Feb 262016
 

Today I am at the British Library (BL) Labs Roadshow 2016 event in Edinburgh. I’m liveblogging so, as usual, all comments, corrections and additions are very much welcomed.

Introduction – Dr Beatrice Alex, Research Fellow at the School of Informatics, University of Edinburgh

I am delighted to welcome the team from the British Library Labs today, this is one of their roadshows. And today we have a liveblogger (thats me) and we are encouraging you to tweet to the hashtag #bldigital.

Doing digital research at the British Library – Nora McGregor, Digital Curator at the British Library

Nora is starting with a brief video on the British Library – to a wonderful soundtrack, made from the collections by DJ Yoda. If you read 5 items a day it would take you 80,000 years to get through the collections. One of the oldest things we have in the collection are oracle bones – 3000 years old. Some of the newest items are the UK Web Archive – contemporaneous websites.

Today we are here to talk about the digital Research Team. We support the curation and use of the BL’s Digital collections. And Ben and Mahendra, talking today, are part of our Carnegie Funded digital research labs.

We help researchers by working with those operating at the intersection of academic research, cultural heritage and technology to support new ways of exploring adn accessing the BL collections. This is through getting content into digital forms, supporting skills development, including the skills of BL staff.

In terms of getting digital content online we curate collections to be digitised and catalogued. Within digitisation projects we now have a digital curation role dedicated to that project, who can support scholars to get the most out of these projects. For instance we have a Hebrew Manuscripts digitisation project – with over 3000 manuscripts spanning 1000 years digitised. That collection includes rare scrolls and our curator for this project, Adi, has also done things like creating 3D models of artefacts like those scrolls. So these curators really ensure scholars get the most from digitised materials.

You can find this and all of our digitisation projects on our website: http://bl.uk/subjects/digital-scholarship where you can find out about all of our curators and get in touch with them.

We are also supporting different departments to get paper based catalogues into digital form. So we had a project called Collect e-Card. You won’t find this on our website but our cards, which include some in, for instance, Chinese scripts or urdu, are being crowd sourced so that we can make materials more accessible. Do take a look: http://libcrowds.com/project/urducardcatalogue_d1.

One of the things we initially set up for our staff as a two year programme was a Digital Research Support and Guidance programme. That kicked off in 2012 and we’ve created 19 bespoke one-day courses for staff covering the basics of Digital Scholarship which is delivered on a rolling basis. So far we have delivered 88 courses to nearly 400 staff members. Those courses mean that staff understand the implications of requests for images at specific qualities, to understand text mining requests and questions, etc.

These courses are intended to build capacity. The materials from these courses are also available online for scholars. And we are also here to help if you want to email a question we will be happy to point you in the right direction.

So, in terms of the value of these courses… A curator came to a course on cleaning up data and she went on to get a grant of over £70k for Big Data History of Music – a project with Royal Holloway to undertake analysis as a proof of concept around patters in the history of music – trends in printing for instance.

We also have events, competitions and awards. One of these is “Off the Map”, a very cool endeavour, now in its fourth year. I’m going to show you a video on The Wondering Lands of Alice, our most recent winner. We digitise materials for this competition, teams compete to build video games and actually this one is actually in our current Alice in Wonderland exhibition. This uses digitised content from our collection and you can see the calibre of these is very high.

There is a new competition open now. The new one is for any kind of digital media based on our digital collections. So do take a look of this.

So, if you want to get in touch with us you can find us at http://bl.uk/digital or tweet #bldigital.

British Library Labs – Mahendra Mahey, Project Manager of British Library Labs.

You can find my slides online (link to follow).

I manage a project called British Library Labs, based in the Digital Research team, who we work closely with. What we are trying to do is to get researchers, artists, entrepreneurs, educators, and anyone really to experiment with our digital collections. We are especially interested in people finding new things from our collections, especially things that would be very difficult to do with our physical collections.

What I thought I’d do, and the space the project occupies, is to show you some work from a researcher called Adam Crymble, Kings College London (a video called Big Data + Old History). Adam entered a competition to explain his research in visual/comic book format (we are now watching the video which talks about using digital texts for distant reading and computational approaches to selecting relevant material; to quantify the importance of key factors).

Other kinds of examples of the kinds of methods we hope researchers will use with our data span text mining, georeferencing, as well are creative reuses.

Just to give you a sense of our scale… The British Library says we are the world’s largest library by number of items. 180 million (or so) items, with only about 1-2% digitised. Now new acquisitions do increasingly come in digital form, including the UK Web Archive, but it is still a small proportion of the whole.

What we are hoping to do with our digital scholarship site is to launch data.bl.uk (soon) where you can directly access data. But as I did last year I have also brought a network drive so you can access some of our data today. We have some challenges around sharing data, we sometimes literally have to shift hard drives… But soon there will be a platform for downloading some of this.

So, imagine 20 years from now… I saw a presentation on technology and how we use “digital”… Well we wont use “digital” in front of scholarship or humanities, it will just be part of the mainstream methodologies.

But back to the present… The reason I am here is to engage people like you, to encourage you to use our stuff, our content. One way to do this is through our BL Labs Competition, the deadline for which is 11th April 2016. And, to get you thinking, the best idea pitched to me during the coffee break gets a goodie bag – you have 30 seconds in that break!

Once ideas are (formally) submitted to the BL there will be 2 finalists announced in late May 2016. They then get a residency with some financial (up to £3600) and technical and curational support from June to October 2016. And a winner is then announced later in the year.

We also have the BL Labs Awards. This is for work already done with our content in interesting and innovative ways. You can submit projects – previous and new – by 5th September 2016. We have four categories: Artistic; Commercial; Research; and Learning/Teaching. Those categories reflect the increasingly diverse range of those engaging with our content. Winners are announced at a symposium on 7th November 2016 when prizes are given out!

So today is all about projects and ideas. Today is really the start of the conversation. What we have learned so far is that the kinds of ideas that people have will change quite radically once you try and access, examine and use the data. You can really tell the difference between someone who has tried to use the data and someone who has not when you look at their ideas/competition entries. So, do look at our data, do talk to us about your ideas. Aside from those competitions and awards we also collaborate in projects so we want to listen to you, to work with you on ideas, to help you with your work (capacity permitting – we are a small team).

Why are we doing this? We want to understand who wants to use our material, and more importantly why. We will try and give some examples to inspire you, to give you an idea of what we are doing. You will see some information on your seat (sorry blog followers, I only have the paper copy to hand) with more examples. We really want to learn how to support digital experiments better, what we can do, how we can enable your work. I would say the number one lesson we have learned – not new but important – is that it’s ok to make mistakes and to learn from these (cue a Jimmy Wales Fail Faster video).

So, I’m going to talk about the competition. One of our two finalists last year was Adam Crymble – the same one whose PhD project was highlighted earlier – and he’s now a lecturer in Digital History. He wanted to crowdsource tagging of historical images through Crowdsource Arcade – harnessing the appeal of 80s video games to improve the metadata and usefulness fo historical images. So we needed to find an arcade machine, and then set up games on it – like Tag Attack – created by collaborators across the world. Tag Attack used a fox character trotting out images which you had to tag to one of four categories before he left the screen.

I also want to talk about our Awards last year. Our Artistic award winner last year was Mario Klingeman – Quasimondo. He found images of 44 men who Look 44 with Flickr images – a bit of code he wrote for his birthday! He found Tragic Looking Women etc. All of these done computationally.

In Commercial our entrant used images to cross stitch ties that she sold on Etsy

The winner last year, from the Research category was Spatial Humanities in Lancaster looking for disease patterns and mapping those.

And we had a Special Jury prize was for James Heald who did tremendous work with Flickr images from the BL, making them more available on Wikimedia, particularly map data.

Finally, loads of other projects I could show… One of my favourites is a former Pixar animator who developed some software to animate some of our images (The British Library Art Project).

So, some lessons we have learned is that there is huge appetite to use BL digital content and data (see Flickr Commons stats later). And we are a route to finding that content – someone called us a “human API for the BL content”!

We want to make sure you get the most from our collections, we want to help your projects… So get in touch.

And now I just want to introduce Katrina Navickas who will talk about her project.

Political Meetings Mapper – Katrina Navickas

I am part of the Digital History Research Centre at the University of Hertfordshire. My focus right now is on Chartism, the big movement in the 19th Century campaigning for the vote. I am especially interested in the meetings they held, where and when they met and gathered.

The Chartists held big public meetings, but also weekly local meetings advertised in the press and local press. The BL holds huge amounts of those newspapers. So my challenge was to find out more about those meetings – how many there were advertised in the Northern Star newspaper from 1838 to 1850. The data is well structured for this… Now that may seem like a simple computational challenge but I come from a traditional research background, used to doing things by hand. I wanted to do this more automatically, at a much larger scale than previously possible. My mission was to find out how many meetings there were, where they were held, and how we could find those meetings automatically in the newspapers. We also wanted to make connections between papers, georeferenced historical maps, and also any that appear in playbills as some meetings were in theatres (though most were in pubs).

But this wasn’t that simple to do… Just finding the right files is tricky. The XML is some years old so is quite poor really. The OCR was quite inaccurate, hard to search. And we needed to find maps from the right period.

So, the first stage was to redo the OCR of the original image files. Initially we thought we’d need to do what Bob Nicholson did with Historic Jokes, which was getting volunteers to re-do them. But actually newer OCR software (Abbyy Finereader 12) did a much better job and we just needed a volunteer student to check the text – mainly about punctuation not spelling. Then we needed to geo-code places using a gazeteer. And then we needed to use a Python code with regular expressions to extract dates and using some basic NLP to calculate the dates of words like “tomorrow” – easier as the paper always came out on a Saturday.

So, in terms of building a historical gazeteer. We extracted place names run through: http://sandbox.idre.ucla.edu/tools/geocoder. Ran through with parameters of Lat and Long to check locations. But we still needed to do some geocoding by hand. The areas we were looking at has changed a lot through slum clearances. We needed to therefore geolocate some of the historical places, using detailed 1840s georeferenced maps of Manchester, and geocoding those.

In the end, in the scale of this project, we looked at only 1841-1844. From that we extracted 5519 meetings (and counting) – and identifying text and dates. And that coverage spanned 462 towns and villages (and counting). In that data we found 200+ lecture tours – Chartist lecturers were paid to go on tours.

So, you can find all of our work so far here: http://politicalmeetingsmapper.co.uk. The website is still a bit rough and ready, and we’d love feedback. It’s built on the Umeeka (?) platform – designed for showing collections – which also means we have some limitations but it does what we wanted to.

Our historical maps are with thanks to the NLS whose brilliant historical mapping tiles – albeit from a slightly later map – were easier to use than the BL georeferenced map when it came to plot our data.

Interestingly, although this was a Manchester paper, we were able to see meeting locations in London – which let us compare to Charles Booth’s poverty maps. Also to do some heatmapping of that data. Basically we are experimenting with this data… Some of this stuff is totally new to me, including trialling a Machine Learning approach to understand the texts of a meeting advertisement – using an IPython Notebook to make a classifer to try to identify meeting texts.

So, what next? Well we want to refine our NLP parsing for more dates and other data. And I also want to connect “forthcoming meetings” to reports from the same meeting in the next issue of the paper. Also we need to do more machine learning to identify columns and types of texts in the unreconstructed XML of the newspapers in the BL Digital Collections.

Now that’s one side of our work, but we also did some creative engagement around this too. We got dressed up in Victorian costume, building on our London data analysis and did a walking tour of meetings ending in recreating a Chartist meeting in a London Pub.

Q&A

Q1) I’m looking at Data mining for my own research. I was wondering how much coding you knew before this project – and after?

A1) My training had only been in GIS, and I’d done a little introduction to coding but I basically spent the summer learning how to do this Python coding. Having a clear project gave me the focus and opportunity to do that. I still don’t consider myself a Digital Historian I guess but I’d getting there. So, no matter whether you have any coding skills already don’t be scared, do enter the competition – you get help, support, and pointed in the right direction to learn the skills you need to.

Farces and Failures: an overview projects that have used British Library’s Digital Content and data – Ben O’Steen, Technical Lead of British Library Labs.

My title isn’t because our work is farce and failure… It’s intentionally to reference the idea that it can be really important early in the process to ensure we have a shared understanding of terminology as that can cause all manner of confusion. The names and labels we choose shape the questions that people will ask and the assumptions we make. For instance “Labs” might make you imagine test tubes… or puppies… In fact we are based in the BL building in St Pancras, in offices, with curators.

Our main purpose is to make the collections available to you, to help you find the paths to walk through, where to go, what you can find, where to look. We work with researchers on their specific problems, and although that work is specific we are also trying to assess how widely this problem is felt. Much of our work is to feed back to the library what researchers really want and need to do their work.

There is also this notion that people tell us things that they think we need to hear in order to help them. As if you need secret passwords to access the content, people can see us as gatekeepers. But that isn’t how BL Labs work. We are trying to develop things that avoid the expected model of scholarship – of coming in, getting one thing, and leaving. That’s not what we see. We see scholars looking at 10,000 things to work with. People ask us “Give me all of collection X” but is that useful? Collections are often collected that way, named that way for adminstrative reasons – the naming associated with a particular digitisation funder, or from a collection. So the Dead Sea Scrolls are scanned in a music collection because the settings were the same for digitising them… That means the “collection” isn’t always that helpful.

So farce… If we think Fork handles/4 Candles…

We have some common farce-inducing words:

  • Collection (see above)
  • Access – but that has different meanings, sometimes “access” is “on-site” and without download, etc. Access has many meanings.
  • Content – we have so much, that isn’t a useful term. We have personal archives, computers, archives, UK Web domain trawl, pictures of manuscripts, OCR, derived data. Content can be anything. We have to be specific.
  • Metadata – one persons metadata is anothers data. Not helpful except in a very defined context.
  • Crowdsourced – means different things to different people. You must understand how the data was collected – what was the community, how did they do it, what was the QA process. That applies to any collaborative research data collection, not just crowdsourcing.

An example of complex provenence…

Microsoft Books digitisation project. It started in 2007 but stopped in 2009 when the MS Book search project was cancelled. This digitised 49K works (~65k volumes). It has been online since 2012 via a “standard” page turning interface ut we have very low usage statistics. That collection is quite random, items were picked shelf by shelf with books missing. People do data analysis of those works and draw conclusions that don’t make sense if you don’t understand that provenance.

So we had a competition entry in 2013 that wanted to analyse that collection… But actually led to a project called the Sample Generator by Pieter Francois. This compared physical to digital collections to highlight the issues of how unrepresentative that sample is for drawing any conclusions.

Allen B Riddell looked at the HathiTrust corpus called “Where are the novels?” in 2012 which similarly looked at the bias in digitised resources.

We have really big gaps in our knowledge. In fact librarians may recognise the square brackets of the soul… The data in records that isn’t actually confirmed, inferred information within metadata. If you look at the Microsoft Books project it’s about half inferred information. A lot of the Sample Generator peaks of what has been digitised is because of inferred year of publication based on content – guesswork rather than reliable dates.

But we can use this data. So Bob Nicholson’s competition entry on Victorian Jokes led to the Mechanical Comedian Twitter account. We didn’t have a good way into these texts, we had to improvise around these ideas. And we did find some good jokes… If you search for “My Mother in-law” and “Victorian Humour” you’ll see a great video for this.

That project looked for patterns of words. That’s the same technique applied to Political Meetings Mapper.

So “Access” again… These newspapers were accessible but we didn’t have access to them… Keyword search fails miserable and bulk access is an issue. But that issue is useful to know about. Research and genealogical needs are different and these papers were digitised partly for those more lucrative genealogical needs to browse and search.

There are over 600 digital archive, we can only spend so long characterising each of them. Microsoft Books digitisation project was public domain so that let us experiment richly quickly. We identified images of people, we found image details. we started to post images to Twitter and Tumblr (via Mechanical Curator)… There was demand and we weren’t set up to deliver those so we used Flickr Commons – 1 TB for free – with the limited awareness of what page an image was from, what region. We had minimal metadata but others started tagging and adding to our knowledge. Nora did a great job of collating these images that had been started to be tagged (by people and machines). And usage of images has been huge. 13-20 million hits on average every month, over 330 M hits to date.

Is this Iterative Crowdsourcing (Mia Ridge)? We crowdsource broad facts and subcollections of related items will emerge. There is no one size fits all, has to be project based. We start with no knowledge but build from there. But these have to be purposefully contextless. Presenting them on Flickr removed the illustrations context. The sheer amount of data is huge. David Foster Wallace has a great comment that “if your fidelity to perfectionism is too high, you never do anything”. We have a fear of imperfection in all universities, and we need to have the space to experiment. We can re-represent content in new forms, it might work, it might not. Metaphors don’t translate between media – like turning pages on a screen, or scrolling a book forever.

With our map collection we ran a tagathon and found nearly 30,000 maps. 10,000 were tagged by hand, 20,000 were found by machine. We have that nice combination of human and machine. We are now trying to georeference our maps and you can help with that.

But it’s not just research… We encourage people to do new things – make colouring books for kids, make collages – like David Normal’s Burning Man installation (also shown at St Pancras). That stuff is part of playing around.

Now, I’ve talked about “Crowd sourcing” several times. There can be lots of bad assumptions of that term. It’s assumed to be about a crowd of people all doing a small thing, about special software, that if you build it they will come, its easy, its cheap, it’s totally untrustworthy… These aren’t right. It’s about being part of a community, not just using it. When you looka at Zooniverse data you see a common pattern – that 1-2% of your community will do the majority of the work. You have to nurture the expert group within your community. This means you can crowdsource starting with that expert group – something we are also doing in a variety of those groups. You have to take care of all your participants but that core crowd really matter.

So, for crowdsourcing you don’t need special software. If you build something they don’t neccassarily come, they often don’t. And something we like to flag up is the idea of playing games, trying the unusual… Can we avoid keyboard and mouse? That arcade game does that, it asks that idea of whether we can make use of casual interaction to get useful data. That experiment is based on a raspberry pi and loads of great ideas from others using our collections. They are about the game dynamic… How we deal with data – how to understand how the game dynamics impact on the information you can extract.

So, in summary…

Don’t be scared of using words like “collection” and “access” with us… But understand that there will be a dialogue… that helps avoid disappointment, helps avoid misunderstanding or wasted time. We want to be clear and make sure we are all on the same page early on. I’m there to be your technical guide and lead on a project. There is space to experiment, to not be scared to fail and learn from that failure when it happens. We are there to have fun, to experiment.

Questions & Discussion

Q1) I’m a historian at the National Library of Scotland. You talked about that Microsoft Books project and the randomness of that collection. Then you talked about the Flickr metadata – isn’t that the same issue… Is that suitable for data mining? What do you do with that metadata?

A1) A good point. Part of what we have talked about is that those images just tell you about part of one page in a book. The mapping data is one of the ways we can get started on that. So if we geotag an image or a map with Aberdeen then you can perhaps find that book via that additional metadata, even if Aberdeen would not be part of the catalogue record, the title etc. There are big data approaches we can take but there is work on OCR etc. that we can do.

Q2) A question for Ben about Tweeting – the Mechanical Curator and the Mechanical Comedian. For the Curator… They come out some regularly… How are they generated?

A2) That is mechanical… There are about 1200 lines of code that roams the collection looking for similar stuff… The text is generated from books metadata… It is looking at data on the harddrive – access to everything so quite random. If no match it finds another random image.

Q2) And the mechnical comedian?

A2) That is run by Bob. The jokes are mechanically harvested, but he adds the images. He does that himself – with a bit of curation in terms of the badness of jokes – and adds images with help of a keen volunteer.

Q3) I work at the National Library of Scotland. You said to have fun and experiment. What is your response to the news of job cuts at Trove, at the National Library of Australia.

A3 – Ben) Trove is a leader in this space and I know a lot of people are increadibly upset about that.

A3 – Nora) The thing with digital collections is that they are global. Our own curators love Trove and I know there is a Facebook group to support Trove so, who knows, perhaps that global response might lead to a reversal?

Mahendra: I just wanted to say again that learning about the stories and provenance of a collection is so important. Talking about the back stories of collections. Sometimes the reasons content are not made available have nothing to do with legality… Those personal connections are so importan.

Q4) I’m interested in your use of the IPython Notebook. You are using that to access content on BL servers and website? So you didn’t have to download lots of data? Is that right?

A4) I mainly use it as a communication tool between myself and Ben… I type ideas into the notebook, Ben helps me turn that into code… It seemed the best tool to do that.

Q4) That’s very interesting… The Human API in action! As a researcher is that how it should be?

A4) I think be. As a researcher I’m not really a coder. For learning these spaces are great, they act as a sandbox.

Q4) And your code was written for your project, should that be shared with others?

A4) All the code is on a GitHub page. It isn’t perfect. That extract, code, geocode idea would be applicable to many other projects.

Mahendra: There is a balance that we work with. There are projects that are fantastic partnerships of domain experts working with technical experts wanting problems to solve. But we also see domain experts wanting to develop technical skills for their projects. We’ve seen both. Not sure of the answer… We did an event at Oxford who do a critical coding course where they team humanities and computer scientists… It gives computer scientists experience of really insanely difficult problems, the academics get experience of framing questions in precise ways…

Ben: And by understanding coding and

Comment (me): I just wanted to encourage anyone creating research software to consider submitting papers on that to the Journal of Open Research Software, a metajournal for sharing and finding software specifically created for research.

Q5) It seemed like the Political Meetings Mapper and the Palimpsest project had similar goals, so I wondered why they selected different workflows.

A5 – Bea Alex) The project came about because I spoke to Miranda Anderson who had the idea at the Digital Scholarship Day of Ideas. At that time we were geocoding historical trading documents and we chatted about automating that idea of georeferencing texts. That is how that project came about… There was a large manual aspect as well as the automated aspects. But the idea was to reduce that manual effort.

A5 – Katrina) Our project was so much smaller team. This is very much a pilot project to meet a particular research issue. The outcomes may seem similar but we worked on a smaller scale, seeing what one researcher could do. As a traditional academic historian I don’t usually work in groups, let alone big teams. I know other projects work at larger scale though – like Ian Gregory’s Lakes project.

A5 – Mahendra) Time was a really important aspect in decisions we took in Katrina’s project, and of focusing the scope of that work.

A5 – Katrina) Absolutely. It was about what could be done in a limited time.

A5 – Bea) One of the aspects from our work is that we sourced data from many collections, and the structure could be different for each mention. Whereas there is probably a more consistent structure because of the single newspaper used in Katrina’s project, which lends itself better to a regular expressions approach.

And next we moved to coffee and networking. We return at 3.30 for more excellent presentations (details below). 

BL Labs Awards: Research runner up project: “Palimpsest: Telling Edinburgh’s Stories with Maps” – Professor James Loxley, Palimpsest, University of Edinburgh

I am going to talk about project which I led in collaboration with colleagues in English Literature, with INformatics here, with visualisation experts at St Andrews, and with EDINA.

The idea came from Miranda Anderson, in 2012, who wanted to explore how people imagine Edinburgh in a literary sense, how the place is imagined and described. And one of the reasons for being interested in doing this is the fact that Edinburgh was the world’s first UNESCO City of Literature. The City of Literature Trust in Edinburgh is also keen to promote that rich literary heritage.

We received funding from the AHRC from January 2014 to March 2015. And the name came from the concept of the Palimpsest, the text that is rewritten and erased and layered upon – and of the city as a Palimpsest, changing and layering over time. The original website was to have the same name but as that wasn’t quite as accessible, we called that LitLong in the end.

We had some key aims for this project. There are particular ways literature is packaged for tourists etc. We weren’t interested in where authors were born or died. Or the authors that live here. What we were interested in was how the city is imagined in the work of authors, from Robert Louis Stevenson to Muriel Spark or Irvine Welsh.

And we wanted to do that in a different way. Our initial pilot in 2012 was all done manually. We had to extract locations from texts. We had a very small data set and it offfered us things we already knew – relying on well known Edinburgh books, working with the familiar. The kind of map produced there told us what we already knew. And we wanted to do something new. And this is where we realised that the digital methods we weree thinking about really gave us an opportunity to think of the literary cityscape in a different mode.

So, we planned to textmine large collections of digital text to identify narrative works set in Edinburgh. We weren’t constrained to novels, we included short stories, memoirs… Imaginative narrative writing. We excluded poetry as that was too difficult a processing challenge for the scale of the project. And we were very lucky to have the support and access to British library works, as well as material from the HathiTrust, and the National Library of Scotland. We mainly worked with out of copyright works. But we did specifically get permission from some publishers for in-copyright works. Not all publishers were forthcoming, and happy for work to be text mined. We were text mining work – not making them freely available – but for some publishers full text for text mining wasn’t possible.

So we had large collections of works, mainly but not exclusively out of copyright. And we set about textmining those collections to find those set in Edinburgh. And then we georeferenced the Edinburgh placenmmaes in those works to make mapping possible. And then finally we created visualisations offering different viewpoints into the data.

The best way to talk about this is to refer to text from our website:

Our aim in creating LitLong was to find out what the topography of a literary city such as Edinburgh would look like if we allowed digital reading to work on a very large body of texts. Edinburgh has a justly well-known literary history, cumulatively curated down the years by its many writers and readers. This history is visible in books, maps, walking tours and the city’s many literary sites and sights. But might there be other voices to hear in the chorus? Other, less familiar stories? By letting the computer do the reading, we’ve tried to set that familiar narrative of Edinburgh’s literary history in the less familiar context of hundreds of other works. We also want our maps and our app to illustrate old connections, and forge new ones, among the hundreds of literary works we’ve been able to capture.

That’s the kind of aims we had, what we were after.

So our method started with identifying texts with a clear Edinburgh connection or, as we called it “Edinburghyness“. Then, within those works to actually try and understand just how relevant they were. And that proved tricky. Some of the best stuff about this project came from close collaboration between literary scholars and informatics researchers. The back and forth was enormously helpful.

We came across some seemingly obvious issues. The first thing we saw was that there was a huge amount of theological works… Which was odd… And turned out to be because the Edinburgh placename “Trinity” was in there. Then “Haymarket” is a place in London as well as Edinburgh. So we needed to rank placenames and part of that was the ambiguity of names, and understanding that some places are more likely to specifically be Edinburgh than others.

From there, with selected works, we wanted to draw out snippits – of varying lengths but usually a sensible syntactic shape – with those mentions of specific placenames.

At the end of that process we had a dataset of 550 published works, across a range of narrative genres. They have over 1600 Edinburgh place names of lots of different types, since literary engagement with a city might be a street, a building, open spaces, areas, monuments etc. In mapping terms you can be more exact, in literature you have these areas and diverse types of “place”, so our gazeteer needed to be flexible to that. And what that all gave us in total was 47,000 extracts from literary works, all focused on a place name mention.

That was the work itself but we also wanted to engage people in our work. So we brought Sir Walter Scott back to life. He came along to the Edinburgh International Book Festival in 2014. He kind of got away from us and took on a life of his own… He ended up being part of the celebrations of the 200th aniversary of Waverley. And popped up again last year on the Borders Railway when that launched! That was fun!

We did another event at EIBF in 2015 with James Robertson who was exploring LitLong and data there. And you can download that as a podcast.

So, we were very very focused on making this project work, but we were also thinking about the users.

The resource itself you can visit at LitLong.org. I will talk a little about the two forms of visualisation. The first is a location visualiser largely built and developer by Uta Hinrichs at St Andrews. That allows you to explore the map, to look at keywords associated by locations – which indicate a degree of qualitative engagement. We also have a searchable database where you can see the extracts. And we have an app version which allows you to wander in among the extracts, rather than see from above – our visualisation colleagues call this the “Frogs Eye View”. You can wander between extracts, browse the range of them. It works quite well on the bus!

We were obviously delighted to be able to do this! Some of the obstacles seemed tough but we found workable solutions… But we hope it is not the end of the story. We are keen to explore new ways to make the resource explorable. Right now there isn’t a way where interaction leaves a trace – other people’s routes through the city, other peoples understanding of the topography. There is scope for more analysis of the texts themselves. For instance we considered doing a mood map of the city, scope to see that. But we weren’t able to do that in this project but there is scope to do that. And as part of building on the project we have a bit of funding from the AHRC so lots of interesting lines of enquiry there. And if you want to explore the resource do take a look, get in touch etc.

Q&A

Q1) Do you think someone could run sentiment analysis over your text?

A1) That is entirely plausible. The data is there and tagged so that you could do that.

A1 – Bea) We did have an MSc project just starting to explore that in fact.

A1) One of our buttons on the homepage is “LitLong Lab” where we share experiments in various ways.

Q2) Some science fiction authors have imagined near future Edinburgh, how could that be mapped?

A2) We did have some science fiction in the texts, including the winner of our writing competition. We have texts from a range of ages of work but a contemporary map, so there is scope to keying data to historic maps, and those exist thanks to the NLS. As to the future…  The not-yet-Edinburgh… Something I’d like to do… It is not uncommon that fictional places exist in real places – like 221 Baker Street or 44 Scotland Street – and I thought it would be fun to see the linguistic qualities associated with a fictional place, and compare to real places with the same sort of profile. So, perhaps for futuristic places that would work – using linguistic profile to do that.

Q3) I was going to ask about chronology – but you just answered that. So instead I will ask about crowd sourcing.

A3) Yes! As an editor I am most concerned about potential effort. For this scale and speed we had to let go of issues of mistakes, we know they are there… Places that move, some false positives, and some books that used Edinburgh placenames but are other places (e.g. some Glasgow texts). At the moment we don’t have a full report function or similar. We weren’t able to create it to enable corrections in that sort of way. What we decided to do is make a feature of a bug – celebrating those as worm holes! But I would like to fine tune and correct, with user interactions as part of that.

Q4) Is the data set available.

A4) Yes, through an API created by EDINA. Open for out of copyright work.

Palimpsest seeks to find new ways to present and explore Edinburgh’s literary cityscape, through interfaces showcasing extracts from a wide range of celebrated and lesser known narrative texts set in the city. In this talk, James will set out some of the project’s challenges, and some of the possibilities for the use of cultural data that it has helped to unearth.

Geoparsing Jisc Historical Texts – Dr Claire Grover, Senior Research Fellow, School of Informatics, University of Edinburgh

I’ll be talking about a current project, a very rapid project to geoparse all of the Jisc Historical Texts. So I’ll talk about the Geoparser and then more about that project.

The Edinburgh Geoparser, which has been developed over a number of years in collaboration with EDINA. It has been deployed in various projects and places, mainly also in collaboration with EDINA. And it has various main steps:

  • Use named entity recognition to identify place names in texts
  • Find matching records in a gazeteer
  • In cases of ambiguity (e.g. Paris, Springfield), resolve using contextual information from the document
  • Assign coordinates of preferred reading to the placename

So, you can use the Geoparser either via EDINA’s Unlock Text, or you can download it, or you can try a demonstrator online (links to follow).

To give you an example I have a news piece on the buriel of Richard III. You can see the Geoparser looks for entity recognition of all types – people as well as places – as that helps with disambiguation later on. Then using that text the parser ranks the likelihood of possible locations.

A quick word on gazeteers. The knowledge of possible interpretations comes from a gazeteer, which pairs place names to lat/long. So, if you know your data you can choose a gazeteer relevant to that (e.g. just the UK). The Edinburgh Geoparser is configured to provide a choice of gazeteers and can be configured to use other gazeteers.

If a place is not in a gazeteer it cannot be grounded. If the correct interprestation of a place name is not in the gazeteer, it cannot be grounded correctly. Modern gazeteers are not ideal for historical documents so historical gazeteers need to be used/developed. So for instance the DEEP (Directory of English Place Names) or PELAGIOS (ancient world) gazeteers have been useful in our current work.

The current Jisc Historical Text(http://historicaltexts.jisc.ac.uk/) project has been working with EEBO and ECCO texts as well as the BL Nineteenth Century collections. These are large and highly varied data sets. So, for instance, yesterday I did a random sample of writers and texts… which is so large we’ve only seen a tiny portion of it. We can process it but we can’t look at it all.

So, what is involved in us georeferencing this text? Well we have to get all the data through the Edinburgh Geoparser pipeline. And that requires adapting the geoparser pipeline to recognise place names to work as accurately as possible on historical text. And we need to adjust the georeferencing strategy to be more detailed.

Adapting our place name recognition relies a lot on lexicons. The standard Edinburgh Geoparser has three lexicons derived from the Alexandria Gazetteer (global, very detailed); Ordnance Survey (Great Britain, quite detailed), DEEP. We’ve also added more lexicons from more gazeteers… including larger place names in Geonames (population over 10,000), populated places from Natural Earth, only larger places from DEEP, and the score recognised place names based on how many and which lexicons they occur in. Low scored placenames are removed – we reckon people’s tolerance for missing a place is higher than their tolerance for false positives.

Working with old texts also means huge variation of spellings… There are a lot of false placenames/false negatives because of this (e.g. Maldauia, Demnarke, Saxonie, Spayne). They also result in false positives (Grasse, Hamme, Lyon, Penne, Sunne, Haue, Ayr). So we have tried to remove the false positives, to remove bad placenames.

When it comes to actually georeferencing these places we need coordinates for place names from gazetteers. We used three place names in succession: Pleiades++, GeoNames and then DEEP. In addition to using those gazeteers we can weight the results based on locations in the world – based on a bounding box. So we can prefer locations in the UK and Europe, then those in the East. Not extending to the West as much… And excluding Australia and New Zealand (unknown at that time).

So looking at EEBO and ECCO we can see some frequent place names from each gazeteers – which shows how different they are. In terms of how many terms we have found there are over 3 million locations in EEBO, over 250k in ECCO (a much smaller collection). The early EEBO collections have a lot of locations in Israel, Italy, France. The early books are more concerned with the ancient world and Biblical texts so these statistics suggest that we are doing the right thing here.

These are really old texts, we have huge volumes fo them, and there is a huge variety of the data and that all makes this a hard task. We still don’t know how the work will be received but we think Jisc will put this work in a sandbox area and we should get some feedback on it.

Find out more:

  • http://historicaltexts.jisc.ac.uk/
  • https://www.ltg.ed.ac.uk/software/geoparser
  • http://edina.ac.uk/unlock/
  • http://placenames.org.uk/
  • https://googleancientplaces.wordpress.com/

Q&A

Q1) What about historical Gaelic place names?

A1) I’m not sure these texts have these. But we did apply a language tag on a paragraph level. These are supposed to be English texts but there is lots of Latin, Welsh, Spanish, French and German. We only georeferenced texts thought to be English. If Gaelic names then, if in Ordnance Survey, they may have been picked up…

Claire will talk about work the Edinburgh Language Technology Group have been doing for Jisc on geoparsing historical texts such as the British Library’s Nineteenth Century Books and Early English Books Online Text Creation Partnership which is creating standardized, accurate XML/SGML encoded electronic text editions of early print books.

Pitches – Mahendra and co

Can the people who pitched me

Lorna: I’m interested in open education and I’d love to get some of the BL content out there. I’ve been worked on the new HECoS coding schema for different subjects. And I thought that it would be great to classify the BL content with HECoS.

Karen: I’ve been looking at Copyright music collections at St Andrews. There are gaps in legal deposit music from late 18th and 19th century as we know publishers deposited less in Scottish versus BL. So we could compare and see what reached outer reaches of the UK.

Nina: My idea was a digital Pilgrims Progress where you can have a virtual tour of a journey with all sorts of resources.. To see why some places are most popular in texts etc.

David: I think my idea has been done.. It was going to be iPython – Katrina is already doing this! But to make it more unique… It’s quite hard work for Ben to support scholars in that way so I think researchers should be encouraged to approach Ben etc. but also get non-programmers to craft complex queries, make the good ones reusable by others… and have those reused be marked up as of particular quality. And to make it more fun… Could have a sort of treasure hunt jam with people using that facility to have a treasure hunt on a theme… share interesting information… Have researchers see tweets or shared things… A group treasure hunt to encourage people by helping them share queries…

Mahendra: So we are supposed to decide the winners now… But I think we’ll get all our pitchers to share the bag – all great ideas… The idea was to start conversations. You should all have an email from me so, if you have found this inspiring or interesting, we’ll continue that conversation.

And with that we are done! Thanks to all for a really excellent session!

May 142014
 

Today I am at the University of Edinburgh Digital Humanities and Social SciencesDigital Scholarship Day of Ideas 2014 which is taking place at the Edinburgh Centre for Carbon Innovation, High Street Yards, Edinburgh. This year’s event takes, as it’s specialist focus, “data”. These notes have been taken live so my usual disclaimers apply and comments, questions and corrections are, as ever, very much welcomed.

Introduction: Prof Dorothy Miell, Head of College of Humanities and Social Science

I’m really pleased to welcome everybody here today. This is our third Digital Scholarship Day of Ideas and they are an opportunity to bring in interesting outside speakers, but also for all of us interested in this area to come together, to network and build relationships, and to take work forward. Again today we have a mixture of international and local speakers, and this year we are keeping us all in one room so we can all hear from those speakers. I am really glad to see such a popular take up for the day, and mixing from across the college and Information Services.

Digital HSS, which organised this event, is work that Sian Bayne leads and there are a series of events throughout the year in that strand, as well as these events.

Today we are going to be talking about the idea of data, particularly what data means for scholars in the humanities, how can we understand the term Big Data that we hear in the Social Sciences, and how can we use these concepts in our own work.

Sian Bayne, Associate Dean (digital scholars) is introducing our first speaker. Annette describes herself as an “itinerant researcher”. Annette’s work focuses on internet and qualitative research methods, and the ethical aspects of internet research. I think she has a real talent for great paper titles. One of my favourites is “Undermining Data” – which today’s talk is partially based on – but I also loved that she had a paper entitled “Fieldwork in Social Media: What would Manonovsky do?”. Anyway, I am delighted to welcome Professor Annette Markham.

Can we get beyond ‘data’? Questioning the dominance of a core term in scientific inquiry – Prof Annette MarkhamDepartment of Informatics, Umeå University, Sweden; Department of Aesthetics & Communication, Aarhus University, Denmark; School of Communication, Loyola University, Chicago (session chair: Dr Sian Bayne)

As Sian mentioned I have spent a lot of time… I was a professor for ten years before I quit in 2007 and pushed myself across other disciplines, to push forward some philosophical work on methods. For the last 5 years or so I’ve been thinking about innovative and creative ways to think of methods to resonate better with the complex and complexity of modern life. I work with STS – Science and Technology – scholars in Denmark, Informatics scholars, Machine learning Scolars in Boston, Language scholars in Helsinki… So a real range across the disciplines.

The work today is around methods work I’ve done with colleagues over the last few years, much is captured in a special issue of First Monday: Vol 18, No 10: Making Data – Big Data and Beyond Special Issue. And this I’m doing from a post humanist, STS, non positivist sort of perspective, thinking about the way in which data can be used to to indicate that we share an understanding when actually, we are understanding the same information in very different ways. For some data can be an easy term, consistent with your world view… a word that you understand in your own method of inquiry. Data and data sets might be familiar parts of your work. We all come from somewhere, we all do research… what I say may not be new, or may be totally new… it may resonate… or not at all… but I want this to be a provocation, to make you question and think about data and our methods.

So, why me, well mainly I guess because I know about methods… so this entire talk is part of a bigger project where I look at method, at forms of inquiry… but looking at method directly isn’t quite right, but looking at it from the side, from the corner of your eye… And to look at method is to look at the conditions in which we undertake inquiry in the 21st century. For many of us inquiry is shaped by funding, and funding priviledges that which produces evidence, which can be archived. For many qualitative researchers this is unthinkable… a coffee stain on field notes might have meaning for you as an ethnographer but how can that have meaning for anyone else? How can that be archivable or sharable or minebale.

And I think we also have to think about what it is that we do when we do inquiry, when we do research… to get rid of some of the baggage of inquiry – like collecting data, analysing and then writing up as there are many forms of inquiry that don’t fit that linear approach. Another way to think of this is to think of frames, of how we frame our research. As an American Scholar trained in the Chicago School of Sociology is that I cannot help but cite Erving Goffman. They both tell us to focus on something, and to ignore other things… So if I show you a picture of a frame here…. If I say Mona Lisa you might think of that painting. If I tell you to look outside of the frame you might envision the wall, or the gallery, or what sits outside that frame. And if you change the frame it changes what you see, what you focus on… so if I show you a frame diagram of a sphere and say that is a frame, a frame for research what do you see? (some comment they see the globe, they see 3D techniques, they see movement). The frame tells us to think about certain phenomenon…. to also not think about others… if I say Mona Lisa now… we think of very different things… Similarly an atomic structure type image works as a very different type of frame – no inside or outside but all interconnected node… But it’s almost impossible to easily frame, again, Mona Lisa…

So, another frame – a not-quite-closed drawn circle – and this is to say that frames don’t tell you a lot about what they do… and Goffman and others say that frames work best when they are almost invisible…. like maps (except say the McArthur Corrective Map). So, by repositioning a map, or by standing in an elevator the wrong way and talking to people – as Harold Garfield had his students do – we have a frame that helps us look differently at what we do. “Data” can make us think we look at the same map, when we are not… Data may not be understood as a shortcut term of a metanym, it could be taken rather as preexisting aspects of the phenomenon – have been filtered and created through a process, and organised in some way. Not the meaning I want for my work but not good or bad…

So I want to come back to “How are our research sensibilities being framed?”. In order to understand inquiry we have to understand three other things. (1) How do we frame culture and experience in the 21st Century; (2) How do we frame objects and processes of inquiry; (3) How do we frame “what counts” as proper and legitimate inquiry?

For me (1), as someone focused on internet studies, I think about how our research context has shifted, and how has our global society shifted, since the internet. It’s networked for instance. But also interesting to note how this frame has shifted considerably since the early days of the internet… So taking an image from the Atlas of CyberSpace – an image suggesting the internet as a tunnel. But city scapes were also common ways to understand the world. MIT suggested different ways to understand a computer interface. This is about what happened, the interests in the early days of the internet in the 90s. That playfulness and radical ideas change as commerce becomes a standard part of the internet. Skipping forward to Facebook for instance… interfaces are easy to understand, friendly, almost all social media looks the same, almost all websites look the same… and Google is a real model for this as their interface has always been so clean…

But I think the significant issue here about socio-technical research and understanding has been shaped by these internet interfaces we encounter on a daily basis.

For me frame (2) hasn’t changed that much… two slides…. this to me represents any phenomenon or study – a whole series of different networks of nodes connected to the centre. There is no obvious starting point. Not clear what belongs in the centre – a person, an event, a device – and there are all these entanglements charecterising these relationships. And yet our methods were designed for and work best in the traditional anthropological fieldwork conditions… And the process is still very linear in how we understand it – albeit with iterative cycles – but it’s still presented that way. And that matters as it priviledges the neat and tidy inquiry over the messy inquiry, the inquiry without clear conclusions… so how we frame inquiry hasn’t changed much in terms of inquiry methods.

Finally, and briefly, (3) my provocation is: I think we’ve gone backwards… you can go back to the 60s or earlier and look at feminist scholars and their total reunderstanding of scientific method, and situated research. But as budgets tighten, as research is funded under more conservative conditions this stuff that isn’t well understood isn’t as popular… so we’ve seen a return to evidence based methods, to clear conclusions, to scientific process. Particularly in media coverage of research. It’s still a dominent theme…

So… What is data?

I don’t want to be glib here. The word “data” is awefully easy to toss around. It is. In every day life this term is a metanym for lots of stuff, highly specific but unspecified stuff. It is arguably quite a powerfully rhetorical term. As Daniel Rosenburg says the use of the term data has really shifted over the last few hundred years. It appeared in the 1760s or so. Many of those associated with the word only had it appear in translations posthumously. It is derived from Latin and, in the 1760s, it was about conditions that exist before arguement. Then as something that exists before analysis. And in that context data has no theoretical baggage. It cannot be questions. It always exists… has an incontrovertible it-ness. A “fact” can be proven false. But false data is still “data”. Over time and usage “data” has come to represent the entirity of what the researcher seeks and needs in pursuit of the goal of inquiry. To consider the word in my non-positivist stance, I see data as “what is data within the more general idea of inquiry”. In the mid 1980s I was taught not to use that word, we collect materials, we collect artefacts as ethnographers… and we construct… data… see even I used it there, so hard not to. It has been operationalised as discreet and uncontrovertible.

Big data has brought critical responses out, they are timely and subtle responses… and boyd and Crawford (2011) came up with six provocations for big data. And Nancy Baym (2013) also talks about all social media metrics being a nonrepresentative partial sample. And that there is an inherant ambiguity that arises from decontextualising a moment of clicking from a stream of activity and turning it into a stand alone data point. Bruno LaTour talked about this too, in talking about soil from the Amazon, of removing something form it’s context.

And this idea disturbs me, particularly when understanding social life as representated in technology. Even outside the western world, even if we don’t use technology, as Sonia Livingstone notes, we are all implicated in technology in our everyday life. So, I want to show you a very common metaphor for everyday life in the 21st century – a Samsung Galaxy SII ad. I love this ad – it’s low hanging fruit for rhetorical critique! It flattens everything – your hopes and dreams offered at equal value to services or products you might buy… and flatterns as equal in not infitesimal bits that swirl around, can be transmitted, transformed, controlled – as long as we purchase that particular phone. An interesting depiction of life as data – and humans and their data as new. It’s not unusual and not a problem as we don’t buy into it as a notion, uncritically.

This ad troubles me more. This is Global Pulse, an NGO, a sub committee of UN, that distributes data on prices in the developing world. It follows the story of a woman affected by price shifts. So this ad… it has a lot of persuasive power and I want to be careful about this arguement that I make to conclude…

I really like what we get from many big data analyses. I have nothing against big data or computational analysis. Some of the work you hear about today is extroadinary, powerful… I won’t make an arguement about data, about data to solve certain problems. I want to talk about what Kate Crawford talks about as “big data fundamentalism”. I wouldn’t go that far… but algorithms can be powerful but not all human experience can be reduced to data points. And not everything can be framed by big data. Data can be hugely valuable but it’s important to trouble what is included and what is missed by big data. That advert implies data can be understood as it happens. Data is always filtered, transformed, framed… from that you draw conclusions. Data operates within the larger framework for inquiry. We have to remember that we have strong and robust models for inquiry that do not focus on data as the core of inquiry. Data might be important – it should be the chorus not the main player on the stage. The focus of non-positivist research is upon collecting the messy stuff….

And I wanted to show a visualisation, created in Gephi, by one of my colleagues who looked at Arab Spring coverage in media and social media in Sweden… In doing this as he shifts the algorithm he is manipulating data, changing how the data appears to us, changing variables to make his case… most of the algorithms of Gephi create neat round visualisations. Alex Galloway critiques this by saying that some forms may not be representable, and this tool does not accommodate that, or encourages us to think that all networks can be visualised in that way. These visualisations and network analyses are about algorithms… So I sort of want to leave it there, to say that data functions very powerfully as a term… and that from a methodoly perspective it creates a very particular frame that warrants concern, particularly when the dominant context tells us that data is the way to do inquiry.

Q&A

Q: I enjoyed that but I find you more pessimistic than I would be. That last visualization shows how different understandings of that network as possible. It’s easy to create a strawman like this but I’ve been reading papers where videos are included in papers… the audience can all think about different interpretations. We can click on a data point, to see that interview, to see that complex account of that point. There are many more opportunities to create richer entanglements of data… we should emphasize those, emphasize that complexity rather than hide the complexity of how that data is created.

A: Thanks for finishing my talk for me! If we consider the generative aspects of inquiry then we can use the tools to be transparent about the playfulness of interrogation, by offering multiple interpretations… I talk about a process of Borrow / Play / Move / Interrogate / Generate. So I was a bit pessimistic – that Global Pulse ad always depresses me. But I agree!

Q: I was taken by your argument that human experience cannot be reduced to a single data point… what else can it be reduced to… it implies an alternative to data… so what might that be?

A: I think that question is not one that I would ask. To me that is not the most important question. For me it’s about how we might make social change – how might I create interventions, how might I represent someone’s story. I’m not saying that there is an alternative… but that discussion of data in general puts us in that sort of terrain… and what is more interesting or important is to consider why we do research in the first place, why do we want to look for a particular phenomenon… to not let data overwhelm any other arguments.

Q: I think your talk noted that big data focuses on how people are similar and what similarities there are, whilst ethnography tend to be about difference. That makes those data tracking that cover most people particularly depressing. Is that the distinction though?

A: I think I would see it as simplification versus complexity… how do we envision inquiry in ways that try to explode the phenomenon into even a more complex set of entanglements and connections. It may be about differences but doesn’t have to be… its about what emerges from a more generative process… it’s an interesting reading though, I wouldn’t disagree.

Q: I wanted to share a story with you of finishing my PhD, a study of social workers when I was a social worker. I had an interview for a research post at the Scottish Government and one of the panel asked me “and how did you analyze your data” and I had never thought of my interviews and discussions as data… and since then I’ve been in academia in 20 years but actually I’ve had to put that idea, that people are not data, aside to progress my career – holding onto the concept but learning to talk the talk…

A: I can relate to that. You hear that a lot, struggling to find the vocabulary to make your work credible and understandable to other people. With my students I help them see that the vocabulary of science is there, and has been dominant… and to help them use other terms to replace the terms they use in the inquiry, in their method… these terms of mine (Borrow / play / move / interrogate / generate) to get them thinking another way, to make them look at their work in a different way from that dominant method. These become a way that people can talk about the same thing but with less weighty vocabulary, or terms that do not carry that baggage. So that’s one way I try to do that…

Crowd-sourced data coding for the social sciences: Massive non-expert coding of political texts – Prof Ken BenoitProfessor of Quantitative Social Research Methods, London School of Economics and Political Science (session chair: Prof John McInnes)

Professor John McInnes is introducing our next speaker, Professor Ken Benoit. Ken not only talks about big data but has the computational skills to work with it.

I will be showing you something very practical…. I had an idea that I’d do something live… so it could be an Epic Fail!

So I took the UKIP European Election Manifesto… converted to plain text in my text editor. Made every sentence one line… put into spreadsheet… Then I’m using CrowdFlower with some text questions… So I’ll leave that to run…

So back to my talk… the goal is to measure unobservable quantities… we want to understand ideology – the “left-right” policy positions… we have theories of how people vote, that they vote to parties most proximate to their own positions. For political scientists this is a huge issue. We might also want to measure corruption, cultural values, power… but today I’m going to focus on those policy positions.

A lot of political science data is “created” by experts… a lot of it is, frankly, made up. A lot of it is about hand-coded text units – you take a text, you unitise it…. e.g. immigration policy statements… (Comparative Manifesto Project, Policy Agenda Project). Another way is Solicited Expert Opinion (Benoit and Laver, Chapel Hill, etc) – I worked with Laver for years looking at understanding of policies of each party. It’s expensive work, takes an expert an hour to fill out a form… real headache… We have expert-completed checklists (Polity, Comparative Parliamentary Democracy Dataset, Freedom House, etc.). And there are Coded International events (KEDS, Penn State Event Data). And we have inductively scaled quantities (factor analysis such as “Billy Joe Jimbon Factoral analysis).

So what are some of the problems of coding using “experts”. Who are experts anyway? Difficult to find coders who are suitably qualified. It’s hard to find them AND hard to train them… most of the experts coding texts tend to be PhD students who find it a pleasing thing to do whilst avoiding finishing their thesis. There can be knowledge effects since no text is ever anonymous to an expert coder with country knowledge. Human coders are unreliable – their codings of the same text unit will vary wildly. And even single coding is relatively costly and time-consuming. So only one coder codes each text. Even when you pay the experts, they are still doing you a favour!

So I will talk about an alternative solution to this problem, and that problem is about classifying text units. So the idea is to observe a political party’s policy position by content analysis of it’s texts. And party manifestos are most common texts. The idea behind content analysis is breaking text into small units and then using human judgement to apply pre-defined codes. e.g. coding something as right wing policy. And usually that is done for LOTS of sentences by only ONE coder.

Tomorrow I’ll be in Berlin… the biggest (only?) game in town is the Comparative Manifesto Project (CMP). This is a huge project with 3500 party manifestos from 55 countries from 1945-2010 though still going. Human coders are trained and have PhDs. They break manifestos into sentences, human judgement to apply pre-defined codes. Each sentence assigned to one of 56 policy categories. Category percentages of the total text are used to measure policy. And each manifesto is seen by just one coder, and coded by just one coder.

So… what could we do… crowd-sourcing involves outsourcing a task by distributing it to an unspecific group, usually in parts… based idea of this, versus expert coding is that it reduces the expertise of each of the coders, but increase the number of coders. Distribute texts for coding partially and randomly. Increase the number of coders per sentence. Treat different coders as exchangable – and anonimous, and we don’t care if sitting in internet cafe in Estonia in their underwear, or whether they engage on a day off from a bank…

The coding scheme here is to have a more simplified coding scheme. We applied it to 18 of the “big 3” British party manifestos from 1987 to 2010. So a sentence can be coded as Economic, Social or neither… under either of the first two categories there are further options (anti, neutral or pro) from “Very left” to “Very right”, or “Very liberal” to “Very conservative”. And there is a 10 question test to show correct codings, to guide the coder and to keep them on track.

So, to get this started we wanted a comparison we understood. We wanted to compare crowd coding to expert coding. So my colleague and I, and some graduate students, coded a total of 123,000 sentences between us… With between 4 and 6 coders per manifesto and using the same system to be deployed to the crowd. This was  a benchmark for the crowd sourcing end of things. This took ages to do… we did that…. that’s a lot of expert coding… and in practice you wouldn’t get this happening… For the crowdsourced codings we got almost twice as many codings…

We used an IRT type scaling model to estimate position. We didn’t want to just take averages here… we used a multi nomial method here. We treat each sentence as an item, to which the manifesto is responding, and the left or rightness (etc) as a quality they exhibit. Despite that complexity we found that a mean of means approach led to very similar results. We are trying to simplify that multi nomial method… but now the results…

Comparing expert codings to expert surveys on economic and social positions look pretty good.. good correlation for economic particularly a thing that we’d expect – and we see.

We tested to see how best to serve up results… we tried the sentences in order and out of order. Found .98 correlation so order doesn’t matter…

For the crowd sourcing we used Crowdflower, a front end to many crowd-sourcing platforms, not just Mechanical Turk. Uses a quality monitoring system so that you have to maintain an 80% “trust” score to be rejected. Trust maintained through “gold questions” carefully selected and generated by experts…

So, we can go back to the live experiement… it’s 96% complete!

So, looking at results in two dimensions… if Liberal Democrats were actually Liberal would be right of economics and left of social… but actually they are more left on economics. Conservatives on the right socially but getting nearer the left in some cases… but it’s not about the analysis so much as the comparison with the benchmark…

When we look at expert codings versus crowd coders… well the points are all over the place but we see correlations of 0.96 for economic, 0.92 for social dimensions. So in both cases there isn’t total agreement – we have either have a small crowd of experts or a bigger crowd of non experts. Its always an average but just a matter of scale…

So, how many coders do we need? No need for 20 codes for a sentence if it’s clearly not about immigration policy… we did massively over sample, then drew sub sets there for standard error… we saw that estimates from our errors the uncertainty starts to collapse… The rate of collapse for experts is substantially steeper… for aggregate of these two processes you need five times more non-expert coders than experts. But you can run good codings with five coders…

So we did some tests for immigration policy… used 2010 British manifestos, knowing that there were two expert surveys on this dimension (but no CMP measures). Only coded immigration or not, and if immigration is positive or not. Cost about $300. Ran again, same cost, extremely similar results…

Doing this we had 0.96 correlation with Benoit 2010 expert survey. .94 correlation with Chapel Hill Survey. And between the two runs correlation of around 0.94. Would have been higher… the experts differed between the immigration policies of Labour and Conservative… were not obvious positions in the text… but they had positions that experts knew about…

So, who are these people? Who are these crowd coders? They are from all over the world… the top countries were USA, Britain, India and Estonia. One person coded over 10,000 sentences! Crazy person loves coding! The mean trust score rarely drops below 0.8 as you’ll be booted off if it does… You don’t pay or get data from those that fail. Where are these jobs being sourced? We tried Mechanical Turk… we’ve used Crowd Flower… there are huge numbers of these sites – a student looked at about 40 of these sites… but trust scores are great no matter how these people are sourced… Techniques are not all ideal… but they don’t stay in the system if trust score changes. No relationship between coder quality and platform…

Conclusions here. Non experts produce valid results, just need a few more of them. Experts have variance, have noise, so experts are just another version of a crowd with higher expertise (lower variance). Repeat experiments prove that the method is reliable (and replicable). Some places require your work to be replicatable… is data plus script a good way to do that? Here you really can… You can replicate everything here. You can redo in February what you did in December… with the right text you can reproduce the result. Why does this appeal? Well it’s cheap, it’s flexible. Great for PhD students who lack expert access. And you can work independently from big organisations that have their own agenda for a study. You can try an idea, run again, tweak, see what works… Can go back again… And this works for any data production job that is easily distributed into simple tasks… sign up for Mechanical Turk, be a worker, see what it’s like to actually do this… for instance for transcriptions of audio tapes… it’s noisy…. a common job is that they upload 5 second clips and you transcribe that… gives you pretty good human transcription that timestamps weaves back together. Better than computer method…

So, we are 100% finished with our UKIP crowdsourcing experiment… Interestingly 40 negative, 48 positive… needs further analysis…

Q&A

Q: In terms of checking coders do the right thing – do you check them at the beginning or do you check during the process of codings?

A: Here I cheated a bit… used 126 gold questions from another experiment. You have to give a reason for each question about why it’s there – if the person doesn’t get it right then they get text to explain why that is the case… Very clear unambiguous questions here. But when you deploy a job you can monitor how participants responded or if they contested it… In a previous experiment we had so many contested responses that I actually looked again and removed it…

Q: A very interesting talk… I am a computer scientist and I am interested in whether now you have that huge gold data set you have thought about using machine learning.

A: Yes, we won’t let that go to waste. The crowd data too…

Q: I am impressed but have two questions… you look at every sentence of every manifesto… they are funny things as not every sentence is about the thing you are searching for – how do you deal with that? And a lot of what is in manifestos are sort of dog whistle things – with subtexts that the reader will pick up, how do you deal with that in crowdsourcing?

A: You get contextual sentences around the one you are coding, that helps indicate the relevance of that sentence, it’s context. In terms of the dog whistle question… people think that but manifestos are not designed to be subtle. They actually tend to be very plain, very clear. It’s rare for that subtlety to be present. Want truly outrageous immigration policy look at the BNP manifesto… every single area is about immigration, not subtle at all.

Q: I’m a linguist, I find it very interesting… and a question about tasks appropriate to crowdsourcing. Those that can be broken down into small tasks, and that your participants can relate to their daily life. I am doing work on musical interpretation… I need experts because I can’t see how to do that in language, in a way that is interpretable to non experts…

A: You can’t give something that’s complex… I couldn’t do your task… you can’t assume who your crowd is, we have very little information… we didn’t ask about language but they wouldn’t retain that trust score without some good English language skills. But workers have a trust score across projects so anything they can’t do they avoid as losing that score is too costly… You could simplify the task with some sort of task that can test corect or incorrect interpretation… but we keep the task simple.

Q: A very interesting talk, I have a quick question about how you set the right price for these tasks… how do you do that? People come from different areas and different contexts.

A: Good question. We paid 2 US cents per sentence. We tried at 5 cents and it was done very fast but quality wasn’t better. A job at 1 cent didn’t happen fast at all. So it’s about timings and pricing of other jobs.

Q: Could you say something about the ethics of this kind of method… you are not giving much consideration to the production of these texts, so I wondered if you could talk about the ethics of this work and responsibilities as researchers.

A: Well I didn’t ruin any rainforests, or ruined any summers. These people have signed up for terms and conditions. They are responsible for taxation in their jurisdiction. Our agreement with Crowdflower gives them responsibility. And it’s voluntary. Hopefully no sweatshops for this… I’m receptive to the idea of what ethical concerns could be… but couldn’t see anything inherently wrong about the notion of crowdsourcing that would be a concern. Did run past ethics committee at LSE. Didn’t directly contact people, completing tasks on the internet through third party supplier.

Q: You were showing public domain documents… but for research documents not in the public domain how would security be handled…

A: Generally transcriptions are private… but segments are usually 3 or 5 segments… like reading a document from the shredder basket… the system have that data but workers do not have access to that system

Q: But the system does have that so you need trust in the platform…

A: Yes.

Comment from floor: companies like Crowdflower have convinced companies to give them data – doctors notes etc. they have had to work on making sure they can assure customers about privacy of data… as a researcher when you go in you can consider what is being done in that business market in comparison

Q: Have you compared volunteer coders to paid coders? I am thinking particularly about ethical side of things and motivations, particularly given how in political tasks participants often have their own agendas. Might be interesting to do.

A: Volunteer crowdsourcing? Yes, it would be interesting to compare that…

Reading Data: Experiments in the Generative Humanities – Dr Lisa Otty, Lecturer in English Literature and Digital Humanities, University of Edinburgh (session chair: Dr Tom Mole)

Dr Tom Mole is introducing our next speaker, Dr Lisa Otty whose interests are in the relationship betweeen reading, writing and the technologies of transcription. And she will be talking about her work on Reading Poetry, and the process of what happens when we read a poem.

Now to be  a literature scholar speaking at an event like this I have to acknowledge that data is not a term typically used in our field. When you think about what we are used to reading texts are often books, poems… but a text is not neccassarily a traditional material but may also be another linguistic unit, something more complex. Taking the Open Archival Information Systems (CCSDS 2002) describes data as “a reinterpretable representation of information in a formalized manner suitable for communication, interpretatio, or processing”. Interpretation being crucial there. When we look at texts like books or poems those are “cooked” – edited, curated, finished. Data is too often not seen as that.

Johanna Drucker – in Humanities Approaches to Graphical Display (DHQ 5.1 2011) talks about data as Taken Not Given, Constructed from the Phenomological World. Data passes itself off as a priori conditions, as if same as phenomena observed, collapsing the critical gap between the data collection and observation.

Some of these arguements gel with some of the arguements around close versus distance reading. And I think it can therefore be more productive to see data as a generative process…

Between 2009-2012 I was involved in the research project Poetry Beyond Text (University of Glasgow, and University of Kent). This was a collaborative project so inevitably some of my reflections and insights are also collaborative and I would like to acknowledge my colleagues work here. The project was looking at interpretation of poetry, and particular visual forms of poetry such as artist boks. What these works share is that they are deeply resistent to being shared as just information.

For example Eugen Gomringer’s (1954) “silencio” is an example of how the space is more resonant than the words around it… So how do we interpret these texts? And how do our processes for interpretation effect our understanding. One method, popular in psychology, is eye tracking… a physical way of registering what you are doing. We combined eye-tracking with self-reporting. Eye Tracking takes advantage of the movements of a small area of the retina. So a map of concentration sees those little jumps, those movements around the page. But it’s an odd process to be part of – you wear a head brace with a camera focused on your eye. You get a great deal of data from the process. Where more concentration that usually indicates trickiness or challenge or interest in that section – particularly likely for challenging parts of text. From this data you can generate visualisations from this data. (We are watching a video of eye tracking process for poetry).

Doing this we found a lot of patterns. We saw that people did focus and understand space, but only when that space has significance in the process. In poems where space is more conceptual than nemetic. But interestingly people who recorded high confusion also reported liking them much more… With experiments with post linear poems the cross-linear connections. All people start with a linear reading patterns before visual reading. And that reflects the colour strip test – psychology test that shows that visual information trumps linguistic information… so visual readings and habitual reading processes are hard to overcome. We are programmed to read in a certain way… our habits are only broken by obstacles or glitches in the text we are reading…

Now talking about this project if I talk about findings I am back in that traditional research methods… and that would be misleading. We were a cross disciplinary team and so I am particularly interested in focusing on that process, on how we worked on that. The eye tracking data generates huge amounts of numerical data… we faced real challenges in understanding how to understand, to read this data… a useful reminder of the fact that data’s apparent neutrality has real repurcussions. Its one thing to make data open, another to enable people to work with it.

To my colleagues in psychology didn’t understand our interest in visualisations of numerical eye tracking data, it is an abstraction… and you have to understand the software to understand how that abstraction works. Psychologists like to interpret the data through the numerical data. They see visualisations, graphs etc. as having a rhetorical rather than analytical function. Our team were interested in that rhetorical function. We were humanists running an experiment – the framework was of hypotheses, of labs, of subjects… but the team came from creative practice background so this sense of experiment was also in play. In it’s broadest terms experiments are about seeing something in process and see how they behave, for scientists about testing hypotheses in this way, creative experiements rather different… For humanist analysis of these texts you have to deal with a huge number of variables, very much a contrast to traditional psychology experiements. For creative experiments there is a long tradition of work in surrealism, dadaism, etc. that poetry can unleash and disrupt our traditional reading of texts… they are deliberately breaking our habits. The reader of the literary form is a potentially revolutionasible(?) subject.

In Literary scholarship and humanities the process of reading is social, contextualised process. In psychology reading is a biomedical process, my colleagues in this field collapse the human and machine. In a recent article by Lutz Koepnick asked Can Computers Read? (2014) and discussed the different possible understandings of what reading is for.. our ideological framework of reading means to us… computational reading is less about what computers are, more about how we invest in them and envision them.

One of the things that came out of our project was the connections between poetry and psychology, and the connections to creative experiments.

To finish I want to talk about some examples of experiments around reading and what reading can mean.

The readers project – John Cayley and Daniel Howe (2009 – ) their work explores imaginative critiques of reading. Cayley is a literary scholar and has been working in digital production for some time. The readers project features “programmed autonomous entities”. Each reader moves through a text at different speeds and in different ways. So for each part of the experiment projections are used, and they are often shown with books, a deliberate choice. A number of interfaces are available. But these readers move according to machine reading rather than biomechanical reading. Cayley terms this an exploration of vectors of reading… directions in which reading might take of. It explores and engaged with new creative understandings of reading. This seems to be seen by Cayley in avant garde context. Emphasis on constructed nature of the work.

“because the project’s readers move within and are thus composed by the words within which they move, they also, effectively, write. They generate trxts and the traces of their writings are offered to th eproject’s human readers as such, as writing, as literary art.” (Cayley, The Readers Project website).

As someone engaging with these pieces the experience is of reading with, more than processing or consuming or analysing.

Tower – by Simon Biggs and Mark Shovman (2011), working at Hive, uses knowledge of natural language processing to build visualisations. When the interactor speaks their words spiral around them. And other texts are also present – the project is inspired by the Tower of Babel and builds up and up. Shovman’s previous work at Hive was on geometric structure. Biggs hope is that participants “will be enabled to reflect upon the inter-relations of the things that they are experiencing and their own contingency as part of that set of things.”

Michelle Kendrick talks about hybrids, that hybrid of human and machine interaction, the centrality of human investment in computer reading.

When I talk about this work I am overwhelmed by the rhetorical significance of words like “experiment” and the dominance of scientific research methods – the first interpretation of this work is often wrongly around seeing the work as applying scientific methods to literary interpretation.  But instead this work is about interpretation and exploring methods of understanding and interpretation.

Q&A

Q: You talked about different disciplines coming together. Do you think there is a need for humanities researchers to understand data and computational methods?

A: I think we would all benefit from a better understanding of data and analysis, particularly as we move more and more into using digital tools. I’m not sure if that needs to be in the curriculum but it’s certainly important.

Q: One of the interesting things about reading is the idea of it being a process of encoding and decoding… but the code shifts continously… and a challenge in experimental reading or interpretation is that literature is always experimental to some extent because the code always changes.

A: I think the idea of reading as always being experimental… I think that experimental writing is about disruption… less about process but more about creating challenge.

Q: I was very struck in what you were presenting there in the Poetry Beyond Text project about the importance of spatiality and space… so I was wondering about explicit spatial understandings – the eye tracking being a form of spatial understanding…

A: We were looking at the way that people had been interpreting those texts in the past, in the ways people had looked at that poetry in the past… they had talked about the structural work of the poets themselves… and we wanted to look beyond that…We wanted to find out people’s responses to some of these processes, and what the relationship was between that experience and those critical views of those texts.

Q: Did you do any work on different kinds of readers – expert readers or people who had studied these works?

A: It was quite a small group but we looked at the same people over time and we did see development over time. We worked mainly with students in literature or art and most hadn’t encountered this type of concrete poetry before but were well experienced with reading.

Q: I wanted to ask you about the ways in which we are trained to read… there are apps showing images of texts very very quickly, are we developing skills to read quickly rather than more fully and understand the text.

A: There was a process of rapid image showing to the eye (RSVP was the acronym) – to allow you to absorb more quickly but in actual fact that was quite uncomfortable. We do see digital texts playing with those notions. I don’t think we will move away from slow reading but we are seeing more of these rapid reading processes and technologies.

Chair: Kinetic Text project works in some of these ways, about focusing eye movement…

A: The text can also manipulate eye movement and therefore your reading and understanding of the text. Very interesting in that respect.

Algorithm Data and Interpretation – Dr Stephen Ramsay, Associate Professor of English at the University of Nebraska; Fellow at the Center for Digital Research in the Humanities (session chair: Prof James Loxley)

James Loxley is introducing our next speaker, Dr Stephen Ramsay.

I want to say that my mother is from Ireland, a little place west of here, and she said that if she had ever been to University it would have been to University of Edinburgh which she felt was the best in the world.

Now I was planning to teach a technical talk – I teach computer science in an English faculty. But instead I’m going to talk about data. So I’m going to start with the 1965 blackout of New York. At the time it was about disaster, groping in the dark, a city stranded. But then 9 months later they ran stories on the growth in birth rates, a sharp rise across hospitals across the state. All recording above average numbers of births. Although one report noted that Jewish hospitals did not see an increase. Sociologists talked about the blackout as in some way responsible… three years later a sociologist published a terse statement showing no increase in births after the Great Blackout. This work looked at average gestation period and noting that births would have been higher from June through to August, not just in August… but he found that 1966 was not unusual or remarkable. Black Out Babies were a myth…

You could read this tale as a cautionary one about the misuse of data. But I think this can be read another way… the New York Times piece said something about human nature – people turning to each other when power out is a sad reflection on the place of television in our life, but a hopeful narrative for humanity. And citing birth rates and data and using scientific language adds to that. And the comments about Jewish people shows prejudice. But at the same time that subsequent analysis frames the public as prone to fantasy, as uninformed, with the scholar overcoming this…

The idea of “lies, damn lies, and statistics” encourages us to always look for falsehood hiding behind truth… so we think of what stories we are being told, and what story we want to tell. It’s simple advice that is hard to do. I want to give a different spin on this. I think that data is narrative automatic. the way we use data is instructive – we talk about lists, numbers… Pride and Prejusice does not seem to be a data set unless we convert it. It gains narrative in transformation. The data can be shown to show and mean things – like stories, stories waiting to be told… data doesn’t mean anything by itself, someone has to hear what it is saying…

What does data look like in its pre interpretive state? There is an internet site called “Found” – collecting random items such as notes, cards, love letters, shopping lists. Materials without their context. Abandoned artefacts. All can be found there. But the great glorious treasure of Found is it’s lists…

[small pause here for technical difficulty reasons]

These lists are just abandoned slips of paper… one for instance says:

beer

neat

dogfoot

domestic

stenga

another:

roach spray

flashlight

watermellon

The spareness and absence of context turns these data-like lists turns them, quickly into narrative… not all are funny… one reads:

go out for a walk with someone

speak with someone

watch tv

go out to cemetry to speak to mom

go to my room

Have you ever wanted to give your data a hug? Bram Stoker said in writing Dracula he just wanted to write something scary… his novel is far more interesting without him as the interpretations of others are fascinating and intriguing… Do facts matter in the humanities? In some areas… who painted a picture, when a treaty was signed… these are not contingent truth claims… surely we can say fact is a good word for those things that are not subject to debate. Scholars can debate whether a painting is by Rembrandt or his school, that debate is about establishing a fact. But facts still matter…

If we look at Rembrandt’s Night Watch the lighting of the girl equating to that of the captain is intriguing. If he said it meant nothing we’d probably ignore him… The signing of a treaty may be a fact but why it occured is much more interesting. Humanities are about that category 1 inquiry more than the category 2 fact inquiries. Often this is the critique of the humanities and the digital humanities, Jonathan Gotschil insists that the humanities should embrace scientific approaches and sense of optimism… And sees the sciences as doing a better job of this stuff but that “what makes literature special” should be retained… he doesn’t say what those things are. There are unsettled matters if one takes scientific approaches. Of course Gotschil’s nightmare is to understand data with the same criticality we apply to Bram Stoker, questioning it’s being and meaning… and I suggest we make that nightmare a reality!

[More technical issues… ]

What I wanted to show you was a list of English Novels [being read to us]… It is a list, from Hoover, organises novels in terms of breadth of the vocabulary in that list. I have shown this list to many people over the last few years, including many professors… they see Faulkner and Henry James at the top and approve of that and of Mark Twain…. and young adult novel writers at the bottom… but actually I read you the list in ascending order… Faulkner and James are at the bottom. Kipling and Lewis are at the top. And there it starts… richness is questioned… people want to point out how clearly correct the answer is, despite having given the wrong answer; some explain that the methodology is flawed or misreported… these are category 1 people being annoyed by category 2 reality…

But when we stop using it as a Gotcha it is a more provocative question… each of these titles contains a thousand, a hundred thousand thoughts and connections… it is what we do… as humanists we make those connections… we ask questions of the narrative we have created… part of our problem is a general discomfort with lettinng the computer telling us what is so… but if we stop doing that we might see peculiar mappings of books a cultural objects… it might show us a way to deeper understanding of reading itself… it raises any number of questions about the development of English style… and most of all it raises questions of our discursive paradigms.

That gives us narrative possibilities we could not see. We cannot think of text as 50k word blocks. The computer can ONLY apprehend the text in such terms. To understand the computer as finding facts is to miss the point. It is about creating triggers to ask questions, to look at the text in new ways. This is something I came across working on Virginia Woolf’s The Wave. The structure is so orderly… and without traditional cultural narrative. And they speak in very similar styles, sentence structures, image patterns… some see some difference between gender or solidarity… but overall it is about unity… this is the sort of problem that attracts text analysis scholars like myself. I ran algorithm clustering models looking for similaritudes unseen by scholars. On a lark we posted a simple question… “what are the words that the women in the novel use in common, that none of the men do?” and it turns out that there are 9 such words. Could see that as a narrative – like a Found list – and then we did it with men and found 120 words! Dramatic. So many words… Some critics found that disparity frightening… some think it backs up sexism of western cannon. Others see this as a chance to ask another questions… to try with other authors, novels, characters… if you think this way, perhaps you’ve caught the DH bug, I welcome you. But do we think we’ll find an answer to questions of gender and isolation? Do we want to answer those? The humanities want a world that is more complex, deeper than we thoughts. That process is a conversation…

In 2015 the Text project will release huge volumes of literature. Perseus contains most greek texts… there are huge new resouerces. almost all questions we ask of these corpuses have not been asked before… we can say they will transform the humanities but that may not be true… the limiting factor is whether we choose to remain humanists in the face of such abundance… perhaps we need to be programmers, tool builders, text engineers… many more of us need to invite the new texts – lists, ngrams, maps etc. – into our ongoing conversation. We are here to talk about philosophical issues of data and these issues are critical… but we have to be engaging with these questions…. Digital humanities means databases, mark up, watermelon…!

Q&A

Q: I am intrigued to think about how we design for the things we don’t know what we need to know…

A: Sure, imagining what we don’t know… you inevitably build your own questions into the tools… ironically an issue for scientific methods. The nice thing about computers is that they are fast, obedient and stupid. They will do anything we ask them to, even our own most stupid ideas, huge serendipity just baked into that! Its a problem but its amazing how the computer does that job for me, surprisingly.

Q: That was a brilliant fascinating talk. Part of the problem with digital humanities for literature right now is that it either tells us what we do know… or it tells us what we don’t know but then we worry that it’s wrong… The description of the richness list was part of that. I really liked your call for an ongoing discussion that includes computer generated data… but I don’t see how we get past the current description. If all literary criticism says something is so, and expects “yes, but…” I can see how computer generated data sits in that… but how can data be a participant in that conversation – beyond ruling something out, or concurring with expectations.

A: Excellent point and lets not downplay at all the first part of your question. I saw Franco Morelli give a talk about titles getting shorter for instance… who’d have thought?! But I think it has a lot to do with how we build our tools… I find it frustrating that we all use R, or tools designed for science or psychology… I want our schools to look more like the art-informed projects Lisa talked about. I think the humanities needs to do more like that, to generate the synergies. Tools that are more ludic.

Q: May be to be about perceived barriers being quite high. An earlier speaker talked about the role of repeatability. Ambiguity reading a poem is repeatible. if barriers to entry low enough for repitition and for others to play, to ask new questions, maybe that brings the data in as part of the conversation…

A: There are tools that let you play with the text more ludically. Voyant for instance. But we come with a lot of cultural baggage as humanists… there is a phenomenon that… no matter what they are talking about they give a literary critical reading of a text but when they show a graph we all think we are scientists… there is so much cultural baggage. We haven’t learned how to be humanistic users of these tools, or to create our own tool.

Q: A question and an observation… There is a school of thought in cognitive psychology that humans are infinitely able to retrofit any narrative to any circumstances whatsoever, and that is very much what was coming through your data… Many humanities departments have become pseudo social sciences departments… but if you don’t have a clear distinction between category 1 and category 2 they can end up doing their own thing…

A: I don’t want the humanities. I resist the social science type study of literature, the study of human record or of the human condition… when we are talking about… in my own work I move between being a literary critic and being an engineer… when it comes to writing software that method definition is wrong, it doesn’t work… when I am a literary critic it is about all those shades of grey, those complexities… but those different states both seem important in pursuit of that end goal… if studying flu outbreaks lets not be ludic… but for Bram Stroker then we should!

Q: In my own field of politics there was a particular set of work which gave statistical data a bad name… and I wonder in your field is the risk of the same is there…

A: In digital literary studies this is sometimes seen as a 25 year project to get literary profs into the digital field.. but I always say that that’s not true, there’ll always be things to be done. There was a book in the 70s that looked at slavery in an entirely quantitative way, it made the arguement no one wanted to hear, that slavery had been extremely lucrative. Economists said that it’s profitable. History fled from statistical methods for years after that… but they do all agree that that was profitable. And there is quantitative work there again/still. If I had to predict I’d say the same thing for digital literary studies does seem likely…

Q: I can’t resist one here… I was following a blog by Kirsch where you say that scholars should code and I wanted to ask about that…

A: OK, well Kirsch lumps me in with the positivists… I’m not quite in the devils party. But I teach programming and software engineering to humanists. Its extremely divisive… My views have softened over the years… for me programming is a magnificant intellectual excercise… knowing about it seems to help understand the world. But also if you want to do research in this area you need some technical skills. If that’s programming… well learn what you need whether thats GIS, 3D Graphics… if you want to build things you might need coding!

Big Data and the Co-Production of Social Scientific Knowledge – Prof Rob Procter, Professor of Social Informatics, University of Warwick (session chair: Prof Robin Williams)

Professor Robin Williams is now introducing Professor Rob Proctor, our next speaker, talking about his work around social informatics.

The eagle eyed amongst you will spot my change of title – but digital is infinitely rewritable! I am working in the overlap of sociology and computational tools and methods. So, the second thing I want to talk about is Sociology in the age of “big data”. I think what this demonstrates is the opportunities for sociology to respond in various different ways to this big data, and tools to interrogate that data. The evolving of tools and methods is a key thing to look at in the area. So that brings me to the Collaborative Online Social Media Observatory (COSMOS) and tools we are developing for understanding social media… and then I want to talk about Sociology beyond the academy – knowledge co-produced of social scientific knowledge. But there are other types of expertise being mobilised at the moment, in looking at the computational turns things are taking. Not always a comfortable thing for social scientists…

So firstly Social Informatics. So what is that? Well to me its the inter-disciplinary study of factors that shape adoption and use of ICTs. And what gets me excited is how these then move into real processes. And for me the emphasis on innovation as public, participatory process of experimentation and learning where meanings of technologies are collaboratively explored and co-produced. In social media you can argue that this is a large scale experiment in social learning… Of course as we witness growing scale of adoption more people experience those processes: how social media works, how they might adopt or use it… to me this is a fascinating area to study. And because it is public and involves social media it is very easy to see what’s going on… to some extent. And generally that data is accessible for social research purposes. It is not quite that simple but you can research without barriers of having to pay for data if you do it in a careful way.

So these developments have led me into social media as a prime area of my research. So firstly some work we did on the impact of Web 2.0 on scholarly communications – work with Robin Williams and James Stewart – many of us will be part of this, many of us tweet our research… but many of us are not clear of what that means, what the implications are. So we did some work, got some interesting demographic research… we also did interviews with people and got ideas of why they were, and why they were not adopting… Some very polarised. And in parallel we looked at how scholarly publishers incorporate social media tools into their work, in order to remain key players… they do lots of experiments and often that is focused on measuring impact and seeing the movement of their work to other audiences. Some try providing blogs on their content. But that is all with mixed success. A comment notes that it is easier to get comments on cricket reports than on research online… So it’s hard to understand and capture impact…

I’ll come back to that and about co-creation of knowledge. But first I want to talk about the riots in England in 2011. This was work in conjunction with the Guardian Newspaper. They had been given 2.5 million tweets directly by Twitter. They wanted to know if social media was particularly vulnerable for sharing false information, did that support calls for shutting down social media at times of crisis? So we looks at a number of different rumours known about and present in the corpus: zoo animals on the loose; london eye on fire; miss selfridge on fire; rioters attack a children’s hospital in Birmingham. I will talk about that latter example. But we wanted to ask about how people use and understand and interpret social media in these circumstances, how they engage with rumous…

So this is about sociology in the age of “big data”. It calls for interpretive methods but we can’t do that at scale easily… so we need computational methods to focus scarce human resources. We could crowdsource some of this but at this scale that would still be a challenge…

So firstly lets look at the work of Savage and Burrows (2007) talked about the “coming crisis of empirical sociology” because the best sociology, as they saw it, was conducted by private companies who have the greatest and most useful data sets which sociologists could not rival nor access. However we might be more confident about the continuing relevance of social sciences… social media provides a lot of born digital data… maybe this should be entitled the “social data deluge”. There is a lot of data available, much of it freely available. Meanwhile lots of policy initiatives to promote open data in government for/by anyone with a legitimate usage for it. Perhaps we can be more confident about the future of academic sociology…

But if you see the purpose this data is put to, its a more mixed picture… so we see analysis of social media for stock market prediction. But here correlation is mistaken for causality. Perhaps more interesting are protest movements – like occupy wallstreet – or use of social media during the Egyptian revolution… It is a tool for political change, a way for citizens to acquire more freedom and change? Is it a movement to organise themselves? Lots of discussion of these contexts. Methodologically its a challenge of quantity, and methods that combine social science understanding with social media tools enabling analysis of large scale data…

So back to that rumour from the riots and that rumour of a children’s hospital being attacked in Birmingham. This requires thorough work with the data, but focused where it counts.

So, what sparked this off was someone tweeting that the police were assembling in large numbers outside the hospital… therefore the hospital must be under threat. A reasonable inference.

So, methodologically we undertook computational methods for analysing tweets in an active area of research: sentiment analysis; topic analysis. We combine a relatively simple tool looking at information flows… and then looking at flow from “opinion leaders” to others (e.g. RTs). Once that information flow analysis has been done we can then take those relative sizes to analyse that data, size as proxy for importance… this structure, we argue, is relatively useful for focusing human effort. And then we used coding frames for conventional qualitative methods of content analysis to understand how Twitter was used – to inductively analyse information flow content to develop a “code frame” of topics; use code frame to categorise inofrmation flows (e.g. agreement, disagreement, etc.); and then we used visualisation around that analysis of information flows…

So here we see that original tweet… you see the rumour mushroom, versions appear… bounding circles reflect information flows… and individuals and their influence… Initially tweets agree/repeat… and we then start to see common sense reasoning: those working or nearby dispute the threat, others point out that the police station is next door to the hospital thus providing alternative understanding. People respond and do not just accept the rumor as true… So rumours do break quickly BUT they are not neccassarily more vulnerable as versions and challenges quickly appear to provide alternative likely truth. That process might be more rapid with authoritative sources – media or police in this case – adding their voice. But false information may persist longer, with potential risk to public safety – see follow on Pheme project.

But I wanted to talk about authoritative sources again. The police and media and how they use social media. The question is what were the police doing on twitter at that time? Well another interesting case here… riots in Manchester led to people creating new accounts to draw attention to public bodies like the police, as an auxillery service to raise awareness of what was going on. Quite an interesting use of social meidia where these see something like this arising.

So what these examples demonstrate is innovation as a co-production… lots of people collectively experimenting, trying out things, learning about what social media can and cannot do. So I think it’s a prime example for sociologists. And we see uses are emergent, people learn as they use… and it continues to change and people reinvent their own uses… And we all do this, we have our own uses and agenda shaping our interactions.

So this work led to development of tools for use by social scientists… COSMOS involved James S, Ewan K, etc. from Edinburgh… It would be an error to assume social media can tell us everything that takes place in the world – this data goes with crime data, demographic data, etc. The aim of COSMOS is to forge interdisciplinary working between social and computing scientists. To provide open, sustainable platform for interoperable social media analysis tools. And refine and evolve capabilities, provide service models compatible with needs of diverse user communities.

There are existing tools out there for social media analysis… but many are blackbox systems, its hard to understand that process that is taking place. So we want those blackbox processes to be opened up, they are complex but can be understood and explored…

So the Cosmos Tools let you view timelines, to look at rates and flows… to look for selection based on keywords and hashtags… and to view the networks of who is tweeting… and to compare data with demographic data.

Also some experimental tools around geographical tools for clustering. The way people use Twitter can show geographical patterns. Another factor is about topic modelling, topic clustering… identifying tweets on the same topic. This is where NLP and Ewan and his colleagues in Informatics has become important.

So current research looking at: Social media and civil society – social media as digital agora; “hate” speech and social media – understanding users, networks and information flows –  a learning challenge here about people not understanding impact and implications of their comments, perhaps a misunderstanding of social media… ; citizen social science – harnessing volunteer effort; social media and predictions – crime sensing, data integration and statistical modelling; suicide clusters and social media; humanitariansim 2.0 – care for the future; BBC World Service – tweeting the olympics. And we have a wide range of collaborators and community engagement.

Let me briefly talk about social media as digital agora… may sound implausible… many talk about social media as a force for change… opportunities to promote democracy… not just in less democratic countries, but also democratic countries where processes don’t seem to work as well… So we are looking at social media in communicative, in smaller communities. And also thinking about social resiliance in a day to day small scale way… problems which if not managed may become bigger issues. For that we have studied Twitter in several locations, collected data, interviewed participants… and built up a network of communications. What is interesting, for instance, is that non governmental group @c3sc seems to have big impact. We have to see how this all plays out… deserves longitudinal approach…

So, to conclude… let me talk about the lessons for academic sociology… and I think it’s about sociology beyond the academy and the role of wider players. Firstly data journalism – was interested in Steven’s 1965 press accounts of the black out earlier. Perhaps nowadays the way journalists are being trained might change that… journalists are increasingly data savvy. We see this through Fact Check, through RealityCheck blog… through sourcing from social media. So is citizen journalism, used to gather evidence of what is happening… tools like Ushahidi… and a sense of empowerment for these communities… reminds me of notion of sousveillance… and the possibility of greater accountability… And Citizen Journalism in the expenses scandal – guardian recruited people to look at the expense claims. The journalists couldn’t do that externally… so recruited others.

So, citizen social science… in various ways (see Harris 2012 “Oh man, the crowd is getting an F in social science”. And Ken Benoit’s work discussed earlier… we see more people coming into social science understanding…

So the boundaries of social science research production are becoming more porous, social scientific knowledge production is changing, potentially becoming more open. These developments create an opportunity to reinvigorate the project for a “public sociology” – as per Burawoy (2005) and his call “For a public sociology”. to make sociology accountable to more people, to organisations, to those in power. Ethically we need to ask what is needed and wanted, how the agenda is set, how to deliver more meaningful and useful social sciences to the public.

How can we do that? New modes of scholarly communications, technology, but it’s not enough… we’ve also been working with a company on a  possible programme for the BBC where social media is used to reflect on the week, a knowledge transfer concept. Also knowledge transfer in the Pheme project – for discriminating false and true information… all quite conventional… but we need other pathways to impact… with people as sensors and interpreters of social life, training and capacity building – in ways we have not done before, and something that has emerged in science and citizen science has been the notion of workshops, hackathons, getting people engaged in using mundane technologies for their own research (e.g. Public Lab), we need something similar for tools, social media, to extract data they want for their purposes for their agenda… to create more public sociology that people can do themselves. And we need to also have an open dialogue about research problems.

Q&A

Q: My question is about COSMOS and the riot rumours stuff… within COSMOS do you have space for formal input around ethics and law… you cut close to making people identifiable and locatable. And related to that… with police in those circles… may arouse suspicions about motives… for instance in Birmingham did police just monitor or did they tweet.

A: They did tweet but not on that rumour. It is an understandable concern that collaborations make powerful state actors more powerful… for us we want these technologies available for anyone to use them… not some exclusive arrangement, should be available to communities, third sector organisations… anyone who feels that social media may be important in their research

Q: I was more concerned about self-led vigilantes, those who might gang up on others…

A: A responsibility of civil society to be aware of those dangers, to have mechanisms to avoid harm. It does exist already… so if social media becomes instrument of that we have to respond and be aware – partly what hate speech project is about… Bigger learning problem is about conduct in social media space. And the probably issue that people don’t realise how conduct quickly becomes visible to much bigger group of others… and that relates to ethics… twitter is public domain space but when something is highlighted by others… we have to revisit the ethics issues time and again… for the study for the riots we did the usual clearance process… Like Ken we were told it was fine… but don’t make identifiable but that is nearly impossible in social media. Not an easy thing to resolve.

Q: I’m curious about changes in social media platforms and how that effects us… moves from facebook to twitter to snapchat to instagram… how does that become apparent, may be invisible, how do we track that..

A: There is a fundamental issue of sustainability of access to data from social media. Not too much of a problem to gather data if you design harvesting appropriately for their rate limits. In terms of other platforms, and people moving to them, and changes in modality and observability and accessibility of data… what social research needs is agreement with providers of data that, under certain conditions of access, that their data is available for research.. to make access for legitimate data easy. There are efforts to archive data – Library of Congress collects all tweets. Likely to allow access under license I think, to ensure access to platforms as use of platforms change…

Edinburgh Data Science initiative – Prof Dave Robertson, Head of School of Informatics

Sian Bayne quickly introducing Dave Robertson providing a coda to today’s session.

I’m just briefly going to talk about the Edinburgh Data Science Initiative. The ideas being data as the catalyst for change in multiple academic disciplines and business sectors.

So firstly the business side… big data can be very big and very fast… that can be off-putting in the humanities… And you don’t have to build something big to be part of this… I work in these areas but my models are small… and there is a stack you never see – economic and political side of this stuff.

And here’s the other one… this is about variety and velocity – a chart from IBM – looking at predictions of the volume of data and, more interestingly, the uncertainty of data… And the data sites in a few categories… Enterprise Data, loads of Social Media, and loads of Sensors (internet of things)… but uncertainty over aggregate data is getting hugely large… and that’s not in sphere of traditional engineering, or traditional business…

The next slide here is about architectures… this is topical… it’s IBM’s Watson system… this is the one that won Jeopardy… harvested loads of information and hypothesis generation… This stack starts with very computational stuff but the top layers look much more like humanities work and concepts…

Now technology and society interact. Often technology pushes on society. For instance if we look at Moore’s Law (memory in your computer doubles every year) mapped against the cost of mapping the human genome. It looks radically different, costs drop hugely in late 2000’s as a lot of effort is pushed in here. And that drop in cost to $1000 per unit… that is socially important… I could sequence my genome… maybe I don’t want to. You can sequence at population scales… machines generate a TB of data a week too – huge data being generated! And this works the other way around… sometimes technology gives you an inflection point and you have to keep up, sometimes society pushes back. A lot of time online is spent on social networks (allegedly 1/7)… now a unified channel for discovery and interaction… And the number of connected devices is zooming up…

So that’s the sort of thing that is pushing a lot of things… A lot of people have spoken to all the schools in the university… everyone reacts… you will find everyone recognising this… and you hear them saying “and it changes the way it makes me think about my research”. That’s so unusual to have such a common response…

Why this is important at Edinburgh… We have many interdisciplinary foundations at Edinburgh… All are relevant, no matter how data intensive, but we are well developed in interdisciplinary working…

And we have a whole data driven start up Ecosystem in Edinburgh… we have Silicon Walk (miicard, zonefox, etc.), Waverley Gate (Amazon, Microsoft), Appleton Tower (Informatics Ventures, feusd, Disney research, tigerface), Evo House (FlockEdu, Lucky Frame, etc), Quartermile (Skyscanner, IBM), Informatics, Techcube (FanDuel, Outplay, CloudSoft, etc.). A huge ecosystem here!

So, I’ll leave it there but input, feedback welcomed, just speak to myself and/or Kevin.

And that was it for the day…

Related resources:

 May 14, 2014  Posted by at 10:10 am Events Attended, LiveBlogs Tagged with: , , ,  1 Response »
Nov 112013
 

This afternoon I am attending “A digital humanties workshop in four keys: medicine, law, bibliography and crime“, a University of Edinburgh Digital Humanities and Social Sciences event. I will be liveblogging throughout the event and you can keep an eye on related tweets on the #digitalhss tag. The event sees four post doctoral researchers discussing their digital humanities work.

As usual this is a liveblog so my notes may include the odd error or typo – please let me have your thoughts or corrections in the comments below!

Alison Crockford – Digital articulations: writing medicine in Edinburgh

In addition to the four keys we identified we also thoughts about the four ways you can engage with the humanities field more widely. And in addition to medicine I will be talking about motions of public engagement.

Digital articulations plays on the idea of the crossover of humanities and medicine. So both the state of being flexibly joined together and of expressing the self. The idea came from the Issecting Edinburgh exhibition at Surgeons Hall. Edinburgh has a very unique history of medicine when compared to other areas of the UK. But scholars don’t give much consideration to the regional history and how medicine in an area may be reflected in literature. So you get British texts or anthologies with may be one or two Scottish writers bundled in. Edinburgh is one of the most prominent city in the history of medicine. My own research is concerned with the late 19th century but this trend really goes back at least as far as the fifteenth century. As an early career researcher I can’t access the multimillion pound grants from the ESRC you might need… So digital humanities became a kind of natural platform. I wanted to build a better more trans historical perspective on literature and medicine, would need input from specialists across those areas, I would also need ways to visualise this research in a way that would make sense to researchers and other audiences. I was considering building an anthology and spoke to a colleague creating a digital anthology. I chose to do it this way with a tool called omecca, in part because of its accessibility to other audiences. Public engagement is seen as increasingly favourable, particularly for early career researchers I’m interested in tools to foster research but also to do so in digital spaces that are public, and what that means.

I don’t have a background in digital humanities and there doesn’t seem to be a single clear definition. But I’m going to talk about some of the possibilities, what drives a project, how does that influence the result, etc. I will take my cues from Matthew Kirshenball’s 2002 essay on digital humanities and English literature. He sees it as concerned with scholarship and pedagoguey being more public, more collaborative, and more connected to infrastructure.

I was reassured to know I am not alone in looking at this issue and to have questions, there was a blog post on HASTAC – the humanities, arts, science and technology alliance and cOllaboratory. This was looking at the intersection between the digital humanities and public engagement, despite that organisation being already active in that space. I get the sense that this topic comes up as being there, but perhaps only recently ave there been deliberate reflections on the implications for that.

The digital humanities manifesto 2.0 which talks about increasingly public spheres. There’s a kind of deprivation in kirshenberg’s take on digital humanities and public engagement. I’m not sure public engagement deserves such derisive treatment, even though I am concerned about how public engagement and similar values judgement is increasingly chipping away at the humanities. But there is more potential there…

Many digital humanities tools are web based apps, they are potentially public spaces, and there are implications on our perspectives on any digital humanities, or indeed any humanities work. For instance the Oxford digital humanities conference last year, lookin at impact, nonetheless talked about public engagement as something more than just dissemination, but also something richer. Thinking about the participation of your audience, their needs and interests, not just your own.

Bowarst states that humanities scholars may risk letting existing technologies dictate their work, rather than being the inventors and designers of their tool. Whilst we may be more likely o be adopters I do not think that it is always the case nor neccassarily a problem. Working as Wikipedian in Residence at NLS I have been impressed with the number of GLAM collaborations embracing a range of existing kit: flickr, WordPress, Omeka, Drupal.

Omeka is designed for non technical users, it is based around templates and editable content. It is about presentation of materials. They are designed for researchers, those already interested… Who will SE it as a tool fr their research but not for wider audiences (e.g. Digitising historical serialised fiction and depictions of disability in nineteenth century literature). But these can look samey as websites, there are limitations without design support. However looki b at Lincoln 200 or Indeed George Arthus Plimpton rare book and manuscript page vs treasures of the New York Public Library website which is more visual and appealing. So I am interested in having the appeal of a public orientated website with the quality of a scholarly tool.

So looking At Gothic Past we see something that is both visual and of quality. You can save materials. The ways these plugins, opportunities for discourse etc. in Omeka etc. one up public engagement in richer ways…

Returning to medical humanities.. I think it has inherent links to public engagement, it helps enhance understand perceptions of health and illness. It’s impact can be so universal. Viewing medicine through the lens of literature enables a massively diverse audience who have their own interest, experience and perspectives to share. Giving a local focus also connects to the large community interested in local history. And designing the resource for that diverse audience with these many perspectives will help shape the tool. Restricting a resource to researchers

Q&A

Q) really interesting oaicularly the problems of digital humanities and research… Could yo say more about Omeka and how you plan to use it?
A) I have a wish list for what I want to make from Omeka. I would like logins, the ability to save material, and to have user added content and keywords to drive the site, so that there is input from other audiences, not just researchers but also public audiences. For instance exhibitions around digital patienthood. I hoe to be a good customer. If you don’t have the technological skills, you still have to put in the time to understand the software, to create good briefs, two months in I’m still working with the web team to create a good resource. I want to be a good customer so that I get what I want without making the teams life hell!

Q) what do you think being a good client means for our students. Bergson mentions that the more we rely on existing technologies, the harder it becomes. Think outside the box.
A) I think some f those coming up behind me have a better nderstanding of things digital… But those are the corporately driven websites, but they don’t neccassarily look. Eying that. Maybe you need something akin to research methods, looking at open source materials and resources. But realistically that may not be possible.

Q) I wanted to ask abut the way the digital humanities is perceived as a thing. In your public engagement work is that phrase used?
A) I think largely people think that these are the humanities and these are digital tools. There are parallel conversations in humanities and in the cultural contexts… The ideas of the digital library just being the library. So this doesn’t seem to be specific to academia, it is a struggle fr others to work out how to incorporate the digital into your experience.
Q) we are alread post digital?
A) kind of… The ideas of a digital resource from a library being a different tool doesn’t really seem to be what you actively consider, you see a cool tool.

Q) do you think the schism between research and public engagement exists in the cultural sector?
A) they have a better potential chance to do that. They must provide materials for research and also public engagement and public audiences. We think about research and sharing further but these organisations think inherently about their audiences, but the resources are great for research, for instance the historical post office directory research. The sector is a good place to look to to see what we might do.

Chen Wei Zhu – Rethinking property: copyright law and digital humanities research

Chen Wei did his research on open source but spen much of that time at the British Library.

I will be doing a whistle stop tour of copyright law, mainly drawing on the non digital. Just to set the scene… When did the digital humanities staRt? 1946 is a convenient start date, an Italian Jesuit priest tried to index the massive work of Thimas Equinus, they were digitised, put onto CDROM and now online. But at that time the term wasn’t digital humanities but “humanities computing”. I tried Googles n-gram viewer and based on that corpus you see that “humanities computing” comes in in the 1970s but “digital humanities” emerges in the 1990s. Humanities computing is still hugely used but will be interesting to see when “digital humanities” becomes dominant or bigger. A health warning here… Best between 1820s and 1922. 1922 in the US marks the beginning of copyright, but in Europe materials published before then were already in copyright. And another Heath warning… oigkes scanning kit isn’t perfect before 1820s because of print inconsistencies and changes. E.g. “f” instead of “s”. It fell out of use after times newspaper dropped the long f/s in 1893. So much data to clear up.

So what are the digital humanists opinion and understanding of copyright. I feel that digital humanities scholars are quite frustrated. E.g. burdock et al 2012 sees it this way. Cohen and Rosenzweig 2005 see it as an issue of Things never being fixed? [check this reference]

The US copyright office is shutdown… The US federal government closure included the copyright office being shut down. It is still saying it is shut… There will be a huge backlog for registering copyright.

So how did copyright law begin? What is the connection between the loch ness monster and copyright? The story goes that st columba is not only the first sighted of Nessie, and the first person engaged in copyright dispute. There is a mythical connection too…

The first copyright dispute is sometimes called the patron saint of copyright, huge misunderstanding, he is more the first pirate, copying a manuscript without the permission of his tutor. When he was caught secretly copying the book of psalms st finnian was very angry, he wanted to restrict the copy. The idea “to every cow belongs her calf, therefore to every book belongs its copy”. So this was the first copyright case. Columba had the decision go against him, and he rose up against the king s he led something of a bloodbath.

Now in this case there was no clear author of either finnian or columba. Ad no publishing planned r taking place. SL skip forward to 12th century china we see Cheng Sheren, the first publisher to register their copyright. We see a picture like Pre 18th century England, where the publisher has copyright. In china as in 16th and 17th century England is all about censorship not copyright in any other sense.

The Statute of Anne 1710 is the first copyright act, which brings in the rights of authors and does not include censorship clauses. The first modern copyright law. But author based copyright didn’t really take off until the early nineteenth century, think this was another ethos. Only as authors are seen as romantic genius in the romantic age does this model takes off. Publishers recede to the background to manage economic aspects and authors move to the forefront.

Enter stage left the authors guild. So Authors Guild vs HathiTrust (2012). The Authors Guild has around 8000 members at present. The authors ar encouraging decision that the distinct judge recognised a fair use defence for HathiTrust Trust to digitise copies of texts. The judge argued two types of transformations: full text search, and accessibility of text. That is very very important as an aspect of the ruling. And the judge was convinced of fair use defence. Some humanities scholars submitted, matthew jocker did an analysis of the use of digitised text.

Where we are… We started from the year 1550 and ended in 2012. The meaning of copy has changed. Is digitisation the same as copying by hand? And for digital humanist and copyright lawyer we have to reimagine the role of copyright and the role of the author in copyright. Could see authors as intellectual property owners. We didnt see intellectual property as a term emerge until 1960s when we saw an influential book and the IPO set up, but that idea does change our thoughts of copyright to some extent. But we also see open source, coined in 1988.. There are parallel growth there… We are more a steward and custodian rather tha exclusive intellectual property owner.

Q&A

Q) just to be a pedant here… Your discussion of the romantic author… I think you got it reversed… The law precedes the author by a distance. In the 18th century original works, poems, epic poems like the work of alexander pope etc. for the sake of erectile, their rank of gentlemen, and royal sponsors made books of vellum, extremely expensive.. The way the publishers got around the need to publish these expensive texts was to republish out of copyright works, recycled materials (including shakespeare), etc. cheap material on recycled rag paper. When new works appear, when paper costs drop, then you see new types of writing replacing old writing and publishers have little say… And in the early nineteenth century you see authors assert power. Profit and capitalisation of ideas in republishing of works is so crucial to current Authors Guild debate is important.

A) I’m glad you mentioedn Alexander pope, he is quoted in 1771 case. Almost all cases in 1710s onwards are between publishers but pope actually sued his publisher in that time. That is a gradual change… Going o the nineteenth century.

Q) us versus uk
A) divergence of law… In 1922… Us copyright act was a 56 year act. In 1978 that was in place… Anything Pre 1922 Out of copyright. UK it is 70 years after authors death. Canada 50 years, sheet music sites in Canada. Stuff out of copyright in Canada but not in the uk. But you can access in the uk. Definitely territorial but internet access is not.

Q) interesting you raised music, a whole other complicated history there.
A) absolutely, very complex. For instance Stravinskys work was very difficult for him to copyright because of Russia’s take on property.

Q) the ease of violating copyright law… Working fr Wikipedia and Wikipedia UK… It can be twisted around. The NLS we frequently have conversations about releasing digitised materials. In the uk unlike the us new digitised material has new rights attached. But we have just been putting content out there.
Comment) the British library lets you use copies of less that 3000 copies but if you have an ebook contract you have to pay huge sums for an image.
Q) it costs more to enforce copyright and fees. The NLS have a non commercial clause for digitised materials, usually we won’t charge if the come and ask us. But cost of enforcement can be higher than perusing. Is this unique to digital?

Gregory Adam Scott – The digital bibliography of Chinese Buddhism as a research and reference tool

Gregory is a digital humanities post doctoral fellow at IASH, his doctorate looked at printing and publishing in early Buddhist cultures. His talk has a new title “building and rebuilding a digital catalogue for modern Chinese buddhism”.

I chose this title inspired bynjorge Louis borges’ “the library of babel” containing the sum of all possible knowledges, versions with all typographic mistakes, the catalogue itself… I evoke this to represent the challenge we face today in looking at mountains of data, whilst the text may be less random we still risk becoming lost in our own library of babel.

My own work looks at a more narrow range of data. I began studying the digital catalogue of Chinese Buddhism cataloging texts from 1866 and 1950. But first a whistle stop tour of printing and religious printing in china. A woodblock print edition if the diamond Astra from 886 CE remains the earliest printed text that records the year of printing. In ore modern east Asian print history religious texts we some of the most frequently printed texts. The printing blocks of the Korean buddy canon was an enormous undertaking in terms of time, cost and political support. Often the costs were supported by ideas that contributing to publishing religious works would be something of a merit economy, bringing good things to you and to your family, which can then be gifted to others – s these texts often include a credit to donors in which they dedicated the texts to loved ones.

Yang Wenhui (1837-1911) and his students published hundreds of texts, thousands of copies and was a hugely influential lay Buddhist publisher. As we see the introduction of movable type and western printing processes this was hugely important, more work was printed in a thirty seven year window than in the previous two thousand years. This is great interma of accessing primary sources but problematic for understanding printing cultures. We see publishers opening up. The history of modern china is pepped with conflict and political and cultural change. And religious studies were often overlooked in the move towards secularisation, this is now slowly changing. And libraries were often free from key religious texts and it can be particularly hard to track the history of print in this time because of variance of names, of contributors, of texts, and of cataloging.

So I wanted to go back to original sources to understand what has been published. S I started with five key sources who had created bibliographies based on accessing original materials rather than relying on primary sources. There were still errors and inconsistencies. I merged these together where appropriate. I wanted to maintain citations so that original published sources could be accessed, that the work could be understood properly.

I did this by transcribing the data. I used a simple bare bones methods with XML. Separating the data and the display of the data. If someone wants to transform the data this format will allow them to do that. This is used simply, tags and descriptions are as human readable as possible. I want future researchers to be able to understand this. I also used Python for some automated tasks for indexing some of these texts.

Looking at the web interface that I put online, it uses Php, the same stack as Omeka. The database runs on SQL. There is a search interface where you can enter Chinese keywords and eventually you will be able to search by year or pairs of years. It returns an index number, title, involved author etc. simple but helpful information. It includes 2328 entries whe the spike at the golden age of china in 1902 is very evident. And then each item has its own static HTML page. That is easy to cite and includes all information I know about this text. S far I think this resource has been useful to produce data t pint the way towards future work… Less the end f research, more the beginning. This work has let me see previously undiscovered texts, you can also look across trends, across connections, the relationships to the larger historical picture. It could also be applied to other disciplines regions.

All of my input to this project is provided under creative commons (non commercial). Bibliographic data isn’t copyright able as it is lucid knowledge but the collection of that could be seen to be original work so I’ve said it is my work that I am happy for others to use.

The reason there is such a spike in 1902, where a date is not known it is assigned to that date free which all texts will have a date.

This catalogue is different from book suppliers data as the purpose is so different, my research use is not for purchase in the same way. I want to add features and finesse this somewhat but my dream is if doing what I’d call “Biblio-Biographies” to see the appearance of text over time, seeing nowhere it appears in publishers catalogues… and how the pricing and presentation changes. For instance looking at the Diamond Sutra we see different numbers of editors, one offers a special price for 1000 copies. I used bibliographic sources but there are so many more forms and formats that I will need to consider, each source will be treated differently. Adverts may appear for publications that were never produced. Have moved from bibliography, to catalogue to something else.

Q&A

Q) why not use existing catalogue tools
A) didn’t have anything with the right sort if fields, very different roles of authors, editors, etc. not in a standard format, consider MARC but it wailed be relatively easy to transform the XML to MARC.

Q) are you thinking about that next stage, about having ways for more people to contribute.
A) I have been involved in the wiki based dictionary of Chinese buddhism, we opened it up to colleagues and nothing happened. But only us, the co-editors contributed. Big issue is about getting credit for your work which may be the issue for contribution.
Comment) have a look at the website Branch on nineteenth century literature, have asked for short articles and campaigned for MLA bibliographies inclusion and that helps with prestige. Just need big names to write one thing…

Q) could you say something more about other sources
A) there are periodicals, a huge number of the,. A lot of these focus in on ocular printings of texts, some include advertisements, etc. so these texts point off to other nodes and records.

Q) you talked about deliberately designing your catalogue for onwards for transformation, and whether you’ve thought about how you will move forward with the structure for the data…
A) I’m not sure yet but I will stick to the principle that simple is good and reusable, and transform ale are good.
Comment) you might want to look at records of music and musical performance.
A) I’ll keep that in mind, Readings of these texts are often referred to as performances so that may be a useful parallel.

Louise Settle – Digitally mapping Crime in Edinburgh, 1900-1939

Louise is a digital humanities post doctoral fellow at IASH and her work builds upon her PhD research on gender and crime in the nineteenth century.

I want to talk about digital technologies and visualisation of data, particularly visualisation of spatial data. I will draw upon my own research data on prostitution. And considering the potential fr data analysis.

My thesis looked at prostitution in Scotland from 1892 and 1939. The first half looked at the work of reformers, and the second half looks at how that impacted on the life of women at this time. S why do crime statistics matter? Well it sets prostitution in context, recording changes and changing attitudes. My data comes from the borough court records, where arrests took place, where police looked for arrests, and the locations of brothels at this time. Obviously I’m only looking a offences, so the women who were caught, and that’s important in terms of understanding the data. Because these were paper records, not digitised, I looked at four years only coinciding with census years, or the years with full data nearest census years.

I used Edinburgh Map Builder, developed as part of the Visualising Urban Geographies project led by Professor Richard Roger who helped me use this tool, although it is a very simple tool to use. This allows you to use NLS historical maps, Google Maps and your own data. There are a range of maps available so you pick the right map, you can zoom in and out, find the appropriate area to focus on. To map the addresses, you input your data either manually or you can upload a spreadsheet and then you press “start geocoding” to have your records appear on the map. You can change pin colours etc. and calculate the difference between different points. Do have a look and play around with it yourself.

The visual aspect is a very simple and clear way to explore your subject, and the visual element is particularly good for non specialist audiences, but it also helps you spot trends and patterns you may not have noticed before. So looking at maps of my data from 1903, 1911, 1921 and 1931. The maps visualise the location of offences, for example it was clear from the maps that the location changed over time, particularly the move from the old town to the new town. In 1903 offences are spread across the city. In 1911 many more offences particularly around the mound. In 1921 move to new town further evident. By 1931 the new town shift is more evident, some on Calton hill too.

The visual patterns tell us a lot, in the context of the research, about the social geography of edinburgh. Often old town is seen as working class area and new town as a middle class area. Prostitution appears to move towards to centre but that is also the grin statistician, the shopping areas, the tourist areas. This tells us there is more work there. They keep being arrested there but that does not deter them. Small fines and prison spells did not deter. Entertainment locations were more important than policing policies. You can see that a project that is not neccassarily about geography has benefitted from that spatial analysis aspect.

If you have spatial information in your own research then do have a look at Edinburgh Map Builder. But if you have data for elsewhere in the UK you can use Digimap which includes both contemporary and historical maps. There are workshops at Edinburgh University, and the website on the bottom there. That’s UK-wide. And a new thing I’ve been playing with is HistoryPin – this uses historical photography. You can set up profiles, pictures, paints, etc. you can plot these according to location. You can plot particular events, from your computer or smartphone. Yo can look at historical images and data. So I have been plotting prostitution related locations such as the Kosmo Club, the coffee stalls on The Mound. You can add your data and plot them on the map. Very easy to use site and this idea of public engagement, this is a great tool for doing this.

Q&A

Q) I was quite interested in those visual tools and the linking of events tying them to geographical places. And there are other ways to visualise social network maps, I wonder how it would be to map those in your work, there must be social connections ther. Social network analysis can look very similar… I wanted to know if you have considered that or come across that sort of linkage.
A) I haven’t but that sounds really exciting.

Q) I wanted to ask you about the distribution and policing. If one were to return to the maps. Some marked differences in the number of offences – arrests? – how much detail did you take out of it? You said they were going back and were not deterred. In 1911 markedly different numbers. But even at the times when there was actually more policing towards the old town, the police were just sticking to the main routes. So was the old town a lawless zone at that time? Police not wanting to venture into dark alleys. And how long does Edinburgh’s tolerance zone persist. And it’s curious o see that without Leith too! As now the city operates a more direct reflection but perhaps before the amalgamation of the authorities perhaps there wasn’t such a direct deflection affect?
A) in terms of Keith it was occurring there. The argument is coming from the suggestion that it was informally tolerated in the old town… I don’t disagree that it happened in the old town but my arguement is that it is also happening in the new town and measures there don’t stop it when they should. And my research also sees the police not always caring and judges and juries moving for reform rather than harsher sentences. Cafes and ice cream parlours were a cause of concern in Glasgow in 1911 which may impact the figures then. The 1903 records are not correct, it may be an outlier as the general trend is of decreasing offences over time…

Q) about the visualisation tool, you have tremendous amount of interest in those maps, are this emails important for research design, for research questions. Or would you wish for a tool with more possibility for contextualisation. Fr instance statistics from authorities etc, to interpret your findings. What possibilities for researchers to have these tools yield more stuff?
A) the maps are interesting, they are more appealing. But these need to be used with tables, charts, statistics. If just presenting on the work I would have included those other factors. So in 1903 you lose some density when all dots are in the same place. But an interactive tool to do that would be great.

Comment) what is so attractive of visualisation is speed and efficiency but that also means there is a risk in concluding too quickly, of not necessarily reflecting reality of prostitution – the reader may read your map of offences in that way, that will be easy to do but the methodology can be dull to people and that can mean misunderstandings.
A) absolutely. This needs to be in context.

Q) could you have layers comparing income against offences etc. if you’d found any projects that were developing more complex…
A) the big project is the Edinburgh Atlas, there is a mini conference on hidden histories and geographies of edinburgh on mapping crime, it’s on the IASH mailing list, there are others doing that.

Q) you talked about women seduced by foreigners in edinburgh?
A) in edinburgh there was concern about Italians at ice cream parlours, brazilians were the concern in Glasgow. And in edinburgh there was also a German Jewish pimp of concern as well.

Discussion more widely…

Comment) I’m primarily a learnin technologist and I send my life trying to get people to start from the activity they want to undertake, and not starting with the tools. I found it refreshing tat you all started with your data and looking for tools with the right affordances. How did you find you were helped with that search for a tool.
Louise) it was human contacts. I saw a lecture from professor Richard roger.
Ally) it was similar for me, I found a software through a contact but found it hard to find what else was out there. It basically came down to Omeka or Drupal that the web team knew about. but it would have been great to know what was out there, what the differences are, what resources there are. Even looking through DHNow and DH Quarterly there isn’t a sense of easily identifying the options for the tools. That can be a bit of an issue.
Greg) I used the tools colleagues were using to build my own…
Comment) HCI has the notion of affordances, what it easily enables you to do and what else it could enable yo to do. Is there something there about describing affordances for the humanities. My sense is that often they are pitched towards the sciences, sometimes terminology varies event, so understanding affordances varies.
Ally) sometimes developing your own tools is good, but even a little knowledge and terminology let’s me get better results from these tools, if. Come to these tools end these colleagues with no knowledge then I will not have a successful outcome. I want to really explore Omeka so that I feel confident and able with it.

Question) have the tools changed your research questions or ways of working?
Louise) not me
Ally) for me the have. I was introduced to the 19th century disability reader digital anthology and knowing what was capable with the tools changed what I wanted to d with my project. It did to some degree. By the basic aim was I want to know more about late nineteenth century medical history hasn’t changed. But the project has
Wei Chen) I find the legal documents, creative commons licenses etc. most useful, I was able to be involved in the first version of the Chinese Creative Commons license.
Greg) it hasn’t changed my questions but the scale of work possible and how I might explore it has changed for me.

Question) what advice would yo give for people thinking about digital tools for research
Greg) don’t be afraid to just try things out, work out what’s possible…
Louise) do ask for help, do take advantage of courses…

Question) I was struck with the issue of time when you gave your presentations. Have you reflected on the process of the use of time. How to use jt creatively and consain it. And how that use of time perhaps changed your view of get, of hard copy materials.
Ally) with digital projects you can find you go with the additional time used. Yo should not underestimate the time neccassary. But at the same time I would spend hours and hours leafing through texts to answer a research question. I want t use this tool to reduce the time to find the data I need, to access it, to interpret it. But this project is about developing this oll to benefit myself and others later. You need to be realistic, step back, and be realistic about what is possible.
Louise) that’s part of the issue of digital humanities. My work will be in a traditional book format but the Historypin work, very engaging, but not counting towards career, towards a job. That’s a challenge fr digital humanities and for early career researchers, it’s why our scholarships are so good.
Wei Chen) and there is the distant versus close reading difference. Close reading still has a role but that distant reading allows us to interrogate that reading, to find that resource, etc.
Greg) nothing we are doing are unrecognisable research but we are able to perhaps examine more material, or to do things more quickly. We are not doing everything differently but using new tools in our work.

Question) do you think this investment in tools is changing humanities as a result f this temporal and labour investment in tools. Ally you talked about putting off other work…
Ally) well I am song research, You always have to manage many projects at once. And ther will be an impact. But. Chose the digital path because time and financial limitations changed what was possible. It could have been done another very expensive way. So I’m not putting off research, I would probably be spending years collating information… Instead I am setting something up to facilitate my own research in the future. The relationship between distant and close reading. That divide isn’t as fiery as it appears.
Comment) the superficial view of the digital is happening in teaching. Universities jump on the digitisation bandwagon in a way that changes how humanists are employed, how software are copyrighted and licensed. All these tools help universities save money. One can overreact… Ealignments f labour and resources makes not so positive inroads…
Ally) it’s a huge problem, I have huge concerns about the University’s MOOC programme. There was discussion of open access individuals to talk about what these means…
Louise) not sure but I know colleagues are concerned.
Wei Chen) open access is about economic growth, not hardcore humanist values. Humanist values should be at the core for digital humanists, there will be an increasingly curatorial role fr all formats of material
Comment) abit critical engagements

Question) one of my concerns about this sort of work, and the work in geography in ways of making and curating an archive. I was wondering about the length of time an archive is available after a project. There was a BBC project to save our sound and it finished and the map is no longer accessible… So who looks after and preserves data.
Greg) I think it’s hard to “lose” data, it’s abit implementation not methods.
Ally) I think it’s about how digital humanities adopt tools, about reflecting on project aftermath. When looking into project funding you don’t want that tool lost. It’s not an issue f methodology or individuals but it has implications for future archiving.
Comment) which is why Greg’s work in XML matters
Me) and the use of research data management plans and research data repositories to help ensure planning and curating of data at the outset, and to ensure lon terms access and sustainability.

May 022013
 

Today I am blogging from the University of Edinburgh Digital Scholarship Day of Ideas 2, a day long look at research in the digital humanities and social sciences. You can find out more on the event on the Digital HSS website. As usual these are live blog posts so apologies for any spelling errors, typos, etc. And please do leave your comments and corrections here.

Professor Dorothy Miell, head of college of Huminities and Social Sciences is introducing the day. Last year we shaped the day around external speakers but we are well aware that there is such a wealth of work taking place here in Edinburgh so this year we have reshaped the event to include more input from researchers here in Edinburgh, with break out sessions and discussion time. The event is part of a programme of events in the Digital HSS thread, led by Sian Bayne. The programme includes workshops and a range of other events. Just yesterday a group of us were discussing how to take forward this work, how to help groups gather around applications for grants etc, developing fora for post graduates etc. If you have any ideas please do contact Sian and let her know.

Our first speaker is Tara McPherson who is based in the School of Cinematic Arts at USC in Los Angeles. She is a researcher on cinema and gender. Her new media research concentrates on computation, gender and race as well as new paradigms of publishing and authorship.

Scholarship across scales: humanities research in a networked world – Dr Tara McPherson, School of Cinematic Arts, University Southern California

We are often told we are living in an era of big data, of large digital data sets and the speed of their expansion. And so much of this work is created by citizens, “vernacular archives” such as Flickr and YouTube. And those spaces are the data for emerging scholars. And we are already further along in how big data and linked data can support scholarship. There is a project called DataONE – Data Observation Network for Earth  – is a grant project for scientists, the grand archive of knowledge. This is the sort of data aggregation Foucault warned us about! But it’s not just in the scientists. In the humanities we also have huge data sets, the Holocaust Testimony video collection is an example of that – we can use that as visual evidence in  a way that was previously unavailable to us. Study of expression, of memory, of visual aspects can be explored alongside more traditional ways of exploring those testimonies. And we can begin to ask ourselves about what happens when we begin to visualise big data in new ways. If communication is increasingly in forms like video what are the opportunities for scholarship to take advantage of that new material, the vernaculars, and what does it mean that we can now have interpretation presented in parallel to evidence. Whilst many humanities scholars have been sceptical about the combination of human and machine interpretations there are rich possibilities for thinking about these not as alternative forms but as a contiunuum. And we will see shifts in how we collaborate, in sharing the outcomes of our knowledge. Rather than thinking of our outputs as texts, as publications, we also need to think about data sets, as software. Stuff that exists at multiple levels from bite size records – metadata that records our work for instance, to book size, to bigger. And we need to think about how we credit work, how we recognise effort, how we assess that work. How do we reward and assess innovation – how do we do that for research that may not lead to immediate articles but be much longer, much bigger scale.

Going back to DataONE there is a sub project called eBird, a tool to allow birdwatchers to gather data on birds. They are somewhat ahead of the game in thinking about crowdsourced science. Colleagues at Dartmouth are starting to look at crowdsourcing data. My son plays a game that lets you fold proteins that contributes to scientific research. There are examples from Wikipedia, to protein folding to metadata games, etc. which also challenge traditional publishing. The Shakespeare Quarterly challenges peer review with an open process – an often challenging form of peer review. Gary Hall and colleagues at Goldsmiths are also innovating with open journals. But we also see a change from academic knowledge as something which should be locked away, a move away from the book as fetish object etc. In the UK we saw JISC fund livingbooksaboutlife.org – from open access science but curated by humanists and scientists.

And we see information that can be discovered and represented in many ways. We can get hung up on Google or library catalogue search dynamics but actually searches can be quite different. So for something like Textmap we get an idea of different modes of discovering and browsing and searching the archive, opportunities for academics to reinterpret and reuse data. The opportunity to manipulate and reuse data gives our archive much more fludity. We can engage on many different registers. You can imagine the Shoah Foundation archive which I showed earlier having a K12 interface, as well as interfaces for researchers, for publishers etc. Some may be functional interfaces but some may be much more playful, more experimental.

Humanities scholars and artists are helping to design some of these spaces. The tools will not take the form that we need them to as particular humanities scholars unless we are part of that process. We often don’t think of ourselves as having that role but we have to shape those ways to communicate our data, to visualising it etc. Humanities scholars have spent years interpreting text, visual aspects, emotion, embodiment, we are extremely well placed to contribute, to help us build better tools, better visualisations etc. There is no logical fit between the design of the database and the type of fit with the work of humanities researcher. Data can have inconsistencies, nuances, multiple interpretations, they don’t easily fit into a database but databases can be designed to do that. Mukurtu (www.mukurtu.org) is an ethnographic database and exploration space, the researcher has worked with the world intellectual property association and indiginous groups to record and access data according to their knowledge protocol, that reflect kinship relations, codings of trust. We also have much to learn from experimental interactive design. The Open Ended Group (openendedgroup.com) do large scale digitisation. They have digisted a huge closed detroit factory, and used 3D visualisation. It’s for an experimental art space not a science museum. It’s a powerful piece to experience and inhabit and explores the grammers of visuality. It’s not about literal reinterpretation but creative and immersive explorations.

Another example: Sharon Daniel – database driven documentary from IV drug users in a needle exchange programme in San Francisco. 100 hours of audio to be explored through the interface, work in Vectors. Vectors is a journal I edit, an experiment on the boundary of humanities research, visual interpretation and screen culture. Can you play an argument like a video game? Can you be emersed in an argument like a film? Another example here is an audio exploration of the largest womens prisons in California. Curated to make an arguement about our complicity in the rhetoric of imprisonment by the state. The piece has a tree based structure which allows exploration based on where you have been. You can navigate the piece through a variety of themes. You can follow one woman’s story through the archive in a variety of ways, and incarceration and the paradigms on which it depends. The piece is quite different to a typical journal article – it will be different every time. Which raises interesting questions for the assessment of scholarship. It’s fairly typical of what else is in the archive. We pair scholars with minimal or no programming experience with staff in design and programming staff in the lab. A fantastic co-creative process but not scalable, especially as many of these pieces are in Flash. But we have identified many research questions and areas for exploration here.

I work in a cinema schools, looking at visual cultures. We found we needed tools, we didn’t want to build tools but the scholarly interpretation needed by our scholars does not fit into existing rigid strcutures. Since we began to work in this area we’ve moved to thinking about potential around vernacular knowledge, collaboration with the Shoah Foundation, temporal and geographical maps from Hypercities that let you explore materials in space and time. And from those partnership we have formed a group, the Alliance for Networking Visual Culture (scalar.usc.edu/anvc) funded by Carnegie Mellon(?) with partners from the Internet Archive, with the SHoah Foundation, with traditional humanities research centres, with design partners, 8 university presses to explore none traditional scholarly publications and those presses have committed to publishing these born digital scholarly materials. And you can begin to think about scholarship across scales, with new combinations, ways to draw in the archives. Traditionally humanities scholars have a vampiric relationships with the archive! We can imagine in the world of Linked Data that the round tripping of our scholarly knowledge back to the archive might become quicker and more effective. So we’ve been building a prototype… this is a born digital book about YouTube by a media scholar, which takes the form of YouTube. It’s an open access book but peer reviewed in the same way as any other. So we have built a platform called “Scalar”, a publishing platforum for scholars who use visual materials. Anyone can log in, to play with the software, to try to create and engage with the software. It’s connected to archives – partners, YouTube, Vimeo, etc. and particularly to Critical Commons – an archive that includes some commercial materials (under US copyright law) and also links to the metadata around that material. And it lets you create different structures that allow you to take multiple paths through materials, through data, more like a scholarly form but not neccassarily in linear routes. So, for example, “We are all children of Algeria” by Nicolas Mirzoeff. He had a book coming out in print but when submitted the Arab Spring took place and was very relevant to the book so he created a companion piece. As you built the piece on Scalar a number of visualisations are generated on the fly to show you data on the content of the book, visual Table of Contents, metadata, the paths, etc. Another recent project, “The Nicest Kids in Town” – on American Bandstand that includes video that couldn’t be in the book. Also Diana Taylor and the Hermispheric Institute

Henry Jenkins and colleagues interactive book on digital cultures. Third World Majority an activist archive and scholarly expert pathways through that archive. Blurring the boundary between edited collection and archival collection. And the Knotted Line blurs public humanities and public curation. It explores incarceration in the US and this is based on the Scalar API with their own interface which is quite tactile.

These tools allow us to explore the outputs of scholarly research in different ways, the relationship to evidence, but also to think about teaching differently. See programme in the humanities and media studies, at intersection of theory and practice, where students must “make” a dissertation rather than write a dissertation. See also Rethinking Learning – a series of cards and materials from which students could create peer to peer learning. It is also a dissertation. The author Jeff Watson will be in a tenureship track role in Canada in the fall. Susana Ruiz has created a dissertation prototype which is a model of learning around games and video archives. But both of these projects look at new possibilities for teaching and learning.

We are building tools here for humanities scolars not “digital” humanities scholars. We build upon rich traditions of scholarly citation and annotation. Our evidence can live side by side by the analysis which increases the potential rigour of scholarship, the reader has far more opportunity to question or asses those arguemens. And the user/reader has an opportunity to remix. This isn’t about watering down our scholarship or making it ritzy, rather it is about making our scholarship flexible to an ever changing world and accessible in new ways.

Q&A

Q1 – Richard Coyne, Architecture & ECA) You raised the question of citation and academic and scholarly practices. Visual materials can be difficult to that

A) We tried stuff out here. A flash project is really hard to quote, accessing a specific audio file in Sharon Daniels work is really challenging. But in scalar each object has a unique identifier and URI, and you can export as XML and PDF, and you can use the API. It’s a traditional relational database with quite an idiosyncratic semantic layer on top. So you can build interesting stuff because of that combination.

Q2) You talked about emotion. There can be excitement around this sort of material but for some there is a sense of fear around knowing how to engage, particularly when incorporating into our own curricula and research. We can be quite traditional when we return to our desks. Any simple start up ramps to get through the fear barrier?

A2) It’s been a slog, even at USC. Dealing with visual rhetorics and argument. We have an institute in visual literacy for practice based PhD and interactive undergraduate and postgraduate programmes. We have guidelines and rubrics developed there for multimedia work and assessment and those have been useful rubrics for other schools in the university. At university level for tenures and promotion committee we have created criteria for assessing digital scholarship, the different ways to evaluate that work. The issue is less the form of the work but actually assessing the contribution of such a wide range of collaborators with very different skills. We have borrowed from the sciences but that’s not a simple mapping, there are issues. We have had only four digital media PhDs completed so far but all have gone on to good things. Visual temporality have traditions that it can draw upon… it will be an unevenly distributed move for next 10 years or so at least.

Q3 – Clara O’Shea, School of Education) the engagement with living archive, and the role of the scholar in that – what are the ethical implications? And what ways are your work changing the way scholars assess their own work?

A3) I’m just starting to look at assessing the role of the digital archive and the radical shift in purpose than the traditional archive. The library is about access, the archive to preserve. Digitally that split isn’t as relevant. Ethically it is very tricky though. The Shoah Foundation recorded materials long before the web, this was set up by Stephen Spielberg. Now they did sign away their rights to materials but we have been working with the board of the Shoah Foundation around what is and is not appropriate to do with the materials. There are projects for kids to remix video – so we have developed an ethical editing guideline for those students. At Dartmouth with that metadata game there has been a need to really think about the ethical and quality implications – exploring by layer, the difference of “expert” and crowdsourced, is a way that has been handled. In terms of scholars it changes the relationship to evidence and to scholars own work. So back to the Shoah material they have a policy of not providing transcripts as they want researchers to actually watch the video, to understand hesitancy and emotion. They have had scholars who have gotten students to make transcripts for them, analysed that and the Shoah foundation queries the analysis and whether scholars had seen the films. When those scholars actually watched the films their experience and analysis was quite different.

I was trained as a feminist film scholar when it was hard to find the film. I had read about the films before seeing them, often long before, and you could be left wondering if the scholar you had read was based on the same thing. Having the evidence there changes that, gives you a more direct relationship. Also writing small sections of arguments, writing more modually, that is what you start to do rather than long form structures we are used to, and that can be really appropriate for humanities scholars in some areas.

And now many thank yous and onto breakouts. I am going to Breakout 2, chaired by Professor Robin Williams:
I will be talking about  a project from the last three years looking at electronic literature as a model for creative innovation and practice. It’s mainly about networked communities of data analysts and practitioners. I was looking at ideas, concepts and new ontologies, of creativity in particular. And focusing on co-creation and collaboration. I say that is novel but really it isn’t, co-creation and collaboration pre-dates the digital era, pre-dates publishing in craftsmanship traditions. I was looking at both amateur and professional artists and practitioners, in a transnational, transcultural contexts. How we use the internet to create, say, art. So this is about exploring process, creativity, community, these sorts of aspects.
We came across the idea of creativity as a social ontology. Creativity as “an activity of exchange that enables (creates) people and communities” (Simon Biggs). You need interaction in the making process of this sort of ontology. In the communities I engaged with creativity was a subsequent activity of the collaborative community. They were interested in the making process rather than the objects of the making. Ethnographically I took a post-modern multi cited approach as a framework: follow the community; follow the artefact; follow the metaphor; follow the story; follow the life; follow the conflict; and I added the idea of follow the line (follow the rhizome). The communities are dynamic, changing, they move in different directions. The same in the voices, how many are there within those communities… The fieldwork was very nomadic both offline and online. I started following one community, then found many others connected. I followed online but also offline (within Europe). I looked at a network physically based in London, other communities started with New Zealand, moved to Germany, Italy, etc. and online presences moved beyond this.
I was looking at the idea of a “creative land” sat between place, artefact, practice. The practices are connected, through a community of bodies that make these assemblages happen. I look at the theoretical approach by (?) of creative lands. I didn’t just look at the creation of objects but also the creation of communities. Looking at creativity of Synergy and Assemblance. So I looked at Furtherfield.org, probably largest digital arts community in Europe. They have an offline gallery in London where I undertook fieldwork in January 2011 and this is still ongoing. This comes from the idea of being further than the leftfield, their basis is political and based in politics of late 1970s but also with criticism of commercialism of the New British Artists and Saachi’s influence on the arts. I looked at the daily activities, how they communicated their activities, and it is very equally distributed, not hierachical. For example one co-founder Mark Garrat talked of the community as “the medium” for this work. The artists were involved could come from sound, to network, to cyber performance, quite an open approach by Furtherfield. They have created the idea of DIWO – Do-It-With-Others, the making of art and artistic practice. This is defined on the website and clearly requires social interaction and collaboration as part of this work, about heteroarchy. The DIWO ethos is about contemporary forms of collaboration, an open and political praxis, about peer-to-peer processes for learning and sharing knowledge and making knowledge. And the idea of media art ecologies – based on Bucht who believes in a continuum of humans and environment, and from George Babbetson who talked about ecologies of mind, as multifunctional and different ideas and cultures coming together to make an assemblance.
The particular projects using digital platforms tend to focus on social change, particularly environmental change. And there is a movement called “make-shift”. Two groups, one around the world, one in Exeter. They have cyberperformances. And they have an open source “App Space” performance space for video, for materials, tweets, etc. This is one kind of process, of use of ideas. The artists have particular materials for performance including facilities to allow multiple audiences, multiple mixing, multiple points of access to be part of the performance. Another performance brings in comments from Facebook. As well as her belongings from the last 5 years, juxtaposing this with other forms of collection.
Another project, Read/Write Reality and their work Art is Open Source. Their idea is creating academies of knowledge. They share the knowledge of how to use open source tools to make art. So one project of Art is Open Source uses ubiquitous realities movies with WordPress. Their work is about co-creation and collaboration. I am also looking at AOS: Ubiquitous Pompeii through autoethnographic processes. This works with high school children in Pompeii, looking at designing and imagining possibilities to see the city in different ways. And co-creating and remixing material with schools. Using ubiquitous technology to co-create cities. It is still about peer-to-peer processes, about co-design… We are seeing the process of working together. The largest and best known project of Art is Open Source is La Cura – the call for a cure for a brain tumour, sharing medical information and scans etc. openly on the web.
Q&A
Q1) We have a project on open source and film, how do people engaging in these works actually make money from them?
A1) Furtherfield are using crowdfunding, education projects etc. to keep running. Art is Open Source runs educational and other projects and provides funding to make some of these projects happen.
Q2) You write in scholarly journals etc. Did the keynote give you thoughts about how the projects you look at may be written up in new ways.
A2) Yes, I think one thing that is interesting is the idea of being open source but I would also like to see collaborative writing. The monograph is all about me. But I would like to see multi voice texts and would like to look at this for sure.

Copyright, authorship and ownership in digital co-creative practices – Dr Smita Kheria

My work arose from Penny’s previous project. Some of the participants will be common to Penny’s presentation just now. My research interest is in exploring the norms of collaborative practices so far as copyright and ownership are concerned. I am a copyright lawyer and I am interest in how authors relate to copyright law in their practices. Copyright law poses 2 problems. Firstly how it conceives authorship and how that author is credited; and the second problem is how collaborative authors are perceived and how that works in practice, and particularly in emerging collaborative processes online.

So, just to ensure we are all in the same place. Copyright protects the work, it must be an original work. There must be some originality, some effort, skill and judgement. Usually the first author is the first owner, they are the copyright holder and has the economic rights. In collaborative work there are particular assumptions. In co-authorship – for example distinct book chapters in a book – each author has the rights for their contribution. When a joint authoer is perceived, a collaborative authorship, then all contributions have rights. But there is no distinctions within the concept of a joint author. And that has implications for the perception of authorship.

Last year Penny and I worked on a six month AHRC project looking at creation and publication of the “Digital Manual” and looking at authority, authorship and voice. Explored through interviews and focus groups. Participants were working with open source mechanisms. We asked participants – and creators – what the role and meaning of collaborative authorship was for them. What they felt about this, rules of attribution etc. And we found no set rules here, some ideas of how they should perceive authorship. Some commonalities across all four communities – which included MakeShift (from UpStage) and Art is Open Source. What they created was built in real time, changing regularly, grounded heavily in collaboration. The first case study on Art is Open Source we saw a very hands off approach to authorship and ownership. They are a network, they provide open source platforms and software, and also a fake competition in the project we were looking at. They were clear about the ownership of the platform and the software – open source and GPL licensed. But in terms of authors they wanted to disappear, they don’t want control, do not mind what others do with the material they have created. So for instance a book which came out of the project was discussed, they felt forced to be on the cover by publishers. They did take responsibility for the process but didn’t want to engage in what was made with what they made available. They felt attribution was important, generally important but they were not concerned about attribution of their own work.

This was very different to Sauti ya Wakulima. This is a collaborative knowledge base project set up by a set of farmers in Tanzania who share materials gathered via smartphone. There is an ongoing community around farming practices, climate change, etc. The person who set up this project took a very active role in terms of the content created and in the platform etc. They spoke to farmers about the licensing of content etc. This was made available under Creative Commons. His own perception of authorship was different. He did see himself as the author of the software, although he talks about using others materials and code. He was the author but no “not everything came from my own mind”.

Looking at UpStage from make-shift. The platform is totally open. But what about the performances? Well they left that to  performers. There was no licence fee payment option within platform for instance. Performance organisers used the term “brokers” of collaborative performances in the space but, when asked about the performance, capture of the performance for instance, they conceived themselves as authors. They wanted to disassociated themselves from notions of authorships but that was very much their own perceptions. And there was ambiguity about contributed images around performances as well.

And the final case study was FLOSS Manuals – collection of manuals on free and open source software. It is entirely open and editable. A collaborative publishing platform. A lot of manuals there. When editing videos we had taken in this work I actually used one of their manuals for my own work. The platform is open but what about the content? The platform takes a very active role in the content. They have clear licensing, using GPL. Anyone can publish, sell, reuse content. Within the community creating the manuals there was no consensus, it was imposed by the platform owners. And the creative community here radically expanded attribution – anyone who had done anything at all (a single letter, a font face, etc) was credited. Some uncertainty when we spoke to them as the community was unsure about attribution and licensing.

This was a small study but it is clear that collaboration and co-creating has huge implications for perceptions of authorship and huge relevance for copyright law.

Q&A

Q1 – Ewan Klein, Informatics) A comment more than a question: GPL does not let you do what you like. But do you think that Creative Commons would have provided a trail of attribution in the right way?

A1) Yes Creative Commons would allow that but not all of those we spoke to had the same feeling about attribution, about how work should be attributed and whether there is to be attributed. And under the law some may not be a copyright work (e.g. 1 line in a manual). Here attribution and copright ownership would be split. Do you attribute the collective or the individuals? The farmers went for collaborative attribution… that solves the problem but not the issue of who should be attributed.

Q2 – Chris Speed) something here to do with reciprocity. In terms of commons, in commons land… implicit models of not taking all your sheep… could that translate to copyright

A2) Reciprocity did come up as a suggestion on the basis of which attribution could be made. But how do you assess reciprocity? This comes back to Robin’s question of funding. All of these projects were started by grants, thereafter funded by second jobs, projects, PhDs, voluntary contributors. So if coming in voluntarily is attribution the least you can do (e.g. FLOSS), but maybe if getting a performance that is reciprocity enough? Now these were very different projects and that does need bearing in mind, but those differences were interesting.

Simon: There is a model in Open Source Software of attribution. In open source films we see this work at first but it falls apart when it gets to being an interface from enthusasim and creation and the longer term sustainability.

Penny: FLOSS is an interesting one. This is sort of a benevolent dictator model. He was reluctant to be involved. They do not have money, looking in different directions… This open source, almost utopian community have realised that they need funding to continue.

Smita: and they had an issue. They could publish those manuals but so could anyone else. It would be good to go back in a year’s time to see what had happened.

— And a break whilst I spoke at the Scottish Crucible —

“It’s a computer m’lord”: law and regulation for the digital economy – Prof Burkhard Schafer

I have come in a little late here but Burkhard is talking about new forms of data, such as monitoring data on older people, for the monitoring of their health but potentially ethical and legal concerns. What if you use technology to help people with their memory – what if it has legal issues? What if it leads to a criminal investigation? New forms of data collection invalidate traditional metaphors, traditional divisions of law.

I am based at the law school, notoriously the scene of a crime – the body snatchers of Edinburgh. The law tried to manage supply side, that led

Regulation through Architecture (Larry Lessig) – they restricted access, they build fencing around graves, they patented thick metal coffins that allowed you to view the decomposition before burying, to avoid body snatchers. I call this DRM (Death Risk Management!). But this does relate to the loss of things that are precious. There was a case of a father who gave his daughter, who was dying of cancer, a phone with unlimited voice mail box. But the phone was in her name and when she died the messages were deleted. He took legal action but this is not an easy case.

Whose assets are they, whose privacy is at stake? What happens to the digital artefacts after death? This is complex. This work is part of a multidisciplinary research project, not just informatics and lawyers but across anthropologists, sociologists etc. We came up with radical suggestions far from that of these judges. For instance the “Dead Man’s Switch” – a way to wipe your hard drive and remove embarrassing stuff on your death. There were joke companies promising to look after pets in the case of the rapture to ensure your pets were taken care of by good aethists. But there are serious questions about a service here… about legal liability when taking action on behalf of a dead person.

What about disintermediation? The body snatchers were banned so they cut out the middle man, killing for bodies rather than digging them up again. But could it happen again? Well child trafficking and sex abuse sits in some of the same places of preying on the nieve. We work on this area, looking at ways to understand the role of social workers, teachers, police so that they can extract information they need to evidence a case without breaching data protection law or compromising privacy. This is one of our more technical projects around encryption. And this includes consideration of risk to informants, what can be shared and how, to make sure that there is sharing of neccassary data without exposing others in responsible roles’ as informers on their clients or communities.

Robots bring deep seated problems. They will be something more than machines. They change how we think or interact with technology. To give examples is it appropriate legally, ethically… to give someone suffering with althzeimers a robot that speaks like her husband even if it comforts here? It may be justifiable emotionally but it is a massive deception. Similarly is it ethical to have robots looking like people, should that be another law of robotics.

Meanwhile we have Sensecam devices that automatically take images of their day. Althzeimers patients have been given these to go through their day and work through them with their support worker – to go through their day, remember what they have done, this seems to have benefits for retrieval. They use these devices on dogs too (for more fun purposes). Legally… well in galleries, theatres, movies… photography is banned but should there be an overriding right to take pictures. In Germany public buildings are copyrighted and images cannot be taken. We let guide dogs go where other dogs cannot, maybe this is a similar justification.

And a final example: David Valentine records his performances: “Duellists” and “The Commercial” in public space – demands made on council for CCTV films of his performance for his performers rights. Legally in the UK this is complex!

Q&A

Q1 – Jen Ross, School of Education) In recent release of Google Glass some restaurants and business banned Google Glass and I’m wondering about the social response and impact of these technologies.

A) Google “St Patricks Day Google Glass” for amusing example. One of the concerns I have… these are being designed in health settings and medical settings but are being designed for live blogging. This is sort of a trojan horse for changing privacy laws and expectation. Private time has origins in latin for robbing time from others, we expect to be alone. It’s fine if we are OK to have images taken etc. But without ability to be alone, if privacy is a public good not a private good then we may not want people to give it up so easily. It becomes very complicated. Lots of frivolous uses trying to get public use on the back of essentially medical technologies.

Q2) I worked on a project with Charles Wab on data sharing. A thing I found in that context is that once you’ve released data into that space… you’ve talked about advocacy role of the social worker… but once released how do you retrench into your social role?

A2) It’s not surprising that in case of child abuse evidence was there but have not been shared. Rules have been changed but it still doesn’t work. People find a way around that. If I don’t trust the recording mechanism I don’t share the data. If I’m concerned about use of my data then I don’t write them down any longer. I don’t think all the evidence we’ve found from the social scientists, the political scientists is that technology doesn’t change that. People respond to requests in our approach, not dumping all their data as they just won’t comply in any manner of creative ways. And it’s a distributed system, rather than centralised for the same reason.

Letting your digits do the walking: on the road with Ben Jonson, 1618 & 2013 – Prof James Loxley  and
Dr Anna Groundwater

We are at the beginning of our digital journey in comparison to others who have been talking today. I will tell you a bit about the manuscript we are looking us, it’s significance and the journey we think it could take us on. In 1618 Ben Jonson walked from London to Edinburgh on foot – an extended walk with no evidence until James Loxley came across an account by a walking companion, a treasure trove of primary evidence for researchers, and a window into life along the Great North Road. So I will talk a bit about how we can recreat that world, to understand that using primary and digital resources.

My experience of digital online resources as a user was as a beginner. I physically dug around in regional and national archives along the Great North Road. Digital catalogues have really helped me to do this, it has allowed me to achieve much more and in a much more cost effective manner. Tools like EEBO have helped me speed up the collation of materials online, to gather biographical information alongside literary texts. Most apposite here is EDINA’s Digimap, I’ve been using it on a daily basis, a way to reinterpret and consider networks, social spaces in early modern britain.

And the literature allows us to understand social spaces, social practices. We can look at practices of hospitality at that time, the experience Jonson was having. Welbeck Abbey for instance is discussed in the manuscript, with specific descriptions of taking over the house from Sir William. Also mention of Mr Bonner the Sherief in Newcastle. Some of this text we have been able to verify. We have been able to use OED to understand some of the terminology e.g. hullock, a wine for very important people.

The texts also provide a history of cultural interests, antiquarianism of tourism and travel.Of the places visited, of the castles, buildings and grand houses along the way. And the route taken there. From Belvoir Castle through to Pettifour Well in Kinghorn. So Edinburgh castle, for instance, was one of his stops. We can use art and images of that era to recreate that voyage. We can physically make these journals, but we can make these journals digitally too. The digital journey remaking the mental and physical connections of that historical journey.

Over to James: I will touch on the dimensions of the project which have emerged as we have been going along. Dimensions of which we have become aware. This was a digital project right from the start, since we have been talking about the project and the manuscript, many have asked about how the manuscript came to light and why this has happened now. The story is a disappointing one. In fact it involved me sitting down to consider the potential for a set of digitised set of catalogues, done by the National Archives, which are catalogues of archives around the UK in a project called Access to Archive. This allowed discovery of collection and structure of collections. I was looking through materials and how they worked, I was able to find literary manuscripts and where it sat in the collection… seemed to refer to Ben Jonson but the spelling was such that no one searching would have found it. There was no rummaging in archive attics. But we have been further exploring digital dimensions.

Because we have a journey here, because it is not like Boswell’s account of Samual Johnson but is instead a list of people, places, food, etc. We can see dimensions that are not classically those that a literary scholar are looking for, what we see as a quantifiable text I suppose. For instance an account talks of the time a journey began, the time of arrival, the locations. And can work out the distance of 9.5 miles, a time of 3 hours, what the walking pace was. Jonson seems to be at about 3.17 mph (modern human average 3.3 mph). An interesting one since Jonson in his own notes says he is around 20 stone. maybe something is not quite right there?

We don’t know who wrote the account, we have candidates but the companion is still anonimous. We can work out the height of the companion using surviving architectural drawings of a venue visited. We can work out that he is 5’5!

We are inevitably working with small data here. We have places, times, distances, speed etc. allows us to visualise the journey in ways we maybe would not have been able to do before, a manifestation beyond the annotated text. We’ve initially been exploring that in terms of a map. (see blogs.hss.ed.ac.uk/ben-jonsons-walk) This inital map on our website gives a sense of places visited (via map pins) and on those pins we include the time they were there and notes which is growing as metadata (excellent sweet water at York!). This is a starting point to begin to map out the data the walk has presented us with. This is really at “rehearsal” stage. There is a performative aspect to this walk – Jonson is greeted by crowds, by property owners, etc. markers etc. People have told us that we must reenact the walk! So we are doing a virtual walk, on 8th July he will tweet in real time on Twitter, that will be linked into the map and the information on the blog site, an interaction between those channels. Hopefully Ben will get into conversations as he is on his way, that’s part of what we’d like to do!

We are already thinking about the possibilities of expanding this for future projects. There is an example called Mapping the Lakes, a team at University of Lancaster made this tracking Thomas Grey and Coleridge journeys around the lakes, created with a GIS to visualise the walks. They have mapped obvious markers but have also tried to map more subjective things such as mood of the walk. You can look at them separately or together. That seems a way of thinking about the literary journey that we would like to develop for ourselves. We would like to think beyond the map we are “performing” this summer… There is clearly an interplay between sites and routes… some are easier to map and work out than others. In some places there was a guide to take them on their way – very hard to find the obvious route. Thinking also about how the mapping of the journey could bring in different possibilities, views, prospects, meaning of sites, etc. We haven’t represented those on the map but we would love to, particularly to compare their walk to modern walks. How do different models of the walk undertaken “for the sake of it” compare? And how can we take that walk, preserve that experience, feed in other materials etc. We hope to be able to approach the AHRC for follow on funding and we would love to talk to anyone interested in the spatiality of walking who might be interested in engaging.

Q&A

Q1) A connection: Joseph Burlaff, an artist in the US, recreated Gandhi’s walk using a treadmill and hooked up to Second Life avatar and reproduced that there… possible digital precursor

A1) Interesting possibility. Could get gradients in perhaps. There are analogues or comparitives out there to explore. There is a deepening tension and intensifying interest in the process and practice of walking. And how that carries with it expectations and kinds of appropriate representational modelling, do some justice to spatiality but not assuming a single model is all that we need.. need to weave different senses of the spatial within literary walk.

Q2 – Rocio) Comment on idea of the walk: making a collective walk, ask people in surrounding areas to do a bit of it, make it interactive and add their part of the journey… If you can’t do it yourself.

A) Exactly what we hope to do. Want to bring in local history societies and walking groups etc. on the old roads and feed that in.

Old light on new media: medieval practices in the digital age – Dr Eyal Poleg

We are working on a project called Manuscript Studies in an Interoperable Digital Environment funded by the Mellon Foundation. We have found interesting parallels between the reading of medieval manuscripts and medieval practices. Perhaps we can learn from Medieval practices to think about developing digital practices. In many ways printed books are an interim step here between practices we see across old and new media.

Lets start with hypertext. Hypertext is very common in medieval manuscripts, particularly in the Bible. The problem with the New Testament is the Gospels, how do you jump from one to another. You can explore a version at University of Toronto for instance. But in the manuscript era we get the usepian cannons, in the margins of each episode the usepian cannons and use the tables to jump from one to another, very similar to click on a link. This starts something new in exploring the text.

In the 12th Century there is a beautiful text in France. It is a working manuscripts. It has physical cut and paste. It shows the authors wrestling with technology, with experiments in navigating the text. Inventing references. And they tie that to the “late medieval bible” – Gutenberg bible is a replication of one of these bibles. The innovation of these bibles is evident in the chapter division, previously no divisions in the text. From 1230 onwards, with help of Stephen Langton the Archbishop of Canterbury, we have the chapter divisions. And we begin to get Book and Chapter divisions. This fits into mindset of Christian Exegetists at the time of the linkages within the bible. But this linking etc. took off like wildfile – the most efficient way to link and navigate. When we think about hypertext in the Medieval we have to also think of the web of illusions that people also had. So when reading a text, for example a psalter, there is an interaction of text, image and sound. For monks reading the text created a world of illusion. So we can, using digital technology, replicate that to an extent. By adding musical strata of the text, intricate links that evoke the memory of the men and women who would read these texts.

The wiki is a structure we also see in medieval texts. Even now the interaction one has with a printed book is limited. In Middle ages books were different, they were communal objects even for the monks. Annotations were seen to add value to the text, a communal project to read the text. You can read generations of commentators through the margins of the text. The way it took place.. and this is worth considering… is by giving amply space to interact, to comment on the text. Space deliberately left, intermedial and marginal glosses, spaces for comments and annotation. You can see the different hands, texts, monks reflected in the communal commenting on the text. And you see some commentators responding to each other. In one manuscript in Glasgow an O character has been vandalised, a later reader finds this offensive, erases for future readers… so how much can readers interact, erase, changes to the text do we allow? That would have been a nice image…

There is also a sort of Open Code emerging in manuscripts. a Printed book is not that open. But looking across the same manuscripts we see differences – some are errors or changes by the scribe. In the medieval ages the scribe assumes the text could have been faulty, they try to correct them, the text was in flux. Scholars use this to reproduce the text and we can also explore connections between one manuscripts and another. But of course what is a text? What is a changed text? What is a fixed text?

And finally we have non linear texts here, this can be created now in digital environments. Not necessarily beginning, middle and end. Navigation can be very different. For instance a medieval teaching manual uses images and associated ideas to explore but these are non linear, the image point us in directions within the text. And this ties into a late medieval aesthetic vision of ellisons. The idea of a network of ellisions.

Q&A

Q1) This is a fascinating talk, there are several very orchestrated ways to explore medieval manuscripts that this relates to. You touch on websites reflecting print books, not neccasarily taking advantage of the multimodal opportunities of the web.

A1) That was the starting point to the project. Mellon saw medievel manuscripts increasingly being digitised but that people were using them as printed texts and it wanted to look at new ways of working. So for instance you can see the Summarium, a prototype that uses TEI annotating a non-linear version of the texts, in a communal way.

Q2) Is there a connection between the idea of hypertext in medieval texts and the role of the church as an information system. There have been times where the physical church acted as an information system for state information etc. I’m not sure if that is true of the medieval era.

A2) In the middle ages, unlike the reformation, this is less about inforcement and more about the reality of texts. You live the texts. Monks especially live and breath the text and information. You wake and pray 7 times a day, you are surrounded by images, you are embedded within the textuality.

Q3) Do you find any dilution of the text transferring them to digital technologies? I am sure that institutions are very careful about this

A3) This is not an issue for us. The texts are not of interest to religious institutions today. Very early or very later texts might be an issue but these are not an issue

Q4) Have you ever come across work on roman law reception in the middle ages in codex, I think he came to similar conclusions analysing legal texts as hypertext and wikis. He has a secular models of the same phenomenon

A4) Yes I wasn’t aware of that but I will be interested to have the references. The manuscript texts were a little behind legal texts but it would be very interesting to compare.

And now onto the closing from Sian Bayne saying that it really has been a day of new ideas, very inspiring. And thank yous to the audience and the organisers and of course to all of our speakers.

 

Feb 242012
 

Having hotfooted it over from George Square where I’ve given my Innovative Learning Week session on Social Media – more on that in a future blog post – I’m now at the School of Education for a seminar from Dr Melissa Terras, Reader in Electronic Communication at the UCL Department of Information Studies and Co-director of the UCL Centre for the Digital Humanities. She’ll be talking on “The Virtual Visitor: What Do We Know About Users of Digitised Cultural Heritage?”.

You can find out more about the event on the Digital HSS website: http://www.digital.hss.ed.ac.uk/?page_id=493

As usual this is a liveblog so may have spellings typos etc. Just let me know if you have corrections and comments. 

Jen Ross is introducing Melissa and mentioning her fantastic talk yesterday. And now over to Melissa:

I’m originally from near her, from Kilcaldy, so delighted to be up here. So… I do a lot of analysis but over the last few years we have been getting increasingly requests from cultural heritage organisations around where we are, in Bloomsbury, who want to learn about technology and work with us. So I’ll be talking about how users use the online collection database and a bit on log analysis. So I’ll talk about what we know, what we can’t know and what technologies are available to us at the moment.

So the project I’ll be talking about is the British Museum (BM) Collection Database Online (COLL?). And particularly the research area. That area of the website was launched in 2007, by the end of 2009 there were 2 million objects online, 600+k images. By the end of 2011, 800k images. But the BM wanted to know how they were being used, what impact this was happening.

So we’ve been doing user studies at UCLDH for some time

Log analysis of Internet Resouces in the Arts and Humanities – we got permission from the servers to see how many people were using AHRC resources (not all sites had very many at the time but it was a while ago). We had an idea of if we build it they will come but we have a better idea now, we know that it isn’t that simple

User Centrerd Interactive Search with Digital Libraries – led by my colleague Claire Warwick this looked at the comparison of in person and online library experiences – functionality and information seeking environment features such

Virtual Environments for Research in Archeaology – we took a huge long term dig site used by the University of Reading. We set up wifi and we provided technology to try to speed up note-taking. Huge fun… technology isn’t great in the rain… and struggles in the sunshine!

QRator – this is with KCL and an exhibit of biological specimens. We have iPads around the room and visitors can answer questions there, or on the website, or on Twitter – like “is this an ethical thing to display”. And we can analyse those comments. And the vast majority are intelligent answers – my PhD student Claire Ross is working on that. And she has a NESTA grant to do the same thing will all 5 sites of the Imperial War Museum.

Linksphere (Claire Ross was research assistant here) – the intent here was to build a huge integrated system with the University of Reading. But we did have this highly engaged student who wanted stuff to do. She wanted to do user studies but the original project wasn’t ready for that so she asked someone at the BM to let her see their log data… and she worked on that!

We also have workplacement students and one of them, Vera Motyckova, was engaged on this specific project.

We have various PHd students working with the British Library, the British Museum, the Science Museum, Grant Museum. We are very lucky to work with amazing partners here.

So I was going to talk about how all museums and cultural heritage are putting materials online but we don’t know why people come to these sites, what they search for. So we wanted to try and find out with the BM website.

So, Methods here…

Quantitative – log analysis, link analysis, analytics (eg. Google Analytics), surveys

Qualitative – Open ended survey tasks, interviews, focus groups

Toolkit for the Impact of Digitised Scholarly Resources – fantastic tool outlining these methods.

So Google Analytics or web log analysis tells us where visitors come from, which pages they have visited, how long they stayed, where they came from, search terms captured from the search within the site. Online survey (survey monkey( allows an in-depth suvvey. In-depth interviews – we didn’t have time for one to one interviews but these would have been very valuable.

So some up to date contect. The BM had 10.5 visits/60m pageviews in 2011. 1.2m included a visi to the Research section containing the collection online – that’s about 11% of the visitors. That’s the serious section on research. By comparison 5.8m physical visits to the museum (a rise of 4.9%).

The study I’m talking about was a year and a bit ago. So we looked at stats from 20o9 – 2010 when there were almost 9 millions visits, averaging 6 pages per visitors. 30% of returning visitors from 230 countries. almost 2 million searchs on the site.

The most common search terms are about famous artefacts at the BM. But we are interested in the database and what people are looking and why.

So when you look at log analysis and Google Analytics the problem is it’s a bit dry… So some 21% people that come to the collection do so directly. About 15% come in through other sites – mainly the BM shopping sites. Many visits also came from other sites that indicate good amounts of links in.

So this graph (on screen showing a major jump) shows access of the BM site via mobile. There is an instant boost in mid October. And that was 1 week – the sim card switching on period – after O2 launched the iPhone. It’s that obvious in the stats. But whilst you can see usage you don’t know about intent or usage behaviour.

We put a pop up survey request on the site – it hit 1 in 5 people – and ran from 3rd June 201 to 2 July 2010. We got 2.5K respondants. We were delighted by that. We did 30 main questions – multiple choice, Likert scale and defined tests.

So if we look at the age of respondants – most were 21-30, a good amount between 31 and 50 years old. Not in use withe under 20 youths of course. Some 29% of users were in UK, 91% in England; 6% from Scotland, 3% Wales – non in Nothern Iereland.  And also a large group from the US.

When we asked people how they had come to know about BM COL most said they had heard through a friend or a colleague, many from an academic websie. But relatively few had come in via search enine.

And when we asked users what best describes their reason for COL – 5o% were doing academic research. 12% using it for professional research, only 18% ish for personal interest/fun.  And when we looked at academic role it was clear that PG Researchers were the major group, also professors and other students.

We asked users how they wanted to search. Most wanted free text. They also wanted to search by type of object which is interestin though th emetadata doesn’t support that yet. Some did want the data organised by museum gallery but most were not passionate about replicating the offline experience online.

On the whole people were looking for specific ojects, some on a group of objects. Unsurprising that latest research not as well used – that’s what they were doing.

And next… a lesson in phrasing questions well here.. we asked what type of object poeple were coming to the website to see. Some of the print materials are hard to acces in the museum so of particular interest.

In terms of how often the collection database was used something like 30% was using the COL for the dirst time. Another third occassionally. We really needed to understand that – it’s the response rate rather than who is using the site.

When we asked the users what improvements they would want – most wanted more images, many wanted improved search facilities, and some wanted more collections. Some wanted zoomable and 360 degree images. Great stuff but is it realistic to do on this scale. Notably most users wanted to have COL be easy to find and access and improvements have been made to address that.

Users wanted images in higher res, image ordering, less steps for retrieving images once logged in. Option to return to search results after ordering an item.

Users wanted to see more objects – better information on physical presence/loan included.

And we asked which social media appliations people wanted in COL. 65% said no. Some said Facebook. Some said tagging. This was a big Meh! The users were not that interested, there for serious research.

We asked users if they reuse image based content elsewhere – some said no, some said yes… we didn’t but should have asked for more info here.

We asked people are you on the BM;s website as a result of a visit to the museum – we asked twice and people were clear that visist and website visits were differnt. COL use it for serious research vs. fun of in-person visits.

And finally we had a task at the end of the survey. In computer science you give people a sample task to find out how they would run a search on your website. So we used the example on  Greek Vase. I knew that specific item and couldn’t find it… so “you are searching for a greek case which you know in the BM as you have seen it in a print catalogue. It is an Attic black-figured lekythos from the myth of psychostasia. And it has the catalogue number B 639.”

So, if you look for B 639, or psychostasia, or lekytos you get nothing . The 6 people out of 174 users who found the vase took the space out of “B 639”. The British Museum were surprised at this…

We had a similar task for a painting. And in both cases we found that users try to search in Google type ways – people learn information strategies in these places and we can’t be the same but we have to be aware of that.

One of the interesting analytical things you can do is compare where people say they are with where the logs show them. These cross checkings help you see if you have a representative sample. We didn’t offer the survey in other languages though – perhaps we should have. But little cross correlation we can do between logs and the survey. And in between there is a space where you can’t quite touch about motivation and behaviour – that’s where surveys come in…

So, to wrap up, we showed what the scholarly perceptions of the BMs information environment is. We found that digital resources are used extensively by academics as part of their research process and are considered vital to their research. We found preference for visual material. And a real different between a physical and a virtual visit.

We found that social media was not a priority. But we also foujnd that academics display specific information seeking behaviour and sophisticated search strategies. They come for known objects, with deep knowledge of the materials, it’s a serious and purposeful visit.

We got a better understanding of search patterns and information seeking behaviour of a specific user group. We provided a valuable guide for further development and refinement of the BM coll page – not many but really clear guidance.

And we are going to repeat this work and do this as a longitudinal research. And we are also involved in work with other poeple – National Gallery, BM and National Museum of Wales for a full 2 year longitudinal study of all types of users here. But this work completed has acted as a great pilot – and our masters and PhD studemts are invaluable here. We’ve been looking at the National Gallery and they have 2000 items(?) – far fewer than BM – but all really really famous.

And finally thank yous to Matthew Cock and David Prudames, British Museum; Claire Ross, UCLDH; Vera Motyckova, ex-student at UCLDH (now at BBC). And also to @paleofuture for an amazing 1962 comic predicting that researchers may consult materials at the Library of Congress or British Museum remotely – very nice!

Q&A

Q1) How confident are you that 50% users are researchers – aren’t they more likely to fill in surveys?

A1) You’re right. This comes down to survey methodology and you just have to be honest about that in your write ups. We got a 10% response rate – about right but what about the others? But even seeing that there is real use by researchers is useful – even if only that 10%. When people talk about that stuff we don’t see references to the online version of an item but just the item so you can’t tell from the literature.

Q2) I thought that 35% of users who would be willing to engage with social media isn’t bad.

A2) Yeah but the age is mostly post grads, Facebook and Twitter users etc. So it seems low. And if you are deciding about resources – better valued for digitised resources than for investment in social media. The National Museum of Wales have millions of items and they have users no longer in Wales but are doing family and genealogy research – real balance to strike. But BM work showed as much info as possible is best.

Q4) Did you find out much about unsuccessful visits?

A4) Yes, at the end of the survey there was an “other” box. And they told us about frustrated searches, images they couldn’t find – all very useful. Important I think. Most responses were pretty positive. But on suveys poeple can share more negative comments so great to have so much positiveness

Q5) I’m from RCAHMS and we’ve been working on crowd tagging, could you say something on this…

A5) I did a tiny amount of work on Flickr community tagging work but huge interest in this. I had poeple who run these virtual museums and they millions and millions of hits. Tends to be image based stuff. But libraries and museums tend not to provide this stuff. Enthusiasts do this stuff exhaustively whilst museums don’t. More to be done here. But projects like First World War Poetry DIary did a great community sourced project digitising amateur poeple’s stuff. And the other thing is amateurs do not need the physical items – they collect digital copies of an image, they want an exhaustive collection. Archives and museums would never do that stuff. And we can do more to look at that relationship around who owns what and who gets to engage with it. I’m doing a talk on Jeremy Bentham later today – we have one woman who has done almost half the transcribing! She did it to unwind! Tapping into those key users is really interesting. And Pinterest and Tumblr etc. are picked up fast by these enthusiasts – very responsive to change. But organisations are getting better at that and being more pragmatic about sharing. The BM has a SPARQL endpoint that people can use to do their own interface on the data. That’s the future – making systems robust enough for better access.

Jen: On that note we should let Melissa go on to her next talk.

Q3) I work at National Galleries of Scotland and we struggle to work out how much data we put with the data? Collection/label stuff vs narrative perhaps.

A3) We found people want to know the size, the details. But the story behind the object may not be as important as the basic information. Provenance was important for users. But different questions for art vs. history.

::: Update: Melissa Terras has allowed the organisers of this session to share the audio recording alongside the slides so I have now embedded these here :::



 

 February 24, 2012  Posted by at 1:35 pm Events Attended, LiveBlogs Tagged with: , , ,  No Responses »
Feb 222012
 
sustainable

Today I am at the Digital Scholarship: A Day of Ideas which is a day of “talks and discussions for staff and PhD students in HSS, to inspire and share ideas for digital research, teaching and scholarship. An exciting programme of invited speakers working in the field of digital scholarship will present their ideas and their work” taking place at The Business School at the University of Edinburgh. The Event has been arranged apart of the excellent Digital HSS programme of activities. The full programme is available here (and I’ve also linked to the related abstracts in the titles for today’s talks below).

::: Update: The videos are live on YouTube here :::

The event is also being webcast and can be viewed here: http://www.digital.hss.ed.ac.uk/?page_id=504

As this is a liveblog the usual caveats apply to this liveblog re:  typos, errors, etc. And please do leave me comments and corrections!

Continue reading »