Today I am at the British Library (BL) Labs Roadshow 2016 event in Edinburgh. I’m liveblogging so, as usual, all comments, corrections and additions are very much welcomed.
Introduction – Dr Beatrice Alex, Research Fellow at the School of Informatics, University of Edinburgh
I am delighted to welcome the team from the British Library Labs today, this is one of their roadshows. And today we have a liveblogger (thats me) and we are encouraging you to tweet to the hashtag #bldigital.
Doing digital research at the British Library – Nora McGregor, Digital Curator at the British Library
Nora is starting with a brief video on the British Library – to a wonderful soundtrack, made from the collections by DJ Yoda. If you read 5 items a day it would take you 80,000 years to get through the collections. One of the oldest things we have in the collection are oracle bones – 3000 years old. Some of the newest items are the UK Web Archive – contemporaneous websites.
Today we are here to talk about the digital Research Team. We support the curation and use of the BL’s Digital collections. And Ben and Mahendra, talking today, are part of our Carnegie Funded digital research labs.
We help researchers by working with those operating at the intersection of academic research, cultural heritage and technology to support new ways of exploring adn accessing the BL collections. This is through getting content into digital forms, supporting skills development, including the skills of BL staff.
In terms of getting digital content online we curate collections to be digitised and catalogued. Within digitisation projects we now have a digital curation role dedicated to that project, who can support scholars to get the most out of these projects. For instance we have a Hebrew Manuscripts digitisation project – with over 3000 manuscripts spanning 1000 years digitised. That collection includes rare scrolls and our curator for this project, Adi, has also done things like creating 3D models of artefacts like those scrolls. So these curators really ensure scholars get the most from digitised materials.
You can find this and all of our digitisation projects on our website: http://bl.uk/subjects/digital-scholarship where you can find out about all of our curators and get in touch with them.
We are also supporting different departments to get paper based catalogues into digital form. So we had a project called Collect e-Card. You won’t find this on our website but our cards, which include some in, for instance, Chinese scripts or urdu, are being crowd sourced so that we can make materials more accessible. Do take a look: http://libcrowds.com/project/urducardcatalogue_d1.
One of the things we initially set up for our staff as a two year programme was a Digital Research Support and Guidance programme. That kicked off in 2012 and we’ve created 19 bespoke one-day courses for staff covering the basics of Digital Scholarship which is delivered on a rolling basis. So far we have delivered 88 courses to nearly 400 staff members. Those courses mean that staff understand the implications of requests for images at specific qualities, to understand text mining requests and questions, etc.
These courses are intended to build capacity. The materials from these courses are also available online for scholars. And we are also here to help if you want to email a question we will be happy to point you in the right direction.
So, in terms of the value of these courses… A curator came to a course on cleaning up data and she went on to get a grant of over £70k for Big Data History of Music – a project with Royal Holloway to undertake analysis as a proof of concept around patters in the history of music – trends in printing for instance.
We also have events, competitions and awards. One of these is “Off the Map”, a very cool endeavour, now in its fourth year. I’m going to show you a video on The Wondering Lands of Alice, our most recent winner. We digitise materials for this competition, teams compete to build video games and actually this one is actually in our current Alice in Wonderland exhibition. This uses digitised content from our collection and you can see the calibre of these is very high.
There is a new competition open now. The new one is for any kind of digital media based on our digital collections. So do take a look of this.
So, if you want to get in touch with us you can find us at http://bl.uk/digital or tweet #bldigital.
British Library Labs – Mahendra Mahey, Project Manager of British Library Labs.
You can find my slides online (link to follow).
I manage a project called British Library Labs, based in the Digital Research team, who we work closely with. What we are trying to do is to get researchers, artists, entrepreneurs, educators, and anyone really to experiment with our digital collections. We are especially interested in people finding new things from our collections, especially things that would be very difficult to do with our physical collections.
What I thought I’d do, and the space the project occupies, is to show you some work from a researcher called Adam Crymble, Kings College London (a video called Big Data + Old History). Adam entered a competition to explain his research in visual/comic book format (we are now watching the video which talks about using digital texts for distant reading and computational approaches to selecting relevant material; to quantify the importance of key factors).
Other kinds of examples of the kinds of methods we hope researchers will use with our data span text mining, georeferencing, as well are creative reuses.
Just to give you a sense of our scale… The British Library says we are the world’s largest library by number of items. 180 million (or so) items, with only about 1-2% digitised. Now new acquisitions do increasingly come in digital form, including the UK Web Archive, but it is still a small proportion of the whole.
What we are hoping to do with our digital scholarship site is to launch data.bl.uk (soon) where you can directly access data. But as I did last year I have also brought a network drive so you can access some of our data today. We have some challenges around sharing data, we sometimes literally have to shift hard drives… But soon there will be a platform for downloading some of this.
So, imagine 20 years from now… I saw a presentation on technology and how we use “digital”… Well we wont use “digital” in front of scholarship or humanities, it will just be part of the mainstream methodologies.
But back to the present… The reason I am here is to engage people like you, to encourage you to use our stuff, our content. One way to do this is through our BL Labs Competition, the deadline for which is 11th April 2016. And, to get you thinking, the best idea pitched to me during the coffee break gets a goodie bag – you have 30 seconds in that break!
Once ideas are (formally) submitted to the BL there will be 2 finalists announced in late May 2016. They then get a residency with some financial (up to £3600) and technical and curational support from June to October 2016. And a winner is then announced later in the year.
We also have the BL Labs Awards. This is for work already done with our content in interesting and innovative ways. You can submit projects – previous and new – by 5th September 2016. We have four categories: Artistic; Commercial; Research; and Learning/Teaching. Those categories reflect the increasingly diverse range of those engaging with our content. Winners are announced at a symposium on 7th November 2016 when prizes are given out!
So today is all about projects and ideas. Today is really the start of the conversation. What we have learned so far is that the kinds of ideas that people have will change quite radically once you try and access, examine and use the data. You can really tell the difference between someone who has tried to use the data and someone who has not when you look at their ideas/competition entries. So, do look at our data, do talk to us about your ideas. Aside from those competitions and awards we also collaborate in projects so we want to listen to you, to work with you on ideas, to help you with your work (capacity permitting – we are a small team).
Why are we doing this? We want to understand who wants to use our material, and more importantly why. We will try and give some examples to inspire you, to give you an idea of what we are doing. You will see some information on your seat (sorry blog followers, I only have the paper copy to hand) with more examples. We really want to learn how to support digital experiments better, what we can do, how we can enable your work. I would say the number one lesson we have learned – not new but important – is that it’s ok to make mistakes and to learn from these (cue a Jimmy Wales Fail Faster video).
So, I’m going to talk about the competition. One of our two finalists last year was Adam Crymble – the same one whose PhD project was highlighted earlier – and he’s now a lecturer in Digital History. He wanted to crowdsource tagging of historical images through Crowdsource Arcade – harnessing the appeal of 80s video games to improve the metadata and usefulness fo historical images. So we needed to find an arcade machine, and then set up games on it – like Tag Attack – created by collaborators across the world. Tag Attack used a fox character trotting out images which you had to tag to one of four categories before he left the screen.
I also want to talk about our Awards last year. Our Artistic award winner last year was Mario Klingeman – Quasimondo. He found images of 44 men who Look 44 with Flickr images – a bit of code he wrote for his birthday! He found Tragic Looking Women etc. All of these done computationally.
In Commercial our entrant used images to cross stitch ties that she sold on Etsy
The winner last year, from the Research category was Spatial Humanities in Lancaster looking for disease patterns and mapping those.
And we had a Special Jury prize was for James Heald who did tremendous work with Flickr images from the BL, making them more available on Wikimedia, particularly map data.
Finally, loads of other projects I could show… One of my favourites is a former Pixar animator who developed some software to animate some of our images (The British Library Art Project).
So, some lessons we have learned is that there is huge appetite to use BL digital content and data (see Flickr Commons stats later). And we are a route to finding that content – someone called us a “human API for the BL content”!
We want to make sure you get the most from our collections, we want to help your projects… So get in touch.
And now I just want to introduce Katrina Navickas who will talk about her project.
Political Meetings Mapper – Katrina Navickas
I am part of the Digital History Research Centre at the University of Hertfordshire. My focus right now is on Chartism, the big movement in the 19th Century campaigning for the vote. I am especially interested in the meetings they held, where and when they met and gathered.
The Chartists held big public meetings, but also weekly local meetings advertised in the press and local press. The BL holds huge amounts of those newspapers. So my challenge was to find out more about those meetings – how many there were advertised in the Northern Star newspaper from 1838 to 1850. The data is well structured for this… Now that may seem like a simple computational challenge but I come from a traditional research background, used to doing things by hand. I wanted to do this more automatically, at a much larger scale than previously possible. My mission was to find out how many meetings there were, where they were held, and how we could find those meetings automatically in the newspapers. We also wanted to make connections between papers, georeferenced historical maps, and also any that appear in playbills as some meetings were in theatres (though most were in pubs).
But this wasn’t that simple to do… Just finding the right files is tricky. The XML is some years old so is quite poor really. The OCR was quite inaccurate, hard to search. And we needed to find maps from the right period.
So, the first stage was to redo the OCR of the original image files. Initially we thought we’d need to do what Bob Nicholson did with Historic Jokes, which was getting volunteers to re-do them. But actually newer OCR software (Abbyy Finereader 12) did a much better job and we just needed a volunteer student to check the text – mainly about punctuation not spelling. Then we needed to geo-code places using a gazeteer. And then we needed to use a Python code with regular expressions to extract dates and using some basic NLP to calculate the dates of words like “tomorrow” – easier as the paper always came out on a Saturday.
So, in terms of building a historical gazeteer. We extracted place names run through: http://sandbox.idre.ucla.edu/tools/geocoder. Ran through with parameters of Lat and Long to check locations. But we still needed to do some geocoding by hand. The areas we were looking at has changed a lot through slum clearances. We needed to therefore geolocate some of the historical places, using detailed 1840s georeferenced maps of Manchester, and geocoding those.
In the end, in the scale of this project, we looked at only 1841-1844. From that we extracted 5519 meetings (and counting) – and identifying text and dates. And that coverage spanned 462 towns and villages (and counting). In that data we found 200+ lecture tours – Chartist lecturers were paid to go on tours.
So, you can find all of our work so far here: http://politicalmeetingsmapper.co.uk. The website is still a bit rough and ready, and we’d love feedback. It’s built on the Umeeka (?) platform – designed for showing collections – which also means we have some limitations but it does what we wanted to.
Our historical maps are with thanks to the NLS whose brilliant historical mapping tiles – albeit from a slightly later map – were easier to use than the BL georeferenced map when it came to plot our data.
Interestingly, although this was a Manchester paper, we were able to see meeting locations in London – which let us compare to Charles Booth’s poverty maps. Also to do some heatmapping of that data. Basically we are experimenting with this data… Some of this stuff is totally new to me, including trialling a Machine Learning approach to understand the texts of a meeting advertisement – using an IPython Notebook to make a classifer to try to identify meeting texts.
So, what next? Well we want to refine our NLP parsing for more dates and other data. And I also want to connect “forthcoming meetings” to reports from the same meeting in the next issue of the paper. Also we need to do more machine learning to identify columns and types of texts in the unreconstructed XML of the newspapers in the BL Digital Collections.
Now that’s one side of our work, but we also did some creative engagement around this too. We got dressed up in Victorian costume, building on our London data analysis and did a walking tour of meetings ending in recreating a Chartist meeting in a London Pub.
Q1) I’m looking at Data mining for my own research. I was wondering how much coding you knew before this project – and after?
A1) My training had only been in GIS, and I’d done a little introduction to coding but I basically spent the summer learning how to do this Python coding. Having a clear project gave me the focus and opportunity to do that. I still don’t consider myself a Digital Historian I guess but I’d getting there. So, no matter whether you have any coding skills already don’t be scared, do enter the competition – you get help, support, and pointed in the right direction to learn the skills you need to.
Farces and Failures: an overview projects that have used British Library’s Digital Content and data – Ben O’Steen, Technical Lead of British Library Labs.
My title isn’t because our work is farce and failure… It’s intentionally to reference the idea that it can be really important early in the process to ensure we have a shared understanding of terminology as that can cause all manner of confusion. The names and labels we choose shape the questions that people will ask and the assumptions we make. For instance “Labs” might make you imagine test tubes… or puppies… In fact we are based in the BL building in St Pancras, in offices, with curators.
Our main purpose is to make the collections available to you, to help you find the paths to walk through, where to go, what you can find, where to look. We work with researchers on their specific problems, and although that work is specific we are also trying to assess how widely this problem is felt. Much of our work is to feed back to the library what researchers really want and need to do their work.
There is also this notion that people tell us things that they think we need to hear in order to help them. As if you need secret passwords to access the content, people can see us as gatekeepers. But that isn’t how BL Labs work. We are trying to develop things that avoid the expected model of scholarship – of coming in, getting one thing, and leaving. That’s not what we see. We see scholars looking at 10,000 things to work with. People ask us “Give me all of collection X” but is that useful? Collections are often collected that way, named that way for adminstrative reasons – the naming associated with a particular digitisation funder, or from a collection. So the Dead Sea Scrolls are scanned in a music collection because the settings were the same for digitising them… That means the “collection” isn’t always that helpful.
So farce… If we think Fork handles/4 Candles…
We have some common farce-inducing words:
- Collection (see above)
- Access – but that has different meanings, sometimes “access” is “on-site” and without download, etc. Access has many meanings.
- Content – we have so much, that isn’t a useful term. We have personal archives, computers, archives, UK Web domain trawl, pictures of manuscripts, OCR, derived data. Content can be anything. We have to be specific.
- Metadata – one persons metadata is anothers data. Not helpful except in a very defined context.
- Crowdsourced – means different things to different people. You must understand how the data was collected – what was the community, how did they do it, what was the QA process. That applies to any collaborative research data collection, not just crowdsourcing.
An example of complex provenence…
Microsoft Books digitisation project. It started in 2007 but stopped in 2009 when the MS Book search project was cancelled. This digitised 49K works (~65k volumes). It has been online since 2012 via a “standard” page turning interface ut we have very low usage statistics. That collection is quite random, items were picked shelf by shelf with books missing. People do data analysis of those works and draw conclusions that don’t make sense if you don’t understand that provenance.
So we had a competition entry in 2013 that wanted to analyse that collection… But actually led to a project called the Sample Generator by Pieter Francois. This compared physical to digital collections to highlight the issues of how unrepresentative that sample is for drawing any conclusions.
Allen B Riddell looked at the HathiTrust corpus called “Where are the novels?” in 2012 which similarly looked at the bias in digitised resources.
We have really big gaps in our knowledge. In fact librarians may recognise the square brackets of the soul… The data in records that isn’t actually confirmed, inferred information within metadata. If you look at the Microsoft Books project it’s about half inferred information. A lot of the Sample Generator peaks of what has been digitised is because of inferred year of publication based on content – guesswork rather than reliable dates.
But we can use this data. So Bob Nicholson’s competition entry on Victorian Jokes led to the Mechanical Comedian Twitter account. We didn’t have a good way into these texts, we had to improvise around these ideas. And we did find some good jokes… If you search for “My Mother in-law” and “Victorian Humour” you’ll see a great video for this.
That project looked for patterns of words. That’s the same technique applied to Political Meetings Mapper.
So “Access” again… These newspapers were accessible but we didn’t have access to them… Keyword search fails miserable and bulk access is an issue. But that issue is useful to know about. Research and genealogical needs are different and these papers were digitised partly for those more lucrative genealogical needs to browse and search.
There are over 600 digital archive, we can only spend so long characterising each of them. Microsoft Books digitisation project was public domain so that let us experiment richly quickly. We identified images of people, we found image details. we started to post images to Twitter and Tumblr (via Mechanical Curator)… There was demand and we weren’t set up to deliver those so we used Flickr Commons – 1 TB for free – with the limited awareness of what page an image was from, what region. We had minimal metadata but others started tagging and adding to our knowledge. Nora did a great job of collating these images that had been started to be tagged (by people and machines). And usage of images has been huge. 13-20 million hits on average every month, over 330 M hits to date.
Is this Iterative Crowdsourcing (Mia Ridge)? We crowdsource broad facts and subcollections of related items will emerge. There is no one size fits all, has to be project based. We start with no knowledge but build from there. But these have to be purposefully contextless. Presenting them on Flickr removed the illustrations context. The sheer amount of data is huge. David Foster Wallace has a great comment that “if your fidelity to perfectionism is too high, you never do anything”. We have a fear of imperfection in all universities, and we need to have the space to experiment. We can re-represent content in new forms, it might work, it might not. Metaphors don’t translate between media – like turning pages on a screen, or scrolling a book forever.
With our map collection we ran a tagathon and found nearly 30,000 maps. 10,000 were tagged by hand, 20,000 were found by machine. We have that nice combination of human and machine. We are now trying to georeference our maps and you can help with that.
But it’s not just research… We encourage people to do new things – make colouring books for kids, make collages – like David Normal’s Burning Man installation (also shown at St Pancras). That stuff is part of playing around.
Now, I’ve talked about “Crowd sourcing” several times. There can be lots of bad assumptions of that term. It’s assumed to be about a crowd of people all doing a small thing, about special software, that if you build it they will come, its easy, its cheap, it’s totally untrustworthy… These aren’t right. It’s about being part of a community, not just using it. When you looka at Zooniverse data you see a common pattern – that 1-2% of your community will do the majority of the work. You have to nurture the expert group within your community. This means you can crowdsource starting with that expert group – something we are also doing in a variety of those groups. You have to take care of all your participants but that core crowd really matter.
So, for crowdsourcing you don’t need special software. If you build something they don’t neccassarily come, they often don’t. And something we like to flag up is the idea of playing games, trying the unusual… Can we avoid keyboard and mouse? That arcade game does that, it asks that idea of whether we can make use of casual interaction to get useful data. That experiment is based on a raspberry pi and loads of great ideas from others using our collections. They are about the game dynamic… How we deal with data – how to understand how the game dynamics impact on the information you can extract.
So, in summary…
Don’t be scared of using words like “collection” and “access” with us… But understand that there will be a dialogue… that helps avoid disappointment, helps avoid misunderstanding or wasted time. We want to be clear and make sure we are all on the same page early on. I’m there to be your technical guide and lead on a project. There is space to experiment, to not be scared to fail and learn from that failure when it happens. We are there to have fun, to experiment.
Questions & Discussion
Q1) I’m a historian at the National Library of Scotland. You talked about that Microsoft Books project and the randomness of that collection. Then you talked about the Flickr metadata – isn’t that the same issue… Is that suitable for data mining? What do you do with that metadata?
A1) A good point. Part of what we have talked about is that those images just tell you about part of one page in a book. The mapping data is one of the ways we can get started on that. So if we geotag an image or a map with Aberdeen then you can perhaps find that book via that additional metadata, even if Aberdeen would not be part of the catalogue record, the title etc. There are big data approaches we can take but there is work on OCR etc. that we can do.
Q2) A question for Ben about Tweeting – the Mechanical Curator and the Mechanical Comedian. For the Curator… They come out some regularly… How are they generated?
A2) That is mechanical… There are about 1200 lines of code that roams the collection looking for similar stuff… The text is generated from books metadata… It is looking at data on the harddrive – access to everything so quite random. If no match it finds another random image.
Q2) And the mechnical comedian?
A2) That is run by Bob. The jokes are mechanically harvested, but he adds the images. He does that himself – with a bit of curation in terms of the badness of jokes – and adds images with help of a keen volunteer.
Q3) I work at the National Library of Scotland. You said to have fun and experiment. What is your response to the news of job cuts at Trove, at the National Library of Australia.
A3 – Ben) Trove is a leader in this space and I know a lot of people are increadibly upset about that.
A3 – Nora) The thing with digital collections is that they are global. Our own curators love Trove and I know there is a Facebook group to support Trove so, who knows, perhaps that global response might lead to a reversal?
Mahendra: I just wanted to say again that learning about the stories and provenance of a collection is so important. Talking about the back stories of collections. Sometimes the reasons content are not made available have nothing to do with legality… Those personal connections are so importan.
Q4) I’m interested in your use of the IPython Notebook. You are using that to access content on BL servers and website? So you didn’t have to download lots of data? Is that right?
A4) I mainly use it as a communication tool between myself and Ben… I type ideas into the notebook, Ben helps me turn that into code… It seemed the best tool to do that.
Q4) That’s very interesting… The Human API in action! As a researcher is that how it should be?
A4) I think be. As a researcher I’m not really a coder. For learning these spaces are great, they act as a sandbox.
Q4) And your code was written for your project, should that be shared with others?
A4) All the code is on a GitHub page. It isn’t perfect. That extract, code, geocode idea would be applicable to many other projects.
Mahendra: There is a balance that we work with. There are projects that are fantastic partnerships of domain experts working with technical experts wanting problems to solve. But we also see domain experts wanting to develop technical skills for their projects. We’ve seen both. Not sure of the answer… We did an event at Oxford who do a critical coding course where they team humanities and computer scientists… It gives computer scientists experience of really insanely difficult problems, the academics get experience of framing questions in precise ways…
Ben: And by understanding coding and
Comment (me): I just wanted to encourage anyone creating research software to consider submitting papers on that to the Journal of Open Research Software, a metajournal for sharing and finding software specifically created for research.
Q5) It seemed like the Political Meetings Mapper and the Palimpsest project had similar goals, so I wondered why they selected different workflows.
A5 – Bea Alex) The project came about because I spoke to Miranda Anderson who had the idea at the Digital Scholarship Day of Ideas. At that time we were geocoding historical trading documents and we chatted about automating that idea of georeferencing texts. That is how that project came about… There was a large manual aspect as well as the automated aspects. But the idea was to reduce that manual effort.
A5 – Katrina) Our project was so much smaller team. This is very much a pilot project to meet a particular research issue. The outcomes may seem similar but we worked on a smaller scale, seeing what one researcher could do. As a traditional academic historian I don’t usually work in groups, let alone big teams. I know other projects work at larger scale though – like Ian Gregory’s Lakes project.
A5 – Mahendra) Time was a really important aspect in decisions we took in Katrina’s project, and of focusing the scope of that work.
A5 – Katrina) Absolutely. It was about what could be done in a limited time.
A5 – Bea) One of the aspects from our work is that we sourced data from many collections, and the structure could be different for each mention. Whereas there is probably a more consistent structure because of the single newspaper used in Katrina’s project, which lends itself better to a regular expressions approach.
And next we moved to coffee and networking. We return at 3.30 for more excellent presentations (details below).
BL Labs Awards: Research runner up project: “Palimpsest: Telling Edinburgh’s Stories with Maps” – Professor James Loxley, Palimpsest, University of Edinburgh
I am going to talk about project which I led in collaboration with colleagues in English Literature, with INformatics here, with visualisation experts at St Andrews, and with EDINA.
The idea came from Miranda Anderson, in 2012, who wanted to explore how people imagine Edinburgh in a literary sense, how the place is imagined and described. And one of the reasons for being interested in doing this is the fact that Edinburgh was the world’s first UNESCO City of Literature. The City of Literature Trust in Edinburgh is also keen to promote that rich literary heritage.
We received funding from the AHRC from January 2014 to March 2015. And the name came from the concept of the Palimpsest, the text that is rewritten and erased and layered upon – and of the city as a Palimpsest, changing and layering over time. The original website was to have the same name but as that wasn’t quite as accessible, we called that LitLong in the end.
We had some key aims for this project. There are particular ways literature is packaged for tourists etc. We weren’t interested in where authors were born or died. Or the authors that live here. What we were interested in was how the city is imagined in the work of authors, from Robert Louis Stevenson to Muriel Spark or Irvine Welsh.
And we wanted to do that in a different way. Our initial pilot in 2012 was all done manually. We had to extract locations from texts. We had a very small data set and it offfered us things we already knew – relying on well known Edinburgh books, working with the familiar. The kind of map produced there told us what we already knew. And we wanted to do something new. And this is where we realised that the digital methods we weree thinking about really gave us an opportunity to think of the literary cityscape in a different mode.
So, we planned to textmine large collections of digital text to identify narrative works set in Edinburgh. We weren’t constrained to novels, we included short stories, memoirs… Imaginative narrative writing. We excluded poetry as that was too difficult a processing challenge for the scale of the project. And we were very lucky to have the support and access to British library works, as well as material from the HathiTrust, and the National Library of Scotland. We mainly worked with out of copyright works. But we did specifically get permission from some publishers for in-copyright works. Not all publishers were forthcoming, and happy for work to be text mined. We were text mining work – not making them freely available – but for some publishers full text for text mining wasn’t possible.
So we had large collections of works, mainly but not exclusively out of copyright. And we set about textmining those collections to find those set in Edinburgh. And then we georeferenced the Edinburgh placenmmaes in those works to make mapping possible. And then finally we created visualisations offering different viewpoints into the data.
The best way to talk about this is to refer to text from our website:
Our aim in creating LitLong was to find out what the topography of a literary city such as Edinburgh would look like if we allowed digital reading to work on a very large body of texts. Edinburgh has a justly well-known literary history, cumulatively curated down the years by its many writers and readers. This history is visible in books, maps, walking tours and the city’s many literary sites and sights. But might there be other voices to hear in the chorus? Other, less familiar stories? By letting the computer do the reading, we’ve tried to set that familiar narrative of Edinburgh’s literary history in the less familiar context of hundreds of other works. We also want our maps and our app to illustrate old connections, and forge new ones, among the hundreds of literary works we’ve been able to capture.
That’s the kind of aims we had, what we were after.
So our method started with identifying texts with a clear Edinburgh connection or, as we called it “Edinburghyness“. Then, within those works to actually try and understand just how relevant they were. And that proved tricky. Some of the best stuff about this project came from close collaboration between literary scholars and informatics researchers. The back and forth was enormously helpful.
We came across some seemingly obvious issues. The first thing we saw was that there was a huge amount of theological works… Which was odd… And turned out to be because the Edinburgh placename “Trinity” was in there. Then “Haymarket” is a place in London as well as Edinburgh. So we needed to rank placenames and part of that was the ambiguity of names, and understanding that some places are more likely to specifically be Edinburgh than others.
From there, with selected works, we wanted to draw out snippits – of varying lengths but usually a sensible syntactic shape – with those mentions of specific placenames.
At the end of that process we had a dataset of 550 published works, across a range of narrative genres. They have over 1600 Edinburgh place names of lots of different types, since literary engagement with a city might be a street, a building, open spaces, areas, monuments etc. In mapping terms you can be more exact, in literature you have these areas and diverse types of “place”, so our gazeteer needed to be flexible to that. And what that all gave us in total was 47,000 extracts from literary works, all focused on a place name mention.
That was the work itself but we also wanted to engage people in our work. So we brought Sir Walter Scott back to life. He came along to the Edinburgh International Book Festival in 2014. He kind of got away from us and took on a life of his own… He ended up being part of the celebrations of the 200th aniversary of Waverley. And popped up again last year on the Borders Railway when that launched! That was fun!
We did another event at EIBF in 2015 with James Robertson who was exploring LitLong and data there. And you can download that as a podcast.
So, we were very very focused on making this project work, but we were also thinking about the users.
The resource itself you can visit at LitLong.org. I will talk a little about the two forms of visualisation. The first is a location visualiser largely built and developer by Uta Hinrichs at St Andrews. That allows you to explore the map, to look at keywords associated by locations – which indicate a degree of qualitative engagement. We also have a searchable database where you can see the extracts. And we have an app version which allows you to wander in among the extracts, rather than see from above – our visualisation colleagues call this the “Frogs Eye View”. You can wander between extracts, browse the range of them. It works quite well on the bus!
We were obviously delighted to be able to do this! Some of the obstacles seemed tough but we found workable solutions… But we hope it is not the end of the story. We are keen to explore new ways to make the resource explorable. Right now there isn’t a way where interaction leaves a trace – other people’s routes through the city, other peoples understanding of the topography. There is scope for more analysis of the texts themselves. For instance we considered doing a mood map of the city, scope to see that. But we weren’t able to do that in this project but there is scope to do that. And as part of building on the project we have a bit of funding from the AHRC so lots of interesting lines of enquiry there. And if you want to explore the resource do take a look, get in touch etc.
Q1) Do you think someone could run sentiment analysis over your text?
A1) That is entirely plausible. The data is there and tagged so that you could do that.
A1 – Bea) We did have an MSc project just starting to explore that in fact.
A1) One of our buttons on the homepage is “LitLong Lab” where we share experiments in various ways.
Q2) Some science fiction authors have imagined near future Edinburgh, how could that be mapped?
A2) We did have some science fiction in the texts, including the winner of our writing competition. We have texts from a range of ages of work but a contemporary map, so there is scope to keying data to historic maps, and those exist thanks to the NLS. As to the future… The not-yet-Edinburgh… Something I’d like to do… It is not uncommon that fictional places exist in real places – like 221 Baker Street or 44 Scotland Street – and I thought it would be fun to see the linguistic qualities associated with a fictional place, and compare to real places with the same sort of profile. So, perhaps for futuristic places that would work – using linguistic profile to do that.
Q3) I was going to ask about chronology – but you just answered that. So instead I will ask about crowd sourcing.
A3) Yes! As an editor I am most concerned about potential effort. For this scale and speed we had to let go of issues of mistakes, we know they are there… Places that move, some false positives, and some books that used Edinburgh placenames but are other places (e.g. some Glasgow texts). At the moment we don’t have a full report function or similar. We weren’t able to create it to enable corrections in that sort of way. What we decided to do is make a feature of a bug – celebrating those as worm holes! But I would like to fine tune and correct, with user interactions as part of that.
Q4) Is the data set available.
A4) Yes, through an API created by EDINA. Open for out of copyright work.
Palimpsest seeks to find new ways to present and explore Edinburgh’s literary cityscape, through interfaces showcasing extracts from a wide range of celebrated and lesser known narrative texts set in the city. In this talk, James will set out some of the project’s challenges, and some of the possibilities for the use of cultural data that it has helped to unearth.
Geoparsing Jisc Historical Texts – Dr Claire Grover, Senior Research Fellow, School of Informatics, University of Edinburgh
I’ll be talking about a current project, a very rapid project to geoparse all of the Jisc Historical Texts. So I’ll talk about the Geoparser and then more about that project.
The Edinburgh Geoparser, which has been developed over a number of years in collaboration with EDINA. It has been deployed in various projects and places, mainly also in collaboration with EDINA. And it has various main steps:
- Use named entity recognition to identify place names in texts
- Find matching records in a gazeteer
- In cases of ambiguity (e.g. Paris, Springfield), resolve using contextual information from the document
- Assign coordinates of preferred reading to the placename
So, you can use the Geoparser either via EDINA’s Unlock Text, or you can download it, or you can try a demonstrator online (links to follow).
To give you an example I have a news piece on the buriel of Richard III. You can see the Geoparser looks for entity recognition of all types – people as well as places – as that helps with disambiguation later on. Then using that text the parser ranks the likelihood of possible locations.
A quick word on gazeteers. The knowledge of possible interpretations comes from a gazeteer, which pairs place names to lat/long. So, if you know your data you can choose a gazeteer relevant to that (e.g. just the UK). The Edinburgh Geoparser is configured to provide a choice of gazeteers and can be configured to use other gazeteers.
If a place is not in a gazeteer it cannot be grounded. If the correct interprestation of a place name is not in the gazeteer, it cannot be grounded correctly. Modern gazeteers are not ideal for historical documents so historical gazeteers need to be used/developed. So for instance the DEEP (Directory of English Place Names) or PELAGIOS (ancient world) gazeteers have been useful in our current work.
The current Jisc Historical Text(http://historicaltexts.jisc.ac.uk/) project has been working with EEBO and ECCO texts as well as the BL Nineteenth Century collections. These are large and highly varied data sets. So, for instance, yesterday I did a random sample of writers and texts… which is so large we’ve only seen a tiny portion of it. We can process it but we can’t look at it all.
So, what is involved in us georeferencing this text? Well we have to get all the data through the Edinburgh Geoparser pipeline. And that requires adapting the geoparser pipeline to recognise place names to work as accurately as possible on historical text. And we need to adjust the georeferencing strategy to be more detailed.
Adapting our place name recognition relies a lot on lexicons. The standard Edinburgh Geoparser has three lexicons derived from the Alexandria Gazetteer (global, very detailed); Ordnance Survey (Great Britain, quite detailed), DEEP. We’ve also added more lexicons from more gazeteers… including larger place names in Geonames (population over 10,000), populated places from Natural Earth, only larger places from DEEP, and the score recognised place names based on how many and which lexicons they occur in. Low scored placenames are removed – we reckon people’s tolerance for missing a place is higher than their tolerance for false positives.
Working with old texts also means huge variation of spellings… There are a lot of false placenames/false negatives because of this (e.g. Maldauia, Demnarke, Saxonie, Spayne). They also result in false positives (Grasse, Hamme, Lyon, Penne, Sunne, Haue, Ayr). So we have tried to remove the false positives, to remove bad placenames.
When it comes to actually georeferencing these places we need coordinates for place names from gazetteers. We used three place names in succession: Pleiades++, GeoNames and then DEEP. In addition to using those gazeteers we can weight the results based on locations in the world – based on a bounding box. So we can prefer locations in the UK and Europe, then those in the East. Not extending to the West as much… And excluding Australia and New Zealand (unknown at that time).
So looking at EEBO and ECCO we can see some frequent place names from each gazeteers – which shows how different they are. In terms of how many terms we have found there are over 3 million locations in EEBO, over 250k in ECCO (a much smaller collection). The early EEBO collections have a lot of locations in Israel, Italy, France. The early books are more concerned with the ancient world and Biblical texts so these statistics suggest that we are doing the right thing here.
These are really old texts, we have huge volumes fo them, and there is a huge variety of the data and that all makes this a hard task. We still don’t know how the work will be received but we think Jisc will put this work in a sandbox area and we should get some feedback on it.
Find out more:
Q1) What about historical Gaelic place names?
A1) I’m not sure these texts have these. But we did apply a language tag on a paragraph level. These are supposed to be English texts but there is lots of Latin, Welsh, Spanish, French and German. We only georeferenced texts thought to be English. If Gaelic names then, if in Ordnance Survey, they may have been picked up…
Claire will talk about work the Edinburgh Language Technology Group have been doing for Jisc on geoparsing historical texts such as the British Library’s Nineteenth Century Books and Early English Books Online Text Creation Partnership which is creating standardized, accurate XML/SGML encoded electronic text editions of early print books.
Pitches – Mahendra and co
Can the people who pitched me
Lorna: I’m interested in open education and I’d love to get some of the BL content out there. I’ve been worked on the new HECoS coding schema for different subjects. And I thought that it would be great to classify the BL content with HECoS.
Karen: I’ve been looking at Copyright music collections at St Andrews. There are gaps in legal deposit music from late 18th and 19th century as we know publishers deposited less in Scottish versus BL. So we could compare and see what reached outer reaches of the UK.
Nina: My idea was a digital Pilgrims Progress where you can have a virtual tour of a journey with all sorts of resources.. To see why some places are most popular in texts etc.
David: I think my idea has been done.. It was going to be iPython – Katrina is already doing this! But to make it more unique… It’s quite hard work for Ben to support scholars in that way so I think researchers should be encouraged to approach Ben etc. but also get non-programmers to craft complex queries, make the good ones reusable by others… and have those reused be marked up as of particular quality. And to make it more fun… Could have a sort of treasure hunt jam with people using that facility to have a treasure hunt on a theme… share interesting information… Have researchers see tweets or shared things… A group treasure hunt to encourage people by helping them share queries…
Mahendra: So we are supposed to decide the winners now… But I think we’ll get all our pitchers to share the bag – all great ideas… The idea was to start conversations. You should all have an email from me so, if you have found this inspiring or interesting, we’ll continue that conversation.
And with that we are done! Thanks to all for a really excellent session!