May 022018
 

This morning I’m at the “Working with the British Library’s Digital Content, Data and Services for your research (University of Edinburgh)” event at the Informatics Forum to hear about work that has been taking place at the British Library Labs programme, and with BL data recently. I’ll be liveblogging and, as usual, any comments, questions, 

Introduction and Welcome – Professor Melissa Terras

Welcome to this British Library Labs event, this is about work that fits into wider work taking place and coming here at Edinburgh. British Library Labs works in a space that is changing all the time, and we need to think about how we as researchers can use digital content and this kind of work – and we’ll be hearing from some Edinburgh researchers using British Library data in their work today.

“What is British Library Labs? How have we engaged researchers, artists, entrepreneurs and educators in using our digital collections” – Ben O’Steen, Technical Lead, British Library Labs

We work to engage researchers, artists, entrepreneurs and educators to use our digital collections – we don’t build stuff, we find ways to enable access and use of our data.

The British Library isn’t just our building in St Pancras, we also have a huge document supply and storage facility in Boston Spa. At St Pancras we don’t just have the collections, we have space to work, we have reading rooms, and we have five underground floors hidden away there. We also have a public mission and a “Living Knowledge Vision” which helps us to shape our work

British Library Labs has been running for four years now, funded by the Andrew Mellow Fund, and we are in our third funded phase where we are trying to make this business as usual… So the BL supports the reader who wants to read 3 things, and the reader who wants to read 300,000 things. To do that we have some challenges to face to make things more accessible – not least to help people deal with the sheer scale of the collections. And we want to avoid people having to learn unfamiliar formats and methodologies which are about the library and our processes. We also want to help people explore the feel of collections, their “shape” – what’s missing, what’s there, why and how to understand that. We also want to help people navigate data in new ways.

So, for the last few years we have been trying to help researchers address their own specific problems, but also trying to work out if that is part of a wider problem, to see where there are general issues. But a lot of what we have done has been about getting started… We have a lot of items – about 180 million – but any count e have is always an estimates. Those items include 14m books, 60m patents, 8m stamps, 3m sound recordings… So what do researchers ask for….

Well, researchers often ask for all the content we have. That hides the failure that we should have better tools to understand what is there, and what they want. That is a big ask, but that means a lot of internal change. So, we try to give researchers as much as we have… Sometimes thats TBs of data, sometimes GBs.. And data might be all sorts of stuff – not just the text but the images, the bindings, etc. If we take a digitised item we have an image of the cover, we have pictures, we have text, we also have OCR for these books – when people ask for “all” the book – is that the images, the OCR or both? One of those is much easier to provide…

Facial recognition is quite hot right now… That was one of the original reasons to access all of the illustrations – I run something called the Mechanical Curator to help highlight those images – they asked if they could have the images – so we now have 120m images on Flickr. What we knew about images was the book, and the page. All the categorisation and metadata now there has been from people and machines looking at the data. We worked with Wikimedia UK to find maps, using manual and machine learning techniques – kind of in competition – to identify those maps… And they have now been moved into georeferencing tools (bl.uk/maps) and fed back to Flickr and also into the catalgue… But that breaks the catalogue… It’s not the best way to do this, so that has triggered conversations within the library about what we do differently, what we do extra.

As part of the crowdsourcing I built an arcade machine – and we ran a game jam with several usable games to categorise or confirm categories. That’s currently in the hallway by the lifts in the building, and was the result of work with researchers.

We put our content out there under CC0 license, and then we have awards to recognise great use of our data. And this was submitted – a video of Hey There Young Sailor official music video using that content! We also have the Off the Map copetition – a curated set of data for undergraduate gaming students based on a theme… Every year there is something exceptional.

I mentioned library catalogue being challenging. And not always understanding that when you ask for everything, that isn’t everything that exists. But there are still holes…. When we look at the metadata for our 19th century books we see huge amounts of data in [square brackets] meaning the data isn’t known but is the best suggestion. And this becomes more obvious when we look at work researcher Pieter Francois did on the collection – showing spikes in publication dates at 5 year intervals… Which reflects the guesses at publication year that tend to be e.g. 1800/1805/1810. So if you take intervals to shape your data, it will be distorted. And then what we have digitised is not representative of that, and it’s a very small part of the collection…

There is bias in digitisation then, and we try to help others understand that. Right now our digitised collections are about 3% of our collections. Of the digitised material 15% is openly licensed. But only about 10% is online. About 85% of our collections cn only be accessed “on site” as licenses were written pre-internet. We have been exploring that, and exploring what that means…

So, back to use of our data… People have a hierachy of needs from big broad questions down to filtered and specific queries… We have to get to the place where we can address those specific questions. We know we have messy OCR, so that needs addressing.

We have people looking for (sometimes terrible) jokes – see Victorian Humour run by Bob Nicholson based on his research – this is stuff that can’t be found with keywords…

We have Kavina Novrakas mapping political activity in the 19th Century. This looks different but uses the same data and the same platform – using Jupyter Notebooks. And we have researchers looking at black abolitionists. We have SherlockNet trying to do image classification… And we find work all over the place building on our data, on our images… We found a card game – Moveable Type – built on our images. And David Normal building montages of images. We’ve had poetic places project.

So, we try to help people explore. We know that our services need to be better… And that our services shape expectations of the data – and can omit and hide aspects of the collections. Exploring data is difficult, especially with collections at this scale – and it often requires specific skills and capabilities.

British Library Labs working with University of Edinburgh and University of St Andrews Researchers

“Text Mining of News Broadcasts” – Dr. Beatrice Alex, Informatics (University of Edinburgh)

Today I’ll be talking about my work with speech data, which is funded by my Turing fellowship. I work in a group who have mainly worked with text, but this project has built on work with speech transcripts – and I am doing work on a project with news footage, and dialogues between humans and robots.

The challenges of working with speech includes particular characteristics: short utterances, interjections; speaker assumptions – different from e.g. newspaper text; turn taking.  Often transcripts miss sentence boundaries, punctuation or missing case distinctions. And there are errors introduced by speech recognition.

So, I’m just going to show you an example of our work which you can view online – https://jekyll.inf.ed.ac.uk/geoparser-speech/. Here you can do real time speech recognition, and this can then also be run through the Edinburgh Geoparser to look for locations and identify their locations on the map. There are a few errors and, where locations haven’t been recognised in the speech recognition they also don’t map well. The steps in this pipeline is speech recognition… ASR then Google Text Restoration, and then text and data mining.

So, at the BL I’ve been working with Luke McKernan, lead curator for news and moving images. I have had access to a small set of example news broadcast files for prototype development. This is too small for testing/validation – I’d have to be onsite at BL to work on the full collection. And I’ve been using the CallHome collection (telephone transcripts) and BBC data which is available locally at Informatics.

So looking at an example we can see good text recognition. In my work I have implemented a case restoration step (named entities and sentence initials) using rule based lexicon lookup, and also using Punctuator 2 – an open source tool which adds punctuation. That works much better but isn’t up to an ideal level there. Meanwhile the Geoparser was designed for text so works well but misses things… Improvement work has taken place but there is more to do… And we have named entity recognition in use here too – looking for location, names, etc.

The next steps is to test the effect of ASR quality on text mining – using CallHome and BBC broadcast data) using formal evaluation; improve the text mining on speech transcript data based on further error analysis; and longer term plans include applications in the healthcare sector.

Q&A

Q1) Could this technology be applied to songs?

A1) It could be – we haven’t worked with songs before but we could look at applying it.

“Text Mining Historical Newspapers” – Dr. Beatrice Alex and Dr. Claire Grover, Senior Research Fellow, Informatics (University of Edinburgh) [Bea Alex will present Claire’s paper on her behalf]

Claire is involved in an Adinistrative Data Research Centre Scotland project looking at local Scottish Newspapers, text mine it, and connect it to other work. Claire managed to get access to the BL newspapers through Cengage and Gale – with help from the University of Edinburgh Library. This isn’t all of the BL newspaper collection, but part of it. This collection of data is also now available for use by other researchers at Edinburgh. Issues we had here ws that access to more reent newspaper is difficult, and the OCR quality. Claire’s work focused on three papers in the first instance, from Aberdeen, Dundee and Edinburgh.

Claire adapted the Edinburgh Geoparser to process the OCR format of the newspapers and added local gazetteer resouces fro Aberdeen, Dundee and Edinburgh from OS OpenData. Each article was then automatically annotated with paragraph, sentence, work mark-up; named entities – people, place, organisation; location; geo coordinates.

So, for example, a scanned item from the Edinburgh Evening News from 1904 – its not a great scan but the OCR is OK but erroneous. Named entities are identified, locations are marked. Because of the scale of the data Claire took just one year from most of the papers and worked with a huge number of articles, announcments, images etc. She also drilled down into the geoparsed newspaper articles.

So for Abereen in 1922 there were over 19 million word/punctuation tokens and over 230,000 location mentions Then used frequency methods and concordances to understand the data. For instance she looked for mentions of Aberdeen placenames by frequency – and that shows the regions/districts of abersteen – Torry, Woodside, and also Union Street… Then Claire dug down again… Looking at Torry the mentions included Office, Rooms, Suit, etc, which gives a sense of the area – a place people rented accommoation in. In just the news articles (not ads etc) then for Torry it’s about Council, Parish, Councillor, politics, etc.

Looking at Concordances Claire looked at “fish”, for instance” to see what else was mentioned and, in summary, she noted that the industry was depressed after WW1; there was unemployment in Aberdeen and the fishing towns of Aberdeenshire; that there was competition rom German trawlers landing Icelandic fish; that there were hopes to work with Germany and Russia on the industry; and that government was involved in supporting the industry and taking action to improve it.

With the Dundee data we can see the Topic Modelling that Claire did for the articles – for instance clustering of cars, police, accidents etc; there is a farming and agriculture topic; sports (golf etc)… And you can look at the headlines from those topics and see how that reflect the identified topics.

So, next steps for this work will include: improving text analysis and geoparsing components; get access to more recent newspapers – but there is issing infrastructure for larger data sets but we are working on this; scale up the system to process whole data set and store text ining output; tools to summarise content; and tools for search – filtering by place, data, linguistic context – tools beyond the command line.

“Visualizing Cultural Collections as a Speculative Process” – Dr. Uta Hinrichs, Lecturer at the School of Computer Science (University of St Andrews)

My research focuses on visualisation and Human Computer Interaction. I am particularly interested in how interfaces can make visible digital collections. I have worked on a couple of projects with Bea Alex and others in the room to visualise texts. I will talk a little bit about LitLong, and the process in developing early visualisations for the project.

So, some background… Edinburgh is a UNESCO City of Literature, with lots of literature about and in the city. And we wanted to automate the discovery of Edinburgh-absed literature from available digitised text. That included a large number of collections – about 380k – from collections including the BL 19th Century Books collection. And we wanted to make results accessible to the public.

There were lots of people involved here, from Edinburgh University (PI, James Loxley), Informatics, St Andrews, and EDINA. And worked both with out of copyright texts, but also we had special permission to work with some in-copyright texts including Irvine Welsh. And a lot of work was done to geoparse the text – and assess it’s Edinburghyness. For each mention we had the author, the title, the year, and snippets of the text from around the mention. This led to visualisations – I worked on LitLong 1.0 and I’ll talk about this, but a further version (LitLong 2.0) launched last year.

So you can explore clusters of places mentioned in texts, you can explore the clustered words and snippets around the mentions. And you can zoom in to specific texts – again you can see the text snippets in detail. When you explore the snippets, you can see what else is there, to explore other snippets.

So in terms of the design considerations we wanted a multi faceted intractive overview of the data – Edinburgh locations; books; extracted snippets; authors; keywords. Maps and lists are familiar and we wanted this tool to be accessible to scholars but also the public. We took an approach that allowed “generous” explorations (Mitchell Whitelaw 2015) so there are suggestions of how to explore further, parts of the data showing… Weighted tag clouds let you get a feel of the data for instance.

As a process it wasn’t like the text mining happened then we magically had the visualisations… It was iterative. And actually we used visualisation tools to actually assess which texts were in scope, and which weren’t going to be relevant – and mark them up to keep or to rule out a text. This interface included information on where in a text the mention occurred – to help identify how much about Edinburgh a text actually was.

We had a creative visualisation process… We launched the interface in 2015, and there was some iteration and that also inspired LitLong 2.0 which is a much more public-friendly way to explore the material in different way.

So, I think it is important to think about visualisation as a speculative process. This allows you to make early computational analysis approached visille and facilitate qa and curatorial process. To promote new interactions transforming a print based culture into something different – thinking about materiality rather than just content is important as we enable exporation. When I look back at my own work I see some similarities in interfaces… You can see the unique qualities of the collections in the data trends but we are doung much more work on designing interfaces  that surface the unique qualities of the collection in new ways.

Q&A

Q1) What did you learn about Edinburgh or literature in Edinburgh from this project?

A1) The literature scholars would be better able to talk about that but I know it has inspired new writers. Used in teaching. And also discovered some characteristics of Edinburgh, and women writers in the corpus… James Loxley (Edinburgh) and Tara Thompson (Edinburgh Napier University) could say more about how this is being used in new literary research.

“Public Private Digitisation Partnerships at the British Library” – Hugh Brown, British Library Digitisation Project Manager

I work as part of the Digital Scholarship team at the British Library, which was founded in 2010 to support colleagues and researchers to make innovative use of BL digital collections and data – and recognising the gap in provision we had there. The team is led by Adam Farquhar – Head of Digital Scholarship, and by Neil Fitzgerald, Head of Digital Research Team. We are cross disciplinary experts in the areas of digitisation, librarianship, digital historu adnd humanities, computer and data sience and we look at how technilogu is transforming research and in turn our services. And we include the British Library Labs, Digital Curators, adn the Endangered Archives Programme (EAP).

So, we help get content online and digitised, we support researchers, and we run a training programme to bridge skills so that researchers can begin to engage with digital resources. We expect that in 10-15 years time those will be core research skills so we might not exist – it will just be part of the norm. But we are a long way off that at the moment. We also currently run Hack and Yack events to experiment and discuss. And we also have a Reading Room to share what’s happening in the world, to share best practice.

In terms of our collections and partnerships, we have historically had a slightly piecemeal digitisation approach, so we now have a joined up strategy that sits under our Living Knowledge strategy and includes partnership, commercial strategy and our own collection strategy. Our partnerships recognise that we don’t always have the skills we need to make content available, whilst our commercial strategy – where I work – allows us to digitise as much as possible, and in a context were we don’t have infinite funding for digitisation.

We have various factors in mind when considering potential partnership. The types of approach include partnerships based on whether materials are in or out of copyright – if in copyright then commercial partners have to clear rights. We do public/private partnership with technology partners. We have non-commercial organisational and/or consortium funding. And we have philanthropic donor funded work. Then we think about content – content strategy, asset ownership, digitisation location. We think about value – audience type/interest/geography, and topicality. We think about copyright – British library owns the rights, rights of reuse. We think about disocverability – the ability to identify and search, and access that maximises exposure. We look at the (BL) benefit – funding, access etc. We look at risk. And we look at contract – whether it is non-exclusive, commercial/non commercial.

So, we have had public-private digitisation partnerships with Gale Cengage Learning, Adam Matthews, findmypast, Google Books, Microsoft books, etc. And looking at examples Google books has been 80m+ images digitised; Microsoft books was 25m images; findmypast has done 23m+ images of newspapers; Gale Cengage Learning has done 18th century collections – 22m images, 19c online 2.2m+ images, and Arabic books, etc.

The process begins with liaison with key publishers. Then there is market and content research. Then we plan and agree plan, including licensing of rights for a fixed term (5-10 years), and royalty arrangements and reading room access. Then digitisation takes place, funded by the partner – either by setting up a satellite studio, or using the BL studio. So our partners digitise content and give us that content, in exchange they get 5-10 years exclusive agreement to use that content on their platform. And revenue  generated for BL helps support what we do, and our curators work around digitisation.

So Findmypast was an interesting example. We had electoral registers and India Office Records – data with real commercial value. So, we put a tender out for a partner for digitisation. Findmypast was selected… Part of that was to do with the challenges of the electoral registers which were inconsistent formats etc. so needed a lot of specific work And we also needed historical country boundaries to be understood to make it work. There was also a lot of manual OCR work to do.

With Gale Cengage they tend to be education/universities focused and they work with researchers. We worked with them to select 19th century materials to fit their themes and interests. They did the early arabic book project – a really complex project. The private case collection consisted of mainly books that had been inaccessible on grounds of obscenity from around 1600 and 1960.

With Adam Mathew Digital we were approaches to contribute material from the electoral registers and india office records. And materials on the East India Company.

Now these are exciting projects but we want 20-30% of content generated in these projects to be available as a corpus for research and that’s important to our agreements.

Challenges in the workflow include ensuring business partners and scannning vendors have a good understanding of the material BL holds in our collections. We have to define and provide metadata requirements the BL needs to supply to the partners. Getting statistics and project plans from information business partners. There are logistical challenges around understanding the impact of digitisation on BL departments supporting the process. We have to manage partners business drivers versus BL curatorial drivers. We have to manage the parters digitisation vendors on site. And ensuring the final digital assets/metadata received meets BL requirements for sign off and ingest.

Q&A

Q1) How can we actually access this stuff for research?

A1) For pure research that can be done. For example we have a company in Brighton who are doing research on the electoral roll. That’s not in competition with what the private partner is doing.

Comment from Melissa) My experience is “don’t ask, don’t get” – so if you see something you want to use in your research, do ask!

“The Future of BL Labs and Digital Research at the Library” – Ben O’Steen

I’ve handed out some personas for users of our digital collections – and a blank sheet on the back. We are trying to build up a picture of the needs of our users, their skills and interests, and that helps us illustrate what we do – that’s a thing to come back to (see: https://goo.gl/M41Pc4/)

So I want to talk about the future of BL Labs. We are a project and our funding is due to finish. Our role has been to engage with researchers and that is going to continue – maybe with that same brand just not as a project. We need to learn what they want to do… We need to collect evidence of demand. And we are developing a business model and support process to make “Business as usual” at the BL. We want to help to create pathway to developing a “Digital Research Suit” at the BL by 2019. But we want to think about what that might be, and we are piloting ideas including small 2 person workrooms for digital projects. And we can control access – so that we can see how this works, and ensure that the users understand what you can and cannot do with the data (that you can’t just download everything and walk out with it).

And many other places are being “inspired” by our model – take a look at the Library of Congress work in particular.

So, at this stage we are looking at our business model and how we can make these scalable services. Our model to date has been smaller scale, about capabilities to get started, etc. That is not scalable at the level we’ve been working. We need a more hands off proess ad to be able to see more people. We also run BL Labs Awards which, instead of working with people, recognises work people have already done. People submit and then in October our advisory board reviews the entries and looks for work that champions our content.

To develop our business model we are exploring, evaluating and implementing a business model. We are using business model canvas. We have internal and external business model development, implementation and evaluation groups, and exploring how this could work in practice. And we are testing, piloting and implementing our business model. That means:

  • developing support service
    • Entry level – about the collection, documentation improvements, case studies that help show what is in there.
    • Baseline – basic enquiry service to enable researchers to understand if a BL project is the right path, any legal restrictions that need addressing, etc. We try to get you to the next stage of developing your idea.
    • Intermediate – Consultation service, which will be written in as part of a bid.
    • Advanced – support 10 projects per year through an application process)
  • Augment data.bl.uk – that was a placeholder for a year, and now a tender has just gone out for a repository type service for 12-18 months
    • e.g. sample datasets, tools, examples of use
    • Pilot use of Jupyter Notebooks / Docker other tools for Open and Onside data
  • Researcher access to BL APIs
  • Reading room services – onside access/compute to digital collections – which means us training staff

This has come about as we’ve seen a pattern in approaches that start with an initial exploration phase, then transition into investigation and then some sort of completion phase. There had been a false assumption (on the data providers part) that data-based work must start at the investigation phase – to have an idea of the project they want to do, to know the data already, to know the collections. What we are piloting is that essential exploratory stage, acknowledging that that happens. And that pattern shifts around – exploration and investigation stages can fork off in different directions, that’s fine.

So, timescales and themes seem to be a phase of quick initial work. A longer and variable transition takes place into investigation – probably months. Then investigation takes months to a year. And crucially that completion stage.

Exploration is about understanding the data in an open-ended fashion. It is about discovering the potential tools to work with the data. We want people to gain awareness of their capabilities and limitations – a reality check and opportunity to understand the need for partners and/or new tools. And it’s about developing a firmer query as that helps you to understand the cost, risk, time you might need. Exploration (e.g. V&A Spelunker) lets you get a sense of what’s there, which gives you a different way in to the keyword or catalogue search. And then you have artists like Mario Klingemann – collating images looking sad… It’s artistic but talks about how women are portrayed in the 19th Century. He’s also done work on hats on the ground – and found it’s always a fight! This is showing cultural memes – an important question… An older example is the Cooper Heritt collection – which lets you see all of tags – including various types of similarity that show new ways into the data.

So, what should a digital exploration service look like? Which apps? Does Jupyter Notebook assume too much?

We’ve found that every time we present the data, it shapes the perception. For instance the On the Road manuscript is on a roll. If you print a book on a receipt roll it’s different and reads and is understood differently.

MIT have a Moral Machine survey (http://moralmachine.mit.edu/) which is the classic trolley issue – crowdsourced for autonomous vehicle. But that presentation shapes and limits the questions, and that is biased. Some of the best questions we’ve seen have been from people who have asked very broad questions and haven’t engaged in exploration in other ways. They are hard to answer (e.g. all depictions of women) but they reveal more. Presenting as a searchable list shapes how we interpret the result… But for instance showing newspaper articles as if in a giant newspaper – not a list of results – changes what you do. And that’s why tools like IIIF seems useful.

So… We have things like Gender API. It looks good, it looks professional… If you try it with a western name, does it work. If you try it with an Indian name, does it work. If you try it with a 19th Century name does it work? Know that marketeers will use this. See also sentiment analysis. Some of these tools are based on Twitter. I found a research working an 18th Century texts for sentiment about war and conflict… Through a tool developed and trained for Tweets. We have to be transparent in what is happening, in understanding what you are doing… Hence thinking about personas.

We are trying to think about how we show what is missing from a collection, rather than what is present so that data can be used in a more informed way. We are looking at what research environments we can provide – we know that people want to use their own but we can sometimes be a bit stuffed by licensing based in a paper era. On site tools can help. Should we enable research environments for open data that can be used off site too. We are thinking about focus – are the query, tooling and collections required well defined; is it feasible – legal, cost, ethical, source data quality, etc; is it affordable – time, people, money; etc.

So, we have, on the BL Labs website, a form – it’s long so do send us feedback on whether that is the right format etc. – to help us understand demand and skills.

Those personas – please fill these in – and let us know the technical part, what you might want, how technical the support you need. We are keen to discuss your needs, challenges and issues.

And with that we are done and moving onto lunch and discussion. Thanks to Ben, Hugh, Alex and Uta we well as Melissa and the Digital Scholarship Team!