Jun 142017

Following on from Day One of IIPC/RESAW I’m at the British Library for a connected Web Archiving Week 2017 event: Digital Conversations @BL, Web Archives: truth, lies and politics in the 21st century. This is a panel session chaired by Elaine Glaser (EG) with Jane Winters (JW), Valerie Schafer (VS), Jefferson Bailey (JB) and Andrew Jackson (AJ). 

As usual, this is a liveblog so corrections, additions, etc. are welcomed. 

EG: Really excited to be chairing this session. I’ll let everyone speak for a few minutes, then ask some questions, then open it out…

JB: I thought I’d talk a bit about our archiving strategy at Internet Archive. We don’t archive the whole of the internet, but we aim to collect a lot of it. The approach is multi-pronged: to take entire web domains in shallow but broad strategy; to work with other libraries and archives to focus on particular subjects or areas or collections; and then to work with researchers who are mining or scraping the web, but not neccassarily having preservation strategies. So, when we talk about political archiving or web archiving, it’s about getting as much as possible, with different volumes and frequencies. I think we know we can’t collect everything but important things frequently, less important things less frequently. And we work with national governments, with national libraries…

The other thing I wanted to raise in

T.R. Shellenberg who was an important archivist at the National Archive in the US. He had an idea about archival strategies: that there is a primary documentation strategy, and a secondary straetgy. The primary for a government and agencies to do for their own use, the secondary for futur euse in unknown ways… And including documentary and evidencey material (the latter being how and why things are done). Those evidencery elements becomes much more meaningful on the web, that has eerged and become more meaningful in the context of our current political environment.

AJ: My role is to build a Web Archive for the United Kingdom. So I want to ask a question that comes out of this… “Can a web archive lie?”. Even putting to one side that it isn’t possible to archive the whole web.. There is confusion because we can’t get every version of everything we capture… Then there are biases from our work. We choose all UK sites, but some are captured more than others… And our team isn’t as diverse as it could be. And what we collect is also constrained by technology capability. And we are limited by time issues… We don’t normally know when material is created… The crawler often finds things only when they become popular… So the academic paper is picked up after a BBC News item – they are out of order. We would like to use more structured data, such as Twitter which has clear publication date…

But can the archive lie? Well material is much easier than print to make an untraceable change. As digital is increasingly predominant we need to be aware that our archive could he hacked… So we have to protect for that, evidence that we haven’t been hacked… And we have to build systems that are secure and can maintain that trust. Libraries will have to take care of each other.

JW: The Oxford Dictionary word of the year in 2016 was “post truth” whilst the Australian dictionary went for “Fake News”. Fake News for them is either disinformation on websites for political purposes, or commercial benefit. Mirrium Webster went for “surreal” – their most searched for work. It feels like we live in very strange times… There aren’t calls for resignation where there once were… Hasn’t it always been thus though… ? For all the good citizens who point out the errors of a fake image circulated on Twitter, for many the truth never catches the lie. Fakes, lies and forgeries have helped change human history…

But modern fake news is different to that which existed before. Firstly there is the speed of fake news… Mainstream media only counteracts or addresses this. Some newspapers and websites do public corrections, but that isn’t the norm. Once publishing took time and means. Social media has made it much easier to self-publish. One can create, but also one can check accuracy and integrity – reverse image searching to see when a photo has been photoshopped or shows events of two things before…

And we have politicians making claims that they believe can be deleted and disappear from our memory… We have web archives – on both sides of the Atlantic. The European Referendum NHS pledge claim is archived and lasts long beyond the bus – which was brought by Greenpeace and repainted. The archives have also been capturing political parties websites throughout our endless election cycle… The DUP website crashed after announcement of the election results because of demands… But the archive copy was available throughout. Also a rumour that a hacker was creating an irish language version of the DUP website… But that wasn’t a new story, it was from 2011… And again the archive shows that, and archive of news websites do that.

Social Networks Responses to Terrorist Attacks in France – Valerie Schafer. 

Before 9/11 we had some digital archives of terrorist materials on the web. But this event challenged archivists and researchers. Charlie Hebdo, Paris Bataclan and Nice attacks are archived… People can search at the BNF to explore these archives, to provide users a way to see what has been said. And at the INA you can also explore the archive, including Titter archives. You can search, see keywords, explore timelines crossing key hashtags… And you can search for images… including the emoji’s used in discussion of Charlie Hebdo and Bataclan.

We also have Archive-It collections for Charlie Hebdo. This raises some questions of what should and should not be collected… We did not normally collected news papers and audio visual sites, but decided to in this case as we faced a special event. But we still face challenges – it is easiest to collect data from Twitter than from Facebook. But it is free to collect Twitter data in real time, but the archived/older data is charged for so you have to capture it in the moment. And there are limits on API collection… INA captured more than 12 Million tweets for Charlie Hebdo, for instance, it is very complete but not exhaustive.

We continue to collect for #jesuischarlie and #bataclan… They continually used and added to, in similar or related attacks, etc. There is a time for exploring and reflecting on this data, and space for critics too….

But we also see that content gets deleted… It is hard to find fake news on social media, unless you are looking for it… Looking for #fakenews just won’t cut it… So, we had a study on fake news… And we recommend that authorities are cautious about material they share. But also there is a need for cross checking – the kinds of projects with Facebook and Twitter. Web archives are full of fake news, but also full of others’ attempts to correct and check fake news as well…

EG: I wanted to go back in time to the idea of the term “fake news”… In order to understand from what “Fake News” actually is, we have to understand how it differs from previous lies and mistruths… I’m from outside the web world… We are often looking at tactics to fight fire with fire, to use an unfortunate metaphor…  How new is it? And who is to blame and why?

JW: Talking about it as a web problem, or a social media issue isn’t right. It’s about humans making decisions to critique or not that content. But it is about algorithmic sharing and visibility of that information.

JB: I agree. What is new is the way media is produced, disseminated and consumed – those have technological underpinnings. And they have been disruptive of publication and interpretation in a web world.

EG: Shouldn’t we be talking about a culture not just technology… It’s not just the “vessel”… Isn’t the dissemination have more of a role than perhaps we are suggesting…

AJ: When you build a social network or any digital space you build in different affordances… So that Facebook and Twitter is different. And you can create automated accounts, with Twitter especially offering an affordance for robots etc which allows you to give the impression of a movement. There are ways to change those affordances, but there will also always be fake news and issues…

EG: There are degrees of agency in fake news.. from bots to deliberate posts…

JW: I think there is also the aspect of performing your popularity – creating content for likes and shares, regardless of whether what you share is true or not.

VS: I know terrorism is different… But any tweet sharing fake news you get 4 retweets denying… You have more tweets denying than sharing fake news…

AJ: One wonders about the filter bubble impact here… Facebook encourges inward looking discussion… Social media has helped like minded people find each other, and perhaps they can be clipped off more easily from the wider discussion…

VS: I think also what is interested is the game between social media and traditional media…You have questions and relationship there…

EG: All the internet can do is reflect the crooked timber of reality… We know that people have confirmation bias, we are quite tolerant of untruths, to be less tolerant of information that contradicts our perceptions, even if untrue.You have people and the net being equally tolerant of lies and mistruths… But isn’t there another factor here… The people demonised as gatekeepers… By putting in place structures of authority – which were journalism and academics… Their resources are reduced now… So what role do you see for those traditional gatekeepers…

VS: These gatekeepers are no more the traditional gatekeepers that they were…. They work in 24 hour news cycles and have to work to that. In France they are trying to rethink that role, there were a lot of questions about this… Whether that’s about how you react to changing events, and what happens during election…. People thinking about that…

JB: There is an authority and responsibiity for media still, but has the web changed that? Looking back its suprising now how few organisations controlled most of the media… But is that that different now?

EG: I still think you are being too easy on the internet… We’ve had investigate journalism by Carrell Cadwalladar and others on Cambridge Analytica and others who deliberately manipulate reality… You talked about witness testimony in relation to terrorism… Isn’t there an immediacy and authenticity challenge there… Donald Trump’s tweets… They are transparant but not accountable… Haven’t we created a problem that we are now trying to fix?

AJ: Yes. But there are two things going on… It seems to be that people care less about lying… People see Trump lying, and they don’t care, and media organisations don’t care as long as advertising money comes in… A parallel for that in social media – the flow of content and ads takes priority over truth. There is an economic driver common to both mediums that is warping that…

JW: There is an aspect of unpopularity aspect too… a (nameless) newspaper here that shares content to generate “I can’t believe this!” and then sharing and generating advertising income… But on a positive note, there is scope and appetite for strong investigative journalism… and that is facilitated by the web and digital methods…

VS: Citizens do use different media and cross media… Colleagues are working on how TV is used… And different channels, to compare… Mainstream and social media are strongly crossed together…

EG: I did want to talk about temporal element… Twitter exists in the moment, making it easy to make people accountable… Do you see Twitter doing what newspapers did?

AJ: Yes… A substrate…

JB: It’s amazing how much of the web is archived… With “Save Page Now” we see all kinds of things archived – including pages that exposed the whole Russian downing a Ukrainian plane… Citizen action, spotting the need to capture data whilst it is still there and that happens all the time…

EG: I am still sceptical about citizen journalism… It’s a small group of narrow demographics people, it’s time consuming… Perhaps there is still a need for journalist roles… We did talk about filter bubbles… We hear about newspapers and media as biased… But isn’t the issue that communities of misinformation are not penetrated by the other side, but by the truth…

JW: I think bias in newspapers is quite interesting and different to unacknowledged bias… Most papers are explicit in their perspective… So you know what you will get…

AJ: I think so, but bias can be quite subtle… Different perspectives on a common issue allows comparison… But other stories only appear in one type of paper… That selection case is harder to compare…

EG: This really is a key point… There is a difference between facts and truth, and explicitly framed interpretation or commentary… Those things are different… That’s where I wonder about web archives… When I look at Wikipedia… It’s almost better to go to a source with an explicit bias where I can see a take on something, unlike Wikipedia which tries to focus on fact. Talking about politicians lying misses the point… It should be about a specific rhetorical position… That definition of truth comes up when we think of the role of the archive… How do you deal with that slightly differing definition of what truth is…

JB: I talked about different complimentary collecting strategy… The Archivist as a thing has some political power in deciding what goes in the historical record… The volume of the web does undercut that power in a way that I think is good – archives have historically been about the rich and the powerful… So making archives non-exclusive somewhat addresses that… But there will be fake news in the archive…

JW: But that’s great! Archives aren’t about collecting truth. Things will be in there that are not true, partially true, or factual… It’s for researchers to sort that out lately…

VS: Your comment on Wikipedia… They do try to be factual, neutral… But not truth… And to have a good balance of power… For us as researchers we can be surprised by the neutral point of view… Fortunately the web archive does capture a mixture of opinions…

EG: Yeah, so that captures what people believed at a point of time – true or not… So I would like to talk about the archive itself… Do you see your role as being successors to journalists… Or as being able to harvest the world’s record in a different way…

JB: I am an archivist with that training and background, as are a lot of people working on web archives and interesting spaces. Certainly historic preservation drives a lot of collecting aspects… But also engineering and technological aspects. So it’s poeple interested in archiving, preservation, but also technology… And software engineers interested in web archiving.

AJ: I’m a physicist but I’m now running web archives. And for us it’s an extension of the legal deposit role… Anything made public on the web should go into the legal deposit… That’s the theory, in practice there are questions of scope, and where we expend quality assurance energy. That’s the source of possible collection bias. And I want tools to support archivists… And also to prompt for challenging bias – if we can recognise that taking place.

JW: There are also questions of what you foreground in Special Collections. There are decisions being made about collections that will be archived and catalogued more deeply…

VS: In BNF my colleagues are work in an area with a tradition, with legal deposit responsibility… There are politics of heritage and what it should be. I think that is the case for many places where that activity sits with other archivists and librarians.

EG: You do have this huge responsibility to curate the record of human history… How do you match the top down requirements with the bottom up nature of the web as we now talk about i.t.

JW: One way is to have others come in to your department to curate particular collections…

JB: We do have special collections – people can choose their own, public suggestions, feeds from researchers, all sorts of projects to get the tools in place for building web archives for their own communities… I think for the sake of longevity and use going forward, the curated collections will probably have more value… Even if they seem more narrow now.

VS: Also interesting that archives did not select bottom-up curation. In Switzerland they went top down – there are a variety of approaches across Europe.

JW: We heard about the 1916 Easter Rising archive earlier, which was through public nominations… Which is really interesting…

AJ: And social media can help us – by seeing links and hashtags. We looked at this 4-5 years ago everyone linked to the BBC, but now we have more fake news sites etc…

VS: We do have this question of what should be archived… We see capture of the vernacular web – kitten or unicorn gifs etc… !

EG: I have a dystopian scenario in my head… Could you see a time years from now when newspapers are dead, public broadcasters are more or less dead… And we have flotsom and jetsom… We have all this data out there… And kinds of data who use all this social media data… Can you reassure me?

AJ: No…

JW: I think academics are always ready to pick holes in things, I hope that that continues…

JB: I think more interesting is the idea that there may not be a web… Apps, walled gardens… Facebook is pretty hard to web archive – they make it intentionally more challenging than it should be. There are lots of communication tools that disappeared… So I worry more about loss of a web that allows the positive affordances of participation and engagement…

EG: There is the issue of privatising and sequestering the web… I am becoming increasingly aware of the importance of organisations – like the BL and Internet Archive… Those roles did used to be taken on by publicly appointed organisations and bodies… How are they impacted by commercial privatisation… And how those roles are changing… How do you envisage that public sphere of collecting…

JW: For me more money for organisations like the British Library is important. Trust is crucial, and I trust that they will continue to do that in a trustworthy way. Commercial entities cannot be trusted to protect our cultural heritage…

AJ: A lot of people know what we do with physical material, but are surprised by our digital work. We have to advocate for ourselves. We are also constrained by the legal framework we operate within, and we have to challenge that over time…

JB: It’s super exciting to see libraries and archives recognised for their responsibility and trust… But that also puts them at higher risk by those who they hold accountable, and being recognised as bastions of accountability makes them more vulnerable.

VS: Recently we had 20th birthday of the Internet Archive, and 10 years of the French internet archiving… This is all so fast moving… People are more and more aware of web archiving… We will see new developments, ways to make things open… How to find and search and explore the archive more easily…

EG: The question then is how we access this data… The new masters of the universe will be those emerging gatekeepers who can explore the data… What is the role between them and the public’s ability to access data…

VS: It is not easy to explain everything around web archives but people will demand access…

JW: There are different levels of access… Most people will be able to access what they want. But there is also a great deal of expertise in organisations – it isn’t just commercial data work. And working with the Alan Turing Institute and cutting edge research helps here…

EG: One of the founders of the internet, Vint Cerf, says that “if you want to keep your treasured family pictures, print them out”. Are we overly optimistic about the permanence of the record.

AJ: We believe we have the skills and capabilities to maintain most if not all of it over time… There is an aspect of benign neglect… But if you are active about your digital archive you could have a copy in every continent… Digital allows you to protect content from different types of risk… I’m confident that the library can do this as part of it’s mission.


Q1) Coming back to fake news and journalists… There is a changing role between the web as a communications media, and web archiving… Web archives are about documenting this stuff for journalists for research as a source, they don’t build the discussion… They are not the journalism itself.

Q2) I wanted to come back to the idea of the Filter Bubble, in the sense that it mediates the experience of the web now… It is important to capture that in some way, but how do we archive that… And changes from one year to the next?

Q3) It’s kind of ironic to have nostalgia about journalism and traditional media as gatekeepers, in a country where Rupert Murdoch is traditionally that gatekeeper. Global funding for web archiving is tens of millions; the budget for the web is tens of billions… The challenges are getting harder – right now you can use robots.txt but we have DRM coming and that will make it illegal to archive the web – and the budgets have to increase to match that to keep archives doing their job.

AJ: To respond to Q3… Under the legislation it will not be illegal for us to archive that data… But it will make it more expensive and difficult to do, especially at scale. So your point stands, even with that. In terms of the Filter Bubble, they are out of our scope, but we know they are important… It would be good to partner with an organisation where the modern experience of media is explicitly part of it’s role.

JW: I think that idea of the data not being the only thing that matters is important. Ethnography is important for understanding that context around all that other stuff…  To help you with supplementary research. On the expense side, it is increasingly important to demonstrate the value of that archiving… Need to think in terms of financial return to digital and creative economies, which is why researchers have to engage with this.

VS: Regarding the first two questions… Archives reflect reality, so there will be lies there… Of course web archives must be crossed and compared with other archives… And contextualisation matters, the digital environment in which the web was living… Contextualisation of web environment is important… And with terrorist archive we tried to document the process of how we selected content, and archive that too for future researchers to have in mind and understand what is there and why…

JB: I was interested in the first question, this idea of what happens and preserving the conversation… That timeline was sometimes decades before but is now weeks or days or less… In terms of experience websites are now personalised and our ability to capture that is impossible on a broad question. So we need to capture that experience, and the emergent personlisation… The web wasn’t public before, as ARPAnet, then it became public, but it seems to be ebbing a bit…

JW: With a longer term view… I wonder if the open stuff which is easier to archive may survive beyond the gated stuff that traditionally was more likely to survive.

Q4) Today we are 24 years into advertising on the web. We take ad-driven models as a given, and we see fake news as a consequence of that… So, my question is, Minitel was a large system that ran on a different model… Are there different ways to change the revenue model to change fake or true news and how it is shared…

Q5) Teresa May has been outspoken on fake news and wants a crackdown… The way I interpret that is censorship and banning of sites she does not like… Jefferson said that he’s been archiving sites that she won’t like… What will you do if she asks you to delete parts of your archive…

JB: In the US?!

Q6) Do you think we have sufficient web literacy amongst policy makers, researchers and citizens?

JW: On that last question… Absolutely not. I do feel sorry for politicians who have to appear on the news to answer questions but… Some of the responses and comments, especially on encryption and cybersecurity have been shocking. It should matter, but it doesn’t seem to matter enough yet… 

JB: We have a tactic of “geopolitical redundancy” to ensure our collections are shielded from political endangerment by making copies – which is easy to do – and locate them in different political and geographical contexts. 

AJ: We can suppress content by access. But not deletion. We don’t do that… 

EG: Is there a further risk of data manipulation… Of Trump and Farage and data… a covert threat… 

AJ: We do have to understand and learn how to cope with potential attack… Any one domain is a single point of failure… so we need to share metadata, content where possible… But web archives are fortunate to have the strong social framework to build that on… 

Q7) Going back to that idea of what kinds of responsibilities we have to enable a broader range of people to engage in a rich way with the digital archive… 

Q8) I was thinking about questions in context, and trust in content in the archive… And realising that web archives are fairly young… Generally researchers are close to the resource they are studying… Can we imagine projects in 50-100 years time where we are more separate from what we should be trusting in the archive… 

Q9) My perspective comes from building a web archive for European institutions… And can the archive live… Do we need legal notice on the archive, disclaimers, our method… How do we ensure people do not misinterpret what we do. How do we make the process of archiving more transparent. 

JB: That question of who has resources to access web archives is important. It is a responsibility of institutions like ours… To ensure even small collections can be accessed, that researchers and citizens are empowered with skills to query the archive, and things like APIs to enable that too… The other question on evidencing curatorial decisions – we are notoriously poor at that historically… But there is a lot of technological mystery there that we should demystify for users… All sorts of complexity there… The web archiving needs to work on that provenance information over the next few years… 

AJ: We do try to record this but as Jefferson said much of this is computational and algorithmic… So we maybe need to describe that better for wider audiences… That’s a bigger issue anyway, that understanding of algorithmic process. At the British Library we are fortunate to have capacity for text mining our own archives… We will be doing more than that… It will be small at first… But as it’s hard to bring data to the queries, we must bring queries to the archive. 

JW: I think it is so hard to think ahead to the long term… You’ll never pre-empt all usage… You just have to do the best that you can. 

VS: You won’t collect everything, every time… The web archive is not an exact mirror… It is “reborn digital heritage”… We have to document everything, but we can try to give some digital literacy to students so they have a way to access the web archive and engage with it… 

EG: Time is up, Thank you our panellists for this fantastic session. 

Feb 262016

Today I am at the British Library (BL) Labs Roadshow 2016 event in Edinburgh. I’m liveblogging so, as usual, all comments, corrections and additions are very much welcomed.

Introduction – Dr Beatrice Alex, Research Fellow at the School of Informatics, University of Edinburgh

I am delighted to welcome the team from the British Library Labs today, this is one of their roadshows. And today we have a liveblogger (thats me) and we are encouraging you to tweet to the hashtag #bldigital.

Doing digital research at the British Library – Nora McGregor, Digital Curator at the British Library

Nora is starting with a brief video on the British Library – to a wonderful soundtrack, made from the collections by DJ Yoda. If you read 5 items a day it would take you 80,000 years to get through the collections. One of the oldest things we have in the collection are oracle bones – 3000 years old. Some of the newest items are the UK Web Archive – contemporaneous websites.

Today we are here to talk about the digital Research Team. We support the curation and use of the BL’s Digital collections. And Ben and Mahendra, talking today, are part of our Carnegie Funded digital research labs.

We help researchers by working with those operating at the intersection of academic research, cultural heritage and technology to support new ways of exploring adn accessing the BL collections. This is through getting content into digital forms, supporting skills development, including the skills of BL staff.

In terms of getting digital content online we curate collections to be digitised and catalogued. Within digitisation projects we now have a digital curation role dedicated to that project, who can support scholars to get the most out of these projects. For instance we have a Hebrew Manuscripts digitisation project – with over 3000 manuscripts spanning 1000 years digitised. That collection includes rare scrolls and our curator for this project, Adi, has also done things like creating 3D models of artefacts like those scrolls. So these curators really ensure scholars get the most from digitised materials.

You can find this and all of our digitisation projects on our website: http://bl.uk/subjects/digital-scholarship where you can find out about all of our curators and get in touch with them.

We are also supporting different departments to get paper based catalogues into digital form. So we had a project called Collect e-Card. You won’t find this on our website but our cards, which include some in, for instance, Chinese scripts or urdu, are being crowd sourced so that we can make materials more accessible. Do take a look: http://libcrowds.com/project/urducardcatalogue_d1.

One of the things we initially set up for our staff as a two year programme was a Digital Research Support and Guidance programme. That kicked off in 2012 and we’ve created 19 bespoke one-day courses for staff covering the basics of Digital Scholarship which is delivered on a rolling basis. So far we have delivered 88 courses to nearly 400 staff members. Those courses mean that staff understand the implications of requests for images at specific qualities, to understand text mining requests and questions, etc.

These courses are intended to build capacity. The materials from these courses are also available online for scholars. And we are also here to help if you want to email a question we will be happy to point you in the right direction.

So, in terms of the value of these courses… A curator came to a course on cleaning up data and she went on to get a grant of over £70k for Big Data History of Music – a project with Royal Holloway to undertake analysis as a proof of concept around patters in the history of music – trends in printing for instance.

We also have events, competitions and awards. One of these is “Off the Map”, a very cool endeavour, now in its fourth year. I’m going to show you a video on The Wondering Lands of Alice, our most recent winner. We digitise materials for this competition, teams compete to build video games and actually this one is actually in our current Alice in Wonderland exhibition. This uses digitised content from our collection and you can see the calibre of these is very high.

There is a new competition open now. The new one is for any kind of digital media based on our digital collections. So do take a look of this.

So, if you want to get in touch with us you can find us at http://bl.uk/digital or tweet #bldigital.

British Library Labs – Mahendra Mahey, Project Manager of British Library Labs.

You can find my slides online (link to follow).

I manage a project called British Library Labs, based in the Digital Research team, who we work closely with. What we are trying to do is to get researchers, artists, entrepreneurs, educators, and anyone really to experiment with our digital collections. We are especially interested in people finding new things from our collections, especially things that would be very difficult to do with our physical collections.

What I thought I’d do, and the space the project occupies, is to show you some work from a researcher called Adam Crymble, Kings College London (a video called Big Data + Old History). Adam entered a competition to explain his research in visual/comic book format (we are now watching the video which talks about using digital texts for distant reading and computational approaches to selecting relevant material; to quantify the importance of key factors).

Other kinds of examples of the kinds of methods we hope researchers will use with our data span text mining, georeferencing, as well are creative reuses.

Just to give you a sense of our scale… The British Library says we are the world’s largest library by number of items. 180 million (or so) items, with only about 1-2% digitised. Now new acquisitions do increasingly come in digital form, including the UK Web Archive, but it is still a small proportion of the whole.

What we are hoping to do with our digital scholarship site is to launch data.bl.uk (soon) where you can directly access data. But as I did last year I have also brought a network drive so you can access some of our data today. We have some challenges around sharing data, we sometimes literally have to shift hard drives… But soon there will be a platform for downloading some of this.

So, imagine 20 years from now… I saw a presentation on technology and how we use “digital”… Well we wont use “digital” in front of scholarship or humanities, it will just be part of the mainstream methodologies.

But back to the present… The reason I am here is to engage people like you, to encourage you to use our stuff, our content. One way to do this is through our BL Labs Competition, the deadline for which is 11th April 2016. And, to get you thinking, the best idea pitched to me during the coffee break gets a goodie bag – you have 30 seconds in that break!

Once ideas are (formally) submitted to the BL there will be 2 finalists announced in late May 2016. They then get a residency with some financial (up to £3600) and technical and curational support from June to October 2016. And a winner is then announced later in the year.

We also have the BL Labs Awards. This is for work already done with our content in interesting and innovative ways. You can submit projects – previous and new – by 5th September 2016. We have four categories: Artistic; Commercial; Research; and Learning/Teaching. Those categories reflect the increasingly diverse range of those engaging with our content. Winners are announced at a symposium on 7th November 2016 when prizes are given out!

So today is all about projects and ideas. Today is really the start of the conversation. What we have learned so far is that the kinds of ideas that people have will change quite radically once you try and access, examine and use the data. You can really tell the difference between someone who has tried to use the data and someone who has not when you look at their ideas/competition entries. So, do look at our data, do talk to us about your ideas. Aside from those competitions and awards we also collaborate in projects so we want to listen to you, to work with you on ideas, to help you with your work (capacity permitting – we are a small team).

Why are we doing this? We want to understand who wants to use our material, and more importantly why. We will try and give some examples to inspire you, to give you an idea of what we are doing. You will see some information on your seat (sorry blog followers, I only have the paper copy to hand) with more examples. We really want to learn how to support digital experiments better, what we can do, how we can enable your work. I would say the number one lesson we have learned – not new but important – is that it’s ok to make mistakes and to learn from these (cue a Jimmy Wales Fail Faster video).

So, I’m going to talk about the competition. One of our two finalists last year was Adam Crymble – the same one whose PhD project was highlighted earlier – and he’s now a lecturer in Digital History. He wanted to crowdsource tagging of historical images through Crowdsource Arcade – harnessing the appeal of 80s video games to improve the metadata and usefulness fo historical images. So we needed to find an arcade machine, and then set up games on it – like Tag Attack – created by collaborators across the world. Tag Attack used a fox character trotting out images which you had to tag to one of four categories before he left the screen.

I also want to talk about our Awards last year. Our Artistic award winner last year was Mario Klingeman – Quasimondo. He found images of 44 men who Look 44 with Flickr images – a bit of code he wrote for his birthday! He found Tragic Looking Women etc. All of these done computationally.

In Commercial our entrant used images to cross stitch ties that she sold on Etsy

The winner last year, from the Research category was Spatial Humanities in Lancaster looking for disease patterns and mapping those.

And we had a Special Jury prize was for James Heald who did tremendous work with Flickr images from the BL, making them more available on Wikimedia, particularly map data.

Finally, loads of other projects I could show… One of my favourites is a former Pixar animator who developed some software to animate some of our images (The British Library Art Project).

So, some lessons we have learned is that there is huge appetite to use BL digital content and data (see Flickr Commons stats later). And we are a route to finding that content – someone called us a “human API for the BL content”!

We want to make sure you get the most from our collections, we want to help your projects… So get in touch.

And now I just want to introduce Katrina Navickas who will talk about her project.

Political Meetings Mapper – Katrina Navickas

I am part of the Digital History Research Centre at the University of Hertfordshire. My focus right now is on Chartism, the big movement in the 19th Century campaigning for the vote. I am especially interested in the meetings they held, where and when they met and gathered.

The Chartists held big public meetings, but also weekly local meetings advertised in the press and local press. The BL holds huge amounts of those newspapers. So my challenge was to find out more about those meetings – how many there were advertised in the Northern Star newspaper from 1838 to 1850. The data is well structured for this… Now that may seem like a simple computational challenge but I come from a traditional research background, used to doing things by hand. I wanted to do this more automatically, at a much larger scale than previously possible. My mission was to find out how many meetings there were, where they were held, and how we could find those meetings automatically in the newspapers. We also wanted to make connections between papers, georeferenced historical maps, and also any that appear in playbills as some meetings were in theatres (though most were in pubs).

But this wasn’t that simple to do… Just finding the right files is tricky. The XML is some years old so is quite poor really. The OCR was quite inaccurate, hard to search. And we needed to find maps from the right period.

So, the first stage was to redo the OCR of the original image files. Initially we thought we’d need to do what Bob Nicholson did with Historic Jokes, which was getting volunteers to re-do them. But actually newer OCR software (Abbyy Finereader 12) did a much better job and we just needed a volunteer student to check the text – mainly about punctuation not spelling. Then we needed to geo-code places using a gazeteer. And then we needed to use a Python code with regular expressions to extract dates and using some basic NLP to calculate the dates of words like “tomorrow” – easier as the paper always came out on a Saturday.

So, in terms of building a historical gazeteer. We extracted place names run through: http://sandbox.idre.ucla.edu/tools/geocoder. Ran through with parameters of Lat and Long to check locations. But we still needed to do some geocoding by hand. The areas we were looking at has changed a lot through slum clearances. We needed to therefore geolocate some of the historical places, using detailed 1840s georeferenced maps of Manchester, and geocoding those.

In the end, in the scale of this project, we looked at only 1841-1844. From that we extracted 5519 meetings (and counting) – and identifying text and dates. And that coverage spanned 462 towns and villages (and counting). In that data we found 200+ lecture tours – Chartist lecturers were paid to go on tours.

So, you can find all of our work so far here: http://politicalmeetingsmapper.co.uk. The website is still a bit rough and ready, and we’d love feedback. It’s built on the Umeeka (?) platform – designed for showing collections – which also means we have some limitations but it does what we wanted to.

Our historical maps are with thanks to the NLS whose brilliant historical mapping tiles – albeit from a slightly later map – were easier to use than the BL georeferenced map when it came to plot our data.

Interestingly, although this was a Manchester paper, we were able to see meeting locations in London – which let us compare to Charles Booth’s poverty maps. Also to do some heatmapping of that data. Basically we are experimenting with this data… Some of this stuff is totally new to me, including trialling a Machine Learning approach to understand the texts of a meeting advertisement – using an IPython Notebook to make a classifer to try to identify meeting texts.

So, what next? Well we want to refine our NLP parsing for more dates and other data. And I also want to connect “forthcoming meetings” to reports from the same meeting in the next issue of the paper. Also we need to do more machine learning to identify columns and types of texts in the unreconstructed XML of the newspapers in the BL Digital Collections.

Now that’s one side of our work, but we also did some creative engagement around this too. We got dressed up in Victorian costume, building on our London data analysis and did a walking tour of meetings ending in recreating a Chartist meeting in a London Pub.


Q1) I’m looking at Data mining for my own research. I was wondering how much coding you knew before this project – and after?

A1) My training had only been in GIS, and I’d done a little introduction to coding but I basically spent the summer learning how to do this Python coding. Having a clear project gave me the focus and opportunity to do that. I still don’t consider myself a Digital Historian I guess but I’d getting there. So, no matter whether you have any coding skills already don’t be scared, do enter the competition – you get help, support, and pointed in the right direction to learn the skills you need to.

Farces and Failures: an overview projects that have used British Library’s Digital Content and data – Ben O’Steen, Technical Lead of British Library Labs.

My title isn’t because our work is farce and failure… It’s intentionally to reference the idea that it can be really important early in the process to ensure we have a shared understanding of terminology as that can cause all manner of confusion. The names and labels we choose shape the questions that people will ask and the assumptions we make. For instance “Labs” might make you imagine test tubes… or puppies… In fact we are based in the BL building in St Pancras, in offices, with curators.

Our main purpose is to make the collections available to you, to help you find the paths to walk through, where to go, what you can find, where to look. We work with researchers on their specific problems, and although that work is specific we are also trying to assess how widely this problem is felt. Much of our work is to feed back to the library what researchers really want and need to do their work.

There is also this notion that people tell us things that they think we need to hear in order to help them. As if you need secret passwords to access the content, people can see us as gatekeepers. But that isn’t how BL Labs work. We are trying to develop things that avoid the expected model of scholarship – of coming in, getting one thing, and leaving. That’s not what we see. We see scholars looking at 10,000 things to work with. People ask us “Give me all of collection X” but is that useful? Collections are often collected that way, named that way for adminstrative reasons – the naming associated with a particular digitisation funder, or from a collection. So the Dead Sea Scrolls are scanned in a music collection because the settings were the same for digitising them… That means the “collection” isn’t always that helpful.

So farce… If we think Fork handles/4 Candles…

We have some common farce-inducing words:

  • Collection (see above)
  • Access – but that has different meanings, sometimes “access” is “on-site” and without download, etc. Access has many meanings.
  • Content – we have so much, that isn’t a useful term. We have personal archives, computers, archives, UK Web domain trawl, pictures of manuscripts, OCR, derived data. Content can be anything. We have to be specific.
  • Metadata – one persons metadata is anothers data. Not helpful except in a very defined context.
  • Crowdsourced – means different things to different people. You must understand how the data was collected – what was the community, how did they do it, what was the QA process. That applies to any collaborative research data collection, not just crowdsourcing.

An example of complex provenence…

Microsoft Books digitisation project. It started in 2007 but stopped in 2009 when the MS Book search project was cancelled. This digitised 49K works (~65k volumes). It has been online since 2012 via a “standard” page turning interface ut we have very low usage statistics. That collection is quite random, items were picked shelf by shelf with books missing. People do data analysis of those works and draw conclusions that don’t make sense if you don’t understand that provenance.

So we had a competition entry in 2013 that wanted to analyse that collection… But actually led to a project called the Sample Generator by Pieter Francois. This compared physical to digital collections to highlight the issues of how unrepresentative that sample is for drawing any conclusions.

Allen B Riddell looked at the HathiTrust corpus called “Where are the novels?” in 2012 which similarly looked at the bias in digitised resources.

We have really big gaps in our knowledge. In fact librarians may recognise the square brackets of the soul… The data in records that isn’t actually confirmed, inferred information within metadata. If you look at the Microsoft Books project it’s about half inferred information. A lot of the Sample Generator peaks of what has been digitised is because of inferred year of publication based on content – guesswork rather than reliable dates.

But we can use this data. So Bob Nicholson’s competition entry on Victorian Jokes led to the Mechanical Comedian Twitter account. We didn’t have a good way into these texts, we had to improvise around these ideas. And we did find some good jokes… If you search for “My Mother in-law” and “Victorian Humour” you’ll see a great video for this.

That project looked for patterns of words. That’s the same technique applied to Political Meetings Mapper.

So “Access” again… These newspapers were accessible but we didn’t have access to them… Keyword search fails miserable and bulk access is an issue. But that issue is useful to know about. Research and genealogical needs are different and these papers were digitised partly for those more lucrative genealogical needs to browse and search.

There are over 600 digital archive, we can only spend so long characterising each of them. Microsoft Books digitisation project was public domain so that let us experiment richly quickly. We identified images of people, we found image details. we started to post images to Twitter and Tumblr (via Mechanical Curator)… There was demand and we weren’t set up to deliver those so we used Flickr Commons – 1 TB for free – with the limited awareness of what page an image was from, what region. We had minimal metadata but others started tagging and adding to our knowledge. Nora did a great job of collating these images that had been started to be tagged (by people and machines). And usage of images has been huge. 13-20 million hits on average every month, over 330 M hits to date.

Is this Iterative Crowdsourcing (Mia Ridge)? We crowdsource broad facts and subcollections of related items will emerge. There is no one size fits all, has to be project based. We start with no knowledge but build from there. But these have to be purposefully contextless. Presenting them on Flickr removed the illustrations context. The sheer amount of data is huge. David Foster Wallace has a great comment that “if your fidelity to perfectionism is too high, you never do anything”. We have a fear of imperfection in all universities, and we need to have the space to experiment. We can re-represent content in new forms, it might work, it might not. Metaphors don’t translate between media – like turning pages on a screen, or scrolling a book forever.

With our map collection we ran a tagathon and found nearly 30,000 maps. 10,000 were tagged by hand, 20,000 were found by machine. We have that nice combination of human and machine. We are now trying to georeference our maps and you can help with that.

But it’s not just research… We encourage people to do new things – make colouring books for kids, make collages – like David Normal’s Burning Man installation (also shown at St Pancras). That stuff is part of playing around.

Now, I’ve talked about “Crowd sourcing” several times. There can be lots of bad assumptions of that term. It’s assumed to be about a crowd of people all doing a small thing, about special software, that if you build it they will come, its easy, its cheap, it’s totally untrustworthy… These aren’t right. It’s about being part of a community, not just using it. When you looka at Zooniverse data you see a common pattern – that 1-2% of your community will do the majority of the work. You have to nurture the expert group within your community. This means you can crowdsource starting with that expert group – something we are also doing in a variety of those groups. You have to take care of all your participants but that core crowd really matter.

So, for crowdsourcing you don’t need special software. If you build something they don’t neccassarily come, they often don’t. And something we like to flag up is the idea of playing games, trying the unusual… Can we avoid keyboard and mouse? That arcade game does that, it asks that idea of whether we can make use of casual interaction to get useful data. That experiment is based on a raspberry pi and loads of great ideas from others using our collections. They are about the game dynamic… How we deal with data – how to understand how the game dynamics impact on the information you can extract.

So, in summary…

Don’t be scared of using words like “collection” and “access” with us… But understand that there will be a dialogue… that helps avoid disappointment, helps avoid misunderstanding or wasted time. We want to be clear and make sure we are all on the same page early on. I’m there to be your technical guide and lead on a project. There is space to experiment, to not be scared to fail and learn from that failure when it happens. We are there to have fun, to experiment.

Questions & Discussion

Q1) I’m a historian at the National Library of Scotland. You talked about that Microsoft Books project and the randomness of that collection. Then you talked about the Flickr metadata – isn’t that the same issue… Is that suitable for data mining? What do you do with that metadata?

A1) A good point. Part of what we have talked about is that those images just tell you about part of one page in a book. The mapping data is one of the ways we can get started on that. So if we geotag an image or a map with Aberdeen then you can perhaps find that book via that additional metadata, even if Aberdeen would not be part of the catalogue record, the title etc. There are big data approaches we can take but there is work on OCR etc. that we can do.

Q2) A question for Ben about Tweeting – the Mechanical Curator and the Mechanical Comedian. For the Curator… They come out some regularly… How are they generated?

A2) That is mechanical… There are about 1200 lines of code that roams the collection looking for similar stuff… The text is generated from books metadata… It is looking at data on the harddrive – access to everything so quite random. If no match it finds another random image.

Q2) And the mechnical comedian?

A2) That is run by Bob. The jokes are mechanically harvested, but he adds the images. He does that himself – with a bit of curation in terms of the badness of jokes – and adds images with help of a keen volunteer.

Q3) I work at the National Library of Scotland. You said to have fun and experiment. What is your response to the news of job cuts at Trove, at the National Library of Australia.

A3 – Ben) Trove is a leader in this space and I know a lot of people are increadibly upset about that.

A3 – Nora) The thing with digital collections is that they are global. Our own curators love Trove and I know there is a Facebook group to support Trove so, who knows, perhaps that global response might lead to a reversal?

Mahendra: I just wanted to say again that learning about the stories and provenance of a collection is so important. Talking about the back stories of collections. Sometimes the reasons content are not made available have nothing to do with legality… Those personal connections are so importan.

Q4) I’m interested in your use of the IPython Notebook. You are using that to access content on BL servers and website? So you didn’t have to download lots of data? Is that right?

A4) I mainly use it as a communication tool between myself and Ben… I type ideas into the notebook, Ben helps me turn that into code… It seemed the best tool to do that.

Q4) That’s very interesting… The Human API in action! As a researcher is that how it should be?

A4) I think be. As a researcher I’m not really a coder. For learning these spaces are great, they act as a sandbox.

Q4) And your code was written for your project, should that be shared with others?

A4) All the code is on a GitHub page. It isn’t perfect. That extract, code, geocode idea would be applicable to many other projects.

Mahendra: There is a balance that we work with. There are projects that are fantastic partnerships of domain experts working with technical experts wanting problems to solve. But we also see domain experts wanting to develop technical skills for their projects. We’ve seen both. Not sure of the answer… We did an event at Oxford who do a critical coding course where they team humanities and computer scientists… It gives computer scientists experience of really insanely difficult problems, the academics get experience of framing questions in precise ways…

Ben: And by understanding coding and

Comment (me): I just wanted to encourage anyone creating research software to consider submitting papers on that to the Journal of Open Research Software, a metajournal for sharing and finding software specifically created for research.

Q5) It seemed like the Political Meetings Mapper and the Palimpsest project had similar goals, so I wondered why they selected different workflows.

A5 – Bea Alex) The project came about because I spoke to Miranda Anderson who had the idea at the Digital Scholarship Day of Ideas. At that time we were geocoding historical trading documents and we chatted about automating that idea of georeferencing texts. That is how that project came about… There was a large manual aspect as well as the automated aspects. But the idea was to reduce that manual effort.

A5 – Katrina) Our project was so much smaller team. This is very much a pilot project to meet a particular research issue. The outcomes may seem similar but we worked on a smaller scale, seeing what one researcher could do. As a traditional academic historian I don’t usually work in groups, let alone big teams. I know other projects work at larger scale though – like Ian Gregory’s Lakes project.

A5 – Mahendra) Time was a really important aspect in decisions we took in Katrina’s project, and of focusing the scope of that work.

A5 – Katrina) Absolutely. It was about what could be done in a limited time.

A5 – Bea) One of the aspects from our work is that we sourced data from many collections, and the structure could be different for each mention. Whereas there is probably a more consistent structure because of the single newspaper used in Katrina’s project, which lends itself better to a regular expressions approach.

And next we moved to coffee and networking. We return at 3.30 for more excellent presentations (details below). 

BL Labs Awards: Research runner up project: “Palimpsest: Telling Edinburgh’s Stories with Maps” – Professor James Loxley, Palimpsest, University of Edinburgh

I am going to talk about project which I led in collaboration with colleagues in English Literature, with INformatics here, with visualisation experts at St Andrews, and with EDINA.

The idea came from Miranda Anderson, in 2012, who wanted to explore how people imagine Edinburgh in a literary sense, how the place is imagined and described. And one of the reasons for being interested in doing this is the fact that Edinburgh was the world’s first UNESCO City of Literature. The City of Literature Trust in Edinburgh is also keen to promote that rich literary heritage.

We received funding from the AHRC from January 2014 to March 2015. And the name came from the concept of the Palimpsest, the text that is rewritten and erased and layered upon – and of the city as a Palimpsest, changing and layering over time. The original website was to have the same name but as that wasn’t quite as accessible, we called that LitLong in the end.

We had some key aims for this project. There are particular ways literature is packaged for tourists etc. We weren’t interested in where authors were born or died. Or the authors that live here. What we were interested in was how the city is imagined in the work of authors, from Robert Louis Stevenson to Muriel Spark or Irvine Welsh.

And we wanted to do that in a different way. Our initial pilot in 2012 was all done manually. We had to extract locations from texts. We had a very small data set and it offfered us things we already knew – relying on well known Edinburgh books, working with the familiar. The kind of map produced there told us what we already knew. And we wanted to do something new. And this is where we realised that the digital methods we weree thinking about really gave us an opportunity to think of the literary cityscape in a different mode.

So, we planned to textmine large collections of digital text to identify narrative works set in Edinburgh. We weren’t constrained to novels, we included short stories, memoirs… Imaginative narrative writing. We excluded poetry as that was too difficult a processing challenge for the scale of the project. And we were very lucky to have the support and access to British library works, as well as material from the HathiTrust, and the National Library of Scotland. We mainly worked with out of copyright works. But we did specifically get permission from some publishers for in-copyright works. Not all publishers were forthcoming, and happy for work to be text mined. We were text mining work – not making them freely available – but for some publishers full text for text mining wasn’t possible.

So we had large collections of works, mainly but not exclusively out of copyright. And we set about textmining those collections to find those set in Edinburgh. And then we georeferenced the Edinburgh placenmmaes in those works to make mapping possible. And then finally we created visualisations offering different viewpoints into the data.

The best way to talk about this is to refer to text from our website:

Our aim in creating LitLong was to find out what the topography of a literary city such as Edinburgh would look like if we allowed digital reading to work on a very large body of texts. Edinburgh has a justly well-known literary history, cumulatively curated down the years by its many writers and readers. This history is visible in books, maps, walking tours and the city’s many literary sites and sights. But might there be other voices to hear in the chorus? Other, less familiar stories? By letting the computer do the reading, we’ve tried to set that familiar narrative of Edinburgh’s literary history in the less familiar context of hundreds of other works. We also want our maps and our app to illustrate old connections, and forge new ones, among the hundreds of literary works we’ve been able to capture.

That’s the kind of aims we had, what we were after.

So our method started with identifying texts with a clear Edinburgh connection or, as we called it “Edinburghyness“. Then, within those works to actually try and understand just how relevant they were. And that proved tricky. Some of the best stuff about this project came from close collaboration between literary scholars and informatics researchers. The back and forth was enormously helpful.

We came across some seemingly obvious issues. The first thing we saw was that there was a huge amount of theological works… Which was odd… And turned out to be because the Edinburgh placename “Trinity” was in there. Then “Haymarket” is a place in London as well as Edinburgh. So we needed to rank placenames and part of that was the ambiguity of names, and understanding that some places are more likely to specifically be Edinburgh than others.

From there, with selected works, we wanted to draw out snippits – of varying lengths but usually a sensible syntactic shape – with those mentions of specific placenames.

At the end of that process we had a dataset of 550 published works, across a range of narrative genres. They have over 1600 Edinburgh place names of lots of different types, since literary engagement with a city might be a street, a building, open spaces, areas, monuments etc. In mapping terms you can be more exact, in literature you have these areas and diverse types of “place”, so our gazeteer needed to be flexible to that. And what that all gave us in total was 47,000 extracts from literary works, all focused on a place name mention.

That was the work itself but we also wanted to engage people in our work. So we brought Sir Walter Scott back to life. He came along to the Edinburgh International Book Festival in 2014. He kind of got away from us and took on a life of his own… He ended up being part of the celebrations of the 200th aniversary of Waverley. And popped up again last year on the Borders Railway when that launched! That was fun!

We did another event at EIBF in 2015 with James Robertson who was exploring LitLong and data there. And you can download that as a podcast.

So, we were very very focused on making this project work, but we were also thinking about the users.

The resource itself you can visit at LitLong.org. I will talk a little about the two forms of visualisation. The first is a location visualiser largely built and developer by Uta Hinrichs at St Andrews. That allows you to explore the map, to look at keywords associated by locations – which indicate a degree of qualitative engagement. We also have a searchable database where you can see the extracts. And we have an app version which allows you to wander in among the extracts, rather than see from above – our visualisation colleagues call this the “Frogs Eye View”. You can wander between extracts, browse the range of them. It works quite well on the bus!

We were obviously delighted to be able to do this! Some of the obstacles seemed tough but we found workable solutions… But we hope it is not the end of the story. We are keen to explore new ways to make the resource explorable. Right now there isn’t a way where interaction leaves a trace – other people’s routes through the city, other peoples understanding of the topography. There is scope for more analysis of the texts themselves. For instance we considered doing a mood map of the city, scope to see that. But we weren’t able to do that in this project but there is scope to do that. And as part of building on the project we have a bit of funding from the AHRC so lots of interesting lines of enquiry there. And if you want to explore the resource do take a look, get in touch etc.


Q1) Do you think someone could run sentiment analysis over your text?

A1) That is entirely plausible. The data is there and tagged so that you could do that.

A1 – Bea) We did have an MSc project just starting to explore that in fact.

A1) One of our buttons on the homepage is “LitLong Lab” where we share experiments in various ways.

Q2) Some science fiction authors have imagined near future Edinburgh, how could that be mapped?

A2) We did have some science fiction in the texts, including the winner of our writing competition. We have texts from a range of ages of work but a contemporary map, so there is scope to keying data to historic maps, and those exist thanks to the NLS. As to the future…  The not-yet-Edinburgh… Something I’d like to do… It is not uncommon that fictional places exist in real places – like 221 Baker Street or 44 Scotland Street – and I thought it would be fun to see the linguistic qualities associated with a fictional place, and compare to real places with the same sort of profile. So, perhaps for futuristic places that would work – using linguistic profile to do that.

Q3) I was going to ask about chronology – but you just answered that. So instead I will ask about crowd sourcing.

A3) Yes! As an editor I am most concerned about potential effort. For this scale and speed we had to let go of issues of mistakes, we know they are there… Places that move, some false positives, and some books that used Edinburgh placenames but are other places (e.g. some Glasgow texts). At the moment we don’t have a full report function or similar. We weren’t able to create it to enable corrections in that sort of way. What we decided to do is make a feature of a bug – celebrating those as worm holes! But I would like to fine tune and correct, with user interactions as part of that.

Q4) Is the data set available.

A4) Yes, through an API created by EDINA. Open for out of copyright work.

Palimpsest seeks to find new ways to present and explore Edinburgh’s literary cityscape, through interfaces showcasing extracts from a wide range of celebrated and lesser known narrative texts set in the city. In this talk, James will set out some of the project’s challenges, and some of the possibilities for the use of cultural data that it has helped to unearth.

Geoparsing Jisc Historical Texts – Dr Claire Grover, Senior Research Fellow, School of Informatics, University of Edinburgh

I’ll be talking about a current project, a very rapid project to geoparse all of the Jisc Historical Texts. So I’ll talk about the Geoparser and then more about that project.

The Edinburgh Geoparser, which has been developed over a number of years in collaboration with EDINA. It has been deployed in various projects and places, mainly also in collaboration with EDINA. And it has various main steps:

  • Use named entity recognition to identify place names in texts
  • Find matching records in a gazeteer
  • In cases of ambiguity (e.g. Paris, Springfield), resolve using contextual information from the document
  • Assign coordinates of preferred reading to the placename

So, you can use the Geoparser either via EDINA’s Unlock Text, or you can download it, or you can try a demonstrator online (links to follow).

To give you an example I have a news piece on the buriel of Richard III. You can see the Geoparser looks for entity recognition of all types – people as well as places – as that helps with disambiguation later on. Then using that text the parser ranks the likelihood of possible locations.

A quick word on gazeteers. The knowledge of possible interpretations comes from a gazeteer, which pairs place names to lat/long. So, if you know your data you can choose a gazeteer relevant to that (e.g. just the UK). The Edinburgh Geoparser is configured to provide a choice of gazeteers and can be configured to use other gazeteers.

If a place is not in a gazeteer it cannot be grounded. If the correct interprestation of a place name is not in the gazeteer, it cannot be grounded correctly. Modern gazeteers are not ideal for historical documents so historical gazeteers need to be used/developed. So for instance the DEEP (Directory of English Place Names) or PELAGIOS (ancient world) gazeteers have been useful in our current work.

The current Jisc Historical Text(http://historicaltexts.jisc.ac.uk/) project has been working with EEBO and ECCO texts as well as the BL Nineteenth Century collections. These are large and highly varied data sets. So, for instance, yesterday I did a random sample of writers and texts… which is so large we’ve only seen a tiny portion of it. We can process it but we can’t look at it all.

So, what is involved in us georeferencing this text? Well we have to get all the data through the Edinburgh Geoparser pipeline. And that requires adapting the geoparser pipeline to recognise place names to work as accurately as possible on historical text. And we need to adjust the georeferencing strategy to be more detailed.

Adapting our place name recognition relies a lot on lexicons. The standard Edinburgh Geoparser has three lexicons derived from the Alexandria Gazetteer (global, very detailed); Ordnance Survey (Great Britain, quite detailed), DEEP. We’ve also added more lexicons from more gazeteers… including larger place names in Geonames (population over 10,000), populated places from Natural Earth, only larger places from DEEP, and the score recognised place names based on how many and which lexicons they occur in. Low scored placenames are removed – we reckon people’s tolerance for missing a place is higher than their tolerance for false positives.

Working with old texts also means huge variation of spellings… There are a lot of false placenames/false negatives because of this (e.g. Maldauia, Demnarke, Saxonie, Spayne). They also result in false positives (Grasse, Hamme, Lyon, Penne, Sunne, Haue, Ayr). So we have tried to remove the false positives, to remove bad placenames.

When it comes to actually georeferencing these places we need coordinates for place names from gazetteers. We used three place names in succession: Pleiades++, GeoNames and then DEEP. In addition to using those gazeteers we can weight the results based on locations in the world – based on a bounding box. So we can prefer locations in the UK and Europe, then those in the East. Not extending to the West as much… And excluding Australia and New Zealand (unknown at that time).

So looking at EEBO and ECCO we can see some frequent place names from each gazeteers – which shows how different they are. In terms of how many terms we have found there are over 3 million locations in EEBO, over 250k in ECCO (a much smaller collection). The early EEBO collections have a lot of locations in Israel, Italy, France. The early books are more concerned with the ancient world and Biblical texts so these statistics suggest that we are doing the right thing here.

These are really old texts, we have huge volumes fo them, and there is a huge variety of the data and that all makes this a hard task. We still don’t know how the work will be received but we think Jisc will put this work in a sandbox area and we should get some feedback on it.

Find out more:

  • http://historicaltexts.jisc.ac.uk/
  • https://www.ltg.ed.ac.uk/software/geoparser
  • http://edina.ac.uk/unlock/
  • http://placenames.org.uk/
  • https://googleancientplaces.wordpress.com/


Q1) What about historical Gaelic place names?

A1) I’m not sure these texts have these. But we did apply a language tag on a paragraph level. These are supposed to be English texts but there is lots of Latin, Welsh, Spanish, French and German. We only georeferenced texts thought to be English. If Gaelic names then, if in Ordnance Survey, they may have been picked up…

Claire will talk about work the Edinburgh Language Technology Group have been doing for Jisc on geoparsing historical texts such as the British Library’s Nineteenth Century Books and Early English Books Online Text Creation Partnership which is creating standardized, accurate XML/SGML encoded electronic text editions of early print books.

Pitches – Mahendra and co

Can the people who pitched me

Lorna: I’m interested in open education and I’d love to get some of the BL content out there. I’ve been worked on the new HECoS coding schema for different subjects. And I thought that it would be great to classify the BL content with HECoS.

Karen: I’ve been looking at Copyright music collections at St Andrews. There are gaps in legal deposit music from late 18th and 19th century as we know publishers deposited less in Scottish versus BL. So we could compare and see what reached outer reaches of the UK.

Nina: My idea was a digital Pilgrims Progress where you can have a virtual tour of a journey with all sorts of resources.. To see why some places are most popular in texts etc.

David: I think my idea has been done.. It was going to be iPython – Katrina is already doing this! But to make it more unique… It’s quite hard work for Ben to support scholars in that way so I think researchers should be encouraged to approach Ben etc. but also get non-programmers to craft complex queries, make the good ones reusable by others… and have those reused be marked up as of particular quality. And to make it more fun… Could have a sort of treasure hunt jam with people using that facility to have a treasure hunt on a theme… share interesting information… Have researchers see tweets or shared things… A group treasure hunt to encourage people by helping them share queries…

Mahendra: So we are supposed to decide the winners now… But I think we’ll get all our pitchers to share the bag – all great ideas… The idea was to start conversations. You should all have an email from me so, if you have found this inspiring or interesting, we’ll continue that conversation.

And with that we are done! Thanks to all for a really excellent session!