May 162018
 

Today I am at the Digital Scholarship Day of Ideas, organised by the Digital Scholarship programme at University of Edinburgh. I’ll be liveblogging all day so, as usual, I welcome additions, corrections, etc. 

Welcome & Introduction – Melissa Terras, Professor of Digital Cultural Heritage, University of Edinburgh

Hi everyone, it is my great pleasure to welcome you to the Digital Day of Ideas 2018 – I’ve been on stage here before as I spoke at the very first one in 2012. I am introducing the day but want to give my thanks to Anouk Lang and Professor James Loxley for putting the event together and their work in supporting digital scholarship. Today is an opportunity to focus on digital research methods and work.

Later on I am pleased that we have speakers from sociology and economic sociology, and the nexus of that with digital techniques, areas which will feed into the Edinburgh Futures Institute. We’ll also have opportunity to talk about the future of digital methods, and particularly what we can do here to support that.

Lynn Jameson – Introduction

Susan Halford is professor of sociology but also director of the institution-wide Web Science Institute.

Symphonic Social Science and the Future of Big Data Analytics – Susan J Halford, Professor of Sociology & Director of Web Science Institute, University of Southampton

Abstract: Recent years have seen ongoing battles between proponents of big data analytics, using new forms of digital data to make computational and statistical claims about the social world, and many social scientists who remain sceptical about the value of big data, its associated methods and claims to knowledge. This talk suggest that we must move beyond this, and offers some possible ways forward. The first part of the talk takes inspiration from a mode of argumentation identified as ‘symphonic social science’ which, it is suggested, offers a potential way forward. The second part of talk considers how we might put this into practice, with a particular emphasis on visualisation and the role that this could play in overcoming disciplinary hierarchies and enabling in-depth interdisciplinary collaboration.

It’s a great pleasure to be here in very sunny Edinburgh, and to be speaking to such a wide ranging audience. My own background is geography, politics, english literature, sociology and in recent years computer sciences. That interdisciplinary background has been increasingly important as we start to work with data, new forms of data, new types of work with data, and new knowledge – but lets query that – from that data. All this new work raises significant challenges especially as those individual fields come from very different backgrounds. I’m going to look at this from the perspective of sociology and perhaps the social sciences, I won’t claim to cover all of the arts and humanities as well.

My talk today is based on work that I have been doing with Mike Savage on “big data” and the new forms of practice emerging around these new forms of data, and the claims being made about how we understand the social world. In this world there has been something of a stand off between data scientists and social scientists. Chris Anderson (in 2008), a writer for Wired, essentially claimed “the data will speak for itself” – you won’t need the disciplines. Many have pushed back hard on this. The push back is partly methodological: these data do not capture every aspect of our lives, they capture partial traces, often lacking in demographic detail (do we care? sociologists generally do…) and we know little of its promise. And it is very hard to work with this data without computational methods – tools for pattern recognition generally, not usually thorough sociological approaches. And present concerning, something ethically problematic, results that are presented as unproblematic. So, this is highly challenging. John Goldthorpe says “whatever big data may have for “knowing capitalism” it’s value to social science has… remained open to questions…”.

Today I want to move beyond that stand out. The divisiveness and siloing of disciplines is destructive for the disciplines – it’s not good for social science and it’s not good for big data analytics either. From a social science perspective, that position marginalises social sciences, sociology specifically, and makes us unable to take part in this big data paradigm which – love it or loathe it – has growing importance, influence, and investment. We have to take part in this for three major reasons: (1) it is happening anyway – it will march forward with or without it; (2) these new data and methods do offer new opportunities for social sciences research and; (3) we may be able to shape big data analytics as the field emerges – it is very much in formation right now. It’s also really bad for data science not to engage with the social sciences… Anderson and others made these claims ten years ago… Reality hasn’t really shown that happen. In commercial contexts – recommendations, behaviour tracking and advertising, the data and analysis is doing that. But in actually drawing understanding from the world, it hasn’t really happened. And even the evangelists have moved on… Wired itself has moved to saying “big data is a tool, but should not be considered the solution”. Jeff Hammerbacker (co-credited for coining the term “data science” in 2008, said in 2013 “the best minds of my generation are thinking about how to make people click ads… that sucks”.

We have a wobble here, a real change in the discourse. We have a call for greater engagement with domain experts. We have a recognition that data are only part of the picture. We need to build a middle ground between those two positions of data science and social science. This isn’t easy… It’s really hard for a variety of reasons. There are bodies buried here… But rather than focus on that, I want to focus on how we take big steps forward here…

The inspiration here are three major social science projects: Bowling Alone (Robert Putnam); The Spirit Level – Richard Wilkinson and Kate Pickett; Capital – Thomas Piketty. These projects have made huge differences, influencing public policy and in the case of Bowling Alone, really reshaped how governments make policy. These aren’t by sociologists. They aren’t connected as such. The connection we make in our paper is that we see a new style of social science argumentation – and we see it as a way that social scientists may engage in data analytics.

There are some big similarities between these books. They are all data driven. Think about sociologists at the end of 20th century was highly theoretical… At the beginning of the 21st century we see data driven works. And they haven’t done their own research generating data here, they have drawn on existing research data. Piketty has drawn together diverse tax data… But also Jane Austen quotes… Not just mixed methods but huge repurposing. These books don’t make claims for causality based on data, their claims for causality is supported by theory. However they present data throughout and supporting their arguments. Data is key, with images to hold the data together. There is a “visual consistency”. The books each have a key graph that essentially summarises the book. Putnam talks about social capital, Piketty talks about the rise and fall of wealth inequality in the 20th century.

In each of these texts data, method and visualisation are woven into a repeat refrain, combined with theory as a composite whole to makes powerful arguments about the nature of social life and social change over the long term. We call this a “Symphonic Aesthetic” as different instruments and refrains build, come in and go… and the whole is greater than the sum of the parts.

OK, thats an observation about the narrative… But why does that matter? We think it’s a way to engage with and disrupt big data. There are similarities: re-purposing multiple and varied “found” data sources; an emphasis on correlation; use of visualistion. There are differences too: theoretical awareness; choice of data; temporality is different – big data has huge sets of data looking at tiny focused and often real time moments. Social Science takes long term comparisons – potentially over 100 years. The role of correlation is different. Big data analytics looks for a result (at least in the early stage), in symphonic aesthetics there is a real interest in correlation through statistical and theoretical understandings. Practice of visualisation varies as well. In big data it is the results, in symphonic aesthetics it is part of the process, not the end of the process.

Those similarities are useful but there is much still to do: symphonic authors do not use new forms of digital data, their methods cannot simply be applied, big data demand new and unfamiliar skills and collaborations. So I want to talk about the prospective direction of travel around data; method; theory; visualisation practice.

So, firstly, data. If we talk about symphonic aesthetics we have to think about critical data pragmatism. That is about lateral thinking – redirection of what data exist already. And we have to move beyond naivety – we cannot claim they are “naturally occurring” mirrors/telescopes etc. They are deliberately social-technical constructions. And we need to understand what the data are and what they are not: socio-technical processes of data construction (eg carefully constructed samples); understanding and using demographic biases (go with the biases and use the data as appropriate, rather than claiming they are representative; or maybe ignore that, look at network construction, flows, mobilities – e.g. John Murrey’s work).

Secondly method. We have to be methodologically plural. Normally we do mixed methods – some quantitative, some qualitative. But most of us aren’t yet trained for computational methods, and that is a problem. Many of the most interesting things about these data – their scale, complexity etc. – are not things we can accommodate in our traditional methods. We need to extend our repertoire here. So social network analysis has a long and venerable history – we can apply the more intensive smaller version of large scale social network analysis. But we also need machine learning – supervised (with training sets) and unsupervised (without). This allows you to seek evidence of different perhaps even contradictory patterns. But also machine learning can help you find the structures and patterns in the data – which you may well not know in data sets at this scale.

We have this quote from Ari Goldberg (2015): “sociologists often round up the usual suspects. They enter the metaphorical crime scene every dat, armed with strong and well-theorised hypotheses about who the murderer should or at least plausibly might be.”

To be very clear I am not suggesting we outsource analysis to computational methods: we need to understand what the methods are doing and how.

Thirdly, theory. We have to use abductive reasoning – a constant interplay between data, method and theory. Initial methods may be informed by initial hunches, themes, etc. We might use those methods to see if there is something interesting there… Perhaps there isn’t, or perhaps you build upon this. That interplay and iterative process is, I suspect, something sociologists already do.

So, how do we bring this all together in practice? Most sociologists do not have a sophisticated understanding of the methods; and most computer scientists may understand the methods but not the theoretical elements. I am suggesting something end to end, with both sociologists and computer scientists working together.

It isn’t the only answer but I am suggesting that visualisation becomes an analytical method, rather than a “result”. And thinking about a space for work where both sociological and computer science expertise are equally valid rather than combatorial. At best visualisations are “instruments for reasoning about quantitative information. Often the most effective way to describe, explore and summarise a set of numbers – even a very large set – is to look at pictures of those numbers” (Tufte 1998). Visualisations as interdisciplinary boundary objects. Beyond a mode of argumentation… visualisation becomes a mode of practice.

An example of this was a visualisation of the network of a hashtag that was collaborative with my colleague Ramin, which developed over time as we asked each other questions about how the data was presented and what that means…

In conclusion, sociology flourished in the C20th. Developing methods, data and theory that gave us expertise in “the social” (a near monopoly). This is changing – new forms of data, new forms of expertise… And claims being made which we may, or may not, think are valid. And that stands on the work of sociologists. But there is some promise in the idea of symphonic aesthetic: for data science – data science has to be credible and there is recognition of that – see for instance Cathy O’Neil’s work on data science, “Weapons of Math Destruction” which also pushes in this direction. ; for sociological research – but not all of it, these won’t be the right methods for everyone; for public sociology – this being used in lots of ways already, algorithm sentencing debates, Cambridge Analytics… There is a real place for sociologists to reshape sociology in the public understanding. There are big epistemological implications here… Changing the data and methods changes what we study… But it has always been like that. Big data can do something different – not necessarily better, but different.

Q&A

Q1) I was really interested in your comments about visualisations as a method… Joanna Drucker talks about visual technology and visual discourse – and issues of visualisations as being biased towards positivistic approaches, and advocates for getting involved in the design of visualisation tools.

A1) I’m familiar with these concepts. That work I did with Ramin is early speculative work… But it builds and is based on classic social network analysis so yes, I agree, that reflects some issues.

Q2 – Tim Squirrel) I guess my question is about the trade off between access and making meaningful critiques. Often sociology is about critiquing power and methods by which power is transmitted. The more data proliferates, the more the data is locked behind doors – like the kind of data Facebook holds. And in order to access that data you ahve to compromise the kinds of critiques you can make. How do you navigate that narrow channel, to make critiques without compromising those…

Q2) The field is quite unsettled… It looks settled a year ago but I think Cambridge Analytica will have major impact… That may make the doors more closed… Or perhaps we will see these platforms – for instance Facebook – understanding that to retain credibility it has to create a segregation between their own use of the data, and research (not funded by Facebook), so that there is proper separation. But I’m not naive about how that will work in practice… Maybe we have to tread a careful line… And maybe that does mean not being critical in all the ways we might be, in every paper. Empirical data may help us make critical cases across the diverse range of scholarship taking place.

Q3 – Jake Broadhurst) Data science has been used in the social world already, how do we keep up and remain relevant?

A3) It is a pressing challenge. The academy does not have the scale or capacity to address data science in the way the private sector does. One of the big issues is ethics… And how difficult it is for academics to navigate ethics of social media and social data. And it is right that we are bound to ethical processes in a way data scientists and even journalists do not need to. But it is also absolutely right that our ethics committees have to understand new methods, and the realities of the gold standard consent and other options where that is not feasible.

The discussion we are having now, in the wake of Cambridge Analytica, is crucial. Two years ago I’d ask students what data they felt was collected, they just didn’t know. And understanding that is part of being relevant.

Q4 – Karen Gregory) If you were taking up a sociology PhD next year, how would you take that up?

A4) My official response would be that I’d do a PhD in Web Science. We have a programme at University of Southampton, taking students from a huge array of backgrounds, and giving them all the same theoretical and methodological backgrounds. They then have to have 2 supervisors, from at least 2 different disciplines for their PhD.

Q5 – Kate Orton Johnson) How do we tackle the structures of HE that prevent those interdisciplinary projects, creating space, time, collaborative push to create the things that you describe?

A5) It’s a continuous struggle. Money helps – we’ve had £10m from EPSRC and that really helps. UKRI could help – I’m sceptical but hopeful about interdisciplinary possibilities here. Having PhD supervision across really different disciplines is a beautiful thing, you learn so much and it leads to new things. Universities talk about interdisciplinary work but the reality doesn’t always match up. Money helps. Interdisciplinary research helps. Collaboration on small scales – conference papers etc. also help.

Q6 – David, research in AI and Law) I found your comments about dialogues between data scientists and social scientists… How can you achieve similar with law scholars and data scientists… Especially if trying to avoid hierachichal issues. Law and data science is a really interesting space right now… GDPR but also algorithmic accountability – legal aspects of equality, protected categories, etc. Very few users of big data have faced up to the risks of how they use the data, and potential for legal challenge on the basis of discrimination. You have to find joint enthusiasm areas, and fundable areas, and that’s where you have to start.

The Economics Agora Online: Open Surveys and the Politics of Expertise – Tod van Gunten, Lecturer in Economic Sociology, University of Edinburgh

Abstract: In recent years, research centres in both the United States and United Kingdom have conducted open online surveys of professional economists in order to inform the public about expert opinion.  Media attention to a US-based survey has centred on early research claiming to show a broad policy consensus among professional economists.  However, my own research shows that there is a clear alignment of political ideology in this survey.  My talk will discuss the value and limitations of these online surveys as tools for informing the public about expert opinion.

Thank you for the invitation to speak today, and for Susan’s great and inspiring talk. I wouldn’t claim the label “symphonic” for this talk, but I think there is something of that spirit in this talk. This project is based on found and repurposed data. It isn’t particularly “big” data… But the “found” aspect of the data raises profound questions. Data never holds the answers on its own, it is always crucial to understand method and context. Visualisation is a big part of this. And it about public sociology – so it hasn’t just been published in journals but in popular press as well.

I am an economist who studies economists as a sociological object in their own right. So, this is a famous moment in 2008 when the Queen, during the midst of the largest global financial crisis since 1929, asked an economist “why did nobody notice it”. Because she is the queen, the British Academy convened a panel to respond to this question. And they said that lots of people did a good job, but it was no-one had it as their job to put everything together. Meanwhile with Brexit we’ve seen economists as a profession receiving substantial criticism.

Economists are hugely influential, we study them because it is the politics of expertise. It is the most politically influential social science. So, I’m going to talk about properties we would like politically influential experts to have:

  1. A high level of professional consensus within the the relevant community of experts. Gold standard here is climate science. If we have a community of experts that all agree, there seems to be a need for action. That’s a good principle.
  2. Form policy opinions independently of their own political ideology. We will receive and have confidence in advice from an independent expert more than someone presenting their own views.
  3. Acknowledge professional debate in expressing their views. That they acknowledge that issues are not settled issues.

So in this paper I want to look at how we may use data to measure these aspects. And I’mm be going through some theory around the cultural structure of belief spaces and how this relates to data, big data in the context of economics – but this theory can be used in other contexts as well).

I want to open on the “economics agora” online. I want to talk about two surveys here – these are open online surveys of economists since the financial crisis. It is no coincidence that these have emerged at this time. These surveys are in the UK and in the USA. And unusually the results include publishing the full responses, and the names of the responders – by their consent. These are famous/well known individuals in their field. This allows us to do more… Bring in data that is not in the survey – the CVs of the respondents for instance so including universities, political activities, their co-authorship network, etc. The survey organisers’ goal is to inform the public, but finding patterns in the data requires aggregation and analysis. This isn’t just individual responses, but understanding the context of the data. And again, this isn’t big data, this is quite small data. But these approaches apply to big data too.

So one of these surveys is the Chicago Booth IGM Economic Experts Panel. Each month they put a question to 40 economists about some issue of the moment – the impact of autonomous cars for instance. The second survey is the Centre for Math and Economics, based in London, and again they ask a panel for responses. Typically the UK/European survey shows much more disagreement than the US survey.

There are a lot of issues with these surveys: they are small (the UK/EU one is expanding) and non-random samples; deliberately elitist samples (US survey – “top 7” economics departments in US universities, mainly Ivy League) – why would you take this sample? Well you wouldn’t really… But you have very high status economists. The UK survey has a much wider range in its samples. I think these surveys are great… But I think they should do a better job! Another problem is that you have a high rate of “softball” questions – in the US survey, not in the UK/EU surveys. For instance “imposing new US tariffs on steel and alumnium will improve Americans’ welfare” – it’s timely but we already know that there is high consensus here. We need to ask harder questions! And finally we need to think about the motivations of the people who produce the data – the survey designers are looking to raise the profile of the profession. In a Wall Street Journal the designers of the US survey talked about wanting to counteract the idea of a lack of consensus in the field – and they are the ones asking the questions.

Gordon and Dahl (2013) looked at views and consensus in the field based on the surveys. They presented this as being a “remarkably high degree of consensus” and little variance across schools and departments. And thus look at how influential this field should be. This got big pick up… the Washington Post picked it up. Nobel winning economist Paul Krugman picked this up in his opinion column in the Economist. He is on record (New York Times 2009) as saying pretty much the opposite – that there is polarisation between the “saltwater” economists in the Keynesian camp, and the “freshwater” economists who are very much the opposite.

So, a bit of theory… What do we mean by consensus, polarisation, factions etc? How do groups of people structure their belief systems? We do have twenty years of literature and theory here around understanding belief systems. This goes back to political scientists in the 1960s. Philip Converse (1964) found that most american voters do not adhere to a coherent political ideology – this is still the case. Their believe systems are disorganised or “unconstrained” – so one belief does not let you predict another belief. So for instance comparing a belief that you should “reduce immigration” and “reduce corporate tax” – could show little correlation, those beliefs don’t automatically go together. Now if you are a voter in the UK in 2018 there probably is more alignment. That pattern is a “constrained or aligned” correlation. If you look at polarisation you see clusters of correlation.

So, that paper on economists looks for clusters. I looked at polarisation to look at latent ideology, noting partisanship (where known involvement in e.g. being part of political left or right leaning think tanks etc. – or marked as “none”), current department (freshwater vs saltwater) and belief dimension. Unsurprisingly those involved in Republican/conservative organisations and those with backgrounds in democratic/liberal organisations were very different, leaning right and left respectively. This is the same data that generated that paper that showed consensus and little variance.

There is a high degree of consensus in this survey but you can also see idealogical alignment. That can be consistent. But it depends on what you think, and what you ask. The UK survey – more recently expanded to Europe – shows much less consensus. This could mean there is more consensus in the US than in Europe; but it could also mean that the questions being asked in the UK survey are harder questions. The UK survey asks very complex questions… e.g. “Do you agree that, in a period of great uncertainty and after a prolonged period of weak real wage growth, monetary policy makers can afford to wait for greater certainty about real wage developments and building inflationary pressure before raising interest rates?”. So, you can’t measure consensus without a comparison with another group. You can see consensus on a question, not of a group/community or set of beliefs.

So, looking at a recent UK/EU survey on  looking at anti-establishment vs monetary conservatism you can see a diversity of views here.

So, back to those qualities. Professional consensus is harder to measure than it first appears.

One of the questions respondents are asked to give is their vote and their level of confidence. So, when experts give an opinion on hot topics you’d really want a low confidence score to show you don’t have a partisan respondent on your hands. Looking at the data here in the US surveys we see a lot of overly confident responses. Respondents with a stronger idealogical disposition (aligned belief structure) exhibit systematic overconfidence. In general, across all questions, when asked politically salient questions they state higher confidence than questions with little/no political salience.

By way of conclusion… Am I joining ranks with Michael Gove “people in this country have had enough of experts”? No. I would say something more nuanced. Arguably professions in general, economists in particular, has lost political legitimacy, then professional over-reach (“look how much consensus we have”) is not the answer. Claiming consensus where none exists is over-reach. Transparency about professional debate is always better than overstating consensus. Political legitimacy is a scarce resource and should be treated as such.

The economics agora online is a useful tool for studying the beliefs of an important community of experts… but survey designers should up their game. If you want an “unbiased” expert, chose someone whose belief structure is unconstrained. You probably want someone in the middle – people whose belief systems are not correlated. You need a theory of how groups form beliefs…. So read cultural sociology!

Q&A

Q1) In thinking about the resistance to “naturally occurring data” and the idea of an “unbiased expert” – do you have a sense that that isn’t possible… Rather than getting that, should we instead shift the conversation to make the politics relevant – to be clear in a way that makes the numbers make sense…

A1) If we chose which experts to listen to, which do we listen to…

Q1) It was interesting to think of economists as “not political” – if that’s the conversation… I think the non-biased expert… That raises issues. We query that that even exists… Maybe we can shift the conversation.

A1) I guess I would want to push back a little bit. I am sympathetic that there is no unbiased expert but… I do a lot of work on economists on how they influence policy. I think the world does need economists, especially for monetary policy, technical aspects of policy. So, having some tools to understand this profession, how they structure beliefs… We need more tools to unpack that set of questions… I’m trying to find ways to study this profession studying quantitative tools and qualitative tools and understand impact on politics and society.

Q2) You mentioned a graph to show polarisation – how did you do that?

A2) This is not based on data, this is based on theoretical patterns… A series of plots using a test data set to illustrate the patterns of the theory – it’s theoretical rather than empirical data.

Q3) A slight follow up… How much have you played with non linear tools… Consensus and confidence… Research on scientific knowledge shows that people who know a little about science have higher confidence than those who know more… That could impact that data on confidence.

A3) We did look an non-linearity – doesn’t make a big difference to some measures here.

Q4) What definition of “expert” are you using, and why?

A4) People with PhDs in economics. In the US case are high status people in the field… In the UK/EU case it is broader. Most work as professors of economics, some work in the private sector in financial sectors. For my purposes it’s holding a PhD in economics… Other work I’ve done on organisations in Latin America you have senior political elites with those credentials, a lot don’t boundary work becomes more important here.

Q5) I think some of the Chicago questions also go to the public. Have you looked at that?

A5) It’s not publicly available… I’ve been thinking about asking for that. But it would be interesting to know if members of the public structure their belief systems differently. There is some work that compares public beliefs to these questions.

Q6)  I work on spatial models around expert agreement and disagreement – interesting measures there and on polarisation. Also dimensionality reduction. Since you are trying to identify latent ideological positions… Not sure if you’ve looked at that. Political behaviour research has

Q7) I wanted to ask about how much the very different types of respondents and samples you have between the US and UK/EU surveys. I was particularly wondering about the high status nature of the US experts and how much that status plays a part… You talked about doing some social network and contextual work here so I was wondering the degree to which their network and co-authorship and professional standing feeds into wanting to be seen to take a particular view, or visibly agree.

A7) The social network part, and co-authorship data is going to lead to a paper. We found people who are closer in co-authoring papers are ideologically closer – not totally surprising… So there is a social approval thing and a selection vias. We think that is the more likely interpretation here – the homophily effect. They co-author non-political papers, they still pick ideologically aligned authors. The status thing is interest… The UK/EU experts is less hierarchical – maybe reflects practice. In terms of monitoring each others responses… I think it’s more contrarian thing… They want to find ways to disagree… They can add comments… So lots of “My colleagues all think this, but if you think about it this other way you get this opposite response”.

Q8) My question/comment is about the “unconstrained” idea space – it feels funny and attractive… But also quite negative… Unconstrained… Disorganised… But you are talking it about a positive quality. But does that suggest they haven’t thought this stuff through?

A8) I’m glad you asked this. This question came up in the 1960s and it was seen as terrible that the ideologies didn’t align to political parties… The field has turned on it’s head now. In the 1960s though this was seen as politically naive. Actually more educated voters are seen to have more constrained beliefs… But with the economists that unconstrained belief system is good as it shows that they are not bring in their partisan/idealogical stand point. There is a contraction there. The idea that the more information you have, the more constrained your belief system should be… But only to a point. There is a really interesting paper by ? de Surrey and Ari Goldberg that compares idealogical voters, the unconstrained voters, and they find a third group that is e.g. politically liberal and economically conservative. This is a really interesting area of the literature. There are a bunch of new methods that are getting us nearer that question…

We broke for lunch and workshops at this point… 

Workshops: Parallel workshop sessions – please see descriptors below.

  • Text Analysis for the Tech Beginner – Suzanne Black, PhD student in LLC
  • An Introduction to Digital Manufacture – Mike Boyd (uCreate Studio Manager, UoE)
  • ‘I have the best words’: Twitter, Trump and Text Analysis – Dave Elsmore (EDINA)
  • An Introduction to Databases, with Maria DB & Navicat – Bridget Moynihan (LLC, UoE)
  • Introduction to Data Visualisation in Processing – Jules Rawlinson (Music, ECA, UoE)
  • Jupyter Notebooks and The University of Edinburgh Noteable service – Overview and Introduction – James Reid (EDINA)
  • Obtaining and working with Facebook Data – Simon Yuill (Goldsmiths)

I attended the Introduction to Data Visualisation in Processing workshop which was really interesting, and left me wanting to have a further play to see where it may potentially be useful. 

Round Table Discussion

  • Melissa Terras (MT), Professor of Digital Cultural Heritage
  • Kirsty Lingstadt (KL), Head of Digital Library and Depute Director of Library and University Collections
  • Ewan McAndrew (EM), Wikimedian in Residence
  • Tim Squirell (TM), PhD Student, Science, Technology and Innovation Studies working on communities and expertise and negotiations of those concepts.

MT: I wanted to start with quite a personal place… I realised last year that I was sort of grieving for the internet. I grew up with the internet, it’s been a big part of my life and friendships… But the internet has taken a different turn… And there is a need to step away from that a bit to stay sane. There is a need to step back and reflect, and think about the University Space. I feel maybe we could have stepped in… The questions of Facebook, Twitter, the use of data… The human nature of trust… And how we use and engage and archive and preserve some of these spaces… I think that makes it interesting to an academic in the digital space right now.

EM: I think the idea of the web was quite sour after Cambridge Analytica. Tim Berners-Lee spoke on Channel 4 News about how it’s not enough to build and run the open web, but we have to look critically at what is being done with it, what people are building. I also thought that the Scottish Referendum, and Glasgow Strathclyde University which called upon all librarians to support political literacy. But that could be “universities” not just “libraries” – there is a need for much more information literacy as a service almost.

KL: The role of the university is about knowledge and supporting and preserving knowledge, with the library central to that… As the digital world changes we need those skills of information literacy, to think critically about what we see on the web, and how we understand that. That’s an important thread the library offers and supports. The arts, humanities and social sciences really support that development of critical engagement, literacy, context and the origins of big data. I was very much chiming with CILIPS work on information literacy – the university library has a really important part to play here…

TS: I want to make three brief points on engagement, expertise and access. One of the things I’ve observed on the web around online communities, is that there is a tendency to only notice a community until something happens. I study some quite extreme communities, including the involuntary celibate community, and you can’t raise interest until people go out and kill people. We really need to see more engagement and understanding, not as an object of interest. The second point is about experts and what that means… I think that reification of expertise is niave at best, and often dangerous. Only engaging with experts, or corroborating your beliefs, or feeling that you only engage with an expert class, overlooks the way most people engage with issues. And finally on access… In light of Cambridge Analytica, Facebook has shut down access for all but their own Facebook programme (with funding councils) of research. Doing that means only people working at the companies, or the elite universities with particular track records…

Comment: Interesting that you mentioned Tim Berners-Lee as he was the reason Web Science got set up at Southampton. The narrative was… I invented the web (discuss) and it has gone wrong (discuss). That was a perspective that didn’t problematise information or communication etc. The idea was that we would reengineer the web (discuss) as if it is technical, not a complex socio-technical network. I’m not being negative but supporting your statements. The restructing of Information Technology GCSE was a travesty – there was no attempt at critical engagement, just at programming. And it is really important that we envision what we want the web to be. There is no fixed idea of the web. We have gone down the rabbit hole of behavioural tracking and advertising as the only economic model… But we could play with that. I would make a pitch for Utopianism… With Donna Harraway: looking at the trouble and thinking about what else we could do.

Comment: I wondered about… that sense of the internet as being what we hoped it could be… But also the issue of the attack on net neutrality in the US, and immediate recognition that that isn’t ok… How do we back away, not engage in the toxic parts of the internet… But also save the parts that are worth saving… Keeping an eye on legislation? Do we protect without participating?

MT: I immediately started to think of how we talk about bitcoin – very utopian visions and turning it into a profit making machine, as has happened in the internet… How do we build structures that can be used to make money… Without that consuming the rest of it… The internet is consuming all the other stuff… I think bitcoin will be the same… The same people who had money 200 years ago, will be the same people who’ll make money now… Partly information literacy, partly being cynical, being civic… Being alive to issues…

TS: I am going to say two contradictory sounding things… So many of these issues seem to be engineering issues to social problems. I was at a conference with someone talking about a blockchain based education network, with a smart contract to validate credentials. Taking the human out of the process, in order to improve the situation. Bitcoin is supposed to be trustless… But at some point you have a human interface, it will fail… You will always face problems you couldn’t spot – unless you spoke to a social scientist. But that goes with us as social scientists is the need for us to engage with the engineering sides of things… Lots of “if only we could have known what would happen with Cambridge Analytica”, but we’ve known about that for years… We struggle to be listened to by policy makers when compared with businesses who have legitimate routes in, and argue for a lack of accountability. Platforms are not neutral, you can engineer the behaviours available in the space. You have to understand the feedback loop between administration and engineering.

EM: Thinking about democratisation.. And thinking about utopian visions… Putting my wikimedian hat on… I think that it has been amazing to see the work done by students here… There is real benefit to having a very transparant space online where you can query or change or contribute to the world. Wikipedia is committed to keeping the human element at its core. One of the ways that Wikipedia checks and balances the data is that you can’t edit a page unless you’ve had an account for four days.

KL: That’s where libraries of all kinds come in – a space or platform to trace the source, the archive materials… And digital data… Data curation and longer term lifecycles.. Digital content being created… To check, to contribute.

Comment: There’s an interesting underlying narrative that the web has gone wrong, and that the economy has gone wrong… As if these structured inequalities are accidental but they are not, they are deliberate. We need a critical historical narrative of the web and how this has taken place…. And the historical narrative of where the web has come from. We need more engagement from the humanities here… There are underlying themes here.

Comment: From literary and fan fiction studies we have for years been talking to a literature and community that exists online and how that interacts online. Fan fiction is often written by women, by BME and LGBTQ and non-binary people… We have a cry of “own the servers” to avoid exploitation… Could anyone comment on that type of utopian vision – the local and the global… Who accesses the data…

KL: From my context of the library, it’s about putting materials out there to access what they need as equitably as possible… But that’s difficult… For archives and personal material there are restrictions and limitations for good reason… We haven’t cracked that perfectly… It is a challenge, there isn’t an easy answer to it…

EM: From a Wikipedia angle… Wikipedia had a conversation within and around the community about where the community is going by 2030… Where they were going, what they needed to do to share and access knowledge around the world… To enable better understanding… To more civic and better societies. But there are huge disparities of access. Out of that came the sense of knowledge not as a product but as a service. And the idea of knowledge equity – in terms of access but recognising only 10% of editors are female, it’s Northern Hemisphere orientated, only 2.5% of geotagged content relates to Africa. It’s not shying away from that, instead trying to address that over time… Which is why Wiki Project Medicine has created “the internet in a box” to enable access to a downloaded medical version of the content to improve access to information.

Comment: From Biological sciences background… My question underpins everything here… We haven’t really touched on digital preservation, it’s a big and worrying thing. I’ve listened to comment on big gaps in digital data, it’s really difficult in the long term. How will that be affected by GDPR and what can be done there in terms of preservation and access. We are looking more and more at the cloud… The carbon footprint of ICT is expected to be 40% by 2040. Thinking about preservation and the more and more carbon intensive nature of the web, what can universities do to tackle these years…

KL: Digital preservation is close and dear to us. It is challenging and not easy. It’s not a commodity you can just buy, there isn’t one way to do this. We are trying to tackle certain areas. We are trying to preserve the university’s history. We also look actively on research data produced by the University. Addressing those two areas, there is still a huge area of web output and web archiving there… There is interest in the University output, but less interest in the wider context. We acknowledge that agenda and push it up in the university – and digital humanities helps here, and that means access to information which helps us make our case. With GDPR does present complexity, it does mean working with encryption… For company/global content that’s broader.

Comment: In terms of the issue of experts… I think it’s interesting to see experts by credentials, or by reputation… And how that relates to the internet… It seems like a great way to be a self-made expert… To promote yourself as an expert because you have a blog. You may have stature and influence… But that’s very different from a PhD or an academic expertise… I’m interested that part of being an expert is admitting when you don’t know something… It seems the public wants experts to tell you the answer right now… What is the role of the internet right now here.

TS: I have a lot of thoughts on this. It’s basically my PhD. If I ramble… Stop me… I think this is fundamentally about the way we reconceptualise expertise.. There is the idea of it being reiified, as rare and based on credentials, and that being in conflict with other types of self-made influential. Steven Taylor has a paper on experts across three types, including this group of self-made experts… They come to represent a much larger group of experts – it hasn’t democratised broadcast but it’s certainly opened up and broadened the field somewhat. When we understand expertise as only credentialed people in specific organisations, we limit communication. We have to be able to engage as compellingly as these people able to weaponise, essentially, nonsense and see how we can be as engaging with them. We have to be provocative and interesting. We can’t expect people to just come and ask the right experts. The burden shouldn’t be on audiences, the burden should be on “experts” to be palatable and appealing as experts.

MT: The anti expertise thing isn’t a new thing too… It goes right back to founding of universities, particularly in the Victorian era… I have a book coming out on professors in childrens literature, and accompanying anthrology, and every single story is “the professor is rubbish”. All of them. All about not trusting experts, just when expertise is being formalised… The general populace ridiculing them… The internet has boosted that again. But a positive thing… Crowd sourcing is a positive development… We did a few crowd sourcing projects that truly changed access and use of information – work that used to only be done by paleographist, looking at Jeremy Bentham’s papers… The internet helped us speed that all up… If we have the right platforms, the right structures, we can do the right things… But we can’t let “expertise is rubbish to perpetuate”.

EM: Again with digital preservation, there is a cost attached… There may be volunteers… If there is a platform or a lack of cost… You can do a lot. And archive a lot in public ways…

KL: I was going to add that the cultural heritage sector has an interesting relationship with working with the community… But there is this tension about how and who can contribute how, and who can do it best. But the crowd is full of enthusiasm… As long as work is provenanced…. That is a really good way to positively use the web.

Comment: In response to the Cambridge Analytica stuff… And why didn’t they listen to the social scientists… Isn’t GDPR an example of the law doing as good a job as it could… And data ownership… Legislative work in Europe on copyright and data ownership… If we want to set the right example, it’s not enough to throw up our hands in horror… You have to engage in legislative process… Laws do have an impact in cyberspace.

Comment: Business models – and how do we change that – it shapes the platform. Investment doesn’t go in equally – and as universities we do start ups, we do engagement with industry. How do we move beyond all of these businesses being set up by young wealthy guys, and opening that up… And reconceptualising success as more than just exit, and data as asset – and that being personal data. I also wanted to note that web archiving does take place – with the Internet Archive who operate in the more permissive US copyright context (and mirrored in Canada – they were concerned that Trump might interfere with the archive). There is a small but politically aware web archiving community but part of making that and any platform work is about acknowledging that there is cost to running platforms, to archiving materials…

Comment: That idea of “an expert” – surely we reconceptualise the expert as a distributed thing.

TS: Yes.

MT: And with that I’d like to thank the panel and draw this to a close. We hope to have some announcements in the next year about expanding this work, and this day takes place in an environment that contributed to my coming to Edinburgh, with the City Deal, and with the work driving Edinburgh to be the Data Driven Innovation capital of Europe.

May 022018
 

This morning I’m at the “Working with the British Library’s Digital Content, Data and Services for your research (University of Edinburgh)” event at the Informatics Forum to hear about work that has been taking place at the British Library Labs programme, and with BL data recently. I’ll be liveblogging and, as usual, any comments, questions, 

Introduction and Welcome – Professor Melissa Terras

Welcome to this British Library Labs event, this is about work that fits into wider work taking place and coming here at Edinburgh. British Library Labs works in a space that is changing all the time, and we need to think about how we as researchers can use digital content and this kind of work – and we’ll be hearing from some Edinburgh researchers using British Library data in their work today.

“What is British Library Labs? How have we engaged researchers, artists, entrepreneurs and educators in using our digital collections” – Ben O’Steen, Technical Lead, British Library Labs

We work to engage researchers, artists, entrepreneurs and educators to use our digital collections – we don’t build stuff, we find ways to enable access and use of our data.

The British Library isn’t just our building in St Pancras, we also have a huge document supply and storage facility in Boston Spa. At St Pancras we don’t just have the collections, we have space to work, we have reading rooms, and we have five underground floors hidden away there. We also have a public mission and a “Living Knowledge Vision” which helps us to shape our work

British Library Labs has been running for four years now, funded by the Andrew Mellow Fund, and we are in our third funded phase where we are trying to make this business as usual… So the BL supports the reader who wants to read 3 things, and the reader who wants to read 300,000 things. To do that we have some challenges to face to make things more accessible – not least to help people deal with the sheer scale of the collections. And we want to avoid people having to learn unfamiliar formats and methodologies which are about the library and our processes. We also want to help people explore the feel of collections, their “shape” – what’s missing, what’s there, why and how to understand that. We also want to help people navigate data in new ways.

So, for the last few years we have been trying to help researchers address their own specific problems, but also trying to work out if that is part of a wider problem, to see where there are general issues. But a lot of what we have done has been about getting started… We have a lot of items – about 180 million – but any count e have is always an estimates. Those items include 14m books, 60m patents, 8m stamps, 3m sound recordings… So what do researchers ask for….

Well, researchers often ask for all the content we have. That hides the failure that we should have better tools to understand what is there, and what they want. That is a big ask, but that means a lot of internal change. So, we try to give researchers as much as we have… Sometimes thats TBs of data, sometimes GBs.. And data might be all sorts of stuff – not just the text but the images, the bindings, etc. If we take a digitised item we have an image of the cover, we have pictures, we have text, we also have OCR for these books – when people ask for “all” the book – is that the images, the OCR or both? One of those is much easier to provide…

Facial recognition is quite hot right now… That was one of the original reasons to access all of the illustrations – I run something called the Mechanical Curator to help highlight those images – they asked if they could have the images – so we now have 120m images on Flickr. What we knew about images was the book, and the page. All the categorisation and metadata now there has been from people and machines looking at the data. We worked with Wikimedia UK to find maps, using manual and machine learning techniques – kind of in competition – to identify those maps… And they have now been moved into georeferencing tools (bl.uk/maps) and fed back to Flickr and also into the catalgue… But that breaks the catalogue… It’s not the best way to do this, so that has triggered conversations within the library about what we do differently, what we do extra.

As part of the crowdsourcing I built an arcade machine – and we ran a game jam with several usable games to categorise or confirm categories. That’s currently in the hallway by the lifts in the building, and was the result of work with researchers.

We put our content out there under CC0 license, and then we have awards to recognise great use of our data. And this was submitted – a video of Hey There Young Sailor official music video using that content! We also have the Off the Map copetition – a curated set of data for undergraduate gaming students based on a theme… Every year there is something exceptional.

I mentioned library catalogue being challenging. And not always understanding that when you ask for everything, that isn’t everything that exists. But there are still holes…. When we look at the metadata for our 19th century books we see huge amounts of data in [square brackets] meaning the data isn’t known but is the best suggestion. And this becomes more obvious when we look at work researcher Pieter Francois did on the collection – showing spikes in publication dates at 5 year intervals… Which reflects the guesses at publication year that tend to be e.g. 1800/1805/1810. So if you take intervals to shape your data, it will be distorted. And then what we have digitised is not representative of that, and it’s a very small part of the collection…

There is bias in digitisation then, and we try to help others understand that. Right now our digitised collections are about 3% of our collections. Of the digitised material 15% is openly licensed. But only about 10% is online. About 85% of our collections cn only be accessed “on site” as licenses were written pre-internet. We have been exploring that, and exploring what that means…

So, back to use of our data… People have a hierachy of needs from big broad questions down to filtered and specific queries… We have to get to the place where we can address those specific questions. We know we have messy OCR, so that needs addressing.

We have people looking for (sometimes terrible) jokes – see Victorian Humour run by Bob Nicholson based on his research – this is stuff that can’t be found with keywords…

We have Kavina Novrakas mapping political activity in the 19th Century. This looks different but uses the same data and the same platform – using Jupyter Notebooks. And we have researchers looking at black abolitionists. We have SherlockNet trying to do image classification… And we find work all over the place building on our data, on our images… We found a card game – Moveable Type – built on our images. And David Normal building montages of images. We’ve had poetic places project.

So, we try to help people explore. We know that our services need to be better… And that our services shape expectations of the data – and can omit and hide aspects of the collections. Exploring data is difficult, especially with collections at this scale – and it often requires specific skills and capabilities.

British Library Labs working with University of Edinburgh and University of St Andrews Researchers

“Text Mining of News Broadcasts” – Dr. Beatrice Alex, Informatics (University of Edinburgh)

Today I’ll be talking about my work with speech data, which is funded by my Turing fellowship. I work in a group who have mainly worked with text, but this project has built on work with speech transcripts – and I am doing work on a project with news footage, and dialogues between humans and robots.

The challenges of working with speech includes particular characteristics: short utterances, interjections; speaker assumptions – different from e.g. newspaper text; turn taking.  Often transcripts miss sentence boundaries, punctuation or missing case distinctions. And there are errors introduced by speech recognition.

So, I’m just going to show you an example of our work which you can view online – https://jekyll.inf.ed.ac.uk/geoparser-speech/. Here you can do real time speech recognition, and this can then also be run through the Edinburgh Geoparser to look for locations and identify their locations on the map. There are a few errors and, where locations haven’t been recognised in the speech recognition they also don’t map well. The steps in this pipeline is speech recognition… ASR then Google Text Restoration, and then text and data mining.

So, at the BL I’ve been working with Luke McKernan, lead curator for news and moving images. I have had access to a small set of example news broadcast files for prototype development. This is too small for testing/validation – I’d have to be onsite at BL to work on the full collection. And I’ve been using the CallHome collection (telephone transcripts) and BBC data which is available locally at Informatics.

So looking at an example we can see good text recognition. In my work I have implemented a case restoration step (named entities and sentence initials) using rule based lexicon lookup, and also using Punctuator 2 – an open source tool which adds punctuation. That works much better but isn’t up to an ideal level there. Meanwhile the Geoparser was designed for text so works well but misses things… Improvement work has taken place but there is more to do… And we have named entity recognition in use here too – looking for location, names, etc.

The next steps is to test the effect of ASR quality on text mining – using CallHome and BBC broadcast data) using formal evaluation; improve the text mining on speech transcript data based on further error analysis; and longer term plans include applications in the healthcare sector.

Q&A

Q1) Could this technology be applied to songs?

A1) It could be – we haven’t worked with songs before but we could look at applying it.

“Text Mining Historical Newspapers” – Dr. Beatrice Alex and Dr. Claire Grover, Senior Research Fellow, Informatics (University of Edinburgh) [Bea Alex will present Claire’s paper on her behalf]

Claire is involved in an Adinistrative Data Research Centre Scotland project looking at local Scottish Newspapers, text mine it, and connect it to other work. Claire managed to get access to the BL newspapers through Cengage and Gale – with help from the University of Edinburgh Library. This isn’t all of the BL newspaper collection, but part of it. This collection of data is also now available for use by other researchers at Edinburgh. Issues we had here ws that access to more reent newspaper is difficult, and the OCR quality. Claire’s work focused on three papers in the first instance, from Aberdeen, Dundee and Edinburgh.

Claire adapted the Edinburgh Geoparser to process the OCR format of the newspapers and added local gazetteer resouces fro Aberdeen, Dundee and Edinburgh from OS OpenData. Each article was then automatically annotated with paragraph, sentence, work mark-up; named entities – people, place, organisation; location; geo coordinates.

So, for example, a scanned item from the Edinburgh Evening News from 1904 – its not a great scan but the OCR is OK but erroneous. Named entities are identified, locations are marked. Because of the scale of the data Claire took just one year from most of the papers and worked with a huge number of articles, announcments, images etc. She also drilled down into the geoparsed newspaper articles.

So for Abereen in 1922 there were over 19 million word/punctuation tokens and over 230,000 location mentions Then used frequency methods and concordances to understand the data. For instance she looked for mentions of Aberdeen placenames by frequency – and that shows the regions/districts of abersteen – Torry, Woodside, and also Union Street… Then Claire dug down again… Looking at Torry the mentions included Office, Rooms, Suit, etc, which gives a sense of the area – a place people rented accommoation in. In just the news articles (not ads etc) then for Torry it’s about Council, Parish, Councillor, politics, etc.

Looking at Concordances Claire looked at “fish”, for instance” to see what else was mentioned and, in summary, she noted that the industry was depressed after WW1; there was unemployment in Aberdeen and the fishing towns of Aberdeenshire; that there was competition rom German trawlers landing Icelandic fish; that there were hopes to work with Germany and Russia on the industry; and that government was involved in supporting the industry and taking action to improve it.

With the Dundee data we can see the Topic Modelling that Claire did for the articles – for instance clustering of cars, police, accidents etc; there is a farming and agriculture topic; sports (golf etc)… And you can look at the headlines from those topics and see how that reflect the identified topics.

So, next steps for this work will include: improving text analysis and geoparsing components; get access to more recent newspapers – but there is issing infrastructure for larger data sets but we are working on this; scale up the system to process whole data set and store text ining output; tools to summarise content; and tools for search – filtering by place, data, linguistic context – tools beyond the command line.

“Visualizing Cultural Collections as a Speculative Process” – Dr. Uta Hinrichs, Lecturer at the School of Computer Science (University of St Andrews)

My research focuses on visualisation and Human Computer Interaction. I am particularly interested in how interfaces can make visible digital collections. I have worked on a couple of projects with Bea Alex and others in the room to visualise texts. I will talk a little bit about LitLong, and the process in developing early visualisations for the project.

So, some background… Edinburgh is a UNESCO City of Literature, with lots of literature about and in the city. And we wanted to automate the discovery of Edinburgh-absed literature from available digitised text. That included a large number of collections – about 380k – from collections including the BL 19th Century Books collection. And we wanted to make results accessible to the public.

There were lots of people involved here, from Edinburgh University (PI, James Loxley), Informatics, St Andrews, and EDINA. And worked both with out of copyright texts, but also we had special permission to work with some in-copyright texts including Irvine Welsh. And a lot of work was done to geoparse the text – and assess it’s Edinburghyness. For each mention we had the author, the title, the year, and snippets of the text from around the mention. This led to visualisations – I worked on LitLong 1.0 and I’ll talk about this, but a further version (LitLong 2.0) launched last year.

So you can explore clusters of places mentioned in texts, you can explore the clustered words and snippets around the mentions. And you can zoom in to specific texts – again you can see the text snippets in detail. When you explore the snippets, you can see what else is there, to explore other snippets.

So in terms of the design considerations we wanted a multi faceted intractive overview of the data – Edinburgh locations; books; extracted snippets; authors; keywords. Maps and lists are familiar and we wanted this tool to be accessible to scholars but also the public. We took an approach that allowed “generous” explorations (Mitchell Whitelaw 2015) so there are suggestions of how to explore further, parts of the data showing… Weighted tag clouds let you get a feel of the data for instance.

As a process it wasn’t like the text mining happened then we magically had the visualisations… It was iterative. And actually we used visualisation tools to actually assess which texts were in scope, and which weren’t going to be relevant – and mark them up to keep or to rule out a text. This interface included information on where in a text the mention occurred – to help identify how much about Edinburgh a text actually was.

We had a creative visualisation process… We launched the interface in 2015, and there was some iteration and that also inspired LitLong 2.0 which is a much more public-friendly way to explore the material in different way.

So, I think it is important to think about visualisation as a speculative process. This allows you to make early computational analysis approached visille and facilitate qa and curatorial process. To promote new interactions transforming a print based culture into something different – thinking about materiality rather than just content is important as we enable exporation. When I look back at my own work I see some similarities in interfaces… You can see the unique qualities of the collections in the data trends but we are doung much more work on designing interfaces  that surface the unique qualities of the collection in new ways.

Q&A

Q1) What did you learn about Edinburgh or literature in Edinburgh from this project?

A1) The literature scholars would be better able to talk about that but I know it has inspired new writers. Used in teaching. And also discovered some characteristics of Edinburgh, and women writers in the corpus… James Loxley (Edinburgh) and Tara Thompson (Edinburgh Napier University) could say more about how this is being used in new literary research.

“Public Private Digitisation Partnerships at the British Library” – Hugh Brown, British Library Digitisation Project Manager

I work as part of the Digital Scholarship team at the British Library, which was founded in 2010 to support colleagues and researchers to make innovative use of BL digital collections and data – and recognising the gap in provision we had there. The team is led by Adam Farquhar – Head of Digital Scholarship, and by Neil Fitzgerald, Head of Digital Research Team. We are cross disciplinary experts in the areas of digitisation, librarianship, digital historu adnd humanities, computer and data sience and we look at how technilogu is transforming research and in turn our services. And we include the British Library Labs, Digital Curators, adn the Endangered Archives Programme (EAP).

So, we help get content online and digitised, we support researchers, and we run a training programme to bridge skills so that researchers can begin to engage with digital resources. We expect that in 10-15 years time those will be core research skills so we might not exist – it will just be part of the norm. But we are a long way off that at the moment. We also currently run Hack and Yack events to experiment and discuss. And we also have a Reading Room to share what’s happening in the world, to share best practice.

In terms of our collections and partnerships, we have historically had a slightly piecemeal digitisation approach, so we now have a joined up strategy that sits under our Living Knowledge strategy and includes partnership, commercial strategy and our own collection strategy. Our partnerships recognise that we don’t always have the skills we need to make content available, whilst our commercial strategy – where I work – allows us to digitise as much as possible, and in a context were we don’t have infinite funding for digitisation.

We have various factors in mind when considering potential partnership. The types of approach include partnerships based on whether materials are in or out of copyright – if in copyright then commercial partners have to clear rights. We do public/private partnership with technology partners. We have non-commercial organisational and/or consortium funding. And we have philanthropic donor funded work. Then we think about content – content strategy, asset ownership, digitisation location. We think about value – audience type/interest/geography, and topicality. We think about copyright – British library owns the rights, rights of reuse. We think about disocverability – the ability to identify and search, and access that maximises exposure. We look at the (BL) benefit – funding, access etc. We look at risk. And we look at contract – whether it is non-exclusive, commercial/non commercial.

So, we have had public-private digitisation partnerships with Gale Cengage Learning, Adam Matthews, findmypast, Google Books, Microsoft books, etc. And looking at examples Google books has been 80m+ images digitised; Microsoft books was 25m images; findmypast has done 23m+ images of newspapers; Gale Cengage Learning has done 18th century collections – 22m images, 19c online 2.2m+ images, and Arabic books, etc.

The process begins with liaison with key publishers. Then there is market and content research. Then we plan and agree plan, including licensing of rights for a fixed term (5-10 years), and royalty arrangements and reading room access. Then digitisation takes place, funded by the partner – either by setting up a satellite studio, or using the BL studio. So our partners digitise content and give us that content, in exchange they get 5-10 years exclusive agreement to use that content on their platform. And revenue  generated for BL helps support what we do, and our curators work around digitisation.

So Findmypast was an interesting example. We had electoral registers and India Office Records – data with real commercial value. So, we put a tender out for a partner for digitisation. Findmypast was selected… Part of that was to do with the challenges of the electoral registers which were inconsistent formats etc. so needed a lot of specific work And we also needed historical country boundaries to be understood to make it work. There was also a lot of manual OCR work to do.

With Gale Cengage they tend to be education/universities focused and they work with researchers. We worked with them to select 19th century materials to fit their themes and interests. They did the early arabic book project – a really complex project. The private case collection consisted of mainly books that had been inaccessible on grounds of obscenity from around 1600 and 1960.

With Adam Mathew Digital we were approaches to contribute material from the electoral registers and india office records. And materials on the East India Company.

Now these are exciting projects but we want 20-30% of content generated in these projects to be available as a corpus for research and that’s important to our agreements.

Challenges in the workflow include ensuring business partners and scannning vendors have a good understanding of the material BL holds in our collections. We have to define and provide metadata requirements the BL needs to supply to the partners. Getting statistics and project plans from information business partners. There are logistical challenges around understanding the impact of digitisation on BL departments supporting the process. We have to manage partners business drivers versus BL curatorial drivers. We have to manage the parters digitisation vendors on site. And ensuring the final digital assets/metadata received meets BL requirements for sign off and ingest.

Q&A

Q1) How can we actually access this stuff for research?

A1) For pure research that can be done. For example we have a company in Brighton who are doing research on the electoral roll. That’s not in competition with what the private partner is doing.

Comment from Melissa) My experience is “don’t ask, don’t get” – so if you see something you want to use in your research, do ask!

“The Future of BL Labs and Digital Research at the Library” – Ben O’Steen

I’ve handed out some personas for users of our digital collections – and a blank sheet on the back. We are trying to build up a picture of the needs of our users, their skills and interests, and that helps us illustrate what we do – that’s a thing to come back to (see: https://goo.gl/M41Pc4/)

So I want to talk about the future of BL Labs. We are a project and our funding is due to finish. Our role has been to engage with researchers and that is going to continue – maybe with that same brand just not as a project. We need to learn what they want to do… We need to collect evidence of demand. And we are developing a business model and support process to make “Business as usual” at the BL. We want to help to create pathway to developing a “Digital Research Suit” at the BL by 2019. But we want to think about what that might be, and we are piloting ideas including small 2 person workrooms for digital projects. And we can control access – so that we can see how this works, and ensure that the users understand what you can and cannot do with the data (that you can’t just download everything and walk out with it).

And many other places are being “inspired” by our model – take a look at the Library of Congress work in particular.

So, at this stage we are looking at our business model and how we can make these scalable services. Our model to date has been smaller scale, about capabilities to get started, etc. That is not scalable at the level we’ve been working. We need a more hands off proess ad to be able to see more people. We also run BL Labs Awards which, instead of working with people, recognises work people have already done. People submit and then in October our advisory board reviews the entries and looks for work that champions our content.

To develop our business model we are exploring, evaluating and implementing a business model. We are using business model canvas. We have internal and external business model development, implementation and evaluation groups, and exploring how this could work in practice. And we are testing, piloting and implementing our business model. That means:

  • developing support service
    • Entry level – about the collection, documentation improvements, case studies that help show what is in there.
    • Baseline – basic enquiry service to enable researchers to understand if a BL project is the right path, any legal restrictions that need addressing, etc. We try to get you to the next stage of developing your idea.
    • Intermediate – Consultation service, which will be written in as part of a bid.
    • Advanced – support 10 projects per year through an application process)
  • Augment data.bl.uk – that was a placeholder for a year, and now a tender has just gone out for a repository type service for 12-18 months
    • e.g. sample datasets, tools, examples of use
    • Pilot use of Jupyter Notebooks / Docker other tools for Open and Onside data
  • Researcher access to BL APIs
  • Reading room services – onside access/compute to digital collections – which means us training staff

This has come about as we’ve seen a pattern in approaches that start with an initial exploration phase, then transition into investigation and then some sort of completion phase. There had been a false assumption (on the data providers part) that data-based work must start at the investigation phase – to have an idea of the project they want to do, to know the data already, to know the collections. What we are piloting is that essential exploratory stage, acknowledging that that happens. And that pattern shifts around – exploration and investigation stages can fork off in different directions, that’s fine.

So, timescales and themes seem to be a phase of quick initial work. A longer and variable transition takes place into investigation – probably months. Then investigation takes months to a year. And crucially that completion stage.

Exploration is about understanding the data in an open-ended fashion. It is about discovering the potential tools to work with the data. We want people to gain awareness of their capabilities and limitations – a reality check and opportunity to understand the need for partners and/or new tools. And it’s about developing a firmer query as that helps you to understand the cost, risk, time you might need. Exploration (e.g. V&A Spelunker) lets you get a sense of what’s there, which gives you a different way in to the keyword or catalogue search. And then you have artists like Mario Klingemann – collating images looking sad… It’s artistic but talks about how women are portrayed in the 19th Century. He’s also done work on hats on the ground – and found it’s always a fight! This is showing cultural memes – an important question… An older example is the Cooper Heritt collection – which lets you see all of tags – including various types of similarity that show new ways into the data.

So, what should a digital exploration service look like? Which apps? Does Jupyter Notebook assume too much?

We’ve found that every time we present the data, it shapes the perception. For instance the On the Road manuscript is on a roll. If you print a book on a receipt roll it’s different and reads and is understood differently.

MIT have a Moral Machine survey (http://moralmachine.mit.edu/) which is the classic trolley issue – crowdsourced for autonomous vehicle. But that presentation shapes and limits the questions, and that is biased. Some of the best questions we’ve seen have been from people who have asked very broad questions and haven’t engaged in exploration in other ways. They are hard to answer (e.g. all depictions of women) but they reveal more. Presenting as a searchable list shapes how we interpret the result… But for instance showing newspaper articles as if in a giant newspaper – not a list of results – changes what you do. And that’s why tools like IIIF seems useful.

So… We have things like Gender API. It looks good, it looks professional… If you try it with a western name, does it work. If you try it with an Indian name, does it work. If you try it with a 19th Century name does it work? Know that marketeers will use this. See also sentiment analysis. Some of these tools are based on Twitter. I found a research working an 18th Century texts for sentiment about war and conflict… Through a tool developed and trained for Tweets. We have to be transparent in what is happening, in understanding what you are doing… Hence thinking about personas.

We are trying to think about how we show what is missing from a collection, rather than what is present so that data can be used in a more informed way. We are looking at what research environments we can provide – we know that people want to use their own but we can sometimes be a bit stuffed by licensing based in a paper era. On site tools can help. Should we enable research environments for open data that can be used off site too. We are thinking about focus – are the query, tooling and collections required well defined; is it feasible – legal, cost, ethical, source data quality, etc; is it affordable – time, people, money; etc.

So, we have, on the BL Labs website, a form – it’s long so do send us feedback on whether that is the right format etc. – to help us understand demand and skills.

Those personas – please fill these in – and let us know the technical part, what you might want, how technical the support you need. We are keen to discuss your needs, challenges and issues.

And with that we are done and moving onto lunch and discussion. Thanks to Ben, Hugh, Alex and Uta we well as Melissa and the Digital Scholarship Team!