Jul 052016

This afternoon I’m at UCL for the “If you give a historian code: Adventures in Digital Humanities” seminar from Jean Bauer of Center for Digital Humanities at Princeton University, who is being hosted by Melissa Terras of the UCL Centre for Digital HumanitiesI’ll be liveblogging so, as usual, any corrections and additions are very much welcomed. 

Melissa is introducing Jean, who is in London en route to DH 2016 in Kraków next week. Over to Jean:

I’m delighted to be here with all of the wonderful work Melissa has been doing here. I’m going to talk a bit about how I got into digital humanities, but also about how scholars in library and information sciences, and scholars in other areas of the humanities might find these approaches useful.

So, this image (American Commissioners of the Preliminary Peace Negotiations with Great Britain. By Benjamin West, London, England; 1783 (begun). Oil on canvas. (Unframed) Height: 28 ½” (72.3 cm); Width: 36 ¼” (92.7 cm). 1957.856) is by Benjamin West, the Treaty of Paris, 1783. This is the era that I research and what I am interested in. In particular I am interested in John Adam, the first minister of the United States – he even gets one line in Hamilton: the musical. He’s really interested as he was very concerned with getting thinking and processes on paper. And on the work he did with Europe, where there hadn’t really been American foreign consuls before. And he was also working on areas of the North America, making changes that locked the British out of particular trading blocks through adjustments brought about by that peace treaty – and I might add that this is a weird time to give this talk in England!

Now, the foreign service at this time kind of lost contact once they reached Europe and left the US. So the correspondence is really important and useful to understand these changes. There are only 12 diplomats in Europe from 1775-1788, but that grows and grows with consuls and diplomats increasing steadily. And most of those consuls are unpaid as the US had no money to support them. When people talk about the diplomats of this time they tend to focus on future presidents etc. and I was interested in this much wider group of consuls and diplomats. So I had a dataset of letters, sent to John Jay, as he was negotiating the treaty. To use that I needed to put this into some sort of data structure – so, this is it. And this is essentially the world of 1820 as expressed in code. So we have locations, residences, assignments, letters, people, etc. Within that data structure we have letters – sent to or from individuals, to or from locations, they have dates assigned to them. And there are linkages here. Databases don’t handle fuzzy dates well, and I don’t want invalid dates, so I have a Boolean logic here. And also a process for handling enclosures – right now that’s letters but people did enclose books, shoes, statuettes – all sorts of things! And when you look at locations these connect to “in states” and states and location information… This data set occurs within the Napoleonic wars so none of the boundaries are stable in these times so the same location shifts in meaning/state depending on the date.

So, John Jay has all this correspondence between May 27 and Nov 19, 1794 and they are going from Europe to North America, and between the West Indies and North America. Many of these are reporting on trouble. The West Indies are ship siezures… And there are debts to Britain… And none of these issues get resolved in that treaty. Instread John Jay and Lord Granville set up a series of committees – and this is the historical precident for mediation. Which is why I was keen to understand what information John Jay had available. None of this correspondance got to him early enough in time. There wasn’t information there to resolve the issue, but enough to understand it. But there were delays for safety, for practical issues – the State Department was 6 people at this time – but the information was being collected in Philadephia. So you have a centre collecting data from across the continent, but not able to push it out quickly enough…

And if you look at the people in these letters you see John Jay, and you see Edmund Jennings Randolph mentions most regularly. So, I have this elaborate database (The Early American Foreign Service Database – EAFSD) and lots of ways to visualise this… Which enables us to see connections, linkages, and places where different comparisons highlight different areas of interest. And this is one of the reasons I got into the Humanities. There are all these papers – usually for famous historical men – and they get digitised, also the enclosures… In a single file(!), parsing that with a partial typescript, you start to see patterns. You see not summaries of information being shared, not aggregation and analysis, but the letters being bundled up and sent off – like a repeater note. So, building up all of this stuff… Letters are objects, they have relationships to each others, they move across space and time. You look at the papers of John Adams, or of any political leader, and they are just in order of date sent… Requiring us to flip back and forth. Databases and networks allow us to follow those conversations, to understand new orders to read those letters in.

Now, I had a background in code before I was a graduate student. What I do now at Princeton (as Associate Director of the Center for Digital Humanities) is to work with librarians and students to build new projects. We use a lot of relational databases, and network analysis… And that means a student like one I have at the moment can have a fully described, fully structured data set on a vagrant machine that she can engage with, query, analysis, and convey to her examiners etc. Now this student was an excel junky but approaching the data as a database allows us to structure the data, to think about information, the nature of sources and citation practices, and also to get major demographic data on her group and the things she’s working on.

Another thing we do at Princeton is to work with libraries and with catalogue data – thinking about data in MARC, MODS, or METS record, and thinking about the extract and reformatting of that data to query and rethink that data. And we work with librarians on information retrieval, and how that could be translated to research – book history perhaps. Princeton University library brought the personal library of philosopher Jaques Derrida – close to 19,000 volumes (thought it was about 15,000 until they were unpacked), so two projects are happening simultaneously. One is at the Centre for Digital Humanities, looking at how Derrida marked up the texts, and then went on to use and cite in Of Grammatology. The other is with BibFrame – a Linked Open Data standard for library catalogues, and they are looking at books sent to Derrida, with dedications to him. Now there won’t be much overlap of those projects just now – Of Grammatology was his first book so those dedicated/gifted books to him. But we are building our databases for both projects as Linked Open Data, all being added a book at a time, so the hope is that we’ll be able to look at any relationships between the books that he owned and the way that he was using and being gifted items. And this is an experiment to explore those connections, and to expose that via library catalogue… But the library wants to catalogue all works, not just those with research interest. And it can be hard to connect research work, with depth and challenge, back to the catalogue but that’s what we are trying to do. And we want to be able to encourage more use and access to the works, without the library having to stand behind the work or analyse the work of a particular scholar.

So, you can take a data structure like this, then set up your system with appropriate constraints and affordances that need to be thought about as they will shape what you can and will do with your data later on. Continents have particular locations, boundaries, shape files. But you can’t mark out the boundaries for empires and states. The Western boundary at this time is a very contested thing indeed. In my system states are merely groups of locations, so that I can follow mercantile power, and think from a political viewpoint. But I wanted a tool with broader use hence that other data. Locations seem very safe and neutral but they really are not, they are complex and disputed. Now for that reason I wanted this tool – Project Quincy – to have others using it, but that hasn’t happened yet… Because this was very much created for my research and research question…It’s my own little Mind Palace for my needs… But I have heard from a researcher looking to catalogue those letters, and that would be very useful. Systems like this can have interesting afterlives, even if they don’t have the uptake we want Open Source Digital Humanities tools to have. The biggest impact of this project has been that I have the schema online. Some people do use the American Foreign Correspondents databases – I am one of the few places you can find this information, especially about consuls. But that schema being shared online have been helping others to make their own system… In that sense the more open documentation we can do, the better all of our projects could be.

I also created those diagrams that you were seeing – with DAVILA, a programme that creates these allows you to create easy to read, easy to follow, annotated, colour coded visuals. They are prettier than most database diagrams. I hope that when documentation is appealing and more transparent,  that that will get used more… That additional step to help people understand what you’ve made available for them… And you can use documentation to help teach someone how to make a project. So when my student was creating her schema, it was an example I could share or reference. Having something more designed was very helpful.


Q1) Can you say more about the Derrida project and that holy grail of hanging that other stuff on the catalogue record?

A1) So the BibFrame schema is not as flexible as you’d like, it’s based on MARC, but it’s Linked Open Data, it can be expressed in RDF or JSON… And that lets us link records up. And we are working in the same library so we can link up on people, locations, maybe also major terms, and on th eaccession id number too. We haven’t tried it yet but…

Q1) And how do you make the distinction between authoritative record and other data.

A1) Jill Benson(?) team are creating authoritative linked open data records for all of the catalogue. And we are creating Linked Open Data, we’ll put it in a relational database with an API and an endpoint to query to generate that data. Once we have something we’ll look at offering a Triple Store on an ongoing basis. So, basically it is two independent data structures growing side by side with an awareness of each other. You can connect via API but we are also hoping for a demo of the Derrida library in BibFrame in the next year or two. At least a couple of the books there will be annotated, so you can see data from under the catalogue.

Q1) What about the commentary or research outputs from that…

A1) So, once we have our data, we’ll make a link to the catalogue and pull in from the researcher system. The link back to the catalogue is the harder bit.

Q2) I had a suggestion for a geographic system you might be interested in called Pelagios… And I don’t know if you could feed into that – it maps historical locations, fictional locations etc.

A2) There is a historical location atlas held by Newbury so there are shapefiles. Last I looked at Pelagios it was concerned more with the ancient world.

Comment) Latest iteration of funding takes it to Medieval and Arabic… It’s getting closer to your period.

A2) One thing that I really like about Pelagios is that they have split locations from their name, which accommodates multiple names, multiple imaginings and understandings etc. It’s a really neat data model. My model is more of a hack together – so in mine “London” is at the centre of modern London… Doesn’t make much sense for London but I do similar for Paris, that probably makes more sense. So you could go in deeper… There was a time when I was really interested in where all of Jay’s London Correspondents were… That was what put me into thinking about networking analysis… 60 letters are within London alone. I thought about disambiguating it more… But I was more interested in the people. So I went down a Royal Mail in London 1794 rabbit hole… And that was interesting, thinking about letters as a unit of information… Diplomatic notes fix conversations into a piece of paper you can refer to later – capturing the information and decisions. They go back and forth… So the ways letters came and went across London – sometimes several per day, sometimes over a week within the city…. is really interesting… London was and is extremely complicated.

Q3) I was going to ask about different letters. Those letters in London sound more like memos than a letter. But the others being sent are more precarious, at more time delay… My background is classics so there you tend to see a single letter – and you’d commission someone like Cicero to write a letter to you to stick up somewhere – but these letters are part of a conversation… So what is the difference in these transatlantic letters?

A3) There are lots of letters. I treat letters capaciously… If there is a “to” or “from” it’s in. So there are diplomatic notes between John Jay and George Hammond – a minister not an ambassadors as the US didn’t warrant that. Hammond was bad at his job – he saw a war coming and therefore didn’t see value in negotiating. They exchange notes, forward conversations back and forth. My data set for my research was all the letters sent to Jay, not those sent by Jay. I wanted to see what information Jay had available. With Hammond he kept a copy of all his letters to Jay, as evidence for very petty disputes. The letters from the West Indies were from Nathanial Cabbot Dickinson, who was sent as an information collector for the US government. Jay was sent to Europe on the treaty…. So the kick off for Jay’s treaty is changes that sees food supplies to British West Indies being stopped. Hammond actually couldn’t find a ship to take evidence against admiralty courts… They had to go through Philadelphia, then through London. So that cluster of letters include older letters. Letters from the coast include complaints from Angry American consuls…. There are urgent cries for help from the US. There is every possible genre… One of the things I love about American history is that Jay needs all the information he can get. When you map letters – like the Republic of Letters project at Stanford – you have this issue of someone writing to their tailor, not just important political texts. But for diplomats all information matters… Now you could say that a letter to a tailor is important but you could also say you are looking to map the boundaries of intellectual history here… Now in my system I map duplicates sent transatlantically, as those really matter, not all arrived, etc. I don’t map duplicates within London, as that isn’t as notable and is more about after the fact archiving.

Q4) Did John Jay keep diaries that put this correspondance in context?

A4) He did keep diaries… I do have analysis of how John Quincy Adams wrote letters in his time. He created subject headings, he analysed them, he recreated a filing system and way of managing his letters – he’d docket his letters, noting date received. He was like a human database… Hence naming my database after him.

Q5) There are a couple of different types of a tool like this. There is your use and then there is reuse of the engineering. I have correspondance earlier than Jay’s, mainly centred on London… Could I download the system and input my own letters?

A5) Yes, if you go to eafsd.org you’ll find more information there and you can try out the system. The database is Project Quincy and that’s on GitHub (GPL 3.0) and you can fire it up in Django. It comes with a nice interface. And do get in touch and I’ll update you on the system etc. It runs in the Django framework, can use any database underneath it. And there may be a smaller tractable letter database running underneath it.

Comment) On BibFrame… We have a Library and Information Studies programme which we teach BibFrame as part of that. We set up a project with a teaching tool which is also on GitHub – its linked from my staff page.

A quick note as follow up:

If you have research software that you have created for your work, and which you are making available under open source license, then I would recommend looking at some of the dedicated metajournals that will help you raise awareness of your project and ensure it is well documented for others to reuse. I would particularly recommend the Journal of Open Research Software (which, for full disclosure, I sit on the Editorial Advisory Board for), or the Journal of Open Source Software (as recommended by the lovely Daniel S. Katz in response to my post).


May 142014

Today I am at the University of Edinburgh Digital Humanities and Social SciencesDigital Scholarship Day of Ideas 2014 which is taking place at the Edinburgh Centre for Carbon Innovation, High Street Yards, Edinburgh. This year’s event takes, as it’s specialist focus, “data”. These notes have been taken live so my usual disclaimers apply and comments, questions and corrections are, as ever, very much welcomed.

Introduction: Prof Dorothy Miell, Head of College of Humanities and Social Science

I’m really pleased to welcome everybody here today. This is our third Digital Scholarship Day of Ideas and they are an opportunity to bring in interesting outside speakers, but also for all of us interested in this area to come together, to network and build relationships, and to take work forward. Again today we have a mixture of international and local speakers, and this year we are keeping us all in one room so we can all hear from those speakers. I am really glad to see such a popular take up for the day, and mixing from across the college and Information Services.

Digital HSS, which organised this event, is work that Sian Bayne leads and there are a series of events throughout the year in that strand, as well as these events.

Today we are going to be talking about the idea of data, particularly what data means for scholars in the humanities, how can we understand the term Big Data that we hear in the Social Sciences, and how can we use these concepts in our own work.

Sian Bayne, Associate Dean (digital scholars) is introducing our first speaker. Annette describes herself as an “itinerant researcher”. Annette’s work focuses on internet and qualitative research methods, and the ethical aspects of internet research. I think she has a real talent for great paper titles. One of my favourites is “Undermining Data” – which today’s talk is partially based on – but I also loved that she had a paper entitled “Fieldwork in Social Media: What would Manonovsky do?”. Anyway, I am delighted to welcome Professor Annette Markham.

Can we get beyond ‘data’? Questioning the dominance of a core term in scientific inquiry – Prof Annette MarkhamDepartment of Informatics, Umeå University, Sweden; Department of Aesthetics & Communication, Aarhus University, Denmark; School of Communication, Loyola University, Chicago (session chair: Dr Sian Bayne)

As Sian mentioned I have spent a lot of time… I was a professor for ten years before I quit in 2007 and pushed myself across other disciplines, to push forward some philosophical work on methods. For the last 5 years or so I’ve been thinking about innovative and creative ways to think of methods to resonate better with the complex and complexity of modern life. I work with STS – Science and Technology – scholars in Denmark, Informatics scholars, Machine learning Scolars in Boston, Language scholars in Helsinki… So a real range across the disciplines.

The work today is around methods work I’ve done with colleagues over the last few years, much is captured in a special issue of First Monday: Vol 18, No 10: Making Data – Big Data and Beyond Special Issue. And this I’m doing from a post humanist, STS, non positivist sort of perspective, thinking about the way in which data can be used to to indicate that we share an understanding when actually, we are understanding the same information in very different ways. For some data can be an easy term, consistent with your world view… a word that you understand in your own method of inquiry. Data and data sets might be familiar parts of your work. We all come from somewhere, we all do research… what I say may not be new, or may be totally new… it may resonate… or not at all… but I want this to be a provocation, to make you question and think about data and our methods.

So, why me, well mainly I guess because I know about methods… so this entire talk is part of a bigger project where I look at method, at forms of inquiry… but looking at method directly isn’t quite right, but looking at it from the side, from the corner of your eye… And to look at method is to look at the conditions in which we undertake inquiry in the 21st century. For many of us inquiry is shaped by funding, and funding priviledges that which produces evidence, which can be archived. For many qualitative researchers this is unthinkable… a coffee stain on field notes might have meaning for you as an ethnographer but how can that have meaning for anyone else? How can that be archivable or sharable or minebale.

And I think we also have to think about what it is that we do when we do inquiry, when we do research… to get rid of some of the baggage of inquiry – like collecting data, analysing and then writing up as there are many forms of inquiry that don’t fit that linear approach. Another way to think of this is to think of frames, of how we frame our research. As an American Scholar trained in the Chicago School of Sociology is that I cannot help but cite Erving Goffman. They both tell us to focus on something, and to ignore other things… So if I show you a picture of a frame here…. If I say Mona Lisa you might think of that painting. If I tell you to look outside of the frame you might envision the wall, or the gallery, or what sits outside that frame. And if you change the frame it changes what you see, what you focus on… so if I show you a frame diagram of a sphere and say that is a frame, a frame for research what do you see? (some comment they see the globe, they see 3D techniques, they see movement). The frame tells us to think about certain phenomenon…. to also not think about others… if I say Mona Lisa now… we think of very different things… Similarly an atomic structure type image works as a very different type of frame – no inside or outside but all interconnected node… But it’s almost impossible to easily frame, again, Mona Lisa…

So, another frame – a not-quite-closed drawn circle – and this is to say that frames don’t tell you a lot about what they do… and Goffman and others say that frames work best when they are almost invisible…. like maps (except say the McArthur Corrective Map). So, by repositioning a map, or by standing in an elevator the wrong way and talking to people – as Harold Garfield had his students do – we have a frame that helps us look differently at what we do. “Data” can make us think we look at the same map, when we are not… Data may not be understood as a shortcut term of a metanym, it could be taken rather as preexisting aspects of the phenomenon – have been filtered and created through a process, and organised in some way. Not the meaning I want for my work but not good or bad…

So I want to come back to “How are our research sensibilities being framed?”. In order to understand inquiry we have to understand three other things. (1) How do we frame culture and experience in the 21st Century; (2) How do we frame objects and processes of inquiry; (3) How do we frame “what counts” as proper and legitimate inquiry?

For me (1), as someone focused on internet studies, I think about how our research context has shifted, and how has our global society shifted, since the internet. It’s networked for instance. But also interesting to note how this frame has shifted considerably since the early days of the internet… So taking an image from the Atlas of CyberSpace – an image suggesting the internet as a tunnel. But city scapes were also common ways to understand the world. MIT suggested different ways to understand a computer interface. This is about what happened, the interests in the early days of the internet in the 90s. That playfulness and radical ideas change as commerce becomes a standard part of the internet. Skipping forward to Facebook for instance… interfaces are easy to understand, friendly, almost all social media looks the same, almost all websites look the same… and Google is a real model for this as their interface has always been so clean…

But I think the significant issue here about socio-technical research and understanding has been shaped by these internet interfaces we encounter on a daily basis.

For me frame (2) hasn’t changed that much… two slides…. this to me represents any phenomenon or study – a whole series of different networks of nodes connected to the centre. There is no obvious starting point. Not clear what belongs in the centre – a person, an event, a device – and there are all these entanglements charecterising these relationships. And yet our methods were designed for and work best in the traditional anthropological fieldwork conditions… And the process is still very linear in how we understand it – albeit with iterative cycles – but it’s still presented that way. And that matters as it priviledges the neat and tidy inquiry over the messy inquiry, the inquiry without clear conclusions… so how we frame inquiry hasn’t changed much in terms of inquiry methods.

Finally, and briefly, (3) my provocation is: I think we’ve gone backwards… you can go back to the 60s or earlier and look at feminist scholars and their total reunderstanding of scientific method, and situated research. But as budgets tighten, as research is funded under more conservative conditions this stuff that isn’t well understood isn’t as popular… so we’ve seen a return to evidence based methods, to clear conclusions, to scientific process. Particularly in media coverage of research. It’s still a dominent theme…

So… What is data?

I don’t want to be glib here. The word “data” is awefully easy to toss around. It is. In every day life this term is a metanym for lots of stuff, highly specific but unspecified stuff. It is arguably quite a powerfully rhetorical term. As Daniel Rosenburg says the use of the term data has really shifted over the last few hundred years. It appeared in the 1760s or so. Many of those associated with the word only had it appear in translations posthumously. It is derived from Latin and, in the 1760s, it was about conditions that exist before arguement. Then as something that exists before analysis. And in that context data has no theoretical baggage. It cannot be questions. It always exists… has an incontrovertible it-ness. A “fact” can be proven false. But false data is still “data”. Over time and usage “data” has come to represent the entirity of what the researcher seeks and needs in pursuit of the goal of inquiry. To consider the word in my non-positivist stance, I see data as “what is data within the more general idea of inquiry”. In the mid 1980s I was taught not to use that word, we collect materials, we collect artefacts as ethnographers… and we construct… data… see even I used it there, so hard not to. It has been operationalised as discreet and uncontrovertible.

Big data has brought critical responses out, they are timely and subtle responses… and boyd and Crawford (2011) came up with six provocations for big data. And Nancy Baym (2013) also talks about all social media metrics being a nonrepresentative partial sample. And that there is an inherant ambiguity that arises from decontextualising a moment of clicking from a stream of activity and turning it into a stand alone data point. Bruno LaTour talked about this too, in talking about soil from the Amazon, of removing something form it’s context.

And this idea disturbs me, particularly when understanding social life as representated in technology. Even outside the western world, even if we don’t use technology, as Sonia Livingstone notes, we are all implicated in technology in our everyday life. So, I want to show you a very common metaphor for everyday life in the 21st century – a Samsung Galaxy SII ad. I love this ad – it’s low hanging fruit for rhetorical critique! It flattens everything – your hopes and dreams offered at equal value to services or products you might buy… and flatterns as equal in not infitesimal bits that swirl around, can be transmitted, transformed, controlled – as long as we purchase that particular phone. An interesting depiction of life as data – and humans and their data as new. It’s not unusual and not a problem as we don’t buy into it as a notion, uncritically.

This ad troubles me more. This is Global Pulse, an NGO, a sub committee of UN, that distributes data on prices in the developing world. It follows the story of a woman affected by price shifts. So this ad… it has a lot of persuasive power and I want to be careful about this arguement that I make to conclude…

I really like what we get from many big data analyses. I have nothing against big data or computational analysis. Some of the work you hear about today is extroadinary, powerful… I won’t make an arguement about data, about data to solve certain problems. I want to talk about what Kate Crawford talks about as “big data fundamentalism”. I wouldn’t go that far… but algorithms can be powerful but not all human experience can be reduced to data points. And not everything can be framed by big data. Data can be hugely valuable but it’s important to trouble what is included and what is missed by big data. That advert implies data can be understood as it happens. Data is always filtered, transformed, framed… from that you draw conclusions. Data operates within the larger framework for inquiry. We have to remember that we have strong and robust models for inquiry that do not focus on data as the core of inquiry. Data might be important – it should be the chorus not the main player on the stage. The focus of non-positivist research is upon collecting the messy stuff….

And I wanted to show a visualisation, created in Gephi, by one of my colleagues who looked at Arab Spring coverage in media and social media in Sweden… In doing this as he shifts the algorithm he is manipulating data, changing how the data appears to us, changing variables to make his case… most of the algorithms of Gephi create neat round visualisations. Alex Galloway critiques this by saying that some forms may not be representable, and this tool does not accommodate that, or encourages us to think that all networks can be visualised in that way. These visualisations and network analyses are about algorithms… So I sort of want to leave it there, to say that data functions very powerfully as a term… and that from a methodoly perspective it creates a very particular frame that warrants concern, particularly when the dominant context tells us that data is the way to do inquiry.


Q: I enjoyed that but I find you more pessimistic than I would be. That last visualization shows how different understandings of that network as possible. It’s easy to create a strawman like this but I’ve been reading papers where videos are included in papers… the audience can all think about different interpretations. We can click on a data point, to see that interview, to see that complex account of that point. There are many more opportunities to create richer entanglements of data… we should emphasize those, emphasize that complexity rather than hide the complexity of how that data is created.

A: Thanks for finishing my talk for me! If we consider the generative aspects of inquiry then we can use the tools to be transparent about the playfulness of interrogation, by offering multiple interpretations… I talk about a process of Borrow / Play / Move / Interrogate / Generate. So I was a bit pessimistic – that Global Pulse ad always depresses me. But I agree!

Q: I was taken by your argument that human experience cannot be reduced to a single data point… what else can it be reduced to… it implies an alternative to data… so what might that be?

A: I think that question is not one that I would ask. To me that is not the most important question. For me it’s about how we might make social change – how might I create interventions, how might I represent someone’s story. I’m not saying that there is an alternative… but that discussion of data in general puts us in that sort of terrain… and what is more interesting or important is to consider why we do research in the first place, why do we want to look for a particular phenomenon… to not let data overwhelm any other arguments.

Q: I think your talk noted that big data focuses on how people are similar and what similarities there are, whilst ethnography tend to be about difference. That makes those data tracking that cover most people particularly depressing. Is that the distinction though?

A: I think I would see it as simplification versus complexity… how do we envision inquiry in ways that try to explode the phenomenon into even a more complex set of entanglements and connections. It may be about differences but doesn’t have to be… its about what emerges from a more generative process… it’s an interesting reading though, I wouldn’t disagree.

Q: I wanted to share a story with you of finishing my PhD, a study of social workers when I was a social worker. I had an interview for a research post at the Scottish Government and one of the panel asked me “and how did you analyze your data” and I had never thought of my interviews and discussions as data… and since then I’ve been in academia in 20 years but actually I’ve had to put that idea, that people are not data, aside to progress my career – holding onto the concept but learning to talk the talk…

A: I can relate to that. You hear that a lot, struggling to find the vocabulary to make your work credible and understandable to other people. With my students I help them see that the vocabulary of science is there, and has been dominant… and to help them use other terms to replace the terms they use in the inquiry, in their method… these terms of mine (Borrow / play / move / interrogate / generate) to get them thinking another way, to make them look at their work in a different way from that dominant method. These become a way that people can talk about the same thing but with less weighty vocabulary, or terms that do not carry that baggage. So that’s one way I try to do that…

Crowd-sourced data coding for the social sciences: Massive non-expert coding of political texts – Prof Ken BenoitProfessor of Quantitative Social Research Methods, London School of Economics and Political Science (session chair: Prof John McInnes)

Professor John McInnes is introducing our next speaker, Professor Ken Benoit. Ken not only talks about big data but has the computational skills to work with it.

I will be showing you something very practical…. I had an idea that I’d do something live… so it could be an Epic Fail!

So I took the UKIP European Election Manifesto… converted to plain text in my text editor. Made every sentence one line… put into spreadsheet… Then I’m using CrowdFlower with some text questions… So I’ll leave that to run…

So back to my talk… the goal is to measure unobservable quantities… we want to understand ideology – the “left-right” policy positions… we have theories of how people vote, that they vote to parties most proximate to their own positions. For political scientists this is a huge issue. We might also want to measure corruption, cultural values, power… but today I’m going to focus on those policy positions.

A lot of political science data is “created” by experts… a lot of it is, frankly, made up. A lot of it is about hand-coded text units – you take a text, you unitise it…. e.g. immigration policy statements… (Comparative Manifesto Project, Policy Agenda Project). Another way is Solicited Expert Opinion (Benoit and Laver, Chapel Hill, etc) – I worked with Laver for years looking at understanding of policies of each party. It’s expensive work, takes an expert an hour to fill out a form… real headache… We have expert-completed checklists (Polity, Comparative Parliamentary Democracy Dataset, Freedom House, etc.). And there are Coded International events (KEDS, Penn State Event Data). And we have inductively scaled quantities (factor analysis such as “Billy Joe Jimbon Factoral analysis).

So what are some of the problems of coding using “experts”. Who are experts anyway? Difficult to find coders who are suitably qualified. It’s hard to find them AND hard to train them… most of the experts coding texts tend to be PhD students who find it a pleasing thing to do whilst avoiding finishing their thesis. There can be knowledge effects since no text is ever anonymous to an expert coder with country knowledge. Human coders are unreliable – their codings of the same text unit will vary wildly. And even single coding is relatively costly and time-consuming. So only one coder codes each text. Even when you pay the experts, they are still doing you a favour!

So I will talk about an alternative solution to this problem, and that problem is about classifying text units. So the idea is to observe a political party’s policy position by content analysis of it’s texts. And party manifestos are most common texts. The idea behind content analysis is breaking text into small units and then using human judgement to apply pre-defined codes. e.g. coding something as right wing policy. And usually that is done for LOTS of sentences by only ONE coder.

Tomorrow I’ll be in Berlin… the biggest (only?) game in town is the Comparative Manifesto Project (CMP). This is a huge project with 3500 party manifestos from 55 countries from 1945-2010 though still going. Human coders are trained and have PhDs. They break manifestos into sentences, human judgement to apply pre-defined codes. Each sentence assigned to one of 56 policy categories. Category percentages of the total text are used to measure policy. And each manifesto is seen by just one coder, and coded by just one coder.

So… what could we do… crowd-sourcing involves outsourcing a task by distributing it to an unspecific group, usually in parts… based idea of this, versus expert coding is that it reduces the expertise of each of the coders, but increase the number of coders. Distribute texts for coding partially and randomly. Increase the number of coders per sentence. Treat different coders as exchangable – and anonimous, and we don’t care if sitting in internet cafe in Estonia in their underwear, or whether they engage on a day off from a bank…

The coding scheme here is to have a more simplified coding scheme. We applied it to 18 of the “big 3” British party manifestos from 1987 to 2010. So a sentence can be coded as Economic, Social or neither… under either of the first two categories there are further options (anti, neutral or pro) from “Very left” to “Very right”, or “Very liberal” to “Very conservative”. And there is a 10 question test to show correct codings, to guide the coder and to keep them on track.

So, to get this started we wanted a comparison we understood. We wanted to compare crowd coding to expert coding. So my colleague and I, and some graduate students, coded a total of 123,000 sentences between us… With between 4 and 6 coders per manifesto and using the same system to be deployed to the crowd. This was  a benchmark for the crowd sourcing end of things. This took ages to do… we did that…. that’s a lot of expert coding… and in practice you wouldn’t get this happening… For the crowdsourced codings we got almost twice as many codings…

We used an IRT type scaling model to estimate position. We didn’t want to just take averages here… we used a multi nomial method here. We treat each sentence as an item, to which the manifesto is responding, and the left or rightness (etc) as a quality they exhibit. Despite that complexity we found that a mean of means approach led to very similar results. We are trying to simplify that multi nomial method… but now the results…

Comparing expert codings to expert surveys on economic and social positions look pretty good.. good correlation for economic particularly a thing that we’d expect – and we see.

We tested to see how best to serve up results… we tried the sentences in order and out of order. Found .98 correlation so order doesn’t matter…

For the crowd sourcing we used Crowdflower, a front end to many crowd-sourcing platforms, not just Mechanical Turk. Uses a quality monitoring system so that you have to maintain an 80% “trust” score to be rejected. Trust maintained through “gold questions” carefully selected and generated by experts…

So, we can go back to the live experiement… it’s 96% complete!

So, looking at results in two dimensions… if Liberal Democrats were actually Liberal would be right of economics and left of social… but actually they are more left on economics. Conservatives on the right socially but getting nearer the left in some cases… but it’s not about the analysis so much as the comparison with the benchmark…

When we look at expert codings versus crowd coders… well the points are all over the place but we see correlations of 0.96 for economic, 0.92 for social dimensions. So in both cases there isn’t total agreement – we have either have a small crowd of experts or a bigger crowd of non experts. Its always an average but just a matter of scale…

So, how many coders do we need? No need for 20 codes for a sentence if it’s clearly not about immigration policy… we did massively over sample, then drew sub sets there for standard error… we saw that estimates from our errors the uncertainty starts to collapse… The rate of collapse for experts is substantially steeper… for aggregate of these two processes you need five times more non-expert coders than experts. But you can run good codings with five coders…

So we did some tests for immigration policy… used 2010 British manifestos, knowing that there were two expert surveys on this dimension (but no CMP measures). Only coded immigration or not, and if immigration is positive or not. Cost about $300. Ran again, same cost, extremely similar results…

Doing this we had 0.96 correlation with Benoit 2010 expert survey. .94 correlation with Chapel Hill Survey. And between the two runs correlation of around 0.94. Would have been higher… the experts differed between the immigration policies of Labour and Conservative… were not obvious positions in the text… but they had positions that experts knew about…

So, who are these people? Who are these crowd coders? They are from all over the world… the top countries were USA, Britain, India and Estonia. One person coded over 10,000 sentences! Crazy person loves coding! The mean trust score rarely drops below 0.8 as you’ll be booted off if it does… You don’t pay or get data from those that fail. Where are these jobs being sourced? We tried Mechanical Turk… we’ve used Crowd Flower… there are huge numbers of these sites – a student looked at about 40 of these sites… but trust scores are great no matter how these people are sourced… Techniques are not all ideal… but they don’t stay in the system if trust score changes. No relationship between coder quality and platform…

Conclusions here. Non experts produce valid results, just need a few more of them. Experts have variance, have noise, so experts are just another version of a crowd with higher expertise (lower variance). Repeat experiments prove that the method is reliable (and replicable). Some places require your work to be replicatable… is data plus script a good way to do that? Here you really can… You can replicate everything here. You can redo in February what you did in December… with the right text you can reproduce the result. Why does this appeal? Well it’s cheap, it’s flexible. Great for PhD students who lack expert access. And you can work independently from big organisations that have their own agenda for a study. You can try an idea, run again, tweak, see what works… Can go back again… And this works for any data production job that is easily distributed into simple tasks… sign up for Mechanical Turk, be a worker, see what it’s like to actually do this… for instance for transcriptions of audio tapes… it’s noisy…. a common job is that they upload 5 second clips and you transcribe that… gives you pretty good human transcription that timestamps weaves back together. Better than computer method…

So, we are 100% finished with our UKIP crowdsourcing experiment… Interestingly 40 negative, 48 positive… needs further analysis…


Q: In terms of checking coders do the right thing – do you check them at the beginning or do you check during the process of codings?

A: Here I cheated a bit… used 126 gold questions from another experiment. You have to give a reason for each question about why it’s there – if the person doesn’t get it right then they get text to explain why that is the case… Very clear unambiguous questions here. But when you deploy a job you can monitor how participants responded or if they contested it… In a previous experiment we had so many contested responses that I actually looked again and removed it…

Q: A very interesting talk… I am a computer scientist and I am interested in whether now you have that huge gold data set you have thought about using machine learning.

A: Yes, we won’t let that go to waste. The crowd data too…

Q: I am impressed but have two questions… you look at every sentence of every manifesto… they are funny things as not every sentence is about the thing you are searching for – how do you deal with that? And a lot of what is in manifestos are sort of dog whistle things – with subtexts that the reader will pick up, how do you deal with that in crowdsourcing?

A: You get contextual sentences around the one you are coding, that helps indicate the relevance of that sentence, it’s context. In terms of the dog whistle question… people think that but manifestos are not designed to be subtle. They actually tend to be very plain, very clear. It’s rare for that subtlety to be present. Want truly outrageous immigration policy look at the BNP manifesto… every single area is about immigration, not subtle at all.

Q: I’m a linguist, I find it very interesting… and a question about tasks appropriate to crowdsourcing. Those that can be broken down into small tasks, and that your participants can relate to their daily life. I am doing work on musical interpretation… I need experts because I can’t see how to do that in language, in a way that is interpretable to non experts…

A: You can’t give something that’s complex… I couldn’t do your task… you can’t assume who your crowd is, we have very little information… we didn’t ask about language but they wouldn’t retain that trust score without some good English language skills. But workers have a trust score across projects so anything they can’t do they avoid as losing that score is too costly… You could simplify the task with some sort of task that can test corect or incorrect interpretation… but we keep the task simple.

Q: A very interesting talk, I have a quick question about how you set the right price for these tasks… how do you do that? People come from different areas and different contexts.

A: Good question. We paid 2 US cents per sentence. We tried at 5 cents and it was done very fast but quality wasn’t better. A job at 1 cent didn’t happen fast at all. So it’s about timings and pricing of other jobs.

Q: Could you say something about the ethics of this kind of method… you are not giving much consideration to the production of these texts, so I wondered if you could talk about the ethics of this work and responsibilities as researchers.

A: Well I didn’t ruin any rainforests, or ruined any summers. These people have signed up for terms and conditions. They are responsible for taxation in their jurisdiction. Our agreement with Crowdflower gives them responsibility. And it’s voluntary. Hopefully no sweatshops for this… I’m receptive to the idea of what ethical concerns could be… but couldn’t see anything inherently wrong about the notion of crowdsourcing that would be a concern. Did run past ethics committee at LSE. Didn’t directly contact people, completing tasks on the internet through third party supplier.

Q: You were showing public domain documents… but for research documents not in the public domain how would security be handled…

A: Generally transcriptions are private… but segments are usually 3 or 5 segments… like reading a document from the shredder basket… the system have that data but workers do not have access to that system

Q: But the system does have that so you need trust in the platform…

A: Yes.

Comment from floor: companies like Crowdflower have convinced companies to give them data – doctors notes etc. they have had to work on making sure they can assure customers about privacy of data… as a researcher when you go in you can consider what is being done in that business market in comparison

Q: Have you compared volunteer coders to paid coders? I am thinking particularly about ethical side of things and motivations, particularly given how in political tasks participants often have their own agendas. Might be interesting to do.

A: Volunteer crowdsourcing? Yes, it would be interesting to compare that…

Reading Data: Experiments in the Generative Humanities – Dr Lisa Otty, Lecturer in English Literature and Digital Humanities, University of Edinburgh (session chair: Dr Tom Mole)

Dr Tom Mole is introducing our next speaker, Dr Lisa Otty whose interests are in the relationship betweeen reading, writing and the technologies of transcription. And she will be talking about her work on Reading Poetry, and the process of what happens when we read a poem.

Now to be  a literature scholar speaking at an event like this I have to acknowledge that data is not a term typically used in our field. When you think about what we are used to reading texts are often books, poems… but a text is not neccassarily a traditional material but may also be another linguistic unit, something more complex. Taking the Open Archival Information Systems (CCSDS 2002) describes data as “a reinterpretable representation of information in a formalized manner suitable for communication, interpretatio, or processing”. Interpretation being crucial there. When we look at texts like books or poems those are “cooked” – edited, curated, finished. Data is too often not seen as that.

Johanna Drucker – in Humanities Approaches to Graphical Display (DHQ 5.1 2011) talks about data as Taken Not Given, Constructed from the Phenomological World. Data passes itself off as a priori conditions, as if same as phenomena observed, collapsing the critical gap between the data collection and observation.

Some of these arguements gel with some of the arguements around close versus distance reading. And I think it can therefore be more productive to see data as a generative process…

Between 2009-2012 I was involved in the research project Poetry Beyond Text (University of Glasgow, and University of Kent). This was a collaborative project so inevitably some of my reflections and insights are also collaborative and I would like to acknowledge my colleagues work here. The project was looking at interpretation of poetry, and particular visual forms of poetry such as artist boks. What these works share is that they are deeply resistent to being shared as just information.

For example Eugen Gomringer’s (1954) “silencio” is an example of how the space is more resonant than the words around it… So how do we interpret these texts? And how do our processes for interpretation effect our understanding. One method, popular in psychology, is eye tracking… a physical way of registering what you are doing. We combined eye-tracking with self-reporting. Eye Tracking takes advantage of the movements of a small area of the retina. So a map of concentration sees those little jumps, those movements around the page. But it’s an odd process to be part of – you wear a head brace with a camera focused on your eye. You get a great deal of data from the process. Where more concentration that usually indicates trickiness or challenge or interest in that section – particularly likely for challenging parts of text. From this data you can generate visualisations from this data. (We are watching a video of eye tracking process for poetry).

Doing this we found a lot of patterns. We saw that people did focus and understand space, but only when that space has significance in the process. In poems where space is more conceptual than nemetic. But interestingly people who recorded high confusion also reported liking them much more… With experiments with post linear poems the cross-linear connections. All people start with a linear reading patterns before visual reading. And that reflects the colour strip test – psychology test that shows that visual information trumps linguistic information… so visual readings and habitual reading processes are hard to overcome. We are programmed to read in a certain way… our habits are only broken by obstacles or glitches in the text we are reading…

Now talking about this project if I talk about findings I am back in that traditional research methods… and that would be misleading. We were a cross disciplinary team and so I am particularly interested in focusing on that process, on how we worked on that. The eye tracking data generates huge amounts of numerical data… we faced real challenges in understanding how to understand, to read this data… a useful reminder of the fact that data’s apparent neutrality has real repurcussions. Its one thing to make data open, another to enable people to work with it.

To my colleagues in psychology didn’t understand our interest in visualisations of numerical eye tracking data, it is an abstraction… and you have to understand the software to understand how that abstraction works. Psychologists like to interpret the data through the numerical data. They see visualisations, graphs etc. as having a rhetorical rather than analytical function. Our team were interested in that rhetorical function. We were humanists running an experiment – the framework was of hypotheses, of labs, of subjects… but the team came from creative practice background so this sense of experiment was also in play. In it’s broadest terms experiments are about seeing something in process and see how they behave, for scientists about testing hypotheses in this way, creative experiements rather different… For humanist analysis of these texts you have to deal with a huge number of variables, very much a contrast to traditional psychology experiements. For creative experiments there is a long tradition of work in surrealism, dadaism, etc. that poetry can unleash and disrupt our traditional reading of texts… they are deliberately breaking our habits. The reader of the literary form is a potentially revolutionasible(?) subject.

In Literary scholarship and humanities the process of reading is social, contextualised process. In psychology reading is a biomedical process, my colleagues in this field collapse the human and machine. In a recent article by Lutz Koepnick asked Can Computers Read? (2014) and discussed the different possible understandings of what reading is for.. our ideological framework of reading means to us… computational reading is less about what computers are, more about how we invest in them and envision them.

One of the things that came out of our project was the connections between poetry and psychology, and the connections to creative experiments.

To finish I want to talk about some examples of experiments around reading and what reading can mean.

The readers project – John Cayley and Daniel Howe (2009 – ) their work explores imaginative critiques of reading. Cayley is a literary scholar and has been working in digital production for some time. The readers project features “programmed autonomous entities”. Each reader moves through a text at different speeds and in different ways. So for each part of the experiment projections are used, and they are often shown with books, a deliberate choice. A number of interfaces are available. But these readers move according to machine reading rather than biomechanical reading. Cayley terms this an exploration of vectors of reading… directions in which reading might take of. It explores and engaged with new creative understandings of reading. This seems to be seen by Cayley in avant garde context. Emphasis on constructed nature of the work.

“because the project’s readers move within and are thus composed by the words within which they move, they also, effectively, write. They generate trxts and the traces of their writings are offered to th eproject’s human readers as such, as writing, as literary art.” (Cayley, The Readers Project website).

As someone engaging with these pieces the experience is of reading with, more than processing or consuming or analysing.

Tower – by Simon Biggs and Mark Shovman (2011), working at Hive, uses knowledge of natural language processing to build visualisations. When the interactor speaks their words spiral around them. And other texts are also present – the project is inspired by the Tower of Babel and builds up and up. Shovman’s previous work at Hive was on geometric structure. Biggs hope is that participants “will be enabled to reflect upon the inter-relations of the things that they are experiencing and their own contingency as part of that set of things.”

Michelle Kendrick talks about hybrids, that hybrid of human and machine interaction, the centrality of human investment in computer reading.

When I talk about this work I am overwhelmed by the rhetorical significance of words like “experiment” and the dominance of scientific research methods – the first interpretation of this work is often wrongly around seeing the work as applying scientific methods to literary interpretation.  But instead this work is about interpretation and exploring methods of understanding and interpretation.


Q: You talked about different disciplines coming together. Do you think there is a need for humanities researchers to understand data and computational methods?

A: I think we would all benefit from a better understanding of data and analysis, particularly as we move more and more into using digital tools. I’m not sure if that needs to be in the curriculum but it’s certainly important.

Q: One of the interesting things about reading is the idea of it being a process of encoding and decoding… but the code shifts continously… and a challenge in experimental reading or interpretation is that literature is always experimental to some extent because the code always changes.

A: I think the idea of reading as always being experimental… I think that experimental writing is about disruption… less about process but more about creating challenge.

Q: I was very struck in what you were presenting there in the Poetry Beyond Text project about the importance of spatiality and space… so I was wondering about explicit spatial understandings – the eye tracking being a form of spatial understanding…

A: We were looking at the way that people had been interpreting those texts in the past, in the ways people had looked at that poetry in the past… they had talked about the structural work of the poets themselves… and we wanted to look beyond that…We wanted to find out people’s responses to some of these processes, and what the relationship was between that experience and those critical views of those texts.

Q: Did you do any work on different kinds of readers – expert readers or people who had studied these works?

A: It was quite a small group but we looked at the same people over time and we did see development over time. We worked mainly with students in literature or art and most hadn’t encountered this type of concrete poetry before but were well experienced with reading.

Q: I wanted to ask you about the ways in which we are trained to read… there are apps showing images of texts very very quickly, are we developing skills to read quickly rather than more fully and understand the text.

A: There was a process of rapid image showing to the eye (RSVP was the acronym) – to allow you to absorb more quickly but in actual fact that was quite uncomfortable. We do see digital texts playing with those notions. I don’t think we will move away from slow reading but we are seeing more of these rapid reading processes and technologies.

Chair: Kinetic Text project works in some of these ways, about focusing eye movement…

A: The text can also manipulate eye movement and therefore your reading and understanding of the text. Very interesting in that respect.

Algorithm Data and Interpretation – Dr Stephen Ramsay, Associate Professor of English at the University of Nebraska; Fellow at the Center for Digital Research in the Humanities (session chair: Prof James Loxley)

James Loxley is introducing our next speaker, Dr Stephen Ramsay.

I want to say that my mother is from Ireland, a little place west of here, and she said that if she had ever been to University it would have been to University of Edinburgh which she felt was the best in the world.

Now I was planning to teach a technical talk – I teach computer science in an English faculty. But instead I’m going to talk about data. So I’m going to start with the 1965 blackout of New York. At the time it was about disaster, groping in the dark, a city stranded. But then 9 months later they ran stories on the growth in birth rates, a sharp rise across hospitals across the state. All recording above average numbers of births. Although one report noted that Jewish hospitals did not see an increase. Sociologists talked about the blackout as in some way responsible… three years later a sociologist published a terse statement showing no increase in births after the Great Blackout. This work looked at average gestation period and noting that births would have been higher from June through to August, not just in August… but he found that 1966 was not unusual or remarkable. Black Out Babies were a myth…

You could read this tale as a cautionary one about the misuse of data. But I think this can be read another way… the New York Times piece said something about human nature – people turning to each other when power out is a sad reflection on the place of television in our life, but a hopeful narrative for humanity. And citing birth rates and data and using scientific language adds to that. And the comments about Jewish people shows prejudice. But at the same time that subsequent analysis frames the public as prone to fantasy, as uninformed, with the scholar overcoming this…

The idea of “lies, damn lies, and statistics” encourages us to always look for falsehood hiding behind truth… so we think of what stories we are being told, and what story we want to tell. It’s simple advice that is hard to do. I want to give a different spin on this. I think that data is narrative automatic. the way we use data is instructive – we talk about lists, numbers… Pride and Prejusice does not seem to be a data set unless we convert it. It gains narrative in transformation. The data can be shown to show and mean things – like stories, stories waiting to be told… data doesn’t mean anything by itself, someone has to hear what it is saying…

What does data look like in its pre interpretive state? There is an internet site called “Found” – collecting random items such as notes, cards, love letters, shopping lists. Materials without their context. Abandoned artefacts. All can be found there. But the great glorious treasure of Found is it’s lists…

[small pause here for technical difficulty reasons]

These lists are just abandoned slips of paper… one for instance says:







roach spray



The spareness and absence of context turns these data-like lists turns them, quickly into narrative… not all are funny… one reads:

go out for a walk with someone

speak with someone

watch tv

go out to cemetry to speak to mom

go to my room

Have you ever wanted to give your data a hug? Bram Stoker said in writing Dracula he just wanted to write something scary… his novel is far more interesting without him as the interpretations of others are fascinating and intriguing… Do facts matter in the humanities? In some areas… who painted a picture, when a treaty was signed… these are not contingent truth claims… surely we can say fact is a good word for those things that are not subject to debate. Scholars can debate whether a painting is by Rembrandt or his school, that debate is about establishing a fact. But facts still matter…

If we look at Rembrandt’s Night Watch the lighting of the girl equating to that of the captain is intriguing. If he said it meant nothing we’d probably ignore him… The signing of a treaty may be a fact but why it occured is much more interesting. Humanities are about that category 1 inquiry more than the category 2 fact inquiries. Often this is the critique of the humanities and the digital humanities, Jonathan Gotschil insists that the humanities should embrace scientific approaches and sense of optimism… And sees the sciences as doing a better job of this stuff but that “what makes literature special” should be retained… he doesn’t say what those things are. There are unsettled matters if one takes scientific approaches. Of course Gotschil’s nightmare is to understand data with the same criticality we apply to Bram Stoker, questioning it’s being and meaning… and I suggest we make that nightmare a reality!

[More technical issues… ]

What I wanted to show you was a list of English Novels [being read to us]… It is a list, from Hoover, organises novels in terms of breadth of the vocabulary in that list. I have shown this list to many people over the last few years, including many professors… they see Faulkner and Henry James at the top and approve of that and of Mark Twain…. and young adult novel writers at the bottom… but actually I read you the list in ascending order… Faulkner and James are at the bottom. Kipling and Lewis are at the top. And there it starts… richness is questioned… people want to point out how clearly correct the answer is, despite having given the wrong answer; some explain that the methodology is flawed or misreported… these are category 1 people being annoyed by category 2 reality…

But when we stop using it as a Gotcha it is a more provocative question… each of these titles contains a thousand, a hundred thousand thoughts and connections… it is what we do… as humanists we make those connections… we ask questions of the narrative we have created… part of our problem is a general discomfort with lettinng the computer telling us what is so… but if we stop doing that we might see peculiar mappings of books a cultural objects… it might show us a way to deeper understanding of reading itself… it raises any number of questions about the development of English style… and most of all it raises questions of our discursive paradigms.

That gives us narrative possibilities we could not see. We cannot think of text as 50k word blocks. The computer can ONLY apprehend the text in such terms. To understand the computer as finding facts is to miss the point. It is about creating triggers to ask questions, to look at the text in new ways. This is something I came across working on Virginia Woolf’s The Wave. The structure is so orderly… and without traditional cultural narrative. And they speak in very similar styles, sentence structures, image patterns… some see some difference between gender or solidarity… but overall it is about unity… this is the sort of problem that attracts text analysis scholars like myself. I ran algorithm clustering models looking for similaritudes unseen by scholars. On a lark we posted a simple question… “what are the words that the women in the novel use in common, that none of the men do?” and it turns out that there are 9 such words. Could see that as a narrative – like a Found list – and then we did it with men and found 120 words! Dramatic. So many words… Some critics found that disparity frightening… some think it backs up sexism of western cannon. Others see this as a chance to ask another questions… to try with other authors, novels, characters… if you think this way, perhaps you’ve caught the DH bug, I welcome you. But do we think we’ll find an answer to questions of gender and isolation? Do we want to answer those? The humanities want a world that is more complex, deeper than we thoughts. That process is a conversation…

In 2015 the Text project will release huge volumes of literature. Perseus contains most greek texts… there are huge new resouerces. almost all questions we ask of these corpuses have not been asked before… we can say they will transform the humanities but that may not be true… the limiting factor is whether we choose to remain humanists in the face of such abundance… perhaps we need to be programmers, tool builders, text engineers… many more of us need to invite the new texts – lists, ngrams, maps etc. – into our ongoing conversation. We are here to talk about philosophical issues of data and these issues are critical… but we have to be engaging with these questions…. Digital humanities means databases, mark up, watermelon…!


Q: I am intrigued to think about how we design for the things we don’t know what we need to know…

A: Sure, imagining what we don’t know… you inevitably build your own questions into the tools… ironically an issue for scientific methods. The nice thing about computers is that they are fast, obedient and stupid. They will do anything we ask them to, even our own most stupid ideas, huge serendipity just baked into that! Its a problem but its amazing how the computer does that job for me, surprisingly.

Q: That was a brilliant fascinating talk. Part of the problem with digital humanities for literature right now is that it either tells us what we do know… or it tells us what we don’t know but then we worry that it’s wrong… The description of the richness list was part of that. I really liked your call for an ongoing discussion that includes computer generated data… but I don’t see how we get past the current description. If all literary criticism says something is so, and expects “yes, but…” I can see how computer generated data sits in that… but how can data be a participant in that conversation – beyond ruling something out, or concurring with expectations.

A: Excellent point and lets not downplay at all the first part of your question. I saw Franco Morelli give a talk about titles getting shorter for instance… who’d have thought?! But I think it has a lot to do with how we build our tools… I find it frustrating that we all use R, or tools designed for science or psychology… I want our schools to look more like the art-informed projects Lisa talked about. I think the humanities needs to do more like that, to generate the synergies. Tools that are more ludic.

Q: May be to be about perceived barriers being quite high. An earlier speaker talked about the role of repeatability. Ambiguity reading a poem is repeatible. if barriers to entry low enough for repitition and for others to play, to ask new questions, maybe that brings the data in as part of the conversation…

A: There are tools that let you play with the text more ludically. Voyant for instance. But we come with a lot of cultural baggage as humanists… there is a phenomenon that… no matter what they are talking about they give a literary critical reading of a text but when they show a graph we all think we are scientists… there is so much cultural baggage. We haven’t learned how to be humanistic users of these tools, or to create our own tool.

Q: A question and an observation… There is a school of thought in cognitive psychology that humans are infinitely able to retrofit any narrative to any circumstances whatsoever, and that is very much what was coming through your data… Many humanities departments have become pseudo social sciences departments… but if you don’t have a clear distinction between category 1 and category 2 they can end up doing their own thing…

A: I don’t want the humanities. I resist the social science type study of literature, the study of human record or of the human condition… when we are talking about… in my own work I move between being a literary critic and being an engineer… when it comes to writing software that method definition is wrong, it doesn’t work… when I am a literary critic it is about all those shades of grey, those complexities… but those different states both seem important in pursuit of that end goal… if studying flu outbreaks lets not be ludic… but for Bram Stroker then we should!

Q: In my own field of politics there was a particular set of work which gave statistical data a bad name… and I wonder in your field is the risk of the same is there…

A: In digital literary studies this is sometimes seen as a 25 year project to get literary profs into the digital field.. but I always say that that’s not true, there’ll always be things to be done. There was a book in the 70s that looked at slavery in an entirely quantitative way, it made the arguement no one wanted to hear, that slavery had been extremely lucrative. Economists said that it’s profitable. History fled from statistical methods for years after that… but they do all agree that that was profitable. And there is quantitative work there again/still. If I had to predict I’d say the same thing for digital literary studies does seem likely…

Q: I can’t resist one here… I was following a blog by Kirsch where you say that scholars should code and I wanted to ask about that…

A: OK, well Kirsch lumps me in with the positivists… I’m not quite in the devils party. But I teach programming and software engineering to humanists. Its extremely divisive… My views have softened over the years… for me programming is a magnificant intellectual excercise… knowing about it seems to help understand the world. But also if you want to do research in this area you need some technical skills. If that’s programming… well learn what you need whether thats GIS, 3D Graphics… if you want to build things you might need coding!

Big Data and the Co-Production of Social Scientific Knowledge – Prof Rob Procter, Professor of Social Informatics, University of Warwick (session chair: Prof Robin Williams)

Professor Robin Williams is now introducing Professor Rob Proctor, our next speaker, talking about his work around social informatics.

The eagle eyed amongst you will spot my change of title – but digital is infinitely rewritable! I am working in the overlap of sociology and computational tools and methods. So, the second thing I want to talk about is Sociology in the age of “big data”. I think what this demonstrates is the opportunities for sociology to respond in various different ways to this big data, and tools to interrogate that data. The evolving of tools and methods is a key thing to look at in the area. So that brings me to the Collaborative Online Social Media Observatory (COSMOS) and tools we are developing for understanding social media… and then I want to talk about Sociology beyond the academy – knowledge co-produced of social scientific knowledge. But there are other types of expertise being mobilised at the moment, in looking at the computational turns things are taking. Not always a comfortable thing for social scientists…

So firstly Social Informatics. So what is that? Well to me its the inter-disciplinary study of factors that shape adoption and use of ICTs. And what gets me excited is how these then move into real processes. And for me the emphasis on innovation as public, participatory process of experimentation and learning where meanings of technologies are collaboratively explored and co-produced. In social media you can argue that this is a large scale experiment in social learning… Of course as we witness growing scale of adoption more people experience those processes: how social media works, how they might adopt or use it… to me this is a fascinating area to study. And because it is public and involves social media it is very easy to see what’s going on… to some extent. And generally that data is accessible for social research purposes. It is not quite that simple but you can research without barriers of having to pay for data if you do it in a careful way.

So these developments have led me into social media as a prime area of my research. So firstly some work we did on the impact of Web 2.0 on scholarly communications – work with Robin Williams and James Stewart – many of us will be part of this, many of us tweet our research… but many of us are not clear of what that means, what the implications are. So we did some work, got some interesting demographic research… we also did interviews with people and got ideas of why they were, and why they were not adopting… Some very polarised. And in parallel we looked at how scholarly publishers incorporate social media tools into their work, in order to remain key players… they do lots of experiments and often that is focused on measuring impact and seeing the movement of their work to other audiences. Some try providing blogs on their content. But that is all with mixed success. A comment notes that it is easier to get comments on cricket reports than on research online… So it’s hard to understand and capture impact…

I’ll come back to that and about co-creation of knowledge. But first I want to talk about the riots in England in 2011. This was work in conjunction with the Guardian Newspaper. They had been given 2.5 million tweets directly by Twitter. They wanted to know if social media was particularly vulnerable for sharing false information, did that support calls for shutting down social media at times of crisis? So we looks at a number of different rumours known about and present in the corpus: zoo animals on the loose; london eye on fire; miss selfridge on fire; rioters attack a children’s hospital in Birmingham. I will talk about that latter example. But we wanted to ask about how people use and understand and interpret social media in these circumstances, how they engage with rumous…

So this is about sociology in the age of “big data”. It calls for interpretive methods but we can’t do that at scale easily… so we need computational methods to focus scarce human resources. We could crowdsource some of this but at this scale that would still be a challenge…

So firstly lets look at the work of Savage and Burrows (2007) talked about the “coming crisis of empirical sociology” because the best sociology, as they saw it, was conducted by private companies who have the greatest and most useful data sets which sociologists could not rival nor access. However we might be more confident about the continuing relevance of social sciences… social media provides a lot of born digital data… maybe this should be entitled the “social data deluge”. There is a lot of data available, much of it freely available. Meanwhile lots of policy initiatives to promote open data in government for/by anyone with a legitimate usage for it. Perhaps we can be more confident about the future of academic sociology…

But if you see the purpose this data is put to, its a more mixed picture… so we see analysis of social media for stock market prediction. But here correlation is mistaken for causality. Perhaps more interesting are protest movements – like occupy wallstreet – or use of social media during the Egyptian revolution… It is a tool for political change, a way for citizens to acquire more freedom and change? Is it a movement to organise themselves? Lots of discussion of these contexts. Methodologically its a challenge of quantity, and methods that combine social science understanding with social media tools enabling analysis of large scale data…

So back to that rumour from the riots and that rumour of a children’s hospital being attacked in Birmingham. This requires thorough work with the data, but focused where it counts.

So, what sparked this off was someone tweeting that the police were assembling in large numbers outside the hospital… therefore the hospital must be under threat. A reasonable inference.

So, methodologically we undertook computational methods for analysing tweets in an active area of research: sentiment analysis; topic analysis. We combine a relatively simple tool looking at information flows… and then looking at flow from “opinion leaders” to others (e.g. RTs). Once that information flow analysis has been done we can then take those relative sizes to analyse that data, size as proxy for importance… this structure, we argue, is relatively useful for focusing human effort. And then we used coding frames for conventional qualitative methods of content analysis to understand how Twitter was used – to inductively analyse information flow content to develop a “code frame” of topics; use code frame to categorise inofrmation flows (e.g. agreement, disagreement, etc.); and then we used visualisation around that analysis of information flows…

So here we see that original tweet… you see the rumour mushroom, versions appear… bounding circles reflect information flows… and individuals and their influence… Initially tweets agree/repeat… and we then start to see common sense reasoning: those working or nearby dispute the threat, others point out that the police station is next door to the hospital thus providing alternative understanding. People respond and do not just accept the rumor as true… So rumours do break quickly BUT they are not neccassarily more vulnerable as versions and challenges quickly appear to provide alternative likely truth. That process might be more rapid with authoritative sources – media or police in this case – adding their voice. But false information may persist longer, with potential risk to public safety – see follow on Pheme project.

But I wanted to talk about authoritative sources again. The police and media and how they use social media. The question is what were the police doing on twitter at that time? Well another interesting case here… riots in Manchester led to people creating new accounts to draw attention to public bodies like the police, as an auxillery service to raise awareness of what was going on. Quite an interesting use of social meidia where these see something like this arising.

So what these examples demonstrate is innovation as a co-production… lots of people collectively experimenting, trying out things, learning about what social media can and cannot do. So I think it’s a prime example for sociologists. And we see uses are emergent, people learn as they use… and it continues to change and people reinvent their own uses… And we all do this, we have our own uses and agenda shaping our interactions.

So this work led to development of tools for use by social scientists… COSMOS involved James S, Ewan K, etc. from Edinburgh… It would be an error to assume social media can tell us everything that takes place in the world – this data goes with crime data, demographic data, etc. The aim of COSMOS is to forge interdisciplinary working between social and computing scientists. To provide open, sustainable platform for interoperable social media analysis tools. And refine and evolve capabilities, provide service models compatible with needs of diverse user communities.

There are existing tools out there for social media analysis… but many are blackbox systems, its hard to understand that process that is taking place. So we want those blackbox processes to be opened up, they are complex but can be understood and explored…

So the Cosmos Tools let you view timelines, to look at rates and flows… to look for selection based on keywords and hashtags… and to view the networks of who is tweeting… and to compare data with demographic data.

Also some experimental tools around geographical tools for clustering. The way people use Twitter can show geographical patterns. Another factor is about topic modelling, topic clustering… identifying tweets on the same topic. This is where NLP and Ewan and his colleagues in Informatics has become important.

So current research looking at: Social media and civil society – social media as digital agora; “hate” speech and social media – understanding users, networks and information flows –  a learning challenge here about people not understanding impact and implications of their comments, perhaps a misunderstanding of social media… ; citizen social science – harnessing volunteer effort; social media and predictions – crime sensing, data integration and statistical modelling; suicide clusters and social media; humanitariansim 2.0 – care for the future; BBC World Service – tweeting the olympics. And we have a wide range of collaborators and community engagement.

Let me briefly talk about social media as digital agora… may sound implausible… many talk about social media as a force for change… opportunities to promote democracy… not just in less democratic countries, but also democratic countries where processes don’t seem to work as well… So we are looking at social media in communicative, in smaller communities. And also thinking about social resiliance in a day to day small scale way… problems which if not managed may become bigger issues. For that we have studied Twitter in several locations, collected data, interviewed participants… and built up a network of communications. What is interesting, for instance, is that non governmental group @c3sc seems to have big impact. We have to see how this all plays out… deserves longitudinal approach…

So, to conclude… let me talk about the lessons for academic sociology… and I think it’s about sociology beyond the academy and the role of wider players. Firstly data journalism – was interested in Steven’s 1965 press accounts of the black out earlier. Perhaps nowadays the way journalists are being trained might change that… journalists are increasingly data savvy. We see this through Fact Check, through RealityCheck blog… through sourcing from social media. So is citizen journalism, used to gather evidence of what is happening… tools like Ushahidi… and a sense of empowerment for these communities… reminds me of notion of sousveillance… and the possibility of greater accountability… And Citizen Journalism in the expenses scandal – guardian recruited people to look at the expense claims. The journalists couldn’t do that externally… so recruited others.

So, citizen social science… in various ways (see Harris 2012 “Oh man, the crowd is getting an F in social science”. And Ken Benoit’s work discussed earlier… we see more people coming into social science understanding…

So the boundaries of social science research production are becoming more porous, social scientific knowledge production is changing, potentially becoming more open. These developments create an opportunity to reinvigorate the project for a “public sociology” – as per Burawoy (2005) and his call “For a public sociology”. to make sociology accountable to more people, to organisations, to those in power. Ethically we need to ask what is needed and wanted, how the agenda is set, how to deliver more meaningful and useful social sciences to the public.

How can we do that? New modes of scholarly communications, technology, but it’s not enough… we’ve also been working with a company on a  possible programme for the BBC where social media is used to reflect on the week, a knowledge transfer concept. Also knowledge transfer in the Pheme project – for discriminating false and true information… all quite conventional… but we need other pathways to impact… with people as sensors and interpreters of social life, training and capacity building – in ways we have not done before, and something that has emerged in science and citizen science has been the notion of workshops, hackathons, getting people engaged in using mundane technologies for their own research (e.g. Public Lab), we need something similar for tools, social media, to extract data they want for their purposes for their agenda… to create more public sociology that people can do themselves. And we need to also have an open dialogue about research problems.


Q: My question is about COSMOS and the riot rumours stuff… within COSMOS do you have space for formal input around ethics and law… you cut close to making people identifiable and locatable. And related to that… with police in those circles… may arouse suspicions about motives… for instance in Birmingham did police just monitor or did they tweet.

A: They did tweet but not on that rumour. It is an understandable concern that collaborations make powerful state actors more powerful… for us we want these technologies available for anyone to use them… not some exclusive arrangement, should be available to communities, third sector organisations… anyone who feels that social media may be important in their research

Q: I was more concerned about self-led vigilantes, those who might gang up on others…

A: A responsibility of civil society to be aware of those dangers, to have mechanisms to avoid harm. It does exist already… so if social media becomes instrument of that we have to respond and be aware – partly what hate speech project is about… Bigger learning problem is about conduct in social media space. And the probably issue that people don’t realise how conduct quickly becomes visible to much bigger group of others… and that relates to ethics… twitter is public domain space but when something is highlighted by others… we have to revisit the ethics issues time and again… for the study for the riots we did the usual clearance process… Like Ken we were told it was fine… but don’t make identifiable but that is nearly impossible in social media. Not an easy thing to resolve.

Q: I’m curious about changes in social media platforms and how that effects us… moves from facebook to twitter to snapchat to instagram… how does that become apparent, may be invisible, how do we track that..

A: There is a fundamental issue of sustainability of access to data from social media. Not too much of a problem to gather data if you design harvesting appropriately for their rate limits. In terms of other platforms, and people moving to them, and changes in modality and observability and accessibility of data… what social research needs is agreement with providers of data that, under certain conditions of access, that their data is available for research.. to make access for legitimate data easy. There are efforts to archive data – Library of Congress collects all tweets. Likely to allow access under license I think, to ensure access to platforms as use of platforms change…

Edinburgh Data Science initiative – Prof Dave Robertson, Head of School of Informatics

Sian Bayne quickly introducing Dave Robertson providing a coda to today’s session.

I’m just briefly going to talk about the Edinburgh Data Science Initiative. The ideas being data as the catalyst for change in multiple academic disciplines and business sectors.

So firstly the business side… big data can be very big and very fast… that can be off-putting in the humanities… And you don’t have to build something big to be part of this… I work in these areas but my models are small… and there is a stack you never see – economic and political side of this stuff.

And here’s the other one… this is about variety and velocity – a chart from IBM – looking at predictions of the volume of data and, more interestingly, the uncertainty of data… And the data sites in a few categories… Enterprise Data, loads of Social Media, and loads of Sensors (internet of things)… but uncertainty over aggregate data is getting hugely large… and that’s not in sphere of traditional engineering, or traditional business…

The next slide here is about architectures… this is topical… it’s IBM’s Watson system… this is the one that won Jeopardy… harvested loads of information and hypothesis generation… This stack starts with very computational stuff but the top layers look much more like humanities work and concepts…

Now technology and society interact. Often technology pushes on society. For instance if we look at Moore’s Law (memory in your computer doubles every year) mapped against the cost of mapping the human genome. It looks radically different, costs drop hugely in late 2000’s as a lot of effort is pushed in here. And that drop in cost to $1000 per unit… that is socially important… I could sequence my genome… maybe I don’t want to. You can sequence at population scales… machines generate a TB of data a week too – huge data being generated! And this works the other way around… sometimes technology gives you an inflection point and you have to keep up, sometimes society pushes back. A lot of time online is spent on social networks (allegedly 1/7)… now a unified channel for discovery and interaction… And the number of connected devices is zooming up…

So that’s the sort of thing that is pushing a lot of things… A lot of people have spoken to all the schools in the university… everyone reacts… you will find everyone recognising this… and you hear them saying “and it changes the way it makes me think about my research”. That’s so unusual to have such a common response…

Why this is important at Edinburgh… We have many interdisciplinary foundations at Edinburgh… All are relevant, no matter how data intensive, but we are well developed in interdisciplinary working…

And we have a whole data driven start up Ecosystem in Edinburgh… we have Silicon Walk (miicard, zonefox, etc.), Waverley Gate (Amazon, Microsoft), Appleton Tower (Informatics Ventures, feusd, Disney research, tigerface), Evo House (FlockEdu, Lucky Frame, etc), Quartermile (Skyscanner, IBM), Informatics, Techcube (FanDuel, Outplay, CloudSoft, etc.). A huge ecosystem here!

So, I’ll leave it there but input, feedback welcomed, just speak to myself and/or Kevin.

And that was it for the day…

Related resources:

 May 14, 2014  Posted by at 10:10 am Events Attended, LiveBlogs Tagged with: , , ,  1 Response »
Nov 112013

This afternoon I am attending “A digital humanties workshop in four keys: medicine, law, bibliography and crime“, a University of Edinburgh Digital Humanities and Social Sciences event. I will be liveblogging throughout the event and you can keep an eye on related tweets on the #digitalhss tag. The event sees four post doctoral researchers discussing their digital humanities work.

As usual this is a liveblog so my notes may include the odd error or typo – please let me have your thoughts or corrections in the comments below!

Alison Crockford – Digital articulations: writing medicine in Edinburgh

In addition to the four keys we identified we also thoughts about the four ways you can engage with the humanities field more widely. And in addition to medicine I will be talking about motions of public engagement.

Digital articulations plays on the idea of the crossover of humanities and medicine. So both the state of being flexibly joined together and of expressing the self. The idea came from the Issecting Edinburgh exhibition at Surgeons Hall. Edinburgh has a very unique history of medicine when compared to other areas of the UK. But scholars don’t give much consideration to the regional history and how medicine in an area may be reflected in literature. So you get British texts or anthologies with may be one or two Scottish writers bundled in. Edinburgh is one of the most prominent city in the history of medicine. My own research is concerned with the late 19th century but this trend really goes back at least as far as the fifteenth century. As an early career researcher I can’t access the multimillion pound grants from the ESRC you might need… So digital humanities became a kind of natural platform. I wanted to build a better more trans historical perspective on literature and medicine, would need input from specialists across those areas, I would also need ways to visualise this research in a way that would make sense to researchers and other audiences. I was considering building an anthology and spoke to a colleague creating a digital anthology. I chose to do it this way with a tool called omecca, in part because of its accessibility to other audiences. Public engagement is seen as increasingly favourable, particularly for early career researchers I’m interested in tools to foster research but also to do so in digital spaces that are public, and what that means.

I don’t have a background in digital humanities and there doesn’t seem to be a single clear definition. But I’m going to talk about some of the possibilities, what drives a project, how does that influence the result, etc. I will take my cues from Matthew Kirshenball’s 2002 essay on digital humanities and English literature. He sees it as concerned with scholarship and pedagoguey being more public, more collaborative, and more connected to infrastructure.

I was reassured to know I am not alone in looking at this issue and to have questions, there was a blog post on HASTAC – the humanities, arts, science and technology alliance and cOllaboratory. This was looking at the intersection between the digital humanities and public engagement, despite that organisation being already active in that space. I get the sense that this topic comes up as being there, but perhaps only recently ave there been deliberate reflections on the implications for that.

The digital humanities manifesto 2.0 which talks about increasingly public spheres. There’s a kind of deprivation in kirshenberg’s take on digital humanities and public engagement. I’m not sure public engagement deserves such derisive treatment, even though I am concerned about how public engagement and similar values judgement is increasingly chipping away at the humanities. But there is more potential there…

Many digital humanities tools are web based apps, they are potentially public spaces, and there are implications on our perspectives on any digital humanities, or indeed any humanities work. For instance the Oxford digital humanities conference last year, lookin at impact, nonetheless talked about public engagement as something more than just dissemination, but also something richer. Thinking about the participation of your audience, their needs and interests, not just your own.

Bowarst states that humanities scholars may risk letting existing technologies dictate their work, rather than being the inventors and designers of their tool. Whilst we may be more likely o be adopters I do not think that it is always the case nor neccassarily a problem. Working as Wikipedian in Residence at NLS I have been impressed with the number of GLAM collaborations embracing a range of existing kit: flickr, WordPress, Omeka, Drupal.

Omeka is designed for non technical users, it is based around templates and editable content. It is about presentation of materials. They are designed for researchers, those already interested… Who will SE it as a tool fr their research but not for wider audiences (e.g. Digitising historical serialised fiction and depictions of disability in nineteenth century literature). But these can look samey as websites, there are limitations without design support. However looki b at Lincoln 200 or Indeed George Arthus Plimpton rare book and manuscript page vs treasures of the New York Public Library website which is more visual and appealing. So I am interested in having the appeal of a public orientated website with the quality of a scholarly tool.

So looking At Gothic Past we see something that is both visual and of quality. You can save materials. The ways these plugins, opportunities for discourse etc. in Omeka etc. one up public engagement in richer ways…

Returning to medical humanities.. I think it has inherent links to public engagement, it helps enhance understand perceptions of health and illness. It’s impact can be so universal. Viewing medicine through the lens of literature enables a massively diverse audience who have their own interest, experience and perspectives to share. Giving a local focus also connects to the large community interested in local history. And designing the resource for that diverse audience with these many perspectives will help shape the tool. Restricting a resource to researchers


Q) really interesting oaicularly the problems of digital humanities and research… Could yo say more about Omeka and how you plan to use it?
A) I have a wish list for what I want to make from Omeka. I would like logins, the ability to save material, and to have user added content and keywords to drive the site, so that there is input from other audiences, not just researchers but also public audiences. For instance exhibitions around digital patienthood. I hoe to be a good customer. If you don’t have the technological skills, you still have to put in the time to understand the software, to create good briefs, two months in I’m still working with the web team to create a good resource. I want to be a good customer so that I get what I want without making the teams life hell!

Q) what do you think being a good client means for our students. Bergson mentions that the more we rely on existing technologies, the harder it becomes. Think outside the box.
A) I think some f those coming up behind me have a better nderstanding of things digital… But those are the corporately driven websites, but they don’t neccassarily look. Eying that. Maybe you need something akin to research methods, looking at open source materials and resources. But realistically that may not be possible.

Q) I wanted to ask abut the way the digital humanities is perceived as a thing. In your public engagement work is that phrase used?
A) I think largely people think that these are the humanities and these are digital tools. There are parallel conversations in humanities and in the cultural contexts… The ideas of the digital library just being the library. So this doesn’t seem to be specific to academia, it is a struggle fr others to work out how to incorporate the digital into your experience.
Q) we are alread post digital?
A) kind of… The ideas of a digital resource from a library being a different tool doesn’t really seem to be what you actively consider, you see a cool tool.

Q) do you think the schism between research and public engagement exists in the cultural sector?
A) they have a better potential chance to do that. They must provide materials for research and also public engagement and public audiences. We think about research and sharing further but these organisations think inherently about their audiences, but the resources are great for research, for instance the historical post office directory research. The sector is a good place to look to to see what we might do.

Chen Wei Zhu – Rethinking property: copyright law and digital humanities research

Chen Wei did his research on open source but spen much of that time at the British Library.

I will be doing a whistle stop tour of copyright law, mainly drawing on the non digital. Just to set the scene… When did the digital humanities staRt? 1946 is a convenient start date, an Italian Jesuit priest tried to index the massive work of Thimas Equinus, they were digitised, put onto CDROM and now online. But at that time the term wasn’t digital humanities but “humanities computing”. I tried Googles n-gram viewer and based on that corpus you see that “humanities computing” comes in in the 1970s but “digital humanities” emerges in the 1990s. Humanities computing is still hugely used but will be interesting to see when “digital humanities” becomes dominant or bigger. A health warning here… Best between 1820s and 1922. 1922 in the US marks the beginning of copyright, but in Europe materials published before then were already in copyright. And another Heath warning… oigkes scanning kit isn’t perfect before 1820s because of print inconsistencies and changes. E.g. “f” instead of “s”. It fell out of use after times newspaper dropped the long f/s in 1893. So much data to clear up.

So what are the digital humanists opinion and understanding of copyright. I feel that digital humanities scholars are quite frustrated. E.g. burdock et al 2012 sees it this way. Cohen and Rosenzweig 2005 see it as an issue of Things never being fixed? [check this reference]

The US copyright office is shutdown… The US federal government closure included the copyright office being shut down. It is still saying it is shut… There will be a huge backlog for registering copyright.

So how did copyright law begin? What is the connection between the loch ness monster and copyright? The story goes that st columba is not only the first sighted of Nessie, and the first person engaged in copyright dispute. There is a mythical connection too…

The first copyright dispute is sometimes called the patron saint of copyright, huge misunderstanding, he is more the first pirate, copying a manuscript without the permission of his tutor. When he was caught secretly copying the book of psalms st finnian was very angry, he wanted to restrict the copy. The idea “to every cow belongs her calf, therefore to every book belongs its copy”. So this was the first copyright case. Columba had the decision go against him, and he rose up against the king s he led something of a bloodbath.

Now in this case there was no clear author of either finnian or columba. Ad no publishing planned r taking place. SL skip forward to 12th century china we see Cheng Sheren, the first publisher to register their copyright. We see a picture like Pre 18th century England, where the publisher has copyright. In china as in 16th and 17th century England is all about censorship not copyright in any other sense.

The Statute of Anne 1710 is the first copyright act, which brings in the rights of authors and does not include censorship clauses. The first modern copyright law. But author based copyright didn’t really take off until the early nineteenth century, think this was another ethos. Only as authors are seen as romantic genius in the romantic age does this model takes off. Publishers recede to the background to manage economic aspects and authors move to the forefront.

Enter stage left the authors guild. So Authors Guild vs HathiTrust (2012). The Authors Guild has around 8000 members at present. The authors ar encouraging decision that the distinct judge recognised a fair use defence for HathiTrust Trust to digitise copies of texts. The judge argued two types of transformations: full text search, and accessibility of text. That is very very important as an aspect of the ruling. And the judge was convinced of fair use defence. Some humanities scholars submitted, matthew jocker did an analysis of the use of digitised text.

Where we are… We started from the year 1550 and ended in 2012. The meaning of copy has changed. Is digitisation the same as copying by hand? And for digital humanist and copyright lawyer we have to reimagine the role of copyright and the role of the author in copyright. Could see authors as intellectual property owners. We didnt see intellectual property as a term emerge until 1960s when we saw an influential book and the IPO set up, but that idea does change our thoughts of copyright to some extent. But we also see open source, coined in 1988.. There are parallel growth there… We are more a steward and custodian rather tha exclusive intellectual property owner.


Q) just to be a pedant here… Your discussion of the romantic author… I think you got it reversed… The law precedes the author by a distance. In the 18th century original works, poems, epic poems like the work of alexander pope etc. for the sake of erectile, their rank of gentlemen, and royal sponsors made books of vellum, extremely expensive.. The way the publishers got around the need to publish these expensive texts was to republish out of copyright works, recycled materials (including shakespeare), etc. cheap material on recycled rag paper. When new works appear, when paper costs drop, then you see new types of writing replacing old writing and publishers have little say… And in the early nineteenth century you see authors assert power. Profit and capitalisation of ideas in republishing of works is so crucial to current Authors Guild debate is important.

A) I’m glad you mentioedn Alexander pope, he is quoted in 1771 case. Almost all cases in 1710s onwards are between publishers but pope actually sued his publisher in that time. That is a gradual change… Going o the nineteenth century.

Q) us versus uk
A) divergence of law… In 1922… Us copyright act was a 56 year act. In 1978 that was in place… Anything Pre 1922 Out of copyright. UK it is 70 years after authors death. Canada 50 years, sheet music sites in Canada. Stuff out of copyright in Canada but not in the uk. But you can access in the uk. Definitely territorial but internet access is not.

Q) interesting you raised music, a whole other complicated history there.
A) absolutely, very complex. For instance Stravinskys work was very difficult for him to copyright because of Russia’s take on property.

Q) the ease of violating copyright law… Working fr Wikipedia and Wikipedia UK… It can be twisted around. The NLS we frequently have conversations about releasing digitised materials. In the uk unlike the us new digitised material has new rights attached. But we have just been putting content out there.
Comment) the British library lets you use copies of less that 3000 copies but if you have an ebook contract you have to pay huge sums for an image.
Q) it costs more to enforce copyright and fees. The NLS have a non commercial clause for digitised materials, usually we won’t charge if the come and ask us. But cost of enforcement can be higher than perusing. Is this unique to digital?

Gregory Adam Scott – The digital bibliography of Chinese Buddhism as a research and reference tool

Gregory is a digital humanities post doctoral fellow at IASH, his doctorate looked at printing and publishing in early Buddhist cultures. His talk has a new title “building and rebuilding a digital catalogue for modern Chinese buddhism”.

I chose this title inspired bynjorge Louis borges’ “the library of babel” containing the sum of all possible knowledges, versions with all typographic mistakes, the catalogue itself… I evoke this to represent the challenge we face today in looking at mountains of data, whilst the text may be less random we still risk becoming lost in our own library of babel.

My own work looks at a more narrow range of data. I began studying the digital catalogue of Chinese Buddhism cataloging texts from 1866 and 1950. But first a whistle stop tour of printing and religious printing in china. A woodblock print edition if the diamond Astra from 886 CE remains the earliest printed text that records the year of printing. In ore modern east Asian print history religious texts we some of the most frequently printed texts. The printing blocks of the Korean buddy canon was an enormous undertaking in terms of time, cost and political support. Often the costs were supported by ideas that contributing to publishing religious works would be something of a merit economy, bringing good things to you and to your family, which can then be gifted to others – s these texts often include a credit to donors in which they dedicated the texts to loved ones.

Yang Wenhui (1837-1911) and his students published hundreds of texts, thousands of copies and was a hugely influential lay Buddhist publisher. As we see the introduction of movable type and western printing processes this was hugely important, more work was printed in a thirty seven year window than in the previous two thousand years. This is great interma of accessing primary sources but problematic for understanding printing cultures. We see publishers opening up. The history of modern china is pepped with conflict and political and cultural change. And religious studies were often overlooked in the move towards secularisation, this is now slowly changing. And libraries were often free from key religious texts and it can be particularly hard to track the history of print in this time because of variance of names, of contributors, of texts, and of cataloging.

So I wanted to go back to original sources to understand what has been published. S I started with five key sources who had created bibliographies based on accessing original materials rather than relying on primary sources. There were still errors and inconsistencies. I merged these together where appropriate. I wanted to maintain citations so that original published sources could be accessed, that the work could be understood properly.

I did this by transcribing the data. I used a simple bare bones methods with XML. Separating the data and the display of the data. If someone wants to transform the data this format will allow them to do that. This is used simply, tags and descriptions are as human readable as possible. I want future researchers to be able to understand this. I also used Python for some automated tasks for indexing some of these texts.

Looking at the web interface that I put online, it uses Php, the same stack as Omeka. The database runs on SQL. There is a search interface where you can enter Chinese keywords and eventually you will be able to search by year or pairs of years. It returns an index number, title, involved author etc. simple but helpful information. It includes 2328 entries whe the spike at the golden age of china in 1902 is very evident. And then each item has its own static HTML page. That is easy to cite and includes all information I know about this text. S far I think this resource has been useful to produce data t pint the way towards future work… Less the end f research, more the beginning. This work has let me see previously undiscovered texts, you can also look across trends, across connections, the relationships to the larger historical picture. It could also be applied to other disciplines regions.

All of my input to this project is provided under creative commons (non commercial). Bibliographic data isn’t copyright able as it is lucid knowledge but the collection of that could be seen to be original work so I’ve said it is my work that I am happy for others to use.

The reason there is such a spike in 1902, where a date is not known it is assigned to that date free which all texts will have a date.

This catalogue is different from book suppliers data as the purpose is so different, my research use is not for purchase in the same way. I want to add features and finesse this somewhat but my dream is if doing what I’d call “Biblio-Biographies” to see the appearance of text over time, seeing nowhere it appears in publishers catalogues… and how the pricing and presentation changes. For instance looking at the Diamond Sutra we see different numbers of editors, one offers a special price for 1000 copies. I used bibliographic sources but there are so many more forms and formats that I will need to consider, each source will be treated differently. Adverts may appear for publications that were never produced. Have moved from bibliography, to catalogue to something else.


Q) why not use existing catalogue tools
A) didn’t have anything with the right sort if fields, very different roles of authors, editors, etc. not in a standard format, consider MARC but it wailed be relatively easy to transform the XML to MARC.

Q) are you thinking about that next stage, about having ways for more people to contribute.
A) I have been involved in the wiki based dictionary of Chinese buddhism, we opened it up to colleagues and nothing happened. But only us, the co-editors contributed. Big issue is about getting credit for your work which may be the issue for contribution.
Comment) have a look at the website Branch on nineteenth century literature, have asked for short articles and campaigned for MLA bibliographies inclusion and that helps with prestige. Just need big names to write one thing…

Q) could you say something more about other sources
A) there are periodicals, a huge number of the,. A lot of these focus in on ocular printings of texts, some include advertisements, etc. so these texts point off to other nodes and records.

Q) you talked about deliberately designing your catalogue for onwards for transformation, and whether you’ve thought about how you will move forward with the structure for the data…
A) I’m not sure yet but I will stick to the principle that simple is good and reusable, and transform ale are good.
Comment) you might want to look at records of music and musical performance.
A) I’ll keep that in mind, Readings of these texts are often referred to as performances so that may be a useful parallel.

Louise Settle – Digitally mapping Crime in Edinburgh, 1900-1939

Louise is a digital humanities post doctoral fellow at IASH and her work builds upon her PhD research on gender and crime in the nineteenth century.

I want to talk about digital technologies and visualisation of data, particularly visualisation of spatial data. I will draw upon my own research data on prostitution. And considering the potential fr data analysis.

My thesis looked at prostitution in Scotland from 1892 and 1939. The first half looked at the work of reformers, and the second half looks at how that impacted on the life of women at this time. S why do crime statistics matter? Well it sets prostitution in context, recording changes and changing attitudes. My data comes from the borough court records, where arrests took place, where police looked for arrests, and the locations of brothels at this time. Obviously I’m only looking a offences, so the women who were caught, and that’s important in terms of understanding the data. Because these were paper records, not digitised, I looked at four years only coinciding with census years, or the years with full data nearest census years.

I used Edinburgh Map Builder, developed as part of the Visualising Urban Geographies project led by Professor Richard Roger who helped me use this tool, although it is a very simple tool to use. This allows you to use NLS historical maps, Google Maps and your own data. There are a range of maps available so you pick the right map, you can zoom in and out, find the appropriate area to focus on. To map the addresses, you input your data either manually or you can upload a spreadsheet and then you press “start geocoding” to have your records appear on the map. You can change pin colours etc. and calculate the difference between different points. Do have a look and play around with it yourself.

The visual aspect is a very simple and clear way to explore your subject, and the visual element is particularly good for non specialist audiences, but it also helps you spot trends and patterns you may not have noticed before. So looking at maps of my data from 1903, 1911, 1921 and 1931. The maps visualise the location of offences, for example it was clear from the maps that the location changed over time, particularly the move from the old town to the new town. In 1903 offences are spread across the city. In 1911 many more offences particularly around the mound. In 1921 move to new town further evident. By 1931 the new town shift is more evident, some on Calton hill too.

The visual patterns tell us a lot, in the context of the research, about the social geography of edinburgh. Often old town is seen as working class area and new town as a middle class area. Prostitution appears to move towards to centre but that is also the grin statistician, the shopping areas, the tourist areas. This tells us there is more work there. They keep being arrested there but that does not deter them. Small fines and prison spells did not deter. Entertainment locations were more important than policing policies. You can see that a project that is not neccassarily about geography has benefitted from that spatial analysis aspect.

If you have spatial information in your own research then do have a look at Edinburgh Map Builder. But if you have data for elsewhere in the UK you can use Digimap which includes both contemporary and historical maps. There are workshops at Edinburgh University, and the website on the bottom there. That’s UK-wide. And a new thing I’ve been playing with is HistoryPin – this uses historical photography. You can set up profiles, pictures, paints, etc. you can plot these according to location. You can plot particular events, from your computer or smartphone. Yo can look at historical images and data. So I have been plotting prostitution related locations such as the Kosmo Club, the coffee stalls on The Mound. You can add your data and plot them on the map. Very easy to use site and this idea of public engagement, this is a great tool for doing this.


Q) I was quite interested in those visual tools and the linking of events tying them to geographical places. And there are other ways to visualise social network maps, I wonder how it would be to map those in your work, there must be social connections ther. Social network analysis can look very similar… I wanted to know if you have considered that or come across that sort of linkage.
A) I haven’t but that sounds really exciting.

Q) I wanted to ask you about the distribution and policing. If one were to return to the maps. Some marked differences in the number of offences – arrests? – how much detail did you take out of it? You said they were going back and were not deterred. In 1911 markedly different numbers. But even at the times when there was actually more policing towards the old town, the police were just sticking to the main routes. So was the old town a lawless zone at that time? Police not wanting to venture into dark alleys. And how long does Edinburgh’s tolerance zone persist. And it’s curious o see that without Leith too! As now the city operates a more direct reflection but perhaps before the amalgamation of the authorities perhaps there wasn’t such a direct deflection affect?
A) in terms of Keith it was occurring there. The argument is coming from the suggestion that it was informally tolerated in the old town… I don’t disagree that it happened in the old town but my arguement is that it is also happening in the new town and measures there don’t stop it when they should. And my research also sees the police not always caring and judges and juries moving for reform rather than harsher sentences. Cafes and ice cream parlours were a cause of concern in Glasgow in 1911 which may impact the figures then. The 1903 records are not correct, it may be an outlier as the general trend is of decreasing offences over time…

Q) about the visualisation tool, you have tremendous amount of interest in those maps, are this emails important for research design, for research questions. Or would you wish for a tool with more possibility for contextualisation. Fr instance statistics from authorities etc, to interpret your findings. What possibilities for researchers to have these tools yield more stuff?
A) the maps are interesting, they are more appealing. But these need to be used with tables, charts, statistics. If just presenting on the work I would have included those other factors. So in 1903 you lose some density when all dots are in the same place. But an interactive tool to do that would be great.

Comment) what is so attractive of visualisation is speed and efficiency but that also means there is a risk in concluding too quickly, of not necessarily reflecting reality of prostitution – the reader may read your map of offences in that way, that will be easy to do but the methodology can be dull to people and that can mean misunderstandings.
A) absolutely. This needs to be in context.

Q) could you have layers comparing income against offences etc. if you’d found any projects that were developing more complex…
A) the big project is the Edinburgh Atlas, there is a mini conference on hidden histories and geographies of edinburgh on mapping crime, it’s on the IASH mailing list, there are others doing that.

Q) you talked about women seduced by foreigners in edinburgh?
A) in edinburgh there was concern about Italians at ice cream parlours, brazilians were the concern in Glasgow. And in edinburgh there was also a German Jewish pimp of concern as well.

Discussion more widely…

Comment) I’m primarily a learnin technologist and I send my life trying to get people to start from the activity they want to undertake, and not starting with the tools. I found it refreshing tat you all started with your data and looking for tools with the right affordances. How did you find you were helped with that search for a tool.
Louise) it was human contacts. I saw a lecture from professor Richard roger.
Ally) it was similar for me, I found a software through a contact but found it hard to find what else was out there. It basically came down to Omeka or Drupal that the web team knew about. but it would have been great to know what was out there, what the differences are, what resources there are. Even looking through DHNow and DH Quarterly there isn’t a sense of easily identifying the options for the tools. That can be a bit of an issue.
Greg) I used the tools colleagues were using to build my own…
Comment) HCI has the notion of affordances, what it easily enables you to do and what else it could enable yo to do. Is there something there about describing affordances for the humanities. My sense is that often they are pitched towards the sciences, sometimes terminology varies event, so understanding affordances varies.
Ally) sometimes developing your own tools is good, but even a little knowledge and terminology let’s me get better results from these tools, if. Come to these tools end these colleagues with no knowledge then I will not have a successful outcome. I want to really explore Omeka so that I feel confident and able with it.

Question) have the tools changed your research questions or ways of working?
Louise) not me
Ally) for me the have. I was introduced to the 19th century disability reader digital anthology and knowing what was capable with the tools changed what I wanted to d with my project. It did to some degree. By the basic aim was I want to know more about late nineteenth century medical history hasn’t changed. But the project has
Wei Chen) I find the legal documents, creative commons licenses etc. most useful, I was able to be involved in the first version of the Chinese Creative Commons license.
Greg) it hasn’t changed my questions but the scale of work possible and how I might explore it has changed for me.

Question) what advice would yo give for people thinking about digital tools for research
Greg) don’t be afraid to just try things out, work out what’s possible…
Louise) do ask for help, do take advantage of courses…

Question) I was struck with the issue of time when you gave your presentations. Have you reflected on the process of the use of time. How to use jt creatively and consain it. And how that use of time perhaps changed your view of get, of hard copy materials.
Ally) with digital projects you can find you go with the additional time used. Yo should not underestimate the time neccassary. But at the same time I would spend hours and hours leafing through texts to answer a research question. I want t use this tool to reduce the time to find the data I need, to access it, to interpret it. But this project is about developing this oll to benefit myself and others later. You need to be realistic, step back, and be realistic about what is possible.
Louise) that’s part of the issue of digital humanities. My work will be in a traditional book format but the Historypin work, very engaging, but not counting towards career, towards a job. That’s a challenge fr digital humanities and for early career researchers, it’s why our scholarships are so good.
Wei Chen) and there is the distant versus close reading difference. Close reading still has a role but that distant reading allows us to interrogate that reading, to find that resource, etc.
Greg) nothing we are doing are unrecognisable research but we are able to perhaps examine more material, or to do things more quickly. We are not doing everything differently but using new tools in our work.

Question) do you think this investment in tools is changing humanities as a result f this temporal and labour investment in tools. Ally you talked about putting off other work…
Ally) well I am song research, You always have to manage many projects at once. And ther will be an impact. But. Chose the digital path because time and financial limitations changed what was possible. It could have been done another very expensive way. So I’m not putting off research, I would probably be spending years collating information… Instead I am setting something up to facilitate my own research in the future. The relationship between distant and close reading. That divide isn’t as fiery as it appears.
Comment) the superficial view of the digital is happening in teaching. Universities jump on the digitisation bandwagon in a way that changes how humanists are employed, how software are copyrighted and licensed. All these tools help universities save money. One can overreact… Ealignments f labour and resources makes not so positive inroads…
Ally) it’s a huge problem, I have huge concerns about the University’s MOOC programme. There was discussion of open access individuals to talk about what these means…
Louise) not sure but I know colleagues are concerned.
Wei Chen) open access is about economic growth, not hardcore humanist values. Humanist values should be at the core for digital humanists, there will be an increasingly curatorial role fr all formats of material
Comment) abit critical engagements

Question) one of my concerns about this sort of work, and the work in geography in ways of making and curating an archive. I was wondering about the length of time an archive is available after a project. There was a BBC project to save our sound and it finished and the map is no longer accessible… So who looks after and preserves data.
Greg) I think it’s hard to “lose” data, it’s abit implementation not methods.
Ally) I think it’s about how digital humanities adopt tools, about reflecting on project aftermath. When looking into project funding you don’t want that tool lost. It’s not an issue f methodology or individuals but it has implications for future archiving.
Comment) which is why Greg’s work in XML matters
Me) and the use of research data management plans and research data repositories to help ensure planning and curating of data at the outset, and to ensure lon terms access and sustainability.