Mar 142016
 

Back on 2nd December 2015 I attended a Digital Scholarship event arranged by Anouk Lang, lecturer in Digital Humanities at the University of Edinburgh.

The event ran in two parts: the first section enabled those interested in digital humanities to hear about events, training opportunities, and experiences of others, mainly those based within the College of Humanities and Social Sciences; the second half of the event involved short presentations and group discussions on practical needs and resources available. My colleague Lisa Otty and I had been asked to present at the second half of the day, sharing the range of services, skills and expertise EDINA offer for digital scholarship (do contact us if you’d like to know more), and were delighted to be able to attend the full half day event.

My notes were captured live, so all the usual caveats about typos, corrections, additions, etc. apply despite the delay in me setting this live. 

The event is opening, after a wee intro from Anouk Lang, with a review of various events and sessions around Digital Humanities, starting with those who had addtended the DHOxSS: Digital Humanities Summer School at Oxford in summer 2016.

Introduction to Digital Humanities – Rhona Alcock

Attended with bursary from College. Annual event for a range of interests and levels. Was an introductory strand on DH. Then another ~9 that were much more specialist. The introductory session was for those new to the field, or keen to broaden understanding. Gave an overview of other strands (some quite technical). Learnt a huge amount for that week. Came back and write up notes – 25 pages of them and I’ve referred back to them a huge amount. Main topics I benefited from was on planning DH research projects, on light weight usability testing (and QA). Great session on crowdsourcing and comining in your research area. Also some great sessions on knowledge exchange and engagement. More than anything what I had was the sense of connecting with others interested in DH and what they could get from it and do with it. And loads of new ideas for what I do, new audiences to take it to. Ractical advice to take it forward, tools etc. So, highly recommended.

David Oulton, College Web Team

I went to the summer school too. Found a big gap in the range of tools and resources (WordPress, Drupal, ESRI, ArcGIS, etc). There are so many tools that are just out there and can be used for all sorts of things, that can be downloaded, etc.

I’ve created a list on pbworks (dhresourcesforprojectbuilding.pbworks.com). And there are lots of resources listed there. I’m keen to encourage academics to just go and try these things, that can be set up quickly and easily.

Digital Medieval – Jeremy Piercy

Worked with new digital Bodleian interactive tools. Interactive scans of images that allow you to see hidden writing with a varieties of light – often can’t do on an artefact. Also learned a lot about ArcGIS, and how useful that can be. Also curation issues – many of our documents will eventually cease to be. This allows ongoing curation of those items. We have various high quality images and scans that will “always be accessible”. Portability of data is a key issue – not tying your work to a particular interface that may later fail/change/etc. Need to be able to move information around. I wasn’t the only person at that…

Gavin Willshaw, Digital Curator at the Library

I also went to Dig approaches to medieval and renaissance studies. Was quite specialist in many ways. Lots of hard work but learned a lot. I was also keen to understand how DH researchers work and what the library can do to support that with collections and tools. Was also some sessions on DIY digitisation – mini projects around doing that, managing that data. Tools such as Retro Mobile – ways to see what lies beneath images. Also some quite good introductory overviews of areas like TEI and IIIF for interoperability etc. Also several workshops to try stuff out directly. Really enjoyed seeing new stuff – e.g. hyperspectral imaging. Also a session on how museums can use wifi signals to track visitors movements and tailor what they do to that experience. And fed into discussions we’ve been having about Beacons and Blue Tooth in the library. From my point of view it was a really interesting mixture of tools and skills and experience. And I’m now looking for how the library can get more involved. Again, would highly recommend.

Humanities Data: Curation, Analysis, Access and Reuse – Rocio von Jungenfeld

I work in the data library, but am also finishing my PhD at ECA. I was looking at data tools and analysis tools. To compare what we do to what others do. And the recommendations from other places. Had very interesting speakers. Combination of HATHHI trust, and also OxII. They gave really practical advice on software out there, schemas, metadata, methodology, great insights to data tools and analysis. New tools to me, e.g. Gephi, and was very useful. Good experience overall – was a big Edinburgh contingent there, stuff done together. Interesting people and good lectures. Would recommend. My notes are available.

Harriet Cornell, Edinburgh Law School

I’m a post doc in HCA, and project officer for the political settlements programme. It was a complex workshop. Went for this one not introduction as I felt up to date. But this was at the sharp end in terms of the technical stuff. Galloped through technical stuff, would have liked more time on software. But reflecting afterwards it was great. Trying Open Refine – that I didn’t know about it – but also how we label and tag data and research. Really useful. Four things I took away:

  1. The capacity for DH projects – having that director of the HATHII Trust was great that that could happen
  2. Tools, particularly Open Refine – so useful. I know that Anouk and Anna have run workshops on this but it’s brilliant.
  3. Labelling and tagging. I do lots of blogging on WordPress and thinking about SEO, and just thinking about that in a different way was great.
  4. Design, Curation, Research and Longevity – thinking about the time and cost of planning and making things properly sustainable, after e.g. 10 years.

If I did it again I’d have done the introductory workshop. But this was great if you were happy to get down with Owl, ontologies, Python etc. I was tweeting from the session with the #dhoxss tag.

Linked Data for the Humanities – Anouk Lang, School of Literature, Languages and Culture

When you are a researcher your data is your baby. My lovingly curated research database with rich information about which historical figures were writing to whom, from which places, at which times. The problem is that if you want to share that data, your stuff is hard for others to use. The solution is ontologies and linked open data.

So what is an ontology? It is a structured way of understanding the context of an object – e.g. for the British Museum it might be where it was from, who acquired it and where and when, when it is from, where is it located, what is it made from. So we have linked data. Which we express as a “triple” – a subject, predicate, and object. So for a James Joyce letter then you have a lot of known individuals already – James Joyce is out there. And then locations wise there are lists of locations – you want an authority list (someone elses ontology of places). And then the predicate (e.g. when James Joyce was born) is also already available…

So, data is stored in a “triplestore” and you can query it using SPARQL, which lets you ask about name, place, location etc. There is a structure for SPARQL queries. That lets you query stuff within others databases.

So, if you are interested in using stuff in others’ databases or how to share your data with others, then you want to learn about Linked Open Data and SPARQL.

DHSO – Jim Mistiff

PhD student in English Lit. Went to the summer school and did the Linked Open Data and RDF course. Before PhD I was a Drupal developer, so had an interest from that so interesting to see it from another angle. I’ll be looking at specific application of LOD. Specifically for a text heavy usage for my own PhD. Not a perfect solution but an interesting starting point for a conversation.

Getting some definitions out of the way. LOD is a scary term for literary humanists. Any texts I’m using is a data object really. Breaking a long text up is a more useful way to think of it. Open Data helps you share data with others (and vice versa). The Linked part lets you interlink stuff. So if, say, a DB at Edinburgh (modernist data), and one at UVIC (who have modernist correspondence) you can link those together to form one complete set.

One of the results of doing LOD is that your dataset or database or a version of the internet that is easier for machines to read. Right now the internet is a set of dumb links – LOD allows text data objects to be more machine readable. I think it will be easier to explain that through an example.

So, I’m writing my PhD on Hugh MacDiamid. I’m interested in later stuff, particularly a poem called In Memoriam James Joyce. I’m interested in the construction. He wrote the poem through borrowing/plaguerising from other sources which are not credited. E.g. from an ad in the Writers and Artists Yearbook 1949. I’ve gone through that poem and found about 60% of the sources. That is a very interlinked text that maps nicely onto the idea of LOD. So, what I’d like to do is to datafy MacDiarmid. What I propose in terms of what Anouk covered is to take the text, word by word… and take the two lines from the poem, break them into triples… Can do it character by character… can automate… Can then apply Stanfords NLP to it… and then identify automatically when you’ve got a name of a real person in there, the name of a text… A very vague linked data model here… Bits in boxes (in his diagram) that are in other sources. So a mention of Pape – means Capt A.G. Pape and we could grab info from DBPedia. And then a text by that author could be pulled in/connected to. Etc. This makes assumption about what else is out there. But what it gives us is a rich version of the poem. We tend to read many texts in these ways – looking up definitions, references etc. Not suggesting change in core tasks, but using technology to enhance what we do. We could turn a LOD version of the poem into a hyperlinked version that pops up those obscure references etc. Much of what we already do, but in an automated way.

My suggestion expects and relies on other people doing parts of the work. What we can all do is get it closer to LOD than it is. There are five steps (see 5stardata.info by Tim Berners-Lee). Step one is get it online – e.g. a PDF on the web. The next step is to structure that data – a table rather than an image of a table for instance. Next step is open format – so that table in CSV rather than Exel. Next step is using RDF to point to things. Final step is LOD – the stuff of linking from proper names and quotations to other names and data stores. And then we have LOD.

DHSI, University of Victoria, Canada – Anouk Lang

This is premier DH summer school but costly to get to. Cheaper with student membership of computation in humanities(?).

I did three workshops – there for 3 weeks. I’m going to talk about Programing for Human|ist|s. My area uses R, Python etc. I did an intense week long Python course. So you start with a spec. For us we decided to construct a script that will visit a website and pull out certain bits of information relating to discussion posts (username, data of posting, content) and write those to a spreadsheet so they can be used for analysis. So, a web scraper.

Then you write Pseudocode. And that includes pulling in other people’s stuff – loads of Googling Stack Exchange for code and libraries.

So, Beautiful Soup does loads of web scraping. Then wanted it to write to a file. So then, you can build the code and run it. Then creates a file. Now when I did that I did pull out the appropriate text. It looks like a mess, but that’s a starting point. I spent the rest of the summer consolidating things, and doing some other things. So that included building something fun using a Markov Chain Generator – to feed text in, and produce lovely parodies. Can then use Python to automate Twitter posts. So we did a fun PatrickTwite Twitterbot (to go with a book launch).

Programming History Live – Anna Groundwater, History, Classics and Archeaology

I’m a historian. I’m here to talk about Programming History Live, which I attended in London in October. My first encounter was with the website. And that is a fantastic website (programminghistorian.org) – free, open, very enabling. It takes you through tutorials in lots of software you can use. Great range from Zotero, Antconc to do corpus analysis tool, to Python. Stuff on data cleaning, network analysis. Using Omeca to do online exhibits – will be used for a masters renaissance course. Comes from best DH ethos. I recommend W3C for LOD and Open Data. Also clear tutorial on SPARQL on Programming Historian. Also has web scraping tutorials.

My interest is network analysis. Martin During uses Palladio – free online network analysis tool which visualizes those networks. The site takes you through step by step. Gives you working examples with data from the website so that you do it yourself. Actually doing stuff is a great way to learn. Now recommended to several dissertation students who will use Palladio and then on Gephi. One thing to add is that these amazing exciting tools are compelling and exciting but there is theoretical underpinning that you need to understand if you use them. And Programming Historian also covers areas of that.

So, fantastic resources here (and Anouk also gave a lovely plug for EDINA’s geospatial tools and expertise, including my colleagues QGIS training).

Session on BL was led by James Baker (now at Sussex, was at BL). He’s also an SSI fellow – gives you £3k for DH work and profile there. I learned a lot on Antconc, on TEI, shell and widget for web scraping.

Antconc was with Anouk, and the tutorial written by Anouk and Heather Froehlich at Sterling. Antconc helps you analyse a corpus. It’s free to download to look at your corpus. So, for example a set of movie reviews (test data from Programming Historian) that has “shot” marked up across the reviews. I like this because it combines the patterns of big sets of data, with the ability to see the exact context. Combines distant and close reading concepts. And here “shot” is shown in its many definitions and uses. Concordance is seeing that word with surrounding words (and you can choose surrounding number of works). I’m interested in this because of cognitive geography of James I and VI using text analysis with Antconc. Some work already done on Agatha Christie using Atconc (in a paper on language and dementia by Ian Lancashire [PDF]).

Other useful resources: James Baker did a great reflective post on his blog, Cradled in Caricature. Also there are a range of people I recommend following on Twitter around digital scholarship, digital humanities, etc. including: @heatherfro, @williamjturkel, @adam_crymble, @ruthahnert, @melissaterras.

And that brought Section 1 to a close. Section 2 of this event – which I was presenting at – was collaboratively noted. See: https://docs.google.com/document/d/1U1CTcTi7hC-1gVq5HD762k2MCgSXaXDfPp5KsDNfOlk/

Feb 262016
 

Today I am at the British Library (BL) Labs Roadshow 2016 event in Edinburgh. I’m liveblogging so, as usual, all comments, corrections and additions are very much welcomed.

Introduction – Dr Beatrice Alex, Research Fellow at the School of Informatics, University of Edinburgh

I am delighted to welcome the team from the British Library Labs today, this is one of their roadshows. And today we have a liveblogger (thats me) and we are encouraging you to tweet to the hashtag #bldigital.

Doing digital research at the British Library – Nora McGregor, Digital Curator at the British Library

Nora is starting with a brief video on the British Library – to a wonderful soundtrack, made from the collections by DJ Yoda. If you read 5 items a day it would take you 80,000 years to get through the collections. One of the oldest things we have in the collection are oracle bones – 3000 years old. Some of the newest items are the UK Web Archive – contemporaneous websites.

Today we are here to talk about the digital Research Team. We support the curation and use of the BL’s Digital collections. And Ben and Mahendra, talking today, are part of our Carnegie Funded digital research labs.

We help researchers by working with those operating at the intersection of academic research, cultural heritage and technology to support new ways of exploring adn accessing the BL collections. This is through getting content into digital forms, supporting skills development, including the skills of BL staff.

In terms of getting digital content online we curate collections to be digitised and catalogued. Within digitisation projects we now have a digital curation role dedicated to that project, who can support scholars to get the most out of these projects. For instance we have a Hebrew Manuscripts digitisation project – with over 3000 manuscripts spanning 1000 years digitised. That collection includes rare scrolls and our curator for this project, Adi, has also done things like creating 3D models of artefacts like those scrolls. So these curators really ensure scholars get the most from digitised materials.

You can find this and all of our digitisation projects on our website: http://bl.uk/subjects/digital-scholarship where you can find out about all of our curators and get in touch with them.

We are also supporting different departments to get paper based catalogues into digital form. So we had a project called Collect e-Card. You won’t find this on our website but our cards, which include some in, for instance, Chinese scripts or urdu, are being crowd sourced so that we can make materials more accessible. Do take a look: http://libcrowds.com/project/urducardcatalogue_d1.

One of the things we initially set up for our staff as a two year programme was a Digital Research Support and Guidance programme. That kicked off in 2012 and we’ve created 19 bespoke one-day courses for staff covering the basics of Digital Scholarship which is delivered on a rolling basis. So far we have delivered 88 courses to nearly 400 staff members. Those courses mean that staff understand the implications of requests for images at specific qualities, to understand text mining requests and questions, etc.

These courses are intended to build capacity. The materials from these courses are also available online for scholars. And we are also here to help if you want to email a question we will be happy to point you in the right direction.

So, in terms of the value of these courses… A curator came to a course on cleaning up data and she went on to get a grant of over £70k for Big Data History of Music – a project with Royal Holloway to undertake analysis as a proof of concept around patters in the history of music – trends in printing for instance.

We also have events, competitions and awards. One of these is “Off the Map”, a very cool endeavour, now in its fourth year. I’m going to show you a video on The Wondering Lands of Alice, our most recent winner. We digitise materials for this competition, teams compete to build video games and actually this one is actually in our current Alice in Wonderland exhibition. This uses digitised content from our collection and you can see the calibre of these is very high.

There is a new competition open now. The new one is for any kind of digital media based on our digital collections. So do take a look of this.

So, if you want to get in touch with us you can find us at http://bl.uk/digital or tweet #bldigital.

British Library Labs – Mahendra Mahey, Project Manager of British Library Labs.

You can find my slides online (link to follow).

I manage a project called British Library Labs, based in the Digital Research team, who we work closely with. What we are trying to do is to get researchers, artists, entrepreneurs, educators, and anyone really to experiment with our digital collections. We are especially interested in people finding new things from our collections, especially things that would be very difficult to do with our physical collections.

What I thought I’d do, and the space the project occupies, is to show you some work from a researcher called Adam Crymble, Kings College London (a video called Big Data + Old History). Adam entered a competition to explain his research in visual/comic book format (we are now watching the video which talks about using digital texts for distant reading and computational approaches to selecting relevant material; to quantify the importance of key factors).

Other kinds of examples of the kinds of methods we hope researchers will use with our data span text mining, georeferencing, as well are creative reuses.

Just to give you a sense of our scale… The British Library says we are the world’s largest library by number of items. 180 million (or so) items, with only about 1-2% digitised. Now new acquisitions do increasingly come in digital form, including the UK Web Archive, but it is still a small proportion of the whole.

What we are hoping to do with our digital scholarship site is to launch data.bl.uk (soon) where you can directly access data. But as I did last year I have also brought a network drive so you can access some of our data today. We have some challenges around sharing data, we sometimes literally have to shift hard drives… But soon there will be a platform for downloading some of this.

So, imagine 20 years from now… I saw a presentation on technology and how we use “digital”… Well we wont use “digital” in front of scholarship or humanities, it will just be part of the mainstream methodologies.

But back to the present… The reason I am here is to engage people like you, to encourage you to use our stuff, our content. One way to do this is through our BL Labs Competition, the deadline for which is 11th April 2016. And, to get you thinking, the best idea pitched to me during the coffee break gets a goodie bag – you have 30 seconds in that break!

Once ideas are (formally) submitted to the BL there will be 2 finalists announced in late May 2016. They then get a residency with some financial (up to £3600) and technical and curational support from June to October 2016. And a winner is then announced later in the year.

We also have the BL Labs Awards. This is for work already done with our content in interesting and innovative ways. You can submit projects – previous and new – by 5th September 2016. We have four categories: Artistic; Commercial; Research; and Learning/Teaching. Those categories reflect the increasingly diverse range of those engaging with our content. Winners are announced at a symposium on 7th November 2016 when prizes are given out!

So today is all about projects and ideas. Today is really the start of the conversation. What we have learned so far is that the kinds of ideas that people have will change quite radically once you try and access, examine and use the data. You can really tell the difference between someone who has tried to use the data and someone who has not when you look at their ideas/competition entries. So, do look at our data, do talk to us about your ideas. Aside from those competitions and awards we also collaborate in projects so we want to listen to you, to work with you on ideas, to help you with your work (capacity permitting – we are a small team).

Why are we doing this? We want to understand who wants to use our material, and more importantly why. We will try and give some examples to inspire you, to give you an idea of what we are doing. You will see some information on your seat (sorry blog followers, I only have the paper copy to hand) with more examples. We really want to learn how to support digital experiments better, what we can do, how we can enable your work. I would say the number one lesson we have learned – not new but important – is that it’s ok to make mistakes and to learn from these (cue a Jimmy Wales Fail Faster video).

So, I’m going to talk about the competition. One of our two finalists last year was Adam Crymble – the same one whose PhD project was highlighted earlier – and he’s now a lecturer in Digital History. He wanted to crowdsource tagging of historical images through Crowdsource Arcade – harnessing the appeal of 80s video games to improve the metadata and usefulness fo historical images. So we needed to find an arcade machine, and then set up games on it – like Tag Attack – created by collaborators across the world. Tag Attack used a fox character trotting out images which you had to tag to one of four categories before he left the screen.

I also want to talk about our Awards last year. Our Artistic award winner last year was Mario Klingeman – Quasimondo. He found images of 44 men who Look 44 with Flickr images – a bit of code he wrote for his birthday! He found Tragic Looking Women etc. All of these done computationally.

In Commercial our entrant used images to cross stitch ties that she sold on Etsy

The winner last year, from the Research category was Spatial Humanities in Lancaster looking for disease patterns and mapping those.

And we had a Special Jury prize was for James Heald who did tremendous work with Flickr images from the BL, making them more available on Wikimedia, particularly map data.

Finally, loads of other projects I could show… One of my favourites is a former Pixar animator who developed some software to animate some of our images (The British Library Art Project).

So, some lessons we have learned is that there is huge appetite to use BL digital content and data (see Flickr Commons stats later). And we are a route to finding that content – someone called us a “human API for the BL content”!

We want to make sure you get the most from our collections, we want to help your projects… So get in touch.

And now I just want to introduce Katrina Navickas who will talk about her project.

Political Meetings Mapper – Katrina Navickas

I am part of the Digital History Research Centre at the University of Hertfordshire. My focus right now is on Chartism, the big movement in the 19th Century campaigning for the vote. I am especially interested in the meetings they held, where and when they met and gathered.

The Chartists held big public meetings, but also weekly local meetings advertised in the press and local press. The BL holds huge amounts of those newspapers. So my challenge was to find out more about those meetings – how many there were advertised in the Northern Star newspaper from 1838 to 1850. The data is well structured for this… Now that may seem like a simple computational challenge but I come from a traditional research background, used to doing things by hand. I wanted to do this more automatically, at a much larger scale than previously possible. My mission was to find out how many meetings there were, where they were held, and how we could find those meetings automatically in the newspapers. We also wanted to make connections between papers, georeferenced historical maps, and also any that appear in playbills as some meetings were in theatres (though most were in pubs).

But this wasn’t that simple to do… Just finding the right files is tricky. The XML is some years old so is quite poor really. The OCR was quite inaccurate, hard to search. And we needed to find maps from the right period.

So, the first stage was to redo the OCR of the original image files. Initially we thought we’d need to do what Bob Nicholson did with Historic Jokes, which was getting volunteers to re-do them. But actually newer OCR software (Abbyy Finereader 12) did a much better job and we just needed a volunteer student to check the text – mainly about punctuation not spelling. Then we needed to geo-code places using a gazeteer. And then we needed to use a Python code with regular expressions to extract dates and using some basic NLP to calculate the dates of words like “tomorrow” – easier as the paper always came out on a Saturday.

So, in terms of building a historical gazeteer. We extracted place names run through: http://sandbox.idre.ucla.edu/tools/geocoder. Ran through with parameters of Lat and Long to check locations. But we still needed to do some geocoding by hand. The areas we were looking at has changed a lot through slum clearances. We needed to therefore geolocate some of the historical places, using detailed 1840s georeferenced maps of Manchester, and geocoding those.

In the end, in the scale of this project, we looked at only 1841-1844. From that we extracted 5519 meetings (and counting) – and identifying text and dates. And that coverage spanned 462 towns and villages (and counting). In that data we found 200+ lecture tours – Chartist lecturers were paid to go on tours.

So, you can find all of our work so far here: http://politicalmeetingsmapper.co.uk. The website is still a bit rough and ready, and we’d love feedback. It’s built on the Umeeka (?) platform – designed for showing collections – which also means we have some limitations but it does what we wanted to.

Our historical maps are with thanks to the NLS whose brilliant historical mapping tiles – albeit from a slightly later map – were easier to use than the BL georeferenced map when it came to plot our data.

Interestingly, although this was a Manchester paper, we were able to see meeting locations in London – which let us compare to Charles Booth’s poverty maps. Also to do some heatmapping of that data. Basically we are experimenting with this data… Some of this stuff is totally new to me, including trialling a Machine Learning approach to understand the texts of a meeting advertisement – using an IPython Notebook to make a classifer to try to identify meeting texts.

So, what next? Well we want to refine our NLP parsing for more dates and other data. And I also want to connect “forthcoming meetings” to reports from the same meeting in the next issue of the paper. Also we need to do more machine learning to identify columns and types of texts in the unreconstructed XML of the newspapers in the BL Digital Collections.

Now that’s one side of our work, but we also did some creative engagement around this too. We got dressed up in Victorian costume, building on our London data analysis and did a walking tour of meetings ending in recreating a Chartist meeting in a London Pub.

Q&A

Q1) I’m looking at Data mining for my own research. I was wondering how much coding you knew before this project – and after?

A1) My training had only been in GIS, and I’d done a little introduction to coding but I basically spent the summer learning how to do this Python coding. Having a clear project gave me the focus and opportunity to do that. I still don’t consider myself a Digital Historian I guess but I’d getting there. So, no matter whether you have any coding skills already don’t be scared, do enter the competition – you get help, support, and pointed in the right direction to learn the skills you need to.

Farces and Failures: an overview projects that have used British Library’s Digital Content and data – Ben O’Steen, Technical Lead of British Library Labs.

My title isn’t because our work is farce and failure… It’s intentionally to reference the idea that it can be really important early in the process to ensure we have a shared understanding of terminology as that can cause all manner of confusion. The names and labels we choose shape the questions that people will ask and the assumptions we make. For instance “Labs” might make you imagine test tubes… or puppies… In fact we are based in the BL building in St Pancras, in offices, with curators.

Our main purpose is to make the collections available to you, to help you find the paths to walk through, where to go, what you can find, where to look. We work with researchers on their specific problems, and although that work is specific we are also trying to assess how widely this problem is felt. Much of our work is to feed back to the library what researchers really want and need to do their work.

There is also this notion that people tell us things that they think we need to hear in order to help them. As if you need secret passwords to access the content, people can see us as gatekeepers. But that isn’t how BL Labs work. We are trying to develop things that avoid the expected model of scholarship – of coming in, getting one thing, and leaving. That’s not what we see. We see scholars looking at 10,000 things to work with. People ask us “Give me all of collection X” but is that useful? Collections are often collected that way, named that way for adminstrative reasons – the naming associated with a particular digitisation funder, or from a collection. So the Dead Sea Scrolls are scanned in a music collection because the settings were the same for digitising them… That means the “collection” isn’t always that helpful.

So farce… If we think Fork handles/4 Candles…

We have some common farce-inducing words:

  • Collection (see above)
  • Access – but that has different meanings, sometimes “access” is “on-site” and without download, etc. Access has many meanings.
  • Content – we have so much, that isn’t a useful term. We have personal archives, computers, archives, UK Web domain trawl, pictures of manuscripts, OCR, derived data. Content can be anything. We have to be specific.
  • Metadata – one persons metadata is anothers data. Not helpful except in a very defined context.
  • Crowdsourced – means different things to different people. You must understand how the data was collected – what was the community, how did they do it, what was the QA process. That applies to any collaborative research data collection, not just crowdsourcing.

An example of complex provenence…

Microsoft Books digitisation project. It started in 2007 but stopped in 2009 when the MS Book search project was cancelled. This digitised 49K works (~65k volumes). It has been online since 2012 via a “standard” page turning interface ut we have very low usage statistics. That collection is quite random, items were picked shelf by shelf with books missing. People do data analysis of those works and draw conclusions that don’t make sense if you don’t understand that provenance.

So we had a competition entry in 2013 that wanted to analyse that collection… But actually led to a project called the Sample Generator by Pieter Francois. This compared physical to digital collections to highlight the issues of how unrepresentative that sample is for drawing any conclusions.

Allen B Riddell looked at the HathiTrust corpus called “Where are the novels?” in 2012 which similarly looked at the bias in digitised resources.

We have really big gaps in our knowledge. In fact librarians may recognise the square brackets of the soul… The data in records that isn’t actually confirmed, inferred information within metadata. If you look at the Microsoft Books project it’s about half inferred information. A lot of the Sample Generator peaks of what has been digitised is because of inferred year of publication based on content – guesswork rather than reliable dates.

But we can use this data. So Bob Nicholson’s competition entry on Victorian Jokes led to the Mechanical Comedian Twitter account. We didn’t have a good way into these texts, we had to improvise around these ideas. And we did find some good jokes… If you search for “My Mother in-law” and “Victorian Humour” you’ll see a great video for this.

That project looked for patterns of words. That’s the same technique applied to Political Meetings Mapper.

So “Access” again… These newspapers were accessible but we didn’t have access to them… Keyword search fails miserable and bulk access is an issue. But that issue is useful to know about. Research and genealogical needs are different and these papers were digitised partly for those more lucrative genealogical needs to browse and search.

There are over 600 digital archive, we can only spend so long characterising each of them. Microsoft Books digitisation project was public domain so that let us experiment richly quickly. We identified images of people, we found image details. we started to post images to Twitter and Tumblr (via Mechanical Curator)… There was demand and we weren’t set up to deliver those so we used Flickr Commons – 1 TB for free – with the limited awareness of what page an image was from, what region. We had minimal metadata but others started tagging and adding to our knowledge. Nora did a great job of collating these images that had been started to be tagged (by people and machines). And usage of images has been huge. 13-20 million hits on average every month, over 330 M hits to date.

Is this Iterative Crowdsourcing (Mia Ridge)? We crowdsource broad facts and subcollections of related items will emerge. There is no one size fits all, has to be project based. We start with no knowledge but build from there. But these have to be purposefully contextless. Presenting them on Flickr removed the illustrations context. The sheer amount of data is huge. David Foster Wallace has a great comment that “if your fidelity to perfectionism is too high, you never do anything”. We have a fear of imperfection in all universities, and we need to have the space to experiment. We can re-represent content in new forms, it might work, it might not. Metaphors don’t translate between media – like turning pages on a screen, or scrolling a book forever.

With our map collection we ran a tagathon and found nearly 30,000 maps. 10,000 were tagged by hand, 20,000 were found by machine. We have that nice combination of human and machine. We are now trying to georeference our maps and you can help with that.

But it’s not just research… We encourage people to do new things – make colouring books for kids, make collages – like David Normal’s Burning Man installation (also shown at St Pancras). That stuff is part of playing around.

Now, I’ve talked about “Crowd sourcing” several times. There can be lots of bad assumptions of that term. It’s assumed to be about a crowd of people all doing a small thing, about special software, that if you build it they will come, its easy, its cheap, it’s totally untrustworthy… These aren’t right. It’s about being part of a community, not just using it. When you looka at Zooniverse data you see a common pattern – that 1-2% of your community will do the majority of the work. You have to nurture the expert group within your community. This means you can crowdsource starting with that expert group – something we are also doing in a variety of those groups. You have to take care of all your participants but that core crowd really matter.

So, for crowdsourcing you don’t need special software. If you build something they don’t neccassarily come, they often don’t. And something we like to flag up is the idea of playing games, trying the unusual… Can we avoid keyboard and mouse? That arcade game does that, it asks that idea of whether we can make use of casual interaction to get useful data. That experiment is based on a raspberry pi and loads of great ideas from others using our collections. They are about the game dynamic… How we deal with data – how to understand how the game dynamics impact on the information you can extract.

So, in summary…

Don’t be scared of using words like “collection” and “access” with us… But understand that there will be a dialogue… that helps avoid disappointment, helps avoid misunderstanding or wasted time. We want to be clear and make sure we are all on the same page early on. I’m there to be your technical guide and lead on a project. There is space to experiment, to not be scared to fail and learn from that failure when it happens. We are there to have fun, to experiment.

Questions & Discussion

Q1) I’m a historian at the National Library of Scotland. You talked about that Microsoft Books project and the randomness of that collection. Then you talked about the Flickr metadata – isn’t that the same issue… Is that suitable for data mining? What do you do with that metadata?

A1) A good point. Part of what we have talked about is that those images just tell you about part of one page in a book. The mapping data is one of the ways we can get started on that. So if we geotag an image or a map with Aberdeen then you can perhaps find that book via that additional metadata, even if Aberdeen would not be part of the catalogue record, the title etc. There are big data approaches we can take but there is work on OCR etc. that we can do.

Q2) A question for Ben about Tweeting – the Mechanical Curator and the Mechanical Comedian. For the Curator… They come out some regularly… How are they generated?

A2) That is mechanical… There are about 1200 lines of code that roams the collection looking for similar stuff… The text is generated from books metadata… It is looking at data on the harddrive – access to everything so quite random. If no match it finds another random image.

Q2) And the mechnical comedian?

A2) That is run by Bob. The jokes are mechanically harvested, but he adds the images. He does that himself – with a bit of curation in terms of the badness of jokes – and adds images with help of a keen volunteer.

Q3) I work at the National Library of Scotland. You said to have fun and experiment. What is your response to the news of job cuts at Trove, at the National Library of Australia.

A3 – Ben) Trove is a leader in this space and I know a lot of people are increadibly upset about that.

A3 – Nora) The thing with digital collections is that they are global. Our own curators love Trove and I know there is a Facebook group to support Trove so, who knows, perhaps that global response might lead to a reversal?

Mahendra: I just wanted to say again that learning about the stories and provenance of a collection is so important. Talking about the back stories of collections. Sometimes the reasons content are not made available have nothing to do with legality… Those personal connections are so importan.

Q4) I’m interested in your use of the IPython Notebook. You are using that to access content on BL servers and website? So you didn’t have to download lots of data? Is that right?

A4) I mainly use it as a communication tool between myself and Ben… I type ideas into the notebook, Ben helps me turn that into code… It seemed the best tool to do that.

Q4) That’s very interesting… The Human API in action! As a researcher is that how it should be?

A4) I think be. As a researcher I’m not really a coder. For learning these spaces are great, they act as a sandbox.

Q4) And your code was written for your project, should that be shared with others?

A4) All the code is on a GitHub page. It isn’t perfect. That extract, code, geocode idea would be applicable to many other projects.

Mahendra: There is a balance that we work with. There are projects that are fantastic partnerships of domain experts working with technical experts wanting problems to solve. But we also see domain experts wanting to develop technical skills for their projects. We’ve seen both. Not sure of the answer… We did an event at Oxford who do a critical coding course where they team humanities and computer scientists… It gives computer scientists experience of really insanely difficult problems, the academics get experience of framing questions in precise ways…

Ben: And by understanding coding and

Comment (me): I just wanted to encourage anyone creating research software to consider submitting papers on that to the Journal of Open Research Software, a metajournal for sharing and finding software specifically created for research.

Q5) It seemed like the Political Meetings Mapper and the Palimpsest project had similar goals, so I wondered why they selected different workflows.

A5 – Bea Alex) The project came about because I spoke to Miranda Anderson who had the idea at the Digital Scholarship Day of Ideas. At that time we were geocoding historical trading documents and we chatted about automating that idea of georeferencing texts. That is how that project came about… There was a large manual aspect as well as the automated aspects. But the idea was to reduce that manual effort.

A5 – Katrina) Our project was so much smaller team. This is very much a pilot project to meet a particular research issue. The outcomes may seem similar but we worked on a smaller scale, seeing what one researcher could do. As a traditional academic historian I don’t usually work in groups, let alone big teams. I know other projects work at larger scale though – like Ian Gregory’s Lakes project.

A5 – Mahendra) Time was a really important aspect in decisions we took in Katrina’s project, and of focusing the scope of that work.

A5 – Katrina) Absolutely. It was about what could be done in a limited time.

A5 – Bea) One of the aspects from our work is that we sourced data from many collections, and the structure could be different for each mention. Whereas there is probably a more consistent structure because of the single newspaper used in Katrina’s project, which lends itself better to a regular expressions approach.

And next we moved to coffee and networking. We return at 3.30 for more excellent presentations (details below). 

BL Labs Awards: Research runner up project: “Palimpsest: Telling Edinburgh’s Stories with Maps” – Professor James Loxley, Palimpsest, University of Edinburgh

I am going to talk about project which I led in collaboration with colleagues in English Literature, with INformatics here, with visualisation experts at St Andrews, and with EDINA.

The idea came from Miranda Anderson, in 2012, who wanted to explore how people imagine Edinburgh in a literary sense, how the place is imagined and described. And one of the reasons for being interested in doing this is the fact that Edinburgh was the world’s first UNESCO City of Literature. The City of Literature Trust in Edinburgh is also keen to promote that rich literary heritage.

We received funding from the AHRC from January 2014 to March 2015. And the name came from the concept of the Palimpsest, the text that is rewritten and erased and layered upon – and of the city as a Palimpsest, changing and layering over time. The original website was to have the same name but as that wasn’t quite as accessible, we called that LitLong in the end.

We had some key aims for this project. There are particular ways literature is packaged for tourists etc. We weren’t interested in where authors were born or died. Or the authors that live here. What we were interested in was how the city is imagined in the work of authors, from Robert Louis Stevenson to Muriel Spark or Irvine Welsh.

And we wanted to do that in a different way. Our initial pilot in 2012 was all done manually. We had to extract locations from texts. We had a very small data set and it offfered us things we already knew – relying on well known Edinburgh books, working with the familiar. The kind of map produced there told us what we already knew. And we wanted to do something new. And this is where we realised that the digital methods we weree thinking about really gave us an opportunity to think of the literary cityscape in a different mode.

So, we planned to textmine large collections of digital text to identify narrative works set in Edinburgh. We weren’t constrained to novels, we included short stories, memoirs… Imaginative narrative writing. We excluded poetry as that was too difficult a processing challenge for the scale of the project. And we were very lucky to have the support and access to British library works, as well as material from the HathiTrust, and the National Library of Scotland. We mainly worked with out of copyright works. But we did specifically get permission from some publishers for in-copyright works. Not all publishers were forthcoming, and happy for work to be text mined. We were text mining work – not making them freely available – but for some publishers full text for text mining wasn’t possible.

So we had large collections of works, mainly but not exclusively out of copyright. And we set about textmining those collections to find those set in Edinburgh. And then we georeferenced the Edinburgh placenmmaes in those works to make mapping possible. And then finally we created visualisations offering different viewpoints into the data.

The best way to talk about this is to refer to text from our website:

Our aim in creating LitLong was to find out what the topography of a literary city such as Edinburgh would look like if we allowed digital reading to work on a very large body of texts. Edinburgh has a justly well-known literary history, cumulatively curated down the years by its many writers and readers. This history is visible in books, maps, walking tours and the city’s many literary sites and sights. But might there be other voices to hear in the chorus? Other, less familiar stories? By letting the computer do the reading, we’ve tried to set that familiar narrative of Edinburgh’s literary history in the less familiar context of hundreds of other works. We also want our maps and our app to illustrate old connections, and forge new ones, among the hundreds of literary works we’ve been able to capture.

That’s the kind of aims we had, what we were after.

So our method started with identifying texts with a clear Edinburgh connection or, as we called it “Edinburghyness“. Then, within those works to actually try and understand just how relevant they were. And that proved tricky. Some of the best stuff about this project came from close collaboration between literary scholars and informatics researchers. The back and forth was enormously helpful.

We came across some seemingly obvious issues. The first thing we saw was that there was a huge amount of theological works… Which was odd… And turned out to be because the Edinburgh placename “Trinity” was in there. Then “Haymarket” is a place in London as well as Edinburgh. So we needed to rank placenames and part of that was the ambiguity of names, and understanding that some places are more likely to specifically be Edinburgh than others.

From there, with selected works, we wanted to draw out snippits – of varying lengths but usually a sensible syntactic shape – with those mentions of specific placenames.

At the end of that process we had a dataset of 550 published works, across a range of narrative genres. They have over 1600 Edinburgh place names of lots of different types, since literary engagement with a city might be a street, a building, open spaces, areas, monuments etc. In mapping terms you can be more exact, in literature you have these areas and diverse types of “place”, so our gazeteer needed to be flexible to that. And what that all gave us in total was 47,000 extracts from literary works, all focused on a place name mention.

That was the work itself but we also wanted to engage people in our work. So we brought Sir Walter Scott back to life. He came along to the Edinburgh International Book Festival in 2014. He kind of got away from us and took on a life of his own… He ended up being part of the celebrations of the 200th aniversary of Waverley. And popped up again last year on the Borders Railway when that launched! That was fun!

We did another event at EIBF in 2015 with James Robertson who was exploring LitLong and data there. And you can download that as a podcast.

So, we were very very focused on making this project work, but we were also thinking about the users.

The resource itself you can visit at LitLong.org. I will talk a little about the two forms of visualisation. The first is a location visualiser largely built and developer by Uta Hinrichs at St Andrews. That allows you to explore the map, to look at keywords associated by locations – which indicate a degree of qualitative engagement. We also have a searchable database where you can see the extracts. And we have an app version which allows you to wander in among the extracts, rather than see from above – our visualisation colleagues call this the “Frogs Eye View”. You can wander between extracts, browse the range of them. It works quite well on the bus!

We were obviously delighted to be able to do this! Some of the obstacles seemed tough but we found workable solutions… But we hope it is not the end of the story. We are keen to explore new ways to make the resource explorable. Right now there isn’t a way where interaction leaves a trace – other people’s routes through the city, other peoples understanding of the topography. There is scope for more analysis of the texts themselves. For instance we considered doing a mood map of the city, scope to see that. But we weren’t able to do that in this project but there is scope to do that. And as part of building on the project we have a bit of funding from the AHRC so lots of interesting lines of enquiry there. And if you want to explore the resource do take a look, get in touch etc.

Q&A

Q1) Do you think someone could run sentiment analysis over your text?

A1) That is entirely plausible. The data is there and tagged so that you could do that.

A1 – Bea) We did have an MSc project just starting to explore that in fact.

A1) One of our buttons on the homepage is “LitLong Lab” where we share experiments in various ways.

Q2) Some science fiction authors have imagined near future Edinburgh, how could that be mapped?

A2) We did have some science fiction in the texts, including the winner of our writing competition. We have texts from a range of ages of work but a contemporary map, so there is scope to keying data to historic maps, and those exist thanks to the NLS. As to the future…  The not-yet-Edinburgh… Something I’d like to do… It is not uncommon that fictional places exist in real places – like 221 Baker Street or 44 Scotland Street – and I thought it would be fun to see the linguistic qualities associated with a fictional place, and compare to real places with the same sort of profile. So, perhaps for futuristic places that would work – using linguistic profile to do that.

Q3) I was going to ask about chronology – but you just answered that. So instead I will ask about crowd sourcing.

A3) Yes! As an editor I am most concerned about potential effort. For this scale and speed we had to let go of issues of mistakes, we know they are there… Places that move, some false positives, and some books that used Edinburgh placenames but are other places (e.g. some Glasgow texts). At the moment we don’t have a full report function or similar. We weren’t able to create it to enable corrections in that sort of way. What we decided to do is make a feature of a bug – celebrating those as worm holes! But I would like to fine tune and correct, with user interactions as part of that.

Q4) Is the data set available.

A4) Yes, through an API created by EDINA. Open for out of copyright work.

Palimpsest seeks to find new ways to present and explore Edinburgh’s literary cityscape, through interfaces showcasing extracts from a wide range of celebrated and lesser known narrative texts set in the city. In this talk, James will set out some of the project’s challenges, and some of the possibilities for the use of cultural data that it has helped to unearth.

Geoparsing Jisc Historical Texts – Dr Claire Grover, Senior Research Fellow, School of Informatics, University of Edinburgh

I’ll be talking about a current project, a very rapid project to geoparse all of the Jisc Historical Texts. So I’ll talk about the Geoparser and then more about that project.

The Edinburgh Geoparser, which has been developed over a number of years in collaboration with EDINA. It has been deployed in various projects and places, mainly also in collaboration with EDINA. And it has various main steps:

  • Use named entity recognition to identify place names in texts
  • Find matching records in a gazeteer
  • In cases of ambiguity (e.g. Paris, Springfield), resolve using contextual information from the document
  • Assign coordinates of preferred reading to the placename

So, you can use the Geoparser either via EDINA’s Unlock Text, or you can download it, or you can try a demonstrator online (links to follow).

To give you an example I have a news piece on the buriel of Richard III. You can see the Geoparser looks for entity recognition of all types – people as well as places – as that helps with disambiguation later on. Then using that text the parser ranks the likelihood of possible locations.

A quick word on gazeteers. The knowledge of possible interpretations comes from a gazeteer, which pairs place names to lat/long. So, if you know your data you can choose a gazeteer relevant to that (e.g. just the UK). The Edinburgh Geoparser is configured to provide a choice of gazeteers and can be configured to use other gazeteers.

If a place is not in a gazeteer it cannot be grounded. If the correct interprestation of a place name is not in the gazeteer, it cannot be grounded correctly. Modern gazeteers are not ideal for historical documents so historical gazeteers need to be used/developed. So for instance the DEEP (Directory of English Place Names) or PELAGIOS (ancient world) gazeteers have been useful in our current work.

The current Jisc Historical Text(http://historicaltexts.jisc.ac.uk/) project has been working with EEBO and ECCO texts as well as the BL Nineteenth Century collections. These are large and highly varied data sets. So, for instance, yesterday I did a random sample of writers and texts… which is so large we’ve only seen a tiny portion of it. We can process it but we can’t look at it all.

So, what is involved in us georeferencing this text? Well we have to get all the data through the Edinburgh Geoparser pipeline. And that requires adapting the geoparser pipeline to recognise place names to work as accurately as possible on historical text. And we need to adjust the georeferencing strategy to be more detailed.

Adapting our place name recognition relies a lot on lexicons. The standard Edinburgh Geoparser has three lexicons derived from the Alexandria Gazetteer (global, very detailed); Ordnance Survey (Great Britain, quite detailed), DEEP. We’ve also added more lexicons from more gazeteers… including larger place names in Geonames (population over 10,000), populated places from Natural Earth, only larger places from DEEP, and the score recognised place names based on how many and which lexicons they occur in. Low scored placenames are removed – we reckon people’s tolerance for missing a place is higher than their tolerance for false positives.

Working with old texts also means huge variation of spellings… There are a lot of false placenames/false negatives because of this (e.g. Maldauia, Demnarke, Saxonie, Spayne). They also result in false positives (Grasse, Hamme, Lyon, Penne, Sunne, Haue, Ayr). So we have tried to remove the false positives, to remove bad placenames.

When it comes to actually georeferencing these places we need coordinates for place names from gazetteers. We used three place names in succession: Pleiades++, GeoNames and then DEEP. In addition to using those gazeteers we can weight the results based on locations in the world – based on a bounding box. So we can prefer locations in the UK and Europe, then those in the East. Not extending to the West as much… And excluding Australia and New Zealand (unknown at that time).

So looking at EEBO and ECCO we can see some frequent place names from each gazeteers – which shows how different they are. In terms of how many terms we have found there are over 3 million locations in EEBO, over 250k in ECCO (a much smaller collection). The early EEBO collections have a lot of locations in Israel, Italy, France. The early books are more concerned with the ancient world and Biblical texts so these statistics suggest that we are doing the right thing here.

These are really old texts, we have huge volumes fo them, and there is a huge variety of the data and that all makes this a hard task. We still don’t know how the work will be received but we think Jisc will put this work in a sandbox area and we should get some feedback on it.

Find out more:

  • http://historicaltexts.jisc.ac.uk/
  • https://www.ltg.ed.ac.uk/software/geoparser
  • http://edina.ac.uk/unlock/
  • http://placenames.org.uk/
  • https://googleancientplaces.wordpress.com/

Q&A

Q1) What about historical Gaelic place names?

A1) I’m not sure these texts have these. But we did apply a language tag on a paragraph level. These are supposed to be English texts but there is lots of Latin, Welsh, Spanish, French and German. We only georeferenced texts thought to be English. If Gaelic names then, if in Ordnance Survey, they may have been picked up…

Claire will talk about work the Edinburgh Language Technology Group have been doing for Jisc on geoparsing historical texts such as the British Library’s Nineteenth Century Books and Early English Books Online Text Creation Partnership which is creating standardized, accurate XML/SGML encoded electronic text editions of early print books.

Pitches – Mahendra and co

Can the people who pitched me

Lorna: I’m interested in open education and I’d love to get some of the BL content out there. I’ve been worked on the new HECoS coding schema for different subjects. And I thought that it would be great to classify the BL content with HECoS.

Karen: I’ve been looking at Copyright music collections at St Andrews. There are gaps in legal deposit music from late 18th and 19th century as we know publishers deposited less in Scottish versus BL. So we could compare and see what reached outer reaches of the UK.

Nina: My idea was a digital Pilgrims Progress where you can have a virtual tour of a journey with all sorts of resources.. To see why some places are most popular in texts etc.

David: I think my idea has been done.. It was going to be iPython – Katrina is already doing this! But to make it more unique… It’s quite hard work for Ben to support scholars in that way so I think researchers should be encouraged to approach Ben etc. but also get non-programmers to craft complex queries, make the good ones reusable by others… and have those reused be marked up as of particular quality. And to make it more fun… Could have a sort of treasure hunt jam with people using that facility to have a treasure hunt on a theme… share interesting information… Have researchers see tweets or shared things… A group treasure hunt to encourage people by helping them share queries…

Mahendra: So we are supposed to decide the winners now… But I think we’ll get all our pitchers to share the bag – all great ideas… The idea was to start conversations. You should all have an email from me so, if you have found this inspiring or interesting, we’ll continue that conversation.

And with that we are done! Thanks to all for a really excellent session!

Feb 252016
 
Today we have our second eLearning@ed/LTW Showcase and Network event. I’m liveblogging so, as usual, corrections and updates are welcome. 
Jo Spiller is welcoming us along and introducing our first speaker…
Dr. Chris Harlow – “Using WordPress and Wikipedia in Undergraduate Medical & Honours Teaching: Creating outward facing OERs”
I’m just going to briefly tell you about some novel ways of teaching medical students and undergraduate biomedical students using WordPress and platforms like Wikipedia. So I will be talking about our use of WordPress websites in the MBChB curriculum. Then I’ll tell you about how we’ve used the same model in Reproductive Biology Honours. And then how we are using Wikipedia in Reproductive Biology courses.
We use WordPress websites in the MBChB curriculum during Year 2 student selected components. Students work in groups of 6 to 9 with a facilitator. They work with a provided WordPress template – the idea being that the focus is on the content rather than the look and feel. In the first semester the topics are chosen by the group’s facilitator. In semestor two the topics and facilitators are selected by the students.
So, looking at example websites you can see that the students have created rich websites, with content, appendices. It’s all produced online, marked online and assessed online. And once that has happened the sites are made available on the web as open educational resources that anyone can explore and use here: http://studentblogs.med.ed.ac.uk/
The students don’t have any problem at all building these websites and they create these wonderful resources that others can use.
In terms of assessing these resources there is a 50% group mark on the website by an independent marker, a 25% group mark on the website from a facilitator, and (at the students request) a 25% individual mark on student performance and contribution which is also given by the facilitator.
In terms of how we have used this model with Reproductive Biology Honours it is a similar idea. We have 4-6 students per group. This work counts for 30% of their Semester 1 course “Reproductive Systems” marks, and assessment is along the same lines as the MBChB. Again, we can view examples here (e.g. “The Quest for Artificial Gametes”. Worth noting that there is a maximum word count of 6000 words (excluding Appendices).
So, now onto the Wikipedia idea. This was something which Mark Wetton encouraged me to do. Students are often told not to use or rely on Wikipedia but, speaking a biomedical scientist, I use it all the time. You have to use it judiciously but it can be an invaluable tool for engaging with unfamiliar terminology or concepts.
The context for the Wikipedia work is that we have 29 Reproductive Biology Honours stduents (50% Biomedical Sciences, 50% intercalculating medics), and they are split into groups of 4-5 students/groups. We did this in Semester 1, week 1, as part of the core “Research Skills in Reproductive Biology”. And we benefited from expert staff including two Wikipedians in Residence (at different Scottish organisations), a librarian, and a learning, teaching and web colleague.
So the students had an introdution to Wikipedia, then some literature searching examples. We went on to groupwprl sesssions to find papers on particular topics, looking for differences in definitions, spellings, terminology. We discussed findings. This led onto groupwork where each group defined their own aspect to research. And from there they looked to create Wikipedia edits/pages.
The groups really valued trying out different library resources and search engines, and seeing the varying content that was returned by them.
The students then, in the following week, developed their Wikipedia editing skills so that they could combine their work into a new page for Neuroangiogenesis. Getting that online in an afternoon was increadibly exciting. And actually that page was high in the search rankings immediately. Looking at the traffic statistics that page seemed to be getting 3 hits per day – a lot more reads than the papers I’ve published!
So, we will run the exercise again with our new students. I’ve already identified some terms which are not already out there on Wikipedia. This time we’ll be looking to add to or improve High Grade Serious Carcinoma, and Fetal Programming. But we have further terms that need more work.
Q&A
Q1) Did anyone edit the page after the students were finished?
A1) A number of small corrections and one querying of whether a PhD thesis was a suitable reference – whether a primary or secondary reference. What needs done more than anything else is building more links into that page from other pages.
Q2) With the WordPress blogs you presumably want some QA as these are becoming OERs. What would happen if a project got, say, a low C.
A2) Happily that hasn’t happened yet. That would be down to the tutor I think… But I think people would be quite forgiving of undergraduate work, which it is clearly presented at.
Q3) Did you consider peer marking?
A3) An interesting question. Students are concerned that there are peers in their groups who do not contribute equally, or let peers carry them.
Comment) There is a tool called PeerAim where peer input weights the marks of students.
Q3) Do all of those blog projects have the same model? I’m sure I saw something on peer marking?
A3) There is peer feedback but not peer marking at present.
Dr. Anouk Lang – “Structuring Data in the Humanities Classroom: Mapping literary texts using open geodata”
I am a digital humanities scholar in the school of Languages and Linguistics. One of the courses I teach is digital humanities for literature, which is a lovely class and I’m going to talk about projects in that course.
The first MSc project the students looked at was to explore Robert Louis Stevenson’s The Dynamiter. Although we were mapping the texts but the key aim was to understand who wrote what part of the text.
So the reason we use mapping in this course is because these are brilliant analytical students but they are not used to working with structured data, and this is an opportunity to do this. So, using CartoDB – a brilliant tool that will draw data from Google Sheets – they needed to identify locations in the text but I also asked students to give texts an “emotion rating”. That is a rating of intensity of emotion based on the work of Ian Gregory – spatial historian who has worked with Lakes data on the emotional intensity of these texts.
So, the students build this database by hand. And then loaded into CartoDB you get all sorts of nice ways to visualise the data. So, looking at a map of London you can see where the story occurs. The Dynamiter is a very weird text with a central story in London but side stories about the planting of bombs, which is kind of played as comedy. The view I’m showing here is a heatmap. So for this text you can see the scope of the text. Robert Louis Stevenson was British, but his wife was American, and you see that this book brings in American references, including unexpected places like Utah.
So, within CartoDB you can try different ways to display your data. You can view a “Torque Map” that shows chronology of mentions – for this text, which is a short story, that isn’t the most helpful perhaps.
Now we do get issues of anachronisms. OpenStreetMap – on which CartoDB is based – is a contemporary map and the geography and locations on the map changes over time. And so another open data source was hugely useful in this project. Over at the National Library of Scotland there is a wonderful maps librarian called Chris Fleet who has made huge numbers of historical maps available not only as scanned images but as map tiles through a Historical Open Maps API, so you can zoom into detailed historical maps. That means that mapping a text from, say, the late 19th Century, it’s incredibly useful to view a contemporaneous map with the text.
You can view the Robert Louis Stevenson map here: http://edin.ac/20ooW0s.
So, moving to this year’s project… We have been looking at Jean Rhys. Rhys was a white Creole born in the Dominican Republic who lived mainly in Europe. She is a really located author with place important to her work. For this project, rather than hand coding texts, I used the wonderful wonderful Edinburgh Geoparser (https://www.ltg.ed.ac.uk/software/geoparser/??) – a tool I recommend and a new version is imminent from Clare Grover and colleagues in LTG, Informatics.
So, the Geoparser goes through the text and picks out text that looks like places, then tells you which it things is the most likely location for that place – based on aspects like nearby words in the text etc. That produces XML and Clare has created me an XSLT Stylesheet, so all the students have had to do is to manually clean up that data. The GeoParser gives you GeoNames reference that enables you to check latitude and longitude. Now this sort of data cleaning, the concept of gazeteers, these are bread and butter tools of the digital humanities. These are tools which are very unfamiliar to many of us working in the humanities. This is open, shared, and the opposite of the scholar secretly working in the librarian.
We do websites in class to benefit from that publicness – and the meaning of public scholarship. When students are doing work in public they really rise to the challenge. They know it will connect to their real world identities. I insist students sow their name, their information, their image because this is part of their digital scholarly identities. I want people who Google them to find this lovely site with it’s scholarship.
So, for our Jean Rhys work I will show you a mock up preview of our data. One of the great things about visualising your data in these ways is that you can spot errors in your data. So, for instance, checking a point in Canada we see that the Geoparser has picked Halifax Nova Scotia when the text indicates Halifax in England. When I raised this issue in class today the student got a wee bit embarrassed and made immediate changes… Which again is kind of perk of work in public.
Next week my students will be trying out QGIS  with Tom Armitage of EDINA, that’s a full on GIS system so that will be really exciting.
For me there are real pedagogical benefits of these tools. Students have to really think hard about structuring their data, which is really important. As humanists we have to put our data in our work into computational form. Taking this kind of class means they are more questioning of data, of what it means, of what accuracy is. They are critically engaged with data and they are prepared to collaborate in a gentle kind of way. They also get to think about place in a literary sense, in a way they haven’t before.
We like to think that we have it all figured out in terms of understanding place in literature. But when you put a text into a spreadsheet you really have to understand what is being said about place in a whole different way than a close reading. So, if you take a sentence like: “He found them a hotel in Rue Lamartine, near Gard du Nord, in Monmatre”. Is that one location or three? The Edinburgh GeoParser maps two points but not Rue Lamartine… So you have to use Google maps for that… And is the accuracy correct. And you have to discuss if those two map points are distorting. The discussion there is more rich than any other discussion you would have around close reading. We are so confident about close readings… We assume it as a research method… This is a different way to close read… To shoe horn into a different structure.
So, I really like Michel De Certeau’s “Spatial stories” in The practice of everyday life (De Certeau 1984), where he talks about structured space and the ambiguous realities of use and engagement in that space. And that’s what that Rue LaMartine type example is all about.
Q&A
Q1) What about looking at distance between points, how length of discussion varies in comparison to real distance
A1) That’s an interesting thing. And that CartoDB Torque display is crude but exciting to me – a great way to explore that sort of question.
OER as Assessment – Stuart Nichol, LTW
I’m going to be talking about OER as Assessment from a students perspective. I study part time on the MSc in Digital Education and a few years ago I took a module called Digital Futures for Learning, a course co-created by participants and where assessment is built around developing an Open Educational Resource. The purpose is to “facilitate learning for the whole group”. This requires a pedagogical approach (to running the module) which is quite structured to enable that flexibility.
So, for this course, the assessment structure is 30% position paper (basis of content for the OER), then 40% of mark for the OER (30%peer-assessed and tutor moderated / 10% self assessed), and then the final 30% of the marks come from an analysis paper that reflects on the peer assessment. You could then resubmit the OER along with that paper reflecting on that process.
I took this module a few years ago, before the University’s adoption of an open educational resource policy, but I was really interested in this. So I ended up building a course on Open Accreditation, and Open Badges, using weebly: http://openaccreditation.weebly.com/.
This was really useful as a route to learn about Open Educational Resources generally but that artefact has also become part of my professional portfolio now. It’s a really different type of assignment and experience. And, looking at my stats from this site I can see it is still in use, still getting hits. And Hamish (Macleod) points to that course in his Game Based Learning module now. My contact information is on that site and I get tweets and feedback about the resource which is great. It is such a different experience to the traditional essay type idea. And, as a learning technologist, this was quite an authentic experience. The course structure and process felt like professional practice.
This type of process, and use of open assessment, is in use elsewhere. In Geosciences there are undergraduate students working with local schools and preparing open educational resources around that. There are other courses too. We support that with advice on copyright and licensing. There are also real opportunities for this in the SLICCs (Student Led Individually Created Courses). If you are considering going down this route then there is support at the University from the IS OER Service – we have a workshop at KB on 3rd March. We also have the new Open.Ed website, about Open Educational Resources which has information on workshops, guidance, and showcases of University work as well as blogs from practitioners. And we now have an approved OER policy for learning and teaching.
In that new OER Policy and how that relates to assessment, and we are clear that OERs are created by both staff and students.
And finally, fresh from the ILW Editathon this week, we have Ewan MacAndrew, our new Wikimedian in residence, who will introduce us to Histropedia (Interactive timelines for Wikipedia: http://histropedia.com) and run through a practical introduction to Wikipedia editing.
Wikimedian in Residence – University of Edinburgh – Ewan MacAndrew
Ewan is starting by introducing us to to “Listen to Wikipedia“, which turns live edits on Wikipedia right now into melodic music. And that site colour codes for logged in, anonymous, and clean up bots all making edits.
My new role, as Wikimedian in Residence, comes about from a collaboration between the University of Edinburgh and Wikimedia Foundation. And my role fits into their complimentary missions, which fit around the broad vision of imagining the world where all knowledge is openly available. My role is to enhance the teaching and curriculum, but also helping to highlight the rich heritage and culture around the university beyond that, and helping raise awareness of their commitment to open knowledge. But this isn’t a new collaboration, it is part of an ongoing collaboration through events and activities and collaboration.
It’s also important to note that I am a Wikimedian in Residence, rather than a Wikipedian in Residence. Wikimedia is the charitable foundation behind Wikipedia, but they have a huge family of projects including Wikibooks, MediaWiki, Wikispecies, etc. That includes Wikidata is the database of all knowledge that humans and machines can read, which is completely language independent – the model Wikipedia is trying to work towards.
So, what is Wikipedia and how does it work? Well we have over 5 million articles, 38 million pages, over 800 million edits, and over 130k active users.
There has been past work by the University with Wikimedia. There was the Women, Science and Scottish editathon for ILW 2015, Chris Harlow already spoke about his work, there was an Ada Lovelace editathon from October 2015, Gavin Willshaw was part of #1Lib1Ref day for Wikipedia’s 15th Birthday in January 2016. Then last week we had the History of Medicine editathon for ILW 2016 which generated 4 new articles, improved 56 articles, uploaded over 500 images to Wikicommons. Those images, for instance, have huge impact as they are of University buildings and articles with images are far more likely to be clicked on and explored.
You can explore that recent editathon in a Storify I made of our work…


View the story “University of Edinburgh Innovative Learning Week 2016 – History of Medicine Wikipedia editathon” on Storify

We are now looking at new and upcoming events, our next editathon is for International Women’s Day. In terms of ideas for events we are considering:

  • Edinburgh Gothic week – cross curricular event with art, literature, film, architecture, history, music and crime
  • Robert Louis Stevenson Day
  • Scottish Enlightenment
  • Scottish photographers and Image-a-thons
  • Day of the Dead
  • Scotland in WWI Editathon – zeppelin raids, Craiglockhart, etc.
  • Translationathons…

Really open to any ideas here. Do take a look at reports and updates on the University of Edinburgh Wikimedian in Residence activities here: https://en.wikipedia.org/wiki/Wikipedia:University_of_Edinburgh

So, I’m going to now quickly run through the five pillars of Wikipedia, which are:

  1. An encylopedia – not a gossip column, or blog, etc. So we take an academic, rigorous approach to the articles we are putting in.
  2. Neutral point of view – trying to avoid “peacock terms”. Only saying things that are certain, backed up by reliable published sources.
  3. Free content that anyone can use, edit and distribute.
  4. Respect and civility – when I run sessions I ask people to note that they are new users so that others in the community treat you with kindness and respect.
  5. No firm rules – for every firm rules there has to be flexibility to work with subjects that may be tricky, might not quite work. If you can argue the case, and that is accepted, there is the freedom to accept exceptions.

People can get bogged down in the detail of Wikipedia. Really the only rule is to “Be bold not reckless!“.

When we talk of Wikipedia and what a reliable source is, Wikipedia is based on reliable published source with reputation for fact-checking and accuracy. Academic and peer-reviewed scholarly material is often used (barring the no original research distinction). High quality mainstream publications too. Blogs are not seen as reliable generally, but sites like BBC and CNN are. And you need several independent sources for a new article – generally we look for 250 words and 3 reliable sources for a new Wikipedia article.

Ewan is now giving us a quick tour through enabling the new (fantastic!) visual editor, which you can do by editing your settings as a registered user. He’s also encouraging us to edit our own profile page (you can say hello to Ewan via his page here), formatting and linking our profiles to make them more relevant and useful. Ewan is also showing how to use Wikimedia Commons images in profiles and pages. 

So, before I finish I wanted to show you Histropedia, which allows you to create timelines from Wikipedia categories.

Ewan is now demonstrating how to create timelines, to edit them, to make changes. And showing how the timelines understand “important articles” – which is based on high visibility through linking to other pages. 

If you create a timeline you can save these either as a personal timeline, or as a public timeline for others to explore. The other thing to be aware of is that WikiData can be modified to search for more specialised topics – for instance looking at descendants of Robert the Bruce. Or even as specific as female descendants of Robert the Bruce born in Denmark. That just uses Robert the Bruce and a WikiData term called “child of”, and from those two fields you can build a very specific timelines. Histropedia uses both categories and WikiData terms… So here it is using both of those.

Q&A

Q1) Does Wikidata draw on structured text in articles?

A1) It’s based on “an instance of”… “place of education” or “created on” etc. That’s one of the limitations of Histropedia right now… It can’t differentiate between birth and death date versus dates of reign. So limited to birth and death, foundation dates etc.

Q2) How is Wikipedia “language independent”?

A2) Wikipedia is language dependent. Wikidata is language independent. So, no matter what tool Wikidata uses, it functions in every single language. Wikipedia doesn’t work that way, we have to transfer or translate texts between different language versions of Wikipedia. Wikidata uses a q code that is neutral to all languages that gets round that issue of language.

Q3) Are you holding any introductory events?

A3) Yes, trying to find best ways to do that. There are articles from last week’s editathon which we could work on.

And with that we are done – and off to support our colleague Jeremy Knox’s launch of his new book: Posthumanism and the Massive Open Online Course: Contaminating the Subject of Global Education.

Thanks to all our fantastic speakers, and our lovely organisers for this month’s event: Stuart Nicol and Jo Spiller.