Aug 032017

Today I am at Repository Fringe which runs today and tomorrow in Edinburgh and is celebrating 10 years of Repofringe! I’m just here today – presenting a 10×10 on our recent Reference Rot in Theses: A HiberActive Pilot project work – and will be blogging whilst I’m here. As usual, as this is live, may include the odd typo or error so all comments, corrections, questions, additions, etc. are very much welcomed!

Welcome – Janet Roberts, Director of EDINA

My colleagues were explaining to me that this event came from an idea from Les Carr that there should be not just one repository conference, but also a fringe – and here were are at the 10th Repository Fringe on the cusp of the Edinburgh Fringe.

So, this week we celebrate ten years of repository fringe, and the progress we have made over the last 10 years to share content beyond borders. It is a space for debating future trends and challenges.

At EDINA we established the OpenDepot to provide a space for those without an institutional repository… That has now migrated to Zenodo… and the challenges are changing, around the size of data, how we store and access that data, and what those next generation repositories will look like.

Over the next few days we have some excellent speakers as well as some fringe events, including the Wiki Datathon – so I hope you have all brought your laptops!

Thank you to our organising team from EDINA, DCC and the University of Edinburgh. Thank you also to our sponsors: Atmire; FigShare; Arkivum; ePrints; and Jisc!

Opening Keynote – Kathleen Shearer, Executive Director COARRaising our game – repositioning repositories as the foundation for sustainable scholarly communication

Theo Andrew: I am delighted to introduce Kathleen, who has been working in digital libraries and repositories for years. COAR is an international organisation of repositories, and I’m pleased to say that Edinburgh has been a member for some time.

Kathleen: Thank you so much for inviting me. It’s actually my first time speaking in the UK and it’s a little bit intimidating as I know that you folks are really ahead here.

COAR is now about 120 members. Our activities fall into four areas: presenting an international voice so that repositories are part of a global community with diverse perspective. We are being more active in training for repository managers, something which is especially important in developing countries. And the other area is value added services, which is where today’s talk on the repository of the future comes in. The vision here is about

But first, a rant… The international publishing system is broken! And it is broken for a number of reasons – there is access, and the cost of access. The cost of scholarly journals goes up far beyond the rate of inflation. That touches us in Canada – where I am based, in Germany, in the UK… But much more so in the developing world. And then we have the “Big Deal”. A study of University of Montreal libraries by Stephanie Gagnon found that of 50k subscribed-to journals, really there were only 5,893 unique essential titles. But often those deals aren’t opted out of as the key core journals separately cost the same as that big deal.

We also have a participation problem… Juan Pablo Alperin’s map of authors published in Web of Science shows a huge bias towards the US and the UK, a seriously reduced participation in Africa and parts of Asia. Why does that happen? The journals are operated from the global North, and don’t represent the kinds of research problems in the developing world. And one Nobel Prize winner notes that the pressure to publish in “luxury” journals encourages researchers to cut corners and pursue trendy fields rather than areas where there are those research gaps. That was the cake with Zika virus – you could hardly get research published on that until a major outbreak brought it to the attention of the dominant publishing cultures, then there was huge appetite to publish there.

Timothy Gowers talks about “perverse incentives” which are supporting the really high costs of journals. It’s not just a problem for researchers and how they publish, its also a problem of how we incentivise researchers to publish. So, this is my goats in trees slide… It doesn’t feel like goats should be in trees… Moroccan tree goats are taught to climb the trees when there isn’t food on the ground… I think of the researchers able to publish in these high end journals as being the lucky goats in the tree here…

In order to incentivise participation in high end journals we have created a lucrative publishing industry. I’m sure you’ve seen the recent Guardian article: “is the staggeringly profitable business of science publishing bad for science”. Yes. For those reasons of access and participation. We see very few publishers publishing the majority of titles, and there is a real

My colleague Leslie Chan, funded by the International Development Council, talked about openness not just being about gaining access to knowledge but also about having access to participate in the system.

On the positive side… Open access has arrived. A recent study (Piwowar et al 2017) found that about 45% of articles published in 2015 were open access. And that is increasing every year. And you have probably seen the May 27th 2016 statement from the EU that all research they fund must be open by 2020.

It hasn’t been a totally smooth transition… APCs (Article Processing Charges) are very much in the mix and part of the picture… Some publishers are trying to slow the growth of access, but they can see that it’s coming and want to retain their profit margins. And they want to move to all APCs. There is discussion here… There is a project called OA2020 which wants to flip from subscription based to open access publishing. It has some traction but there are concerns here, particularly about sustainability of scholarly comms in the long term. And we are not syre that publishers will go for it… Particularly one of them (Elsevier) which exited talks in The Netherlands and Germany. In Germany the tap was turned off for a while for Elsevier – and there wasn’t a big uproar from the community! But the tap has been turned back on…

So, what will the future be around open access? If you look across APCs and the average value… If you think about the relative value of journals, especially the value of high end journals… I don’t think we’ll see lesser increases in APCs in the future.

At COAR we have a different vision…

Lorcan Dempsey talked about the idea of the “inside out” library. Similarly a new MIT Future of Libraries Report – published by a broad stakeholder group that had spent 6 months working on a vision – came up with the need for libraries to be open, trusted, durable, interdisciplinary, interoperable content platform. So, like the inside out library, it’s about collecting the output of your organisation and making is available to the world…

So, for me, if we collect articles… We just perpetuate the system and we are not in a position to change the system. So how do we move forward at the same time as being kind of reliant on that system.

Eloy Rodrigues, at Open Repository earlier this year, asked whether repositories are a success story. They are ubiquitous, they are adopted and networked… But then they are also using old, pre-web technologies; mostly passive recipients; limited interoperability making value added systems hard; and not really embedded in researcher workflows. These are the kinds of challenges we need to address in next generation of repositories…

So we started a working group on Next Generation Repositories to define new technologies for repositories. We want to position repositories as the foundation for a distributed, globally networked infrastructure for scholarly communication. And on top of which we want to be able to add layers of value added services. Our principles include distributed control to guard againts failure, change, etc. We want this to be inclusive, and reflecting the needs of the research communities in the global south. We want intelligent openness – we know not everything can be open.

We also have some design assumptions, with a focus on the resources themselves, not just associated metadata. We want to be pragmatic, and make use of technologies we have…

To date we have identified major use cases and user stories, and shared those. We determined functionality and behaviours; and a conceptual models. At the moment we are defining specific technologies and architectures. We will publish recommendations in September 2017. We then need to promote it widely and encourages adoption and implementation, as well as the upgrade of repositories around the world (a big challenge).

You can view our user stories online. But I’d like to talk about a few of these… We would like to enable peer review on top of repositories… To slowly incrementally replace what researchers do. That’s not building peer review in repositories, but as a layer on top. We also want some social functionalities like recommendations. And we’d like standard usage metrics across the world to understand what is used and hw.. We are looking to the UK and the IRUS project there as that has already been looked at here. We also need to address discovery… Right now we use metadata, rather than indexing full text content… So contat can be hard to get to unless the metadata is obvious. We also need data syncing in hubs, indexing systems, etc. reflect changes in the repositories. And we also want to address preservation – that’s a really important role that we should do well, and it’s something that can set us apart from the publishers – preservation is not part of their business model.

So, this is a slide from Peter Knoth at CORE – a repository aggregator – who talks about expanding the repository, and the potential to layer all of these additional services on top.

To make this happen we need to improve the functionality of repositories: to be of and not just on the web. But we also need to step out of the article paradigm… The whole system is set up around the article, but we need to think beyond that, deposit other content, and ensure those research outputs are appropriately recognised.

So, we have our (draft) conceptual model… It isn’t around siloed individual repositories, but around a whole network. And some of our draft recommendations for technologies for next generation repositories. These are a really early view… These are things like: ResourceSync; Signposting; Messaging protocols; Message queue; IIIF presentation API; AOAuth; Webmention; and more…

Critical to the widespread adoption of this process is the widespread adoption of the behaviours and functionalities for next generation repositories. It won’t be a success if only one software or approach takes these on. So I’d like to quote a Scottish industrialist, Andrew Carnegie: “strength is derived from unity…. “. So we need to coalesce around a common vision.

Ad it isn’t just about a common vision, science is global and networked and our approach has to reflect and connect with that. Repositories need to balance a dual mission to (1) showcase and provide access to institutional research and (2) be nodes in a global research network.

To support better networking in repositories and in Venice, in May we signed an International Accord for Repository Networks, with networks from Australasia, Canada, China, Europe, Japan, Latin America, South Africa, United States. For us there is a question about how best we work with the UK internationally. We work with with OpenAIRE but maybe we need something else as well. The networks across those areas are advancing at different paces, but have committed to move forward.

There are three areas of that international accord:

  1. Strategic coordination – to have a shared vision and a stronger voice for the repository community
  2. Interoperability and common “behaviours” for repositories – supporting the development of value added services
  3. Data exchange and cross regional harvesting – to ensure redundancy and preservation. This has started but there is a lot to do here still, especially as we move to harvesting full text, not just metadata. And there is interest in redundancy for preservation reasons.

So we need to develop the case for a distributed community-managed infrastructure, that will better support the needs of diverse regions, disciplines and languages. Redundancy will safeguard against failure. With less risk of commercial buy out. Places the library at the centre… But… I appreciate it is much harder to sell a distributed system… We need branding that really attracts researchers to take part and engage in †he system…

And one of the things we want to avoid… Yesterday it was announced that Elsevier has acquired bepress. bepress is mainly used in the US and there will be much thinking about the implications for their repositories. So not only should institutional repositories be distributed, but they should be different platforms, and different open source platforms…

Concluding thoughts here… Repositories are a technology and technologies change. What its really promoting is a vision in which institutions, universities and their libraries are the foundational nodes in a global scholarly communication system. This is really the future of libraries in the scholarly communication community. This is what libraries should be doing. This is what our values represent.

And this is urgent. We see Elsevier consolidating, buying platforms, trying to control publishers and the research cycle, we really have to move forward and move quickly. I hope the UK will remain engaged with this. And i look forward to your participation in our ongoing dialogue.


Q1 – Les Carr) I was very struck by that comment about the need to balance the local and the global I think that’s a really major opportunity for my university. Everyone is obsessed about their place in the global university ranking, their representation as a global university. This could be a real opportunity, led by our libraries and knowledge assets, and I’m really excited about that!

A1) I think the challenge around that is trying to support common values… If you are competing with other institutions it’s not always an incentive to adopt systems with common technologies, measures, approaches. So there needs to be a benefit for institutions in joining this network. It is a huge opportunity, but we have to show the value of joining that network It’s maybe easier in the UK, Europe, Canada. In the US they don’t see that value as much… They are not used to collaborating in this way and have been one of the hardest regions to bring onboard.

Q2 – Adam Field) Correct me if I’m wrong… You are talking about a Commons… In some way the benefits are watered down as part of the Commons, so how do we pay for this system, how do we make this benefit the organisation?

A2) That’s where I see that challenge of the benefit. There has to be value… That’s where value added systems come in… So a recommender system is much more valuable if it crosses all of the repositories… That is a benefit and allows you to access more material and for more people to access yours. I know CORE at the OU are already building a recommender system in their own aggregated platform.

Q3 – Anna Clements) At the sharp end this is not a problem for libraries, but a problem for academia… If we are seen as librarians doing things to or for academics that won’t have as much traction… How do we engage academia…

A3) There are researchers keen to move to open access… But it’s hard to represent what we want to do at a global level when many researchers are focused on that one journal or area and making that open access… I’m not sure what the elevator pitch should be here. I think if we can get to that usage statistics data there, that will help… If we can build an alternative system that even research administrators can use in place of impact factor or Web of Science, that might move us forward in terms of showing this approach has value. Administrators are still stuck in having to evaluate the quality of research based on journals and impact factors. This stuff won’t happen in a day. But having standardised measures across repositories will help.

So, one thing we’ve done in Canada with the U15 (top 15 universities in Canada)… They are at the top of what they can do in terms of the cost of scholarly journals so they asked us to produce a paper for them on how to address that… I think that issue of cost could be an opportunity…

Q4) I’m an academic and we are looking for services that make our life better… Here at Edinburgh we can see that libraries are the naturally the consistent point of connection with repository. Does that translate globally?

A4) It varies globally. Libraries are fairly well recognised in Western countries. In developing world there are funding and capacity challenges that makes that harder… There is also a question of whether we need repositories for every library.. Can we do more consortia repositories or similar.

Q5 – Chris) You talked about repository supporting all kinds of materials… And how they can “wag the dog” of the article

A5) I think with research data there is so much momentum there around making data available… But I don’t know how well we are set up with research data management to ensure data can be found and reused. We need to improve the technology in repositories. And we need more resources too…

Q6) Can we do more to encourage academics, researchers, students to reuse data and content as part of their practice?

A6) I think the more content we have at Commons level, the more it can be reused. We have to improve discoverability, and improve the functionality to help that content to be reused… There is huge use of machine reuse of content – I was speaking with Peter Knoth about this – but that isn’t easy to do with repositories…

Theo) It would be really useful to see Open Access buttons more visible, using repositories for document delivery, etc.

Chris Banks, Director of Library Services, Imperial CollegeFocusing upstream: supporting scholarly communication by academics

Gavin MacLachlan: I’d just like to welcome you again to Edinburgh, our beautiful city and our always lovely weather (note for remote followers: it’s dreich and raining!). I’m here to introduce Chris, whose work with LIBER and LERU will be well known to you.

Chris: This is my first fringe and I find it quite terrifying that I’m second up! Now, I’m going to go right back to basics and policy…

The Finch report in 2012 and Research Councils UK: we had RCUK policy; funding available for immediate Gold OA (including hybrid); embargo limits apply where Green OA chosen. Nevertheless the transition across the world is likely to take a number of years. For my money we’ve moved well on repositories, partly as the UK has gone it alone in terms of funding that transition process.

In terms of REF we had the Funding council REF policy (2013) which is applicable to all outputs that are to be submitted to the post 2014 REF exercise – effectively covers all researchers. No additional funding available Where Green OA selected, requirement for use of repositories. There were also two paragraphs (15 and 26) shaping what we have been doing…

That institutions are encouraged to go beyond the minimum (and will receive credit for doing so) – and the visibility of that is where we see the rise of University presses. And the statement that repositories do not need to be accessible for reuse and text mining, but that, again, there will be credit for those that are. Those two paragraphs have driven what we’ve been doing at Imperial.

At the moment UK researchers face the “policy stack” challenge. There are many funder policies; the REF policy differs substantially from other policies and applies to all UK research academics – you can comply with RCUK policy and fall foul of REF; many publisher policies…

So how can the REF policy help? Institutions recognise IP, copyright and open access policies are not necessarily supporting funder compliance – something needs to be done. There is a variety of approaches to academic IP observed in UK institutions. Legally in the UK the employer is the first copyright holder… subject to any other agreements and unless the individual is a contractors etc.

Publishers have varying approaches to copyright, licence to first publish, to outright copyright transfer. Licences are not read to academics. It’s not just in publishing… It’s social media… It’s a big problem.

For the library we want to create frictionless services. We need to upscale services to all researchers – REF policy requirements. We can’t easily give an answer to researchers on their OA options. So we started our work at imperial to address this, and to ensure our own organisational policy aligned with funder policies. We also wanted to preserve academic choice over publishing, and ability to sign away rights when necessary (though encouraging scrutiny of licenses). We have a desire to maximise impact of publication. And there is a desire to retain some re-use rights for us in teaching etc, including rights to diagrams etc.

The options we explored with academics was to do as we do at the moment – with academics signing over copyright, through to the institution claiming all copyright on all academic outputs. And we wanted to look at two existing models in between, the SPARC model (academic signed copyright over to publisher but licenses back); and the Harvard model – which we selected.

The Harvard model is implemented as part of the university OA policy. Academic deposits Author Accepted Manuscipts (AAMs) and grant a non-exclusive licence to the university for all journal articles. It is a well established policy and has been in use (elsewhere) since 2008. Where a journal seeks a waiver that can be managed by exception. And this is well tested in Ivy League colleges but also much more widely, including universities in Kenya.

The benefits here is that academia retains rights, authors have the right to make articles open access – open access articles have higher citations than closed ones. Authors can continue to publish in journal or choice irrespective of whether it allows ope access or not. Single means by which authors can comply with green open access policies. We are minimising reliance on hybrid open access – reducing “double dipping”, paying twice through subscriptions and APC – a complex and costly process. I think we and publishers see money for hybrid OA models drying up in the future, as the UK has pretty much been the one place doing that. Instead funding is typically used for pure gold OA models and publications.

We have mae some changes to the Harvard model policy to make it work in the context of UK law, also to ensure it facilitated funder deposit compliance and REF eligibility. The next step here is that 60 institutions overall are interested and we have a first mover group of around 12 institutions. We are discussing with publishers. And we have had wider engagement with the researcher, library, research office and legal office communities. We have a website and advocacy materials under development. We are also drafting boilerplate texts for authors, collaboration agreements etc. especially for large international collaborative projects. We have a steering committee established and that includes representatives from across institutions, and including a publisher.

At the moment we are addressing some publisher concerns and perceptions. Publishers are taking a very particular approach to us. We have a real range of responses. Some are very positive – including the pure gold (e.g. PLoS) and also learned society (e.g. Royal Society). Other publishers have raised concerns and are in touch with the steering group, and with ASPLP.

Summary of current concerns:

  • that it goes beyond requirements of Finch. We have stated that UK-SCL is to support REF and other
  • AAMs will be made available on publication. Response: yes, as per Harvard model around since 2008
  • Administrative burden on UK author/institutions as publishers would have to ask for waivers in 70-80%. We have responded that in other Harvard using experiences it has been less than 5% and we can’t see why UK authors would be treated differently.
  • They noted that only 8% of material submitted to the REF were green OA compliant. We have noted that only 8% submitted were green OA, not 8% of all eligible for submission.

Researchers have also raised concerns

  • the need to seek agreement from co-authors, especially in collaborations. Can be addressed through a phased/gradual implementation
  • Fear that a publisher will refuse to publish. Institutions using Harvard model repot no instances of this happening
  • Learned Societies – fear loss of income. No reliable research evidence to back up this fear.
  • Don’t like the CC-BY-ND Licence. That is to comply with RCUK but warrants further discussion.

Our next step is further meeting with PA/ALPSP to take place during the summer. We have encouraged proposals to delivery more than simply minimum REF eligibility which would resolve current funder/publisher policy stack complexity. We will finalise the website, waiver system, advocacy materials and boilerplate texts. To gain agreement on early mover institutions and on the dat of first adoption. And to notify publishers.

Another bit of late breaking news… Publishers recently went to HEFCE to ask about policy statements and, as a result of that, HEFCE will be clarifying that it is pushing for minimum compliance and encouraging more than that. One concern of the REF policy had been that only material submitted to the REF would have been deposited…

Last time my institution submitted 5k items, more than half were not monographs. We submitted 95% of our researchers. Out of that four items were looked at, now would be 2. And from that our funding is decided. And you can see, from that, why that bigger encouragement for the open scholarly ecosystem is so important.

I wanted to close by sharing some useful further materials and to credit others who have been part of this work.

One important thing to note is that we are trying to help researchers and university to comply as policies from funders and publishers evolve. I would like to see that result in discussion with publishers, and a move to all gold OA…  The AAMs is not the goal here, it’s the published article. Now that could see the end of repositories – something I am cautious of raising with this audience. Now in the


Q1) The elephant in the room here is Sci Hub… They are making 95% of published content available for free. You have AAMCs out there… And we haven’t seen subscriptions drop.

A1) So our initiative is about legal sharing. And also need to note that the UK is just one scholarly community. And others have not moved towards mandates and funding. I think it is a shame that fights have been picked is with institutions, when we have that elephant in the room…

Q2) Congratulations on the complex and intricate discussions you have been holding… Almost a legal game of Twister, where all the participants hate each other! This ia particular negotiation at the end of a process, at the end of the scholarly publishing change. How might you like your experience to feed into training of researchers and their own understanding of copyright, ownership of their own outputs.

A2) The challenge that we observe is that we have many younger researches and authors who are very passionate and ethically minded about openness. They are under pressure from supervisors who say they will not get tenured position if they don’t have a “good” journal on heir cv. And they are frustrated by the slow movement on the San Francisco research assessment declaration. Right now the quality journals remain those subscription high impact journals. But we have research showing the higher use of open access journals. But we still have that debate within academe that is slowing down that environment. But training researchers about their IP and what copyright. I also think it is interesting that Sir Mark Walpock in charge of UKRI as he has written before about the evolving scholarly record, and the scattering of articles and outputs, instead building online around research projects. He gave a talk at LIBER in 2015, and an article for THE. He was also at Wellcome when they first introduced their mandate so I think we really do have someone who understands that complexity and the importance of openness.

10×10 presentations (Chair: Ianthe Sutherland, University Library & Collections)

  1. v2.juliet – A Model For SHERPA’s Mid-Term Infrastructure. Adam Field, Jisc

I’m here from SHERPA HQ at Jisc! I’m going to go back to 2006… We saw YouTube celebrating it’s first year… Eight out of Ten Cats began… The Nintendo WII appeared… And… SHERPA/JULIET was launched (SHERPA having been around in 2001). So, when we set up Sherpa REF as a beta service in 2016 we had to build something new, as JULIET hadn’t been set up for APIs and interoperability in that kind of way.

So, we set about a new SHERPA/JULIET based around a pragmatic, functional data model; to move data into a platform; to rebrand to Jisc norms; a like-for-like replacement; and a precedent for our other services as we update them all..

So, a quick demo… We now have the list of funders – as before – include an overview of open access. So if we choose Cancer Research UK… You can see the full metadata record, headings for more information. Can see which groups it is part of… We have a nice open API where you can retrieve information.

So, whilst it was a like for like rebuild we have snuck in new features, including FundRef DOIs – added automatically where possible, will be added to with manual input too. More flexible browsing. And a JSON API – really easy to work with. And in the future we’d like funders able to add to their own records and other usefu l3rd party editorial features. We want to integrate ElasticSearch. And we want to add microservices…

In terms of our process here… The hard part was analying the existing data, structuring it into a more appropriate shape… the next part was much easier… We configured EPrints, imported data, and added some bespoke service requirements.

Right now we have a beta of SHERPA/JULIET. All take a look please! We are now working on OpenDOAR. And then SHERPA/ROMEO is expected to be in early 2018.

We now want your feedback! Email with your comments and feedback. We’ll have feedback sessions later today that you can join us for and share your thoughts, ask questions about the API. And myself and Tom Davey our user interface person, are here all day – come find us!

  1. CORE Recommender: a plug in suggesting open access content. Nancy Pontika, CORE

I want to talk about discoverability of content in repositories… Salo 2008, Konkiel 2012 and Acharya 2017 talk about the challenges of discoverability in repositories. So, what is needed? Well, we need recommender systems in repositories so that we can increase the number of incoming links to relevant resources…

For those of you new to repositories, CORE is an aggregation service, we are global and focused we have started harvesting gold OA journals… We have services at various levels, including for text mining and data science. We have a huge collection of 8 million full text articles,  77 million metadata records… They are all in one place… So we can build a good recommendation system.

What effect can we have? Well it will increase the accessibility meaning more engagement, higher Click-Through Rate (CTR); twice as often people access resources on CORE via its recommender system than via search. And that additional engagement increases the time spent in your repositories – which is good for you. And you can open another way to find research…

For instance you can see within White Rose Research Online that suggested articles are presented that come from all of the collections of CORE, including basic geographic information… We would like crowd sourced feedback here. The more users that engage in feedback, the more the recommender will improve. We also get feedback from our community. At the moment the first tab is CORE recommendations, the second tab is institutional recommendations. We’ve had feedback that institutions would prefer it th eother way… We have heard that… Although we note that CORE recommendations are better as its a bigger data set…. We want to make sure the institutional tab appears first unless there are few recommendations/poor matches… We are working on this…

CORE Recommender has been installed at St Mary’s; LSHTM; the OU; University of York; University of Sheffield; York St John; Strathclyde University… and others with more to follow.

How does it work? Currently it’s an article-to-article recommender system. There is preprocessing to make this possible. What is unique is that recommendations is based on full text, and the full text is open access.

What is the CORE recommender not? It is not always right – but which recommendation system is? And it does not compare the “quality” of the recommended articles with the “quality” of the initial paper…

  1. Enhancing Two workflows with RSpace & Figshare: Active Data to Archival Data and Research to Publication. Rory Macneil, Research Space and Megan Hardeman of Figshare

Rory: Most of the discussion so far has been on publications, but we are talking about data. I think it’s fair to say that FigShare in the data field; and RSpace in the Lab notebooks world have been totally fixated on interoperability!

Right now most data does not make it into repositories… Some shouldn’t be but even the data that should be shared, is not. One way to increase deposit is to make it easy to deposit data. By integrating with RSpace notebooks that allows easy and quick deposit.

So, in RSPace you can capture metadata of various types. There are lots of ways to organise the data… And to use that you just need to activate the FigShare plugin. Then you select the data to deposit – exporting one or many documents… You select what you want to deposit, and the format to deposit in. You can export all of your work, or all of your lab’s work – whatever level of granularity you want to share… You deposit to Figshare… And over to Megan!

Megan: Figshare is a repository where users can male all of their research outputs availale in citable, accessible ways (all marked up for Google Scholar). You upload any file type (we support over 1000 types); we assign a DOI on an item level’ Store items in perpetuity (and backed up in DPN); track usage stats and Altmetrics (more exposure) and you can collaborate with researchers inside and outside your institutions.

figshare has na open API and integrations with RSpace nad other organizations and tools…

For an example… You can see an electronic lab notebook from RSpace which can be browsed and explored in the browser!

  1. Thesis digitisation project. Gavin Willshaw, University of Edinburgh

I’m digital curator here, and manager of the PhD digitisation project. This project sees a huge amount of content going into ERA, our repository. In the last three years we’ve moved from having two photographers to having two teams of photographers and cataloguers across two sites – we are investing heavily.

We have 17,000 PhD theses and that will all be online by the end of 2018. This will provide global access to entire PhD collection. We have obtained some equipment. We are creating metadata records, and also undertaking some preservation work where thre required.

The collection is largely standardised… But we have some latin and handwritten theses. We have awkward objects – like slices of lungs!

For 10k theses we have duplicates and they are scanned destructively. 3000 unique these are scanned non-destructively in house. And 40000 unique these outsourced. All are OCRed. And they are all catalogued, with data protection checks made before determining what can be shared online in full and which cannot.

In terms of copyright and licensing, that is still with the author. We have contacted some and had positive feedback. It’s a risk but a low risk. In any case we can’t asset the copyright or change licences on our own. And we already have over 2500 theses live.

And these theses are not just text… We have images that are rare and unusual. We share some of these highlights in our blog: and we use, on Twitter, the hashtag #UoEPhD. We have some notable theses… Alexander Macall Smith’s PhD is there; Isabelle Elmsley Hutton, a doctor in the first world war in the Balkans – so noted she was on a stamp in Serbia last year; Helen Pankhurts; and of course members of staff from the university too!

Impact wise the theses on ERA have been downloaded 2 million times since 2012. Those digitised in the project are seeing around 3000 downloads per month. Oddly our most popular thesis right now is on the differentiation of people in Norwich. We are also looking at what else we can d… Linking theses to Wikipedia; adding a thesis to Wikisource (and getting 10x the views); and now looking at what else… text and data mining etc.

  1. Weather Cloudy & Cool Harvest Begun’: St Andrews output usage beyond the repository. Michael Bryce, University of St Andrews

I didn’t expect it to actually be cloudy today…!

Our repository has been going since 2006, and use has been growing steadily…

Some of the highlights fro our repository has included research on New Caledonian crows and collaborative tool use. We also have farming diaries in our repository under Creative Commons license… Pushing that out into the community in blog posts and posters… So going beyond traditional publications and use. Our material on Syria has seen significant usage driven partly by use in OJS journals.

Our repository isn’t currently OpenAIRE compliant, but we have some content shared that way, which means a bigger audience… For instance material on virtual learning environments associated with a big EU project.

We’ve also been engaging in publishing engagement. The BBC asked us to digitise a thesis at the time of broadcasting Coast which added that work to our repository.

When we reached our 10,000th item we had cake! And helped publicise the student and their work to a wider audience…

Impact and the REF panel session

Brief for this session: How are institutions preparing for the next round of the Research Excellence Framework #REF2021, and how do repositories feature in this? What lessons can we learn from the last REF and what changes to impact might we expect in 2021? How can we improve our repositories and associated services to support researchers to achieve and measure impact with a view to the REF? In anticipation of the forthcoming announcement by HEFCE later this year of the details of how #REF2021 will work, and how impact will be measured, our panel will discuss all these issues and answer questions from RepoFringers.

Chair: Keith McDonald (KM), Assistant Director, Research and Innovation Directorate, Scottish Funding Council

The panel here include Pauline Jones, REF Manager at University of Edinburgh, and a veteran of the two previous REFs – she was at Napier University in 2008, and was working at the SFC (where I work) for the previous REF and was involved in the introduction of Impact.

Catriona Firth (CF), REF Deputy Manager, HEFCE

I used to work in universities, now I am a poacher-turned-gamekeeper I suppose!

Today I want to talk about Impact in REF 2014. Impact was introduced and assessed for the first time in REF 2014. After extensive consultation Impact was defined in an inclusive way. So, for REF 2014, impact was assessed in four-page case studies describing impacts that had occurred between January 2008 and July 2013. The submitting university must have produced high quality research since 1993 that contributed to the impacts. Each submitting unit (usually subject area) returned one case study, plus an additional case study for every 10 staff.

At the end of the REF 2014 we had 6,975 case studies submitted. On average across submissions 44% of impacts were judged outstanding (4*) by over 250 external users of research, working jointly with the academic panel. There was global spread of impact, and those impacts were across a wealth of areas of life policy, performance and creative practice, etc. There was, for instance, a case study of drama and performance that had an impact on nuclear technology. The HEFCE report on impact is highly recommended reading.

In November 2015 Lord NicholasStern was commissioned by the Minister of Universities and Science to conduct an independent review of the REF. He found that the exercise was excellent, and had achieved what was desired. However there were recommendations for improvement:

  • lowering the burden on institutions
  • less game-playing and use of loop holes
  • less personalisation, more institutionally focused – to take pressure off institutions but also recognise and reward institutional investment in research
  • recognition for investment
  • more rounded view of research activity – again avoiding distortion
  • interdisciplinary emphasis – some work could
  • broaden impact – and find ways to capture, reward, and promote the ways UK research has a benefit on and impacts society.

If you go to the HEFCE website you’ll see a video of a webinar on the Stern Review and specifically on staff and outputs, including that all research active staff should be included, that outputs be determined at assessment level, and that outputs should not be portable.

In terms of impact there was keenness to broaden and deepen the definition of impact and provide additional guidance. Policy was a safer kind of case studies before. The Stern Review emphasised a need for more focus on public engagement and impact on curricula and/or pedagogy. Reduce the number of required case studies to a minimum of one. And to include impact arising from research, research activity, or a “body of work”.  And having a quality threshold for underpinning research based on rigour – not just originality. And the opportunity to resubmit case studies if the impact was ongoing.

We have been receiving feedback – over 400 responses – which are being summarised. That feedback includes positive feedback on broadening impact and to aligning definitions of impact and on public engagement across funding bodies. There were some concerns about sub-profile based on one case study – especially in small departments. And in those case you’d know exactly whose work and case study was 4* (or not). There have been concerns about how you separate rigour from originality and significance. There was a lot of support for broader basis of research, but challenges in drawing boundaries in practice – in terms of timing and how far back you go… For scholarly career assessment do you go back further? And there was broad support for resubmission of 2014 case studies but questions about “additionality” – could it be the same sort of impact or did it need to be something new or additional? So, we are working on those questions at the moment.

The other suggestion from the Stern Review was the idea of an institutional level assessment of impact, giving universities opportunities to show case studies that didn’t fall neatly elsewhere. Th ecase studies arising from multi and interdisciplinary and collaborative work, and that that should be 10-20% of total ipact case studies; minimum of one. But feedback has been unclear here, particularly the conflation of interdisciplinary research with institutional profiles. Concern also that the University might take over a case study that would otherwise sit in another unit.

So, the next step is communications in summer/autumn 2017. There will be a REF initial decisions document. A summary of consultation responses. And there will be sharing of full consultation responses (with permission).  And there will be a launch for our REF 2021 website and Twitter account.

Anne-Sofie Laegran (ASL), Knowledge Exchange Manager, College of Arts, Humanities and Social Sciences, University of Edinburgh

KM: Is resubmission better for some areas than others?

ASL: I think it depends on what you mean by resubmission.. We have some good case studies arising from the same research as in 2014, but they are different impacts.

So.. I will give you a view from the trenches. To start I draw your attention to the University strapline that we have been “Influencing the world since 1583”. But we have to demonstrate and evidence that of course.

There has been impact of impact in academia… When I started in 2008 it was about having conversations about the importance of having an impact, and now it is much more about how you do this. There has been a culture change – all academic staff must consider th epotential impact of research. The challenge is not only to create impact but also to demonstrate impact. There is also an incentive to show ipact – it is part of career progression, it is part of recruitment, and it is part of promotion.

Impact of impact in academia has also been about training – how to develop pathways as well as how to capture and evidence impact. And there has been more support – expert staff as well as funding from funders and from the university.

In terms of developing pathways to impact we have borrowed questions that funders ask:

  • who may benefit from your researh?
  • what might th ebenefts ve?
  • what can you do to ensure potential beneficiaries and decision makers have th eopportunity to engage and benefit

And it is also – especially when capturing impact – about developing new skills and networks.

For instance… If you want to impact the NHS, who makes decisions, makes changes… If you are working with museums and galleries the decision makers will vary depending on where you can find that influence. And, for instance, you rarely partner with the Scottish Government, but you may influence NGOs who then influence Scottish Government.

Whatever the impact it starts from excellent research; which leads to knowledge exchange – public engagement, influencing policy, informing professional practice and service deliver, technology transfer; and that results in impact. You don’t “do” impact, your work is used and influences that then effects a change and an impact.

REF impact challenges include demonstrating change/benefit as opposed to reporting engagement activity. Attributing that change to research. And providing robust evidence. In 2014 that was particularly tricky as the guidance was in 2012 so people had to dig back… That should be less of an issue now, we’ve been collecting evidence along the way…

Some cases that we think did well, and/or had feedback were doing well:

  • College of art scholar, who has a dual appointment at the National Galleries of Scotland. She curated the Impressionism Scotland show with over 100k visitors. There was good feedback that also generated debate. It had a change on how th egallery curates shows. And on the market the works displayed went up in value – it had a real economic impact.
  • In law two researchers have been undertaking longitudinal work on young people, their lives, careers, and criminal careers. That is funded by Scottish Government. That research led to a new piece of policy based on the findings of that research. And there was a quote from Scottish Government showing a decline in youth crime, attributing that to the policy change, and which was based on research – showing that clear line of impact.
  • In sociology, a researcher wrote about the impact of research on the financial crisis for the London Review of Books, it was well received and he was named one of the most influential thinkers on the crisis; his work was translated to French; it was picked up in parliament; and Stephanie Flanders – then BBC economics editor – tweeted that this work was the most important on the financial crisis.
  • In music, researchers developed the Skoog, an instrument for disabled students to engage in music. They set up a company, they had investment. At the the time of the REF they had 6 employees, they were selling to organisations – so reaching many people. And in the cultural olympiad during the Olympics in 2012 they were also used, showing that wider impact.

So for each of these you can see there was both activity, and impact here.

In terms of RepoFringe areas I was asked to talk about the role of repositories and open access. It is potentially important. But typically we don’t see impact coming from the scholarly publication, it’s usually the activities coming from the research or from that publication. Making work open access certainly isn’t enough to just trigger impact.

Social media can be important but it needs to have high level of engagement, reach and/or significance to demonstrate more than mere dissemination. That Stephanie Flanders example wouldn’t be enough on it’s own, it works as part of demonstrating another impact, and a good way to track impact, to understand your audience… And to follow up and see what happened next…

Metrics – there is no doubt that numeric evidence was important. Our head of research said last time “numbers speak louder than adjectives” but they have to be relevant and useful. You need context. Standardised metrics/Altmetrics doesn’t work – a big report recently concluded the same. Altmetrics is about alternative metrics that can be tracked online, using DOI. A company called Altmetrics gathers that data, can be useful to track… And can be manipulated by friends with big Twitter followers.. It won’t replace case studies, but may be useful for tracking…

In terms of importance of impact… It relates to 20% of REF score; determined 26% of the funding in Scotland. Funding attracted per annum for the next 7 years:

  • 4* case study brings in £45-120k
  • 3* £9-25k
  • 2* £0
  • 4* output, for comparison, is work £7-15k…

The question that does come up is “what is impact” and yes, a single Tweet could be impact that someone has read and engaged with your work… But those big impact case studies are about making a real change and a range of impacts.

Pauline Jones (PJ), REF Manager and Head of Strategic Performance and Research Policy, University of Edinburgh

Thank you to Catriona and Anne-Sofie for introducing impact. I wanted to reinforce the idea that this is what we are doing anyway, making an impact on society, so it is important anyway, not only because of the REF.

Catriona suggested we had a “year off” but actually once REF happened we went into an intense period of evaluation and reflection, then of course the Stern review, consultation, general election… It has been quite non-stop. But actually even if that wasn’t all going on, we’d need our academics to be aware of the REF and of open access. I think open access is incredibly important, people are looking for it… Research is publicly funded… But it has required a lot of work to get up and running.

Although we are roughly at mid point between REFs, we are up and running, gathering impact, preparing to emphasise our impact. In terms of collecting evidence, depositing papers… That will happen in most universities. I think many will be doing the sort of Mock REFs/REF readiness exercises that we have been undertaking. We are also already thinking about drafting our case studies. As we get nearer to submission we’ll take decisions on inclusion… and getting everything ready.

So for REF 2021 we have a long time period over which submission is prepared. There is no period over which outputs, impacts, environment don’t count. Academics thinking now about what to include: 2017 REF readiness exercise to focus on open access and numbers; 2018 Mock REF to focus on quality. And we all have to have a CRIS system now to make that work.

What’s new here? We are still waiting for the draft to understand what’s happening. There are open access journal articles/conference proceedings. There are probably the challenges of submitting all research staff; decoupling the one-to-four staff-to-outputs ratio. That break is quite a good thing… Some researchers might struggle to create four key outputs – part time staff, staff with maternity leave, etc. But we want a sense of what that looks like from our mock/readiness work. That non-portability requirement seems useful and desirable, but speaking personally I think the researcher invests a lot – not just an institution – making that complex. Taking all those together I’m not sure the Stern idea of less complexity or burden here, not alongside those changes.

And then we have the institutional impact case studies – we had a number of interdisciplinary examples of work, so we are comfortable with that possibility. institutional environment is largely shared so doing that once for the whole university could be a really helpful reduction in work load. And each new element will have implications for how CRIS systems support REF submissions.

And as we prepare for REF 2021 we also have to look to REF 2028. We think open data will be important given the Concordat on Open Data Research (signed by HEFCE; RCUK; Universities UK; Wellcome) so we can get ready now, ready for when that happens. I’m pretty confident that open access monographs will be part of the next REF (following Monographs and Open Access HEFCE report). Then there is institutional impact – may not happen here but may be back. And then there are metrics. We have The Metric Tide: Report of the Independent Review of the Role of Metrics in Research Assessment Management.

IN terms of responsible metrics,we haven’t heard the last of them… Forum for responsible metrics’ Data and metrics to support decisions, not the sole driver; but the conversation will not end with th e metric tide. Metrics are alluring but to date they have’t worked well versus other types of evidence.

SO, how do we prepare?

  • For REF 2021 we need to be agile, support research managers to help academics deposit work, we have to help us lobby CROS system designers to have fit-for-purpose systems.
  • For REF 2028 we have to understand the benefits and challenges of making more research open
  • Be part of the conversation on responsible metrics – any bibliometrics experts in the room will stay busy.
  • And we want to have interoperability in systems…


Q1) How can we do something useful in terms of impact for case studies as our repository remit expands to different materials, different levels of openness, etc.

A1 – ASL) I think being easily accessible on Univesity websites, making them findable… Then also perhaps improved search functionality, and some way to categorise what comes out… If creating things other than peer reviewed publications – what is this? type information. I might have been too negative about repositories because historically our data wasn’t in those… I think actually sciences find that even more important…

Q1) For collecting evidence?

A1 – ASL) Yes. for collecting… Some have metrics that help us see how those impact have worked.

A1 – PJ) We’ve been talking about how best to use our CRIS to improve join up and understand those impacts…

A1 – CF) I think it’s also about getting that rounded view of the researcher – their outputs, publications, etc. being captured as impacts alongside the outputs… That could be useful and valuable…

Q2) A common theme was the burden of this exercise… But could be argued that it drives positive changes… How can the REF add to the sector?

A2 – CF) Wearing my personal and former job hat, as impact officer, I did see REF drive strategic investment in universities, including public engagement, that rewards, recognises, and encourages more engagement with the coomunity. There is real sharing of knowledge brought about by impact and the REF.

A2 – ASL) Totally agree.

A2 – PJ) More broadly the REF and RAE… They recognise the importance of research and supporting researchers. For us we get £75M a year through the research excellence funding. And we see the quality of research publications going up…

Q3) Do you have any comments on the academic community and how that supports the REF, particularly around data.

A3 – PJ) At Edinburgh we are very big – we submitted 1800 staff, we could have submitted up to 2500. In my previous role we had much smaller numbers of resarch staff… So they are different challenges and different systems… We have spoken to our Informatics colleagues to see what we can do. There are definitely benefits at th elevel of building a sysetm to manage this…

Q3) In an academic environment we have collegiate working practice, and need systems that work together.

A3 – PJ) We have a really distributed set up at Edinburgh, so we are constrantly having that conversation, and looking for cross cutting areas, exchanging information…

Q4) the relationship with the researcher changes here… In previous years universities talked about “their research” but it was actually all structured around the individual. In this new model that shift is big, and the role and responsibility of the organisation, the ways that schools interact with their researcher…

A4 – ASL) You do see that in pre-funding application activity with internal peer review processes that build that collegiality within the organisation…

Q5) I was intrigued with the comment that lots of impact isn’t associated with outputs… So that raises questions about the importance of outputs in the REF. Should we rebalance the value of the output and how it is valued.

A5 – ASL) Perhaps. For example when colleagues are providing evidence to government and parliament it is rare for publications to be referenced, and rare for publications to be read… I don’t think those matter… But those include methodology, rigour, evidence of quality of work. But that then becomes briefing papers etc… Otherwise you and I could just make a paper – but that would be opinion. So you need that (hard to read) academic publication, and you have to acknowledge that those are different things and have different roles – and that has to be demonstrated in the case studies.

A5 – CF) I think it’s an interesting question, especially thinking ahead to REF 2021… We are considering how those impacts o the field and impact on wider society are represented – some blue skies research won’t have impact for many years to come…

Q6) I think lay summaries of a piece of work is so crucial. Science Open and John Tennent is putting up lay summaries, you have Kudos and other things there contributing to that… The public want to understand what they are reading. I have personally sat on panels as a lay member and I know how hard it is to have that kind of lay summary is, to understand what has taken place.

A6 – ASL) You do need that lay summary of work, or briefing paper, or expert communities which are not lay people… You have to think about audiences and communicating your work widely, and target it… I think repositories are useful to access work, but it’s not enough to put it there – just as it isn’t enough to put an article out there – you have to actively reach out to your audiences.

A6 – CF) I would agree and I would add that there is such a key need to help academics to do that, to support skills for writing lay summaries… Getting it clearer benefits the researcher, their thinking, and how they tell others about their work – that truly enables knowledge exchange.

A6 – PJ) And it benefits the academic audience too. I was listening to a podcast where academics from across disciplines to see which papers were most valuable, and being readable to a lay audience was a key factor in how those papers did.

10×10 presentations (Chair: Ianthe Sutherland, University Library & Collections)

  1. National Open Data and Open Science Policies in Europe. Martin Donnelly, DCC

I’m talking about some work we’ve done at DCC with SPARC Europe looking at Open Data and Policies across Europe.

The DCC is a centre of expertise in digital curation and data management. We maintain a watching brief on funders research data policies (largely focused on the UK). SPARC Europe is a membership organisation comprising academic institutions, library consortia, funding bodies, research institutes and publishers. Their gial is advocating change in scholarly communications for the benefit of research and society. And we have been collaborating since 2016 looking at open data and open science policies across Europe.

So, what is a policy? Well the dictionary definition works, it’s a set of ideas or a plan of what to do in particular situations that has been agreed to officially by a group of people or an organisation.

In this work we looked at national policies – in some regions with a single research funder that could be the funder policy but, in the UK the AHRC wouldn’t count here as that is not a national policy across the whole country. And the last known analysis of this sort dates back to 2013 and much has changed in that time.

We began by compiling and briefing describing a list of national policies in the EU and some ERA states (IS, NO, CH). We circulated that list for comment and additions. We also sought intelligence from contacts fro DCC European projects to ask about the status of national approaches, forthcoming or exiting policies, etc. We then attempted to classify the policies.

Across the thirteen countries we found: 6 funder policies; 4 national plans or roadmaps; 2 concordat type documents; 2 laws; and one working paper. There are more than 13 there as some parallel documents. Identifying the lead, ranking or sponsoring organisation was not always straightforward, sometimes documents were co-signed by partners or groups. All of the policies discussed research data; 7 addressed open access publication explicitly; 6 addressed software, code, tools or models; 5 addressed methods, workflows or protocols, and one addressing physical (non-digital) samples.

Most policies were prescriptive or imperative. Monitoring of compliance and/or penalties are not that common. And these are new – only 2 policies pre-date 2014 but there are open preceeding open access policies. And new policies keep appearing as a result of our work… And two policies have been translated to English specifically because of this work (Estonia, Cyprus). The EC’s Open Research Data Pilot for Horizon 2020 was cited in multiple policy documents. And we hope that Brexit won’t diminish our role or engagement in European open data policy.

  1. IIIF: you can keep your head while all around are losing theirs! Scott Renton, University of Edinburgh

IIIF is the International Image Interoperability Framework which enables you to use images in your cultural heritage resources. IIIF works through two APIs. You bring in images through the Image API through IIIF compliant URLs, which have long URLs that include the region of the image, instructions for display, etc. The other API is the Presentation API which is much more about curation, including the ability to curate collections of content – so you can structure these as, say, an image of a building that is related to images of the rooms in that building.

We have images in Luna and we pushed on Luna to support IIIF and we did get success there. We have implemented IIIF in December. We made a lot of progress and have IIIF websites online. The workflows are really complex but it allows us to maintain one set of images and metadata through these embedded images, rather than having to copy and duplicate work. And those images are zoomable, draggable, etc. And Metadata games is also IIIF compliant. And it is feeding into our websites including the new St Cecilia’s Hall museum website.

Our next implementation was the Coimbra virtual implementation – which includes other people’s images. For our images, and other IIIF compliant organisations that was easy, but we had to set up our own server (named Cantaloupe) to manage those images from others.

The next challenge was the Mahabharata Scroll. It is a huge document but the IIF spec and Luna allows us to prorgamme a sequence of viewers…

And our main achievement has been Polyanno that allows annotation that can then be stored in manifests, to upload and discuss annotations. It’s proving very popular with the IIIF community. We have huge amount of images to convert to IIIF but lots of plans, lots of ideas, and lots to do…

We are also collabortion with NLS around their content, and are up to talk with others about IIIF!

  1. Reference Rot in theses: a HiberActive pilot. Nicola Osborne, EDINA, University of Edinburgh

This was my presentation – so notes from me here but some links to Site2Cite, a working demo/pilot tool for researchers to proactively archive their web citations as they are doing their research, to ensure that by the time they submit their PhD, have their work published, or begin follow up work, they still have access to those important resources.

Introducing Site2Cite:

Try out the Site2Cite tools for yourself here:

You can view my full slides (slightly updated to make more sense for those who didn’t hear the accompanying talk) from the 10×10 here:

This ISG Innovation Funded pilot project builds upon our previous Andrew W. Mellon-funded Hiberlink project; a collaboration between EDINA, Los Alamos National Laboratory, and the University of Edinburgh School of Informatics. The Hiberlink project built on and worked with Herbert Van de Sompel’s and his Memento work.

  1. Lifting the lid on global research impact: implementation and analysis of a Request a Copy service. Dimity Flanagan, London School of Economics and Political Science

Apologies for missing the first few minutes of Dimity’s talk…

LSE have only recently implemented the “request a copy” button in the repository but, having done that Dimity and colleagues have been researching how it is used.

We’ve had about 500 requests so far. The most popular requests have been for international relations, law and media areas. And we see demand from organisations and governments – including requests explicitly stating that they do not subscribe to the journal and they felt it was crucial to their work. There is that potential impact here being revealed in requests for articles ahead of key meetings and events, etc.

And these requests show huge reach form organisations locally and around the world.

One thing we have noticed is that we get a lot of requests from students who can definitely access the version of record through journals subscribed to by their university – they don’t realise and that causes avoidable delay. We have also seen academics linking from reading lists to restricted items in repositories. But, on a more positive note, we’ve had lots of requests from our alumni – 70% of our alumni are international and that shows really positive impact for our work.

Overall this button and the evidence that requests provide has been really positive.

  1. What RADAR did next: developing a peer review process for research plans. Nicola Siminson, Glasgow School of Art

RADAR captures performances, exhibitions, as well as traditional articles, monographs etc. It is hosted on EPrints. And we encourage staff to add as much metadata as possible. But increasingly it is being used internally, with staff developing annual research plans (ARPs) and that feeding into allocations in the year ahead.

These ARPs arose in part from the outcome of the REF 2014 assessment. These are peer reviewed (but not openly available) ARPs aim to enable research time to be allocated more effectively with a view to maximising the number of high quality submissions to the next REF. RADAR houses the template as it played a key role in the GSA REF 2014 submissions, and staff already use and know the system.

The templates went live in 2015, and was tweaked, tried and relaunched in February 2015. The ARP template captures the research, the researchers details, and the expected impact of their work – and a submit process. The process was really quite manual so we thought carefully about how this should work… So once submitted the digital ARP went into a manual process. Once piloted we built the peer review process into RADAR, including access management that allows the researcher sole access until submitted, and then manages access back and forth as required.

We discussed this work with EPrints in Autumn 2016 and development commenced in Spring 2017. This was quite an involved process. The system was live in time for ARP panel chairs to send feedback and results.

So the process now sees ARPS submit; RADAR admin provides Head of Research with report of all ARPs submitted. Then it goes through a series of review stages and feedback stages.

So administrators can view ARPs, panels, status, etc. and there is space for reviews to be captured and the outcome to be shared.

Lessons learned here… No matter how much testing you have done, you’ll still need to tweak and flag things – it’s useful to have a keen researcher to test it and feedback on ‘those tweaks. We still need to increase prominence of summary and decision for the researcher, with more differentiated fields for peer reviews, etc. In conclusion the ARP peer reviewed process has been integrated into RADAR and will be fully tested next year. The continued development of RADAR is bearing fruit – researchers are using the repository and adding more outputs, and offering greater visibility and downloads for GSA.

Explore our repository at

  1. Edinburgh DataVault: Local implementation of Jisc DataVault: the value of testing. Pauline Ward, EDINA

I am Pauline Ward from the Research Data Service at the University of Edinburgh, and I am based at EDINA which is part of the University. Jisc commissioned UoE’s Library and University Collections (L&UC) team to design a service for researchers to store data for the long term with the Jisc Data Vault. And we’ve now implemented a version of this at Edinburgh – using that software from L&UC and specified and managed by EDINA.

The DataVault allows safe data storage in the University’s archival storage option, which links this data to a metadata record in Pure without having to re-enter any of the data. And, optionally, to receive a DOI for the data which can be used in publications and other outputs – depending on the context and appropriate visibility of the data. That allows preservation of data at the University. The DataVault is not for making data public – we have a service called DataShare for that.

So, let’s talk about metadata… We push that metadata to Pure and keep DataVault metadata as concise as possible. We need metadata that is usable and have some manual intervention to check and curate that.

We had a fairly extensive user testing process, to ensure documentation works well, then we also recruited academics from across the University to bring us their data and test the system to help us ensure it met their needs.

So, the interim version is out there, and we are continuing to develop and improve it.

  1. Data Management & Preservation using PURE and Archivematica at Strathclyde. Alan Morrisson, University of Strathclyde

We are governed and based in the research department. We wanted to look at both research data management and long term preservation, including reflecting on whether Pure is the right tool for the job here. Pure was already in use at Strathclyde when our Research Data Deposit Policy was being developed, so we deliberately made the policy as open as possible. Also Strathclyde is predominantly a STEM university, and we started off by surveying what else was out there… We knew the quantity and type of data coming in…

And since we opened up the service, in terms of data deposits to date we are have seen a steady increase from about 200 to 400 data sets over the last year.

In terms of our preservation and curation systems we have Pure in place and that does a lot – data storage, metadata, DOI etc. But we’ve also recently implemented Archivematica – it’s free, it’s open source, it’s compatible with Jisc DataVault. So the workflow right now is that data, metadata and related outputs are added to to Pure, and a DOI minted. This feeds the knowledgebase portal. In parallel the data from Pure goes to Archivematica where it is ingested and processed for preservation, and AIP METS file cleaned using METSflask before being stored.

The benefits of this set up is that Pure is familiar to researchers, does a good job of metadata management and related content and has a customised front end (Knowledgebase). Archivematica is well supported, open access, and designed for archiving. But those systems don’t work together, we are manually moving data across. Pure is designed for storage and presentation, not curation. Archivematica only recognises about 40% of the data.

So, in the future we are reviewing our system, perhaps using Pure for metadata only. We are keeping an eye on Jisc RDSS and considering possible Arkivum like storage. And generally looking at what is possible and most appropriate moving forward for curation and archiving.

  1. Open Access… From Oblivion… To the Spotlight? Dawn Hibbert, University of Northampton

I’ll be looking back over the last ten years… And actually ten years back I was working here in Accommodation Services, so not thinking about repositories at all!

Looking back at 2007/8 in the repository world we had our NECTAR repository. Then in 2011, Jisc funded project enabled an author deposit tool for NECTAR. At that time we had a carrot/incentive for deposit, but no stick. Which was actually a nice thing as we’ve now slipped more towards it all being about the REF.

By 2012/13 we engaged with our researchers around open access who had feedback such as “it’s in the library – you can get a copy from there” or “it’s only £30 to buy the journal I publish in, if I make my article free the journal go under” or “My work is not funded by RCUK so why should my work be open access”. We wanted everything open… But by 2014/15 (and the HEFCE announcement) we were still getting “I don’t have to give you anything until 2016” and similar… And we get that idea of “it’s all about the REF”. And it is not. Using the REF in that way, and the repository in that way overlooks the other benefits of open access.

So in 2016/17 HEFCE compliance started. Attitudes have shifted. But the focus has all been about gold APCs and the idea of the university paying. When actually we are using the HEFCE deposit and (later) open access green OA route. And for us we really want researchers to deposit much more than the open access part (we can do that later on).

So, in 2017 and beyond we are looking at emphasising the benefits, sharing that information, being positive about the opportunities, no just using the HEFCE stick. And for open access work we are looking at improving acceptance, extending open access to other outputs, and focus on visibility of research outputs – the Kudos type tool. And we are shifting the focus to Digital Preservation.

We are looking at datasets being open access too. RDM and Digital Preservation gaining ground. And when work is deposited, shared, tweeted, etc. that can really shift attitudes and show benefits and engagement for academic colleagues.

But we still see lots of money spent on PA and journal subscriptions. And we have yet to see what happens with RCUK and REF compliance.

  1. Automated metadata collection from the researcher CV Lattes Platform to aid IR ingest. Chloe Furnival, Universidade Federal de São Carlos

I am pleased to present work by myself and my colleagues from Sao Paulo in Brazil. Back in 1999 all Brail universities were required to share CVs of their research and academic staff on a platform (Curriculo Lattes) which now has over 2 million records now.

However, our University’s repository was only launched in 2016. Different to many universities using Web of Science or Scopus capturing their researchers’ work there, we saw that the Lattes CV Platform was the key and most up to date metadata – always extremely updated as required in funding. It is a really useful stepping stone to identify our staff publications for the initial repository.

So we have very well known researchers, Mena-Chalco and Cesar Jr (2013) who developed ScriptLattes for this extraction. But then the CNPq decided to implement a CAPTCHA which inhibits this Script. They alleged this was for security reasons but it created an uproar as it was seen as “our data”… So, this has all been very complicated and impacted on our plans to identify our own researchers’ work… So we went for SOAP (Simple Object Access Protocol). We also developed a proxy server to deal with CNPq limits. This is based on OpenResty platform to share access to the Lattes SOAP webservices. That lets us manage our local IP address and manage load/avoid going over capacity.

We extract data in xml format, then process in Python to generate Dublin Core. Then we use another script to eliminate duplicates using the Jaccard measure that helps detects differences… Then, once processed, it is held in DSpace. Each record in Lattes has a unique identifier as that site uses an ID number that all Brazillians are required to have to access e.g. a bank account.

So now we have the CVs of 1,166 teaching staff and researchers working at our HEI were retrieved in just 11 minutes. including metadata for 78K journal articles and proceedings papers. We had the specific objective of gaining direct and official access to public metadata held in Lattes CV.

  1. The Changing Face of Goldsmiths Research Online. Jeremiah Spillane, Goldsmiths, University of London

JS: Goldsmiths Research Online started as a vanilla install of EPrints, and it has become customised more and more over time. Important to that development have been several projects. The Jisc Kultur project created a transferable and sustainable institutional repository model for research output i the creative and applied arts, and creating facility for capturing multimedia content in repositories.

Kultur led to the Jisc Kaptur project, led by VADS working with various art colleges including Goldsmiths and GSA.

Then in 2009 we had the Defiant Objects project which looked to understand what makes some objects more difficult to deposit than others.

Jeremiah’s colleague: RAE/REF work has looked at policy versus the open access ethos – and striking the right balance there. So, the Goldsmiths website now includes content brought in from the repository. And that is now organised depending on the needs of different departments. We are also redesigning the website to better embed content to enable exploration of visual content. And the new design should be in place by autumn this year.

Speaking of design… We have been working with OJS but have been wanting to more thoroughly design OJS journals, so we have a new journal coming, Volupte, which runs on OJS in the background but uses SquareSpace at the front end – that’s a bit of an experiment at the moment.

JS: So, the repository continues to develop, whilst our end users primarily focus on their research.

Take a look at:

And with that Day One, and my visit to Repository Fringe 2017, is done. 

Aug 022017
The digital flyer for my CODI 2017 show - huge thanks to the CODI interns for creating this.

As we reach the end of the academic year, and I begin gearing up for the delightful chaos of the Edinburgh Fringe and my show, Is Your Online Reputation Hurting You?, I thought this would be a good time to look back on a busy recent few months of talks and projects (inspired partly by Lorna Campbell’s post along the same lines!).

This year the Managing Your Digital Footprint work has been continuing at a pace…

We began the year with funding from the Principal’s Teaching Award Scheme for a new project, led by Prof. Sian Bayne: “A Live Pulse”: Yik Yak for Teaching, Learning and Research at Edinburgh. Sian, Louise Connelly (PI for the original Digital Footprint research), and I have been working with the School of Informatics and a small team of fantastic undergraduate student research associates to look at Yik Yak and anonymity online. Yik Yak closed down this spring which has made this even more interesting as a cutting edge research project. You can find out more on the project blog – including my recent post on addressing ethics of research in anonymous social media spaces; student RA Lilinaz’s excellent post giving her take on the project; and Sian’s fantastic keynote from#CALRG2017, giving an overview of the challenges and emerging findings from this work. Expect more presentations and publications to follow over the coming months.

Over the last year or so Louise Connelly and I have been busy developing a Digital Footprint MOOC building on our previous research, training and best practice work and share this with the world. We designed a three week MOOC (Massive Open Online Course) that runs on a rolling basis on Coursera – a new session kicks off every month. The course launched this April and we were delighted to see it get some fantastic participant feedback and some fantastic press coverage (including a really positive experience of being interviewed by The Sun).

The MOOC has been going well and building interest in the consultancy and training work around our Digital Footprint research. Last year I received ISG Innovation Fund support to pilot this service and the last few months have included great opportunities to share research-informed expertise and best practices through commissioned and invited presentations and sessions including those for Abertay University, University of Stirling/Peer Review Project Academic Publishing Routes to Success event, Edinburgh Napier University, Asthma UK’s Patient Involvement Fair, CILIPS Annual Conference, CIGS Web 2.0 & Metadata seminar, and ReCon 2017. You can find more details of all of these, and other presentations and workshops on the Presentations & Publications page.

In June an unexpected short notice invitation came my way to do a mini version of my Digital Footprint Cabaret of Dangerous Ideas show as part of the Edinburgh International Film Festival. I’ve always attended EIFF films but also spent years reviewing films there so it was lovely to perform as part of the official programme, working with our brilliant CODI compare Susan Morrison and my fellow mini-CODI performer, mental health specialist Professor Steven Lawrie. We had a really engaged audience with loads of questions – an excellent way to try out ideas ahead of this August’s show.

Also in June, Louise and I were absolutely delighted to find out that our article (in Vol. 11, No. 1, October 2015) for ALISS Quarterly, the journal of the Association of Librarians and Information Professionals in the Social Sciences, had been awarded Best Article of the Year. Huge thanks to the lovely folks at ALISS – this was lovely recognition for our article, which can read in full in the ALISS Quarterly archive.

In July I attended the European Conference on Social Media (#ecsm17) in Vilnius, Lithuania. In addition to co-chairing the Education Mini Track with the lovely Stephania Manca (Italian National Research Council), I was also there to present Louise and my Digital Footprint paper, “Exploring Risk, Privacy and the Impact of Social Media Usage with Undergraduates“, and to present a case study of the EDINA Digital Footprint consultancy and training service for the Social Media in Practice Excellence Awards 2017. I am delighted to say that our service was awarded 2nd place in those awards!

Social Media in Practice Excellence Award 2017 - 2nd place - certificate

My Social Media in Practice Excellence Award 2017 2nd place certificate (still awaiting a frame).

You can read more about the awards – and my fab fellow finalists Adam and Lisa – in this EDINA news piece.

On my way back from Lithuania I had another exciting stop to make at the Palace of Westminster. The lovely folk at the Parliamentary Digital Service invited me to give a talk, “If I Googled you, what would I find? Managing your digital footprint” for their Cyber Security Week which is open to members, peers, and parliamentary staff. I’ll have a longer post on that presentation coming very soon here. For now I’d like to thank Salim and the PDS team for the invitation and an excellent experience.

The digital flyer for my CODI 2017 show - huge thanks to the CODI interns for creating this.

The digital flyer for my CODI 2017 show (click to view a larger version) – huge thanks to the CODI interns for creating this.

The final big Digital Footprint project of the year is my forthcoming Edinburgh Fringe show, Is Your Online Reputation Hurting You? (book tickets here!). This year the Cabaret of Dangerous Ideas has a new venue – the New Town Theatre – and two strands of events: afternoon shows; and “Cabaret of Dangerous Ideas by Candlelight”. It’s a fantastic programme across the Fringe and I’m delighted to be part of the latter strand with a thrilling but challengingly competitive Friday night slot during peak fringe! However, that evening slot also means we can address some edgier questions so I will be talking about how an online reputation can contribute to fun, scary, weird, interesting experiences, risks, and opportunities – and what you can do about it.

QR code for CODI17 Facebook Event

Help spread the word about my CODI show by tweeting with #codi17 or sharing the associated Facebook event.

To promote the show I will be doing a live Q&A on YouTube on Saturday 5th August 2017, 10am. Please do add your questions via Twitter (#codi17digifoot) or via this anonymous survey and/or tune in on Saturday (the video below will be available on the day and after the event).

So, that’s been the Digital Footprint work this spring/summer… What else is there to share?

Well, throughout this year I’ve been working on a number of EDINA’s ISG Innovation Fund projects…

The Reference Rot in Theses: a HiberActive Pilot project has been looking at how to develop the fantastic prior work undertaken during the Andrew W. Mellon-funded Hiberlink project (a collaboration between EDINA, Los Alamos National Laboratory, and the University of Edinburgh School of Informatics), which investigated “reference rot” (where URLs cease to work) and “content drift” (where URLs work but the content changes over time) in scientific scholarly publishing.

For our follow up work the focus has shifted to web citations – websites, reports, etc. – something which has become a far more visible challenge for many web users since January. I’ve been managing this project, working with developer, design and user experience colleagues to develop a practical solution around the needs of PhD students, shaped by advice from Library and University Collections colleagues.

If you are familiar with the Memento standard, and/or follow Herbert von de Sompel and Martin Klein’s work you’ll be well aware of how widespread the challenge of web citations changing over time can be, and the seriousness of the implications. The Internet Archive might be preserving all the (non-R-rated) gifs from Geocities but without preserving government reports, ephemeral content, social media etc. we would be missing a great deal of the cultural record and, in terms of where our project comes in, crucial resources and artefacts in many modern scholarly works. If you are new the issue of web archiving I would recommend a browse of my notes from the IIPC Web Archiving Week 2017 and papers from the co-located RESAW 2017 conference.

A huge part of the HiberActive project has been working with five postgraduate student interns to undertake interviews and usability work with PhD students across the University. My personal and huge thanks to Clarissa, Juliet, Irene, Luke and Shiva!

Still from the HiberActive gif featuring Library Cat.

A preview of the HiberActive gif featuring Library Cat.

You can see the results of this work at our demo site,, and we would love your feedback on what we’ve done. You’ll find an introductory page on the project as well as three tools for archiving websites and obtaining the appropriate information to cite – hence adopting the name one our interviewees suggested, Site2Cite. We are particularly excited to have a tool which enables you to upload a Word or PDF document, have all URLs detected, and which then returns a list of URLs and the archived citable versions (as a csv file).

Now that the project is complete, we are looking at what the next steps may be so if you’d find these tools useful for your own publications or teaching materials, we’d love to hear from you.  I’ll also be presenting this work at Repository Fringe 2017 later this week so, if you are there, I’ll see you in the 10×10 session on Thursday!

To bring the HiberActive to life our students suggested something fun and my colleague Jackie created a fun and informative gif featuring Library Cat, Edinburgh’s world famous sociable on-campus feline. Library Cat has also popped up in another EDINA ISG Innovation-Funded project, Pixel This, which my colleagues James Reid and Tom Armitage have been working on. This project has been exploring how Pixel Sticks could be used around the University. To try them out properly I joined the team for fun photography night in George Square with Pixel Stick loaded with images of notable University of Edinburgh figures. One of my photos from that night, featuring the ghostly image of the much missed Library Cat (1.0) went a wee bit viral over on Facebook:

James Reid and I have also been experimenting with Tango-capable phone handsets in the (admittedly daftly named) Strictly Come Tango project. Tango creates impressive 3D scans of rooms and objects and we have been keen to find out what one might do with that data, how it could be used in buildings and georeferenced spaces. This was a small exploratory project but you can see a wee video on what we’ve been up to here.

In addition to these projects I’ve also been busy with continuing involvement in the Edinburgh Cityscope project, which I sit on the steering group for. Cityscope provided one of our busiest events for this spring’s excellent Data Festread more about EDINA’s participation in this new exciting event around big data, data analytics and data driven innovation, here.

I have also been working on two rather awesome Edinburgh-centric projects. Curious Edinburgh officially launched for Android, and released an updated iOS app, for this year’s Edinburgh International Science Festival in April. The app includes History of Science; Medicine; Geosciences; Physics; and a brand new Biotechnology tours that led you explore Edinburgh’s fantastic scientific legacy. The current PTAS-funded project is led by Dr Niki Vermeulen (Science, Technology & Innovation Studies), with tours written by Dr Bill Jenkins, and will see the app used in teaching around 600 undergraduate students this autumn. If you are curious about the app (pun entirely intended!), visiting Edinburgh – or just want to take a long distance virtual tour – do download the app, rate and review it, and let us know what you think!

Image of the Curious Edinburgh History of Biotechnology and Genetics Tour.

A preview of the new Curious Edinburgh History of Biotechnology and Genetics Tour.

The other Edinburgh project which has been progressing at a pace this year is LitLong: Word on the Street, an AHRC-funded project which builds on the prior LitLong project to develop new ways to engage with Edinburgh’s rich literary heritage. Edinburgh was the first city in the world to be awarded UNESCO City of Literature status (in 2008) and there are huge resources to draw upon. Prof. James Loxley (English Literature) is leading this project, which will be showcased in some fun and interesting ways at the Edinburgh International Book Festival this August. Keep an eye on for updates or follow @litlong.

And finally… Regular readers here will be aware that I’m Convener for eLearning@ed (though my term is up and I’ll be passing the role onto a successor later this year – nominations welcomed!), a community of learning technologists and academic and support staff working with technologies in teaching and learning contexts. We held our big annual conference, eLearning@ed 2017: Playful Learning this June and I was invited to write about it on the ALTC Blog. You can explore a preview and click through to my full article below.

Playful Learning: the eLearning@ed Conference 2017

Phew! So, it has been a rather busy few months for me, which is why you may have seen slightly fewer blog posts and tweets from me of late…

In terms of the months ahead there are some exciting things brewing… But I’d also love to hear any ideas you may have for possible collaborations as my EDINA colleagues and I are always interested to work on new projects, develop joint proposals, and work in new innovative areas. Do get in touch!

And in the meantime, remember to book those tickets for my CODI 2017 show if you can make it along on 11th August!

Jul 042017

Today I am again at the Mykolo Romerio Universitetas in Vilnius, Lithuania, for the European Conference on Social Media 2017. As usual this is a liveblog so additions, corrections etc. all welcome… 

Keynote presentation: Daiva Lialytė, Integrity PR, Lithuania: Practical point of view: push or pull strategy works on social media 

I attended your presentations yesterday, and you are going so far into detail in social media. I am a practitioner and we can’t go into that same sort of depth because things are changing so fast. I have to confess that a colleague, a few years ago, suggested using social media and I thought “Oh, it’s all just cats” and I wasn’t sure. But it was a big success, we have six people working in this area now. And I’m now addicted to social media. In fact, how many times do you check your phone per day? (various guesses)…

Well, we are checking our smartphones 100-150 times per day. And some people would rather give up sex than smartphones! And we have this constant flood of updates and information – notifications that pop up all over the place… And there are a lot of people, organisations, brands, NGOs, etc. all want our attention on social media.

So, today, I want to introduce three main ideas here as a practitioner and marketer…

#1 Right Mindset

Brands want to control everything, absolutely everything… The colour, the font, the images, etc. But now social media says that you have to share your brand in other spaces, to lose some control. And I want to draw on Paul Holmes, a PR expert (see and he says when he fell in love with social media, there were four key aspects:

  • Brands (in)dependency
  • Possibilities of (non)control
  • Dialogue vs monologue
  • Dynamic 24×7

And I am going to give some examples here. So Gap, the US fashion brand, they looked at updating their brand. They spent a great deal of money to do this – not just the logo but all the paperwork, branded items, etc. They launched it, it went to the media… And it was a disaster. The Gap thought for a few days. They said “Thank you brand lover, we appreciate that you love our brand and we are going to stick with the old one”. And this raises the question of to whom a brand belongs… Shareholders or customers? Perhaps now we must think about customers as owning the brand.

Yesterday I saw a presentation from Syracuse on University traditions – and some of the restrictions of maintaining brand – but in social media that isn’t always possible. So, another example… Lagerhaus (like a smaller scale Ikea). They were launching a new online store, and wanted to build community (see videos) so targeted interior six design blogs and created “pop up online stores” – bloggers could select products from the store’s selection, and promote them as they like. That gained media attention, gained Facebook likes for the store’s Facebook page. And there was then an online store launch, with invitees approached by bloggers, and their pop up stores continue. So this is a great example of giving control to others, and building authentic interest in your brand.

In terms of dialogue vs monologue I’d quote from Michael Dell here, on the importance of engaging in honest, direct conversations with customers and stakeholders. This is all great… But the reality is that many who talk about this, many are never ever doing this… Indeed some just shut down spaces when they can’t engage properly. However, Dell has set up a social media listening and command centre. 22k+posts are monitored daily, engaging 1000+ customers per week. This was tightly integrated with @dellcares Twitter/Facebook team. And they have managed to convert “ranters” to “ravers” in 30% of cases. And a decrease of negative commentary since engagement in this space. Posts need quick responses as a few minutes, or hours, are great, longer and it becomes less and less useful…

Similarly we’ve seen scandinavian countries and banks engaging, even when they have been afraid of negative comments. And this is part of the thing about being part of social media – the ability to engage in dialogue, to be part of and react to the conversations.

Social media is really dynamic, 24×7. You have to move fast to take advantage. So, Lidl… They heard about a scandal in Lithuania about the army paying a fortune for spoons – some were €40 each. So Lidl ran a promotion for being able to get everything, including spoons there cheaper. It was funny, clever, creative and worked well.

Similarly Starbucks vowing to hire 10,000 refugees in the US (and now in EU) following Trump’s travel ban, that was also being dynamic, responding quickly.

#2 Bold Actions

When we first started doing social media… we faced challenges… Because the future is uncertain… So I want to talk about several social media apps here…

Google+ launched claiming to be bigger than Facebook, to do it all better. Meanwhile WhatsApp… Did great… But disappearing as a brand, at least in Lithuania. SnapChat has posts disappearing quickly… Young people love it. The owner has said that it won’t be sold to Facebook. Meanwhile Facebook is trying desperately to copy functionality. We have clients using SnapChat, fun but challenging to do well… Instagram has been a big success story… And it is starting to be bigger than Facebook in some demographics.

A little history here… If you look at a world map of social networks from December 2009, we see quite a lot of countries having their own social networks which are much more popular. By 2013, it’s much more Facebook, but there are still some national social media networks in Lithuania or Latvia. And then by 2017 we see in Africa uptake of Twitter and Instagram. Still a lot of Facebook. My point here is that things move really quickly. For instance young people love SnapChat, so we professionally need to be there too. You can learn new spaces quickly… But it doesn’t matter as you don’t have to retain that for long, everything changes fast. For instance in the US I have read that Facebook is banning posts by celebrities where they promote items… That is good, that means they are not sharing other content…

I want to go in depth on Facebook and Twitter. Of course the most eminent social media platform is Facebook. They are too big to be ignored. 2 billion monthly active Facebook users (June 2017). 1.28 billion people log onto Facebook daily. 83 million fake profiles. Age 25 to 34 at 29.7% of users are biggest age group. For many people they check Facebook first in the morning when they wake up. And 42% of marketers report that Facebook is very important to their business. And we now have brands approaching us to set up Facebook presence no matter what their area of work.

What Facebook does well is most precise targeting – the more precise the more you pay, but that’s ok. So that’s based on geolocation, demographic characteristic, social status, interests, even real time location. That works well but remember that there are 83 million fake profiles too.

So that’s push, what about pull? Well there are the posts, clicks, etc. And there is Canvas – which works for mobile users, story driven ads (mini landing), creative story, generate better results and click through rates. (we are watching a Nespresso mobile canvas demo). Another key tool is Livestream – free of charge, notifications for your followers, and it’s live discussion. But you need to be well prepared and tell a compelling story to make proper use of this. But you can do it from anywhere in the world. For instance one time I saw livestream of farewell of Barack Obama – that only had 15k viewers though so it’s free but you have to work to get engagement.

No matter which tool, “content is the king!” (Bill Gates, 1996). Clients want us to create good stories here but it is hard to do… So what makes the difference? The Content Marketing Institute (US), 2015 suggest:

  1. Content
  2. Photos
  3. Newsletters
  4. Video
  5. Article
  6. Blogs
  7. Events
  8. Infographics
  9. Mobile applications
  10. Conferences and Livestreams

So, I will give some examples here… I’ll show you the recent winner of Cannes Lions 2017 for social media and digital category. This is “Project Graham” – a public driver safety campaign about how humans are not designed to survive a crash… Here is how we’d look if we were – this was promoted heavily in social media.

Help for push from Facebook – well the algorithms prioritise content that does well. And auctions to reach your audience mean that it is cheaper to run good content that really works for your audience.

And LinkedIn meanwhile is having a renaissance. It was quite dull, but they changed their interface significantly a few months back, and now we see influencers (in Lithunia) now using LinkedIn, sharing content there. For instance lawyers have adopted the space. Some were predicting LinkedIn would die, but I am not so sure… It is the biggest professional social network – 467 million users in 200 countries. And it is the biggest network of professionals – a third have LinkedIn profile. Users spend 17 minutes per dat, 40% use it every day, 28% of all internet users use LinkedIn. And it is really functioning as a public CV, recruitment, and for ambassadorship – you can share richer information here.

I wanted to give a recent example – it is not a sexy looking case study – but it worked very well. This was work with Ruptela, a high tech company that provides fleet management based on GPS tracking and real-time vehicle monitoring and control. They needed to hire rapidly 15 new sales representatives via social media. That’s a challenge as young people, especially in the IT sector – are leaving Lithuania or working in Lithuania-based expertise centres for UK, Danish, etc. brands.

So we ran a campaign, on a tiny budget (incomparable with headhunters for instance), around “get a job in 2 days” and successfully recruited 20 sales representatives. LinkedIn marketing is expensive, but very targeted and much cheaper than you’d otherwise pay.

#3 Right Skills

In terms of the skills for these spaces:

  • copywriter (for good storytelling)
  • visualist (graphics, photo, video)
  • community manager (to maintain appropriate contact) – the skills for that cannot be underestimated.
  • And… Something that I missed… 

You have to be like a one man band – good at everything. But then we have young people coming in with lots of those skills, and can develop them further…

So, I wanted to end on a nice story/campaign… An add for Budweiser for not drinking and driving


Q1) Authenticity is the big thing right now… But do you think all that “authentic” advertising content may get old and less effective over time?

A1) People want to hear from their friends, from people like them, in their own words. Big brands want that authenticity… But they also want total control which doesn’t fit with that. The reality is probably that something between those two levels is what we need but that change will only happen as it becomes clear to big brands that their controlled content isn’t working anymore.

Q2) With that social media map… What age group was that? I didn’t see SnapChat there.

A2) I’m not sure, it was a map of dominant social media spaces…

Q3) I wanted to talk about the hierarchy of content… Written posts, visual content etc… What seemed to do best was sponsored video content that was subtitled.

A3) Facebook itself, they prioritise video content – it is cheaper to use this in your marketing. If you do video yes, you have to have subtitles so that you can see rather than listen to the videos… And with videos, especially “authentic video” that will be heavily prioritised by Facebook. So we are doing a lot of video work.

Introduction to ECSM 2018 Niall Corcoran, Limerick Institute of Technology, Ireland

I wanted to start by thanking our hosts this year, Vilnius has been excellent this year. Next year we’ll a bit earlier in the year – late June – and we’ll be at the Limerick Institute of Technology, Ireland. We have campuses around the region with 7000 students and 650 staff, teaching from levels 6 to 10. The nearest airport is Shannon, or easy distance from Cork or Dublin airports.

In terms of social media we do research on Social MEdia Interactive Learning Environment, Limerick Interactive Storytelling Network, Social Media for teaching and research, Social Media for cancer recovery.

In terms of Limerick itself, 80-90% of the Europe’s contact lenses are manufactured there! There is a lot of manufacturing in Limerick, with many companies having their European headquarters there. So, I’ve got a short video made by one of our students to give you a sense of the town. And we hope to see you there next year!

Social Media Competition Update

The top three placed entries are: Developing Social Paleantology – Lisa Lundgren; EDINA Digital Footprint Consulting and Training Service – Nicola Osborne (yay!); Traditions Mobile App – Adam Peruta.

Stream A: Mini track on Ethical use of social media data – Chair: Dragana Calic

The Benefits and Complications of Facebook Memorials – White Michelle, University of Hawai’i at Manoa, USA

I wanted to look at who people imagine are their audience for these memorials. And this happened because after the death made me look at this, and I decided to look into this in more depth.

So, I’m using danah boyd’s definition of social networking here. We are talking Facebook, Twitter, SnapChat etc. So, a Facebook Memorial is a group that is created specifically to mark the death of a friend or family members – or for public figures (e.g. Michael Jackson).

Robert Zebruck and Brubecker talk about imagined audience as the flattening of realities. So, right now I can see people in the room, I can see who you are, how you react, how to modify my tone or style to meet you, to respond to you. But it is hard to do that on social media. We see context collapse. And we can be sat there alone at our computer and not have that sense of being public. Sometimes with memorials we will say things for that audience, but in other cases perhaps it is sharing memories of drinking together, or smoking weed with something… Memories that may jar with others.

It was a long road to get to this research. My review board were concerned about emotional distress of interviewees. I agreed in the end to interview via Skype or Facebook and to check everything was ok after every question, to make it easier to see and review their state of mind. I had to wait over a year to interview people, the death had to not be by suicide, and the participants had to be over 18 years old. So I did conduct qualitative research over Skype and Facebook… And I found interviewees by looking at memorial pages that are out there – there are loads there, not all labelled as memorials.

So, my data… I began by asking who people thought they were talking to… Many hadn’t thought about it. They talked about family members, friends… Even in a very controlled group you can have trolls and haters who can get in… But often people assumed that other people were like them. A lot of people would write to the deceased – as if visiting a grave, say. I asked if they thought the person could hear or understand.. But they hadn’t really thought about it, it felt like the right thing to do… And they wanted family and friends to hear from them. They felt likes, shares, etc. were validating and therapeutic, and that sense of connection was therapeutic. Some even made friends through going out drinking, or family gatherings… with friends of friends who they hadn’t met before…

This inability to really think or understand the imagine audience, that led to context collapse. Usually family is in charge of these pages… And that can be challenging… For instance an up and coming football star died suddenly, and then it was evident that it was the result of a drug overdose… And that was distressing for the family who tried to remove that content. There is an idea of alternative narratives. Fake news or alternative facts has a particular meaning right now… But we are all used to presenting ourselves in a particular way to different friends, etc. In one memorial site the deceased had owed money to a friend, and they still felt owed that money and were posting about that – like a fight at the funeral… It’s very hard to monitor ourselves and other people…

And there was fighting about who owned the person… Some claiming that someone was their best friend, fights over who was more important or who was more influenced. It happens in real life… But not quite as visibly or with all involved…

So, in conclusion… There are  a lot of benefits for Facebook Memorials. Pyschologists talk of the benefit of connecting, grieving, not feeling alone, to get support. Death happens. We are usually sad when it happens… Social networking sites provide another way to engage and connect. So if I’m in Lithuania and there is a funeral in Hawaii that I can’t travel to, I can still connect. It is changing our social norms, and how we connect. But we can do more to make it work better – safety and security needs improving. Facebook have now added the ability to will your page to someone. And now if someone dies you can notify Twitter – it changes it slightly, birthday reminders no longer pop up, it acts as a memorial. There are new affordances.

Personally, doing this research was very sad, and it’s not an area I want to continue looking at. It was emotionally distressing for me to do this work.


Q1) I am old enough to remember LiveJournal and remember memorials there. They used to turn a page into a memorial, then were deleted… Do you think Facebook should sunset these memorials?

A1) I personally spoke to people who would stare at the page for a month, expecting posts… Maybe you go to a funeral, you mourn, you are sad… But that page sticking around feels like it extends that… But I bet Mark Zuckerberg has some money making plan for keeping those profiles there!

Q2) What is the motivation for such public sharing in this way?

A2) I think young people want to put it out there, to share their pain, to have it validated – “feel my pain with me”. One lady I spoke to, her boyfriend was killed in a mass shooting… Eventually she couldn’t look at it, it was all debate about gun control and she didn’t want to engage with that any more…

Q3) Why no suicides? I struggle to see why they are automatically more distressing than other upsetting deaths…

A3) I don’t know… But my review board thought it would be more distressing for people…

Q4) How do private memorials differ from celebrity memorials?

A4) I deliberately avoided celebrities, but also my IRB didn’t want me to look at any groups without permission from every member of that group…

Comment) I’ve done work with public Facebook groups, my IRB was fine with that.

A4) I think it was just this group really… But there was concern about publicly identifiable information.

Online Privacy: Present Need or Relic From the Past? – Aguirre-Jaramillo Lina Maria, Universidad Pontificia Bolivariana, Colombia

In the influential essay, The Right to Privacy, in the Harvard Law Review (1890) – Warren and Brandeis, privacy was defined as “Privacy – the right to be let alone”. But in the last ten years or so we now see sharing of information that not long ago would have been seen and expected to be private. Earl Warren is a famous US judge and he said “The fantastic advances in the field of electronic communication constitute a greater danger to the privacy of the individual.”

We see privacy particularly threatened by systematic data collection. Mark Zuckerberg (1999) claims “Privacy is no longer a social norm”. This has been used as evidence of disregard toward users rights and data. The manner in which data is stored, changed and used and the associated threats. But we also see counter arguments such as the American Library Association’s Privacy Revolution campaign.

So, this is the context for this work in Columbia. It is important to understand literature in this area, particularly around data use, data combinations, and the connection between privacy concerns and behaviours online (Joinsen et al 2008). And we also refer to the work of Sheenan (2002) in the characterisations of online users. Particularly we are interested in new privacy concerns and platforms, particularly Facebook. The impact of culture on online privacy has been studied by Cho, Rivera Sanchez and Lim (2009).

The State of the Internet from OxII found that Columbia had between 40 and 60% of people online. Internet uptake is, however, lower than in e.g. the US. And in Columbia our population is 46% 25-54 years old.

So, my study is currently online. A wider group is also engaging in personal and group interviews. Our analysis will focus on what background knowledge, risk and privacy awareness there is amongst participants. Wat self-efficacy level is regealed by participants – their knowledge and habits. And what interest and willingness is there to acquire more knowledge and gain more skills to manage privacy. At a later stage we will be building a prototype tool.

Our conclusions so far… Privacy is hard to define and we need to do more to define it. Privacy is not a concept articulated in one only universally accepted definition. Different groups trade off privacy differently. Relevant concepts here include background knowledge, computer literacy, privacy risk, self efficacy.

And finally… Privacy is still important but often ignored as important in the wider culture. Privacy is not a relic but a changing necessity…


Q1) Did age play a role in privacy? Do young people care as much as older people?

A1) They seem to care when they hear stories of peers being bullied, or harassed, or hear stories of hacking Instagram accounts. But their idea of privacy is different. But there is information that they do not want to have public or stolen. So we are looking more at that, and also a need to understand how they want to engage in privacy. As my colleague Nicola Osborne form Edinburgh said in her presentation yesterday, we have to remember students already come in with a long internet/social media history and presence.

Q2) I was wondering about cultural aspect… Apps used and whether privacy is important… For instance SnapChat is very exhibitionist but also ephemeral…

A2) I don’t have full answers yet but… Young people share on SnapChat and Instagram to build popularity with peers… But almost none of them are interested in Twitter… At least that’s the case in Columbia. But they do know some content on Facebook may be more vulnerable that SnapChat and Instagram… It may be that they have the idea of SnapChat as a space they can control perhaps…

Q3) I often feel more liberal with what I share on Facebook, than students who are 10 or 15 years younger… I would have some privacy settings but don’t think about the long story of that… From my experience students are a lot more savvy in that way… When they first come in, they are very aware of that… Don’t want a bigger footprint there…

A3) That is not exactly true in Columbia. The idea of Digital Footprint affecting their career is not a thing in the same way… Just becoming aware of it… But that idea of exhibitionism… I have found that most of the students in Columbia seem quite happy to share lots of selfies and images of their feet… That became a trend in other countries about three years ago… They don’t want to write much… Just to say “I’m here”… And there has been some interesting research in terms of the selfie generation and ideas of expressing yourself and showing yourself… May be partly to do with other issues… In Columbia many young women have plastic surgery – came out of the 1980s and 1990s… Many women, young women, have cosmetic surgery and want to share that… More on Instagram than Pinterest – Pinterest is for flowers and little girlie things…

Q4) You were talking about gender, how do privacy attitudes differ between males and females?

A4) The literature review suggests women tend to be more careful about what they publish online… They may be more careful selecting networks and where they share content… More willing to double check settings, and to delete content they might have difficulty explaining… Also more willing to discuss issues of privacy… Things may change over time… Suggestion that people will get to an age where they do care more… But we also need to see how the generation that have all of their images online, even from being a baby, will think about this… But generally seems to be slightly more concern or awareness from women…

Comment) I wanted to just follow up the Facebook comment and say that I think it may not be age but experience of prior use that may shape different habits there… Students typically arrive at our university with hundreds of friends having used Facebook since school, and so they see that page as a very public space – in our research some students commented specifically on that and their changing use and filtering back of Facebook contacts… For a lot of academics and mid career professionals Facebook is quite a private social space, Twitter plays more that public role. But it’s not age per se perhaps, it’s that baggage and experience.

Constructing Malleable Truth: Memes from the 2016 U.S. Presidential Campaign – Wiggins Bradley, Webster University, Vienna, Austria, Austria

Now, when I wrote this… Trump was “a candidate”. Then he was nominee. Then president elect… And now President. And that’s been… surprising… So that’s the context.

I look at various aspects in my research, including internet memes. So, in the 2008 Obama’s campaign was great at using social media, at getting people out there and sharing and campaigning for them on a voluntary and enthusiastic basis. 2016 was the meme election I think. Now people researching Memes feel they must refer to Richard Dawkins talking about memes. He meant ideas… That’s not the same as internet memes… So what are the differences betwen Dawkins’ memes and Internet memes? Well honestly they are totally different EXCEPT that they require attention, and have to be reproducable….

Mikhail Bakhtin wrote about the Carnivalesque as something that subverts the dominant mode or perspective, it turns the world on its head… The king becomes the jester and the jester becomes the king. So the Trump tie memes… We need no text here, the absurd is made more absurd. It is very critical. It has that circus level laugh… He’s a clown or a buffoon… You know about it and how to reproduce this.

In terms of literature.. There is work on memes but I think when understanding memes with millennials, but also baby boomers, even people in their 70’s and 80s… We have to go back to major theorists, concepts and perspectives – Henry Jenkins, Erving Goffman, etc. This is a new mode of communication I think, not a new language, but a new mode.

So method wise… I wanted to do a rhetorical-critical analysis of selected internet memes from the facebook page Bernie Sanders Dank Meme Stash, which had over 420k members when I wrote this slide – more now. It was founded by a college student in October 2015. And there are hundreds of thousands of memes there. People create and curate them.

Two months before nad one month after the US Election I did two sets of samples… Memes that received 1000 or more likes/retweets. And memes that received at least 500 or more likes/reactions and at least 100 shares. As an unexpected side note I found that I needed to define “media narrative”. There doesn’t seem to be a good definition. I spoke to Brooke Gladstone of WYNC, I spoke with colleagues in Vienna… We don’t usually take time to think about media narrative… For instance the shooting at Pulse Nightclub has a narrative on the right around gun control, for others its around it being a sad and horrible event…

So, media narrative I am defining as:

  1. Malleable depending upon the ability to ask critical questions
  2.  Able to shape opinion as well as perceptions of reality and a person’s decision-making process and…
  3.  Linguistic and image-based simulations of real-world events which adhere and/or appeal to ontologically specific perspectives, which may include any intentional use of stereotyping, ideology, allegory, etc.

Some findings… The relational roles between image and text are interchangable because of the relationship to popular culture. Barthes (1977) takls about the text loading the image burdening it with culture, a moral, an imagination. And therefore the text in internet memes fluctuates depending n the intended message and the dependence on popular culture.

So, for instance we have an image from Nightmare at 20,000 ft, a classic Twilight Zone image… You need to know nothing here and if I replace a monster face with Donald Trump’s face… It’s instantly accessible and funny. But you can put any image there depending on the directionality of the intended meaning. So you have the idea of the mytheme or function of the monster/devil/etc. can be replaced by any other monster… It doesn’t matter, the reaction will depend on your audience.

Back to Barthes (1977) again, I find him incredibly salient to the work I’ve done here. One thing emerging from this and Russian memes work done before, is the idea of Polysemic directionality. It has one direction and intentionality.. No matter what version of this image you use…

So, here’s a quick clip of the Silence of the Lambs. And here Buffallo Bill, who kills women and skins them… A very scary character… We have him in a meme being a disturbing advisor in memes. If you get that reference it has more weight, but you don’t need to know the reference.

We have the image of Hillary as Two Face, we have Donald as The Joker… And a poster saying “Choose”. The vitriol directed at Clinton was far worse than that at Trump… Perhaps because Sanders supporters were disappointed at not getting the nomination.

We have intertextuality, we also have inter-memetic references… For example the Hilary deletes electoral colleges meme which plays on Grandma on the internet memes… YOu also have the Superman vs Trump – particularly relevant to immigrant populations (Jenkins 2010).

So, conclusions… The construction of a meme is affected and dependent on the media around it… That is crucial… We have heard about fake news, and we see memes in support of that fake news… And you may see that on all sides here. Intertextual references rely on popular culture and inter memetic references which assumes knowledge, a new form of communication. And I would argue that memes are a digital myth – I think Levi Strauss might agree with me on that…

And to close, for your viewing pleasure, the Trump Executive Order meme… The idea of a meme, an idea that can be infinitely replaced with anything really…


Q1) This new sphere of memes… Do you think that Trump represents a new era of presidency… Do you think that this will pass? With Trump posting to his own Twitter account…

A1) I think that it will get more intense… And offline too… We see stickers in Austrian elections around meme like images… These are tools for millennials. They are hugely popular in Turkey… There are governments in Turkey, Iran and China are using memes as propaganda against other parties… I’m not sure it’s new but we are certainly more aware of it… Trump is a reality TV star with the nucleaur keys… That should scare us… But memes won’t go away…

Q2) In terms of memes in real life… What about bumper stickers… ? They were huge before… They are kind of IRL memes…

A2) I am working on a book at the moment… And one of the chapters is on pre-digital memes. WWII used to write “Kilroy was here”. Is Magritte’s Ceci n’est pas une pipe a meme? There is definitely a legacy of that… So yes, but depends on national regional context…

Q3) So… In Egypt we saw memes about Trump… We were surprised at the election outcome… What happened?

A3) Firstly, there is that bias that reinforcing narrative… If you looked at the Sanders meme page you might have had that idea that Clinton would not win because, for whatever reason, these people hated Hillary. Real rage and hatred towards her… And Trump as clown hitler… Won’t happen… Then it did… Then rage against him went up… After the Muslim ban, the women’s march etc…

Q4) There are some memes that seem to be everywhere – Charlie and the Chocolate Factory, Sean Bean, etc… Why are we picking those specific particular memes of all things?

A4) Like the Picard WTF meme… Know Your is a great resource… In the scene that Picard image is from he’s reciting Shakespeare to get Louixana Troy away from the aliens… It doesn’t matter… But it just fits, it has a meaning

Q5) Gender and memes: I wondered about the aspect of gender in memes, particularly thinking about Clinton – many of those reminded me of the Mary Beard memes and trolling… There are trolling memes – the frog for Trump… the semi-pornographic memes against women… Is there more to that than just (with all her baggage) Clinton herself?

A5) Lisa Silfestry from Gonzaga, Washington State and Lemour Shipman in Tel Aviv do work in that area. Shipman looks at Online Jokes of all types and has done some work on gender.

Q6) Who makes memes? Why?

A6) I taught a course on internet memes and cultures. That was one of the best attended courses ever. My students concluded that the author didn’t matter… But look at 4Chan and Reddit or Know Your Meme… And you can tell who created it… But does that matter… It’s almost a public good. Who cares who created the Trump tie meme. With the United Airline you can see that video, it turned into a meme… and it had lost millions in stock.

Stream B: Mini track on Enterprise Social Media – Chair: Paul Alpar

The Role of Social Media in Crowdfunding – Makina Daniel, University of South Africa, Pretoria, South Africa

My work seeks to find the connection between social media and finance, specifically crowd funding. And the paper introduces the phenomena of crowdfunding, and how the theory of social networking underpins social media. The theory around social media is still developing… Underpinned by theory of information systems and technology adoption, with different characteristics from what happens in social media.

So, a definition of crowdfunding. Crowdfunding is essentially an aspect of crowdsourcing, spurred by ubiquitous web 2.0 technologies. And “Crowdfunding refers to the efforts of entrepreneurial individuals and groups – cultural, social and for-profit – to fund their ventures by drawing on relatively small contributions from a relatively large number of individuals using the internet, without standard financial intermediaries” (Mollick 2014).

Since 2010 there have been growing amounts of money raised globally through crowdfunding. Fobes estimates $34 billion in 2015 (compared to $16 billion in 2014, and $880 million in 2010). The World Bank estimates that crowdfunding will raise $93 billion annually by 2025. This growth couldn’t be achieved in the absence of internet technology, and social media are critical in promoting this form of alternative finance.

Cheung and Lee (2010) examined social influence processes in determining collective social action in the context of online social networks. Their model shows intentional soial action, with users considering themselves part of the social fabric. And they explain three processes of social influence: subjective norm – self outside of any group; group norm – self awareness as a member of a group; and social identity – self in context. Other authors explain social media popularity because of a lack of trust in traditional media, with people wary of information that emanates from people they do not know personally. Kaplin and Haenlein (2010) define social media as “a group of internet-based applications that build on the ideological and technological foundations of web 2.0 applications that allow the creation and exchange of user generated content” So it is a form of online interaction that enables people to create, comment, share and exchange content with other people.

So, how does social media facilitate finance, or crowd sourcing? Since social media assists in maintaining social ties, this should in turn aid facilitation of crowdfunding campaigns. Draw on Linus’s Law “given enough eyeballs, all bugs are shallow”. Large groups are more adept at detecting potential flaws in a campaign than individuals (alone). Thus providing fraudulent campaigns from raising money for crowdfunding projects. Facebook, Twitter, etc. provide spaces for sharing and connection are therefore suitable for crowdfunding campaigns. Studies have shown that 51% of Facebook users are more likely to buy a product after becoming a fan of the products Facebook page (Knudsen 2015).

Brossman (2015) views crowdfunding as existing in two phases (i) brand awareness and (ii) targeting people to support/back one’s campaign. And crowdfunding sites such as Kickstarted and IndieGoGo allow project creators to publish pertinent information and updates, as well as to link to social media. Those connections are present and that also helps deal with a relative lack of social networking functionality within the platform itself, where they are able to create project descriptions, they have a community of users and utilise web 2.0 technologies that allow users to comment on projects and attract money.

A study by Moisseyez (2013) on 100 Kickstarter projects found that connection between social media approval and success in funding. Mollick (2014) observed that crowdfunding success is associated with having a large number of friends in online social networks: a founder with ten Facebook friends would have a 9% chance of succeeding; one with 100 friends would have a 20% chance of success; one with 1000 friends would have a 40% chance of success. He cited a film industry example where more friends mapped to a much higher potential success rates.

So, in conclusion, we don’t have many studies on this are yet. But social media is observed to aid crowdfunding campaigns through its ability to network disparate people through the internet. One notable feature is that although there are main forms of social media, crowdfunding utilizes a limited number of spaces, primarily Facebook and Twitter. Furthermore future research should examine how the expertise of the creator (requestor of funds) and project type, social network, and online presence influence motivations.


Q1) I was wondering if you see any connection between the types of people who back crowdfunding campaigns, and why particular patterns of social media use, or popularity are being found. For instance anecdotally the people who back lots of crowdfunding campaigns – not just one off – tend to be young men in their 30s and 40s. So I was wondering about that profile of backers and what that looks like… And if that profile of backer is part of what makes those social media approaches work.

A1) The majority of people using social media are young people… But young people as sources of finance for, say, small businesses… They are mainly likely to be either studying or starting professional career… But not accumulating money to give it out… So we see a disconnect… Between who is on social media… On Twitter, Facebook, etc. to raise finance… You successful in raising funding from people who cannot raise much… So one would expect people in mid career were using most social media, would expect more money coming from crowdfunding… One aspect of crowdfunding… We are looking at resources… You asking for small amounts… Then young people are able to spare that much…

Q2) So most people giving funding on crowdfunding sites are young people, and they give small amounts…

A2) Yes… And that data from Mollick… combined with evidence of people who are using Facebook…

Q2) What about other specialised crowdfunding networks… ?

A2) There is more work to be done. But even small crowdfunding networks will connect to supporters through social media…

Q3) Have you looked at the relative offerings of the crowdfunding campaigns?

A3) Yes, technology products are more successful on these platforms than other projects…

Using Enterprise Social Networks to Support Staff Knowledge Sharing in Higher Education – Corcoran Niall, Limerick Institute of Technology, Ireland and Aidan Duane, Waterford Institute of Technology, Ireland

This work is rooted in knowledge management, this is the basis for the whole study. So I wanted to start with a Ikujio Nonaka “in an econoy where the only certainty is uncertainty… ” And Lew Platt, former CEO of Hewlett-Packard said “If HP knew what HP knows it would be three times more productive” – highlighting the crucial role of knowledge sharing.

Organisations can gain competitive advantage through encouraging and promoting knowledge sharing – that’s the theory at least. It’s very important in knowledge-intensive organisations, such as public HEIs. HEIs need to compete in a global market place… We need to share knowledge… Do we do this?

And I want to think about this in the context of social media. We know that social media enable creation, sharing or exchange of information, ideas and media in virtual communities and networks. And organisational applications are close to some of the ideals of knowledge management: supporting group interaction towards establishing communities; enable creation and sharing of content; can help improve collaboration and communication with organisations; distinct technological features that are ideally suited for knowledge sharing; fundamental disruption in knowledge management; and social media is reinvigorating knowledge management as a field.

We do see Enterprise Social Networks (ESN). If you just bring one into an organisation, people don’t necessarily just go and use it. People need a reason to share. So another aspect is communities of practice (Lave and Wenger 1991), this is an important knowledge management strategy, increasingly used. This is about groups pf people who share a passion for something – loose and informal social structures, largely voluntary, and about sharing tacit knowledge. So Communities of Practice (CoP) tend to meet from time to time – in person or virtually.

ESN can be used to create virtual communities. This is particularly suitable for distributed communities – our university has multiple campuses for instance.

So, knowledge sharing in HEIs… Well many don’t do it. A number of studies have shown that KM implementation and knowledge sharing in HEIs is at a low level. Why? Organisational culture, organisational structures, beurocractic characteristics. And there is well documented divide/mistrust between faculty and support staff (silos) – particularly work from Australia, US and UK. So, can CoP and ESN help? Well in theory they can bypass structures that can reinforce silos. That’s an ideal picture, whether we get there is a different thing.

So our research looked at what the antecedents for staff knowledge sharing are; what the dominant problems in the implementation of ESN and CoP. The contextual setting here is Limerick Institute of Technology. I used to work in IT services and this work came significantly from this interest. There is a significant practical aspect to the research so action research seemed like the most appropriate approach.

So we had a three cycle action research project. We looked at Yammer. It has all the features of social networking you’d expect – can engage in conversations, tagged, shared, can upload content. It lends itself well to setting up virtual communities, very flexible and powerful tools for virtual communities. We started from scratch and grew to 209 users.

Some key findings… We found culture and structure are major barriers to staff knowledge sharing. We theorised that and found it to be absolutely the case. The faculty staff divide in HEI exacerbates the problem. Management have an important role to play in shaping a knowledge sharing environment. The existence of CoP are essential to build a successful knowledge sharing environment, and community leaders and champions are require for the ESN. Motivation to participate is also crucial. If they feel motivated, and they see benefit, that can be very helpful. And those benefits can potentially lead to culture change, which then effects motivation…

We found that our organisation has a strong hierarchical model. Very beaurocratic and rigid. Geographic dispersal doesn’t help. To fix this we need to move from a transactional culture. The current organisational structure contributes to the faculty staff divide, limits opportunities and motivations for staff and faculty to work together. But we also found examples where they were working well together. And in terms of the role of management, they have significant importance, and have to be involved to make this work.

Virtual communities are a Knowledge Management strategy has the potential to improve collaboration and interaction between staff, and it has to be seen as valued, relevant, a valid work activity. Staff motivation wise there are some highly motivated people, but not all. Management have to understand that.

So management need to understand the organisational culture; recognise the existence of structural and cultural problems; etc. Some of the challenges here are the public sector hierarchical structures – public accountability, media scrutiny, transitional culture etc.


Q1) On a technical level, which tools are most promising for tacit knowledge sharing…

A1) The whole ability to have a conversation. Email doesn’t work for that, you can’t branch threads… That is a distinctive feature of Yammer groups, to also like/view/be onlookers in a conversation. We encourage people to like something if they read it, to see that it is useful. But the ability to have a proper conversation, and organised meetings and conversations in real time.

Q2) What kind of things are they sharing?

A2) We’ve seen some communities that are large, they have a real sense of collaboration. We’re had research coming out of that, some really positive outcomes.

Q3) Have you seen any evidence of use in different countries… What are barriers across different regions, if known?

A3) I think the barriers are similar to the conceptual model (in the proceedings) – both personal and organisational barriers… People are afraid largely to share stuff… They are nervous of being judged… Also that engagement on this platform might make managers thing that they are not working. Age is a limiting factor – economic issues mean we haven’t recruited new staff for almost 10 years, so we are older as a staff group.

Q3) Might be interested to compare to different cultures, with asian culture more closed I think…

A3) Yes, that would be really interesting to do…

Q4) I am trying to think how and what I might share with my colleagues in professional services, technical staff, etc.

A4) The way this is constructed is in communities… We have staff interested in using Office 365 and Classroom Notebook, and so we set up a group to discuss that. We have champions who lead that group and guide it. So what is posted there would be quite specific… But in Yammer you can also share to all… But we monitor and also train our users in how and where to post… You can sign up for groups or create new groups… And it is moderated. But not limited to specifically work related groups – sports and social groups are there too. And that helps grow the user base and helps people see benefits.

Q5) Have you looked at Slack at all? Or done any comparison there?

A5) We chose Yammer because of price… We have it in O365, very practical reason for that… We have looked at Slack but no direct comparison.

Finalists in the Social Media in Practice Excellence Competition present their Case Histories

EDINA Digital Footprint Consulting and Training Service – Nicola Osborne

No notes for this one…

Developing Social Paleantology – Lisa Lundgren;

This is work with a software development company, funded by the National Science Foundation. And this was a project to develop a community of practice around paleontology… People often think “dinosaur” but actually it’s about a much wider set of research and studies of fossils. For our fossil project to meet it’s goal, to develop and support that community, we needed to use social media. So we have a My Fossil community, which is closed to the community, but also a Facebook group and Twitter presence. We wanted to use social media in an educative way to engage the community with our work.

We began with design studies which looked at what basic elements to contribute to engage with social media, and how to engage. We were able to assess practical contributions and build an educatie and evidence-based social media plan. So we wanted to create daily posts using social paleantology, e.g. #TrilobiteTuesday; design branded image-focused posts that are practice-specific, meet design principles, often huperlinks to vetted paleontological websites; respond to members in ways that encourage chains of communication. There is a theoretical contribution here as well. And we think there are further opportunities to engage more with social paleontology and we are keen for feedback and further discussion. So, I’m here to chat!


Traditions Mobile App – Adam Peruta.

When new university students come to campus they have lots of concerns like what is this place, where do I fit in, how can I make new friends. That is particularly the case at small universities who want to ensure students feel part of the community, and want to stay around. his is where the Traditions Challenge app comes in – it provides challenges and activities to engage new students in university traditions and features. This was trialled at Ithaca University. So, for instance we encourage students to head along to go along to events, meet other new students, etc. We encourage students to meet their academic advisors outside of the classroom. To explore notable campus features. And to explore the local community more – like the farmers market. So we have a social feed – you can like, comment, there is an event calendar, a history of the school, etc. And the whole process is gamified, you gain points through challenges, you can go on the leaderboard so there are incentives to gain status… And there are prizes too.

Looking at the results this year… We had about 200 students who collectively completed over 1400 challenges, the person who completed the most (and won a shirt) completed 53 challenges. There are about 100 challenges in the app so it’s good they weren’t all done in one year. And we see over 50k screen views so we know that the app is getting more attention whether or not people engage in the challenges. Students focus groups raised themes of the enjoyment of the challenge list, motivation for participation (which varied), app design and user experience – if there’s one key takeaway: this demographic has really high expectations for user interface, design and tone; contribution to identity… Lots of academic research that the more students are engaged on campus, the more likely they will remain at that university and remain engaged through their studies and as alumni. So there is loads of potential here, and opportunity to do more with the data.

So, the digital experience is preferred, mobile development is expensive and time consuming, good UI/UX is imperative to success, universities are good at protecting their brands, and we learned that students really want to augment their on-campus academic experiences.

Conference organiser: Those were the finalists from yesterday, so we will award the prizes for first, second and third… and the PhD prize…

Third place is Lisa; Second place is me (yay!); First place is Adam and the Traditions mobile app.

I’m going to rely on others to tweet the PhD winners…

The best poster went to IT Alignment through Artificial Intelligence – Amir  – this was mainly based on Amir’s performance as his poster went missing so he had to present to an A4 version of the poster so he did a great job of presenting.

Thank you to our hosts here… And we hope you can join us in Limerick next year!

Thanks to all at ECSM 2017.

Jul 032017

Today I am at the Mykolo Romerio Universitetas in Vilnius, Lithuania, for the European Conference on Social Media 2017. As usual this is a liveblog so additions, corrections etc. all welcome… 

Welcome and Opening by the Conference and Programme Chairs: Aelita Skaržauskienė and Nomeda Gudelienė

Nomeda Gudelienė: I am head of research here and I want to welcome you to Lithuania. We are very honoured to have you here. Social media is very important for building connections and networking, but conferences are also really important still. And we are delighted to have you here in our beautiful Vilnius – I hope you will have time to explore our lovely city.

We were founded 25 years ago when our country gained independence from the Soviet Union. We focus on social studies – there was a gap for new public officials, for lawyers, etc. and our university was founded,

Keynote presentation: Dr. Edgaras Leichteris, Lithuanian Robotics Association – Society in the cloud – what is the future of digitalization?

I wanted to give something of an overview of how trends in ICT are moving – I’m sure you’ve all heard that none of us will have jobs in 20 years because robots will have them all (cue laughter).

I wanted to start with this complex timeline of emerging science and technology that gives an overview of Digital, Green, Bio, Nano, Neuro. Digitalisation is the most important of these trends, it underpins this all. How many of us think digitalisation will save paper? Maybe not for universities or government but young people are shifting to digital. But there are major energy implications of that, we are using a lot of power and heat to digitise our society. This takes us through some of those other areas…. Can you imagine social networking when we have direct neural interfaces?

This brings me to the Hype curve – where see a great deal of excitement, the trough of disillusionment and through to where the real work is. Gartner creates a hype cycle graph every year to illustrate technological trends. At the moment we can pick out areas like Augmented reality, virtual reality, digital currency. When you look at business impact… Well I thought that the areas that seem to be showing real change include Internet of Things – in modern factories you see very few people now, they are just there for packaging as we have sensors and devices everywhere. We have privacy-enhancing technologies, blockchain, brain computer interfaces, and virtual assistance. So we have technologies which are being genuinely disruptive.

Trends wise we also see political focus here. Why is digital a key focus in the European Union? Well we have captured only a small percentage of the potential. And when we look across the Digital Economy and Society index we see this is about skills, about high quality public services – a real priority in Lithuania at the moment – not just about digitalisation for it’s own sake. Now a few days ago the US press laughed at Jean Claude Junker admitting he still doesn’t have a smartphone, but at the same time, he and others leading the EU see that the future is digital.

Some months back I was asked at a training session “Close your eyes. You are now in 2050. What do you see?”. When I thought about that my view was rather dystopic, rather “Big Brother is watching you”, rather hierarchical. And then we were asked to throw out those ideas and focus instead on what can be done. In the Cimulact EU project we have been looking at citizens visions to look toward a future EU research and innovation agenda. In general I note that people from older European countries there was more optimism about green technologies, technology enabling societies… Whilst people from Eastern European countries have tended to be more concerned with the technologies themselves, and with issues of safety and privacy. And we’ve been bringing these ideas together. For me the vision is technology in the service of people, enabling citizens, and creating systems for green and smart city development, and about personal freedom and responsibility. What unites all of these scenarios?  The information was gathered offline. People wanted security, privacy, communication… They didn’t want the technologies per se.

Challenges here? I think that privacy and security is key for social media, and the focus on the right tool, for the right audience, at the right time. If we listen to Time Berners Lee we note that the web is developing in a way divergent from the original vision. Lorrie Faith Cranor, Carnegie Mellon University notes that privacy is possible in a laboratory condition, but in the reality of the real world, it is hard to actually achieve that. That’s why such people as Aral Balkan, self-styled Cyborg Rights Activist – he has founded a cross-Europe party just focusing on privacy issues. He says that the business model of mainstream technology under “surveillance capitalisms” is “people arming and it it is toxic to human rights and democracy”. And he is trying to bring those issues into more prominence.

Another challenge is engagement. The use and time on social media is increasing every year. But what does that mean. Mark Schaefer, Director of Schaefer Marketing Solutions, describes this as “content shock” – we don’t have the capacity to deal with and consume the amount of content we are now encountering. Jay Bayer just wrote the book “Hug your haters” making the differentiation between “offstage haters” vs. “onstage haters”. Offstage haters tend to be older, offline, and only go public if you do not respond. Onstage haters post to every social media network not thinking about the consequences. So his book is about how to respond to, and deal with, many forms of hate on the internet. And one of the recently consulted companies have 150 people working to respond to that sort of “onstage” hate.

And then we have the issue of trolling. In Lithuania we have a government trying to limit alcohol consumption – you can just imagine how many people were being supported by alcohol companies to comment and post and respond to that.

We so also need to think about engagement in something valuable. Here I wanted to highlight three initiatives, two are quite mature, the third is quite new. The first is “My Government” or E citizens. This is about engaging citizens and asking them what they think – they post a question, and provide a (simple) space for discussion. The one that I engaged with only had four respondents but it was really done well. Lithuania 2.0 was looking at ways to generate creative solutions at government level. That project ended up with a lot of nice features… Every time we took it out, they wanted new features… People engaged but then dropped off… What was being contributed didn’t seem directly enough fed into government, and there was a need to feedback to commentators what had happened as a result of their posts. So, we have reviewed this work and are designing a new way to do this which will be more focused around single topics or questions over a contained period of time, with direct routes to feed that into government.

And I wanted to talk about the right tools for the right audiences. I have a personal story here to do with the idea of whether you really need to be in every network. Colleagues asked why I was not on Twitter… There was lots of discussion, but only 2 people were using Twitter in the audience… So these people were trying to use a tool they didn’t understand to reach people who were not using those tools.

Thinking about different types of tools… You might know that last week in Vilnius we had huge rainfall and a flood… Here we have people sharing open data that allows us to track and understand that sort of local emergency.

And there is the issue of how to give users personalised tools, and give opportunity for different opinions – going beyond your filter bubble – and earn profit. My favourite tool was called Personal Journal – it had just the right combination – until that was brought by Flipboard. Algorithmic tailoring can do this well, but there is that need to make it work, to expose to wider views. There is a social responsibility aspect here.

So, the future seems to look like decentralisation – including safe silos that can connect to each other; and the right tools for the right audience. On decentralisation Blockchain, or technologies like it, are looking important. And we are starting to see possible use of that in Universities for credentialing. We can also talk about uses for decentralisation like this.

We will also see new forms of engagement going mass market. Observation of “digital natives” who really don’t want to work in a factory… See those people going to get a coffee, needing money… So putting on their visor/glasses and managing a team in a factory somewhere – maybe Australia – only until that money is earned. We also see better artificial intelligence working on the side of the end users.

The future is ours – we define now, what will happen!


Q1) I was wondering what you mean by Blockchain, I haven’t heard it before.

A1) It’s quite complicated to explain… I suggest you Google it – some lovely explanations out there. We have a distributed

Q2) You spoke about the green issues around digitalisation, and I know Block Chain comes with serious environmental challenges – how do we manage that environmental and technological convenience challenge?

A2) Me and my wife have a really different view of green… She thinks we go back to the yurt and the plants. I think differently… I think yes, we consume more… But we have to find spots where we consume lots of energy and use technology to make it more sustainable. Last week  was at the LEGO factory in Denmark and they are working on how to make that sustainable… But that is challenging as their clients want trusted, robust, long-lasting materials. There are aready some technologies but we have to see how that will happen.

Q3) How do you see the role of artificial intelligence in privacy? Do you see it as a smart agent and intermediary between consumers and marketers?

A3) I am afraid of a future like Elon Musk where artificial intelligence takes over. But what AI can do is that it can help us interpret data for our decisions. And it can interpret patterns, filter information, help us make the best use of information. At the same time there is always a tension between advertisers and those who want to block advertisers. In Lithuanian media we see pop ups requesting that we switch off ad blocking tools… At the same time we will see more ad blocks… So Google, Amazon, Facebook… They will use AI to target us better in different ways. I remember hearing from someone that you will always have advertising – but you’ll like it as it will be tailored to your preferences.

Q4) Coming from a background of political sciences and public administration… You were talking about decentralisation… Wouldn’t it be useful to differentiate between developed and developing world, or countries in transition… In some of those contexts decentralisation can mean a lack of responsibility and accountability…

A4) We see real gaps already between cities and rural communities – increasingly cities are their own power and culture, with a lot of decisions taken like mini states. You talked a possible scenario that is quite 1984 like, of centralisation for order. But personally I still believe in decentralisation. There is a need for responsibility and accountability, but you have more potential for human rights and

Aelita Skaržauskienė: Thank you to Edgaras! I actually just spend a whole weekend reading about Block Chain as here in Lithuania we are becoming a hub for Fin Tech – financial innovation start ups.

So, I just wanted to introduce today here. Social media is very important for my department. More than 33 researchers here look at social technologies. Social media is rising in popularity, but more growth lies ahead. More than 85% of internet users are engaging with social media BUT over 5 billion people in the world still lack regular access to the internet, so that number will increase. There have already been so many new collaborations made possible for and by social media.

Thank you so much for your attention in this exciting and challenging research topic!

Stream B: Mini track on Social Media in Education (Chair: Nicola Osborne and Stefania Manca)

As I’m chairing this session (as Stefania is presenting), my notes do not include Q&A I’m afraid. But you can be confident that interesting questions were asked and answered!

The use of on-line media at a Distance Education University – Martins Nico, University of South Africa, Pretoria, South Africa

South Africa University is an online only university so I will be talking about research we have been doing on the use of Twitter, WhatsApp, Messenger, Skype and Facebook by students. A number of researchers have also explored obstacles experienced in social media. Some identified obstacles will be discussed.

In terms of professional teaching dispositoins these are principals, commitments, values and professional ethics that influence the attitude and behavious of educators, and I called on my background in organisational psychology and measuring instruments to explore different ideas of presence: virtual/technological; pedagogical; expert/cognitive; social. And these sit on a scale from Behaviours that are easily changed, and those that are difficult to change. And I want to focus on the difficult to change area of incorporating technologies significantly into practive – in the virtual/technologial presence area.

Now, about our university… We have 350k students and +/- 100k non-formal students. African and international students from 130 countries. We are a distance education university. 60% are between 25 and 39 and 63.9% are female. At Unisa we think about “blended” learning, from posting materials (snail mail) through to online presence. In our open online distance learning context we are using tools including WhatsApp, BBM, Mxit, WeChat, Research Gate, Facebook, LinkedIn, intranet, Google drive and wiki spaces, multimedia etc. We use a huge range, but it is up to the lecturer exactly which of these they use. For all the modules online you can view course materials, video clips, articles, etc. For this module that I’m showing here, you have to work online, you can’t work offline, it’s a digital course.

So, the aim of our research was to understand how effectively the various teaching dispositions are using the available online media, and to what extent there is a relationship between disposition and technology used. Most respondents we had (40.5%) had 1 to 3 years of service. Most respondents (45.1%) were Baby Boomers. Most were female (61%), most respondents were lecturers and senior lecturers.

Looking at the results, the most used was WhatsApp, with instant messaging and social networking high. Microbogging and digital curation were amongst the least used.

Now, when we compare that to the dispositions, we seen an interesting correlation between Social presence dispositions and instant messaging; virtual presence dispositions using research networking, cloud computing… The most significant relationships were between virtual and online tools. No significant correlation between pedagogical presence and any particular tools.

I just wanted to talk about the generations at play here: Baby boomers, Gen X-ers, and Millennials. Looking at the ANOVA analysis for generations and gender. Only for instance messaging and social networking was there any significant result. In both cases millennials use this most. In terms of gender we see females using social networking and instant messaging more than males. The results show younger generation or millennials and females use the two online media significantly more than other groups – for our university that has an implication to ensure our staff understand the spaces our students use.

The results confirmed that millennials are most inclined to use instant messaging and social networking. Females were using these the most.

So, my reocmmendation? To increase usage of online tools, the university will need to train academics in the usage of the various online tools. To arrange workshops on new technology, social media and mobile learning. And we need to advise and guide academics to increase web self-efficacy and compensate accordingly. And determine the needs and preferences of students pertaining to the use of social media in an ODL environment, and focus

Towards a Multilevel Framework for Analysing Academic Social Network Sites: A Network Socio-Technical Perspective – Manca Stefania, National Research Council of Italy and Juliana Elisa Raffaghelli, University of Florence, Italy

I work on the field of learning, distance education, distance learning, social media and social networking. I’m going to share with you some work I am doing with Juliana Elisa Raffaghelli on the use of social networking sites for academic purposes. I know there are lots of different definitions here. In this year I’m talking about the use of social media sites for scholarly communication. As we all know there are many different dispositions to communicate our work, for what we do, including academic publications, conferneces like this, but also we have seen a real increase in the use of social media for scholarly communication. And we have seen and ResearchGate  in widest use of these, but others are out there.

The aim of my study was to investigate these kinds of sites, not only in terms of adoption, uptake, what kind of actions people do in these sites. But the study is a theoretical piece of work taking a socio-technical perspective. But before I talk more about this I wanted to define some of the terms and context here.

Digital Scholarship is the use of digital evidence, methods of inquiry, research, publication and preservation to achieve scholarly and research goals. And can encompass both scholarly communication using digital media and research on digital media. Martin Weller, one of the first to explore this area, describes digital scholarship as shorthand of an intersection in technology-related developments namely: digital content; networked distribution; open practices. And the potential transformational quality of that intersection.

A recent update to this update, by Greenhow and Gleason (2014) have defined Social Scholarship as the means by which social media affordaces and potential values evolve the ways scholarship is done in academia. And Veletsianos and Kimmons (2012) have talked about Networked Participatory Scholarship as a new form of scholarship arising from these new uses of technology and new types of practice.

There are lots of concerns and tensions here that have been raised… The blurring boundaries of personal and professional identities. The challenge of unreliable information online. Many say that ResearchGate and have a huge number of fake profiles, and that not all of what is there can be considered reliable. There is also a perception that these sites may not be useful – a social factor. There is the challenge of time to curate different sites. And in the traditional idea of “publish or perish” there has been some concern over these sites.

The premise of this study is to look at popular academic sites like ResearchGate, like Although these sites are increasingly transforming scholarly communication and academic identity, there is a need to understand these at a socio technical level, which is where this study comes in. Academic social network sites are networked socio-technical systems. These systems are determined by social forces and technological features. Design, implementation and use of such technologies sit in a wider cultural and social context (Hudson and Wolf 2003?).

I wanted to define these sites through a multilevel framework, with a socio-economic layer (ownership, governance, business model); techno-cultural layer (technology, user/usage, content); networked-scholar layer (networking, knowledge sharing, identity). Those first two layers come from a popular study of social networking usage, but we added that third level to capture those scholarly qualities. The first two levels refer to the structure and wider context.

We also wanted to bring in social capital theory/ies, encompassing the capacity of social networks to produce goods for mutual benefits (Bourdieu, 1986). This can take the form of useful information, personal relationships or group networks (Putnam 2000). We took this approach because the scholarly community can be viewed as knowledge sharing entities formed by trust, recognition etc. I will move past an overview of social capital types here, and move to my conclusion here…

This positions academic social network sites as networked socio-technical systems that afford social capital among scholars… And here we see structural and distributed scholarly capital.

So to finish a specific example: ResearchGate. The site was founded in 2008 by two physicists and a computer scientist. More than 12 million members distributed worldwide in 193 countries. The majority of members (60%) belong to scientific subject areas, and it is intended to open up science and enable new work and collaboration.

When we look at ResearchGate from the perspective of the socio-economic layer…. Ownership is for-profit. Governance is largely through terms and conditions. The business model is largely based on a wide range of free-of-charge services, with some subscription aspects.

From the techno-cultural layer… Technology signals automatically who one may be interested in connected with, news feeds, propts endorsements, new researchers to follow. And usage can be passive, or they can be active participants after making new connections. And content – it affords publication of diverse types of science outputs.

From the networked scholar layer. Networking – Follow and recommend, Knowledge of sharing – commenting, questions feature, search function, existing Q&As, expertise and skills, and Identity – through profile, score, reach and h-index.

On Linking Social Media, Learning Styles, and Augmented Reality in Education – Kurilovas Eugenijus, Julija Kurilova and Viktorija Dvareckiene, Vilnius University Institute of Mathematics and Informatics, Lithuania

Eugenijus: So, why augmented reality? Well according to predictions it will be the main environment for education by 2020 and we need to think about linking it to students on the one hand, and to academia as well. So, the aim of this work is to present an original method to identify students preferring to actively engage in social media and wanting to use augmented reality. To relate this to learning styles.

Looking over the literature we faced a tremendous development of social media, powered by innovative web technologies, web 2.0 and social networks. But so many different approaches here, and every student is different. Possibilities of AR seem almost endless. And the literature suggests AR may be more effective than traditional methods. Only one meta-analysis work directly addresses personalisation of AR-based systems/environments in education. The learning styles element of this work is about the differences of student needs, not specifically focused on this.

Another aspect of AR can be cognitive overload from the information, the technological devices, and the tasks they need to undertake. Few studies seem to look at pedagogy of AR, rather than tests of AR.

So, our method… All learning processes, activities and scenarios should be personalised to student learning styles. We undertook simple and convenient expert evaluation method based on application of trapezoid fuzzy learning. And looking at suitability of use in elearning. The question given to expertise focus on suitability of learning activities of social media and AR in learning. After that details explaining Felder-Silverman learning styles (4 different styles included) model were provided for the experts.

After the experts completed the questionnaire it’s easy to calculate the average values of suitability of the learning styles and learning activities for AR and social media. So we can now easily compute the average for learning styles… So every student could come in and answer a learning styles questionnaire, get their own table, their personal individual learning styles. Then combining that score, with expert ratings of AR and social media, we can calculate suitability indexes of all learning styles of particular students. The programme does this in, say, 20 seconds…

So, we asked 9 experts to share their opinion on particular learning styles… So here the experts see social media and AR as particularly suitable for visuals and activists (learning styles). We think that suitability indexes should be included in recommender systems – main thing in personalised learning system and shoudl be linked to particular students according to those suitability index. The higher suitability index the better the learning components fit particular students needed.

So, expert evaluation, linking learning activities and students by suitability index and recommender system are main intelligent technologies applied to personalise learning. An optimal learning scenario would make use of this to personalise learning. And as already noted Augmented Reality and social media are most suitable for visual and activist learners; most unsuitable for verbal and reflective learners… And that will be reflected in student happiness and outcomes. Visual and activist learners prefer to actively use learning scenarios based on application of AR and social media.

According to Felder and Silverman most people of college age and older are visual. Visual learners remember best what they see rather than what they hear. Visual learners are better able to remember images rather than verbal or text information. For visual learners the optimal learning scenario should include a range of visual materials.

Active learners do not learn much in situations that require them to be passive. They feel more comfortable with or better at active experimentation than reflective observation. For active learners the optimal scenario should include doing something that relates to the wider outside world.

And some conclusions… Learning styles show how this can be best used/tweaked to learners. The influence of visual and social media has shifted student expectations, but many teaching organisations are still quite traditional…

We now have a short break for lunch. This afternoon my notes will be sparse – I’ll be presenting in the Education Mini Track and then, shortly after, in the Social Media Excellence Awards strand. Normal service will be resumed after this afternoon’s coffee break. 

Stream B: Mini track on Social Media in Education (Chair: Nicola Osborne and Stefania Manca)

Digital Badges on Education: Past, Present and Future – Araujo Inês, Carlos Santos, Luís Pedro, and João Batista, Aveiro University, Portugal

I’ve come a little late into Ines’ talk but she is taken us through the history of badges as a certification, including from Roman times. 

This was used like an honour, but also as a punishment, with badges and tattoos used to classify that experience. For a pilgrim going to Compostello de Compagnario(?) they had a badge, but there was a huge range of fake badges out there. The pope eventually required you to come to rome to get your badges. We also have badges like martial arts belts, for scouts… So… Badges have baggage.

With the beginning of the internet we started the beginnings of digital badges, as a way to recognise achievements and to recognise professional achievements. So, we have the person who receives the badge, the person/organisation who issues the badge, and the place where the badge can be displayed. And we have incentives to collect and share badges associated with various cities across the world.

Many platforms have badges. We have Open Badges infrastructures (Credly, BadgeOS, etc.) and we have the place to display and share badges. In educational platforms we also have support for badges, including Moodle, Edmodo,, SAPO campus (at our speaker’s home institution), etc. But in our VLE we didn’t see badges being used as we expected so we tried to look out at how badges are being used (see worldwide…

How are badges being used? Authority; award and motivations; sequential orientation – gain one, then the other…; research; recognition; identity; evidence or achievement; credentialing. The biggest use was around two major areas: motivation (for students but also teachers and others), as well as credentialing. And in fact some 10% of digital badges are used to motivate and reward, and to recognise skills, of teachers. However major use is with students and that is split across award, credentialing, and evidence of achievement.

So, our final recommendations was for the integration of badges in education: that we should choose a platform, show the advantage of using a repository (e.g. a backpack for digital badges); to choose the type of badge – mission type and/or award type; and enjoy it.

Based on this information we began a MOOC: Badges: how to use it. And you can see a poster on the MOOC. And this was based on the investigation we did for this work.


Q1) Have you had some feedback, or collected some information on students’ interest on badges… How do they react or care about getting those badges?

A1) Open Badges are not really known to everyone in Portugal. The first task I had was to explain them, and what the advantages there were. Teachers like the idea… They feel that it is very important for their students and have tried it for their students. Most of the experiments show students enjoying the badges… But I’m not sure that they understand that they can use it again if they show it in social media, into the community… But that is a task still to do. The first experience I have, I’ve known about from the teachers who were in the MOOC, they enjoy it, they liked it, they asked for more badges.

Q2) I know about the concept here… Any issues with dual ways to assess students – grades and badges.

A2) Teachers can use them with grading, in parallel. Or if they use them in sequence, they understand how to get to achieve that grade. Teacher has to decide how best to use them… Whether to use them or to motivate to a better grade.

Q3) Thank you! I’m co-ordinating an EU open badge project so I’d like to invite you to publish. Is the MOOC only in Portuguese? My students are designing interactive modules – CC licensed – with best practice guidance. Maybe we can translate and reuse?

A3) It’s only in Portuguese at the moment. We have about 120 people engaged in the MOOC and it runs on SAPO Campus. They are working on a system of badges that can be used across all institutions so that teachers can share badges, a repository to choose from and use in their own teaching.

Comment) Some of that unification really useful for having a shared understanding of meaning and usage of badges.

Yes, but from what I could see teachers were not using badges because they hadn’t really seen examples of how to use them. And they get a badge at the end of the course!

Q4) What is the difference between digital badges and open badges.

A4) Open Badges is a specific standard designed by Mozilla. Digital badges can be created by everyone.

Comment) At my institution the badges are about transferrable skills… They have to meet unit learning outcomes, graduate learning outcomes. They can get prior learning certified through them as well to reduce taught classes for masters students. But that requires that solid infrastructure.

We have infrastructure to issue badge, someone can make and create, to issue a person. The badge has metadata, where it was issued, why, by whom… And then made available in repository. e.g. Mozilla backpack.

Exploring Risk, Privacy and the Impact of Social Media Usage with Undergraduates – Connelly Louise and Nicola Osborne, University of Edinburgh, UK

Thanks to all who came along! Find our abstract and (shortly after today) our preprint here.

And I’ve now moved on to the Best Practice Awards strand where I’ll be presenting shortly… I’ve come in to the questions for Lisa Lundgren (and J. Crippen Kent)’s presentation on using social media to develop social paleontology. From the questions I think I missed hearing about a really interesting project. 

EDINA Digital Footprint Consultancy & Training Service – Osborne Nicola, University of Edinburgh, UK 

Well, that was me. No notes here, but case study will be available soon. 

D-Move – Petrovic Otto, University of Graz, Austria

This is a method and software environment to anticipate “digital natives” acceptance of technology innovations. Looking particularly at how the academic sector is having long term impact on the private sector. And our students are digital natives, that’s importance. So, to introduce me, I’m professor of information systems at the University of Graz, Austria. I have had a number of international roles and have had a strong role in bridging the connection between academia and government, am a member of regulatory authority for telecommunications for Austria. And I have started three companies.

So, what is the challenge? In 2020 more than half of all the people living in our world are born and raised with diital media and the internet, they are digital natives. And they are quite different regarding their values and norms, behaviours and attitudes. Considering the big changes in industries like media, commerce, banking, transport or the travel industry. They have more and more aversion for traditional surveys based on “imagine a situation where you use a technology like…”. Meanwhile surveys designed, executed and interpreted by traditional “experts” will result in traditional views – the real experts are the digital natives. The results should be gained through digital natives’ lives…

So the solution? It is an implemented method, based on the Delphi approach. Digital Natives are used as experts in a multi-round, structured group communication process. In each round they collect their own impressions regarding the Delphi issue. So, for instance, we have digital natives engaging in self-monitoring of their activities.

So, we recruited 4 groups of 5 digital natives; round one discussion as well as interviews with 130 digital natives; field experience embedded in everyday live; discussion; and analysis. We want to be part of the daily life of the digital native, but a big monolithic space won’t work, things change, and different groups use different spaces. We need social media and we need other types of interfaces… We don’t know them today. We have a data capturing layer for pictures, video, annotations. We also need data storage, data presentation and sharing, data tagging and organisation, access control and privacy, private spaces and personalisation… And access control is crucial, as individuals want to keep their data private until they want to share it (if at all).

D-Move gives insights into changes in Digital Natives views, experiences, self-monitoring, etc. And in terms of understanding “why” digital natives behave as they do. The participants show high satisfaction with D-Move as a space for learning. D-Move has been implemented and used in different industries for many years – used for media, transport and logistics, travel industry, health and fitness. It started with messaging based social media, going to social media platforms, finally implementing social internet of things technologies. And we are currently working with one of the most prestigious hotels – with a customer base typically in their seventies… So we are using D-Move to better understand the luxury sector and what parts of technology they need to engage with. D-Move is part of Digital Natives “natural” communication behaviour. And an on-going cycle of scientific evaluation and further technical development.

In terms of the next steps, firstly the conceptual models will be applied to the whole process to better understand digital natives thinking, feeling and behaviour. Using different front ends focused on the internet of things technologies. And offering D-Move to different industries to book certain issues like using an omnibus survey. And D-Move is both a research environment and a teaching environment. We have two streams going in the same direction, including as a teaching instrument.


Q1) Your digital native participants, how do you recruit them?

A1) It depends on the age group. It ranges from age 10 to nearer age 30. For our university we can reach 20-25 year old, for 10 years to 20 we work with schools. 25 to 30 years old is harder to recruit.

Q2) What about ethical issues? How do you get informed consent from 10 to 18 year olds.

A2) These issues are usually based on real issues in life, and this is why security and privacy is very important. And we have sophisticated ways of indicating what is and is not OK to share. This is partly through storing data in our storage. It is not a public system, the data is not accessible to others.

Q3) We’ve seen a few presentations on using data from participants. According to the POPI Act (based on EU GDPR, you can’t use data without consent… How do you get around that?

A3) It’s easier because it is not a public system, and we do not relate information in publications, only at an aggregated level.

At this point I feel it is important to note my usual “digital native” caveat that I don’t agree with the speaker on this term (or the generalisations around it) which has been disputed widely in the literature, including by Marc Prensky, it’s originator.

The Traditions Challenge mobile App – Peruta Adam, Syracuse University, New York, USA

I’ve been looking at how colleges and universities have been using social media in student recruitment, alumni engagement etc. And it has been getting harder and harder to get access to social media data over the years, so I decided to design my own thing.

So, think back to your first days of universities. You probably had a lot of concerns. For instance Ithaca College is in a town less than 7 miles wide, there isn’t a big sports programme, it is hard to build community. So… The Traditions Challenge is a mobile app to foster engagement and community building for incoming university students – this works as a sort of bucket list of things to do and engage with. This launched at Ithaca in August 2016 with over 100 challenges. For instance FYRE, which already encourages engagement, is a challenge here. Faculty Office Hours is it’s own challenge – a way to get students to find out about these. And the fountains – a notable feature on campus – you can have your image taken. And we encourage them to explore the town, for instance engaging with the farmers market.

So there is a list of challenges, there is also a feed to see what else is happening on campus. And there is information on the school. And this is all gamified. Challenges earn points, there is a leaderboard which gets students status. And there are some actual real world challenges – stickers, a nice sweatshirt, etc. And this is all designed to get students more engaged, and more engaged early on at university. There is a lot of academic research on students who are more involved and engaged, being more likely to stay at that university.

Traditions in the University are very important We have over 4000 institutions. And those traditions translate into a real sense of identity for students. There are materials on traditions, keep safe books for ticket stubs, images, etc. but these are not digital. And those are nice but there is no way to track what is going on (plus who takes pictures).  And in fact Ithaca tried that approach on campus – a pack, whiteboards, etc. But this year, with the app, there are many more data that can be quantified. This year we had around 200 sign ups (4% of on campus students). We didn’t roll out to everyone, but picked influencers and told them to invite friends, then them to invite their friends, etc. And those 200 sign ups did over 1400 challenges and 44 checked in for prizes. Out of the top ten challenges, 70% of the most popular challenges were off-campus, and 100% of those were non-academic experiences. There is a sense of students being most successful when they involved in a lot of things, and have more activities going on. It is hard for comparing the analogue with the app but we know that at least 44 students checked in for prizes with the app, versus 8 checking in when we ran the analogue challenges.

In terms of students responding to the challenges, they enjoyed the combination of academic and non-academic activities. One student, who’d been enrolled for 3 years, found out about events on campus through the app that he had never heard about before. Some really responded to the game, to the competition. Others just enjoyed the check list, and a way to gather memories. Some just really want the prize! (Others were a lot less excited). Maybe more prizes could also help – we are trying that.

In terms of App Design and UX. And this cohort hugely care about the wording of things, the look of things… Their expectation is really really high.

In terms of identity students reported feeling a real sense of connection to Ithaca – but it’s early days, we need some longitudinal data here.

We found that the digital experience is preferred. Mobile development is expensive and time consuming – I had an idea, tried to build a prototype, applied for a grant to hire a designer, but everyone going down this path have to understand that you need developers, designers, and marketing staff at the university to be involved. And like I said, the expectations were really high, We ran workshops before making anything to make sure we understood that expectation.

I would also note that universities in the US are really getting protective of their brand, the use of logos, fonts etc. They really trusted me but it took several goes to get a logo we were all happy with.

And finally, data from the app, from follow up work, show that students really want to augment their experience with on campus activities, off campus activities… And active and involved students seem to lead to active and involved alumni – that would be great data to track. And that old book approach was lovely as tangible things are good – but it’s easy to automate some printing from the app…

So, what’s happening now? Students are starting, they will see posters and postcards, they will see targeted Facebook ads.

I think that this is a good example of how a digital experience can connect with a really tangible experience.

And finally, I’m from Suracuse University, and I’d like to thank Ithaca College, and NEAT for their support.


Q1) What is the quality of contribution like here?

A1) It looks quite a lot like Instagram update – a photo, text, tagging, you can edit it later.

Q2) And can you share to other social media?

A2) Yes, they can share to Facebook and Twitter.

Q3) I wanted to ask about the ethics of what happens when students take images of each other?

A3) Like other types of social media, that’s a social issue. But there is a way to flag images and admins can remove content as required.

Q4) Most of your data is from female participants?

A4) Yes, about 70% of people who took part were female participants.

Q5) How did you recruit users for your focus groups?

A5) We recruited our heaviest app users… We emailed them to invite them along. the other thing I wanted to note that it wasn’t me, or colleagues, running focus groups, it was student facilitators to make this peer to peer.

Q6) How reliable is the feedback? Aren’t they going to be easy to please here?

A6) Sure, they will be eager to please so there may be some bias. I will eventually be doing some research on these data points eventually.

Q7) Any plans to expand to other universities?

A7) Yes, would love to compare the three different types of US universities in particular.

Q8) Is the app free to students?

A8) Yes, I suspect if I was to monetize this it would be for the university – a license type set up.

Mini track on Social Media in Education – Chair: Nicola Osborne and Stefania Manca

Evaluation of e-learning via Social Networking Website by full-time Students in Russia – Pivovarov Ivan, RANEPA, Russia

Why did I look at this area? Well the Russian Government is presently struggling with poor education service delivery. There is great variety in the efficiency and quality of higher education. So, the Russian Government is looking for ways to make significant improvements. And, in my opinion, social media can be effective in full time teaching. And that’s what my research was looking at.

So, I wanted to determine the best techniques of delivery of e-learning via social networking websites. I was looking at rather than Facebook. VK is by far the biggest social media in Russia. The second biggest is Instagram. There is strong competition there.

So I was looking at the views of students about educational usage of HK, targeting bachelor students coming from the Russian Presidential Academy of National Economy and Public Administration – an atypical institution focused specifically on public administration. A special interest group was created on VK and the educational content was regularly uploaded there. We had 100s of people in this group – hoping for 1000 in future. So material would include assignments, educational contests, etc. And finally after six months of using this space, I decided to make a questionnaire and ask my students what they like, what they don’t like, what they didn’t like the most, etc. and we had 100 responses. Age wise 82% were between 18 and 21 years old; 12% are 21-24 years old; 6% were older than 24. This slide shows that users of social media are typically young, when they move on in life, have families etc, they don’t tend to use social media. We did also ask about Facebook, 53% had a Facebook account, 47% did not.

We asked what the advantages are of VK over Faceook. 52% said most of their friends were on VK. 13% said that VK had a more user friendly interface than Facebook. 29% said VK has a more interesting background – sharing of music, films etc. – than Facebook. Looking at usage of VK in educational purpose, 35% use it weekly; 31% very seldom; 14% 2-3 times a weel; 10% daily. Usage is generally heavier on week days, on the weekend that drops.

So, what motivated people to be a member of the special interest group on a social media website? Most (53%) said the ease of access to information; 31% the dissemination of information; 4% said for the chance of interaction. And when asked what the students wanted to improve, most (53%) wanted to increase teacher-student interaction – more teachers to join them on social media.

Students mostly preferred posts from teachers that were about administration of the unit (28%) and content (28%). When asked if the students wanted to watch video lectures, 85% said yes. One year after this work I started to record video lectures – short (5-10 mins) and they become available prior to a lecture. And then find some new definitions, new terms, etc. And in the lecture we follow up, go into details. We can go straight into discussion. So this response inspired me to create this video content.

I also asked if students had taken an online class before, 52% had, 48% hadn’t. I asked students how they likes social media interaction on social media – 86% of students found it positive (but I only asked the after they’d been assessed to avoid too much bias in results).

Conclusions here…. Well I wanted to compare Russian to other contexts. Students in Russia wanted more teacher-student interactions. “comments must be encouraged” was not present in our experiment but in research in Turkey


Q1) Is there an equivalent to YouTube in Russia?

A1) Yes, YouTube is big. There is an alternative called RuTube – maybe more the Russian Vimeo. No Twitter – Telegram is nearest. And no Russian analogous to SnapChat but it is pushed away by Instagram Stories now I think. WhatsApp is very popular, but I don’t see the educational potential there. This semester I had students make online translations of my lecture… with Instagram Stories… VK does try and copy features from other worldwide spaces – they have stories. But Instragram is most popular.

Q2) Among the takeaways is the need for more intense interaction between students and teaching staff. Are your teaching staff motivated to do this? I do this as a “hobby” in my institution? Is it formalised in your school? And also you said about the strength of VK versus Facebook – you noted that people using VK drives traffic… So where do you see opportunities for new platforms in Russia?

A2) Your second question, that’s hard to predict. Two or three years ago it was hard to predict Instagram Stories or Snapchat. But I guess probably social media associated with sport…

Q2) Potential won’t be hampered by attitudes in the population to steer toward what they know.

A2) I don’t think so… On the time usage front I think my peers probably share your concerns about time and engagement.

Comment) It depends on how it develops… We have a minimum standard. In our LMS there is a widget, and staff have to make videos per semester for them – that’s now a minimum practice. Although in the long run teaching isn’t really rewarded – it’s research that is typically rewarded… Do you have to answer to a manager on this in terms of restrictions on trying things out?

A2) No, I am lucky, I am free to experiment. I have a big freedom I think.

Q3) Do you feel uncomfortable being in a social space with your students… To be appropriate in your profile picture… What is your dynamic?

A3) All my photos are clean anyway! Sports, conferences… But yes, as a University teacher you have to be sensible. You have to be careful with images etc… But still…

Comment) But that’s something people struggle with – whether to have one account or several…

A3) I’m a very public person… Open to everyone… So no embaressing photos! On LMS, my university has announced that we will have a new learning management system. But there is a a question of whether students will like that or engage with that. There is a Clayton Christenson concept of disruptive innovation. This tool wasn’t designed for education, but it can be… Will an LMS be comfortable for students to use though?

Comment) Our university is almost post-LMS… So maybe if you don’t have one already, you could jump somewhere else, to a web 2.0 delivery system…

A3) The system will be run and tested in Moscow, and then rolled out to the regions…

Q4) You ran this course for your students at your institution, but was the group open to others? And how does that work in terms of payments if some are students, some are not?

A4) Everyone can join the group. And when they finish, they don’t escape from the group, they stay, they engage, they like etc. Not everyone, but some. Including graduates. So the group is open and everyone can join it.

Developing Social Media Skills for Professional Online Reputation of Migrant Job-Seekers – Buchem Ilona, Beuth University of Applied Sciences Berlin, Germany

We have 12,800 students, many of whom have a migrant background, although the work I will present isn’t actually for our students, its for migrants seeking work.

Cue a short video on what it means to be a migrant moving across the world in seek of a brighter future and a safe place to call home. Noting the significant rise in migration, often because of conflict and uncertainty. 

That was a United Nations video about refugees. Germany has accepted a huge number of refugees, over 1.2 million in 2015, 2016. And, because of that, we have and need quite a complex structure of programmes and support for migrants making their home. At the same time here Germany has shortages of skilled workers so there is a need to match up skills and training here. There is particular need for doctors, engineers, experts in technology and ICT for instance.

But, it’s not al good news. Unemployment in Germany is twice as high among people who have migration background compared to those who do not. At the same time we have migrants with high skills and social capital but it is hard if not impossible to certify and check that. Migrant academics, including refugees, are often faced with unemployment, underemployment or challenging work patterns.

In that video we saw a certificate… Germany is a really organised country but that means without certificates and credentials available. But we also see the idea of the connected migrant, with social media enabling that – for social gain but also to help find jobs and training.

So the project here is “BeuthBonus”, a follow on project. We are targeted at skilled migrant workers – this partly fills a gap in delivery as training programmes for unskilled workers are more common. It was developed to help migrant academics to find appropriate work at the appropriate level. The project is funded by the German Federal Ministry of Research and Education, the German Federal Ministry of Labour, and also part of an EU Open Badges pilot as we are also an Open Badges Network pilot for recognition of skills.

Our participants 2015-16 are 28 in total (12 female, 16 male), from 61 applications. Various backgrounds but we have 20 different degrees there: 28% BA, 18% MA, 7% PhD. They are mainly 30-39 or 40-49 and they are typically from Tunisia, Afghanistan, Syria, etc.

So, the way this works is that we cooperate with different programmes – e.g. an engineer might take an engineering refresher/top up. We also have a module on social media – just one module – to help participants understand social media, develop their skills, and demonstrate their skills to employers. This is also a good fit as job applications are now overwhelmingly digital now. And also the employment of recruiters has moved from reserved to positive to a digital CV.

So, in terms of how companies in Germany are using social media in recruitment. Xing, a German language only version of a tool like LinkedIn, is the biggest for recruitment advertising. In terms of active sourcing in social media, 45% of job seekers prefer to be approached. And in fact 21% of job seekers would pay to be better visible in these space. 40% of job openings are actively sourced – higher in IT sector.

So we know that building an online professional reputation is important, and more highly skilled job hunters will particularly benefit from this. So, we have a particular way that we do this. We have a process for migrants to develop their online professional development. They start by searching for themselves, then others comment on what was found. They are asked to reflect and think about their own strengths and the requirements of the labour market. Then they go in and look at how the spaces are used, how people brand themselves, and use these spaces. Then some framing around a theme, plan what they will do, and then they set up a schedule for the next weeks and months… So they put it into action.

We then have instrumental ways to assess this – do they use social media, how do they use it, how often, how they connect with others, and how they express themselves online. We also take some culture specific and gender specific considerations into account in doing this.

And, to enhance online presence we look at OpenBadges, set goals, and work towards it. I will not introduce OpenBadges, but I will talk about how we understand competencies. So we have a tool called ProfilPASS – a way to capture experience as transferrable skills that can be presented to the world. We designed badges accordingly. And we have BeuthBonus Badges in the Open Badge Network, but these are on Moodle and available in German and in English to enable flexibility in appling for jobs. Those badges span different levels, they are issues badges at the appropriate levels, they can share them on Xing of LinkedIn as appropriate. And we also encourage them to also look at other sources of digital badges – from IBM developerWorks or Womens Business Club, etc.

So, these results have been really good. Before the programme we had 7% employed, but after we had 75% employed. This tends to be a short term perspective. Before the programme 0% had a digital CV, after 72% did. We see that 8% had an online profile before, but 86% now do. And that networking means they have contacts, and they have a better understanding of the labour market in Germany.

In our survey 83% felt Open Badges are useful for enhancing online reputation.

Open Badge Network has initiatives across the world. We work on Output 4: Open Badges in Territories. We work with employers on how best to articulate the names


Q1) In your refugee and migration terminology, do you have subcategories?

A1) We do have sub categories around e.g. language level, so can refer them to language programmes before they are coming to us. And there had been a change – it used to be that economic migrants were not entitled to education, but that has changed now. Migrants and refugees are the target group. It depends on the target group…

Q2) In terms of the employer, do you create a contact point?

A2) We have an advisory board drawn from industry, also our trainers are drawn from industry.

Q3) I was wondering about the cultural differences about online branding?

A3) I have observations only, as we have only small samples and from many countries. One difference is that some people are more reserved, and would not approach someone in a direct way… They would wave (only)… And in Germany the hierarchy is not important in terms of having conversations, making approaches, but that isn’t the case in some other places. And sharing an image, and a persona… that can be challenging. That personal/professional mix can be even tricky.

Q4) How are they able to manage those presences online?

A4) Doing that searching in a group.. And with coaches they have direct support, a space to discuss what is needed, etc.

Q5) Lets say you take a refugee from country x, what is needed?

A5) They have to have a degree, and they have to have good german – a requirement of our funder – and they have to be located in Germany.

Comment) This seems like it is building so much capacity… I think what you are doing over there is fantastic and opening doors to lots of people.

Q6) In Germany, all natives have these skills already? Or do you do this for German people too? Maybe they should?

A6) For our students I tend to just provide guidance for this. But yes, maybe we need this for all our students too.

Jun 302017

Today I’m at ReCon 2017, giving a presentation later (flying the flag for the unconference sessions!) today but also looking forward to a day full of interesting presentations on publishing for early careers researchers.

I’ll be liveblogging (except for my session) and, as usual, comments, additions, corrections, etc. are welcomed. 

Jo Young, Director of the Scientific Editing Company, is introducing the day and thanking the various ReCon sponsors. She notes: ReCon started about five years ago (with a slightly different name). We’ve had really successful events – and you can explore them all online. We have had a really stellar list of speakers over the years! And on that note…

Graham Steel: We wanted to cover publishing at all stages, from preparing for publication, submission, journals, open journals, metrics, alt metrics, etc. So our first speakers are really from the mid point in that process.

SESSION ONE: Publishing’s future: Disruption and Evolution within the Industry

100% Open Access by 2020 or disrupting the present scholarly comms landscape: you can’t have both? A mid-way update – Pablo De Castro, Open Access Advocacy Librarian, University of Strathclyde

It is an honour to be at this well attended event today. Thank you for the invitation. It’s a long title but I will be talking about how are things are progressing towards this goal of full open access by 2020, and to what extent institutions, funders, etc. are being able to introduce disruption into the industry…

So, a quick introduction to me. I am currently at the University of Strathclyde library, having joined in January. It’s quite an old university (founded 1796) and a medium size university. Previous to that I was working at the Hague working on the EC FP7 Post-Grant Open Access Pilot (Open Aire) providing funding to cover OA publishing fees for publications arising from completed FP7 projects. Maybe not the most popular topic in the UK right now but… The main point of explaining my context is that this EU work was more of a funders perspective, and now I’m able to compare that to more of an institutional perspective. As a result o of this pilot there was a report commissioned b a British consultant: “Towards a competitive and sustainable open access publishing market in Europe”.

One key element in this open access EU pilot was the OA policy guidelines which acted as key drivers, and made eligibility criteria very clear. Notable here: publications to hybrid journals would not be funded, only fully open access; and a cap of no more than €2000 for research articles, €6000 for monographs. That was an attempt to shape the costs and ensure accessibility of research publications.

So, now I’m back at the institutional open access coalface. Lots had changed in two years. And it’s great to be back in this spaces. It is allowing me to explore ways to better align institutional and funder positions on open access.

So, why open access? Well in part this is about more exposure for your work, higher citation rates, compliant with grant rules. But also it’s about use and reuse including researchers in developing countries, practitioners who can apply your work, policy makers, and the public and tax payers can access your work. In terms of the wider open access picture in Europe, there was a meeting in Brussels last May where European leaders call for immediate open access to all scientific papers by 2020. It’s not easy to achieve that but it does provide a major driver… However, across these countries we have EU member states with different levels of open access. The UK, Netherlands, Sweden and others prefer “gold” access, whilst Belgium, Cyprus, Denmark, Greece, etc. prefer “green” access, partly because the cost of gold open access is prohibitive.

Funders policies are a really significant driver towards open access. Funders including Arthritis Research UK, Bloodwise, Cancer Research UK, Breast Cancer Now, British Heard Foundation, Parkinsons UK, Wellcome Trust, Research Councils UK, HEFCE, European Commission, etc. Most support green and gold, and will pay APCs (Article Processing Charges) but it’s fair to say that early career researchers are not always at the front of the queue for getting those paid. HEFCE in particular have a green open access policy, requiring research outputs from any part of the university to be made open access, you will not be eligible for the REF (Research Excellence Framework) and, as a result, compliance levels are high – probably top of Europe at the moment. The European Commission supports green and gold open access, but typically green as this is more affordable.

So, there is a need for quick progress at the same time as ongoing pressure on library budgets – we pay both for subscriptions and for APCs. Offsetting agreements are one way to do this, discounting subscriptions by APC charges, could be a good solutions. There are pros and cons here. In principal it will allow quicker progress towards OA goals, but it will disproportionately benefit legacy publishers. It brings publishers into APC reporting – right now sometimes invisible to the library as paid by researchers, so this is a shift and a challenge. It’s supposed to be a temporary stage towards full open access. And it’s a very expensive intermediate stage: not every country can or will afford it.

So how can disruption happen? Well one way to deal with this would be the policies – suggesting not to fund hybrid journals (as done in OpenAire). And disruption is happening (legal or otherwise) as we can see in Sci-Hub usage which are from all around the world, not just developing countries. Legal routes are possible in licensing negotiations. In Germany there is a Projekt Deal being negotiated. And this follows similar negotiations by open At the moment Elsevier is the only publisher not willing to include open access journals.

In terms of tools… The EU has just announced plans to launch it’s own platform for funded research to be published. And Wellcome Trust already has a space like this.

So, some conclusions… Open access is unstoppable now, but still needs to generate sustainable and competitive implementation mechanisms. But it is getting more complex and difficult to disseminate to research – that’s a serious risk. Open Access will happen via a combination of strategies and routes – internal fights just aren’t useful (e.g. green vs gold). The temporary stage towards full open access needs to benefit library budgets sooner rather than later. And the power here really lies with researchers, which OA advocates aren’t always able to get informed. It is important that you know which are open and which are hybrid journals, and why that matters. And we need to think if informing authors on where it would make economic sense to publish beyond the remit of institutional libraries?

To finish, some recommended reading:

  • “Early Career Researchers: the Harbingers of Change” – Final report from Ciber, August 2016
  • “My Top 9 Reasons to Publish Open Access” – a great set of slides.


Q1) It was interesting to hear about offsetting. Are those agreements one-off? continuous? renewed?

A1) At the moment they are one-off and intended to be a temporary measure. But they will probably mostly get renewed… National governments and consortia want to understand how useful they are, how they work.

Q2) Can you explain green open access and gold open access and the difference?

A2) In Gold Open Access, the author pays to make your paper open on the journal website. If that’s a hybrid – so subscription – journal you essentially pay twice, once to subscribe, once to make open. Green Open Access means that your article goes into your repository (after any embargo), into the world wide repository landscape (see:

Q3) As much as I agree that choices of where to publish are for researchers, but there are other factors. The REF pressures you to publish in particular ways. Where can you find more on the relationships between different types of open access and impact? I think that can help?

A3) Quite a number of studies. For instance is APC related to Impact factor – several studies there. In terms of REF, funders like Wellcome are desperate to move away from the impact factor. It is hard but evolving.

Inputs, Outputs and emergent properties: The new Scientometrics – Phill Jones, Director of Publishing Innovation, Digital Science

Scientometrics is essentially the study of science metrics and evaluation of these. As Graham mentioned in his introduction, there is a whole complicated lifecycle and process of publishing. And what I will talk about spans that whole process.

But, to start, a bit about me and Digital Science. We were founded in 2011 and we are wholly owned by Holtzbrink Publishing Group, they owned Nature group. Being privately funded we are able to invest in innovation by researchers, for researchers, trying to create change from the ground up. Things like labguru – a lab notebook (like rspace); Altmetric; Figshare; readcube; Peerwith; transcriptic – IoT company, etc.

So, I’m going to introduce a concept: The Evaluation Gap. This is the difference between the metrics and indicators currently or traditionally available, and the information that those evaluating your research might actually want to know? Funders might. Tenure panels – hiring and promotion panels. Universities – your institution, your office of research management. Government, funders, policy organisations, all want to achieve something with your research…

So, how do we close the evaluation gap? Introducing altmetrics. It adds to academic impact with other types of societal impact – policy documents, grey literature, mentions in blogs, peer review mentions, social media, etc. What else can you look at? Well you can look at grants being awarded… When you see a grant awarded for a new idea, then publishes… someone else picks up and publishers… That can take a long time so grants can tell us before publications. You can also look at patents – a measure of commercialisation and potential economic impact further down the link.

So you see an idea germinate in one place, work with collaborators at the institution, spreading out to researchers at other institutions, and gradually out into the big wide world… As that idea travels outward it gathers more metadata, more impact, more associated materials, ideas, etc.

And at Digital Science we have innovators working across that landscape, along that scholarly lifecycle… But there is no point having that much data if you can’t understand and analyse it. You have to classify that data first to do that… Historically we did that was done by subject area, but increasingly research is interdisciplinary, it crosses different fields. So single tags/subjects are not useful, you need a proper taxonomy to apply here. And there are various ways to do that. You need keywords and semantic modeling and you can choose to:

  1. Use an existing one if available, e.g. MeSH (Medical Subject Headings).
  2. Consult with subject matter experts (the traditional way to do this, could be editors, researchers, faculty, librarians who you’d just ask “what are the keywords that describe computational social science”).
  3. Text mining abstracts or full text article (using the content to create a list from your corpus with bag of words/frequency of words approaches, for instance, to help you cluster and find the ideas with a taxonomy emerging

Now, we are starting to take that text mining approach. But to use that data needs to be cleaned and curated to be of use. So we hand curated a list of institutions to go into GRID: Global Research Identifier Database, to understand organisations and their relationships. Once you have that all mapped you can look at Isni, CrossRef databases etc. And when you have that organisational information you can include georeferences to visualise where organisations are…

An example that we built for HEFCE was the Digital Science BrainScan. The UK has a dual funding model where there is both direct funding and block funding, with the latter awarded by HEFCE and it is distributed according to the most impactful research as understood by the REF. So, our BrainScan, we mapped research areas, connectors, etc. to visualise subject areas, their impact, and clusters of strong collaboration, to see where there are good opportunities for funding…

Similarly we visualised text mined impact statements across the whole corpus. Each impact is captured as a coloured dot. Clusters show similarity… Where things are far apart, there is less similarity. And that can highlight where there is a lot of work on, for instance, management of rivers and waterways… And these weren’t obvious as across disciplines…


Q1) Who do you think benefits the most from this kind of information?

A1) In the consultancy we have clients across the spectrum. In the past we have mainly worked for funders and policy makers to track effectiveness. Increasingly we are talking to institutions wanting to understand strengths, to predict trends… And by publishers wanting to understand if journals should be split, consolidated, are there opportunities we are missing… Each can benefit enormously. And it makes the whole system more efficient.

Against capital – Stuart Lawson, Birkbeck University of London

So, my talk will be a bit different. The arguements I will be making are not in opposition to any of the other speakers here, but is about critically addressing our current ways we are working, and how publishing works. I have chosen to speak on this topic today as I think it is important to make visible the political positions that underly our assumptions and the systems we have in place today. There are calls to become more efficient but I disagree… Ownership and governance matter at least as much as the outcome.

I am an advocate for open access and I am currently undertaking a PhD looking at open access and how our discourse around this has been coopted by neoliberal capitalism. And I believe these issues aren’t technical but social and reflect inequalities in our society, and any company claiming to benefit society but operating as commercial companies should raise questions for us.

Neoliberalism is a political project to reshape all social relations to conform to the logic of capital (this is the only slide, apparently a written and referenced copy will be posted on Stuart’s blog). This system turns us all into capital, entrepreneurs of our selves – quantification, metricification whether through tuition fees that put a price on education, turn students into consumers selecting based on rational indicators of future income; or through pitting universities against each other rather than collaboratively. It isn’t just overtly commercial, but about applying ideas of the market in all elements of our work – high impact factor journals, metrics, etc. in the service of proving our worth. If we do need metrics, they should be open and nuanced, but if we only do metrics for people’s own careers and perform for careers and promotion, then these play into neoliberal ideas of control. I fully understand the pressure to live and do research without engaging and playing the game. It is easier to choose not to do this if you are in a position of privelege, and that reflects and maintains inequalities in our organisations.

Since power relations are often about labour and worth, this is inevitably part of work, and the value of labour. When we hear about disruption in the context of Uber, it is about disrupting rights of works, labour unions, it ignores the needs of the people who do the work, it is a neo-liberal idea. I would recommend seeing Audrey Watters’ recent presentation for University of Edinburgh on the “Uberisation of Education”.

The power of capital in scholarly publishing, and neoliberal values in our scholarly processes… When disruptors align with the political forces that need to be dismantled, I don’t see that as useful or properly disruptive. Open Access is a good thing in terms of open access. But there are two main strands of policy… Research Councils have spent over £80m to researchers to pay APCs. Publishing open access do not require payment of fees, there are OA journals who are funded other ways. But if you want the high end visible journals they are often hybrid journals and 80% of that RCUK has been on hybrid journals. So work is being made open access, but right now this money flows from public funds to a small group of publishers – who take a 30-40% profit – and that system was set up to continue benefitting publishers. You can share or publish to repositories… Those are free to deposit and use. The concern of OA policy is the connection to the REF, it constrains where you can publish and what they mean, and they must always be measured in this restricted structure. It can be seen as compliance rather than a progressive movement toward social justice. But open access is having a really positive impact on the accessibility of research.

If you are angry at Elsevier, then you should also be angry at Oxford University and Cambridge University, and others for their relationships to the power elite. Harvard made a loud statement about journal pricing… It sounded good, and they have a progressive open access policy… But it is also bullshit – they have huge amounts of money… There are huge inequalities here in academia and in relationship to publishing.

And I would recommend strongly reading some history on the inequalities, and the racism and capitalism that was inherent to the founding of higher education so that we can critically reflect on what type of system we really want to discover and share scholarly work. Things have evolved over time – somewhat inevitably – but we need to be more deliberative so that universities are more accountable in their work.

To end on a more positive note, technology is enabling all sorts of new and inexpensive ways to publish and share. But we don’t need to depend on venture capital. Collective and cooperative running of organisations in these spaces – such as the cooperative centres for research… There are small scale examples show the principles, and that this can work. Writing, reviewing and editing is already being done by the academic community, lets build governance and process models to continue that, to make it work, to ensure work is rewarded but that the driver isn’t commercial.


Comment) That was awesome. A lot of us here will be to learn how to play the game. But the game sucks. I am a professor, I get to do a lot of fun things now, because I played the game… We need a way to have people able to do their work that way without that game. But we need something more specific than socialism… Libraries used to publish academic data… Lots of these metrics are there and useful… And I work with them… But I am conscious that we will be fucked by them. We need a way to react to that.

Redesigning Science for the Internet Generation – Gemma Milne, Co-Founder, Science Disrupt

Science Disrupt run regular podcasts, events, a Slack channel for scientists, start ups, VCs, etc. Check out our website. We talk about five focus areas of science. Today I wanted to talk about redesigning science for the internet age. My day job is in journalism and I think a lot about start ups, and to think about how we can influence academia, how success is manifests itself in the internet age.

So, what am I talking about? Things like Pavegen – power generating paving stones. They are all over the news! The press love them! BUT the science does not work, the physics does not work…

I don’t know if you heard about Theranos which promised all sorts of medical testing from one drop of blood, millions of investments, and it all fell apart. But she too had tons of coverage…

I really like science start ups, I like talking about science in a different way… But how can I convince the press, the wider audience what is good stuff, and what is just hype, not real… One of the problems we face is that if you are not engaged in research you either can’t access the science, and can’t read it even if they can access the science… This problem is really big and it influences where money goes and what sort of stuff gets done!

So, how can we change this? There are amazing tools to help (Authorea, overleaf,, figshare, publons, labworm) and this is great and exciting. But I feel it is very short term… Trying to change something that doesn’t work anyway… Doing collaborative lab notes a bit better, publishing a bit faster… OK… But is it good for sharing science? Thinking about journalists and corporates, they don’t care about academic publishing, it’s not where they go for scientific information. How do we rethink that… What if we were to rethink how we share science?

AirBnB and Amazon are on my slide here to make the point of the difference between incremental change vs. real change. AirBnB addressed issues with hotels, issues of hotels being samey… They didn’t build a hotel, instead they thought about what people want when they traveled, what mattered for them… Similarly Amazon didn’t try to incrementally improve supermarkets.. They did something different. They dug to the bottom of why something exists and rethought it…

Imagine science was “invented” today (ignore all the realities of why that’s impossible). But imagine we think of this thing, we have to design it… How do we start? How will I ask questions, find others who ask questions…

So, a bit of a thought experiment here… Maybe I’d post a question on reddit, set up my own sub-reddit. I’d ask questions, ask why they are interested… Create a big thread. And if I have a lot of people, maybe I’ll have a Slack with various channels about all the facets around a question, invite people in… Use the group to project manage this project… OK, I have a team… Maybe I create a Meet Up Group for that same question… Get people to join… Maybe 200 people are now gathered and interested… You gather all these folk into one place. Now we want to analyse ideas. Maybe I share my question and initial code on GitHub, find collaborators… And share the code, make it open… Maybe it can be reused… It has been collaborative at every stage of the journey… Then maybe I want to build a microscope or something… I’d find the right people, I’d ask them to join my Autodesk 360 to collaboratively build engineering drawings for fabrication… So maybe we’ve answered our initial question… So maybe I blog that, and then I tweet that…

The point I’m trying to make is, there are so many tools out there for collaboration, for sharing… Why aren’t more researchers using these tools that are already there? Rather than designing new tools… These are all ways to engage and share what you do, rather than just publishing those articles in those journals…

So, maybe publishing isn’t the way at all? I get the “game” but I am frustrated about how we properly engage, and really get your work out there. Getting industry to understand what is going on. There are lots of people inventing in new ways.. YOu can use stuff in papers that isn’t being picked up… But see what else you can do!

So, what now? I know people are starved for time… But if you want to really make that impact, that you think is more interested… I undesrtand there is a concern around scooping… But there are ways to do that… And if you want to know about all these tools, do come talk to me!


Q1) I think you are spot on with vision. We want faster more collaborative production. But what is missing from those tools is that they are not designed for researchers, they are not designed for publishing. Those systems are ephemeral… They don’t have DOIs and they aren’t persistent. For me it’s a bench to web pipeline…

A1) Then why not create a persistent archived URI – a webpage where all of a project’s content is shared. 50% of all academic papers are only read by the person that published them… These stumbling blocks in the way of sharing… It is crazy… We shouldn’t just stop and not share.

Q2) Thank you, that has given me a lot of food for thought. The issue of work not being read, I’ve been told that by funders so very relevant to me. So, how do we influence the professors… As a PhD student I haven’t heard about many of those online things…

A2) My co-founder of Science Disrupt is a computational biologist and PhD student… My response would be about not asking, just doing… Find networks, find people doing what you want. Benefit from collaboration. Sign an NDA if needed. Find the opportunity, then come back…

Q3) I had a comment and a question. Code repositories like GitHub are persistent and you can find a great list of code repositories and meta-articles around those on the Journal of Open Research Software. My question was about AirBnB and Amazon… Those have made huge changes but I think the narrative they use now is different from where they started – and they started more as incremental change… And they stumbled on bigger things, which looks a lot like research… So… How do you make that case for the potential long term impact of your work in a really engaging way?

A3) It is the golden question. Need to find case studies, to find interesting examples… a way to showcase similar examples… and how that led to things… Forget big pictures, jump the hurdles… Show that bigger picture that’s there but reduce the friction of those hurdles. Sure those companies were somewhat incremental but I think there is genuinely a really different mindset there that matters.

And we now move to lunch. Coming up…

UNCONFERENCE SESSION 1: Best Footprint Forward – Nicola Osborne, EDINA

This will be me – talking about managing a digital footprint and how robust web links are part of that lasting digital legacy- so no post from me but you can view my slides on Managing Your Digital Footprint and our Reference Rot in Theses: A HiberActive Pilot here.

SESSION TWO: The Early Career Researcher Perspective: Publishing & Research Communication

Getting recognition for all your research outputs – Michael Markie, F1000

I’m going to talk about things you do as researchers that you should get credit for, not just traditional publications. This week in fact there was a very interesting article on the history of science publishing “Is the staggering profitable business of scientific publishing bad for science?”. Publishers came out of that poorly… And I think others are at fault here too, including the research community… But we do have to take some blame.

There’s no getting away from the fact that the journal is the coin of the realm, for career progression, institutional reporting, grant applications. For the REF, will there be impact factors? REF says maybe not, but institutions will be tempted to use that to prioritise. Publishing is being looked at by impact factor…

And it’s not just where you publish. There are other things that you do in your work and which you should get ore credit for. Data; software/code – in bioinformatics there are new softwares and tools that are part of the research, are they getting the recognition they should; all results – not just the successes but also the negative results… Publishers want cool and sexy stuff but realistically we are funded for this, we should be able to publish and be recognised for it; peer review – there is no credit for it, peer reviews often improve articles and warrant credit; expertise – all the authors who added expertise, including non-research staff, everyone should know who contributed what…

So I see research as being more than a journal article. Right now we just package it all up into one tidy thing, but we should be fitting into that bigger picture. So, I’m suggesting that we need to disrupt it a bit more and pubis in a different way… Publishing introduces delays – of up to a year. Journals don’t really care about data… That’s a real issue for reproducibility.  And there is bias involved in publishing, there is a real lack of transparency in publishing decisions. All of the above means there is real research waster. At the same time there is demand for results, for quicker action, for wider access to work.

So, at F1000 we have been working on ways to address these issues. We launched Wellcome Open Research, and after launching that the Bill & Melinda Gated Foundation contacted us to build a similar platform. And we have also built an open research model for UCL Child Health (at St Ormond’s Street).

The process involves sending a paper in, checking there is plagiarism and that ethics are appropriate. But no other filtering. That can take up to 7 days. Then we ask for your data – no data then no publication. Then once the publication and data deposition is made, the work is published and an open peer review and user commenting process begins, they are names and credited, and they contribute to improve that article and contribute to the article revision. Those reviewers have three options: approved, approved with reservations, or not approved as it stands. So yo get to PMC and indexed in PubMed you need two “approved” status of two “approved with reservations” and an “approved”.

So this connects to lots of stuff… For Data thats with DataCite, DigShare, Plotly, Resource Identification Initiative. For Software/code we work with code ocean, Zenodo, GitHub. For All results we work with PubMed, you can publish other formats… etc.

Why are funders doing this? Wellcome Trust spent £7m on APCs last year… So this platform is partly as a service to stakeholders with a complementary capacity for all research findings. We are testing new approach to improve science and its impact – to accelerate access and sharing of findings and data; efficiency to reduce waste and support reproducibility; alternative OA model, etc.

Make an impact, know your impact, show your impact – Anna Ritchie, Mendeley, Elsevier

A theme across the day is that there is increasing pressure and challenges for researchers. It’s never been easier to get your work out – new technology, media, platforms. And yet, it’s never been harder to get your work seen: more researchers, producing more outputs, dealing with competition. So how do you ensure you and your work make an impact? Options mean opportunities, but also choices. Traditional publishing is still important – but not enough. And there are both older and newer ways to help make your research stand out.

Publishing campus is a big thing here. These are free resources to support you in publishing. There are online lectures, interactive training courses, and expert advice. And things happen – live webinars, online lectures (e.g. Top 10 Tips for Writing a Really Terrible Journal Article!), interactive course. There are suits of materials around publishing, around developing your profile.

At some point you will want to look at choosing a journal. Metrics may be part of what you use to choose a journal – but use both quantitative and qualitative (e.g. ask colleagues and experts). You can also use Elsevier Journal Finder – you can search for your title and abstract and subject areas to suggest journals to target. But always check the journal guidance before submitting.

There is also the opportunity for article enrichments which will be part of your research story – 2D radiological data viewer, R code Viewer, Virtual Microscope, Genome Viewer, Audioslides, etc.

There are also less traditional journals: Heliyon is all disciplines so you report your original and technically sound results of primary research, regardless of perceived impact. Methodsx is entirely about methods work. Data in Brief allows you to describe your data to facilitate reproducibility, make it easier to cite, etc. And an alternative to a data article is to add datasets on Mendeley.

And you can also use Mendeley to understand your impact through Mendeley Stats. There is a very detailed dashboard for each publication – this is powered by Scopus so works for all articles indexed in Scopus. Stats like users, Mendeley users with that article in their library, citations, related works… And you can see how your article is being shared. You can also show your impact on Mendeley, with a research profile that is as comprehensive as possible –  not just your publications but with wider impacts, press mentions…. And enabling you to connect to other researchers, to other articles and opportunities. This is what we are trying to do to make Mendeley help you build your online profile as a researcher. We intend to grow those profiles to give a more comprehensive picture of you as a researcher.

And we want to hear from you. Every journal, platform, and product is co-developed with ongoing community input. So do get in touch!

How to share science with hard to reach groups and why you should bother – Becky Douglas

My background is physics, high energy gravitational waves, etc… As I was doing my PhD I got very involved in science engagement. Hopefully most of you think about science communication and public outreach as being a good thing. It does seem to be something that arise in job interviews and performance reviews. I’m not convinced that everyone should do this – not everyone enjoys or is good at it – but there is huge potential if you are enthusiastic. And there is more expectation on scientists to do this to gain recognition, to help bring trust back to scientists, and right some misunderstanding. And by the way talks and teaching don’t count here.

And not everyone goes to science festivals. It is up to us to provide alternative and interesting things for those people. There are a few people who won’t be interested in science… But there are many more people who don’t have time or don’t see the appeal to them. These people deserve access to new research… And there are many ways to communicate that research. New ideas are always worth doing, and can attract new people and get dialogue you’d never expect.

So, article writing is a great way to reach out… Not just in science magazines (or on personal blogs). Newspapers and magazines will often print science articles – reach out to them. And you can pitch other places too – Cosmo prints science. Mainstream publications are desperate for people who understand science to write about it in engaging ways – sometimes you’ll be paid for your work as well.

Schools are obvious, but they are great ways to access people from all backgrounds. You’ll do extra well if you can connect it to the current curriculum! Put the effort in to build a memorable activity or event. Send them home with something fun and you may well reach parents as well…

More unusual events would be things like theatre, for instance Lady Scientists Stitch and Bitch. Stitch and Bitch is an international thing where you get together and sew and craft and chat. So this show was a play which was about travelling back in time to gather all the key lady scientists, and they sit down to discuss science over some knitting and sewing. Because it was theatre it was an extremely diverse group, not people who usually go to science events. When you work with non scientists you get access to a whole new crowd.

Something a bit more unusual… Soapbox Science, I brought to Glasgow in 2015. It’s science busking where you talk about your cutting edge research. Often attached to science festivals but out in public, to draw a crowd from those shopping, or visiting museums, etc. It’s highly interactive. Most had not been to a science event before, they didn’t go out to see science, but they enjoyed it…

And finally, interact with local communities. WI have science events, Scouts and Guides, meet up groups… You can just contact and reach out to those groups. They have questions in their own effort. It allows you to speak to really interesting groups. But it does require lots of time. But I was based in Glasgow, now in Falkirk, and I’ve just done some of this with schools in the Goebbels where we knew that the kids rarely go on to science subjects…

So, this is really worth doing. You work, if it is tax-payer funded, should be accessible to the public. Some people don’t think they have an interest in science – some are right but others just remember dusty chalkboards and bland text books. You have to show them it’s something more than that.

What helps or hinders science communication by early career researchers? – Lewis MacKenzie

I’m a postdoc at the University of Leeds. I’m a keen science communicator and I try to get out there as much as possible… I want to talk about what helps or hinders science communication by early career researchers.

So, who are early career researchers? Well undergraduates are a huge pool of early career researchers and scientists which tend to be untapped; also PhDs; also postdocs. There are some shared barriers here: travel costs, time… That is especially the case in inaccessible parts of Scotland. There is a real issue that science communication is work (or training). And not all supervisors have a positive attitude to science communication. As well as all the other barriers to careers in science of course.

Let’s start with science communication training. I’ve been through the system as an undergraduate, PhD students and postdocs. A lot of training are (rightly) targeted at PhD students, often around writing, conferences, elevator pitches, etc. But there are issues/barriers for ECRs include… Pro-active sci comm is often not formally recognized as training/CPD/workload – especially at evenings and weekends. I also think undergraduate sci comm modules are minimal/non-existent. You get dedicated sci comm masters now, there is lots to explore. And there are relatively poor sci comm training opportunities for post docs. But across the board media skills training pretty much limited – how do you make youtube videos, podcasts, web comics, writing in a magazine – and that’s where a lot of science communication takes place!

Sci Comm in Schools includes some great stuff. STEMNET is an excellent way for ECRs, industry, retirees, etc as volunteers, some basic training, background checks, and a contact hub with schools and volunteers. However it is a confusing school system (especially in England) and curricula. How do you do age-appropriate communication. And just getting to the schools can be tricky – most PhDs and Sci Comm people won’t have a car. It’s basic but important as a barrier.

Science Communication Competitions are quite widespread. They tend to be aimed at PhD students, incentives being experience, training and prizes. But there are issues/barriers for ECRs – often conventional “stand and talk” format; not usually collaborative – even though team work can be brilliant, the big famous science communicators work with a team to put their shows and work together; intense pressure of competitions can be off putting… Some alternative formats would help with that.

Conferences… Now there was a tweet earlier this week from @LizyLowe suggesting that every conference should have a public engagement strand – how good would that be?!

Research Grant “Impact Plans”: major funders now require “impact plans” revolving around science communication. That makes time and money for science communication which is great. But there are issues. The grant writer often designate activities before ECRs are recruited. These prescriptive impact plans aren’t very inspiring for ECRS. Money may be inefficiently spent on things like expensive web design. I think we need a more agile approach to include input from ECRs once recruited.

Finally I wanted to finish with Science Communication Fellowships. These are run by people like Wellcome Trust Engagement Fellowships and the STFC. These are for the Olympic gold medallists of Sci Comm. But they are not great for ECRs. The dates are annual and inflexible – and the process is over 6 months – it is a slow decision making process. And they are intensively competitive so not very ECR friendly, which is a shame as many sci comm people are ECRs. So perhaps more institutions or agencies should offer sci comm fellowships? And  a continuous application process with shorter spells?

To sum up… ECRs at different career stages require different training and organisational support to enable science communication. And science communication needs to be recognised as formal work/training/education – not an out of hours hobby! There are good initiatives out there but there could be many more.

PANEL DISCUSSION – Michael Markie, F1000 (MM); Anna Ritchie, Mendeley, Elsevier (AR); Becky Douglas (BD); Lewis MacKenzie (LW) – chaired by Joanna Young (JY)

Q1 (JY): Picking up on what you said about Pathways to Impact statements… What advice would you give to ECRs if they are completing one of these? What should they do?

A1 (LM): It’s quite a weird thing to do… Two strands… This research will make loads of money and commercialise it; and the science communication strand. It’s easier to say you’ll do a science festival event, harder to say you’ll do press release… Can say you will blog you work once a month, or tweet a day in the lab… You can do that. In my fellowship application I proposed a podcast on biophysics that I’d like to do. You can be creative with your science communication… But there is a danger that people aren’t imaginative and make it a box-ticking thing. Just doing a science festival event and a webpage isn’t that exciting. And those plans are written once… But projects run for three years maybe… Things change, skills change, people on the team change…

A1 (BD): As an ECR you can ask for help – ask supervisors, peers, ask online, ask colleagues… You can always ask for advice!

A1 (MM): I would echo that you should ask experienced people for help. And think tactically as different funders have their own priorities and areas of interest here too.

Q2: I totally agree with the importance of communicating your science… But showing impact of that is hard. And not all research is of interest to the public – playing devil’s advocate – so what do you do? Do you broaden it? Do you find another way in?

A2 (LM): Taking a step back and talking about broader areas is good… I talk a fair bit about undergraduates as science communicators… They have really good broad knowledge and interest. They can be excellent. And this is where things like Science Soapbox can be so effective. There are other formats too.. Things like Bright Club which communicates research through comedy… That’s really different.

A2 (BD) I would agree with all of that. I would add that if you want to measure impact then you have to think about it from the outset – will you count people, some sort of voting or questionnaires. YOu have to plan this stuff in. The other thing is that you have to pitch things carefully to your audience. If I run events on gravitational waves I will talk about space and black holes… Whereas with a 5 year old I ask about gravity and we jump up and down so they understand what is relevant to them in their lives.

A2 (LM): In terms of metrics for science communication… At the British Science Association conference a few years back and this was a major theme… Becky mentioned getting kids to post notes in boxes at sessions… Professional science communicators think a great deal about this… Maybe not as much us “Sunday Fun Run” type people but we should engage more.

Comment (AR): When you prepare an impact statement are you asked for metrics?

A2 (LM): Not usually… They want impact but don’t ask about that…

A2 (BD): Whether or not you are asked for details of how something went you do want to know how you did… And even if you just ask “Did you learn something new today?” that can be really helpful for understanding how it went.

Q3: I think there are too many metrics… As a microbiologist… which ones should I worry about? Should there be a module at the beginning of my PhD to tell me?

A3 (AR): There is no one metric… We don’t want a single number to sum us up. There are so many metrics as one number isn’t enough, one isn’t enough… There is experimentation going on with what works and what works for you… So be part of the conversation, and be part of the change.

A3 (MM): I think there are too many metrics too… We are experimenting. Altmetrics are indicators, there are citations, that’s tangible… We just have to live with a lot of them all at once at the moment!

UNCONFERENCE SESSION 2: Preprints: A journey through time – Graham Steel

This will be a quick talk plus plenty of discussion space… From the onset of thinking about this conference I was very keen to talk about preprints…

So, who knows what a preprint is? There are plenty of different definitions out there – see Neylon et al 2017. But we’ll take the Wikipedia definition for now. I thought preprints dates to the 1990s. But I found a paper that referenced a pre-print from 1922!

Lets start there… Preprints were ticking along fine… But then a fightback began, In 1966 preprinte were made outlaws when Nature wanted to take “lethal steps” to end preprints. In 1969 we had a thing called the “Inglefinger Rule” – we’ll come back to that later… Technology wise various technologies ticked along… In 1989 Tim Berners Lee came along, In 1991 Cern set up, also ArXiv set up and grew swiftly… About 8k prepreints per month are uploaded to ArXiv each month as of 2016. Then, in 2007-12 we had Nature Preprints…

But in 2007, the fightback began… In 2012 the Ingelfinger rule was creating stress… There are almost 35k journals, only 37 still use the Ingelfinger rule… But they include key journals like Cell.

But we also saw the launch of BioaXiv in 2013. And we’ve had an explosion of preprints since then… Also 2013 there was a £5m Centre for Open Science set up. This is a central place for preprints… That is a central space, with over 2m preprints so far. There are now a LOT of new …Xiv preprint sites. In 2015 we saw the launch of the ASAPbio movement.

Earlier this year Mark Zuckerberg invested billions in boiXiv… But everything comes at a price…

Scottish spends on average £11m per year to access research through journals. The best average for APCs I could find is $906. Per pre-print it’s $10. If you want to post a pre-print you have to check the terms of your journal – usually extremely clear. Best to check in SHERPA/ROMEO.

If you want to find out more about preprints there is a great Twitter list, also some recommended preprints reading. Find these slides: and


Q1: I found Sherpa/Romeo by accident…. But really useful. Who runs it?

A1: It’s funded by Jisc

Q2: How about findability…

A2: ArXiv usually points to where this work has been submitted. And you can go back and add the DOI once published.

Q2: It’s acting as a static archive then? To hold the green copy

A2: And there is collaborative activity across that… And there is work to make those findable, to share them, they are shared on PubMed…

Q2: One of the problems I see is purely discoverability… Getting it easy to find on Google. And integration into knowledgebases, can be found in libraries, in portals… Hard for a researcher looking for a piece of research… They look for a subject, a topic, to search an aggregated platform and link out to it… To find the repository… So people know they have legal access to preprint copies.

A2: You have COAR at OU which aggregates preprints, suggests additional items when you search. There is ongoing work to integrate with CRIS systems, frequently commercial so interoperability here.

Comment: ArXiv is still the place for high energy physics so that is worth researchers going directly too…

Q3: Can I ask about preprints and research evaluation in the US?

A3: It’s an important way to get the work out… But the lack of peer review is an issue there so emerging stuff there…

GS: My last paper was taking forever to come out, we thought it wasn’t going to happen… We posted to PeerJ but discovered that that journal did use the Inglefinger Rule which scuppered us…

Comment: There are some publishers that want to put preprints on their own platform, so everything stays within their space… How does that sit/conflict with what libraries do…

GS: It’s a bit “us! us! us!”

Comment: You could see all submitted to that journal, which is interesting… Maybe not health… What happens if not accepted… Do you get to pull it out? Do you see what else has been rejected? Could get dodgy… Some potential conflict…

Comment: I believe it is positioned as a separate entity but with a path of least resistance… It’s a question… The thing is.. If we want preprints to be more in academia as opposed to publishers… That means academia has to have the infrastructure to do that, to connect repositories discoverable and aggregated… It’s a potential competitive relationship… Interesting to see how it plays out…

Comment: For Scopus and Web of Science… Those won’t take preprints… Takes ages… And do you want to give up more rights to the journals… ?

Comment: Can see why people would want multiple copies held… That seems healthy… My fear is it requires a lot of community based organisation to be a sustainable and competitive workflow…

Comment: Worth noting the radical “platinum” open access… Lots of preprints out there… Why not get authors to submit them, organise into free, open journal without a publisher… That’s Tim Garrow’s thing… It’s not hard to put together a team to peer review thematically and put out issues of a journal with no charges…

GS: That’s very similar to open library of humanities… And the Wellcome Trust & Gates Foundation stuff, and big EU platform. But the Gates one could be huge. Wellcome Trust is relatively small so far… But EU-wide will be major ramifications…

Comment: Platinum is more about overlay journals… Also like Scope3 and they do metrics on citations etc. to compare use…

GS: In open access we know about green, gold and with platinum it’s free to author and reader… But use of words different in different contexts…

Q4: What do you think the future is for pre-prints?

A4 – GS: There is a huge boom… There’s currently some duplication of central open preprints platform. But information is clear on use and uptake is on the rise… It will plateau at some point like PLoSOne. They launched 2006 and they probably plateaued around 2015. But it is number 2 in the charts of mega-journals, behind Scientific Reports. They increased APCs (around $1450) and that didn’t help (especially as they were profitable)…

SESSION THREE: Raising your research profile: online engagement & metrics

Green, Gold, and Getting out there: How your choice of publisher services can affect your research profile and engagement – Laura Henderson, Editorial Program Manager, Frontiers

We are based in Lausanne in Switzerland. We are fully digital, fully open access publisher. All of 58 journals are published under CC-BY licenses. And the organisation was set up scientists that wanted to change the landscape. So I wanted to talk today about how this can change your work.

What is traditional academic publishing?

Typically readers pay – journal subscriptions via institution/library or pay per view. Given the costs and number of articles they are expensive – ¢14B journals revenue in 2014 works out at $7k per article. It’s slow too.. Journal rejection cascade can take 6 months to a year each time. Up to 1 million papers – valid papers – are rejected every year. And these limit access to research around 80% of research papers are behind subscription paywalls. So knowledge gets out very slowly and inaccessibly.

By comparison open access… Well Green OA allows you to publish an dthen self-archive your paper in a repository where it can be accessed for free. you can use an institutional or central repository, or I’d suggest both. And there can be a delay due to embargo. Gold OA makes research output immediately available from th epublisher and you retain the copyright so no embargoes. It is fully discoverable via indexing and professional promotion services to relevant readers. No subscription fee to reader but usually involves APCs to the institution.

How does Open Access publishing compare? Well it inverts the funding – institution/grant funder supports authors directly, not pay huge subscrition fees for packages dictates by publishers. It’s cheaper – Green OA is usually free. Gold OA average fee is c. $1000 – $3000 – actually that’s half what is paid for subscription publishing. We do see projections of open access overtaking subscription publishing by 2020.

So, what benefits does open access bring? Well there is peer-review; scalable publishing platforms; impact metrics; author discoverability and reputation.

And I’d now like to show you what you should look for from any publisher – open access or others.

Firstly, you should expect basic services: quality assurance and indexing. Peter Suber suggests checking the DOAJ – Directory of Open Access Journals. You can also see if the publisher is part of OASPA which excludes publishers who fail to meet their standards. What else? Look for peer review nad good editors – you can find the joint COPE/OASPA/DOAJ Principles of Transaparancy and Best Practice in Scholarly Publishing. So you need to have clear peer review proceses. And you need a governing board and editors.

At Frontiers we have an impact-neutral peer review oricess. We don’t screen for papers with highest impact. Authors, reviewers and handling Associate Editor interact directly with each other in the online forum. Names of editors and reviewers publishhed on final version of paper. And this leads to an average of 89 days from submission to acceptance – and that’s an industry leading timing… And that’s what won an ASPLP Innovation Award.

So, what are the extraordinary services a top OA publisher can provide? Well altmetrics are more readily available now. Digital articles are accessible and trackable. In Frontiers our metrics are built into every paper… You can see views, downloads, and reader demographics. And that’s post-publication analytics that doesn’t rely on impact factor. And it is community-led imapact – your peers decide the impact and importance.

How discoverable are you? We launched a bespoke built-in networking profile for every author and user: Loop. Scrapes all major index databases to find youe work – constatly updating. It’s linked to Orchid and is included in peer review process. When people look at your profile you can truly see your impact in the world.

In terms of how peers find your work we have article alerts going to 1 million people, and a newsletter that goes to 300k readers. And our articles have 250 million article views and downloads, with hotspots in Mountain View California, and in Shendeng, and areas of development in the “Global South”.

So when you look for a publisher, look for a publisher with global impact.

What are all these dots and what can linking them tell me? – Rachel Lammey, Crossref

Crossref are a not for profit organisation. So… We have articles out there, datasets, blogs, tweets, Wikipedia pages… We are really interested to understand these links. We are doing that through Crossref Event Data, tracking the conversation, mainly around objects with a DOI. The main way we use and mention publications is in the citations of articles. That’s the traditional way to discuss research and understand news. But research is being used in lots of different ways now – Twitter and Reddit…

So, where does Crossref fit in? It is the DOI registration agency for scholarly content. Publishers register their content with us. URLs do change and do break… And that means you need something ore persistent so it can still be used in their research… Last year at ReCon we tried to find DOI gaps in reference lists – hard to do. Even within journals publications move around… And switch publishers… The DOI fixes that reference. We are sort of a switchboard for that information.

I talked about citations and references… Now we are looking beyong that. It is about capturing data and relationships so that understanding and new services (by others) can be built… As such it’s an API (Application Programming Interface) – it’s lots of data rather than an interface. SO it captures subject, relation, object, tweet, mentions, etc. We are generating this data (As of yesterday we’ve seen 14 m events), we are not doing anything with it so this is a clear set of data to do further work on.

We’ve been doing work with NISO Working Group on altmetrics, but again, providing the data not the analysis. So, what can this data show? We see citation rings/friends gaming the machine; potential peer review scams; citation patterns. How can you use this data? Almost any way. Come talk to us about Linked Data; Article Level Metrics; general discoverability, etc.

We’ve done some work ourselves… For instant the Live Data from all sources – including Wikipedia citing various pages… We have lots of members in Korea, and started looking just at citations on Korean Wikipedia. It’s free under a CC0 license. If you are interested, go make something cool… Come ask me questions… And we have a beta testing group and we welcome you feedback and experiments with our data!

The wonderful world of altmetrics: why researchers’ voices matter – Jean Liu, Product Development Manager, Altmetric

I’m actually five years out of graduate school, so I have some empathy with PhD students and ECRs. I really want to go through what Altmetrics is and what measures there are. It’s not controversial to say that altmetrics have been experiencing a meteoric rise over the last few years… That is partly because we have so much more to draw upon than the traditional journal impact factors, citation counts, etc.

So, who are We have about 20 employees, founded in 2011 and all based in London. And we’ve started to see that people re receptive to altmetrics, partly because of the (near) instant feedback… We tune into the Twitter firehose – that phrase is apt! Altmetrics also showcase many “flavours” of attention and impact that research can have – and not just articles. And the signals we tracked are highly varies: policy documents, news, blogs, Twitter, post-publication peer review, Facebook, Wikipedia, LinkedIn, Reddit, etc.

Altmetrics also have limitations. They are not a replacement for peer review or citation-based metrics. They can be gamed – but data providers have measures in place to guard against this. We’ve seen interesting attempts at gamification – but often caught…

Researchers are not only the ones who receive attention in altmetrics, but they are also the ones generating attention that make up altmetrics – but not all attention is high quality or trustworthy. We don’t want to suggest that researchers should be judged just on altmetrics…

Meanwhile Universities are asking interesting questions: how an our researchers change policy? Which conference can I send people to which will be most useful, etc.

So, lets see the topic of “diabetic neuropathy”. Looking around we can see a blog, an NHS/Nice guidance document, and a The Conversation. A whole range of items here. And you can track attention over time… Both by volume, but also you can look at influencers across e.g. News Outlets, Policy Outlets, Blogs and Tweeters. And you can understand where researcher voices feature (all are blogs). And I can then compare news and policy and see the difference. The profile for News and Blogs are quite different…

How can researchers voices be heard? Well you can write for a different audience, you can raise the profile of your work… You can become that “go-to” person. You also want to be really effective when you are active – altmetrics can help you to understand where your audience is and how they respond, to understand what is working well.

And you can find out more by trying the altmetric bookmarking browser plugin, by exploring these tools on publishing platforms (where available), or by taking a look.

How to help more people find and understand your work – Charlie Rapple, Kudos

I’m sorry to be the last person on the agenda, you’ll all be overwhelmed as there has been so much information!

I’m one of the founders of Kudos and we are an organisation dedicated to helping you increase the reach and impact of your work. There is such competition for funding, a huge growth in outputs, there is a huge fight for visibility and usage, a drive for accountability and a real cult of impact. You are expected to find and broaden the audience for your work, to engage with the public. And that is the context in which we set up Kudos. We want to help you navigate this new world.

Part of the challenge is knowing where to engage. We did a survey last year with around 3000 participants to ask how they share their work – conferences, academic networking, conversations with colleagues all ranked highly; whilst YouTube, slideshare, etc. are less used.

Impact is built on readership – impacts cross a variety of areas… But essentially it comes down to getting people to find and read your work. So, for me it starts with making sure you increase the number of people reaching and engaging with your work. Hence the publication is at the centre – for now. That may well be changing as other material is shared.

We’ve talked a lot about metrics, there are very different ones and some will matter more to you than others. Citations have high value, but so do mentions, clicks, shares, downloads… Do take the time to think about these. And think about how your own actions and behaviours contribute back to those metrics… So if you email people about your work, track that to see if it works… Make those connections… Everyone has their own way and, as Nicola was saying in the Digital Footprint session, communities exist already, you have to get work out there… And your metrics have to be about correlating what happens – readership and citations. Kudos is a management tool for that.

In terms of justifying time here is that communications do increase impact. We have been building up data on how that takes place. A team from Nanyang Technological Institute did a study of our data in 2016 and they saw that the Kudos tools – promoting their work – they had 23% higher growth in downloads of full text on publisher sites. And that really shows the value of doing that engagement. It will actually lead to meaningful results.

So a quick look at how Kudos works… It’s free for researchers ( and it takes about 15 minutes to set up, about 10 minutes each time you publish something new. You can find a publication, you can use your ORCID if you have one… It’s easy to find your publication and once you have then you have page for that where you can create a plain language explanation of your work and why it is important – that is grounded in talking to researchers about what they need. For example: That plain text is separate from the abstract. It’s that first quick overview. The advantage of this is that it is easier for people within the field to skim and scam your work; people outside your field in academia can skip terminology of your field and understand what you’ve said. There are also people outside academia to get a handle on research and apply it in non-academic ways. People can actually access your work and actually understand it. There is a lot of research to back that up.

Also on publication page you can add all the resources around your work – code, data, videos, interviews, etc. So for instance Claudia Sick does work on baboons and why they groom where they groom – that includes an article and all of that press coverage together. That publication page gives you a URL, you can post to social media from within Kudos. You can copy the trackable link and paste wherever you like. The advantage to doing this in Kudos is that we can connect that up to all of your metrics and your work. You can get them all in one place, and map it against what you have done to communicate. And we map those actions to show which communications are more effective for sharing… You can really start to refine your efforts… You might have built networks in one space but the value might all be in another space.

Sign up now and we are about to launch a game on building up your profile and impact, and scores your research impact and lets you compare to others.

PANEL DISCUSSION – Laura Henderson, Editorial Program Manager, Frontiers (LH); Rachel Lammey, Crossref (RL); Jean Liu, Product Development Manager, Altmetric (JL); Charlie Rapple, Kudos (CR). 

Q1: Really interesting but how will the community decide which spaces we should use?

A1 (CR): Yes, in the Nangyang work we found that most work was shared on Facebook, but more links were engaged with on Twitter. There is more to be done, and more to filter through… But we have to keep building up the data…

A1 (LH): We are coming from the same sort of place as Jean there, altmetrics are built into Frontiers, connected to ORCID, Loop built to connect to institutional plugins (totally open plugin). But it is such a challenge… Facebook, Twitter, LinkedIn, SnapChat… Usually personal choice really, we just want to make it easier…

A1 (JL): It’s about interoperability. We are all working in it together. You will find certain stats on certain pages…

A1 (RL): It’s personal choice, it’s interoperability… But it is about options. Part of the issue with impact factor is the issue of being judged by something you don’t have any choice or impact upon… And I think that we need to give new tools, ways to select what is right for them.

Q2: These seem like great tools, but how do we persuade funders?

A2 (JL): We have found funders being interested independently, particularly in the US. There is this feeling across the scholarly community that things have to change… And funders want to look at what might work, they are already interested.

A2 (LH): We have an office in Brussels which lobbies to the European Commission, we are trying to get our voice for Open Science heard, to make difference to policies and mandates… The impact factor has been convenient, it’s well embedded, it was designed by an institutional librarian, so we are out lobbying for change.

A2 (CR): Convenience is key. Nothing has changed because nothing has been convenient enough to replace the impact factor. There is a lot of work and innovation in this area, and it is not only on researchers to make that change happen, it’s on all of us to make that change happen now.

Jo Young (JY): To finish a few thank yous… Thank you all for coming a lot today, to all of our speakers, and a huge thank you for Peter and Radic (our cameramen), to Anders, Graham and Jan for work in planning this. And to Nicola and Amy who have been liveblogging, and to all who have been tweeting. Huge thanks to CrossRef, Frontiers, F1000, JYMedia, and PLoS.

And with that we are done. Thanks to all for a really interesting and busy day!


Jun 282017

Today I am at the eLearning@ed Conference 2017, our annual day-long event for the eLearning community across the University of Edinburgh – including learning technologies, academic staff and some post graduate students. As I’m convener of the community I’m also chairing some sessions today so the notes won’t be at quite my normal pace!

As usual comments, additions and corrections are very welcome. 

For the first two sections I’m afraid I was chairing so there were no notes… But huge thanks to Anne Marie for her excellent quick run through exciting stuff to come… 

Welcome – Nicola Osborne, elearning@ed Convenor

Forthcoming Attractions – Anne Marie Scott, Head of Digital Learning Applications and Media

And with that it was over to our wonderful opening keynote… 

Opening Keynote: Prof. Nicola Whitton, Professor of Professional Learning, Manchester Metropolitan University: Inevitable Failure Assessment? Rethinking higher education through play (Chair: Dr Jill MacKay)

Although I am in education now, my background is as a computer scientist… So I grew up with failure. Do you remember the ZX Spectrum? Loading games there was extremely hit and miss. But the games there – all text based – were brilliant, they worked, they took you on adventures. I played all the games but I don’t think I ever finished one… I’d get a certain way through and then we’d have that idea of catastrophic failure…

And then I met a handsome man… It was unrequited… But he was a bit pixellated… Here was Guybush Threepwood of the Monkey Island series. And that game changed everything – you couldn’t catastrophically fail, it was almost impossible. But in this game you can take risks, you can try things, you can be innovative… And that’s important for me… That space for failure…

The way that we and our students think about failure in Higher Education, and deal with failure in Higher Education. If we think that going through life and never failing, we will be set for disappointment. We don’t laud the failures. J.K. Rowling, biggest author, rejected 12 times. The Beatles, biggest band of the 20th Century, were rejected by record labels many many time. The lightbulb failed hundreds of times! Thomas Edison said he didn’t fail 100 times, he succeeded in lots of stages…

So, to laud failure… Here are some of mine:

  1. Primary 5 junior mastermind – I’m still angry! I chose horses as my specialist subject so, a tip, don’t do that!
  2. My driving test – that was a real resiliance moment… I’ll do it again… I’ll have more lessons with my creepy driving instructor, but I’ll do it again.
  3. First year university exams – failed one exam, by one mark… It was borderline and they said “but we thought you need to fail” – I had already been told off for not attending lectures. So I gave up my summer job, spent the summer re-sitting. I learned that there is only so far you can push things… You have to take things seriously…
  4. Keeping control of a moped – in Thailand, with no training… Driving into walls… And learning when to give up… (we then went by walking and bus)
  5. Funding proposals and article submissions, regularly, too numerous to count – failure is inevitable… As academics we tend not to tell you about all the times we fail… We are going to fail… So we have to be fine to fail and learn from it. I was involved in a Jisc project in 2009… I’ve published most on it… It really didn’t work… And when it didn’t work they funded us to write about that. And I was very lucky, one of the Innovation Programme Managers who had funded us said “hey, if some of our innovation funding isn’t failing, then we aren’t being innovative”. But that’s not what we talk about.

For us, for our students… We have to understand that failure is inevitable. Things are currently set up as failure being a bad outcome, rather than an integral part of the learning process… And learning from failure is really important. I have read something – though I’ve not been able to find it again – that those who pass their driving test on the second attempt are better drives. Failure is about learning. I have small children… They spent their first few years failing to talk then failing to walk… That’s not failure though, it’s how we learn…

Just a little bit of theory. I want to talk a bit about the concept of the magic circle… The Magic Circle came from game theory, from the 1950s. Picked up by ? Zimmerman in early 2000s… The idea is that when you play with someone, you enter this other space, this safe space, where normal rules don’t apply… Like when you see animals playfighting… There is mutual agreement that this doesn’t count, that there are rules and safety… In Chess you don’t just randomly grab the king. Pub banter can be that safe space with different rules applying…

This happens in games, this happens in physical play… How can we create magic circles in learning… So what is that:

  • Freedom to fail – if you won right away, there’s no point in playing it. That freedom to fail and not be constrained by the failure… How we look at failure in games is really different from how we look at failure in Higher Education.
  • Lusory attitude – this is about a willingness to engage in play, to forget about the rules of the real world, to abide by the rules of this new situation. To park real life… To experiment, that is powerful. And that idea came from Leonard Suits whose book, The Grasshopper, is a great Playful Learning read.
  • Intrinsic motivation – this is the key area of magic circle for higher education. The idea that learning can be and should be intrinsically motivating is really really important.

So, how many of you have been in an academic reading group? OK, how many have lasted more than a year? Yeah, they rarely last long… People don’t get round to reading the book… We’ve set up a book group with special rules: you either HAVE To read the book, or your HAVE TO PRETEND that you read the book. We’ve had great turn out, no idea if they all read the books… But we have great discussion… Reframing that book group just a small bit makes a huge difference.

That sort of tiny change can be very powerful for integrating playfulness. We don’t think twice about doing this with children… Part of the issue with play, especially with adults, is what matters about play… About that space to fail. But also the idea of play as a socialised bonding space, for experimentation, for exploration, for possibilities, for doing something else, for being someone else. And the link with motivation is quite well established… I think we need to understand that different kind of play has different potential, but it’s about play and people, and safe play…

This is my theory heavy slide… This is from a paper I’ve just completed with colleagues in Denmark. We wanted to think “what is playful learning”… We talk about Higher Education and playful learning in that context… So what actually is it?

Well there is signature pedagogy for playful learning in higher education, under which we have surface (game) structures; deep (play) structures; implicit (playful) structures. Signature pedagogy could be architecture or engineering…

This came out of work on what students respond to…

So Surface (game) structures includes: ease of entry and explicit progression; appropriate and flexible levels of challenge; engaging game mechanics; physical or digital artefacts. Those are often based around games and digital games… But you can be playful without games…

Deep (play) structures is about: active and physical engagement; collaboration with diversity; imagining possibilities; novelty and surprises.

Implicit (playful) structures: lusory attitude; democratice values and openness; acceptance of risk-taking and failure; intrinsic motivation. That is so important for us in higher education…

So, rant alert…

Higher Education is broken. And that is because schools are broken. I live in Manchester (I know things aren’t as bad in Scotland) and we have assessment all over the place… My daughter is 7 sitting exams. Two weeks of them. They are talking about exams for reception kids – 4 year olds! We have a performative culture of “you will be assessed, you will be assessed”. And then we are surprised when that’s how our students respond… And have the TEF appearing… The golds, silvers, and bronze… Based on fairly random metrics… And then we are surprised when people work to the metrics. I think that assessment is a great way to suck out all the creativity!

So, some questions my kids have recently asked:

  • Are there good viruses? I asked an expert… apparently there are for treating people.. (But they often mutate.)
  • Do mermaids lay eggs? Well they are part fish…
  • Do Snow Leopards eat tomatoes? Where did this question come from? Who knows? Apparently they do eat monkeys… What?!

But contrast that to what my students ask:

  • Will I need to know this for the exam?
  • Are we going to be assessed on that?

That’s what happens when we work to the metrics…

We are running a course where there were two assessments. One was formative… And students got angry that it wasn’t worth credit… So I started to think about what was important about assessment? So I plotted the feedback from low to high, and consequence from low to high… So low consequence, low feedback…

We have the idea of the Trivial Fail – we all do those and it doesn’t matter (e.g. forgetting to signal at a roundabout), and lots of opportunity to fail like that.

We also have the Critical Fail – High Consequence and Low Feedback – kids exams and quite a lot of university assessment fits there.

We also have Serious Fail – High Consequence and High Feedback – I’d put PhD Vivas there… consequences matter… But there is feedback and can be opportunity to manage that.

What we need to focus on in Higher Education is the Micro Fail – low consequence with high feedback. We need students to have that experience, and to value that failure, to value failure without consequence…

So… How on earth do we actually do this? How about we “Level Up” assessment… With bosses at the end of levels… And you keep going until you reach as far as you need to go, and have feedback filled in…

Or the Monkey Island assessment. There is a goal but it doesn’t matter how you get there… You integrate learning and assessment completely, and ask people to be creative…

Easter Egg assessment… Not to do with chocolate but “Easter Eggs” – suprises… You don’t know how you’ll be assessed… Or when you’ll be assessed… But you will be! And it might be fun! So you have to go to lectures… Real life works like that… You can’t know which days will count ahead of time.

Inevitable Failure assessment… You WILL fail first time, maybe second time, third time… But eventually pass… Or even maybe you can’t ever succeed and that’s part of the point.

The point is that failure is inevitable and you need to be able to cope with that and learn from that. On which note… Here is my favourite journal, the Journal of Universal Rejection… This is quite a cathartic experience, they reject everything!

So I wanted to talk about a project that we are doing with some support from the HEA… Eduscapes… Have you played Escape Rooms? They are so addictive! There are lots of people creating educational Escape Rooms… This project is a bit different… So there are three parts… You start by understanding what the Escape Room is, how they work; then some training; and then design a game. But they have to trial them again and again and again. We’ve done this with students, and with high school students three times now. There is inevitable failure built in here… And the project can run over days or weeks or months… But you start with something and try and fail and learn…

This is collaborative, it is creative – there is so much scope to play with, sometimes props, sometimes budget, sometimes what they can find… In the schools case they were maths and Comp Sci students so there was a link to the curriculum. It is not assessed… But other people will see it – that’s quite a powerful motivator… We have done this with reflection/portfolio assessment… That resource is now available, there’s a link, and it’s a really simple way to engage in something that doesn’t really matter…

And while I’m here I have to plug our conference, Playful Learning, now in its second year. We were all about thinking differently about conferences… But always presenting at traditional conferences. So our conference is different… Most of it is hands on, all different stuff, a space to do something different – we had a storytelling in a tent as one of these… Lots of space but nothing really went wrong. But we need something to fail. Applications are closed this year… But there will be a call next year… So play more, be creative, fail!

So, to finish… I’m playful, play has massive potential… But we also have to think about diversity of play, the resilience to play… A lot of the research on playful learning, and assessment doesn’t recognise the importance of gender, race, context, etc… And the importance of the language we use in play… It has nuance, and comes with distinctions… We have to encourage people to play ad get involved. And we really have to re-think assessment – for ourselves, of universities, of students, of school pupils… Until we rethink this, it will be hard to have any real impact for playful learning…

Jill: Thank you so much, that was absolutely brilliant. And that Star Trek reference is “Kobayashi Maru”!


Q1) In terms of playful learning and assessment, I was wondering how self-assessment can work?

A1) That brings me back to previous work I have done around reflection… And I think that’s about bringing that reflection into playful assessment… But it’s a hard question… More space and time for reflection, possibly more space for support… But otherwise not that different from other assessment.

Q2) I run a research methods course for an MSc… We tried to invoke playfulness with a fake data set with dragons and princesses… Any other examples of that?

A2) I think that that idea of it being playful, rather than games, is really important. Can use playful images, or data that makes rude shapes when you graph is!

Q3) Nic knows that I don’t play games… I was interested in that difference between gaming and play and playfulness… There is something about games that don’t entice me at all… But that Lusory attitude did feel familiar and appealing… That suspension of disbelief and creativity… And that connection with gendered discussion of play and games.

A3) We are working on a taxonomy of play. That’s quite complex… Some things are clearly play… A game, messing with LEGO… Some things are not play, but can be playful… Crochet… Jigsaw puzzles… They don’t have to be creative… But you can apply that attitude to almost anything. So there is play and there is a playful attitude… That latter part is the key thing, the being prepared to fail…

Q4) Not all games are fun… Easy to think playfulness and games… A lot of games are work… Competitive gaming… Or things like World of Warcraft – your wizard chores. And intensity there… Failure can be quite problematic if working with 25 people in a raid – everyone is tired and angry… That’s not a space where failure is ok… So in terms of what we can learn from games it is important to remember that games aren’t always fun or playful…

A4) Indeed, and not all play is fun… I hate performative play – improv, people touching me… It’s about understanding… It’s really nuanced. It used to be that “students love games because they are fun” and now “students love play because it’s fun” and that’s still missing the point…

Q5) I don’t think you are advocating this but… Thinking about spoonful of sugar making assessment go down… Tricking students into assessment??

A5) No. It’s taking away the consequences in how we think about assessment. I don’t have a problem with exams, but the weight on that, the consequences of failure. It is inevitable in HE that we grade students at different levels… So we have to think about how important assessment is in the real world… We don’t have equivelents of University assessments in the real world… Lets say I do a bid, lots of work, not funded… In real world I try again. If you fail your finals, you don’t get to try again… So it’s about not making it “one go and it’s over”… That’s hard but a big change and important.

Q6) I started in behavioural science in animals… Play there is “you’ll know it when you see it” – we have clear ideas of what other behaviours look like, but play is hard to describe but you know it when you see it… How does that work in your taxonomy…

A6) I have a colleague who is a physical science teacher trainer… And he’s gotten to “you’ll know it when you see it”… Sometimes that is how you perceive that difference… But that’s hard when you apply for grants! It’s a bit of an artificial exercise…

Q7) Can you tell us more about play and cultural diversity, and how we need to think about that in HE?

A7) At the moment we are at the point that people understand and value play in different way. I have a colleague looking at diversity in play… A lot of research previously is on men, and privileged white men… So partly it’s about explaining why you are doing, what you are doing, in the way you are doing it… You have to think beyond that, to appropriateness, to have play in your toolkit…

Q8) You talk about physical spaces and playfulness… How much impact does that have?

A8) It’s not my specialist area but yes, the physical space matters… And you have to think about how to make your space more playful..

Introductions to Break Out Sessions: Playful Learning & Experimentation (Nicola Osborne)

  • Playful Learning – Michael Boyd (10 min)

We are here today with the UCreate Studio… I am the manager of the space, we have student assistants. We also have high school students supporting us too. This pilot runs to the end of July and provides a central Maker Space… To create things, to make things, to generate ideas… This is mixture of the maker movement, we are a space for playful learning through making. There are about 1400 maker spaces world wide, many in Universities in the UK too… Why do they pop up in Universities? They are great creative spaces to learn.

You can get hands on with technology… It is about peer based learning… And project learning… It’s a safe space to fail – it’s non assessed stuff…

Why is it good for learning? Well for instance the World Economic Forum predict that 35% of core professional skills will change from 2015 to 2020. Complex problem solving, critical thinking, creativity, judgement and decision making, cognitive flexibility… These are things that can’t be automated… And can be supported by making and creating…

So, what do we do? We use new technologies, we use technologies that are emerging but not yet widely adopted. And we are educational… That first few months is the hard bit… We don’t lecture much, we are there to help and guide and scaffold. Students can feel confident that they have support if they need it.

And, we are open source! Anyone in the University can use the space, be supported in the space, for free as long as they openly share and license whatever they make. Part of that bigger open ethos.

So, what gets made? Includes academic stuff… Someone made a holder for his spectrometer and 3D printed it. He’s now looking to augment this with his chemistry to improve that design; we have Josie in archeology scanning artefacts and then using that to engage people – using VR; Dimitra in medicine, following a poster project for a cancer monitoring chip, she started prototyping; Hayden in Geosciences is using 3D scanning to see the density of plant matter to understand climate change.

But it’s not just that. Also other stuff… Henry studies architecture, but has a grandfather who needs meds and his family worries if he takes his medicine.. So he’s designed a system that connects a display of that. Then Greg on ECA is looking at projecting memories on people… To see how that helps…

So, I wanted to flag some ideas we can discuss… One of he first projects when I arrived, Fiona Hale and Chris Speed (ECA) ran “Maker Go” had product design students, across the years, to come up with a mobile maker space project… Results were fantastic – a bike to use to scan a space… A way to follow and make paths with paint, to a coffee machine powered by failed crits etc. Brilliant stuff. And afterwards there was a self-organised (first they can remember) exhibtion, Velodrama…

Next up was Edinburgh IoT challenge… Students and academics came together to address challenges set by Council, Uni, etc. Designers, Engineers, Scientists… Led to a really special project, 2 UG students approached us to set yp the new Embedded adn Robotics Society – they run sessions every two weeks. And going strength to strength.

Last but not least… Digital manufacturing IP session trialled last term with Dr Stema Kieria, to explore 3D scanning and printing and the impact on IPs… Huge areas… Echos of taping songs off the radio. Took something real, showed it hands on, learned about technologies, scanned copyright materials, and explored this. They taught me stuff! And that led to a Law and Artificial Intelligence Hackathon in March. This was law and informatics working together, huge ideas… We hope to see them back in the studio soon!

  • Near Future Teaching Vox Pops – Sian Bayne (5 mins)

I am Assistant Vice Principal for Digital Education and I was very keen to look at designing the future of digital education at Edinburgh. I am really excited to be here today… We want you to answer some questions on what teaching will look like in this university in 20 or 30 years time:

  • will students come to campus?
  • will we come to campus?
  • will we have AI tutors?
  • How will teaching change?
  • Will learning analytics trigger new things?
  • How will we work with partner organisations?
  • Will peers accredit each other?
  • Will MOOCs stull exist?
  • Will performance enhancement be routine?
  • Will lectures still exist?
  • Will exams exist?
  • Will essays be marked by software?
  • Will essays exist?
  • Will discipline still exist?
  • Will the VLE still exist?
  • Will we teach in VR?
  • Will the campus be smart? And what does eg IoT to monitor spaces mean socially?
  • Will we be smarter through technology?
  • What values should shape how we change? How we use these technologies?

Come be interviewed for our voxpops! We will be videoing… If you feel brave, come see us!

And now to a break… and our breakout sessions, which were… 

Morning Break Out Sessions

  • Playful Learning Mini Maker Space (Michael Boyd)
  • 23 Things (Stephanie (Charlie) Farley)
  • DIY Film School (Gear and Gadgets) (Stephen Donnelly)
  • World of Warcraft (download/set up information here) (Hamish MacLeod & Clara O’Shea)
  • Near Future Teaching Vox Pops (Sian Bayne)

Presentations: Fun and Games and Learning (Chair: Ruby Rennie, Lecturer, Institute for Education, Teaching and Leadership (Moray House School of Education))

  • Teaching with Dungeons & Dragons – Tom Boylston

I am based in Anthropology and we’ve been running a course on the anthropology of games. And I just wanted to talk about that experience of creating playful teaching and learning. So, Dungeons and Dragons was designed in the 1970s… You wake up, your chained up in a dungeon, you are surrounded by aggressive warriors… And as a player you choose what to do – fight them, talk to them, etc… And you can roll a dice to decide an action, to make the next play. It is always a little bit improvisational, and that’s where the fun comes in!

There are some stigmas around D&D as the last bastion of the nerdy white bloke… But… The situation we had was a 2 hour lecture slot, and I wanted to split that in two. To engage with a reading on the creative opportunities of imagination. I wanted them to make a character, alsmot like creative writing classes, to play that character and see what that felt like, how that changed that… Because part of the fun of role playing is getting to be someone else. Now these games do raise identity issues – gender, race, sexuality… That can be great but it’s not what you want in a big group with people you don’t yet have trust with… But there is something special about being in a space with others, where you don’t know what could happen… It is not a simple thing to take a traditional teaching setting and make it playful… One of the first things we look at when we think about play is people needing to consent to play… And if you impose that on a room, that’s hard…

So early in the course we looked at Erving Goffman’s Frame Analysis, and we used Pictionary cards… We looked at the social cues from the space, the placement of seats, microphones, etc. And then the social cues of play… Some of the foundational work of animal play asks us how you know dogs are playfighting… It’s the half-bite, playful rather than painful… So how do I invite a room full of people to play? I commanded people to play Pictionary, to come up and play… Eventually someone came up… Eventually the room accepted that and the atmosphere changed. It really helped that we had been reading about framing. And I asked what had changed and there were able to think and talk about that…

But D&D… People were sceptical. We started with students making me a character. They made me Englebert, a 5 year old lizard creature… To display the playful situation, a bit silly, to model and frame the situation… Sent them comedy D&D podcasts to listen to and asked them to come back a week later… I promised that we wouldn’t do it every week but… I shared some creative writing approaches to writing a back story, to understand what would matter about this character… Only having done this preparatory work, thought about framing… Only then did I try out my adventure on them… It’s about a masquerade in Camaroon, and children try on others’ masks… I didn’t want to appropriate that. But just to take some cues and ideas and tone from that. And when we got to the role playing, the students were up for it… And we did this either as individual students, or they could pair up…

And then we had a debrief – crucial for a playful experience like this. People said there was more negotiation than they expected as they set up the scene and created. They were surprised how people took care of their characters…

The concluding thing was… At the end of the course I had probably shared more that I cared about. Students interrupted me more – with really great ideas! And students really engaged.


Q1) Would you say that D&D would be a better medium than an online role playing game… Exemporisation rather than structured compunction?

A1) We did talk about that… We created a WoW character… There really is a lot of space, unexpected situations you can create in D&D… Lots of improvisation… More happened in that than in the WoW stuff that we did… It was surprisingly great.

Q2) Is that partly about sharing and revealing you, rather than the playfulness per se?

A2) Maybe a bit… But I would have found that hard in another context. The discussion of games really brought that stuff out… It was great and unexpected… Play is the creation of unexpected things…

Q3) There’s a trust thing there… We can’t expect students to trust us and the process, unless we show our trust ourselves…

A3) There was a fair bit of background effort… Thinking about signalling a playful space, and how that changes the space… The playful situations did that without me intending to or trying to!

Digital Game Based Learning in China – Sihan Zhou

I have been finding this event really inspiring… There is so much to think around playfulness. I am from China, and the concept of playful learning is quite new in China so I’m pleased to talk to you about the platform we are creating – Tornado English…

On this platform we have four components – a bilingual animation, a game, and a bilingual chat bot… If the user clicks on the game, they can download it… So far we have created two games: Word Pop – vocabulary learning and Run Rabbit – syntactic learning, both based around Mayer’s model (2011).

The games mechanics are usually understood but comparing user skills and level of challenge – too easy and users will get bored, but if it’s too challenging then users will be frustrated and demotivated. So for apps in China, many of the educational products tend to be more challenging than fun – more educational apps than educational games. So in our games use timing and scoring to make things more playful and interactions like popping bubbles, clicking on moles popping out of holes in the ground. In Word Smash students have to match images to vocab as quickly as possible… In Run Rabbit… The student has to speak a phrase in order get the rabbit to run to the right word in the game and placing it…

When we designed the game, we considered how we could ensure that the game is educationally effective, and to integrate it with the English curriculum in school. We tie to the 2011 English Curriculum Standards for Compulsory Education in China. Students have to complete a sequence of levels to reach the next level of learning – autonomous learning in a systematic way.

So, we piloted this app in China, working with 6 primary schools in Harbin, China. Data has been collected from interviews with teachers, classroom observation, and questionnaires with parents.

This work is a KTP – a Knowledge Transfer Partnership – project and the KTP research is looking at Chinese primary school teachers’ attitudes towards game-based learning. And there is also an MSc TESOL Dissertation looking at teachers attitudes towards game based learning… For instance they may or may not be able to actually use these tools in the classroom because of the way teaching is planned and run. The results of this work will be presented soon – do get in touch.

Our future game development will focus more on a communicative model, task-based learning, and learner autonomy. So the character lands on a new planet, have to find their way, repair their rocket, and return to earth… To complete those task the learner has to develop the appropriate language to do well… But this is all exploratory so do talk to me and to inspire me.


Q1) I had some fantastic Chinese students in my playful anthropology course and they were explaining quite mixed attitudes to these approaches in China. Clearly there is that challenge to get authorities to accept it… But what’s the compromise between learning and fun.

A1) The game has features designed for fun… I met with education bureu and teachers, to talk about how this is eduationally effective… Then when I get into classrooms to talk to the students, I focus more on gaming features, why you play it, how you progress and unlock new levels. Emphasis has to be quite different depending on the audience. One has to understand the context.

Q2) How have the kids responded?

A2) They have been really inspired and want to try it out. The kids are 8 or 9 years old… They were keen but also knew that their parents weren’t going to be as happy about playing games in the week when they are supposed to do “homework”. We get data on how this used… We see good use on week days, but huge use on weekends, and longer play time too!

Q3) In terms of changing attitudes to game based learning in China… If you are wanting to test it in Taiwan the attitude was different, we were expected to build playful approaches in…

A3) There is “teaching reform” taking place… And more games and playfulness in the classrooms. But digital games was the problem in terms of triggering a mentality and caution. The new generation uses more elearning… But there is a need to demonstrate that usefulness and take it out to others.

VR in Education – Cinzia Pusceddu-Gangarosa

I am manager of learning technology in the School of Biological Sciences, and also a student on the wonderful MS in Digital Education. I’m going to talk about Virtual Reality in Education.

I wanted to start by defining VR. The definition I like best is from Mirriam Webster. It includes key ideas… the idea of “simulated world” and the ways one engaging with it. VR technologies include headsets like Oculus Rift (high end) through to Google Cardboard (low end) that let you engage… But there is more interesting stuff there too… There are VR “Cave” spaces – where you enter and are surrounded by screens. There are gloves, there are other kinds of experience.

Part of virtual reality is about an intense idea of presence, of being there, of being immersed in the world, fully engaged – so much so that the interface disappears, you forget you are using technologies.

In education VR is not anything new. The first applications were in the 1990s…. But in 200s desktop VR becomes more common – spaces such as Second Life – more acceptable and less costly to engage with.

I want to show you a few examples here… One of the first experiments was from the Institute for Simulation and Training, PA, where students could play “noseball” to play with a virtual ball in a set of wearables. You can see they still use headsets, similar to now but not particularly sophisticated… I also wanted to touch on some other university experiments with VR… The first one is Google Expeditions. This is not a product that has been looked at in universities – it has been trialled in schools a lot… It’s a way to travel in time and space through Google Cardboard… Through the use of apps and tools… And Google supports teachers to use this.

A more interesting experiment is an experiment at Stanford’s Virtual Human Interaction Lab, looking at cognitive effects on students behaviour, and perspective-taking in these spaces, looking at empathy – how VR promotes and encourages empathy. Students impersonating a tree, are more cautious wasting paper. Or impersonating a person has more connection and thoughtfulness about their behaviour to that person… Even an experiment on being a cow and whether that might make them more likely to make them a vegetarian.

Another interesting experiment is at Boston University who are engaging with Ulysses – based on a book but not in a literal way. At Penn State they have been experimenting with VR and tactile experiences.

So, to conclude, what are the strengths of VR in education? Well it is about experience what its not possible – cost, distance, time, size, safety. Also non-symbolic learning (maths, chemistry, etc); learning by doing; and engaging experiences. But there are weaknesses too: it is hard to find a VR designer; it requires technical support; and sometimes VR may not be the right technology – maybe we want to replicate the wrong thing, maybe not innovative enough…


Q1) Art Gallery/use in your area?

A1) I would like to do a VR project. It’s hard to understand until you try it out… Most of what I’ve presented is based on what I’ve read and researched, but I would love to explore the topic in a real project.

Q2) With all these technologies, I was wondering if a story is an important accompaniment to the technology and the experience?

A2) I think we do need a story. I don’t think any technology adds value unless we have a vision, and an understanding of full potential of the technology – and what it does differently, and what it really adds to the situation and the story…

Coming up…

Afternoon Keynote: Dr Hamish MacLeod, Senior Lecturer in Digital Education, Institute for Education, Community and Society, Moray House School of Education: Learning with and through Ambiguity (Chair: Cinzia Pusceddu-Gangarosa)

Nicola was talking about her youth and childhood… I will share one too.. I knew all was fine and well, that I was expected… When my primary school teacher, when I was 7, pushed me into the swimming pool. This threat was absolutely no threat… My superpower was to arise miraculously undrowned. That playful interaction was important me, it signalled a relationship, a sign of belonging… A trivial example but…

Playfulness can cement relationships between learners and teachers without judgement. Something similar arose when we ran our eLearning and Digital Cultures MOOC, there was gentle mocking of the team… So for instance there was a video of us as animated Star Trek characters – not just playful but reflecting back to us our ideas… My all time favourite playful response, one of the digital artefacts created, was Andy Mitchell’s intervention… A fake Twitter account, in my name, automatically tweeting cyber security messages. We thouroughly approved and Andy was one of our volunteer tutors on the next run.

Brian Sutton Smith talks about the ambiguity of play, understanding play in both humans and animals, in young and in mature individuals. He came up with 7 rhetorics of play:

  • Play as progress – child development. We talk about children playing, adults engaging in recreation.We can be frivolous as adults but not playful.
  • Play as fate – games of change
  • Play as power – competition and sporting prowess
  • Play as (community) identity, festivals and carnivals
  • Play as the Imaginary and phantasmagorical, narrative and theatrical
  • Play as Self-actualisation
  • Play as Frivolous

For adults play is seen very different. Breugl shows play as time wasting and problematic… But then he also paints children at play…  I commend to you the Wikipedia article on Children playing game – over 80 games defined there and that definition is interesting.

When dogs playfight they prolong “fight”, they self-disable to keep things going… Snapshots don’t tell you it’s playful… But it is. And one give away is the pre-cursor… playful postures, trusting postures… The play context is real. The nip is not a bite, but neither is it not-a-bite… It’s what the bite means (Schechner (1988)). Context is everything. There is a marvellous quote on the about-ness of learning “there is a reason that education is an accusative”.

I was very taken by Jen Ross’ Learning with Digital Provocations talk at the CAHSS Digital Day of Ideas earlier this year. This talk aims to reanimate the debate, framing disruption in terms of “inventiveness, provocation, uncertainty and the concept of ‘not-yetness'”. Disruption says it is revolutionary, but it is not all that revolutionary in it’s reality. We have guru’s of disruption claiming that university is for training students for jobs…

Assuming we aren’t training students for the zombie apocalypse, what are we doing in Higher Education? Well we want our students to behave intelligently, in the sense of Piaget. Or in the sense of Burrhus Frederic Skinner – “Education is what is left when what we have learned has been forgotten”. There are some sorts of skills and mindset for meeting new challenges that they have not met before…

So that mindset, those skills, do have some alignment with Sutton Smith’s ideas of play. And I wanted to show a wonderful local example [a video of Professor Alan Murray – who actually taught me with the innovative and memorable electric guitar/memorable analogies shown in the video].

So, before I move on… Who wants to get the ball into the cup? [cue controlled chaos with three volunteers].

One classic definition of play (Man, Play and Games – Roger Cailois) – it is entered into freely, it has no particular purpose. And again Bernard Suits’ The Grasshopper, it has my favourite definition: “the voluntary attempt to overcome unneccessary obstacles”. This is the lusory attitude Nicola talks about… And that is about overlooking efficient solutions simply for

Thank you to Ross Galloway for an example here… Enrico Fermi came up with the idea of Fermi Problems – making informed guesses and estimates for calculations where a solution does not (yet) exist. Estimation is an important skill. Ross’ example was “can we estimate how much it costs to light all of Edinburgh” – students had their own answers… But tutors saw similar and precise results. Of course you can Google “what does it cost to light Edinburgh”. To use that is to miss the point. The students getting that answer and using it don’t get that it is about understanding how to approach the problem, not what the answer is.

But there is a real challenge to find problems that cannot be solved by Google… That ensures there is that space for play and creative approach.

I wanted to give an example here of our MSc in Digital Education module in the Introduction to Digital Games-Based Learning, which was set up with a great deal of input from Fiona Hale, and many Edinburgh colleagues, as well as Nicola Whitton herself and her book Learning with Digital Games. Several other texts key for us is Digital Games Based Learning by Marc Prensky, and James Paul Gee’s What Vide Games have to teach us about learning and literacy. These two books are radically different. Prensky talks about games mediating instead of traditional approaches, as a modern solution. Gee meanwhile suggests drawing on successful principles from games – he argued that learning should become more playful and game informed. I would say that Gee has been firmly on the right side of history… And that’s reflected in the naming of today, of Nicola’s conference, and indeed the recommended new ALT Playful Learning Special Interest Group – which has changed it’s name from games based learning.

I’m not saying that we shouldn’t be using games in learning… but…

Gee outlines 13 principles, and I’d like to draw out:

  • co-design

Students having agency and taking decisions in their learning, project based, resource based learning, etc. I would say an excellent local example here is the SLICCs – the Student-Led Individually-Created Courses, also Pete Evan’s work on micro credit courses. So this is about agency, ownership. Although learners may need structure and constraint to have agency in this context.

  • Identity and belonging

Identity formation is what happens in education, and also in game (a warlock, a blue hedgehogs, etc.), our law students will become lawyers, our medical students will become doctors. Not to be disrespectful but it can be useful to think about our students as “playing at” being professionals. Lave and Wenger talk about this process, of legitimate peripheral participation. Gee talks about the real, virtual and projective identities… “Saying that if learners in classrooms carry learning so far as to take on a projective identity, something magical happens…” A marvellous example of this, and you can see a video on MediaHopper, of the “white coat ceremony” at the Vet School. The aim is to welcome students into membership of the profession and the community. Certificates are presented, photographs are taken, students then get to put on their white coats for the first time… And then students stand and recite the vet student oath. And then they celebrate with rather staged photographs! And at the end of the year they have a big class photo – in their animal onsies! This is serious fun, this is a right of passage, it is a welcoming into the community. And it symbolises taking on that identity…

  • Fish tanks and
  • Sand boxes

This is about safe space to explore, to play, to experiment. There is a key role for us here in our own practice, but also in how peers may impact on that safety… Sometimes we need to play the fool ourselves, to take the pressure off a student, to take the fire away from someone and reinforcing play as legitimate. This is the idea of “teacher as jester”. Sandboxes are about tinkering, about ideas of brickolage comes in here – as in the constructivism of ? and in Sherry Turkle’s work. And indeed uCreate and 23 Things, space to play and create and have space to think…

It has been said by some students that they don’t enjoy our games-based learning course… And then they do the course design module and then they get it… And in that spirit… I would heartily recommend Charlie Farley and Gavin Willshaw’s Board Game Jams – in one hour there is new language, metaphor, ways to think about what we do in education.

In our course we do also talk about stories…. I’m sure many of you have worked with the notion of role playing, but there are wider and more inclusive approaches in “scenario learning”. One of our former colleagues here, Martin Crapper, talked about environmental enquiry processes and, rather than lecture on it, he actually held an environmental enquiry. The students were engaged, over 2 days, to come with prepared statements as they would as expert witnesses in an environmental enquiry for developers or environmental groups. So rather than read about or think about, they participated.

Another example, a student on our course who teaches at the University of Sunderland, uses a legal appeal around illegally selling cinema tickets. Sophie presents to her students a letter that she asks students to imagines that she found in the archive… In this case the correspondent wants an expert witness in cognitive psychology on parsing and understanding evidence that might be used there.

And briefly… Gamification… This sits in various ways, such as PeerWise. The bit that I think is most useful here isn’t answering questions for peers, but authoring questions. Many of the features here are reminiscent of social networks – you can upvote, follow, engage. There is also a pushback by some students who see this as a trivial experience… A push back to Sutton Smith’s frivolity… And we also have Top Hat – voting on learners old mobile devices, the new equivalent of clickers… Lecture events can be made interactive in this way… Students should be presented with a question, vote on it, and the idea is that when students discuss the question they are more likely to converge on the right answer… What about students who got the right answer right away… What do they talk about? The evidence is that they discuss the problem, other boundary cases… They continue to “play the game” here… They keep that lusory attitude.

I will mention this book – although failure has been well covered – “The art of failure: an essay on the pain of playing video gaes” – Jesper Juul.

So I argue that ALL conceptual learning is accompanied with the sound of pennies dropping… a trickle or a clunk. We talk about threshold concepts, we can see these or recall these as issues we have overcome. All learning involves thresholds and overcoming them. All learning happens in a liminal space, becoming, crossing over, finding the right place to cross over… We may have to try different ways… Often we have to engage actively in doing that which we wish to understand. Papert talks about students learning by doing, and reflecting on what they have done – Brickolage. I would argue this is the territory of Vygotsky’s Zone of Proximal Development – I can succeed with the help of a skilled peer, it is the space of apprenticeship… It is scary but if embrace the ambiguity of play we can make learning more successful.


Q1) I wanted to ask you about the twin idea of risk and safety. In your talk I don’t know if the playful learning you are talking about should be risky and dangerous, or should it just look like that?

A1) I think perhaps it should be increasingly risky… So that one is supported in taking risks. Gee talks about psychosocial moratorium – young people have permission i the world to screw up! It’s about the consequences, and about our responsibility to protect them from the consequences. As Nicola talked about earlier assessment can be the issue, the barrier to working with our students… But we work up to have more risky engagements.

Q2) I wondered if the bit that isn’t a bite, but isn’t not-a-bite is a kind of parody… That students kind of have to “fake it till they make it”

A2) I want my students play along, to engage in parody and metaphor and play with that, rather than it being tangental illustration, but engaging as if in the real world…

Q3) Especially from your examples, the ease you can find information on Google, that really is changing education. Information is now so easily found, you have to engage people in the process of learning, not of information transfer…

A3) I was very much struck by that example from Ross. That idea we have of working with facts… And that sometimes you don’t want the facts… Students must be capable of trying things they have never done before. And of wanting genuine engagement not just the answer to the problem, but the trajectory towards the solution… How do we set that up as the thing we are doing… Maybe we have to be more explicit about that… We can create tasks around finding stuff… But…  How many of us hate when, in a pub conversation, someone runs to the phone to find the answer to some name you can’t remember… For me I want that solution and I’ll grab the phone… But there some people want that chat, they don’t want the answer… It’s the voluntary espousal of unnecessary obstacles.

Q4) A comment and nice example… I read that meerkats remove the sting from scorpions and give it (the scorpions) to their young so that they can explore without being stung.

Comment) It is true… It’s the only animal, other than humans, that teach in stages… So they will give a scorpion without a sting and work up to the full scorpion…

Q5) Would you apply that need to fail to those teaching them…

A5) I think so, Jen and I have talked about the idea of “successful not-knowing”. Another Ross Galloway example… The student asks a question, with a new example… And he has a better example… So he works through that instead… And he goes wrong… he works back… the students comment from the side… and he moves on… And in his course feedback the students said “we liked it when you made a mistake, we really saw physics being done”. A few years ago at Networked Learning a peaker in favour of the lecture said it was an opportunity for students to see the lecturer “thinking on their feet”… That thinking on their feet is the key bit, the discussion, the extemporising… That’s where I feel comfortable – but others will not. But the risk is absolutely for them (the students), but we have to model that risk and they will respect it when they see it. So again Alan Murray would lose some of his dignity [in his playful approaches], but not respect… He was secure in his position.

And with that we moved out into further breakout sessions… 

Afternoon Break Out Sessions

  • Playful Learning Mini Maker Space – Michael Boyd)
  • 23 Things – Stephanie (Charlie) Farley
  • DIY Film School (Gear and Gadgets) – Stephen Donnelly
  • Gamifying Wikpedia – Ewan McAndrew
  • Near Future Teaching Vox Pops – Sian Bayne

To finish the day we have several more short presentations… 

Presentations (Chair: Ross Ward, Learning Technology Advisor (ISG Learning, Teaching & Web Services))

Learning to Code: A Playful Approach – Areti Manataki

I’m a senior researcher in the School of Informatics. And I’ve been quite active in teaching children how to programme both online and offline. My research is in AI, rather than education, so I’m here to share with you and to learn from you.

So Code Yourself! was a special programme that ran in both English and Spanish, organised by University of Edinburgh and University of Uraguay. The whole idea was to introduce programming to people with little or no experience of coding. We wanted to emphasise that it’s fun, it’s relevant to the real world. and anyone can have a go!

We covered the basics of algorithms and control structures, computational thinking, software engineering and programming in scratch. This was for young teenagers and set real challenges, with some structure, using scratch. And we included real life algorithms – like how to make a sandwich – and we had a strong visual elements thanks to the UoE MOOCs production team. And it was all about having fun, including building your own games of all types… Like hunting ghosts, or plants vs zombies…

The forum turns out to be a really important space in the MOOCs. We encouraged them to use the forums, to discuss, and we included tasks and discussions in the weekly emails… So we would say, go to the discussion boards to teach an alien how to brush his teeth Or draw and object and have people guess what it is about.

The course has been running a couple of years, our reach has been good… We have had over 110k participants and 2881 course completers. There are slightly more men than women. The age profile varies, though higher amongst young people than typical Coursera course. And older audiences enjoyed the programming. And we asked if they plan to programme again in the future, many plan to and that’s brilliant, that’s what I hoped for.

We’ve also been running coding workshops at the Edinburgh International Science Festival, having a go and trying to build their own game… We’ve had games with dinosaurs flying over cars! And they seem to enjoy it – and excitement levels are high! Many of these kids are familiar with scratch, but they liked the freedom to play.

And since the MOOC was working for adults, we’ve tweaked it and run the course for students and staff. Again, it came out as being fun!

We now want to reach out to more people. I actually do this in my own time, I’m passionate about it and want to reach young people, especially from disadvantaged backgrounds, and teachers who can use this in the classroom!


Q1) You commented that it was aimed at young people but also that Coursera’s audience aren’t really that young. Is the course running on demand? What’s the support?

A1) We first launched in March 2016 as a session. Then moved in August to monthly sessions, with learners encouraged to move to the next session if they haven’t completed. I was a bit scared when we moved to that phase in terms of support. But Coursera has launched the mentors scheme, where previous learners who enjoyed the course can get involved more actively on discussion boards. They do a fantastic job and learners are supported.

Q2) Is there a platform aimed at a younger audience?

A2) I’m not sure. The Uraguay university have experience in online teaching with youngsters and they designed their own platform – and it didn’t look that different. But I do think the style needs to be more appropriate.

Q3) You talked about working with younger audiences – who maybe encounter coding in schools, and your Coursera audiences and those young people’s parents may be seeing code from their kids or from professional contexts… Have you looked at teaching for an older audience who tend to be less digitally engaged in general?

A3) Yes, we did have some older MOOC takers… And they are amongst the most enthusiastic and they share their stories with us. We have done an online session with older people too and as long as you make them feel safe, and able to use the technology, that’s half the battle. When they are confident they are some of the most enthusiastic participants!

Enriched engagement with recorded lectures – John Lee

I’m going to talk about enriching recorded lectures and thinking about how we can enrich recordings with other content… This work is now turning into a PTAS project with colleagues in Informatics.

Rich media resources are increasingly recorded and created with the intention of using them for teaching and learning. And recorded lectures are perhaps most obvious example of such materials. We capture content, but an opportunity feels like it is being lost here, there is so much opportunity if we can integrate these into some new ecology of materials in an interesting way.

So, what’s the problem with doing that?  We have tools for online editing, annotation, linking to resources… These are not technically difficult to do… The literature captures lots of approaches and systems, and often positive experiences, but you don’t then seem to see people going on to use them in teaching and learning… And that includes me actually. Somehow it seems to be remarkably difficult… Perhaps interfaces are somehow not yet optimal for educational uses. If you look at YouTube’s editor… It works nicely… But not a natural thing to use or adapt to an educational context… Perhaps if we can design easier ways to present and interact with rich media, it can be more successful.

And maybe part of this is about making it more fun! It could be that doing this kind of work isn’t neccassarily playful… But working with these kinds of materials should be assisted by playful approaches. And one ways to do that is to bring in more diverse sources and resources, bringing in YouTube, Vimeo… Bringing in playful content… Perhaps we can also crowdsource more content creation, more content from students themselves… And build content creation hand in hand with content use… Designing learning activities that implicate and build on use and creation of rich media, like for instance..

So, I’ve been building a prototype based on APIs from various places, just built with Javascript. We have tried this out on a course called “Digital Playgrounds for the Online Public” (Denitsa Petrova). Students found it interesting and raised various issues… Again I’m following this tradition of trying things out, seeing it is interesting… But I now want to avoid the trap of not going any further with it… How? Partly by integrating with student projects – informatics and Digital Media students – to design ways to combine this media in new and different ways. Some real possibilities there…


Q1) I really like this idea. I trialled an idea with new students around writing up lecture notes, uploading, sharing, upvoting, comments… A few students tried it and then it fell off exponentially. But one student kept it going all year – and apologised if he didn’t share his notes… But in future years no one did that… You are talking about taking up a lecture and scaling up and building on… But students sometimes want to go the other way… What else will you add for uptake.

A1) That’s a really good point. The ideas in our proposal tackle that head on. We want to design learning activities that produce and build on those recordings – by creating resources, linking content together… The idea if that you leave them to do it by themselves, they often won’t. But also if we structure it into the course itself, it becomes a learning process, and reflect on the course itself. So the lecture material may be used in a flipped classroom type of way… So they take from and work with the video, rather than just watch it. Sometimes we use pre-recorded content and that needs to be part of the picture too… Engagement is always more difficult… We want to foster engagement, and if we can do that the rest should fall into place.

DIY Filmschool and Media Hopper (MoJo) – Stephen Donnelly

I’m Stephen Donnelly and I work with the Media Team in LTW. I’ll talk about some of what we do…

The idea for the DIY Film School came from MoJo – the idea of Mobile Journalism… And how journalism is changing because of technology… The idea is that we are so used to seeing media on YouTube, filmed on a mobile phone… There’s no point commenting on production values when we see that. We are so used to seeing mobile footage used in broadcast. There are two reasons for this… The way we view media has changed, but also we now all have mobile devices. We all have a mobile phone… And broadcasters have figured that out. And most news stories now have a video element in there… So we wanted to help others to engage in that.

So, how did we get here…

I used to work with the BBC and we had gotten to the point where you sent our producers and cameramen… You probably take videos all the time and share them online… But you get to a professional context and everything changes… But you can shoot your own stuff. So we have purchased some inexpensive kit that works with your existing device – mics, rigs, lenses. And, in addition, we run a DIY film school. It teaches the basics of film making – that apply no matter what you use to film… Framing, being stable, how to zoom, lighting and shooting in appropriate light… And audio. People often forget about audio and actually, people tolerate bad video but you really need good audio… And how to be prepared when doing a shot.

So, after you’ve made your amazing films from DIY Film School where do you put it? Well we have MediaHopper… And you can do really cool stuff – store your content, share your content, etc.

I have a few examples of films made with the DIY film gear… So here we have a “how to” video for shooting an interview with two mobile cameras so that you can cut between the two shots – a simple interview set up. Another option here, a video on “Life after Cardiac Arrest” – they had commissioned out before, now shooting themselves, upskilling the team, and they make some really really nice stuff. And lastly is a video by Michael Seery who has students making videos – in this case how to use a UV-vis spectrophotometer. It’s all about the content and not about the technology.


Q1) How much do those mobile rigs tend to cost?

A1) These are in the region of £100 for a steadycam rig… I could put a target on John, say, and it will follow him. It’s so cheap compared to what we would have purchased in the past. Nothing is more than ~£100 – you can buy lots rather than one camera. And you can loan out our rigs too. You guys have all the best content, it’s getting you guys to use it…

Q2) Have you

A2) We had a colleague working in archives, wanting to capture the process… They were doing it through pictures… and making their own videos has changed how they communicate their work… And MediaHopper has changed how academic colleagues are sharing their work in lots of ways, not just DIY Film School…

Q3) Great to see how easy things are now. One of the things that we are keen to do in the School of Education is adding captions. That’s easy on YouTube, how about in MediaHopper?

A3) We are running a pilot at the moment. Including manual and automated captions. The latter is easier but more hit and miss. Get in touch to get involved.

Closing Remarks – Prof. Sian Bayne, Moray House School of Education

Nicola and team asked me to close the content. The theme of the conference feels spot on at the end of a busy and exhausting year. Today has been a lovely reminder to bring playfulness into our everyday lives. Thank you to our fantastic speakers Nicola and Hamish, to colleagues who have presented, and run breakouts and posters all day!

Thank you to Nicola, this is her last conference as convener so huge thanks. Thank you to Ross Ward. To Charlie Farley. To Susan Greig. To Ruby Rennie. And also to Marshall Dozier. And a special thank you to Cinzia Pusceddu-Gangarosa who I know Sian also meant to thank in her talk. 

And I want to invite everyone left to come and drink wine and eat cheese as a very very informal leaving do for Hamish who is retiring. Thank you to everyone for everything, and for coming!

And with that we are done… And I’m off to drink wine and eat cheese… 

Jun 162017

It’s the final day of the IIPC/RESAW conference in London. See my day one and day two post for more information on this. I’m back in the main track today and, as usual, these are live notes so comments, additions, corrections, etc. all welcome.

Collection development panel (Chair: Nicola Bingham)

James R. Jacobs, Pamela M. Graham & Kris Kasianovitz: What’s in your web archive? Subject specialist strategies for collection development

We’ve been archiving the web for many years but the need for web archiving really hit home for me in 2013 when NASA took down every one of their technical reports – for review on various grounds. And the web archiving community was very concerned. Michael Nelson said in a post “NASA information is too important to be left on computers”. And I wrote about when we rely on pointing not archiving.

So, as we planned for this panel we looked back on previous IIPC events and we didn’t see a lot about collection curation. We posed three topics all around these areas. So for each theme we’ll watch a brief screen cast by Kris to introduce them…

  1. Collection development and roles

Kris (via video): I wanted to talk about my role as a subject specialist and how collection development fits into that. AS a subject specialist that is a core part of the role, and I use various tools to develop the collection. I see web archiving as absolutely being part of this. Our collection is books, journals, audio visual content, quantitative and qualitative data sets… Web archives are just another piece of the pie. And when we develop our collection we are looking at what is needed now but in anticipation of what we be needed 10 or 20 years in the future, building a solid historical record that will persist in collections. And we think about how our archives fit into the bigger context of other archives around the country and around the world.

For the two web archives I work on – and the Bay Area Governments archives – I am the primary person engaged in planning, collecting, describing and making available that content. And when you look at the web capture life cycle you need to ensure the subject specialist is included and their role understood and valued.

The archive involves a group from several organisations including the government library. We have been archiving since 2007 in the California Digital Library initially. We moved into Archive-It in 2013.

The Bay Area Governments archives includes materials on 9 counties, but primarily and comprehensively focused on two key counties here. We bring in regional governments and special districts where policy making for these areas occur.

Archiving these collections has been incredibly useful for understanding government, their processes, how to work with government agencies and the dissemination of this work. But as the sole responsible person that is not ideal. We have had really good technical support from Internet Archive around scoping rules, problems with crawls, thinking about writing regular expressions, how to understand and manage what we see from crawls. We’ve also benefitted from working with our colleague Nicholas Taylor here at Stanford who wrote a great QA report which has helped us.

We are heavily reliant on crawlers, on tools and technologies created by you and others, to gather information for our archive. And since most subject selectors have pretty big portfolios of work – outreach, instruction, as well as collection development – we have to have good ties to developers, and to the wider community with whom we can share ideas and questions is really vital.

Pamela: I’m going to talk about two Columbia archives, the Human Rights Web Archive (HRWA) and Historic Preservation and Urban Planning. I’d like to echo Kris’ comments about the importance of subject specialists. The Historic Preservation and Urban Planning archive is led by our architecture subject specialist and we’d reached a point where we had to collect web materials to continue that archive – and she’s done a great job of bringing that together. Human Rights seems to have long been networked – using the idea of the “internet” long before the web and hypertext. We work closely with Alex Thurman, and have an additional specially supported web curator, but there are many more ways to collaborate and work together.

James: I will also reflect on my experience. And the FDLP – Federal Library Program – involves libraries receiving absolutely every government publications in order to ensure a comprehensive archive. There is a wider programme allowing selective collection. At Stanford we are 85% selective – we only weed out content (after five years) very lightly and usually flyers etc. As a librarian I curate content. As an FDLP library we have to think of our collection as part of the wider set of archives, and I like that.

As archivists we also have to understand provenance… How do we do that with the web archive. And at this point I have to shout out to Jefferson Bailey and colleagues for the “End of Term” collection – archiving all gov sites at the end of government terms. This year has been the most expansive, and the most collaborative – including FTP and social media. And, due to the Trump administration’s hostility to science and technology we’ve had huge support – proposals of seed sites, data capture events etc.

2. Collection Development approaches to web archives, perspectives from subject specialists

As subject specialists we all have to engage in collection development – there are no vendors in this space…

Kris: Looking again at the two government archives I work on there is are Depository Program Statuses to act as a starting point… But these haven’t been updated for the web. However, this is really a continuation of the print collection programme. And web archiving actually lets us collect more – we are no longer reliant on agencies putting content into the Depository Program.

So, for we really treat this as a domain collection. And no-one really doing this except some UCs, myself, and state library and archives – not the other depository libraries. However, we don’t collect think tanks, or the not-for-profit players that influence policy – this is for clarity although this content provides important context.

We also had to think about granularity… For instance for the CA transport there is a top level domain and sub domains for each regional transport group, and so we treat all of these as seeds.

Scoping rules matter a great deal, partly as our resources are not unlimited. We have been fortunate that with the archive that we have about 3TB space for this year, and have been able to utilise it all… We may not need all of that going forwards, but it has been useful to have that much space.

Pamela: Much of what Kris has said reflects our experience at Columbia. Our web archiving strengths mirror many of our other collection strengths and indeed I think web archiving is this important bridge from print to fully digital. I spent some time talking with our librarian (Chris) recently, and she will add sites as they come up in discussion, she monitors the news for sites that could be seeds for our collection… She is very integrated in her approach to this work.

For the human rights work one of the challenges is the time that we have to contribute. And this is a truly interdisciplinary area with unclear boundaries, and those are both challenging aspects. We do look at subject guides and other practice to improve and develop our collections. And each fall we sponsor about two dozen human rights scholars to visit and engage, and that feeds into what we collect… The other thing that I hope to do in the future is to do more assessment to look at more authoritative lists in order to compare with other places… Colleagues look at a site called ideallist which lists opportunities and funding in these types of spaces. We also try to capture sites that look more vulnerable – small activist groups – although it is nt clear if they actually are that risky.

Cost wise the expensive part of collecting is both human effort to catalogue, and the permission process in the collecting process. And yesterday’s discussion of possible need for ethics groups as part of the permissions prpcess.

In the web archiving space we have to be clearer on scope and boundaries as there is such a big, almost limitless, set of materials to pick from. But otherwise plenty of parallels.

James: For me the material we collect is in the public domain so permissions are not part of my challenge here. But there are other aspects of my work, including LOCKSS. In the case of Fugitive US Agencies Collection we take entire sites (e.g. CBO, GAO, EPA) plus sites at risk (eg Census, Current Industrial Reports). These “fugitive” agencies include publications should be in the depository programme but are not. And those lots documents that fail to make it out, they are what this collection is about. When a library notes a lost document I will share that on the Lost Docs Project blog, and then also am able to collect and seed the cloud and web archive – using the WordPress Amber plugin – for links. For instance the CBO looked at the health bill, aka Trump Care, was missing… In fact many CBO publications were missing so I have added it as a see for our Archive-it

3. Discovery and use of web archives

Discovery and use of web archives is becoming increasingly important as we look for needles in ever larger haystacks. So, firstly, over to Kris:

Kris: One way we get archives out there is in our catalogue, and into WorldCat. That’s one plae to help other libraries know what we are collecting, and how to find and understand it… So would be interested to do some work with users around what they want to find and how… I suspect it will be about a specific request – e.g. city council in one place over a ten year period… But they won’t be looking for a web archive per se… We have to think about that, and what kind of intermediaries are needed to make that work… Can we also provide better seed lists and documentation for this? In Social Sciences we have the Code Book and I think we need to share the equivalent information for web archives, to expose documentation on how the archive was built… And linking to seeds nad other parts of collections .

One other thing we have to think about is process and document ingest mechanism. We are trying to do this for to better describe what we do… BUt maybe there is a standard way to produce that sort of documentation – like the Codebook…

Pamela: Very quickly… At Columbia we catalogue individual sites. We also have a customised portal for the Human Rights. That has facets for “search as research” so you can search and develop and learn by working through facets – that’s often more useful than item searches… And, in terms of collecting for the web we do have to think of what we collect as data for analysis as part of a larger data sets…

James: In the interests of time we have to wrap up, but there was one comment I wanted to make.which is that there are tools we use but also gaps that we see for subject specialists [see slide]… And Andrew’s comments about the catalogue struck home with me…


Q1) Can you expand on that issue of the catalogue?

A1) Yes, I think we have to see web archives both as bulk data AND collections as collections. We have to be able to pull out the documents and reports – the traditional materials – and combine them with other material in the catalogue… So it is exciting to think about that, about the workflow… And about web archives working into the normal library work flows…

Q2) Pamela, you commented about permissions framework as possibly vital for IRB considerations for web research… Is that from conversations with your IRB or speculative.

A2) That came from Matt Webber’s comment yesterday on IRB becoming more concerned about web archive-based research. We have been looking for faster processes… But I am always very aware of the ethical concern… People do wonder about ethics and permissions when they see the archive… Interesting to see how we can navigate these challenges going forward…

Q3) Do you use LCSH and are there any issues?

A3) Yes, we do use LCSH for some items and the collections… Luckily someone from our metadata team worked with me. He used Dublin Core, with LCSH within that. He hasn’t indicated issues. Government documents in the US (and at state level) typically use LCSH so no, no issues that I’m aware of.

Plenary (Macmillan Hall): Posters with lightning talks (Chair: Olga Holownia)

Olga: I know you will be disappointed that it is the last day of Web Archiving Week! Maybe next year it should be Web Archiving Month… And then year!

So, we have lightening talks that go with posters that you can explore during the break, and speak to the presenters as well.

Tommi Jauhiainen, Heidi Jauhiainen, & Petteri Veikkolainen: Language identification for creating national web archives

Petteri: I am web archivist at the National Library of Finland. But this is really about Tommi’s PhD research on native Finno-Ugric languages and the internet. This work began in 2013 as part of the Kone Foundation Language Programme. It gathers texts in small languages on the web… They had to identify that content to capture them.

We extracted the web links on Finnish web pages, also crawled russian, estonian, swedish, and norwegion domains for these languages. They used HeLI and Heritrix. We used the list of Finnish URLs in the archive, rather than transferring the WARC files directly. So HeLI is the Helsinki language identification method, one of the best in the world. It can be found on Github. And can be used for your language as well! The full service will be out next year, but you can ask HeLi if you want that earlier.

Martin Klein: Robust links – a proposed solution to reference rot in scholarly communication

I work at Los Alamos, I have two short talks and both are work with my boss Herbert Van de Sompel, who I’m sure you’ll be aware of.

So, the problem of robust links is that links break and reference content changes. It is hard to ensure the author’s intention is honoured. So, you write a paper last year, point to the EPA, the DOI this year doesn’t work…

So, there are two ways to do this… You can create a snapshot of a referenced recourse… with, Internet Archive, Archive,is, Webcite. That’s great… But the citation people use is then the URI of the archive copy… Sometimes the original URI is included… But what if the URI-M is a copy elsewhere – or the no longer present

So, second approach, decorate your links by referencing: original URI, datetime of archiving, and the resource’s original URI. That makes your link more robust meaning you can find the live version. The original URI allows finding captures in all web archives. The Capture datetime lets you identify when/what version of the site is used.

How do you do this? With HTML5 link decoration, with the href attribute (data-original and data-versiondate). And we talked about this in a d-Lib article that, with some javascript that makes that actionable!

So, come talk to me upstairs about this!

Herbert Van de Sompel, Michael L. Nelson, Lyudmila Balakireva, Martin Klein, Shawn M. Jones & Harihar Shankar: Uniform access to raw mementos

Martin: Hello, it’s still me, I’m still from Los Alamos! But this is a more collaborative project…

The problem here… Most web archives augment their mementos with custom banners and links… So, in the Internet Archive there is a banner from them, and a pointer on links to a copy in the archive. There are lots of reasons, legal, convenience… BUT That enhancement doesn’t represent the website at the time of capturing… AS a researcher those enhancements are detrimental as you have to rewrite links again.

For us and our Memento Reconstruct, and other replay systems that’s a challenge. Also makes it harder to check the veracity of content.

Currently some systems do support this… OpenWayBack adn pywb do allow this – you can add the {datetime}im_/URI-R to do this, for instance. But that is quite dependent on the individual archive.

So, we propose using the Prefer Header in HTTP Request…

Option 1: Request header sent against Time Gate

Option 2: Request header sent against Memento

So come talk to us… Both versions work, I have a preference, Ilya has a different preference, so it should be interesting!

Sumitra Duncan: NYARC discovery: Promoting integrated access to web archive collections

NYARC is a consortium formed in 2006 from research libraries at Brooklyn Museum, The Frick Collection and the Museum of Modern Art. There is a two year Mellow grant to implement the program. An dthere are 10 collections in Archive-it devoted to scholarly art resources – including artist websites, gallery sites, catalogues, lists of lost and looted art. There is a seed list of 3900+ site.

To put this in place we asked for proof of concept discovery sites – we only had two submitted. We selected Primo from Ex-Libris. This brings in materials using the OpenSearch API. The set up does also let us pull in other archives if we want to. And you can choose whether to include the web archive (or not). The access points are through MARC Records and Full Records Search, and are in both the catalogue and WorldCat. We don’t howver, have faceted results for web archive as it’ snot in the API.

And recently, after discussion with Martin, we integrated Memento into th earchive, which lets them explore all captured content with Memento Time Travel.

In the future we will be doing usability testing of the discovery interface, we will promote use of web archive collections, and encouraging use in new digital art projects.

Fine NYARC’s Archive-It Collections: Documentation at http://wiki.nyarc.??

João Gomes:

Olga: Many of you will be aware of Arquivo. We couldn’t go to Lisbon to mark the 10th anniversary of the Portuguese web archive, but we welcome Joao to talk about it.

Joao: We have had ten years of preserving the Portuguese web, collaborating, researching and getting closer to our researchers, and ten years celebrating a lot.

Hello I am Joao Gomes, the head of We are celebrating ten years of our archive. We are having our national event in November – you are all invited to attend and party a lot!

But what about the next 10 years? We want to be one of the best archives in the world… With improvements to full text search, to launch new services – like image serarching and high quality archiving services. Launching an annual prize for resarching projects over the And at the same time increase our collection and users community.

So, thank you to all in this community who have supported us since 2007. And long live!

Changing records for scholarship & legal use cases (Chair: Alex Thurman)

Martin Klein & Herbert Van de Sompel: Using the Memento framework to assess content drift in scholarly communication

This project is to address both link rot and content drift – as I mentioned earlier in my lightening talk. I talked about link rot there, content drift is where the URI and content there changes, perhaps out of all recognition, so that what I cite is not reproducable.

You may or may not have seen this but there was a Supreme Court case referencing a website, and someone thought it would be really funny to purchase that, put up a very custom 404 error. But you can see pages that change between submission and publication. By contrast if you look at arxiv for instance you see an example of a page with no change over 20 years!

This matters partly as we reference URIs increasingly, hugely so since 2008.

So, some of this I talked about three years ago where I introduced the Hiberlink project, a collaborative project with the University of Edinburgh where we coined the term “reference rot”. This issue is a threat to the integrity of the web-based scholarly record. Resources do not have the same sense of fixity like e.g. journal article. And custodianship is also not as long term, custodians are not always as interest.

We wrote about link rot in PLoSOne. But now we want to focus on Content Drift. We published a new article on this in PLoSOne a few months ago. This is actually based on the same corpus – the entirity of arXiv, of PubMedCentral, and also over 2 million articles from Elsevier. This covered publications from January 1997 to December 2012. We only looked at URIs for non scholarly articles – not the DOIs but the blog posts, the Wikipedia page, etc. We ended up with a total of around 1 million URIs for these corpora. And we also kept the start date of the article with our data.

So, what is our approach for assessing content drift? We take publication date of URI as t. Then we try to find a Memento pre of referenced URI (t-1) and the Memento Post of referenced URI (t+1). Two Thirds of the URIs we looked at have this pair across archives. So now we do text analysis, looking at textual similarity between t-1 and t+1. We use measures of computed noralised scores (values 0 to 100) for:

  • simhash
  • Jaccard – sets of character changes
  • Sorensen-Dice
  • Cosine – contextual changes

So we defined a perfect Representative Momento if it gets a perfect score across all four measures. And we did some sanity checks too, via HTTP headers – E-Tag and Last-modified being the same are a good measure. And that sanity check passed! 98.88% of Mementos were representative.

Out of the 650k pairs we found, about 313k URIs have representative Mementos. There wasn’t any big difference across the three collections .

Now, with these 313k links, over 200k had a live site. And that allowed us to analyse and compare the live and archived versions. We used those same four measures to check similarity. Those vary so we aggregate. And we find that 23.7% of URIs have not drifted. But that means that over 75% have drifted and may not be representative of author intent.

In our work 25% of the most recent papers we looked at (2012) have not drifted at all. That gets worse going back in time, as is intuitive. Again, the differences across the corpora aren’t huge. PMC isn’t quite the same – as there were fewer articles initially. But the trend is common… In Elsevier’s 1997 works only 5% of content has not drifted.

So, take aways:

  1. Scholarly articles increasingly contain URI references to web and large resources
  2. Such resourcs are subject to reference rot (link rot and content drift)
  3. Custodians of these resoueces are typically not over concerned with archiving of their content and lonegtity of the scholarly record
  4. Spoiler: Robust links are one way to address this at the outset.


Q1) Have you had any thought on site redesigns where human readable content may not have changed, but pages have.

A1) Yes. We used those four measures to address that… We strip out all of the HTML and formatting. Cosign ignores very minor “and” vs. “or” changes for instance.

Q1) What about Safari readibility mode?

A1) No. We used something like Beautiful Soup to strip out code. Of course you could also do visual analysis to compare pages.

Q2) You are systematically underestimating the problem… You are looking at publication date… It will have been submitted earlier – generally 6-12 months.

A2) Absolutely. For the sake of the experiment it’s the best we can do… Ideally you’d be as close as possible to the authoring process… When published, as you say, it may already

Q3) A comment and a question… 

Preprints versus publication… 

A3) No, we didn‘t look explicitly at pre-prints. In arXiv those are

The URIs in articles in Elsevier seem to rot more than those in articles… We think that could be because Elsevier articles tend to reference more .coms whereas arXiv references more .org URIs but we need more work to explore that…

Nicholas Taylor: Understanding legal use cases for web archives

I am going to talk about use of web archives in litigation. But out of scope here is the areas of perservation of web citations; terms of service and API agreements for social media collection; copyright; right to be forgotten.

So, why web archives? Well it’s where the content is. In some cases social media may only be available in web archives. Courts do now accept web archive conference. The earliest that IAWM (Internet Archive Way Back Machine) evidence was as early as 2004. Litigants reoutinely challenge this evidence but courts often accept IAWM evidence – commonly through affidavit or testimony, through judicial notice, sometimes through expert testimony.

The IA have affidavit guidance and they suggest asking the court to ensure they will accept that evidence, making that the issue for the courts not the IA. And interpretation is down to the parties in the case. There is also information on how the IAWM works.

Why should we care about this? Well legal professionals are our users too. Often we have unique historical data. And we can help courts and juries correctly interpret web archive evidence leading to more informed outcomes. Other opportunities may be to broaden the community of practice by bringing in legal technology professionals. And this is also part of mainstreaming web archives.

Why might we hestitate here? Well typically cases serve private interests rather than public goods. Immpature open source software culture for legal technology. And market solutions for web and social media archiving for this context do already exist.

USe cases for web archiving in litigation mainly have to do with information on individual webpages as a point in time; information individual webpages over a period of time; persistence of navigational paths over a period of time. And types of cases include civil litigaton and intellectual property cases (which are a separate court in the US). I haven’t seen any criminal cases using the archive but that doesn’t mean it doesn’t exist.

Where archives are used there is a focus on authentication and validity of the record. In the Telewizja Polska USA Inc v. Echostar Video Inc. (2004) saw arguing over the evidence but the court accepting it. In Specht v. Google inc (2010) the evidence was not admissable as it had not come through the affidavit rule.

Another important rule in ths US context is Judicial notice (FRE 201) which is a rule that allows a fact to be entered into evidence. And archives have been used in this context. For instance Martins v 3PD, Inc (2013). And Pond Guy, Inc. v. Aguascape Designs (2011). And in Tompkins v 23andme, Inc (2014) – both parties used IAWM screenshots and the courts went out and found further screenshots that countered both of these to an extent.

Expert testimony (FRE 202) has included Khoday v Symantex Corp et al (2015)  where the expert on navigational paths was queried but the court approved that testimony.

In terms of reliabiity factors things that are raised as concerns include IAWM disclaimer, incompleteness, provenance, temporal coherence. Not seen any examples on discreteness, temporal coherance with HTTP headers), etc.

In Nassar v Nassar (2017) was a defamation case where the IAWM disclaimer saw the court not accept evidence from th earchive.

In Stabile v. Paul Smith Ltd. (2015) saw incomplete archives used, with the court acknowledging but accepting relevance of what was entered.

In Marten Transport Ltd v Plattform Advertising Inc. (2016) was also incomplete, discussion of banners and ads, but the court understood that IAWM does account for some of this. Objections had include issues with crawlers, concern that human/witness wasn’t directly involved in capturing the pages. The literature includes different perceptions of incompleteness. We also have issues of live site “leakage” via AJAX – where new ads leaked into archive pages…

Temporal coherance can be complicated. Web archive  captures can include mementos that are embedded and archived at different points in time so that the composite does not totally make sense.

The Memento Time TRavel service shows you temporal coherance. See also Scott Ainsworth’s work. That kind of visualisation can help courts to understand temporal coherance. Other datetime estimation strategies includes “Carbon Dating” (and constitutent services)’ comparing X-Archive-Orig-last-modified with Memento dattime, etc.

Interpreting datetimes is complicated, and of  great importance in legals cases. These can be interpreted from static datetime of text in archived page, the Memento date time, the headers, etc.

In Servicenow, Inc. v Hewlett-Packard Co. (2015), a patent case where things much be published a year ago to be “prior art” and in this case the archive showed an earlier date than other documentatin.

IN terms of IAWM provenance… Cases have questioned this. Sources for IAWM include a range of different crawls but what does that mean for reliable provenance. There are other archives out there too, but I haven’t seen evidence of these being used in court yet. Canonicality is also an interesting issue… Personalisation of content served to archival agent is an an unanswered question. What about client artifacts?

So, what’s next? If we want to better serve legal and research use cases, then we need to surface more provenance information; to improve interfaces to understand temporal coherance and make volotile aspects visible…

So, some questions for you,

  1. why else might we care, or not about legal use cases?
  2. what other reliability factors are relevant?
    1. What is the relative importance of different reliability factors?
    2. For what use cases are different reliability factors relevant?


Q1) Should we save WhoIs data alongside web archives?

A1) I haven’t seen that use case but it does provide context and provenance information

Q2) Is the legal status of IA relevant – it’s not a publicly funded archive. What about security certificates or similar to show that this is from the archive and unchanged?

A2) To the first question, courts have typically been more accepting of web evidence from .gov websites. They treat that as reliable or official. Not sure if that means they are more inclined to use it.. On the security side, there were some really interesting issues raised by Ilya and Jack. As courts become more concerned, they may increasingly look for those signs. But there may be more of those concerns…

Q3) I work with one of those commercial providers… A lot of lawyers want to be able to submit WARCs captured by web recorer or similar to courts.

A3) The legal system is vrry document centril… Much of their data coming in is PDF and that does raise those temporal issues.

Q3) Yes, but they do also want to render WARC, to bring that in to their tools…

Q4) Did you observe any provenance work outside the archive – developers, GitHub commits… Stuff beyond the WARC?

A4) I didn’t see examples of that… Maybe has to do with… These cases often go back a way… Sites created earlier…

Anastasia Aizman & Matt Phillips: Instruments for web archive comparison in

Matt: We are here to talk to you about some web archiving work we are doing. We are from the Harvard innovation lab. We have learnt so much from what you are doing, thank you so much. is creating tools to help you cite stuff on the web, to capture the WARC, organises those things…

We got started on this work when examining documents looking at the Supreme Court corpus from 1996 to present. We saw that Zittrain et al, Harvard Law Review, found more than 70% of references had rotted. So we wanted to build tools to help that…

Anastasia: So, we have some questions…

  1. How do we know a website has changed
  2. How do we know which are important changes.

So, what is a website made of… There are a lot of different resources that will appear on, say, a Washington Post article will have perhaps 90 components. Some are visual, some are hidden… So, again, how can we tell if the site has changed, if it is significant… And how do you convey that to the user.

In 1997, Andre Broder wrote about Syntactic clustering of the web. In that work he looked at every site on the world wide web. Things have changed a great deal since then… Websites are more dynamic now, we need more ways to compare pages…

Matt: So we have three types of comparison…

  • image comparison – we flatten the page down… If we compare two shots of Hacker News a few minutes apart there is a lot of similarity, but difference too… So we create a third image showing/highlighting the differences and can see where those changes there…

Why do image comparison? It’s kind of a dumb way to understand difference… Well it’s a mental model the human brain can take in. The HCI is pretty simple here – users regularly experience that sort of layering – and we are talking general web users here. And it’s easy to have images on hand.

So, sometimes it works well… Here’s an example… A silly one… A post that is the same but we have a cup of coffee with and without coffee in the mug, and small text differences. Comparisons like this work well…

But it works less well where we see banner ads on webpages and they change all the time… But what does that mean for the content? How do we fix that? We need more fidelity, we need more depth.

Anastasia: So we need another way to compare… Looking at a Washington post from 2016 and 2017… Here we can see what has been deleted, and we can see what has been added…. And the tagline of the paper itself has changed in this case.

The pros of this highlighting approach as that it’s in use in lots of places, it’s intuitive… BUT it has to ignore invisible-to-the_user tags. And it is kind of stupid… With two totally different headlines, both saying “Supreme Court”, it sees similarity where there is none.

So what about other similarity measures… ? Maybe a score would be nice, rather than an overlay highlighting change. So, for that we are looking at:

  • Jaccard Coefficient (MinHash) – this is essentially like applying a Venn diagram to two archives.
  • Hamming distance (SimHash) – This looks for number strings into 1s and 0s and figure out where the differences are… The difference/ratio
  • Sequence Matcher (Baseline/Truth) – this looks for sequences of words… It is good but hard to use as it is slow.

So, we took Washington Post archives (2000+) and resources (12,000) and looked at SimHash – big gaps. MinHash was much closer…

When we can calculate that changes… does it matter? If it’s ads, do you care? Some people will. Human eyes are needed…

Matt: So, how do we convey this information to the user… Right now in Perma we have a banner, we have highlighting, or you can choose image view. And you can see changes highlighted in “File Changes” panel on top left hand side of the screen. You can click to view a breakdown of where those changes are and what they mean… You can get to an HTML diff (via Javascript).

So, those are our three measures sitting in our Perma container..

Anastasia: So future work – coming soon – will look at weighted importance. We’d love your idea of what is important – is HTML more important than text? We want a Command Line (CLI) tool as well. And then we want to look at a similarity measure for images – other research on this out there, we need to look at that. We want a “Paranoia” heuristic – to see EVERY change, but with a tickbox to allow only the important change. And we need to work together!

Finally we’d like to thank you, and our colleagues at Harvard who support this work.


Q1) Nerdy questions… How tightly bound are these similarity measures to the tool?

A1 – Anastasia) Not at all – should be able to use on command line

A1 – Matt) Perma is a Python Django stack and it’s super open source so you should be able to use this.

Comment) This looks super awesome and I want to use it!

Matt) These are really our first steps into this… So we welcome questions, comments, discussion. Come connect with us.

Anastasia) There is so much more work we have coming up that I’m excited about… Cutting up website to see importance of components… Also any work on resources here…

Q2) Do you primarily serve legal scholars? What about litigation stuff Nicholas talked about?

A2) We are in the law school but Perma is open to all. The litigation stuff is interesting..

A2 – Anastasia) It is a multi purpose school and others are using it. We are based in the law school but we are spreading to other places!

Q3) Thank you… There were HTML comparison tools that exist… But they go away and then we have nothing. A CLI will be really useful… And a service comparing any two URLs would be useful… Maybe worth looking at work on Memento damage – missing elements, and impact on the page – CSS, colour, alignment, images missing, etc. and relative importance. How do you highlight invisible changes?

A3 – Anastasia) This is really the complexity of this… And of the UI… Showing the users the changes… Many of our users are not from a technical background… Educating by showing changes is one way. The list with the measures is just very simple… But if a hyperlink has changed, that potentially is more important… So, do we organise the list to indicate importance? Or do we calculate that another way? We welcome ideas about that?

Q3) We have a service running in Momento showing scores on various levels that shows some of that, which may be useful.

Q4) So, a researcher has a copy of what they were looking at… Can other people look at their copy? So, researchers can use this tool as proof that it is what they cited… Can links be shared?

A4 – Matt) Absolutely. We have a way to do that from the Blue Book. Some folks make these private but that’s super super rare…

Understanding user needs (Chair Nicola Bingham)

Peter Webster, Chris Fryer & Jennifer Lynch: Understanding the users of the Parliamentary Web Archive: a user research project

Chris: We are here to talk about some really exciting user needs work we’ve been doing. The Parliamentary Archives holds several million historical records relating to Parliament, dating from 1497. My role is ensure that archive continues, in the form of digital records as well. One aspect of that is the Parliamentary Web Archive. This captures around 30 URLS – the official Parliamentary websphere content from 2009. But we also capture official social media feeds – Twitter, Facebook and Instagram. This work is essential as it captures our relationship with the public. But we don’t have a great idea of our users needs and we wanted to find out more and understand what they use and what they need.

Peter: The objectives of the study were:

  • assess levels and patterns of use – what areas of the sites they are using, etc.
  • gauge levels of user understanding of the archive
  • understand the value of each kind of content in the web archive – to understand curation effort in the future.
  • test UI for fit with user needs – and how satisfied they were.
  • identify most favoured future developments – what directions should the archive head in next.

The research method was an analysis of usage data, then a survey questionnaire – and we threw lots of effort at engaging people in that. There were then 16 individual user observations, where we sat with the users, asked them to carry out tests and narrate their work.  And then we had group workshops with parliamentary staff and public engagement staff, we well as four workshops with the external user community tailored to particular interests.

So we had a rich set of data from this. We identified important areas of the site. We also concluded that the archive and the relationship to the Parliament website, and that website itself, needed rethinking from the ground up.

So, what did we found of interest to this community?

Well, we found users are hard to find and engage – despite engaging the social media community – and staff similarly not least as the internal workshop was just after the EURef; that they are largely ignorant about what web archives are – we asked about the UK Web Archive, the Government Archive, and the Parliamentary Archive… It appeared that survey respondents understood what these are BUT in the workshops most were thinking about the online version of Hansard – a kind of archive but not what was intended. We also found that users are not always sure what they’re doing – particularly when engaging in a live browser snapshots of the site from a previous dates, that several snapshots might exist from different points in time. There was also some issues with understanding the Way Back Machine surround for the archived content – difficulty understanding what was content, what was the frame. There was a particular challenge around using URL search. People tried everything they could to avoid that… We asked them to find archived pages for the homepage of… And had many searches for “homepage” – there was real lack of understanding of the browser and the search functionality. There is also no correlation between how well users did with the task and how well they felt they did. I take from that that a lack of feedback, requests, issues, does not mean there is not an issue.

Second group of findings… We struggled to find academic participants for this work. But our users prioritised in their own way. It became clear that users wanted discovery mechanisms that match their mental map – and actually the archive mapped more to an internal view of how parliament worked… And browsing taxonomies and structures didn’t work for them. That led to a card sorting exercise to rethink this. We also found users liked structures and wanted discovery based on entities: people, acts, publications – so search connected with that structure works well. Also users were very interested to engage in their own curation, tagging and folksonomy, make their own collections, share materials. Teachers particularly saw potential here.

So, what don’t users want? They have a variety of real needs but they were less interested in derived data sets like link browse; I demonstrated data visualisation, including things like ngrams, work on WARCS; API access; take home data… No interest from them!

So, three general lessons coming out of this… If you are engaging in this sort of research, spend as much resource as possible. We need to cultivate users that we do know, they are hard to find but great when you find them. Remember the diversity of groups of users you deal with…

Chris: So the picture Peter is painting is complex, and can feel quite disheartening. But his work has uncovered issues in some of our assumptions, and really highlights needs of users in the public. We now have a much better understanding os can start to address these concerns.

What we’ve done internally is raise the profile of the Parliamentary Web Archive amongst colleagues. We got delayed with procurement… But we have a new provider (MirrorWeb) and they have really helped here too. So we are now in a good place to deliver a user-centred resource at:

We would love to keep the discussion going… Just not about #goatgate! (contact them on @C_Fryer and @pj_webster)


Q1) Do you think there will be tangible benefits for the service and/or the users, and how will you evidence that?

A1 – Chris) Yes. We are redeveloping the web archive. And as part of that we are looking at how we can connect the archive to the catalogue and that is all part of new online services project. We have tangible results to work on… It’s early days but we want to translate it to tangibl ebenefits.

Q2) I imagine the parliament is a very conservative organisation that doesn’t delete content very often. Do you have a sense of what people come to the archive for?

A2 – Chris) Right now it is mainly people who are very aware of the archive, what it is and why it exists. But the research highlighted that many of the people less familiar with the archive wanted the archived versions of content on the live site, and the older content was more of interest.

A2 – Peter) One thing we did was to find out what the difference was between what was on the live website and what was on the archive… And looking ahead… The archive started in 2009… But demand seems to be quite consistent in terms of type of materials.

A2 – Chris) But it will take us time to develop and make use of this.

Q3) Can you say more about the interface and design… So interesting that they avoid the URL search.

A3 – Peter) The outsourced provider was Internet Memory Research… When you were in the archive there was an A-Z browser, a keyword search and a URL search. Above that on the site had taxonomy that linked out, and that didn’t work. I asked them to use that browse and it was clear that their thought process directed them to the wrong places… So recommendation was that it needs to be elsewhere, and more visible.

Q4) You were talking about users wanting to curate their own collections… Have you been considering setting up user dashboards to create and curate collections.

A4 – Chris) We are hoping to do that with our website and service, but it may take a while. But it’s a high priority for us.

Q5) I was interested to understand, the users that you selected for the survey… Were they connected before and part of the existing user base, or did you find through your own efforts.

A5 – Peter) a bit of both… We knew more about those who took the survey and they were the ones we had in the observations. But this was a self selecting group, and they did have a particular interest in the parliament.

Emily Maemura, Nicholas Worby, Christoph Becker & Ian Milligan: Origin stories: documentation for web archives provenance

Emily: We are going to talk about origin stories and it comes out of interest in web archives, provenance, trust. This has been a really collaborative project, and working with Ian Milligan from Toronto. So, we have been looking at two questions really: How are web archives made? How can we document or communicate this?

We wanted to look at choices and decisions in creating collections We have been studying creation of University of Toronto Libraries (UTL) Archive-It collections:

  • Canadian Political Parties and Political Interest Groups (crawled quarterly) – long running, continually collected and ever-evolving.
  • Toronto 2015 Pan Am games (crawled regularly for one month one-off event)
  • Global Summitry Archive

So, thinking about web archives and how they are made we looked at the Web Archiving Life Cycle Model (Bragg et al 2013), which suggests a linear process… But the reality is messier… and iterative as test crawls are reviewed, feed into production crawls… But are also patched as part of QA work.

From this work then we have four things you should document for provenence:

  1. Scoping is iterative and regularly reviewed. and the data budget is a key part of this.
  2. The Process of crawls is important to document as the influence of live web content and actors can be unpredictable
  3. There may be different considerations for access, choices for mode of access can impact discovery, and may be particularly well suited to particular users or use cases.
  4. The fourth thing is context, and the organisational or environmental factors that influence web archiving program – that context is important to understand those decision spaces and choices.

Nick: So, in order to understand these collections we had to look at the organisational history of web archiving. For us web archiving began in 2005, and we piloted what became Archive-it in 2006. It was in liminal state for about 8 years… There were few statements around collection develeopment until last year really But th enew policu talks about scoping, policy, permissions, etc.

So that transition towards service is reflected in staffing. It is still a part time commitment but is written into several people’s job descriptions now, it is higher profile. But there are resourcing challenges around crawling platforms – the earliest archives had to be automatic; dat abudgets; storage limits. There are policies, permissions. robots.text policy, access restrictions. And there is the legal context… Copright laws changed a lot in 2012… Started with permissions, then opt outs, but now it’s take down based…

Looking in turn at these collections:

Canadian Political Parties and Political Interest Groups (crawled quarterly) – long running, continually collected and ever-evolving. Covers main parties and ever changing group of loosely defined interest groups. This was hard to understand as there were four changes of staff in the time period.

Toronto 2015 Pan Am games (crawled regularly for one month one-off event) – based around a discrete event.

Global Summitry Archive – this is a collaborative archive, developed by researchers. It is a hybrid and is an ongoing collection capturing specific events.

In terms of scoping we looked at motivation whether mandate, an identified need or use, collaboration or coordination amongst institutions. These projects are based around technological budgets and limitations… In cases we only really understand what’s taking place when we see crawling taking place. Researchers did think ahead but, for instance, video is excluded… But there is no description of why text was prioritised over video or other storage. You can see evidence of a lack of explicit justifications for crawling particular sites… We have some information and detail, but it’s really useful to annotate content.

In the most recent elections the candidate sites had altered robots.txt… They weren’t trying to block us but the technology used and their measures against DDOS attacks had that effect.

In terms of access we needed metadata and indexes, but the metadata and how they are populated shapes how that happens. We need interfaces but also data formats and restrictions.

Emily: We tried to break out these interdependencies and interactions around what gets captured… Whether a site is captured is down to a mixture of organisational policies and permissions; legal context and copyright law for fair dealing, etc. The wider context elements also change over time… Including changes in staff, as well as changes in policy, in government, etc. This can all impact usage and clarity of how what is there came to be.

So, conclusions and future work… In telling the origin stories we rely on many different aspects and it very complex. We are working towards an extended paper. We believe a little documentation goes a long way… We have a proposal for structure documentation:


Q1) We did this exercise in the Netherlands… We needed to go further in the history of our library… Because in the ’90s we already collected interesting websites for clients – the first time we thought about the web as an important stance.. But there was a gap there between the main library work and the web archiving work…

Q2) I always struggle with what can be conveyed that is not in the archive… Sites not crawl, technical challenges, sites that it is decided not to crawl early on… That very initial thinking needs to be conveyed to pre-seed things… Hard to capture that…

A2 – Emily) There is so much in scoping that is before the seed list that gets into the crawl… Nick mentioned there are proposals for new collections that explains the thinking…

A2 – Nick) That’s about the best way to do it… Can capture pre-seeds and test crawls… But need that “what should be in the collection”

A2 – Emily) The CPPP is actually based on a prior web list of suggested sites… Which should also have been archived.

Q3) In any kind of archive the same issues are hugely there… Decisions are rarely described… Though a whole area of post modern archive description around that… But a lot comes down to the creator of the collection. But I haven’t seen much work on what should be in the archive that is expected to be there… A different context I guess..

A3 – Emily) I’ve been reading a lot of post modern archive theory… It is challenging to document all of that, especially in a way that is useful for researchers… But have to be careful not to transfer over all those issues from the archive into the web archive…

Q4) You made the point that the liberal party candidate had blocked access to the Internet Archive crawler… That resonated for me as that’s happened a few times for our own collection… We have legal deposit legislation and that raises questions of whose responsibility it is to take that forward..

A4 – Nick) I found it fell to me… Once we got the right person on the phone it was an easy case to make – and it wasn’t one site but all the candidates for that party!

Q5) Have you have any positive or negative responses to opt-outs and Take downs

A5 – Nick) We don’t host our own WayBackMachine so use their policy. We honour take downs but get very very few. Our communications team might have felt differently but we had something quite bullish in charge.

Nicola) As an institution there is a very variable appetite for risk – hard to communicate internally, let alone externally to our users.

Q6) In your research have you seen any web archive documenting themselves well? People we should follow? Or based mainly on your archives?

A6) It’s mainly based on our own archives… We haven’t done a comprehensive search of other archives’ documentation.

Jackie Dooley, Alexis Antracoli, Karen Stoll Farrell & Deborah Kempe: Developing web archiving metadata best practices to meet user needs

Alexis: We are going to present on the OCLC Research Library Partnership web archive working group. So, what was the problem? Well, web archives are not very easily discoverable in the ways people are usually used to descovering archives or library resources. This was the most widely shared issue across two OCLC surveys and so a working group was formed.

At Princeton we use Archive-It, but you had to know we did that… It wasn’t in the catalogue, it wasn’t on the website… So you wouldn’t find it… Then we wanted to bring it into our discovery system but that meant two different interfaces… So… If we take an example of one of our finding aids… We have the College Republican Records (2004-2016) and they are an on-campus group with websites… This was catalogues with DACS. But how to use the title and dates appropriately? Is the date the content, the seed, what?! And extent – documents, space, or… we went for the number of websites as that felt like something users would understand.  We wrote Archive-it into the description… But we wanted guidelines…

So, the objectives of this group is to find best practices for web archiving metadata best practices. We have undertane a lutereature review, looked at best practices for descriptive metadata across single nad multiple sites.

Karen: For our literature review we looked at peer reviewed literature but also some other sources, and synthesised that. So, who are the end users of web archives… I was really pleased the UK Parliament work focused on public users, as the research tends to focus on academia. Where we can get some clarity on users is on their needs: to read specific web pages/site; data and text mining; technology development or systems analysis.

In terms of behaviours Costa and Silva (2010) classify three groups, much cited by others: Navigational; Informational or Transactionals.

Take aways…. A couple things that we found – some beyond metadata… Raw data can be a high barrier so they want accessible interaces, unified searches, but the user does want to engage directly with the metadata to make the background and provenence of the data. We need to be thinking about flexible formats, engagement. And to enable access we need re-use and rights statements. And we need to be very direct indicating live versus archive material.

Users also want provenance: when and why was this created? They want context. They want to know the collection criteria and scope.

For metadata practitioners there are distinct approaches… archival and bibliographic approaches – RDA, MARC, Dublin Core, MODS, finding aids, DACS; Data elements vary widely, and change quite quickly.

Jackie: We analysed metadata standards and institutional guidelines; we evaluated existing metdata records in the wild… Our preparatory work raised a lot of questions about building a metadata description… Is the website creator/owner the publisher? author? subject? What is the title? Who is the host institution – and will it stay the same? Is it imporant to clearly stats that the resource is a website (not a “web resources”).

And what does the provenance actually refer to? We saw a lot of variety!

In terms of setting up th econtext we have use cases for library, archives, research… Some comparisons between bibliographic and archival approaches to descriptoin; description of archived and live sites – mostly libraries catalogue live not archives sites; and then you have different levels… Collection level, site level… And there might be document-level discriptions.

So, we wanted to establish data dictionary characteristics. We wanted something simple, not a major new cataloguing standard. So this is a learn 14 element standard, which is grounded on those cataloguing rules, so can be part of wider systems. The categories we have include common elements are used for identification and discovery of types of resources; other elements have to have clear applicability in the discovery of all types of resources. But some things aren’t included as not super specific to web archives – e.g. audience.

So the 14 data elements are:

  • Access/rights*
  • Collector
  • Contributor*
  • Creator*
  • Date*
  • Description*…

Elements with asterisks are direct maps to Dublin Core fields.

So, Access Conditions (to be renamed as “Rights”) is a direct mapping to Dublin Core “Rights”. This provides the circumstances that affect the availability and/or reuse of an archived website or collection. E.g. for Twitter. And it’s not just about rights because so often we don’t actually know the rights, but we know what can be done with the data.

Collector was the strangest element… There is no equivalent in Dublim Core… This is about the organisation responsible for curation and stewardship of an archived website or collection. The only other place that uses Collector is the Internet Archive. We did consider “repository” but, it may do all those things but… for archived websites… the site lives elsewhere but e.g. Princeton decides to collect those things.

We have a special case for Collector where Archive-It creates its own collection…

So, we have three publications, due out in July on this work..


Q1) I was a bit disappointed in the draft report – it wasn’t what I was expecting… We talked about complexities of provenance and wanted something better to convey that to researchers, and we have such detailed technical information we can draw from Archive-It.

A1 – Jackie) Our remit was about description, only. Provenance is bigger than that. Descriptive metadata was appropriate as scope. We did a third report on harvesting tools and whether metadata could be pulled from them… We should have had “descriptive” in our working group name too perhaps…

A1) It is maybe my fault too… But it’s that mapping of DACs that is not perfect… We are taking a different track at University of Albany

A1 – Jackie) This is NOT a standard, it addresses an absence of metadata that often exists for websites. Scalability of metadata creation is a real challenge… The average time available is 0.25 FTE looking at this. The provenance, the nuance of what was and was not crawled is not doable at scale. This is intentionally lean. If you will be using DACs then a lot of data goes straight in. All standards, with the exception of Dublin Core, are more detailed…

Q2) How difficult is this to put in practice for MARC records. For us we treat a website as a collector… You tend to describe the online publication… A lot of what we’d want to put in just can’t make it in…

A2 – Jackie) In Marc the 852 field is the closest to Collector that you can get. (Collector is comparable to Dublin Core’s Contributor; EAD’s <repository>; MARC’s 524, 852 a ad 852 b; MODS’ location or’s schema:OwnershipInfo.

Researcher case studies (Chair: Alex Thurman)

Jane Winters: Moving into the mainstream: web archives in the press

This paper accompanies my article for the first issue of Internet Histories. I’ll be talking about the increasing visibility of web archives and much greater public knowledge of web archive.

So, who are the audiences for web archives? Well they include researchers in the arts, humanities and social sciences – my area and where some tough barriers are. They are also policymakers, perticularly crucial in relation to legal deposit and acess. Also “general public” – though it is really many publics. And journalists as a mediator with the public.

What has changed with media? Well there was an initial focus on technology which reached an audience predisposed to that. But incresingly web archives come into discussion of politics and current affairs but there are also social and cultural concerns starting to emerge. There is real interest around launches and anniversaries – a great way for web archives to get attention, like the Easter Rising archive we heard about this week. We do also get that “digital dark age” klaxon which web archives can and do address. And with Brexit and Trump there is a silver lining… And a real interest in archives as a result.

So in 2013 Niels Brugge arranged the first RESAW meeting in Aahus. And at that time we had one of these big media moments…

Computer Weekly, 12th November 2013, reported on Conservatives erasing official records of speeches from the Internet Archive as a serious breach. Coverage in computing media migrated swiftly to coverage in the mainstream press, the Guardian’s election coverage; BBC News… The hook was that a number of those speeches were about the importance of the internet to open public debate… That hook, that narrative was obviously lovely for the media. Interestingly the Conservatives then responded that many of those speeches were actually still available in the BL’s UK Web Archives. The speeches also made Channel 4 News – and they used it as a hook to talk about broken promises.

Another lovely example was Dr Anat Ben-David from the Open University who got involved with BBC Click on restoring the lost .yu domain. This didn’t come from us trying to get something in the news… They knew our work and we could then point them in the direction of really interesting research… We can all do this highlighting and signposting which is why events like this are so useful for getting to know each others’ work.

When you make the tabloids you know you’ve done well… In 2016 coverage of the BBC Food website was faced with closure as part of cuts. The Independent didn’t lead with this, but with how to find recipes when the website goes… They directed everyone to the Internet Archive – as it’s open (unlike the British Library). Although the UK Web Archive blog did post about this, explained what they are collecting, and why they collect important cultural materials. The BBC actually back peddled… Maintaining the pages, but not updating it. But that message got out that web archiving is for everyone… Building it into people’s daily lives.

The launch of the UK Web Archive in 2013 went live – BBC covered this (and fact that it is not online). The 20th anniversary of the BnF archive had a lot of French press coverage. That’s a great hook as well.  Then I mentioned that Digital Dark Age set of stories… Bloomberg had the subtitle “if you want to preserve something, print it” in 2016. We saw similar from the Royal Society. But generally journalists do know who to speak to from BL, or DPC, or IA to counter that view… Can be a really positive story. Even that negative story can be used as a positive thing if you have that connection with journalists…

So this story: “Raiders of the Lost Web: If a Pultizer-finalist 34 part series can disappear from the web, anything can” looks like it will be that sort of story again… But actually this is about the forensic reconstruction of the work. And the article also talks about cinema at risk, again also preserved thanks to the Internet Archive. This piece of journalism that had been “lost” was about the death of 23 children in a bus crash… It was lost twice as it wasn’t reported, then the story disappeared… But the longer article here talks about that case and the importance of web archiving as a whole.

Talking of traumatic incidents… Brexit coverage of the NHS £350m per week saving on the Vote Leave website… But it disappeared after the vote. BUT you can use the Internet Archive, and the structured referendum collection from the UK Legal Deposit libraries, so the promises are retained into the long term…

And finally, on to Trump! In an Independent article on Melania Trump’s website disappearing, the journalist treats the Internet Archive as another source, a way to track change over time…

And indeed all of the coverage of IA in the last year, and their mirror site in Canada, that isn’t niche news, that’s mainstream coverage now. The more we have stories on data disappearing, or removed, the more opportunities web archives have to make their work clear to the world.


Q1) A fantastic talk and close to my heart as I try to communicate web archives. I think that web archives have fame when they get into fiction… The BBC series New Tricks had a denouement centred on finding a record on the Internet Archive… Are there other fictional representations of web archives?

A1) A really interesting suggestion! Tweet us both if you’ve seen that…

Q2) That coverage is great…

A2) Yes, being held to account is a risk… But that is a particular product of our time… Hopefully when it is clear that it is evidence for any set of politicians… The users may be partisan, even if the content is… It’s a hard line to tread… Non publicly available archives mitigate that… But absolutely a concern.

Q3) It is a big win when there are big press mentions… What happens… Is it more people aware of the tools, or specifically journalists using them?

A3) It’s both but I think it’s how news travels… More people will read an article in the Guardian than will look at the BL website. But they really demonstrate the value and importance of the archive. You want – like the BBC recipe website 100k petition – that public support. We ran a workshop here on a random Saturday recently… It was pitched as tracing family or local history… And a couple were delighted to find their church community website 15 years ago… It was that easy to know about the value of the archive that way… We did a gaming event with late 1980s games in the IA… That’s brilliant, a kid’s birthdya party was going to be inspired by that – that’s fab use we hadn’t thought of… But journalism is often the easy win…

Q4) Political press and journalistic use is often central… But I love that GifCities project… The nostalgia of the web… The historicity… That use… They highlight the datedness of old web design is great… The way we can associated archives with web vernacular that are not evidenced elsewhere is valuable and awesome… Leveraging that should be kept in mind.

A4) The GifCities always gets a “Wow” – it’s a great way to engage people in a teaching setting… Then lead them onto harder real history stuff..!

Q5) Last year when we celebrated the anniversary I had a chance to speak with journalists. They were intrigued that we collect blogs, forums, stuff that is off the radar… And they titled the article “Maybe your Sky Blog is being archived in France” (Sky Blogs is a popular teen blog platform)… But what does not forgetting the stupid things you wrote on the internet when they were 15…

A5) We’ve had three sessions so far, only once did that question arise… But maybe people aren’t thinking like that. More of an issue of the public archive… Less of a worry for closed archive… But so much of the embaressing stuff is in Facebook so not in the archive. But it matters especially in the right to be forgotten legislation… But there is also that thing of having something worth archiving…

Q6) The thing of The Crossing is interesting… Their font was copyright… They had to get specific permission from the designer… But that site is in flash… And soon you’ll need Ilya Cramer’s old web tools to see it at all.

A6) Absolutely. That’s a really fascinating article and they had to work to revive and play that content…

Q6) And six years old! Only six years!

Cynthia Joyce: Keyword ‘Katrina’: a deep dive through Hurricane Katrina’s unsearchable archive

I’ll be talking about how I use – rather than engaging in the technology directly. I was a journalist for 20 years before teaching journalism, which I do at University of Mississippi. Every year we take a study group to New Orleans to look at the outcome of Katrina. Katrina was 12 years ago. But there is a lot of gentrification and so there are few physical scars there… It was weird to have to explain how hard things were to my 18 year old students. And I wanted to bring that to life… But not just the news coverage which is shown as anniversary, do an update piece… The story is not a discrete event, an era…

I found the best way to capture that era was through blogging. New Orleans was not a tech savvy space, it was a poor, black, high levels of illiteracy sort of space. Web 1.0 had skipped New Orleans and the Deep South in a lot of ways.. .It was pre-Twitter, Facebook in infancy, mobiles were primitive. Katrina was probably when many in New Orleans started texting – doable on struggling networks. There was also that Digital Divide – out of trend to talk about this but this is a real gap.

So, 80% of the city flooded, more than 800 people died, 70% of residents were displaced. The storm didn’t cause the problems here, it was the flooding and the failure of the levees. That is an important distinction, as that sparked the rage, the activism, the need for action was about the sense of being lied to and left behind.

I was working as a journalist for from 1995 – very much web 1.0. I was an editor at post Katrina. And I was a resident of New Orleans 2001-2007. We had questions of what to do with comments, follow up, retention of content… A lot of content wasn’t needing preserving… But actually that set of comments should be the shame of Advanced Digital and Conde Naste… It was interesting how little help they provided to, one of their client papers…

I was conducting research as a citizen, but with journalistic principles and approaches… My method was madness basically… I had instincts, stories to follow, high points, themes that had been missed in mainstream media. I interviewed a lot of people… I followed and used a cross-list of blog rolls… This was a lot of surfing, not just searching…

The WayBackMachine helped me so much there, to see that blogroll, seeing those pages… That idea of the vernacular, drill down 10 years later was very helpful and interesting… To experience it again… To go through, to see common experiences… I also did social media posts and call outs – an affirmative action approach. African American people were on camera, but not a lot of first party documentation… I posted something on Binders Full of Women Writers… I searched more than 300 blogs. I chose the entries… I did it for them… I picked out moving, provocative, profound content… Then let them opt out, or suggest something else… It was an ongoing dialogue with 70 people crowd curating a collective diary. New Orleans Press produced a physical book, and I sent it to Jefferson and IA created a special collection for this.

In terms of choosing themes… The original TOC was based on categories that organically emerged… It’s not all sad, it’s often dark humour…

  • Forever days
  • An accounting
  • Led Astray (pets)
  • Re-entry
  • Kindness of Strangers
  • Indecisin
  • Elsewhere = not New Orleans
  • Saute Pans of Mercy (food)
  • Guyville

Guyville for instance… for months no schools were open, so it was a really male space, then so much construction… But some women there though that was great too. A really specific culture and space.

Some challenges… Some work was journalists writing off the record. We got permissions where we could – we have them for all of the people who survived.

I just wanted to talk about Josh Cousin, a former resident of St Bernard projects. His nickname was the “Bookman” – he was an unusual nerdy kid and was 18 when Katrina hit. They stayed… But were forced to leave eventually… It was very sad… They were forced onto a bus, not told where they were going, they took their dog… Someone on the bus complained. Cheddar was turfed onto the highway… They got taken to Houston. The first post Josh posted was a defiant “I made it” type post… He first had online access when he was at the Astrodome. They had online machines that no-one was using… But he was… And he started getting mail, shoes, stuff in the post… He was training people to use these machines. This kid is a hero… At the sort of book launch for contributors he brought Cheddar the dog… Through pet finder… He had been adopted by a couple in Conneticut who had renamed him “George Michael” – they tried to make him pay $3000 as they didn’t want their dog going back to New Orleans…

In terms of other documentary evidence… Material is all as PDF only… The email record of Micheal D. Brown… shows he’s concerned about dog sitting… And later criticised people for not evacuating because of their pets… Two weeks later his emails do talk about pets… There were obviously other things going on… But this narrative, this diary of that time… really brings this reality to life.

I was in a newsroom during Arab Spring… And that’s when they had no option but to run what’s on Twitter, it was hard to verify but it was there and no journalists could get in. And I think Katrina was that kind of moment for blogging…

On Archive-it you can find the Katrina collection… Ranging from resistance and suspicion to gratitude… Some people barely remembered writing stuff, certainly didn’t expect it to be archived. I was collecting 8-9 years later… I was reassured to read that a historian at the Holocaust museum (in Chronicle of Higher Ed) who wasn’t convinced about blogging, until Trump said something stupid and that had triggered her to engage.


Q1 – David) In 2002 the LOCKSS program has a meeting with subject specialists at NY Public Library… And among those that were deemed worth preserving was The Exquisite Corpse. That was published out of New Orleans. After Katrina we were able to give Andre Projescu back his materials and that carried on publishing until 2015… A good news story of archiving from that time.

A1) There are dozens of examples… The things that I found too is that there is no appointed steward… If no institutional support it can be passed round, forgotten… I’d get excited then realise just one person was the advocate, rather than an institution to preserve it for posterity.

Andre wrote some amazing things, and captured that mood in the early days of the storm…

Q2) I love how your work shows blending of work and sources and web archives in conversation with each other… I have a mundane question… Did you go through any human subjects approval for this work from your institution.

A2) I was an independent journalist at the time… BUt went to University of New Orleans as the publisher had done a really intersting project with community work… I went to ask them if this project already existed… And basically I ended up creating it… He said “are you pitching it?” and that’s where it came from. Nievete benefited me.

Q3) Did anyone opt out of this project, given the traumatic nature of this time and work?

A3) Yes, a lot of people… But I went to people who were kind of thought leaders here, who were likely to see the benefit of this… So, for instance Karen Geiger had a blog called Squandered Heritage (now The Lens, the Pro Publica of New Orleans)… And participation of people like that helped build confidence and validity to the project.

Colin Post: The unending lives of net-based artworks: web archives, browser emulations, and new conceptual frameworks

Framing an artwork is never easy… Art objects are “lumps” of the physical world to be described… But what about net based art works, How do we make these objects of art history… And they raise questions of what we define an artwork in the first place… I will talk about Homework by Alexi Shulgin ( as an example of where we need technique snad practices of web arching around net based artworks. I want to suggest a new conceptualisiation of net-based artworks as plural, proliferating, herteogenous archives. Homework is typical, and includes pop ups and self-concious elements that make it challenging to preserve…

So, this came from a real assignment for Natalie Bookchin’s course in 1997. Alexei Shulgin encouraged artists to turn in homework for grading, and did so himeself… And his piece was a single sentence followed by pop up messages – something we use differently today, has different significance… Pop ups ploferate the screen like spam, making the user aware of the browser and its affordances and role… Homework replicates structures of authority and expertise, grading, organising, creitiques, including or excluding artists… But rendered obsurd…

Homework was intended to be ephemeral… But Shulgin curates assignments turned in, and late assignments. It may be tempting to think of these net works as performance art, with records only of a particular moment in time. But actually this is a full record of the artwork… Homework has entered into archives as well as Shulgin’s own space. It is heterogenous… All acting on the work. The nature of pop up messages may have changes but the conditions of its original creation and it is still changing the world today.

Shulgin, in conversation with Armin Medosch in 1997, felt “The net at present has few possibilities for self expression but there is unlimited possibility for communication. But how can you record this communicative element, how can you store it?”. There are so many ways and artists but how to capture them… One answer is web archiving… There are at least 157 versions of Homework in the Internet Archive.. This is not comprehensive, but his own site is well archived… But capacity of connections is determined by incidence rather than choice… The crawler only caught some of these. But these are not discrete objects… The works on Shulgin’s site, the captures others have made, the websites that are still available, is one big object. This structure reflects the work itself, archival systems sustain and invigorate through the same infrastructure…

To return to the communicative elements… Archives do not capture the performative aspects of the piece. But we must also attend to the way the object has transformed over time… In order to engage with complex net-absed artworks… We cannot be easily separated into “original” and “archived” but more as a continuum…

Frank Upward (1996) describe the Records Continuum Model.. This is around four dimensions: Creation, Capture, Organisation, and Pluralisation. All of these are present in the archive of Homework… As copies appear in the Internet Archive, in Rhizome… And spread out… You could describe this as the vitalisation of the artwork on the web… at Rhizome is a way to emulate the browser… This provides some assurance of the retention of old website.. BUt that is not the direct representation of the original work… The context and experience can vary – including the (now) speedy load of pages… And possible changes in appearance… When I load homework here… I see 28 captures all combined, from records over 10 years.. The piece wasn’t uniformly archived at any one time… I view the whole piece but actually it is emulated and artificial… It is disintegrated and inauthentic… But in the continuum it is another continuous layer in space and time.

Niels Brugger in “website history” (2010) talks about “Writing the complex strategic situation in which an artefact is entangled”. Digital archived and emulators preserve Homework, but are in themselves generative… But that isn’t exclusive to web archiving… It is something we see in Eugene Viollet Le Duc (1854/1996) talks about reestablishing a work in a finish state that may never in fact have existed in any point in time.

Q1) a really interesting and important work, particularly around plurality. I research at Rhizome and we have worked with Net Art Anthology – an online exhibition with emulators… is this faithful… should we present a plural version of the work?

A1) I have been thinking about this a lot… but i don’t think Rhizome should have to do all of this… art historians should do this contextual work too… Net Art Anthology does the convenience access work but art historians need to do the context work too.

Q1) I agree completely. For an art historian what provenance metadata should we provide for works like this to make it most useful… Give me a while and I’ll have a wish list… 

Comment) a shout out for Gent in Belgium is doing work on online art so I’ll connect you up.

Q2) Is Homework still an active interactive work?

A2) The final list was really in 1997 – only on IA now… It did end at this time… so experiencing the piece is about looking back… that is artefactial, or a terrace. But Shulgin has past work on his page… sort of a capture and framing as archive.

Q3) How does Homework fit in your research?

A3) I’m interested in 90s art, preservation, and that interactions

Q4) Have you seen that job of contextualisation done well, presented with the work? I’m thinking of Eli Harrison’s quantified self work and how different that looked at the time from now… 

A4) Rhizome does this well, galleries collecting net artists… especially with emulated works.. The guggenheim showed originals and emulated and part of that work was foregrounding the preservation and archiving aspects of the work. 

Closing remarks: Emmanuelle Bermès & Jane Winters

Emmanuelle: Thank you all for being here. This was three very intense day. Five days for those at archived unleashed. To close a few comments on IIPC. We were originally to meet in Lisbon, and I must apologise again to Portuguese colleagues, we hope to meet again there… But colocating with RESAW was brilliant – I saw a tweet that we are creating archives in the room next door to those who use and research them. And researchers are our co-creators.

And so many of our questions this week have been about truth and reliability and trust. This is a sign of growth and maturity of the groups. 

IIPC has had a tough year. We are still a young and fragile group… we have to transition to a strong world wide community. We need all the voices and inputs to grow and to transform into something more résiliant. We will have an annual meeting at an event in Ottawa later this year.

Finally thank you so much to Jane and colleagues from RESAW, and to Nicholas and WARC committee, and Olga and BL to get this all together so well.

Jane: you were saying how good it has been to bring archivists and researchers together, to see how we can help and not just ask… A few things struck me: discussion of context and provenance; and at the other end permanence and longevity. 

We will have a special issue of Internet Histories so do email us 

Thank you to Neils Brugger and NetLab, The Coffin Trust who funded our reception last night, RESAW Programme Committee, and the really important peop – the events team at University of London, and to Robert Kelly who did our wonderful promotional materials. And Olga who has made this all possible. 

And we do intend to have another Resaw conference in June in 2 years.

And thank you to Nicholas and Neils for representing IIPC, and to all of you for sharing your fantastic work. 

And with that a very interesting week of web archiving comes to an end. Thank you all for welcoming me along!

Jun 152017

I am again at the IIPC WAC / RESAW Conference 2017 and today I am in the very busy technical strand at the British Library. See my Day One post for more on the event and on the HiberActive project, which is why I’m attending this very interesting event.

These notes are live so, as usual, comments, additions, corrections, etc. are very much welcomed.

Tools for web archives analysis & record extraction (chair Nicholas Taylor)

Digging documents out of the archived web – Andrew Jackson

This is the technical counterpoint to the presentation I gave yesterday… So I talked yesterday about the physical workflow of catalogue items… We found that the Digital ePrints team had started processing eprints the same way…

  • staff looked in an outlook calendar for reminders
  • looked for new updates since last check
  • download each to local folder and open
  • check catalogue to avoid re-submitting
  • upload to internal submission portal
  • add essential metadata
  • submit for ingest
  • clean up local files
  • update stats sheet
  • Then inget usually automated (but can require intervention)
  • Updates catalogue once complete
  • New catalogue records processed or enhanced as necessary.

It was very manual, and very inefficient… So we have created a harvester:

  • Setup: specify “watched targets” then…
  • Harvest (harvester crawl targets as usual) –> Ingested… but also…
  • Document extraction:
    • spot documents in the crawl
    • find landing page
    • extract machine-readable metadata
    • submit to W3ACT (curation tool) for review
  • Acquisition:
    • check document harvester for new publications
    • edit essential metadata
    • submit to catalogue
  • Cataloguing
    • cataloguing records processed as necessary

This is better but there are challenges. Firstly, what is a “publication?”. With the eprints team there was a one-to-one print and digital relationship. But now, no more one-to-one. For example, publications… An original report will has an ISBN… But that landing page is a representation of the publication, that’s where the assets are… When stuff is catalogued, what can frustrate technical folk… You take date and text from the page – honouring what is there rather than normalising it… We can dishonour intent by capturing the pages… It is challenging…

MARC is initially alarming… For a developer used to current data formats, it’s quite weird to get used to. But really it is just encoding… There is how we say we use MARC, how we do use MARC, and where we want to be now…

One of the intentions of the metadata extraction work was to provide an initial guess of the catalogue data – hoping to save cataloguers and curators time. But you probably won’t be surprised that the names of authors’ names etc. in the document metadata is rarely correct. We use the worse extractor, and layer up so we have the best shot. What works best is extracting the HTML. is a big and consistent publishing space so it’s worth us working on extracting that.

What works even better is the API data – it’s in JSON, it’s easy to parse, it’s worth coding as it is a bigger publisher for us.

But now we have to resolve references… Multiple use cases for “records about this record”:

  • publisher metadata
  • third party data sources (e.g. Wikipedia)
  • Our own annotations and catalogues
  • Revisit records

We can’t ignore the revisit records… Have to do a great big join at some point… To get best possible quality data for every single thing….

And this is where the layers of transformation come in… Lots of opportunities to try again and build up… But… When I retry document extraction I can accidentally run up another chain each time… If we do our Solr searches correctly it should be easy so will be correcting this…

We do need to do more future experimentation.. Multiple workflows brings synchronisation problems. We need to ensure documents are accessible when discoverable. Need to be able to re-run automated extraction.

We want to iteratively improve automated metadata extraction:

  • improve HTML data extraction rules, e.g. Zotero translators (and I think LOCKSS are working on this).
  • Bring together different sources
  • Smarter extractors – Stanford NER, GROBID (built for sophisticated extraction from ejournals)

And we still have that tension between what a publication is… A tension between established practice and publisher output Need to trial different approaches with catalogues and users… Close that whole loop.


Q1) Is the PDF you extract going into another repository… You probably have a different preservation goal for those PDFs and the archive…

A1) Currently the same copy for archive and access. Format migration probably will be an issue in the future.

Q2) This is quite similar to issues we’ve faced in LOCKSS… I’ve written a paper with Herbert von de Sompel and Michael Nelson about this thing of describing a document…

A2) That’s great. I’ve been working with the Government Digital Service and they are keen to do this consistently….

Q2) Geoffrey Bilder also working on this…

A2) And that’s the ideal… To improve the standards more broadly…

Q3) Are these all PDF files?

A3) At the moment, yes. We deliberately kept scope tight… We don’t get a lot of ePub or open formats… We’ll need to… Now publishers are moving to HTML – which is good for the archive – but that’s more complex in other ways…

Q4) What does the user see at the end of this… Is it a PDF?

A4) This work ends up in our search service, and that metadata helps them find what they are looking for…

Q4) Do they know its from the website, or don’t they care?

A4) Officially, the way the library thinks about monographs and serials, would be that the user doesn’t care… But I’d like to speak to more users… The library does a lot of downstream processing here too..

Q4) For me as an archivist all that data on where the document is from, what issues in accessing it they were, etc. would extremely useful…

Q5) You spoke yesterday about engaging with machine learning… Can you say more?

A5) This is where I’d like to do more user work. The library is keen on subject headings – thats a big high level challenge so that’s quite amenable to machine learning. We have a massive golden data set… There’s at least a masters theory in there, right! And if we built something, then ran it over the 3 million ish items with little metadata could be incredibly useful. In my 0pinion this is what big organisations will need to do more and more of… making best use of human time to tailor and tune machine learning to do much of the work…

Comment) That thing of everything ending up as a PDF is on the way out by the way… You should look at – a new journal from Google and Y combinator – and that’s the future of these sorts of formats, it’s JavaScript and GitHub. Can you collect it? Yes, you can. You can visit the page, switch off the network, and it still works… And it’s there and will update…

A6) As things are more dynamic the re-collecting issue gets more and more important. That’s hard for the organisation to adjust to.

Nick Ruest & Ian Milligan: Learning to WALK (Web Archives for Longitudinal Knowledge): building a national web archiving collaborative platform

Ian: Before I start, thank you to my wider colleagues and funders as this is a collaborative project.

So, we have a fantastic web archival collections in Canada… They collect political parties, activist groups, major events, etc. But, whilst these are amazing collections, they aren’t accessed or used much. I think this is mainly down to two issues: people don’t know they are there; and the access mechanisms don’t fit well with their practices. Maybe when the Archive-it API is live that will fix it all… Right now though it’s hard to find the right thing, and the Canadian archive is quite siloed. There are about 25 organisations collecting, most use the Archive-It service. But, if you are a researcher… to use web archives you really have to interested and engaged, you need to be an expert.

So, building this portal is about making this easier to use… We want web archives to be used on page 150 in some random book. And that’s what the WALK project is trying to do. Our goal is to break down the silos, take down walls between collections, between institutions. We are starting out slow… We signed Memoranda of Understanding with Toronto, Alberta, Victoria, Winnipeg, Dalhousie, Simon Fraser University – that represents about half of the archive in Canada.

We work on workflow… We run workshops… We separated the collections so that post docs can look at this

We are using Warcbase ( and command line tools, we transferred data from internet archive, generate checksums; we generate scholarly derivatives – plain text, hypertext graph, etc. In the front end you enter basic information, describe the collection, and make sure that the user can engage directly themselves… And those visualisations are really useful… Looking at visualisation of the Canadian political parties and political interest group web crawls which track changes, although that may include crawler issues.

Then, with all that generated, we create landing pages, including tagging, data information, visualizations, etc.

Nick: So, on a technical level… I’ve spent the last ten years in open source digital repository communities… This community is small and tight-knit, and I like how we build and share and develop on each others work. Last year we presented We’ve indexed 10 TB of warcs since then, representing 200+ M Solr docs. We have grown from one collection and we have needed additional facets: institution; collection name; collection ID, etc.

Then we have also dealt with scaling issues… 30-40Gb to 1Tb sized index. You probably think that’s kinda cute… But we do have more scaling to do… So we are learning from others in the community about how to manage this… We have Solr running on an Open Stack… But right now it isn’t at production scale, but getting there. We are looking at SolrCloud and potentially using a Shard2 per collection.

Last year we had a Solr index using the Shine front end… It’s great but… it doesn’t have an active open source community… We love the UK Web Archive but… Meanwhile there is BlackLight which is in wide use in libraries. There is a bigger community, better APIs, bug fixes, etc… So we have set up a prototype called WARCLight. It does almost all that Shine does, except the tree structure and the advanced searching..

Ian spoke about derivative datasets… For each collection, via Blacklight or ScholarsPortal we want domain/URL Counts; Full text; graphs. Rather than them having to do the work, they can just engage with particular datasets or collections.

So, that goal Ian talked about: one central hub for archived data and derivatives…


Q1) Do you plan to make graphs interactive, by using Kibana rather than Gephi?

A1 – Ian) We tried some stuff out… One colleague tried R in the browser… That was great but didn’t look great in the browser. But it would be great if the casual user could look at drag and drop R type visualisations. We haven’t quite found the best option for interactive network diagrams in the browser…

A1 – Nick) Generally the data is so big it will bring down the browser. I’ve started looking at Kibana for stuff so in due course we may bring that in…

Q2) Interesting as we are doing similar things at the BnF. We did use Shine, looked at Blacklight, but built our own thing…. But we are looking at what we can do… We are interested in that web archive discovery collections approaches, useful in other contexts too…

A2 – Nick) I kinda did this the ugly way… There is a more elegant way to do it but haven’t done that yet..

Q2) We tried to give people WARC and WARC files… Our actual users didn’t want that, they want full text…

A2 – Ian) My students are quite biased… Right now if you search it will flake out… But by fall it should be available, I suspect that full text will be of most interest… Sociologists etc. think that network diagram view will be interesting but it’s hard to know what will happen when you give them that. People are quickly put off by raw data without visualisation though so we think it will be useful…

Q3) Do you think in few years time

A3) Right now that doesn’t scale… We want this more cloud-based – that’s our next 3 years and next wave of funded work… We do have capacity to write new scripts right now as needed, but when we scale that will be harder,,,,

Q4) What are some of the organisational, admin and social challenges of building this?

A4 – Nick) Going out and connecting with the archives is a big part of this… Having time to do this can be challenging…. “is an institution going to devote a person to this?”

A4 – Ian) This is about making this more accessible… People are more used to Backlight than Shine. People respond poorly to WARC. But they can deal with PDFs with CSV, those are familiar formats…

A4 – Nick) And when I get back I’m going to be doing some work and sharing to enable an actual community to work on this..

Gregory Wiedeman: Automating access to web archives with APIs and ArchivesSpace

A little bit of context here… University at Albany, SUNY we are a public university with state records las that require us to archive. This is consistent with traditional collecting. But we no dedicated web archives staff – so no capacity for lots of manual work.

One thing I wanted to note is that web archives are records. Some have paper equivalent, or which were for many years (e.g. Undergraduate Bulletin). We also have things like word documents. And then we have things like University sports websites, some of which we do need to keep…

The seed isn’t a good place to manage these as records. But archives theory and practices adapt well to web archives – they are designed to scale, they document and maintain context, with relationship to other content, and a strong emphasis on being a history of records.

So, we are using DACS: Describing Archives: A Content Standard to describe archives, why not use that for web archives? They focus on intellectual content, ignorant of formats; designed for pragmatic access to archives. We also use ArchiveSpace – a modern tool for aggregated records that allows curators to add metadata about a collection. And it interleaved with our physical archives.

So, for any record in our collection.. You can specify a subject… a Python script goes to look at our CDX, looks at numbers, schedules processes, and then as we crawl a collection the extents and data collected… And then shows in our catalogue… So we have our paper records, our digital captures… Uses then can find an item, and only then do you need to think about format and context. And, there is an awesome article by David Graves(?) which talks about that aggregation encourages new discovery…

Users need to understand where web archives come from. They need provenance to frame of their research question – it adds weight to their research. So we need to capture what was attempted to be collected – collecting policies included. We have just started to do this with a statement on our website. We need a more standardised content source. This sort of information should be easy to use and comprehend, but hard to find the right format to do that.

We also need to capture what was collected. We are using the Archive-It Partner Data API, part of the Archive-It 5.0 system. That API captures:

  • type of crawl
  • unique ID
  • crawl result
  • crawl start, end time
  • recurrence
  • exact data, time, etc…

This looks like a big JSON file. Knowing what has been captured – and not captured – is really important to understand context. What can we do with this data? Well we can see what’s in our public access system, we can add metadata, we can present some start times, non-finish issues etc. on product pages. BUT… it doesn’t address issues at scale.

So, we are now working on a new open digital repository using the Hydra system – though not called that anymore! Possibly we will expose data in the API. We need standardised data structure that is independent of tools. And we also have a researcher education challenge – the archival description needs to be easy to use, re-share and understand.

Find our work – sample scripts, command line query tools – on Github:


Q1) Right now people describe collection intent, crawl targets… How could you standardise that?

A1) I don’t know… Need an intellectual definition of what a crawl is… And what the depth of a crawl is… They can produce very different results and WARC files… We need to articulate this in a way that is clear for others to understand…

Q1) Anything equivalent in the paper world?

A1) It is DACS but in the paper work we don’t get that granular… This is really specific data we weren’t really able to get before…

Q2) My impression is that ArchiveSpace isn’t built with discovery of archives in mind… What would help with that…

A2) I would actually put less emphasis on web archives… Long term you shouldn’t have all these things captures. We just need an good API access point really… I would rather it be modular I guess…

Q3) Really interesting… the definition of Archive-It, what’s in the crawl… And interesting to think about conveying what is in the crawl to researchers…

A3) From what I understand the Archive-It people are still working on this… With documentation to come. But we need granular way to do that… Researchers don’t care too much about the structure…. They don’t need all those counts but you need to convey some key issues, what the intellectual content is…

Comment) Looking ahead to the WASAPI presentation… Some steps towards vocabulary there might help you with this…

Comment) I also added that sort of issue for today’s panels – high level information on crawl or collection scope. Researchers want to know when crawlers don’t collect things, when to stop – usually to do with freak outs about what isn’t retained… But that idea of understanding absence really matters to researchers… It is really necessary to get some… There is a crapton of data in the partners API – most isn’t super interesting to researchers so some community effort to find 6 or 12 data points that can explain that crawl process/gaps etc…

A4) That issue of understanding users is really important, but also hard as it is difficult to understand who our users are…

Harvesting tools & strategies (Chair: Ian Milligan)

Jefferson Bailey: Who, what, when, where, why, WARC: new tools at the Internet Archive

Firstly, apologies for any repetition between yesterday and today… I will be talking about all sorts of updates…

So, WayBack Search… You can now search WayBackMachine… Including keyword, host/domain search, etc. The index is build on inbound anchor text links to a homepage. It is pretty cool and it’s one way to access this content which is not URL based. We also wanted to look at domain and host routes into this… So, if you look at the page for, say, you can now see statistics and visualisations. And there is an API so you can make your own visualisations – for hosts or for domains.

We have done stat counts for specific domains or crawl jobs… The API is all in json so you can just parse this for, for example, how much of what is archived for a domain is in the form of PDFs.

We also now have search by format using the same idea, the anchor text, the file and URL path, and you can search for media assets. We don’t have exciting front end displays yet… But I can search for e.g. Puppy, mime type: video, 2014… And get lots of awesome puppy videos [the demo is the Puppy Bowl 2014!]. This media search is available for some of the WayBackMachine for some media types… And you can again present this in the format and display you’d like.

For search and profiling we have a new 14 column CDX including new language, simhash, sha256 fields. Language will help users find material in their local/native languages. The SIMHASH is pretty exciting… that allows you to see how much a page has changed. We have been using it on Archive It partners… And it is pretty good. For instance seeing government blog change month to month shows the (dis)similarity.

For those that haven’t seen the Capture tool – Brozzler is in production in Archive-it with 3 doze orgaisations and using it. This has also led to warcprox developments too. It was intended for AV and social media stuff. We have a chromium cluster… It won’t do domain harvesting, but it’s good for social media.

In terms of crawl quality assurance we are working with the Internet Memory Foundation to create quality toools. These are building on internal crawl priorities work at IA crawler beans, comparison testing. And this is about quality at scale. And you can find reports on how we also did associated work on the WayBackMachine’s crawl quality. We are also looking at tools to monitor crawls for partners, trying to find large scale crawling quality as it happens… There aren’t great analytics… But there are domain-scale monitoring, domain scale patch crawling, and Slack integrations.

For doman scale work, for patch crawling we use WAT analysis for embeds and most linked. We rank by inbound links and add to crawl. ArchiveSpark is a framework for cluster-based data extraction and derivation (WA+).

Although this is a technical presentation we are also doing an IMLS funded project to train public librarians in web archiving to preserve online local history and community memory, working with partners in various communities.

Other collaborations and research include our end of term web archive 2016/17 when the administration changes… No one is official custodian for the And this year the widespread deletion of data has given this work greater profile than usual. This time the work was with IA, LOC, UNT, GWU, and others. 250+ TB of .gov/.mil as well as White House and Obama social media content.

There had already been discussion of the Partner Data API. We are currently re-building this so come talk to me if you are interested in this. We are working with partners to make sure this is useful. makes sense, and is made more relevant.

We take a lot of WARC files from people to preserve… So we are looking to see how we can get partners to do this with and for it. We are developing a pipeline for automated WARC ingest for web services.

There will be more on WASAPI later, but this is part of work to ensure web archives are more accessible… And that uses API calls to connect up repositories.

We have also build a WAT API that allows you to query most of the metadta for a WARC file. You can feed it URLs, and get back what you want – except the page type.

We have new portals and searches now and coming. This is about putting new search layers on TLD content in the WayBackMachine… So you can pick media types, and just from one domain, and explore them all…

And with a statement on what archives should do – involving a gif of a centaur entering a rainbow room – that’s all… 


Q1) What are implications of new capabilities for headless browsing for Chrome for Brozzler…

A1 – audience) It changes how fast you can do things, not really what you can do…

Q2) What about http post for WASAPI

A2) Yes, it will be in the Archive-It web application… We’ll change a flag and then you can go and do whatever… And there is reporting on the backend. Doesn’t usually effect crawl budgets, it should be pretty automated… There is a UI.. Right now we do a lot manually, the idea is to do it less manually…

Q3) What do you do with pages that don’t specify encoding… ?

A3) It doesn’t go into url tokenisation… We would wipe character encoding in anchor text – it gets cleaned up before elastic search..

Q4) The SIMHASH is before or after the capture? And can it be used for deduplication

A4) After capture before CDX writing – it is part of that process. Yes, it could be used for deduplication. Although we do already do URL deduplication… But we could compare to previous SIMHASH to work out if another copy is needed… We really were thinking about visualising change…

Q5) I’m really excited about WATS… What scale will it work on…

A5) The crawl is on 100 TB – we mostly use existing WARC and Json pipeline… It performs well on something large. But if a lot of URLs, it could be a lot to parse.

Q6) With quality analysis and improvement at scale, can you tell me more about this?

A6) We’ve given the IMF access to our own crawls… But we have been compared our own crawls to our own crawls… Comparing to Archive-it is more interesting… And looking at domain level… We need to share some similar size crawls – BL and IA – and figure out how results look and differ. It won’t be content based at that stage, it will be hotpads and URLs and things.

Michele C. Weigle, Michael L. Nelson, Mat Kelly & John Berlin: Archive what I see now – personal web archiving with WARCs

Mat: I will be describing tools here for web users. We want to enable individuals to create personal web archives in a self-contained way, without external services. Standard web archiving tools are difficult for non IT experts. “Save page as” is not suitable for web archiving. Why do this? It’s for people who don’t want to touch the commend line, but also to ensure content is preserved that wouldn’t otherwise be. More archives are more better.

It is also about creation and access, as both elements are important.

So, our goals involve advancing development of:

  • WARCreate – create WARC from what you see in your browser.
  • Web Archiving Integration Layer (WAIL)
  • Mink

WARCcreate is… A Chrome browser extension to save WARC files from your browser, no credentials pass through 3rd parties. It heavilt leverages Chrome webRequest API. ut it was build in 2012 so APIs and libraries have evolved so we had to work on that. We also wanted three new modes for bwoser based preservation: record mode – retain buffer as you browse; countdown mode – preserve reloading page on an interval; event mode – preserve page when automatically reloaded.

So you simply click on the WARCreate button the browser to generate WARC files for non technical people.

Web Archiving Integration Layer (WAIL) is a stand-alone desktop application, it offers collection-based web archiving, and includes Heritrix for crawling, OpenWayback for replay, and Python scripts compiled to OS-native binaries (.app, .exe). One of the recent advancements was a new user interface. We ported Python to Electron – using web technologies to create native apps. And that means you can use native languages to help you to preserve. We also moves from a single archive to collection-based archiving. We also ported OpenWayback to pywb. And we also started doing native Twitter integration – over time and hashtags…

So, the original app was a tool to enter a URI and then get a notification. The new version is a little more complicated but provides that new collection-based interface. Right now both of these are out there… Eventually we’d like to merge functionality here. So, an example here, looking at the UK election as a collection… You can enter information, then crawl to within defined boundaries… You can kill processes, or restart an old one… And this process integrates with Heritrix to give status of a task here… And if you want to Archive Twitter you can enter a hashtag and interval, you can also do some additional filtering with keywords, etc. And then once running you’ll get notifications.

Mink… is a Google Chrome browser extension. It indicates archival capture count as you browse. Quickly submits URI to multiple archives from UI. From Mink(owski) space. Our recent enhancements include enhancements to the interface to add the number of archives pages to icon at bottom of page. And allows users to set preferences on how to view large set of memetos. And communication with user-specified or local archives…

The old mink interface could be affected by page CSS as in the DOM. So we ave moved to shadow DOM, making it more reliable and easy to use. And then you have a more consistent, intuitive iller columns for many captures. It’s an integration of live and archive web, whilst you are viewing the live web. And you can see year, month, day, etc. And it is refined to what you want to look at this. And you have an icon in Mink to make a request to save the page now – and notification of status.

So, in terms of tool integration…. We want to ensure integration between Mink and WAIL so that Mink points to local archives. In the future we want to decouple Mink from external Memento aggregator – client-side customisable collection of archives instead.

See: for tools and source code.


Q1) Do you see any qualitative difference in capture between WARCreate and WARC recorder?

A1) We capture the representation right at the moment you saw it.. Not the full experience for others, but for you in a moment of time. And that’s our goal – what you last saw.

Q2) Who are your users, and do you have a sense of what they want?

A2) We have a lot of digital humanities scholars wanting to preserve Twitter and Facebook – the stream as it is now, exactly as they see it. So that’s a major use case for us.

Q3) You said it is watching as you browse… What happens if you don’t select a WARC

A3) If you have hit record you could build up content as pages reload and are in that record mode… It will impact performance but you’ll have a better capture…

Q3) Just a suggestion but I often have 100 tabs open but only want to capture something once a week so I might want to kick it off only when I want to save it…

Q4) That real time capture/playback – are there cool communities you can see using this…

A4) Yes, I think with CNN coverage of a breaking storm allows you to see how that story evolves and changes…

Q5) Have you considered a mobile version for social media/web pages on my phone?

A5) Not currently supported… Chrome doesn’t support that… There is an app out there that lets you submit to archives, but not to create WARC… But there is a movement to making those types of things…

Q6) Personal archiving is interesting… But jailed in my laptop… great for personal content… But then can I share my WARC files with the wider community .

A6) That’s a good idea… And more captures is better… So there should be a way to aggregate these together… I am currently working on that, but you should need to be able to specify what is shared and what is not.

Q6) One challenge there is about organisations and what they will be comfortable with sharing/not sharing.

Lozana Rossenova and IIya Kreymar, Rhizome: Containerised browsers and archive augmentation

Lozana: As you probably know Webrecorder is a high fidelity interactive recording of any web site you browse – and how you engage. And we have recently released an App in electron format.

Webrecorder is a worm’s eye view of archiving, tracking how users actually move around the web… For instance for instragram and Twitter posts around #lovewins you can see the quality is high. Webrecorder uses symmetrical archiving – in the live browser and in a remote browser… And you can capture then replay…

In terms of how we organise webrecorder: we have collections and sessions.

The thing I want to talk about today is on Remote browsers, and my work with Rhizome on internet art. And a lot of these works actually require old browser plugins and tools… So Webrecorder enables capture and replay even where technology no longer available.

To clarify: the programme says “containerised” but we now refer to this as “remote browsers” – still using Docker cotainers to run these various older browsers.

When you go to record a site you select the browser, and the site, and it begins the recording… The Java Applet runs and shows you a visulisation of how it is being captured. You can do this with flash as well… If we open a multimedia in your normal (Chrome) browser, it isn’t working. Restoration is easier with just flash, need other things to capture flash with other dependencies and interactions.

Remote browsers are really important for Rhizome work in general, as we use them to stage old artworks in new exhibitions.

Ilya: I will be showing some upcoming beta features, including ways to use webrecorder to improve other arhives…

Firstly, which other web archives? So I built a public web archives repsitory:

And with this work we are using WAM – the Web Archiving Manifest. And added a WARC source URI and WARC creation date field to the WARC Header at the moment.

So, Jefferson already talked about patching – patching remote archives from the live web… is an approach where we patch either from live web or from other archives, depending on what is available or missing. So, for instance, if I look at a Washington Post page in the archive from 2nd March… It shows how other archives are being patched in to me to deliver me a page… In the collection I have a think called “patch” that captures this.

Once pages are patched, then we introduce extraction… We are extracting again using remote archiving and automatic patching. So you combine extraction and patching features. You create two patches and two WARC files. I’ll demo that as well… So, here’s a page from the CCA website and we can patch that… And then extract that… And then when we patch again we get the images, the richer content, a much better recording of the page. So we have 2 WARCs here – one from the British Library archive, one from the patching that might be combined and used to enrich that partial UKWA capture.

Similarly we can look at a CNN page and take patches from e.g. the Portuguese archive. And once it is done we have a more complete archive… When we play this back you can display the page as it appeared, and patch files are available for archives to add to their copy.

So, this is all in beta right now but we hope to release it all in the near future…


Q1) Every web archive already has a temporal issue where the content may come from other dates than the page claims to have… But you could aggrevate that problem. Have you considered this?

A1) Yes. There are timebounds for patching. And also around what you display to the user so they understand what they see… e.g. to patch only within the week or the month…

Q2) So it’s the closest date to what is in web recorder?

A2) The other sources are the closest successful result on/closest to the date from another site…

Q3) Rather than a fixed window for collection, seeing frequently of change might be useful to understand quality/relevance… But I think you are replaying

A3)Have you considered a headless browser… with the address bar…

A3 – Lozana) Actually for us the key use case is about highlighting and showcasing old art works to the users. It is really important to show the original page as it appeared – in the older browsers like Netscape etc.

Q4) This is increadibly exciting. But how difficult is the patching… What does it change?

A4) If you take a good capture and a static image is missing… Those are easy to patch in… If highly contextualised – like Facebook, that is difficult to do.

Q5) Can you do this in realtime… So you archive with then you want to patch something immediately…

A5) This will be in the new version I hope… So you can check other sources and fall back to other sources and scenarios…

Comment –  Lozana) We have run UX work with an archiving organisation in Europe for cultural heritage and their use case is that they use Archive-It and do QA the next day… Crawl might mix something but highly dynamic, so want to quickly be able to patch it pretty quickly.

Ilya) If you have an archive that is not in the public archive list on Github please do submit it as a fork request and we’ll be able to add it…

Leveraging APIs (Chair: Nicholas Taylor)

Fernando Melo and Joao Nobre: API: enabling automatic analytics over historical web data

Fernando: We are a publicly available web archive, mainly of Portuguese websites from the .pt domain. So, what can you do with out API?

Well, we built our first image search using our API, for instance a way to explore Charlie Hebdo materials; another application enables you to explore information on Portuguese politicians.

We support the Memento protocol, and you can use the Memento API. We are one of the time gates for the time travel searches. And we also have full text search as well as URL search, though our OpenSearch API. We have extended our API to support temporal searches in the portuguese web. Find this at: Full text search requests can be made through a URL query, e.g. 2004 would search for mentions of euro 2004, and you can add parameters to this, or search as a phrase rather than keywords.

You can also search mime types – so just within PDFs for instance. And you can also run URL searches – e.g. all pages from the New York Times website… And if you provide time boundaries the search will look for the capture from the nearest date.

Joao: I am going to talk about our image search API. This works based on keyword searches, you can include operators such as limiting to images from a particular site, to particular dates… Results are ordered by relevance, recency, or by type. You can also run advanced image searches, such as for icons, you can use quotation marks for names, or a phrase.

The request parameters include:

  • query
  • stamp – timestamp
  • Start – first index of search
  • safe Image (yes; no; all) – restricts search only to safe images.

The response is returned in json with total results, URL, width, height, alt, score, timestamp, mime, thumbnail, nsfw, pageTitle fields.

More on all of this:


Q1) How do you classify safe for work/not safe for work

A1 – Fernando) This is a closed beta version. Safe for work/nsfw is based on classification worked around training set from Yahoo. We are not for blocking things but we want to be able to exclude shocking images if needed.

Q1) We have this same issue in the GifCities project – we have a manually curated training set to handle that.

Comment) Maybe you need to have more options for that measure to provide levels of filtering…

Q2) With that json response, why did you include title and alt text…

A2) We process image and extract from URL, the image text… So we capture the image, the alt text, but we thought that perhaps the page title would be interesting, giving some sense of context. Maybe the text before/after would also be useful but that takes more time… We are trying to keep this working

Q3) What is the thumbnail value?

A3) It is in base 64. But we can make that clearer in the next version…

Nicholas Taylor: Lots more LOCKSS for web archiving: boons from the LOCKSS software re-architecture

This is following on from the presentation myself and colleagues did at last year’s IIPC on APIs.

LOCKSS came about from a serials librarian and a computer scientist. They were thinking about emulating the best features of the system for preserving print journals, allowing libraries to conserve their traditional role as preserver. The LOCKSS boxes would sit in each library, collecting from publishers’ website, providing redundancy, sharing with other libraries if and when that publication was no longer available.

18 years on this is a self-sustaining programme running out of Stanford, with 10s of networks and hundreds of partners. Lots of copies isn’t exclusive to LOCKSS but it is the decentralised replication model that addresses the long term bit integrity is hard to solve, that more (correlated) copies doesn’t necessarily keep things safe and can make it vulnerable to hackers. So this model is community approved, published on, and well established.

Last year we started re-architecting the LOCKSS software so that it becomes a series of websites. Why do this? Well to reduce support and operation costs – taking advantage of other softwares on the web and web archiving,; to de silo components and enable external integration – we want components to find use in other systems, especially in web archiving; and we are preparing to evolve with the web, to adapt our technologies accordingly.

What that means is that LOCKSS systems will treat WARC as a storage abstraction, and more seamlessly do this, processing layers, proxies, etc. We also already integrate Memento but this will also let us engage WASAPI – which there will be more in our next talk.

We have built a service for bibliographic metadata extraction, for web harvest and file transfer content; we can map values in DOM tree to metadata fields; we can retrieve downloadable metadata from expected URL patterns; and parse RIS and XML by schema. That model shows our bias to bibliographic material.

We are also using plugins to make bibliographic objects and their metadata on many publishing platforms machine-intelligible. We mainly work with publishing/platform heuristics like Atypon, Digital Commons, HighWire, OJS and Silverchair. These vary so we have a framework for them.

The use cases for metadata extraction would include applying to consistent subsets of content in larger corpora; curating PA materials within broader crawls; retrieve faculty publications online; or retrieve from University CMSs. You can also undertake discovery via bibliographic metadata, with your institutions OpenURL resolver.

As described in 2005 D-Lib paper by DSHR et al, we are looking at on-access format migration. For instance x-bitmap to GIF.

Probably the most important core preservation capability is the audit and repair protocol. Network nodes conduct polls to validate integrity of distributed copies of data chunks. More nodes = more security – more nodes can be down; more copies can be corrupted… The notes do not trust each other in this model and responses cannot be cached. And when copies do not match, the node audits and repairs.

We think that functionality may be useful in other distributed digital preservation networks, in repository storage replication layers. And we would like to support varied back-ends including tape and cloud. We haven’t built those integrations yet…

To date our progress has addressed the WARC work. By end of 2017 we will have Docker-ised components, have a web harvest framework, polling and repair web service. By end of 2018 we will have IP address and Shibboleth access to OpenWayBack…

By all means follow and plugin. Most of our work is in a private repository, which then copies to GitHub. And we are moving more towards a community orientated software development approach, collaborating more, and exploring use of LOCKSS technologies in other contexts.

So, I want to end with some questions:

  • What potential do you see for LOCKSS technologies for web archiving, other use cases?
  • What standards or technologies could we use that we maybe haven’t considered
  • How could we help you to use LOCKSS technologies?
  • How would you like to see LOCKSS plug in more to the web archiving community?


Q1) Will these work with existing LOCKSS software, and do we need to update our boxes?

A1) Yes, it is backwards compatible. And the new features are containerised so that does slightly change the requirements of the LOCKSS boxes but no changes needed for now.

Q2) Where do you store biblographic metadata? Or is in the WARC?

A2) It is separate from the WARC, in a database.

Q3) With the extraction of the metadata… We have some resources around translators that may be useful.

Q4 – David) Just one thing of your simplified example… For each node… They all have to calculate a new separate nonce… None of the answers are the same… They all have to do all the work… It’s actually a system where untrusted nodes are compared… And several nodes can’t gang up on the other… Each peer randomly decides on when to poll on things… There is  leader here…

Q5) Can you talk about format migration…

A5) It’s a capability already built into LOCKSS but we haven’t had to use it…

A5 – David) It’s done on the requests in http, which include acceptable formats… You can configure this thing so that if an acceptable format isn’t found, then you transform it to an acceptable format… (see the paper mentioned earlier). It is based on mime type.

Q6) We are trying to use LOCKSS as a generic archive crawler… Is that still how it will work…

A6) I’m not sure I have a definitive answer… LOCKSS will still be web harvesting-based. It will still be interesting to hear about approaches that are not web harvesting based.

A6 – David) Also interesting for CLOCKSS which are not using web harvesting…

A6) For the CLOCKSS and LOCKSS networks – the big networks – the web harvesting portfolio makes sense. But other networks with other content types, that is becoming more important.

Comment) We looked at doing transformation that is quite straightforward… We have used an API

Q7) Can you say more about the community project work?

A7) We have largely run LOCKSS as more of an in-house project, rather than a community project. We are trying to move it more in the direction of say, Blacklight, Hydra….etc. A culture change here but we see this as a benchmark of success for this re-architecting project… We are also in the process of hiring a partnerships manager and that person will focus more on creating documentation, doing developer outreach etc.

David: There is a (fragile) demo that you can have a lot of this… The goal is to continue that through the laws project, as a way to try this out… You can (cautiously) engage with that at but it will be published to GitHub at some point.

Jefferson Bailey & Naomi Dushay: WASAPI data transfer APIs: specification, project update, and demonstration

Jefferson: I’ll give some background on the APIs. This is an IMLS funded project in the US looking at Systems Interoperability and Collaborative Development for Web Archives. Our goals are to:

  • build WARC and derivative dataset APIs (AIT and LOCKSS) and test via transfer to partners (SUL, UNT, Rutgers) to enable better distributed preservation and access
  • Seed and launch community modelled on characteristics of successful development and participation from communities ID’d by project
  • Sketch a blueprint and technical model for future web archiving APIs informed by project R&D
  • Technical architecture to support this.

So, we’ve already run WARC and Digital Preservation Surveys. 15-20% of Archive-it users download and locally store their WARCS – for various reasons – that is small and hasn’t really moved, that’s why data transfer was a core area. We are doing online webinars and demos. We ran a national symposium on API based interoperability and digital preservation and we have white papers to come from this.

Development wise we have created a general specification, a LOCKSS implementation, Archive-it implementation, Archive-it API documentation, testing and utility (in progress). All of this is on GitHub.

The WASAPI Archive-it Transfer API is written in python, meets all gen-spec citeria, swagger yaml in the repos. Authorisation uses AIT Django framework (same as web app), not defined in general specification. We are using browser cookies or http basic auth. We have a basic endpoint (in production) which returns all WARCs for that account; base/all results are paginated. In terms of query parameters you can use: filename; filetype; collection (ID); crawl (ID for AID crawl job)) etc.

So what do you get back? A JSON object has: pagination, count, request-url, includes-extra. You have fields including account (Archive-it ID); checksums; collection (Archive-It ID); crawl; craw time; crawl start; filename’ filetype; locations; size. And you can request these through simple http queries.

You can also submit jobs for generating derivative datasets. We use existing query language.

In terms of what is to come, this includes:

  1. Minor AIT API features
  2. Recipes and utilities (testers welcome)
  3. Community building research and report
  4. A few papers on WA APIs
  5. Ongoing surgets and research
  6. Other APIs in WASAPI (past and future)

So we need some way to bring together these APIs regularly. And also an idea of what other APIs we need to support, and how to prioritise that.

Naomi: I’m talking about the Stanford take on this… These are the steps Nicholas, as project owner, does to download WARC files from Archive-it at the moment… It is a 13 step process… And this grant funded work focuses on simplifying the first six steps and making it more manageable and efficient. As a team we are really focused on not being dependent on bespoke softwares, things much be maintainable, continuous integration set up, excellent test coverage, automate-able. There is a team behind this work, and this was their first touching of any of this code – you had 3 neophytes working on this with much to learn.

We are lucky to be just down the corridor from LOCKSS. Our preferred language is Ruby but Java would work best for LOCKSS. So we leveraged LOCKSS engineering here.

The code is at:

You only need Java to run the code. And all arguments are documented in Github. You can also view a video demo:

YouTube Preview Image

These videos are how we share our progress at the end of each Agile sprint.

In terms of work remaining we have various tweaks, pull requests, etc. to ensure it is production ready. One of the challenges so far has been about thinking crawls and patches, and the context of the WARC.


Q1) At Stanford are you working with the other WASAPI APIs, or just the downloads one.

A1) I hope the approach we are taking is a welcome one. But we have a lot of projects taking place, but we are limited by available software engineering cycles for archives work.

Note that we do need a new readme on GitHub

Q2) Jefferson, you mentioned plans to expand the API, when will that be?

A2 – Jefferson) I think that it is pretty much done and stable for most of the rest of the year… WARCs do not have crawl IDs or start dates – hence adding crawl time.

Naomi: It was super useful that a different team built the downloader was separate from the team building the WASAPI as that surfaced a lot of the assumptions, issues, etc.

David: We have a CLOCKSS implementation pretty much building on the Swagger. I need to fix our ID… But the goal is that you will be able to extract stuff from a LOCKSS box using WASAPI using URL or Solr text search. But timing wise, don’t hold your breath.

Jefferson: We’d also like others feedback and engagement with the generic specification – comments welcome on GitHub for instance.

Web archives platforms & infrastructure (Chair: Andrew Jackson)

Jack Cushman & Ilya Kreymer: Thinking like a hacker: security issues in web capture and playback

Jack: We want to talk about securing web archives, and how web archives can get themselves into trouble with security… We want to share what we’ve learnt, and what we are struggling with… So why should we care about security as web archives?

Ilya: Well web archives are not just a collection of old pages… No, high fidelity web archives run entrusted software. And there is an assumption that a live site is “safe” so nothing to worry about… but that isn’t right either..

Jack: So, what could a page do that could damage an archive? Not just a virus or a hack… but more than that…

Ilya: Archiving local content… Well a capture system could have privileged access – on local ports or network server or local files. It is a real threat. And could capture private resources into a public archive. So. Mitigation: network filtering and sandboxing, don’t allow capture of local IP addresses…

Jack: Threat: hacking the headless browser. Modern captures may use PhantomJS or other browsers on the server, most browsers have known exploits. Mitigation: sandbox your VM

Ilya: Stealing user secrets during capture… Normal web flow… But you have other things open in the browser. Partial mitigation: rewriting – rewrite cookies to exact path only; rewrite JS to intercept cookie access. Mitigation: separate recording sessions – for webrecorder use separate recording sessions when recording credentialed content. Mitigation: Remote browser.

Jack: So assume we are running… Threat: cross site scripting to steal archive login

Ilya: Well you can use a subdomain…

Jack: Cookies are separate?

Ilya: Not really.. In IE10 the archive within the archive might steal login cookie. In all browsers a site can wipe and replace cookies.

Mitigation: run web archive on a separate domain from everything else. Use iFrames to isolate web archive content. Load web archive app from app domain, load iFrame content from content domain. As Webrecorder and both do.

Jack: Now, in our content frame… how back could it be if that content leaks… What if we have live web leakage on playback. This can happen all the time… It’s hard to stop that entirely… Javascript can send messages back and fetch new content… to mislead, track users, rewrite history. Bonus: for private archives – any of your captures could eport any of your other captures.

The best mitigation is a Content-Security-Policy header can limit access to web archive domain

Ilya: Threat: Show different age contents when archives… Pages can tell they’re in an archive and act differently. Mitigation: Run archive in containerised/proxy mode browser.

Ilya: Threat: Banner spoofing… This is a dangerous but quite easy to execute threat. Pages can dynamically edit the archives banner…

Jack: Suppose I copy the code of a page that was captured and change fake evidence, change the metadata of the date collected, and/or the URL bar…

Ilya: You can’t do that in Perma because we use frames. But if you don’t separate banner and content, this is a fairly easy exploit to do… So, Mitigation: Use iFrames for replay; don’t inject banner into replay frame… It’s a fidelity/security trade off.. .

Jack: That’s our top 7 tips… But what next… What we introduce today is a tool called This is a version of webrecorder with every security problem possible turned on… You can run it locally on your machine to try all the exploits and think about mitigations and what to do about them!

And you can find some exploits to try, some challenges… Of course if you actually find a flaw in any real system please do be respectful


Q1) How much is the bug bounty?! [laughs] What do we do about the use of very old browsers…

A1 – Jack) If you use an old browser you may be compromised already… But we use the most robust solution possible… In many cases there are secure options that work with older browsers too…

Q2) Any trends in exploits?

A2 – Jack) I recommend the book A Tangled Book… And there is an aspect that when you run a web browser there will always be some sort of issue

A2 – Ilya) We have to get around security policies to archive the web… It wasn’t designed for archiving… But that raises its own issues.

Q3) Suggestions for browser makers to make these safer?

A3) Yes, but… How do you do this with current protocols and APIs

Q4) Does running old browsers and escaping from containers keep you awake at night…

A4 – Ilya) Yes!

A4 – Jack) If anyone is good at container escapes please do write that challenge as we’d like to have it in there…

Q5) There’s a great article called “Familiarity builds content” which notes that old browsers and softwares get more vulnerable over time… It is particularly a big risk where you need old software to archive things…

A5 – Jack) Thanks David!

Q6) Can you saw more about the headers being used…

A6) The idea is we write the CSP header to only serve from the archive server… And they can be quite complex… May want to add something of your own…

Q7) May depend on what you see as a security issue… for me it may be about the authenticity of the archive… By building something in the website that shows different content in the archive…

A7 – Jack) We definitely think that changing the archive is a security threat…

Q8) How can you check the archives and look for arbitrary hacks?

A8 – Ilya) It’s pretty hard to do…

A8 – Jack) But it would be a really great research question…

Mat Kelly & David Dias: A collaborative, secure, and private InterPlanetary WayBack web archiving system using IPFS

David: Welcome to the session on going InterPlanatary… We are going to talk about peer to peer and other technology to make web archiving better…

We’ll talk about InterPlanatary File System (IPFS) and InterPlanatary WayBack (IPWB)…

IPFS is also known as  the distributed web, moving from location based to content based… As we are aware, the web has some problems… You have experience of using a service, accessing email, using a document… There is some break in connectivity… And suddenly all those essential services are gone… Why? Why do we need to have the services working in such a vulnerable way… Even a simple page, you lose a connection and you get a 404. Why?

There is a real problem with permanence… We have this URI, the URL, telling us the protocol, location and content path… But when we come back later – weeks or months – and that content has moved elsewhere… Either somewhere else you can find, or somewhere you can’t. Sometimes it’s like the content has been destroyed… But every time people see a webpage, you download it to your machine… These issues come from location addressing…

In content addressing we tie content to a unique hash that identifies the item… So a Content Identifier (CID) allows us to do this… And then, in a network, when I look for that data… If there is a disruption to the network, we can ask any machine where the content is… And the node near you can show you what is available before you ever go to the network.

IPFS is already used in video streaming (inc. Netflix), legal documents, 3D models – with Hollolens for instance, for games, for scientific data and papers, blogs and webpages, and totally distributed web apps.

IPFS allows this to be distributed, offline, saves space, optimise bandwidth usage, etc.

Mat: So I am going to talk about IPWB. Motivation here is the persistence of archived web data dependent on resilience of organisation and availability of data. The design is extending the CDXJ format, with indexing and IPFS dissemination procedure, and Replay and IPFS Pull Procedure. So in an adapted CDXJ adds a header with the hash for the content to the metadata structure.

Dave: One of the ways IPFS is making changes in the boundary is in browser tab, in browser extension and service worker as a proxy for requests the browser makes, with no changes to the interface (that one is definitely in alpha!)…

So the IPWB can expose the content to the IPFS and then connect and do everything in the browser without needing to download and execute code on their machine. Building it into the browser makes it easy to use…

Mat: And IPWB enables privacy, collaboration and security, building encryption method and key into the WARC. Similarly CDXJs may be transferred for our users’ replay… Ideally you won’t need a CDZJ on your own machine at all…

We are also rerouting, rather than rewriting, for archival replay… We’ll be presenting on that late this summer…

And I think we just have time for a short demo…

For more see:


Q1) Mat, I think that you should tell that story of what you do…

A1) So, I looked for files on another machine…

A1 – Dave) When Mat has the archive file on a remote machine… Someone looks for this hash on the network, send my way as I have it… So when Mat looked, it replied… so the content was discovered… request issued, received content… and presented… And that also lets you capture pages appearing differently in different places and easily access them…

Q2) With the hash addressing, are there security concerns…

A2 – Dave) We use Multihash, using Shard… But you can use different hash functions, they just verify the link… In IPFS we prevent issue with self-describable data functions..

Q3) The problem is that the hash function does end up in the URL… and it will decay over time because the hash function will decay… Its a really hard problem to solve – making a choice now that may be wrong… But there is no way of choosing the right choice.

A3) At least we can use the hash function to indicate whether it looks likely to be the right or wrong link…

Q4) Is hash functioning itself useful with or without IPFS… Or is content addressing itself inherently useful?

A4 – Dave) I think the IPLD is useful anyway… So with legal documents where links have to stay in tact, and not be part of the open web, then IPFS can work to restrict that access but still make this more useful…

Q5) If we had a content addressable web, almost all these web archiving issues would be resolved really… IT is hard to know if content is in Archive 1 or Archive 2. A content addressable web would make it easier to be archived.. Important to keep in mind…

A5 – Dave) I 100% agree! Content addressed web lets you understand what is important to capture. And IPTF saves a lot of bandwidth and a lot of storage…

Q6) What is the longevity of the hashs and how do I check that?

A6 – Dave) OK, you can check the integrity of the hash. And we have which is a blockchain [based storage network and cryptocurrency and that does handle this information… Using an address in a public blockchain… That’s our solution for some of those specific problems.

Andrew Jackson (AJ), Jefferson Bailey (JB), Kristinn Sigurðsson (KS) & Nicholas Taylor (NT): IIPC Tools: autumn technical workshop planning discussion

AJ: I’ve been really impressed with what I’ve seen today. There is a lot of enthusiasm for open source and collaborative approaches and that has been clear today and the IIPC wants to encourage and support that.

Now, in September 2016 we had a hackathon but there were some who just wanted to get something concrete done… And we might therefore adjust the format… Perhaps pre-define a task well ahead of time… But also a parallel track for the next hackathon/more experimental side. Is that a good idea? What else may be?

JB: We looked at Archives Unleashed, and we did a White House Social Media Hackathon earlier this year… This is a technical track but… it’s interesting to think about what kind of developer skills/what mix will work best… We have lots of web archiving engineers… They don’t use the software that comes out of it… We find it useful to have archivists in the room…

Then, from another angle, is that at the hackathons… IIPC doesn’t have a lot of money and travel is expensive… The impact of that gets debated – it’s a big budget line for 8-10 institutions out of 53 members. The outcomes are obviously useful but… If people expect to be totally funded for days on end across the world isn’t feasible… So maybe more little events, or fewer bigger events can work…

Comment 1) Why aren’t these sessions recorded?

JB: Too much money. We have recorded some of them… Sometimes it happens, sometimes it doesn’t…

AJ: We don’t have in-house skills, so it’s third party… And that’s the issue…

JB: It’s a quality thing…

KS: But also, when we’ve done it before, it’s not heavily watched… And the value can feel questionable…

Comment 1) I have a camera at home!

JB: People can film whatever they want… But that’s on people to do… IIPC isn’t an enforcement agency… But we should make it clear that people can film them…

KS: For me… You guys are doing incredible things… And it’s things I can’t do at home. The other aspect is that… There are advancements that never quite happened… But I think there is value in the unconference side…

AJ: One of the things with unconference sessions is that

NT: I didn’t go to the London hackathon… Now we have a technical team, it’s more appealling… The conference in general is good for surfacing issues we have in common… such as extraction of metadata… But there is also the question of when we sit down to deal with some specific task… That could be useful for taking things forward..

AJ: I like the idea of a counter conference, focused on the tools… I was a bit concerned that if there were really specific things… What does it need to be to be worth your organisations flying you to them… Too narrow and it’s exclusionary… Too broad and maybe it’s not helpful enough…

Comment 2) Worth seeing the model used by Python – they have a sprint after their conference. That isn’t an unconference but lets you come together. Mozilla Fest Sprint picks a topic and then next time you work on it… Sometimes looking at other organisations with less money are worth looking at… And for things like crowd sourcing coverage etc… There must be models…

AJ: This is cool.. You will have to push on this…

Comment 3) I think that tacking on to a conference helps…

KS: But challenging to be away from office more than 3/4 days…

Comment 4) Maybe look at NodeJS Community and how they organise… They have a website, with three workshops… People organise events pretty much monthly… And create material in local communities… Less travel but builds momentum… And you can see that that has impact through local NodeJS events now…

AJ: That would be possible to support as well… with IIPC or organisational support… Bootstrapping approaches…

Comment 5) Other than hackathon there are other ways to engage developers in the community… So you can engage with Google Summer of Code for instance – as mentors… That is where students look for projects to work on…

JB: We have two GSoC and like 8 working without funding at the moment… But it’s non trivial to manage that…

AJ: Onboarding new developers in any way would be useful…

Nick: Onboarding into the weird and wacky world of web archiving… If IIPC can curate a lot of onboarding stuff, that would be really good for potential… for getting started… Not relying on a small number of people…

AJ: We have to be careful as IIPC tools page is very popular, but hard to keep up to date… Benefits can be minor versus time…

Nick: Do you have GitHub? Just put up an awesome lise!

AJ: That’s a good idea…

JB: Microfunding projects – sub $10k is also an option for cost recovered brought out time for some of these sorts of tasks… That would be really interesting…

Comment 6) To expand on Jefferson and Nick were saying… I’m really new… Went to IIPC in April. I am enjoying this and learning this a lot… I’ve been talking to a lot of you… That would really help more people get the technical environment right… Organisations want to get into archiving on a small scale…

Olga: We do have a list on GitHub… but not up to date and well used…

AJ: We do have this document, we have GitHub… But we could refer to each other… and point to the getting started stuff (only). Rather get away from lists…

Comment 7) Google has an page – could take inspiration from that… Licensing, communities, etc… Very simple plain English getting started guide/documentation…

Comment 8) I’m very new to the community… And I was wondering to what extent you use Slack and Twitter between events to maintain these conversations and connections?

AJ: We have a Slack channel, but we haven’t publicised it particularly but it’s there… And Twitter you should tweet @NetPreserve and they will retweet then this community will see that…

Jun 142017

From today until Friday I will be at the International Internet Preservation Coalition (IIPC) Web Archiving Conference 2017, which is being held jointly with the second RESAW: Research Infrastructure for the Study of Archived Web Materials Conference. I’ll be attending the main strand at the School of Advanced Study, University of London, today and Friday, and at the technical strand (at the British Library) on Thursday. I’m here wearing my “Reference Rot in Theses: A HiberActive Pilot” – aka “HiberActive” – hat. HiberActive is looking at how we can better enable PhD candidates to archive web materials they are using in their research, and citing in their thesis. I’m managing the project and working with developers, library and information services stakeholders, and a fab team of five postgraduate interns who are, whilst I’m here, out and about around the University of Edinburgh talking to PhD students to find out how they collect, manage and cite their web references, and what issues they may be having with “reference rot” – content that changes, decays, disappears, etc. We will have a webpage for the project and some further information to share soon but if you are interested in finding out more, leave me a comment below or email me: These notes are being taken live so, as usual for my liveblogs, I welcome corrections, additions, comment etc. (and, as usual, you’ll see the structure of the day appearing below with notes added at each session). 

Opening remarks: Jane Winters

This event follows the first RESAW event which took place in Aarhus last year. This year we again highlight the huge range of work being undertaken with web archives. 

This year a few things are different… Firstly we are holding this with the IIPC, which means we can run the event over 3 days, and means we can bring together librarians, archivists, and data scientists. The BL have been involved and we are very greatful for their input. We are also excited to have a public event this evening, highlighted the increasingly public nature of web archiving. 

Opening remarks: Nicholas Taylor

On behalf of the IIPC Programme Committee I am hugely grateful to colleagues here at the School of Advanced Studies and at the British Library for being flexible and accommodating us. I would also like to thank colleagues in Portugal, and hope a future meeting will take place there as had been originally planned for IIPC.

For us we have seen the Web Archiving Conference as an increasingly public way to explore web archiving practice. The programme committee saw a great increase in submissions, requiring a larger than usual commitment from the programming committee. We are lucky to have this opportunity to connect as an international community of practice, to build connections to new members of the community, and to celebrate what you do. 

Opening plenary: Leah Lievrouw – Web history and the landscape of communication/media research Chair: Nicholas Taylor

I intend to go through some context in media studies. I know this is a mixed audience… I am from the Department of Information Studies at UCLA and we have a very polyglot organisation – we can never assume that we all understand each others backgrounds and contexts. 

A lot about the web, and web archiving, is changing, so I am hoping that we will get some Q&A going about how we address some gaps in possible approaches. 

I’ll begin by saying that it has been some time now that computing has been seen, computers as communication devices, have been seen as a medium. This seems commonplace now, but when I was in college this was seen as fringe, in communication research, in the US at least. But for years documentarists, engineers, programmers and designers have seen information resources, data and computing as tools and sites for imagining, building, and defending “new” societies; enacting emancipatory cultures and politics… A sort of Alexandrian vision of “all the knowledge in the world”. This is still part of the idea that we have in web archiving. Back in the day the idea of fostering this kind of knowledge would bring about internationality, world peace, modernity. When you look at old images you see artefacts – it is more than information, it is the materiality of artefacts. I am a contributor to Nils’ web archiving handbook, and he talks about history written of the web, and history written with the web. So there are attempts to write history with the web, but what about the tools themselves? 

So, this idea about connections between bits of knowledge… This goes back before browsers. Many of you will be familiar with H.G. Well’s ? Brain; Suzanne Briet’s Qu’est que la documentation (1951) is a very influential work in this space; Jennifer Light wrote a wonderful book on Cold War Intellectuals, and their relationship to networked information… One of my lecturers was one of these in fact, thinking about networked cities… Vannevar Bush “As we may think” (1945) saw information as essential to order and society. 

Another piece I often teach, J.C.R. Licklider and Robert W. Taylor (1968) in “the computer as a communication device” talked about computers communicating but not in the same ways that humans make meaning. In fact this graphic shows a man’s computer talking to an insurance salesman saying “he’s out” an the caption “your computer will know what is important to you and buffer you from the outside world”.

We then have this counterculture movement in California in the 1960s and 1970s.. And that feeds into the emerging tech culture. We have The Well coming out of this. Stewart Brand wrote The Whole Earth Catalog (1968-78). And Actually in 2012 someone wrote a new Whole Earth Catalog… 

Ted Nelson, Computer Lib/Dream Machines (1974) is known as being the person who came up with the concept of the link, between computers, to information… He’s an inventor essentially. Computer Lib/Dream Machine was a self-published title, a manifesto… The subtitle for Computer Lib was “you can and must understand computers NOW”. Counterculture was another element, and this is way before the web, where people were talking about networked information.. But these people were not thinking about preservation and archiving, but there was an assumption that information would be kept… 

And then as we see information utilities and wired cities emerging, mainly around cable TV but also local public access TV… There was a lot of capacity for information communication… In the UK you had teletext, in Canada there was Teledyne… And you were able to start thinking about information distribution wider and more diverse than central broadcasters… With services like LexisNexis emerging we had these ideas of information utilities… There was a lot of interest in the 1980s, and back in the 1970s too. 

Harold Sackman and Norman Nie (eds.) The Information Utility and Social Choice (1970); H.G. Bradley, H.S. Dordick and B. Nanus, the Emerging Network Marketplace (1980); R.S. Block “A global information utility”, the Futurist (1984); W.H. Dutton, J.G. Blumer and K.L. Kraemer “Wired cities: shaping the future of communications” (1987).

This new medium looked more like point-to-point communication, like the telephone. But no-one was studying that. There were communications scholars looking at face to face communication, and at media, but not at this on the whole. 

Now, that’s some background, I want to periodise a bit here… And I realise that is a risk of course… 

So, we have the Pre-browser internet (early 1980s-1990s). Here the emphasis was on access – to information, expertise and content at centre of early versions of “information utilities”, “wired cities” etc. This was about everyone having access – coming from that counter culture place. More people needed more access, more bandwidth, more information. There were a lot of digital materials already out there… But they were fiddly to get at. 

Now, when the internet become privatised – moved away from military and universities – the old model of markets and selling information to mass markets, the transmission model, reemerged. But there was also tis idea that because the internet was point-to-point – and any point could get to any other point… And that everyone would eventually be on the internet… The vision was of the internet as “inherently democratic”. Now we recognise the complexity of that right now, but that was the vision then. 

Post-browser internet (early 1990s to mid-2000s) – was about web 1.0. Browsers and WWW were designed to search and retrieve documents, discrete kinds of files, to access online documents. I’ve said “Web 1.0” but had a good conversation with a colleague yesterday who isn’t convinced about these kinds of labels, but I find them useful shorthand for thinking about the web at particular points in time/use. In this era we had email still but other types of authoring tools arose.. Encouraging a wave of “user generated content” – wikis, blogs, tagging, media production and publishing, social networking sites. This sounds such a dated term now but it did change who could produce and create media, and it was the team around LA around this time. 

Then we began to see Web 2.0 with the rise of “smart phones” in the mid-2000s, merging mobile telephony and specialised web-based mobile applications, accelerate user content production and social media profiling. And the rise of social networking sounded a little weird to those of us with sociology training who were used to these terms from the real world, from social network analysis. But Facebook is a social network. Many of the tools, blogging for example, can be seen as having a kind of mass media quality – so instead of a movie studio making content… But I can have my blog which may have an audience of millions or maybe just, like, 12 people. But that is highly personal. Indeed one of the earliest so-called “killer apps” for the internet was email. Instead of shipping data around for processing – as the architecture originally got set up for – you could send a short note to your friend elsewhere… Email hasn’t changed much. That point-to-opint communication suddenly and unexpectedly suddenly became more than half of the ARPANET. Many people were surprised by that. That pattern of interpersonal communication over networks, continued to repeat itself – we see it with Facebook, Twitter, and even with Blogs etc. that have feedback/comments etc. 

Web 2.0 is often talked about as social driven. But what is important from a sociology perspective, is the participation, and the participation of user generated communities. And actually that continues to be a challenge, it continues to be not the thing the architecture was for… 

In the last decade we’ve seen algorithmic media emerging, and the rise of “web 3.0”. Both access and participation appropriated as commodities to be monitored, captures, analyzed, monetised and sold back to individuals, reconcieved as data subjects. Everything is thought about as data, data that can be stored, accessed… Access itself, the action people take to stay in touch with each other… We all carry around monitoring devices every day… At UCLA we are looking at the concept of the “data subjects”. Bruce ? used to talk about the “data footprint” or the “data cloud”. We are at a moment where we are increasingly aware of being data subjects. London is one of the most remarkable in the world in terms of surveillance… The UK in general, but London in particular… And that is ok culturally, I’m not sure it would be in the United States. 

We did some work in UCLA to get students to mark up how many surveillance cameras there were, who controlled them, who had set them up, how many there were… Neither Campus police nor university knew. That was eye opening. Our students were horrified at this – but that’s an American cultural reaction. 

But if we conceive of our own connections to each other, to government, etc. as “data” we begin to think of ourselves, and everything, as “things”. Right now systems and governance maximising the market, institutional government surveillance; unrestricted access to user data; moves towards real-time flows rather than “stocks” of documents or content. Surveillance isn’t just about government – supermarkets are some of our most surveilled spaces. 

I currently have students working on a “name domain infrastructure” project. The idea is that data will be enclosed, that data is time-based, to replace the IP, the Internet Protocol. So that rather than packages, data is flowing all the time. So that it would be like opening the nearest tap to get water. One of the interests here is from the movie and television industry, particularly web streaming services who occupy significant percentages of bandwidth now… 

There are a lot of ways to talk about this, to conceive of this… 

1.0 tend to be about documents, press, publishing, texts, search, retrieval, circulation, access, reception, production-consumption: content. 

2.0 is about conversations, relationships, peers, interaction, communities, play – as a cooperative and flow experience, mobility, social media (though I rebel against that somewhere): social networks. 

3.0 is about algorithms, “clouds” (as fluffy benevolent things, rather than real and problematic, with physical spaces, server farms), “internet of things”, aggregation, sensing, visualisation, visibility, personalisation, self as data subject, ecosystems, surveillance, interoperability, flows: big data, algorithmic media. Surveillance is kind of the environment we live in. 

Now I want to talk a little about traditions in communication studies.. 

In communication, broadly and historically speaking, there has been one school of thought that is broadly social scientific, from sociology and communications research, that thinks about how technologies are “used” for expression, interaction, as data sources or analytic tools. Looking at media in terms of their effects on what people know or do, can look at media as data sources, but usually it is about their use. 

There are theories of interaction, group process and influence; communities and networks; semantic, topical and content studies; law, policy and regulation of systems/political economy. One key question we might ask here: “what difference does the web make as a medium/milieu for communicative action, relations, interact, organising, institutional formation and change? Those from a science and technology background might know about the issues of shaping – we shape technology and technology shapes us. 

Then there is the more cultural/critical/humanist or media studies approach. When I come to the UK people who do media studies still think of humanist studies as being different, “what people do”. However this approach of cultural/critical/etc. is about analyses of digital technologies and web; design, affordances, contexts, consequences – philosophical, historical, critical lens. How power is distributed are important in this tradition. 

In terms of theoretical schools, we have the Toronto School/media ecology – the Marshall McLuhan take – which is very much about the media itself; American cultural studies, and the work of James Carey and his students; Birmingham school – the British take on media studies; and new materialism – that you see in Digital Humanities, German Media Studies, that says we have gone too far from the roles of the materials themselves. So, we might ask “What is the web itself (social and technical constituents) as both medium and product of culture, under what conditions, times and places.

So, what are the implications for Web Archiving? Well I hope we can discuss this, thinking about a table of:

Web Phase | Soc sci/admin | Crit/Cultural

  • Documents: content + access
  • Conversation: Social nets + participation
  • Data/AlgorithmsL algorithmic media + data subjects

Comment: I was wondering about ArXiv and the move to sharing multiple versions, pre-prints, post prints…

Leah: That issue of changes in publication, what preprints mean for who is paid for what, that’s certainly changing things and an interesting question here…

Comment: If we think of the web moving from documents, towards fluid state, social networks… It becomes interesting… Where are the boundaries of web archiving? What is a web archiving object? Or is it not an object but an assemblage? Also ethics of this…

Leah: It is an interesting move from the concrete, the material… And then this whole cultural heritage question, what does it instantiate, what evidence is it, whose evidence is it? And do we participate in hardening those boundaries… Or do we keep them open… How porous are our boundaries…

Comment: What about the role of metadata?

Leah: Sure, arguably the metadata is the most important thing… What we say about it, what we define it as… And that issue of fluidity… We think of metadata as having some sort of fixity… One thing that has begun to emerge in surveillance contexts… Where law enforcement says “we aren’t looking at your content, just the metadata”, well it turns out that is highly personally identifiable, it’s the added value… What happens when that secondary data becomes the most important things… In face where many of our data systems do not communicate with each other, those connections are through the metadata (only).

Comment: In terms of web archiving… As you go from documents, to conversations, to algorithms… Archiving becomes so much more complex. Particularly where interactions are involved… You can archive the data and the algorithm but you still can’t capture the interactions there…

Leah: Absolutely. As we move towards the algorithmic level its not a fixed thing. You can’t just capture the Google search algorithms, they change all the time. The more I look at this work through the lens of algorithms and data flows, there is no object in the classic sense…

Comment: Perhaps, like a movie, we need longer temporal snapshots…

Leah: Like the algorithmic equivalence of persistence of vision. Yes, I think that’s really interesting.

And with that the opening session is over, with organisers noted that those interested in surveillance may be interested to know that Room 101, said to have inspired the room of the same name in 1984, is where we are having coffee…

Session 1B (Chair: Marie Chouleur, National Library of France):

Jefferson Bailey (Deputy chair of IIPC, Director of Web Archiving, Internet Archiving): Advancing access and interface for research use of web archives

I would like to thank all of the organisers again. I’ll be giving a broad rather than deep overview of what the Internet Archive is doing at the moment.

For those that don’t know, we are a non-profit Digital Library and Archive founded in 1996. We work in a former church and it’s awesome – you are welcome to visit and do open public lunches every Friday if you are ever in San Francisco. We have lots of open source technology and we are very technology-driven.

People always ask about stats… We are at 30 Petabytes plus multiple copies right now, including 560 billion URLs, 280 billion webpages. We archive about 1 billion URLs per week, and have partners and facilities around the world, including here in the UK where we have Wellcome Trust support.

So, searching… This is WayBackMachine. Most of our traffic – 75% – is automatically directed to the new service. So, if you search for, say, UK Parliament, you’ll see the screenshots, the URLs, and some statistics on what is there and captured. So, how does it work? With that much data to do full text search! Even the raw text (not HTML) is 3-5 Pb. So, we figured the most instructive and easiest to work with text is the anchor text of all in-bound links to a homepage. The index text covers 443 million homepages, drawn from 900B in-bound links from other cross-domain websites. Is that perfect? No, but it’s the best that works on this scale of data… And people tend to make keyword type searches which this works for.

You can also now, in the new Way Back Machine, see a summary tab which includes a visualisation of data captured for that page, host, domain, MIME-type or MIME-type category. It’s really fun to play with. It’s really cool information to work with. That information is in the Way Back Machine (WBM) if there fore 4.5 billion hosts; 256 millions domains; 1238 TLDs. Also special collections that exist – building this for specific crawls/collections such as our .gov collection. And there is an API – so you can create your own visualisations if you like.

We have also created a full text search for AIT (Archive-It). This was part of a total rebuild of full text search in Elasticsearch. 6.5 billion documents with a 52 TB full text index. In total AIT is 23 billion documents and 1 PB. Searches are across all 8000+ colections. We have improved relevance ranking, metadata search, performance. And we have a Media Search coming – it’s still a test at presence. So you can search non textual content with similar process.

So, how can we help people find things better… search, full text search… And APIs. The APIs power the details charts, captures counts, year, size, new, domain/hosts. Explore that more and see what you can do. We’ve also been looking at Data Transfer APIs to standardise transfer specifications for web data exchange between repositories for preservation. For research use you can submit “jobs” to create derivative datasets from WARCS from specific collections. And it allows programmatic access to AIT WARCs, submission of job, job status, derivative results list. More at:

In other API news we have been working with WAT files – a sort of metadata file derived from a WARC. This includes Headers and content (title, anchor/text, metas, links). We have API access to some capture content – a better way to get programmtic access to the content itself. So we have a test build on a 100 TB WARC set (EOT). It’s like CDX API with a build – replays WATs not WARCs (see: You can analyse, for example, term counts across the data.

In terms of analysing language we have a new CDX code to help identify languages. You can visualise this data, see the language of the texts, etc. A lot of our content right now is in English – we need less focus on English in the archive.

We are always interested in working with researchers on building archives, not just using them. So we are working on the News Measures Research Project. We are looking at 663 local news sites representing 100 communities. 7 crawls for a composite week (July-September 2016).

We are also working with a Katrina Blogs project, after research was done, project was published, but we created a special collection of the cites used so that it can be accessed and explored.

And in fact we are general looking at ways to create useful sub collections and ways to explore content. For instance Gif Cities is a way to search for gifs from Geocities. We have a Military Industrial Powerpoint Complex, turning PPT into PDFs and creating a special collection.

We did a new collection, with a dedicated portal ( which archives US congress for NARA. And we capture this every 2 years, and also raised questions of indexing YouTube videos.

We are also looking at historical ccTLD Wayback Machines. Built on IA global crawls and added historic web data with keyword and mime/format search, embed linkback, domain stats and special features. This gives a german view – from the .de domain – of the archive.

And we continue to provide data and datasets for people. We love Archives Unleashed – which ran earlier this week. We did an Obama Whitehouse data hackathon recently. We have a webinar on APIs coming very soon


Q1) What is anchor text?

A1) That’s when you create a link to a page – the text that is associated with that page.

Q2) If you are using anchor text in that keyword search… What happens when the anchor text is just a URL…

A2) We are tokenising all the URLs too. And yes, we are using a kind of PageRank type understanding of popular anchor text.

Q3) Is that TLD work.. Do you plan to offer that for all that ask for all top level domains?

A3) Yes! Because subsets are small enough that they allow search in a more manageable way… We basically build a new CDX for each of these…

Q4) What are issues you are facing with data protection challenges and archiving in the last few years… Concerns about storing data with privacy considerations.

A4) No problems for us. We operate as a library… The Way Back Machine is used in courts, but not by us – in US courts its recognised as a thing you can use in court.

Panel: Internet and Web Histories – Niels Bruger – Chair (NB); Marc Weber (MW); Steve Jones (SJ); Jane Winters (JW)

We are going to talk about the internet and the web, and also to talk about the new journal, Internet Histories, which I am editing. The new journal addresses what my colleagues and I saw as a gap. On the one hand there are journals like New Media and Society and Internet Studies which are great, but rarely focus on history. And media history journals are excellent but rarely look at web history. We felt there was a gap there… And Taylor & Francis Routledge agreed with us… The inaugeral issue is a double issue 1-2, and people on our panel today are authors in our first journal, and we asked them to address six key questions from members of our international editorial board.

For this panel we will have an arguement, counter statement, and questions from the floor type format.

A Common Language – Mark Weber

This journal has been a long time coming… I am Curatorial Director, Internet History Program, Computer History Museum. We have been going for a while now. This Internet History program was probably the first one of its kind in a museum.

When I first said I was looking at the history of the web in the mid ’90s, people were puzzled… Now most people have moved to incurious acceptance. Until recently there was also tepid interest from researchers. But in the last few years has reached critical mass – and this journal is a marker of this change.

We have this idea of a common language, the sharing of knowledge. For a long time my own perspective was mostly focused on the web, it was only when I started the Internet History program that I thought about the fuller sweep of cyberspace. We come in through one path or thread, and it can be (too) easy to only focus on that… The first major networks, the ARPAnet was there and has become the internet. Telenet was one of the most important commercial networks in the 1970s, but who here now remembers Anne Reid of Telenet? [no-one] And by contrast, what about Vint Cerf [some]. However, we need to understand what changed, what did not succeed in the long term, how things changed and shifted over time…

We are kind of in the Victorian era of the internet… We have 170 years of telephones, 60 years of going on line… longer of imagining a connected world. Our internet history goes back to the 1840s and the telegraph. And a useful thought here, “The past isn’t over. It isn’t even past” William Faulkner.  Of this history only small portions are preserved properly. Some of then risks of not having a collective narrative… And not understanding particular aspects in proper context. There is also scope for new types of approaches and work, not just applying traditional approaches to the web.

There is a risk of a digital dark age – we have  film to illustrate this at the museum although I don’t think this crowd needs persuading of the importance of preserving the web.

So, going forward… We need to treat history and preservation as something to do quickly, we cannot go back and find materials later…

Response – Jane Winters

Mark makes, I think convincingly, the case for a common language, and for understanding the preceding and surrounding technologies, why they failed and their commercial, political and social contexts. And I agree with the importance of capturing that history, with oral history a key means to do this. Secondly the call to look beyond your own interest or discipline – interdisiplinary researcg is always challenging, but in the best sense, and can be hugely rewarding when done well.

Understanding the history of the internet and its context is important, although I think we see too many comparisons with early printing. Although some of those views are useful… I think there is real importance in getting to grips with these histories now, not in a decade or two. Key decisions will be made, from net neutrality to mass surveillance, and right now the understanding and analysis of the issues is not sophisticated – such as the incompatibility of “back doors” and secure internet use. And as researchers we risk focusing on the content, not the infrastructure. I think we need a new interdisciplinary research network, and we have all the right people gathered here…


Q1) Mark, as you are from a museum… Have you any thoughts about how you present the archived web, the interface between the visitor and the content you preserve.

A1) What we do now with the current exhibits… the star isn’t the objects, it is the screen. We do archive some websites – but don’t try to replicate the internet archive but we do work with them on some projects, including the GeoCities exhibition. When you get to things that require emulation or live data, we want live and interactive versions that can be accessed online.

Q2) I’m a linguist and was intrigued by the interdisciplinary collaboration suggested… How do you see linguists and the language of the web fitting in…

A2) Actually there is a postdoc – Naomi – looking at how different language communities in the UK have engaged through looking at the UK Web Archive, seeing how language has shaped their experience and change in moving to a new country. We are definitely thinking about this and it’s a really interesting opportunity.

Out from the PLATO Cave: Uncovering the pre-Internet history of social computing – Steve Jones, University of Ilinois at Chicago

I think you will have gathered that there is no one history of the internet. PLATO was a space for education and for my interest it also became a social space, and a platform for online gaming. These uses were spontaneous rather than centrally led. PLATO was an acronym for Programmed Logic for Automatic Teaching Operations (see diagram in Ted Nelson’s Dream Machine publication and
There were two key interests in developing for PLATO – one was multi-player games, and the other was communication. And the latter was due to laziness… Originally the PLATO lab was in a large room, and we couldn’t be bothered to walk to each others desks. So “Talk” was created – and that saved standard messages so you didn’t have to say the same thing twice!

As time went on, I undertook undergraduate biology studies and engaged in the Internet and saw that interaction as similar… At that time data storage was so expensive that storing content in perpetuity seemed absurd… If it was kept its because you hadn’t got to writing it yet. You would print out code – then rekey it – that was possible at the time given the number of lines per programme. So, in addition to the materials that were missing… There were boxes of Ledger-size green bar print outs from a particular PLATO Notes group of developers. Having found this in the archive I took pictures to OCR – that didn’t work! I got – brilliantly and terribly – funding to preserve that text. That content can now be viewed side by side in the archive – images next to re-keyed text.

Now, PLATO wasn’t designed for lay users, it was designed for professionals although also used by university and high school students who had the time to play with it. So you saw changes between developer and community values, seeing development of affordances in the context of the discourse of the developers – that archived set of discussions. The value of that work is to describe and engage with this history not just from our current day perspective, but to understand the context, the poeple and their discourse at the time.

Response – Mark

PLATO sort of is the perfect example of a system that didn’t survive into the mainstream… Those communities knew each other, the idea of the flatscreen – which led to the laptop – came from PLATO. PLATO had a distinct messaging system, separate from the ARPAnet route. It’s a great corpus to see how this was used – were there flames? What does one-to-many communication look like? It is a wonderful example of the importance of preserving these different threads.. And PLATO was one of the very first spaces not full of only technical people.

PLATO was designed for education, and that meant users were mainly students, and that shaped community and usage. There was a small experiment with community time sharing memory stores – with terminals in public places… But PLATO began in the late ’60s and ran through into the 80s, it is the poster child for preserving earlier systems. PLATO notes became Lotus Notes – that isn’t there now but in its own domain, PLATO was the progenitor of much of what we do with education online now, and that history is also very important.


Q1) I’m so glad, Steve, that you are working on PLATO. I used to work in Medical Education in Texas and we had PLATO terminals to teach basic science first and second year medical education students and ER simulations. And my colleagues and I were taught computer instruction around PLATO. I am intereted that you wanted to look at discourse around UIC around PLATO – so, what did you find? I only experienced PLATO at the consumer end of the spectrum, so I wondered what the producer end was like…

A1) There are a few papers on this – search for it – but two basic things stand out… (1) the degree to which as a mainframe system PLATO was limited as system, and the conflict between the systems people and the gaming people. The gaming used a lot of the capacity, and although that taxed the system it did also mean they developed better code, showed what PLATO was capable of, and helped with the case for funding and support. So it wasn’t just shut PLATO down, it was a complex 2-way thing; (2) the other thing was around the emergence of community. Almost anyone could sit at a terminal and use the system. There were occasional flare ups and they mirrored community responses even later around flamewars, competition for attention, community norms… Hopefully others will mine that archive too and find some more things.

Digital Humanities – Jane Winters

I’m delighted to have an article in the journal, but I won’t be presenting on this. Instead I want to talk about digital humanities and web archives. There is a great deal of content in web archives but we still see little research engagement in web archives, there are numerous reasons including the continuing work on digitised traditional texts, and slow movement to develop new ways to research. But it is hard to engage with the history of the 21st century without engaging with the web.

The mismatch of the value of web archives and the use and research around the archive was part of what led us to set up a project here in 2014 to equip researchers to use web archives, and encourage others to do the same. For many humanities researchers it will take a long time to move to born-digital resources. And to engage with material that subtly differs for different audiences. There are real challenges to using this data – web archives are big data. As humanities scholars we are focused on the small, the detailed, we can want to filter down… But there is room for a macro historical view too. What Tim Hitchcock calls the “beautiful chaos?” of the web.

Exploring the wider context one can see change on many levels – from the individual person or business, to wide spread social and political change. How the web changes the language used between users and consumers. You can also track networks, the development of ideas… It is challenging but also offers huge opportunities. Web archives can include newspapers, media, and direct conversation – through social media. There is also visual content, gifs… The increase in use of YouTube and Instagram. Much of this sits outside the scope of web archives, but a lot still does make it in. And these media and archiving challenges will only become more challenging as see more data… The larger and more uncontrolled the data, the harder the analysis. Keyword searches are challenging at scale. The selection of the archive is not easily understood but is important.

The absence of metadata is another challenge too. The absence of metadata or alternative text can render images, particularly, invisible. And the mix of formats and types of personal and the public is most difficult but also most important. For instance the announcement of a government policy, the discussion around it, a petition perhaps, a debate in parliament… These are not easy to locate… Our histories is almost inherently online… But they only gain any real permanence through preservation in web archives, and thats why humanists and historians really need to engage with them.

Response – Steve

I particularly want to talk about archiving in scholarship. In order to fit archiving into scholarly models… administrators increasingly make the case for scholarship in the context of employment and value. But archive work is important. Scholars are discouraged from this sort of work because it is not quick, it’s harder to be published… Separately you need organisations to engage in preservation of their online presences. The degree to which archive work is needed is not reflected by promotion committees, organisational support, local archiving processes. There are immense rhetorical challenges here, to persuade others of the value of this work. There had been successful cases made to encourage telephone providers to capture and share historical information. I was at a telephone museum recently and asked about the archive… She handed me a huge book on the founding of Southwestern Bell, published in a very small run… She gave me a copy but no-one had asked about this before… That’s wrong though, it should be captured. So we can do some preservation work ourselves just by asking!


Q1) Jane, you mentioned a skills gap for humanities researchers. What sort of skills do they need?

A1) I think the complete lack of quantitative data training, how to sample, how to make meaning from quantitative data. They have never been engaged in statistical training. They have never been required to do it – you specialise so early here. Also, basic command line stuff… People don’t understand that or why they have to engage that way. Those are two simple starting points. Those help them understand what they are looking at, what an ngram means, etc.

Session 2B (Chair: Tom Storrar)

Philip Webster, Claire Newing, Paul Clough & Gianluca Demartini: A temporal exploration of the composition of the UK Government Web Archive

I’m afraid I’ve come into this session a little late. I have come in at the point that Philip and Claire are talking about the composition of the archive – mostly 2008 onwards – and looking at status codes of UK Government Web Archive. 

Phillip: The hypothesis for looking at http status codes was to see if changes in government raised trends in the http status code. Actually, when we looked at post-2008 data we didn’t see what we expected there. However we did fine that there was an increase in not finding what was requested – and thought this may be about moving to dynamic pages – but this is not a strong trend.

In terms of MIME types – media types – which are restricted to:

Application – flash, java, Microsoft Office Documents. Here we saw trends away from PDF as the dominant format. Microsoft word increases, and we see the increased use of Atom – syndication – coming across.

Executable – we see quite a lot of javascript. The importance of flash decreased over time – which we expected – and the increased in javascript (javascript and javascript x).

Document – PDF remains prevalent. Also MS Word, some MS Excel. Open formats haven’t really taken hold…

Claire: The Government Digital Strategy included guidance to use open document formats as much as possible, but that wasn’t mandated until late 2014 – a bit too late for our data set unfortunately. But the Government Digital Strategy in 2011 was, itself, published in Word and PDF itself!

Philip: If we take document type outside of PDFs you see that lack of open formats more clearly..

Image – This includes images appearing in documents, plus icons. And occasionally you see non-standard media types associated with the MIME-types. Jpegs are fairly consistent changes. Gif and Png are comparable… Gif was being phased out for IP reasons, with Png to replace it,and you see that change over time…

Text – Test is almost all HTML. You see a lot of plain text, stylesheets, XML…

Video – we saw compressed video formats… but gradually superceded with embedded YouTube links. However we do still see a of flash video retained. And we see a large, increasing of MP4, used by Apple devices.

Another thing that is available over time is relative file sizes. However CDX index only contains compressed size data and therefore is not a true representation of file size trends. So you can’t compare images to their pre-archiving version. That means for this work we’ve limited the data set to those where you can tell the before and after status of the image files. We saw some spikes in compressed image formats over time, not clear if this shows departmental isssues..

To finish on a high note… There is an increase in the use of https rather than http. I thought it might be the result of a campaign, but it seems to be a general trend..

The conclusion… Yes, it is possible to do temporal analysis of CDX index data but you have to be careful, looking at proportion rather than raw frequency. SQL is feasible, commonly available and low cost. Archive data has particular weaknesses – data cannot be assumed to be fully representative, but in some cases trends can be identified.


Q1) Very interesting, thank you. Can I understand… You are studying the whole archive? How do you take account of having more than one copy of the same data over time?

A1) There is a risk of one website being overrepresented in the archive. There are checks that can be done… But that is more computationally expensive…

Q2) With the seed list, is that generating the 404 rather than actual broken links?

A2 – Claire) We crawl by asking the crawler to go out to find links and seed from that. It generally looks within the domain we’ve asked it to capture…

Q3) At various points you talked about peaks and trends… Have you thought about highlighting that to folks who use your archive so they understand the data?

A3 – Claire) We are looking at how we can do that more. I have read about historians’ interest in the origins of the collection, and we are thinking about this, but we haven’t done that yet.

Caroline Nyvang, Thomas Hvid Kromann & Eld Zierau: Capturing the web at large – a critique of current web citation practices

Caroline: We are all here as we recognise the importance and relevance of internet research. Our paper looks at web referencing and citation within the sciences. We propose a new format to replace the URL+date format usually recommended. We will talk about a study of web references in 35 Danish master’s theses from the University of Copenhagen, then further work on monograph referencing, then a new citation format.

The work on 35 masters theses submitted to Copenhagen university, included, as a set: 899 web references, there was an average of 26.4 web references – some had none, the max was 80. This gave us some insight into how students cite URL. Of those students citing websites: 21% gave the date for all links; 58% had dates for some but not all sites; 22% had no dates. Some of those URLs pointed to homepages or search results.

We looked at web rot and web references – almost 16% could not be accessed by the reader, checked or reproduced. An error rate of 16% isn’t that remarkable – in 1992 a study of 10 journals found that a third of references was inaccurate enough to make it hard to find the source again. But web resources are dynamic and issues will vary, and likely increase over time.

The amount of web references does not seem to correlate with particular subjects. Students are also quite imprecise when they reference websites. And even when the correct format was used 15.5% of all the links would still have been dead.

Thomas: We looked at 10 danish academic monographs published from 2010-2016. Although this is a small number of titles, it allowed us to see some key trends in the citation of web content. There was a wide range of number of web citations used – 25% at the top, 0% at the bottom of these titles. Location of web references in these texts are not uniform. On the whole scholars rely on printed scholarly work… But web references are still important. This isn’t a systematic review of these texts… In theory these links should all work.

We wanted to see the status after five years… We used a traffic light system. 34.3% were red – broken, dead, a different page; 20?% were amber – critical links that either refer to changed or at risk material; 44.7% were green – working as expected.

This work showed that web references to dead links within a limited number of years. In our work the URLs that go to the front page, with instructions of where to look, actually, ironically, lasted best. Long complex URLs were most at risk… So, what can we do about this…

Eld: We felt that we had to do something here, to address what is needed. We can see from the studies that today’s practices of URLs and date stamp does not work. We need a new standard, a way to reference something stable. The web is a marketplace and changes all the time. We need to look at the web archives… And we need precision and persistency. We felt there were four neccassary elements, and we call it the PWID – Persistent Web IDentifier. The Four elemnts are:

  • Archived URL
  • Time of archiving
  • Web archive – precision and indication that you verified this is what you expect. Also persistency. Researcher has to understand that – is it a small or large archive, what is contextual legislation.
  • Content coverage specification – is part only? Is it the html? Is it the page including images as it appears in your browser? Is it a page? Is it the side including referred pages within the domain

So we propose a form of reference which can be textually expressed as:

web archive:, archiving time: 2016-04-20 18:21:47 UTC, archived URL: http://resaw.en/, content coverage: webpage

But, why not use web archive URL? Of the form:

Well, this can be hard to read, there is a lot of technology embedded in the URL. It is not as accessible.


This is now in as an ISO 690 suggestion and proposed as a URI type.

To sum up, all research fields eed to refer to the web. Good scientific practice cannot take place with current approaches.


Q1) I really enjoyed your presentation… I was wondering what citation format you recommend for content behind paywalls, and for dynamic content – things that are not in the archive.

A1 – Eld) We have proposed this for content in the web archive only. You have to put it into an archive to be sure, then you refer to it. But we haven’t tried to address those issues of paywall and dynamic content. BUT the URI suggestion could refer to closed archives too, not just open archives.

A1 – Caroline) We also wanted to note that this approach is to make web citations align with traditional academic publication citations.

Q2) I think perhaps what you present here is an idealised way to present archiving resources, but what about the marketing and communications challenge here – to better cite websites, and to use this convention when they aren’t even using best practice for web resources.

A2 – Eld) You are talking about marketing to get people to use this, yes? We are starting with the ISO standard… That’s one aspect. I hope also that this event is something that can help promote this and help to support it. We hope to work with different people, like you, to make sure it is used. We have had contact with Zotero for instance. But we are a library… We only have the resources that we have.

Q3) With some archives of the web there can be a challenge for students, for them to actually look at the archive and check what is there..

A3) Firstly citing correctly is key. There are a lot of open archives at the moment… But we hope the next step will be more about closed archives, and ways to engage with these more easily, to find common ground, to ensure we are citing correctly in the first place.

Comment – Nicola Bingham, BL) I like the idea of incentivising not just researchers but also publishers to incentivise web archiving, another point of pressure to web archives… And making the case for openly accessible articles.

Q4) Have you come across Martin Klein and Herbert Von Sompel’s work on robust links, and Momento.

A4 – Eld) Momento is excellent to find things, but usually you do not have the archive in there… I don’t think the way of referencing without the archive is a precise reference…

Q5) When you compare to web archive URL, it was the content coverage that seems different – why not offer as an incremental update.

A5) As far as I know there is using a # in the URL and that doesn’t offer that specificity…

Comment) I would suggest you could define the standard for after that # in the URLs to include the content coverage – I’ll take that offline.

Q6) Is there a proposal there… For persistence across organisations, not just one archive.

A6) I think from my perspective there should be a registry when archives change/move to find the new registry. Our persistent identifier isn’t persistent if you can change something. And I think archives must be large organisations, with formal custodians, to ensure it is persistent.

Comment) I would like to talk offline about content addressing and Linked Data to directly address and connect to copies.

Andrew Jackson: The web archive and the catalogue

I wanted to talk about some bad experiences I had recently… There is a recent BL video of the journey of a (print) collection item… From posting to processing, cataloguing, etc… I have worked at the library for over 10 years, but this year for the first time I had to get to grips with the library catalogue… I’ll talk more about that tomorrow (in the technical strand) but we needed to update our catalogue… Accommodating the different ways the catalogue and the archive see c0ntent.

Now, that video, the formation of teams, the structure of the organisations, the physical structure of our building is all about that print process, and that catalogue… So it was a suprise for me – maybe not you – that the catalogue isn’t just bibliographic data, it’s also a workflow management tool…

There is a change of events here… Sometimes events are in a line, sometimes in circles… Always forwards…

Now, last year legal deposit came in for online items… The original digital processing workflow went from acquisition to ingest to cataloguing… But most of the content was already in the archive… We wanted to remove duplication, and make the process more efficient… So we wanted to automate this as a harvesting process.

For our digital work previously we also had a workflow, from nomination, to authorisation, etc… With legal deposit we have to get it all, all the time, all the stuff… So, we don’t collect news items, we want all news sites every day… We might specify crawl targets, but more likely that we’ll see what we’ve had before and draw them in… But this is a dynamic process….

So, our document harvester looks for “watched targets”, harvests, extracts documents for web archiving… and also ingest. There are relationships to acquisition, that feeds into cataloguing and the catalogue. But that is an odd mix of material and metadata. So that’s a process… But webpages change… For print matter things change rarely, it is highly unusual. For the web changes are regular… So how do we bring these things together…

To borrow an analogy from our Georeferencing project… Users engage with an editor to help us understand old maps. So, imagine a modern web is a web archive… Then you need information, DOIs, places and entities – perhaps a map. This kind of process allows us to understand the transition from print to online. So we think about this as layers of transformation… Where we can annotate the web archive… Or the main catalogue… That can be replaced each time this is needed. And the web content can, with this approach, be reconstructed with some certainty, later in time…

Also this approach allows us to use rich human curation to better understand that which is being automatically catalogued and organised.

So, in summary: the catalogue tends to focus on chains of operation and backlogs, item by item. The web archive tends to focus on transformation (and re-transformation) of data. Layered data model can bring them together. Means revisiting the datat (but fixity checking  requires this anyway). It’s costly in terms of disk space required. And it allows rapid exploration and experimentation.

Q1) To what extend is the drive for this your users, versus your colleagues?

A1) The business reason is that it will save us money… Taking away manual work. But, as a side effect we’ve been working with cataloguing colleagues in this area… And their expectations are being raised and changed by this project. I do now much better understand the catalogue. The catalogue tends to focus on tradition not output… So this project has been interesting from this perspective.

Q2) Are you planning to publish that layer model – I think it could be useful elsewhere?

A2) I hope to yes.

Q3) And could this be used in Higher Education research data management?

A3) I have noticed that with research data sets there are some tensions… Some communities use change management, functional programming etc… Hadoop, which we use, requires replacement of data… So yes, but this requires some transformation to do.

We’d like to use the same based data infrastructure for research… Otherwise had to maintain this pattern of work.

Q4) Your model… suggests WARC files and such archive documents might become part of new views and routes in for discovery.

A4) That’s the idea, for discovery to be decoupled from where you the file.

Nicola Bingham, UK Web Archive: Resource not in archive: understanding the behaviour, borders and gaps of web archive collections

I will describe the shape and the scope of the UK Web Archive, to give some context for you to explore it… By way of introduction.. We have been archiving the UK Web since 2013, under UK non-print legal deposit. But we’ve also had the Open Archive (since 2004); Legal Deposit Archive (since 2013); and the Jisc Historical Archive (1996-2013).

The UK Web Archive includes around 400 TB of compressed data. And in the region of 11-12 billion records. We grow, on average 60-70 TB per year and 3 B records per year. We want to be comprehensive but, that said, we can’t collect everything and we don’t want to collect everything… Firstly we collect UK websites only. We carry out web archiving under 2013 regulations, and they state that only UK published web content – meaning content on a UK web domain, or by a person whose work occurs in the UK. So, we can automate harvesting from UK TLD (.uk, .scot, .cymru etc); UK hosting – geo-IP loook up to locate server. Then manual checks. So Facebook, WordPress, Twitter cannot be automated…

We only collect published content. Out of scope here are:

  • Film and recorded sound where AV content predominates, e.g. YouTube
  • Private intranets and emails.
  • Social networkings sites only available to restricted groups – if you need a login, special permissions they are out of scope.

Web archiving is expensive. We have to provide good value for money… We crawl the UK domain on an annual basis (only). Some sites are more frequent but annual misses a lot. We cap domains at 512 MB – which captures many sites in their entirity, but others that we only capture part of (unless we override automatic settings).

There are technical limitations too, around:

  • Database driven sites – crawler struggle with these
  • Programming scripts
  • Plug-ins
  • Proprietary file formats
  • Blockers – robots.txt or access denied.

So there are misrepresentations… For instance the One Hundred Women blog captures the content but not the stylesheet – that’s a fairly common limitation.

We also have curatorial input to locate the “important stuff”. In the British Library web archiving is not performed universally by all curators, we rely on those who do engage, usually voluntarily. We try to onboard as many curators and specialist professionals as possible to widen coverage.

So, I’ve talked about gaps and boundaries, but I also want to talk about how the users of the archive find this information, so that even where there are gaps, it’s a little more transparant…

We have the Collection Scoping Document, this captures scope, motivation, parameters and timeframe of collection. This document could, in a paired-down form, be made available to end users of the archive.

We have run user testing of our current UK Web Archive website, and our new version. And even more general audiences really wanted as much contextual information as possible. That was particularly important on our current website – where we only shared permission-cleared items. But this is one way in which contextual information can be shown in the interface with the collection.

The metadata can be browsed searched, though users will be directed to come in to view the content.

So, an example of a collection would be 1000 Londoners, showing the context of the work.

We also gather information during the crawling process… We capture information on crawler configuration, seed list, exclusions… I understand this could be used and displayed to users to give statistics on the collection…

So, what do we know about what the researchers want to know? They want as much documentation as they possibly can. We have engaged with the research community to understand how best to present data to the community. And indeed that’s where your feedback and insight is important. Please do get in touch.


Q1) You said you only collect “published” content… How do you define that?

A1) With legal deposit regulations… The legal deposit libraries may collect content openly available on the web… Content that is paywalled or behind login credentials. UK publishers are obliged to provide credentials for crawling. BUT how we make that accessible… Is a different matter – we wouldn’t republish that on the open web without logins/credentials.

Q2) How do you have any ideas about packaging this type of information for users and researchers – more than crawler config files.

A2) The short answer is no… We’d like to invite researchers to access the collection in both a close reading sense, and a big data sense… But I don’t have that many details about that at the moment.

Q3) A practical question: if you know you have to collect something… If you have a web copy of a government publication, say, and the option of the original, older, (digital) document… Is the web archive copy enough, do you have the metadata to use that the right way?

A3) Yes, so on the official publications… This is where the document harvester tool comes into play, adding another layer of metadata to pass the document through various access elements appropriately. We are still dealing with this issue though.

Chris Wemyss – Tracing the Virtual community of Hong Kong Britons through the archived web

I’ve joined this a wee bit late after a fun adventure on the Senate House stairs… 

Looking at the Gwulo: Old Hong Kong site.. User content is central to this site which is centred on a collection of old photographs, buildings, people, landscapes… The website starts to add features to explore categorisations of images.. And the site is led by an older British resident. He described subscribers being expats who have moved away, where an old version of Hong Kong that no longer exists – one user described it as an interactive photo album… There is clearly more to be done on this phenomenon of building these collective resources to construct this type of place. The founder comments on Facebook groups – they are about the now, “you don’t build anything, you just have interesting conversations”.

A third example then, Swire Mariners Association. This site has been running, nearly unchanged, for 17 years, but they have a very active forum, a very active Facebook group. These are all former dockyard workers, they meet every year, it is a close knit community but that isn’t totally represented on the web – they care about the community that has been constructed, not the website for others.

So, in conclusion archives are useful in some cases. Using oral history and web archives together is powerful, however, where it is possible to speak to website founders or members, to understand how and why things have changed over time. Seeing that change over time already gives some idea of the futures people want to see. And these sites indicate the demand for communities, active societies, long after they are formed. And illustrates how people utilise the web for community memory…


Q1) You’ve raised a problem I hadn’t really thought about. How can you tell if they are more active on Facebook or the website… How do you approach that?

A1) I have used web archiving as one source to arrange other things around… Looking for new websites, finding and joining the Facebook group, finding interviewees to ask about that. But I wouldn’t have been prompted to ask about the website and its change/lack of change without consulting the web archives.

Q2) Were participants aware that their pages were in the archive?

A2) No, not at all. The blog I showed first was started by two guys, Gwilo is run by one guy… And he quite liked the idea that this site would live on in the future.

David Geiringer & James Baker: The home computer and networked technology: encounters in the Mass Observation Project archive, 1991-2004

I have been doing web on various communities, including some work on GeoCities which is coming out soon… And I heard about the Mass Observation project which, from 1991 – 2004, about computers and how they are using them in their life… The archives capture comments like:

“I confess that sometimes I resort to using the computer using th ecut and paste techniwue to write several letters at once”

Confess is a strong word there.. Over this period of observation we saw production of text moving to computers, computers moving into most homes, the rebuilding of modernity. We welcome comment on this project, and hope to publish soon where you can find out more on our method and approach.

So, each year since 1981 the mass observation project has issued directives to respondents to respond to key issues like e.g. Football, or the AIDs crisis. They issued the technology directive in 1991. From that year we see several fans of word processor – words like love, dream…  Responses to the 1991 directive are overwhelmingly positive… Something that was not the case for other technologies on the whole…

“There is a spell check on this machine. Also my mind works faster than my hand and I miss out letters. This machine picks up all my faults and corrects them. Thank you computer.”

After this positive response though we start to see etiquette issues, concerns about privacy… Writing some correspondence by hand. Some use simulated hand writing… And start to have concerns about adapting letters, whether that is cheating or not… Ethical considerations appearing.. It is apparent that sometimes guilt around typing text is also slightly humorous… Some playful mischief there…

Altering the context of the issue of copy and paste… the time and effort to write a unique manuscript is at concern… Interestingly the directive asked about printing and filing emails… And one respondent notes that actually it wasn’t financial or business records, but emails from their ex…

Another comments that they wish they had printed more emails during their pregnancy, a way of situating yourself in time and remembering the experience…

I’m going to skip ahead to how computers fitted into their home… People talk about dining rooms, and offices, and living rooms.. Lots of very specific discussions about where computers are placed and why they are placed there… One person comments:

“Usually at the dining room at home which doubles as our office and our coffee room”

Others talk about quieter spaces… The positioning of a computer seems to create some competition for use of space. The home changing to make room for the computer or the network… We also start to see (in 2004) comments about home life and work life, the setting up of a hotmail account as a subtle act of resistance, the reassertion of the home space.

A Mass Observation Directive in 1996 asked about email and the internet:

“Internet – we have this at work and it’s mildly useful. I wouldn’t have it at home because it costs a lot to be quite sad and sit alone at home” (1996)

So, observers from 1991-2004 talked about efficiencies of the computer and internet, copy, paste, ease… But this then reflected concerns about the act of creating texts, of engaging with others, computers as changing homes and spaces. Now, there are really specific findings around location, gender, class, gender, age, sexuality… The overwhelming majority of respondents are white middle class cis-gendered straight women over 50. But we do see that change of response to technology, a moment in time, from positive to concerned. That runs parallel to the rise of the World Wide Web… We think our work does provide context to web archive work and web research, with textual production influenced by these wider factors.


Q1) I hadn’t realised mass observation picked up again in 1980. My understanding was that previously it was the observed, not the observers. Here people report on their own situations?

A1) They self report on themselves. At one point they are asked to draw their living room as well…

Q1) I was wondering about business machinery in the home – type writers for instance

A1) I don’t know enough about the wider archive. All of this newer material was done consistently… The older mass observation material was less consistent – people recorded on the street, or notes made in pubs. What is interesting is that in the newer responses you see a difference in the writing of the response… As they move from hand written to type writers to computer…

Q2) Partly you were talking about how people write and use computers. And a bit about how people archive themselves… But the only people I could find how people archive themselves digitally was by Microsoft Research… Is there anything since then… In that paper though you could almost read regret between the lines… the loss of photo albums, letters, etc…

A2) My colleague David Geiringer who I co-wrote the paper was initially looking at self-archiving. There was very very little. But printing stuff comes up… And the tensions there. There is enough there, people talking about worries and loss… There is lots in there… The great thing with Mass Obvs is that you can have a question but then you have to dig around a lot to find things…

Ian Milligan, University of Waterloo and Matthew Weber, Rutgers University – Archives Unleashed 4.0: presentation of projects (#hackarchives)

Ian: I’m here to talk about what happened on the first two days of Web Archiving Week. And I’d like to thank our hosts, supporters, and partners for this exciting event. We’ll do some lightening talks on the work undertaken… But why are historians organising data hackathons? Well, because we face problems in our popular cultural history. Problems like GeoCities… Kids write about Winnie the Pooh, people write about the love of Buffy the Vampire Slayer, their love of cigars… We face a problem of huge scale… 7 million users of the web now online… It’s the scale that boggles the mind and compare it to the Old Bailey – one of very few sources on ordinary people. They leave birth, death, marriage or criminal justice records… 239 years from 197,745 trials, 1674 and 1913 is the biggest collection of texts about ordinary people… But from 7 years of geocities we have 413 million web documents.

So, we have a problem, and myself, Matt and Olga from the British Library came together to build community, to establish a common vision of web archiving documents, to find new ways of addressing some of these issues.

Matt: I’m going to quickly show you some of what we did over the last few days… and the amazing projects created. I’ve always joked that Archives Unleashed is letting folk run amok to see what they can do… We started around 2 years ago, in Toronto, then Library of Congress, then at Internet Archive in San Francisco, and we stepped it up a little for London! We had the most teams, we had people from as far as New Zealand.

We started with some socilising in a pub on Friday evening, so that when we gathered on Monday we’d already done some introductions. Then a formal overview and quickly forming teams to work and develop ideas… And continuing through day one and day two… We ended up with 8 complete projects:

  • Robots in the Archives
  • US Elections 2008 and 2010 – text and keyword analysis
  • Study of Gender Distribution in Olympic communities
  • Link Ranking Group
  • Intersection Analysis
  • Public Inquiries Implications (Shipman)
  • Image Search in the Portuguese Web Archive
  • Rhyzome Web Archive Discovery Archive

We will hear from the top three from our informal voting…

Intersection Analysis – Jess

We wanted to understand how we could find a cookbook methodology for understanding the intersections between different data sets. So, we looked at the Occupy Movement (2011/12) with a Web Archive, a Rutgers archive and a social media archive from one of our researchers.

We normalised CDX, crunch WAT for outlinks and extract links from tweets. We generated counts and descriptive data, union/intersection between every data set. We had over 74 million datasets, but only 0.3% overlap between the collections… If you go to our website we have a visualisation of overlaps, tree maps of the collections…

We wanted to use the WAT files to explore Outlinks in the data sets, what they were linking to, how much of it was archived (not a lot).

Parting thoughts? Overlap is inversely proportional to the diversity pf URIs – in other words, the more collectors, the better. Diversifying see lists with social media is good.

Robots in the Archive 

We focused on robots.txt. And our wuestion was “what do we miss when we respect robots.txt?”. At National Library of Denmark we respect this… At Internet Archive they’ve started to ignore that in some contexts. So, what did we do? We extracts robots.txt from the WARC collection. Then apply it retroactively. Then we wanted to compare to link graph.

Our data was from The National Archives and from the 2010 election. We started by looking at user-agent blocks. Four had specifically blocked the internet archive, but some robot names were very old and out of date.. And we looked at crawl delay… Looking specifically at the sub collection of the department for energy and climate change… We would have missed only 24 links that would have been blocked…

So, robots.txt is minimal for this collection. Our method can be applied to other collections and extended to further the discussion on ignore robots.txt. And our code is on GitHub.

Link Ranking Group 

We looked at link analysis to ask if all links are treated the same… We wanted to test if links in <li> are different from content links (<p> or <div>). We used a WarcBase scripts to export manageable raw HTML, Load into Beuatifulsoup library. Used this on the Rio Olympic sites…

So we started looking at WARCs… We said, well, we should test if absolute or relative links… And comparing hard links to relative links but didn’t see lots of differences…

But we started to look at a previous election data set… There we saw links in tables, and there relative links were about 3/4 of links, and the other 1/4 were hard links. We did some investigation about why we had more hard links (proportionally) than before… Turns out this is a mixture of SEO practice, but also use of CMS (Content Management Systems) which make hard links easier to generate… So we sort of stumbled on that finding…

And with that the main programme for today is complete. There is a further event tonight and battery/power sockets permitting I’ll blog that too. 

Apr 052017
Cakes at the CIGS Web 2.0 and Metadata Event 2017

Today I’m at the Cataloguing and Indexing Group Scotland event – their 7th Metadata & Web 2.0 event – Somewhere over the Rainbow: our metadata online, past, present & future. I’m blogging live so, as usual, all comments, corrections, additions, etc. are welcome. 

Paul Cunnea, CIGS Chair is introducing the day noting that this is the 10th year of these events: we don’t have one every year but we thought we’d return to our Wizard of Oz theme.

On a practical note, Paul notes that if we have a fire alarm today we’d normally assemble outside St Giles Cathedral but as they are filming The Avengers today, we’ll be assembling elsewhere!

There is also a cupcake competition today – expect many baked goods to appear on the hashtag for the day #cigsweb2. The winner takes home a copy of Managing Metadata in Web-scale Discovery Systems / edited by Louise F Spiteri. London : Facet Publishing, 2016 (list price £55).

Engaging the crowd: old hands, modern minds. Evolving an on-line manuscript transcription project / Steve Rigden with Ines Byrne (not here today) (National Library of Scotland)

Ines has led the development of our crowdsourcing side. My role has been on the manuscripts side. Any transcription is about discovery. For the manuscripts team we have to prioritise digitisation so that we can deliver digital surrogates that enable access, and to open up access. Transcription hugely opens up texts but it is time consuming and that time may be better spent on other digitisation tasks.

OCR has issues but works relatively well for printed texts. Manuscripts are a different matter – handwriting, ink density, paper, all vary wildly. The REED(?) project is looking at what may be possible but until something better comes along we rely on human effort. Generally the manuscript team do not undertake manual transcription, but do so for special exhibitions or very high priority items. We also have the challenge that so much of our material is still under copyright so cannot be done remotely (but can be accessed on site). The expected user community generally can be expected to have the skill to read the manuscript – so a digital surrogate replicates that experience. That being said, new possibilities shape expectations. So we need to explore possibilities for transcription – and that’s where crowd sourcing comes in.

Crowd sourcing can resolve transcription, but issues with copyright and data protection still have to be resolved. It has taken time to select suitable candidates for transcription. In developing this transcription project we looked to other projects – like Transcribe Bentham which was highly specialised, through to projects with much broader audiences. We also looked at transcription undertaken for the John Murray Archive, aimed at non specialists.

The selection criteria we decided upon was for:

  • Hands that are not too troublesome.
  • Manuscripts that have not been re-worked excessively with scoring through, corrections and additions.
  • Documents that are structurally simple – no tables or columns for example where more complex mark-up (tagging) would be required.
  • Subject areas with broad appeal: genealogies, recipe book (in the old crafts of all kinds sense), mountaineering.

Based on our previous John Murray Archive work we also want the crowd to provide us with structure text, so that it can be easily used, by tagging the text. That’s an approach that is borrowed from Transcribe Bentham, but we want our community to be self-correcting rather than doing QA of everything going through. If something is marked as finalised and completed, it will be released with the tool to a wider public – otherwise it is only available within the tool.

The approach could be summed up as keep it simple – and that requires feedback to ensure it really is simple (something we did through a survey). We did user testing on our tool, it particularly confirmed that users just want to go in, use it, and make it intuitive – that’s a problem with transcription and mark up so there are challenges in making that usable. We have a great team who are creative and have come up with solutions for us… But meanwhile other project have emerged. If the REED project is successful in getting machines to read manuscripts then perhaps these tools will become redundant. Right now there is nothing out there or in scope for transcribing manuscripts at scale.

So, lets take a look at Transcribe NLS

You have to login to use the system. That’s mainly to help restrict the appeal to potential malicious or erroneous data. Once you log into the tool you can browse manuscripts, you can also filter by the completeness of the transcription, the grade of the transcription – we ummed and ahhed about including that but we though it was important to include.

Once you pick a text you click the button to begin transcribing – you can enter text, special characters, etc. You can indicate if text is above/below the line. You can mark up where the figure is. You can tag whether the text is not in English. You can mark up gaps. You can mark that an area is a table. And you can also insert special characters. It’s all quite straight forward.


Q1) Do you pick the transcribers, or do they pick you?

A1) Anyone can take part but they have to sign up. And they can indicate a query – which comes to our team. We do want to engage with people… As the project evolves we are looking at the resources required to monitor the tool.

Q2) It’s interesting what you were saying about copyright…

A2) The issues of copyright here is about sharing off site. A lot of our manuscripts are unpublished. We use exceptions such as the 1956 Copyright Act for old works whose authors had died. The selection process has been difficult, working out what can go in there. We’ve also cheated a wee bit

Q3) What has the uptake of this been like?

A3) The tool is not yet live. We thin it will build quite quickly – people like a challenge. Transcription is quite addictive.

Q4) Are there enough people with palaeography skills?

A4) I think that most of the content is C19th, where handwriting is the main challenge. For much older materials we’d hit that concern and would need to think about how best to do that.

Q5) You are creating these documents that people are reading. What is your plan for archiving these.

A5) We do have a colleague considering and looking at digital preservation – longer term storage being more the challenge. As part of normal digital preservation scheme.

Q6) Are you going for a Project Gutenberg model? Or have you spoken to them?

A6) It’s all very localised right now, just seeing what happens and what uptake looks like.

Q7) How will this move back into the catalogue?

A7) Totally manual for now. It has been the source of discussion. There was discussion of pushing things through automatically once transcribed to a particular level but we are quite cautious and we want to see what the results start to look like.

Q8) What about tagging with TEI? Is this tool a subset of that?

A8) There was a John Murray Archive, including mark up and tagging. There was a handbook for that. TEI is huge but there is also TEI Light – the JMA used a subset of the latter. I would say this approach – that subset of TEI Light – is essentially TEI Very Light.

Q9) Have other places used similar approaches?

A9) TRanscribe Bentham is similar in terms of tagging. The University of Iowa Civil War Archive has also had a similar transcription and tagging approach.

Q10) The metadata behind this – how significant is that work?

A10) We have basic metadata for these. We have items in our digital object database and simple metadata goes in there – we don’t replicate the catalogue record but ensure it is identifiable, log date of creation, etc. And this transcription tool is intentionally very basic at th emoment.

Coming up later…

Can web archiving the Olympics be an international team effort? Running the Rio Olympics and Paralympics project / Helena Byrne (British Library)

I am based at the UK Web Archive, which is based at the British Library. The British Library is one of the six legal deposit libraries. The BL are also a member of the International Internet Preservation Consortium – as are the National Library of Scotland. The Content Development Group works on any project with international relevance and a number of interested organisations.

Last year I was lucky enough to be lead curator on the Olympics 2016 Web Archiving project. We wanted to get a good range of content. Historically our archives for Olympics have been about the events and official information only. This time we wanted the wider debate, controversy, fandom, and the “e-Olympics”.

We received a lot of nominations for sites. This is one of the biggest we have been involved in. There was 18 IIPC members involved in the project, but nominations also came from wider nominations. We think this will be a really good resource for those researching the events in Rio. We had material in 34 languages in total. English was the top language collected – reflecting IIPC memberships to some extent. In terms of what we collected it included Official IOC materials – but few as we have a separate archive across Games for these. But subjects included athletes, teams, gender, doping, etc. There were a large number of website types submitted. Not all material nominated were collected – some incomplete metadata, unsuccessful crawls, duplicate nominations, and the web is quite fragile still and some links were already dead when we reached them.

There were four people involved here, myself, my line manager, the two IIPC chairs, and the IIPC communications person (also based at BL). We designed a collection strategy to build engagement as well as content. The Olympics is something with very wide appeal and lots of media coverage around the political and Zika situation so we did widen the scope of collection.

Thinking about our user we had collaborative tools that worked with contributors context: Webex, Google Drive and Maps, and Slack (free for many contexts) was really useful. Chapter 8 in “Altmetrics” is great for alternatives to Google – it is important to have those as it’s simply not accessible in some locations.

We used mostly Google Sheets for IIPC member nominations – 15 fields, 6 of which were obligatory. For non members we used a (simplified) Google Form – shared through social media. Some non IIPC member organisations used this approach – for instance a librarian in Hawaii submitted lots of pacific islands content.

In terms of communicating the strategy we developed instructional videos (with free tools – Screencastomatic and Windows Movie Maker) with text and audio commentary, print summaries, emails, and public blog posts. Resources were shared via Google Drive so that IIPC members could download and redistributed.

No matter whether IIPC member or through the nomination form, we wanted six key fields:

  1. URL – free form
  2. Event – drop down option
  3. Title – free form (and English translation option if relevant)
  4. Olympic/Paralympic sport – drop down option
  5. Country – free form
  6. Contributing organisation – free form (for admin rather than archive purposes)

There are no international standards for cataloguing web archive data. OCLC have a working group looking at this just now – they are due to report this year. One issue that has been raised is the context of those doing the cataloguing – cataloguing versus archiving.

Communications are essential on a regular basis – there was quite a long window of nomination and collection across the summer. We had several pre-event crawl dates, then also dates during and after both the Olympics and the Paralympics. I would remind folk about this, and provide updates on that, on what was collected, to share that map of content collected. We also blogged the projects to engage and promote what we were doing. The Participants enjoyed the updates – it helped them justify time spent on the project to their own managers and organisations.

There were some issues along the way…

  • The trailing backslash is required for the crawler – so if there is no trailing backslash the crawler takes everything it can find – attempting all of BBC or Twitter is a problem.
  • Not tracking the date of nomination – e.g. organisations adding to the spreadsheet without updating date of nomination – that was essential to avoid duplication so that’s a tip for Google forms.
  • Some people did not fill in all of the six mandatory fields (or didn’t fill them in completely.
  • Country name vs Olympic team name. That is unexpectedly complex. Team GB includes England, Scotland, Wales and Northern Ireland… But Northern Ireland can also compete in Ireland. Palestine isn’t recognised as a country in all places, but it is in the Olympics. And there was a Refugee Team as well – with no country to tie to. Similar issues of complexity came out of organisation names – there are lots of ways to write the name of the British Library for instance.

We promoted the project with four blog posts sharing key updates and news. We had limited direct contact – mostly through email and Slack/messaging. We also had a unique hashtag for the collection #Rio2016WA – not catchy but avoids confusion with Wario (Nintendo game) – and Twitter chat, a small but international chat.

Ethically we only crawl public sites but the IIPC also have a take down policy so that anyone can request their site be removed.

Conclusions… Be aware of any cultural differences with collaborators. Know who your users are. Have a clear project plan, available in different mediums. And communicate regularly – to keep enthusiasm going. And, most importantly, don’t assume anything!

Finally… Web Archiving Week is in London in June, 12th-16th 2017. There is a “Datathon” but the deadline is Friday! Find out more at And you can find out more about the UK Web Archive via our website and blog: You can also follow us and the IIPC on Twitter.

Explore the Olympics archive at:


Q1) For British Library etc… Did you use a controlled vocabulary

A1) No but we probably will next time. There were suggestions/autocomplete. Similarly for countries. For Northern Irish sites I had to put them in as Irish and Team GB at the same time.

Q2) Any interest from researchers yet? And/or any connection to those undertaking research – I know internet researchers will have been collecting tweets…

A2) Colleagues in Rio identified a PhD project researching the tweets – very dynamic content so hard to capture. Not huge amount of work yet. I want to look at the research projects that took place after the London 2012 Olympics – to see if the sites are still available.

Q3) Anything you were unable to collect?

A3) In some cases articles are only open for short periods of time – we’d do more regular crawls of those nominations next time I think.

Q4) What about Zika content?

A4) We didn’t have a tag for Zika, but we did have one for corruption, doping, etc. Lots of corruption post event after the chair of the Irish Olympic Committee was arrested!

Statistical Accounts of Scotland / Vivienne Mayo (EDINA)

I’m based at EDINA and we run various digital services and projects, primarily for the education sector. Today I’m going to talk about the Statistical Accounts of Scotland. These are a hugely rich and valuable collection of statistical data that span both the agricultural and industrial revolutions in Scotland. The online service launched in 2001 but was thoroughly refreshed and relaunched next year.

There are two accounts. The first set was created (1791-1799) by Sir John Sinclair of Ulbster. He had a real zeal for agricultural data. There had been attempts to collect data in the 16th and 17th centuries. So Sir John set about a plan to get every minister in Scotland to collect data on their parishes. He was inspired by German surveys but also had his own ideas for his project:

“an inquiry into the state of a country, for the purpose of ascertaining the quantum of happiness enjoyed by its inhabitants, and the means of its future improvement”

He also used the word “Statistics” as a kind of novel, interesting term – it wasn’t in wide use. And the statistics in the accounts are more qualitative then the quantitative data we associate with the word today.

Sir John sent minister 160 questions, then another 6, then another set a year late so that there were 171 in total. So you can imagine how delighted they were to receive that. And the questions (you can access them all in the service) were hard to answer – asking about the wellbeing of parishioners, how their circumstances could be ameliorated… But ministers were paid by the landowners who employed their parishioners so that data also has to be understood in context. There were also more factual questions on crops, pricing, etc.

It took a long time – 8 years – to collect the data. But it was a major achievement. And these accounts were part of a “pyramid” of data for the agricultural reports. He had country reports, but also higher level reports. This was at the time of the Enlightenment and the idea was that with this data you could improve the condition of life.

Even though the ministers did complete their returns, for some it was struggle – and certainly hard to be accurate. Population tables were hard to get correct, especially in the context of scepticism that this data might be used to collect taxes or other non-beneficial purposes.

The Old Account was a real success. And the Church of Scotland commissioned a New Account from 1834-45 as a follow up to that set of accounts.

The online service was part of one of the biggest digitisation projects in Scotland in the late 1990s, with the accounts going live in 2001. But much had changed since then in terms of functionality that any user might expect. In this new updated service we have added the ability to tag, to annotate, to save… Transcriptions have been improved, the interface has been improved. We have also made it easier to find associated resources – selected by our editorial board drawn from libraries, archives, specialists on this data.

When Sir John published the Old Accounts he printed them in volumes as they were received – that makes it difficult to browse and explore those. And there can be multiple accounts for the same parish. So we have added a way to browse each of the 21 volumes so that it is easier to find what you need. Place is key for our users and we wanted to make the service more accessible. Page numbers were an issue too – our engineers provide numbering of sections – so if you look for Portpatrick – you can find all of the sections and volumes where that area occurs. Typically sections are a parish report, but it can be other types of content too – title pages, etc.

Each section is associated with a Parish – which is part of a county. And there may be images (illustrations such as coal seams, elevations of notable buildings in the parish, etc.). Each section is also associated with pages – including images of the pages – as well as transcripts and indexed data used to enable searching.

So, if I search for tea drinking… Described as a moral menace in some of the earlier accounts! When you run a search like this identifies associated sections, the related resources, and associated words – those words that often occur with the search term. For tea-drinking “twopenny” is often associated… Following that thread I found a county of forfar from 1793… And this turns out to be the slighly alarming sounding home brew…

“They make their own malt, and brew it into that kind of drink called Two-penny which, till debased in consequence of multiplied taxes, was long the favourite liquor of all ranks of people in Dundee.”

When you do look at a page like this you can view the transcription – which tends to be easier to read than the scanned pages with their flourishes and “f” instead of “s”. You can tag, annotate, and share the pages. There are lots of ways to explore and engage with the text.

There are lots of options to search the service – simple search, advanced search, and new interactive maps of areas and parishes – these use historic maps from the NLS collections and are brand new to the service.

With all these new features we’d love to hear your feedback when you do take a look at the service – do let us know how you find it.

I wanted to show an example of change and illustration here. In the old Accounts of Dumfries (Vol 5, p. 119) talks about the positive improvements to housing and the idea of “improvement” as a very positive thing. We also see an illustration from the New Accounts of old habitations and new modern house of the small tenants – but that was from a Parish owned by the Duke of Sutherland who had a notorious reputation as a brutal landlord for clearing land and murdering tenants to make these “improvements”. So, again one has to understand the context of this content.

Looking at Dumfries in the Old Accounts things looked good, some receiving poor support. The increase in industry means that by the New Accounts the population has substantially grown, as has poverty. The minister also comments on the impact of the three inns in town, the increase in poaching. Transitory population can also effect health – there is a vivid account of a cholera outbreak from 15th Sept – 27th Nov in 1832. That seems relatively recent but at that point they thought transmission was through the air, they didn’t realise it was water born until some time later.

Some accounts, like that one, are highly descriptive. But many are briefer or less richly engaging. Deaths are often carefully captured. The minister for Dumfries put together a whole table of deaths – causes of which include, surprisingly, teething. And there are also records of healthcare and healthcare costs – including one individual paying for several thousand children to be inoculated against smallpox.

Looking at the schools near us here in central Edinburgh there was free education for some poor children. But schooling mostly wasn’t free. The costs for one child for reading and writing, if you were a farm labourer, it would be a 12th of your salary. To climb the social ladder with e.g. French, Latin, etc. the teaching was far more expensive. And indeed there is a chilling quote in the New Accounts from Cadder, County of Lanark (Vol 8, P. 481) spoke of attitudes that education was corrupting for the poor. This was before education became mandatory (in 1834).

There is also some colourful stuff in the Accounts. There is a lot of witchcraft, local stories, and folk stories. One of my colleagues found a lovely story about a tradition that the last person buried in one area “manned the gates” until the next one arrived. Then one day two people died and there were fisticuffs!

I was looking for something else entirely and, in Fife, a story of a girl who set sale from Greenock, was captured by pirates, was sold into a Hareem, and became a princess in Morroco – there’s a book called The Fourth Queen based on that story.

There is an anvil known as the “Reformation Cloth” – pre-reformation there was a blacksmith thought the catholic priest was having an affair with his wife… And took his revenge by attacking the offending part of the minister on that anvil. I suspect that there may have been some ministerial stuff at play here too – the parish minister notes that “no other catholic minister replaced him” – but it is certainly colourful.

And that’s all I wanted to share today. Hopefully I’ve peaked your interest. You can browse the accounts for free and then some of the richer features are part of our subscription service. Explore the Statistical Accounts of Scotland at: You can also follow us on Twitter, Facebook, etc.


Q1) SOLR indexing and subject headings – can you say more?

A1) They used subject headings from original transcriptions. And then there was some additions made based on those.

Comment) The Accounts are also great for Wikipedia editing! I found references to Christian Shaw, a thread pioneer I was looking to build a page about. In the Accounts as she was mentioned in a witchcraft trial that is included there. It can be a really useful way to find details that aren’t documented elsewhere.

Q2) You said it was free to browse – how about those related resources?

A2) Those related resources are part of the subscription services.

Q3) Any references to sports and leisure?

A3) Definitely to festivals, competitions, events etc. As well as some regular activities in the parish.

Beyond bibliographic description: emotional metadata on YouTube / Diane Pennington (University of Strathclyde)

I want to start with this picture of a dog in a dress…. How do you feel when you see this picture? How do you think she was feeling? [people in the room guess the pup might be embarrassed].

So, this is Tina, she’s my dog. She’s wearing a dress we had made for her when we got married… And when she wears it she always looks so happy… And people, when I shared it on social media, also thought she looked happy. And that got me curious about emotion and emotional responses… That isn’t accommodated in bibliographic metadata. As a community we need to think about how this material makes us feel, how else can we describe things? When you search for music online mood is something you might want to see… But usually it’s recommendations like “this band is similar to…”. My favourite band is U2 and I get recommended Coldplay… And that makes me mad, they aren’t similar!

So, when we teach and practice ILS, we think about information as text that sits in a database, waiting for a user to write a query and get a match. The problem is that there are so many other ways that people also want to look for information – not just bibliographic information, full text, but in other areas too, like bodily – what pain means (Yates 2015); photographs, videos, music (Rasmussen Neal, 2012) – where the full text doesn’t include the search terms or keywords inherantly; “matter and energy” (Bates, 2006) – that there is information everywhere and the need to think more broadly to describe this.

I’ve been working in this area for a while and I started looking at Flickr, at pictures that are tagged “happy”. Those tend to include smiling people, holiday photos, sunny days, babies, cute animals. Relevance rankings showed “happy” more often, people engaged and liked more with happy photos… But music is different. We often want music that matches our mood… There were differences to tags and understanding music… Heavy metal sounds angy, slower or minor key music sounds sad…

So, the work I’m talking about you can also find in an article published last year.

My work was based on the U2 song, Song for Someone. And there are over 150 fan videos created for this song.. And if I show you this one (by Dimas Fletcher) you’ll see it is high production values… The song was written by Bono for his wife – they’ve been together since they were teenagers, and it’s very slow and emotional, and reminisces about being together. So this video is a really different interpretation.

Background to this work, and theoretical framework for it, includes:

  • “Basic emotions” from cognition, psychology, music therapy (Ekman, 1992)
  • Emotional Information Retrieval
  • omains of fandom and aca-fandom (Stein & Busse, 2009; Bennett, 2014)
  • Online participatory culture, such as writing fan fiction or making cover versions of videos for loves songs (Jenkins, 2013)
  • U2 acadeic study – and
  • Intertexuality as a practic in online participatory culture (Varmacelli 2013?)

So I wanted to do a discourse analysis (Budd & Raber 1996, Iedema 2003) applied to intertextuality. And I wanted to analyse the emotional information conveyed in 150 YouTUbe cover videos of U2’s Song for Someone. And also a quantitative view of views, comments, likes and dislikes – indicating response to them.

The producers of these videos created lots of different types of videos. Some were cover versions. Some were original versions of the song with new visual content. Some were tutorials on how to play the song. And then there were videos exhibiting really deep personal connections with the song.

So the cover versions are often very emotional – a comment says that. That emotion level is metadata. There are videos in context – background details, kids dancing, etc. But then some are filmed out of a plane window. The tutorials include people, some annotated “kareoke piano” tutorials…

Intertextuality… You need to understand your context. So one of the videos shows a guy in a yellow cape who is reaching and touching the Achtung Baby album cover before starting to sing. In another video a person is in the dark, in shadow… But here Song for Someone lyrics and title on the wall, but then playing and mashing up with another song. In another video the producer and his friend try to look like U2.

Then we have the producers comments and descriptions that add greatly to understanding those videos. Responses from consumers – more likes than dislikes; almost all positive comments – this is very different from some Justin Bieber YouTube work I did a while back. You see comments on the quality of the cover, on the emotion of the song.

The discussion is an expression of emotion. The producers show tenderness, facial expressions, surrounds, music elements. And you see social construction here…

And we can link this to something like FRBR… U2 as authoritative version, and FRBR relationships… Is there a way we can show the relationship between Songs of Innocence by William Blake, Songs of Innocence as an album, cover versions, etc.

As we move forward there is so much more we need to do when we design systems for description that accommodate more than just keywords/bibliographic records. There is no full text inherent in a video or other non-textual document – an indexing problem. And we need to account for not only emotion, but also socially constructed and individually experienced emotional responses to items. Ultimate goal – help people to find things in meaningful ways to even potentially be useful in therapies (Hanser 2010).


Q1) Comment more than a question… I work with film materials in the archive, and we struggle to bring that alive, but you do have some response from the cataloguer and their reactions – and reactions at the access centre – and that could be part of the record.

A1) That’s part of archives – do we need it in every case… Some of the stuff I study gets taken down… Do we need to archive (some of) them?

Q1) Also a danger that you lose content because catalogue records are not exciting enough… Often stuff has to go on YouTube to get seen and accessed – but then you lose that additional metadata…

A1) We do need to go where our audience is… Maybe we do need to be on YouTube more… And maybe we can use Linked Data to make things more findable. Catalogue records rarely come up high enough in search results…

Q2) This is a really subjective way to mark something up… So, for instance, Songs of Innocence was imposed on my iPhone and I respond quite negatively to that… How do you catalogue emotion with that much subjectivity at play?

A2) This is where we have happy songs versus individual perspectives… Most people think The Beatles’ Here Comes the Sun is mostly seen is happy… But if someone broke up with you during it…  How do we build into algorithms to tune into those different opinions..

Q3) How do producers choose to tag things – the lyrics, the tune, their reaction… But you kind of answered that… I mean people have Every Breath You Take by the Police as their first song at a wedding but it’s about a jilted lover stalking his ex…

A3) We need to think about how we provide access, and how we can move forward with this… My first job was in a record store and people would come in and ask “can I buy this record that was on the radio at about 3pm” and that was all they could offer… We need those facets, those emotions…

Q4) I had the experience of seeing quite a neutral painting but then with more context that painting meant something else entirely… So how do we account for that, that issue of context and understanding of the same songs in different ways…

A4) There isn’t one good solution to that but part of the web 2.0 approach is about giving space for the collective and the individual perspective.

Q5) How about musical language?

A5) Yeah.. I took an elective on musical librarianship. My tutor there showed me the tetrachords in Dido & Aeneid as a good example of an opera that people respond in very particular ways. There are musical styles that map to particular emotions.

Our 5Rights: digital rights of children and young people / Dev Kornish, Dan Dickson, Bethany Wilson (5Rights Youth Commission)

We are from Young Scot and Young Scot

1 in 5 young people have missed food or sleep because of the internet.

How many unemployed young people struggle with entering work due to the lack of digital skills? It’s 1 in 10 who struggle with CVs, online applications, and jobs requiring digital skills.

How young do people start building their digital footprint? Before birth – an EU study found that 80% of mothers had shared images, including scans, of their children.

Bethany: We are passionate about our rights and how our rights can be maintained in a digital world. When it comes to protecting young people online it can be scary… But that doesn’t mean we shouldn’t use the internet or technology, when used critically The 5Rights campaign aims to do ensure we have that understanding.

Dan: The UNCRC outlines rights and these are: the right to remove; the right to know – who has your data and what they are doing with it; the right to safety and support; the right to informed and conscious use – we should be able to opt out or remove ourselves if we want to; right to digital literacy – to use and to create.

Bethany: Under the right to remove, we do sometimes post things we shouldn’t but we should be able to remove things if we want to. In terms of the right to know – we don’t read the terms and conditions but we have the right to be informed, we need support. The right to safety and support requires respect – dismissing our online life can make us not want to talk about it openly with you. If you speak to us openly and individually then we will appreciate your support but restrictions cannot be too restrictive. Technology is designed to be addictive and that’s a reality we need to engage with. Technology is a part of most aspects of our lives, teaching and curriculum should reflect that. It’s not just about coding, it’s about finding information, and to understand what is reliable, what sources we can trust. And finally you need to listen to us, to our needs, to be able to support us.

And a question for us: What challenges have you encountered when supporting young people online? [a good question]

And a second question: What can you do in your work to realise young people’s rights in the digital world?

Q1) What digital literacy is being taught in schools right now?

A1) It’s school to school, depends on the educational authority. Education Scotland have it as a priority but only over the last year… It depends…

Q2) My kid’s 5 and she has library cards…

Comment) The perception is that kids are experts by default

A2 – Dan) That’s not the case but there is that perception of “digital natives” knowing everything. And that isn’t the case…

Dan: Do you want to share what you’ve been discussing?

Comment: It’s not just an age thing… Some love technology, some hate it… But it’s hard to be totally safe online… How do you protect people from that…

Dan: It is incredibly difficult, especially in education.

Comment [me]: There is a real challenge when the internet is filtered and restricted – it is hard to teach real world information literacy and digital literacy when you are doing that in an artificial school set up. That was something that came up in the Royal Society of Edinburgh Digital Participation Inquiry I was involved in a few years ago. I also wanted to add that we have a new MOOC on Digital Footprints that is particularly aimed at those leaving school/coming into university.

Bethany: We really want that deletion when we use our right to remove to be proper deleted. We really want to know where our data is held. And we want everyone to have access to quality information online and offline. And we want to right to disengage when we want to. And we want digital literacy to be about more than just coding, but also what we do and can do online.

Dan: We invite you all to join our 5Rights Coalition to show your support and engagement with this work. We are now in the final stages of this work and will be publishing our report soon. We’ve spoken to Google, Facebook, Education Scotland, mental health organisations, etc. We hope our report will provide great guidance for implementing the 5Rights.

You can find out more and contact us:, #5RightsYC,


Q1) Has your organisation written any guidance for librarians in putting these rights into action?

A1) Not yet but that report should include some of that guidance.

Playing with metadata / Gavin Willshaw and Scott Renton (University of Edinburgh)

Gavin: Scott and I will be talking about our metadata games project which we’ve been working on for the last few years. My current focus is on PhD digitisation but I’m also involved in this work. I’ll give an overview, what we’ve learned… And then Scott will give more of an idea of the technical side of things.

A few years ago we had 2 full time photographers working on high quality digital images. Now there are three photographers, 5 scanning assistants, and several specialists all working in digitisation. And that means we have a lot more digital content. A few years ago we launched which is the one stop shop into our digital collections. You can access the images at: We have around 30k images, and most are CC BY licenced at high resolution.

Looking at the individual images we tend to have really good information of the volume the image comes from, but prior to this project we had little information on what was actually in the image. That made them hard to find. We didn’t really have anyone to catalogue this. A lot of these images are as much as 10 years old – for projects but not neccassarily intended to go online. So, we decided to create this game to improve the description of our collections…

The game has a really retro theme – we didn’t want to spend too long on the design side of things, just keep it simple. And the game is open to everyone.

So, stage 1: tag. You harvest initial tags, it’s an open text box, there is no quality review, and there are points for tags entered. We do have some safety measures to avoid swear or stop words.

Stage 2: vote. You vote on the quality of others’ tags. It’s a closed system – good/bad/don’t know. That filters out any initial gobbldegook. You get points…

The tags are QAed and imported into our image management system. We make a distinction between formal metadata and crowdsourced tags. We show that on the record and include a link to the tool – so others can go and play.

We don’t see crowdsourcing as being just about free labour, but about communities of people with an interest and knowledge. We see it as a way to engage and connect with people beyond the usual groups – members of the public, educators, anyone really. People playing the game range from 7 to 70’s and we are interest to have the widest audience possible. And obviously the more people use the system, the more tags and participation we get. We also get feedback for improvements – some features in the game came from feedback. In theory it frees up staff time, but it takes time to run. But it lets us reach languages, collections, special knowledge that may not be in our team.

To engage our communities we took the games on tour across our sites. We’ve also brought the activity into other events – Innovative Learning Week/Festival of Creative Learning; Ada Lovelace Day; exhibitions – e.g. the Where’s Dolly game that coincided with the Towards Dolly exhibition. Those events are vital to get interest – it doesn’t work to expect people to just find it themselves.

In terms of motivation people like to do something good, some like to share their skills, and some just enjoy it because it is fun and a wee bit competitive. We’ve had a few (small) prizes… We also display real time high scores at events which gets people in competitive mode.

This also fits into an emerging culture of play in Library and Information Services… Looking at play in learning – it being ok to try things whether or not they succeed. These have included Board Game Jam sessions using images from the collections, learning about copyright and IP in a fun context. Ada Lovelace day I’ve mentioned – designing your own Raspberry Pi case out of LEGO, Making music… And also Wikipedia Editathons – also fun events.

There is also an organisatoin called Tiltfactor who have their own metadata games looking at tagging and gaming. They have Zen Tag – like ours. But also Nextag for video and audio. And also Guess What! a multiplier game of description. We put about 2000 images into the metadatagames platform Tiltfactor run and got huge numbers of tags quickly. They are at quite a different scale.

We’ve also experimented with Lady Grange’s correspondence in the Zooniverse platform, where you have to underline or indicate names and titles etc.

We’ve also put some of our images into Crowdcrafting to see if we can learn more about the content of images.

There are Pros and Cons here…


  • Hosted service
  • Easy to create an account
  • Easy to set up and play
  • Range of options – not just tagging
  • Easy to load in images from Dropbox/Flickr


  • Some limitations of what you can do
  • Technical expertise needed for best value – especially in platforms like Crowdcrafting.

What we’ve learned so far is that it is difficult to create engaging platform but combining with events and activities – with target theme and collections – work well. Incentives and prizes help. Considerable staff time is needed. And crowdsourced tags are a compliment rather than an alternative to the official record.

Scott: So I’ll give the more technical side of what we’ve done. Why we needed them, how we built them, how we got on, and what we’ve learned.

I’ve been hacking away at workflows for a good 7 years. We have a reader who sees something they want, and they request the photograph of the page. They don’t provide much information – just about what is needed. These make for skeleton records – and we now have about 30k of these. It also used to be the case that buying a high end piece of kit can be easier to buy in for a project than a low level cataloguer… That means we end up with data being copied and pasted in by photographers rather than good records.

We have all these skeletons… But we need some meat on our bones… If we take an image from the Incunabula we want to know that there’s a skeleton on a horse with a scyth. Now the image platform we have does let us annotate an image – but it’s hidden away and hard to use. We needed something better and easier. That’s where we came up with an initial front end. When I came in it was a module for us to use. It was Gavin that said “hey, this should be a game”. So the nostalgic computer games thing is weirdly appealing (like the Google Maps Pacman Aprils Fool!). So it’s super simple, you put in a few words…

And it is truly lo-fi. It’s LAMP (Linux, Apache, MySQL, PHP) – not cool! Front end design retrofit. Authentication added to let students and staff login. In terms of design decisions we have a moderation module, we have a voting module, we have a scoreboard, we have stars for high contributors. And now more complex games: set no of items, clock, featured items, and Easter Eggs within the game. For instance in the Dolly the Sheep game we hid a few images with hideous comic sans that you could stumble upon if you tagged enough images!

Where we do have moderation, voting module, thresholds, demarcation… Tiltfactor told us we’re the only library putting data back in from the crowd to our system – people are really nervous about this but we demarcate it really carefully.

We now have a codebase we can clone. We skin it up differently for particular events or exhibitions – like Dolly – but it’s all the same idea with different design and collections. This all connects up through (authenticated) APIs back into the image management system (Luna).

So, how have we gotten on?

  • 283 users
  • 34070 tags in system
  • 15616 tags from our game
  • 18454 tags from Tiltfactor metadata games pushed in
  • 6212 tags pushed back into our system – that’s because of backlog in the moderation (upvotes may be good enough).

So, what next? Well we have MSc projects coming up. We are having a revamp with an intern signed up for the summer – responsiveness, links to social media, more gamification, more incentives, authentication for non UoE users, etc.

And also we are excited about IIIF – about beautification of websites with embedded viewers, streamlining (thumbnails through URL; photoshopping through URL etc) and annotations. You can do deep zoom into images without having to link out to do that with an image.

We also have the Polyglot Project – coming soon – which is a paleography project for manuscripts in our collections of any age, in any language. We asked an intern to find a transcription and translation module using IIIF. She’s come up with something fantastic… Ways to draw around text, for users to add in annotations, to discuss annotations, etc. She’s got 50-60 keyboards so almost all languages supported. Not sure how to bring back into core systems but really excited about this.

That’s basically where we’ve gotten to. And if you want to try the games, come and have a play.


Q1) That example you showed for IIIF tagging has words written in widely varied spellings… You wouldn’t key it in as written in the document.

A1 – Scott) We do have a project looking at this. We have a girl looking for dictionaries to find variance and different spellings.

A1 – Gavin) There are projects like Transcribe Bentham who will have faced that issue…

Comment – Paul C) It’s a common issue… Methods like fuzzy searching help with that…

Q2) I’m quite interested about how you identify parts of images, and how you feed that back to the catalogue?

A2 – Scott) Right now I think the scope of the project is… Well it will be interesting to see how best to feed into catalogue records. Still to be addressed.

Q3 – Paul C) You built this in-house… How open is it? Can others use it?

A3 – Gavin) It is using Luna image management system…

A3 – Scott) It’s based on Luna for derivatives and data. It’s on Github and it is open. The website is open to everyone. You login through EASE – you can join as an “EASE Friend” if you aren’t part of the University. Others can use the code if they want it…

And finally it was me up to present…

Managing your Digital Footprint : Taking control of the metadata and tracks and traces that define us online / Nicola Osborne (EDINA)

Obviously I didn’t take notes on my session, but you can explore the slides below:

Look out for a new blogpost very soon on some of the background to our new Digital Footprint MOOC, which launched on Monday 3rd April. You can join the course now, or sign up to join the next run of the course next month, here:

And with that the event drew to a close with thank you’s to all of the organisers, speakers, and attended!


 April 5, 2017  Posted by at 11:08 am Events Attended, LiveBlogs, Presentation and Performance Tagged with:  No Responses »