Jun 302017
 

Today I’m at ReCon 2017, giving a presentation later (flying the flag for the unconference sessions!) today but also looking forward to a day full of interesting presentations on publishing for early careers researchers.

I’ll be liveblogging (except for my session) and, as usual, comments, additions, corrections, etc. are welcomed. 

Jo Young, Director of the Scientific Editing Company, is introducing the day and thanking the various ReCon sponsors. She notes: ReCon started about five years ago (with a slightly different name). We’ve had really successful events – and you can explore them all online. We have had a really stellar list of speakers over the years! And on that note…

Graham Steel: We wanted to cover publishing at all stages, from preparing for publication, submission, journals, open journals, metrics, alt metrics, etc. So our first speakers are really from the mid point in that process.

SESSION ONE: Publishing’s future: Disruption and Evolution within the Industry

100% Open Access by 2020 or disrupting the present scholarly comms landscape: you can’t have both? A mid-way update – Pablo De Castro, Open Access Advocacy Librarian, University of Strathclyde

It is an honour to be at this well attended event today. Thank you for the invitation. It’s a long title but I will be talking about how are things are progressing towards this goal of full open access by 2020, and to what extent institutions, funders, etc. are being able to introduce disruption into the industry…

So, a quick introduction to me. I am currently at the University of Strathclyde library, having joined in January. It’s quite an old university (founded 1796) and a medium size university. Previous to that I was working at the Hague working on the EC FP7 Post-Grant Open Access Pilot (Open Aire) providing funding to cover OA publishing fees for publications arising from completed FP7 projects. Maybe not the most popular topic in the UK right now but… The main point of explaining my context is that this EU work was more of a funders perspective, and now I’m able to compare that to more of an institutional perspective. As a result o of this pilot there was a report commissioned b a British consultant: “Towards a competitive and sustainable open access publishing market in Europe”.

One key element in this open access EU pilot was the OA policy guidelines which acted as key drivers, and made eligibility criteria very clear. Notable here: publications to hybrid journals would not be funded, only fully open access; and a cap of no more than €2000 for research articles, €6000 for monographs. That was an attempt to shape the costs and ensure accessibility of research publications.

So, now I’m back at the institutional open access coalface. Lots had changed in two years. And it’s great to be back in this spaces. It is allowing me to explore ways to better align institutional and funder positions on open access.

So, why open access? Well in part this is about more exposure for your work, higher citation rates, compliant with grant rules. But also it’s about use and reuse including researchers in developing countries, practitioners who can apply your work, policy makers, and the public and tax payers can access your work. In terms of the wider open access picture in Europe, there was a meeting in Brussels last May where European leaders call for immediate open access to all scientific papers by 2020. It’s not easy to achieve that but it does provide a major driver… However, across these countries we have EU member states with different levels of open access. The UK, Netherlands, Sweden and others prefer “gold” access, whilst Belgium, Cyprus, Denmark, Greece, etc. prefer “green” access, partly because the cost of gold open access is prohibitive.

Funders policies are a really significant driver towards open access. Funders including Arthritis Research UK, Bloodwise, Cancer Research UK, Breast Cancer Now, British Heard Foundation, Parkinsons UK, Wellcome Trust, Research Councils UK, HEFCE, European Commission, etc. Most support green and gold, and will pay APCs (Article Processing Charges) but it’s fair to say that early career researchers are not always at the front of the queue for getting those paid. HEFCE in particular have a green open access policy, requiring research outputs from any part of the university to be made open access, you will not be eligible for the REF (Research Excellence Framework) and, as a result, compliance levels are high – probably top of Europe at the moment. The European Commission supports green and gold open access, but typically green as this is more affordable.

So, there is a need for quick progress at the same time as ongoing pressure on library budgets – we pay both for subscriptions and for APCs. Offsetting agreements are one way to do this, discounting subscriptions by APC charges, could be a good solutions. There are pros and cons here. In principal it will allow quicker progress towards OA goals, but it will disproportionately benefit legacy publishers. It brings publishers into APC reporting – right now sometimes invisible to the library as paid by researchers, so this is a shift and a challenge. It’s supposed to be a temporary stage towards full open access. And it’s a very expensive intermediate stage: not every country can or will afford it.

So how can disruption happen? Well one way to deal with this would be the policies – suggesting not to fund hybrid journals (as done in OpenAire). And disruption is happening (legal or otherwise) as we can see in Sci-Hub usage which are from all around the world, not just developing countries. Legal routes are possible in licensing negotiations. In Germany there is a Projekt Deal being negotiated. And this follows similar negotiations by open access.nl. At the moment Elsevier is the only publisher not willing to include open access journals.

In terms of tools… The EU has just announced plans to launch it’s own platform for funded research to be published. And Wellcome Trust already has a space like this.

So, some conclusions… Open access is unstoppable now, but still needs to generate sustainable and competitive implementation mechanisms. But it is getting more complex and difficult to disseminate to research – that’s a serious risk. Open Access will happen via a combination of strategies and routes – internal fights just aren’t useful (e.g. green vs gold). The temporary stage towards full open access needs to benefit library budgets sooner rather than later. And the power here really lies with researchers, which OA advocates aren’t always able to get informed. It is important that you know which are open and which are hybrid journals, and why that matters. And we need to think if informing authors on where it would make economic sense to publish beyond the remit of institutional libraries?

To finish, some recommended reading:

  • “Early Career Researchers: the Harbingers of Change” – Final report from Ciber, August 2016
  • “My Top 9 Reasons to Publish Open Access” – a great set of slides.

Q&A

Q1) It was interesting to hear about offsetting. Are those agreements one-off? continuous? renewed?

A1) At the moment they are one-off and intended to be a temporary measure. But they will probably mostly get renewed… National governments and consortia want to understand how useful they are, how they work.

Q2) Can you explain green open access and gold open access and the difference?

A2) In Gold Open Access, the author pays to make your paper open on the journal website. If that’s a hybrid – so subscription – journal you essentially pay twice, once to subscribe, once to make open. Green Open Access means that your article goes into your repository (after any embargo), into the world wide repository landscape (see: https://www.jisc.ac.uk/guides/an-introduction-to-open-access).

Q3) As much as I agree that choices of where to publish are for researchers, but there are other factors. The REF pressures you to publish in particular ways. Where can you find more on the relationships between different types of open access and impact? I think that can help?

A3) Quite a number of studies. For instance is APC related to Impact factor – several studies there. In terms of REF, funders like Wellcome are desperate to move away from the impact factor. It is hard but evolving.

Inputs, Outputs and emergent properties: The new Scientometrics – Phill Jones, Director of Publishing Innovation, Digital Science

Scientometrics is essentially the study of science metrics and evaluation of these. As Graham mentioned in his introduction, there is a whole complicated lifecycle and process of publishing. And what I will talk about spans that whole process.

But, to start, a bit about me and Digital Science. We were founded in 2011 and we are wholly owned by Holtzbrink Publishing Group, they owned Nature group. Being privately funded we are able to invest in innovation by researchers, for researchers, trying to create change from the ground up. Things like labguru – a lab notebook (like rspace); Altmetric; Figshare; readcube; Peerwith; transcriptic – IoT company, etc.

So, I’m going to introduce a concept: The Evaluation Gap. This is the difference between the metrics and indicators currently or traditionally available, and the information that those evaluating your research might actually want to know? Funders might. Tenure panels – hiring and promotion panels. Universities – your institution, your office of research management. Government, funders, policy organisations, all want to achieve something with your research…

So, how do we close the evaluation gap? Introducing altmetrics. It adds to academic impact with other types of societal impact – policy documents, grey literature, mentions in blogs, peer review mentions, social media, etc. What else can you look at? Well you can look at grants being awarded… When you see a grant awarded for a new idea, then publishes… someone else picks up and publishers… That can take a long time so grants can tell us before publications. You can also look at patents – a measure of commercialisation and potential economic impact further down the link.

So you see an idea germinate in one place, work with collaborators at the institution, spreading out to researchers at other institutions, and gradually out into the big wide world… As that idea travels outward it gathers more metadata, more impact, more associated materials, ideas, etc.

And at Digital Science we have innovators working across that landscape, along that scholarly lifecycle… But there is no point having that much data if you can’t understand and analyse it. You have to classify that data first to do that… Historically we did that was done by subject area, but increasingly research is interdisciplinary, it crosses different fields. So single tags/subjects are not useful, you need a proper taxonomy to apply here. And there are various ways to do that. You need keywords and semantic modeling and you can choose to:

  1. Use an existing one if available, e.g. MeSH (Medical Subject Headings).
  2. Consult with subject matter experts (the traditional way to do this, could be editors, researchers, faculty, librarians who you’d just ask “what are the keywords that describe computational social science”).
  3. Text mining abstracts or full text article (using the content to create a list from your corpus with bag of words/frequency of words approaches, for instance, to help you cluster and find the ideas with a taxonomy emerging

Now, we are starting to take that text mining approach. But to use that data needs to be cleaned and curated to be of use. So we hand curated a list of institutions to go into GRID: Global Research Identifier Database, to understand organisations and their relationships. Once you have that all mapped you can look at Isni, CrossRef databases etc. And when you have that organisational information you can include georeferences to visualise where organisations are…

An example that we built for HEFCE was the Digital Science BrainScan. The UK has a dual funding model where there is both direct funding and block funding, with the latter awarded by HEFCE and it is distributed according to the most impactful research as understood by the REF. So, our BrainScan, we mapped research areas, connectors, etc. to visualise subject areas, their impact, and clusters of strong collaboration, to see where there are good opportunities for funding…

Similarly we visualised text mined impact statements across the whole corpus. Each impact is captured as a coloured dot. Clusters show similarity… Where things are far apart, there is less similarity. And that can highlight where there is a lot of work on, for instance, management of rivers and waterways… And these weren’t obvious as across disciplines…

Q&A

Q1) Who do you think benefits the most from this kind of information?

A1) In the consultancy we have clients across the spectrum. In the past we have mainly worked for funders and policy makers to track effectiveness. Increasingly we are talking to institutions wanting to understand strengths, to predict trends… And by publishers wanting to understand if journals should be split, consolidated, are there opportunities we are missing… Each can benefit enormously. And it makes the whole system more efficient.

Against capital – Stuart Lawson, Birkbeck University of London

So, my talk will be a bit different. The arguements I will be making are not in opposition to any of the other speakers here, but is about critically addressing our current ways we are working, and how publishing works. I have chosen to speak on this topic today as I think it is important to make visible the political positions that underly our assumptions and the systems we have in place today. There are calls to become more efficient but I disagree… Ownership and governance matter at least as much as the outcome.

I am an advocate for open access and I am currently undertaking a PhD looking at open access and how our discourse around this has been coopted by neoliberal capitalism. And I believe these issues aren’t technical but social and reflect inequalities in our society, and any company claiming to benefit society but operating as commercial companies should raise questions for us.

Neoliberalism is a political project to reshape all social relations to conform to the logic of capital (this is the only slide, apparently a written and referenced copy will be posted on Stuart’s blog). This system turns us all into capital, entrepreneurs of our selves – quantification, metricification whether through tuition fees that put a price on education, turn students into consumers selecting based on rational indicators of future income; or through pitting universities against each other rather than collaboratively. It isn’t just overtly commercial, but about applying ideas of the market in all elements of our work – high impact factor journals, metrics, etc. in the service of proving our worth. If we do need metrics, they should be open and nuanced, but if we only do metrics for people’s own careers and perform for careers and promotion, then these play into neoliberal ideas of control. I fully understand the pressure to live and do research without engaging and playing the game. It is easier to choose not to do this if you are in a position of privelege, and that reflects and maintains inequalities in our organisations.

Since power relations are often about labour and worth, this is inevitably part of work, and the value of labour. When we hear about disruption in the context of Uber, it is about disrupting rights of works, labour unions, it ignores the needs of the people who do the work, it is a neo-liberal idea. I would recommend seeing Audrey Watters’ recent presentation for University of Edinburgh on the “Uberisation of Education”.

The power of capital in scholarly publishing, and neoliberal values in our scholarly processes… When disruptors align with the political forces that need to be dismantled, I don’t see that as useful or properly disruptive. Open Access is a good thing in terms of open access. But there are two main strands of policy… Research Councils have spent over £80m to researchers to pay APCs. Publishing open access do not require payment of fees, there are OA journals who are funded other ways. But if you want the high end visible journals they are often hybrid journals and 80% of that RCUK has been on hybrid journals. So work is being made open access, but right now this money flows from public funds to a small group of publishers – who take a 30-40% profit – and that system was set up to continue benefitting publishers. You can share or publish to repositories… Those are free to deposit and use. The concern of OA policy is the connection to the REF, it constrains where you can publish and what they mean, and they must always be measured in this restricted structure. It can be seen as compliance rather than a progressive movement toward social justice. But open access is having a really positive impact on the accessibility of research.

If you are angry at Elsevier, then you should also be angry at Oxford University and Cambridge University, and others for their relationships to the power elite. Harvard made a loud statement about journal pricing… It sounded good, and they have a progressive open access policy… But it is also bullshit – they have huge amounts of money… There are huge inequalities here in academia and in relationship to publishing.

And I would recommend strongly reading some history on the inequalities, and the racism and capitalism that was inherent to the founding of higher education so that we can critically reflect on what type of system we really want to discover and share scholarly work. Things have evolved over time – somewhat inevitably – but we need to be more deliberative so that universities are more accountable in their work.

To end on a more positive note, technology is enabling all sorts of new and inexpensive ways to publish and share. But we don’t need to depend on venture capital. Collective and cooperative running of organisations in these spaces – such as the cooperative centres for research… There are small scale examples show the principles, and that this can work. Writing, reviewing and editing is already being done by the academic community, lets build governance and process models to continue that, to make it work, to ensure work is rewarded but that the driver isn’t commercial.

Q&A

Comment) That was awesome. A lot of us here will be to learn how to play the game. But the game sucks. I am a professor, I get to do a lot of fun things now, because I played the game… We need a way to have people able to do their work that way without that game. But we need something more specific than socialism… Libraries used to publish academic data… Lots of these metrics are there and useful… And I work with them… But I am conscious that we will be fucked by them. We need a way to react to that.

Redesigning Science for the Internet Generation – Gemma Milne, Co-Founder, Science Disrupt

Science Disrupt run regular podcasts, events, a Slack channel for scientists, start ups, VCs, etc. Check out our website. We talk about five focus areas of science. Today I wanted to talk about redesigning science for the internet age. My day job is in journalism and I think a lot about start ups, and to think about how we can influence academia, how success is manifests itself in the internet age.

So, what am I talking about? Things like Pavegen – power generating paving stones. They are all over the news! The press love them! BUT the science does not work, the physics does not work…

I don’t know if you heard about Theranos which promised all sorts of medical testing from one drop of blood, millions of investments, and it all fell apart. But she too had tons of coverage…

I really like science start ups, I like talking about science in a different way… But how can I convince the press, the wider audience what is good stuff, and what is just hype, not real… One of the problems we face is that if you are not engaged in research you either can’t access the science, and can’t read it even if they can access the science… This problem is really big and it influences where money goes and what sort of stuff gets done!

So, how can we change this? There are amazing tools to help (Authorea, overleaf, protocol.io, figshare, publons, labworm) and this is great and exciting. But I feel it is very short term… Trying to change something that doesn’t work anyway… Doing collaborative lab notes a bit better, publishing a bit faster… OK… But is it good for sharing science? Thinking about journalists and corporates, they don’t care about academic publishing, it’s not where they go for scientific information. How do we rethink that… What if we were to rethink how we share science?

AirBnB and Amazon are on my slide here to make the point of the difference between incremental change vs. real change. AirBnB addressed issues with hotels, issues of hotels being samey… They didn’t build a hotel, instead they thought about what people want when they traveled, what mattered for them… Similarly Amazon didn’t try to incrementally improve supermarkets.. They did something different. They dug to the bottom of why something exists and rethought it…

Imagine science was “invented” today (ignore all the realities of why that’s impossible). But imagine we think of this thing, we have to design it… How do we start? How will I ask questions, find others who ask questions…

So, a bit of a thought experiment here… Maybe I’d post a question on reddit, set up my own sub-reddit. I’d ask questions, ask why they are interested… Create a big thread. And if I have a lot of people, maybe I’ll have a Slack with various channels about all the facets around a question, invite people in… Use the group to project manage this project… OK, I have a team… Maybe I create a Meet Up Group for that same question… Get people to join… Maybe 200 people are now gathered and interested… You gather all these folk into one place. Now we want to analyse ideas. Maybe I share my question and initial code on GitHub, find collaborators… And share the code, make it open… Maybe it can be reused… It has been collaborative at every stage of the journey… Then maybe I want to build a microscope or something… I’d find the right people, I’d ask them to join my Autodesk 360 to collaboratively build engineering drawings for fabrication… So maybe we’ve answered our initial question… So maybe I blog that, and then I tweet that…

The point I’m trying to make is, there are so many tools out there for collaboration, for sharing… Why aren’t more researchers using these tools that are already there? Rather than designing new tools… These are all ways to engage and share what you do, rather than just publishing those articles in those journals…

So, maybe publishing isn’t the way at all? I get the “game” but I am frustrated about how we properly engage, and really get your work out there. Getting industry to understand what is going on. There are lots of people inventing in new ways.. YOu can use stuff in papers that isn’t being picked up… But see what else you can do!

So, what now? I know people are starved for time… But if you want to really make that impact, that you think is more interested… I undesrtand there is a concern around scooping… But there are ways to do that… And if you want to know about all these tools, do come talk to me!

Q&A

Q1) I think you are spot on with vision. We want faster more collaborative production. But what is missing from those tools is that they are not designed for researchers, they are not designed for publishing. Those systems are ephemeral… They don’t have DOIs and they aren’t persistent. For me it’s a bench to web pipeline…

A1) Then why not create a persistent archived URI – a webpage where all of a project’s content is shared. 50% of all academic papers are only read by the person that published them… These stumbling blocks in the way of sharing… It is crazy… We shouldn’t just stop and not share.

Q2) Thank you, that has given me a lot of food for thought. The issue of work not being read, I’ve been told that by funders so very relevant to me. So, how do we influence the professors… As a PhD student I haven’t heard about many of those online things…

A2) My co-founder of Science Disrupt is a computational biologist and PhD student… My response would be about not asking, just doing… Find networks, find people doing what you want. Benefit from collaboration. Sign an NDA if needed. Find the opportunity, then come back…

Q3) I had a comment and a question. Code repositories like GitHub are persistent and you can find a great list of code repositories and meta-articles around those on the Journal of Open Research Software. My question was about AirBnB and Amazon… Those have made huge changes but I think the narrative they use now is different from where they started – and they started more as incremental change… And they stumbled on bigger things, which looks a lot like research… So… How do you make that case for the potential long term impact of your work in a really engaging way?

A3) It is the golden question. Need to find case studies, to find interesting examples… a way to showcase similar examples… and how that led to things… Forget big pictures, jump the hurdles… Show that bigger picture that’s there but reduce the friction of those hurdles. Sure those companies were somewhat incremental but I think there is genuinely a really different mindset there that matters.

And we now move to lunch. Coming up…

UNCONFERENCE SESSION 1: Best Footprint Forward – Nicola Osborne, EDINA

This will be me – talking about managing a digital footprint and how robust web links are part of that lasting digital legacy- so no post from me but you can view my slides on Managing Your Digital Footprint and our Reference Rot in Theses: A HiberActive Pilot here.

SESSION TWO: The Early Career Researcher Perspective: Publishing & Research Communication

Getting recognition for all your research outputs – Michael Markie, F1000

I’m going to talk about things you do as researchers that you should get credit for, not just traditional publications. This week in fact there was a very interesting article on the history of science publishing “Is the staggering profitable business of scientific publishing bad for science?”. Publishers came out of that poorly… And I think others are at fault here too, including the research community… But we do have to take some blame.

There’s no getting away from the fact that the journal is the coin of the realm, for career progression, institutional reporting, grant applications. For the REF, will there be impact factors? REF says maybe not, but institutions will be tempted to use that to prioritise. Publishing is being looked at by impact factor…

And it’s not just where you publish. There are other things that you do in your work and which you should get ore credit for. Data; software/code – in bioinformatics there are new softwares and tools that are part of the research, are they getting the recognition they should; all results – not just the successes but also the negative results… Publishers want cool and sexy stuff but realistically we are funded for this, we should be able to publish and be recognised for it; peer review – there is no credit for it, peer reviews often improve articles and warrant credit; expertise – all the authors who added expertise, including non-research staff, everyone should know who contributed what…

So I see research as being more than a journal article. Right now we just package it all up into one tidy thing, but we should be fitting into that bigger picture. So, I’m suggesting that we need to disrupt it a bit more and pubis in a different way… Publishing introduces delays – of up to a year. Journals don’t really care about data… That’s a real issue for reproducibility.  And there is bias involved in publishing, there is a real lack of transparency in publishing decisions. All of the above means there is real research waster. At the same time there is demand for results, for quicker action, for wider access to work.

So, at F1000 we have been working on ways to address these issues. We launched Wellcome Open Research, and after launching that the Bill & Melinda Gated Foundation contacted us to build a similar platform. And we have also built an open research model for UCL Child Health (at St Ormond’s Street).

The process involves sending a paper in, checking there is plagiarism and that ethics are appropriate. But no other filtering. That can take up to 7 days. Then we ask for your data – no data then no publication. Then once the publication and data deposition is made, the work is published and an open peer review and user commenting process begins, they are names and credited, and they contribute to improve that article and contribute to the article revision. Those reviewers have three options: approved, approved with reservations, or not approved as it stands. So yo get to PMC and indexed in PubMed you need two “approved” status of two “approved with reservations” and an “approved”.

So this connects to lots of stuff… For Data thats with DataCite, DigShare, Plotly, Resource Identification Initiative. For Software/code we work with code ocean, Zenodo, GitHub. For All results we work with PubMed, you can publish other formats… etc.

Why are funders doing this? Wellcome Trust spent £7m on APCs last year… So this platform is partly as a service to stakeholders with a complementary capacity for all research findings. We are testing new approach to improve science and its impact – to accelerate access and sharing of findings and data; efficiency to reduce waste and support reproducibility; alternative OA model, etc.

Make an impact, know your impact, show your impact – Anna Ritchie, Mendeley, Elsevier

A theme across the day is that there is increasing pressure and challenges for researchers. It’s never been easier to get your work out – new technology, media, platforms. And yet, it’s never been harder to get your work seen: more researchers, producing more outputs, dealing with competition. So how do you ensure you and your work make an impact? Options mean opportunities, but also choices. Traditional publishing is still important – but not enough. And there are both older and newer ways to help make your research stand out.

Publishing campus is a big thing here. These are free resources to support you in publishing. There are online lectures, interactive training courses, and expert advice. And things happen – live webinars, online lectures (e.g. Top 10 Tips for Writing a Really Terrible Journal Article!), interactive course. There are suits of materials around publishing, around developing your profile.

At some point you will want to look at choosing a journal. Metrics may be part of what you use to choose a journal – but use both quantitative and qualitative (e.g. ask colleagues and experts). You can also use Elsevier Journal Finder – you can search for your title and abstract and subject areas to suggest journals to target. But always check the journal guidance before submitting.

There is also the opportunity for article enrichments which will be part of your research story – 2D radiological data viewer, R code Viewer, Virtual Microscope, Genome Viewer, Audioslides, etc.

There are also less traditional journals: Heliyon is all disciplines so you report your original and technically sound results of primary research, regardless of perceived impact. Methodsx is entirely about methods work. Data in Brief allows you to describe your data to facilitate reproducibility, make it easier to cite, etc. And an alternative to a data article is to add datasets on Mendeley.

And you can also use Mendeley to understand your impact through Mendeley Stats. There is a very detailed dashboard for each publication – this is powered by Scopus so works for all articles indexed in Scopus. Stats like users, Mendeley users with that article in their library, citations, related works… And you can see how your article is being shared. You can also show your impact on Mendeley, with a research profile that is as comprehensive as possible –  not just your publications but with wider impacts, press mentions…. And enabling you to connect to other researchers, to other articles and opportunities. This is what we are trying to do to make Mendeley help you build your online profile as a researcher. We intend to grow those profiles to give a more comprehensive picture of you as a researcher.

And we want to hear from you. Every journal, platform, and product is co-developed with ongoing community input. So do get in touch!

How to share science with hard to reach groups and why you should bother – Becky Douglas

My background is physics, high energy gravitational waves, etc… As I was doing my PhD I got very involved in science engagement. Hopefully most of you think about science communication and public outreach as being a good thing. It does seem to be something that arise in job interviews and performance reviews. I’m not convinced that everyone should do this – not everyone enjoys or is good at it – but there is huge potential if you are enthusiastic. And there is more expectation on scientists to do this to gain recognition, to help bring trust back to scientists, and right some misunderstanding. And by the way talks and teaching don’t count here.

And not everyone goes to science festivals. It is up to us to provide alternative and interesting things for those people. There are a few people who won’t be interested in science… But there are many more people who don’t have time or don’t see the appeal to them. These people deserve access to new research… And there are many ways to communicate that research. New ideas are always worth doing, and can attract new people and get dialogue you’d never expect.

So, article writing is a great way to reach out… Not just in science magazines (or on personal blogs). Newspapers and magazines will often print science articles – reach out to them. And you can pitch other places too – Cosmo prints science. Mainstream publications are desperate for people who understand science to write about it in engaging ways – sometimes you’ll be paid for your work as well.

Schools are obvious, but they are great ways to access people from all backgrounds. You’ll do extra well if you can connect it to the current curriculum! Put the effort in to build a memorable activity or event. Send them home with something fun and you may well reach parents as well…

More unusual events would be things like theatre, for instance Lady Scientists Stitch and Bitch. Stitch and Bitch is an international thing where you get together and sew and craft and chat. So this show was a play which was about travelling back in time to gather all the key lady scientists, and they sit down to discuss science over some knitting and sewing. Because it was theatre it was an extremely diverse group, not people who usually go to science events. When you work with non scientists you get access to a whole new crowd.

Something a bit more unusual… Soapbox Science, I brought to Glasgow in 2015. It’s science busking where you talk about your cutting edge research. Often attached to science festivals but out in public, to draw a crowd from those shopping, or visiting museums, etc. It’s highly interactive. Most had not been to a science event before, they didn’t go out to see science, but they enjoyed it…

And finally, interact with local communities. WI have science events, Scouts and Guides, meet up groups… You can just contact and reach out to those groups. They have questions in their own effort. It allows you to speak to really interesting groups. But it does require lots of time. But I was based in Glasgow, now in Falkirk, and I’ve just done some of this with schools in the Goebbels where we knew that the kids rarely go on to science subjects…

So, this is really worth doing. You work, if it is tax-payer funded, should be accessible to the public. Some people don’t think they have an interest in science – some are right but others just remember dusty chalkboards and bland text books. You have to show them it’s something more than that.

What helps or hinders science communication by early career researchers? – Lewis MacKenzie

I’m a postdoc at the University of Leeds. I’m a keen science communicator and I try to get out there as much as possible… I want to talk about what helps or hinders science communication by early career researchers.

So, who are early career researchers? Well undergraduates are a huge pool of early career researchers and scientists which tend to be untapped; also PhDs; also postdocs. There are some shared barriers here: travel costs, time… That is especially the case in inaccessible parts of Scotland. There is a real issue that science communication is work (or training). And not all supervisors have a positive attitude to science communication. As well as all the other barriers to careers in science of course.

Let’s start with science communication training. I’ve been through the system as an undergraduate, PhD students and postdocs. A lot of training are (rightly) targeted at PhD students, often around writing, conferences, elevator pitches, etc. But there are issues/barriers for ECRs include… Pro-active sci comm is often not formally recognized as training/CPD/workload – especially at evenings and weekends. I also think undergraduate sci comm modules are minimal/non-existent. You get dedicated sci comm masters now, there is lots to explore. And there are relatively poor sci comm training opportunities for post docs. But across the board media skills training pretty much limited – how do you make youtube videos, podcasts, web comics, writing in a magazine – and that’s where a lot of science communication takes place!

Sci Comm in Schools includes some great stuff. STEMNET is an excellent way for ECRs, industry, retirees, etc as volunteers, some basic training, background checks, and a contact hub with schools and volunteers. However it is a confusing school system (especially in England) and curricula. How do you do age-appropriate communication. And just getting to the schools can be tricky – most PhDs and Sci Comm people won’t have a car. It’s basic but important as a barrier.

Science Communication Competitions are quite widespread. They tend to be aimed at PhD students, incentives being experience, training and prizes. But there are issues/barriers for ECRs – often conventional “stand and talk” format; not usually collaborative – even though team work can be brilliant, the big famous science communicators work with a team to put their shows and work together; intense pressure of competitions can be off putting… Some alternative formats would help with that.

Conferences… Now there was a tweet earlier this week from @LizyLowe suggesting that every conference should have a public engagement strand – how good would that be?!

Research Grant “Impact Plans”: major funders now require “impact plans” revolving around science communication. That makes time and money for science communication which is great. But there are issues. The grant writer often designate activities before ECRs are recruited. These prescriptive impact plans aren’t very inspiring for ECRS. Money may be inefficiently spent on things like expensive web design. I think we need a more agile approach to include input from ECRs once recruited.

Finally I wanted to finish with Science Communication Fellowships. These are run by people like Wellcome Trust Engagement Fellowships and the STFC. These are for the Olympic gold medallists of Sci Comm. But they are not great for ECRs. The dates are annual and inflexible – and the process is over 6 months – it is a slow decision making process. And they are intensively competitive so not very ECR friendly, which is a shame as many sci comm people are ECRs. So perhaps more institutions or agencies should offer sci comm fellowships? And  a continuous application process with shorter spells?

To sum up… ECRs at different career stages require different training and organisational support to enable science communication. And science communication needs to be recognised as formal work/training/education – not an out of hours hobby! There are good initiatives out there but there could be many more.

PANEL DISCUSSION – Michael Markie, F1000 (MM); Anna Ritchie, Mendeley, Elsevier (AR); Becky Douglas (BD); Lewis MacKenzie (LW) – chaired by Joanna Young (JY)

Q1 (JY): Picking up on what you said about Pathways to Impact statements… What advice would you give to ECRs if they are completing one of these? What should they do?

A1 (LM): It’s quite a weird thing to do… Two strands… This research will make loads of money and commercialise it; and the science communication strand. It’s easier to say you’ll do a science festival event, harder to say you’ll do press release… Can say you will blog you work once a month, or tweet a day in the lab… You can do that. In my fellowship application I proposed a podcast on biophysics that I’d like to do. You can be creative with your science communication… But there is a danger that people aren’t imaginative and make it a box-ticking thing. Just doing a science festival event and a webpage isn’t that exciting. And those plans are written once… But projects run for three years maybe… Things change, skills change, people on the team change…

A1 (BD): As an ECR you can ask for help – ask supervisors, peers, ask online, ask colleagues… You can always ask for advice!

A1 (MM): I would echo that you should ask experienced people for help. And think tactically as different funders have their own priorities and areas of interest here too.

Q2: I totally agree with the importance of communicating your science… But showing impact of that is hard. And not all research is of interest to the public – playing devil’s advocate – so what do you do? Do you broaden it? Do you find another way in?

A2 (LM): Taking a step back and talking about broader areas is good… I talk a fair bit about undergraduates as science communicators… They have really good broad knowledge and interest. They can be excellent. And this is where things like Science Soapbox can be so effective. There are other formats too.. Things like Bright Club which communicates research through comedy… That’s really different.

A2 (BD) I would agree with all of that. I would add that if you want to measure impact then you have to think about it from the outset – will you count people, some sort of voting or questionnaires. YOu have to plan this stuff in. The other thing is that you have to pitch things carefully to your audience. If I run events on gravitational waves I will talk about space and black holes… Whereas with a 5 year old I ask about gravity and we jump up and down so they understand what is relevant to them in their lives.

A2 (LM): In terms of metrics for science communication… At the British Science Association conference a few years back and this was a major theme… Becky mentioned getting kids to post notes in boxes at sessions… Professional science communicators think a great deal about this… Maybe not as much us “Sunday Fun Run” type people but we should engage more.

Comment (AR): When you prepare an impact statement are you asked for metrics?

A2 (LM): Not usually… They want impact but don’t ask about that…

A2 (BD): Whether or not you are asked for details of how something went you do want to know how you did… And even if you just ask “Did you learn something new today?” that can be really helpful for understanding how it went.

Q3: I think there are too many metrics… As a microbiologist… which ones should I worry about? Should there be a module at the beginning of my PhD to tell me?

A3 (AR): There is no one metric… We don’t want a single number to sum us up. There are so many metrics as one number isn’t enough, one isn’t enough… There is experimentation going on with what works and what works for you… So be part of the conversation, and be part of the change.

A3 (MM): I think there are too many metrics too… We are experimenting. Altmetrics are indicators, there are citations, that’s tangible… We just have to live with a lot of them all at once at the moment!

UNCONFERENCE SESSION 2: Preprints: A journey through time – Graham Steel

This will be a quick talk plus plenty of discussion space… From the onset of thinking about this conference I was very keen to talk about preprints…

So, who knows what a preprint is? There are plenty of different definitions out there – see Neylon et al 2017. But we’ll take the Wikipedia definition for now. I thought preprints dates to the 1990s. But I found a paper that referenced a pre-print from 1922!

Lets start there… Preprints were ticking along fine… But then a fightback began, In 1966 preprinte were made outlaws when Nature wanted to take “lethal steps” to end preprints. In 1969 we had a thing called the “Inglefinger Rule” – we’ll come back to that later… Technology wise various technologies ticked along… In 1989 Tim Berners Lee came along, In 1991 Cern set up, also ArXiv set up and grew swiftly… About 8k prepreints per month are uploaded to ArXiv each month as of 2016. Then, in 2007-12 we had Nature Preprints…

But in 2007, the fightback began… In 2012 the Ingelfinger rule was creating stress… There are almost 35k journals, only 37 still use the Ingelfinger rule… But they include key journals like Cell.

But we also saw the launch of BioaXiv in 2013. And we’ve had an explosion of preprints since then… Also 2013 there was a £5m Centre for Open Science set up. This is a central place for preprints… That is a central space, with over 2m preprints so far. There are now a LOT of new …Xiv preprint sites. In 2015 we saw the launch of the ASAPbio movement.

Earlier this year Mark Zuckerberg invested billions in boiXiv… But everything comes at a price…

Scottish spends on average £11m per year to access research through journals. The best average for APCs I could find is $906. Per pre-print it’s $10. If you want to post a pre-print you have to check the terms of your journal – usually extremely clear. Best to check in SHERPA/ROMEO.

If you want to find out more about preprints there is a great Twitter list, also some recommended preprints reading. Find these slides: slideshare.net/steelgraham and osf.io/zjps6/.

Q&A

Q1: I found Sherpa/Romeo by accident…. But really useful. Who runs it?

A1: It’s funded by Jisc

Q2: How about findability…

A2: ArXiv usually points to where this work has been submitted. And you can go back and add the DOI once published.

Q2: It’s acting as a static archive then? To hold the green copy

A2: And there is collaborative activity across that… And there is work to make those findable, to share them, they are shared on PubMed…

Q2: One of the problems I see is purely discoverability… Getting it easy to find on Google. And integration into knowledgebases, can be found in libraries, in portals… Hard for a researcher looking for a piece of research… They look for a subject, a topic, to search an aggregated platform and link out to it… To find the repository… So people know they have legal access to preprint copies.

A2: You have COAR at OU which aggregates preprints, suggests additional items when you search. There is ongoing work to integrate with CRIS systems, frequently commercial so interoperability here.

Comment: ArXiv is still the place for high energy physics so that is worth researchers going directly too…

Q3: Can I ask about preprints and research evaluation in the US?

A3: It’s an important way to get the work out… But the lack of peer review is an issue there so emerging stuff there…

GS: My last paper was taking forever to come out, we thought it wasn’t going to happen… We posted to PeerJ but discovered that that journal did use the Inglefinger Rule which scuppered us…

Comment: There are some publishers that want to put preprints on their own platform, so everything stays within their space… How does that sit/conflict with what libraries do…

GS: It’s a bit “us! us! us!”

Comment: You could see all submitted to that journal, which is interesting… Maybe not health… What happens if not accepted… Do you get to pull it out? Do you see what else has been rejected? Could get dodgy… Some potential conflict…

Comment: I believe it is positioned as a separate entity but with a path of least resistance… It’s a question… The thing is.. If we want preprints to be more in academia as opposed to publishers… That means academia has to have the infrastructure to do that, to connect repositories discoverable and aggregated… It’s a potential competitive relationship… Interesting to see how it plays out…

Comment: For Scopus and Web of Science… Those won’t take preprints… Takes ages… And do you want to give up more rights to the journals… ?

Comment: Can see why people would want multiple copies held… That seems healthy… My fear is it requires a lot of community based organisation to be a sustainable and competitive workflow…

Comment: Worth noting the radical “platinum” open access… Lots of preprints out there… Why not get authors to submit them, organise into free, open journal without a publisher… That’s Tim Garrow’s thing… It’s not hard to put together a team to peer review thematically and put out issues of a journal with no charges…

GS: That’s very similar to open library of humanities… And the Wellcome Trust & Gates Foundation stuff, and big EU platform. But the Gates one could be huge. Wellcome Trust is relatively small so far… But EU-wide will be major ramifications…

Comment: Platinum is more about overlay journals… Also like Scope3 and they do metrics on citations etc. to compare use…

GS: In open access we know about green, gold and with platinum it’s free to author and reader… But use of words different in different contexts…

Q4: What do you think the future is for pre-prints?

A4 – GS: There is a huge boom… There’s currently some duplication of central open preprints platform. But information is clear on use and uptake is on the rise… It will plateau at some point like PLoSOne. They launched 2006 and they probably plateaued around 2015. But it is number 2 in the charts of mega-journals, behind Scientific Reports. They increased APCs (around $1450) and that didn’t help (especially as they were profitable)…

SESSION THREE: Raising your research profile: online engagement & metrics

Green, Gold, and Getting out there: How your choice of publisher services can affect your research profile and engagement – Laura Henderson, Editorial Program Manager, Frontiers

We are based in Lausanne in Switzerland. We are fully digital, fully open access publisher. All of 58 journals are published under CC-BY licenses. And the organisation was set up scientists that wanted to change the landscape. So I wanted to talk today about how this can change your work.

What is traditional academic publishing?

Typically readers pay – journal subscriptions via institution/library or pay per view. Given the costs and number of articles they are expensive – ¢14B journals revenue in 2014 works out at $7k per article. It’s slow too.. Journal rejection cascade can take 6 months to a year each time. Up to 1 million papers – valid papers – are rejected every year. And these limit access to research around 80% of research papers are behind subscription paywalls. So knowledge gets out very slowly and inaccessibly.

By comparison open access… Well Green OA allows you to publish an dthen self-archive your paper in a repository where it can be accessed for free. you can use an institutional or central repository, or I’d suggest both. And there can be a delay due to embargo. Gold OA makes research output immediately available from th epublisher and you retain the copyright so no embargoes. It is fully discoverable via indexing and professional promotion services to relevant readers. No subscription fee to reader but usually involves APCs to the institution.

How does Open Access publishing compare? Well it inverts the funding – institution/grant funder supports authors directly, not pay huge subscrition fees for packages dictates by publishers. It’s cheaper – Green OA is usually free. Gold OA average fee is c. $1000 – $3000 – actually that’s half what is paid for subscription publishing. We do see projections of open access overtaking subscription publishing by 2020.

So, what benefits does open access bring? Well there is peer-review; scalable publishing platforms; impact metrics; author discoverability and reputation.

And I’d now like to show you what you should look for from any publisher – open access or others.

Firstly, you should expect basic services: quality assurance and indexing. Peter Suber suggests checking the DOAJ – Directory of Open Access Journals. You can also see if the publisher is part of OASPA which excludes publishers who fail to meet their standards. What else? Look for peer review nad good editors – you can find the joint COPE/OASPA/DOAJ Principles of Transaparancy and Best Practice in Scholarly Publishing. So you need to have clear peer review proceses. And you need a governing board and editors.

At Frontiers we have an impact-neutral peer review oricess. We don’t screen for papers with highest impact. Authors, reviewers and handling Associate Editor interact directly with each other in the online forum. Names of editors and reviewers publishhed on final version of paper. And this leads to an average of 89 days from submission to acceptance – and that’s an industry leading timing… And that’s what won an ASPLP Innovation Award.

So, what are the extraordinary services a top OA publisher can provide? Well altmetrics are more readily available now. Digital articles are accessible and trackable. In Frontiers our metrics are built into every paper… You can see views, downloads, and reader demographics. And that’s post-publication analytics that doesn’t rely on impact factor. And it is community-led imapact – your peers decide the impact and importance.

How discoverable are you? We launched a bespoke built-in networking profile for every author and user: Loop. Scrapes all major index databases to find youe work – constatly updating. It’s linked to Orchid and is included in peer review process. When people look at your profile you can truly see your impact in the world.

In terms of how peers find your work we have article alerts going to 1 million people, and a newsletter that goes to 300k readers. And our articles have 250 million article views and downloads, with hotspots in Mountain View California, and in Shendeng, and areas of development in the “Global South”.

So when you look for a publisher, look for a publisher with global impact.

What are all these dots and what can linking them tell me? – Rachel Lammey, Crossref

Crossref are a not for profit organisation. So… We have articles out there, datasets, blogs, tweets, Wikipedia pages… We are really interested to understand these links. We are doing that through Crossref Event Data, tracking the conversation, mainly around objects with a DOI. The main way we use and mention publications is in the citations of articles. That’s the traditional way to discuss research and understand news. But research is being used in lots of different ways now – Twitter and Reddit…

So, where does Crossref fit in? It is the DOI registration agency for scholarly content. Publishers register their content with us. URLs do change and do break… And that means you need something ore persistent so it can still be used in their research… Last year at ReCon we tried to find DOI gaps in reference lists – hard to do. Even within journals publications move around… And switch publishers… The DOI fixes that reference. We are sort of a switchboard for that information.

I talked about citations and references… Now we are looking beyong that. It is about capturing data and relationships so that understanding and new services (by others) can be built… As such it’s an API (Application Programming Interface) – it’s lots of data rather than an interface. SO it captures subject, relation, object, tweet, mentions, etc. We are generating this data (As of yesterday we’ve seen 14 m events), we are not doing anything with it so this is a clear set of data to do further work on.

We’ve been doing work with NISO Working Group on altmetrics, but again, providing the data not the analysis. So, what can this data show? We see citation rings/friends gaming the machine; potential peer review scams; citation patterns. How can you use this data? Almost any way. Come talk to us about Linked Data; Article Level Metrics; general discoverability, etc.

We’ve done some work ourselves… For instant the Live Data from all sources – including Wikipedia citing various pages… We have lots of members in Korea, and started looking just at citations on Korean Wikipedia. It’s free under a CC0 license. If you are interested, go make something cool… Come ask me questions… And we have a beta testing group and we welcome you feedback and experiments with our data!

The wonderful world of altmetrics: why researchers’ voices matter – Jean Liu, Product Development Manager, Altmetric

I’m actually five years out of graduate school, so I have some empathy with PhD students and ECRs. I really want to go through what Altmetrics is and what measures there are. It’s not controversial to say that altmetrics have been experiencing a meteoric rise over the last few years… That is partly because we have so much more to draw upon than the traditional journal impact factors, citation counts, etc.

So, who are altmetrics.com? We have about 20 employees, founded in 2011 and all based in London. And we’ve started to see that people re receptive to altmetrics, partly because of the (near) instant feedback… We tune into the Twitter firehose – that phrase is apt! Altmetrics also showcase many “flavours” of attention and impact that research can have – and not just articles. And the signals we tracked are highly varies: policy documents, news, blogs, Twitter, post-publication peer review, Facebook, Wikipedia, LinkedIn, Reddit, etc.

Altmetrics also have limitations. They are not a replacement for peer review or citation-based metrics. They can be gamed – but data providers have measures in place to guard against this. We’ve seen interesting attempts at gamification – but often caught…

Researchers are not only the ones who receive attention in altmetrics, but they are also the ones generating attention that make up altmetrics – but not all attention is high quality or trustworthy. We don’t want to suggest that researchers should be judged just on altmetrics…

Meanwhile Universities are asking interesting questions: how an our researchers change policy? Which conference can I send people to which will be most useful, etc.

So, lets see the topic of “diabetic neuropathy”. Looking around we can see a blog, an NHS/Nice guidance document, and a The Conversation. A whole range of items here. And you can track attention over time… Both by volume, but also you can look at influencers across e.g. News Outlets, Policy Outlets, Blogs and Tweeters. And you can understand where researcher voices feature (all are blogs). And I can then compare news and policy and see the difference. The profile for News and Blogs are quite different…

How can researchers voices be heard? Well you can write for a different audience, you can raise the profile of your work… You can become that “go-to” person. You also want to be really effective when you are active – altmetrics can help you to understand where your audience is and how they respond, to understand what is working well.

And you can find out more by trying the altmetric bookmarking browser plugin, by exploring these tools on publishing platforms (where available), or by taking a look.

How to help more people find and understand your work – Charlie Rapple, Kudos

I’m sorry to be the last person on the agenda, you’ll all be overwhelmed as there has been so much information!

I’m one of the founders of Kudos and we are an organisation dedicated to helping you increase the reach and impact of your work. There is such competition for funding, a huge growth in outputs, there is a huge fight for visibility and usage, a drive for accountability and a real cult of impact. You are expected to find and broaden the audience for your work, to engage with the public. And that is the context in which we set up Kudos. We want to help you navigate this new world.

Part of the challenge is knowing where to engage. We did a survey last year with around 3000 participants to ask how they share their work – conferences, academic networking, conversations with colleagues all ranked highly; whilst YouTube, slideshare, etc. are less used.

Impact is built on readership – impacts cross a variety of areas… But essentially it comes down to getting people to find and read your work. So, for me it starts with making sure you increase the number of people reaching and engaging with your work. Hence the publication is at the centre – for now. That may well be changing as other material is shared.

We’ve talked a lot about metrics, there are very different ones and some will matter more to you than others. Citations have high value, but so do mentions, clicks, shares, downloads… Do take the time to think about these. And think about how your own actions and behaviours contribute back to those metrics… So if you email people about your work, track that to see if it works… Make those connections… Everyone has their own way and, as Nicola was saying in the Digital Footprint session, communities exist already, you have to get work out there… And your metrics have to be about correlating what happens – readership and citations. Kudos is a management tool for that.

In terms of justifying time here is that communications do increase impact. We have been building up data on how that takes place. A team from Nanyang Technological Institute did a study of our data in 2016 and they saw that the Kudos tools – promoting their work – they had 23% higher growth in downloads of full text on publisher sites. And that really shows the value of doing that engagement. It will actually lead to meaningful results.

So a quick look at how Kudos works… It’s free for researchers (www.growkudos.com) and it takes about 15 minutes to set up, about 10 minutes each time you publish something new. You can find a publication, you can use your ORCID if you have one… It’s easy to find your publication and once you have then you have page for that where you can create a plain language explanation of your work and why it is important – that is grounded in talking to researchers about what they need. For example: http://bit.ly/plantsdance. That plain text is separate from the abstract. It’s that first quick overview. The advantage of this is that it is easier for people within the field to skim and scam your work; people outside your field in academia can skip terminology of your field and understand what you’ve said. There are also people outside academia to get a handle on research and apply it in non-academic ways. People can actually access your work and actually understand it. There is a lot of research to back that up.

Also on publication page you can add all the resources around your work – code, data, videos, interviews, etc. So for instance Claudia Sick does work on baboons and why they groom where they groom – that includes an article and all of that press coverage together. That publication page gives you a URL, you can post to social media from within Kudos. You can copy the trackable link and paste wherever you like. The advantage to doing this in Kudos is that we can connect that up to all of your metrics and your work. You can get them all in one place, and map it against what you have done to communicate. And we map those actions to show which communications are more effective for sharing… You can really start to refine your efforts… You might have built networks in one space but the value might all be in another space.

Sign up now and we are about to launch a game on building up your profile and impact, and scores your research impact and lets you compare to others.

PANEL DISCUSSION – Laura Henderson, Editorial Program Manager, Frontiers (LH); Rachel Lammey, Crossref (RL); Jean Liu, Product Development Manager, Altmetric (JL); Charlie Rapple, Kudos (CR). 

Q1: Really interesting but how will the community decide which spaces we should use?

A1 (CR): Yes, in the Nangyang work we found that most work was shared on Facebook, but more links were engaged with on Twitter. There is more to be done, and more to filter through… But we have to keep building up the data…

A1 (LH): We are coming from the same sort of place as Jean there, altmetrics are built into Frontiers, connected to ORCID, Loop built to connect to institutional plugins (totally open plugin). But it is such a challenge… Facebook, Twitter, LinkedIn, SnapChat… Usually personal choice really, we just want to make it easier…

A1 (JL): It’s about interoperability. We are all working in it together. You will find certain stats on certain pages…

A1 (RL): It’s personal choice, it’s interoperability… But it is about options. Part of the issue with impact factor is the issue of being judged by something you don’t have any choice or impact upon… And I think that we need to give new tools, ways to select what is right for them.

Q2: These seem like great tools, but how do we persuade funders?

A2 (JL): We have found funders being interested independently, particularly in the US. There is this feeling across the scholarly community that things have to change… And funders want to look at what might work, they are already interested.

A2 (LH): We have an office in Brussels which lobbies to the European Commission, we are trying to get our voice for Open Science heard, to make difference to policies and mandates… The impact factor has been convenient, it’s well embedded, it was designed by an institutional librarian, so we are out lobbying for change.

A2 (CR): Convenience is key. Nothing has changed because nothing has been convenient enough to replace the impact factor. There is a lot of work and innovation in this area, and it is not only on researchers to make that change happen, it’s on all of us to make that change happen now.

Jo Young (JY): To finish a few thank yous… Thank you all for coming a lot today, to all of our speakers, and a huge thank you for Peter and Radic (our cameramen), to Anders, Graham and Jan for work in planning this. And to Nicola and Amy who have been liveblogging, and to all who have been tweeting. Huge thanks to CrossRef, Frontiers, F1000, JYMedia, and PLoS.

And with that we are done. Thanks to all for a really interesting and busy day!

 

Jun 282017
 

Today I am at the eLearning@ed Conference 2017, our annual day-long event for the eLearning community across the University of Edinburgh – including learning technologies, academic staff and some post graduate students. As I’m convener of the community I’m also chairing some sessions today so the notes won’t be at quite my normal pace!

As usual comments, additions and corrections are very welcome. 

For the first two sections I’m afraid I was chairing so there were no notes… But huge thanks to Anne Marie for her excellent quick run through exciting stuff to come… 

Welcome – Nicola Osborne, elearning@ed Convenor

Forthcoming Attractions – Anne Marie Scott, Head of Digital Learning Applications and Media

And with that it was over to our wonderful opening keynote… 

Opening Keynote: Prof. Nicola Whitton, Professor of Professional Learning, Manchester Metropolitan University: Inevitable Failure Assessment? Rethinking higher education through play (Chair: Dr Jill MacKay)

Although I am in education now, my background is as a computer scientist… So I grew up with failure. Do you remember the ZX Spectrum? Loading games there was extremely hit and miss. But the games there – all text based – were brilliant, they worked, they took you on adventures. I played all the games but I don’t think I ever finished one… I’d get a certain way through and then we’d have that idea of catastrophic failure…

And then I met a handsome man… It was unrequited… But he was a bit pixellated… Here was Guybush Threepwood of the Monkey Island series. And that game changed everything – you couldn’t catastrophically fail, it was almost impossible. But in this game you can take risks, you can try things, you can be innovative… And that’s important for me… That space for failure…

The way that we and our students think about failure in Higher Education, and deal with failure in Higher Education. If we think that going through life and never failing, we will be set for disappointment. We don’t laud the failures. J.K. Rowling, biggest author, rejected 12 times. The Beatles, biggest band of the 20th Century, were rejected by record labels many many time. The lightbulb failed hundreds of times! Thomas Edison said he didn’t fail 100 times, he succeeded in lots of stages…

So, to laud failure… Here are some of mine:

  1. Primary 5 junior mastermind – I’m still angry! I chose horses as my specialist subject so, a tip, don’t do that!
  2. My driving test – that was a real resiliance moment… I’ll do it again… I’ll have more lessons with my creepy driving instructor, but I’ll do it again.
  3. First year university exams – failed one exam, by one mark… It was borderline and they said “but we thought you need to fail” – I had already been told off for not attending lectures. So I gave up my summer job, spent the summer re-sitting. I learned that there is only so far you can push things… You have to take things seriously…
  4. Keeping control of a moped – in Thailand, with no training… Driving into walls… And learning when to give up… (we then went by walking and bus)
  5. Funding proposals and article submissions, regularly, too numerous to count – failure is inevitable… As academics we tend not to tell you about all the times we fail… We are going to fail… So we have to be fine to fail and learn from it. I was involved in a Jisc project in 2009… I’ve published most on it… It really didn’t work… And when it didn’t work they funded us to write about that. And I was very lucky, one of the Innovation Programme Managers who had funded us said “hey, if some of our innovation funding isn’t failing, then we aren’t being innovative”. But that’s not what we talk about.

For us, for our students… We have to understand that failure is inevitable. Things are currently set up as failure being a bad outcome, rather than an integral part of the learning process… And learning from failure is really important. I have read something – though I’ve not been able to find it again – that those who pass their driving test on the second attempt are better drives. Failure is about learning. I have small children… They spent their first few years failing to talk then failing to walk… That’s not failure though, it’s how we learn…

Just a little bit of theory. I want to talk a bit about the concept of the magic circle… The Magic Circle came from game theory, from the 1950s. Picked up by ? Zimmerman in early 2000s… The idea is that when you play with someone, you enter this other space, this safe space, where normal rules don’t apply… Like when you see animals playfighting… There is mutual agreement that this doesn’t count, that there are rules and safety… In Chess you don’t just randomly grab the king. Pub banter can be that safe space with different rules applying…

This happens in games, this happens in physical play… How can we create magic circles in learning… So what is that:

  • Freedom to fail – if you won right away, there’s no point in playing it. That freedom to fail and not be constrained by the failure… How we look at failure in games is really different from how we look at failure in Higher Education.
  • Lusory attitude – this is about a willingness to engage in play, to forget about the rules of the real world, to abide by the rules of this new situation. To park real life… To experiment, that is powerful. And that idea came from Leonard Suits whose book, The Grasshopper, is a great Playful Learning read.
  • Intrinsic motivation – this is the key area of magic circle for higher education. The idea that learning can be and should be intrinsically motivating is really really important.

So, how many of you have been in an academic reading group? OK, how many have lasted more than a year? Yeah, they rarely last long… People don’t get round to reading the book… We’ve set up a book group with special rules: you either HAVE To read the book, or your HAVE TO PRETEND that you read the book. We’ve had great turn out, no idea if they all read the books… But we have great discussion… Reframing that book group just a small bit makes a huge difference.

That sort of tiny change can be very powerful for integrating playfulness. We don’t think twice about doing this with children… Part of the issue with play, especially with adults, is what matters about play… About that space to fail. But also the idea of play as a socialised bonding space, for experimentation, for exploration, for possibilities, for doing something else, for being someone else. And the link with motivation is quite well established… I think we need to understand that different kind of play has different potential, but it’s about play and people, and safe play…

This is my theory heavy slide… This is from a paper I’ve just completed with colleagues in Denmark. We wanted to think “what is playful learning”… We talk about Higher Education and playful learning in that context… So what actually is it?

Well there is signature pedagogy for playful learning in higher education, under which we have surface (game) structures; deep (play) structures; implicit (playful) structures. Signature pedagogy could be architecture or engineering…

This came out of work on what students respond to…

So Surface (game) structures includes: ease of entry and explicit progression; appropriate and flexible levels of challenge; engaging game mechanics; physical or digital artefacts. Those are often based around games and digital games… But you can be playful without games…

Deep (play) structures is about: active and physical engagement; collaboration with diversity; imagining possibilities; novelty and surprises.

Implicit (playful) structures: lusory attitude; democratice values and openness; acceptance of risk-taking and failure; intrinsic motivation. That is so important for us in higher education…

So, rant alert…

Higher Education is broken. And that is because schools are broken. I live in Manchester (I know things aren’t as bad in Scotland) and we have assessment all over the place… My daughter is 7 sitting exams. Two weeks of them. They are talking about exams for reception kids – 4 year olds! We have a performative culture of “you will be assessed, you will be assessed”. And then we are surprised when that’s how our students respond… And have the TEF appearing… The golds, silvers, and bronze… Based on fairly random metrics… And then we are surprised when people work to the metrics. I think that assessment is a great way to suck out all the creativity!

So, some questions my kids have recently asked:

  • Are there good viruses? I asked an expert… apparently there are for treating people.. (But they often mutate.)
  • Do mermaids lay eggs? Well they are part fish…
  • Do Snow Leopards eat tomatoes? Where did this question come from? Who knows? Apparently they do eat monkeys… What?!

But contrast that to what my students ask:

  • Will I need to know this for the exam?
  • Are we going to be assessed on that?

That’s what happens when we work to the metrics…

We are running a course where there were two assessments. One was formative… And students got angry that it wasn’t worth credit… So I started to think about what was important about assessment? So I plotted the feedback from low to high, and consequence from low to high… So low consequence, low feedback…

We have the idea of the Trivial Fail – we all do those and it doesn’t matter (e.g. forgetting to signal at a roundabout), and lots of opportunity to fail like that.

We also have the Critical Fail – High Consequence and Low Feedback – kids exams and quite a lot of university assessment fits there.

We also have Serious Fail – High Consequence and High Feedback – I’d put PhD Vivas there… consequences matter… But there is feedback and can be opportunity to manage that.

What we need to focus on in Higher Education is the Micro Fail – low consequence with high feedback. We need students to have that experience, and to value that failure, to value failure without consequence…

So… How on earth do we actually do this? How about we “Level Up” assessment… With bosses at the end of levels… And you keep going until you reach as far as you need to go, and have feedback filled in…

Or the Monkey Island assessment. There is a goal but it doesn’t matter how you get there… You integrate learning and assessment completely, and ask people to be creative…

Easter Egg assessment… Not to do with chocolate but “Easter Eggs” – suprises… You don’t know how you’ll be assessed… Or when you’ll be assessed… But you will be! And it might be fun! So you have to go to lectures… Real life works like that… You can’t know which days will count ahead of time.

Inevitable Failure assessment… You WILL fail first time, maybe second time, third time… But eventually pass… Or even maybe you can’t ever succeed and that’s part of the point.

The point is that failure is inevitable and you need to be able to cope with that and learn from that. On which note… Here is my favourite journal, the Journal of Universal Rejection… This is quite a cathartic experience, they reject everything!

So I wanted to talk about a project that we are doing with some support from the HEA… Eduscapes… Have you played Escape Rooms? They are so addictive! There are lots of people creating educational Escape Rooms… This project is a bit different… So there are three parts… You start by understanding what the Escape Room is, how they work; then some training; and then design a game. But they have to trial them again and again and again. We’ve done this with students, and with high school students three times now. There is inevitable failure built in here… And the project can run over days or weeks or months… But you start with something and try and fail and learn…

This is collaborative, it is creative – there is so much scope to play with, sometimes props, sometimes budget, sometimes what they can find… In the schools case they were maths and Comp Sci students so there was a link to the curriculum. It is not assessed… But other people will see it – that’s quite a powerful motivator… We have done this with reflection/portfolio assessment… That resource is now available, there’s a link, and it’s a really simple way to engage in something that doesn’t really matter…

And while I’m here I have to plug our conference, Playful Learning, now in its second year. We were all about thinking differently about conferences… But always presenting at traditional conferences. So our conference is different… Most of it is hands on, all different stuff, a space to do something different – we had a storytelling in a tent as one of these… Lots of space but nothing really went wrong. But we need something to fail. Applications are closed this year… But there will be a call next year… So play more, be creative, fail!

So, to finish… I’m playful, play has massive potential… But we also have to think about diversity of play, the resilience to play… A lot of the research on playful learning, and assessment doesn’t recognise the importance of gender, race, context, etc… And the importance of the language we use in play… It has nuance, and comes with distinctions… We have to encourage people to play ad get involved. And we really have to re-think assessment – for ourselves, of universities, of students, of school pupils… Until we rethink this, it will be hard to have any real impact for playful learning…

Jill: Thank you so much, that was absolutely brilliant. And that Star Trek reference is “Kobayashi Maru”!

Q&A

Q1) In terms of playful learning and assessment, I was wondering how self-assessment can work?

A1) That brings me back to previous work I have done around reflection… And I think that’s about bringing that reflection into playful assessment… But it’s a hard question… More space and time for reflection, possibly more space for support… But otherwise not that different from other assessment.

Q2) I run a research methods course for an MSc… We tried to invoke playfulness with a fake data set with dragons and princesses… Any other examples of that?

A2) I think that that idea of it being playful, rather than games, is really important. Can use playful images, or data that makes rude shapes when you graph is!

Q3) Nic knows that I don’t play games… I was interested in that difference between gaming and play and playfulness… There is something about games that don’t entice me at all… But that Lusory attitude did feel familiar and appealing… That suspension of disbelief and creativity… And that connection with gendered discussion of play and games.

A3) We are working on a taxonomy of play. That’s quite complex… Some things are clearly play… A game, messing with LEGO… Some things are not play, but can be playful… Crochet… Jigsaw puzzles… They don’t have to be creative… But you can apply that attitude to almost anything. So there is play and there is a playful attitude… That latter part is the key thing, the being prepared to fail…

Q4) Not all games are fun… Easy to think playfulness and games… A lot of games are work… Competitive gaming… Or things like World of Warcraft – your wizard chores. And intensity there… Failure can be quite problematic if working with 25 people in a raid – everyone is tired and angry… That’s not a space where failure is ok… So in terms of what we can learn from games it is important to remember that games aren’t always fun or playful…

A4) Indeed, and not all play is fun… I hate performative play – improv, people touching me… It’s about understanding… It’s really nuanced. It used to be that “students love games because they are fun” and now “students love play because it’s fun” and that’s still missing the point…

Q5) I don’t think you are advocating this but… Thinking about spoonful of sugar making assessment go down… Tricking students into assessment??

A5) No. It’s taking away the consequences in how we think about assessment. I don’t have a problem with exams, but the weight on that, the consequences of failure. It is inevitable in HE that we grade students at different levels… So we have to think about how important assessment is in the real world… We don’t have equivelents of University assessments in the real world… Lets say I do a bid, lots of work, not funded… In real world I try again. If you fail your finals, you don’t get to try again… So it’s about not making it “one go and it’s over”… That’s hard but a big change and important.

Q6) I started in behavioural science in animals… Play there is “you’ll know it when you see it” – we have clear ideas of what other behaviours look like, but play is hard to describe but you know it when you see it… How does that work in your taxonomy…

A6) I have a colleague who is a physical science teacher trainer… And he’s gotten to “you’ll know it when you see it”… Sometimes that is how you perceive that difference… But that’s hard when you apply for grants! It’s a bit of an artificial exercise…

Q7) Can you tell us more about play and cultural diversity, and how we need to think about that in HE?

A7) At the moment we are at the point that people understand and value play in different way. I have a colleague looking at diversity in play… A lot of research previously is on men, and privileged white men… So partly it’s about explaining why you are doing, what you are doing, in the way you are doing it… You have to think beyond that, to appropriateness, to have play in your toolkit…

Q8) You talk about physical spaces and playfulness… How much impact does that have?

A8) It’s not my specialist area but yes, the physical space matters… And you have to think about how to make your space more playful..

Introductions to Break Out Sessions: Playful Learning & Experimentation (Nicola Osborne)

  • Playful Learning – Michael Boyd (10 min)

We are here today with the UCreate Studio… I am the manager of the space, we have student assistants. We also have high school students supporting us too. This pilot runs to the end of July and provides a central Maker Space… To create things, to make things, to generate ideas… This is mixture of the maker movement, we are a space for playful learning through making. There are about 1400 maker spaces world wide, many in Universities in the UK too… Why do they pop up in Universities? They are great creative spaces to learn.

You can get hands on with technology… It is about peer based learning… And project learning… It’s a safe space to fail – it’s non assessed stuff…

Why is it good for learning? Well for instance the World Economic Forum predict that 35% of core professional skills will change from 2015 to 2020. Complex problem solving, critical thinking, creativity, judgement and decision making, cognitive flexibility… These are things that can’t be automated… And can be supported by making and creating…

So, what do we do? We use new technologies, we use technologies that are emerging but not yet widely adopted. And we are educational… That first few months is the hard bit… We don’t lecture much, we are there to help and guide and scaffold. Students can feel confident that they have support if they need it.

And, we are open source! Anyone in the University can use the space, be supported in the space, for free as long as they openly share and license whatever they make. Part of that bigger open ethos.

So, what gets made? Includes academic stuff… Someone made a holder for his spectrometer and 3D printed it. He’s now looking to augment this with his chemistry to improve that design; we have Josie in archeology scanning artefacts and then using that to engage people – using VR; Dimitra in medicine, following a poster project for a cancer monitoring chip, she started prototyping; Hayden in Geosciences is using 3D scanning to see the density of plant matter to understand climate change.

But it’s not just that. Also other stuff… Henry studies architecture, but has a grandfather who needs meds and his family worries if he takes his medicine.. So he’s designed a system that connects a display of that. Then Greg on ECA is looking at projecting memories on people… To see how that helps…

So, I wanted to flag some ideas we can discuss… One of he first projects when I arrived, Fiona Hale and Chris Speed (ECA) ran “Maker Go” had product design students, across the years, to come up with a mobile maker space project… Results were fantastic – a bike to use to scan a space… A way to follow and make paths with paint, to a coffee machine powered by failed crits etc. Brilliant stuff. And afterwards there was a self-organised (first they can remember) exhibtion, Velodrama…

Next up was Edinburgh IoT challenge… Students and academics came together to address challenges set by Council, Uni, etc. Designers, Engineers, Scientists… Led to a really special project, 2 UG students approached us to set yp the new Embedded adn Robotics Society – they run sessions every two weeks. And going strength to strength.

Last but not least… Digital manufacturing IP session trialled last term with Dr Stema Kieria, to explore 3D scanning and printing and the impact on IPs… Huge areas… Echos of taping songs off the radio. Took something real, showed it hands on, learned about technologies, scanned copyright materials, and explored this. They taught me stuff! And that led to a Law and Artificial Intelligence Hackathon in March. This was law and informatics working together, huge ideas… We hope to see them back in the studio soon!

  • Near Future Teaching Vox Pops – Sian Bayne (5 mins)

I am Assistant Vice Principal for Digital Education and I was very keen to look at designing the future of digital education at Edinburgh. I am really excited to be here today… We want you to answer some questions on what teaching will look like in this university in 20 or 30 years time:

  • will students come to campus?
  • will we come to campus?
  • will we have AI tutors?
  • How will teaching change?
  • Will learning analytics trigger new things?
  • How will we work with partner organisations?
  • Will peers accredit each other?
  • Will MOOCs stull exist?
  • Will performance enhancement be routine?
  • Will lectures still exist?
  • Will exams exist?
  • Will essays be marked by software?
  • Will essays exist?
  • Will discipline still exist?
  • Will the VLE still exist?
  • Will we teach in VR?
  • Will the campus be smart? And what does eg IoT to monitor spaces mean socially?
  • Will we be smarter through technology?
  • What values should shape how we change? How we use these technologies?

Come be interviewed for our voxpops! We will be videoing… If you feel brave, come see us!

And now to a break… and our breakout sessions, which were… 

Morning Break Out Sessions

  • Playful Learning Mini Maker Space (Michael Boyd)
  • 23 Things (Stephanie (Charlie) Farley)
  • DIY Film School (Gear and Gadgets) (Stephen Donnelly)
  • World of Warcraft (download/set up information here) (Hamish MacLeod & Clara O’Shea)
  • Near Future Teaching Vox Pops (Sian Bayne)

Presentations: Fun and Games and Learning (Chair: Ruby Rennie, Lecturer, Institute for Education, Teaching and Leadership (Moray House School of Education))

  • Teaching with Dungeons & Dragons – Tom Boylston

I am based in Anthropology and we’ve been running a course on the anthropology of games. And I just wanted to talk about that experience of creating playful teaching and learning. So, Dungeons and Dragons was designed in the 1970s… You wake up, your chained up in a dungeon, you are surrounded by aggressive warriors… And as a player you choose what to do – fight them, talk to them, etc… And you can roll a dice to decide an action, to make the next play. It is always a little bit improvisational, and that’s where the fun comes in!

There are some stigmas around D&D as the last bastion of the nerdy white bloke… But… The situation we had was a 2 hour lecture slot, and I wanted to split that in two. To engage with a reading on the creative opportunities of imagination. I wanted them to make a character, alsmot like creative writing classes, to play that character and see what that felt like, how that changed that… Because part of the fun of role playing is getting to be someone else. Now these games do raise identity issues – gender, race, sexuality… That can be great but it’s not what you want in a big group with people you don’t yet have trust with… But there is something special about being in a space with others, where you don’t know what could happen… It is not a simple thing to take a traditional teaching setting and make it playful… One of the first things we look at when we think about play is people needing to consent to play… And if you impose that on a room, that’s hard…

So early in the course we looked at Erving Goffman’s Frame Analysis, and we used Pictionary cards… We looked at the social cues from the space, the placement of seats, microphones, etc. And then the social cues of play… Some of the foundational work of animal play asks us how you know dogs are playfighting… It’s the half-bite, playful rather than painful… So how do I invite a room full of people to play? I commanded people to play Pictionary, to come up and play… Eventually someone came up… Eventually the room accepted that and the atmosphere changed. It really helped that we had been reading about framing. And I asked what had changed and there were able to think and talk about that…

But D&D… People were sceptical. We started with students making me a character. They made me Englebert, a 5 year old lizard creature… To display the playful situation, a bit silly, to model and frame the situation… Sent them comedy D&D podcasts to listen to and asked them to come back a week later… I promised that we wouldn’t do it every week but… I shared some creative writing approaches to writing a back story, to understand what would matter about this character… Only having done this preparatory work, thought about framing… Only then did I try out my adventure on them… It’s about a masquerade in Camaroon, and children try on others’ masks… I didn’t want to appropriate that. But just to take some cues and ideas and tone from that. And when we got to the role playing, the students were up for it… And we did this either as individual students, or they could pair up…

And then we had a debrief – crucial for a playful experience like this. People said there was more negotiation than they expected as they set up the scene and created. They were surprised how people took care of their characters…

The concluding thing was… At the end of the course I had probably shared more that I cared about. Students interrupted me more – with really great ideas! And students really engaged.

Q&A

Q1) Would you say that D&D would be a better medium than an online role playing game… Exemporisation rather than structured compunction?

A1) We did talk about that… We created a WoW character… There really is a lot of space, unexpected situations you can create in D&D… Lots of improvisation… More happened in that than in the WoW stuff that we did… It was surprisingly great.

Q2) Is that partly about sharing and revealing you, rather than the playfulness per se?

A2) Maybe a bit… But I would have found that hard in another context. The discussion of games really brought that stuff out… It was great and unexpected… Play is the creation of unexpected things…

Q3) There’s a trust thing there… We can’t expect students to trust us and the process, unless we show our trust ourselves…

A3) There was a fair bit of background effort… Thinking about signalling a playful space, and how that changes the space… The playful situations did that without me intending to or trying to!

Digital Game Based Learning in China – Sihan Zhou

I have been finding this event really inspiring… There is so much to think around playfulness. I am from China, and the concept of playful learning is quite new in China so I’m pleased to talk to you about the platform we are creating – Tornado English…

On this platform we have four components – a bilingual animation, a game, and a bilingual chat bot… If the user clicks on the game, they can download it… So far we have created two games: Word Pop – vocabulary learning and Run Rabbit – syntactic learning, both based around Mayer’s model (2011).

The games mechanics are usually understood but comparing user skills and level of challenge – too easy and users will get bored, but if it’s too challenging then users will be frustrated and demotivated. So for apps in China, many of the educational products tend to be more challenging than fun – more educational apps than educational games. So in our games use timing and scoring to make things more playful and interactions like popping bubbles, clicking on moles popping out of holes in the ground. In Word Smash students have to match images to vocab as quickly as possible… In Run Rabbit… The student has to speak a phrase in order get the rabbit to run to the right word in the game and placing it…

When we designed the game, we considered how we could ensure that the game is educationally effective, and to integrate it with the English curriculum in school. We tie to the 2011 English Curriculum Standards for Compulsory Education in China. Students have to complete a sequence of levels to reach the next level of learning – autonomous learning in a systematic way.

So, we piloted this app in China, working with 6 primary schools in Harbin, China. Data has been collected from interviews with teachers, classroom observation, and questionnaires with parents.

This work is a KTP – a Knowledge Transfer Partnership – project and the KTP research is looking at Chinese primary school teachers’ attitudes towards game-based learning. And there is also an MSc TESOL Dissertation looking at teachers attitudes towards game based learning… For instance they may or may not be able to actually use these tools in the classroom because of the way teaching is planned and run. The results of this work will be presented soon – do get in touch.

Our future game development will focus more on a communicative model, task-based learning, and learner autonomy. So the character lands on a new planet, have to find their way, repair their rocket, and return to earth… To complete those task the learner has to develop the appropriate language to do well… But this is all exploratory so do talk to me and to inspire me.

Q&A

Q1) I had some fantastic Chinese students in my playful anthropology course and they were explaining quite mixed attitudes to these approaches in China. Clearly there is that challenge to get authorities to accept it… But what’s the compromise between learning and fun.

A1) The game has features designed for fun… I met with education bureu and teachers, to talk about how this is eduationally effective… Then when I get into classrooms to talk to the students, I focus more on gaming features, why you play it, how you progress and unlock new levels. Emphasis has to be quite different depending on the audience. One has to understand the context.

Q2) How have the kids responded?

A2) They have been really inspired and want to try it out. The kids are 8 or 9 years old… They were keen but also knew that their parents weren’t going to be as happy about playing games in the week when they are supposed to do “homework”. We get data on how this used… We see good use on week days, but huge use on weekends, and longer play time too!

Q3) In terms of changing attitudes to game based learning in China… If you are wanting to test it in Taiwan the attitude was different, we were expected to build playful approaches in…

A3) There is “teaching reform” taking place… And more games and playfulness in the classrooms. But digital games was the problem in terms of triggering a mentality and caution. The new generation uses more elearning… But there is a need to demonstrate that usefulness and take it out to others.

VR in Education – Cinzia Pusceddu-Gangarosa

I am manager of learning technology in the School of Biological Sciences, and also a student on the wonderful MS in Digital Education. I’m going to talk about Virtual Reality in Education.

I wanted to start by defining VR. The definition I like best is from Mirriam Webster. It includes key ideas… the idea of “simulated world” and the ways one engaging with it. VR technologies include headsets like Oculus Rift (high end) through to Google Cardboard (low end) that let you engage… But there is more interesting stuff there too… There are VR “Cave” spaces – where you enter and are surrounded by screens. There are gloves, there are other kinds of experience.

Part of virtual reality is about an intense idea of presence, of being there, of being immersed in the world, fully engaged – so much so that the interface disappears, you forget you are using technologies.

In education VR is not anything new. The first applications were in the 1990s…. But in 200s desktop VR becomes more common – spaces such as Second Life – more acceptable and less costly to engage with.

I want to show you a few examples here… One of the first experiments was from the Institute for Simulation and Training, PA, where students could play “noseball” to play with a virtual ball in a set of wearables. You can see they still use headsets, similar to now but not particularly sophisticated… I also wanted to touch on some other university experiments with VR… The first one is Google Expeditions. This is not a product that has been looked at in universities – it has been trialled in schools a lot… It’s a way to travel in time and space through Google Cardboard… Through the use of apps and tools… And Google supports teachers to use this.

A more interesting experiment is an experiment at Stanford’s Virtual Human Interaction Lab, looking at cognitive effects on students behaviour, and perspective-taking in these spaces, looking at empathy – how VR promotes and encourages empathy. Students impersonating a tree, are more cautious wasting paper. Or impersonating a person has more connection and thoughtfulness about their behaviour to that person… Even an experiment on being a cow and whether that might make them more likely to make them a vegetarian.

Another interesting experiment is at Boston University who are engaging with Ulysses – based on a book but not in a literal way. At Penn State they have been experimenting with VR and tactile experiences.

So, to conclude, what are the strengths of VR in education? Well it is about experience what its not possible – cost, distance, time, size, safety. Also non-symbolic learning (maths, chemistry, etc); learning by doing; and engaging experiences. But there are weaknesses too: it is hard to find a VR designer; it requires technical support; and sometimes VR may not be the right technology – maybe we want to replicate the wrong thing, maybe not innovative enough…

Q&A

Q1) Art Gallery/use in your area?

A1) I would like to do a VR project. It’s hard to understand until you try it out… Most of what I’ve presented is based on what I’ve read and researched, but I would love to explore the topic in a real project.

Q2) With all these technologies, I was wondering if a story is an important accompaniment to the technology and the experience?

A2) I think we do need a story. I don’t think any technology adds value unless we have a vision, and an understanding of full potential of the technology – and what it does differently, and what it really adds to the situation and the story…

Coming up…

Afternoon Keynote: Dr Hamish MacLeod, Senior Lecturer in Digital Education, Institute for Education, Community and Society, Moray House School of Education: Learning with and through Ambiguity (Chair: Cinzia Pusceddu-Gangarosa)

Nicola was talking about her youth and childhood… I will share one too.. I knew all was fine and well, that I was expected… When my primary school teacher, when I was 7, pushed me into the swimming pool. This threat was absolutely no threat… My superpower was to arise miraculously undrowned. That playful interaction was important me, it signalled a relationship, a sign of belonging… A trivial example but…

Playfulness can cement relationships between learners and teachers without judgement. Something similar arose when we ran our eLearning and Digital Cultures MOOC, there was gentle mocking of the team… So for instance there was a video of us as animated Star Trek characters – not just playful but reflecting back to us our ideas… My all time favourite playful response, one of the digital artefacts created, was Andy Mitchell’s intervention… A fake Twitter account, in my name, automatically tweeting cyber security messages. We thouroughly approved and Andy was one of our volunteer tutors on the next run.

Brian Sutton Smith talks about the ambiguity of play, understanding play in both humans and animals, in young and in mature individuals. He came up with 7 rhetorics of play:

  • Play as progress – child development. We talk about children playing, adults engaging in recreation.We can be frivolous as adults but not playful.
  • Play as fate – games of change
  • Play as power – competition and sporting prowess
  • Play as (community) identity, festivals and carnivals
  • Play as the Imaginary and phantasmagorical, narrative and theatrical
  • Play as Self-actualisation
  • Play as Frivolous

For adults play is seen very different. Breugl shows play as time wasting and problematic… But then he also paints children at play…  I commend to you the Wikipedia article on Children playing game – over 80 games defined there and that definition is interesting.

When dogs playfight they prolong “fight”, they self-disable to keep things going… Snapshots don’t tell you it’s playful… But it is. And one give away is the pre-cursor… playful postures, trusting postures… The play context is real. The nip is not a bite, but neither is it not-a-bite… It’s what the bite means (Schechner (1988)). Context is everything. There is a marvellous quote on the about-ness of learning “there is a reason that education is an accusative”.

I was very taken by Jen Ross’ Learning with Digital Provocations talk at the CAHSS Digital Day of Ideas earlier this year. This talk aims to reanimate the debate, framing disruption in terms of “inventiveness, provocation, uncertainty and the concept of ‘not-yetness'”. Disruption says it is revolutionary, but it is not all that revolutionary in it’s reality. We have guru’s of disruption claiming that university is for training students for jobs…

Assuming we aren’t training students for the zombie apocalypse, what are we doing in Higher Education? Well we want our students to behave intelligently, in the sense of Piaget. Or in the sense of Burrhus Frederic Skinner – “Education is what is left when what we have learned has been forgotten”. There are some sorts of skills and mindset for meeting new challenges that they have not met before…

So that mindset, those skills, do have some alignment with Sutton Smith’s ideas of play. And I wanted to show a wonderful local example [a video of Professor Alan Murray – who actually taught me with the innovative and memorable electric guitar/memorable analogies shown in the video].

So, before I move on… Who wants to get the ball into the cup? [cue controlled chaos with three volunteers].

One classic definition of play (Man, Play and Games – Roger Cailois) – it is entered into freely, it has no particular purpose. And again Bernard Suits’ The Grasshopper, it has my favourite definition: “the voluntary attempt to overcome unneccessary obstacles”. This is the lusory attitude Nicola talks about… And that is about overlooking efficient solutions simply for

Thank you to Ross Galloway for an example here… Enrico Fermi came up with the idea of Fermi Problems – making informed guesses and estimates for calculations where a solution does not (yet) exist. Estimation is an important skill. Ross’ example was “can we estimate how much it costs to light all of Edinburgh” – students had their own answers… But tutors saw similar and precise results. Of course you can Google “what does it cost to light Edinburgh”. To use that is to miss the point. The students getting that answer and using it don’t get that it is about understanding how to approach the problem, not what the answer is.

But there is a real challenge to find problems that cannot be solved by Google… That ensures there is that space for play and creative approach.

I wanted to give an example here of our MSc in Digital Education module in the Introduction to Digital Games-Based Learning, which was set up with a great deal of input from Fiona Hale, and many Edinburgh colleagues, as well as Nicola Whitton herself and her book Learning with Digital Games. Several other texts key for us is Digital Games Based Learning by Marc Prensky, and James Paul Gee’s What Vide Games have to teach us about learning and literacy. These two books are radically different. Prensky talks about games mediating instead of traditional approaches, as a modern solution. Gee meanwhile suggests drawing on successful principles from games – he argued that learning should become more playful and game informed. I would say that Gee has been firmly on the right side of history… And that’s reflected in the naming of today, of Nicola’s conference, and indeed the recommended new ALT Playful Learning Special Interest Group – which has changed it’s name from games based learning.

I’m not saying that we shouldn’t be using games in learning… but…

Gee outlines 13 principles, and I’d like to draw out:

  • co-design

Students having agency and taking decisions in their learning, project based, resource based learning, etc. I would say an excellent local example here is the SLICCs – the Student-Led Individually-Created Courses, also Pete Evan’s work on micro credit courses. So this is about agency, ownership. Although learners may need structure and constraint to have agency in this context.

  • Identity and belonging

Identity formation is what happens in education, and also in game (a warlock, a blue hedgehogs, etc.), our law students will become lawyers, our medical students will become doctors. Not to be disrespectful but it can be useful to think about our students as “playing at” being professionals. Lave and Wenger talk about this process, of legitimate peripheral participation. Gee talks about the real, virtual and projective identities… “Saying that if learners in classrooms carry learning so far as to take on a projective identity, something magical happens…” A marvellous example of this, and you can see a video on MediaHopper, of the “white coat ceremony” at the Vet School. The aim is to welcome students into membership of the profession and the community. Certificates are presented, photographs are taken, students then get to put on their white coats for the first time… And then students stand and recite the vet student oath. And then they celebrate with rather staged photographs! And at the end of the year they have a big class photo – in their animal onsies! This is serious fun, this is a right of passage, it is a welcoming into the community. And it symbolises taking on that identity…

  • Fish tanks and
  • Sand boxes

This is about safe space to explore, to play, to experiment. There is a key role for us here in our own practice, but also in how peers may impact on that safety… Sometimes we need to play the fool ourselves, to take the pressure off a student, to take the fire away from someone and reinforcing play as legitimate. This is the idea of “teacher as jester”. Sandboxes are about tinkering, about ideas of brickolage comes in here – as in the constructivism of ? and in Sherry Turkle’s work. And indeed uCreate and 23 Things, space to play and create and have space to think…

It has been said by some students that they don’t enjoy our games-based learning course… And then they do the course design module and then they get it… And in that spirit… I would heartily recommend Charlie Farley and Gavin Willshaw’s Board Game Jams – in one hour there is new language, metaphor, ways to think about what we do in education.

In our course we do also talk about stories…. I’m sure many of you have worked with the notion of role playing, but there are wider and more inclusive approaches in “scenario learning”. One of our former colleagues here, Martin Crapper, talked about environmental enquiry processes and, rather than lecture on it, he actually held an environmental enquiry. The students were engaged, over 2 days, to come with prepared statements as they would as expert witnesses in an environmental enquiry for developers or environmental groups. So rather than read about or think about, they participated.

Another example, a student on our course who teaches at the University of Sunderland, uses a legal appeal around illegally selling cinema tickets. Sophie presents to her students a letter that she asks students to imagines that she found in the archive… In this case the correspondent wants an expert witness in cognitive psychology on parsing and understanding evidence that might be used there.

And briefly… Gamification… This sits in various ways, such as PeerWise. The bit that I think is most useful here isn’t answering questions for peers, but authoring questions. Many of the features here are reminiscent of social networks – you can upvote, follow, engage. There is also a pushback by some students who see this as a trivial experience… A push back to Sutton Smith’s frivolity… And we also have Top Hat – voting on learners old mobile devices, the new equivalent of clickers… Lecture events can be made interactive in this way… Students should be presented with a question, vote on it, and the idea is that when students discuss the question they are more likely to converge on the right answer… What about students who got the right answer right away… What do they talk about? The evidence is that they discuss the problem, other boundary cases… They continue to “play the game” here… They keep that lusory attitude.

I will mention this book – although failure has been well covered – “The art of failure: an essay on the pain of playing video gaes” – Jesper Juul.

So I argue that ALL conceptual learning is accompanied with the sound of pennies dropping… a trickle or a clunk. We talk about threshold concepts, we can see these or recall these as issues we have overcome. All learning involves thresholds and overcoming them. All learning happens in a liminal space, becoming, crossing over, finding the right place to cross over… We may have to try different ways… Often we have to engage actively in doing that which we wish to understand. Papert talks about students learning by doing, and reflecting on what they have done – Brickolage. I would argue this is the territory of Vygotsky’s Zone of Proximal Development – I can succeed with the help of a skilled peer, it is the space of apprenticeship… It is scary but if embrace the ambiguity of play we can make learning more successful.

Q&A

Q1) I wanted to ask you about the twin idea of risk and safety. In your talk I don’t know if the playful learning you are talking about should be risky and dangerous, or should it just look like that?

A1) I think perhaps it should be increasingly risky… So that one is supported in taking risks. Gee talks about psychosocial moratorium – young people have permission i the world to screw up! It’s about the consequences, and about our responsibility to protect them from the consequences. As Nicola talked about earlier assessment can be the issue, the barrier to working with our students… But we work up to have more risky engagements.

Q2) I wondered if the bit that isn’t a bite, but isn’t not-a-bite is a kind of parody… That students kind of have to “fake it till they make it”

A2) I want my students play along, to engage in parody and metaphor and play with that, rather than it being tangental illustration, but engaging as if in the real world…

Q3) Especially from your examples, the ease you can find information on Google, that really is changing education. Information is now so easily found, you have to engage people in the process of learning, not of information transfer…

A3) I was very much struck by that example from Ross. That idea we have of working with facts… And that sometimes you don’t want the facts… Students must be capable of trying things they have never done before. And of wanting genuine engagement not just the answer to the problem, but the trajectory towards the solution… How do we set that up as the thing we are doing… Maybe we have to be more explicit about that… We can create tasks around finding stuff… But…  How many of us hate when, in a pub conversation, someone runs to the phone to find the answer to some name you can’t remember… For me I want that solution and I’ll grab the phone… But there some people want that chat, they don’t want the answer… It’s the voluntary espousal of unnecessary obstacles.

Q4) A comment and nice example… I read that meerkats remove the sting from scorpions and give it (the scorpions) to their young so that they can explore without being stung.

Comment) It is true… It’s the only animal, other than humans, that teach in stages… So they will give a scorpion without a sting and work up to the full scorpion…

Q5) Would you apply that need to fail to those teaching them…

A5) I think so, Jen and I have talked about the idea of “successful not-knowing”. Another Ross Galloway example… The student asks a question, with a new example… And he has a better example… So he works through that instead… And he goes wrong… he works back… the students comment from the side… and he moves on… And in his course feedback the students said “we liked it when you made a mistake, we really saw physics being done”. A few years ago at Networked Learning a peaker in favour of the lecture said it was an opportunity for students to see the lecturer “thinking on their feet”… That thinking on their feet is the key bit, the discussion, the extemporising… That’s where I feel comfortable – but others will not. But the risk is absolutely for them (the students), but we have to model that risk and they will respect it when they see it. So again Alan Murray would lose some of his dignity [in his playful approaches], but not respect… He was secure in his position.

And with that we moved out into further breakout sessions… 

Afternoon Break Out Sessions

  • Playful Learning Mini Maker Space – Michael Boyd)
  • 23 Things – Stephanie (Charlie) Farley
  • DIY Film School (Gear and Gadgets) – Stephen Donnelly
  • Gamifying Wikpedia – Ewan McAndrew
  • Near Future Teaching Vox Pops – Sian Bayne

To finish the day we have several more short presentations… 

Presentations (Chair: Ross Ward, Learning Technology Advisor (ISG Learning, Teaching & Web Services))

Learning to Code: A Playful Approach – Areti Manataki

I’m a senior researcher in the School of Informatics. And I’ve been quite active in teaching children how to programme both online and offline. My research is in AI, rather than education, so I’m here to share with you and to learn from you.

So Code Yourself! was a special programme that ran in both English and Spanish, organised by University of Edinburgh and University of Uraguay. The whole idea was to introduce programming to people with little or no experience of coding. We wanted to emphasise that it’s fun, it’s relevant to the real world. and anyone can have a go!

We covered the basics of algorithms and control structures, computational thinking, software engineering and programming in scratch. This was for young teenagers and set real challenges, with some structure, using scratch. And we included real life algorithms – like how to make a sandwich – and we had a strong visual elements thanks to the UoE MOOCs production team. And it was all about having fun, including building your own games of all types… Like hunting ghosts, or plants vs zombies…

The forum turns out to be a really important space in the MOOCs. We encouraged them to use the forums, to discuss, and we included tasks and discussions in the weekly emails… So we would say, go to the discussion boards to teach an alien how to brush his teeth Or draw and object and have people guess what it is about.

The course has been running a couple of years, our reach has been good… We have had over 110k participants and 2881 course completers. There are slightly more men than women. The age profile varies, though higher amongst young people than typical Coursera course. And older audiences enjoyed the programming. And we asked if they plan to programme again in the future, many plan to and that’s brilliant, that’s what I hoped for.

We’ve also been running coding workshops at the Edinburgh International Science Festival, having a go and trying to build their own game… We’ve had games with dinosaurs flying over cars! And they seem to enjoy it – and excitement levels are high! Many of these kids are familiar with scratch, but they liked the freedom to play.

And since the MOOC was working for adults, we’ve tweaked it and run the course for students and staff. Again, it came out as being fun!

We now want to reach out to more people. I actually do this in my own time, I’m passionate about it and want to reach young people, especially from disadvantaged backgrounds, and teachers who can use this in the classroom!

Q&A

Q1) You commented that it was aimed at young people but also that Coursera’s audience aren’t really that young. Is the course running on demand? What’s the support?

A1) We first launched in March 2016 as a session. Then moved in August to monthly sessions, with learners encouraged to move to the next session if they haven’t completed. I was a bit scared when we moved to that phase in terms of support. But Coursera has launched the mentors scheme, where previous learners who enjoyed the course can get involved more actively on discussion boards. They do a fantastic job and learners are supported.

Q2) Is there a platform aimed at a younger audience?

A2) I’m not sure. The Uraguay university have experience in online teaching with youngsters and they designed their own platform – and it didn’t look that different. But I do think the style needs to be more appropriate.

Q3) You talked about working with younger audiences – who maybe encounter coding in schools, and your Coursera audiences and those young people’s parents may be seeing code from their kids or from professional contexts… Have you looked at teaching for an older audience who tend to be less digitally engaged in general?

A3) Yes, we did have some older MOOC takers… And they are amongst the most enthusiastic and they share their stories with us. We have done an online session with older people too and as long as you make them feel safe, and able to use the technology, that’s half the battle. When they are confident they are some of the most enthusiastic participants!

Enriched engagement with recorded lectures – John Lee

I’m going to talk about enriching recorded lectures and thinking about how we can enrich recordings with other content… This work is now turning into a PTAS project with colleagues in Informatics.

Rich media resources are increasingly recorded and created with the intention of using them for teaching and learning. And recorded lectures are perhaps most obvious example of such materials. We capture content, but an opportunity feels like it is being lost here, there is so much opportunity if we can integrate these into some new ecology of materials in an interesting way.

So, what’s the problem with doing that?  We have tools for online editing, annotation, linking to resources… These are not technically difficult to do… The literature captures lots of approaches and systems, and often positive experiences, but you don’t then seem to see people going on to use them in teaching and learning… And that includes me actually. Somehow it seems to be remarkably difficult… Perhaps interfaces are somehow not yet optimal for educational uses. If you look at YouTube’s editor… It works nicely… But not a natural thing to use or adapt to an educational context… Perhaps if we can design easier ways to present and interact with rich media, it can be more successful.

And maybe part of this is about making it more fun! It could be that doing this kind of work isn’t neccassarily playful… But working with these kinds of materials should be assisted by playful approaches. And one ways to do that is to bring in more diverse sources and resources, bringing in YouTube, Vimeo… Bringing in playful content… Perhaps we can also crowdsource more content creation, more content from students themselves… And build content creation hand in hand with content use… Designing learning activities that implicate and build on use and creation of rich media, like Lynda.com for instance..

So, I’ve been building a prototype based on APIs from various places, just built with Javascript. We have tried this out on a course called “Digital Playgrounds for the Online Public” (Denitsa Petrova). Students found it interesting and raised various issues… Again I’m following this tradition of trying things out, seeing it is interesting… But I now want to avoid the trap of not going any further with it… How? Partly by integrating with student projects – informatics and Digital Media students – to design ways to combine this media in new and different ways. Some real possibilities there…

Q&A

Q1) I really like this idea. I trialled an idea with new students around writing up lecture notes, uploading, sharing, upvoting, comments… A few students tried it and then it fell off exponentially. But one student kept it going all year – and apologised if he didn’t share his notes… But in future years no one did that… You are talking about taking up a lecture and scaling up and building on… But students sometimes want to go the other way… What else will you add for uptake.

A1) That’s a really good point. The ideas in our proposal tackle that head on. We want to design learning activities that produce and build on those recordings – by creating resources, linking content together… The idea if that you leave them to do it by themselves, they often won’t. But also if we structure it into the course itself, it becomes a learning process, and reflect on the course itself. So the lecture material may be used in a flipped classroom type of way… So they take from and work with the video, rather than just watch it. Sometimes we use pre-recorded content and that needs to be part of the picture too… Engagement is always more difficult… We want to foster engagement, and if we can do that the rest should fall into place.

DIY Filmschool and Media Hopper (MoJo) – Stephen Donnelly

I’m Stephen Donnelly and I work with the Media Team in LTW. I’ll talk about some of what we do…

The idea for the DIY Film School came from MoJo – the idea of Mobile Journalism… And how journalism is changing because of technology… The idea is that we are so used to seeing media on YouTube, filmed on a mobile phone… There’s no point commenting on production values when we see that. We are so used to seeing mobile footage used in broadcast. There are two reasons for this… The way we view media has changed, but also we now all have mobile devices. We all have a mobile phone… And broadcasters have figured that out. And most news stories now have a video element in there… So we wanted to help others to engage in that.

So, how did we get here…

I used to work with the BBC and we had gotten to the point where you sent our producers and cameramen… You probably take videos all the time and share them online… But you get to a professional context and everything changes… But you can shoot your own stuff. So we have purchased some inexpensive kit that works with your existing device – mics, rigs, lenses. And, in addition, we run a DIY film school. It teaches the basics of film making – that apply no matter what you use to film… Framing, being stable, how to zoom, lighting and shooting in appropriate light… And audio. People often forget about audio and actually, people tolerate bad video but you really need good audio… And how to be prepared when doing a shot.

So, after you’ve made your amazing films from DIY Film School where do you put it? Well we have MediaHopper… And you can do really cool stuff – store your content, share your content, etc.

I have a few examples of films made with the DIY film gear… So here we have a “how to” video for shooting an interview with two mobile cameras so that you can cut between the two shots – a simple interview set up. Another option here, a video on “Life after Cardiac Arrest” – they had commissioned out before, now shooting themselves, upskilling the team, and they make some really really nice stuff. And lastly is a video by Michael Seery who has students making videos – in this case how to use a UV-vis spectrophotometer. It’s all about the content and not about the technology.

Q&A

Q1) How much do those mobile rigs tend to cost?

A1) These are in the region of £100 for a steadycam rig… I could put a target on John, say, and it will follow him. It’s so cheap compared to what we would have purchased in the past. Nothing is more than ~£100 – you can buy lots rather than one camera. And you can loan out our rigs too. You guys have all the best content, it’s getting you guys to use it…

Q2) Have you

A2) We had a colleague working in archives, wanting to capture the process… They were doing it through pictures… and making their own videos has changed how they communicate their work… And MediaHopper has changed how academic colleagues are sharing their work in lots of ways, not just DIY Film School…

Q3) Great to see how easy things are now. One of the things that we are keen to do in the School of Education is adding captions. That’s easy on YouTube, how about in MediaHopper?

A3) We are running a pilot at the moment. Including manual and automated captions. The latter is easier but more hit and miss. Get in touch to get involved.

Closing Remarks – Prof. Sian Bayne, Moray House School of Education

Nicola and team asked me to close the content. The theme of the conference feels spot on at the end of a busy and exhausting year. Today has been a lovely reminder to bring playfulness into our everyday lives. Thank you to our fantastic speakers Nicola and Hamish, to colleagues who have presented, and run breakouts and posters all day!

Thank you to Nicola, this is her last conference as convener so huge thanks. Thank you to Ross Ward. To Charlie Farley. To Susan Greig. To Ruby Rennie. And also to Marshall Dozier. And a special thank you to Cinzia Pusceddu-Gangarosa who I know Sian also meant to thank in her talk. 

And I want to invite everyone left to come and drink wine and eat cheese as a very very informal leaving do for Hamish who is retiring. Thank you to everyone for everything, and for coming!

And with that we are done… And I’m off to drink wine and eat cheese… 

Jun 212017
 

Last Thursday I attended the Guardian Teacher Network Seminar: Technology in schools: money saver or money waster? at Kings Place, London.The panel was chaired by Kate Hodge (KH), head of content strategy at Jaywing Content and former editor of the Guardian Teacher Network, and featured:

  • John Galloway (JG), advisory teacher for ICT/special educational needs and inclusion, Tower Hamlets Council.
  • Donald Clark (DC), founder, PlanB Learning and investor in EdTech companies with experience of teaching maths and physics in FE in the UK and US.
  • Michael Mann (MM), senior programme manager, education team, Nesta Innovation Lab.
  • Naureen Khalid (NK), school governor and co-founder of @UkGovChat.

These are my live notes from the event – although these are a wee bit belated they are more or less unedited so comments, corrections, additions etc. are welcomed. 

The panel began with introductions, mainly giving an overview of their background. The two who said a wee bit more were:

John Galloway, specialist on technologies for students with special needs and inclusion, I work half time at Tower Hamlets with students but also a lot of training. It’s the skills of adults that is often the challenge. The rest of my time I consult, I’m a freelance writer, I am a judge of the BETT awards.

Michael Mann (MM), NESTA, our interest is that we don’t think EdTech has reached its potential yet… Our feeling is that we haven’t seen that impact yet. And since our report five years ago we’ve invested in companies and charities who focus on impact. Also do research with UCL, and work with teachers to trial things in real classrooms.

All comments below are credited to the speakers with their initials (see above), and audience comments and questions are marked as such… 

KH: What’s the next big thing in tech?

DC: It’s AI… It’s the new UI no matter what you use really… I only invest in AI now… Education is curiously immune from this at the moment but it won’t be… It is perfect for providing feedback and improving the eLearning experience – that crappy gamification or read then quiz experience… We are in a funny transitionary phase..

MM: There has been an interesting trend recently where specialist kit is becoming mainstreams… touch screens for instance, or speech to text… So, I think that is closing the gap between our minds and our machines… The gap is closing… The latest thing in special education needs have been eye games – your eyes are the controller… That is moving into mainstream gaming so that will become bigger… So I see a bigger convergence there… And the other thing I see happening is VR. That will allow children to go places they can’t go – for all kids but that has particular benefits and relevance for, say a child in a wheelchair. For autistic children you put them in environments so they can understand size, lights, noise, and deal with the anxiety… before they visit…

KH: What are the challenges of implementing that in the classroom

JG: The tech – and costs, the space… But also the creativity… A lot of what’s created are not particularly engaging or educational. I’d like to see teachers able to make things themselves… And then we need to think about pedagogy… But that’s the big issue…

DC: I can give you an example in the context of teaching Newton’s Laws with kids… We downloaded a bunch of VR apps… And NASA apps there was great for understanding and really feeling Newton’s three audience… Couldn’t do that with a blackboard… And that’s all free…

KH: How accessible is that… ?

DC: Almost every kid has a smartphone… Google Cardboard is maybe £5… It’s very cheap… It won’t replace a teacher, at least not yet. I wouldn’t teach basic mathematics with VR, but I wouldn’t teach Newton’s three laws any other way…

MM: We are piloting a thing called RocketFund and one of the first people to use VR used it in history… After that ran we have about 10 projects because they’d seen what was possible…

DC: “Fieldtrips” can be free… I’ve also seen a brilliant project with a 360 degree camera in a classroom used in a teaching space – a £250 camera – and brilliant for showing issues with behaviour, managing the classroom etc.

NK: Now if something is free, I would have no objection at all!

KH: How do you measure impact?

NK: Well if someone has a really old PC and it runs slow… that’s a quick and clear impact. But it’s about how they will use it, what studies are there and are they reliable… Could you do this any other way? What’s different?

MM: A lot of these technologies do not have evidence on them… But you will have toolkits, ideas that are well grounded on peer instruction, or tutoring… If you can take pedagogical approaches and link it to a tool you are using, that’s great. There’s work on online tutoring, and there is a company which provides tutoring from India… And I want to know how they ensure that they follow established criteria…

DC: I think we’ve had a lot of device fetishism… We’ve seen huge amounts of tablets imposed… and abandoned… You have to regard tech as a medium – not a gadget or a school. I think we’ve had disastrous experiences with iPads in secondary schools… They work in primary schools but actually writing on iPads doesn’t work well… It’s a disaster… And it’s a consumer devices not enabling higher order writing, coding, creation skills… I recommend that you look at Audrey Mullen’s work – she was a school kid when she started a company called Kite Reviews… She said we don’t want tablets or mobiles, that laptops were better…

Comment: What about iPads in schools… I did a David Hockney project with Year 10 students, that riffed off his use of iPads and the students really engaged with it… I’ve also used it in a portrait project as well… And one of the things I’m interested it is how you use it in more than writing and literacy…

JG: I just want to come back to measuring impact… It depends what you want to use it for… Donald gave us an example of using an iPad for the wrong thing, and from the audience that example of using iPads in the right ways… No-one in industry would code on an iPad… We have to use technology appropriate to the context and the wider world.

KH: How would you know that?

JG: As a teacher you have to gain expertise and transfer that to your teaching…

KH: You might be an expert in history but not in ITT…

JG: As a teacher you have to understand the technology you are being given to use… You have to understand the pedagogy… And you have to prove to teachers that the technology will improve their practice… I’m not sure any teacher has ever taught the perfect lesson, you always can think of ways to improve that… And that’s how you consider your work… One of the best innovations in teaching have been TeachMeets – informal exchanges of practice, experiences, etc. The reason technology in classrooms is not as successful as it should be are complex…

NK: I know of someone who purchased an app, brought into it, send people off to training… But it was the wrong app or what you are trying to do… So do the research first before you purchase anything…

DC: I think that the key word here is procurement… And teachers shouldn’t be doing that with hardware… You have to start with teaching needs, but actually general school software too – website, comms with parents, VLEs etc… It’s back end stuff… Take the art example… I know lots of artists… none using iPads… They use more sophisticated computers that enable the same stuff and more… It’s not David Hockney, that’s the tail wagging the dog… It’s general needs… Most kids have devices… I’d spend money on topping up for inclusion… And you have to do that cost benefit analysis first…

MM: Cost benefit analysis and expert approaches isn’t realistic in many schools… Often it’s more realistic to do small scale trialling… If it works, guide their peers, if not, then quite there… Practical experimentation, test and learn is the way forward I would say…

JG: I think that the challenge is often the enthusiast… You need to give things to the cynic!

DC: There is a role for sensible professional advice. In Higher Ed we have Jisc, we are quite sensible… But we don’t have that advice available for schools… It all goes a bit odd… It’s all anecdotal rather than evidence based… Otherwise we are just pottering about… And we end up with the lowest common denominator in terms of skills and understanding…

JG: I’m getting a bit nostalgic for BECTA, and NESTA FutureLab… doing interesting stuff. A lot of research now is funded by companies engaged in the research…

MM: I agree… but there is no evidence for white boards, tablets, whatever as they don’t work on their own… Has to be evidence informed…

DC: Cost effectiveness is always about tech as an intervention in education… The evidence for schools is that writing accuracy goes down 31% and is a huge problem on tablets… Unless…

NK: There’s good evidence that typing notes in class doesn’t work

DC: Absolutely… Although there is plenty of evidence that lectures don’t work ad we still do that… They have power devolved and in my view they are not really teachers… That happens every day…

Comment from audience: That doesn’t happen every day…

MM: We have to be careful about how we use the word evidence… Lectures may not be correlated with success but that may be to do with the quality of teaching staff, of lecturers…

KH: One of you talked about giving technology to the cynic… How do you overcome this…

JG: I think that the doubter, the cynic… will ask all the questions, find all the faults… But also see what works if it works…

KH: Often use of tech comes down to the enthusiasts and evangelists… But teachers lack space to be creative… How can we adopt technology if we lack that time and opportunity…

JG: We have so much more technology now, it has permeated our lives more… Our thinking, our discussion, potentially our classrooms… But I haven’t seen smartphones in schools much yet… We haven’t talked about bring your own device… There is an element of risk.. potential for videoing, for sharing bad practice, for bullying and harassment… But there is a lot of nervousness there…

DC: I think we have to move away from just thinking about technology in the classroom. I’m dead against it. Bring tech into a room in a one-to-many context… I’d rather use learner technology… Good teachers are teachers in the classroom… Kids really use tech at home, with homework… When you struggled when I was a kid you got stuck… but now you can use devices… to find the answer but also the method… And we have adaptive learning that can tailor to every kid. I think learner technology and away from the classroom is where it needs to be… Rather than the smart board debacle… Where one minister brought that in, Promethean made millions…

JG: I don’t recognise the classroom you are describing… I see teachers using technology, with big changes over the last twenty years… It is the appropriate use of technology in the appropriate places in learning… And thinking about the right technology for the job… If we took technology out of the classroom we’d just have lectures wouldn’t we?!

DC: The issue of collaboration is interesting… There is work from Stanford that many group works/collaborative technological driven things in the classroom… That most kids aren’t doing anything, but it looks collaborative… versus a good teacher doing the Socratic thing…

MM: I don’t think the in/outside the classroom thing is as important as the issue of what works, how things adapt, immediate feedback to with FitTech…. But it all comes back to pedagogy….

NK: It all comes back to what the problem is that you are trying to solve…

KH: What about the right way to do this… There’s the start-up like run fast, fail fast approach… Then the procurement approach…

NK: We want evidence based procurement… I don’t want to fund trials… Schools are poor…

KH: Start ups don’t throw it and see if it works… They use data to change their approach…And that’s what I’m talking about… Trialling then using evidence to inform decisions…

DC: The last thing I want to do is to waste time or money with start ups going into schools… I think taking risks in schools like that is very risky… I’m also not sure governors should be procuring… The senior team should… But often there is no digital strategy… It needs to be tactical not strategic…

JG: Suppose we get the kids to assess the start up product… There is a great project called Apps For Good… It gets kids to engage in the idea, the design process, the entrepreneurial aspect… There is a role for start ups for teaching kids about how this happens… I think education is a risky business anyway… We think something good will happen, kids have to trust the teacher… I think risk can be quite a healthy thing, and managing risk… Introducing something new can be edgy and can be quite invigorating…

NK: As a governor I don’t want my school going into the red financially… We need to operate within our means…

KH: It wasn’t about start ups in the classrooms… Even a small spend…. Can be risky…

MM: Isn’t there a risk of a big roll out of something that doesn’t work for your school? Some risks will feel riskier than others… School culture and character all mater…

JG: We do have examples of technologies that didn’t work but now do… VLEs didn’t take off… Schools don’t use them… It was an expensive risk… But many use Google Classroom which is essentially the same thing… It’s free but needs maintenance…

DC: Actually with new start ups… you want evidence, you want research to prove the usefulness. 50% of start ups fail, and you don’t want to adopt stuff that will fail…

JG: But someone has to try things first, to try new things, to bring something new into the classroom.

KH: How do we take Ed Tech forward… ?

DC: At risk of repeating myself… Professional procurement, technology strategy, strategic leadership in this…

Comment from crowd: Where do you get the evidence if you don’t test it in the classroom…

DC: I am involved in a big adaptive learning company… We are doing research with Cambridge University…

Comment from crowd: so for the schools taking part, that is a risk!

DC: No, it’s all carefully set up, with control groups… Not just by recommendation by colleagues…

JG: Setting up trials in schools in incredibly difficult, especially with control groups… Even if you do that you have to look at who was teaching, who was unwell then, etc. It’s very very hard to compare… And if it is showing improvement then morally should you withhold that technology from some pupils… One of the trials I can think of was around use of iPads… Give them own budget for apps.. But give them free choice… And then have them talk about that… It’s a trial but it’s very low cost, it’s very effective, it’s judging fit of tech to the space…

NK: I’ve known schools go for the iPad whether or not it works… Why go for the most expensive tablets… to try them!

DC: In the US there was a 1.3bn deal with Apple in California… And iPads are not there now… They now use Chrome Books…

JG: But that was imposed from the top.. And that’s an important issue…

Comment: I want to take issue with something Donald was talking about… I am all in favour of evidence based research and everything… But it is hard to find time to find the research, and a lot of effort to actually read through it… 3 pages of methodology before the conclusion… By the time it’s published it’s out of date anyway… I write about evidence on my website and often no firm conclusions come out of this… Ultimately anecdotal evidence matters… Asking questions of what was this trying to solve, what worked, what didn’t… Question: does Donald agree with me.

DC: No!

Comment: We all know the digital age is coming, kids have to work with computers, how can schools prepare children for that work and keep traditional teaching too..?

MM: For me there are two aspects: digital skills like codeclubs, programming… The other side is that when we are in this world with automation, what sort of jobs will survive… We have a report at Nesta called Creativity vs Robots… Skills that are most robust are creative, collaborative, dexterous… Preparing kids for the future still requires factual knowledge but also collaborative and problem solving skills… It’s not that it doesn’t exist, we just really need to focus on that…

JG: Maybe controversially I will say that we don’t… We should teach flexibility and to learn. A few years back I wrote for TimeEd… I visited Harrow- relatively unlimited funding… They don’t teach computing… They don’t get there until Year 9… Prep schools don’t teach it… Not “academic” enough fpr A-level or GCSE. They do some ICT skills… I guess they will get jobs, good ones…But they don’t prepare them for that… They prepare them to be leaders and the elite… I’m not necessarily sold on the idea that you have to prepare kids to be the makers… We teach reading and writing, but not digital literacy… Or how to read a film or a computer game, why failure is important… We don’t teach that… We might teach them how to create the game… So in part “don’t” and in part “expand the curriculum”

Comment: For Mr Galloway… Why did you go to Harrow not Eton… They invest in innovation and you get to be amused at top hats and tails?

JG: Tube ride!

DC: It would be madness to ignore technology in schools… But coding is this year’s thing… ! Kids need skills when they leave school…

NK: I have great problems with the idea of 21st Century skills… We can’t train kids for jobs they don’t exist… Jobs from hundreds of years ago….

MM: There is a social justice aspect here… Mark Zuckerberg went to one of the top schools… If we don’t expose all children to technology opportunities they can miss out…

JG: In Harrow they don’t impose technology on teachers… but they get it if they ask for it. They also give kids Facebook account sand teach them how to use it…

Comment: When we think about technology in schools, when do we think about teachers perspective… can we motivate and engage students with 21st century skills and possibilities…

NK: With all the money in the world, yes. We are in the position where schools can barely afford the teachers… We have to live within their means…

DC: Are teachers the right people to teach these skills… Is that what teachers are best suited to that… Not sure subject orientated teachers are well placed for that.

JG: Teachers do teach collaboration. Social media is about relationships… It’s just a form of that… CPD for teachers is outside of school time and that means keen teachers engage there…

MM: Having some teachers into smartphones. Some who are not… Some teachers are into outdoor education and camping… Others are not… You would’t want to exclude kids from the experience of camping… That’s how you can think about the ideas of digital literacy here… Finding the enthusiasm and route in…

Comments: A lot of what we, in this room, know of technology is through past exposure and experience of technology. Children are sponges.. They can often teach the teachers, with scaffolding from the teachers, about this era of technology… The kids are often better and quicker at using the technology… We have to think about where this might lead them…

Comment: On procurement and evidence… Michael talked about small trials… Do we think specific and unique contexts with schools not justify that type of small scale trialling…

MM: I think context is key in trials… Even outside of tech… Approaches like peer learning have great evidence… But the actual implementation can make a big difference… But you have to weigh up whether your context is as unique as you think…

DC: That can also be an excuse… Having been involved in procurement in tech… You don’t throw tech about… You think about what the context is, do serious homework before spending the money… You need the strategy and change management to roll things out and sustaining the effort… That’s almost invariably absent in the school context… Quite haphazard… “everyone’s unique… Let’s just play with this stuff”

Comment, I’m the director of a startup empowering primary aged girls and augmented reality to encourage routes into STEM subjects.: In terms of costs and being a governor… Start ups are obsessed with evidence. One of the best things you can do is work with start ups, they really want that evidence… If you are worried about costs you can trial things… But it is a risk when you are teaching… You were also talking about jobs that don’t exist at the moment… That means new jobs in new fields… One thing that strikes me this evening is that no one has talked about science, technology, arts and maths…. And teachers don’t come in from that route into schools… We’ve been talking to Jim Knight. In primary schools you don’t get labs but you can use AR to do experiments… to look in this area… My point it you’ve been talking about technology, is it worth it… Would have been great to hear someone from positive experiences, or an Ed Tech company… This feels like a lot of slamming down of technology…

JG: Can I talk about positive experiences… Technology is life changing and amazing… removing technology from classrooms is a horrendous… Your example in not having enough good qualified science teachers is an important one…

DC: I am not sure about AR and VR… I’d be careful with some of these things… Hololens isn’t there yet… Leading edge tech is a bit of a honeytrap… I raise VR as its on every phone… and free…

Commenter: AR is on phones… !

KH: Thank you for a really lively discussion!

And with that the rather spirited discussions came to an end! Some interesting things to consider but I felt like there was so much that wasn’t discussed properly because of the direction the conversation took – issues like access to wifi; measures to use but make technology safe – and what they mean for information literacy; technology beyond devices… So, I’d love to hear your comments below on Ed Tech in Schools.

 June 21, 2017  Posted by at 10:23 pm Digital Education Tagged with: ,  No Responses »
Jun 162017
 

It’s the final day of the IIPC/RESAW conference in London. See my day one and day two post for more information on this. I’m back in the main track today and, as usual, these are live notes so comments, additions, corrections, etc. all welcome.

Collection development panel (Chair: Nicola Bingham)

James R. Jacobs, Pamela M. Graham & Kris Kasianovitz: What’s in your web archive? Subject specialist strategies for collection development

We’ve been archiving the web for many years but the need for web archiving really hit home for me in 2013 when NASA took down every one of their technical reports – for review on various grounds. And the web archiving community was very concerned. Michael Nelson said in a post “NASA information is too important to be left on nasa.gov computers”. And I wrote about when we rely on pointing not archiving.

So, as we planned for this panel we looked back on previous IIPC events and we didn’t see a lot about collection curation. We posed three topics all around these areas. So for each theme we’ll watch a brief screen cast by Kris to introduce them…

  1. Collection development and roles

Kris (via video): I wanted to talk about my role as a subject specialist and how collection development fits into that. AS a subject specialist that is a core part of the role, and I use various tools to develop the collection. I see web archiving as absolutely being part of this. Our collection is books, journals, audio visual content, quantitative and qualitative data sets… Web archives are just another piece of the pie. And when we develop our collection we are looking at what is needed now but in anticipation of what we be needed 10 or 20 years in the future, building a solid historical record that will persist in collections. And we think about how our archives fit into the bigger context of other archives around the country and around the world.

For the two web archives I work on – CA.gov and the Bay Area Governments archives – I am the primary person engaged in planning, collecting, describing and making available that content. And when you look at the web capture life cycle you need to ensure the subject specialist is included and their role understood and valued.

The CA.gov archive involves a group from several organisations including the government library. We have been archiving since 2007 in the California Digital Library initially. We moved into Archive-It in 2013.

The Bay Area Governments archives includes materials on 9 counties, but primarily and comprehensively focused on two key counties here. We bring in regional governments and special districts where policy making for these areas occur.

Archiving these collections has been incredibly useful for understanding government, their processes, how to work with government agencies and the dissemination of this work. But as the sole responsible person that is not ideal. We have had really good technical support from Internet Archive around scoping rules, problems with crawls, thinking about writing regular expressions, how to understand and manage what we see from crawls. We’ve also benefitted from working with our colleague Nicholas Taylor here at Stanford who wrote a great QA report which has helped us.

We are heavily reliant on crawlers, on tools and technologies created by you and others, to gather information for our archive. And since most subject selectors have pretty big portfolios of work – outreach, instruction, as well as collection development – we have to have good ties to developers, and to the wider community with whom we can share ideas and questions is really vital.

Pamela: I’m going to talk about two Columbia archives, the Human Rights Web Archive (HRWA) and Historic Preservation and Urban Planning. I’d like to echo Kris’ comments about the importance of subject specialists. The Historic Preservation and Urban Planning archive is led by our architecture subject specialist and we’d reached a point where we had to collect web materials to continue that archive – and she’s done a great job of bringing that together. Human Rights seems to have long been networked – using the idea of the “internet” long before the web and hypertext. We work closely with Alex Thurman, and have an additional specially supported web curator, but there are many more ways to collaborate and work together.

James: I will also reflect on my experience. And the FDLP – Federal Library Program – involves libraries receiving absolutely every government publications in order to ensure a comprehensive archive. There is a wider programme allowing selective collection. At Stanford we are 85% selective – we only weed out content (after five years) very lightly and usually flyers etc. As a librarian I curate content. As an FDLP library we have to think of our collection as part of the wider set of archives, and I like that.

As archivists we also have to understand provenance… How do we do that with the web archive. And at this point I have to shout out to Jefferson Bailey and colleagues for the “End of Term” collection – archiving all gov sites at the end of government terms. This year has been the most expansive, and the most collaborative – including FTP and social media. And, due to the Trump administration’s hostility to science and technology we’ve had huge support – proposals of seed sites, data capture events etc.

2. Collection Development approaches to web archives, perspectives from subject specialists

As subject specialists we all have to engage in collection development – there are no vendors in this space…

Kris: Looking again at the two government archives I work on there is are Depository Program Statuses to act as a starting point… But these haven’t been updated for the web. However, this is really a continuation of the print collection programme. And web archiving actually lets us collect more – we are no longer reliant on agencies putting content into the Depository Program.

So, for CA.gov we really treat this as a domain collection. And no-one really doing this except some UCs, myself, and state library and archives – not the other depository libraries. However, we don’t collect think tanks, or the not-for-profit players that influence policy – this is for clarity although this content provides important context.

We also had to think about granularity… For instance for the CA transport there is a top level domain and sub domains for each regional transport group, and so we treat all of these as seeds.

Scoping rules matter a great deal, partly as our resources are not unlimited. We have been fortunate that with the CA.gov archive that we have about 3TB space for this year, and have been able to utilise it all… We may not need all of that going forwards, but it has been useful to have that much space.

Pamela: Much of what Kris has said reflects our experience at Columbia. Our web archiving strengths mirror many of our other collection strengths and indeed I think web archiving is this important bridge from print to fully digital. I spent some time talking with our librarian (Chris) recently, and she will add sites as they come up in discussion, she monitors the news for sites that could be seeds for our collection… She is very integrated in her approach to this work.

For the human rights work one of the challenges is the time that we have to contribute. And this is a truly interdisciplinary area with unclear boundaries, and those are both challenging aspects. We do look at subject guides and other practice to improve and develop our collections. And each fall we sponsor about two dozen human rights scholars to visit and engage, and that feeds into what we collect… The other thing that I hope to do in the future is to do more assessment to look at more authoritative lists in order to compare with other places… Colleagues look at a site called ideallist which lists opportunities and funding in these types of spaces. We also try to capture sites that look more vulnerable – small activist groups – although it is nt clear if they actually are that risky.

Cost wise the expensive part of collecting is both human effort to catalogue, and the permission process in the collecting process. And yesterday’s discussion of possible need for ethics groups as part of the permissions prpcess.

In the web archiving space we have to be clearer on scope and boundaries as there is such a big, almost limitless, set of materials to pick from. But otherwise plenty of parallels.

James: For me the material we collect is in the public domain so permissions are not part of my challenge here. But there are other aspects of my work, including LOCKSS. In the case of Fugitive US Agencies Collection we take entire sites (e.g. CBO, GAO, EPA) plus sites at risk (eg Census, Current Industrial Reports). These “fugitive” agencies include publications should be in the depository programme but are not. And those lots documents that fail to make it out, they are what this collection is about. When a library notes a lost document I will share that on the Lost Docs Project blog, and then also am able to collect and seed the cloud and web archive – using the WordPress Amber plugin – for links. For instance the CBO looked at the health bill, aka Trump Care, was missing… In fact many CBO publications were missing so I have added it as a see for our Archive-it

3. Discovery and use of web archives

Discovery and use of web archives is becoming increasingly important as we look for needles in ever larger haystacks. So, firstly, over to Kris:

Kris: One way we get archives out there is in our catalogue, and into WorldCat. That’s one plae to help other libraries know what we are collecting, and how to find and understand it… So would be interested to do some work with users around what they want to find and how… I suspect it will be about a specific request – e.g. city council in one place over a ten year period… But they won’t be looking for a web archive per se… We have to think about that, and what kind of intermediaries are needed to make that work… Can we also provide better seed lists and documentation for this? In Social Sciences we have the Code Book and I think we need to share the equivalent information for web archives, to expose documentation on how the archive was built… And linking to seeds nad other parts of collections .

One other thing we have to think about is process and document ingest mechanism. We are trying to do this for CA.gov to better describe what we do… BUt maybe there is a standard way to produce that sort of documentation – like the Codebook…

Pamela: Very quickly… At Columbia we catalogue individual sites. We also have a customised portal for the Human Rights. That has facets for “search as research” so you can search and develop and learn by working through facets – that’s often more useful than item searches… And, in terms of collecting for the web we do have to think of what we collect as data for analysis as part of a larger data sets…

James: In the interests of time we have to wrap up, but there was one comment I wanted to make.which is that there are tools we use but also gaps that we see for subject specialists [see slide]… And Andrew’s comments about the catalogue struck home with me…

Q&A

Q1) Can you expand on that issue of the catalogue?

A1) Yes, I think we have to see web archives both as bulk data AND collections as collections. We have to be able to pull out the documents and reports – the traditional materials – and combine them with other material in the catalogue… So it is exciting to think about that, about the workflow… And about web archives working into the normal library work flows…

Q2) Pamela, you commented about permissions framework as possibly vital for IRB considerations for web research… Is that from conversations with your IRB or speculative.

A2) That came from Matt Webber’s comment yesterday on IRB becoming more concerned about web archive-based research. We have been looking for faster processes… But I am always very aware of the ethical concern… People do wonder about ethics and permissions when they see the archive… Interesting to see how we can navigate these challenges going forward…

Q3) Do you use LCSH and are there any issues?

A3) Yes, we do use LCSH for some items and the collections… Luckily someone from our metadata team worked with me. He used Dublin Core, with LCSH within that. He hasn’t indicated issues. Government documents in the US (and at state level) typically use LCSH so no, no issues that I’m aware of.

Plenary (Macmillan Hall): Posters with lightning talks (Chair: Olga Holownia)

Olga: I know you will be disappointed that it is the last day of Web Archiving Week! Maybe next year it should be Web Archiving Month… And then year!

So, we have lightening talks that go with posters that you can explore during the break, and speak to the presenters as well.

Tommi Jauhiainen, Heidi Jauhiainen, & Petteri Veikkolainen: Language identification for creating national web archives

Petteri: I am web archivist at the National Library of Finland. But this is really about Tommi’s PhD research on native Finno-Ugric languages and the internet. This work began in 2013 as part of the Kone Foundation Language Programme. It gathers texts in small languages on the web… They had to identify that content to capture them.

We extracted the web links on Finnish web pages, also crawled russian, estonian, swedish, and norwegion domains for these languages. They used HeLI and Heritrix. We used the list of Finnish URLs in the archive, rather than transferring the WARC files directly. So HeLI is the Helsinki language identification method, one of the best in the world. It can be found on Github. And can be used for your language as well! The full service will be out next year, but you can ask HeLi if you want that earlier.

Martin Klein: Robust links – a proposed solution to reference rot in scholarly communication

I work at Los Alamos, I have two short talks and both are work with my boss Herbert Van de Sompel, who I’m sure you’ll be aware of.

So, the problem of robust links is that links break and reference content changes. It is hard to ensure the author’s intention is honoured. So, you write a paper last year, point to the EPA, the DOI this year doesn’t work…

So, there are two ways to do this… You can create a snapshot of a referenced recourse… with Perma.cc, Internet Archive, Archive,is, Webcite. That’s great… But the citation people use is then the URI of the archive copy… Sometimes the original URI is included… But what if the URI-M is a copy elsewhere – archive.is or the no longer present mummy.it.

So, second approach, decorate your links by referencing: original URI, datetime of archiving, and the resource’s original URI. That makes your link more robust meaning you can find the live version. The original URI allows finding captures in all web archives. The Capture datetime lets you identify when/what version of the site is used.

How do you do this? With HTML5 link decoration, with the href attribute (data-original and data-versiondate). And we talked about this in a d-Lib article that, with some javascript that makes that actionable!

So, come talk to me upstairs about this!

Herbert Van de Sompel, Michael L. Nelson, Lyudmila Balakireva, Martin Klein, Shawn M. Jones & Harihar Shankar: Uniform access to raw mementos

Martin: Hello, it’s still me, I’m still from Los Alamos! But this is a more collaborative project…

The problem here… Most web archives augment their mementos with custom banners and links… So, in the Internet Archive there is a banner from them, and a pointer on links to a copy in the archive. There are lots of reasons, legal, convenience… BUT That enhancement doesn’t represent the website at the time of capturing… AS a researcher those enhancements are detrimental as you have to rewrite links again.

For us and our Memento Reconstruct, and other replay systems that’s a challenge. Also makes it harder to check the veracity of content.

Currently some systems do support this… OpenWayBack adn pywb do allow this – you can add the {datetime}im_/URI-R to do this, for instance. But that is quite dependent on the individual archive.

So, we propose using the Prefer Header in HTTP Request…

Option 1: Request header sent against Time Gate

Option 2: Request header sent against Memento

So come talk to us… Both versions work, I have a preference, Ilya has a different preference, so it should be interesting!

Sumitra Duncan: NYARC discovery: Promoting integrated access to web archive collections

NYARC is a consortium formed in 2006 from research libraries at Brooklyn Museum, The Frick Collection and the Museum of Modern Art. There is a two year Mellow grant to implement the program. An dthere are 10 collections in Archive-it devoted to scholarly art resources – including artist websites, gallery sites, catalogues, lists of lost and looted art. There is a seed list of 3900+ site.

To put this in place we asked for proof of concept discovery sites – we only had two submitted. We selected Primo from Ex-Libris. This brings in materials using the OpenSearch API. The set up does also let us pull in other archives if we want to. And you can choose whether to include the web archive (or not). The access points are through MARC Records and Full Records Search, and are in both the catalogue and WorldCat. We don’t howver, have faceted results for web archive as it’ snot in the API.

And recently, after discussion with Martin, we integrated Memento into th earchive, which lets them explore all captured content with Memento Time Travel.

In the future we will be doing usability testing of the discovery interface, we will promote use of web archive collections, and encouraging use in new digital art projects.

Fine NYARC’s Archive-It Collections: www.nywarc.org/webarchive. Documentation at http://wiki.nyarc.??

João Gomes: Arquivo.pt

Olga: Many of you will be aware of Arquivo. We couldn’t go to Lisbon to mark the 10th anniversary of the Portuguese web archive, but we welcome Joao to talk about it.

Joao: We have had ten years of preserving the Portuguese web, collaborating, researching and getting closer to our researchers, and ten years celebrating a lot.

Hello I am Joao Gomes, the head of Arquivo.pt. We are celebrating ten years of our archive. We are having our national event in November – you are all invited to attend and party a lot!

But what about the next 10 years? We want to be one of the best archives in the world… With improvements to full text search, to launch new services – like image serarching and high quality archiving services. Launching an annual prize for resarching projects over the Arquivo.pt. And at the same time increase our collection and users community.

So, thank you to all in this community who have supported us since 2007. And long live Arquivo.pt!

Changing records for scholarship & legal use cases (Chair: Alex Thurman)

Martin Klein & Herbert Van de Sompel: Using the Memento framework to assess content drift in scholarly communication

This project is to address both link rot and content drift – as I mentioned earlier in my lightening talk. I talked about link rot there, content drift is where the URI and content there changes, perhaps out of all recognition, so that what I cite is not reproducable.

You may or may not have seen this but there was a Supreme Court case referencing a website, and someone thought it would be really funny to purchase that, put up a very custom 404 error. But you can see pages that change between submission and publication. By contrast if you look at arxiv for instance you see an example of a page with no change over 20 years!

This matters partly as we reference URIs increasingly, hugely so since 2008.

So, some of this I talked about three years ago where I introduced the Hiberlink project, a collaborative project with the University of Edinburgh where we coined the term “reference rot”. This issue is a threat to the integrity of the web-based scholarly record. Resources do not have the same sense of fixity like e.g. journal article. And custodianship is also not as long term, custodians are not always as interest.

We wrote about link rot in PLoSOne. But now we want to focus on Content Drift. We published a new article on this in PLoSOne a few months ago. This is actually based on the same corpus – the entirity of arXiv, of PubMedCentral, and also over 2 million articles from Elsevier. This covered publications from January 1997 to December 2012. We only looked at URIs for non scholarly articles – not the DOIs but the blog posts, the Wikipedia page, etc. We ended up with a total of around 1 million URIs for these corpora. And we also kept the start date of the article with our data.

So, what is our approach for assessing content drift? We take publication date of URI as t. Then we try to find a Memento pre of referenced URI (t-1) and the Memento Post of referenced URI (t+1). Two Thirds of the URIs we looked at have this pair across archives. So now we do text analysis, looking at textual similarity between t-1 and t+1. We use measures of computed noralised scores (values 0 to 100) for:

  • simhash
  • Jaccard – sets of character changes
  • Sorensen-Dice
  • Cosine – contextual changes

So we defined a perfect Representative Momento if it gets a perfect score across all four measures. And we did some sanity checks too, via HTTP headers – E-Tag and Last-modified being the same are a good measure. And that sanity check passed! 98.88% of Mementos were representative.

Out of the 650k pairs we found, about 313k URIs have representative Mementos. There wasn’t any big difference across the three collections .

Now, with these 313k links, over 200k had a live site. And that allowed us to analyse and compare the live and archived versions. We used those same four measures to check similarity. Those vary so we aggregate. And we find that 23.7% of URIs have not drifted. But that means that over 75% have drifted and may not be representative of author intent.

In our work 25% of the most recent papers we looked at (2012) have not drifted at all. That gets worse going back in time, as is intuitive. Again, the differences across the corpora aren’t huge. PMC isn’t quite the same – as there were fewer articles initially. But the trend is common… In Elsevier’s 1997 works only 5% of content has not drifted.

So, take aways:

  1. Scholarly articles increasingly contain URI references to web and large resources
  2. Such resourcs are subject to reference rot (link rot and content drift)
  3. Custodians of these resoueces are typically not over concerned with archiving of their content and lonegtity of the scholarly record
  4. Spoiler: Robust links are one way to address this at the outset.

Q&A

Q1) Have you had any thought on site redesigns where human readable content may not have changed, but pages have.

A1) Yes. We used those four measures to address that… We strip out all of the HTML and formatting. Cosign ignores very minor “and” vs. “or” changes for instance.

Q1) What about Safari readibility mode?

A1) No. We used something like Beautiful Soup to strip out code. Of course you could also do visual analysis to compare pages.

Q2) You are systematically underestimating the problem… You are looking at publication date… It will have been submitted earlier – generally 6-12 months.

A2) Absolutely. For the sake of the experiment it’s the best we can do… Ideally you’d be as close as possible to the authoring process… When published, as you say, it may already

Q3) A comment and a question… 

Preprints versus publication… 

A3) No, we didn‘t look explicitly at pre-prints. In arXiv those are

The URIs in articles in Elsevier seem to rot more than those in arXiv.org articles… We think that could be because Elsevier articles tend to reference more .coms whereas arXiv references more .org URIs but we need more work to explore that…

Nicholas Taylor: Understanding legal use cases for web archives

I am going to talk about use of web archives in litigation. But out of scope here is the areas of perservation of web citations; terms of service and API agreements for social media collection; copyright; right to be forgotten.

So, why web archives? Well it’s where the content is. In some cases social media may only be available in web archives. Courts do now accept web archive conference. The earliest that IAWM (Internet Archive Way Back Machine) evidence was as early as 2004. Litigants reoutinely challenge this evidence but courts often accept IAWM evidence – commonly through affidavit or testimony, through judicial notice, sometimes through expert testimony.

The IA have affidavit guidance and they suggest asking the court to ensure they will accept that evidence, making that the issue for the courts not the IA. And interpretation is down to the parties in the case. There is also information on how the IAWM works.

Why should we care about this? Well legal professionals are our users too. Often we have unique historical data. And we can help courts and juries correctly interpret web archive evidence leading to more informed outcomes. Other opportunities may be to broaden the community of practice by bringing in legal technology professionals. And this is also part of mainstreaming web archives.

Why might we hestitate here? Well typically cases serve private interests rather than public goods. Immpature open source software culture for legal technology. And market solutions for web and social media archiving for this context do already exist.

USe cases for web archiving in litigation mainly have to do with information on individual webpages as a point in time; information individual webpages over a period of time; persistence of navigational paths over a period of time. And types of cases include civil litigaton and intellectual property cases (which are a separate court in the US). I haven’t seen any criminal cases using the archive but that doesn’t mean it doesn’t exist.

Where archives are used there is a focus on authentication and validity of the record. In the Telewizja Polska USA Inc v. Echostar Video Inc. (2004) saw arguing over the evidence but the court accepting it. In Specht v. Google inc (2010) the evidence was not admissable as it had not come through the affidavit rule.

Another important rule in ths US context is Judicial notice (FRE 201) which is a rule that allows a fact to be entered into evidence. And archives have been used in this context. For instance Martins v 3PD, Inc (2013). And Pond Guy, Inc. v. Aguascape Designs (2011). And in Tompkins v 23andme, Inc (2014) – both parties used IAWM screenshots and the courts went out and found further screenshots that countered both of these to an extent.

Expert testimony (FRE 202) has included Khoday v Symantex Corp et al (2015)  where the expert on navigational paths was queried but the court approved that testimony.

In terms of reliabiity factors things that are raised as concerns include IAWM disclaimer, incompleteness, provenance, temporal coherence. Not seen any examples on discreteness, temporal coherance with HTTP headers), etc.

In Nassar v Nassar (2017) was a defamation case where the IAWM disclaimer saw the court not accept evidence from th earchive.

In Stabile v. Paul Smith Ltd. (2015) saw incomplete archives used, with the court acknowledging but accepting relevance of what was entered.

In Marten Transport Ltd v Plattform Advertising Inc. (2016) was also incomplete, discussion of banners and ads, but the court understood that IAWM does account for some of this. Objections had include issues with crawlers, concern that human/witness wasn’t directly involved in capturing the pages. The literature includes different perceptions of incompleteness. We also have issues of live site “leakage” via AJAX – where new ads leaked into archive pages…

Temporal coherance can be complicated. Web archive  captures can include mementos that are embedded and archived at different points in time so that the composite does not totally make sense.

The Memento Time TRavel service shows you temporal coherance. See also Scott Ainsworth’s work. That kind of visualisation can help courts to understand temporal coherance. Other datetime estimation strategies includes “Carbon Dating” (and constitutent services)’ comparing X-Archive-Orig-last-modified with Memento dattime, etc.

Interpreting datetimes is complicated, and of  great importance in legals cases. These can be interpreted from static datetime of text in archived page, the Memento date time, the headers, etc.

In Servicenow, Inc. v Hewlett-Packard Co. (2015), a patent case where things much be published a year ago to be “prior art” and in this case the archive showed an earlier date than other documentatin.

IN terms of IAWM provenance… Cases have questioned this. Sources for IAWM include a range of different crawls but what does that mean for reliable provenance. There are other archives out there too, but I haven’t seen evidence of these being used in court yet. Canonicality is also an interesting issue… Personalisation of content served to archival agent is an an unanswered question. What about client artifacts?

So, what’s next? If we want to better serve legal and research use cases, then we need to surface more provenance information; to improve interfaces to understand temporal coherance and make volotile aspects visible…

So, some questions for you,

  1. why else might we care, or not about legal use cases?
  2. what other reliability factors are relevant?
    1. What is the relative importance of different reliability factors?
    2. For what use cases are different reliability factors relevant?

Q&A

Q1) Should we save WhoIs data alongside web archives?

A1) I haven’t seen that use case but it does provide context and provenance information

Q2) Is the legal status of IA relevant – it’s not a publicly funded archive. What about security certificates or similar to show that this is from the archive and unchanged?

A2) To the first question, courts have typically been more accepting of web evidence from .gov websites. They treat that as reliable or official. Not sure if that means they are more inclined to use it.. On the security side, there were some really interesting issues raised by Ilya and Jack. As courts become more concerned, they may increasingly look for those signs. But there may be more of those concerns…

Q3) I work with one of those commercial providers… A lot of lawyers want to be able to submit WARCs captured by web recorer or similar to courts.

A3) The legal system is vrry document centril… Much of their data coming in is PDF and that does raise those temporal issues.

Q3) Yes, but they do also want to render WARC, to bring that in to their tools…

Q4) Did you observe any provenance work outside the archive – developers, GitHub commits… Stuff beyond the WARC?

A4) I didn’t see examples of that… Maybe has to do with… These cases often go back a way… Sites created earlier…

Anastasia Aizman & Matt Phillips: Instruments for web archive comparison in Perma.cc

Matt: We are here to talk to you about some web archiving work we are doing. We are from the Harvard innovation lab. We have learnt so much from what you are doing, thank you so much. Perma.cc is creating tools to help you cite stuff on the web, to capture the WARC, organises those things…

We got started on this work when examining documents looking at the Supreme Court corpus from 1996 to present. We saw that Zittrain et al, Harvard Law Review, found more than 70% of references had rotted. So we wanted to build tools to help that…

Anastasia: So, we have some questions…

  1. How do we know a website has changed
  2. How do we know which are important changes.

So, what is a website made of… There are a lot of different resources that will appear on, say, a Washington Post article will have perhaps 90 components. Some are visual, some are hidden… So, again, how can we tell if the site has changed, if it is significant… And how do you convey that to the user.

In 1997, Andre Broder wrote about Syntactic clustering of the web. In that work he looked at every site on the world wide web. Things have changed a great deal since then… Websites are more dynamic now, we need more ways to compare pages…

Matt: So we have three types of comparison…

  • image comparison – we flatten the page down… If we compare two shots of Hacker News a few minutes apart there is a lot of similarity, but difference too… So we create a third image showing/highlighting the differences and can see where those changes there…

Why do image comparison? It’s kind of a dumb way to understand difference… Well it’s a mental model the human brain can take in. The HCI is pretty simple here – users regularly experience that sort of layering – and we are talking general web users here. And it’s easy to have images on hand.

So, sometimes it works well… Here’s an example… A silly one… A post that is the same but we have a cup of coffee with and without coffee in the mug, and small text differences. Comparisons like this work well…

But it works less well where we see banner ads on webpages and they change all the time… But what does that mean for the content? How do we fix that? We need more fidelity, we need more depth.

Anastasia: So we need another way to compare… Looking at a Washington post from 2016 and 2017… Here we can see what has been deleted, and we can see what has been added…. And the tagline of the paper itself has changed in this case.

The pros of this highlighting approach as that it’s in use in lots of places, it’s intuitive… BUT it has to ignore invisible-to-the_user tags. And it is kind of stupid… With two totally different headlines, both saying “Supreme Court”, it sees similarity where there is none.

So what about other similarity measures… ? Maybe a score would be nice, rather than an overlay highlighting change. So, for that we are looking at:

  • Jaccard Coefficient (MinHash) – this is essentially like applying a Venn diagram to two archives.
  • Hamming distance (SimHash) – This looks for number strings into 1s and 0s and figure out where the differences are… The difference/ratio
  • Sequence Matcher (Baseline/Truth) – this looks for sequences of words… It is good but hard to use as it is slow.

So, we took Washington Post archives (2000+) and resources (12,000) and looked at SimHash – big gaps. MinHash was much closer…

When we can calculate that changes… does it matter? If it’s ads, do you care? Some people will. Human eyes are needed…

Matt: So, how do we convey this information to the user… Right now in Perma we have a banner, we have highlighting, or you can choose image view. And you can see changes highlighted in “File Changes” panel on top left hand side of the screen. You can click to view a breakdown of where those changes are and what they mean… You can get to an HTML diff (via Javascript).

So, those are our three measures sitting in our Perma container..

Anastasia: So future work – coming soon – will look at weighted importance. We’d love your idea of what is important – is HTML more important than text? We want a Command Line (CLI) tool as well. And then we want to look at a similarity measure for images – other research on this out there, we need to look at that. We want a “Paranoia” heuristic – to see EVERY change, but with a tickbox to allow only the important change. And we need to work together!

Finally we’d like to thank you, and our colleagues at Harvard who support this work.

Q&A

Q1) Nerdy questions… How tightly bound are these similarity measures to the Perma.cc tool?

A1 – Anastasia) Not at all – should be able to use on command line

A1 – Matt) Perma is a Python Django stack and it’s super open source so you should be able to use this.

Comment) This looks super awesome and I want to use it!

Matt) These are really our first steps into this… So we welcome questions, comments, discussion. Come connect with us.

Anastasia) There is so much more work we have coming up that I’m excited about… Cutting up website to see importance of components… Also any work on resources here…

Q2) Do you primarily serve legal scholars? What about litigation stuff Nicholas talked about?

A2) We are in the law school but Perma is open to all. The litigation stuff is interesting..

A2 – Anastasia) It is a multi purpose school and others are using it. We are based in the law school but we are spreading to other places!

Q3) Thank you… There were HTML comparison tools that exist… But they go away and then we have nothing. A CLI will be really useful… And a service comparing any two URLs would be useful… Maybe worth looking at work on Memento damage – missing elements, and impact on the page – CSS, colour, alignment, images missing, etc. and relative importance. How do you highlight invisible changes?

A3 – Anastasia) This is really the complexity of this… And of the UI… Showing the users the changes… Many of our users are not from a technical background… Educating by showing changes is one way. The list with the measures is just very simple… But if a hyperlink has changed, that potentially is more important… So, do we organise the list to indicate importance? Or do we calculate that another way? We welcome ideas about that?

Q3) We have a service running in Momento showing scores on various levels that shows some of that, which may be useful.

Q4) So, a researcher has a copy of what they were looking at… Can other people look at their copy? So, researchers can use this tool as proof that it is what they cited… Can links be shared?

A4 – Matt) Absolutely. We have a way to do that from the Blue Book. Some folks make these private but that’s super super rare…

Understanding user needs (Chair Nicola Bingham)

Peter Webster, Chris Fryer & Jennifer Lynch: Understanding the users of the Parliamentary Web Archive: a user research project

Chris: We are here to talk about some really exciting user needs work we’ve been doing. The Parliamentary Archives holds several million historical records relating to Parliament, dating from 1497. My role is ensure that archive continues, in the form of digital records as well. One aspect of that is the Parliamentary Web Archive. This captures around 30 URLS – the official Parliamentary websphere content from 2009. But we also capture official social media feeds – Twitter, Facebook and Instagram. This work is essential as it captures our relationship with the public. But we don’t have a great idea of our users needs and we wanted to find out more and understand what they use and what they need.

Peter: The objectives of the study were:

  • assess levels and patterns of use – what areas of the sites they are using, etc.
  • gauge levels of user understanding of the archive
  • understand the value of each kind of content in the web archive – to understand curation effort in the future.
  • test UI for fit with user needs – and how satisfied they were.
  • identify most favoured future developments – what directions should the archive head in next.

The research method was an analysis of usage data, then a survey questionnaire – and we threw lots of effort at engaging people in that. There were then 16 individual user observations, where we sat with the users, asked them to carry out tests and narrate their work.  And then we had group workshops with parliamentary staff and public engagement staff, we well as four workshops with the external user community tailored to particular interests.

So we had a rich set of data from this. We identified important areas of the site. We also concluded that the archive and the relationship to the Parliament website, and that website itself, needed rethinking from the ground up.

So, what did we found of interest to this community?

Well, we found users are hard to find and engage – despite engaging the social media community – and staff similarly not least as the internal workshop was just after the EURef; that they are largely ignorant about what web archives are – we asked about the UK Web Archive, the Government Archive, and the Parliamentary Archive… It appeared that survey respondents understood what these are BUT in the workshops most were thinking about the online version of Hansard – a kind of archive but not what was intended. We also found that users are not always sure what they’re doing – particularly when engaging in a live browser snapshots of the site from a previous dates, that several snapshots might exist from different points in time. There was also some issues with understanding the Way Back Machine surround for the archived content – difficulty understanding what was content, what was the frame. There was a particular challenge around using URL search. People tried everything they could to avoid that… We asked them to find archived pages for the homepage of parliament.uk… And had many searches for “homepage” – there was real lack of understanding of the browser and the search functionality. There is also no correlation between how well users did with the task and how well they felt they did. I take from that that a lack of feedback, requests, issues, does not mean there is not an issue.

Second group of findings… We struggled to find academic participants for this work. But our users prioritised in their own way. It became clear that users wanted discovery mechanisms that match their mental map – and actually the archive mapped more to an internal view of how parliament worked… And browsing taxonomies and structures didn’t work for them. That led to a card sorting exercise to rethink this. We also found users liked structures and wanted discovery based on entities: people, acts, publications – so search connected with that structure works well. Also users were very interested to engage in their own curation, tagging and folksonomy, make their own collections, share materials. Teachers particularly saw potential here.

So, what don’t users want? They have a variety of real needs but they were less interested in derived data sets like link browse; I demonstrated data visualisation, including things like ngrams, work on WARCS; API access; take home data… No interest from them!

So, three general lessons coming out of this… If you are engaging in this sort of research, spend as much resource as possible. We need to cultivate users that we do know, they are hard to find but great when you find them. Remember the diversity of groups of users you deal with…

Chris: So the picture Peter is painting is complex, and can feel quite disheartening. But his work has uncovered issues in some of our assumptions, and really highlights needs of users in the public. We now have a much better understanding os can start to address these concerns.

What we’ve done internally is raise the profile of the Parliamentary Web Archive amongst colleagues. We got delayed with procurement… But we have a new provider (MirrorWeb) and they have really helped here too. So we are now in a good place to deliver a user-centred resource at: webarchive.parliament.uk.

We would love to keep the discussion going… Just not about #goatgate! (contact them on @C_Fryer and @pj_webster)

Q&A

Q1) Do you think there will be tangible benefits for the service and/or the users, and how will you evidence that?

A1 – Chris) Yes. We are redeveloping the web archive. And as part of that we are looking at how we can connect the archive to the catalogue and that is all part of new online services project. We have tangible results to work on… It’s early days but we want to translate it to tangibl ebenefits.

Q2) I imagine the parliament is a very conservative organisation that doesn’t delete content very often. Do you have a sense of what people come to the archive for?

A2 – Chris) Right now it is mainly people who are very aware of the archive, what it is and why it exists. But the research highlighted that many of the people less familiar with the archive wanted the archived versions of content on the live site, and the older content was more of interest.

A2 – Peter) One thing we did was to find out what the difference was between what was on the live website and what was on the archive… And looking ahead… The archive started in 2009… But demand seems to be quite consistent in terms of type of materials.

A2 – Chris) But it will take us time to develop and make use of this.

Q3) Can you say more about the interface and design… So interesting that they avoid the URL search.

A3 – Peter) The outsourced provider was Internet Memory Research… When you were in the archive there was an A-Z browser, a keyword search and a URL search. Above that on the parliament.uk site had taxonomy that linked out, and that didn’t work. I asked them to use that browse and it was clear that their thought process directed them to the wrong places… So recommendation was that it needs to be elsewhere, and more visible.

Q4) You were talking about users wanting to curate their own collections… Have you been considering setting up user dashboards to create and curate collections.

A4 – Chris) We are hoping to do that with our website and service, but it may take a while. But it’s a high priority for us.

Q5) I was interested to understand, the users that you selected for the survey… Were they connected before and part of the existing user base, or did you find through your own efforts.

A5 – Peter) a bit of both… We knew more about those who took the survey and they were the ones we had in the observations. But this was a self selecting group, and they did have a particular interest in the parliament.

Emily Maemura, Nicholas Worby, Christoph Becker & Ian Milligan: Origin stories: documentation for web archives provenance

Emily: We are going to talk about origin stories and it comes out of interest in web archives, provenance, trust. This has been a really collaborative project, and working with Ian Milligan from Toronto. So, we have been looking at two questions really: How are web archives made? How can we document or communicate this?

We wanted to look at choices and decisions in creating collections We have been studying creation of University of Toronto Libraries (UTL) Archive-It collections:

  • Canadian Political Parties and Political Interest Groups (crawled quarterly) – long running, continually collected and ever-evolving.
  • Toronto 2015 Pan Am games (crawled regularly for one month one-off event)
  • Global Summitry Archive

So, thinking about web archives and how they are made we looked at the Web Archiving Life Cycle Model (Bragg et al 2013), which suggests a linear process… But the reality is messier… and iterative as test crawls are reviewed, feed into production crawls… But are also patched as part of QA work.

From this work then we have four things you should document for provenence:

  1. Scoping is iterative and regularly reviewed. and the data budget is a key part of this.
  2. The Process of crawls is important to document as the influence of live web content and actors can be unpredictable
  3. There may be different considerations for access, choices for mode of access can impact discovery, and may be particularly well suited to particular users or use cases.
  4. The fourth thing is context, and the organisational or environmental factors that influence web archiving program – that context is important to understand those decision spaces and choices.

Nick: So, in order to understand these collections we had to look at the organisational history of web archiving. For us web archiving began in 2005, and we piloted what became Archive-it in 2006. It was in liminal state for about 8 years… There were few statements around collection develeopment until last year really But th enew policu talks about scoping, policy, permissions, etc.

So that transition towards service is reflected in staffing. It is still a part time commitment but is written into several people’s job descriptions now, it is higher profile. But there are resourcing challenges around crawling platforms – the earliest archives had to be automatic; dat abudgets; storage limits. There are policies, permissions. robots.text policy, access restrictions. And there is the legal context… Copright laws changed a lot in 2012… Started with permissions, then opt outs, but now it’s take down based…

Looking in turn at these collections:

Canadian Political Parties and Political Interest Groups (crawled quarterly) – long running, continually collected and ever-evolving. Covers main parties and ever changing group of loosely defined interest groups. This was hard to understand as there were four changes of staff in the time period.

Toronto 2015 Pan Am games (crawled regularly for one month one-off event) – based around a discrete event.

Global Summitry Archive – this is a collaborative archive, developed by researchers. It is a hybrid and is an ongoing collection capturing specific events.

In terms of scoping we looked at motivation whether mandate, an identified need or use, collaboration or coordination amongst institutions. These projects are based around technological budgets and limitations… In cases we only really understand what’s taking place when we see crawling taking place. Researchers did think ahead but, for instance, video is excluded… But there is no description of why text was prioritised over video or other storage. You can see evidence of a lack of explicit justifications for crawling particular sites… We have some information and detail, but it’s really useful to annotate content.

In the most recent elections the candidate sites had altered robots.txt… They weren’t trying to block us but the technology used and their measures against DDOS attacks had that effect.

In terms of access we needed metadata and indexes, but the metadata and how they are populated shapes how that happens. We need interfaces but also data formats and restrictions.

Emily: We tried to break out these interdependencies and interactions around what gets captured… Whether a site is captured is down to a mixture of organisational policies and permissions; legal context and copyright law for fair dealing, etc. The wider context elements also change over time… Including changes in staff, as well as changes in policy, in government, etc. This can all impact usage and clarity of how what is there came to be.

So, conclusions and future work… In telling the origin stories we rely on many different aspects and it very complex. We are working towards an extended paper. We believe a little documentation goes a long way… We have a proposal for structure documentation: goo.gl/CQwMt2

Q&A

Q1) We did this exercise in the Netherlands… We needed to go further in the history of our library… Because in the ’90s we already collected interesting websites for clients – the first time we thought about the web as an important stance.. But there was a gap there between the main library work and the web archiving work…

Q2) I always struggle with what can be conveyed that is not in the archive… Sites not crawl, technical challenges, sites that it is decided not to crawl early on… That very initial thinking needs to be conveyed to pre-seed things… Hard to capture that…

A2 – Emily) There is so much in scoping that is before the seed list that gets into the crawl… Nick mentioned there are proposals for new collections that explains the thinking…

A2 – Nick) That’s about the best way to do it… Can capture pre-seeds and test crawls… But need that “what should be in the collection”

A2 – Emily) The CPPP is actually based on a prior web list of suggested sites… Which should also have been archived.

Q3) In any kind of archive the same issues are hugely there… Decisions are rarely described… Though a whole area of post modern archive description around that… But a lot comes down to the creator of the collection. But I haven’t seen much work on what should be in the archive that is expected to be there… A different context I guess..

A3 – Emily) I’ve been reading a lot of post modern archive theory… It is challenging to document all of that, especially in a way that is useful for researchers… But have to be careful not to transfer over all those issues from the archive into the web archive…

Q4) You made the point that the liberal party candidate had blocked access to the Internet Archive crawler… That resonated for me as that’s happened a few times for our own collection… We have legal deposit legislation and that raises questions of whose responsibility it is to take that forward..

A4 – Nick) I found it fell to me… Once we got the right person on the phone it was an easy case to make – and it wasn’t one site but all the candidates for that party!

Q5) Have you have any positive or negative responses to opt-outs and Take downs

A5 – Nick) We don’t host our own WayBackMachine so use their policy. We honour take downs but get very very few. Our communications team might have felt differently but we had something quite bullish in charge.

Nicola) As an institution there is a very variable appetite for risk – hard to communicate internally, let alone externally to our users.

Q6) In your research have you seen any web archive documenting themselves well? People we should follow? Or based mainly on your archives?

A6) It’s mainly based on our own archives… We haven’t done a comprehensive search of other archives’ documentation.

Jackie Dooley, Alexis Antracoli, Karen Stoll Farrell & Deborah Kempe: Developing web archiving metadata best practices to meet user needs

Alexis: We are going to present on the OCLC Research Library Partnership web archive working group. So, what was the problem? Well, web archives are not very easily discoverable in the ways people are usually used to descovering archives or library resources. This was the most widely shared issue across two OCLC surveys and so a working group was formed.

At Princeton we use Archive-It, but you had to know we did that… It wasn’t in the catalogue, it wasn’t on the website… So you wouldn’t find it… Then we wanted to bring it into our discovery system but that meant two different interfaces… So… If we take an example of one of our finding aids… We have the College Republican Records (2004-2016) and they are an on-campus group with websites… This was catalogues with DACS. But how to use the title and dates appropriately? Is the date the content, the seed, what?! And extent – documents, space, or… we went for the number of websites as that felt like something users would understand.  We wrote Archive-it into the description… But we wanted guidelines…

So, the objectives of this group is to find best practices for web archiving metadata best practices. We have undertane a lutereature review, looked at best practices for descriptive metadata across single nad multiple sites.

Karen: For our literature review we looked at peer reviewed literature but also some other sources, and synthesised that. So, who are the end users of web archives… I was really pleased the UK Parliament work focused on public users, as the research tends to focus on academia. Where we can get some clarity on users is on their needs: to read specific web pages/site; data and text mining; technology development or systems analysis.

In terms of behaviours Costa and Silva (2010) classify three groups, much cited by others: Navigational; Informational or Transactionals.

Take aways…. A couple things that we found – some beyond metadata… Raw data can be a high barrier so they want accessible interaces, unified searches, but the user does want to engage directly with the metadata to make the background and provenence of the data. We need to be thinking about flexible formats, engagement. And to enable access we need re-use and rights statements. And we need to be very direct indicating live versus archive material.

Users also want provenance: when and why was this created? They want context. They want to know the collection criteria and scope.

For metadata practitioners there are distinct approaches… archival and bibliographic approaches – RDA, MARC, Dublin Core, MODS, finding aids, DACS; Data elements vary widely, and change quite quickly.

Jackie: We analysed metadata standards and institutional guidelines; we evaluated existing metdata records in the wild… Our preparatory work raised a lot of questions about building a metadata description… Is the website creator/owner the publisher? author? subject? What is the title? Who is the host institution – and will it stay the same? Is it imporant to clearly stats that the resource is a website (not a “web resources”).

And what does the provenance actually refer to? We saw a lot of variety!

In terms of setting up th econtext we have use cases for library, archives, research… Some comparisons between bibliographic and archival approaches to descriptoin; description of archived and live sites – mostly libraries catalogue live not archives sites; and then you have different levels… Collection level, site level… And there might be document-level discriptions.

So, we wanted to establish data dictionary characteristics. We wanted something simple, not a major new cataloguing standard. So this is a learn 14 element standard, which is grounded on those cataloguing rules, so can be part of wider systems. The categories we have include common elements are used for identification and discovery of types of resources; other elements have to have clear applicability in the discovery of all types of resources. But some things aren’t included as not super specific to web archives – e.g. audience.

So the 14 data elements are:

  • Access/rights*
  • Collector
  • Contributor*
  • Creator*
  • Date*
  • Description*…

Elements with asterisks are direct maps to Dublin Core fields.

So, Access Conditions (to be renamed as “Rights”) is a direct mapping to Dublin Core “Rights”. This provides the circumstances that affect the availability and/or reuse of an archived website or collection. E.g. for Twitter. And it’s not just about rights because so often we don’t actually know the rights, but we know what can be done with the data.

Collector was the strangest element… There is no equivalent in Dublim Core… This is about the organisation responsible for curation and stewardship of an archived website or collection. The only other place that uses Collector is the Internet Archive. We did consider “repository” but, it may do all those things but… for archived websites… the site lives elsewhere but e.g. Princeton decides to collect those things.

We have a special case for Collector where Archive-It creates its own collection…

So, we have three publications, due out in July on this work..

Q&A

Q1) I was a bit disappointed in the draft report – it wasn’t what I was expecting… We talked about complexities of provenance and wanted something better to convey that to researchers, and we have such detailed technical information we can draw from Archive-It.

A1 – Jackie) Our remit was about description, only. Provenance is bigger than that. Descriptive metadata was appropriate as scope. We did a third report on harvesting tools and whether metadata could be pulled from them… We should have had “descriptive” in our working group name too perhaps…

A1) It is maybe my fault too… But it’s that mapping of DACs that is not perfect… We are taking a different track at University of Albany

A1 – Jackie) This is NOT a standard, it addresses an absence of metadata that often exists for websites. Scalability of metadata creation is a real challenge… The average time available is 0.25 FTE looking at this. The provenance, the nuance of what was and was not crawled is not doable at scale. This is intentionally lean. If you will be using DACs then a lot of data goes straight in. All standards, with the exception of Dublin Core, are more detailed…

Q2) How difficult is this to put in practice for MARC records. For us we treat a website as a collector… You tend to describe the online publication… A lot of what we’d want to put in just can’t make it in…

A2 – Jackie) In Marc the 852 field is the closest to Collector that you can get. (Collector is comparable to Dublin Core’s Contributor; EAD’s <repository>; MARC’s 524, 852 a ad 852 b; MODS’ location or schema.org’s schema:OwnershipInfo.

Researcher case studies (Chair: Alex Thurman)

Jane Winters: Moving into the mainstream: web archives in the press

This paper accompanies my article for the first issue of Internet Histories. I’ll be talking about the increasing visibility of web archives and much greater public knowledge of web archive.

So, who are the audiences for web archives? Well they include researchers in the arts, humanities and social sciences – my area and where some tough barriers are. They are also policymakers, perticularly crucial in relation to legal deposit and acess. Also “general public” – though it is really many publics. And journalists as a mediator with the public.

What has changed with media? Well there was an initial focus on technology which reached an audience predisposed to that. But incresingly web archives come into discussion of politics and current affairs but there are also social and cultural concerns starting to emerge. There is real interest around launches and anniversaries – a great way for web archives to get attention, like the Easter Rising archive we heard about this week. We do also get that “digital dark age” klaxon which web archives can and do address. And with Brexit and Trump there is a silver lining… And a real interest in archives as a result.

So in 2013 Niels Brugge arranged the first RESAW meeting in Aahus. And at that time we had one of these big media moments…

Computer Weekly, 12th November 2013, reported on Conservatives erasing official records of speeches from the Internet Archive as a serious breach. Coverage in computing media migrated swiftly to coverage in the mainstream press, the Guardian’s election coverage; BBC News… The hook was that a number of those speeches were about the importance of the internet to open public debate… That hook, that narrative was obviously lovely for the media. Interestingly the Conservatives then responded that many of those speeches were actually still available in the BL’s UK Web Archives. The speeches also made Channel 4 News – and they used it as a hook to talk about broken promises.

Another lovely example was Dr Anat Ben-David from the Open University who got involved with BBC Click on restoring the lost .yu domain. This didn’t come from us trying to get something in the news… They knew our work and we could then point them in the direction of really interesting research… We can all do this highlighting and signposting which is why events like this are so useful for getting to know each others’ work.

When you make the tabloids you know you’ve done well… In 2016 coverage of the BBC Food website was faced with closure as part of cuts. The Independent didn’t lead with this, but with how to find recipes when the website goes… They directed everyone to the Internet Archive – as it’s open (unlike the British Library). Although the UK Web Archive blog did post about this, explained what they are collecting, and why they collect important cultural materials. The BBC actually back peddled… Maintaining the pages, but not updating it. But that message got out that web archiving is for everyone… Building it into people’s daily lives.

The launch of the UK Web Archive in 2013 went live – BBC covered this (and fact that it is not online). The 20th anniversary of the BnF archive had a lot of French press coverage. That’s a great hook as well.  Then I mentioned that Digital Dark Age set of stories… Bloomberg had the subtitle “if you want to preserve something, print it” in 2016. We saw similar from the Royal Society. But generally journalists do know who to speak to from BL, or DPC, or IA to counter that view… Can be a really positive story. Even that negative story can be used as a positive thing if you have that connection with journalists…

So this story: “Raiders of the Lost Web: If a Pultizer-finalist 34 part series can disappear from the web, anything can” looks like it will be that sort of story again… But actually this is about the forensic reconstruction of the work. And the article also talks about cinema at risk, again also preserved thanks to the Internet Archive. This piece of journalism that had been “lost” was about the death of 23 children in a bus crash… It was lost twice as it wasn’t reported, then the story disappeared… But the longer article here talks about that case and the importance of web archiving as a whole.

Talking of traumatic incidents… Brexit coverage of the NHS £350m per week saving on the Vote Leave website… But it disappeared after the vote. BUT you can use the Internet Archive, and the structured referendum collection from the UK Legal Deposit libraries, so the promises are retained into the long term…

And finally, on to Trump! In an Independent article on Melania Trump’s website disappearing, the journalist treats the Internet Archive as another source, a way to track change over time…

And indeed all of the coverage of IA in the last year, and their mirror site in Canada, that isn’t niche news, that’s mainstream coverage now. The more we have stories on data disappearing, or removed, the more opportunities web archives have to make their work clear to the world.

Q&A

Q1) A fantastic talk and close to my heart as I try to communicate web archives. I think that web archives have fame when they get into fiction… The BBC series New Tricks had a denouement centred on finding a record on the Internet Archive… Are there other fictional representations of web archives?

A1) A really interesting suggestion! Tweet us both if you’ve seen that…

Q2) That coverage is great…

A2) Yes, being held to account is a risk… But that is a particular product of our time… Hopefully when it is clear that it is evidence for any set of politicians… The users may be partisan, even if the content is… It’s a hard line to tread… Non publicly available archives mitigate that… But absolutely a concern.

Q3) It is a big win when there are big press mentions… What happens… Is it more people aware of the tools, or specifically journalists using them?

A3) It’s both but I think it’s how news travels… More people will read an article in the Guardian than will look at the BL website. But they really demonstrate the value and importance of the archive. You want – like the BBC recipe website 100k petition – that public support. We ran a workshop here on a random Saturday recently… It was pitched as tracing family or local history… And a couple were delighted to find their church community website 15 years ago… It was that easy to know about the value of the archive that way… We did a gaming event with late 1980s games in the IA… That’s brilliant, a kid’s birthdya party was going to be inspired by that – that’s fab use we hadn’t thought of… But journalism is often the easy win…

Q4) Political press and journalistic use is often central… But I love that GifCities project… The nostalgia of the web… The historicity… That use… They highlight the datedness of old web design is great… The way we can associated archives with web vernacular that are not evidenced elsewhere is valuable and awesome… Leveraging that should be kept in mind.

A4) The GifCities always gets a “Wow” – it’s a great way to engage people in a teaching setting… Then lead them onto harder real history stuff..!

Q5) Last year when we celebrated the anniversary I had a chance to speak with journalists. They were intrigued that we collect blogs, forums, stuff that is off the radar… And they titled the article “Maybe your Sky Blog is being archived in France” (Sky Blogs is a popular teen blog platform)… But what does not forgetting the stupid things you wrote on the internet when they were 15…

A5) We’ve had three sessions so far, only once did that question arise… But maybe people aren’t thinking like that. More of an issue of the public archive… Less of a worry for closed archive… But so much of the embaressing stuff is in Facebook so not in the archive. But it matters especially in the right to be forgotten legislation… But there is also that thing of having something worth archiving…

Q6) The thing of The Crossing is interesting… Their font was copyright… They had to get specific permission from the designer… But that site is in flash… And soon you’ll need Ilya Cramer’s old web tools to see it at all.

A6) Absolutely. That’s a really fascinating article and they had to work to revive and play that content…

Q6) And six years old! Only six years!

Cynthia Joyce: Keyword ‘Katrina’: a deep dive through Hurricane Katrina’s unsearchable archive

I’ll be talking about how I use – rather than engaging in the technology directly. I was a journalist for 20 years before teaching journalism, which I do at University of Mississippi. Every year we take a study group to New Orleans to look at the outcome of Katrina. Katrina was 12 years ago. But there is a lot of gentrification and so there are few physical scars there… It was weird to have to explain how hard things were to my 18 year old students. And I wanted to bring that to life… But not just the news coverage which is shown as anniversary, do an update piece… The story is not a discrete event, an era…

I found the best way to capture that era was through blogging. New Orleans was not a tech savvy space, it was a poor, black, high levels of illiteracy sort of space. Web 1.0 had skipped New Orleans and the Deep South in a lot of ways.. .It was pre-Twitter, Facebook in infancy, mobiles were primitive. Katrina was probably when many in New Orleans started texting – doable on struggling networks. There was also that Digital Divide – out of trend to talk about this but this is a real gap.

So, 80% of the city flooded, more than 800 people died, 70% of residents were displaced. The storm didn’t cause the problems here, it was the flooding and the failure of the levees. That is an important distinction, as that sparked the rage, the activism, the need for action was about the sense of being lied to and left behind.

I was working as a journalist for Salon.com from 1995 – very much web 1.0. I was an editor at Nola.com post Katrina. And I was a resident of New Orleans 2001-2007. We had questions of what to do with comments, follow up, retention of content… A lot of content wasn’t needing preserving… But actually that set of comments should be the shame of Advanced Digital and Conde Naste… It was interesting how little help they provided to Nola.com, one of their client papers…

I was conducting research as a citizen, but with journalistic principles and approaches… My method was madness basically… I had instincts, stories to follow, high points, themes that had been missed in mainstream media. I interviewed a lot of people… I followed and used a cross-list of blog rolls… This was a lot of surfing, not just searching…

The WayBackMachine helped me so much there, to see that blogroll, seeing those pages… That idea of the vernacular, drill down 10 years later was very helpful and interesting… To experience it again… To go through, to see common experiences… I also did social media posts and call outs – an affirmative action approach. African American people were on camera, but not a lot of first party documentation… I posted something on Binders Full of Women Writers… I searched more than 300 blogs. I chose the entries… I did it for them… I picked out moving, provocative, profound content… Then let them opt out, or suggest something else… It was an ongoing dialogue with 70 people crowd curating a collective diary. New Orleans Press produced a physical book, and I sent it to Jefferson and IA created a special collection for this.

In terms of choosing themes… The original TOC was based on categories that organically emerged… It’s not all sad, it’s often dark humour…

  • Forever days
  • An accounting
  • Led Astray (pets)
  • Re-entry
  • Kindness of Strangers
  • Indecisin
  • Elsewhere = not New Orleans
  • Saute Pans of Mercy (food)
  • Guyville

Guyville for instance… for months no schools were open, so it was a really male space, then so much construction… But some women there though that was great too. A really specific culture and space.

Some challenges… Some work was journalists writing off the record. We got permissions where we could – we have them for all of the people who survived.

I just wanted to talk about Josh Cousin, a former resident of St Bernard projects. His nickname was the “Bookman” – he was an unusual nerdy kid and was 18 when Katrina hit. They stayed… But were forced to leave eventually… It was very sad… They were forced onto a bus, not told where they were going, they took their dog… Someone on the bus complained. Cheddar was turfed onto the highway… They got taken to Houston. The first post Josh posted was a defiant “I made it” type post… He first had online access when he was at the Astrodome. They had online machines that no-one was using… But he was… And he started getting mail, shoes, stuff in the post… He was training people to use these machines. This kid is a hero… At the sort of book launch for contributors he brought Cheddar the dog… Through pet finder… He had been adopted by a couple in Conneticut who had renamed him “George Michael” – they tried to make him pay $3000 as they didn’t want their dog going back to New Orleans…

In terms of other documentary evidence… Material is all as PDF only… The email record of Micheal D. Brown… shows he’s concerned about dog sitting… And later criticised people for not evacuating because of their pets… Two weeks later his emails do talk about pets… There were obviously other things going on… But this narrative, this diary of that time… really brings this reality to life.

I was in a newsroom during Arab Spring… And that’s when they had no option but to run what’s on Twitter, it was hard to verify but it was there and no journalists could get in. And I think Katrina was that kind of moment for blogging…

On Archive-it you can find the Katrina collection… Ranging from resistance and suspicion to gratitude… Some people barely remembered writing stuff, certainly didn’t expect it to be archived. I was collecting 8-9 years later… I was reassured to read that a historian at the Holocaust museum (in Chronicle of Higher Ed) who wasn’t convinced about blogging, until Trump said something stupid and that had triggered her to engage.

Q&A

Q1 – David) In 2002 the LOCKSS program has a meeting with subject specialists at NY Public Library… And among those that were deemed worth preserving was The Exquisite Corpse. That was published out of New Orleans. After Katrina we were able to give Andre Projescu back his materials and that carried on publishing until 2015… A good news story of archiving from that time.

A1) There are dozens of examples… The things that I found too is that there is no appointed steward… If no institutional support it can be passed round, forgotten… I’d get excited then realise just one person was the advocate, rather than an institution to preserve it for posterity.

Andre wrote some amazing things, and captured that mood in the early days of the storm…

Q2) I love how your work shows blending of work and sources and web archives in conversation with each other… I have a mundane question… Did you go through any human subjects approval for this work from your institution.

A2) I was an independent journalist at the time… BUt went to University of New Orleans as the publisher had done a really intersting project with community work… I went to ask them if this project already existed… And basically I ended up creating it… He said “are you pitching it?” and that’s where it came from. Nievete benefited me.

Q3) Did anyone opt out of this project, given the traumatic nature of this time and work?

A3) Yes, a lot of people… But I went to people who were kind of thought leaders here, who were likely to see the benefit of this… So, for instance Karen Geiger had a blog called Squandered Heritage (now The Lens, the Pro Publica of New Orleans)… And participation of people like that helped build confidence and validity to the project.

Colin Post: The unending lives of net-based artworks: web archives, browser emulations, and new conceptual frameworks

Framing an artwork is never easy… Art objects are “lumps” of the physical world to be described… But what about net based art works, How do we make these objects of art history… And they raise questions of what we define an artwork in the first place… I will talk about Homework by Alexi Shulgin (http://www.easylife.org/homework/) as an example of where we need technique snad practices of web arching around net based artworks. I want to suggest a new conceptualisiation of net-based artworks as plural, proliferating, herteogenous archives. Homework is typical, and includes pop ups and self-concious elements that make it challenging to preserve…

So, this came from a real assignment for Natalie Bookchin’s course in 1997. Alexei Shulgin encouraged artists to turn in homework for grading, and did so himeself… And his piece was a single sentence followed by pop up messages – something we use differently today, has different significance… Pop ups ploferate the screen like spam, making the user aware of the browser and its affordances and role… Homework replicates structures of authority and expertise, grading, organising, creitiques, including or excluding artists… But rendered obsurd…

Homework was intended to be ephemeral… But Shulgin curates assignments turned in, and late assignments. It may be tempting to think of these net works as performance art, with records only of a particular moment in time. But actually this is a full record of the artwork… Homework has entered into archives as well as Shulgin’s own space. It is heterogenous… All acting on the work. The nature of pop up messages may have changes but the conditions of its original creation and it is still changing the world today.

Shulgin, in conversation with Armin Medosch in 1997, felt “The net at present has few possibilities for self expression but there is unlimited possibility for communication. But how can you record this communicative element, how can you store it?”. There are so many ways and artists but how to capture them… One answer is web archiving… There are at least 157 versions of Homework in the Internet Archive.. This is not comprehensive, but his own site is well archived… But capacity of connections is determined by incidence rather than choice… The crawler only caught some of these. But these are not discrete objects… The works on Shulgin’s site, the captures others have made, the websites that are still available, is one big object. This structure reflects the work itself, archival systems sustain and invigorate through the same infrastructure…

To return to the communicative elements… Archives do not capture the performative aspects of the piece. But we must also attend to the way the object has transformed over time… In order to engage with complex net-absed artworks… We cannot be easily separated into “original” and “archived” but more as a continuum…

Frank Upward (1996) describe the Records Continuum Model.. This is around four dimensions: Creation, Capture, Organisation, and Pluralisation. All of these are present in the archive of Homework… As copies appear in the Internet Archive, in Rhizome… And spread out… You could describe this as the vitalisation of the artwork on the web…

oldweb.today at Rhizome is a way to emulate the browser… This provides some assurance of the retention of old website.. BUt that is not the direct representation of the original work… The context and experience can vary – including the (now) speedy load of pages… And possible changes in appearance… When I load homework here… I see 28 captures all combined, from records over 10 years.. The piece wasn’t uniformly archived at any one time… I view the whole piece but actually it is emulated and artificial… It is disintegrated and inauthentic… But in the continuum it is another continuous layer in space and time.

Niels Brugger in “website history” (2010) talks about “Writing the complex strategic situation in which an artefact is entangled”. Digital archived and emulators preserve Homework, but are in themselves generative… But that isn’t exclusive to web archiving… It is something we see in Eugene Viollet Le Duc (1854/1996) talks about reestablishing a work in a finish state that may never in fact have existed in any point in time.

Q1) a really interesting and important work, particularly around plurality. I research at Rhizome and we have worked with Net Art Anthology – an online exhibition with emulators… is this faithful… should we present a plural version of the work?

A1) I have been thinking about this a lot… but i don’t think Rhizome should have to do all of this… art historians should do this contextual work too… Net Art Anthology does the convenience access work but art historians need to do the context work too.

Q1) I agree completely. For an art historian what provenance metadata should we provide for works like this to make it most useful… Give me a while and I’ll have a wish list… 

Comment) a shout out for Gent in Belgium is doing work on online art so I’ll connect you up.

Q2) Is Homework still an active interactive work?

A2) The final list was really in 1997 – only on IA now… It did end at this time… so experiencing the piece is about looking back… that is artefactial, or a terrace. But Shulgin has past work on his page… sort of a capture and framing as archive.

Q3) How does Homework fit in your research?

A3) I’m interested in 90s art, preservation, and that interactions

Q4) Have you seen that job of contextualisation done well, presented with the work? I’m thinking of Eli Harrison’s quantified self work and how different that looked at the time from now… 

A4) Rhizome does this well, galleries collecting net artists… especially with emulated works.. The guggenheim showed originals and emulated and part of that work was foregrounding the preservation and archiving aspects of the work. 

Closing remarks: Emmanuelle Bermès & Jane Winters

Emmanuelle: Thank you all for being here. This was three very intense day. Five days for those at archived unleashed. To close a few comments on IIPC. We were originally to meet in Lisbon, and I must apologise again to Portuguese colleagues, we hope to meet again there… But colocating with RESAW was brilliant – I saw a tweet that we are creating archives in the room next door to those who use and research them. And researchers are our co-creators.

And so many of our questions this week have been about truth and reliability and trust. This is a sign of growth and maturity of the groups. 

IIPC has had a tough year. We are still a young and fragile group… we have to transition to a strong world wide community. We need all the voices and inputs to grow and to transform into something more résiliant. We will have an annual meeting at an event in Ottawa later this year.

Finally thank you so much to Jane and colleagues from RESAW, and to Nicholas and WARC committee, and Olga and BL to get this all together so well.

Jane: you were saying how good it has been to bring archivists and researchers together, to see how we can help and not just ask… A few things struck me: discussion of context and provenance; and at the other end permanence and longevity. 

We will have a special issue of Internet Histories so do email us 

Thank you to Neils Brugger and NetLab, The Coffin Trust who funded our reception last night, RESAW Programme Committee, and the really important peop – the events team at University of London, and to Robert Kelly who did our wonderful promotional materials. And Olga who has made this all possible. 

And we do intend to have another Resaw conference in June in 2 years.

And thank you to Nicholas and Neils for representing IIPC, and to all of you for sharing your fantastic work. 

And with that a very interesting week of web archiving comes to an end. Thank you all for welcoming me along!

Jun 152017
 

I am again at the IIPC WAC / RESAW Conference 2017 and today I am in the very busy technical strand at the British Library. See my Day One post for more on the event and on the HiberActive project, which is why I’m attending this very interesting event.

These notes are live so, as usual, comments, additions, corrections, etc. are very much welcomed.

Tools for web archives analysis & record extraction (chair Nicholas Taylor)

Digging documents out of the archived web – Andrew Jackson

This is the technical counterpoint to the presentation I gave yesterday… So I talked yesterday about the physical workflow of catalogue items… We found that the Digital ePrints team had started processing eprints the same way…

  • staff looked in an outlook calendar for reminders
  • looked for new updates since last check
  • download each to local folder and open
  • check catalogue to avoid re-submitting
  • upload to internal submission portal
  • add essential metadata
  • submit for ingest
  • clean up local files
  • update stats sheet
  • Then inget usually automated (but can require intervention)
  • Updates catalogue once complete
  • New catalogue records processed or enhanced as necessary.

It was very manual, and very inefficient… So we have created a harvester:

  • Setup: specify “watched targets” then…
  • Harvest (harvester crawl targets as usual) –> Ingested… but also…
  • Document extraction:
    • spot documents in the crawl
    • find landing page
    • extract machine-readable metadata
    • submit to W3ACT (curation tool) for review
  • Acquisition:
    • check document harvester for new publications
    • edit essential metadata
    • submit to catalogue
  • Cataloguing
    • cataloguing records processed as necessary

This is better but there are challenges. Firstly, what is a “publication?”. With the eprints team there was a one-to-one print and digital relationship. But now, no more one-to-one. For example, gov.uk publications… An original report will has an ISBN… But that landing page is a representation of the publication, that’s where the assets are… When stuff is catalogued, what can frustrate technical folk… You take date and text from the page – honouring what is there rather than normalising it… We can dishonour intent by capturing the pages… It is challenging…

MARC is initially alarming… For a developer used to current data formats, it’s quite weird to get used to. But really it is just encoding… There is how we say we use MARC, how we do use MARC, and where we want to be now…

One of the intentions of the metadata extraction work was to provide an initial guess of the catalogue data – hoping to save cataloguers and curators time. But you probably won’t be surprised that the names of authors’ names etc. in the document metadata is rarely correct. We use the worse extractor, and layer up so we have the best shot. What works best is extracting the HTML. Gov.uk is a big and consistent publishing space so it’s worth us working on extracting that.

What works even better is the gov.uk API data – it’s in JSON, it’s easy to parse, it’s worth coding as it is a bigger publisher for us.

But now we have to resolve references… Multiple use cases for “records about this record”:

  • publisher metadata
  • third party data sources (e.g. Wikipedia)
  • Our own annotations and catalogues
  • Revisit records

We can’t ignore the revisit records… Have to do a great big join at some point… To get best possible quality data for every single thing….

And this is where the layers of transformation come in… Lots of opportunities to try again and build up… But… When I retry document extraction I can accidentally run up another chain each time… If we do our Solr searches correctly it should be easy so will be correcting this…

We do need to do more future experimentation.. Multiple workflows brings synchronisation problems. We need to ensure documents are accessible when discoverable. Need to be able to re-run automated extraction.

We want to iteratively improve automated metadata extraction:

  • improve HTML data extraction rules, e.g. Zotero translators (and I think LOCKSS are working on this).
  • Bring together different sources
  • Smarter extractors – Stanford NER, GROBID (built for sophisticated extraction from ejournals)

And we still have that tension between what a publication is… A tension between established practice and publisher output Need to trial different approaches with catalogues and users… Close that whole loop.

Q&A

Q1) Is the PDF you extract going into another repository… You probably have a different preservation goal for those PDFs and the archive…

A1) Currently the same copy for archive and access. Format migration probably will be an issue in the future.

Q2) This is quite similar to issues we’ve faced in LOCKSS… I’ve written a paper with Herbert von de Sompel and Michael Nelson about this thing of describing a document…

A2) That’s great. I’ve been working with the Government Digital Service and they are keen to do this consistently….

Q2) Geoffrey Bilder also working on this…

A2) And that’s the ideal… To improve the standards more broadly…

Q3) Are these all PDF files?

A3) At the moment, yes. We deliberately kept scope tight… We don’t get a lot of ePub or open formats… We’ll need to… Now publishers are moving to HTML – which is good for the archive – but that’s more complex in other ways…

Q4) What does the user see at the end of this… Is it a PDF?

A4) This work ends up in our search service, and that metadata helps them find what they are looking for…

Q4) Do they know its from the website, or don’t they care?

A4) Officially, the way the library thinks about monographs and serials, would be that the user doesn’t care… But I’d like to speak to more users… The library does a lot of downstream processing here too..

Q4) For me as an archivist all that data on where the document is from, what issues in accessing it they were, etc. would extremely useful…

Q5) You spoke yesterday about engaging with machine learning… Can you say more?

A5) This is where I’d like to do more user work. The library is keen on subject headings – thats a big high level challenge so that’s quite amenable to machine learning. We have a massive golden data set… There’s at least a masters theory in there, right! And if we built something, then ran it over the 3 million ish items with little metadata could be incredibly useful. In my 0pinion this is what big organisations will need to do more and more of… making best use of human time to tailor and tune machine learning to do much of the work…

Comment) That thing of everything ending up as a PDF is on the way out by the way… You should look at Distil.pub – a new journal from Google and Y combinator – and that’s the future of these sorts of formats, it’s JavaScript and GitHub. Can you collect it? Yes, you can. You can visit the page, switch off the network, and it still works… And it’s there and will update…

A6) As things are more dynamic the re-collecting issue gets more and more important. That’s hard for the organisation to adjust to.

Nick Ruest & Ian Milligan: Learning to WALK (Web Archives for Longitudinal Knowledge): building a national web archiving collaborative platform

Ian: Before I start, thank you to my wider colleagues and funders as this is a collaborative project.

So, we have a fantastic web archival collections in Canada… They collect political parties, activist groups, major events, etc. But, whilst these are amazing collections, they aren’t accessed or used much. I think this is mainly down to two issues: people don’t know they are there; and the access mechanisms don’t fit well with their practices. Maybe when the Archive-it API is live that will fix it all… Right now though it’s hard to find the right thing, and the Canadian archive is quite siloed. There are about 25 organisations collecting, most use the Archive-It service. But, if you are a researcher… to use web archives you really have to interested and engaged, you need to be an expert.

So, building this portal is about making this easier to use… We want web archives to be used on page 150 in some random book. And that’s what the WALK project is trying to do. Our goal is to break down the silos, take down walls between collections, between institutions. We are starting out slow… We signed Memoranda of Understanding with Toronto, Alberta, Victoria, Winnipeg, Dalhousie, Simon Fraser University – that represents about half of the archive in Canada.

We work on workflow… We run workshops… We separated the collections so that post docs can look at this

We are using Warcbase (warcbase.org) and command line tools, we transferred data from internet archive, generate checksums; we generate scholarly derivatives – plain text, hypertext graph, etc. In the front end you enter basic information, describe the collection, and make sure that the user can engage directly themselves… And those visualisations are really useful… Looking at visualisation of the Canadian political parties and political interest group web crawls which track changes, although that may include crawler issues.

Then, with all that generated, we create landing pages, including tagging, data information, visualizations, etc.

Nick: So, on a technical level… I’ve spent the last ten years in open source digital repository communities… This community is small and tight-knit, and I like how we build and share and develop on each others work. Last year we presented webarchives.ca. We’ve indexed 10 TB of warcs since then, representing 200+ M Solr docs. We have grown from one collection and we have needed additional facets: institution; collection name; collection ID, etc.

Then we have also dealt with scaling issues… 30-40Gb to 1Tb sized index. You probably think that’s kinda cute… But we do have more scaling to do… So we are learning from others in the community about how to manage this… We have Solr running on an Open Stack… But right now it isn’t at production scale, but getting there. We are looking at SolrCloud and potentially using a Shard2 per collection.

Last year we had a Solr index using the Shine front end… It’s great but… it doesn’t have an active open source community… We love the UK Web Archive but… Meanwhile there is BlackLight which is in wide use in libraries. There is a bigger community, better APIs, bug fixes, etc… So we have set up a prototype called WARCLight. It does almost all that Shine does, except the tree structure and the advanced searching..

Ian spoke about derivative datasets… For each collection, via Blacklight or ScholarsPortal we want domain/URL Counts; Full text; graphs. Rather than them having to do the work, they can just engage with particular datasets or collections.

So, that goal Ian talked about: one central hub for archived data and derivatives…

Q&A

Q1) Do you plan to make graphs interactive, by using Kibana rather than Gephi?

A1 – Ian) We tried some stuff out… One colleague tried R in the browser… That was great but didn’t look great in the browser. But it would be great if the casual user could look at drag and drop R type visualisations. We haven’t quite found the best option for interactive network diagrams in the browser…

A1 – Nick) Generally the data is so big it will bring down the browser. I’ve started looking at Kibana for stuff so in due course we may bring that in…

Q2) Interesting as we are doing similar things at the BnF. We did use Shine, looked at Blacklight, but built our own thing…. But we are looking at what we can do… We are interested in that web archive discovery collections approaches, useful in other contexts too…

A2 – Nick) I kinda did this the ugly way… There is a more elegant way to do it but haven’t done that yet..

Q2) We tried to give people WARC and WARC files… Our actual users didn’t want that, they want full text…

A2 – Ian) My students are quite biased… Right now if you search it will flake out… But by fall it should be available, I suspect that full text will be of most interest… Sociologists etc. think that network diagram view will be interesting but it’s hard to know what will happen when you give them that. People are quickly put off by raw data without visualisation though so we think it will be useful…

Q3) Do you think in few years time

A3) Right now that doesn’t scale… We want this more cloud-based – that’s our next 3 years and next wave of funded work… We do have capacity to write new scripts right now as needed, but when we scale that will be harder,,,,

Q4) What are some of the organisational, admin and social challenges of building this?

A4 – Nick) Going out and connecting with the archives is a big part of this… Having time to do this can be challenging…. “is an institution going to devote a person to this?”

A4 – Ian) This is about making this more accessible… People are more used to Backlight than Shine. People respond poorly to WARC. But they can deal with PDFs with CSV, those are familiar formats…

A4 – Nick) And when I get back I’m going to be doing some work and sharing to enable an actual community to work on this..

Gregory Wiedeman: Automating access to web archives with APIs and ArchivesSpace

A little bit of context here… University at Albany, SUNY we are a public university with state records las that require us to archive. This is consistent with traditional collecting. But we no dedicated web archives staff – so no capacity for lots of manual work.

One thing I wanted to note is that web archives are records. Some have paper equivalent, or which were for many years (e.g. Undergraduate Bulletin). We also have things like word documents. And then we have things like University sports websites, some of which we do need to keep…

The seed isn’t a good place to manage these as records. But archives theory and practices adapt well to web archives – they are designed to scale, they document and maintain context, with relationship to other content, and a strong emphasis on being a history of records.

So, we are using DACS: Describing Archives: A Content Standard to describe archives, why not use that for web archives? They focus on intellectual content, ignorant of formats; designed for pragmatic access to archives. We also use ArchiveSpace – a modern tool for aggregated records that allows curators to add metadata about a collection. And it interleaved with our physical archives.

So, for any record in our collection.. You can specify a subject… a Python script goes to look at our CDX, looks at numbers, schedules processes, and then as we crawl a collection the extents and data collected… And then shows in our catalogue… So we have our paper records, our digital captures… Uses then can find an item, and only then do you need to think about format and context. And, there is an awesome article by David Graves(?) which talks about that aggregation encourages new discovery…

Users need to understand where web archives come from. They need provenance to frame of their research question – it adds weight to their research. So we need to capture what was attempted to be collected – collecting policies included. We have just started to do this with a statement on our website. We need a more standardised content source. This sort of information should be easy to use and comprehend, but hard to find the right format to do that.

We also need to capture what was collected. We are using the Archive-It Partner Data API, part of the Archive-It 5.0 system. That API captures:

  • type of crawl
  • unique ID
  • crawl result
  • crawl start, end time
  • recurrence
  • exact data, time, etc…

This looks like a big JSON file. Knowing what has been captured – and not captured – is really important to understand context. What can we do with this data? Well we can see what’s in our public access system, we can add metadata, we can present some start times, non-finish issues etc. on product pages. BUT… it doesn’t address issues at scale.

So, we are now working on a new open digital repository using the Hydra system – though not called that anymore! Possibly we will expose data in the API. We need standardised data structure that is independent of tools. And we also have a researcher education challenge – the archival description needs to be easy to use, re-share and understand.

Find our work – sample scripts, command line query tools – on Github:

http://github.com/UAlbanyArchives/describingWebArchives

Q&A

Q1) Right now people describe collection intent, crawl targets… How could you standardise that?

A1) I don’t know… Need an intellectual definition of what a crawl is… And what the depth of a crawl is… They can produce very different results and WARC files… We need to articulate this in a way that is clear for others to understand…

Q1) Anything equivalent in the paper world?

A1) It is DACS but in the paper work we don’t get that granular… This is really specific data we weren’t really able to get before…

Q2) My impression is that ArchiveSpace isn’t built with discovery of archives in mind… What would help with that…

A2) I would actually put less emphasis on web archives… Long term you shouldn’t have all these things captures. We just need an good API access point really… I would rather it be modular I guess…

Q3) Really interesting… the definition of Archive-It, what’s in the crawl… And interesting to think about conveying what is in the crawl to researchers…

A3) From what I understand the Archive-It people are still working on this… With documentation to come. But we need granular way to do that… Researchers don’t care too much about the structure…. They don’t need all those counts but you need to convey some key issues, what the intellectual content is…

Comment) Looking ahead to the WASAPI presentation… Some steps towards vocabulary there might help you with this…

Comment) I also added that sort of issue for today’s panels – high level information on crawl or collection scope. Researchers want to know when crawlers don’t collect things, when to stop – usually to do with freak outs about what isn’t retained… But that idea of understanding absence really matters to researchers… It is really necessary to get some… There is a crapton of data in the partners API – most isn’t super interesting to researchers so some community effort to find 6 or 12 data points that can explain that crawl process/gaps etc…

A4) That issue of understanding users is really important, but also hard as it is difficult to understand who our users are…

Harvesting tools & strategies (Chair: Ian Milligan)

Jefferson Bailey: Who, what, when, where, why, WARC: new tools at the Internet Archive

Firstly, apologies for any repetition between yesterday and today… I will be talking about all sorts of updates…

So, WayBack Search… You can now search WayBackMachine… Including keyword, host/domain search, etc. The index is build on inbound anchor text links to a homepage. It is pretty cool and it’s one way to access this content which is not URL based. We also wanted to look at domain and host routes into this… So, if you look at the page for, say, parliament.uk you can now see statistics and visualisations. And there is an API so you can make your own visualisations – for hosts or for domains.

We have done stat counts for specific domains or crawl jobs… The API is all in json so you can just parse this for, for example, how much of what is archived for a domain is in the form of PDFs.

We also now have search by format using the same idea, the anchor text, the file and URL path, and you can search for media assets. We don’t have exciting front end displays yet… But I can search for e.g. Puppy, mime type: video, 2014… And get lots of awesome puppy videos [the demo is the Puppy Bowl 2014!]. This media search is available for some of the WayBackMachine for some media types… And you can again present this in the format and display you’d like.

For search and profiling we have a new 14 column CDX including new language, simhash, sha256 fields. Language will help users find material in their local/native languages. The SIMHASH is pretty exciting… that allows you to see how much a page has changed. We have been using it on Archive It partners… And it is pretty good. For instance seeing government blog change month to month shows the (dis)similarity.

For those that haven’t seen the Capture tool – Brozzler is in production in Archive-it with 3 doze orgaisations and using it. This has also led to warcprox developments too. It was intended for AV and social media stuff. We have a chromium cluster… It won’t do domain harvesting, but it’s good for social media.

In terms of crawl quality assurance we are working with the Internet Memory Foundation to create quality toools. These are building on internal crawl priorities work at IA crawler beans, comparison testing. And this is about quality at scale. And you can find reports on how we also did associated work on the WayBackMachine’s crawl quality. We are also looking at tools to monitor crawls for partners, trying to find large scale crawling quality as it happens… There aren’t great analytics… But there are domain-scale monitoring, domain scale patch crawling, and Slack integrations.

For doman scale work, for patch crawling we use WAT analysis for embeds and most linked. We rank by inbound links and add to crawl. ArchiveSpark is a framework for cluster-based data extraction and derivation (WA+).

Although this is a technical presentation we are also doing an IMLS funded project to train public librarians in web archiving to preserve online local history and community memory, working with partners in various communities.

Other collaborations and research include our end of term web archive 2016/17 when the administration changes… No one is official custodian for the gov.uk. And this year the widespread deletion of data has given this work greater profile than usual. This time the work was with IA, LOC, UNT, GWU, and others. 250+ TB of .gov/.mil as well as White House and Obama social media content.

There had already been discussion of the Partner Data API. We are currently re-building this so come talk to me if you are interested in this. We are working with partners to make sure this is useful. makes sense, and is made more relevant.

We take a lot of WARC files from people to preserve… So we are looking to see how we can get partners to do this with and for it. We are developing a pipeline for automated WARC ingest for web services.

There will be more on WASAPI later, but this is part of work to ensure web archives are more accessible… And that uses API calls to connect up repositories.

We have also build a WAT API that allows you to query most of the metadta for a WARC file. You can feed it URLs, and get back what you want – except the page type.

We have new portals and searches now and coming. This is about putting new search layers on TLD content in the WayBackMachine… So you can pick media types, and just from one domain, and explore them all…

And with a statement on what archives should do – involving a gif of a centaur entering a rainbow room – that’s all… 

Q&A

Q1) What are implications of new capabilities for headless browsing for Chrome for Brozzler…

A1 – audience) It changes how fast you can do things, not really what you can do…

Q2) What about http post for WASAPI

A2) Yes, it will be in the Archive-It web application… We’ll change a flag and then you can go and do whatever… And there is reporting on the backend. Doesn’t usually effect crawl budgets, it should be pretty automated… There is a UI.. Right now we do a lot manually, the idea is to do it less manually…

Q3) What do you do with pages that don’t specify encoding… ?

A3) It doesn’t go into url tokenisation… We would wipe character encoding in anchor text – it gets cleaned up before elastic search..

Q4) The SIMHASH is before or after the capture? And can it be used for deduplication

A4) After capture before CDX writing – it is part of that process. Yes, it could be used for deduplication. Although we do already do URL deduplication… But we could compare to previous SIMHASH to work out if another copy is needed… We really were thinking about visualising change…

Q5) I’m really excited about WATS… What scale will it work on…

A5) The crawl is on 100 TB – we mostly use existing WARC and Json pipeline… It performs well on something large. But if a lot of URLs, it could be a lot to parse.

Q6) With quality analysis and improvement at scale, can you tell me more about this?

A6) We’ve given the IMF access to our own crawls… But we have been compared our own crawls to our own crawls… Comparing to Archive-it is more interesting… And looking at domain level… We need to share some similar size crawls – BL and IA – and figure out how results look and differ. It won’t be content based at that stage, it will be hotpads and URLs and things.

Michele C. Weigle, Michael L. Nelson, Mat Kelly & John Berlin: Archive what I see now – personal web archiving with WARCs

Mat: I will be describing tools here for web users. We want to enable individuals to create personal web archives in a self-contained way, without external services. Standard web archiving tools are difficult for non IT experts. “Save page as” is not suitable for web archiving. Why do this? It’s for people who don’t want to touch the commend line, but also to ensure content is preserved that wouldn’t otherwise be. More archives are more better.

It is also about creation and access, as both elements are important.

So, our goals involve advancing development of:

  • WARCreate – create WARC from what you see in your browser.
  • Web Archiving Integration Layer (WAIL)
  • Mink

WARCcreate is… A Chrome browser extension to save WARC files from your browser, no credentials pass through 3rd parties. It heavilt leverages Chrome webRequest API. ut it was build in 2012 so APIs and libraries have evolved so we had to work on that. We also wanted three new modes for bwoser based preservation: record mode – retain buffer as you browse; countdown mode – preserve reloading page on an interval; event mode – preserve page when automatically reloaded.

So you simply click on the WARCreate button the browser to generate WARC files for non technical people.

Web Archiving Integration Layer (WAIL) is a stand-alone desktop application, it offers collection-based web archiving, and includes Heritrix for crawling, OpenWayback for replay, and Python scripts compiled to OS-native binaries (.app, .exe). One of the recent advancements was a new user interface. We ported Python to Electron – using web technologies to create native apps. And that means you can use native languages to help you to preserve. We also moves from a single archive to collection-based archiving. We also ported OpenWayback to pywb. And we also started doing native Twitter integration – over time and hashtags…

So, the original app was a tool to enter a URI and then get a notification. The new version is a little more complicated but provides that new collection-based interface. Right now both of these are out there… Eventually we’d like to merge functionality here. So, an example here, looking at the UK election as a collection… You can enter information, then crawl to within defined boundaries… You can kill processes, or restart an old one… And this process integrates with Heritrix to give status of a task here… And if you want to Archive Twitter you can enter a hashtag and interval, you can also do some additional filtering with keywords, etc. And then once running you’ll get notifications.

Mink… is a Google Chrome browser extension. It indicates archival capture count as you browse. Quickly submits URI to multiple archives from UI. From Mink(owski) space. Our recent enhancements include enhancements to the interface to add the number of archives pages to icon at bottom of page. And allows users to set preferences on how to view large set of memetos. And communication with user-specified or local archives…

The old mink interface could be affected by page CSS as in the DOM. So we ave moved to shadow DOM, making it more reliable and easy to use. And then you have a more consistent, intuitive iller columns for many captures. It’s an integration of live and archive web, whilst you are viewing the live web. And you can see year, month, day, etc. And it is refined to what you want to look at this. And you have an icon in Mink to make a request to save the page now – and notification of status.

So, in terms of tool integration…. We want to ensure integration between Mink and WAIL so that Mink points to local archives. In the future we want to decouple Mink from external Memento aggregator – client-side customisable collection of archives instead.

See: http://bit.ly/iipcWAC2017 for tools and source code.

Q&A

Q1) Do you see any qualitative difference in capture between WARCreate and WARC recorder?

A1) We capture the representation right at the moment you saw it.. Not the full experience for others, but for you in a moment of time. And that’s our goal – what you last saw.

Q2) Who are your users, and do you have a sense of what they want?

A2) We have a lot of digital humanities scholars wanting to preserve Twitter and Facebook – the stream as it is now, exactly as they see it. So that’s a major use case for us.

Q3) You said it is watching as you browse… What happens if you don’t select a WARC

A3) If you have hit record you could build up content as pages reload and are in that record mode… It will impact performance but you’ll have a better capture…

Q3) Just a suggestion but I often have 100 tabs open but only want to capture something once a week so I might want to kick it off only when I want to save it…

Q4) That real time capture/playback – are there cool communities you can see using this…

A4) Yes, I think with CNN coverage of a breaking storm allows you to see how that story evolves and changes…

Q5) Have you considered a mobile version for social media/web pages on my phone?

A5) Not currently supported… Chrome doesn’t support that… There is an app out there that lets you submit to archives, but not to create WARC… But there is a movement to making those types of things…

Q6) Personal archiving is interesting… But jailed in my laptop… great for personal content… But then can I share my WARC files with the wider community .

A6) That’s a good idea… And more captures is better… So there should be a way to aggregate these together… I am currently working on that, but you should need to be able to specify what is shared and what is not.

Q6) One challenge there is about organisations and what they will be comfortable with sharing/not sharing.

Lozana Rossenova and IIya Kreymar, Rhizome: Containerised browsers and archive augmentation

Lozana: As you probably know Webrecorder is a high fidelity interactive recording of any web site you browse – and how you engage. And we have recently released an App in electron format.

Webrecorder is a worm’s eye view of archiving, tracking how users actually move around the web… For instance for instragram and Twitter posts around #lovewins you can see the quality is high. Webrecorder uses symmetrical archiving – in the live browser and in a remote browser… And you can capture then replay…

In terms of how we organise webrecorder: we have collections and sessions.

The thing I want to talk about today is on Remote browsers, and my work with Rhizome on internet art. And a lot of these works actually require old browser plugins and tools… So Webrecorder enables capture and replay even where technology no longer available.

To clarify: the programme says “containerised” but we now refer to this as “remote browsers” – still using Docker cotainers to run these various older browsers.

When you go to record a site you select the browser, and the site, and it begins the recording… The Java Applet runs and shows you a visulisation of how it is being captured. You can do this with flash as well… If we open a multimedia in your normal (Chrome) browser, it isn’t working. Restoration is easier with just flash, need other things to capture flash with other dependencies and interactions.

Remote browsers are really important for Rhizome work in general, as we use them to stage old artworks in new exhibitions.

Ilya: I will be showing some upcoming beta features, including ways to use webrecorder to improve other arhives…

Firstly, which other web archives? So I built a public web archives repsitory:

https://github.com/webrecorder/public-web-archives

And with this work we are using WAM – the Web Archiving Manifest. And added a WARC source URI and WARC creation date field to the WARC Header at the moment.

So, Jefferson already talked about patching – patching remote archives from the live web… is an approach where we patch either from live web or from other archives, depending on what is available or missing. So, for instance, if I look at a Washington Post page in the archive from 2nd March… It shows how other archives are being patched in to me to deliver me a page… In the collection I have a think called “patch” that captures this.

Once pages are patched, then we introduce extraction… We are extracting again using remote archiving and automatic patching. So you combine extraction and patching features. You create two patches and two WARC files. I’ll demo that as well… So, here’s a page from the CCA website and we can patch that… And then extract that… And then when we patch again we get the images, the richer content, a much better recording of the page. So we have 2 WARCs here – one from the British Library archive, one from the patching that might be combined and used to enrich that partial UKWA capture.

Similarly we can look at a CNN page and take patches from e.g. the Portuguese archive. And once it is done we have a more complete archive… When we play this back you can display the page as it appeared, and patch files are available for archives to add to their copy.

So, this is all in beta right now but we hope to release it all in the near future…

Q&A

Q1) Every web archive already has a temporal issue where the content may come from other dates than the page claims to have… But you could aggrevate that problem. Have you considered this?

A1) Yes. There are timebounds for patching. And also around what you display to the user so they understand what they see… e.g. to patch only within the week or the month…

Q2) So it’s the closest date to what is in web recorder?

A2) The other sources are the closest successful result on/closest to the date from another site…

Q3) Rather than a fixed window for collection, seeing frequently of change might be useful to understand quality/relevance… But I think you are replaying

A3)Have you considered a headless browser… with the address bar…

A3 – Lozana) Actually for us the key use case is about highlighting and showcasing old art works to the users. It is really important to show the original page as it appeared – in the older browsers like Netscape etc.

Q4) This is increadibly exciting. But how difficult is the patching… What does it change?

A4) If you take a good capture and a static image is missing… Those are easy to patch in… If highly contextualised – like Facebook, that is difficult to do.

Q5) Can you do this in realtime… So you archive with Perma.cc then you want to patch something immediately…

A5) This will be in the new version I hope… So you can check other sources and fall back to other sources and scenarios…

Comment –  Lozana) We have run UX work with an archiving organisation in Europe for cultural heritage and their use case is that they use Archive-It and do QA the next day… Crawl might mix something but highly dynamic, so want to quickly be able to patch it pretty quickly.

Ilya) If you have an archive that is not in the public archive list on Github please do submit it as a fork request and we’ll be able to add it…

Leveraging APIs (Chair: Nicholas Taylor)

Fernando Melo and Joao Nobre: Arquivo.pt API: enabling automatic analytics over historical web data

Fernando: We are a publicly available web archive, mainly of Portuguese websites from the .pt domain. So, what can you do with out API?

Well, we built our first image search using our API, for instance a way to explore Charlie Hebdo materials; another application enables you to explore information on Portuguese politicians.

We support the Memento protocol, and you can use the Memento API. We are one of the time gates for the time travel searches. And we also have full text search as well as URL search, though our OpenSearch API. We have extended our API to support temporal searches in the portuguese web. Find this at: http://arquivo.pt/apis/opensearch/. Full text search requests can be made through a URL query, e.g. http://arquivp.pt/opensearch?query=euro 2004 would search for mentions of euro 2004, and you can add parameters to this, or search as a phrase rather than keywords.

You can also search mime types – so just within PDFs for instance. And you can also run URL searches – e.g. all pages from the New York Times website… And if you provide time boundaries the search will look for the capture from the nearest date.

Joao: I am going to talk about our image search API. This works based on keyword searches, you can include operators such as limiting to images from a particular site, to particular dates… Results are ordered by relevance, recency, or by type. You can also run advanced image searches, such as for icons, you can use quotation marks for names, or a phrase.

The request parameters include:

  • query
  • stamp – timestamp
  • Start – first index of search
  • safe Image (yes; no; all) – restricts search only to safe images.

The response is returned in json with total results, URL, width, height, alt, score, timestamp, mime, thumbnail, nsfw, pageTitle fields.

More on all of this: http://arquivo.pt/apis

Q&A

Q1) How do you classify safe for work/not safe for work

A1 – Fernando) This is a closed beta version. Safe for work/nsfw is based on classification worked around training set from Yahoo. We are not for blocking things but we want to be able to exclude shocking images if needed.

Q1) We have this same issue in the GifCities project – we have a manually curated training set to handle that.

Comment) Maybe you need to have more options for that measure to provide levels of filtering…

Q2) With that json response, why did you include title and alt text…

A2) We process image and extract from URL, the image text… So we capture the image, the alt text, but we thought that perhaps the page title would be interesting, giving some sense of context. Maybe the text before/after would also be useful but that takes more time… We are trying to keep this working

Q3) What is the thumbnail value?

A3) It is in base 64. But we can make that clearer in the next version…

Nicholas Taylor: Lots more LOCKSS for web archiving: boons from the LOCKSS software re-architecture

This is following on from the presentation myself and colleagues did at last year’s IIPC on APIs.

LOCKSS came about from a serials librarian and a computer scientist. They were thinking about emulating the best features of the system for preserving print journals, allowing libraries to conserve their traditional role as preserver. The LOCKSS boxes would sit in each library, collecting from publishers’ website, providing redundancy, sharing with other libraries if and when that publication was no longer available.

18 years on this is a self-sustaining programme running out of Stanford, with 10s of networks and hundreds of partners. Lots of copies isn’t exclusive to LOCKSS but it is the decentralised replication model that addresses the long term bit integrity is hard to solve, that more (correlated) copies doesn’t necessarily keep things safe and can make it vulnerable to hackers. So this model is community approved, published on, and well established.

Last year we started re-architecting the LOCKSS software so that it becomes a series of websites. Why do this? Well to reduce support and operation costs – taking advantage of other softwares on the web and web archiving,; to de silo components and enable external integration – we want components to find use in other systems, especially in web archiving; and we are preparing to evolve with the web, to adapt our technologies accordingly.

What that means is that LOCKSS systems will treat WARC as a storage abstraction, and more seamlessly do this, processing layers, proxies, etc. We also already integrate Memento but this will also let us engage WASAPI – which there will be more in our next talk.

We have built a service for bibliographic metadata extraction, for web harvest and file transfer content; we can map values in DOM tree to metadata fields; we can retrieve downloadable metadata from expected URL patterns; and parse RIS and XML by schema. That model shows our bias to bibliographic material.

We are also using plugins to make bibliographic objects and their metadata on many publishing platforms machine-intelligible. We mainly work with publishing/platform heuristics like Atypon, Digital Commons, HighWire, OJS and Silverchair. These vary so we have a framework for them.

The use cases for metadata extraction would include applying to consistent subsets of content in larger corpora; curating PA materials within broader crawls; retrieve faculty publications online; or retrieve from University CMSs. You can also undertake discovery via bibliographic metadata, with your institutions OpenURL resolver.

As described in 2005 D-Lib paper by DSHR et al, we are looking at on-access format migration. For instance x-bitmap to GIF.

Probably the most important core preservation capability is the audit and repair protocol. Network nodes conduct polls to validate integrity of distributed copies of data chunks. More nodes = more security – more nodes can be down; more copies can be corrupted… The notes do not trust each other in this model and responses cannot be cached. And when copies do not match, the node audits and repairs.

We think that functionality may be useful in other distributed digital preservation networks, in repository storage replication layers. And we would like to support varied back-ends including tape and cloud. We haven’t built those integrations yet…

To date our progress has addressed the WARC work. By end of 2017 we will have Docker-ised components, have a web harvest framework, polling and repair web service. By end of 2018 we will have IP address and Shibboleth access to OpenWayBack…

By all means follow and plugin. Most of our work is in a private repository, which then copies to GitHub. And we are moving more towards a community orientated software development approach, collaborating more, and exploring use of LOCKSS technologies in other contexts.

So, I want to end with some questions:

  • What potential do you see for LOCKSS technologies for web archiving, other use cases?
  • What standards or technologies could we use that we maybe haven’t considered
  • How could we help you to use LOCKSS technologies?
  • How would you like to see LOCKSS plug in more to the web archiving community?

Q&A

Q1) Will these work with existing LOCKSS software, and do we need to update our boxes?

A1) Yes, it is backwards compatible. And the new features are containerised so that does slightly change the requirements of the LOCKSS boxes but no changes needed for now.

Q2) Where do you store biblographic metadata? Or is in the WARC?

A2) It is separate from the WARC, in a database.

Q3) With the extraction of the metadata… We have some resources around translators that may be useful.

Q4 – David) Just one thing of your simplified example… For each node… They all have to calculate a new separate nonce… None of the answers are the same… They all have to do all the work… It’s actually a system where untrusted nodes are compared… And several nodes can’t gang up on the other… Each peer randomly decides on when to poll on things… There is  leader here…

Q5) Can you talk about format migration…

A5) It’s a capability already built into LOCKSS but we haven’t had to use it…

A5 – David) It’s done on the requests in http, which include acceptable formats… You can configure this thing so that if an acceptable format isn’t found, then you transform it to an acceptable format… (see the paper mentioned earlier). It is based on mime type.

Q6) We are trying to use LOCKSS as a generic archive crawler… Is that still how it will work…

A6) I’m not sure I have a definitive answer… LOCKSS will still be web harvesting-based. It will still be interesting to hear about approaches that are not web harvesting based.

A6 – David) Also interesting for CLOCKSS which are not using web harvesting…

A6) For the CLOCKSS and LOCKSS networks – the big networks – the web harvesting portfolio makes sense. But other networks with other content types, that is becoming more important.

Comment) We looked at doing transformation that is quite straightforward… We have used an API

Q7) Can you say more about the community project work?

A7) We have largely run LOCKSS as more of an in-house project, rather than a community project. We are trying to move it more in the direction of say, Blacklight, Hydra….etc. A culture change here but we see this as a benchmark of success for this re-architecting project… We are also in the process of hiring a partnerships manager and that person will focus more on creating documentation, doing developer outreach etc.

David: There is a (fragile) demo that you can have a lot of this… The goal is to continue that through the laws project, as a way to try this out… You can (cautiously) engage with that at demo.laws.lockss.org but it will be published to GitHub at some point.

Jefferson Bailey & Naomi Dushay: WASAPI data transfer APIs: specification, project update, and demonstration

Jefferson: I’ll give some background on the APIs. This is an IMLS funded project in the US looking at Systems Interoperability and Collaborative Development for Web Archives. Our goals are to:

  • build WARC and derivative dataset APIs (AIT and LOCKSS) and test via transfer to partners (SUL, UNT, Rutgers) to enable better distributed preservation and access
  • Seed and launch community modelled on characteristics of successful development and participation from communities ID’d by project
  • Sketch a blueprint and technical model for future web archiving APIs informed by project R&D
  • Technical architecture to support this.

So, we’ve already run WARC and Digital Preservation Surveys. 15-20% of Archive-it users download and locally store their WARCS – for various reasons – that is small and hasn’t really moved, that’s why data transfer was a core area. We are doing online webinars and demos. We ran a national symposium on API based interoperability and digital preservation and we have white papers to come from this.

Development wise we have created a general specification, a LOCKSS implementation, Archive-it implementation, Archive-it API documentation, testing and utility (in progress). All of this is on GitHub.

The WASAPI Archive-it Transfer API is written in python, meets all gen-spec citeria, swagger yaml in the repos. Authorisation uses AIT Django framework (same as web app), not defined in general specification. We are using browser cookies or http basic auth. We have a basic endpoint (in production) which returns all WARCs for that account; base/all results are paginated. In terms of query parameters you can use: filename; filetype; collection (ID); crawl (ID for AID crawl job)) etc.

So what do you get back? A JSON object has: pagination, count, request-url, includes-extra. You have fields including account (Archive-it ID); checksums; collection (Archive-It ID); crawl; craw time; crawl start; filename’ filetype; locations; size. And you can request these through simple http queries.

You can also submit jobs for generating derivative datasets. We use existing query language.

In terms of what is to come, this includes:

  1. Minor AIT API features
  2. Recipes and utilities (testers welcome)
  3. Community building research and report
  4. A few papers on WA APIs
  5. Ongoing surgets and research
  6. Other APIs in WASAPI (past and future)

So we need some way to bring together these APIs regularly. And also an idea of what other APIs we need to support, and how to prioritise that.

Naomi: I’m talking about the Stanford take on this… These are the steps Nicholas, as project owner, does to download WARC files from Archive-it at the moment… It is a 13 step process… And this grant funded work focuses on simplifying the first six steps and making it more manageable and efficient. As a team we are really focused on not being dependent on bespoke softwares, things much be maintainable, continuous integration set up, excellent test coverage, automate-able. There is a team behind this work, and this was their first touching of any of this code – you had 3 neophytes working on this with much to learn.

We are lucky to be just down the corridor from LOCKSS. Our preferred language is Ruby but Java would work best for LOCKSS. So we leveraged LOCKSS engineering here.

The code is at: https://github.com/sul-dlss/wasapi-downloader/.

You only need Java to run the code. And all arguments are documented in Github. You can also view a video demo:

YouTube Preview Image

These videos are how we share our progress at the end of each Agile sprint.

In terms of work remaining we have various tweaks, pull requests, etc. to ensure it is production ready. One of the challenges so far has been about thinking crawls and patches, and the context of the WARC.

Q&A

Q1) At Stanford are you working with the other WASAPI APIs, or just the downloads one.

A1) I hope the approach we are taking is a welcome one. But we have a lot of projects taking place, but we are limited by available software engineering cycles for archives work.

Note that we do need a new readme on GitHub

Q2) Jefferson, you mentioned plans to expand the API, when will that be?

A2 – Jefferson) I think that it is pretty much done and stable for most of the rest of the year… WARCs do not have crawl IDs or start dates – hence adding crawl time.

Naomi: It was super useful that a different team built the downloader was separate from the team building the WASAPI as that surfaced a lot of the assumptions, issues, etc.

David: We have a CLOCKSS implementation pretty much building on the Swagger. I need to fix our ID… But the goal is that you will be able to extract stuff from a LOCKSS box using WASAPI using URL or Solr text search. But timing wise, don’t hold your breath.

Jefferson: We’d also like others feedback and engagement with the generic specification – comments welcome on GitHub for instance.

Web archives platforms & infrastructure (Chair: Andrew Jackson)

Jack Cushman & Ilya Kreymer: Thinking like a hacker: security issues in web capture and playback

Jack: We want to talk about securing web archives, and how web archives can get themselves into trouble with security… We want to share what we’ve learnt, and what we are struggling with… So why should we care about security as web archives?

Ilya: Well web archives are not just a collection of old pages… No, high fidelity web archives run entrusted software. And there is an assumption that a live site is “safe” so nothing to worry about… but that isn’t right either..

Jack: So, what could a page do that could damage an archive? Not just a virus or a hack… but more than that…

Ilya: Archiving local content… Well a capture system could have privileged access – on local ports or network server or local files. It is a real threat. And could capture private resources into a public archive. So. Mitigation: network filtering and sandboxing, don’t allow capture of local IP addresses…

Jack: Threat: hacking the headless browser. Modern captures may use PhantomJS or other browsers on the server, most browsers have known exploits. Mitigation: sandbox your VM

Ilya: Stealing user secrets during capture… Normal web flow… But you have other things open in the browser. Partial mitigation: rewriting – rewrite cookies to exact path only; rewrite JS to intercept cookie access. Mitigation: separate recording sessions – for webrecorder use separate recording sessions when recording credentialed content. Mitigation: Remote browser.

Jack: So assume we are running MyArchive.com… Threat: cross site scripting to steal archive login

Ilya: Well you can use a subdomain…

Jack: Cookies are separate?

Ilya: Not really.. In IE10 the archive within the archive might steal login cookie. In all browsers a site can wipe and replace cookies.

Mitigation: run web archive on a separate domain from everything else. Use iFrames to isolate web archive content. Load web archive app from app domain, load iFrame content from content domain. As Webrecorder and Perma.cc both do.

Jack: Now, in our content frame… how back could it be if that content leaks… What if we have live web leakage on playback. This can happen all the time… It’s hard to stop that entirely… Javascript can send messages back and fetch new content… to mislead, track users, rewrite history. Bonus: for private archives – any of your captures could eport any of your other captures.

The best mitigation is a Content-Security-Policy header can limit access to web archive domain

Ilya: Threat: Show different age contents when archives… Pages can tell they’re in an archive and act differently. Mitigation: Run archive in containerised/proxy mode browser.

Ilya: Threat: Banner spoofing… This is a dangerous but quite easy to execute threat. Pages can dynamically edit the archives banner…

Jack: Suppose I copy the code of a page that was captured and change fake evidence, change the metadata of the date collected, and/or the URL bar…

Ilya: You can’t do that in Perma because we use frames. But if you don’t separate banner and content, this is a fairly easy exploit to do… So, Mitigation: Use iFrames for replay; don’t inject banner into replay frame… It’s a fidelity/security trade off.. .

Jack: That’s our top 7 tips… But what next… What we introduce today is a tool called http://warc.games. This is a version of webrecorder with every security problem possible turned on… You can run it locally on your machine to try all the exploits and think about mitigations and what to do about them!

And you can find some exploits to try, some challenges… Of course if you actually find a flaw in any real system please do be respectful

Q&A

Q1) How much is the bug bounty?! [laughs] What do we do about the use of very old browsers…

A1 – Jack) If you use an old browser you may be compromised already… But we use the most robust solution possible… In many cases there are secure options that work with older browsers too…

Q2) Any trends in exploits?

A2 – Jack) I recommend the book A Tangled Book… And there is an aspect that when you run a web browser there will always be some sort of issue

A2 – Ilya) We have to get around security policies to archive the web… It wasn’t designed for archiving… But that raises its own issues.

Q3) Suggestions for browser makers to make these safer?

A3) Yes, but… How do you do this with current protocols and APIs

Q4) Does running old browsers and escaping from containers keep you awake at night…

A4 – Ilya) Yes!

A4 – Jack) If anyone is good at container escapes please do write that challenge as we’d like to have it in there…

Q5) There’s a great article called “Familiarity builds content” which notes that old browsers and softwares get more vulnerable over time… It is particularly a big risk where you need old software to archive things…

A5 – Jack) Thanks David!

Q6) Can you saw more about the headers being used…

A6) The idea is we write the CSP header to only serve from the archive server… And they can be quite complex… May want to add something of your own…

Q7) May depend on what you see as a security issue… for me it may be about the authenticity of the archive… By building something in the website that shows different content in the archive…

A7 – Jack) We definitely think that changing the archive is a security threat…

Q8) How can you check the archives and look for arbitrary hacks?

A8 – Ilya) It’s pretty hard to do…

A8 – Jack) But it would be a really great research question…

Mat Kelly & David Dias: A collaborative, secure, and private InterPlanetary WayBack web archiving system using IPFS

David: Welcome to the session on going InterPlanatary… We are going to talk about peer to peer and other technology to make web archiving better…

We’ll talk about InterPlanatary File System (IPFS) and InterPlanatary WayBack (IPWB)…

IPFS is also known as  the distributed web, moving from location based to content based… As we are aware, the web has some problems… You have experience of using a service, accessing email, using a document… There is some break in connectivity… And suddenly all those essential services are gone… Why? Why do we need to have the services working in such a vulnerable way… Even a simple page, you lose a connection and you get a 404. Why?

There is a real problem with permanence… We have this URI, the URL, telling us the protocol, location and content path… But when we come back later – weeks or months – and that content has moved elsewhere… Either somewhere else you can find, or somewhere you can’t. Sometimes it’s like the content has been destroyed… But every time people see a webpage, you download it to your machine… These issues come from location addressing…

In content addressing we tie content to a unique hash that identifies the item… So a Content Identifier (CID) allows us to do this… And then, in a network, when I look for that data… If there is a disruption to the network, we can ask any machine where the content is… And the node near you can show you what is available before you ever go to the network.

IPFS is already used in video streaming (inc. Netflix), legal documents, 3D models – with Hollolens for instance, for games, for scientific data and papers, blogs and webpages, and totally distributed web apps.

IPFS allows this to be distributed, offline, saves space, optimise bandwidth usage, etc.

Mat: So I am going to talk about IPWB. Motivation here is the persistence of archived web data dependent on resilience of organisation and availability of data. The design is extending the CDXJ format, with indexing and IPFS dissemination procedure, and Replay and IPFS Pull Procedure. So in an adapted CDXJ adds a header with the hash for the content to the metadata structure.

Dave: One of the ways IPFS is making changes in the boundary is in browser tab, in browser extension and service worker as a proxy for requests the browser makes, with no changes to the interface (that one is definitely in alpha!)…

So the IPWB can expose the content to the IPFS and then connect and do everything in the browser without needing to download and execute code on their machine. Building it into the browser makes it easy to use…

Mat: And IPWB enables privacy, collaboration and security, building encryption method and key into the WARC. Similarly CDXJs may be transferred for our users’ replay… Ideally you won’t need a CDZJ on your own machine at all…

We are also rerouting, rather than rewriting, for archival replay… We’ll be presenting on that late this summer…

And I think we just have time for a short demo…

For more see: https://github.com/oduwsdl/ipwb

Q&A

Q1) Mat, I think that you should tell that story of what you do…

A1) So, I looked for files on another machine…

A1 – Dave) When Mat has the archive file on a remote machine… Someone looks for this hash on the network, send my way as I have it… So when Mat looked, it replied… so the content was discovered… request issued, received content… and presented… And that also lets you capture pages appearing differently in different places and easily access them…

Q2) With the hash addressing, are there security concerns…

A2 – Dave) We use Multihash, using Shard… But you can use different hash functions, they just verify the link… In IPFS we prevent issue with self-describable data functions..

Q3) The problem is that the hash function does end up in the URL… and it will decay over time because the hash function will decay… Its a really hard problem to solve – making a choice now that may be wrong… But there is no way of choosing the right choice.

A3) At least we can use the hash function to indicate whether it looks likely to be the right or wrong link…

Q4) Is hash functioning itself useful with or without IPFS… Or is content addressing itself inherently useful?

A4 – Dave) I think the IPLD is useful anyway… So with legal documents where links have to stay in tact, and not be part of the open web, then IPFS can work to restrict that access but still make this more useful…

Q5) If we had a content addressable web, almost all these web archiving issues would be resolved really… IT is hard to know if content is in Archive 1 or Archive 2. A content addressable web would make it easier to be archived.. Important to keep in mind…

A5 – Dave) I 100% agree! Content addressed web lets you understand what is important to capture. And IPTF saves a lot of bandwidth and a lot of storage…

Q6) What is the longevity of the hashs and how do I check that?

A6 – Dave) OK, you can check the integrity of the hash. And we have filecoin.io which is a blockchain [based storage network and cryptocurrency and that does handle this information… Using an address in a public blockchain… That’s our solution for some of those specific problems.

Andrew Jackson (AJ), Jefferson Bailey (JB), Kristinn Sigurðsson (KS) & Nicholas Taylor (NT): IIPC Tools: autumn technical workshop planning discussion

AJ: I’ve been really impressed with what I’ve seen today. There is a lot of enthusiasm for open source and collaborative approaches and that has been clear today and the IIPC wants to encourage and support that.

Now, in September 2016 we had a hackathon but there were some who just wanted to get something concrete done… And we might therefore adjust the format… Perhaps pre-define a task well ahead of time… But also a parallel track for the next hackathon/more experimental side. Is that a good idea? What else may be?

JB: We looked at Archives Unleashed, and we did a White House Social Media Hackathon earlier this year… This is a technical track but… it’s interesting to think about what kind of developer skills/what mix will work best… We have lots of web archiving engineers… They don’t use the software that comes out of it… We find it useful to have archivists in the room…

Then, from another angle, is that at the hackathons… IIPC doesn’t have a lot of money and travel is expensive… The impact of that gets debated – it’s a big budget line for 8-10 institutions out of 53 members. The outcomes are obviously useful but… If people expect to be totally funded for days on end across the world isn’t feasible… So maybe more little events, or fewer bigger events can work…

Comment 1) Why aren’t these sessions recorded?

JB: Too much money. We have recorded some of them… Sometimes it happens, sometimes it doesn’t…

AJ: We don’t have in-house skills, so it’s third party… And that’s the issue…

JB: It’s a quality thing…

KS: But also, when we’ve done it before, it’s not heavily watched… And the value can feel questionable…

Comment 1) I have a camera at home!

JB: People can film whatever they want… But that’s on people to do… IIPC isn’t an enforcement agency… But we should make it clear that people can film them…

KS: For me… You guys are doing incredible things… And it’s things I can’t do at home. The other aspect is that… There are advancements that never quite happened… But I think there is value in the unconference side…

AJ: One of the things with unconference sessions is that

NT: I didn’t go to the London hackathon… Now we have a technical team, it’s more appealling… The conference in general is good for surfacing issues we have in common… such as extraction of metadata… But there is also the question of when we sit down to deal with some specific task… That could be useful for taking things forward..

AJ: I like the idea of a counter conference, focused on the tools… I was a bit concerned that if there were really specific things… What does it need to be to be worth your organisations flying you to them… Too narrow and it’s exclusionary… Too broad and maybe it’s not helpful enough…

Comment 2) Worth seeing the model used by Python – they have a sprint after their conference. That isn’t an unconference but lets you come together. Mozilla Fest Sprint picks a topic and then next time you work on it… Sometimes looking at other organisations with less money are worth looking at… And for things like crowd sourcing coverage etc… There must be models…

AJ: This is cool.. You will have to push on this…

Comment 3) I think that tacking on to a conference helps…

KS: But challenging to be away from office more than 3/4 days…

Comment 4) Maybe look at NodeJS Community and how they organise… They have a website, NodeSchool.io with three workshops… People organise events pretty much monthly… And create material in local communities… Less travel but builds momentum… And you can see that that has impact through local NodeJS events now…

AJ: That would be possible to support as well… with IIPC or organisational support… Bootstrapping approaches…

Comment 5) Other than hackathon there are other ways to engage developers in the community… So you can engage with Google Summer of Code for instance – as mentors… That is where students look for projects to work on…

JB: We have two GSoC and like 8 working without funding at the moment… But it’s non trivial to manage that…

AJ: Onboarding new developers in any way would be useful…

Nick: Onboarding into the weird and wacky world of web archiving… If IIPC can curate a lot of onboarding stuff, that would be really good for potential… for getting started… Not relying on a small number of people…

AJ: We have to be careful as IIPC tools page is very popular, but hard to keep up to date… Benefits can be minor versus time…

Nick: Do you have GitHub? Just put up an awesome lise!

AJ: That’s a good idea…

JB: Microfunding projects – sub $10k is also an option for cost recovered brought out time for some of these sorts of tasks… That would be really interesting…

Comment 6) To expand on Jefferson and Nick were saying… I’m really new… Went to IIPC in April. I am enjoying this and learning this a lot… I’ve been talking to a lot of you… That would really help more people get the technical environment right… Organisations want to get into archiving on a small scale…

Olga: We do have a list on GitHub… but not up to date and well used…

AJ: We do have this document, we have GitHub… But we could refer to each other… and point to the getting started stuff (only). Rather get away from lists…

Comment 7) Google has an OpenSource.guide page – could take inspiration from that… Licensing, communities, etc… Very simple plain English getting started guide/documentation…

Comment 8) I’m very new to the community… And I was wondering to what extent you use Slack and Twitter between events to maintain these conversations and connections?

AJ: We have a Slack channel, but we haven’t publicised it particularly but it’s there… And Twitter you should tweet @NetPreserve and they will retweet then this community will see that…

Jun 142017
 

Following on from Day One of IIPC/RESAW I’m at the British Library for a connected Web Archiving Week 2017 event: Digital Conversations @BL, Web Archives: truth, lies and politics in the 21st century. This is a panel session chaired by Elaine Glaser (EG) with Jane Winters (JW), Valerie Schafer (VS), Jefferson Bailey (JB) and Andrew Jackson (AJ). 

As usual, this is a liveblog so corrections, additions, etc. are welcomed. 

EG: Really excited to be chairing this session. I’ll let everyone speak for a few minutes, then ask some questions, then open it out…

JB: I thought I’d talk a bit about our archiving strategy at Internet Archive. We don’t archive the whole of the internet, but we aim to collect a lot of it. The approach is multi-pronged: to take entire web domains in shallow but broad strategy; to work with other libraries and archives to focus on particular subjects or areas or collections; and then to work with researchers who are mining or scraping the web, but not neccassarily having preservation strategies. So, when we talk about political archiving or web archiving, it’s about getting as much as possible, with different volumes and frequencies. I think we know we can’t collect everything but important things frequently, less important things less frequently. And we work with national governments, with national libraries…

The other thing I wanted to raise in

T.R. Shellenberg who was an important archivist at the National Archive in the US. He had an idea about archival strategies: that there is a primary documentation strategy, and a secondary straetgy. The primary for a government and agencies to do for their own use, the secondary for futur euse in unknown ways… And including documentary and evidencey material (the latter being how and why things are done). Those evidencery elements becomes much more meaningful on the web, that has eerged and become more meaningful in the context of our current political environment.

AJ: My role is to build a Web Archive for the United Kingdom. So I want to ask a question that comes out of this… “Can a web archive lie?”. Even putting to one side that it isn’t possible to archive the whole web.. There is confusion because we can’t get every version of everything we capture… Then there are biases from our work. We choose all UK sites, but some are captured more than others… And our team isn’t as diverse as it could be. And what we collect is also constrained by technology capability. And we are limited by time issues… We don’t normally know when material is created… The crawler often finds things only when they become popular… So the academic paper is picked up after a BBC News item – they are out of order. We would like to use more structured data, such as Twitter which has clear publication date…

But can the archive lie? Well material is much easier than print to make an untraceable change. As digital is increasingly predominant we need to be aware that our archive could he hacked… So we have to protect for that, evidence that we haven’t been hacked… And we have to build systems that are secure and can maintain that trust. Libraries will have to take care of each other.

JW: The Oxford Dictionary word of the year in 2016 was “post truth” whilst the Australian dictionary went for “Fake News”. Fake News for them is either disinformation on websites for political purposes, or commercial benefit. Mirrium Webster went for “surreal” – their most searched for work. It feels like we live in very strange times… There aren’t calls for resignation where there once were… Hasn’t it always been thus though… ? For all the good citizens who point out the errors of a fake image circulated on Twitter, for many the truth never catches the lie. Fakes, lies and forgeries have helped change human history…

But modern fake news is different to that which existed before. Firstly there is the speed of fake news… Mainstream media only counteracts or addresses this. Some newspapers and websites do public corrections, but that isn’t the norm. Once publishing took time and means. Social media has made it much easier to self-publish. One can create, but also one can check accuracy and integrity – reverse image searching to see when a photo has been photoshopped or shows events of two things before…

And we have politicians making claims that they believe can be deleted and disappear from our memory… We have web archives – on both sides of the Atlantic. The European Referendum NHS pledge claim is archived and lasts long beyond the bus – which was brought by Greenpeace and repainted. The archives have also been capturing political parties websites throughout our endless election cycle… The DUP website crashed after announcement of the election results because of demands… But the archive copy was available throughout. Also a rumour that a hacker was creating an irish language version of the DUP website… But that wasn’t a new story, it was from 2011… And again the archive shows that, and archive of news websites do that.

Social Networks Responses to Terrorist Attacks in France – Valerie Schafer. 

Before 9/11 we had some digital archives of terrorist materials on the web. But this event challenged archivists and researchers. Charlie Hebdo, Paris Bataclan and Nice attacks are archived… People can search at the BNF to explore these archives, to provide users a way to see what has been said. And at the INA you can also explore the archive, including Titter archives. You can search, see keywords, explore timelines crossing key hashtags… And you can search for images… including the emoji’s used in discussion of Charlie Hebdo and Bataclan.

We also have Archive-It collections for Charlie Hebdo. This raises some questions of what should and should not be collected… We did not normally collected news papers and audio visual sites, but decided to in this case as we faced a special event. But we still face challenges – it is easiest to collect data from Twitter than from Facebook. But it is free to collect Twitter data in real time, but the archived/older data is charged for so you have to capture it in the moment. And there are limits on API collection… INA captured more than 12 Million tweets for Charlie Hebdo, for instance, it is very complete but not exhaustive.

We continue to collect for #jesuischarlie and #bataclan… They continually used and added to, in similar or related attacks, etc. There is a time for exploring and reflecting on this data, and space for critics too….

But we also see that content gets deleted… It is hard to find fake news on social media, unless you are looking for it… Looking for #fakenews just won’t cut it… So, we had a study on fake news… And we recommend that authorities are cautious about material they share. But also there is a need for cross checking – the kinds of projects with Facebook and Twitter. Web archives are full of fake news, but also full of others’ attempts to correct and check fake news as well…

EG: I wanted to go back in time to the idea of the term “fake news”… In order to understand from what “Fake News” actually is, we have to understand how it differs from previous lies and mistruths… I’m from outside the web world… We are often looking at tactics to fight fire with fire, to use an unfortunate metaphor…  How new is it? And who is to blame and why?

JW: Talking about it as a web problem, or a social media issue isn’t right. It’s about humans making decisions to critique or not that content. But it is about algorithmic sharing and visibility of that information.

JB: I agree. What is new is the way media is produced, disseminated and consumed – those have technological underpinnings. And they have been disruptive of publication and interpretation in a web world.

EG: Shouldn’t we be talking about a culture not just technology… It’s not just the “vessel”… Isn’t the dissemination have more of a role than perhaps we are suggesting…

AJ: When you build a social network or any digital space you build in different affordances… So that Facebook and Twitter is different. And you can create automated accounts, with Twitter especially offering an affordance for robots etc which allows you to give the impression of a movement. There are ways to change those affordances, but there will also always be fake news and issues…

EG: There are degrees of agency in fake news.. from bots to deliberate posts…

JW: I think there is also the aspect of performing your popularity – creating content for likes and shares, regardless of whether what you share is true or not.

VS: I know terrorism is different… But any tweet sharing fake news you get 4 retweets denying… You have more tweets denying than sharing fake news…

AJ: One wonders about the filter bubble impact here… Facebook encourges inward looking discussion… Social media has helped like minded people find each other, and perhaps they can be clipped off more easily from the wider discussion…

VS: I think also what is interested is the game between social media and traditional media…You have questions and relationship there…

EG: All the internet can do is reflect the crooked timber of reality… We know that people have confirmation bias, we are quite tolerant of untruths, to be less tolerant of information that contradicts our perceptions, even if untrue.You have people and the net being equally tolerant of lies and mistruths… But isn’t there another factor here… The people demonised as gatekeepers… By putting in place structures of authority – which were journalism and academics… Their resources are reduced now… So what role do you see for those traditional gatekeepers…

VS: These gatekeepers are no more the traditional gatekeepers that they were…. They work in 24 hour news cycles and have to work to that. In France they are trying to rethink that role, there were a lot of questions about this… Whether that’s about how you react to changing events, and what happens during election…. People thinking about that…

JB: There is an authority and responsibiity for media still, but has the web changed that? Looking back its suprising now how few organisations controlled most of the media… But is that that different now?

EG: I still think you are being too easy on the internet… We’ve had investigate journalism by Carrell Cadwalladar and others on Cambridge Analytica and others who deliberately manipulate reality… You talked about witness testimony in relation to terrorism… Isn’t there an immediacy and authenticity challenge there… Donald Trump’s tweets… They are transparant but not accountable… Haven’t we created a problem that we are now trying to fix?

AJ: Yes. But there are two things going on… It seems to be that people care less about lying… People see Trump lying, and they don’t care, and media organisations don’t care as long as advertising money comes in… A parallel for that in social media – the flow of content and ads takes priority over truth. There is an economic driver common to both mediums that is warping that…

JW: There is an aspect of unpopularity aspect too… a (nameless) newspaper here that shares content to generate “I can’t believe this!” and then sharing and generating advertising income… But on a positive note, there is scope and appetite for strong investigative journalism… and that is facilitated by the web and digital methods…

VS: Citizens do use different media and cross media… Colleagues are working on how TV is used… And different channels, to compare… Mainstream and social media are strongly crossed together…

EG: I did want to talk about temporal element… Twitter exists in the moment, making it easy to make people accountable… Do you see Twitter doing what newspapers did?

AJ: Yes… A substrate…

JB: It’s amazing how much of the web is archived… With “Save Page Now” we see all kinds of things archived – including pages that exposed the whole Russian downing a Ukrainian plane… Citizen action, spotting the need to capture data whilst it is still there and that happens all the time…

EG: I am still sceptical about citizen journalism… It’s a small group of narrow demographics people, it’s time consuming… Perhaps there is still a need for journalist roles… We did talk about filter bubbles… We hear about newspapers and media as biased… But isn’t the issue that communities of misinformation are not penetrated by the other side, but by the truth…

JW: I think bias in newspapers is quite interesting and different to unacknowledged bias… Most papers are explicit in their perspective… So you know what you will get…

AJ: I think so, but bias can be quite subtle… Different perspectives on a common issue allows comparison… But other stories only appear in one type of paper… That selection case is harder to compare…

EG: This really is a key point… There is a difference between facts and truth, and explicitly framed interpretation or commentary… Those things are different… That’s where I wonder about web archives… When I look at Wikipedia… It’s almost better to go to a source with an explicit bias where I can see a take on something, unlike Wikipedia which tries to focus on fact. Talking about politicians lying misses the point… It should be about a specific rhetorical position… That definition of truth comes up when we think of the role of the archive… How do you deal with that slightly differing definition of what truth is…

JB: I talked about different complimentary collecting strategy… The Archivist as a thing has some political power in deciding what goes in the historical record… The volume of the web does undercut that power in a way that I think is good – archives have historically been about the rich and the powerful… So making archives non-exclusive somewhat addresses that… But there will be fake news in the archive…

JW: But that’s great! Archives aren’t about collecting truth. Things will be in there that are not true, partially true, or factual… It’s for researchers to sort that out lately…

VS: Your comment on Wikipedia… They do try to be factual, neutral… But not truth… And to have a good balance of power… For us as researchers we can be surprised by the neutral point of view… Fortunately the web archive does capture a mixture of opinions…

EG: Yeah, so that captures what people believed at a point of time – true or not… So I would like to talk about the archive itself… Do you see your role as being successors to journalists… Or as being able to harvest the world’s record in a different way…

JB: I am an archivist with that training and background, as are a lot of people working on web archives and interesting spaces. Certainly historic preservation drives a lot of collecting aspects… But also engineering and technological aspects. So it’s poeple interested in archiving, preservation, but also technology… And software engineers interested in web archiving.

AJ: I’m a physicist but I’m now running web archives. And for us it’s an extension of the legal deposit role… Anything made public on the web should go into the legal deposit… That’s the theory, in practice there are questions of scope, and where we expend quality assurance energy. That’s the source of possible collection bias. And I want tools to support archivists… And also to prompt for challenging bias – if we can recognise that taking place.

JW: There are also questions of what you foreground in Special Collections. There are decisions being made about collections that will be archived and catalogued more deeply…

VS: In BNF my colleagues are work in an area with a tradition, with legal deposit responsibility… There are politics of heritage and what it should be. I think that is the case for many places where that activity sits with other archivists and librarians.

EG: You do have this huge responsibility to curate the record of human history… How do you match the top down requirements with the bottom up nature of the web as we now talk about i.t.

JW: One way is to have others come in to your department to curate particular collections…

JB: We do have special collections – people can choose their own, public suggestions, feeds from researchers, all sorts of projects to get the tools in place for building web archives for their own communities… I think for the sake of longevity and use going forward, the curated collections will probably have more value… Even if they seem more narrow now.

VS: Also interesting that archives did not select bottom-up curation. In Switzerland they went top down – there are a variety of approaches across Europe.

JW: We heard about the 1916 Easter Rising archive earlier, which was through public nominations… Which is really interesting…

AJ: And social media can help us – by seeing links and hashtags. We looked at this 4-5 years ago everyone linked to the BBC, but now we have more fake news sites etc…

VS: We do have this question of what should be archived… We see capture of the vernacular web – kitten or unicorn gifs etc… !

EG: I have a dystopian scenario in my head… Could you see a time years from now when newspapers are dead, public broadcasters are more or less dead… And we have flotsom and jetsom… We have all this data out there… And kinds of data who use all this social media data… Can you reassure me?

AJ: No…

JW: I think academics are always ready to pick holes in things, I hope that that continues…

JB: I think more interesting is the idea that there may not be a web… Apps, walled gardens… Facebook is pretty hard to web archive – they make it intentionally more challenging than it should be. There are lots of communication tools that disappeared… So I worry more about loss of a web that allows the positive affordances of participation and engagement…

EG: There is the issue of privatising and sequestering the web… I am becoming increasingly aware of the importance of organisations – like the BL and Internet Archive… Those roles did used to be taken on by publicly appointed organisations and bodies… How are they impacted by commercial privatisation… And how those roles are changing… How do you envisage that public sphere of collecting…

JW: For me more money for organisations like the British Library is important. Trust is crucial, and I trust that they will continue to do that in a trustworthy way. Commercial entities cannot be trusted to protect our cultural heritage…

AJ: A lot of people know what we do with physical material, but are surprised by our digital work. We have to advocate for ourselves. We are also constrained by the legal framework we operate within, and we have to challenge that over time…

JB: It’s super exciting to see libraries and archives recognised for their responsibility and trust… But that also puts them at higher risk by those who they hold accountable, and being recognised as bastions of accountability makes them more vulnerable.

VS: Recently we had 20th birthday of the Internet Archive, and 10 years of the French internet archiving… This is all so fast moving… People are more and more aware of web archiving… We will see new developments, ways to make things open… How to find and search and explore the archive more easily…

EG: The question then is how we access this data… The new masters of the universe will be those emerging gatekeepers who can explore the data… What is the role between them and the public’s ability to access data…

VS: It is not easy to explain everything around web archives but people will demand access…

JW: There are different levels of access… Most people will be able to access what they want. But there is also a great deal of expertise in organisations – it isn’t just commercial data work. And working with the Alan Turing Institute and cutting edge research helps here…

EG: One of the founders of the internet, Vint Cerf, says that “if you want to keep your treasured family pictures, print them out”. Are we overly optimistic about the permanence of the record.

AJ: We believe we have the skills and capabilities to maintain most if not all of it over time… There is an aspect of benign neglect… But if you are active about your digital archive you could have a copy in every continent… Digital allows you to protect content from different types of risk… I’m confident that the library can do this as part of it’s mission.

Q&A

Q1) Coming back to fake news and journalists… There is a changing role between the web as a communications media, and web archiving… Web archives are about documenting this stuff for journalists for research as a source, they don’t build the discussion… They are not the journalism itself.

Q2) I wanted to come back to the idea of the Filter Bubble, in the sense that it mediates the experience of the web now… It is important to capture that in some way, but how do we archive that… And changes from one year to the next?

Q3) It’s kind of ironic to have nostalgia about journalism and traditional media as gatekeepers, in a country where Rupert Murdoch is traditionally that gatekeeper. Global funding for web archiving is tens of millions; the budget for the web is tens of billions… The challenges are getting harder – right now you can use robots.txt but we have DRM coming and that will make it illegal to archive the web – and the budgets have to increase to match that to keep archives doing their job.

AJ: To respond to Q3… Under the legislation it will not be illegal for us to archive that data… But it will make it more expensive and difficult to do, especially at scale. So your point stands, even with that. In terms of the Filter Bubble, they are out of our scope, but we know they are important… It would be good to partner with an organisation where the modern experience of media is explicitly part of it’s role.

JW: I think that idea of the data not being the only thing that matters is important. Ethnography is important for understanding that context around all that other stuff…  To help you with supplementary research. On the expense side, it is increasingly important to demonstrate the value of that archiving… Need to think in terms of financial return to digital and creative economies, which is why researchers have to engage with this.

VS: Regarding the first two questions… Archives reflect reality, so there will be lies there… Of course web archives must be crossed and compared with other archives… And contextualisation matters, the digital environment in which the web was living… Contextualisation of web environment is important… And with terrorist archive we tried to document the process of how we selected content, and archive that too for future researchers to have in mind and understand what is there and why…

JB: I was interested in the first question, this idea of what happens and preserving the conversation… That timeline was sometimes decades before but is now weeks or days or less… In terms of experience websites are now personalised and our ability to capture that is impossible on a broad question. So we need to capture that experience, and the emergent personlisation… The web wasn’t public before, as ARPAnet, then it became public, but it seems to be ebbing a bit…

JW: With a longer term view… I wonder if the open stuff which is easier to archive may survive beyond the gated stuff that traditionally was more likely to survive.

Q4) Today we are 24 years into advertising on the web. We take ad-driven models as a given, and we see fake news as a consequence of that… So, my question is, Minitel was a large system that ran on a different model… Are there different ways to change the revenue model to change fake or true news and how it is shared…

Q5) Teresa May has been outspoken on fake news and wants a crackdown… The way I interpret that is censorship and banning of sites she does not like… Jefferson said that he’s been archiving sites that she won’t like… What will you do if she asks you to delete parts of your archive…

JB: In the US?!

Q6) Do you think we have sufficient web literacy amongst policy makers, researchers and citizens?

JW: On that last question… Absolutely not. I do feel sorry for politicians who have to appear on the news to answer questions but… Some of the responses and comments, especially on encryption and cybersecurity have been shocking. It should matter, but it doesn’t seem to matter enough yet… 

JB: We have a tactic of “geopolitical redundancy” to ensure our collections are shielded from political endangerment by making copies – which is easy to do – and locate them in different political and geographical contexts. 

AJ: We can suppress content by access. But not deletion. We don’t do that… 

EG: Is there a further risk of data manipulation… Of Trump and Farage and data… a covert threat… 

AJ: We do have to understand and learn how to cope with potential attack… Any one domain is a single point of failure… so we need to share metadata, content where possible… But web archives are fortunate to have the strong social framework to build that on… 

Q7) Going back to that idea of what kinds of responsibilities we have to enable a broader range of people to engage in a rich way with the digital archive… 

Q8) I was thinking about questions in context, and trust in content in the archive… And realising that web archives are fairly young… Generally researchers are close to the resource they are studying… Can we imagine projects in 50-100 years time where we are more separate from what we should be trusting in the archive… 

Q9) My perspective comes from building a web archive for European institutions… And can the archive live… Do we need legal notice on the archive, disclaimers, our method… How do we ensure people do not misinterpret what we do. How do we make the process of archiving more transparent. 

JB: That question of who has resources to access web archives is important. It is a responsibility of institutions like ours… To ensure even small collections can be accessed, that researchers and citizens are empowered with skills to query the archive, and things like APIs to enable that too… The other question on evidencing curatorial decisions – we are notoriously poor at that historically… But there is a lot of technological mystery there that we should demystify for users… All sorts of complexity there… The web archiving needs to work on that provenance information over the next few years… 

AJ: We do try to record this but as Jefferson said much of this is computational and algorithmic… So we maybe need to describe that better for wider audiences… That’s a bigger issue anyway, that understanding of algorithmic process. At the British Library we are fortunate to have capacity for text mining our own archives… We will be doing more than that… It will be small at first… But as it’s hard to bring data to the queries, we must bring queries to the archive. 

JW: I think it is so hard to think ahead to the long term… You’ll never pre-empt all usage… You just have to do the best that you can. 

VS: You won’t collect everything, every time… The web archive is not an exact mirror… It is “reborn digital heritage”… We have to document everything, but we can try to give some digital literacy to students so they have a way to access the web archive and engage with it… 

EG: Time is up, Thank you our panellists for this fantastic session. 

Jun 142017
 

From today until Friday I will be at the International Internet Preservation Coalition (IIPC) Web Archiving Conference 2017, which is being held jointly with the second RESAW: Research Infrastructure for the Study of Archived Web Materials Conference. I’ll be attending the main strand at the School of Advanced Study, University of London, today and Friday, and at the technical strand (at the British Library) on Thursday. I’m here wearing my “Reference Rot in Theses: A HiberActive Pilot” – aka “HiberActive” – hat. HiberActive is looking at how we can better enable PhD candidates to archive web materials they are using in their research, and citing in their thesis. I’m managing the project and working with developers, library and information services stakeholders, and a fab team of five postgraduate interns who are, whilst I’m here, out and about around the University of Edinburgh talking to PhD students to find out how they collect, manage and cite their web references, and what issues they may be having with “reference rot” – content that changes, decays, disappears, etc. We will have a webpage for the project and some further information to share soon but if you are interested in finding out more, leave me a comment below or email me: nicola.osborne@ed.ac.uk. These notes are being taken live so, as usual for my liveblogs, I welcome corrections, additions, comment etc. (and, as usual, you’ll see the structure of the day appearing below with notes added at each session). 

Opening remarks: Jane Winters

This event follows the first RESAW event which took place in Aarhus last year. This year we again highlight the huge range of work being undertaken with web archives. 

This year a few things are different… Firstly we are holding this with the IIPC, which means we can run the event over 3 days, and means we can bring together librarians, archivists, and data scientists. The BL have been involved and we are very greatful for their input. We are also excited to have a public event this evening, highlighted the increasingly public nature of web archiving. 

Opening remarks: Nicholas Taylor

On behalf of the IIPC Programme Committee I am hugely grateful to colleagues here at the School of Advanced Studies and at the British Library for being flexible and accommodating us. I would also like to thank colleagues in Portugal, and hope a future meeting will take place there as had been originally planned for IIPC.

For us we have seen the Web Archiving Conference as an increasingly public way to explore web archiving practice. The programme committee saw a great increase in submissions, requiring a larger than usual commitment from the programming committee. We are lucky to have this opportunity to connect as an international community of practice, to build connections to new members of the community, and to celebrate what you do. 

Opening plenary: Leah Lievrouw – Web history and the landscape of communication/media research Chair: Nicholas Taylor

I intend to go through some context in media studies. I know this is a mixed audience… I am from the Department of Information Studies at UCLA and we have a very polyglot organisation – we can never assume that we all understand each others backgrounds and contexts. 

A lot about the web, and web archiving, is changing, so I am hoping that we will get some Q&A going about how we address some gaps in possible approaches. 

I’ll begin by saying that it has been some time now that computing has been seen, computers as communication devices, have been seen as a medium. This seems commonplace now, but when I was in college this was seen as fringe, in communication research, in the US at least. But for years documentarists, engineers, programmers and designers have seen information resources, data and computing as tools and sites for imagining, building, and defending “new” societies; enacting emancipatory cultures and politics… A sort of Alexandrian vision of “all the knowledge in the world”. This is still part of the idea that we have in web archiving. Back in the day the idea of fostering this kind of knowledge would bring about internationality, world peace, modernity. When you look at old images you see artefacts – it is more than information, it is the materiality of artefacts. I am a contributor to Nils’ web archiving handbook, and he talks about history written of the web, and history written with the web. So there are attempts to write history with the web, but what about the tools themselves? 

So, this idea about connections between bits of knowledge… This goes back before browsers. Many of you will be familiar with H.G. Well’s ? Brain; Suzanne Briet’s Qu’est que la documentation (1951) is a very influential work in this space; Jennifer Light wrote a wonderful book on Cold War Intellectuals, and their relationship to networked information… One of my lecturers was one of these in fact, thinking about networked cities… Vannevar Bush “As we may think” (1945) saw information as essential to order and society. 

Another piece I often teach, J.C.R. Licklider and Robert W. Taylor (1968) in “the computer as a communication device” talked about computers communicating but not in the same ways that humans make meaning. In fact this graphic shows a man’s computer talking to an insurance salesman saying “he’s out” an the caption “your computer will know what is important to you and buffer you from the outside world”.

We then have this counterculture movement in California in the 1960s and 1970s.. And that feeds into the emerging tech culture. We have The Well coming out of this. Stewart Brand wrote The Whole Earth Catalog (1968-78). And Actually in 2012 someone wrote a new Whole Earth Catalog… 

Ted Nelson, Computer Lib/Dream Machines (1974) is known as being the person who came up with the concept of the link, between computers, to information… He’s an inventor essentially. Computer Lib/Dream Machine was a self-published title, a manifesto… The subtitle for Computer Lib was “you can and must understand computers NOW”. Counterculture was another element, and this is way before the web, where people were talking about networked information.. But these people were not thinking about preservation and archiving, but there was an assumption that information would be kept… 

And then as we see information utilities and wired cities emerging, mainly around cable TV but also local public access TV… There was a lot of capacity for information communication… In the UK you had teletext, in Canada there was Teledyne… And you were able to start thinking about information distribution wider and more diverse than central broadcasters… With services like LexisNexis emerging we had these ideas of information utilities… There was a lot of interest in the 1980s, and back in the 1970s too. 

Harold Sackman and Norman Nie (eds.) The Information Utility and Social Choice (1970); H.G. Bradley, H.S. Dordick and B. Nanus, the Emerging Network Marketplace (1980); R.S. Block “A global information utility”, the Futurist (1984); W.H. Dutton, J.G. Blumer and K.L. Kraemer “Wired cities: shaping the future of communications” (1987).

This new medium looked more like point-to-point communication, like the telephone. But no-one was studying that. There were communications scholars looking at face to face communication, and at media, but not at this on the whole. 

Now, that’s some background, I want to periodise a bit here… And I realise that is a risk of course… 

So, we have the Pre-browser internet (early 1980s-1990s). Here the emphasis was on access – to information, expertise and content at centre of early versions of “information utilities”, “wired cities” etc. This was about everyone having access – coming from that counter culture place. More people needed more access, more bandwidth, more information. There were a lot of digital materials already out there… But they were fiddly to get at. 

Now, when the internet become privatised – moved away from military and universities – the old model of markets and selling information to mass markets, the transmission model, reemerged. But there was also tis idea that because the internet was point-to-point – and any point could get to any other point… And that everyone would eventually be on the internet… The vision was of the internet as “inherently democratic”. Now we recognise the complexity of that right now, but that was the vision then. 

Post-browser internet (early 1990s to mid-2000s) – was about web 1.0. Browsers and WWW were designed to search and retrieve documents, discrete kinds of files, to access online documents. I’ve said “Web 1.0” but had a good conversation with a colleague yesterday who isn’t convinced about these kinds of labels, but I find them useful shorthand for thinking about the web at particular points in time/use. In this era we had email still but other types of authoring tools arose.. Encouraging a wave of “user generated content” – wikis, blogs, tagging, media production and publishing, social networking sites. This sounds such a dated term now but it did change who could produce and create media, and it was the team around LA around this time. 

Then we began to see Web 2.0 with the rise of “smart phones” in the mid-2000s, merging mobile telephony and specialised web-based mobile applications, accelerate user content production and social media profiling. And the rise of social networking sounded a little weird to those of us with sociology training who were used to these terms from the real world, from social network analysis. But Facebook is a social network. Many of the tools, blogging for example, can be seen as having a kind of mass media quality – so instead of a movie studio making content… But I can have my blog which may have an audience of millions or maybe just, like, 12 people. But that is highly personal. Indeed one of the earliest so-called “killer apps” for the internet was email. Instead of shipping data around for processing – as the architecture originally got set up for – you could send a short note to your friend elsewhere… Email hasn’t changed much. That point-to-opint communication suddenly and unexpectedly suddenly became more than half of the ARPANET. Many people were surprised by that. That pattern of interpersonal communication over networks, continued to repeat itself – we see it with Facebook, Twitter, and even with Blogs etc. that have feedback/comments etc. 

Web 2.0 is often talked about as social driven. But what is important from a sociology perspective, is the participation, and the participation of user generated communities. And actually that continues to be a challenge, it continues to be not the thing the architecture was for… 

In the last decade we’ve seen algorithmic media emerging, and the rise of “web 3.0”. Both access and participation appropriated as commodities to be monitored, captures, analyzed, monetised and sold back to individuals, reconcieved as data subjects. Everything is thought about as data, data that can be stored, accessed… Access itself, the action people take to stay in touch with each other… We all carry around monitoring devices every day… At UCLA we are looking at the concept of the “data subjects”. Bruce ? used to talk about the “data footprint” or the “data cloud”. We are at a moment where we are increasingly aware of being data subjects. London is one of the most remarkable in the world in terms of surveillance… The UK in general, but London in particular… And that is ok culturally, I’m not sure it would be in the United States. 

We did some work in UCLA to get students to mark up how many surveillance cameras there were, who controlled them, who had set them up, how many there were… Neither Campus police nor university knew. That was eye opening. Our students were horrified at this – but that’s an American cultural reaction. 

But if we conceive of our own connections to each other, to government, etc. as “data” we begin to think of ourselves, and everything, as “things”. Right now systems and governance maximising the market, institutional government surveillance; unrestricted access to user data; moves towards real-time flows rather than “stocks” of documents or content. Surveillance isn’t just about government – supermarkets are some of our most surveilled spaces. 

I currently have students working on a “name domain infrastructure” project. The idea is that data will be enclosed, that data is time-based, to replace the IP, the Internet Protocol. So that rather than packages, data is flowing all the time. So that it would be like opening the nearest tap to get water. One of the interests here is from the movie and television industry, particularly web streaming services who occupy significant percentages of bandwidth now… 

There are a lot of ways to talk about this, to conceive of this… 

1.0 tend to be about documents, press, publishing, texts, search, retrieval, circulation, access, reception, production-consumption: content. 

2.0 is about conversations, relationships, peers, interaction, communities, play – as a cooperative and flow experience, mobility, social media (though I rebel against that somewhere): social networks. 

3.0 is about algorithms, “clouds” (as fluffy benevolent things, rather than real and problematic, with physical spaces, server farms), “internet of things”, aggregation, sensing, visualisation, visibility, personalisation, self as data subject, ecosystems, surveillance, interoperability, flows: big data, algorithmic media. Surveillance is kind of the environment we live in. 

Now I want to talk a little about traditions in communication studies.. 

In communication, broadly and historically speaking, there has been one school of thought that is broadly social scientific, from sociology and communications research, that thinks about how technologies are “used” for expression, interaction, as data sources or analytic tools. Looking at media in terms of their effects on what people know or do, can look at media as data sources, but usually it is about their use. 

There are theories of interaction, group process and influence; communities and networks; semantic, topical and content studies; law, policy and regulation of systems/political economy. One key question we might ask here: “what difference does the web make as a medium/milieu for communicative action, relations, interact, organising, institutional formation and change? Those from a science and technology background might know about the issues of shaping – we shape technology and technology shapes us. 

Then there is the more cultural/critical/humanist or media studies approach. When I come to the UK people who do media studies still think of humanist studies as being different, “what people do”. However this approach of cultural/critical/etc. is about analyses of digital technologies and web; design, affordances, contexts, consequences – philosophical, historical, critical lens. How power is distributed are important in this tradition. 

In terms of theoretical schools, we have the Toronto School/media ecology – the Marshall McLuhan take – which is very much about the media itself; American cultural studies, and the work of James Carey and his students; Birmingham school – the British take on media studies; and new materialism – that you see in Digital Humanities, German Media Studies, that says we have gone too far from the roles of the materials themselves. So, we might ask “What is the web itself (social and technical constituents) as both medium and product of culture, under what conditions, times and places.

So, what are the implications for Web Archiving? Well I hope we can discuss this, thinking about a table of:

Web Phase | Soc sci/admin | Crit/Cultural

  • Documents: content + access
  • Conversation: Social nets + participation
  • Data/AlgorithmsL algorithmic media + data subjects

Comment: I was wondering about ArXiv and the move to sharing multiple versions, pre-prints, post prints…

Leah: That issue of changes in publication, what preprints mean for who is paid for what, that’s certainly changing things and an interesting question here…

Comment: If we think of the web moving from documents, towards fluid state, social networks… It becomes interesting… Where are the boundaries of web archiving? What is a web archiving object? Or is it not an object but an assemblage? Also ethics of this…

Leah: It is an interesting move from the concrete, the material… And then this whole cultural heritage question, what does it instantiate, what evidence is it, whose evidence is it? And do we participate in hardening those boundaries… Or do we keep them open… How porous are our boundaries…

Comment: What about the role of metadata?

Leah: Sure, arguably the metadata is the most important thing… What we say about it, what we define it as… And that issue of fluidity… We think of metadata as having some sort of fixity… One thing that has begun to emerge in surveillance contexts… Where law enforcement says “we aren’t looking at your content, just the metadata”, well it turns out that is highly personally identifiable, it’s the added value… What happens when that secondary data becomes the most important things… In face where many of our data systems do not communicate with each other, those connections are through the metadata (only).

Comment: In terms of web archiving… As you go from documents, to conversations, to algorithms… Archiving becomes so much more complex. Particularly where interactions are involved… You can archive the data and the algorithm but you still can’t capture the interactions there…

Leah: Absolutely. As we move towards the algorithmic level its not a fixed thing. You can’t just capture the Google search algorithms, they change all the time. The more I look at this work through the lens of algorithms and data flows, there is no object in the classic sense…

Comment: Perhaps, like a movie, we need longer temporal snapshots…

Leah: Like the algorithmic equivalence of persistence of vision. Yes, I think that’s really interesting.

And with that the opening session is over, with organisers noted that those interested in surveillance may be interested to know that Room 101, said to have inspired the room of the same name in 1984, is where we are having coffee…

Session 1B (Chair: Marie Chouleur, National Library of France):

Jefferson Bailey (Deputy chair of IIPC, Director of Web Archiving, Internet Archiving): Advancing access and interface for research use of web archives

I would like to thank all of the organisers again. I’ll be giving a broad rather than deep overview of what the Internet Archive is doing at the moment.

For those that don’t know, we are a non-profit Digital Library and Archive founded in 1996. We work in a former church and it’s awesome – you are welcome to visit and do open public lunches every Friday if you are ever in San Francisco. We have lots of open source technology and we are very technology-driven.

People always ask about stats… We are at 30 Petabytes plus multiple copies right now, including 560 billion URLs, 280 billion webpages. We archive about 1 billion URLs per week, and have partners and facilities around the world, including here in the UK where we have Wellcome Trust support.

So, searching… This is WayBackMachine. Most of our traffic – 75% – is automatically directed to the new service. So, if you search for, say, UK Parliament, you’ll see the screenshots, the URLs, and some statistics on what is there and captured. So, how does it work? With that much data to do full text search! Even the raw text (not HTML) is 3-5 Pb. So, we figured the most instructive and easiest to work with text is the anchor text of all in-bound links to a homepage. The index text covers 443 million homepages, drawn from 900B in-bound links from other cross-domain websites. Is that perfect? No, but it’s the best that works on this scale of data… And people tend to make keyword type searches which this works for.

You can also now, in the new Way Back Machine, see a summary tab which includes a visualisation of data captured for that page, host, domain, MIME-type or MIME-type category. It’s really fun to play with. It’s really cool information to work with. That information is in the Way Back Machine (WBM) if there fore 4.5 billion hosts; 256 millions domains; 1238 TLDs. Also special collections that exist – building this for specific crawls/collections such as our .gov collection. And there is an API – so you can create your own visualisations if you like.

We have also created a full text search for AIT (Archive-It). This was part of a total rebuild of full text search in Elasticsearch. 6.5 billion documents with a 52 TB full text index. In total AIT is 23 billion documents and 1 PB. Searches are across all 8000+ colections. We have improved relevance ranking, metadata search, performance. And we have a Media Search coming – it’s still a test at presence. So you can search non textual content with similar process.

So, how can we help people find things better… search, full text search… And APIs. The APIs power the details charts, captures counts, year, size, new, domain/hosts. Explore that more and see what you can do. We’ve also been looking at Data Transfer APIs to standardise transfer specifications for web data exchange between repositories for preservation. For research use you can submit “jobs” to create derivative datasets from WARCS from specific collections. And it allows programmatic access to AIT WARCs, submission of job, job status, derivative results list. More at: https://github.com/WASAPI-Community/data-transfer-apis.

In other API news we have been working with WAT files – a sort of metadata file derived from a WARC. This includes Headers and content (title, anchor/text, metas, links). We have API access to some capture content – a better way to get programmtic access to the content itself. So we have a test build on a 100 TB WARC set (EOT). It’s like CDX API with a build – replays WATs not WARCs (see: http://vinay-dev.us.archive.org:8080/eot2016/20170125090436/http://house.gov/. You can analyse, for example, term counts across the data.

In terms of analysing language we have a new CDX code to help identify languages. You can visualise this data, see the language of the texts, etc. A lot of our content right now is in English – we need less focus on English in the archive.

We are always interested in working with researchers on building archives, not just using them. So we are working on the News Measures Research Project. We are looking at 663 local news sites representing 100 communities. 7 crawls for a composite week (July-September 2016).

We are also working with a Katrina Blogs project, after research was done, project was published, but we created a special collection of the cites used so that it can be accessed and explored.

And in fact we are general looking at ways to create useful sub collections and ways to explore content. For instance Gif Cities is a way to search for gifs from Geocities. We have a Military Industrial Powerpoint Complex, turning PPT into PDFs and creating a special collection.

We did a new collection, with a dedicated portal (https://www.webharvest.gov) which archives US congress for NARA. And we capture this every 2 years, and also raised questions of indexing YouTube videos.

We are also looking at historical ccTLD Wayback Machines. Built on IA global crawls and added historic web data with keyword and mime/format search, embed linkback, domain stats and special features. This gives a german view – from the .de domain – of the archive.

And we continue to provide data and datasets for people. We love Archives Unleashed – which ran earlier this week. We did an Obama Whitehouse data hackathon recently. We have a webinar on APIs coming very soon

Q&A

Q1) What is anchor text?

A1) That’s when you create a link to a page – the text that is associated with that page.

Q2) If you are using anchor text in that keyword search… What happens when the anchor text is just a URL…

A2) We are tokenising all the URLs too. And yes, we are using a kind of PageRank type understanding of popular anchor text.

Q3) Is that TLD work.. Do you plan to offer that for all that ask for all top level domains?

A3) Yes! Because subsets are small enough that they allow search in a more manageable way… We basically build a new CDX for each of these…

Q4) What are issues you are facing with data protection challenges and archiving in the last few years… Concerns about storing data with privacy considerations.

A4) No problems for us. We operate as a library… The Way Back Machine is used in courts, but not by us – in US courts its recognised as a thing you can use in court.

Panel: Internet and Web Histories – Niels Bruger – Chair (NB); Marc Weber (MW); Steve Jones (SJ); Jane Winters (JW)

We are going to talk about the internet and the web, and also to talk about the new journal, Internet Histories, which I am editing. The new journal addresses what my colleagues and I saw as a gap. On the one hand there are journals like New Media and Society and Internet Studies which are great, but rarely focus on history. And media history journals are excellent but rarely look at web history. We felt there was a gap there… And Taylor & Francis Routledge agreed with us… The inaugeral issue is a double issue 1-2, and people on our panel today are authors in our first journal, and we asked them to address six key questions from members of our international editorial board.

For this panel we will have an arguement, counter statement, and questions from the floor type format.

A Common Language – Mark Weber

This journal has been a long time coming… I am Curatorial Director, Internet History Program, Computer History Museum. We have been going for a while now. This Internet History program was probably the first one of its kind in a museum.

When I first said I was looking at the history of the web in the mid ’90s, people were puzzled… Now most people have moved to incurious acceptance. Until recently there was also tepid interest from researchers. But in the last few years has reached critical mass – and this journal is a marker of this change.

We have this idea of a common language, the sharing of knowledge. For a long time my own perspective was mostly focused on the web, it was only when I started the Internet History program that I thought about the fuller sweep of cyberspace. We come in through one path or thread, and it can be (too) easy to only focus on that… The first major networks, the ARPAnet was there and has become the internet. Telenet was one of the most important commercial networks in the 1970s, but who here now remembers Anne Reid of Telenet? [no-one] And by contrast, what about Vint Cerf [some]. However, we need to understand what changed, what did not succeed in the long term, how things changed and shifted over time…

We are kind of in the Victorian era of the internet… We have 170 years of telephones, 60 years of going on line… longer of imagining a connected world. Our internet history goes back to the 1840s and the telegraph. And a useful thought here, “The past isn’t over. It isn’t even past” William Faulkner.  Of this history only small portions are preserved properly. Some of then risks of not having a collective narrative… And not understanding particular aspects in proper context. There is also scope for new types of approaches and work, not just applying traditional approaches to the web.

There is a risk of a digital dark age – we have  film to illustrate this at the museum although I don’t think this crowd needs persuading of the importance of preserving the web.

So, going forward… We need to treat history and preservation as something to do quickly, we cannot go back and find materials later…

Response – Jane Winters

Mark makes, I think convincingly, the case for a common language, and for understanding the preceding and surrounding technologies, why they failed and their commercial, political and social contexts. And I agree with the importance of capturing that history, with oral history a key means to do this. Secondly the call to look beyond your own interest or discipline – interdisiplinary researcg is always challenging, but in the best sense, and can be hugely rewarding when done well.

Understanding the history of the internet and its context is important, although I think we see too many comparisons with early printing. Although some of those views are useful… I think there is real importance in getting to grips with these histories now, not in a decade or two. Key decisions will be made, from net neutrality to mass surveillance, and right now the understanding and analysis of the issues is not sophisticated – such as the incompatibility of “back doors” and secure internet use. And as researchers we risk focusing on the content, not the infrastructure. I think we need a new interdisciplinary research network, and we have all the right people gathered here…

Q&A

Q1) Mark, as you are from a museum… Have you any thoughts about how you present the archived web, the interface between the visitor and the content you preserve.

A1) What we do now with the current exhibits… the star isn’t the objects, it is the screen. We do archive some websites – but don’t try to replicate the internet archive but we do work with them on some projects, including the GeoCities exhibition. When you get to things that require emulation or live data, we want live and interactive versions that can be accessed online.

Q2) I’m a linguist and was intrigued by the interdisciplinary collaboration suggested… How do you see linguists and the language of the web fitting in…

A2) Actually there is a postdoc – Naomi – looking at how different language communities in the UK have engaged through looking at the UK Web Archive, seeing how language has shaped their experience and change in moving to a new country. We are definitely thinking about this and it’s a really interesting opportunity.

Out from the PLATO Cave: Uncovering the pre-Internet history of social computing – Steve Jones, University of Ilinois at Chicago

I think you will have gathered that there is no one history of the internet. PLATO was a space for education and for my interest it also became a social space, and a platform for online gaming. These uses were spontaneous rather than centrally led. PLATO was an acronym for Programmed Logic for Automatic Teaching Operations (see diagram in Ted Nelson’s Dream Machine publication and https://en.wikipedia.org/wiki/PLATO_(computer_system)).
There were two key interests in developing for PLATO – one was multi-player games, and the other was communication. And the latter was due to laziness… Originally the PLATO lab was in a large room, and we couldn’t be bothered to walk to each others desks. So “Talk” was created – and that saved standard messages so you didn’t have to say the same thing twice!

As time went on, I undertook undergraduate biology studies and engaged in the Internet and saw that interaction as similar… At that time data storage was so expensive that storing content in perpetuity seemed absurd… If it was kept its because you hadn’t got to writing it yet. You would print out code – then rekey it – that was possible at the time given the number of lines per programme. So, in addition to the materials that were missing… There were boxes of Ledger-size green bar print outs from a particular PLATO Notes group of developers. Having found this in the archive I took pictures to OCR – that didn’t work! I got – brilliantly and terribly – funding to preserve that text. That content can now be viewed side by side in the archive – images next to re-keyed text.

Now, PLATO wasn’t designed for lay users, it was designed for professionals although also used by university and high school students who had the time to play with it. So you saw changes between developer and community values, seeing development of affordances in the context of the discourse of the developers – that archived set of discussions. The value of that work is to describe and engage with this history not just from our current day perspective, but to understand the context, the poeple and their discourse at the time.

Response – Mark

PLATO sort of is the perfect example of a system that didn’t survive into the mainstream… Those communities knew each other, the idea of the flatscreen – which led to the laptop – came from PLATO. PLATO had a distinct messaging system, separate from the ARPAnet route. It’s a great corpus to see how this was used – were there flames? What does one-to-many communication look like? It is a wonderful example of the importance of preserving these different threads.. And PLATO was one of the very first spaces not full of only technical people.

PLATO was designed for education, and that meant users were mainly students, and that shaped community and usage. There was a small experiment with community time sharing memory stores – with terminals in public places… But PLATO began in the late ’60s and ran through into the 80s, it is the poster child for preserving earlier systems. PLATO notes became Lotus Notes – that isn’t there now but in its own domain, PLATO was the progenitor of much of what we do with education online now, and that history is also very important.

Q&A

Q1) I’m so glad, Steve, that you are working on PLATO. I used to work in Medical Education in Texas and we had PLATO terminals to teach basic science first and second year medical education students and ER simulations. And my colleagues and I were taught computer instruction around PLATO. I am intereted that you wanted to look at discourse around UIC around PLATO – so, what did you find? I only experienced PLATO at the consumer end of the spectrum, so I wondered what the producer end was like…

A1) There are a few papers on this – search for it – but two basic things stand out… (1) the degree to which as a mainframe system PLATO was limited as system, and the conflict between the systems people and the gaming people. The gaming used a lot of the capacity, and although that taxed the system it did also mean they developed better code, showed what PLATO was capable of, and helped with the case for funding and support. So it wasn’t just shut PLATO down, it was a complex 2-way thing; (2) the other thing was around the emergence of community. Almost anyone could sit at a terminal and use the system. There were occasional flare ups and they mirrored community responses even later around flamewars, competition for attention, community norms… Hopefully others will mine that archive too and find some more things.

Digital Humanities – Jane Winters

I’m delighted to have an article in the journal, but I won’t be presenting on this. Instead I want to talk about digital humanities and web archives. There is a great deal of content in web archives but we still see little research engagement in web archives, there are numerous reasons including the continuing work on digitised traditional texts, and slow movement to develop new ways to research. But it is hard to engage with the history of the 21st century without engaging with the web.

The mismatch of the value of web archives and the use and research around the archive was part of what led us to set up a project here in 2014 to equip researchers to use web archives, and encourage others to do the same. For many humanities researchers it will take a long time to move to born-digital resources. And to engage with material that subtly differs for different audiences. There are real challenges to using this data – web archives are big data. As humanities scholars we are focused on the small, the detailed, we can want to filter down… But there is room for a macro historical view too. What Tim Hitchcock calls the “beautiful chaos?” of the web.

Exploring the wider context one can see change on many levels – from the individual person or business, to wide spread social and political change. How the web changes the language used between users and consumers. You can also track networks, the development of ideas… It is challenging but also offers huge opportunities. Web archives can include newspapers, media, and direct conversation – through social media. There is also visual content, gifs… The increase in use of YouTube and Instagram. Much of this sits outside the scope of web archives, but a lot still does make it in. And these media and archiving challenges will only become more challenging as see more data… The larger and more uncontrolled the data, the harder the analysis. Keyword searches are challenging at scale. The selection of the archive is not easily understood but is important.

The absence of metadata is another challenge too. The absence of metadata or alternative text can render images, particularly, invisible. And the mix of formats and types of personal and the public is most difficult but also most important. For instance the announcement of a government policy, the discussion around it, a petition perhaps, a debate in parliament… These are not easy to locate… Our histories is almost inherently online… But they only gain any real permanence through preservation in web archives, and thats why humanists and historians really need to engage with them.

Response – Steve

I particularly want to talk about archiving in scholarship. In order to fit archiving into scholarly models… administrators increasingly make the case for scholarship in the context of employment and value. But archive work is important. Scholars are discouraged from this sort of work because it is not quick, it’s harder to be published… Separately you need organisations to engage in preservation of their online presences. The degree to which archive work is needed is not reflected by promotion committees, organisational support, local archiving processes. There are immense rhetorical challenges here, to persuade others of the value of this work. There had been successful cases made to encourage telephone providers to capture and share historical information. I was at a telephone museum recently and asked about the archive… She handed me a huge book on the founding of Southwestern Bell, published in a very small run… She gave me a copy but no-one had asked about this before… That’s wrong though, it should be captured. So we can do some preservation work ourselves just by asking!

Q&A

Q1) Jane, you mentioned a skills gap for humanities researchers. What sort of skills do they need?

A1) I think the complete lack of quantitative data training, how to sample, how to make meaning from quantitative data. They have never been engaged in statistical training. They have never been required to do it – you specialise so early here. Also, basic command line stuff… People don’t understand that or why they have to engage that way. Those are two simple starting points. Those help them understand what they are looking at, what an ngram means, etc.

Session 2B (Chair: Tom Storrar)

Philip Webster, Claire Newing, Paul Clough & Gianluca Demartini: A temporal exploration of the composition of the UK Government Web Archive

I’m afraid I’ve come into this session a little late. I have come in at the point that Philip and Claire are talking about the composition of the archive – mostly 2008 onwards – and looking at status codes of UK Government Web Archive. 

Phillip: The hypothesis for looking at http status codes was to see if changes in government raised trends in the http status code. Actually, when we looked at post-2008 data we didn’t see what we expected there. However we did fine that there was an increase in not finding what was requested – and thought this may be about moving to dynamic pages – but this is not a strong trend.

In terms of MIME types – media types – which are restricted to:

Application – flash, java, Microsoft Office Documents. Here we saw trends away from PDF as the dominant format. Microsoft word increases, and we see the increased use of Atom – syndication – coming across.

Executable – we see quite a lot of javascript. The importance of flash decreased over time – which we expected – and the increased in javascript (javascript and javascript x).

Document – PDF remains prevalent. Also MS Word, some MS Excel. Open formats haven’t really taken hold…

Claire: The Government Digital Strategy included guidance to use open document formats as much as possible, but that wasn’t mandated until late 2014 – a bit too late for our data set unfortunately. But the Government Digital Strategy in 2011 was, itself, published in Word and PDF itself!

Philip: If we take document type outside of PDFs you see that lack of open formats more clearly..

Image – This includes images appearing in documents, plus icons. And occasionally you see non-standard media types associated with the MIME-types. Jpegs are fairly consistent changes. Gif and Png are comparable… Gif was being phased out for IP reasons, with Png to replace it,and you see that change over time…

Text – Test is almost all HTML. You see a lot of plain text, stylesheets, XML…

Video – we saw compressed video formats… but gradually superceded with embedded YouTube links. However we do still see a of flash video retained. And we see a large, increasing of MP4, used by Apple devices.

Another thing that is available over time is relative file sizes. However CDX index only contains compressed size data and therefore is not a true representation of file size trends. So you can’t compare images to their pre-archiving version. That means for this work we’ve limited the data set to those where you can tell the before and after status of the image files. We saw some spikes in compressed image formats over time, not clear if this shows departmental isssues..

To finish on a high note… There is an increase in the use of https rather than http. I thought it might be the result of a campaign, but it seems to be a general trend..

The conclusion… Yes, it is possible to do temporal analysis of CDX index data but you have to be careful, looking at proportion rather than raw frequency. SQL is feasible, commonly available and low cost. Archive data has particular weaknesses – data cannot be assumed to be fully representative, but in some cases trends can be identified.

Q&A

Q1) Very interesting, thank you. Can I understand… You are studying the whole archive? How do you take account of having more than one copy of the same data over time?

A1) There is a risk of one website being overrepresented in the archive. There are checks that can be done… But that is more computationally expensive…

Q2) With the seed list, is that generating the 404 rather than actual broken links?

A2 – Claire) We crawl by asking the crawler to go out to find links and seed from that. It generally looks within the domain we’ve asked it to capture…

Q3) At various points you talked about peaks and trends… Have you thought about highlighting that to folks who use your archive so they understand the data?

A3 – Claire) We are looking at how we can do that more. I have read about historians’ interest in the origins of the collection, and we are thinking about this, but we haven’t done that yet.

Caroline Nyvang, Thomas Hvid Kromann & Eld Zierau: Capturing the web at large – a critique of current web citation practices

Caroline: We are all here as we recognise the importance and relevance of internet research. Our paper looks at web referencing and citation within the sciences. We propose a new format to replace the URL+date format usually recommended. We will talk about a study of web references in 35 Danish master’s theses from the University of Copenhagen, then further work on monograph referencing, then a new citation format.

The work on 35 masters theses submitted to Copenhagen university, included, as a set: 899 web references, there was an average of 26.4 web references – some had none, the max was 80. This gave us some insight into how students cite URL. Of those students citing websites: 21% gave the date for all links; 58% had dates for some but not all sites; 22% had no dates. Some of those URLs pointed to homepages or search results.

We looked at web rot and web references – almost 16% could not be accessed by the reader, checked or reproduced. An error rate of 16% isn’t that remarkable – in 1992 a study of 10 journals found that a third of references was inaccurate enough to make it hard to find the source again. But web resources are dynamic and issues will vary, and likely increase over time.

The amount of web references does not seem to correlate with particular subjects. Students are also quite imprecise when they reference websites. And even when the correct format was used 15.5% of all the links would still have been dead.

Thomas: We looked at 10 danish academic monographs published from 2010-2016. Although this is a small number of titles, it allowed us to see some key trends in the citation of web content. There was a wide range of number of web citations used – 25% at the top, 0% at the bottom of these titles. Location of web references in these texts are not uniform. On the whole scholars rely on printed scholarly work… But web references are still important. This isn’t a systematic review of these texts… In theory these links should all work.

We wanted to see the status after five years… We used a traffic light system. 34.3% were red – broken, dead, a different page; 20?% were amber – critical links that either refer to changed or at risk material; 44.7% were green – working as expected.

This work showed that web references to dead links within a limited number of years. In our work the URLs that go to the front page, with instructions of where to look, actually, ironically, lasted best. Long complex URLs were most at risk… So, what can we do about this…

Eld: We felt that we had to do something here, to address what is needed. We can see from the studies that today’s practices of URLs and date stamp does not work. We need a new standard, a way to reference something stable. The web is a marketplace and changes all the time. We need to look at the web archives… And we need precision and persistency. We felt there were four neccassary elements, and we call it the PWID – Persistent Web IDentifier. The Four elemnts are:

  • Archived URL
  • Time of archiving
  • Web archive – precision and indication that you verified this is what you expect. Also persistency. Researcher has to understand that – is it a small or large archive, what is contextual legislation.
  • Content coverage specification – is part only? Is it the html? Is it the page including images as it appears in your browser? Is it a page? Is it the side including referred pages within the domain

So we propose a form of reference which can be textually expressed as:

web archive: archive.org, archiving time: 2016-04-20 18:21:47 UTC, archived URL: http://resaw.en/, content coverage: webpage

But, why not use web archive URL? Of the form:

https://web.archive.org/web/20160420182147http://resaw.en/

Well, this can be hard to read, there is a lot of technology embedded in the URL. It is not as accessible.

So, a PWID URI:

pwid:archive.org:2016-04-20_18.21.47Z:page:http://resaw.en/

This is now in as an ISO 690 suggestion and proposed as a URI type.

To sum up, all research fields eed to refer to the web. Good scientific practice cannot take place with current approaches.

Q&A

Q1) I really enjoyed your presentation… I was wondering what citation format you recommend for content behind paywalls, and for dynamic content – things that are not in the archive.

A1 – Eld) We have proposed this for content in the web archive only. You have to put it into an archive to be sure, then you refer to it. But we haven’t tried to address those issues of paywall and dynamic content. BUT the URI suggestion could refer to closed archives too, not just open archives.

A1 – Caroline) We also wanted to note that this approach is to make web citations align with traditional academic publication citations.

Q2) I think perhaps what you present here is an idealised way to present archiving resources, but what about the marketing and communications challenge here – to better cite websites, and to use this convention when they aren’t even using best practice for web resources.

A2 – Eld) You are talking about marketing to get people to use this, yes? We are starting with the ISO standard… That’s one aspect. I hope also that this event is something that can help promote this and help to support it. We hope to work with different people, like you, to make sure it is used. We have had contact with Zotero for instance. But we are a library… We only have the resources that we have.

Q3) With some archives of the web there can be a challenge for students, for them to actually look at the archive and check what is there..

A3) Firstly citing correctly is key. There are a lot of open archives at the moment… But we hope the next step will be more about closed archives, and ways to engage with these more easily, to find common ground, to ensure we are citing correctly in the first place.

Comment – Nicola Bingham, BL) I like the idea of incentivising not just researchers but also publishers to incentivise web archiving, another point of pressure to web archives… And making the case for openly accessible articles.

Q4) Have you come across Martin Klein and Herbert Von Sompel’s work on robust links, and Momento.

A4 – Eld) Momento is excellent to find things, but usually you do not have the archive in there… I don’t think the way of referencing without the archive is a precise reference…

Q5) When you compare to web archive URL, it was the content coverage that seems different – why not offer as an incremental update.

A5) As far as I know there is using a # in the URL and that doesn’t offer that specificity…

Comment) I would suggest you could define the standard for after that # in the URLs to include the content coverage – I’ll take that offline.

Q6) Is there a proposal there… For persistence across organisations, not just one archive.

A6) I think from my perspective there should be a registry when archives change/move to find the new registry. Our persistent identifier isn’t persistent if you can change something. And I think archives must be large organisations, with formal custodians, to ensure it is persistent.

Comment) I would like to talk offline about content addressing and Linked Data to directly address and connect to copies.

Andrew Jackson: The web archive and the catalogue

I wanted to talk about some bad experiences I had recently… There is a recent BL video of the journey of a (print) collection item… From posting to processing, cataloguing, etc… I have worked at the library for over 10 years, but this year for the first time I had to get to grips with the library catalogue… I’ll talk more about that tomorrow (in the technical strand) but we needed to update our catalogue… Accommodating the different ways the catalogue and the archive see c0ntent.

Now, that video, the formation of teams, the structure of the organisations, the physical structure of our building is all about that print process, and that catalogue… So it was a suprise for me – maybe not you – that the catalogue isn’t just bibliographic data, it’s also a workflow management tool…

There is a change of events here… Sometimes events are in a line, sometimes in circles… Always forwards…

Now, last year legal deposit came in for online items… The original digital processing workflow went from acquisition to ingest to cataloguing… But most of the content was already in the archive… We wanted to remove duplication, and make the process more efficient… So we wanted to automate this as a harvesting process.

For our digital work previously we also had a workflow, from nomination, to authorisation, etc… With legal deposit we have to get it all, all the time, all the stuff… So, we don’t collect news items, we want all news sites every day… We might specify crawl targets, but more likely that we’ll see what we’ve had before and draw them in… But this is a dynamic process….

So, our document harvester looks for “watched targets”, harvests, extracts documents for web archiving… and also ingest. There are relationships to acquisition, that feeds into cataloguing and the catalogue. But that is an odd mix of material and metadata. So that’s a process… But webpages change… For print matter things change rarely, it is highly unusual. For the web changes are regular… So how do we bring these things together…

To borrow an analogy from our Georeferencing project… Users engage with an editor to help us understand old maps. So, imagine a modern web is a web archive… Then you need information, DOIs, places and entities – perhaps a map. This kind of process allows us to understand the transition from print to online. So we think about this as layers of transformation… Where we can annotate the web archive… Or the main catalogue… That can be replaced each time this is needed. And the web content can, with this approach, be reconstructed with some certainty, later in time…

Also this approach allows us to use rich human curation to better understand that which is being automatically catalogued and organised.

So, in summary: the catalogue tends to focus on chains of operation and backlogs, item by item. The web archive tends to focus on transformation (and re-transformation) of data. Layered data model can bring them together. Means revisiting the datat (but fixity checking  requires this anyway). It’s costly in terms of disk space required. And it allows rapid exploration and experimentation.

Q1) To what extend is the drive for this your users, versus your colleagues?

A1) The business reason is that it will save us money… Taking away manual work. But, as a side effect we’ve been working with cataloguing colleagues in this area… And their expectations are being raised and changed by this project. I do now much better understand the catalogue. The catalogue tends to focus on tradition not output… So this project has been interesting from this perspective.

Q2) Are you planning to publish that layer model – I think it could be useful elsewhere?

A2) I hope to yes.

Q3) And could this be used in Higher Education research data management?

A3) I have noticed that with research data sets there are some tensions… Some communities use change management, functional programming etc… Hadoop, which we use, requires replacement of data… So yes, but this requires some transformation to do.

We’d like to use the same based data infrastructure for research… Otherwise had to maintain this pattern of work.

Q4) Your model… suggests WARC files and such archive documents might become part of new views and routes in for discovery.

A4) That’s the idea, for discovery to be decoupled from where you the file.

Nicola Bingham, UK Web Archive: Resource not in archive: understanding the behaviour, borders and gaps of web archive collections

I will describe the shape and the scope of the UK Web Archive, to give some context for you to explore it… By way of introduction.. We have been archiving the UK Web since 2013, under UK non-print legal deposit. But we’ve also had the Open Archive (since 2004); Legal Deposit Archive (since 2013); and the Jisc Historical Archive (1996-2013).

The UK Web Archive includes around 400 TB of compressed data. And in the region of 11-12 billion records. We grow, on average 60-70 TB per year and 3 B records per year. We want to be comprehensive but, that said, we can’t collect everything and we don’t want to collect everything… Firstly we collect UK websites only. We carry out web archiving under 2013 regulations, and they state that only UK published web content – meaning content on a UK web domain, or by a person whose work occurs in the UK. So, we can automate harvesting from UK TLD (.uk, .scot, .cymru etc); UK hosting – geo-IP loook up to locate server. Then manual checks. So Facebook, WordPress, Twitter cannot be automated…

We only collect published content. Out of scope here are:

  • Film and recorded sound where AV content predominates, e.g. YouTube
  • Private intranets and emails.
  • Social networkings sites only available to restricted groups – if you need a login, special permissions they are out of scope.

Web archiving is expensive. We have to provide good value for money… We crawl the UK domain on an annual basis (only). Some sites are more frequent but annual misses a lot. We cap domains at 512 MB – which captures many sites in their entirity, but others that we only capture part of (unless we override automatic settings).

There are technical limitations too, around:

  • Database driven sites – crawler struggle with these
  • Programming scripts
  • Plug-ins
  • Proprietary file formats
  • Blockers – robots.txt or access denied.

So there are misrepresentations… For instance the One Hundred Women blog captures the content but not the stylesheet – that’s a fairly common limitation.

We also have curatorial input to locate the “important stuff”. In the British Library web archiving is not performed universally by all curators, we rely on those who do engage, usually voluntarily. We try to onboard as many curators and specialist professionals as possible to widen coverage.

So, I’ve talked about gaps and boundaries, but I also want to talk about how the users of the archive find this information, so that even where there are gaps, it’s a little more transparant…

We have the Collection Scoping Document, this captures scope, motivation, parameters and timeframe of collection. This document could, in a paired-down form, be made available to end users of the archive.

We have run user testing of our current UK Web Archive website, and our new version. And even more general audiences really wanted as much contextual information as possible. That was particularly important on our current website – where we only shared permission-cleared items. But this is one way in which contextual information can be shown in the interface with the collection.

The metadata can be browsed searched, though users will be directed to come in to view the content.

So, an example of a collection would be 1000 Londoners, showing the context of the work.

We also gather information during the crawling process… We capture information on crawler configuration, seed list, exclusions… I understand this could be used and displayed to users to give statistics on the collection…

So, what do we know about what the researchers want to know? They want as much documentation as they possibly can. We have engaged with the research community to understand how best to present data to the community. And indeed that’s where your feedback and insight is important. Please do get in touch.

Q&A

Q1) You said you only collect “published” content… How do you define that?

A1) With legal deposit regulations… The legal deposit libraries may collect content openly available on the web… Content that is paywalled or behind login credentials. UK publishers are obliged to provide credentials for crawling. BUT how we make that accessible… Is a different matter – we wouldn’t republish that on the open web without logins/credentials.

Q2) How do you have any ideas about packaging this type of information for users and researchers – more than crawler config files.

A2) The short answer is no… We’d like to invite researchers to access the collection in both a close reading sense, and a big data sense… But I don’t have that many details about that at the moment.

Q3) A practical question: if you know you have to collect something… If you have a web copy of a government publication, say, and the option of the original, older, (digital) document… Is the web archive copy enough, do you have the metadata to use that the right way?

A3) Yes, so on the official publications… This is where the document harvester tool comes into play, adding another layer of metadata to pass the document through various access elements appropriately. We are still dealing with this issue though.

Chris Wemyss – Tracing the Virtual community of Hong Kong Britons through the archived web

I’ve joined this a wee bit late after a fun adventure on the Senate House stairs… 

Looking at the Gwulo: Old Hong Kong site.. User content is central to this site which is centred on a collection of old photographs, buildings, people, landscapes… The website starts to add features to explore categorisations of images.. And the site is led by an older British resident. He described subscribers being expats who have moved away, where an old version of Hong Kong that no longer exists – one user described it as an interactive photo album… There is clearly more to be done on this phenomenon of building these collective resources to construct this type of place. The founder comments on Facebook groups – they are about the now, “you don’t build anything, you just have interesting conversations”.

A third example then, Swire Mariners Association. This site has been running, nearly unchanged, for 17 years, but they have a very active forum, a very active Facebook group. These are all former dockyard workers, they meet every year, it is a close knit community but that isn’t totally represented on the web – they care about the community that has been constructed, not the website for others.

So, in conclusion archives are useful in some cases. Using oral history and web archives together is powerful, however, where it is possible to speak to website founders or members, to understand how and why things have changed over time. Seeing that change over time already gives some idea of the futures people want to see. And these sites indicate the demand for communities, active societies, long after they are formed. And illustrates how people utilise the web for community memory…

Q&A

Q1) You’ve raised a problem I hadn’t really thought about. How can you tell if they are more active on Facebook or the website… How do you approach that?

A1) I have used web archiving as one source to arrange other things around… Looking for new websites, finding and joining the Facebook group, finding interviewees to ask about that. But I wouldn’t have been prompted to ask about the website and its change/lack of change without consulting the web archives.

Q2) Were participants aware that their pages were in the archive?

A2) No, not at all. The blog I showed first was started by two guys, Gwilo is run by one guy… And he quite liked the idea that this site would live on in the future.

David Geiringer & James Baker: The home computer and networked technology: encounters in the Mass Observation Project archive, 1991-2004

I have been doing web on various communities, including some work on GeoCities which is coming out soon… And I heard about the Mass Observation project which, from 1991 – 2004, about computers and how they are using them in their life… The archives capture comments like:

“I confess that sometimes I resort to using the computer using th ecut and paste techniwue to write several letters at once”

Confess is a strong word there.. Over this period of observation we saw production of text moving to computers, computers moving into most homes, the rebuilding of modernity. We welcome comment on this project, and hope to publish soon where you can find out more on our method and approach.

So, each year since 1981 the mass observation project has issued directives to respondents to respond to key issues like e.g. Football, or the AIDs crisis. They issued the technology directive in 1991. From that year we see several fans of word processor – words like love, dream…  Responses to the 1991 directive are overwhelmingly positive… Something that was not the case for other technologies on the whole…

“There is a spell check on this machine. Also my mind works faster than my hand and I miss out letters. This machine picks up all my faults and corrects them. Thank you computer.”

After this positive response though we start to see etiquette issues, concerns about privacy… Writing some correspondence by hand. Some use simulated hand writing… And start to have concerns about adapting letters, whether that is cheating or not… Ethical considerations appearing.. It is apparent that sometimes guilt around typing text is also slightly humorous… Some playful mischief there…

Altering the context of the issue of copy and paste… the time and effort to write a unique manuscript is at concern… Interestingly the directive asked about printing and filing emails… And one respondent notes that actually it wasn’t financial or business records, but emails from their ex…

Another comments that they wish they had printed more emails during their pregnancy, a way of situating yourself in time and remembering the experience…

I’m going to skip ahead to how computers fitted into their home… People talk about dining rooms, and offices, and living rooms.. Lots of very specific discussions about where computers are placed and why they are placed there… One person comments:

“Usually at the dining room at home which doubles as our office and our coffee room”

Others talk about quieter spaces… The positioning of a computer seems to create some competition for use of space. The home changing to make room for the computer or the network… We also start to see (in 2004) comments about home life and work life, the setting up of a hotmail account as a subtle act of resistance, the reassertion of the home space.

A Mass Observation Directive in 1996 asked about email and the internet:

“Internet – we have this at work and it’s mildly useful. I wouldn’t have it at home because it costs a lot to be quite sad and sit alone at home” (1996)

So, observers from 1991-2004 talked about efficiencies of the computer and internet, copy, paste, ease… But this then reflected concerns about the act of creating texts, of engaging with others, computers as changing homes and spaces. Now, there are really specific findings around location, gender, class, gender, age, sexuality… The overwhelming majority of respondents are white middle class cis-gendered straight women over 50. But we do see that change of response to technology, a moment in time, from positive to concerned. That runs parallel to the rise of the World Wide Web… We think our work does provide context to web archive work and web research, with textual production influenced by these wider factors.

Q&A

Q1) I hadn’t realised mass observation picked up again in 1980. My understanding was that previously it was the observed, not the observers. Here people report on their own situations?

A1) They self report on themselves. At one point they are asked to draw their living room as well…

Q1) I was wondering about business machinery in the home – type writers for instance

A1) I don’t know enough about the wider archive. All of this newer material was done consistently… The older mass observation material was less consistent – people recorded on the street, or notes made in pubs. What is interesting is that in the newer responses you see a difference in the writing of the response… As they move from hand written to type writers to computer…

Q2) Partly you were talking about how people write and use computers. And a bit about how people archive themselves… But the only people I could find how people archive themselves digitally was by Microsoft Research… Is there anything since then… In that paper though you could almost read regret between the lines… the loss of photo albums, letters, etc…

A2) My colleague David Geiringer who I co-wrote the paper was initially looking at self-archiving. There was very very little. But printing stuff comes up… And the tensions there. There is enough there, people talking about worries and loss… There is lots in there… The great thing with Mass Obvs is that you can have a question but then you have to dig around a lot to find things…

Ian Milligan, University of Waterloo and Matthew Weber, Rutgers University – Archives Unleashed 4.0: presentation of projects (#hackarchives)

Ian: I’m here to talk about what happened on the first two days of Web Archiving Week. And I’d like to thank our hosts, supporters, and partners for this exciting event. We’ll do some lightening talks on the work undertaken… But why are historians organising data hackathons? Well, because we face problems in our popular cultural history. Problems like GeoCities… Kids write about Winnie the Pooh, people write about the love of Buffy the Vampire Slayer, their love of cigars… We face a problem of huge scale… 7 million users of the web now online… It’s the scale that boggles the mind and compare it to the Old Bailey – one of very few sources on ordinary people. They leave birth, death, marriage or criminal justice records… 239 years from 197,745 trials, 1674 and 1913 is the biggest collection of texts about ordinary people… But from 7 years of geocities we have 413 million web documents.

So, we have a problem, and myself, Matt and Olga from the British Library came together to build community, to establish a common vision of web archiving documents, to find new ways of addressing some of these issues.

Matt: I’m going to quickly show you some of what we did over the last few days… and the amazing projects created. I’ve always joked that Archives Unleashed is letting folk run amok to see what they can do… We started around 2 years ago, in Toronto, then Library of Congress, then at Internet Archive in San Francisco, and we stepped it up a little for London! We had the most teams, we had people from as far as New Zealand.

We started with some socilising in a pub on Friday evening, so that when we gathered on Monday we’d already done some introductions. Then a formal overview and quickly forming teams to work and develop ideas… And continuing through day one and day two… We ended up with 8 complete projects:

  • Robots in the Archives
  • US Elections 2008 and 2010 – text and keyword analysis
  • Study of Gender Distribution in Olympic communities
  • Link Ranking Group
  • Intersection Analysis
  • Public Inquiries Implications (Shipman)
  • Image Search in the Portuguese Web Archive
  • Rhyzome Web Archive Discovery Archive

We will hear from the top three from our informal voting…

Intersection Analysis – Jess

We wanted to understand how we could find a cookbook methodology for understanding the intersections between different data sets. So, we looked at the Occupy Movement (2011/12) with a Web Archive, a Rutgers archive and a social media archive from one of our researchers.

We normalised CDX, crunch WAT for outlinks and extract links from tweets. We generated counts and descriptive data, union/intersection between every data set. We had over 74 million datasets, but only 0.3% overlap between the collections… If you go to our website we have a visualisation of overlaps, tree maps of the collections…

We wanted to use the WAT files to explore Outlinks in the data sets, what they were linking to, how much of it was archived (not a lot).

Parting thoughts? Overlap is inversely proportional to the diversity pf URIs – in other words, the more collectors, the better. Diversifying see lists with social media is good.

Robots in the Archive 

We focused on robots.txt. And our wuestion was “what do we miss when we respect robots.txt?”. At National Library of Denmark we respect this… At Internet Archive they’ve started to ignore that in some contexts. So, what did we do? We extracts robots.txt from the WARC collection. Then apply it retroactively. Then we wanted to compare to link graph.

Our data was from The National Archives and from the 2010 election. We started by looking at user-agent blocks. Four had specifically blocked the internet archive, but some robot names were very old and out of date.. And we looked at crawl delay… Looking specifically at the sub collection of the department for energy and climate change… We would have missed only 24 links that would have been blocked…

So, robots.txt is minimal for this collection. Our method can be applied to other collections and extended to further the discussion on ignore robots.txt. And our code is on GitHub.

Link Ranking Group 

We looked at link analysis to ask if all links are treated the same… We wanted to test if links in <li> are different from content links (<p> or <div>). We used a WarcBase scripts to export manageable raw HTML, Load into Beuatifulsoup library. Used this on the Rio Olympic sites…

So we started looking at WARCs… We said, well, we should test if absolute or relative links… And comparing hard links to relative links but didn’t see lots of differences…

But we started to look at a previous election data set… There we saw links in tables, and there relative links were about 3/4 of links, and the other 1/4 were hard links. We did some investigation about why we had more hard links (proportionally) than before… Turns out this is a mixture of SEO practice, but also use of CMS (Content Management Systems) which make hard links easier to generate… So we sort of stumbled on that finding…

And with that the main programme for today is complete. There is a further event tonight and battery/power sockets permitting I’ll blog that too. 

Jun 062017
 

Today I am at the CILIPS Conference 2017: Strategies for Success. I’ll be talking about our Digital Footprint work and Digital Footprint MOOC (#DFMOOC). Meanwhile back in Edinburgh my colleagues Louise Connelly (PI for our Digital Footprint research) and Sian Bayne (PI for our Yik Yak research) are at the Principal’s Teaching Award Scheme Forum 2017 talking about our “A Live Pulse”: YikYak for understanding teaching, learning and assessment at Edinburgh research project. So, lots of exciting digital footprint stuff afoot!

I’ll be liveblogging the sessions I’m sitting in today here, as usual corrections, additions, etc. always welcome. You’ll see the programme below becoming 

We have opened with the efficient and productive CILIPS AGM. Now, a welcome from the CILIPS President, Liz McGettigan, reflecting on the last year for libraries in Scotland. She is also presenting the student awards to Adam Dombovari (in absentia) and Laura Anne MacNeil. She is also announcing the inauguration of a new CILIPS award Scotland’s Library and Information Professional of the Year Award – nomination information coming soon on the website – the first award will be given out at the Autumn Gathering.

Keynote One – The Road to Copyright Literacy: a journey towards library empowerment Dr. Jane Secker, Senior Lecturer in Educational Development at City, University of London and Chris Morrison, Copyright and Licensing Compliance Officer, University of Kent

Jane: We are going to take you on the road to copyright literacy… And we have on our tour shirts – these are Copyright exception shirts… They are a parody Guns and Roses tour shirts…

Now, we want to ask you: How does copyright make you feel? [cue some voting] Mostly confused…

Chris: When we’ve done this across the country people have said it made them warm and fuzzy, very happy, but also worried, anxious or confused and faintly cautious…

Jane: Now Copyright get Chris and I really excited… But what gets us even more excited… Star Wars! When they were working on the prequels to star wars, George Lucas’ advice to the young film makers was “Don’t be Afraid”…

Chris: Fear leads to a fight or flight. That’s not what you need… you need to work through it calmly and diligently…

Jane: So lets take this back a bit…

Chris: I was a musician, so I thought what job can I do around music… So I started working at PRS – who handle performing rights for music… Then moved onto the British Library working on copyright…  Music turns out to be less glamorous than I expected, libraries turned out to much more glamorous than I expected! My life changed, I moved to Kent and now work at University of Kent as Copyright Officer, and they are brilliant in supporting me to do things like this!

Jane: I went to Aberystwyth, worked with old newspapers – out of copyright so really it wasn’t my thing… I works at the National History Museum…. Then at the British Library… When I moved to UCL to work on digitising lecture materials and course materials copyright became my thing, researching this area… Then onto LSE, working with staff on training, working with academics around copyright literacy… And just recently I moved to University of London in a lectureship role, again educating people on Copyright.

Chris: Now, in 2014 we finally saw some reforms to Copyright Law following the Hargreaves Review…

Jane: When that review came out in 2011 I needed a speaker, someone mentioned Chris… And that was the beginning of a beautiful friendship really… A few years later I was at a conference in Dubrovnic and heard about a concept called “Copyright Literacy” and I wanted to run some research around that – 600 of you completed that, and actually research on copyright literacy took place across 14 countries..

Chris: Out of that work we started looking at resources, including designing Copyright: the board game (CC licensed) which helps you to work out

Jane: Chris and I are part of the Universities UK Copyright and licensing group. We also have a book out: Copyright and e-learning: a guide for practitioners (second edition). One thing that came out of our first research was librarians being nervous and concerned about copyright… We wanted to do more in this area… So we decided to do some work on phenomenography and Copyright as an experience, as a phenomenon, to enable us to understand appropriate educational interventions.

Chris: We categorised the experiences in various ways:

  • category 1: copyright is a problem
  • category 2: copyright is complicated and shifting
  • category 3: copyright is a known entitute requiring coherant messages
  • category 4: copyright is an opportunity for negotiation, collaboration and co-costructuion and understanding…

Jane: Copyright is a problem… The idea of copyright as an imposition… and not well aligned to goals of librarianship, of making material available to people…

In category 2 it’s about copyright as complicated, shifting, changing… “for non-copyright queries the answer is yes, or no, or a series of instructions but for copyright questions it’s maybe, or maybe, or maybe…”

Chris: In category 3 it’s about behaviour change, compliance, avoiding getting into trouble with publishers or the law.

The fourth category is about copyright as an opportunity… It can be about being assertive. When you look at what you share or publish… It can be easy to make sweeping assumptions… So you have to have conversations to reach a shared understanding of copyright… It’s best practice in the industry… And it’s important to also bring that to the profession…

And now that the one minute silence for London is observed… It’s Jane and Chris’ Don’t be Afraid Quiz Time… I won’t blog this as it is fast paced and there are prizes at stake! However… I have learned that HG Wells’ work only came out of copyright this year… 

Jane: So, what does this all mean?

Chris: What would the world be like without copyright literacy?

Jane: It would be a sad world… But why… Without copyright people don’t want to share things, people don’t know how to advise people… We can end up being risk averse – playing it safe and saying no… There are works in the public domain – if we don’t know what we can and can’t do, we see a reduction in what is available. And actually for libraries that would increase costs – rights holders will happily sell you licenses that you may not need – you may be able to use works under copyright exceptions…

Chris: So, we’ve been trying to find ways of bridging the gaps… It’s clearly a complex subject in a complex environment… We want to connect the practitioners to the activists. Some of us are really aware but there is  a gap, people working in the profession but not focused on copyright. There is also the concept of creators and consumers, and copyright enables that… But the realities of that distinction is unclear… Automatic copyright can be useful but also challenging.. And then we have rightsholders and libraries, and the need to work together to address barriers… There is also a thing about legal language, and the idea that copyright can only be explained in legal jargon, but there are ways to communicate it in a clearer way…

We have been doing work on the role of the copyright officer – and are analysing data from a survey on this…

Jane: To come back to copyright literacy, and critical copyright literacy… We have traditionally focused on training, and one day training events… I think we need to think differently. I spent some time with Prof. John Naughton in Cambridge.. He’d use the example of “think about your children at school and sex eductaion… Do they need education, or do they need training?!”.

There is balance between training and approach.. We want to develop people to think individually and find their own answers.. It’s about avoiding binary questions and become comfortable with uncertainty. There is no one way to Google, or one way to explore a catalogue, and there isn’t just one answer in copyright.

Chris: To put this into practice Jane and I have been setting up groups and get togethers in our local and London and South East f0r communities of practice around copyright.

Jane: And that’s also about rethinking copyright education for librarians… Bridging the gap between a one dat course and a PG Diploma in Copyright law, focusing on what librarians need to know about copyright, focusing on the copyright queries we work with. And we have to talk to library schools about the copyright education young professionals are getting during their qualification…

So, that leads us to the point I wanted to make: Copyright literacy is a journey not a destination (“Morrison and Secker (with apologies to Ralph Waldo Emerson)”). And you have to be comfortable with all that uncertainty.

So, some take aways…

Chris: Copyright is about knowledge, money and power. It is also about privelges, in all meanings of that word.

Jane: Copyright literacy means sharing and working as a community.

Chris: Librarians! Copyright belongs to you, own it! Indeed it belongs to everyone – not lawyers, but everyone.

Jane: Our next tour stop is Manchester! Join us! Now, we don’t expect you to love copyright. We want you to not be afraid, confused, baffled, but to see it as an exciting opportunity, and something that as a librarian you have some special priveleges…

Find out more at: https://copyrightliteracy.org or on Twitter: @UKCopyrightLit

Q&A

Q1: When I was a copyright librarian the question was “will I be sued”… ?

A1, Chris: It does come up when I speak to copyright officers. Copyright is civil not criminal law. Your organisation is often where responsibility lies. But rarely does anything go to court, usually it is demands for money, you pay it or deal with it in a process to make your case… That process is crucial as it makes it an efficient and helpful process.

A1, Jane: That does seem to be a major fear for people… Not many actual court cases though…

A1, Chris: There are very few.Though one in Australia on photocopying, few recently though… There’s not a lot of money in suing libraries… But there is a risk to be managed, and libraries need to show they are doing the right thing…

A fab opening session from Chris and Jane – not a surprise (the fun factor – always some copyright surprises and learning!) based on previous experience of their talks and workshops but delightful nonetheless… 

Parallel Session 1: Overcoming disability and barriers: Using assistive Technologies in libraries A joint presentation from

  • Craig Mill – CALL Scotland and Edinburgh Libraries award winning Visually Impaired People Project
  • Jim McKenzie – Lifelong Learning Library Development Leader – Disability Support,
  • Paul McCloskey – Lifelong Learning Strategic Development Officer (Libraries) and
  • Lindsay MacLeod – Project Volunteer

Craig Mill: I am from CALL Scotland one of the things we do is to provide an equipment pool for schools and children, so that they can be tried out. For instance we provide Augmentative and Alternative Communications devices and tools – traditionally these were hugely expensive but there are now inexpensive iPad apps that do much of this.

We also have learning resources, many of them supported by funding from NHS Scotland.

We also provide Books for All, which includes texts prepared to be accessible for those with additional support needs… Students can search for books, download them, and use them on their own devices. These are curriculum books, they are provided as PDF in a variety of formats, including large print for visually impaired students… You can magnify, adapt, and you can use preferences to alter document colours for high contrast, you can activate read out loud… You can customise to meet childrens needs. Lots of our Scottish Government funding goes towards the Books for All database.

We also have adapted digital assessments. When you have the SQA physical past paper, you can also now use this service to download and use digital past papers. Again these are a PDF type format with answer boxes. The pupil can go in, type in answers… And you have annotation tools… Including notes/sticky notes… These can be reduce costs by thousands for scribes… Can just have a student with a laptop and headphones now…

We also have Scottish voices… Traditionally they have been quite mechanical… We have a collection of Scottish synthetic voices: Heather; Stuart; Caitilin (gaelic). We have students using these in Scotland in schools, colleges, HE. And if you have a computer voice, you need something to read that…

We also have a tool called “WordTalk” that sits in Word. It just sits there and reads back to you as you type, it’s a free text-t0-speech plugin.

As well as that we have lots of information on assistive technologies. We are asked a lot about supporting pupils with dyslexia. So we now have quite a comprehensive resource on writing, reading, some case studies as well… e.g. Hamish uses OneNote, Notability, iPads… Some really useful stuff here.

And under our downloads section, if you are looking for resources, you’ll find the posters and leaflets – which we’ve become popular for. The most popular by far is our iPad Apps for Learners with Dyslexia resource.

Finally, my colleague Allan recently wrote a blog article on scanning pens and reading pens. These are now much much more accurate than they used to be. He wrote a comparison of the reading pens. In England there is an “exam pen” in exams… But it doesn’t have dictionaries etc. built in. Whereas the C-Pen reader has lots of features added in, including dictionaries… They are the market leaders. Allan compared these with apps that do similar things.

Paul, Jim and Lindsay

Paul: I’ll talk about how our work ties to local and national priorities. Then Jim will talk about the project, and Lindsay will give his experiences as a user.

Our message today is about helping visually impaired people to be empowered to be self-sufficient, with technology enabling access to information. Over 180k people in Scotland are effected by a significant level of sight loss. And the aging population and rise of diabetes mean that this is expected to double in Scotland in the next 10 years.

Blind and partially sighted people can feel isolated. In work on users needs, in their own words, they gave their priorities. VIP supports three of these:

  • That I can access information, making most of opportunity that technology can bring.
  • That I have someone to talk to.
  • That I have the support that I need.

And VIP helps support citizen engagement, community participation and participation in the library.

Jim: We can purchase equipment, but we also provide expertise and the time to get people set up. Four years ago Apple was leading the way with technologies… Setting it up wasn’t the easiest in the world. We had an existing resource centre that people used regularly. We had new users… We wanted to get new users engaged – posters in the library wouldn’t cut it. So we went out… To the RNIB Cafe, where we set up an audio book group, we talked to the eye hospital, we talked to guide dogs, we talked to the thriving macular degeneration group in Edinburgh. We concentrated on these groups and worked hard to develop those relationships.

We thought hard about location. We had 28 libraries, we set up in 10. We looked at safety in crossings and roads. We looked at the location of bus stops – we started a group in one location but no-one came as the bus was too far, crossings weren’t good. We also looked at facilities, and we looked at staffing. We gave some training in what we were offering. We got them to set up a patron, show them how to use wifi – if they could do that, it would be fine. Not all apps are accessible, but many are. There are podcasts. There is the RNIB Tech Talk podcast. Apple has Blind Vis, a group for those with visual impairment. There are apps for VO – Voice Over – to get you used to the interface.

Things we have to guard against included not spreading ourselves too thin – hence 10 not 28 libraries. We have used volunteers and champions. And we had to stay up to date, technology changes really really quickly. We get asked about books and newspapers. One group were asked what they really missed – one guy missed poker… Surprisingly hard to find an accessible app. We eventually found one – Theta Poker (where money is not involved) and I actually recommend it as an app designed for a visually impaired person.

It can be challenging to find and keep great volunteers, but when you find a great one it makes all the difference… On which note, over to Lindsey…

Lindsey: My personal involvement was back in 2014, through an introduction by the RNIB to Jim and what he was doing. I wanted to bring my experience in econtent into volunteering, and the Edinburgh Libraries were doing exactly the kind of things I wanted to do… When you are blind or visually impaired there are fewer choices but the Apple products are really great – not an advert, others are available!

I was really impressed by the groups I met… But the speed of progress is variable. The demographics of blind and partially sighted people tends towards older people and it takes longer to learn later in life, so we work with that. There were differences between blind and partially sighted people. The latter group can try to grasp onto what they are used to doing – and have to be convinced that with a blank screen they are still getting the functionality. That was a learning curve for me but I’ve had a great mentor. Abilities vary… And people’s familiarity with technology varies – the swiping idea can take many back to year zero though.

With these groups we do ask what they want from these devices. Some want to make a change. Some want just emails or audiobooks… But they learn there is virtually no limit to what they can do with an electronic device. The learning is not a linear classroom approach – given the mixture of abilities. So it’s more like a learning spiral, revisiting basic techniques, ensuring they understand what devices can do.

The local library environment is largely great. There is privacy. The staff are very welcoming. And ease of access is important – it’s daunting to navigate a new city without a guide. Libraries should be a universal space, and the things we learn require face to face interraction. Group feedback is essential, to tailor to needs, and to know when to revisit things and refresh them.

As a volunteer this has been a hugely rewarding experience, and I thank the libraries for that.

Paul: I hope Jim and Lindsey have given you an idea of the service. Right now we are looking at evaluating the programme, using RNIB and Online Today. We are also working with them to reach a wider group. We are seeing growth in volunteer, and we are seeing growth in capacity as important. Having a dialogue with our service users has been crucial, for instance deaf-blind families. The reinforcement and training have to continue, be refreshed, almost continually refresh the project, in order to reach a point of sustainability. It’s also brilliant that many who came to use for support are now leading the classes…

Traditionally people with visual impairment have been behind with technology, but with this project that is no longer the case. We’ll be running Six Steps courses over the next few months – see http://www.readingsight.org.uk/ I’m going to conclude with a video of Christine Morris – probably our best speaker of the bunch but sadly she couldn’t come along today!

Chris: I became partially sighted then blind and because of that didn’t do much and didn’t feel as able to leave the house… Then I got an iPhone… I went to the City Library and was shown by Jim how to use it… I then moved to using the Craigmillar library… At a certain point a number of us moved to iPads… It was a big jump but we all made steady progress… It was quite challenging as new people kept joining the group, but volunteers came in to help… Then I couldn’t make the same journey… I now go to the Stockbridge Library – much closer to home – and go regularly.

The technology has changed my life. I can now use email to stay in touch with friends across the world, I can listen to music, listen to the radio, I can download podcasts – The Archers, From Our Own Corespondant, and Inside Radio. And if you have a little sight you can use the camera, and the iPlayer – not useful for me… But I gather I can now record it with audio descriptions so I will try that!

Jim tried to make it that we didn’t just use the technology for practical things, but for fun things too… games and whatnot. I really like doing crosswords – I still do the Daily Telegraph crossword every day with my husband but I can’t do it on my own. But Jim showed me a crossword app I can use on the iPad on my own.

I think it’s so useful for people like me, who would otherwise be quite isolated. It has been a lifeline and I hope to go on and do much more with the technology!

Q&A

Comment: It’s great to hear first hand from a service user.

Paul: We presented to the COSLA judges a year ago. We had Chrissy and she was great – I’m sure that’s why we won! She highlighted things that seem small but can be a big challenge – like the crossword puzzle.

Chair: Some of you may be aware that a digital strategy piece of work has taken place, with a survey. One question on assistive technology only 11% of libraries claim to have assistive technology… But that may be about understanding definitions… So we will come back to that…

And now it’s networking lunch and exhibition time… 

Parallel Sessions 2: Spotlight on research – Papers on: Linked Data

Opening Scotland’s library content to the world (Dr. Diane Pennington, University of Strathclyde)

Thanks for coming to hear about linked data right after lunch! I will give an overview of Linked Data for those of you who may not be sure what it is…

So, a quick note on the evolution of the web (1989-now). We started with Web 1.0, hand-coded HTML pages, accessible and reliable, but not interactive; then web 2.0 with Facebook and Twittter, everyone can post, share and respond without extensive technical knowledge. Web 3.0 or the Semantic Web is about new ways to imagine and combine information on the web…

When Tim Berners Lee outlined the Semantic Web in terms of using URIs as names for things – so Strathclyde’s name on the semantic web is http://www.strath.ac.uk/ for instance.

When someone looks up a name, provide useful [RDF] information. Think grammatically here in terms of understanding relationships in a structured way. And we can include links to other URIs so that people can discover new things.

Anyone that uses Google is using Linked Data. When you see that panel – the Knowledge Graph – that is based on linked data from wikipedia, YouTube, etc.

So that’s the based of linked data, open data..

In 2015 the Scottish Government published an Open Data Strategy. They want any public service creating non-personal, non0commercially sensitive data to share it as linked data. ANd then the Scottish Government’s “Realising Scotland’s full potential in a digital world: A Digital Strategy for Scotland” (2017) this is further reinforced. And there is an official Scottish Government open linked data statistics page.

But this isn’t all where it should be… And what about libraries implementing linked data… Why should we do it? Well because peoplw can more easily find library resources on the web – through Google not (only) through our catalogue; more cerative applications based on library metadata; opportunities for cataloguing innovation and efficiency.

Back to Tim Berners Lee’s star rating of linked data… We are a long way from 5*s now.

So I have been doing a survey of Scotland’s Linked Open Data, with over 120 responses… A lot more people know what “linked data” means, rather than “semantic web” – a very related term. When I asked what it means, they knew it was about resource sharing, linking, availability and connectedness…  When it came to what “semantic web” means themes were around improved web searching, more structured online data for better organisation… But many of the definitions were not really correct…

When we asked if libraries had implemented, or were planning to implement any linked data… Not on the whole, which is unfortunate. Some concerns and limitations was about licensing constraints – permission needed from database providers to link. Teach practitioners what linked data can concretely achieve… Lack of knowledge – decisions made further up the chain? Potential loss of control of data. Concern that digitisation is linked to monetisation… And what to link to…

Despite that wider set of government strategy priorities, and NLS actions in this direction, there remain barriers to implementation… Lack of awareness, lack of time…

This is ongoing research, and I’ll be publishing the survey analysis at some point. I will be looking at Scottish library websites. I also want to do interviews around those plans/lack of plans… I also want to increase awareness among the ILS community around linked data and semantic web to potentially increase uptake..

Towards an information literacy strategy for Scotland (Dr. John Crawford)

Brief background here. I directed the Scottish Information Literacy Project (2004-2010). We built up a great network of contacts and collaborators. After I retired we shifted to the Right Information Community of Practice, founded in 2012. We communicate by blogging, email and twitter with meetings twice a year.

We bring together a diverse range of library sectors and representatives from education and skills bodies…

We have done various things… Including activism. In 2014 the Royal Society of Edinburgh report on Spreading the benefits of digital participation interim report came out… And we submitted a lengthly response which duly ended up in the final report. It outlined the role of libraries, and of information and digital skills… And the need for those skills to be embedded throughout the lifespan. These are all good, but hard to do.

We managed to meet with the minister in June 2015, we focused on democratic renewal for a better informed society. There was a further conference in 2016. We were able to improve links with other relevant bodies. The minister wanted a focus on “digital literacy” rather than “information literacy” which means she wouldn’t give us money…

But we live in a different world now. Many opportunities to vote and, in some cases in Scotland,  that included 16 and 17 year olds, bringing information literacy to a new group in a new way. After the referendum there was an increased interest from young people in politics and engaging in that type of debate.

Another area here is “health literacy”… This is tough information to get our heads round, and it matters greatly. 43% of English working-age adults will struggle to understand instructions to calculate a childhood paracetamol dose… That’s very basic and crucial literacy…

One of the things I tried to do when chairing the information literacy project was try to focus on particular innovation area – including Konstantina Martzoukou’s work with refugees that was presented this morning for instance. That was supported by an information literacy organisation… And connected information literacy to background policy documents…

Bill Johnston chairs the Older Person’s Alliance looking at older people and literacy around good health, pensions, recreation, etc. Lauren Smith is working on political engagement of young people, and the role of school librarians in political information literacy with young people. And we have making it easy – a health literacy policy for Scotland.

How do we evaluate services like this? And what kind of performance indicator can we use? It needs to be precise, and be a genuine indicator of success. I had a look at the literature… And it kept coming back to a special issue of Library Trends that I co-edited around 2011. Particularly work by Andrew Whitworth, which included “information literacy policy documents should be about information literacy and not something else” – sounds obvious but often they are actually about something else, e.g. IT skills. He also stated that such policy documents should have some sort of government support and relevance. They should be fully cross-sector. They should be informed and preferably led by the professional bodies of the countries concerned, and should be collaborative across organisation.

The other paper was by Woody, where he presented his “ten commandments” which included: patience and perseverance; find an in-house champion; link to the 21st century; resistance to change; don’t bite off more than you can choose, etc.

Whitworth’s criteria, particularly that one of information literacy being muddled with the digital agenda, have proved quite thorny. From Woody’s work the issue of champions has been partly addressed by attracting support from professional bodies, other professions and activities. Aiming for the top has been more problematic. Linking information literacy to specific long standing goals and reforms have been key to our activities. We’ve done our best to pilot test and experiment objectively deliberate on that.

If you want further reading I will recommend that 2011 issue of Library Trends, 60 (2). Strategic policy making issues in information literacy, in Library and information research, 40 (123), 2016 which includes articles by Lauren Smith and Bill Johnstone.

Q&A

Q1 – me) I was part of the RSE Inquiry Committee and we did have a lot of discussion about the relationship between digital literacy and information literacy – in a way it is all information literacy and we were aware of that, but also keen to focus on the specific challenges and issues around digital in that report. But I’d agree that information literacy is the fundamental set of skills.

A1 – John) It took so long for CIIPS to be interested in information literacy is the predominent skill set. IT skils and digital literacy skills, do naturally lead onto information literacy. I think we failed to make our case a number of years ago, and should have done.

Q2) Why wouldn’t the minister fund information literacy?

A2  John) If you speak to a government minister you have to look to those around and behind them… Civil servants do have an agenda of their own, and they do present that to the government ministers.. They have successfully presented the digital literacy agenda to ministers… Something that was encouraging was that the minister – Fiona Hyslop – did connect the idea of digital literacy to wider information literacy.

Q3) What is the kind of potential for linked data in libraries?

A3) Say all of our libraries in scotland shared catalogue records in linked open data, then it would also appear, not just videos and that type of content, when someone searches Google for e.g. “Loch Lomand”.

Comment 4) I work for the Scottish Government civil service. I would say it is a bit more positive now, with the digital strategy launched this year. It has taken us a while to make that link between information literacy and digital literacy. Slow progress but it is happening…

Q5) For small public libraries what is the first small step we should take?

A5) I would say what is the unique thing in your library, and focus on that, the quick wins… and make it available as linked open data.

Q6) How do we prioritise linked data over other issues when we are strapped for resources and have many priorities?

A6) Partly its about accountability, findability, transparancy to those that pay for our libraries through taxes, council tax, etc. A public accountability approach can be helpful.

Parallel Session 3: If I googled you – what would I find? Managing your digital footprint Nicola Osborne, Digital Education Manager, EDINA 3.15-3.35pm Refreshments and exhibition

Slides from my session will be available on my presentations and publications page shortly… 

Keynote 2 – Securing the future: where next for our community in 2018 and beyond? Nick Poole, Chief Executive, CILIP

This has been such a good event, it always has such a good buzz, and it is such a privilege to be part of. I’m talking about securing the future, but it’s not about us securing the future for ourselves, we’re securing the next generation’s right to learn, to be informed.

Two years ago there was a presentation here from IFLA about Sustainable Development Goals. These are th ebigger context for the work all of us are doing. Whatever the outcome of 8th June it will be a fresh start for your daily work, to make sure there is opportunity for these people.

We are living in a future that is transformative, and we are the people to make that happen, whether we realise it yet or not. We are a powerful community of information professionals. We are not just librarians, we are information managers, we are data professionals, we are knowledge managers. And it is so important that we are united in our values, and so excited about where we go next as a community.

CILIP members are embedded across the spectrum of public sector, private sector, third sector, all types of organisation. There are over 60,000 of us. And the CIBR estimates that 100,000 jobs for knowledge professionals in the coming years.

So you may have seen Securing the Future, our action plan 2016-2020. Our goal stated there is to “put library and information skills at the heart of a democratic, equal and prosperous society”. We want talented, creative library and information professionals everywhere. To should about what we do. We have three connected goals around being stronger and more inclusive as an organisation.

We have come a long way together as the four CILIP regions. A lot of what we speak about, our campaigns, are about delivering real, measurable change in the opportuniities for and status of librarians and information professionals.

I just wanted to pause to thank everyone for the fantastically effective #LibrariesMatter advocacy campaigns. When you win here, it benefits the wider community across the UK, it is media coverage and impact and meaningful stories of how we make a difference that I can take to government to explain what we do, that we can do these things too.

I really admire that in Scotland you have a little big of swagger and confidence about your libraries and where you are going, and we want to learn from it. And it makes a difference. In the local elections every single party made an above the line commitment to libraries.

And that has led into a national school library strategy for Scotland by the deputy First Minister for Scotland. We know that is words, but it can make a real impact, it is happening, it is hard to go back on. And I can take that back to UK government and make the case for England too .

As you may know we have been working with Chris Riddell (@chrisriddell50) to build our arguement about the huge importance of libraries and schools for literacy, for early years.

We have just launched, after announcement of the election, the #factsmatter campaign, calling on all parties to use evidence based campaigning. Most have signed up, though one – I won’t name them – said “that sounds like a trap!”

Facts DO matter. We shouldn’t tolerate fake facts, fake new, in our politics. Big Issue founder John Bird has advocated for us and continues to do that. We have celebrities and public figures backing this.

We have the “A Million Decisions” campaign demonstrating how librarians make a difference to healthcare, the lives and money saved because of knowledge and information. Coming to the NHS England commitment to libraries. We are absolutely delighted that there is a sister campaign – “A Right Decision” – in Scotland with NHS Scotland.

We are starting to look at how we develop a skilled workforce for the future. We do see retirements and redundancy, but we also see a huge influx of new entries to the profession. We have to develop skills, to ensure transferable information skills. I want young kids to say “I want to be a librarian” and for their parents to be proud of that!

So, we have to develop solutions and routes into the profession that opens us out…

Some announcements here. In our event in July in Manchester we will be launching a sector-wide Ethics Review. We will also launch a Public Library Skills Strategy for England, partnering with the Society of Chief Librarians. And that’s all about opening up the pathways.

Finally, how do we become a bigger, better, more inclusive professional association. Right now we represented about 15% of the sector. Other professional associations represent more like 23%. So to do that we need to make membership more accessible, more affordable, and make sure we champion equality, diversity, and truly represent the sector.

We will build our member networks, we will work on new standards, communities, and publications. And we will continue to build partnerships with organisations and companies that help us achieve our goals.

Carla Hayden, Librarian of Congress, says “When librarians get together, something great happens”. We know that, we believe that.

And we aren’t securing the future for ourselves but for that new generation, for those that need us to be there.

Thank you to all at CILIPS, and here’s to CILIP and CILIPS working together to make the difference to 2018.

Q&A

Q1) You said information skills are at the heart of a democratic society… Isn’t it the case that CILIP has been a bit of a latecomer to information literacy. You and I were on the board in 2011 when we were asked to endorse the Alexandra Proclamation, which had been published 5 years previously… We are catching up but … We’ve had a Scottish and a Welsh Information Literacy project, when will there be a CILIP-led Information Literacy.

A1) Great question. We had three asks of political parties: to support public and school libraries; to acknowledge that the future is data driven; and that we need to have a workforce with information literacy skills to prepare them for the world. I think information literacy will have impact when there is an article in Tesco magazine. Facts Matter has been a really good opportunity to do that. And we need to build something after the election.

Comment) I think that whole campaign is spot on, and it’s great that that has tied into something so current and bigger than the sector, and created new opportunities.

A2) I’d like to say it was long plotted… Honestly I was on an ebay shop doing badges and decided it was the right slogan. Two organisations came to us on the back of the campaign, including the Royal Statistical Society, as they saw real opportunities to work together to build an information literate population.

President’s closing remarks – Liz McGettigan

I won’t go into huge detail but I have to thank Kathy and Sean for making such a brilliant seamless event. Thank you our sponsors, and Alex and our AV team who have been spot on. Most of all thank you to all our speakers, you have inspired us all. There have been fabulous presentations across such useful areas over the last two days. We have been impressed with projects on working with refugees, working with health information, such a range. When people say “libraries are just about books”, think back on all these amazing projects you are all delivering out there! I never cease to be amazed by what you are doing. I hope you go home inspired and galvanised. And it’s not about Sean, Kathy, Nick and I, it’s about all of you advocating for what you do, getting out and talking to media. So get out there!