Jun 212017

Last Thursday I attended the Guardian Teacher Network Seminar: Technology in schools: money saver or money waster? at Kings Place, London.The panel was chaired by Kate Hodge (KH), head of content strategy at Jaywing Content and former editor of the Guardian Teacher Network, and featured:

  • John Galloway (JG), advisory teacher for ICT/special educational needs and inclusion, Tower Hamlets Council.
  • Donald Clark (DC), founder, PlanB Learning and investor in EdTech companies with experience of teaching maths and physics in FE in the UK and US.
  • Michael Mann (MM), senior programme manager, education team, Nesta Innovation Lab.
  • Naureen Khalid (NK), school governor and co-founder of @UkGovChat.

These are my live notes from the event – although these are a wee bit belated they are more or less unedited so comments, corrections, additions etc. are welcomed. 

The panel began with introductions, mainly giving an overview of their background. The two who said a wee bit more were:

John Galloway, specialist on technologies for students with special needs and inclusion, I work half time at Tower Hamlets with students but also a lot of training. It’s the skills of adults that is often the challenge. The rest of my time I consult, I’m a freelance writer, I am a judge of the BETT awards.

Michael Mann (MM), NESTA, our interest is that we don’t think EdTech has reached its potential yet… Our feeling is that we haven’t seen that impact yet. And since our report five years ago we’ve invested in companies and charities who focus on impact. Also do research with UCL, and work with teachers to trial things in real classrooms.

All comments below are credited to the speakers with their initials (see above), and audience comments and questions are marked as such… 

KH: What’s the next big thing in tech?

DC: It’s AI… It’s the new UI no matter what you use really… I only invest in AI now… Education is curiously immune from this at the moment but it won’t be… It is perfect for providing feedback and improving the eLearning experience – that crappy gamification or read then quiz experience… We are in a funny transitionary phase..

MM: There has been an interesting trend recently where specialist kit is becoming mainstreams… touch screens for instance, or speech to text… So, I think that is closing the gap between our minds and our machines… The gap is closing… The latest thing in special education needs have been eye games – your eyes are the controller… That is moving into mainstream gaming so that will become bigger… So I see a bigger convergence there… And the other thing I see happening is VR. That will allow children to go places they can’t go – for all kids but that has particular benefits and relevance for, say a child in a wheelchair. For autistic children you put them in environments so they can understand size, lights, noise, and deal with the anxiety… before they visit…

KH: What are the challenges of implementing that in the classroom

JG: The tech – and costs, the space… But also the creativity… A lot of what’s created are not particularly engaging or educational. I’d like to see teachers able to make things themselves… And then we need to think about pedagogy… But that’s the big issue…

DC: I can give you an example in the context of teaching Newton’s Laws with kids… We downloaded a bunch of VR apps… And NASA apps there was great for understanding and really feeling Newton’s three audience… Couldn’t do that with a blackboard… And that’s all free…

KH: How accessible is that… ?

DC: Almost every kid has a smartphone… Google Cardboard is maybe £5… It’s very cheap… It won’t replace a teacher, at least not yet. I wouldn’t teach basic mathematics with VR, but I wouldn’t teach Newton’s three laws any other way…

MM: We are piloting a thing called RocketFund and one of the first people to use VR used it in history… After that ran we have about 10 projects because they’d seen what was possible…

DC: “Fieldtrips” can be free… I’ve also seen a brilliant project with a 360 degree camera in a classroom used in a teaching space – a £250 camera – and brilliant for showing issues with behaviour, managing the classroom etc.

NK: Now if something is free, I would have no objection at all!

KH: How do you measure impact?

NK: Well if someone has a really old PC and it runs slow… that’s a quick and clear impact. But it’s about how they will use it, what studies are there and are they reliable… Could you do this any other way? What’s different?

MM: A lot of these technologies do not have evidence on them… But you will have toolkits, ideas that are well grounded on peer instruction, or tutoring… If you can take pedagogical approaches and link it to a tool you are using, that’s great. There’s work on online tutoring, and there is a company which provides tutoring from India… And I want to know how they ensure that they follow established criteria…

DC: I think we’ve had a lot of device fetishism… We’ve seen huge amounts of tablets imposed… and abandoned… You have to regard tech as a medium – not a gadget or a school. I think we’ve had disastrous experiences with iPads in secondary schools… They work in primary schools but actually writing on iPads doesn’t work well… It’s a disaster… And it’s a consumer devices not enabling higher order writing, coding, creation skills… I recommend that you look at Audrey Mullen’s work – she was a school kid when she started a company called Kite Reviews… She said we don’t want tablets or mobiles, that laptops were better…

Comment: What about iPads in schools… I did a David Hockney project with Year 10 students, that riffed off his use of iPads and the students really engaged with it… I’ve also used it in a portrait project as well… And one of the things I’m interested it is how you use it in more than writing and literacy…

JG: I just want to come back to measuring impact… It depends what you want to use it for… Donald gave us an example of using an iPad for the wrong thing, and from the audience that example of using iPads in the right ways… No-one in industry would code on an iPad… We have to use technology appropriate to the context and the wider world.

KH: How would you know that?

JG: As a teacher you have to gain expertise and transfer that to your teaching…

KH: You might be an expert in history but not in ITT…

JG: As a teacher you have to understand the technology you are being given to use… You have to understand the pedagogy… And you have to prove to teachers that the technology will improve their practice… I’m not sure any teacher has ever taught the perfect lesson, you always can think of ways to improve that… And that’s how you consider your work… One of the best innovations in teaching have been TeachMeets – informal exchanges of practice, experiences, etc. The reason technology in classrooms is not as successful as it should be are complex…

NK: I know of someone who purchased an app, brought into it, send people off to training… But it was the wrong app or what you are trying to do… So do the research first before you purchase anything…

DC: I think that the key word here is procurement… And teachers shouldn’t be doing that with hardware… You have to start with teaching needs, but actually general school software too – website, comms with parents, VLEs etc… It’s back end stuff… Take the art example… I know lots of artists… none using iPads… They use more sophisticated computers that enable the same stuff and more… It’s not David Hockney, that’s the tail wagging the dog… It’s general needs… Most kids have devices… I’d spend money on topping up for inclusion… And you have to do that cost benefit analysis first…

MM: Cost benefit analysis and expert approaches isn’t realistic in many schools… Often it’s more realistic to do small scale trialling… If it works, guide their peers, if not, then quite there… Practical experimentation, test and learn is the way forward I would say…

JG: I think that the challenge is often the enthusiast… You need to give things to the cynic!

DC: There is a role for sensible professional advice. In Higher Ed we have Jisc, we are quite sensible… But we don’t have that advice available for schools… It all goes a bit odd… It’s all anecdotal rather than evidence based… Otherwise we are just pottering about… And we end up with the lowest common denominator in terms of skills and understanding…

JG: I’m getting a bit nostalgic for BECTA, and NESTA FutureLab… doing interesting stuff. A lot of research now is funded by companies engaged in the research…

MM: I agree… but there is no evidence for white boards, tablets, whatever as they don’t work on their own… Has to be evidence informed…

DC: Cost effectiveness is always about tech as an intervention in education… The evidence for schools is that writing accuracy goes down 31% and is a huge problem on tablets… Unless…

NK: There’s good evidence that typing notes in class doesn’t work

DC: Absolutely… Although there is plenty of evidence that lectures don’t work ad we still do that… They have power devolved and in my view they are not really teachers… That happens every day…

Comment from audience: That doesn’t happen every day…

MM: We have to be careful about how we use the word evidence… Lectures may not be correlated with success but that may be to do with the quality of teaching staff, of lecturers…

KH: One of you talked about giving technology to the cynic… How do you overcome this…

JG: I think that the doubter, the cynic… will ask all the questions, find all the faults… But also see what works if it works…

KH: Often use of tech comes down to the enthusiasts and evangelists… But teachers lack space to be creative… How can we adopt technology if we lack that time and opportunity…

JG: We have so much more technology now, it has permeated our lives more… Our thinking, our discussion, potentially our classrooms… But I haven’t seen smartphones in schools much yet… We haven’t talked about bring your own device… There is an element of risk.. potential for videoing, for sharing bad practice, for bullying and harassment… But there is a lot of nervousness there…

DC: I think we have to move away from just thinking about technology in the classroom. I’m dead against it. Bring tech into a room in a one-to-many context… I’d rather use learner technology… Good teachers are teachers in the classroom… Kids really use tech at home, with homework… When you struggled when I was a kid you got stuck… but now you can use devices… to find the answer but also the method… And we have adaptive learning that can tailor to every kid. I think learner technology and away from the classroom is where it needs to be… Rather than the smart board debacle… Where one minister brought that in, Promethean made millions…

JG: I don’t recognise the classroom you are describing… I see teachers using technology, with big changes over the last twenty years… It is the appropriate use of technology in the appropriate places in learning… And thinking about the right technology for the job… If we took technology out of the classroom we’d just have lectures wouldn’t we?!

DC: The issue of collaboration is interesting… There is work from Stanford that many group works/collaborative technological driven things in the classroom… That most kids aren’t doing anything, but it looks collaborative… versus a good teacher doing the Socratic thing…

MM: I don’t think the in/outside the classroom thing is as important as the issue of what works, how things adapt, immediate feedback to with FitTech…. But it all comes back to pedagogy….

NK: It all comes back to what the problem is that you are trying to solve…

KH: What about the right way to do this… There’s the start-up like run fast, fail fast approach… Then the procurement approach…

NK: We want evidence based procurement… I don’t want to fund trials… Schools are poor…

KH: Start ups don’t throw it and see if it works… They use data to change their approach…And that’s what I’m talking about… Trialling then using evidence to inform decisions…

DC: The last thing I want to do is to waste time or money with start ups going into schools… I think taking risks in schools like that is very risky… I’m also not sure governors should be procuring… The senior team should… But often there is no digital strategy… It needs to be tactical not strategic…

JG: Suppose we get the kids to assess the start up product… There is a great project called Apps For Good… It gets kids to engage in the idea, the design process, the entrepreneurial aspect… There is a role for start ups for teaching kids about how this happens… I think education is a risky business anyway… We think something good will happen, kids have to trust the teacher… I think risk can be quite a healthy thing, and managing risk… Introducing something new can be edgy and can be quite invigorating…

NK: As a governor I don’t want my school going into the red financially… We need to operate within our means…

KH: It wasn’t about start ups in the classrooms… Even a small spend…. Can be risky…

MM: Isn’t there a risk of a big roll out of something that doesn’t work for your school? Some risks will feel riskier than others… School culture and character all mater…

JG: We do have examples of technologies that didn’t work but now do… VLEs didn’t take off… Schools don’t use them… It was an expensive risk… But many use Google Classroom which is essentially the same thing… It’s free but needs maintenance…

DC: Actually with new start ups… you want evidence, you want research to prove the usefulness. 50% of start ups fail, and you don’t want to adopt stuff that will fail…

JG: But someone has to try things first, to try new things, to bring something new into the classroom.

KH: How do we take Ed Tech forward… ?

DC: At risk of repeating myself… Professional procurement, technology strategy, strategic leadership in this…

Comment from crowd: Where do you get the evidence if you don’t test it in the classroom…

DC: I am involved in a big adaptive learning company… We are doing research with Cambridge University…

Comment from crowd: so for the schools taking part, that is a risk!

DC: No, it’s all carefully set up, with control groups… Not just by recommendation by colleagues…

JG: Setting up trials in schools in incredibly difficult, especially with control groups… Even if you do that you have to look at who was teaching, who was unwell then, etc. It’s very very hard to compare… And if it is showing improvement then morally should you withhold that technology from some pupils… One of the trials I can think of was around use of iPads… Give them own budget for apps.. But give them free choice… And then have them talk about that… It’s a trial but it’s very low cost, it’s very effective, it’s judging fit of tech to the space…

NK: I’ve known schools go for the iPad whether or not it works… Why go for the most expensive tablets… to try them!

DC: In the US there was a 1.3bn deal with Apple in California… And iPads are not there now… They now use Chrome Books…

JG: But that was imposed from the top.. And that’s an important issue…

Comment: I want to take issue with something Donald was talking about… I am all in favour of evidence based research and everything… But it is hard to find time to find the research, and a lot of effort to actually read through it… 3 pages of methodology before the conclusion… By the time it’s published it’s out of date anyway… I write about evidence on my website and often no firm conclusions come out of this… Ultimately anecdotal evidence matters… Asking questions of what was this trying to solve, what worked, what didn’t… Question: does Donald agree with me.

DC: No!

Comment: We all know the digital age is coming, kids have to work with computers, how can schools prepare children for that work and keep traditional teaching too..?

MM: For me there are two aspects: digital skills like codeclubs, programming… The other side is that when we are in this world with automation, what sort of jobs will survive… We have a report at Nesta called Creativity vs Robots… Skills that are most robust are creative, collaborative, dexterous… Preparing kids for the future still requires factual knowledge but also collaborative and problem solving skills… It’s not that it doesn’t exist, we just really need to focus on that…

JG: Maybe controversially I will say that we don’t… We should teach flexibility and to learn. A few years back I wrote for TimeEd… I visited Harrow- relatively unlimited funding… They don’t teach computing… They don’t get there until Year 9… Prep schools don’t teach it… Not “academic” enough fpr A-level or GCSE. They do some ICT skills… I guess they will get jobs, good ones…But they don’t prepare them for that… They prepare them to be leaders and the elite… I’m not necessarily sold on the idea that you have to prepare kids to be the makers… We teach reading and writing, but not digital literacy… Or how to read a film or a computer game, why failure is important… We don’t teach that… We might teach them how to create the game… So in part “don’t” and in part “expand the curriculum”

Comment: For Mr Galloway… Why did you go to Harrow not Eton… They invest in innovation and you get to be amused at top hats and tails?

JG: Tube ride!

DC: It would be madness to ignore technology in schools… But coding is this year’s thing… ! Kids need skills when they leave school…

NK: I have great problems with the idea of 21st Century skills… We can’t train kids for jobs they don’t exist… Jobs from hundreds of years ago….

MM: There is a social justice aspect here… Mark Zuckerberg went to one of the top schools… If we don’t expose all children to technology opportunities they can miss out…

JG: In Harrow they don’t impose technology on teachers… but they get it if they ask for it. They also give kids Facebook account sand teach them how to use it…

Comment: When we think about technology in schools, when do we think about teachers perspective… can we motivate and engage students with 21st century skills and possibilities…

NK: With all the money in the world, yes. We are in the position where schools can barely afford the teachers… We have to live within their means…

DC: Are teachers the right people to teach these skills… Is that what teachers are best suited to that… Not sure subject orientated teachers are well placed for that.

JG: Teachers do teach collaboration. Social media is about relationships… It’s just a form of that… CPD for teachers is outside of school time and that means keen teachers engage there…

MM: Having some teachers into smartphones. Some who are not… Some teachers are into outdoor education and camping… Others are not… You would’t want to exclude kids from the experience of camping… That’s how you can think about the ideas of digital literacy here… Finding the enthusiasm and route in…

Comments: A lot of what we, in this room, know of technology is through past exposure and experience of technology. Children are sponges.. They can often teach the teachers, with scaffolding from the teachers, about this era of technology… The kids are often better and quicker at using the technology… We have to think about where this might lead them…

Comment: On procurement and evidence… Michael talked about small trials… Do we think specific and unique contexts with schools not justify that type of small scale trialling…

MM: I think context is key in trials… Even outside of tech… Approaches like peer learning have great evidence… But the actual implementation can make a big difference… But you have to weigh up whether your context is as unique as you think…

DC: That can also be an excuse… Having been involved in procurement in tech… You don’t throw tech about… You think about what the context is, do serious homework before spending the money… You need the strategy and change management to roll things out and sustaining the effort… That’s almost invariably absent in the school context… Quite haphazard… “everyone’s unique… Let’s just play with this stuff”

Comment, I’m the director of a startup empowering primary aged girls and augmented reality to encourage routes into STEM subjects.: In terms of costs and being a governor… Start ups are obsessed with evidence. One of the best things you can do is work with start ups, they really want that evidence… If you are worried about costs you can trial things… But it is a risk when you are teaching… You were also talking about jobs that don’t exist at the moment… That means new jobs in new fields… One thing that strikes me this evening is that no one has talked about science, technology, arts and maths…. And teachers don’t come in from that route into schools… We’ve been talking to Jim Knight. In primary schools you don’t get labs but you can use AR to do experiments… to look in this area… My point it you’ve been talking about technology, is it worth it… Would have been great to hear someone from positive experiences, or an Ed Tech company… This feels like a lot of slamming down of technology…

JG: Can I talk about positive experiences… Technology is life changing and amazing… removing technology from classrooms is a horrendous… Your example in not having enough good qualified science teachers is an important one…

DC: I am not sure about AR and VR… I’d be careful with some of these things… Hololens isn’t there yet… Leading edge tech is a bit of a honeytrap… I raise VR as its on every phone… and free…

Commenter: AR is on phones… !

KH: Thank you for a really lively discussion!

And with that the rather spirited discussions came to an end! Some interesting things to consider but I felt like there was so much that wasn’t discussed properly because of the direction the conversation took – issues like access to wifi; measures to use but make technology safe – and what they mean for information literacy; technology beyond devices… So, I’d love to hear your comments below on Ed Tech in Schools.

 June 21, 2017  Posted by at 10:23 pm Digital Education Tagged with: ,  No Responses »
Jun 162017

It’s the final day of the IIPC/RESAW conference in London. See my day one and day two post for more information on this. I’m back in the main track today and, as usual, these are live notes so comments, additions, corrections, etc. all welcome.

Collection development panel (Chair: Nicola Bingham)

James R. Jacobs, Pamela M. Graham & Kris Kasianovitz: What’s in your web archive? Subject specialist strategies for collection development

We’ve been archiving the web for many years but the need for web archiving really hit home for me in 2013 when NASA took down every one of their technical reports – for review on various grounds. And the web archiving community was very concerned. Michael Nelson said in a post “NASA information is too important to be left on nasa.gov computers”. And I wrote about when we rely on pointing not archiving.

So, as we planned for this panel we looked back on previous IIPC events and we didn’t see a lot about collection curation. We posed three topics all around these areas. So for each theme we’ll watch a brief screen cast by Kris to introduce them…

  1. Collection development and roles

Kris (via video): I wanted to talk about my role as a subject specialist and how collection development fits into that. AS a subject specialist that is a core part of the role, and I use various tools to develop the collection. I see web archiving as absolutely being part of this. Our collection is books, journals, audio visual content, quantitative and qualitative data sets… Web archives are just another piece of the pie. And when we develop our collection we are looking at what is needed now but in anticipation of what we be needed 10 or 20 years in the future, building a solid historical record that will persist in collections. And we think about how our archives fit into the bigger context of other archives around the country and around the world.

For the two web archives I work on – CA.gov and the Bay Area Governments archives – I am the primary person engaged in planning, collecting, describing and making available that content. And when you look at the web capture life cycle you need to ensure the subject specialist is included and their role understood and valued.

The CA.gov archive involves a group from several organisations including the government library. We have been archiving since 2007 in the California Digital Library initially. We moved into Archive-It in 2013.

The Bay Area Governments archives includes materials on 9 counties, but primarily and comprehensively focused on two key counties here. We bring in regional governments and special districts where policy making for these areas occur.

Archiving these collections has been incredibly useful for understanding government, their processes, how to work with government agencies and the dissemination of this work. But as the sole responsible person that is not ideal. We have had really good technical support from Internet Archive around scoping rules, problems with crawls, thinking about writing regular expressions, how to understand and manage what we see from crawls. We’ve also benefitted from working with our colleague Nicholas Taylor here at Stanford who wrote a great QA report which has helped us.

We are heavily reliant on crawlers, on tools and technologies created by you and others, to gather information for our archive. And since most subject selectors have pretty big portfolios of work – outreach, instruction, as well as collection development – we have to have good ties to developers, and to the wider community with whom we can share ideas and questions is really vital.

Pamela: I’m going to talk about two Columbia archives, the Human Rights Web Archive (HRWA) and Historic Preservation and Urban Planning. I’d like to echo Kris’ comments about the importance of subject specialists. The Historic Preservation and Urban Planning archive is led by our architecture subject specialist and we’d reached a point where we had to collect web materials to continue that archive – and she’s done a great job of bringing that together. Human Rights seems to have long been networked – using the idea of the “internet” long before the web and hypertext. We work closely with Alex Thurman, and have an additional specially supported web curator, but there are many more ways to collaborate and work together.

James: I will also reflect on my experience. And the FDLP – Federal Library Program – involves libraries receiving absolutely every government publications in order to ensure a comprehensive archive. There is a wider programme allowing selective collection. At Stanford we are 85% selective – we only weed out content (after five years) very lightly and usually flyers etc. As a librarian I curate content. As an FDLP library we have to think of our collection as part of the wider set of archives, and I like that.

As archivists we also have to understand provenance… How do we do that with the web archive. And at this point I have to shout out to Jefferson Bailey and colleagues for the “End of Term” collection – archiving all gov sites at the end of government terms. This year has been the most expansive, and the most collaborative – including FTP and social media. And, due to the Trump administration’s hostility to science and technology we’ve had huge support – proposals of seed sites, data capture events etc.

2. Collection Development approaches to web archives, perspectives from subject specialists

As subject specialists we all have to engage in collection development – there are no vendors in this space…

Kris: Looking again at the two government archives I work on there is are Depository Program Statuses to act as a starting point… But these haven’t been updated for the web. However, this is really a continuation of the print collection programme. And web archiving actually lets us collect more – we are no longer reliant on agencies putting content into the Depository Program.

So, for CA.gov we really treat this as a domain collection. And no-one really doing this except some UCs, myself, and state library and archives – not the other depository libraries. However, we don’t collect think tanks, or the not-for-profit players that influence policy – this is for clarity although this content provides important context.

We also had to think about granularity… For instance for the CA transport there is a top level domain and sub domains for each regional transport group, and so we treat all of these as seeds.

Scoping rules matter a great deal, partly as our resources are not unlimited. We have been fortunate that with the CA.gov archive that we have about 3TB space for this year, and have been able to utilise it all… We may not need all of that going forwards, but it has been useful to have that much space.

Pamela: Much of what Kris has said reflects our experience at Columbia. Our web archiving strengths mirror many of our other collection strengths and indeed I think web archiving is this important bridge from print to fully digital. I spent some time talking with our librarian (Chris) recently, and she will add sites as they come up in discussion, she monitors the news for sites that could be seeds for our collection… She is very integrated in her approach to this work.

For the human rights work one of the challenges is the time that we have to contribute. And this is a truly interdisciplinary area with unclear boundaries, and those are both challenging aspects. We do look at subject guides and other practice to improve and develop our collections. And each fall we sponsor about two dozen human rights scholars to visit and engage, and that feeds into what we collect… The other thing that I hope to do in the future is to do more assessment to look at more authoritative lists in order to compare with other places… Colleagues look at a site called ideallist which lists opportunities and funding in these types of spaces. We also try to capture sites that look more vulnerable – small activist groups – although it is nt clear if they actually are that risky.

Cost wise the expensive part of collecting is both human effort to catalogue, and the permission process in the collecting process. And yesterday’s discussion of possible need for ethics groups as part of the permissions prpcess.

In the web archiving space we have to be clearer on scope and boundaries as there is such a big, almost limitless, set of materials to pick from. But otherwise plenty of parallels.

James: For me the material we collect is in the public domain so permissions are not part of my challenge here. But there are other aspects of my work, including LOCKSS. In the case of Fugitive US Agencies Collection we take entire sites (e.g. CBO, GAO, EPA) plus sites at risk (eg Census, Current Industrial Reports). These “fugitive” agencies include publications should be in the depository programme but are not. And those lots documents that fail to make it out, they are what this collection is about. When a library notes a lost document I will share that on the Lost Docs Project blog, and then also am able to collect and seed the cloud and web archive – using the WordPress Amber plugin – for links. For instance the CBO looked at the health bill, aka Trump Care, was missing… In fact many CBO publications were missing so I have added it as a see for our Archive-it

3. Discovery and use of web archives

Discovery and use of web archives is becoming increasingly important as we look for needles in ever larger haystacks. So, firstly, over to Kris:

Kris: One way we get archives out there is in our catalogue, and into WorldCat. That’s one plae to help other libraries know what we are collecting, and how to find and understand it… So would be interested to do some work with users around what they want to find and how… I suspect it will be about a specific request – e.g. city council in one place over a ten year period… But they won’t be looking for a web archive per se… We have to think about that, and what kind of intermediaries are needed to make that work… Can we also provide better seed lists and documentation for this? In Social Sciences we have the Code Book and I think we need to share the equivalent information for web archives, to expose documentation on how the archive was built… And linking to seeds nad other parts of collections .

One other thing we have to think about is process and document ingest mechanism. We are trying to do this for CA.gov to better describe what we do… BUt maybe there is a standard way to produce that sort of documentation – like the Codebook…

Pamela: Very quickly… At Columbia we catalogue individual sites. We also have a customised portal for the Human Rights. That has facets for “search as research” so you can search and develop and learn by working through facets – that’s often more useful than item searches… And, in terms of collecting for the web we do have to think of what we collect as data for analysis as part of a larger data sets…

James: In the interests of time we have to wrap up, but there was one comment I wanted to make.which is that there are tools we use but also gaps that we see for subject specialists [see slide]… And Andrew’s comments about the catalogue struck home with me…


Q1) Can you expand on that issue of the catalogue?

A1) Yes, I think we have to see web archives both as bulk data AND collections as collections. We have to be able to pull out the documents and reports – the traditional materials – and combine them with other material in the catalogue… So it is exciting to think about that, about the workflow… And about web archives working into the normal library work flows…

Q2) Pamela, you commented about permissions framework as possibly vital for IRB considerations for web research… Is that from conversations with your IRB or speculative.

A2) That came from Matt Webber’s comment yesterday on IRB becoming more concerned about web archive-based research. We have been looking for faster processes… But I am always very aware of the ethical concern… People do wonder about ethics and permissions when they see the archive… Interesting to see how we can navigate these challenges going forward…

Q3) Do you use LCSH and are there any issues?

A3) Yes, we do use LCSH for some items and the collections… Luckily someone from our metadata team worked with me. He used Dublin Core, with LCSH within that. He hasn’t indicated issues. Government documents in the US (and at state level) typically use LCSH so no, no issues that I’m aware of.

Plenary (Macmillan Hall): Posters with lightning talks (Chair: Olga Holownia)

Olga: I know you will be disappointed that it is the last day of Web Archiving Week! Maybe next year it should be Web Archiving Month… And then year!

So, we have lightening talks that go with posters that you can explore during the break, and speak to the presenters as well.

Tommi Jauhiainen, Heidi Jauhiainen, & Petteri Veikkolainen: Language identification for creating national web archives

Petteri: I am web archivist at the National Library of Finland. But this is really about Tommi’s PhD research on native Finno-Ugric languages and the internet. This work began in 2013 as part of the Kone Foundation Language Programme. It gathers texts in small languages on the web… They had to identify that content to capture them.

We extracted the web links on Finnish web pages, also crawled russian, estonian, swedish, and norwegion domains for these languages. They used HeLI and Heritrix. We used the list of Finnish URLs in the archive, rather than transferring the WARC files directly. So HeLI is the Helsinki language identification method, one of the best in the world. It can be found on Github. And can be used for your language as well! The full service will be out next year, but you can ask HeLi if you want that earlier.

Martin Klein: Robust links – a proposed solution to reference rot in scholarly communication

I work at Los Alamos, I have two short talks and both are work with my boss Herbert Van de Sompel, who I’m sure you’ll be aware of.

So, the problem of robust links is that links break and reference content changes. It is hard to ensure the author’s intention is honoured. So, you write a paper last year, point to the EPA, the DOI this year doesn’t work…

So, there are two ways to do this… You can create a snapshot of a referenced recourse… with Perma.cc, Internet Archive, Archive,is, Webcite. That’s great… But the citation people use is then the URI of the archive copy… Sometimes the original URI is included… But what if the URI-M is a copy elsewhere – archive.is or the no longer present mummy.it.

So, second approach, decorate your links by referencing: original URI, datetime of archiving, and the resource’s original URI. That makes your link more robust meaning you can find the live version. The original URI allows finding captures in all web archives. The Capture datetime lets you identify when/what version of the site is used.

How do you do this? With HTML5 link decoration, with the href attribute (data-original and data-versiondate). And we talked about this in a d-Lib article that, with some javascript that makes that actionable!

So, come talk to me upstairs about this!

Herbert Van de Sompel, Michael L. Nelson, Lyudmila Balakireva, Martin Klein, Shawn M. Jones & Harihar Shankar: Uniform access to raw mementos

Martin: Hello, it’s still me, I’m still from Los Alamos! But this is a more collaborative project…

The problem here… Most web archives augment their mementos with custom banners and links… So, in the Internet Archive there is a banner from them, and a pointer on links to a copy in the archive. There are lots of reasons, legal, convenience… BUT That enhancement doesn’t represent the website at the time of capturing… AS a researcher those enhancements are detrimental as you have to rewrite links again.

For us and our Memento Reconstruct, and other replay systems that’s a challenge. Also makes it harder to check the veracity of content.

Currently some systems do support this… OpenWayBack adn pywb do allow this – you can add the {datetime}im_/URI-R to do this, for instance. But that is quite dependent on the individual archive.

So, we propose using the Prefer Header in HTTP Request…

Option 1: Request header sent against Time Gate

Option 2: Request header sent against Memento

So come talk to us… Both versions work, I have a preference, Ilya has a different preference, so it should be interesting!

Sumitra Duncan: NYARC discovery: Promoting integrated access to web archive collections

NYARC is a consortium formed in 2006 from research libraries at Brooklyn Museum, The Frick Collection and the Museum of Modern Art. There is a two year Mellow grant to implement the program. An dthere are 10 collections in Archive-it devoted to scholarly art resources – including artist websites, gallery sites, catalogues, lists of lost and looted art. There is a seed list of 3900+ site.

To put this in place we asked for proof of concept discovery sites – we only had two submitted. We selected Primo from Ex-Libris. This brings in materials using the OpenSearch API. The set up does also let us pull in other archives if we want to. And you can choose whether to include the web archive (or not). The access points are through MARC Records and Full Records Search, and are in both the catalogue and WorldCat. We don’t howver, have faceted results for web archive as it’ snot in the API.

And recently, after discussion with Martin, we integrated Memento into th earchive, which lets them explore all captured content with Memento Time Travel.

In the future we will be doing usability testing of the discovery interface, we will promote use of web archive collections, and encouraging use in new digital art projects.

Fine NYARC’s Archive-It Collections: www.nywarc.org/webarchive. Documentation at http://wiki.nyarc.??

João Gomes: Arquivo.pt

Olga: Many of you will be aware of Arquivo. We couldn’t go to Lisbon to mark the 10th anniversary of the Portuguese web archive, but we welcome Joao to talk about it.

Joao: We have had ten years of preserving the Portuguese web, collaborating, researching and getting closer to our researchers, and ten years celebrating a lot.

Hello I am Joao Gomes, the head of Arquivo.pt. We are celebrating ten years of our archive. We are having our national event in November – you are all invited to attend and party a lot!

But what about the next 10 years? We want to be one of the best archives in the world… With improvements to full text search, to launch new services – like image serarching and high quality archiving services. Launching an annual prize for resarching projects over the Arquivo.pt. And at the same time increase our collection and users community.

So, thank you to all in this community who have supported us since 2007. And long live Arquivo.pt!

Changing records for scholarship & legal use cases (Chair: Alex Thurman)

Martin Klein & Herbert Van de Sompel: Using the Memento framework to assess content drift in scholarly communication

This project is to address both link rot and content drift – as I mentioned earlier in my lightening talk. I talked about link rot there, content drift is where the URI and content there changes, perhaps out of all recognition, so that what I cite is not reproducable.

You may or may not have seen this but there was a Supreme Court case referencing a website, and someone thought it would be really funny to purchase that, put up a very custom 404 error. But you can see pages that change between submission and publication. By contrast if you look at arxiv for instance you see an example of a page with no change over 20 years!

This matters partly as we reference URIs increasingly, hugely so since 2008.

So, some of this I talked about three years ago where I introduced the Hiberlink project, a collaborative project with the University of Edinburgh where we coined the term “reference rot”. This issue is a threat to the integrity of the web-based scholarly record. Resources do not have the same sense of fixity like e.g. journal article. And custodianship is also not as long term, custodians are not always as interest.

We wrote about link rot in PLoSOne. But now we want to focus on Content Drift. We published a new article on this in PLoSOne a few months ago. This is actually based on the same corpus – the entirity of arXiv, of PubMedCentral, and also over 2 million articles from Elsevier. This covered publications from January 1997 to December 2012. We only looked at URIs for non scholarly articles – not the DOIs but the blog posts, the Wikipedia page, etc. We ended up with a total of around 1 million URIs for these corpora. And we also kept the start date of the article with our data.

So, what is our approach for assessing content drift? We take publication date of URI as t. Then we try to find a Memento pre of referenced URI (t-1) and the Memento Post of referenced URI (t+1). Two Thirds of the URIs we looked at have this pair across archives. So now we do text analysis, looking at textual similarity between t-1 and t+1. We use measures of computed noralised scores (values 0 to 100) for:

  • simhash
  • Jaccard – sets of character changes
  • Sorensen-Dice
  • Cosine – contextual changes

So we defined a perfect Representative Momento if it gets a perfect score across all four measures. And we did some sanity checks too, via HTTP headers – E-Tag and Last-modified being the same are a good measure. And that sanity check passed! 98.88% of Mementos were representative.

Out of the 650k pairs we found, about 313k URIs have representative Mementos. There wasn’t any big difference across the three collections .

Now, with these 313k links, over 200k had a live site. And that allowed us to analyse and compare the live and archived versions. We used those same four measures to check similarity. Those vary so we aggregate. And we find that 23.7% of URIs have not drifted. But that means that over 75% have drifted and may not be representative of author intent.

In our work 25% of the most recent papers we looked at (2012) have not drifted at all. That gets worse going back in time, as is intuitive. Again, the differences across the corpora aren’t huge. PMC isn’t quite the same – as there were fewer articles initially. But the trend is common… In Elsevier’s 1997 works only 5% of content has not drifted.

So, take aways:

  1. Scholarly articles increasingly contain URI references to web and large resources
  2. Such resourcs are subject to reference rot (link rot and content drift)
  3. Custodians of these resoueces are typically not over concerned with archiving of their content and lonegtity of the scholarly record
  4. Spoiler: Robust links are one way to address this at the outset.


Q1) Have you had any thought on site redesigns where human readable content may not have changed, but pages have.

A1) Yes. We used those four measures to address that… We strip out all of the HTML and formatting. Cosign ignores very minor “and” vs. “or” changes for instance.

Q1) What about Safari readibility mode?

A1) No. We used something like Beautiful Soup to strip out code. Of course you could also do visual analysis to compare pages.

Q2) You are systematically underestimating the problem… You are looking at publication date… It will have been submitted earlier – generally 6-12 months.

A2) Absolutely. For the sake of the experiment it’s the best we can do… Ideally you’d be as close as possible to the authoring process… When published, as you say, it may already

Q3) A comment and a question… 

Preprints versus publication… 

A3) No, we didn‘t look explicitly at pre-prints. In arXiv those are

The URIs in articles in Elsevier seem to rot more than those in arXiv.org articles… We think that could be because Elsevier articles tend to reference more .coms whereas arXiv references more .org URIs but we need more work to explore that…

Nicholas Taylor: Understanding legal use cases for web archives

I am going to talk about use of web archives in litigation. But out of scope here is the areas of perservation of web citations; terms of service and API agreements for social media collection; copyright; right to be forgotten.

So, why web archives? Well it’s where the content is. In some cases social media may only be available in web archives. Courts do now accept web archive conference. The earliest that IAWM (Internet Archive Way Back Machine) evidence was as early as 2004. Litigants reoutinely challenge this evidence but courts often accept IAWM evidence – commonly through affidavit or testimony, through judicial notice, sometimes through expert testimony.

The IA have affidavit guidance and they suggest asking the court to ensure they will accept that evidence, making that the issue for the courts not the IA. And interpretation is down to the parties in the case. There is also information on how the IAWM works.

Why should we care about this? Well legal professionals are our users too. Often we have unique historical data. And we can help courts and juries correctly interpret web archive evidence leading to more informed outcomes. Other opportunities may be to broaden the community of practice by bringing in legal technology professionals. And this is also part of mainstreaming web archives.

Why might we hestitate here? Well typically cases serve private interests rather than public goods. Immpature open source software culture for legal technology. And market solutions for web and social media archiving for this context do already exist.

USe cases for web archiving in litigation mainly have to do with information on individual webpages as a point in time; information individual webpages over a period of time; persistence of navigational paths over a period of time. And types of cases include civil litigaton and intellectual property cases (which are a separate court in the US). I haven’t seen any criminal cases using the archive but that doesn’t mean it doesn’t exist.

Where archives are used there is a focus on authentication and validity of the record. In the Telewizja Polska USA Inc v. Echostar Video Inc. (2004) saw arguing over the evidence but the court accepting it. In Specht v. Google inc (2010) the evidence was not admissable as it had not come through the affidavit rule.

Another important rule in ths US context is Judicial notice (FRE 201) which is a rule that allows a fact to be entered into evidence. And archives have been used in this context. For instance Martins v 3PD, Inc (2013). And Pond Guy, Inc. v. Aguascape Designs (2011). And in Tompkins v 23andme, Inc (2014) – both parties used IAWM screenshots and the courts went out and found further screenshots that countered both of these to an extent.

Expert testimony (FRE 202) has included Khoday v Symantex Corp et al (2015)  where the expert on navigational paths was queried but the court approved that testimony.

In terms of reliabiity factors things that are raised as concerns include IAWM disclaimer, incompleteness, provenance, temporal coherence. Not seen any examples on discreteness, temporal coherance with HTTP headers), etc.

In Nassar v Nassar (2017) was a defamation case where the IAWM disclaimer saw the court not accept evidence from th earchive.

In Stabile v. Paul Smith Ltd. (2015) saw incomplete archives used, with the court acknowledging but accepting relevance of what was entered.

In Marten Transport Ltd v Plattform Advertising Inc. (2016) was also incomplete, discussion of banners and ads, but the court understood that IAWM does account for some of this. Objections had include issues with crawlers, concern that human/witness wasn’t directly involved in capturing the pages. The literature includes different perceptions of incompleteness. We also have issues of live site “leakage” via AJAX – where new ads leaked into archive pages…

Temporal coherance can be complicated. Web archive  captures can include mementos that are embedded and archived at different points in time so that the composite does not totally make sense.

The Memento Time TRavel service shows you temporal coherance. See also Scott Ainsworth’s work. That kind of visualisation can help courts to understand temporal coherance. Other datetime estimation strategies includes “Carbon Dating” (and constitutent services)’ comparing X-Archive-Orig-last-modified with Memento dattime, etc.

Interpreting datetimes is complicated, and of  great importance in legals cases. These can be interpreted from static datetime of text in archived page, the Memento date time, the headers, etc.

In Servicenow, Inc. v Hewlett-Packard Co. (2015), a patent case where things much be published a year ago to be “prior art” and in this case the archive showed an earlier date than other documentatin.

IN terms of IAWM provenance… Cases have questioned this. Sources for IAWM include a range of different crawls but what does that mean for reliable provenance. There are other archives out there too, but I haven’t seen evidence of these being used in court yet. Canonicality is also an interesting issue… Personalisation of content served to archival agent is an an unanswered question. What about client artifacts?

So, what’s next? If we want to better serve legal and research use cases, then we need to surface more provenance information; to improve interfaces to understand temporal coherance and make volotile aspects visible…

So, some questions for you,

  1. why else might we care, or not about legal use cases?
  2. what other reliability factors are relevant?
    1. What is the relative importance of different reliability factors?
    2. For what use cases are different reliability factors relevant?


Q1) Should we save WhoIs data alongside web archives?

A1) I haven’t seen that use case but it does provide context and provenance information

Q2) Is the legal status of IA relevant – it’s not a publicly funded archive. What about security certificates or similar to show that this is from the archive and unchanged?

A2) To the first question, courts have typically been more accepting of web evidence from .gov websites. They treat that as reliable or official. Not sure if that means they are more inclined to use it.. On the security side, there were some really interesting issues raised by Ilya and Jack. As courts become more concerned, they may increasingly look for those signs. But there may be more of those concerns…

Q3) I work with one of those commercial providers… A lot of lawyers want to be able to submit WARCs captured by web recorer or similar to courts.

A3) The legal system is vrry document centril… Much of their data coming in is PDF and that does raise those temporal issues.

Q3) Yes, but they do also want to render WARC, to bring that in to their tools…

Q4) Did you observe any provenance work outside the archive – developers, GitHub commits… Stuff beyond the WARC?

A4) I didn’t see examples of that… Maybe has to do with… These cases often go back a way… Sites created earlier…

Anastasia Aizman & Matt Phillips: Instruments for web archive comparison in Perma.cc

Matt: We are here to talk to you about some web archiving work we are doing. We are from the Harvard innovation lab. We have learnt so much from what you are doing, thank you so much. Perma.cc is creating tools to help you cite stuff on the web, to capture the WARC, organises those things…

We got started on this work when examining documents looking at the Supreme Court corpus from 1996 to present. We saw that Zittrain et al, Harvard Law Review, found more than 70% of references had rotted. So we wanted to build tools to help that…

Anastasia: So, we have some questions…

  1. How do we know a website has changed
  2. How do we know which are important changes.

So, what is a website made of… There are a lot of different resources that will appear on, say, a Washington Post article will have perhaps 90 components. Some are visual, some are hidden… So, again, how can we tell if the site has changed, if it is significant… And how do you convey that to the user.

In 1997, Andre Broder wrote about Syntactic clustering of the web. In that work he looked at every site on the world wide web. Things have changed a great deal since then… Websites are more dynamic now, we need more ways to compare pages…

Matt: So we have three types of comparison…

  • image comparison – we flatten the page down… If we compare two shots of Hacker News a few minutes apart there is a lot of similarity, but difference too… So we create a third image showing/highlighting the differences and can see where those changes there…

Why do image comparison? It’s kind of a dumb way to understand difference… Well it’s a mental model the human brain can take in. The HCI is pretty simple here – users regularly experience that sort of layering – and we are talking general web users here. And it’s easy to have images on hand.

So, sometimes it works well… Here’s an example… A silly one… A post that is the same but we have a cup of coffee with and without coffee in the mug, and small text differences. Comparisons like this work well…

But it works less well where we see banner ads on webpages and they change all the time… But what does that mean for the content? How do we fix that? We need more fidelity, we need more depth.

Anastasia: So we need another way to compare… Looking at a Washington post from 2016 and 2017… Here we can see what has been deleted, and we can see what has been added…. And the tagline of the paper itself has changed in this case.

The pros of this highlighting approach as that it’s in use in lots of places, it’s intuitive… BUT it has to ignore invisible-to-the_user tags. And it is kind of stupid… With two totally different headlines, both saying “Supreme Court”, it sees similarity where there is none.

So what about other similarity measures… ? Maybe a score would be nice, rather than an overlay highlighting change. So, for that we are looking at:

  • Jaccard Coefficient (MinHash) – this is essentially like applying a Venn diagram to two archives.
  • Hamming distance (SimHash) – This looks for number strings into 1s and 0s and figure out where the differences are… The difference/ratio
  • Sequence Matcher (Baseline/Truth) – this looks for sequences of words… It is good but hard to use as it is slow.

So, we took Washington Post archives (2000+) and resources (12,000) and looked at SimHash – big gaps. MinHash was much closer…

When we can calculate that changes… does it matter? If it’s ads, do you care? Some people will. Human eyes are needed…

Matt: So, how do we convey this information to the user… Right now in Perma we have a banner, we have highlighting, or you can choose image view. And you can see changes highlighted in “File Changes” panel on top left hand side of the screen. You can click to view a breakdown of where those changes are and what they mean… You can get to an HTML diff (via Javascript).

So, those are our three measures sitting in our Perma container..

Anastasia: So future work – coming soon – will look at weighted importance. We’d love your idea of what is important – is HTML more important than text? We want a Command Line (CLI) tool as well. And then we want to look at a similarity measure for images – other research on this out there, we need to look at that. We want a “Paranoia” heuristic – to see EVERY change, but with a tickbox to allow only the important change. And we need to work together!

Finally we’d like to thank you, and our colleagues at Harvard who support this work.


Q1) Nerdy questions… How tightly bound are these similarity measures to the Perma.cc tool?

A1 – Anastasia) Not at all – should be able to use on command line

A1 – Matt) Perma is a Python Django stack and it’s super open source so you should be able to use this.

Comment) This looks super awesome and I want to use it!

Matt) These are really our first steps into this… So we welcome questions, comments, discussion. Come connect with us.

Anastasia) There is so much more work we have coming up that I’m excited about… Cutting up website to see importance of components… Also any work on resources here…

Q2) Do you primarily serve legal scholars? What about litigation stuff Nicholas talked about?

A2) We are in the law school but Perma is open to all. The litigation stuff is interesting..

A2 – Anastasia) It is a multi purpose school and others are using it. We are based in the law school but we are spreading to other places!

Q3) Thank you… There were HTML comparison tools that exist… But they go away and then we have nothing. A CLI will be really useful… And a service comparing any two URLs would be useful… Maybe worth looking at work on Memento damage – missing elements, and impact on the page – CSS, colour, alignment, images missing, etc. and relative importance. How do you highlight invisible changes?

A3 – Anastasia) This is really the complexity of this… And of the UI… Showing the users the changes… Many of our users are not from a technical background… Educating by showing changes is one way. The list with the measures is just very simple… But if a hyperlink has changed, that potentially is more important… So, do we organise the list to indicate importance? Or do we calculate that another way? We welcome ideas about that?

Q3) We have a service running in Momento showing scores on various levels that shows some of that, which may be useful.

Q4) So, a researcher has a copy of what they were looking at… Can other people look at their copy? So, researchers can use this tool as proof that it is what they cited… Can links be shared?

A4 – Matt) Absolutely. We have a way to do that from the Blue Book. Some folks make these private but that’s super super rare…

Understanding user needs (Chair Nicola Bingham)

Peter Webster, Chris Fryer & Jennifer Lynch: Understanding the users of the Parliamentary Web Archive: a user research project

Chris: We are here to talk about some really exciting user needs work we’ve been doing. The Parliamentary Archives holds several million historical records relating to Parliament, dating from 1497. My role is ensure that archive continues, in the form of digital records as well. One aspect of that is the Parliamentary Web Archive. This captures around 30 URLS – the official Parliamentary websphere content from 2009. But we also capture official social media feeds – Twitter, Facebook and Instagram. This work is essential as it captures our relationship with the public. But we don’t have a great idea of our users needs and we wanted to find out more and understand what they use and what they need.

Peter: The objectives of the study were:

  • assess levels and patterns of use – what areas of the sites they are using, etc.
  • gauge levels of user understanding of the archive
  • understand the value of each kind of content in the web archive – to understand curation effort in the future.
  • test UI for fit with user needs – and how satisfied they were.
  • identify most favoured future developments – what directions should the archive head in next.

The research method was an analysis of usage data, then a survey questionnaire – and we threw lots of effort at engaging people in that. There were then 16 individual user observations, where we sat with the users, asked them to carry out tests and narrate their work.  And then we had group workshops with parliamentary staff and public engagement staff, we well as four workshops with the external user community tailored to particular interests.

So we had a rich set of data from this. We identified important areas of the site. We also concluded that the archive and the relationship to the Parliament website, and that website itself, needed rethinking from the ground up.

So, what did we found of interest to this community?

Well, we found users are hard to find and engage – despite engaging the social media community – and staff similarly not least as the internal workshop was just after the EURef; that they are largely ignorant about what web archives are – we asked about the UK Web Archive, the Government Archive, and the Parliamentary Archive… It appeared that survey respondents understood what these are BUT in the workshops most were thinking about the online version of Hansard – a kind of archive but not what was intended. We also found that users are not always sure what they’re doing – particularly when engaging in a live browser snapshots of the site from a previous dates, that several snapshots might exist from different points in time. There was also some issues with understanding the Way Back Machine surround for the archived content – difficulty understanding what was content, what was the frame. There was a particular challenge around using URL search. People tried everything they could to avoid that… We asked them to find archived pages for the homepage of parliament.uk… And had many searches for “homepage” – there was real lack of understanding of the browser and the search functionality. There is also no correlation between how well users did with the task and how well they felt they did. I take from that that a lack of feedback, requests, issues, does not mean there is not an issue.

Second group of findings… We struggled to find academic participants for this work. But our users prioritised in their own way. It became clear that users wanted discovery mechanisms that match their mental map – and actually the archive mapped more to an internal view of how parliament worked… And browsing taxonomies and structures didn’t work for them. That led to a card sorting exercise to rethink this. We also found users liked structures and wanted discovery based on entities: people, acts, publications – so search connected with that structure works well. Also users were very interested to engage in their own curation, tagging and folksonomy, make their own collections, share materials. Teachers particularly saw potential here.

So, what don’t users want? They have a variety of real needs but they were less interested in derived data sets like link browse; I demonstrated data visualisation, including things like ngrams, work on WARCS; API access; take home data… No interest from them!

So, three general lessons coming out of this… If you are engaging in this sort of research, spend as much resource as possible. We need to cultivate users that we do know, they are hard to find but great when you find them. Remember the diversity of groups of users you deal with…

Chris: So the picture Peter is painting is complex, and can feel quite disheartening. But his work has uncovered issues in some of our assumptions, and really highlights needs of users in the public. We now have a much better understanding os can start to address these concerns.

What we’ve done internally is raise the profile of the Parliamentary Web Archive amongst colleagues. We got delayed with procurement… But we have a new provider (MirrorWeb) and they have really helped here too. So we are now in a good place to deliver a user-centred resource at: webarchive.parliament.uk.

We would love to keep the discussion going… Just not about #goatgate! (contact them on @C_Fryer and @pj_webster)


Q1) Do you think there will be tangible benefits for the service and/or the users, and how will you evidence that?

A1 – Chris) Yes. We are redeveloping the web archive. And as part of that we are looking at how we can connect the archive to the catalogue and that is all part of new online services project. We have tangible results to work on… It’s early days but we want to translate it to tangibl ebenefits.

Q2) I imagine the parliament is a very conservative organisation that doesn’t delete content very often. Do you have a sense of what people come to the archive for?

A2 – Chris) Right now it is mainly people who are very aware of the archive, what it is and why it exists. But the research highlighted that many of the people less familiar with the archive wanted the archived versions of content on the live site, and the older content was more of interest.

A2 – Peter) One thing we did was to find out what the difference was between what was on the live website and what was on the archive… And looking ahead… The archive started in 2009… But demand seems to be quite consistent in terms of type of materials.

A2 – Chris) But it will take us time to develop and make use of this.

Q3) Can you say more about the interface and design… So interesting that they avoid the URL search.

A3 – Peter) The outsourced provider was Internet Memory Research… When you were in the archive there was an A-Z browser, a keyword search and a URL search. Above that on the parliament.uk site had taxonomy that linked out, and that didn’t work. I asked them to use that browse and it was clear that their thought process directed them to the wrong places… So recommendation was that it needs to be elsewhere, and more visible.

Q4) You were talking about users wanting to curate their own collections… Have you been considering setting up user dashboards to create and curate collections.

A4 – Chris) We are hoping to do that with our website and service, but it may take a while. But it’s a high priority for us.

Q5) I was interested to understand, the users that you selected for the survey… Were they connected before and part of the existing user base, or did you find through your own efforts.

A5 – Peter) a bit of both… We knew more about those who took the survey and they were the ones we had in the observations. But this was a self selecting group, and they did have a particular interest in the parliament.

Emily Maemura, Nicholas Worby, Christoph Becker & Ian Milligan: Origin stories: documentation for web archives provenance

Emily: We are going to talk about origin stories and it comes out of interest in web archives, provenance, trust. This has been a really collaborative project, and working with Ian Milligan from Toronto. So, we have been looking at two questions really: How are web archives made? How can we document or communicate this?

We wanted to look at choices and decisions in creating collections We have been studying creation of University of Toronto Libraries (UTL) Archive-It collections:

  • Canadian Political Parties and Political Interest Groups (crawled quarterly) – long running, continually collected and ever-evolving.
  • Toronto 2015 Pan Am games (crawled regularly for one month one-off event)
  • Global Summitry Archive

So, thinking about web archives and how they are made we looked at the Web Archiving Life Cycle Model (Bragg et al 2013), which suggests a linear process… But the reality is messier… and iterative as test crawls are reviewed, feed into production crawls… But are also patched as part of QA work.

From this work then we have four things you should document for provenence:

  1. Scoping is iterative and regularly reviewed. and the data budget is a key part of this.
  2. The Process of crawls is important to document as the influence of live web content and actors can be unpredictable
  3. There may be different considerations for access, choices for mode of access can impact discovery, and may be particularly well suited to particular users or use cases.
  4. The fourth thing is context, and the organisational or environmental factors that influence web archiving program – that context is important to understand those decision spaces and choices.

Nick: So, in order to understand these collections we had to look at the organisational history of web archiving. For us web archiving began in 2005, and we piloted what became Archive-it in 2006. It was in liminal state for about 8 years… There were few statements around collection develeopment until last year really But th enew policu talks about scoping, policy, permissions, etc.

So that transition towards service is reflected in staffing. It is still a part time commitment but is written into several people’s job descriptions now, it is higher profile. But there are resourcing challenges around crawling platforms – the earliest archives had to be automatic; dat abudgets; storage limits. There are policies, permissions. robots.text policy, access restrictions. And there is the legal context… Copright laws changed a lot in 2012… Started with permissions, then opt outs, but now it’s take down based…

Looking in turn at these collections:

Canadian Political Parties and Political Interest Groups (crawled quarterly) – long running, continually collected and ever-evolving. Covers main parties and ever changing group of loosely defined interest groups. This was hard to understand as there were four changes of staff in the time period.

Toronto 2015 Pan Am games (crawled regularly for one month one-off event) – based around a discrete event.

Global Summitry Archive – this is a collaborative archive, developed by researchers. It is a hybrid and is an ongoing collection capturing specific events.

In terms of scoping we looked at motivation whether mandate, an identified need or use, collaboration or coordination amongst institutions. These projects are based around technological budgets and limitations… In cases we only really understand what’s taking place when we see crawling taking place. Researchers did think ahead but, for instance, video is excluded… But there is no description of why text was prioritised over video or other storage. You can see evidence of a lack of explicit justifications for crawling particular sites… We have some information and detail, but it’s really useful to annotate content.

In the most recent elections the candidate sites had altered robots.txt… They weren’t trying to block us but the technology used and their measures against DDOS attacks had that effect.

In terms of access we needed metadata and indexes, but the metadata and how they are populated shapes how that happens. We need interfaces but also data formats and restrictions.

Emily: We tried to break out these interdependencies and interactions around what gets captured… Whether a site is captured is down to a mixture of organisational policies and permissions; legal context and copyright law for fair dealing, etc. The wider context elements also change over time… Including changes in staff, as well as changes in policy, in government, etc. This can all impact usage and clarity of how what is there came to be.

So, conclusions and future work… In telling the origin stories we rely on many different aspects and it very complex. We are working towards an extended paper. We believe a little documentation goes a long way… We have a proposal for structure documentation: goo.gl/CQwMt2


Q1) We did this exercise in the Netherlands… We needed to go further in the history of our library… Because in the ’90s we already collected interesting websites for clients – the first time we thought about the web as an important stance.. But there was a gap there between the main library work and the web archiving work…

Q2) I always struggle with what can be conveyed that is not in the archive… Sites not crawl, technical challenges, sites that it is decided not to crawl early on… That very initial thinking needs to be conveyed to pre-seed things… Hard to capture that…

A2 – Emily) There is so much in scoping that is before the seed list that gets into the crawl… Nick mentioned there are proposals for new collections that explains the thinking…

A2 – Nick) That’s about the best way to do it… Can capture pre-seeds and test crawls… But need that “what should be in the collection”

A2 – Emily) The CPPP is actually based on a prior web list of suggested sites… Which should also have been archived.

Q3) In any kind of archive the same issues are hugely there… Decisions are rarely described… Though a whole area of post modern archive description around that… But a lot comes down to the creator of the collection. But I haven’t seen much work on what should be in the archive that is expected to be there… A different context I guess..

A3 – Emily) I’ve been reading a lot of post modern archive theory… It is challenging to document all of that, especially in a way that is useful for researchers… But have to be careful not to transfer over all those issues from the archive into the web archive…

Q4) You made the point that the liberal party candidate had blocked access to the Internet Archive crawler… That resonated for me as that’s happened a few times for our own collection… We have legal deposit legislation and that raises questions of whose responsibility it is to take that forward..

A4 – Nick) I found it fell to me… Once we got the right person on the phone it was an easy case to make – and it wasn’t one site but all the candidates for that party!

Q5) Have you have any positive or negative responses to opt-outs and Take downs

A5 – Nick) We don’t host our own WayBackMachine so use their policy. We honour take downs but get very very few. Our communications team might have felt differently but we had something quite bullish in charge.

Nicola) As an institution there is a very variable appetite for risk – hard to communicate internally, let alone externally to our users.

Q6) In your research have you seen any web archive documenting themselves well? People we should follow? Or based mainly on your archives?

A6) It’s mainly based on our own archives… We haven’t done a comprehensive search of other archives’ documentation.

Jackie Dooley, Alexis Antracoli, Karen Stoll Farrell & Deborah Kempe: Developing web archiving metadata best practices to meet user needs

Alexis: We are going to present on the OCLC Research Library Partnership web archive working group. So, what was the problem? Well, web archives are not very easily discoverable in the ways people are usually used to descovering archives or library resources. This was the most widely shared issue across two OCLC surveys and so a working group was formed.

At Princeton we use Archive-It, but you had to know we did that… It wasn’t in the catalogue, it wasn’t on the website… So you wouldn’t find it… Then we wanted to bring it into our discovery system but that meant two different interfaces… So… If we take an example of one of our finding aids… We have the College Republican Records (2004-2016) and they are an on-campus group with websites… This was catalogues with DACS. But how to use the title and dates appropriately? Is the date the content, the seed, what?! And extent – documents, space, or… we went for the number of websites as that felt like something users would understand.  We wrote Archive-it into the description… But we wanted guidelines…

So, the objectives of this group is to find best practices for web archiving metadata best practices. We have undertane a lutereature review, looked at best practices for descriptive metadata across single nad multiple sites.

Karen: For our literature review we looked at peer reviewed literature but also some other sources, and synthesised that. So, who are the end users of web archives… I was really pleased the UK Parliament work focused on public users, as the research tends to focus on academia. Where we can get some clarity on users is on their needs: to read specific web pages/site; data and text mining; technology development or systems analysis.

In terms of behaviours Costa and Silva (2010) classify three groups, much cited by others: Navigational; Informational or Transactionals.

Take aways…. A couple things that we found – some beyond metadata… Raw data can be a high barrier so they want accessible interaces, unified searches, but the user does want to engage directly with the metadata to make the background and provenence of the data. We need to be thinking about flexible formats, engagement. And to enable access we need re-use and rights statements. And we need to be very direct indicating live versus archive material.

Users also want provenance: when and why was this created? They want context. They want to know the collection criteria and scope.

For metadata practitioners there are distinct approaches… archival and bibliographic approaches – RDA, MARC, Dublin Core, MODS, finding aids, DACS; Data elements vary widely, and change quite quickly.

Jackie: We analysed metadata standards and institutional guidelines; we evaluated existing metdata records in the wild… Our preparatory work raised a lot of questions about building a metadata description… Is the website creator/owner the publisher? author? subject? What is the title? Who is the host institution – and will it stay the same? Is it imporant to clearly stats that the resource is a website (not a “web resources”).

And what does the provenance actually refer to? We saw a lot of variety!

In terms of setting up th econtext we have use cases for library, archives, research… Some comparisons between bibliographic and archival approaches to descriptoin; description of archived and live sites – mostly libraries catalogue live not archives sites; and then you have different levels… Collection level, site level… And there might be document-level discriptions.

So, we wanted to establish data dictionary characteristics. We wanted something simple, not a major new cataloguing standard. So this is a learn 14 element standard, which is grounded on those cataloguing rules, so can be part of wider systems. The categories we have include common elements are used for identification and discovery of types of resources; other elements have to have clear applicability in the discovery of all types of resources. But some things aren’t included as not super specific to web archives – e.g. audience.

So the 14 data elements are:

  • Access/rights*
  • Collector
  • Contributor*
  • Creator*
  • Date*
  • Description*…

Elements with asterisks are direct maps to Dublin Core fields.

So, Access Conditions (to be renamed as “Rights”) is a direct mapping to Dublin Core “Rights”. This provides the circumstances that affect the availability and/or reuse of an archived website or collection. E.g. for Twitter. And it’s not just about rights because so often we don’t actually know the rights, but we know what can be done with the data.

Collector was the strangest element… There is no equivalent in Dublim Core… This is about the organisation responsible for curation and stewardship of an archived website or collection. The only other place that uses Collector is the Internet Archive. We did consider “repository” but, it may do all those things but… for archived websites… the site lives elsewhere but e.g. Princeton decides to collect those things.

We have a special case for Collector where Archive-It creates its own collection…

So, we have three publications, due out in July on this work..


Q1) I was a bit disappointed in the draft report – it wasn’t what I was expecting… We talked about complexities of provenance and wanted something better to convey that to researchers, and we have such detailed technical information we can draw from Archive-It.

A1 – Jackie) Our remit was about description, only. Provenance is bigger than that. Descriptive metadata was appropriate as scope. We did a third report on harvesting tools and whether metadata could be pulled from them… We should have had “descriptive” in our working group name too perhaps…

A1) It is maybe my fault too… But it’s that mapping of DACs that is not perfect… We are taking a different track at University of Albany

A1 – Jackie) This is NOT a standard, it addresses an absence of metadata that often exists for websites. Scalability of metadata creation is a real challenge… The average time available is 0.25 FTE looking at this. The provenance, the nuance of what was and was not crawled is not doable at scale. This is intentionally lean. If you will be using DACs then a lot of data goes straight in. All standards, with the exception of Dublin Core, are more detailed…

Q2) How difficult is this to put in practice for MARC records. For us we treat a website as a collector… You tend to describe the online publication… A lot of what we’d want to put in just can’t make it in…

A2 – Jackie) In Marc the 852 field is the closest to Collector that you can get. (Collector is comparable to Dublin Core’s Contributor; EAD’s <repository>; MARC’s 524, 852 a ad 852 b; MODS’ location or schema.org’s schema:OwnershipInfo.

Researcher case studies (Chair: Alex Thurman)

Jane Winters: Moving into the mainstream: web archives in the press

This paper accompanies my article for the first issue of Internet Histories. I’ll be talking about the increasing visibility of web archives and much greater public knowledge of web archive.

So, who are the audiences for web archives? Well they include researchers in the arts, humanities and social sciences – my area and where some tough barriers are. They are also policymakers, perticularly crucial in relation to legal deposit and acess. Also “general public” – though it is really many publics. And journalists as a mediator with the public.

What has changed with media? Well there was an initial focus on technology which reached an audience predisposed to that. But incresingly web archives come into discussion of politics and current affairs but there are also social and cultural concerns starting to emerge. There is real interest around launches and anniversaries – a great way for web archives to get attention, like the Easter Rising archive we heard about this week. We do also get that “digital dark age” klaxon which web archives can and do address. And with Brexit and Trump there is a silver lining… And a real interest in archives as a result.

So in 2013 Niels Brugge arranged the first RESAW meeting in Aahus. And at that time we had one of these big media moments…

Computer Weekly, 12th November 2013, reported on Conservatives erasing official records of speeches from the Internet Archive as a serious breach. Coverage in computing media migrated swiftly to coverage in the mainstream press, the Guardian’s election coverage; BBC News… The hook was that a number of those speeches were about the importance of the internet to open public debate… That hook, that narrative was obviously lovely for the media. Interestingly the Conservatives then responded that many of those speeches were actually still available in the BL’s UK Web Archives. The speeches also made Channel 4 News – and they used it as a hook to talk about broken promises.

Another lovely example was Dr Anat Ben-David from the Open University who got involved with BBC Click on restoring the lost .yu domain. This didn’t come from us trying to get something in the news… They knew our work and we could then point them in the direction of really interesting research… We can all do this highlighting and signposting which is why events like this are so useful for getting to know each others’ work.

When you make the tabloids you know you’ve done well… In 2016 coverage of the BBC Food website was faced with closure as part of cuts. The Independent didn’t lead with this, but with how to find recipes when the website goes… They directed everyone to the Internet Archive – as it’s open (unlike the British Library). Although the UK Web Archive blog did post about this, explained what they are collecting, and why they collect important cultural materials. The BBC actually back peddled… Maintaining the pages, but not updating it. But that message got out that web archiving is for everyone… Building it into people’s daily lives.

The launch of the UK Web Archive in 2013 went live – BBC covered this (and fact that it is not online). The 20th anniversary of the BnF archive had a lot of French press coverage. That’s a great hook as well.  Then I mentioned that Digital Dark Age set of stories… Bloomberg had the subtitle “if you want to preserve something, print it” in 2016. We saw similar from the Royal Society. But generally journalists do know who to speak to from BL, or DPC, or IA to counter that view… Can be a really positive story. Even that negative story can be used as a positive thing if you have that connection with journalists…

So this story: “Raiders of the Lost Web: If a Pultizer-finalist 34 part series can disappear from the web, anything can” looks like it will be that sort of story again… But actually this is about the forensic reconstruction of the work. And the article also talks about cinema at risk, again also preserved thanks to the Internet Archive. This piece of journalism that had been “lost” was about the death of 23 children in a bus crash… It was lost twice as it wasn’t reported, then the story disappeared… But the longer article here talks about that case and the importance of web archiving as a whole.

Talking of traumatic incidents… Brexit coverage of the NHS £350m per week saving on the Vote Leave website… But it disappeared after the vote. BUT you can use the Internet Archive, and the structured referendum collection from the UK Legal Deposit libraries, so the promises are retained into the long term…

And finally, on to Trump! In an Independent article on Melania Trump’s website disappearing, the journalist treats the Internet Archive as another source, a way to track change over time…

And indeed all of the coverage of IA in the last year, and their mirror site in Canada, that isn’t niche news, that’s mainstream coverage now. The more we have stories on data disappearing, or removed, the more opportunities web archives have to make their work clear to the world.


Q1) A fantastic talk and close to my heart as I try to communicate web archives. I think that web archives have fame when they get into fiction… The BBC series New Tricks had a denouement centred on finding a record on the Internet Archive… Are there other fictional representations of web archives?

A1) A really interesting suggestion! Tweet us both if you’ve seen that…

Q2) That coverage is great…

A2) Yes, being held to account is a risk… But that is a particular product of our time… Hopefully when it is clear that it is evidence for any set of politicians… The users may be partisan, even if the content is… It’s a hard line to tread… Non publicly available archives mitigate that… But absolutely a concern.

Q3) It is a big win when there are big press mentions… What happens… Is it more people aware of the tools, or specifically journalists using them?

A3) It’s both but I think it’s how news travels… More people will read an article in the Guardian than will look at the BL website. But they really demonstrate the value and importance of the archive. You want – like the BBC recipe website 100k petition – that public support. We ran a workshop here on a random Saturday recently… It was pitched as tracing family or local history… And a couple were delighted to find their church community website 15 years ago… It was that easy to know about the value of the archive that way… We did a gaming event with late 1980s games in the IA… That’s brilliant, a kid’s birthdya party was going to be inspired by that – that’s fab use we hadn’t thought of… But journalism is often the easy win…

Q4) Political press and journalistic use is often central… But I love that GifCities project… The nostalgia of the web… The historicity… That use… They highlight the datedness of old web design is great… The way we can associated archives with web vernacular that are not evidenced elsewhere is valuable and awesome… Leveraging that should be kept in mind.

A4) The GifCities always gets a “Wow” – it’s a great way to engage people in a teaching setting… Then lead them onto harder real history stuff..!

Q5) Last year when we celebrated the anniversary I had a chance to speak with journalists. They were intrigued that we collect blogs, forums, stuff that is off the radar… And they titled the article “Maybe your Sky Blog is being archived in France” (Sky Blogs is a popular teen blog platform)… But what does not forgetting the stupid things you wrote on the internet when they were 15…

A5) We’ve had three sessions so far, only once did that question arise… But maybe people aren’t thinking like that. More of an issue of the public archive… Less of a worry for closed archive… But so much of the embaressing stuff is in Facebook so not in the archive. But it matters especially in the right to be forgotten legislation… But there is also that thing of having something worth archiving…

Q6) The thing of The Crossing is interesting… Their font was copyright… They had to get specific permission from the designer… But that site is in flash… And soon you’ll need Ilya Cramer’s old web tools to see it at all.

A6) Absolutely. That’s a really fascinating article and they had to work to revive and play that content…

Q6) And six years old! Only six years!

Cynthia Joyce: Keyword ‘Katrina’: a deep dive through Hurricane Katrina’s unsearchable archive

I’ll be talking about how I use – rather than engaging in the technology directly. I was a journalist for 20 years before teaching journalism, which I do at University of Mississippi. Every year we take a study group to New Orleans to look at the outcome of Katrina. Katrina was 12 years ago. But there is a lot of gentrification and so there are few physical scars there… It was weird to have to explain how hard things were to my 18 year old students. And I wanted to bring that to life… But not just the news coverage which is shown as anniversary, do an update piece… The story is not a discrete event, an era…

I found the best way to capture that era was through blogging. New Orleans was not a tech savvy space, it was a poor, black, high levels of illiteracy sort of space. Web 1.0 had skipped New Orleans and the Deep South in a lot of ways.. .It was pre-Twitter, Facebook in infancy, mobiles were primitive. Katrina was probably when many in New Orleans started texting – doable on struggling networks. There was also that Digital Divide – out of trend to talk about this but this is a real gap.

So, 80% of the city flooded, more than 800 people died, 70% of residents were displaced. The storm didn’t cause the problems here, it was the flooding and the failure of the levees. That is an important distinction, as that sparked the rage, the activism, the need for action was about the sense of being lied to and left behind.

I was working as a journalist for Salon.com from 1995 – very much web 1.0. I was an editor at Nola.com post Katrina. And I was a resident of New Orleans 2001-2007. We had questions of what to do with comments, follow up, retention of content… A lot of content wasn’t needing preserving… But actually that set of comments should be the shame of Advanced Digital and Conde Naste… It was interesting how little help they provided to Nola.com, one of their client papers…

I was conducting research as a citizen, but with journalistic principles and approaches… My method was madness basically… I had instincts, stories to follow, high points, themes that had been missed in mainstream media. I interviewed a lot of people… I followed and used a cross-list of blog rolls… This was a lot of surfing, not just searching…

The WayBackMachine helped me so much there, to see that blogroll, seeing those pages… That idea of the vernacular, drill down 10 years later was very helpful and interesting… To experience it again… To go through, to see common experiences… I also did social media posts and call outs – an affirmative action approach. African American people were on camera, but not a lot of first party documentation… I posted something on Binders Full of Women Writers… I searched more than 300 blogs. I chose the entries… I did it for them… I picked out moving, provocative, profound content… Then let them opt out, or suggest something else… It was an ongoing dialogue with 70 people crowd curating a collective diary. New Orleans Press produced a physical book, and I sent it to Jefferson and IA created a special collection for this.

In terms of choosing themes… The original TOC was based on categories that organically emerged… It’s not all sad, it’s often dark humour…

  • Forever days
  • An accounting
  • Led Astray (pets)
  • Re-entry
  • Kindness of Strangers
  • Indecisin
  • Elsewhere = not New Orleans
  • Saute Pans of Mercy (food)
  • Guyville

Guyville for instance… for months no schools were open, so it was a really male space, then so much construction… But some women there though that was great too. A really specific culture and space.

Some challenges… Some work was journalists writing off the record. We got permissions where we could – we have them for all of the people who survived.

I just wanted to talk about Josh Cousin, a former resident of St Bernard projects. His nickname was the “Bookman” – he was an unusual nerdy kid and was 18 when Katrina hit. They stayed… But were forced to leave eventually… It was very sad… They were forced onto a bus, not told where they were going, they took their dog… Someone on the bus complained. Cheddar was turfed onto the highway… They got taken to Houston. The first post Josh posted was a defiant “I made it” type post… He first had online access when he was at the Astrodome. They had online machines that no-one was using… But he was… And he started getting mail, shoes, stuff in the post… He was training people to use these machines. This kid is a hero… At the sort of book launch for contributors he brought Cheddar the dog… Through pet finder… He had been adopted by a couple in Conneticut who had renamed him “George Michael” – they tried to make him pay $3000 as they didn’t want their dog going back to New Orleans…

In terms of other documentary evidence… Material is all as PDF only… The email record of Micheal D. Brown… shows he’s concerned about dog sitting… And later criticised people for not evacuating because of their pets… Two weeks later his emails do talk about pets… There were obviously other things going on… But this narrative, this diary of that time… really brings this reality to life.

I was in a newsroom during Arab Spring… And that’s when they had no option but to run what’s on Twitter, it was hard to verify but it was there and no journalists could get in. And I think Katrina was that kind of moment for blogging…

On Archive-it you can find the Katrina collection… Ranging from resistance and suspicion to gratitude… Some people barely remembered writing stuff, certainly didn’t expect it to be archived. I was collecting 8-9 years later… I was reassured to read that a historian at the Holocaust museum (in Chronicle of Higher Ed) who wasn’t convinced about blogging, until Trump said something stupid and that had triggered her to engage.


Q1 – David) In 2002 the LOCKSS program has a meeting with subject specialists at NY Public Library… And among those that were deemed worth preserving was The Exquisite Corpse. That was published out of New Orleans. After Katrina we were able to give Andre Projescu back his materials and that carried on publishing until 2015… A good news story of archiving from that time.

A1) There are dozens of examples… The things that I found too is that there is no appointed steward… If no institutional support it can be passed round, forgotten… I’d get excited then realise just one person was the advocate, rather than an institution to preserve it for posterity.

Andre wrote some amazing things, and captured that mood in the early days of the storm…

Q2) I love how your work shows blending of work and sources and web archives in conversation with each other… I have a mundane question… Did you go through any human subjects approval for this work from your institution.

A2) I was an independent journalist at the time… BUt went to University of New Orleans as the publisher had done a really intersting project with community work… I went to ask them if this project already existed… And basically I ended up creating it… He said “are you pitching it?” and that’s where it came from. Nievete benefited me.

Q3) Did anyone opt out of this project, given the traumatic nature of this time and work?

A3) Yes, a lot of people… But I went to people who were kind of thought leaders here, who were likely to see the benefit of this… So, for instance Karen Geiger had a blog called Squandered Heritage (now The Lens, the Pro Publica of New Orleans)… And participation of people like that helped build confidence and validity to the project.

Colin Post: The unending lives of net-based artworks: web archives, browser emulations, and new conceptual frameworks

Framing an artwork is never easy… Art objects are “lumps” of the physical world to be described… But what about net based art works, How do we make these objects of art history… And they raise questions of what we define an artwork in the first place… I will talk about Homework by Alexi Shulgin (http://www.easylife.org/homework/) as an example of where we need technique snad practices of web arching around net based artworks. I want to suggest a new conceptualisiation of net-based artworks as plural, proliferating, herteogenous archives. Homework is typical, and includes pop ups and self-concious elements that make it challenging to preserve…

So, this came from a real assignment for Natalie Bookchin’s course in 1997. Alexei Shulgin encouraged artists to turn in homework for grading, and did so himeself… And his piece was a single sentence followed by pop up messages – something we use differently today, has different significance… Pop ups ploferate the screen like spam, making the user aware of the browser and its affordances and role… Homework replicates structures of authority and expertise, grading, organising, creitiques, including or excluding artists… But rendered obsurd…

Homework was intended to be ephemeral… But Shulgin curates assignments turned in, and late assignments. It may be tempting to think of these net works as performance art, with records only of a particular moment in time. But actually this is a full record of the artwork… Homework has entered into archives as well as Shulgin’s own space. It is heterogenous… All acting on the work. The nature of pop up messages may have changes but the conditions of its original creation and it is still changing the world today.

Shulgin, in conversation with Armin Medosch in 1997, felt “The net at present has few possibilities for self expression but there is unlimited possibility for communication. But how can you record this communicative element, how can you store it?”. There are so many ways and artists but how to capture them… One answer is web archiving… There are at least 157 versions of Homework in the Internet Archive.. This is not comprehensive, but his own site is well archived… But capacity of connections is determined by incidence rather than choice… The crawler only caught some of these. But these are not discrete objects… The works on Shulgin’s site, the captures others have made, the websites that are still available, is one big object. This structure reflects the work itself, archival systems sustain and invigorate through the same infrastructure…

To return to the communicative elements… Archives do not capture the performative aspects of the piece. But we must also attend to the way the object has transformed over time… In order to engage with complex net-absed artworks… We cannot be easily separated into “original” and “archived” but more as a continuum…

Frank Upward (1996) describe the Records Continuum Model.. This is around four dimensions: Creation, Capture, Organisation, and Pluralisation. All of these are present in the archive of Homework… As copies appear in the Internet Archive, in Rhizome… And spread out… You could describe this as the vitalisation of the artwork on the web…

oldweb.today at Rhizome is a way to emulate the browser… This provides some assurance of the retention of old website.. BUt that is not the direct representation of the original work… The context and experience can vary – including the (now) speedy load of pages… And possible changes in appearance… When I load homework here… I see 28 captures all combined, from records over 10 years.. The piece wasn’t uniformly archived at any one time… I view the whole piece but actually it is emulated and artificial… It is disintegrated and inauthentic… But in the continuum it is another continuous layer in space and time.

Niels Brugger in “website history” (2010) talks about “Writing the complex strategic situation in which an artefact is entangled”. Digital archived and emulators preserve Homework, but are in themselves generative… But that isn’t exclusive to web archiving… It is something we see in Eugene Viollet Le Duc (1854/1996) talks about reestablishing a work in a finish state that may never in fact have existed in any point in time.

Q1) a really interesting and important work, particularly around plurality. I research at Rhizome and we have worked with Net Art Anthology – an online exhibition with emulators… is this faithful… should we present a plural version of the work?

A1) I have been thinking about this a lot… but i don’t think Rhizome should have to do all of this… art historians should do this contextual work too… Net Art Anthology does the convenience access work but art historians need to do the context work too.

Q1) I agree completely. For an art historian what provenance metadata should we provide for works like this to make it most useful… Give me a while and I’ll have a wish list… 

Comment) a shout out for Gent in Belgium is doing work on online art so I’ll connect you up.

Q2) Is Homework still an active interactive work?

A2) The final list was really in 1997 – only on IA now… It did end at this time… so experiencing the piece is about looking back… that is artefactial, or a terrace. But Shulgin has past work on his page… sort of a capture and framing as archive.

Q3) How does Homework fit in your research?

A3) I’m interested in 90s art, preservation, and that interactions

Q4) Have you seen that job of contextualisation done well, presented with the work? I’m thinking of Eli Harrison’s quantified self work and how different that looked at the time from now… 

A4) Rhizome does this well, galleries collecting net artists… especially with emulated works.. The guggenheim showed originals and emulated and part of that work was foregrounding the preservation and archiving aspects of the work. 

Closing remarks: Emmanuelle Bermès & Jane Winters

Emmanuelle: Thank you all for being here. This was three very intense day. Five days for those at archived unleashed. To close a few comments on IIPC. We were originally to meet in Lisbon, and I must apologise again to Portuguese colleagues, we hope to meet again there… But colocating with RESAW was brilliant – I saw a tweet that we are creating archives in the room next door to those who use and research them. And researchers are our co-creators.

And so many of our questions this week have been about truth and reliability and trust. This is a sign of growth and maturity of the groups. 

IIPC has had a tough year. We are still a young and fragile group… we have to transition to a strong world wide community. We need all the voices and inputs to grow and to transform into something more résiliant. We will have an annual meeting at an event in Ottawa later this year.

Finally thank you so much to Jane and colleagues from RESAW, and to Nicholas and WARC committee, and Olga and BL to get this all together so well.

Jane: you were saying how good it has been to bring archivists and researchers together, to see how we can help and not just ask… A few things struck me: discussion of context and provenance; and at the other end permanence and longevity. 

We will have a special issue of Internet Histories so do email us 

Thank you to Neils Brugger and NetLab, The Coffin Trust who funded our reception last night, RESAW Programme Committee, and the really important peop – the events team at University of London, and to Robert Kelly who did our wonderful promotional materials. And Olga who has made this all possible. 

And we do intend to have another Resaw conference in June in 2 years.

And thank you to Nicholas and Neils for representing IIPC, and to all of you for sharing your fantastic work. 

And with that a very interesting week of web archiving comes to an end. Thank you all for welcoming me along!

Jun 152017

I am again at the IIPC WAC / RESAW Conference 2017 and today I am in the very busy technical strand at the British Library. See my Day One post for more on the event and on the HiberActive project, which is why I’m attending this very interesting event.

These notes are live so, as usual, comments, additions, corrections, etc. are very much welcomed.

Tools for web archives analysis & record extraction (chair Nicholas Taylor)

Digging documents out of the archived web – Andrew Jackson

This is the technical counterpoint to the presentation I gave yesterday… So I talked yesterday about the physical workflow of catalogue items… We found that the Digital ePrints team had started processing eprints the same way…

  • staff looked in an outlook calendar for reminders
  • looked for new updates since last check
  • download each to local folder and open
  • check catalogue to avoid re-submitting
  • upload to internal submission portal
  • add essential metadata
  • submit for ingest
  • clean up local files
  • update stats sheet
  • Then inget usually automated (but can require intervention)
  • Updates catalogue once complete
  • New catalogue records processed or enhanced as necessary.

It was very manual, and very inefficient… So we have created a harvester:

  • Setup: specify “watched targets” then…
  • Harvest (harvester crawl targets as usual) –> Ingested… but also…
  • Document extraction:
    • spot documents in the crawl
    • find landing page
    • extract machine-readable metadata
    • submit to W3ACT (curation tool) for review
  • Acquisition:
    • check document harvester for new publications
    • edit essential metadata
    • submit to catalogue
  • Cataloguing
    • cataloguing records processed as necessary

This is better but there are challenges. Firstly, what is a “publication?”. With the eprints team there was a one-to-one print and digital relationship. But now, no more one-to-one. For example, gov.uk publications… An original report will has an ISBN… But that landing page is a representation of the publication, that’s where the assets are… When stuff is catalogued, what can frustrate technical folk… You take date and text from the page – honouring what is there rather than normalising it… We can dishonour intent by capturing the pages… It is challenging…

MARC is initially alarming… For a developer used to current data formats, it’s quite weird to get used to. But really it is just encoding… There is how we say we use MARC, how we do use MARC, and where we want to be now…

One of the intentions of the metadata extraction work was to provide an initial guess of the catalogue data – hoping to save cataloguers and curators time. But you probably won’t be surprised that the names of authors’ names etc. in the document metadata is rarely correct. We use the worse extractor, and layer up so we have the best shot. What works best is extracting the HTML. Gov.uk is a big and consistent publishing space so it’s worth us working on extracting that.

What works even better is the gov.uk API data – it’s in JSON, it’s easy to parse, it’s worth coding as it is a bigger publisher for us.

But now we have to resolve references… Multiple use cases for “records about this record”:

  • publisher metadata
  • third party data sources (e.g. Wikipedia)
  • Our own annotations and catalogues
  • Revisit records

We can’t ignore the revisit records… Have to do a great big join at some point… To get best possible quality data for every single thing….

And this is where the layers of transformation come in… Lots of opportunities to try again and build up… But… When I retry document extraction I can accidentally run up another chain each time… If we do our Solr searches correctly it should be easy so will be correcting this…

We do need to do more future experimentation.. Multiple workflows brings synchronisation problems. We need to ensure documents are accessible when discoverable. Need to be able to re-run automated extraction.

We want to iteratively improve automated metadata extraction:

  • improve HTML data extraction rules, e.g. Zotero translators (and I think LOCKSS are working on this).
  • Bring together different sources
  • Smarter extractors – Stanford NER, GROBID (built for sophisticated extraction from ejournals)

And we still have that tension between what a publication is… A tension between established practice and publisher output Need to trial different approaches with catalogues and users… Close that whole loop.


Q1) Is the PDF you extract going into another repository… You probably have a different preservation goal for those PDFs and the archive…

A1) Currently the same copy for archive and access. Format migration probably will be an issue in the future.

Q2) This is quite similar to issues we’ve faced in LOCKSS… I’ve written a paper with Herbert von de Sompel and Michael Nelson about this thing of describing a document…

A2) That’s great. I’ve been working with the Government Digital Service and they are keen to do this consistently….

Q2) Geoffrey Bilder also working on this…

A2) And that’s the ideal… To improve the standards more broadly…

Q3) Are these all PDF files?

A3) At the moment, yes. We deliberately kept scope tight… We don’t get a lot of ePub or open formats… We’ll need to… Now publishers are moving to HTML – which is good for the archive – but that’s more complex in other ways…

Q4) What does the user see at the end of this… Is it a PDF?

A4) This work ends up in our search service, and that metadata helps them find what they are looking for…

Q4) Do they know its from the website, or don’t they care?

A4) Officially, the way the library thinks about monographs and serials, would be that the user doesn’t care… But I’d like to speak to more users… The library does a lot of downstream processing here too..

Q4) For me as an archivist all that data on where the document is from, what issues in accessing it they were, etc. would extremely useful…

Q5) You spoke yesterday about engaging with machine learning… Can you say more?

A5) This is where I’d like to do more user work. The library is keen on subject headings – thats a big high level challenge so that’s quite amenable to machine learning. We have a massive golden data set… There’s at least a masters theory in there, right! And if we built something, then ran it over the 3 million ish items with little metadata could be incredibly useful. In my 0pinion this is what big organisations will need to do more and more of… making best use of human time to tailor and tune machine learning to do much of the work…

Comment) That thing of everything ending up as a PDF is on the way out by the way… You should look at Distil.pub – a new journal from Google and Y combinator – and that’s the future of these sorts of formats, it’s JavaScript and GitHub. Can you collect it? Yes, you can. You can visit the page, switch off the network, and it still works… And it’s there and will update…

A6) As things are more dynamic the re-collecting issue gets more and more important. That’s hard for the organisation to adjust to.

Nick Ruest & Ian Milligan: Learning to WALK (Web Archives for Longitudinal Knowledge): building a national web archiving collaborative platform

Ian: Before I start, thank you to my wider colleagues and funders as this is a collaborative project.

So, we have a fantastic web archival collections in Canada… They collect political parties, activist groups, major events, etc. But, whilst these are amazing collections, they aren’t accessed or used much. I think this is mainly down to two issues: people don’t know they are there; and the access mechanisms don’t fit well with their practices. Maybe when the Archive-it API is live that will fix it all… Right now though it’s hard to find the right thing, and the Canadian archive is quite siloed. There are about 25 organisations collecting, most use the Archive-It service. But, if you are a researcher… to use web archives you really have to interested and engaged, you need to be an expert.

So, building this portal is about making this easier to use… We want web archives to be used on page 150 in some random book. And that’s what the WALK project is trying to do. Our goal is to break down the silos, take down walls between collections, between institutions. We are starting out slow… We signed Memoranda of Understanding with Toronto, Alberta, Victoria, Winnipeg, Dalhousie, Simon Fraser University – that represents about half of the archive in Canada.

We work on workflow… We run workshops… We separated the collections so that post docs can look at this

We are using Warcbase (warcbase.org) and command line tools, we transferred data from internet archive, generate checksums; we generate scholarly derivatives – plain text, hypertext graph, etc. In the front end you enter basic information, describe the collection, and make sure that the user can engage directly themselves… And those visualisations are really useful… Looking at visualisation of the Canadian political parties and political interest group web crawls which track changes, although that may include crawler issues.

Then, with all that generated, we create landing pages, including tagging, data information, visualizations, etc.

Nick: So, on a technical level… I’ve spent the last ten years in open source digital repository communities… This community is small and tight-knit, and I like how we build and share and develop on each others work. Last year we presented webarchives.ca. We’ve indexed 10 TB of warcs since then, representing 200+ M Solr docs. We have grown from one collection and we have needed additional facets: institution; collection name; collection ID, etc.

Then we have also dealt with scaling issues… 30-40Gb to 1Tb sized index. You probably think that’s kinda cute… But we do have more scaling to do… So we are learning from others in the community about how to manage this… We have Solr running on an Open Stack… But right now it isn’t at production scale, but getting there. We are looking at SolrCloud and potentially using a Shard2 per collection.

Last year we had a Solr index using the Shine front end… It’s great but… it doesn’t have an active open source community… We love the UK Web Archive but… Meanwhile there is BlackLight which is in wide use in libraries. There is a bigger community, better APIs, bug fixes, etc… So we have set up a prototype called WARCLight. It does almost all that Shine does, except the tree structure and the advanced searching..

Ian spoke about derivative datasets… For each collection, via Blacklight or ScholarsPortal we want domain/URL Counts; Full text; graphs. Rather than them having to do the work, they can just engage with particular datasets or collections.

So, that goal Ian talked about: one central hub for archived data and derivatives…


Q1) Do you plan to make graphs interactive, by using Kibana rather than Gephi?

A1 – Ian) We tried some stuff out… One colleague tried R in the browser… That was great but didn’t look great in the browser. But it would be great if the casual user could look at drag and drop R type visualisations. We haven’t quite found the best option for interactive network diagrams in the browser…

A1 – Nick) Generally the data is so big it will bring down the browser. I’ve started looking at Kibana for stuff so in due course we may bring that in…

Q2) Interesting as we are doing similar things at the BnF. We did use Shine, looked at Blacklight, but built our own thing…. But we are looking at what we can do… We are interested in that web archive discovery collections approaches, useful in other contexts too…

A2 – Nick) I kinda did this the ugly way… There is a more elegant way to do it but haven’t done that yet..

Q2) We tried to give people WARC and WARC files… Our actual users didn’t want that, they want full text…

A2 – Ian) My students are quite biased… Right now if you search it will flake out… But by fall it should be available, I suspect that full text will be of most interest… Sociologists etc. think that network diagram view will be interesting but it’s hard to know what will happen when you give them that. People are quickly put off by raw data without visualisation though so we think it will be useful…

Q3) Do you think in few years time

A3) Right now that doesn’t scale… We want this more cloud-based – that’s our next 3 years and next wave of funded work… We do have capacity to write new scripts right now as needed, but when we scale that will be harder,,,,

Q4) What are some of the organisational, admin and social challenges of building this?

A4 – Nick) Going out and connecting with the archives is a big part of this… Having time to do this can be challenging…. “is an institution going to devote a person to this?”

A4 – Ian) This is about making this more accessible… People are more used to Backlight than Shine. People respond poorly to WARC. But they can deal with PDFs with CSV, those are familiar formats…

A4 – Nick) And when I get back I’m going to be doing some work and sharing to enable an actual community to work on this..

Gregory Wiedeman: Automating access to web archives with APIs and ArchivesSpace

A little bit of context here… University at Albany, SUNY we are a public university with state records las that require us to archive. This is consistent with traditional collecting. But we no dedicated web archives staff – so no capacity for lots of manual work.

One thing I wanted to note is that web archives are records. Some have paper equivalent, or which were for many years (e.g. Undergraduate Bulletin). We also have things like word documents. And then we have things like University sports websites, some of which we do need to keep…

The seed isn’t a good place to manage these as records. But archives theory and practices adapt well to web archives – they are designed to scale, they document and maintain context, with relationship to other content, and a strong emphasis on being a history of records.

So, we are using DACS: Describing Archives: A Content Standard to describe archives, why not use that for web archives? They focus on intellectual content, ignorant of formats; designed for pragmatic access to archives. We also use ArchiveSpace – a modern tool for aggregated records that allows curators to add metadata about a collection. And it interleaved with our physical archives.

So, for any record in our collection.. You can specify a subject… a Python script goes to look at our CDX, looks at numbers, schedules processes, and then as we crawl a collection the extents and data collected… And then shows in our catalogue… So we have our paper records, our digital captures… Uses then can find an item, and only then do you need to think about format and context. And, there is an awesome article by David Graves(?) which talks about that aggregation encourages new discovery…

Users need to understand where web archives come from. They need provenance to frame of their research question – it adds weight to their research. So we need to capture what was attempted to be collected – collecting policies included. We have just started to do this with a statement on our website. We need a more standardised content source. This sort of information should be easy to use and comprehend, but hard to find the right format to do that.

We also need to capture what was collected. We are using the Archive-It Partner Data API, part of the Archive-It 5.0 system. That API captures:

  • type of crawl
  • unique ID
  • crawl result
  • crawl start, end time
  • recurrence
  • exact data, time, etc…

This looks like a big JSON file. Knowing what has been captured – and not captured – is really important to understand context. What can we do with this data? Well we can see what’s in our public access system, we can add metadata, we can present some start times, non-finish issues etc. on product pages. BUT… it doesn’t address issues at scale.

So, we are now working on a new open digital repository using the Hydra system – though not called that anymore! Possibly we will expose data in the API. We need standardised data structure that is independent of tools. And we also have a researcher education challenge – the archival description needs to be easy to use, re-share and understand.

Find our work – sample scripts, command line query tools – on Github:



Q1) Right now people describe collection intent, crawl targets… How could you standardise that?

A1) I don’t know… Need an intellectual definition of what a crawl is… And what the depth of a crawl is… They can produce very different results and WARC files… We need to articulate this in a way that is clear for others to understand…

Q1) Anything equivalent in the paper world?

A1) It is DACS but in the paper work we don’t get that granular… This is really specific data we weren’t really able to get before…

Q2) My impression is that ArchiveSpace isn’t built with discovery of archives in mind… What would help with that…

A2) I would actually put less emphasis on web archives… Long term you shouldn’t have all these things captures. We just need an good API access point really… I would rather it be modular I guess…

Q3) Really interesting… the definition of Archive-It, what’s in the crawl… And interesting to think about conveying what is in the crawl to researchers…

A3) From what I understand the Archive-It people are still working on this… With documentation to come. But we need granular way to do that… Researchers don’t care too much about the structure…. They don’t need all those counts but you need to convey some key issues, what the intellectual content is…

Comment) Looking ahead to the WASAPI presentation… Some steps towards vocabulary there might help you with this…

Comment) I also added that sort of issue for today’s panels – high level information on crawl or collection scope. Researchers want to know when crawlers don’t collect things, when to stop – usually to do with freak outs about what isn’t retained… But that idea of understanding absence really matters to researchers… It is really necessary to get some… There is a crapton of data in the partners API – most isn’t super interesting to researchers so some community effort to find 6 or 12 data points that can explain that crawl process/gaps etc…

A4) That issue of understanding users is really important, but also hard as it is difficult to understand who our users are…

Harvesting tools & strategies (Chair: Ian Milligan)

Jefferson Bailey: Who, what, when, where, why, WARC: new tools at the Internet Archive

Firstly, apologies for any repetition between yesterday and today… I will be talking about all sorts of updates…

So, WayBack Search… You can now search WayBackMachine… Including keyword, host/domain search, etc. The index is build on inbound anchor text links to a homepage. It is pretty cool and it’s one way to access this content which is not URL based. We also wanted to look at domain and host routes into this… So, if you look at the page for, say, parliament.uk you can now see statistics and visualisations. And there is an API so you can make your own visualisations – for hosts or for domains.

We have done stat counts for specific domains or crawl jobs… The API is all in json so you can just parse this for, for example, how much of what is archived for a domain is in the form of PDFs.

We also now have search by format using the same idea, the anchor text, the file and URL path, and you can search for media assets. We don’t have exciting front end displays yet… But I can search for e.g. Puppy, mime type: video, 2014… And get lots of awesome puppy videos [the demo is the Puppy Bowl 2014!]. This media search is available for some of the WayBackMachine for some media types… And you can again present this in the format and display you’d like.

For search and profiling we have a new 14 column CDX including new language, simhash, sha256 fields. Language will help users find material in their local/native languages. The SIMHASH is pretty exciting… that allows you to see how much a page has changed. We have been using it on Archive It partners… And it is pretty good. For instance seeing government blog change month to month shows the (dis)similarity.

For those that haven’t seen the Capture tool – Brozzler is in production in Archive-it with 3 doze orgaisations and using it. This has also led to warcprox developments too. It was intended for AV and social media stuff. We have a chromium cluster… It won’t do domain harvesting, but it’s good for social media.

In terms of crawl quality assurance we are working with the Internet Memory Foundation to create quality toools. These are building on internal crawl priorities work at IA crawler beans, comparison testing. And this is about quality at scale. And you can find reports on how we also did associated work on the WayBackMachine’s crawl quality. We are also looking at tools to monitor crawls for partners, trying to find large scale crawling quality as it happens… There aren’t great analytics… But there are domain-scale monitoring, domain scale patch crawling, and Slack integrations.

For doman scale work, for patch crawling we use WAT analysis for embeds and most linked. We rank by inbound links and add to crawl. ArchiveSpark is a framework for cluster-based data extraction and derivation (WA+).

Although this is a technical presentation we are also doing an IMLS funded project to train public librarians in web archiving to preserve online local history and community memory, working with partners in various communities.

Other collaborations and research include our end of term web archive 2016/17 when the administration changes… No one is official custodian for the gov.uk. And this year the widespread deletion of data has given this work greater profile than usual. This time the work was with IA, LOC, UNT, GWU, and others. 250+ TB of .gov/.mil as well as White House and Obama social media content.

There had already been discussion of the Partner Data API. We are currently re-building this so come talk to me if you are interested in this. We are working with partners to make sure this is useful. makes sense, and is made more relevant.

We take a lot of WARC files from people to preserve… So we are looking to see how we can get partners to do this with and for it. We are developing a pipeline for automated WARC ingest for web services.

There will be more on WASAPI later, but this is part of work to ensure web archives are more accessible… And that uses API calls to connect up repositories.

We have also build a WAT API that allows you to query most of the metadta for a WARC file. You can feed it URLs, and get back what you want – except the page type.

We have new portals and searches now and coming. This is about putting new search layers on TLD content in the WayBackMachine… So you can pick media types, and just from one domain, and explore them all…

And with a statement on what archives should do – involving a gif of a centaur entering a rainbow room – that’s all… 


Q1) What are implications of new capabilities for headless browsing for Chrome for Brozzler…

A1 – audience) It changes how fast you can do things, not really what you can do…

Q2) What about http post for WASAPI

A2) Yes, it will be in the Archive-It web application… We’ll change a flag and then you can go and do whatever… And there is reporting on the backend. Doesn’t usually effect crawl budgets, it should be pretty automated… There is a UI.. Right now we do a lot manually, the idea is to do it less manually…

Q3) What do you do with pages that don’t specify encoding… ?

A3) It doesn’t go into url tokenisation… We would wipe character encoding in anchor text – it gets cleaned up before elastic search..

Q4) The SIMHASH is before or after the capture? And can it be used for deduplication

A4) After capture before CDX writing – it is part of that process. Yes, it could be used for deduplication. Although we do already do URL deduplication… But we could compare to previous SIMHASH to work out if another copy is needed… We really were thinking about visualising change…

Q5) I’m really excited about WATS… What scale will it work on…

A5) The crawl is on 100 TB – we mostly use existing WARC and Json pipeline… It performs well on something large. But if a lot of URLs, it could be a lot to parse.

Q6) With quality analysis and improvement at scale, can you tell me more about this?

A6) We’ve given the IMF access to our own crawls… But we have been compared our own crawls to our own crawls… Comparing to Archive-it is more interesting… And looking at domain level… We need to share some similar size crawls – BL and IA – and figure out how results look and differ. It won’t be content based at that stage, it will be hotpads and URLs and things.

Michele C. Weigle, Michael L. Nelson, Mat Kelly & John Berlin: Archive what I see now – personal web archiving with WARCs

Mat: I will be describing tools here for web users. We want to enable individuals to create personal web archives in a self-contained way, without external services. Standard web archiving tools are difficult for non IT experts. “Save page as” is not suitable for web archiving. Why do this? It’s for people who don’t want to touch the commend line, but also to ensure content is preserved that wouldn’t otherwise be. More archives are more better.

It is also about creation and access, as both elements are important.

So, our goals involve advancing development of:

  • WARCreate – create WARC from what you see in your browser.
  • Web Archiving Integration Layer (WAIL)
  • Mink

WARCcreate is… A Chrome browser extension to save WARC files from your browser, no credentials pass through 3rd parties. It heavilt leverages Chrome webRequest API. ut it was build in 2012 so APIs and libraries have evolved so we had to work on that. We also wanted three new modes for bwoser based preservation: record mode – retain buffer as you browse; countdown mode – preserve reloading page on an interval; event mode – preserve page when automatically reloaded.

So you simply click on the WARCreate button the browser to generate WARC files for non technical people.

Web Archiving Integration Layer (WAIL) is a stand-alone desktop application, it offers collection-based web archiving, and includes Heritrix for crawling, OpenWayback for replay, and Python scripts compiled to OS-native binaries (.app, .exe). One of the recent advancements was a new user interface. We ported Python to Electron – using web technologies to create native apps. And that means you can use native languages to help you to preserve. We also moves from a single archive to collection-based archiving. We also ported OpenWayback to pywb. And we also started doing native Twitter integration – over time and hashtags…

So, the original app was a tool to enter a URI and then get a notification. The new version is a little more complicated but provides that new collection-based interface. Right now both of these are out there… Eventually we’d like to merge functionality here. So, an example here, looking at the UK election as a collection… You can enter information, then crawl to within defined boundaries… You can kill processes, or restart an old one… And this process integrates with Heritrix to give status of a task here… And if you want to Archive Twitter you can enter a hashtag and interval, you can also do some additional filtering with keywords, etc. And then once running you’ll get notifications.

Mink… is a Google Chrome browser extension. It indicates archival capture count as you browse. Quickly submits URI to multiple archives from UI. From Mink(owski) space. Our recent enhancements include enhancements to the interface to add the number of archives pages to icon at bottom of page. And allows users to set preferences on how to view large set of memetos. And communication with user-specified or local archives…

The old mink interface could be affected by page CSS as in the DOM. So we ave moved to shadow DOM, making it more reliable and easy to use. And then you have a more consistent, intuitive iller columns for many captures. It’s an integration of live and archive web, whilst you are viewing the live web. And you can see year, month, day, etc. And it is refined to what you want to look at this. And you have an icon in Mink to make a request to save the page now – and notification of status.

So, in terms of tool integration…. We want to ensure integration between Mink and WAIL so that Mink points to local archives. In the future we want to decouple Mink from external Memento aggregator – client-side customisable collection of archives instead.

See: http://bit.ly/iipcWAC2017 for tools and source code.


Q1) Do you see any qualitative difference in capture between WARCreate and WARC recorder?

A1) We capture the representation right at the moment you saw it.. Not the full experience for others, but for you in a moment of time. And that’s our goal – what you last saw.

Q2) Who are your users, and do you have a sense of what they want?

A2) We have a lot of digital humanities scholars wanting to preserve Twitter and Facebook – the stream as it is now, exactly as they see it. So that’s a major use case for us.

Q3) You said it is watching as you browse… What happens if you don’t select a WARC

A3) If you have hit record you could build up content as pages reload and are in that record mode… It will impact performance but you’ll have a better capture…

Q3) Just a suggestion but I often have 100 tabs open but only want to capture something once a week so I might want to kick it off only when I want to save it…

Q4) That real time capture/playback – are there cool communities you can see using this…

A4) Yes, I think with CNN coverage of a breaking storm allows you to see how that story evolves and changes…

Q5) Have you considered a mobile version for social media/web pages on my phone?

A5) Not currently supported… Chrome doesn’t support that… There is an app out there that lets you submit to archives, but not to create WARC… But there is a movement to making those types of things…

Q6) Personal archiving is interesting… But jailed in my laptop… great for personal content… But then can I share my WARC files with the wider community .

A6) That’s a good idea… And more captures is better… So there should be a way to aggregate these together… I am currently working on that, but you should need to be able to specify what is shared and what is not.

Q6) One challenge there is about organisations and what they will be comfortable with sharing/not sharing.

Lozana Rossenova and IIya Kreymar, Rhizome: Containerised browsers and archive augmentation

Lozana: As you probably know Webrecorder is a high fidelity interactive recording of any web site you browse – and how you engage. And we have recently released an App in electron format.

Webrecorder is a worm’s eye view of archiving, tracking how users actually move around the web… For instance for instragram and Twitter posts around #lovewins you can see the quality is high. Webrecorder uses symmetrical archiving – in the live browser and in a remote browser… And you can capture then replay…

In terms of how we organise webrecorder: we have collections and sessions.

The thing I want to talk about today is on Remote browsers, and my work with Rhizome on internet art. And a lot of these works actually require old browser plugins and tools… So Webrecorder enables capture and replay even where technology no longer available.

To clarify: the programme says “containerised” but we now refer to this as “remote browsers” – still using Docker cotainers to run these various older browsers.

When you go to record a site you select the browser, and the site, and it begins the recording… The Java Applet runs and shows you a visulisation of how it is being captured. You can do this with flash as well… If we open a multimedia in your normal (Chrome) browser, it isn’t working. Restoration is easier with just flash, need other things to capture flash with other dependencies and interactions.

Remote browsers are really important for Rhizome work in general, as we use them to stage old artworks in new exhibitions.

Ilya: I will be showing some upcoming beta features, including ways to use webrecorder to improve other arhives…

Firstly, which other web archives? So I built a public web archives repsitory:


And with this work we are using WAM – the Web Archiving Manifest. And added a WARC source URI and WARC creation date field to the WARC Header at the moment.

So, Jefferson already talked about patching – patching remote archives from the live web… is an approach where we patch either from live web or from other archives, depending on what is available or missing. So, for instance, if I look at a Washington Post page in the archive from 2nd March… It shows how other archives are being patched in to me to deliver me a page… In the collection I have a think called “patch” that captures this.

Once pages are patched, then we introduce extraction… We are extracting again using remote archiving and automatic patching. So you combine extraction and patching features. You create two patches and two WARC files. I’ll demo that as well… So, here’s a page from the CCA website and we can patch that… And then extract that… And then when we patch again we get the images, the richer content, a much better recording of the page. So we have 2 WARCs here – one from the British Library archive, one from the patching that might be combined and used to enrich that partial UKWA capture.

Similarly we can look at a CNN page and take patches from e.g. the Portuguese archive. And once it is done we have a more complete archive… When we play this back you can display the page as it appeared, and patch files are available for archives to add to their copy.

So, this is all in beta right now but we hope to release it all in the near future…


Q1) Every web archive already has a temporal issue where the content may come from other dates than the page claims to have… But you could aggrevate that problem. Have you considered this?

A1) Yes. There are timebounds for patching. And also around what you display to the user so they understand what they see… e.g. to patch only within the week or the month…

Q2) So it’s the closest date to what is in web recorder?

A2) The other sources are the closest successful result on/closest to the date from another site…

Q3) Rather than a fixed window for collection, seeing frequently of change might be useful to understand quality/relevance… But I think you are replaying

A3)Have you considered a headless browser… with the address bar…

A3 – Lozana) Actually for us the key use case is about highlighting and showcasing old art works to the users. It is really important to show the original page as it appeared – in the older browsers like Netscape etc.

Q4) This is increadibly exciting. But how difficult is the patching… What does it change?

A4) If you take a good capture and a static image is missing… Those are easy to patch in… If highly contextualised – like Facebook, that is difficult to do.

Q5) Can you do this in realtime… So you archive with Perma.cc then you want to patch something immediately…

A5) This will be in the new version I hope… So you can check other sources and fall back to other sources and scenarios…

Comment –  Lozana) We have run UX work with an archiving organisation in Europe for cultural heritage and their use case is that they use Archive-It and do QA the next day… Crawl might mix something but highly dynamic, so want to quickly be able to patch it pretty quickly.

Ilya) If you have an archive that is not in the public archive list on Github please do submit it as a fork request and we’ll be able to add it…

Leveraging APIs (Chair: Nicholas Taylor)

Fernando Melo and Joao Nobre: Arquivo.pt API: enabling automatic analytics over historical web data

Fernando: We are a publicly available web archive, mainly of Portuguese websites from the .pt domain. So, what can you do with out API?

Well, we built our first image search using our API, for instance a way to explore Charlie Hebdo materials; another application enables you to explore information on Portuguese politicians.

We support the Memento protocol, and you can use the Memento API. We are one of the time gates for the time travel searches. And we also have full text search as well as URL search, though our OpenSearch API. We have extended our API to support temporal searches in the portuguese web. Find this at: http://arquivo.pt/apis/opensearch/. Full text search requests can be made through a URL query, e.g. http://arquivp.pt/opensearch?query=euro 2004 would search for mentions of euro 2004, and you can add parameters to this, or search as a phrase rather than keywords.

You can also search mime types – so just within PDFs for instance. And you can also run URL searches – e.g. all pages from the New York Times website… And if you provide time boundaries the search will look for the capture from the nearest date.

Joao: I am going to talk about our image search API. This works based on keyword searches, you can include operators such as limiting to images from a particular site, to particular dates… Results are ordered by relevance, recency, or by type. You can also run advanced image searches, such as for icons, you can use quotation marks for names, or a phrase.

The request parameters include:

  • query
  • stamp – timestamp
  • Start – first index of search
  • safe Image (yes; no; all) – restricts search only to safe images.

The response is returned in json with total results, URL, width, height, alt, score, timestamp, mime, thumbnail, nsfw, pageTitle fields.

More on all of this: http://arquivo.pt/apis


Q1) How do you classify safe for work/not safe for work

A1 – Fernando) This is a closed beta version. Safe for work/nsfw is based on classification worked around training set from Yahoo. We are not for blocking things but we want to be able to exclude shocking images if needed.

Q1) We have this same issue in the GifCities project – we have a manually curated training set to handle that.

Comment) Maybe you need to have more options for that measure to provide levels of filtering…

Q2) With that json response, why did you include title and alt text…

A2) We process image and extract from URL, the image text… So we capture the image, the alt text, but we thought that perhaps the page title would be interesting, giving some sense of context. Maybe the text before/after would also be useful but that takes more time… We are trying to keep this working

Q3) What is the thumbnail value?

A3) It is in base 64. But we can make that clearer in the next version…

Nicholas Taylor: Lots more LOCKSS for web archiving: boons from the LOCKSS software re-architecture

This is following on from the presentation myself and colleagues did at last year’s IIPC on APIs.

LOCKSS came about from a serials librarian and a computer scientist. They were thinking about emulating the best features of the system for preserving print journals, allowing libraries to conserve their traditional role as preserver. The LOCKSS boxes would sit in each library, collecting from publishers’ website, providing redundancy, sharing with other libraries if and when that publication was no longer available.

18 years on this is a self-sustaining programme running out of Stanford, with 10s of networks and hundreds of partners. Lots of copies isn’t exclusive to LOCKSS but it is the decentralised replication model that addresses the long term bit integrity is hard to solve, that more (correlated) copies doesn’t necessarily keep things safe and can make it vulnerable to hackers. So this model is community approved, published on, and well established.

Last year we started re-architecting the LOCKSS software so that it becomes a series of websites. Why do this? Well to reduce support and operation costs – taking advantage of other softwares on the web and web archiving,; to de silo components and enable external integration – we want components to find use in other systems, especially in web archiving; and we are preparing to evolve with the web, to adapt our technologies accordingly.

What that means is that LOCKSS systems will treat WARC as a storage abstraction, and more seamlessly do this, processing layers, proxies, etc. We also already integrate Memento but this will also let us engage WASAPI – which there will be more in our next talk.

We have built a service for bibliographic metadata extraction, for web harvest and file transfer content; we can map values in DOM tree to metadata fields; we can retrieve downloadable metadata from expected URL patterns; and parse RIS and XML by schema. That model shows our bias to bibliographic material.

We are also using plugins to make bibliographic objects and their metadata on many publishing platforms machine-intelligible. We mainly work with publishing/platform heuristics like Atypon, Digital Commons, HighWire, OJS and Silverchair. These vary so we have a framework for them.

The use cases for metadata extraction would include applying to consistent subsets of content in larger corpora; curating PA materials within broader crawls; retrieve faculty publications online; or retrieve from University CMSs. You can also undertake discovery via bibliographic metadata, with your institutions OpenURL resolver.

As described in 2005 D-Lib paper by DSHR et al, we are looking at on-access format migration. For instance x-bitmap to GIF.

Probably the most important core preservation capability is the audit and repair protocol. Network nodes conduct polls to validate integrity of distributed copies of data chunks. More nodes = more security – more nodes can be down; more copies can be corrupted… The notes do not trust each other in this model and responses cannot be cached. And when copies do not match, the node audits and repairs.

We think that functionality may be useful in other distributed digital preservation networks, in repository storage replication layers. And we would like to support varied back-ends including tape and cloud. We haven’t built those integrations yet…

To date our progress has addressed the WARC work. By end of 2017 we will have Docker-ised components, have a web harvest framework, polling and repair web service. By end of 2018 we will have IP address and Shibboleth access to OpenWayBack…

By all means follow and plugin. Most of our work is in a private repository, which then copies to GitHub. And we are moving more towards a community orientated software development approach, collaborating more, and exploring use of LOCKSS technologies in other contexts.

So, I want to end with some questions:

  • What potential do you see for LOCKSS technologies for web archiving, other use cases?
  • What standards or technologies could we use that we maybe haven’t considered
  • How could we help you to use LOCKSS technologies?
  • How would you like to see LOCKSS plug in more to the web archiving community?


Q1) Will these work with existing LOCKSS software, and do we need to update our boxes?

A1) Yes, it is backwards compatible. And the new features are containerised so that does slightly change the requirements of the LOCKSS boxes but no changes needed for now.

Q2) Where do you store biblographic metadata? Or is in the WARC?

A2) It is separate from the WARC, in a database.

Q3) With the extraction of the metadata… We have some resources around translators that may be useful.

Q4 – David) Just one thing of your simplified example… For each node… They all have to calculate a new separate nonce… None of the answers are the same… They all have to do all the work… It’s actually a system where untrusted nodes are compared… And several nodes can’t gang up on the other… Each peer randomly decides on when to poll on things… There is  leader here…

Q5) Can you talk about format migration…

A5) It’s a capability already built into LOCKSS but we haven’t had to use it…

A5 – David) It’s done on the requests in http, which include acceptable formats… You can configure this thing so that if an acceptable format isn’t found, then you transform it to an acceptable format… (see the paper mentioned earlier). It is based on mime type.

Q6) We are trying to use LOCKSS as a generic archive crawler… Is that still how it will work…

A6) I’m not sure I have a definitive answer… LOCKSS will still be web harvesting-based. It will still be interesting to hear about approaches that are not web harvesting based.

A6 – David) Also interesting for CLOCKSS which are not using web harvesting…

A6) For the CLOCKSS and LOCKSS networks – the big networks – the web harvesting portfolio makes sense. But other networks with other content types, that is becoming more important.

Comment) We looked at doing transformation that is quite straightforward… We have used an API

Q7) Can you say more about the community project work?

A7) We have largely run LOCKSS as more of an in-house project, rather than a community project. We are trying to move it more in the direction of say, Blacklight, Hydra….etc. A culture change here but we see this as a benchmark of success for this re-architecting project… We are also in the process of hiring a partnerships manager and that person will focus more on creating documentation, doing developer outreach etc.

David: There is a (fragile) demo that you can have a lot of this… The goal is to continue that through the laws project, as a way to try this out… You can (cautiously) engage with that at demo.laws.lockss.org but it will be published to GitHub at some point.

Jefferson Bailey & Naomi Dushay: WASAPI data transfer APIs: specification, project update, and demonstration

Jefferson: I’ll give some background on the APIs. This is an IMLS funded project in the US looking at Systems Interoperability and Collaborative Development for Web Archives. Our goals are to:

  • build WARC and derivative dataset APIs (AIT and LOCKSS) and test via transfer to partners (SUL, UNT, Rutgers) to enable better distributed preservation and access
  • Seed and launch community modelled on characteristics of successful development and participation from communities ID’d by project
  • Sketch a blueprint and technical model for future web archiving APIs informed by project R&D
  • Technical architecture to support this.

So, we’ve already run WARC and Digital Preservation Surveys. 15-20% of Archive-it users download and locally store their WARCS – for various reasons – that is small and hasn’t really moved, that’s why data transfer was a core area. We are doing online webinars and demos. We ran a national symposium on API based interoperability and digital preservation and we have white papers to come from this.

Development wise we have created a general specification, a LOCKSS implementation, Archive-it implementation, Archive-it API documentation, testing and utility (in progress). All of this is on GitHub.

The WASAPI Archive-it Transfer API is written in python, meets all gen-spec citeria, swagger yaml in the repos. Authorisation uses AIT Django framework (same as web app), not defined in general specification. We are using browser cookies or http basic auth. We have a basic endpoint (in production) which returns all WARCs for that account; base/all results are paginated. In terms of query parameters you can use: filename; filetype; collection (ID); crawl (ID for AID crawl job)) etc.

So what do you get back? A JSON object has: pagination, count, request-url, includes-extra. You have fields including account (Archive-it ID); checksums; collection (Archive-It ID); crawl; craw time; crawl start; filename’ filetype; locations; size. And you can request these through simple http queries.

You can also submit jobs for generating derivative datasets. We use existing query language.

In terms of what is to come, this includes:

  1. Minor AIT API features
  2. Recipes and utilities (testers welcome)
  3. Community building research and report
  4. A few papers on WA APIs
  5. Ongoing surgets and research
  6. Other APIs in WASAPI (past and future)

So we need some way to bring together these APIs regularly. And also an idea of what other APIs we need to support, and how to prioritise that.

Naomi: I’m talking about the Stanford take on this… These are the steps Nicholas, as project owner, does to download WARC files from Archive-it at the moment… It is a 13 step process… And this grant funded work focuses on simplifying the first six steps and making it more manageable and efficient. As a team we are really focused on not being dependent on bespoke softwares, things much be maintainable, continuous integration set up, excellent test coverage, automate-able. There is a team behind this work, and this was their first touching of any of this code – you had 3 neophytes working on this with much to learn.

We are lucky to be just down the corridor from LOCKSS. Our preferred language is Ruby but Java would work best for LOCKSS. So we leveraged LOCKSS engineering here.

The code is at: https://github.com/sul-dlss/wasapi-downloader/.

You only need Java to run the code. And all arguments are documented in Github. You can also view a video demo:

YouTube Preview Image

These videos are how we share our progress at the end of each Agile sprint.

In terms of work remaining we have various tweaks, pull requests, etc. to ensure it is production ready. One of the challenges so far has been about thinking crawls and patches, and the context of the WARC.


Q1) At Stanford are you working with the other WASAPI APIs, or just the downloads one.

A1) I hope the approach we are taking is a welcome one. But we have a lot of projects taking place, but we are limited by available software engineering cycles for archives work.

Note that we do need a new readme on GitHub

Q2) Jefferson, you mentioned plans to expand the API, when will that be?

A2 – Jefferson) I think that it is pretty much done and stable for most of the rest of the year… WARCs do not have crawl IDs or start dates – hence adding crawl time.

Naomi: It was super useful that a different team built the downloader was separate from the team building the WASAPI as that surfaced a lot of the assumptions, issues, etc.

David: We have a CLOCKSS implementation pretty much building on the Swagger. I need to fix our ID… But the goal is that you will be able to extract stuff from a LOCKSS box using WASAPI using URL or Solr text search. But timing wise, don’t hold your breath.

Jefferson: We’d also like others feedback and engagement with the generic specification – comments welcome on GitHub for instance.

Web archives platforms & infrastructure (Chair: Andrew Jackson)

Jack Cushman & Ilya Kreymer: Thinking like a hacker: security issues in web capture and playback

Jack: We want to talk about securing web archives, and how web archives can get themselves into trouble with security… We want to share what we’ve learnt, and what we are struggling with… So why should we care about security as web archives?

Ilya: Well web archives are not just a collection of old pages… No, high fidelity web archives run entrusted software. And there is an assumption that a live site is “safe” so nothing to worry about… but that isn’t right either..

Jack: So, what could a page do that could damage an archive? Not just a virus or a hack… but more than that…

Ilya: Archiving local content… Well a capture system could have privileged access – on local ports or network server or local files. It is a real threat. And could capture private resources into a public archive. So. Mitigation: network filtering and sandboxing, don’t allow capture of local IP addresses…

Jack: Threat: hacking the headless browser. Modern captures may use PhantomJS or other browsers on the server, most browsers have known exploits. Mitigation: sandbox your VM

Ilya: Stealing user secrets during capture… Normal web flow… But you have other things open in the browser. Partial mitigation: rewriting – rewrite cookies to exact path only; rewrite JS to intercept cookie access. Mitigation: separate recording sessions – for webrecorder use separate recording sessions when recording credentialed content. Mitigation: Remote browser.

Jack: So assume we are running MyArchive.com… Threat: cross site scripting to steal archive login

Ilya: Well you can use a subdomain…

Jack: Cookies are separate?

Ilya: Not really.. In IE10 the archive within the archive might steal login cookie. In all browsers a site can wipe and replace cookies.

Mitigation: run web archive on a separate domain from everything else. Use iFrames to isolate web archive content. Load web archive app from app domain, load iFrame content from content domain. As Webrecorder and Perma.cc both do.

Jack: Now, in our content frame… how back could it be if that content leaks… What if we have live web leakage on playback. This can happen all the time… It’s hard to stop that entirely… Javascript can send messages back and fetch new content… to mislead, track users, rewrite history. Bonus: for private archives – any of your captures could eport any of your other captures.

The best mitigation is a Content-Security-Policy header can limit access to web archive domain

Ilya: Threat: Show different age contents when archives… Pages can tell they’re in an archive and act differently. Mitigation: Run archive in containerised/proxy mode browser.

Ilya: Threat: Banner spoofing… This is a dangerous but quite easy to execute threat. Pages can dynamically edit the archives banner…

Jack: Suppose I copy the code of a page that was captured and change fake evidence, change the metadata of the date collected, and/or the URL bar…

Ilya: You can’t do that in Perma because we use frames. But if you don’t separate banner and content, this is a fairly easy exploit to do… So, Mitigation: Use iFrames for replay; don’t inject banner into replay frame… It’s a fidelity/security trade off.. .

Jack: That’s our top 7 tips… But what next… What we introduce today is a tool called http://warc.games. This is a version of webrecorder with every security problem possible turned on… You can run it locally on your machine to try all the exploits and think about mitigations and what to do about them!

And you can find some exploits to try, some challenges… Of course if you actually find a flaw in any real system please do be respectful


Q1) How much is the bug bounty?! [laughs] What do we do about the use of very old browsers…

A1 – Jack) If you use an old browser you may be compromised already… But we use the most robust solution possible… In many cases there are secure options that work with older browsers too…

Q2) Any trends in exploits?

A2 – Jack) I recommend the book A Tangled Book… And there is an aspect that when you run a web browser there will always be some sort of issue

A2 – Ilya) We have to get around security policies to archive the web… It wasn’t designed for archiving… But that raises its own issues.

Q3) Suggestions for browser makers to make these safer?

A3) Yes, but… How do you do this with current protocols and APIs

Q4) Does running old browsers and escaping from containers keep you awake at night…

A4 – Ilya) Yes!

A4 – Jack) If anyone is good at container escapes please do write that challenge as we’d like to have it in there…

Q5) There’s a great article called “Familiarity builds content” which notes that old browsers and softwares get more vulnerable over time… It is particularly a big risk where you need old software to archive things…

A5 – Jack) Thanks David!

Q6) Can you saw more about the headers being used…

A6) The idea is we write the CSP header to only serve from the archive server… And they can be quite complex… May want to add something of your own…

Q7) May depend on what you see as a security issue… for me it may be about the authenticity of the archive… By building something in the website that shows different content in the archive…

A7 – Jack) We definitely think that changing the archive is a security threat…

Q8) How can you check the archives and look for arbitrary hacks?

A8 – Ilya) It’s pretty hard to do…

A8 – Jack) But it would be a really great research question…

Mat Kelly & David Dias: A collaborative, secure, and private InterPlanetary WayBack web archiving system using IPFS

David: Welcome to the session on going InterPlanatary… We are going to talk about peer to peer and other technology to make web archiving better…

We’ll talk about InterPlanatary File System (IPFS) and InterPlanatary WayBack (IPWB)…

IPFS is also known as  the distributed web, moving from location based to content based… As we are aware, the web has some problems… You have experience of using a service, accessing email, using a document… There is some break in connectivity… And suddenly all those essential services are gone… Why? Why do we need to have the services working in such a vulnerable way… Even a simple page, you lose a connection and you get a 404. Why?

There is a real problem with permanence… We have this URI, the URL, telling us the protocol, location and content path… But when we come back later – weeks or months – and that content has moved elsewhere… Either somewhere else you can find, or somewhere you can’t. Sometimes it’s like the content has been destroyed… But every time people see a webpage, you download it to your machine… These issues come from location addressing…

In content addressing we tie content to a unique hash that identifies the item… So a Content Identifier (CID) allows us to do this… And then, in a network, when I look for that data… If there is a disruption to the network, we can ask any machine where the content is… And the node near you can show you what is available before you ever go to the network.

IPFS is already used in video streaming (inc. Netflix), legal documents, 3D models – with Hollolens for instance, for games, for scientific data and papers, blogs and webpages, and totally distributed web apps.

IPFS allows this to be distributed, offline, saves space, optimise bandwidth usage, etc.

Mat: So I am going to talk about IPWB. Motivation here is the persistence of archived web data dependent on resilience of organisation and availability of data. The design is extending the CDXJ format, with indexing and IPFS dissemination procedure, and Replay and IPFS Pull Procedure. So in an adapted CDXJ adds a header with the hash for the content to the metadata structure.

Dave: One of the ways IPFS is making changes in the boundary is in browser tab, in browser extension and service worker as a proxy for requests the browser makes, with no changes to the interface (that one is definitely in alpha!)…

So the IPWB can expose the content to the IPFS and then connect and do everything in the browser without needing to download and execute code on their machine. Building it into the browser makes it easy to use…

Mat: And IPWB enables privacy, collaboration and security, building encryption method and key into the WARC. Similarly CDXJs may be transferred for our users’ replay… Ideally you won’t need a CDZJ on your own machine at all…

We are also rerouting, rather than rewriting, for archival replay… We’ll be presenting on that late this summer…

And I think we just have time for a short demo…

For more see: https://github.com/oduwsdl/ipwb


Q1) Mat, I think that you should tell that story of what you do…

A1) So, I looked for files on another machine…

A1 – Dave) When Mat has the archive file on a remote machine… Someone looks for this hash on the network, send my way as I have it… So when Mat looked, it replied… so the content was discovered… request issued, received content… and presented… And that also lets you capture pages appearing differently in different places and easily access them…

Q2) With the hash addressing, are there security concerns…

A2 – Dave) We use Multihash, using Shard… But you can use different hash functions, they just verify the link… In IPFS we prevent issue with self-describable data functions..

Q3) The problem is that the hash function does end up in the URL… and it will decay over time because the hash function will decay… Its a really hard problem to solve – making a choice now that may be wrong… But there is no way of choosing the right choice.

A3) At least we can use the hash function to indicate whether it looks likely to be the right or wrong link…

Q4) Is hash functioning itself useful with or without IPFS… Or is content addressing itself inherently useful?

A4 – Dave) I think the IPLD is useful anyway… So with legal documents where links have to stay in tact, and not be part of the open web, then IPFS can work to restrict that access but still make this more useful…

Q5) If we had a content addressable web, almost all these web archiving issues would be resolved really… IT is hard to know if content is in Archive 1 or Archive 2. A content addressable web would make it easier to be archived.. Important to keep in mind…

A5 – Dave) I 100% agree! Content addressed web lets you understand what is important to capture. And IPTF saves a lot of bandwidth and a lot of storage…

Q6) What is the longevity of the hashs and how do I check that?

A6 – Dave) OK, you can check the integrity of the hash. And we have filecoin.io which is a blockchain [based storage network and cryptocurrency and that does handle this information… Using an address in a public blockchain… That’s our solution for some of those specific problems.

Andrew Jackson (AJ), Jefferson Bailey (JB), Kristinn Sigurðsson (KS) & Nicholas Taylor (NT): IIPC Tools: autumn technical workshop planning discussion

AJ: I’ve been really impressed with what I’ve seen today. There is a lot of enthusiasm for open source and collaborative approaches and that has been clear today and the IIPC wants to encourage and support that.

Now, in September 2016 we had a hackathon but there were some who just wanted to get something concrete done… And we might therefore adjust the format… Perhaps pre-define a task well ahead of time… But also a parallel track for the next hackathon/more experimental side. Is that a good idea? What else may be?

JB: We looked at Archives Unleashed, and we did a White House Social Media Hackathon earlier this year… This is a technical track but… it’s interesting to think about what kind of developer skills/what mix will work best… We have lots of web archiving engineers… They don’t use the software that comes out of it… We find it useful to have archivists in the room…

Then, from another angle, is that at the hackathons… IIPC doesn’t have a lot of money and travel is expensive… The impact of that gets debated – it’s a big budget line for 8-10 institutions out of 53 members. The outcomes are obviously useful but… If people expect to be totally funded for days on end across the world isn’t feasible… So maybe more little events, or fewer bigger events can work…

Comment 1) Why aren’t these sessions recorded?

JB: Too much money. We have recorded some of them… Sometimes it happens, sometimes it doesn’t…

AJ: We don’t have in-house skills, so it’s third party… And that’s the issue…

JB: It’s a quality thing…

KS: But also, when we’ve done it before, it’s not heavily watched… And the value can feel questionable…

Comment 1) I have a camera at home!

JB: People can film whatever they want… But that’s on people to do… IIPC isn’t an enforcement agency… But we should make it clear that people can film them…

KS: For me… You guys are doing incredible things… And it’s things I can’t do at home. The other aspect is that… There are advancements that never quite happened… But I think there is value in the unconference side…

AJ: One of the things with unconference sessions is that

NT: I didn’t go to the London hackathon… Now we have a technical team, it’s more appealling… The conference in general is good for surfacing issues we have in common… such as extraction of metadata… But there is also the question of when we sit down to deal with some specific task… That could be useful for taking things forward..

AJ: I like the idea of a counter conference, focused on the tools… I was a bit concerned that if there were really specific things… What does it need to be to be worth your organisations flying you to them… Too narrow and it’s exclusionary… Too broad and maybe it’s not helpful enough…

Comment 2) Worth seeing the model used by Python – they have a sprint after their conference. That isn’t an unconference but lets you come together. Mozilla Fest Sprint picks a topic and then next time you work on it… Sometimes looking at other organisations with less money are worth looking at… And for things like crowd sourcing coverage etc… There must be models…

AJ: This is cool.. You will have to push on this…

Comment 3) I think that tacking on to a conference helps…

KS: But challenging to be away from office more than 3/4 days…

Comment 4) Maybe look at NodeJS Community and how they organise… They have a website, NodeSchool.io with three workshops… People organise events pretty much monthly… And create material in local communities… Less travel but builds momentum… And you can see that that has impact through local NodeJS events now…

AJ: That would be possible to support as well… with IIPC or organisational support… Bootstrapping approaches…

Comment 5) Other than hackathon there are other ways to engage developers in the community… So you can engage with Google Summer of Code for instance – as mentors… That is where students look for projects to work on…

JB: We have two GSoC and like 8 working without funding at the moment… But it’s non trivial to manage that…

AJ: Onboarding new developers in any way would be useful…

Nick: Onboarding into the weird and wacky world of web archiving… If IIPC can curate a lot of onboarding stuff, that would be really good for potential… for getting started… Not relying on a small number of people…

AJ: We have to be careful as IIPC tools page is very popular, but hard to keep up to date… Benefits can be minor versus time…

Nick: Do you have GitHub? Just put up an awesome lise!

AJ: That’s a good idea…

JB: Microfunding projects – sub $10k is also an option for cost recovered brought out time for some of these sorts of tasks… That would be really interesting…

Comment 6) To expand on Jefferson and Nick were saying… I’m really new… Went to IIPC in April. I am enjoying this and learning this a lot… I’ve been talking to a lot of you… That would really help more people get the technical environment right… Organisations want to get into archiving on a small scale…

Olga: We do have a list on GitHub… but not up to date and well used…

AJ: We do have this document, we have GitHub… But we could refer to each other… and point to the getting started stuff (only). Rather get away from lists…

Comment 7) Google has an OpenSource.guide page – could take inspiration from that… Licensing, communities, etc… Very simple plain English getting started guide/documentation…

Comment 8) I’m very new to the community… And I was wondering to what extent you use Slack and Twitter between events to maintain these conversations and connections?

AJ: We have a Slack channel, but we haven’t publicised it particularly but it’s there… And Twitter you should tweet @NetPreserve and they will retweet then this community will see that…

Jun 142017

Following on from Day One of IIPC/RESAW I’m at the British Library for a connected Web Archiving Week 2017 event: Digital Conversations @BL, Web Archives: truth, lies and politics in the 21st century. This is a panel session chaired by Elaine Glaser (EG) with Jane Winters (JW), Valerie Schafer (VS), Jefferson Bailey (JB) and Andrew Jackson (AJ). 

As usual, this is a liveblog so corrections, additions, etc. are welcomed. 

EG: Really excited to be chairing this session. I’ll let everyone speak for a few minutes, then ask some questions, then open it out…

JB: I thought I’d talk a bit about our archiving strategy at Internet Archive. We don’t archive the whole of the internet, but we aim to collect a lot of it. The approach is multi-pronged: to take entire web domains in shallow but broad strategy; to work with other libraries and archives to focus on particular subjects or areas or collections; and then to work with researchers who are mining or scraping the web, but not neccassarily having preservation strategies. So, when we talk about political archiving or web archiving, it’s about getting as much as possible, with different volumes and frequencies. I think we know we can’t collect everything but important things frequently, less important things less frequently. And we work with national governments, with national libraries…

The other thing I wanted to raise in

T.R. Shellenberg who was an important archivist at the National Archive in the US. He had an idea about archival strategies: that there is a primary documentation strategy, and a secondary straetgy. The primary for a government and agencies to do for their own use, the secondary for futur euse in unknown ways… And including documentary and evidencey material (the latter being how and why things are done). Those evidencery elements becomes much more meaningful on the web, that has eerged and become more meaningful in the context of our current political environment.

AJ: My role is to build a Web Archive for the United Kingdom. So I want to ask a question that comes out of this… “Can a web archive lie?”. Even putting to one side that it isn’t possible to archive the whole web.. There is confusion because we can’t get every version of everything we capture… Then there are biases from our work. We choose all UK sites, but some are captured more than others… And our team isn’t as diverse as it could be. And what we collect is also constrained by technology capability. And we are limited by time issues… We don’t normally know when material is created… The crawler often finds things only when they become popular… So the academic paper is picked up after a BBC News item – they are out of order. We would like to use more structured data, such as Twitter which has clear publication date…

But can the archive lie? Well material is much easier than print to make an untraceable change. As digital is increasingly predominant we need to be aware that our archive could he hacked… So we have to protect for that, evidence that we haven’t been hacked… And we have to build systems that are secure and can maintain that trust. Libraries will have to take care of each other.

JW: The Oxford Dictionary word of the year in 2016 was “post truth” whilst the Australian dictionary went for “Fake News”. Fake News for them is either disinformation on websites for political purposes, or commercial benefit. Mirrium Webster went for “surreal” – their most searched for work. It feels like we live in very strange times… There aren’t calls for resignation where there once were… Hasn’t it always been thus though… ? For all the good citizens who point out the errors of a fake image circulated on Twitter, for many the truth never catches the lie. Fakes, lies and forgeries have helped change human history…

But modern fake news is different to that which existed before. Firstly there is the speed of fake news… Mainstream media only counteracts or addresses this. Some newspapers and websites do public corrections, but that isn’t the norm. Once publishing took time and means. Social media has made it much easier to self-publish. One can create, but also one can check accuracy and integrity – reverse image searching to see when a photo has been photoshopped or shows events of two things before…

And we have politicians making claims that they believe can be deleted and disappear from our memory… We have web archives – on both sides of the Atlantic. The European Referendum NHS pledge claim is archived and lasts long beyond the bus – which was brought by Greenpeace and repainted. The archives have also been capturing political parties websites throughout our endless election cycle… The DUP website crashed after announcement of the election results because of demands… But the archive copy was available throughout. Also a rumour that a hacker was creating an irish language version of the DUP website… But that wasn’t a new story, it was from 2011… And again the archive shows that, and archive of news websites do that.

Social Networks Responses to Terrorist Attacks in France – Valerie Schafer. 

Before 9/11 we had some digital archives of terrorist materials on the web. But this event challenged archivists and researchers. Charlie Hebdo, Paris Bataclan and Nice attacks are archived… People can search at the BNF to explore these archives, to provide users a way to see what has been said. And at the INA you can also explore the archive, including Titter archives. You can search, see keywords, explore timelines crossing key hashtags… And you can search for images… including the emoji’s used in discussion of Charlie Hebdo and Bataclan.

We also have Archive-It collections for Charlie Hebdo. This raises some questions of what should and should not be collected… We did not normally collected news papers and audio visual sites, but decided to in this case as we faced a special event. But we still face challenges – it is easiest to collect data from Twitter than from Facebook. But it is free to collect Twitter data in real time, but the archived/older data is charged for so you have to capture it in the moment. And there are limits on API collection… INA captured more than 12 Million tweets for Charlie Hebdo, for instance, it is very complete but not exhaustive.

We continue to collect for #jesuischarlie and #bataclan… They continually used and added to, in similar or related attacks, etc. There is a time for exploring and reflecting on this data, and space for critics too….

But we also see that content gets deleted… It is hard to find fake news on social media, unless you are looking for it… Looking for #fakenews just won’t cut it… So, we had a study on fake news… And we recommend that authorities are cautious about material they share. But also there is a need for cross checking – the kinds of projects with Facebook and Twitter. Web archives are full of fake news, but also full of others’ attempts to correct and check fake news as well…

EG: I wanted to go back in time to the idea of the term “fake news”… In order to understand from what “Fake News” actually is, we have to understand how it differs from previous lies and mistruths… I’m from outside the web world… We are often looking at tactics to fight fire with fire, to use an unfortunate metaphor…  How new is it? And who is to blame and why?

JW: Talking about it as a web problem, or a social media issue isn’t right. It’s about humans making decisions to critique or not that content. But it is about algorithmic sharing and visibility of that information.

JB: I agree. What is new is the way media is produced, disseminated and consumed – those have technological underpinnings. And they have been disruptive of publication and interpretation in a web world.

EG: Shouldn’t we be talking about a culture not just technology… It’s not just the “vessel”… Isn’t the dissemination have more of a role than perhaps we are suggesting…

AJ: When you build a social network or any digital space you build in different affordances… So that Facebook and Twitter is different. And you can create automated accounts, with Twitter especially offering an affordance for robots etc which allows you to give the impression of a movement. There are ways to change those affordances, but there will also always be fake news and issues…

EG: There are degrees of agency in fake news.. from bots to deliberate posts…

JW: I think there is also the aspect of performing your popularity – creating content for likes and shares, regardless of whether what you share is true or not.

VS: I know terrorism is different… But any tweet sharing fake news you get 4 retweets denying… You have more tweets denying than sharing fake news…

AJ: One wonders about the filter bubble impact here… Facebook encourges inward looking discussion… Social media has helped like minded people find each other, and perhaps they can be clipped off more easily from the wider discussion…

VS: I think also what is interested is the game between social media and traditional media…You have questions and relationship there…

EG: All the internet can do is reflect the crooked timber of reality… We know that people have confirmation bias, we are quite tolerant of untruths, to be less tolerant of information that contradicts our perceptions, even if untrue.You have people and the net being equally tolerant of lies and mistruths… But isn’t there another factor here… The people demonised as gatekeepers… By putting in place structures of authority – which were journalism and academics… Their resources are reduced now… So what role do you see for those traditional gatekeepers…

VS: These gatekeepers are no more the traditional gatekeepers that they were…. They work in 24 hour news cycles and have to work to that. In France they are trying to rethink that role, there were a lot of questions about this… Whether that’s about how you react to changing events, and what happens during election…. People thinking about that…

JB: There is an authority and responsibiity for media still, but has the web changed that? Looking back its suprising now how few organisations controlled most of the media… But is that that different now?

EG: I still think you are being too easy on the internet… We’ve had investigate journalism by Carrell Cadwalladar and others on Cambridge Analytica and others who deliberately manipulate reality… You talked about witness testimony in relation to terrorism… Isn’t there an immediacy and authenticity challenge there… Donald Trump’s tweets… They are transparant but not accountable… Haven’t we created a problem that we are now trying to fix?

AJ: Yes. But there are two things going on… It seems to be that people care less about lying… People see Trump lying, and they don’t care, and media organisations don’t care as long as advertising money comes in… A parallel for that in social media – the flow of content and ads takes priority over truth. There is an economic driver common to both mediums that is warping that…

JW: There is an aspect of unpopularity aspect too… a (nameless) newspaper here that shares content to generate “I can’t believe this!” and then sharing and generating advertising income… But on a positive note, there is scope and appetite for strong investigative journalism… and that is facilitated by the web and digital methods…

VS: Citizens do use different media and cross media… Colleagues are working on how TV is used… And different channels, to compare… Mainstream and social media are strongly crossed together…

EG: I did want to talk about temporal element… Twitter exists in the moment, making it easy to make people accountable… Do you see Twitter doing what newspapers did?

AJ: Yes… A substrate…

JB: It’s amazing how much of the web is archived… With “Save Page Now” we see all kinds of things archived – including pages that exposed the whole Russian downing a Ukrainian plane… Citizen action, spotting the need to capture data whilst it is still there and that happens all the time…

EG: I am still sceptical about citizen journalism… It’s a small group of narrow demographics people, it’s time consuming… Perhaps there is still a need for journalist roles… We did talk about filter bubbles… We hear about newspapers and media as biased… But isn’t the issue that communities of misinformation are not penetrated by the other side, but by the truth…

JW: I think bias in newspapers is quite interesting and different to unacknowledged bias… Most papers are explicit in their perspective… So you know what you will get…

AJ: I think so, but bias can be quite subtle… Different perspectives on a common issue allows comparison… But other stories only appear in one type of paper… That selection case is harder to compare…

EG: This really is a key point… There is a difference between facts and truth, and explicitly framed interpretation or commentary… Those things are different… That’s where I wonder about web archives… When I look at Wikipedia… It’s almost better to go to a source with an explicit bias where I can see a take on something, unlike Wikipedia which tries to focus on fact. Talking about politicians lying misses the point… It should be about a specific rhetorical position… That definition of truth comes up when we think of the role of the archive… How do you deal with that slightly differing definition of what truth is…

JB: I talked about different complimentary collecting strategy… The Archivist as a thing has some political power in deciding what goes in the historical record… The volume of the web does undercut that power in a way that I think is good – archives have historically been about the rich and the powerful… So making archives non-exclusive somewhat addresses that… But there will be fake news in the archive…

JW: But that’s great! Archives aren’t about collecting truth. Things will be in there that are not true, partially true, or factual… It’s for researchers to sort that out lately…

VS: Your comment on Wikipedia… They do try to be factual, neutral… But not truth… And to have a good balance of power… For us as researchers we can be surprised by the neutral point of view… Fortunately the web archive does capture a mixture of opinions…

EG: Yeah, so that captures what people believed at a point of time – true or not… So I would like to talk about the archive itself… Do you see your role as being successors to journalists… Or as being able to harvest the world’s record in a different way…

JB: I am an archivist with that training and background, as are a lot of people working on web archives and interesting spaces. Certainly historic preservation drives a lot of collecting aspects… But also engineering and technological aspects. So it’s poeple interested in archiving, preservation, but also technology… And software engineers interested in web archiving.

AJ: I’m a physicist but I’m now running web archives. And for us it’s an extension of the legal deposit role… Anything made public on the web should go into the legal deposit… That’s the theory, in practice there are questions of scope, and where we expend quality assurance energy. That’s the source of possible collection bias. And I want tools to support archivists… And also to prompt for challenging bias – if we can recognise that taking place.

JW: There are also questions of what you foreground in Special Collections. There are decisions being made about collections that will be archived and catalogued more deeply…

VS: In BNF my colleagues are work in an area with a tradition, with legal deposit responsibility… There are politics of heritage and what it should be. I think that is the case for many places where that activity sits with other archivists and librarians.

EG: You do have this huge responsibility to curate the record of human history… How do you match the top down requirements with the bottom up nature of the web as we now talk about i.t.

JW: One way is to have others come in to your department to curate particular collections…

JB: We do have special collections – people can choose their own, public suggestions, feeds from researchers, all sorts of projects to get the tools in place for building web archives for their own communities… I think for the sake of longevity and use going forward, the curated collections will probably have more value… Even if they seem more narrow now.

VS: Also interesting that archives did not select bottom-up curation. In Switzerland they went top down – there are a variety of approaches across Europe.

JW: We heard about the 1916 Easter Rising archive earlier, which was through public nominations… Which is really interesting…

AJ: And social media can help us – by seeing links and hashtags. We looked at this 4-5 years ago everyone linked to the BBC, but now we have more fake news sites etc…

VS: We do have this question of what should be archived… We see capture of the vernacular web – kitten or unicorn gifs etc… !

EG: I have a dystopian scenario in my head… Could you see a time years from now when newspapers are dead, public broadcasters are more or less dead… And we have flotsom and jetsom… We have all this data out there… And kinds of data who use all this social media data… Can you reassure me?

AJ: No…

JW: I think academics are always ready to pick holes in things, I hope that that continues…

JB: I think more interesting is the idea that there may not be a web… Apps, walled gardens… Facebook is pretty hard to web archive – they make it intentionally more challenging than it should be. There are lots of communication tools that disappeared… So I worry more about loss of a web that allows the positive affordances of participation and engagement…

EG: There is the issue of privatising and sequestering the web… I am becoming increasingly aware of the importance of organisations – like the BL and Internet Archive… Those roles did used to be taken on by publicly appointed organisations and bodies… How are they impacted by commercial privatisation… And how those roles are changing… How do you envisage that public sphere of collecting…

JW: For me more money for organisations like the British Library is important. Trust is crucial, and I trust that they will continue to do that in a trustworthy way. Commercial entities cannot be trusted to protect our cultural heritage…

AJ: A lot of people know what we do with physical material, but are surprised by our digital work. We have to advocate for ourselves. We are also constrained by the legal framework we operate within, and we have to challenge that over time…

JB: It’s super exciting to see libraries and archives recognised for their responsibility and trust… But that also puts them at higher risk by those who they hold accountable, and being recognised as bastions of accountability makes them more vulnerable.

VS: Recently we had 20th birthday of the Internet Archive, and 10 years of the French internet archiving… This is all so fast moving… People are more and more aware of web archiving… We will see new developments, ways to make things open… How to find and search and explore the archive more easily…

EG: The question then is how we access this data… The new masters of the universe will be those emerging gatekeepers who can explore the data… What is the role between them and the public’s ability to access data…

VS: It is not easy to explain everything around web archives but people will demand access…

JW: There are different levels of access… Most people will be able to access what they want. But there is also a great deal of expertise in organisations – it isn’t just commercial data work. And working with the Alan Turing Institute and cutting edge research helps here…

EG: One of the founders of the internet, Vint Cerf, says that “if you want to keep your treasured family pictures, print them out”. Are we overly optimistic about the permanence of the record.

AJ: We believe we have the skills and capabilities to maintain most if not all of it over time… There is an aspect of benign neglect… But if you are active about your digital archive you could have a copy in every continent… Digital allows you to protect content from different types of risk… I’m confident that the library can do this as part of it’s mission.


Q1) Coming back to fake news and journalists… There is a changing role between the web as a communications media, and web archiving… Web archives are about documenting this stuff for journalists for research as a source, they don’t build the discussion… They are not the journalism itself.

Q2) I wanted to come back to the idea of the Filter Bubble, in the sense that it mediates the experience of the web now… It is important to capture that in some way, but how do we archive that… And changes from one year to the next?

Q3) It’s kind of ironic to have nostalgia about journalism and traditional media as gatekeepers, in a country where Rupert Murdoch is traditionally that gatekeeper. Global funding for web archiving is tens of millions; the budget for the web is tens of billions… The challenges are getting harder – right now you can use robots.txt but we have DRM coming and that will make it illegal to archive the web – and the budgets have to increase to match that to keep archives doing their job.

AJ: To respond to Q3… Under the legislation it will not be illegal for us to archive that data… But it will make it more expensive and difficult to do, especially at scale. So your point stands, even with that. In terms of the Filter Bubble, they are out of our scope, but we know they are important… It would be good to partner with an organisation where the modern experience of media is explicitly part of it’s role.

JW: I think that idea of the data not being the only thing that matters is important. Ethnography is important for understanding that context around all that other stuff…  To help you with supplementary research. On the expense side, it is increasingly important to demonstrate the value of that archiving… Need to think in terms of financial return to digital and creative economies, which is why researchers have to engage with this.

VS: Regarding the first two questions… Archives reflect reality, so there will be lies there… Of course web archives must be crossed and compared with other archives… And contextualisation matters, the digital environment in which the web was living… Contextualisation of web environment is important… And with terrorist archive we tried to document the process of how we selected content, and archive that too for future researchers to have in mind and understand what is there and why…

JB: I was interested in the first question, this idea of what happens and preserving the conversation… That timeline was sometimes decades before but is now weeks or days or less… In terms of experience websites are now personalised and our ability to capture that is impossible on a broad question. So we need to capture that experience, and the emergent personlisation… The web wasn’t public before, as ARPAnet, then it became public, but it seems to be ebbing a bit…

JW: With a longer term view… I wonder if the open stuff which is easier to archive may survive beyond the gated stuff that traditionally was more likely to survive.

Q4) Today we are 24 years into advertising on the web. We take ad-driven models as a given, and we see fake news as a consequence of that… So, my question is, Minitel was a large system that ran on a different model… Are there different ways to change the revenue model to change fake or true news and how it is shared…

Q5) Teresa May has been outspoken on fake news and wants a crackdown… The way I interpret that is censorship and banning of sites she does not like… Jefferson said that he’s been archiving sites that she won’t like… What will you do if she asks you to delete parts of your archive…

JB: In the US?!

Q6) Do you think we have sufficient web literacy amongst policy makers, researchers and citizens?

JW: On that last question… Absolutely not. I do feel sorry for politicians who have to appear on the news to answer questions but… Some of the responses and comments, especially on encryption and cybersecurity have been shocking. It should matter, but it doesn’t seem to matter enough yet… 

JB: We have a tactic of “geopolitical redundancy” to ensure our collections are shielded from political endangerment by making copies – which is easy to do – and locate them in different political and geographical contexts. 

AJ: We can suppress content by access. But not deletion. We don’t do that… 

EG: Is there a further risk of data manipulation… Of Trump and Farage and data… a covert threat… 

AJ: We do have to understand and learn how to cope with potential attack… Any one domain is a single point of failure… so we need to share metadata, content where possible… But web archives are fortunate to have the strong social framework to build that on… 

Q7) Going back to that idea of what kinds of responsibilities we have to enable a broader range of people to engage in a rich way with the digital archive… 

Q8) I was thinking about questions in context, and trust in content in the archive… And realising that web archives are fairly young… Generally researchers are close to the resource they are studying… Can we imagine projects in 50-100 years time where we are more separate from what we should be trusting in the archive… 

Q9) My perspective comes from building a web archive for European institutions… And can the archive live… Do we need legal notice on the archive, disclaimers, our method… How do we ensure people do not misinterpret what we do. How do we make the process of archiving more transparent. 

JB: That question of who has resources to access web archives is important. It is a responsibility of institutions like ours… To ensure even small collections can be accessed, that researchers and citizens are empowered with skills to query the archive, and things like APIs to enable that too… The other question on evidencing curatorial decisions – we are notoriously poor at that historically… But there is a lot of technological mystery there that we should demystify for users… All sorts of complexity there… The web archiving needs to work on that provenance information over the next few years… 

AJ: We do try to record this but as Jefferson said much of this is computational and algorithmic… So we maybe need to describe that better for wider audiences… That’s a bigger issue anyway, that understanding of algorithmic process. At the British Library we are fortunate to have capacity for text mining our own archives… We will be doing more than that… It will be small at first… But as it’s hard to bring data to the queries, we must bring queries to the archive. 

JW: I think it is so hard to think ahead to the long term… You’ll never pre-empt all usage… You just have to do the best that you can. 

VS: You won’t collect everything, every time… The web archive is not an exact mirror… It is “reborn digital heritage”… We have to document everything, but we can try to give some digital literacy to students so they have a way to access the web archive and engage with it… 

EG: Time is up, Thank you our panellists for this fantastic session. 

Jun 142017

From today until Friday I will be at the International Internet Preservation Coalition (IIPC) Web Archiving Conference 2017, which is being held jointly with the second RESAW: Research Infrastructure for the Study of Archived Web Materials Conference. I’ll be attending the main strand at the School of Advanced Study, University of London, today and Friday, and at the technical strand (at the British Library) on Thursday. I’m here wearing my “Reference Rot in Theses: A HiberActive Pilot” – aka “HiberActive” – hat. HiberActive is looking at how we can better enable PhD candidates to archive web materials they are using in their research, and citing in their thesis. I’m managing the project and working with developers, library and information services stakeholders, and a fab team of five postgraduate interns who are, whilst I’m here, out and about around the University of Edinburgh talking to PhD students to find out how they collect, manage and cite their web references, and what issues they may be having with “reference rot” – content that changes, decays, disappears, etc. We will have a webpage for the project and some further information to share soon but if you are interested in finding out more, leave me a comment below or email me: nicola.osborne@ed.ac.uk. These notes are being taken live so, as usual for my liveblogs, I welcome corrections, additions, comment etc. (and, as usual, you’ll see the structure of the day appearing below with notes added at each session). 

Opening remarks: Jane Winters

This event follows the first RESAW event which took place in Aarhus last year. This year we again highlight the huge range of work being undertaken with web archives. 

This year a few things are different… Firstly we are holding this with the IIPC, which means we can run the event over 3 days, and means we can bring together librarians, archivists, and data scientists. The BL have been involved and we are very greatful for their input. We are also excited to have a public event this evening, highlighted the increasingly public nature of web archiving. 

Opening remarks: Nicholas Taylor

On behalf of the IIPC Programme Committee I am hugely grateful to colleagues here at the School of Advanced Studies and at the British Library for being flexible and accommodating us. I would also like to thank colleagues in Portugal, and hope a future meeting will take place there as had been originally planned for IIPC.

For us we have seen the Web Archiving Conference as an increasingly public way to explore web archiving practice. The programme committee saw a great increase in submissions, requiring a larger than usual commitment from the programming committee. We are lucky to have this opportunity to connect as an international community of practice, to build connections to new members of the community, and to celebrate what you do. 

Opening plenary: Leah Lievrouw – Web history and the landscape of communication/media research Chair: Nicholas Taylor

I intend to go through some context in media studies. I know this is a mixed audience… I am from the Department of Information Studies at UCLA and we have a very polyglot organisation – we can never assume that we all understand each others backgrounds and contexts. 

A lot about the web, and web archiving, is changing, so I am hoping that we will get some Q&A going about how we address some gaps in possible approaches. 

I’ll begin by saying that it has been some time now that computing has been seen, computers as communication devices, have been seen as a medium. This seems commonplace now, but when I was in college this was seen as fringe, in communication research, in the US at least. But for years documentarists, engineers, programmers and designers have seen information resources, data and computing as tools and sites for imagining, building, and defending “new” societies; enacting emancipatory cultures and politics… A sort of Alexandrian vision of “all the knowledge in the world”. This is still part of the idea that we have in web archiving. Back in the day the idea of fostering this kind of knowledge would bring about internationality, world peace, modernity. When you look at old images you see artefacts – it is more than information, it is the materiality of artefacts. I am a contributor to Nils’ web archiving handbook, and he talks about history written of the web, and history written with the web. So there are attempts to write history with the web, but what about the tools themselves? 

So, this idea about connections between bits of knowledge… This goes back before browsers. Many of you will be familiar with H.G. Well’s ? Brain; Suzanne Briet’s Qu’est que la documentation (1951) is a very influential work in this space; Jennifer Light wrote a wonderful book on Cold War Intellectuals, and their relationship to networked information… One of my lecturers was one of these in fact, thinking about networked cities… Vannevar Bush “As we may think” (1945) saw information as essential to order and society. 

Another piece I often teach, J.C.R. Licklider and Robert W. Taylor (1968) in “the computer as a communication device” talked about computers communicating but not in the same ways that humans make meaning. In fact this graphic shows a man’s computer talking to an insurance salesman saying “he’s out” an the caption “your computer will know what is important to you and buffer you from the outside world”.

We then have this counterculture movement in California in the 1960s and 1970s.. And that feeds into the emerging tech culture. We have The Well coming out of this. Stewart Brand wrote The Whole Earth Catalog (1968-78). And Actually in 2012 someone wrote a new Whole Earth Catalog… 

Ted Nelson, Computer Lib/Dream Machines (1974) is known as being the person who came up with the concept of the link, between computers, to information… He’s an inventor essentially. Computer Lib/Dream Machine was a self-published title, a manifesto… The subtitle for Computer Lib was “you can and must understand computers NOW”. Counterculture was another element, and this is way before the web, where people were talking about networked information.. But these people were not thinking about preservation and archiving, but there was an assumption that information would be kept… 

And then as we see information utilities and wired cities emerging, mainly around cable TV but also local public access TV… There was a lot of capacity for information communication… In the UK you had teletext, in Canada there was Teledyne… And you were able to start thinking about information distribution wider and more diverse than central broadcasters… With services like LexisNexis emerging we had these ideas of information utilities… There was a lot of interest in the 1980s, and back in the 1970s too. 

Harold Sackman and Norman Nie (eds.) The Information Utility and Social Choice (1970); H.G. Bradley, H.S. Dordick and B. Nanus, the Emerging Network Marketplace (1980); R.S. Block “A global information utility”, the Futurist (1984); W.H. Dutton, J.G. Blumer and K.L. Kraemer “Wired cities: shaping the future of communications” (1987).

This new medium looked more like point-to-point communication, like the telephone. But no-one was studying that. There were communications scholars looking at face to face communication, and at media, but not at this on the whole. 

Now, that’s some background, I want to periodise a bit here… And I realise that is a risk of course… 

So, we have the Pre-browser internet (early 1980s-1990s). Here the emphasis was on access – to information, expertise and content at centre of early versions of “information utilities”, “wired cities” etc. This was about everyone having access – coming from that counter culture place. More people needed more access, more bandwidth, more information. There were a lot of digital materials already out there… But they were fiddly to get at. 

Now, when the internet become privatised – moved away from military and universities – the old model of markets and selling information to mass markets, the transmission model, reemerged. But there was also tis idea that because the internet was point-to-point – and any point could get to any other point… And that everyone would eventually be on the internet… The vision was of the internet as “inherently democratic”. Now we recognise the complexity of that right now, but that was the vision then. 

Post-browser internet (early 1990s to mid-2000s) – was about web 1.0. Browsers and WWW were designed to search and retrieve documents, discrete kinds of files, to access online documents. I’ve said “Web 1.0” but had a good conversation with a colleague yesterday who isn’t convinced about these kinds of labels, but I find them useful shorthand for thinking about the web at particular points in time/use. In this era we had email still but other types of authoring tools arose.. Encouraging a wave of “user generated content” – wikis, blogs, tagging, media production and publishing, social networking sites. This sounds such a dated term now but it did change who could produce and create media, and it was the team around LA around this time. 

Then we began to see Web 2.0 with the rise of “smart phones” in the mid-2000s, merging mobile telephony and specialised web-based mobile applications, accelerate user content production and social media profiling. And the rise of social networking sounded a little weird to those of us with sociology training who were used to these terms from the real world, from social network analysis. But Facebook is a social network. Many of the tools, blogging for example, can be seen as having a kind of mass media quality – so instead of a movie studio making content… But I can have my blog which may have an audience of millions or maybe just, like, 12 people. But that is highly personal. Indeed one of the earliest so-called “killer apps” for the internet was email. Instead of shipping data around for processing – as the architecture originally got set up for – you could send a short note to your friend elsewhere… Email hasn’t changed much. That point-to-opint communication suddenly and unexpectedly suddenly became more than half of the ARPANET. Many people were surprised by that. That pattern of interpersonal communication over networks, continued to repeat itself – we see it with Facebook, Twitter, and even with Blogs etc. that have feedback/comments etc. 

Web 2.0 is often talked about as social driven. But what is important from a sociology perspective, is the participation, and the participation of user generated communities. And actually that continues to be a challenge, it continues to be not the thing the architecture was for… 

In the last decade we’ve seen algorithmic media emerging, and the rise of “web 3.0”. Both access and participation appropriated as commodities to be monitored, captures, analyzed, monetised and sold back to individuals, reconcieved as data subjects. Everything is thought about as data, data that can be stored, accessed… Access itself, the action people take to stay in touch with each other… We all carry around monitoring devices every day… At UCLA we are looking at the concept of the “data subjects”. Bruce ? used to talk about the “data footprint” or the “data cloud”. We are at a moment where we are increasingly aware of being data subjects. London is one of the most remarkable in the world in terms of surveillance… The UK in general, but London in particular… And that is ok culturally, I’m not sure it would be in the United States. 

We did some work in UCLA to get students to mark up how many surveillance cameras there were, who controlled them, who had set them up, how many there were… Neither Campus police nor university knew. That was eye opening. Our students were horrified at this – but that’s an American cultural reaction. 

But if we conceive of our own connections to each other, to government, etc. as “data” we begin to think of ourselves, and everything, as “things”. Right now systems and governance maximising the market, institutional government surveillance; unrestricted access to user data; moves towards real-time flows rather than “stocks” of documents or content. Surveillance isn’t just about government – supermarkets are some of our most surveilled spaces. 

I currently have students working on a “name domain infrastructure” project. The idea is that data will be enclosed, that data is time-based, to replace the IP, the Internet Protocol. So that rather than packages, data is flowing all the time. So that it would be like opening the nearest tap to get water. One of the interests here is from the movie and television industry, particularly web streaming services who occupy significant percentages of bandwidth now… 

There are a lot of ways to talk about this, to conceive of this… 

1.0 tend to be about documents, press, publishing, texts, search, retrieval, circulation, access, reception, production-consumption: content. 

2.0 is about conversations, relationships, peers, interaction, communities, play – as a cooperative and flow experience, mobility, social media (though I rebel against that somewhere): social networks. 

3.0 is about algorithms, “clouds” (as fluffy benevolent things, rather than real and problematic, with physical spaces, server farms), “internet of things”, aggregation, sensing, visualisation, visibility, personalisation, self as data subject, ecosystems, surveillance, interoperability, flows: big data, algorithmic media. Surveillance is kind of the environment we live in. 

Now I want to talk a little about traditions in communication studies.. 

In communication, broadly and historically speaking, there has been one school of thought that is broadly social scientific, from sociology and communications research, that thinks about how technologies are “used” for expression, interaction, as data sources or analytic tools. Looking at media in terms of their effects on what people know or do, can look at media as data sources, but usually it is about their use. 

There are theories of interaction, group process and influence; communities and networks; semantic, topical and content studies; law, policy and regulation of systems/political economy. One key question we might ask here: “what difference does the web make as a medium/milieu for communicative action, relations, interact, organising, institutional formation and change? Those from a science and technology background might know about the issues of shaping – we shape technology and technology shapes us. 

Then there is the more cultural/critical/humanist or media studies approach. When I come to the UK people who do media studies still think of humanist studies as being different, “what people do”. However this approach of cultural/critical/etc. is about analyses of digital technologies and web; design, affordances, contexts, consequences – philosophical, historical, critical lens. How power is distributed are important in this tradition. 

In terms of theoretical schools, we have the Toronto School/media ecology – the Marshall McLuhan take – which is very much about the media itself; American cultural studies, and the work of James Carey and his students; Birmingham school – the British take on media studies; and new materialism – that you see in Digital Humanities, German Media Studies, that says we have gone too far from the roles of the materials themselves. So, we might ask “What is the web itself (social and technical constituents) as both medium and product of culture, under what conditions, times and places.

So, what are the implications for Web Archiving? Well I hope we can discuss this, thinking about a table of:

Web Phase | Soc sci/admin | Crit/Cultural

  • Documents: content + access
  • Conversation: Social nets + participation
  • Data/AlgorithmsL algorithmic media + data subjects

Comment: I was wondering about ArXiv and the move to sharing multiple versions, pre-prints, post prints…

Leah: That issue of changes in publication, what preprints mean for who is paid for what, that’s certainly changing things and an interesting question here…

Comment: If we think of the web moving from documents, towards fluid state, social networks… It becomes interesting… Where are the boundaries of web archiving? What is a web archiving object? Or is it not an object but an assemblage? Also ethics of this…

Leah: It is an interesting move from the concrete, the material… And then this whole cultural heritage question, what does it instantiate, what evidence is it, whose evidence is it? And do we participate in hardening those boundaries… Or do we keep them open… How porous are our boundaries…

Comment: What about the role of metadata?

Leah: Sure, arguably the metadata is the most important thing… What we say about it, what we define it as… And that issue of fluidity… We think of metadata as having some sort of fixity… One thing that has begun to emerge in surveillance contexts… Where law enforcement says “we aren’t looking at your content, just the metadata”, well it turns out that is highly personally identifiable, it’s the added value… What happens when that secondary data becomes the most important things… In face where many of our data systems do not communicate with each other, those connections are through the metadata (only).

Comment: In terms of web archiving… As you go from documents, to conversations, to algorithms… Archiving becomes so much more complex. Particularly where interactions are involved… You can archive the data and the algorithm but you still can’t capture the interactions there…

Leah: Absolutely. As we move towards the algorithmic level its not a fixed thing. You can’t just capture the Google search algorithms, they change all the time. The more I look at this work through the lens of algorithms and data flows, there is no object in the classic sense…

Comment: Perhaps, like a movie, we need longer temporal snapshots…

Leah: Like the algorithmic equivalence of persistence of vision. Yes, I think that’s really interesting.

And with that the opening session is over, with organisers noted that those interested in surveillance may be interested to know that Room 101, said to have inspired the room of the same name in 1984, is where we are having coffee…

Session 1B (Chair: Marie Chouleur, National Library of France):

Jefferson Bailey (Deputy chair of IIPC, Director of Web Archiving, Internet Archiving): Advancing access and interface for research use of web archives

I would like to thank all of the organisers again. I’ll be giving a broad rather than deep overview of what the Internet Archive is doing at the moment.

For those that don’t know, we are a non-profit Digital Library and Archive founded in 1996. We work in a former church and it’s awesome – you are welcome to visit and do open public lunches every Friday if you are ever in San Francisco. We have lots of open source technology and we are very technology-driven.

People always ask about stats… We are at 30 Petabytes plus multiple copies right now, including 560 billion URLs, 280 billion webpages. We archive about 1 billion URLs per week, and have partners and facilities around the world, including here in the UK where we have Wellcome Trust support.

So, searching… This is WayBackMachine. Most of our traffic – 75% – is automatically directed to the new service. So, if you search for, say, UK Parliament, you’ll see the screenshots, the URLs, and some statistics on what is there and captured. So, how does it work? With that much data to do full text search! Even the raw text (not HTML) is 3-5 Pb. So, we figured the most instructive and easiest to work with text is the anchor text of all in-bound links to a homepage. The index text covers 443 million homepages, drawn from 900B in-bound links from other cross-domain websites. Is that perfect? No, but it’s the best that works on this scale of data… And people tend to make keyword type searches which this works for.

You can also now, in the new Way Back Machine, see a summary tab which includes a visualisation of data captured for that page, host, domain, MIME-type or MIME-type category. It’s really fun to play with. It’s really cool information to work with. That information is in the Way Back Machine (WBM) if there fore 4.5 billion hosts; 256 millions domains; 1238 TLDs. Also special collections that exist – building this for specific crawls/collections such as our .gov collection. And there is an API – so you can create your own visualisations if you like.

We have also created a full text search for AIT (Archive-It). This was part of a total rebuild of full text search in Elasticsearch. 6.5 billion documents with a 52 TB full text index. In total AIT is 23 billion documents and 1 PB. Searches are across all 8000+ colections. We have improved relevance ranking, metadata search, performance. And we have a Media Search coming – it’s still a test at presence. So you can search non textual content with similar process.

So, how can we help people find things better… search, full text search… And APIs. The APIs power the details charts, captures counts, year, size, new, domain/hosts. Explore that more and see what you can do. We’ve also been looking at Data Transfer APIs to standardise transfer specifications for web data exchange between repositories for preservation. For research use you can submit “jobs” to create derivative datasets from WARCS from specific collections. And it allows programmatic access to AIT WARCs, submission of job, job status, derivative results list. More at: https://github.com/WASAPI-Community/data-transfer-apis.

In other API news we have been working with WAT files – a sort of metadata file derived from a WARC. This includes Headers and content (title, anchor/text, metas, links). We have API access to some capture content – a better way to get programmtic access to the content itself. So we have a test build on a 100 TB WARC set (EOT). It’s like CDX API with a build – replays WATs not WARCs (see: http://vinay-dev.us.archive.org:8080/eot2016/20170125090436/http://house.gov/. You can analyse, for example, term counts across the data.

In terms of analysing language we have a new CDX code to help identify languages. You can visualise this data, see the language of the texts, etc. A lot of our content right now is in English – we need less focus on English in the archive.

We are always interested in working with researchers on building archives, not just using them. So we are working on the News Measures Research Project. We are looking at 663 local news sites representing 100 communities. 7 crawls for a composite week (July-September 2016).

We are also working with a Katrina Blogs project, after research was done, project was published, but we created a special collection of the cites used so that it can be accessed and explored.

And in fact we are general looking at ways to create useful sub collections and ways to explore content. For instance Gif Cities is a way to search for gifs from Geocities. We have a Military Industrial Powerpoint Complex, turning PPT into PDFs and creating a special collection.

We did a new collection, with a dedicated portal (https://www.webharvest.gov) which archives US congress for NARA. And we capture this every 2 years, and also raised questions of indexing YouTube videos.

We are also looking at historical ccTLD Wayback Machines. Built on IA global crawls and added historic web data with keyword and mime/format search, embed linkback, domain stats and special features. This gives a german view – from the .de domain – of the archive.

And we continue to provide data and datasets for people. We love Archives Unleashed – which ran earlier this week. We did an Obama Whitehouse data hackathon recently. We have a webinar on APIs coming very soon


Q1) What is anchor text?

A1) That’s when you create a link to a page – the text that is associated with that page.

Q2) If you are using anchor text in that keyword search… What happens when the anchor text is just a URL…

A2) We are tokenising all the URLs too. And yes, we are using a kind of PageRank type understanding of popular anchor text.

Q3) Is that TLD work.. Do you plan to offer that for all that ask for all top level domains?

A3) Yes! Because subsets are small enough that they allow search in a more manageable way… We basically build a new CDX for each of these…

Q4) What are issues you are facing with data protection challenges and archiving in the last few years… Concerns about storing data with privacy considerations.

A4) No problems for us. We operate as a library… The Way Back Machine is used in courts, but not by us – in US courts its recognised as a thing you can use in court.

Panel: Internet and Web Histories – Niels Bruger – Chair (NB); Marc Weber (MW); Steve Jones (SJ); Jane Winters (JW)

We are going to talk about the internet and the web, and also to talk about the new journal, Internet Histories, which I am editing. The new journal addresses what my colleagues and I saw as a gap. On the one hand there are journals like New Media and Society and Internet Studies which are great, but rarely focus on history. And media history journals are excellent but rarely look at web history. We felt there was a gap there… And Taylor & Francis Routledge agreed with us… The inaugeral issue is a double issue 1-2, and people on our panel today are authors in our first journal, and we asked them to address six key questions from members of our international editorial board.

For this panel we will have an arguement, counter statement, and questions from the floor type format.

A Common Language – Mark Weber

This journal has been a long time coming… I am Curatorial Director, Internet History Program, Computer History Museum. We have been going for a while now. This Internet History program was probably the first one of its kind in a museum.

When I first said I was looking at the history of the web in the mid ’90s, people were puzzled… Now most people have moved to incurious acceptance. Until recently there was also tepid interest from researchers. But in the last few years has reached critical mass – and this journal is a marker of this change.

We have this idea of a common language, the sharing of knowledge. For a long time my own perspective was mostly focused on the web, it was only when I started the Internet History program that I thought about the fuller sweep of cyberspace. We come in through one path or thread, and it can be (too) easy to only focus on that… The first major networks, the ARPAnet was there and has become the internet. Telenet was one of the most important commercial networks in the 1970s, but who here now remembers Anne Reid of Telenet? [no-one] And by contrast, what about Vint Cerf [some]. However, we need to understand what changed, what did not succeed in the long term, how things changed and shifted over time…

We are kind of in the Victorian era of the internet… We have 170 years of telephones, 60 years of going on line… longer of imagining a connected world. Our internet history goes back to the 1840s and the telegraph. And a useful thought here, “The past isn’t over. It isn’t even past” William Faulkner.  Of this history only small portions are preserved properly. Some of then risks of not having a collective narrative… And not understanding particular aspects in proper context. There is also scope for new types of approaches and work, not just applying traditional approaches to the web.

There is a risk of a digital dark age – we have  film to illustrate this at the museum although I don’t think this crowd needs persuading of the importance of preserving the web.

So, going forward… We need to treat history and preservation as something to do quickly, we cannot go back and find materials later…

Response – Jane Winters

Mark makes, I think convincingly, the case for a common language, and for understanding the preceding and surrounding technologies, why they failed and their commercial, political and social contexts. And I agree with the importance of capturing that history, with oral history a key means to do this. Secondly the call to look beyond your own interest or discipline – interdisiplinary researcg is always challenging, but in the best sense, and can be hugely rewarding when done well.

Understanding the history of the internet and its context is important, although I think we see too many comparisons with early printing. Although some of those views are useful… I think there is real importance in getting to grips with these histories now, not in a decade or two. Key decisions will be made, from net neutrality to mass surveillance, and right now the understanding and analysis of the issues is not sophisticated – such as the incompatibility of “back doors” and secure internet use. And as researchers we risk focusing on the content, not the infrastructure. I think we need a new interdisciplinary research network, and we have all the right people gathered here…


Q1) Mark, as you are from a museum… Have you any thoughts about how you present the archived web, the interface between the visitor and the content you preserve.

A1) What we do now with the current exhibits… the star isn’t the objects, it is the screen. We do archive some websites – but don’t try to replicate the internet archive but we do work with them on some projects, including the GeoCities exhibition. When you get to things that require emulation or live data, we want live and interactive versions that can be accessed online.

Q2) I’m a linguist and was intrigued by the interdisciplinary collaboration suggested… How do you see linguists and the language of the web fitting in…

A2) Actually there is a postdoc – Naomi – looking at how different language communities in the UK have engaged through looking at the UK Web Archive, seeing how language has shaped their experience and change in moving to a new country. We are definitely thinking about this and it’s a really interesting opportunity.

Out from the PLATO Cave: Uncovering the pre-Internet history of social computing – Steve Jones, University of Ilinois at Chicago

I think you will have gathered that there is no one history of the internet. PLATO was a space for education and for my interest it also became a social space, and a platform for online gaming. These uses were spontaneous rather than centrally led. PLATO was an acronym for Programmed Logic for Automatic Teaching Operations (see diagram in Ted Nelson’s Dream Machine publication and https://en.wikipedia.org/wiki/PLATO_(computer_system)).
There were two key interests in developing for PLATO – one was multi-player games, and the other was communication. And the latter was due to laziness… Originally the PLATO lab was in a large room, and we couldn’t be bothered to walk to each others desks. So “Talk” was created – and that saved standard messages so you didn’t have to say the same thing twice!

As time went on, I undertook undergraduate biology studies and engaged in the Internet and saw that interaction as similar… At that time data storage was so expensive that storing content in perpetuity seemed absurd… If it was kept its because you hadn’t got to writing it yet. You would print out code – then rekey it – that was possible at the time given the number of lines per programme. So, in addition to the materials that were missing… There were boxes of Ledger-size green bar print outs from a particular PLATO Notes group of developers. Having found this in the archive I took pictures to OCR – that didn’t work! I got – brilliantly and terribly – funding to preserve that text. That content can now be viewed side by side in the archive – images next to re-keyed text.

Now, PLATO wasn’t designed for lay users, it was designed for professionals although also used by university and high school students who had the time to play with it. So you saw changes between developer and community values, seeing development of affordances in the context of the discourse of the developers – that archived set of discussions. The value of that work is to describe and engage with this history not just from our current day perspective, but to understand the context, the poeple and their discourse at the time.

Response – Mark

PLATO sort of is the perfect example of a system that didn’t survive into the mainstream… Those communities knew each other, the idea of the flatscreen – which led to the laptop – came from PLATO. PLATO had a distinct messaging system, separate from the ARPAnet route. It’s a great corpus to see how this was used – were there flames? What does one-to-many communication look like? It is a wonderful example of the importance of preserving these different threads.. And PLATO was one of the very first spaces not full of only technical people.

PLATO was designed for education, and that meant users were mainly students, and that shaped community and usage. There was a small experiment with community time sharing memory stores – with terminals in public places… But PLATO began in the late ’60s and ran through into the 80s, it is the poster child for preserving earlier systems. PLATO notes became Lotus Notes – that isn’t there now but in its own domain, PLATO was the progenitor of much of what we do with education online now, and that history is also very important.


Q1) I’m so glad, Steve, that you are working on PLATO. I used to work in Medical Education in Texas and we had PLATO terminals to teach basic science first and second year medical education students and ER simulations. And my colleagues and I were taught computer instruction around PLATO. I am intereted that you wanted to look at discourse around UIC around PLATO – so, what did you find? I only experienced PLATO at the consumer end of the spectrum, so I wondered what the producer end was like…

A1) There are a few papers on this – search for it – but two basic things stand out… (1) the degree to which as a mainframe system PLATO was limited as system, and the conflict between the systems people and the gaming people. The gaming used a lot of the capacity, and although that taxed the system it did also mean they developed better code, showed what PLATO was capable of, and helped with the case for funding and support. So it wasn’t just shut PLATO down, it was a complex 2-way thing; (2) the other thing was around the emergence of community. Almost anyone could sit at a terminal and use the system. There were occasional flare ups and they mirrored community responses even later around flamewars, competition for attention, community norms… Hopefully others will mine that archive too and find some more things.

Digital Humanities – Jane Winters

I’m delighted to have an article in the journal, but I won’t be presenting on this. Instead I want to talk about digital humanities and web archives. There is a great deal of content in web archives but we still see little research engagement in web archives, there are numerous reasons including the continuing work on digitised traditional texts, and slow movement to develop new ways to research. But it is hard to engage with the history of the 21st century without engaging with the web.

The mismatch of the value of web archives and the use and research around the archive was part of what led us to set up a project here in 2014 to equip researchers to use web archives, and encourage others to do the same. For many humanities researchers it will take a long time to move to born-digital resources. And to engage with material that subtly differs for different audiences. There are real challenges to using this data – web archives are big data. As humanities scholars we are focused on the small, the detailed, we can want to filter down… But there is room for a macro historical view too. What Tim Hitchcock calls the “beautiful chaos?” of the web.

Exploring the wider context one can see change on many levels – from the individual person or business, to wide spread social and political change. How the web changes the language used between users and consumers. You can also track networks, the development of ideas… It is challenging but also offers huge opportunities. Web archives can include newspapers, media, and direct conversation – through social media. There is also visual content, gifs… The increase in use of YouTube and Instagram. Much of this sits outside the scope of web archives, but a lot still does make it in. And these media and archiving challenges will only become more challenging as see more data… The larger and more uncontrolled the data, the harder the analysis. Keyword searches are challenging at scale. The selection of the archive is not easily understood but is important.

The absence of metadata is another challenge too. The absence of metadata or alternative text can render images, particularly, invisible. And the mix of formats and types of personal and the public is most difficult but also most important. For instance the announcement of a government policy, the discussion around it, a petition perhaps, a debate in parliament… These are not easy to locate… Our histories is almost inherently online… But they only gain any real permanence through preservation in web archives, and thats why humanists and historians really need to engage with them.

Response – Steve

I particularly want to talk about archiving in scholarship. In order to fit archiving into scholarly models… administrators increasingly make the case for scholarship in the context of employment and value. But archive work is important. Scholars are discouraged from this sort of work because it is not quick, it’s harder to be published… Separately you need organisations to engage in preservation of their online presences. The degree to which archive work is needed is not reflected by promotion committees, organisational support, local archiving processes. There are immense rhetorical challenges here, to persuade others of the value of this work. There had been successful cases made to encourage telephone providers to capture and share historical information. I was at a telephone museum recently and asked about the archive… She handed me a huge book on the founding of Southwestern Bell, published in a very small run… She gave me a copy but no-one had asked about this before… That’s wrong though, it should be captured. So we can do some preservation work ourselves just by asking!


Q1) Jane, you mentioned a skills gap for humanities researchers. What sort of skills do they need?

A1) I think the complete lack of quantitative data training, how to sample, how to make meaning from quantitative data. They have never been engaged in statistical training. They have never been required to do it – you specialise so early here. Also, basic command line stuff… People don’t understand that or why they have to engage that way. Those are two simple starting points. Those help them understand what they are looking at, what an ngram means, etc.

Session 2B (Chair: Tom Storrar)

Philip Webster, Claire Newing, Paul Clough & Gianluca Demartini: A temporal exploration of the composition of the UK Government Web Archive

I’m afraid I’ve come into this session a little late. I have come in at the point that Philip and Claire are talking about the composition of the archive – mostly 2008 onwards – and looking at status codes of UK Government Web Archive. 

Phillip: The hypothesis for looking at http status codes was to see if changes in government raised trends in the http status code. Actually, when we looked at post-2008 data we didn’t see what we expected there. However we did fine that there was an increase in not finding what was requested – and thought this may be about moving to dynamic pages – but this is not a strong trend.

In terms of MIME types – media types – which are restricted to:

Application – flash, java, Microsoft Office Documents. Here we saw trends away from PDF as the dominant format. Microsoft word increases, and we see the increased use of Atom – syndication – coming across.

Executable – we see quite a lot of javascript. The importance of flash decreased over time – which we expected – and the increased in javascript (javascript and javascript x).

Document – PDF remains prevalent. Also MS Word, some MS Excel. Open formats haven’t really taken hold…

Claire: The Government Digital Strategy included guidance to use open document formats as much as possible, but that wasn’t mandated until late 2014 – a bit too late for our data set unfortunately. But the Government Digital Strategy in 2011 was, itself, published in Word and PDF itself!

Philip: If we take document type outside of PDFs you see that lack of open formats more clearly..

Image – This includes images appearing in documents, plus icons. And occasionally you see non-standard media types associated with the MIME-types. Jpegs are fairly consistent changes. Gif and Png are comparable… Gif was being phased out for IP reasons, with Png to replace it,and you see that change over time…

Text – Test is almost all HTML. You see a lot of plain text, stylesheets, XML…

Video – we saw compressed video formats… but gradually superceded with embedded YouTube links. However we do still see a of flash video retained. And we see a large, increasing of MP4, used by Apple devices.

Another thing that is available over time is relative file sizes. However CDX index only contains compressed size data and therefore is not a true representation of file size trends. So you can’t compare images to their pre-archiving version. That means for this work we’ve limited the data set to those where you can tell the before and after status of the image files. We saw some spikes in compressed image formats over time, not clear if this shows departmental isssues..

To finish on a high note… There is an increase in the use of https rather than http. I thought it might be the result of a campaign, but it seems to be a general trend..

The conclusion… Yes, it is possible to do temporal analysis of CDX index data but you have to be careful, looking at proportion rather than raw frequency. SQL is feasible, commonly available and low cost. Archive data has particular weaknesses – data cannot be assumed to be fully representative, but in some cases trends can be identified.


Q1) Very interesting, thank you. Can I understand… You are studying the whole archive? How do you take account of having more than one copy of the same data over time?

A1) There is a risk of one website being overrepresented in the archive. There are checks that can be done… But that is more computationally expensive…

Q2) With the seed list, is that generating the 404 rather than actual broken links?

A2 – Claire) We crawl by asking the crawler to go out to find links and seed from that. It generally looks within the domain we’ve asked it to capture…

Q3) At various points you talked about peaks and trends… Have you thought about highlighting that to folks who use your archive so they understand the data?

A3 – Claire) We are looking at how we can do that more. I have read about historians’ interest in the origins of the collection, and we are thinking about this, but we haven’t done that yet.

Caroline Nyvang, Thomas Hvid Kromann & Eld Zierau: Capturing the web at large – a critique of current web citation practices

Caroline: We are all here as we recognise the importance and relevance of internet research. Our paper looks at web referencing and citation within the sciences. We propose a new format to replace the URL+date format usually recommended. We will talk about a study of web references in 35 Danish master’s theses from the University of Copenhagen, then further work on monograph referencing, then a new citation format.

The work on 35 masters theses submitted to Copenhagen university, included, as a set: 899 web references, there was an average of 26.4 web references – some had none, the max was 80. This gave us some insight into how students cite URL. Of those students citing websites: 21% gave the date for all links; 58% had dates for some but not all sites; 22% had no dates. Some of those URLs pointed to homepages or search results.

We looked at web rot and web references – almost 16% could not be accessed by the reader, checked or reproduced. An error rate of 16% isn’t that remarkable – in 1992 a study of 10 journals found that a third of references was inaccurate enough to make it hard to find the source again. But web resources are dynamic and issues will vary, and likely increase over time.

The amount of web references does not seem to correlate with particular subjects. Students are also quite imprecise when they reference websites. And even when the correct format was used 15.5% of all the links would still have been dead.

Thomas: We looked at 10 danish academic monographs published from 2010-2016. Although this is a small number of titles, it allowed us to see some key trends in the citation of web content. There was a wide range of number of web citations used – 25% at the top, 0% at the bottom of these titles. Location of web references in these texts are not uniform. On the whole scholars rely on printed scholarly work… But web references are still important. This isn’t a systematic review of these texts… In theory these links should all work.

We wanted to see the status after five years… We used a traffic light system. 34.3% were red – broken, dead, a different page; 20?% were amber – critical links that either refer to changed or at risk material; 44.7% were green – working as expected.

This work showed that web references to dead links within a limited number of years. In our work the URLs that go to the front page, with instructions of where to look, actually, ironically, lasted best. Long complex URLs were most at risk… So, what can we do about this…

Eld: We felt that we had to do something here, to address what is needed. We can see from the studies that today’s practices of URLs and date stamp does not work. We need a new standard, a way to reference something stable. The web is a marketplace and changes all the time. We need to look at the web archives… And we need precision and persistency. We felt there were four neccassary elements, and we call it the PWID – Persistent Web IDentifier. The Four elemnts are:

  • Archived URL
  • Time of archiving
  • Web archive – precision and indication that you verified this is what you expect. Also persistency. Researcher has to understand that – is it a small or large archive, what is contextual legislation.
  • Content coverage specification – is part only? Is it the html? Is it the page including images as it appears in your browser? Is it a page? Is it the side including referred pages within the domain

So we propose a form of reference which can be textually expressed as:

web archive: archive.org, archiving time: 2016-04-20 18:21:47 UTC, archived URL: http://resaw.en/, content coverage: webpage

But, why not use web archive URL? Of the form:


Well, this can be hard to read, there is a lot of technology embedded in the URL. It is not as accessible.



This is now in as an ISO 690 suggestion and proposed as a URI type.

To sum up, all research fields eed to refer to the web. Good scientific practice cannot take place with current approaches.


Q1) I really enjoyed your presentation… I was wondering what citation format you recommend for content behind paywalls, and for dynamic content – things that are not in the archive.

A1 – Eld) We have proposed this for content in the web archive only. You have to put it into an archive to be sure, then you refer to it. But we haven’t tried to address those issues of paywall and dynamic content. BUT the URI suggestion could refer to closed archives too, not just open archives.

A1 – Caroline) We also wanted to note that this approach is to make web citations align with traditional academic publication citations.

Q2) I think perhaps what you present here is an idealised way to present archiving resources, but what about the marketing and communications challenge here – to better cite websites, and to use this convention when they aren’t even using best practice for web resources.

A2 – Eld) You are talking about marketing to get people to use this, yes? We are starting with the ISO standard… That’s one aspect. I hope also that this event is something that can help promote this and help to support it. We hope to work with different people, like you, to make sure it is used. We have had contact with Zotero for instance. But we are a library… We only have the resources that we have.

Q3) With some archives of the web there can be a challenge for students, for them to actually look at the archive and check what is there..

A3) Firstly citing correctly is key. There are a lot of open archives at the moment… But we hope the next step will be more about closed archives, and ways to engage with these more easily, to find common ground, to ensure we are citing correctly in the first place.

Comment – Nicola Bingham, BL) I like the idea of incentivising not just researchers but also publishers to incentivise web archiving, another point of pressure to web archives… And making the case for openly accessible articles.

Q4) Have you come across Martin Klein and Herbert Von Sompel’s work on robust links, and Momento.

A4 – Eld) Momento is excellent to find things, but usually you do not have the archive in there… I don’t think the way of referencing without the archive is a precise reference…

Q5) When you compare to web archive URL, it was the content coverage that seems different – why not offer as an incremental update.

A5) As far as I know there is using a # in the URL and that doesn’t offer that specificity…

Comment) I would suggest you could define the standard for after that # in the URLs to include the content coverage – I’ll take that offline.

Q6) Is there a proposal there… For persistence across organisations, not just one archive.

A6) I think from my perspective there should be a registry when archives change/move to find the new registry. Our persistent identifier isn’t persistent if you can change something. And I think archives must be large organisations, with formal custodians, to ensure it is persistent.

Comment) I would like to talk offline about content addressing and Linked Data to directly address and connect to copies.

Andrew Jackson: The web archive and the catalogue

I wanted to talk about some bad experiences I had recently… There is a recent BL video of the journey of a (print) collection item… From posting to processing, cataloguing, etc… I have worked at the library for over 10 years, but this year for the first time I had to get to grips with the library catalogue… I’ll talk more about that tomorrow (in the technical strand) but we needed to update our catalogue… Accommodating the different ways the catalogue and the archive see c0ntent.

Now, that video, the formation of teams, the structure of the organisations, the physical structure of our building is all about that print process, and that catalogue… So it was a suprise for me – maybe not you – that the catalogue isn’t just bibliographic data, it’s also a workflow management tool…

There is a change of events here… Sometimes events are in a line, sometimes in circles… Always forwards…

Now, last year legal deposit came in for online items… The original digital processing workflow went from acquisition to ingest to cataloguing… But most of the content was already in the archive… We wanted to remove duplication, and make the process more efficient… So we wanted to automate this as a harvesting process.

For our digital work previously we also had a workflow, from nomination, to authorisation, etc… With legal deposit we have to get it all, all the time, all the stuff… So, we don’t collect news items, we want all news sites every day… We might specify crawl targets, but more likely that we’ll see what we’ve had before and draw them in… But this is a dynamic process….

So, our document harvester looks for “watched targets”, harvests, extracts documents for web archiving… and also ingest. There are relationships to acquisition, that feeds into cataloguing and the catalogue. But that is an odd mix of material and metadata. So that’s a process… But webpages change… For print matter things change rarely, it is highly unusual. For the web changes are regular… So how do we bring these things together…

To borrow an analogy from our Georeferencing project… Users engage with an editor to help us understand old maps. So, imagine a modern web is a web archive… Then you need information, DOIs, places and entities – perhaps a map. This kind of process allows us to understand the transition from print to online. So we think about this as layers of transformation… Where we can annotate the web archive… Or the main catalogue… That can be replaced each time this is needed. And the web content can, with this approach, be reconstructed with some certainty, later in time…

Also this approach allows us to use rich human curation to better understand that which is being automatically catalogued and organised.

So, in summary: the catalogue tends to focus on chains of operation and backlogs, item by item. The web archive tends to focus on transformation (and re-transformation) of data. Layered data model can bring them together. Means revisiting the datat (but fixity checking  requires this anyway). It’s costly in terms of disk space required. And it allows rapid exploration and experimentation.

Q1) To what extend is the drive for this your users, versus your colleagues?

A1) The business reason is that it will save us money… Taking away manual work. But, as a side effect we’ve been working with cataloguing colleagues in this area… And their expectations are being raised and changed by this project. I do now much better understand the catalogue. The catalogue tends to focus on tradition not output… So this project has been interesting from this perspective.

Q2) Are you planning to publish that layer model – I think it could be useful elsewhere?

A2) I hope to yes.

Q3) And could this be used in Higher Education research data management?

A3) I have noticed that with research data sets there are some tensions… Some communities use change management, functional programming etc… Hadoop, which we use, requires replacement of data… So yes, but this requires some transformation to do.

We’d like to use the same based data infrastructure for research… Otherwise had to maintain this pattern of work.

Q4) Your model… suggests WARC files and such archive documents might become part of new views and routes in for discovery.

A4) That’s the idea, for discovery to be decoupled from where you the file.

Nicola Bingham, UK Web Archive: Resource not in archive: understanding the behaviour, borders and gaps of web archive collections

I will describe the shape and the scope of the UK Web Archive, to give some context for you to explore it… By way of introduction.. We have been archiving the UK Web since 2013, under UK non-print legal deposit. But we’ve also had the Open Archive (since 2004); Legal Deposit Archive (since 2013); and the Jisc Historical Archive (1996-2013).

The UK Web Archive includes around 400 TB of compressed data. And in the region of 11-12 billion records. We grow, on average 60-70 TB per year and 3 B records per year. We want to be comprehensive but, that said, we can’t collect everything and we don’t want to collect everything… Firstly we collect UK websites only. We carry out web archiving under 2013 regulations, and they state that only UK published web content – meaning content on a UK web domain, or by a person whose work occurs in the UK. So, we can automate harvesting from UK TLD (.uk, .scot, .cymru etc); UK hosting – geo-IP loook up to locate server. Then manual checks. So Facebook, WordPress, Twitter cannot be automated…

We only collect published content. Out of scope here are:

  • Film and recorded sound where AV content predominates, e.g. YouTube
  • Private intranets and emails.
  • Social networkings sites only available to restricted groups – if you need a login, special permissions they are out of scope.

Web archiving is expensive. We have to provide good value for money… We crawl the UK domain on an annual basis (only). Some sites are more frequent but annual misses a lot. We cap domains at 512 MB – which captures many sites in their entirity, but others that we only capture part of (unless we override automatic settings).

There are technical limitations too, around:

  • Database driven sites – crawler struggle with these
  • Programming scripts
  • Plug-ins
  • Proprietary file formats
  • Blockers – robots.txt or access denied.

So there are misrepresentations… For instance the One Hundred Women blog captures the content but not the stylesheet – that’s a fairly common limitation.

We also have curatorial input to locate the “important stuff”. In the British Library web archiving is not performed universally by all curators, we rely on those who do engage, usually voluntarily. We try to onboard as many curators and specialist professionals as possible to widen coverage.

So, I’ve talked about gaps and boundaries, but I also want to talk about how the users of the archive find this information, so that even where there are gaps, it’s a little more transparant…

We have the Collection Scoping Document, this captures scope, motivation, parameters and timeframe of collection. This document could, in a paired-down form, be made available to end users of the archive.

We have run user testing of our current UK Web Archive website, and our new version. And even more general audiences really wanted as much contextual information as possible. That was particularly important on our current website – where we only shared permission-cleared items. But this is one way in which contextual information can be shown in the interface with the collection.

The metadata can be browsed searched, though users will be directed to come in to view the content.

So, an example of a collection would be 1000 Londoners, showing the context of the work.

We also gather information during the crawling process… We capture information on crawler configuration, seed list, exclusions… I understand this could be used and displayed to users to give statistics on the collection…

So, what do we know about what the researchers want to know? They want as much documentation as they possibly can. We have engaged with the research community to understand how best to present data to the community. And indeed that’s where your feedback and insight is important. Please do get in touch.


Q1) You said you only collect “published” content… How do you define that?

A1) With legal deposit regulations… The legal deposit libraries may collect content openly available on the web… Content that is paywalled or behind login credentials. UK publishers are obliged to provide credentials for crawling. BUT how we make that accessible… Is a different matter – we wouldn’t republish that on the open web without logins/credentials.

Q2) How do you have any ideas about packaging this type of information for users and researchers – more than crawler config files.

A2) The short answer is no… We’d like to invite researchers to access the collection in both a close reading sense, and a big data sense… But I don’t have that many details about that at the moment.

Q3) A practical question: if you know you have to collect something… If you have a web copy of a government publication, say, and the option of the original, older, (digital) document… Is the web archive copy enough, do you have the metadata to use that the right way?

A3) Yes, so on the official publications… This is where the document harvester tool comes into play, adding another layer of metadata to pass the document through various access elements appropriately. We are still dealing with this issue though.

Chris Wemyss – Tracing the Virtual community of Hong Kong Britons through the archived web

I’ve joined this a wee bit late after a fun adventure on the Senate House stairs… 

Looking at the Gwulo: Old Hong Kong site.. User content is central to this site which is centred on a collection of old photographs, buildings, people, landscapes… The website starts to add features to explore categorisations of images.. And the site is led by an older British resident. He described subscribers being expats who have moved away, where an old version of Hong Kong that no longer exists – one user described it as an interactive photo album… There is clearly more to be done on this phenomenon of building these collective resources to construct this type of place. The founder comments on Facebook groups – they are about the now, “you don’t build anything, you just have interesting conversations”.

A third example then, Swire Mariners Association. This site has been running, nearly unchanged, for 17 years, but they have a very active forum, a very active Facebook group. These are all former dockyard workers, they meet every year, it is a close knit community but that isn’t totally represented on the web – they care about the community that has been constructed, not the website for others.

So, in conclusion archives are useful in some cases. Using oral history and web archives together is powerful, however, where it is possible to speak to website founders or members, to understand how and why things have changed over time. Seeing that change over time already gives some idea of the futures people want to see. And these sites indicate the demand for communities, active societies, long after they are formed. And illustrates how people utilise the web for community memory…


Q1) You’ve raised a problem I hadn’t really thought about. How can you tell if they are more active on Facebook or the website… How do you approach that?

A1) I have used web archiving as one source to arrange other things around… Looking for new websites, finding and joining the Facebook group, finding interviewees to ask about that. But I wouldn’t have been prompted to ask about the website and its change/lack of change without consulting the web archives.

Q2) Were participants aware that their pages were in the archive?

A2) No, not at all. The blog I showed first was started by two guys, Gwilo is run by one guy… And he quite liked the idea that this site would live on in the future.

David Geiringer & James Baker: The home computer and networked technology: encounters in the Mass Observation Project archive, 1991-2004

I have been doing web on various communities, including some work on GeoCities which is coming out soon… And I heard about the Mass Observation project which, from 1991 – 2004, about computers and how they are using them in their life… The archives capture comments like:

“I confess that sometimes I resort to using the computer using th ecut and paste techniwue to write several letters at once”

Confess is a strong word there.. Over this period of observation we saw production of text moving to computers, computers moving into most homes, the rebuilding of modernity. We welcome comment on this project, and hope to publish soon where you can find out more on our method and approach.

So, each year since 1981 the mass observation project has issued directives to respondents to respond to key issues like e.g. Football, or the AIDs crisis. They issued the technology directive in 1991. From that year we see several fans of word processor – words like love, dream…  Responses to the 1991 directive are overwhelmingly positive… Something that was not the case for other technologies on the whole…

“There is a spell check on this machine. Also my mind works faster than my hand and I miss out letters. This machine picks up all my faults and corrects them. Thank you computer.”

After this positive response though we start to see etiquette issues, concerns about privacy… Writing some correspondence by hand. Some use simulated hand writing… And start to have concerns about adapting letters, whether that is cheating or not… Ethical considerations appearing.. It is apparent that sometimes guilt around typing text is also slightly humorous… Some playful mischief there…

Altering the context of the issue of copy and paste… the time and effort to write a unique manuscript is at concern… Interestingly the directive asked about printing and filing emails… And one respondent notes that actually it wasn’t financial or business records, but emails from their ex…

Another comments that they wish they had printed more emails during their pregnancy, a way of situating yourself in time and remembering the experience…

I’m going to skip ahead to how computers fitted into their home… People talk about dining rooms, and offices, and living rooms.. Lots of very specific discussions about where computers are placed and why they are placed there… One person comments:

“Usually at the dining room at home which doubles as our office and our coffee room”

Others talk about quieter spaces… The positioning of a computer seems to create some competition for use of space. The home changing to make room for the computer or the network… We also start to see (in 2004) comments about home life and work life, the setting up of a hotmail account as a subtle act of resistance, the reassertion of the home space.

A Mass Observation Directive in 1996 asked about email and the internet:

“Internet – we have this at work and it’s mildly useful. I wouldn’t have it at home because it costs a lot to be quite sad and sit alone at home” (1996)

So, observers from 1991-2004 talked about efficiencies of the computer and internet, copy, paste, ease… But this then reflected concerns about the act of creating texts, of engaging with others, computers as changing homes and spaces. Now, there are really specific findings around location, gender, class, gender, age, sexuality… The overwhelming majority of respondents are white middle class cis-gendered straight women over 50. But we do see that change of response to technology, a moment in time, from positive to concerned. That runs parallel to the rise of the World Wide Web… We think our work does provide context to web archive work and web research, with textual production influenced by these wider factors.


Q1) I hadn’t realised mass observation picked up again in 1980. My understanding was that previously it was the observed, not the observers. Here people report on their own situations?

A1) They self report on themselves. At one point they are asked to draw their living room as well…

Q1) I was wondering about business machinery in the home – type writers for instance

A1) I don’t know enough about the wider archive. All of this newer material was done consistently… The older mass observation material was less consistent – people recorded on the street, or notes made in pubs. What is interesting is that in the newer responses you see a difference in the writing of the response… As they move from hand written to type writers to computer…

Q2) Partly you were talking about how people write and use computers. And a bit about how people archive themselves… But the only people I could find how people archive themselves digitally was by Microsoft Research… Is there anything since then… In that paper though you could almost read regret between the lines… the loss of photo albums, letters, etc…

A2) My colleague David Geiringer who I co-wrote the paper was initially looking at self-archiving. There was very very little. But printing stuff comes up… And the tensions there. There is enough there, people talking about worries and loss… There is lots in there… The great thing with Mass Obvs is that you can have a question but then you have to dig around a lot to find things…

Ian Milligan, University of Waterloo and Matthew Weber, Rutgers University – Archives Unleashed 4.0: presentation of projects (#hackarchives)

Ian: I’m here to talk about what happened on the first two days of Web Archiving Week. And I’d like to thank our hosts, supporters, and partners for this exciting event. We’ll do some lightening talks on the work undertaken… But why are historians organising data hackathons? Well, because we face problems in our popular cultural history. Problems like GeoCities… Kids write about Winnie the Pooh, people write about the love of Buffy the Vampire Slayer, their love of cigars… We face a problem of huge scale… 7 million users of the web now online… It’s the scale that boggles the mind and compare it to the Old Bailey – one of very few sources on ordinary people. They leave birth, death, marriage or criminal justice records… 239 years from 197,745 trials, 1674 and 1913 is the biggest collection of texts about ordinary people… But from 7 years of geocities we have 413 million web documents.

So, we have a problem, and myself, Matt and Olga from the British Library came together to build community, to establish a common vision of web archiving documents, to find new ways of addressing some of these issues.

Matt: I’m going to quickly show you some of what we did over the last few days… and the amazing projects created. I’ve always joked that Archives Unleashed is letting folk run amok to see what they can do… We started around 2 years ago, in Toronto, then Library of Congress, then at Internet Archive in San Francisco, and we stepped it up a little for London! We had the most teams, we had people from as far as New Zealand.

We started with some socilising in a pub on Friday evening, so that when we gathered on Monday we’d already done some introductions. Then a formal overview and quickly forming teams to work and develop ideas… And continuing through day one and day two… We ended up with 8 complete projects:

  • Robots in the Archives
  • US Elections 2008 and 2010 – text and keyword analysis
  • Study of Gender Distribution in Olympic communities
  • Link Ranking Group
  • Intersection Analysis
  • Public Inquiries Implications (Shipman)
  • Image Search in the Portuguese Web Archive
  • Rhyzome Web Archive Discovery Archive

We will hear from the top three from our informal voting…

Intersection Analysis – Jess

We wanted to understand how we could find a cookbook methodology for understanding the intersections between different data sets. So, we looked at the Occupy Movement (2011/12) with a Web Archive, a Rutgers archive and a social media archive from one of our researchers.

We normalised CDX, crunch WAT for outlinks and extract links from tweets. We generated counts and descriptive data, union/intersection between every data set. We had over 74 million datasets, but only 0.3% overlap between the collections… If you go to our website we have a visualisation of overlaps, tree maps of the collections…

We wanted to use the WAT files to explore Outlinks in the data sets, what they were linking to, how much of it was archived (not a lot).

Parting thoughts? Overlap is inversely proportional to the diversity pf URIs – in other words, the more collectors, the better. Diversifying see lists with social media is good.

Robots in the Archive 

We focused on robots.txt. And our wuestion was “what do we miss when we respect robots.txt?”. At National Library of Denmark we respect this… At Internet Archive they’ve started to ignore that in some contexts. So, what did we do? We extracts robots.txt from the WARC collection. Then apply it retroactively. Then we wanted to compare to link graph.

Our data was from The National Archives and from the 2010 election. We started by looking at user-agent blocks. Four had specifically blocked the internet archive, but some robot names were very old and out of date.. And we looked at crawl delay… Looking specifically at the sub collection of the department for energy and climate change… We would have missed only 24 links that would have been blocked…

So, robots.txt is minimal for this collection. Our method can be applied to other collections and extended to further the discussion on ignore robots.txt. And our code is on GitHub.

Link Ranking Group 

We looked at link analysis to ask if all links are treated the same… We wanted to test if links in <li> are different from content links (<p> or <div>). We used a WarcBase scripts to export manageable raw HTML, Load into Beuatifulsoup library. Used this on the Rio Olympic sites…

So we started looking at WARCs… We said, well, we should test if absolute or relative links… And comparing hard links to relative links but didn’t see lots of differences…

But we started to look at a previous election data set… There we saw links in tables, and there relative links were about 3/4 of links, and the other 1/4 were hard links. We did some investigation about why we had more hard links (proportionally) than before… Turns out this is a mixture of SEO practice, but also use of CMS (Content Management Systems) which make hard links easier to generate… So we sort of stumbled on that finding…

And with that the main programme for today is complete. There is a further event tonight and battery/power sockets permitting I’ll blog that too. 

Apr 092017
Digital Footprint MOOC logo

Last Monday we launched the new Digital Footprint MOOC, a free three week online course (running on Coursera) led by myself and Louise Connelly (Royal (Dick) School of Veterinary Studies). The course builds upon our work on the Managing Your Digital Footprints research project, campaign and also draws on some of the work I’ve been doing in piloting a Digital Footprint training and consultancy service at EDINA.

It has been a really interesting and demanding process working with the University of Edinburgh MOOCs team to create this course, particularly focusing in on the most essential parts of our Digital Footprints work. Our intention for this MOOC is to provide an introduction to the issues and equip participants with appropriate skills and understanding to manage their own digital tracks and traces. Most of all we wanted to provide a space for reflection and for participants to think deeply about what their digital footprint means to them and how they want to manage it in the future. We don’t have a prescriptive stance – Louise and I manage our own digital footprints quite differently but both of us see huge value in public online presence – but we do think that understanding and considering your online presence and the meaning of the traces you leave behind online is an essential modern life skill and want to contribute something to that wider understanding and debate.

Since MOOCs – Massive Open Online Courses – are courses which people tend to take in their own time for pleasure and interest but also as part of their CPD and personal development so that fit of format and digital footprint skills and reflection seemed like a good fit, along with some of the theory and emerging trends from our research work. We also think the course has potential to be used in supporting digital literacy programmes and activities, and those looking for skills for transitioning into and out of education, and in developing their careers. On that note we were delighted to see the All Aboard: Digital Skills in Higher Education‘s 2017 event programme running last week – their website, created to support digital skills in Ireland, is a great complementary resource to our course which we made a (small) contribution to during their development phase.

Over the last week it has been wonderful to see our participants engaging with the Digital Footprint course, sharing their reflections on the #DFMOOC hashtag, and really starting to think about what their digital footprint means for them. From the discussion so far the concept of the “Uncontainable Self” (Barbour & Marshall 2012) seems to have struck a particular chord for many of our participants, which is perhaps not surprising given the degree to which our digital tracks and traces can propagate through others posts, tags, listings, etc. whether or not we are sharing content ourselves.

When we were building the MOOC we were keen to reflect the fact that our own work sits in a context of, and benefits from, the work of many researchers and social media experts both in our own local context and the wider field. We were delighted to be able to include guest contributors including Karen Gregory (University of Edinburgh), Rachel Buchanan (University of Newcastle, Australia), Lilian Edwards (Strathclyde University), Ben Marder (University of Edinburgh), and David Brake (author of Sharing Our Lives Online).

The usefulness of making these connections across disciplines and across the wider debate on digital identity seems particularly pertinent given recent developments that emphasise how fast things are changing around us, and how our own agency in managing our digital footprints and digital identities is being challenged by policy, commercial and social factors. Those notable recent developments include…

On 28th March the US Government voted to remove restrictions on the sale of data by ISPs (Internet Service Providers), potentially allowing them to sell an incredibly rich picture of browsing, search, behavioural and intimate details without further consultation (you can read the full measure here). This came as the UK Government mooted the banning of encryption technologies – essential for private messaging, financial transactions, access management and authentication – claiming that terror threats justified such a wide ranging loss of privacy. Whilst that does not seem likely to come to fruition given the economic and practical implications of such a measure, we do already have the  Investigatory Powers Act 2016 in place which requires web and communications companies to retain full records of activity for 12 months and allows police and security forces significant powers to access and collect personal communications data and records in bulk.

On 30th March, a group of influential privacy researchers, including danah boyd and Kate Crawford, published Ten simple rules for responsible big data research in PLoSOne. The article/manifesto is an accessible and well argued guide to the core issues in responsible big data research. In many ways it summarises the core issues highlight in the excellent (but much more academic and comprehensive) AoIR ethics guidance. The PLoSOne article is notably directed to academia as well as industry and government, since big data research is at least as much a part of commercial activity (particularly social media and data driven start ups, see e.g. Uber’s recent attention for profiling and manipulating drivers) as traditional academic research contexts. Whilst academic research does usually build ethical approval processes (albeit conducted with varying degrees of digital savvy) and peer review into research processes, industry is not typically structured in that way and often not held to the same standards particularly around privacy and boundary crossing (see, e.g. Michael Zimmers work on both academic and commercial use of Facebook data).

The Ten simple rules… are also particularly timely given the current discussion of Cambridge Analytica and it’s role in the 2016 US Election, and the UK’s EU Referendum. An article published in Das Magazin in December 2016, and a subsequent English language version published on Vice’s Motherboard have been widely circulated on social media over recent weeks. These articles suggest that the company’s large scale psychometrics analysis of social media data essentially handed victory to Trump and the Leave/Brexit campaigns, which naturally raises personal data and privacy concerns as well as influence, regulation and governance issues. There remains some skepticism about just how influential this work was… I tend to agree with Aleks Krotoski (social psychologist and host of BBC’s The Digital Human) who – speaking with Pat Kane at an Edinburgh Science Festival event last night on digital identity and authenticity – commented that she thought the Cambridge Analytica work was probably a mix of significant hyperbole but also some genuine impact.

These developments focus attention on access, use and reuse of personal data and personal tracks and traces, and that is something we we hope our MOOC participants will have opportunity to pause and reflect on as they think about what they leave behind online when they share, tag, delete, and particularly when they consider terms and conditions, privacy settings and how they curate what is available and to whom.

So, the Digital Footprint course is launched and open to anyone in the world to join for free (although Coursera will also prompt you with the – very optional – possibility of paying a small fee for a certificate), and we are just starting to get a sense of how our videos and content are being received. We’ll be sharing more highlights from the course, retweeting interesting comments, etc. throughout this run (which began on Monday 3rd April), but also future runs since this is an “on demand” MOOC which will run regularly every four weeks. If you do decide to take a look then I would love to hear your comments and feedback – join the conversation on #DFMOOC, or leave a comment here or email me.

And if you’d like to find out more about our digital footprint consultancy, or would be interested in working with the digital footprints research team on future work, do also get in touch. Although I’ve been working in this space for a while this whole area of privacy, identity and our social spaces seems to continue to grow in interest, relevance, and importance in our day to day (digital) lives.


Mar 152017

Today I’m still in Birmingham for the Jisc Digifest 2017 (#digifest17). I’m based on the EDINA stand (stand 9, Hall 3) for much of the time, along with my colleague Andrew – do come and say hello to us – but will also be blogging any sessions I attend. The event is also being livetweeted by Jisc and some sessions livestreamed – do take a look at the event website for more details. As usual this blog is live and may include typos, errors, etc. Please do let me know if you have any corrections, questions or comments. 

Part Deux: Why educators can’t live without social media – Eric Stoller, higher education thought-leader, consultant, writer, and speaker.

I’ve snuck in a wee bit late to Eric’s talk but he’s starting by flagging up his “Educators: Are you climbing the social media mountain?” blog post. 

Eric: People who are most reluctant to use social media are often those who are also reluctant to engage in CPD, to develop themselves. You can live without social media but social media is useful and important. Why is it important? It is used for communication, for teaching and learning, in research, in activisim… Social media gives us a lot of channels to do different things with, that we can use in our practice… And yes, they can be used in nefarious ways but so can any other media. People are often keen to see particular examples of how they can use social media in their practice in specific ways, but how you use things in your practice is always going to be specific to you, different, and that’s ok.

So, thinking about digital technology… “Digital is people” – as Laurie Phipps is prone to say… Technology enhanced learning is often tied up with employability but there is a balance to be struck, between employability and critical thinking. So, what about social media and critical thinking? We have to teach students how to determine if an online source is reliable or legitimate – social media is the same way… And all of us can be caught out. There was piece in the FT about the chairman of Tesco saying unwise things about gender, and race, etc. And I tweeted about this – but I said he was the CEO – and it got retweeted and included in a Twitter moment… But it was wrong. I did a follow up tweet and apologised but I was contributing to that..

Whenever you use technology in learning it is related to critical thinking so, of course, that means social media too. How many of us here did our educational experience completely online… Most of us did our education in the “sage on the stage” manner, that’s what was comfortable for us… And that can be uncomfortable (see e.g. tweets from @msementor).

If you follow the NHS on Twitter (@NHS) then you will know it is phenomenal – they have a different member of staff guest posting to the account. Including live tweeting an operation from the theatre (with permissions etc. of course) – if you are medical student this would be very interesting. Twitter is the delivery method now but maybe in the future it will be Hololens or Oculus Rift Live or something. Another thing I saw about a year ago was Phil Baty (Inside Higher Ed – @Phil_Baty) talked about Liz Barnes revealing that every academic at Staffordshire will use social media and will build it into performance management. That really shows that this is an organisation that is looking forward and trying new things.

Any of you take part in the weekly #LTHEchat. They were having chats about considering participation in that chat as part of staff appraisal processes. That’s really cool. And why wouldn’t social media and digital be a part of that.

So I did a Twitter poll asking academics what they use social media for:

  • 25% teaching and learning
  • 26% professional development
  • 5% research
  • 44% posting pictures of cats

The cool thing is you can do all of those things and still be using it in appropriate educational contexts. Of course people post pictures of cats.. Of course you do… But you use social media to build community. It can be part of building a professional learning environment… You can use social media to lurk and learn… To reach out to people… And it’s not even creepy… A few years back and I could say “I follow you” and that would be weird and sinister… Now it’s like “That’s cool, that’s Twitter”. Some of you will have been using the event hashtag and connecting there…

Andrew Smith, at the Open University, has been using Facebook Live for teaching. How many of your students use Facebook? It’s important to try this stuff, to see if it’s the right thing for your practice.

We all have jobs… Usually when we think about networking and professional networking we often think about LinkedIn… Any of you using LinkedIn? (yes, a lot of us are). How about blogging on LinkedIn? That’s a great platform to blog in as your content reaches people who are really interested. But you can connect in all of these spaces. I saw @mdleast tweeting about one of Anglia Ruskin’s former students who was running the NHS account – how cool is that?

But, I hear some of you say, Eric, this blurs the social and the professional. Yes, of course it does. Any of you have two Facebook accounts? I’m sorry you violate the terms of service… And yes, of course social media blurs things… Expressing the full gamut of our personality is much more powerful. And it can be amazing when senior leaders model for their colleagues that they are a full human, talking about their academic practice, their development…

Santa J. Ono (@PrezOno/@ubcprez) is a really senior leader but has been having mental health difficulties and tweeting openly about that… And do you know how powerful that is for his staff and students that he is sharing like that?

Now, if you haven’t seen the Jisc Digital Literacies and Digital Capabilities models? You really need to take a look. You can use these to use these to shape and model development for staff and students.

I did another poll on Twitter asking “Agree/Disagree: Universities must teach students digital citizenship skills” (85% agree) – now we can debate what “digital citizenship” means… If any of you have ever gotten into it with a troll online? Those words matter, they effect us. And digital citizenship matter.

I would say that you should not fall in love with digital tools. I love Twitter but that’s a private company, with shareholders, with it’s own issues… And it could disappear tomorrow… And I’d have to shift to another platform to do the things I do there…

Do any of you remember YikYak? It was an anonymous geosocial app… and it was used controversially and for bullying… So they introduced handles… But their users rebelled! (and they reverted)

So, Twitter is great but it will change, it will go… Things change…

I did another Twitter poll – which tools do your students use on a daily basis?

  • 34% snapchat
  • 9% Whatsapp
  • 19% Instagram
  • 36% use all of the above

A lot of people don’t use Snapchat because they are afraid of it… When Facebook first appeared that response was it’s silly, we wouldn’t use it in education… But we have moved that there…

There is a lot of bias about Snapchat. @RosieHare posted “I’m wondering whether I should Snapchat #digifest17 next week or whether there’ll be too many proper grown ups there who don’t use it.” Perhaps we don’t use these platforms yet, maybe we’ll catch up… But will students have moved on by then… There is a professor in the US who was using Snapchat with his students every day… You take your practice to where your students are. According to global web index (q2-3 2016) over 75% of teens use Snapchat. There are policy challenges there but students are there every day…

Instagram – 150 M people engage with daily stories so that’s a powerful tool and easier to start with than Snapchat. Again, a space where our students are.

But perfection leads to stagnation. You have to try and not be fixated on perfection. Being free to experiment, being rewarded for trying new things, that has to be embedded in the culture.

So, at the end of the day, the more engaged students are with their institution – at college or university – the more successful they will be. Social media can be about doing that, about the student experience. All parts of the organisation can be involved. There are so many social media channels you can use. Maybe you don’t recognise them all… Think about your students. A lot will use WhatsApp for collaboration, for coordination… Facebook Messenger, some of the asian messaging spaces… Any of you use Reddit? Ah, the nerds have arrived! But again, these are all spaces you can develop your practice in.

The web used to involve having your birth year in your username (e.g. @purpledragon1982), it was open… But we see this move towards WhatsApp, Facebook Messenger, WeChat, these different types of spaces and there is huge growth predicted this year. So, you need to get into the sandbox of learning, get your hands dirty, make some stuff and learn from trying new things #alldayeveryday


Q1) What audience do you have in mind… Educators or those who support educators? How do I take this message back?

A1) You need to think about how you support educators, how you do sneaky teaching… How you do that education… So.. You use the channels, you incorporate the learning materials in those channels… You disseminate in Medium, say… And hopefully they take that with them…

Q2) I meet a strand of students who reject social media and some technology in a straight edge way… They are in the big outdoors, they are out there learning… Will they not be successful?

A2) Of course they will. You can survive, you can thrive without social media… But if you choose to engage in those channels and spaces… You can be succesful… It’s not an either/or

Q3) I wanted to ask about something you tweeted yesterday… That Prensky’s idea of digital natives/immigrants is rubbish…

A3) I think I said “#friendsdontletfriendsprensky”. He published that over ten years ago – 2001 – and people grasped onto that. And he’s walked it back to being about a spectrum that isn’t about age… Age isn’t a helpful factor. And people used it as an excuse… If you look at Dave White’s work on “visitors and residents” that’s much more helpful… Some people are great, some are not as comfortable but it’s not about age. And we do ourselves a disservice to grasp onto that.

Q4) From my organisation… One of my course leaders found their emails were not being read, asked students what they should use, and they said “Instagram” but then they didn’t read that person’s posts… There is a bump, a challenge to get over…

A4) In the professional world email is the communications currency. We say students don’t check email… Well you have to do email well. You send a long email and wonder why students don’t understand. You have to be good at communicating… You set norms and expectations about discourse and dialogue, you build that in from induction – and that can be email, discussion boards and social media. These are skills for life.

Q5) You mentioned that some academics feel there is too much blend between personal and professional. From work we’ve done in our library we find students feel the same way and don’t want the library to tweet at them…

A5) Yeah, it’s about expectations. Liverpool University has a brilliant Twitter account, Warwick too, they tweet with real personality…

Q6) What do you think about private social communities? We set up WordPress/BuddyPress thing for international students to push out information. It was really varied in how people engaged… It’s private…

A6) Communities form where they form. Maybe ask them where they want to be communicated with. Some WhatsApp groups flourish because that’s the cultural norm. And if it doesn’t work you can scrap it and try something else… And see what

Q7) I wanted to flag up a YikYak study at Edinburgh on how students talk about teaching, learning and assessment on YikYak, that started before the handles were introduced, and has continued as anonymity has returned. And we’ll have results coming from this soon…

A7) YikYak may rise and fall… But that functionality… There is a lot of beauty in those anonymous spaces… That functionality – the peers supporting each other through mental health… It isn’t tools, it’s functionality.

Q8) Our findings in a recent study was about where the students are, and how they want to communicate. That changes, it will always change, and we have to adapt to that ourselves… Do you want us to use WhatsApp or WeChat… It’s following the students and where they prefer to communicate.

A8) There is balance too… You meet students where they are, but you don’t ditch their need to understand email too… They teach us, we teach them… And we do that together.

And with that, we’re out of time… 

Are you future ready? Preparing Students fro living and working in the digital world

Introduction –  Lisa Gray, senior co-design manager, Jisc.

Connected Curricula model is about ensuring that employability is built into the curricuum, in T-profile curricule; employer engagement; and assessment for learning. That assessment is about assessing throughout the student experience as they progress through the curriculum.

The Jisc employability toolkit talks more about how this can be put into action. Looking at Technology for employability aspects include enhanced authentic and simulated learning experiences; enhanced lifelong learning and employability; and digital communications and engagement with employers; enhanced employability skills development – and learner skills diagnostics and self-led assessment; employer focused digital literacy development.

The employable student in the digital age model. The toolkit unpicks the capabilities that map into that context.

You can find out more, along with other resources, at: http://ji.sc/

The Employer View: Preparing students for a digital world – Deborah Edmondson, talent director, Cohesion Recruitment

We manage early talent recruitment processes. Whilst it is clear that automation is replacing some roles, it won’t replace creativity, emotional awareness, and similar skills and expertise.

Graduate vacancies are reducing this year – this has been the third time in the last four years. Some of that is associated with Brexit – especially in construction – but also represents a rise in apprentice roles. Many employers are replacing existing training programmes to the new Apprenticeship model (and levy). Recruitment is typically, for early talent, online application, video interview, psychometric testing, assessment centre. Some employers gamify that process. And we are also seeing a big influence of parental role as well.

Employers have had to up their own digital skills in order to recruit graduates. We’ve had to ensure application forms are online and mobile enabled. And we know that online forms are not the best predictor of who will succeed in graduate recruitment so we’ve reduced or removed them. Video interviews are becoming much more frequent as they give the best idea of a candidates skills, confidence, communication. We still see psychometric testing but there is less focus there, it’s more about contextual recruitment and focusing less on scores, more on the context of that student and achievement. We are also starting to see virtual reality in final stages of recruitment – this is about understanding authentic reactions and responses rather than pre-prepared responses.

So, what do employers want in terms of digital skills? It’s not about skills a lot of the time, often it’s about willingness to use digital skills and capabilities. There are nine key attributes and I’d particularly like to draw your attention to business communications. Students often focus on immediacy… But realities of business and their tools is that things can move slowly, so graduates need real flexibility. The other area I wanted to raise is etiquette: one client mentioned a graduate recruited colleague sending multiple chasers in a single email – that’s just annoying. Similarly use of text speak – wholly inappropriate. Also hiding behind the screen – only emailing and reluctant to call or meet face to face…

Graduates have great skills but they are also described as entitled, hard to manage, etc. So, how can universities help? Well expectations – around success and job satisfaction, as well as about the kinds of technologies they will be using. There isn’t immediacy or instant gratification in the world of work, patience is required. It is about business communication – that emails are long enough, professional enough, and that text speak or emoji in emails – or phrases like “in my oils” which won’t mean much to employers! We also need graduates who are able and willing to have conversations, face to face conversations, phone conversations – they have to be able to talk about their work. And with digital footprint – this can come back to haunt you. We have recruiters looking for high security roles that even check online purchase history – if it’s out there, we will find it. And it’s about perceptions too – those with ambitious career plans have to bear that in mind in how they present themselves from day one. And Excel – it’s important in business but not all students have experience of it. Research… graduates need to be professional on LinkedIn (including photographs) and be able to do the research, to understand the employer, but not to be too stalkery. And it’s about employer interaction – we receive abusive, sweary, etc. responses to rejections but graduates need to be asking for feedback and being graceful in dealing with rejection.

Note: for those interested in digital footprint you should take a look at our new #dfmooc which launches next month and is already open for registration: https://www.coursera.org/learn/digital-footprint.

SERC – Kieran McKenna, South Eastern Regional College

At SERC a students first few weeks are abou entrepreneurship, with guest speakers, student volunteers, and project based learning built around PBL/Enterprise Fairs. We see success in a number of areas and skills contests because of this model. We use the CAST/CAPS approach – Conference for Advancement of Science and Technology – with students working with industry standard PBL and enterprise learning. We also take a “whole-brain learning” approach – ensuring students understand how they learn best.

So, now we will look at three ways we have enabled this. We created a Whole Brain eLearning resource – called EntreBRAINeur – where students understand typical skills of entrepreneurs, have information about the brain, and answer questions that report back to them on their left brain/right brain placement, their learning styles… One message to take home is the language we use.. That the following information “may be of benefit to your working styles” – encouraging the learner in a positive way. The learner knows best how they learn best. And we link results with activity planning – so you can look at a group with their right/left brain dominance.

So, with that, we are going to see a short video on this…

So, having created this tool we set up an enterprise portal. This has objectives including sharing enterprise and entrepreneurship best practice across multiple campuses. So the PBL activities create a web presence and they are explaining how they undertook the PBL design cycle, and they are looking for votes on their projects. They are then assessed against creativity; innovation, team working; and solutions matching the challenge.

So, are we future ready? Looking at students who completed the e-resource found that only about 10% of our students have an entrepreneurial mindset… But we are confident that the tools, the learning tools, the peer assessment will give our students the edge they need.

Self-designed learning and “future proofing” graduates – Ian Pirie, Emeritus Professor, University of Edinburgh

I am going to talk about self-designed learning. We are two years into a pilot programme in Edinburgh where students literally design their own project, it is approved, they manage it, it is assessed, and ends up in an eportfolio online. Edinburgh is a large university – 3 colleges, 22 schools – and we don’t always do things the same way. We had a number of factors colliding – we have a QAA Enhancement theme around learning and a large careers team which was looking for more self-led opportunities; and employers were also saying they valued graduates but felt some skills could be stronger; and for students in e.g. humanities your tutor would tell you what you must do, but you also have a choice of modules – from over 8.5k courses which is quite intimidating.. And staff also wanted to teach their specialist areas which is a challenge.

So I’ll talk in four areas here…

A rapidly changing world… Students can now access all information very quickly, globally, 24/7. It often isn’t the students ability to use technology, it’s often universities and employers that can fall behind. For education the challenge can be that the kind of teaching we are used to doing isn’t necessarily fit for purpose. Traditionally teaching is information rich and assessed a few times in a semester, and that isn’t what they need and frustrating. And we also see a socially mobile environment – university and private coffee shops used socially and professionally by students. And in fact the Kaplan Graduate Recruitment Report 2014 suggests 1 in 2 will become future leaders – and 60% of businesses are looking for graduates with leadership skills.

Looking at the CBI Survey Data – as already mentioned earlier – really isn’t about the subject area. It is about having studied to a particular level… Not what you have learned in the course in terms of subject content. So how can that be taught? And when we survey our own students we find frustration amongst some students about the way they are taught. And indeed the importance of understanding that equality doesn’t mean treating everyone the same – there is a lot of literature here and it is hard to see how we implement this, particularly at scale.

Students are consistently very clear about what they would like… They would like to be treated professionally and individually, they want clarity about what is expected of them and what they can expect in return. They want clarity in assessment critiera with associated timely and effective feedback – an issue across the sector. They also want an academic community comprised of vertical peer groups and academic staff. They want 24/7 access to online information, ideally in one place. And they increasingly want assurance that they are being prepared for the future.

And, for so many reasons, there is a lot of change. HE can be slow to change… But we need to move away from a teaching model towards a learning model where the tutor supports that learning. It is about accepting responsibility for “future proofing” the whole person, and part of that is about ensuring that “digital literacy” is embedded in the curriculum, as well as the abstract skills.

So, three years ago we developed our future vision for a future curriculum. Some of the steps here look innocuous, but some will really radically upset academics – we wanted to design out passive learning. If a student can sleep through a lecture, hand in an essay, do an exam, and that’s them completed the course, that’s not good enough. We also wanted appropriate use of technology – there is no substitutive for the face to face experience. Each student are also required to use online learning in some form, to prepare them for the future, for elearning, for their ongoing development…

And that takes us to the SLICCS. This is a university-wide framework contextualised to the discipline by each student. And there is one framework, the student then contextualises their own course. Student creates, owns, manages and are formatively assessed. There is deliberately minimal input and supervision from academic staff – it’s a lot of work but for the student, not the staff. Inductions are done by Institute for Academic Development staff… the academic input is at the “front end” for induction and presentation of proposal. But students then reflect on their experience.

In order to do this our inductions are face to face – not online – to make sure students are able to take on the SLICC. They also cannot take on a SLICC if they have any fails – academically they have to be solid to go into this phase of their learning. So, the process is for the student to identify and select a learning experience – often a work placement related project; they develop a proposal and work plan; and then engage in ongoing reflection – sometimes once a day. Then there is formative self-assessment by the student, and summative assessment by staff. Staff don’t see the formative assessment until they have marked the work but in our pilots we had over 96% correlation between those assessments.

We are used to seeing staff responsibility for returning marked work etc. But we also make it clear what the student expectations are in terms of giving and receiving feedback (separate from the SLICC), with students needing to submit that self-graded assessment constructively aligned to the LOs. A critically-selective web folio is submitted along with an (up to 2000 word) report. Initially there was concern that SLICCs were 20 credits and students wouldn’t do the work… But they have done mountains of work and really produced fantastic engaged pieces. Students gave us feedback on the courses, but the technology is barely mentioned – the staff struggled more – as the students learned most from the self-management and self-direction. Students from pilot 1 immediately signed up for pilot 2… And now it is mainstream. As one student says “it made me take control of my own learning”. I can’t show you all the portfolios now but if you look at our website, you’ll find out much more: http://www.ed.ac.uk/employability/slicc. Contact Simon Riley and Gavin McGabe for more information.


Q1) Coming back to the first speaker I was quite concerned about the phrase “early talent” as it implies all graduates are young.

A1 – DE) That’s fair. It is a collective term but employers tend to separate into apprenticeships and graduate programmes. But graduate programmes aren’t dependent on age.

Q2) On PebblePads and ePortfolios – do students use those with employers…. Are they effective tools for jobs

A2 – DE) From employers perspective we don’t see them in high volume. We follow it quite closely. We see more of universities encouraging students to use LinkedIn profiles instead.

A2 – IP) For many this approach is new to the students and staff. But in medicine the idea of portfolios is well embedded, and those courses have just adopted PebblePad for that purpose. But it’s discipline specific… And students thought about it before being asked and staff see enthusiastic.

Q3) About the neurological approach to learning… Isn’t there a real risk of thinking of learning being only for employment… What about motivation, what about changes in the market?

A3 – KM) We predominantly try to develop “whole brain” learners. We have electricians and plasterers taking that whole brain learning questionnaire – it’s interesting for them to look at that, to look back at their school experience and how their preference shapes that. The response from students has been quite positive.

Q4) We talked about this on Twitter already but I really hope that you use “left brain” and “right brain” and “learning styles” lightly – these have been debunked so perhaps give students a false sense of security… We are complex organisms… And maybe its just a way to articulate different potential… [Thank you to this person, it was a concern I had too!]

A4 – KM) We do try to address a lot of different learning styles… There is a wide variety of how that phrase is used… A real range of different skills that learners can have. It is important not to pigeon hole… But it is useful to raise awareness of how we can develop as people, regardless of how we label this. There are a range of approaches to this… This is the one that we are using.

Q5) There can be this sense of higher education as being to train the best people for employers – the best meat almost. What is the role and responsibility for employers to train graduates?

A5 – DE) There are training schemes, employers are aware of the need to train students and graduates – around 35% of students who complete a year long industrial placement will be offered a role with that employer in recognition of the training investment and and importance to employers.

Closing plenary and keynote from Lauren Sager Weinstein, chief data officer at Transport for London

The host for this session is Andy McGregor, deputy chief innovation officer, Jisc. He is introducing the session with the outcome of the start up competition that has been running over the last few days. The pitches took place last night. The winners will go into the Jisc Business Accelerator programme, providing support and some funding to take their ideas forward. And we are keen and happy to involve you in this programme so do get in touch… You’ll see us present the results digitally – an envelope seemed just too risky!

The winner of the public vote is Wildfire. And the further teams entering the project are Hubbub, Lumici Slate, Ublend, VineUp. We were hugely impressed with the quality of all of the entries – those who entered, those who were shortlisted, and the small cross section you’ve seen over the last two days.

And now… Lauren Seger Weinstein

I wanted to start by talking about the “why”… TfL has a diverse offering of transport across London – trains, buses, bikes… What are we trying to achieve? We want to deliver transport and mobility services, and to deliver for the mayor. We want to keep London working and growing. And when we think about my team and the work that we do… Our goal is to do things that help influence and contribute to the goals of the wider organisation – putting our customers and users at the core of all of our decision making; to drive improvement in reliability and safety; to be cost effective; to improve what we do.

Our customers want to understand what we stand for: excellent reliability and customer experience; value for money; and progress and innovation. And they want to know that we have a level of trust, that guides what we do and underpins how we use data. And I want to talk about how we use data that is personal, how we strip identifying data out. It is incredibly important that we respect our customers privacy. We tell our customers about how we collect data, we also have more information online. We work closely with our Privacy and Data protection team, and all new data initiatives undergo a Privacy Impact Assessment and have regular engagement with the ICO and rely on their guidance. When we do share any sensitive data we make use of Non-disclosure agreement.

So, our data – we are very lucky as we are data rich. We have 19 million smartcard ticketing transactions a dat from 12 million active cards. We know where our buses are – capturing 4.5 million bus locations a day using ibus geo-located events. We have 500k rows of train diagnostic data on the Central Line alone. We have 250l train locations. We have data from the TfL website. That is brilliant, but how do we make that useful? How do we translate that data into something we can use – that’s where my role comes in.

So we take this data and we use it to create a lot of integrated travel information that is used on our website, in tailored emails, in 600 travel apps powered by open data and created by third party app developers. We also provide advise to customers on travel options… This is where we use data to see which data is most useful… We use data on areas that are busy in terms of entrances and exists – and use that in posters in stations to help customers shift behaviours… If we tell them they have the ability to make a change, whether or not they do.

We also look at customer patterns – based on taps from cards. We anonymise the users but keep a (new) unique id to understand patterns of travel… Some users follow clear commuter patterns – Monday to Friday, we can see where home and work are, etc. But others do not fit clear patterns – part time workers, occasional attenders etc. But understanding that data lets us understand demand, peaks, and planning of shops for an area too. We also use data to help us put things right when they go wrong – paying for delays on the underground or overground. If things go *really* wrong we will look at pattern analysis and automatically refund them – that shows customers that we value them and their time, and means we have fewer forms to process.

We also use data to manage maintenance schedules, so that we can fix small things quickly to avoid bigger issues that would need fixing later on. We also use data to understand where our staff are deployed. If we know where hotspots for breakdowns are, we can deploy recovery teams more strategically. We also use data in real time operations so controllers can change the road network to manage the traffic flows most effectively.

We have also done work to consider the future and growth. We have created an algorithm to answer a question we used to have to do with surveys… With the underground you tap on and off… But on the buses  you only taps off… So we looked at inferring bus journeys… So we take our bus boarding entry taps, plus other modal taps, and iBus event data to work out where they likely exited the bus. We use it to plan busy parts of the network – where more buses may be required at busy times. To also plan out interchanges – we are changing our road layout considerably to make it better for vulnerable road users. We are also thinking about interchanges, and to understand at a granular level how customers use our network.

We are always looking to solve problems and do so in an innovative way… We are industry leaders in a number of areas. We have had wifi on the tube since 2012. We are currently looking to see if wifi data will enable us to plan better. In 2016 we ran a four week pilot to explore value of wifi connection data. When wifi tried to connect with routers in stations we grabbed timestamp, location and a (scrambled) device id. We are analysing that data… But the test was about easier use case. The cases we are currently looking at are about what we can learn about customer patterns from wifi data… And we were deliberately very transparent in that trial, with posters in situ, information online, and a real push to ensure that people were informed about what we were collecting, and how to opt out. 

Finally we have an open data policy. We support developers and the developer economy. this is delivered at very little cost. and our web presence is seen as industry leading. We also do work with universities around six key areas, and we then work with academics on proof of concept with TfL support. Then that can become TfL proof of concept and eventually end up being operational.

So, we are keen to engage with students to come and work with us. So we are planning for ways to support STEM/STEAM in schools activities, to create targeted interventions – it helps us develop the next generation and enables us to deliver the mayors education strategy. We’ve done coding events, work with the Science Museum, with local schools.

To finish my big data principles focus on protecting the privacy of our customers, that is paramount. focus on the right problems you face. Interesting or not enough and don’t start with data… Instead we think of an approach along the lines of… 

  • As a [my job title]
  • I need [big data insights]
  • So that I can [make a decision my job expects me to]

Operational infrastructure generates data… so it is crucial to interpret, translate and understand that data to make it useful. 


Q1) What have you done in terms of data from disabled travellers

A1) We have users with freedom passes… but it depends on what the disability is… so data is hard to tease out. Need a combination of automatic data and talk to our users – so you can take patterns to small groups… Nad to test and discuss those.

Q2) You mentioned that you provide open data for others. Have you thought about student projects… can you provide databank of problems or projects that students could work on?

A2) We are just beginning this now. We have ongoing research projects that require in depth knowledge of work. We also have an opportunity for key questions and key samples – you can see that data today. It isn’t packagers for schools but there is an opportunity on air quality, travel patterns, whether students can find local stops, etc. there is real opportunity but still more to do

Q3) As cities become increasingly populated with self driving autonomous vehicles the data may inform those, but also uber and tesla already collect huge amounts of data…

A3) We have some data on cars but it’s high level. To understand our road customers though we are keen to work with the appropriate companies – some are more open than others – and to understand how we can work with our customers. Historical data is easier but real time analysis is really where we want to be. 

Q4) About information and data protection… you could argue that marginal impact is low for the individual… but compared to cost of security after a data breach… I was wondering how you decided on that balance, and the rights and expectations…

A4) Well we asked our customers and asked them if they were comfortable with the approach. They were asked tangible questions about how data could be used… when we focus on  what is tangible and will improve the network for Londoners, that helps. And that pseudonymous data means you have a hashed number, not full card number but it is still sensitive. Customers can opt into giving us more data – including with wifi where we advised customers to switch off wifi to be part of the study. it’s about customers to be comfortable to engage with us at the level that they want. 

Sincere apologies for the quality of my liveblogging for Laura’s talk – my computer decided to crash about two thirds of the way through and only part of the post was successfully autosaved, with remaining notes made on my phone. Look at the tweets and others write ups for further detail or check out the excellent TfL site where I know there is already a lot of good information on their open data and their recent wifi work. 

And with that Digifest is over for another year. Particular thanks to all who dropped by EDINA’s stand and chatted with Andrew and I – we were delighted to catch up with so many EDINA customers and people interested in our project work and possible opportunities to work together in the future. We are always delighted to meet and hear from our colleagues across the sector so do leave a comment here or drop us a line if you have any comments, questions or ideas you’d like to discuss.  

 March 15, 2017  Posted by at 10:10 am Digital Education, LiveBlogs Tagged with: , , ,  No Responses »
Mar 142017

Today and tomorrow I’m in Birmingham for the Jisc Digifest 2017 (#digifest17). I’m based on the EDINA stand (stand 9, Hall 3) for much of the time, along with my colleague Andrew – do come and say hello to us – but will also be blogging any sessions I attend. The event is also being livetweeted by Jisc and some sessions livestreamed – do take a look at the event website for more details. As usual this blog is live and may include typos, errors, etc. Please do let me know if you have any corrections, questions or comments. 

Plenary and Welcome

Liam Earney is introducing us to the day, with the hope that we all take some away from the event – some inspiration, an idea, the potential to do new things. Over the past three Digifest events we’ve taken a broad view. This year we focus on technology expanding, enabling learning and teaching.

LE: So we will be talking about questions we asked through Twitter and through our conference app with our panel:

  • Sarah Davies (SD), head of change implementation support – education/student, Jisc
  • Liam Earney (LE), director of Jisc Collections
  • Andy McGregor (AM), deputy chief innovation officer, Jisc
  • Paul McKean (PM), head of further education and skills, Jisc

Q1: Do you think that greater use of data and analytics will improve teaching, learning and the student experience?

  • Yes 72%
  • No 10%
  • Don’t Know 18%

AM: I’m relieved at that result as we think it will be important too. But that is backed up by evidence emerging in the US and Australia around data analytics use in retention and attainment. There is a much bigger debate around AI and robots, and around Learning Analytics there is that debate about human and data, and human and machine can work together. We have several sessions in that space.

SD: Learning Analytics has already been around it’s own hype cycle already… We had huge headlines about the potential about a year ago, but now we are seeing much more in-depth discussion, discussion around making sure that our decisions are data informed.. There is concern around the role of the human here but the tutors, the staff, are the people who access this data and work with students so it is about human and data together, and that’s why adoption is taking a while as they work out how best to do that.

Q2: How important is organisational culture in the successful adoption of education technology?

  • Total make or break 55%
  • Can significantly speed it up or slow it down 45%
  • It can help but not essential 0%
  • Not important 0%

PM: Where we see education technology adopted we do often see that organisational culture can drive technology adoption. An open culture – for instance Reading College’s open door policy around technology – can really produce innovation and creative adoption, as people share experience and ideas.

SD: It can also be about what is recognised and rewarded. About making sure that technology is more than what the innovators do – it’s something for the whole organisation. It’s not something that you can do in small pockets. It’s often about small actions – sharing across disciplines, across role groups, about how technology can make a real difference for staff and for students.

Q3: How important is good quality content in delivering an effective blended learning experience?

  • Very important 75%
  • It matters 24%
  • Neither 1%
  • It doesn’t really matter 0%
  • It is not an issue at all 0%

LE: That’s reassuring, but I guess we have to talk about what good quality content is…

SD: I think materials – good quality primary materials – make a huge difference, there are so many materials we simply wouldn’t have had (any) access to 20 years ago. But also about good online texts and how they can change things.

LE: My colleague Karen Colbon and I have been doing some work on making more effective use of technologies… Paul you have been involved in FELTAG…

PM: With FELTAG I was pleased when that came out 3 years ago, but I think only now we’ve moved from the myth of 10% online being blended learning… And moving towards a proper debate about what blended learning is, what is relevant not just what is described. And the need for good quality support to enable that.

LE: What’s the role for Jisc there?

PM: I think it’s about bringing the community together, about focusing on the learner and their experience, rather than the content, to ensure that overall the learner gets what they need.

SD: It’s also about supporting people to design effective curricula too. There are sessions here, talking through interesting things people are doing.

AM: There is a lot of room for innovation around the content. If you are walking around the stands there is a group of students from UCL who are finding innovative ways to visualise research, and we’ll be hearing pitches later with some fantastic ideas.

Q4: Billions of dollars are being invested in edtech startups. What impact do you think this will have on teaching and learning in universities and colleges?

  • No impact at all 1%
  • It may result in a few tools we can use 69%
  • We will come to rely on these companies in our learning and teaching 21%
  • It will completely transform learning and teaching 9%

AM: I am towards the 9% here, there are risks but there is huge reason for optimism here. There are some great companies coming out and working with them increases the chance that this investment will benefit the sector. Startups are keen to work with universities, to collaborate. They are really keen to work with us.

LE: It is difficult for universities to take that punt, to take that risk on new ideas. Procurement, governance, are all essential to facilitating that engagement.

AM: I think so. But I think if we don’t engage then we do risk these companies coming in and building businesses that don’t take account of our needs.

LE: Now that’s a big spend taking place for that small potential change that many who answered this question perceive…

PM: I think there are saving that will come out of those changes potentially…

AM: And in fact that potentially means saving money on tools we currently use by adopting new, and investing that into staff..

Q5: Where do you think the biggest benefits of technology are felt in education?

  • Enabling or enhancing learning and teaching activities 55%
  • In the broader student experience 30%
  • In administrative efficiencies 9%
  • It’s hard to identify clear benefits 6%

SD: I think many of the big benefits we’ve seen over the last 8 years has been around things like online timetables – wider student experience and administrative spaces. But we are also seeing that, when used effectively, technology can really enhance the learning experience. We have a few sessions here around that. Key here is digital capabilities of staff and students. Whether awareness, confidence, understanding fit with disciplinary practice. Lots here at Digifest around digital skills. [sidenote: see also our new Digital Footprint MOOC which is now live for registrations]

I’m quite surprised that 6% thought it was hard to identify clear benefits… There are still lots of questions there, and we have a session on evidence based practice tomorrow, and how evidence feeds into institutional decision making.

PM: There is something here around the Apprentice Levy which is about to come into place. A surprisingly high percentage of employers aren’t aware that they will be paying that actually! Technology has a really important role here for teaching, learning and assessment, but also tracking and monitoring around apprenticeships.

LE: So, with that, I encourage you to look around, chat to our exhibitors, craft the programme that is right for you. And to kick that off here is some of the brilliant work you have been up to. [we are watching a video – this should be shared on today’s hashtag #digifest17]
And with that, our session ended. For the next few hours I will mainly be on our stand but also sitting in on Martin Hamilton’s session “Loving the alien: robots and AI in education” – look out for a few tweets from me and many more from the official live tweeter for the session, @estherbarrett.

Plenary and keynote from Geoff Mulgan,chief executive and CEO, Nesta (host: Paul Feldman, chief executive, Jisc)

Paul Feldman: Welcome to Digifest 2017, and to our Stakeholder Meeting attendees who are joining us for this event. I am delighted to welcome Geoff Mulgan, chief executive of Nesta.

Geoff: Thank you all for being here. I work at Nesta. We are an investor for quite a few ed tech companies, we run a lot of experiments in schools and universities… And I want to share with you two frustrations. The whole area of ed tech is, I think, one of the most exciting, perhaps ever! But the whole field is frustrating… And in Britain we have phenomenal tech companies, and phenomenol universities high in the rankings… But too rarely we bring these together, and we don’t see that vision from ministers either.

So, I’m going to talk about the promise – some of the things that are emerging and developing. I’ll talk about some of the pitfalls – some of the things that are going wrong. And some of the possibilities of where things could go.

So, first of all, the promise. We are going through yet another wave – or series of waves – of Google Watson, Deepmind, Fitbits, sensors… We are at least 50 years into the “digital revolution” and yet the pace of change isn’t letting up – Moore’s Law still applies. So, finding the applications is as exciting and challenging as possible.

Last year Deep Mind defeated a champion of Go. People thought that it was impossible for a machine to win at Go, because of the intuition involved. That cutting edge technology is now being used in London with blood test data to predict who may be admitted to hospital in the next year.

We have also seen these free online bitesize platforms – Coursera, Udacity, etc. – these challenges to trditional courses. And we have Google Translate in November 2016 adopting a neural machine translation engine that can translate whole sentences… Google Translate may be a little clunky still but we are moving toward that Hitchikers Guide to the Galaxy idea of the Babelfish. In January 2017 a machine-learning powered poker bot outcompeted 20 of the world’s best. We are seeing more of these events… The Go contest was observed by 280 million people!

Much of this technology is feeding into this emerging Ed Tech market. There are MOOCs, there are learning analytics tools, there is a huge range of technologies. The UK does well here… When you talk about education you have to talk about technology, not just bricks and mortar. This is a golden age but there are also some things not going as they should be…

So, the pitfalls. There is a lack of understanding of what works. NESTA did a review 3 years ago of school technologies and that was quite negative in terms of return on investment. And the OECD similarly compared spend with learning outcomes and found a negative correlation. One of the odd things about this market is that it has invested very little in using control groups, and gathering the evidence.

And where is the learning about learning? When the first MOOCs appeared I thought it was extraordinary that they showed little interested in decades of knowledge and understanding about elearning, distance learning, online learning. They just shared materials. It’s not just the cognitive elements, you need peers, you need someone to talk to. There is a common finding over decades that you need that combination of peer and social elements and content – that’s one of the reasons I like FutureLearn as it combines that more directly.

The other thing that is missing is the business models. Few ed tech companies make money… They haven’t looked at who will pay, how much they should pay… And I think that reflects, to an extent, the world view of computer scientists…

And I think that business model wise some of the possibilities are quite alarming. Right now many of the digital tools we use are based on collecting our data – the advertisor is the customer, you are the product. And I think some of our ed tech providers, having failed to raise income from students, is somewhat moving in that direction. We are also seeing household data, the internet of things, and my guess is that the impact of these will raise much more awareness of privacy, security, use of data.

The other thing is jobs and future jobs. Some of you will have seen these analyses of jobs and the impact of computerisation. Looking over the last 15 years we’ve seen big shifts here… Technical and professional knowledge has been relatively well protected. But there is also a study (Frey, C and Osborne, M 2013) that looks at those at low risk of computerisation and automation – dentists are safe! – and those at high risk which includes estate agents, accountants, but also actors and performers. We see huge change here. In the US one of the most popular jobs in some areas is truck drivers – they are at high risk here.

We are doing work with Pearson to look at job market requirements – this will be published in a few months time – to help educators prepare students for this world. The jobs likely to grow are around creativity, social intelligence, also dexterity – walking over uneven ground, fine manual skills. If you combine those skills with deep knowledge of technology, or specialised fields, you should be well placed. But we don’t see schools and universities shaping their curricula to these types of needs. Is there a concious effort to look ahead and to think about what 16-22 year olds should be doing now to be well placed in the future?

In terms of more positive possibilities… Some of those I see coming into view… One of these, Skills Route, which was launched for teenagers. It’s an open data set which generates a data driven guide for teenagers about which subjects to study. Allowing teenagers to see what jobs they might get, what income they might attract, how happy they will be even, depending on their subject choices. These insights will be driven by data, including understanding of what jobs may be there in 10 years time. Students may have a better idea of what they need than many of their teachers, their lecturers etc.

We are also seeing a growth of adaptive learning. We are an investor in CogBooks which is a great example. This is a game changer in terms of how education happens. The way AI is built it makes it easier for students to have materials adapt to their needs, to their styles.

My colleagues are working with big cities in England, including Birmingham, to establish Offices of Data Analytics (and data marketplaces), which can enable understanding of e.g. buildings at risk of fire that can be mitigated before fire fighting is needed. I think there are, again, huge opportunities for education. Get into conversations with cities and towns, to use the data commons – which we have but aren’t (yet) using to the full extent of its potential.

We are doing a project called Arloesiadur in Wales which is turning big data into policy action. This allowed policy makers in Welsh Government to have a rich real time picture of what is taking place in the economy, including network analyses of investors, researchers, to help understand emerging fields, targets for new investment and support. This turns the hit and miss craft skill of investment into something more accurate, more data driven. Indeed work on the complexity of the economy shows that economic complexity maps to higher average annual earnings. This goes against some of the smart cities expectation – which wants to create more homogenous environments. Instead diversity and complexity is beneficial.

We host at NESTA the “Alliance for Useful Evidence” which includes a network of around 200 people trying to ensure evidence is used and useful. Out o fthat we have a serues of “What Works” centres – NiCE (health and care); Education Endowment Fund; Early Intervention Foundation; Centre for Ageing Better; College of Policing (crime reduction); Centre for Local Econoic Growth; What Works Well-being… But bizarrely we don’t have one of these for education and universities. These centres help organisations to understand where evidence for particular approaches exists.

To try and fill the gap a bit for universities we’ve worked internationally with the Innovation Growth Lab to understand investment in research, what works properly. This is applying scientific methods to areas on the boundaries of university. In many ways our current environment does very little of that.

The other side of this is the issue of creativity. In China the principal of one university felt it wasn’t enough for students to be strong in engineering, they needed to solve problems. So we worked with them to create programmes for students to create new work, addressing problems and questions without existing answers. There are comparable programmes elsewhere – students facing challenges and problems, not starting with the knowledge. It’s part of the solution… But some work like this can work really well. At Harvard students are working with local authorities and there is a lot of creative collaboration across ages, experience, approaches. In the UK there isn’t any uniersity doing this at serious scale, and I think this community can have a role here…

So, what to lobby for? I’ve worked a lot with government – we’ve worked with about 40 governments across the world – and I’ve seen vice chancellors and principles who have access to government and they usually lobby for something that looks like the present – small changes. I have never seen them lobby for substantial change, for more connection with industry, for investment and ambition at the very top. The leaders argue for the needs of the past, not the present. That is’t true in other industries they look ahead, and make that central to their case. I think that’s part of why we don’t see this coming together in an act of ambition like we saw in the 1960s when the Open University founded.

So, to end…

Tilt is one of the most interesting things to emerge in the last few years – a 3D virtual world that allows you to paint with a Tilt brush. It is exciting as no-one knows how to do this. It’s exciting because it is uncharted territory. It will be, I think, a powerful learning tool. It’s a way to experiment and learn…

But the other side of the coin… The British public’s favourite painting is The Fighting Temorare… An ugly steamboat pulls in a beautiful old sailing boat to be smashed up. It is about technological change… But also about why change is hard. The old boat is more beautiful, tied up with woodwork and carpentry skills, culture, songs… There is a real poetry… But it’s message is that if you don’t go through that, we don’t create space for the new. We are too attached to the old models to let them go – especially the leaders who came through those old models. We need to create those Google Tilts, but we also have to create space for the new to breath as well.


Q1 – Amber Thomas, Warwick) Thinking about the use of technology in universities… There is research on technology in education and I think you point to a disconnect between the big challenges from research councils and how research is disseminated, a disconnect between policy and practice, and a lack of availability of information to practitioners. But also I wanted to say that BECTA used to have some of that role for experimentation and that went in the “bonfire of the quangos”. And what should Jisc’s role be here?

A1) There is all of this research taking place but it is often not used, That emphasis on “Useful Evidence” is important. Academics are not always good at this… What will enable a busy head teacher, a busy tutor, to actually understand and use that evidence. There are some spaces for education at schools level but there is a gap for universities. BECTA was a loss. There is a lack of Ed Tech strategy. There is real potential. To give an example… We have been working with finance, forcing banks to open up data, with banks required by the regulator to fund creative use of that data to help small firms understand their finance. That’s a very different role for the regulator… But I’d like to see institutions willing to do more of that.

A1 – PF) And I would say we are quietly activist.

Q2) To go back to the Hitchhikers Guide issue… Are we too timid in universities?

A2) There is a really interesting history of radical universities – some with no lectures, some no walls, in Paris a short-lived experiment handing out degrees to strangers on buses! Some were totally student driven. My feeling is that that won’t work, it’s like music and you need some structure, some grammars… I like challenge driven universities as they aren’t *that* groundbreaking… You have some structure and content, you have an interdisciplinary teams, you have assessment there… It is a space for experimentation. You need some systematic experimentation on the boundaries… Some creative laboritories on the edge to inform the centre, with some of that quite radical. And I think that we lack those… Things like the Coventry SONAR (?) course for photography which allowed input from the outside, a totally open course including discussion and community… But those sorts of experiments tend not to be in a structure… And I’d like to see systematic experimentation.

Q3 – David White, UAL) When you put up your ed tech slide, a lot of students wouldn’t recognise that as they use lots of free tools – Google etc. Maybe your old warship is actually the market…

A3) That’s a really difficult question. In any institution of any sense, students will make use of the cornucopia of free things – Google Hangouts and YouTube. That’s probably why the Ed Tech industry struggles so much – people are used to free things. Google isn’t free – you indirectly pay through sale of your data as with Facebook. Wikipedia is free but philanthropically funded. I don’t know if that model of Google etc. can continue as we become more aware of data and data use concerns. We don’t know where the future is going… We’ve just started a new project with Barcelona and Amsterdam around the idea of the Data Commons, which doesn’t depend on sale of data to advertisors etc. but that faces the issue of who will pay. My guess is that the free data-based model may last up to 10 years, but then something will change…

How can technology help us meet the needs of a wider range of learners

Pleasing Most of the People Most of the Time – Julia Taylor, subject specialist (accessibility and inclusion), Jisc.

I want to tell you a story about buying LEGO for a young child… My kids loved LEGO and it’s changed a lot since then… I brought a child this pack with lots of little LEGO people with lots of little hats… And this child just sort of left all the people on the carpet because they wanted the LEGO people to choose their own hats and toys… And that was disappointing… And I use that example is that there is an important role to help individuals find the right tools. The ultimate goal of digital skills and inclusion is about giving people the skills and confidence to use the appropriate tools. The idea is that the tools magically turn into tools…

We’ve never had more tools for giving people independence… But what is the potential of technology and how it can be selected and used. We’ll hear more about delivery and use of technology in this context. But I want to talk about what technology is capable of delivering…

Technology gives us the tools for digital diversity, allowing the student to be independent about how they access and engage with our content. That kind of collaboration can also be as meaningful in the context internationally, as it is for learners who have to fit studies around, say, shift work. It allows learners to do things the way they want to do it. That idea of independent study through digital technology is really important. So these tools afford digital skills, the tools remove barriers and/or enable students to overcome the. Technology allows learners with different needs to overcome challenges – perhaps of physical disability, perhaps remote location, perhaps learners with little free time. Technology can help people take those small steps to start or continue their education. It’s as much about that as those big global conversations.

It is also the case that technology can be a real motivator and attraction for some students. And the technology can be about overcoming a small step, to deal with potential intimidation at new technology, through to much more radical forms that keeps people engaged… So when you have tools aimed at the larger end of the scale, you also enable people at the smaller end of the scale. Students do have expectations, and some are involved in technology as a lifestyle, as a life line, that supports their independence… They are using apps and tools to run their life. That is the direction of travel with people, and with young people. Technology is an embedded part of their life. And we should work with that, perhaps even encouraged to use more technology, to depend on it more. Many of us in this room won’t have met a young visually impaired person who doesn’t have an iPhone as those devices allow them to read, to engage, to access their learning materials. Technology is a lifeline here. That’s one example, but there are others… Autistic students may be using an app like “Brain in Hand” to help them engage with travel, with people, with education. We should encourage this use, and we do encourage this use of technology.

We encourage learners to check if they can:

  • Personalise and customise the learning environment
  • Get text books in alternative formats – that they can adapt and adjust as they need
  • Find out about the access features of loan devices and platforms – and there are features built into devices and platforms you use and require students to use. How much do you know about the accessibility of learning platforms that you buy into.
  • Get accessible course notes in advance of lectures – notes that can be navigated and adapted easily, taking away unnecessary barriers. Ensuring documents are accessible for the maximum number of people.
  • Use productivity tools and personal devices everywhere – many people respond well to text to speech, it’s useful for visually impaired students, but also for dyslexic students too.

Now we encourage organisations to make their work accessible to the most people possible. For instance a free and available text to speech tool provides technology that we know works for some learners, for the wide range of learners. That helps those with real needs, but will also benefits other learners, including some who would never disclose a challenge or disability.

So, when you think about technology, think about how you can reach the widest possible range of learners. This should be part of course design, staff development… All areas should include accessible and inclusive technologies.

And I want you now to think about the people and infrastructure required and involved in these types of decisions…  So I have some examples here about change…

What would you need to do to enable a change in practice like this learner statement:

“Usually I hate fieldwork. I’m disorganised, make illegible notes, can’t make sense of the data because we’ve only got little bits of the picture until the evening write up…” 

This student isn’t benefitting from the fieldwork until the information is all brought together. The teacher dealt with this by combining data, information, etc. on the learner’s phone, including QR codes to help them learn… That had an impact and the student continues:

“But this was easy – Google forms. Twitter hashtags. Everything on the phone. To check a technique we scanned the QR code to watch videos. I felt like a proper biologist… not just a rubbish notetaker.”

In another example a student who didn’t want to speak in a group and was able to use a Text Wall to enable their participation in a way that worked for them.

In another case a student who didn’t want to blog but it was compulsory in their course. But then the student discovered they could use voice recognition in GoogleDocs and how to do podcasts and link them in… That option was available to everyone.

Comment: We are a sixth form college. We have a student who is severely dyslexic and he really struggled with classwork. Using voice recognition software has been transformative for that student and now they are achieving the grades and achievements they should have been.

So, what is needed to make this stuff happen. How can we make it easy for change to be made… Is inclusion part of your student induction? It’s hard to gauge from the room how much of this is endemic in your organisations. You need to think about how far down the road you are, and what else needs to be done so that the majority of learners can access podcasts, productivity tools, etc.

[And with that we are moving to discussion.]

Its great to hear you all talking and I thought it might be useful to finish by asking you to share some of the good things that are taking place…

Comment: We have an accessibility unit – a central unit – and that unit provides workshops on technologies for all of the institution, and we promote those heavily in all student inductions. Also I wanted to say that note taking sometimes is the skill that students need…

JT: I was thinking someone would say that! But I wanted to make the point that we should be providing these tools and communicating that they are available… There are things we can do but it requires us to understand what technology can do to lower the barrier, and to engage staff properly. Everyone needs to be able to use and promote technology for use…

The marker by which we are all judged is the success of our students. Technology must be inclusive for that to work.

You can find more resources here:

  • Chat at Todaysmeet.com/DF1734
  • Jisc A&I Offer: TinyURL.com/hw28e42
  • Survey: TinyURL.com/jd8tb5q

How can technology help us meet the needs of a wider range of learners? – Mike Sharples, Institute of Educational Technology, The Open University / FutureLearn

I wanted to start with the idea of accessibility and inclusion. As you may already know the Open University was established in the 1970s to open up university to a wider range of learners… In 1970 19% of our students hadn’t been to University before, now it’s 90%. We’re rather pleased with that! As a diverse and inclusive university accessibility and inclusivity is essential for that. As we move towards more interactive courses, we have to work hard to make fieldtrips accessible to people who are not mobile, to ensure all of our astronomy students access to telescopes, etc.

So, how do we do this? The learning has to be future orientated, and suited to what they will need in the future. I like the idea of the kinds of jobs you see on Careers 2030 – Organic Voltaics Engineer, Data Wrangler, Robot Counsellor – the kinds of work roles that may be there in the future. At the same time of looking to the future we need to also think about what it means to be in a “post truth era” – with accessibility of materials, and access to the educational process too. We need a global open education.

So, FutureLearn is a separate but wholly owned company of the Open University. There are 5.6 million learners, 400 free courses. We have 70 partner institutions, with 70% of learners from outside the UK, 61% are female, and 22% have had no other tertiary education.

When we came to build FutureLearn we had a pretty blank slate. We had EdX and similar but they weren’t based on any particular pedagogy – built around extending the lectures, and around personalised quizzes etc. And as we set up FutureLearn we wanted to encourage a social constructivist model, and the idea of “Learning as Conversation”, based on the idea that all learning is based on conversation – with oursleves, with our teachers and their expertise, and with other learners to try and reach shared understanding. And that’s the brief our software engineers took on. We wanted it to be scalable, for every piece of content to have conversation around it – so that rather than sending you to forums, the conversation sat with the content. And also the idea of peer review, of study groups, etc.

So, for example, the University of Auckland have a course on Logical and Critical thinking. Linked to a video introducing the course is a conversation, and that conversation includes facilitative mentors… And engagement there is throughout the conversation… Our participants have a huge range of backgrounds and locations and that’s part of the conversation you are joining.

Now 2012 was the year of the MOOC, but now they are becoming embedded, and MOOCs need to be taken seriously as part of campus activities, as part of blended learning. In 2009 the US DoE undertook a major meta-study of comparisons of online and face to face teaching in higher education. On average students in online learning conditions performed better than those receiving face to face online, but those undertaking a blend of campus and online did better.

So, we are starting to blend campus and online, with campus students accessing MOOCs, with projects and activities that follow up MOOCs, and we now have the idea of hybrid courses. For example FutureLearn has just offered its full post graduate course with Deakin University. MOOCs are no longer far away from campus learning, they are blending together in new ways of accessing content and accessing conversation. And it’s the flexibility of study that is so important here. There are also new modes of learning (e.g. flipped learning), as well as global access to higher education, including free coures, global conversation and knowledge sharing. The idea of credit transfer and a broader curriculum enabled by that. And the concept of disaggregation – affordable education, pay for use? At the OU only about a third of our students use the tutoring they are entitled to, so perhaps those that use tutoring should pay (only).

As Geoff Mulgan said we do lack evidence – though that is happening. But we also really need new learning platforms that will support free as well as accredited courses, that enables accreditation, credit transfer, badging, etc.


Q1) How do you ensure the quality of the content on your platform?

A1) There are a couple of ways… One was in our selective choice of which universities (and other organisations) we work with. So that offers some credibility and assurance. The other way is through the content team who advise every partner, every course, who creates content for FutureLearn. And there are quite a few quality standards – quite a lot of people on FutureLearn came from the BBC and they come with a very clear idea of quality – there is diversity of the offer but the quality is good.

Q2) What percentage of FutureLearn learners “complete” the course?

A2) In general its about 15-20%. Those 15% ish have opportunities they wouldn’t have other have had. We’ve also done research on who drops out and why… Most (95%) say “it’s not you, it’s me”. Some of those are personal and quite emptional reasons. But mainly life has just gotten in the way and they want to return. Of those remaining 5% about half felt the course wasn’t at quite the right level for them, the other half just didn’t enjoy the platform, it wasn’t right for them.

So, now over to you to discuss…

  1. What pedagogy, ways of doing teaching and learning, would you bring in.
  2. What evidence? What would consitute success in terms of teaching and learning.


Comments: MOOCs are quite different from modules and programmes of study.. Perhaps there is a branching off… More freestyle learning… The learner gets value from whatever paths they go through…

Comments: SLICCs at Edinburgh enable students to design their own module, reflecting and graded against core criteria, but in a project of their own shaping. [read more here]

Comments: Adaptive learning can be a solution to that freestyle learning process… That allows branching off, the algorithm to learn from the learners… There is also the possibility to break a course down to smallest components and build on that.

I want to focus a moment on technology… Is there something that we need.

Comments: We ran a survey of our students about technologies… Overwhelmingly our students wanted their course materials available, they weren’t that excited by e.g. social media.

Let me tell you a bit about what we do at the Open University… We run lots of courses, each looks difference, and we have a great idea of retention, student satisfaction, exam scores. We find that overwhelmingly students like content – video, text and a little bit of interactivity. But students are retained more if they engage in collaborative learning. In terms of student outcomes… The lowest outcomes are for courses that are content heavy… There is a big mismatch between what students like and what they do best with.

Comment: There is some research on learning games that also shows satisfaction at the time doesn’t always map to attainment… Stretching our students is effective, but it’s uncomfortable.

Julia Taylor: Please do get in touch if you more feedback or comments on this.

Dec 052016
Image credit: Brian Slater

This is a very wee blog post/aside to share the video of my TEDxYouth@Manchester talk, “What do your digital footprints say about you?”:

You can read more on the whole experience of being part of this event in my blog post from late November.

It would appear that my first TEDx, much like my first Bright Club, was rather short and sweet (safely within my potential 14 minutes). I hope you enjoy it and I would recommend catching up with my fellow speakers’ talks:

Kat Arney

YouTube Preview Image

Ben Smith

YouTube Preview Image

VV Brown

YouTube Preview Image

Ben Garrod

YouTube Preview Image

I gather that the videos of the incredible teenage speakers and performers will follow soon.


Oct 082016

Today is the last day of the Association of Internet Researchers Conference 2016 – with a couple fewer sessions but I’ll be blogging throughout.

As usual this is a liveblog so corrections, additions, etc. are welcomed. 

PS-24: Rulemaking (Chair: Sandra Braman)

The DMCA Rulemaking and Digital Legal Vernaculars – Olivia G Conti, University of Wisconsin-Madison, United States of America

Apologies, I’ve joined this session late so you miss the first few minutes of what seems to have been an excellent presentation from Olivia. The work she was presenting on – the John Deere DMCA case – is part of her PhD work on how lay communities feed into lawmaking. You can see a quick overview of the case on NPR All Tech Considered and a piece on the ruling at IP Watchdog. The DMCA is the Digital Millennium Copyright Act (1998). My notes start about half-way through Olivia’s talk…

Property and ownership claims made of distinctly American values… Grounded in general ideals, evocations of the Bill of Rights. Or asking what Ben Franklin would say… Bringing the ideas of the DMCA as being contrary to the very foundations of the United Statements. Another them was the idea of once you buy something you should be able to edit as you like. Indeed a theme here is the idea of “tinkering and a liberatory endeavour”. And you see people claiming that it is a basic human right to make changes and tinker, to tweak your tractor (or whatever). Commentators are not trying to appeal to the nation state, they are trying to perform the state to make rights claims to enact the rights of the citizen in a digital world.

So, John Deere made a statement that tractro buyers have an “implied license” to their tractor, they don’t own it out right. And that raised controversies as well.

So, the final register rule was that the farmers won: they could repair their own tractors.

But the vernacular legal formations allow us to see the tensions that arise between citizens and the rights holders. And that also raises interesting issues of citizenship – and of citizenship of the state versus citizenship of the digital world.

The Case of the Missing Fair Use: A Multilingual History & Analysis of Twitter’s Policy Documentation – Amy Johnson, MIT, United States of America

This paper looks at the multilingual history and analysis of Twitter’s policy documentation. Or policies as uneven scalar tools of power alignment. And this comes from the idea of thinking of the Twitter as more than just the whole complete overarching platform. There is much research now on moderation, but understanding this type of policy allows you to understand some of the distributed nature of the platforms. Platforms draw lines when they decide which laws to tranform into policies, and then again when they think about which policies to translate.

If you look across at a list of Twitter policies, there is an English language version. Of this list it is only the Fair Use policy and the Twitter API limits that appear only in English. The API policy makes some sense, but the Fair Use policy does not. And Fair Use only appears really late – in 2014. It sets up in 2005, and many other policies come in in 2013… So what is going on?

So, here is the Twitter Fair Use Policy… Now, before I continue here, I want to say that this translation (and lack of) for this policy is unusual. Generally all companies – not just tech companies – translate into FIGS: French, Italian, German, Spanish languages. And Twitter does not do this. But this is in contrast to the translations of the platform itself. And I wanted to talk in particularly about translations into Japanese and Arabic. Now the Japanese translation came about through collaboration with a company that gave it opportunities to expand out into Japen. Arabic is not put in place until 2011, and around the Arab Spring. And the translation isn’t doen by Twitter itself but by another organisaton set up to do this. So you can see that there are other actors here playing into translations of platform and policies. So this iconic platforms are shaped in some unexpected ways.

So… I am not a lawyer but… Fair Use is a phenomenon that creates all sorts of internet lawyering. And typically there are four factors of fair use (Section 107 of US Copyright Act of 1976): purpose and character of use; nature of copyright work; amount and substantiality of portion used; effect of use on potential market for or value of copyright work. And this is very much an american law, from a legal-economic point of view. And the US is the only country that has Fair Use law.

Now there is a concept of “Fair Dealing” – mentioned in passing in Fair Use – which shares some characters. There are other countries with Fair Use law: Poland, Israel, South Korea… Well they point to the English language version. What about Japanese which has a rich reuse community on Twitter? It also points to the English policy.

So, policy are not equal in their policynesss. But why does this matter? Because this is where rule of law starts to break down… And we cannot assume that the same policies apply universally, that can’t be assumed.

But what about parody? Why bring this up? Well parody is tied up with the idea of Fair Use and creative transformation. Comedy is protected Fair Use category. And Twitter has a rich seam of parody. And indeed, if you Google for the fair use policy, the “People also ask” section has as the first question: “What is a parody account”.

Whilst Fair Use wasn’t there as a policy until 2014, parody unofficially had a policy in 2009, an official one in 2010, updates, another version in 2013 for the IPO. Biz Stone writes about, when at Google, lawyers saying about fake accounts “just say it is parody!” and the importance of parody. And indeed the parody policy has been translated much more widely than the Fair Use policy.

So, policies select bodies of law and align platforms to these bodies of law, in varying degree and depending on specific legitimation practices. Fair Use is strongly associated with US law, and embedding that in the translated policies aligns Twitter more to US law than they want to be. But parody has roots in free speech, and that is something that Twitter wishes to align itself with.

Visual Arts in Digital and Online Environments: Changing Copyright and Fair Use Practice among Institutions and Individuals Abstract – Patricia Aufderheide, Aram Sinnreich, American University, United States of America

Patricia: Aram and I have been working with the College Art Association and it brings together a wide range of professionals and practitioners in art across colleges in the US. They had a new code of conduct and we wanted to speak to them, a few months after that code of conduct was released, to see if that had changed practice and understanding. This is a group that use copyrighted work very widely. And indeed one-third of respondents avoid, abandon, or are delayed because of copyrighted work.

Aram: four-fifths of CAA members use copyrighted materials in their work, but only one fifth employ fair use to do that – most or always seek permission. And of those that use fair use there are some that always or usually use Fair Use. So there are real differences here. So, Fair Use are valued if you know about it and undestand it… but a quarter of this group aren’t sure if Fair Use is useful or not. Now there is that code of conduct. There is also some use of Creative Commons and open licenses.

Of those that use copyright materials… But 47% never use open licenses for their own work – there is a real reciprocity gap. Only 26% never use others openly licensed work. and only 10% never use others’ public domain work. Respondents value creative copying… 19 out of 20 CAA members think that creative appropriation can be “original”, and despite this group seeking permissions they also don’t feel that creative appropriation shouldn’t neccassarily require permission. This really points to an education gap within the community.

And 43% said that uncertainty about the law limits creativity. They think they would appropriate works more, they would public more, they would share work online… These mirror fair use usage!

Patricia: We surveyed this group twice in 2013 and in 2016. Much stays the same but there have been changes… In 2016, 2/3rd have heard about the code, and a third have shared that information – with peers, in teaching, with colleagues. Their associations with the concept of Fair Use are very positive.

Arem: The good news is that the code use does lead to change, even within 10 months of launch. This work was done to try and show how much impact a code of conduct has on understanding… And really there was a dramatic differences here. From the 2016 data, those who are not aware of the code, look a lot like those who are aware but have not used the code. But those who use the code, there is a real difference… And more are using fair use.

Patricia: There is one thing we did outside of the survey… There have been dramatic changes in the field. A number of universities have changed journal policies to be default Fair Use – Yale, Duke, etc. There has been a lot of change in the field. Several museums have internally changed how they create and use their materials. So, we have learned that education matters – behaviour changes with knowledge confidence. Peer support matters and validates new knowledge. Institutional action, well publicized, matters .The newest are most likely to change quickly, but the most veteran are in the best position – it is important to have those influencers on board… And teachers need to bring this into their teaching practice.

Panel Q&A

Q1) How many are artists versus other roles?

A1 – Patricia) About 15% are artists, and they tend to be more positive towards fair use.

Q2) I was curious about changes that took place…

A2 – Arem) We couldn’t ask whether the code made you change your practice… But we could ask whether they had used fair use before and after…

Q3) You’ve made this code for the US CAA, have you shared that more widely…

A3 – Patricia) Many of the CAA members work internationally, but the effectiveness of this code in the US context is that it is about interpreting US Fair Use law – it is not a legal document but it has been reviewed by lawyers. But copyright is territorial which makes this less useful internationally as a document. If copyright was more straightforward, that would be great. There are rights of quotation elsewhere, there is fair dealing… And Canadian law looks more like Fair Use. But the US is very litigious so if something passes Fair Use checking, that’s pretty good elsewhere… But otherwise it is all quite territorial.

A3 – Arem) You can see in data we hold that international practitioners have quite different attitudes to American CAA members.

Q4) You talked about the code, and changes in practice. When I talk to filmmakers and documentary makers in Germany they were aware of Fair Use rights but didn’t use them as they are dependent on TV companies buy them and want every part of rights cleared… They don’t want to hurt relationships.

A4 – Patricia) We always do studies before changes and it is always about reputation and relationship concerns… Fair Use only applies if you can obtain the materials independently… But then the question may be that will rights holders be pissed off next time you need to licence content. What everyone told me was that we can do this but it won’t make any difference…

Chair) I understand that, but that question is about use later on, and demonstration of rights clearance.

A4 – Patricia) This is where change in US errors and omissions insurance makes a difference – that protects them. The film and television makers code of conduct helped insurers engage and feel confident to provide that new type of insurance clause.

Q5) With US platforms, as someone in Norway, it can be hard to understand what you can and cannot access and use on, for instance, in YouTube. Also will algorithmic filtering processes of platforms take into account that they deal with content in different territories?

A5 – Arem) I have spoken to Google Council about that issue of filtering by law – there is no difference there… But monitoring

A5 – Amy) I have written about legal fictions before… They are useful for thinking about what a “reasonable person” – and that can be vulnerable by jury and location so writing that into policies helps to shape that.

A5 – Patricia) The jurisdiction is where you create, not where the work is from…

Q6) There is an indecency case in France which they want to try in French court, but Facebook wants it tried in US court. What might the impact on copyright be?

A6 – Arem) A great question but this type of jurisdictional law has been discussed for over 10 years without any clear conclusion.

A6 – Patricia) This is a European issue too – Germany has good exceptions and limitations, France has horrible exceptions and limitations. There is a real challenge for pan European law.

Q7) Did you look at all of impact on advocacy groups who encouraged writing in/completion of replies on DCMA. And was there any big difference between the farmers and car owners?

A7) There was a lot of discussion on the digital right to repair site, and that probably did have an impact. I did work on Net Neutrality before. But in any of those cases I take out boiler plate, and see what they add directly – but there is a whole other paper to be done on boiler plate texts and how they shape responses and terms of additional comments. It wasn’t that easy to distinguish between farmers and car owners, but it was interesting how individuals established credibility. For farmers they talked abot the value of fixing their own equipment, of being independent, of history of ownership. Car mechanics, by contrast, establish technical expertise.

Q8) As a follow up: farmers will have had a long debate over genetically modified seeds – and the right to tinker in different ways…

A8) I didn’t see that reflected in the comments, but there may well be a bigger issue around micromanagement of practices.

Q9) Olivia, I was wondering if you were considering not only the rhetorical arguements of users, what about the way the techniques and tactics they used are received on the other side… What are the effective tactics there, or locate the limits of the effectiveness of the layperson vernacular stategies?

A9) My goal was to see what frames of arguements looked most effective. I think in the case of the John Deere DCMA case that wasn’t that conclusive. It can be really hard to separate the NGO from the individual – especially when NGOs submit huge collections of individual responses. I did a case study on non-consensual pornography was more conclusive in terms of strategies that was effective. The discourses I look at don’t look like legal discourse but I look at the tone and content people use. So, on revenge porn, the law doesn’t really reflect user practice for instance.

Q10) For Amy, I was wondering… Is the problem that Fair Use isn’t translated… Or the law behind that?

A10 – Amy) I think Twitter in particular have found themselves in a weird middle space… Then the exceptions wouldn’t come up. But having it in English is the odd piece. That policy seems to speak specifically to Americans… But you could argue they are trying to impose (maybe that’s a bit too strong) on all English speaking territory. On YouTube all of the policies are translated into the same languages, including Fair Use.

Q11) I’m fascinated in vernacular understanding and then the experts who are in the round tables, who specialise in these areas. How do you see vernacular discourse use in more closed/smaller settings?

A11 – Olivia) I haven’t been able to take this up as so many of those spaces are opaque. But in the 2012 rule making there were some direct quotes from remixers. And there a suggestion around DVD use that people should videotape the TV screen… and that seemed unreasonably onorous…

Chair) Do you forsee a next stage where you get to be in those rooms and do more on that?

A11 – Olivia) I’d love to do some ethnographic studies, to get more involved.

A11 – Patricia) I was in Washington for the DMCA hearings and those are some of the most fun things I go to. I know that the documentary filmmakers have complained about cost of participating… But a technician from the industry gave 30 minutes of evidence on the 40 technical steps to handle analogue film pieces of information… And to show that it’s not actually broadcast quality. It made them gasp. It was devastating and very visual information, and they cited it in their ruling… And similarly in John Deere case the car technicians made impact. By contrast a teacher came in to explain why copying material was important for teaching, but she didn’t have either people or evidence of what the difference is in the classroom.

Q12) I have an interesting case if anyone wants to look at it, around Wikipedia’s Fair Use issues around multimedia. Volunteers take pre-emptively being stricter as they don’t want lawyers to come in on that… And the Wikipedia policies there. There is also automation through bots to delete content without clear Fair Use exception.

A12 – Arem) I’ve seen Fair Use misappropriated on Wikipedia… Copyright images used at low resolution and claimed as Fair Use…

A12- Patricia) Wikimania has all these people who don’t want to deal with law on copyright at all! Wikimedia lawyers are in an a really difficult position.

Intersections of Technology and Place (panel): Erika Polson, University of Denver, United States of America; Rowan Wilken, Swinburne Institute for Social Research, Australia; Germaine Halegoua,University of Kansas, United States of America; Bryce Renninger, Rutgers University, United States of America; Adrienne Russell, University of Denver, United States of America (Chair: Jessica Lingel)

Traces of our passage: Locative media and the capture of place data – Rowan Wilken

This is a small part of a book that I’m working on. And I am looking at how technologies are geolocating us… In space, in time, but moreso the ways that they reveal our complex socio-technical context through place. And I’m seeing this from an anthropological point of view of places as having particular

Josia Van Dyke in her work on social media business models talks about the use of “location intelligence” as part of the social media ecosystem and economic system.

I want to focus particular on FourSquare… It has changed significantly changed since repositioning in 2014 and those changes in their own and the Swarn app seek to generate real time and even predictive recommendations. They so this through combining social data/social graph and location/Places Graph data. They look to understand People as nodes with edges of proximity, co-location, etc. And in places the places are nodes, the edges are menus, recommendations, etc. So they have these two graphs, but the engineers seen to understand “What are the underlying properties and dynamics of these networks? How can we predict new connections? How do we measure influence?”. Their work now builds up this rich database of places and data around them.

And these changes have led to new repositioning… This has seen FourSquare selling advertising through predictive analysis… The second service called PinPoint, allowing marketers to target users of FourSquare… And for users beyond FourSquare. This is done through GPS locations, finding patterns and tracking shopping and dining routes…

In the last part of this talk I want to talk about Tim Ingol’s work in . For Ingol our perception of place is less about the birds eye view of maps, but of the walked and experienced route, based on the course of moving about in it, of ambulatory knowing. This is perceptual and way finding, less about co-ordinates, more about situating position in the context of moving, of what one knows about routing and moving.

So, my contention is that it’s way finding or mapping not map making or use that are primarily of use and interest to these social platforms going forward. Ingols talks about how new maps come from the replacement and changes over time… I think that is no longer the case, as what is of interest to companies like Foursquare is the digital trace of our passage, not the map itself.

“We know that right now we are not funky”: Placemaking practices in smart cities – Germaine Halegoua, University of Kansas

I am looking at attempts to use underused urban spaces, based on interviews with planners, architects, developers, about how they were developing these spaces – often on reclaimed land or infill – and about what makes them special and unique.

Placemaking is almost always defined as a bottom up process, often linked to home or making somewhere feel like home… But theories of placemaking are less thought of as strategic, thinking of KirkPatrick, or La Corbuisier. And the idea that these are spaces for dominant players – military, powerful people. So in these urban settings the strategic placemaking connects to powerful people, connected and valued around these international players.

I wanted to look at the differences between the planning behind these spaces and smart cities versus the lived experiences and processes. Smart cities are about urbanism imagines, with sustainable urbanism – everything is leaf certified!; technscientific ubranism – data capture is built in, data and technology are thought of as progressive and solutions to our problems; urban triumphalism (Brenner & Schmid 2015). These smart cities are purported as visionary designs, of this coming from the modern needs of people… Taking the best of global cities around the world, naming locations and designs coming in as fragments from other places. Digital media are used to show that this place works, as a place for ideas, a place to get things done… That they are like campus-based communities, like Silicon Valley, a better place than before…

There is this statistic that 70% of all people live in cities, and growing… But they are seen as dumb, problematic, in need of updating… They need order and smart cities are seen as a solution. There is an ordered view of the city as a lab – showroom and demonstration space as well as petri dish for transforming technology. And these are cities built of systems on top of systems – literally (Le Corbeusier-like but with a flowing soft aesthetic) and bringing of things together. So, in Songdu you see this range of services in the space. And in TechCity we see apps and connectedness within the home… Smart cities are monitoring traffic and centralised systems, to monitor biosigns, climate, etc… But in the green spaces or sustainable urban of getting you to live and linger… So you have this odd mixture of not spending the time in the streets, and these green spaces to linger…

But these are quite cold spaces… Vacancies are extremely high. They are seen as artificial. My talk quote is from a developer who feels that the solution is to bring in some funk… To programme serendipity into their lives… The answer is always more technology…

So a few themes here… There is the People Problem… attracting people to the place – not “funky”; placing people within the union of technology and physical design – claim that tech puts man first and needs of the end user… but there is also a sense of people as “bugs”. And I am producing all this data that aren’t about my experience of the city, but which shape that experience.

Geo-social media and the quest for place on-the-go – Erika Polson, University of Denver

This is coming out of my latest book, a multi-site ethnographic project. In the recent work I have developed an idea of digital place making… And this has been about how location technology can be used to shape the space of mobile people.

Expatriation was previously a post WWII experience, and a family affair… Often those assignments failed, sometimes as one partner (often female) couldn’t work. So, as corporations try to globalise there is a move to send younger, single assignees replacing families – they are cheaper and easier to relocate, they are more used to a global professional life as an idea and are enthusiastic.

And we don’t just see people moving once, we see serial global workers… The international experience can be seen as “a global lifestyle is seen as attractive and exciting” Anne Marie Fetcher 2009(?) but that may not reflects reality. There can be deep feelings of loneliness, the experience does’t match experience, they miss out on families, they lack social connections and possibilities to socialise. Margaret Malewski writes in Generation Expatriot (2005) about how there can be an increasing dependency on friends at home, and the need for these extratiots to get out and meet people…

So, my work is based on a range of meetup apps, from Grindr and Tinder, to MeetUp, InterNations and (less of my focus) Couch Surfing… Tools to build connections and find each other. I have studied use of apps in Paris, Bangalore and Singapore. So this image is of a cafe in Paris full of people – the first meetup that I went to and it was intimidating to walk into but immediately someone approached… And I started to think about Digital Place-making about two months into the Paris experience when a friend wanted to meet for dinner and I was at a MeetUp, and he was super floored by his discomfort with talking to a bar full of strangers in Paris – he’s a local guy, he speaks perfect English, he’s very sociable… On any other night he would have owned the space but he was thrown by these expats making the space their own, through Meetup, through their profiles, through discourse of “who we are” and pre-articulation of some of the expectations and norms.

This made me think about the idea of Place and the feelings of belonging and place attachment (Coulthard and Ledema 2009), about shared meanings of place. We’ve seen lots of work on online world and how to create that sense of place, of attachment, or shared meaning.

So, if everyone is able to drop in and feel part of a place… And if professionals can do this, who else can? So, I’m excited to hear the next paper on Grindr. But it’s interesting to think about who is out-of-place, of the quality of place and place relations. And the fact that even as these people maintain this positive narrative of working globally, but also a feeling of following a common template or script. And problems with place-on-the-go for social commitments, community building… Willingness to meet up again, to drop in rather than create anything.

Grindr – Bryce Renninger, Rutgers University, United States of America

I work on open government issues and the site of my work is Grindr – a location based, mainly male, mainly gay and bi casual dating space. And where I am starting from is the idea that Grindr is killing the gay bar (or gayborhood or the gay resort town), which is part of the gay press, for instance articles on the Pines neighbourhood of Fire Island, from New York Magazine. And quotes Ghaziani, author of There Goes the Gaybourhood, that having the app means they don’t need Boystown any more… And I think this narrative comes from concerns of valuing or not valuing these gay towns, resorts, bars, and of the willingness to defend those spaces. Bumgarner (2013) argues that the app does the same thing as the bar… But that assumes that the bar/place is only there to introduce people to each other for narrow range of purposes…

And my way of thinking about this is to think of technologies in democratic ways… Sclove talks about design criteria for democratic technologies, mainly to do with local labour and contribution but this can also be overlaid on social factors as well. And I think there is a space for democratically deliberating as sex publics. Michael Warner respoonds to Andrew Sullivan by problematizing his idea that “normal” is the place for queer people to exist. There are also authors writing on design in public sex spaces as a way to improve health outcomes.

The founder of Grindr says it isn’t killing the gay bar, and indeed provides a platform for the m to advertise on. And showing a quote here of how it is used shows the wide range of use of Grindr (beyond the obvious). I don’t think that Ghaziani’s writing doesn’t talk enough about what the gayborgoods and LGBT spaces are, how they can be class and race exclusive, fitting into gentrification of public spaces… And therefore I recommend Christina Lagazzi’s book.

One of the things I want to do with this work is to think about narratives in which platforms play a part can be written about, spoken about, that allow challenges to popular discourses of technological disruption. The idea that technological disruption is exciting is prevelant, and we aren’t doing enough to challenge that. This AirBnB billboard campaign – a kind of “Fuck You” to the San Francisco authorities and the legal changes to limit their business – are a reminder that we can respond to disruption…

I’m out of time but I think we need to think critically, about social roles of technology and how technological organisations figure into that… And to acknowledge ethnography and press.

Defining space through activism (and journalism): the Paris climate summit – Adrienne Russell, University of Denver

I’ve been working with researchers around the world on the G8 Climate Summits for around ten years, and coverage around it. I’ve been looking at activists and how they kind of spunk up the sapces where meeting take place…

But let me start with an image of Black Lives Matter protestors from the Daily Mail commenting on protestors using mobile phones. It exemplifies the idea that being on your phone means that you are not fully present… If they are on their phone, that arent that serious. This fits a long term type of coverage of protests that seems to suggest that in-person protests are more effective and authentic than social media. Although our literature shows that it is both approaches in combination that is most effective. And then the issue of official versus unofficial action. Activists in the 2014 Paris protestors were especially reliant on online work as protests were banned, public spaces were closed, activists were placed under house arrests… So they had been preparing for years but their action was restricted.

So, the ways that protestors took action was through tools like Climate Games, a real time game which enable you to see real time photography, but also you could highlight surveillance… It was non-violent but called police and law enforcement “team blue”, and lobbyists and greenwashers were “team grey”!

Probably many of you saw the posters across Paris – mocking corporate ad campaigns – e.g. a VW ad saying “we are sorry we got caught”. So you saw these really interesting alternative narratives and interpretations. There was also a hostel called Place to B which became a defacto media centres for protestors, with interviews being given throughout the event. There was a hub of artists who raised issues faced in their own countries. And outside the city there was a venue where they held a mock trial of Exxon vs the People with prominent campaigners from across the globe, this was on the heals of releases showing Exxon had evidence of climate change twenty years back and ignored it. This mock trial made a real media event.

So all these events helped create an alternative narrative. And that crackdown on protest reflects how we are coming to understand this type of top-down event… And resistance through media and counter narratives to mainstream media running predictable official lines.

Panel Q&A
Q1) I have a question, maybe a pushback to you Germaine… Or maybe not… Who are the “they” you are talking about… You talk about city planners… I admire the critique so I want to know who “they” are, and should we problematise that, especially in contemporary smart cities discourses…
A1 – Germaine) It’s CISCO, Seimans, IBM… Those with smart cities labs… Those are the “they”. And I’ve seen the networking of the expert – it is always the same people… The language is really specific and consistent. Everyone is using this term “solutions”… This is the language to talk about the problems… So “they” are transnational, often US based tech corporation with in-house smart cities labs.
Q1) But “they” are also in meetings across the world with lots of different stakeholders, including those people, but others are there. It looks like you are pulling from corporate discourses… Have you traced how that is translating into everyday city planners who host conferences and events they all meet at… And how that plays out and adopt it…
A1 – Germaine) The most I’ve gone with this is to CIOs and City Planners… But it’s a really interesting questions…
Q1) I think it would be interesting and a direction we need to take… How discourses played out and adopted.
Q2) So I was wanting to follow up that question by asking about the role of governments and funders. In the UK right now there is a big push from Government to engage in smart cities, and that offers local authorities a source of capital income that they are keen to take, but then they need providers to deliver that work and are turning to these private sector players…
A2) With cities I have looked at show no vacancy rates, or very low vacancy rates… Of the need to build more units because all are already sold. Some are dormitories for international schools… That lack of join up between ownership and real estate narrative really differs from lived experience. In Kansas they are retrofitting as a smart cities, and taking on that discourse of efficiencies and costs effectiveness…
Q3) How do narratives here fit versus what we used to have as the Cultural Cities narrative…. Who is pushing this? It’s not the same people from civil society perhaps?
A3 – Erika) When I was in Singapore I had this sense of an almost sterile environment. And I learned that the red light district was cleaned up, moved the transvestities and sex workers out… People thought it was too boring… And they started hiring women to dress as men dressed as women to liven it up…
Q4 – Germaine) I wanted to ask about the discourse around the gaybourhood and where they come from…
A4 – Bryce) I think there are particular stakeholders… So one of the articles I showed was about closure of one of the oldest gay bars in New York, and the idea that Grindr caused that, but someone pointed out in the comments that actually real estate prices is the issue. And there is also this change that came from Mayor Giuliani wanting Christopher Street to be more consistent with the rest of New York…
Q5) I was wondering how that location data and tracking data from Rowan’s paper connects with Smart Cities work…
A5 – Germaine) That idea of tracing is common, but the idea of relational space, whilst there, doesn’t really work as it isn’t made yet… There isn’t sufficient density of people to do that… They need the people for that data. In the social media layer it’s relatively invisible, it’s there… But there really is something connected there.
A5 – Rowan) The move to pinpoint technology at FourSquare, they may be interested in Smart Cities… But quite a lot of the critiques I’ve read is that its just about consumption… I’m tired of that… I think they are trying to do something more interesting, to get at the complexity of everyday life… In Melbourne there was a planned development called Docklands… There is nothing there on Foursquare…
A5 – Erika) I am surprised that they aren’t hiring people to be people…
A5 – Rowan) I was thinking about that William Gibson comment about street signs. One of the things about Docklands was that it had high technology and good connections but low population so it did become a centre for crime.
Q6) I work with low income/low socio-economic groups, and how are people ensuring that those communities are part of smart cities, or how their interests are voiced.
A6 – Germaine) In Kansas Cities Google wired neighbourhoods, but that also raised issues around neighbourhoods that were not reached… And that came from activists. Cable wasn’t fitted for poor and middle income communities, but data centres were also located in them. You also see small MESH and Line of Sight networks emerging as a counter measure in some neighbourhoods. I that place it was activists and the press… But in Kansas City it is being picked up as a story.
A6 -Rowan) In my field Jordan Frick does great work on this area, particularly on issues of monolingualism and how that excludes communities.
A6 – Erika) Tim Cresswell does really interesting work in this space… As I’ve thought about place and whose place a particular space it, I’ve been thinking about activists and police in the US. Would be interesting to look at.
A6 – Adreinne) People who have Tor, who resist surveillance, are well off and tech savvy, almost exclusively…
PS-32: Power (chair: Lina Dencik)
Lina: We have another #allfemalepanel for you! On power. 
The Terms Of Service Of Online Protest – Stefania Milan, University of Amsterdam, The Netherlands.
This is part of a bigger project which is slowly approaching book stage, so I won’t sum everything up here but I will give an overview of the theoretical position.
So, one of our starting points is the materiality and broker role of semiotechnologies, and particularly about mediation of social media and the ways that materiality contributes here. I am a sociologist and I’m looking at change. I have been accursed of being a techno-determinist… Yes, to an extent. I play with this. And I am working from the perspective that algorithmically mediated environment of social media has the ability to create change.
I look at a micro level and meso level, looking at interactions between individuals and how that makes differences. Collective action is a social construct – the result of interactions between social actors (Melucci 1996) – not a huge surprise. Organisation is a communicative and expressive activity. And centrality of sense-making activities (ie how people make sense of what they do) Meaning construction is embedded here. That shouldn’t be a surprise either here. Mediata tech and internet are not just tools but as both metaphors and enablers of a new configuration of collective action: cloud protesting. That’s a term I stick with – despite much criticism – as I like the contradiction that it captures… the direct, on the ground, individual, and the large, opaque, inaccessible.
So, features of “cloud protesting” is about the cloud as an “imagined online space” where resources are stored. In social movements there is something important there around shared resources. In this case resources are soft resources – information and meaning making resources. Resources are the “ingredients” of mobilisation. Cyberspaces gives these soft resources and (immaterial) body.

The cloud is a metaphor for organisational forms… And I relate that back to organisational forms of the 1960s, and to later movements, and now the idea of the cloud protest.  The cloud is also an analogy for individualisation – many of the nodes are individuals, who reject pre-packaged non-negotiable identities and organisations. The cloud is a platform for the movements resources can be… But a cloud movement does not require commitment and can be quite hard to activate and mobilise.

Collective identity, in these spaces, has some particular aspects. The “cloud” is an enabler, and you can identify “we” and “them”. But social media spaces overly emphasise visibility over collective identity.

The consequences of the materiality of social media are seen in four mechanisms: centrality of performance; interpellation to fellows and opponents; expansion of the temporality of the protest; reproducability of social action. Now much of that enables new forms of collective action… But there are both positive and negative aspects. Something I won’t mention here is surveillance and consequences of that on collective action.

So, what’s the role of social media? Social media act as intermediaries, enabling speed in protest organisation and diffusion – shaping and constraining collective action too. The cloud is grounded on everyday technology, everyone has the right in his/her pockets. The cloud has the power to deeply influence not only the nature of the protest but also the tactics. Social media enables the creation of a customisable narrative.

Hate Speech and Social Media Platforms – Eugenia Siapera, Paloma Viejo Otero, Dublin City University, Ireland

Eugenia: My narrative is also not hugely positive. We wanted to look at how social media platforms themselves understand, regulate and manage hate speech on their platforms. We did this through an analysis of terms of service. And we did in-depth interviews with key informants – Facebook, Twitter, and YouTube. These platforms are happy to talk to researchers but not to be quoted. We have permission from Facebook and Twitter. YouTube have told us to re-record interviews with lawyers and PR people present.

So, we had three analytical steps – looking at what constitutes hate speech means.

We found that there is no use of definitions of hate speech based on law. Instead they put in reporting mechanisms and use that to determine what is/is not hate speech.

Now, we spoke to people from Twitter and Facebook (indeed there are a number of staff members who move from one to another). The tactic at Facebook was to make rules, what will be taken down (and what won’t), hiring teams to work to apply then, and then help ensure rules are are appropriate. Twitter took a similar approach. So, the definition largely comes from what users report as hate speech rather than from external definitions or understandings.

We had assumed that the content would be manually and algorithmically assessed, but actually reports are reviewed by real people. Facebook have four teams across the world. There are native speakers – to ensure that they understand context – and they prioritise self-/harm and some other categories.

Platforms are reactively rather than proactively positioned. Take downs are not based on number of reports. Hate speech is considered in context – a compromising selfie of a young woman in the UK isn’t hates speech… Unless in India where that may impact on marriage (See Hot Girls of Mumbai – in that case they didn’t take down on that basis but did remove it directly with the ). And if in doubt they keep the content on.

Twitter talk about reluctance to share information with law enforcement, protective of users, protective of freedom of speech. They are not keen to remove someone, would prefer counter arguments. And there are also tensions created by different local regulations and the global operations of the platforms – tension is resolved by compromise (not the case for YouTube).

A Twitter employee talked about the challenges of meeting with representatives from government, where there is tension between legislation and commercial needs, and the need for consistent handling.

There is also a tension between the principled stance assumed by social media corporations that sends the user to block and protect themselves first – a focus on safety and security and personal responsibility. And they want users to feel happy and secure.

Some conclusions… Social media corporations are increasingly acquiring state-like powers. Users are conditioned to behave in ways conforming to social media corporations’ liberal ideology. Posts are “rewarded” by staying online but only if they conform to social media corporations’ version of what constitutes acceptable hate speech.

#YesAllWomen (have a collective story to tell): Feminist hashtags and the intersection of personal narratives, networked publics, and intimate citizenship – Jacqueline Ryan Vickery, University of North Texas, United States of America

The original idea here was to think about second wave feminism and the idea of sharing personal stories and make the personal political. And how that looks online. Working on Plummer’s work (2003) in this areas. All was well… And then I got stuck down the rabbit hole of publics and public discourses that are created when people share personal stories in public spaces… So I have tried to map these aspects. Thinking about the goals of hashtags and who started them as well… not something non-academics tend to look at. I also will be talking about hashtags themselves.

So I tried to think about and mapping goals, political, affective aspects, and affordances and relationships around these. The affordances of hashtags include: Curational – immediacy, reciprocity and conversationality (Papacharissi 2015); they are Polysemic – plurality, open signifiers, diverse meanings (Fiske 1987); Memetic – replicable, ever-evolving, remix, spreadable cultural information (Knobel and Lankshear 2007); Duality in communities of practice – opposing forces that drive change and creativity, local and broader for instance (Wenger 1988); Articulated subjectivities – momentarily jumping in and out of hashtags without really engaging beyond brief usage.

And how can I understand political hashtags on Twitter and their impact? Are we just sharing amongst ourselves, or can we measure that? So I want to think about agenda setting and re-framing – the hashtags I am looking at speak to a public event, or speak back to a media event that is taking place another way. We have op-option by organisations etc. And we see (strategic) essentialism. Awareness/mobilisation. Amplification/silencing of (privileged/marganlisation narratives). So #Yesallwomen is adopted by many privileged white feminists but was started by a biracial muslim women. Indeed all of the hashtags I study were started by non-white women.

So, looking at #Yesallwomen was in response to a terrible shooting and wrote a diatribe about women denying him. The person who created that hashtags left Twitter for a while but has now returned. So we do see lots of tweets that use that hashtag, responding with appropriate experiences and comments.  But it became problematic, too open… This memetic affordance – a controversial male monologist used it as a title for his show, using it abusively and trolling, and beauty brands being there.

The #WhyIStayed hashtag was started by Beverley Gooden in response to commentary that a woman should have left her partner, and that media wasn’t asking why they didn’t ask why that man had beaten and abused his partner. So people shared real stories… But also a pizza company used it – though they apologised and admitted not researching first. Some found the hashtag traumatic… But others shared resources for other women here…

So, I wanted to talk about how these spaces are creating these networked publics, and they do have power to deal with changes. I also think that idea of openness, of lack of control, and the consequences of that openness. #Yesallwomen has lost its meaning to an extent, and is now a very white hashtag. But if we look at these and think of them with social theories we can think about what this means for future movements and publicness.

Internet Rules, Media Logics and Media Grammar: New Perspectives on the Relation between Technology, Organization and Power – Caja Thimm, University of Bonn, Germany

I’m going to report briefly on a long term project on Twitter funded by a range of agencies. There is also a book coming on Twitter and the European Election. So, where do we start… We have Twitter. And we have tweets in French – e.g. from Marine Le Pen – but we see Tweets in other languages too – emoticons, standard structures, but also visual storytelling – images from events.

We have politicians, witnesses, and we see other players, e.g. the police. So first of all we wanted a model for Tweets and how we can understand them. So we used the Functional Operator Model (Thimm et al 2014) – but thats descriptive – great for organising data but not for analysing and understanding platforms.

So, we started with a conference on Media Logic, an old concept from the 1970s. Media Logic offers an approach to develop parameters for a better analysis of such new forms of “media”. It defines players, objectives and power. And how players interact and what do they do (e.g. how do others conquer a hashtag for instance). Consequently you can consider media logics that are to be considered as a network of parameters.

So, what are the parameters of Media Logics that we should understand?

  1. Media Logic and communication cultures. For instance how politicians and political parties take into account media logic of television – production routines, presentation formats (Schulz 2004)
  2. Media Logic and media institutions – institutional and technological modus operandi (Hjarvard 2014)
  3. Media Grammar – a concept drawn from analogy of language.

So, lets think about constituents of “Media Grammar”? Periscope came out of a need, a gap… So you have Surface Grammar – visible and accessible to the user (language, semiotic signs, sounds etc). Surface Grammar is (sometimes) open to the creativity of users. It guides use through media.

(Constitutive) Property Grammar is difference. They are constitutive for the medium itself, determines the rules the functional level of the surface power. Constitutes of algorithms (not exclusively). Not accessible but for the platform itself. And surface grammar and property grammar form a reflexive feedback loop.

We also see van Dijk and Poell (2013) talking about social media as powerful institutions, so the idea of connecting social media grammar here to understand that… This opens up the focus on the open and hidden properties of social media and its interplay with communicative practices. Social media are differentiated, segmented and diverse to such a degree that it seems necessary to focus in more to gain a better idea of how we understand them as technology and society…

Panel Q&A

Q1) A general question to start off. You presented a real range of methodologies, but I didn’t hear a lot about practices and what people actually do, and how that fits into your models.

A1 – Caja) We have a six year project, millions of tweets, and we are trying to find patterns of what they do, and who does what.  There are real differences in usage but still working on what those means.

A1 – Jacqueline) I think that when you look at very large hashtags, even #blacklivesmatter, you do see community participation. But the tags I’m looking at are really personal, not “Political”, these are using hashtags as a momentary act in some way, but is not really a community of practice in a sustainable movements, but some are triggering bigger movements and engagement though…

A1 – Eugenia) We see hate speech being gamed… People put outrageous posts out there to see what will happen, if they will be taen down…

Q2) I’ve been trying to find an appropriate framework… The field is so multidisciplinary… For a study I did on native american activists. We saw interest groups – discursive groups – were loosely stitched together with #indigenous. I’m sort of using the phrase “translator” to capture this. I was wondering if you had any thoughts on how we navigate this…

A2 – Caja) It’s a good question… This conference is very varied, there are so many fields… Socio-linguistics has some interesting frameworks for accommodations in Twitter. No-one seems to have published on that.

A2 – Jacqueline) I think understanding the network, the echo chamber effects, mapping of that network and how the hashtag moves, might be the way in there…

Q2) That’s what we did, but that’s also a problem… But hashtag seems to have a transformative impact too…

Q3) I wonder if we say Social Media Logic, do we loose sight of the overarching issue…

A3 – Caja) I think that Media Logic is in really early stages… It was founded in the 1970s when media was so different. But there are real power symmetries… And I hope we find a real way to bridge the two.

Q4) Many of these arguments come down to how much we trust the idea of the structure in action. Eugenia talks about creating rules iteratively around the issue. Jacqueline talked about the contested rules of play… It’s not clear of who defines those terms in the end…

A4 – Eugenia) There are certain media logics in place now… But they are fast moving as social media move to monetise, to develop, to change. Twitter launches Periscope, Facebook then launches Facebook Live! The grammar keeps on moving, aimed towards the users… Everything keeps moving…

A4 – Caja) But that’s the model. The dynamics are at the core. I do believe that media grammar on the level of small nitpicks that are magic – like the hashtag which has transgressed the platform and even the written form. But it’s about how they work, and whether there are logics inscribed.

A4 – Stefania) There is, of course, attempts made by the platform to hide the logic, and to hide the dynamics of the logic… Even at a radical activist conference who cannot imagine their activism without the platform – and that statement also comes from a belief that they understand the platform.

Q5) I study hate speech too… I came with my top five criticisms but you covered them all in your presentation! You talked about location (IP address) as a factor in hate speech, rather than jurisdiction.

A5 – Eugenia) I think they (nameless social platform) take this approach in the same way that they do for take down notices… But they only do that for France and Germany where hate speech law is very different.

A5 – Caja) There is a study that has been taking place about take downs and the impact of pressure, politics, and effect across platforms when dealing with issues in different countries.

A5 – Eugenia) Twitter has a relationship with NGOs. and have a priority to deal with their requests, sometimes automatically. But they give guidance on how to do that, but they are outsourcing that process to these users…

Q6) I was thinking about platform logics and business logics… And how the business models are part of these logics. And I was wondering if you could talk to some of the methodological issues there… And the issue of the growing powers of governments – for instance Benjamin Netanahu meeting Mark Zuckerberg and talking to him about taking down arabic journalists.

A6 – Eugenia) This is challenging… We want to research them and we want to critique them… But we don’t want to find ourselves blacklisted for doing this. Some of the people I spoke to are very sensitive about, for instance, Palestinian content and when they can take it down. Sometimes though platforms are keen to show they have the power to take down content…

Q7) For Eugenia, you had very good access to people at these platforms. Not surprised they are reluctant to be quoted… But that access is quite difficult in our experience – how did you do it.

A7) These people live in Dublin so you meet them at conferences, there are cross overs through shared interests. Once you get in it’s easier to meet and speak to them… Speaking is ok, quoting and identifying names in our work is different. But it’s not just in social media

Comment) These people really are restricted in who they can talk to… There are PR people at one platform… You ask for comparative roles and use that as a way in… You can start to sidle inside. But mainly it’s the PR people you can access… I’ve had some luck referring to role area at a given company, rather than by name.

Q8 – Stefania) I was wondering about our own roles, in this room, and the issue of agency and publics…

A8 – Jacqueline) I don’t think publics take agency away, in the communities I look at these women benefit from the publics, and of sharing… But actually what we understand as publics varies… So in some publics some talk about exclusion of, e.g. women or people of public, but there are counter publics…

A8 – Caja) Like you were saying there are mini publics and they can be public, and extend out into media and coverage. I think we have to look beyond the idea of the bubble… It’s really fragmented and we shouldn’t overlook that…

And with that, the conference is finished. 

You can read the rest of my posts from this week here:

Thanks to all at AoIR for a really excellent week. I have much to think about, lots of contacts to follow up with, and lots of ideas for taking forward my own work, particularly our new YikYak project