Jun 162017
 

It’s the final day of the IIPC/RESAW conference in London. See my day one and day two post for more information on this. I’m back in the main track today and, as usual, these are live notes so comments, additions, corrections, etc. all welcome.

Collection development panel (Chair: Nicola Bingham)

James R. Jacobs, Pamela M. Graham & Kris Kasianovitz: What’s in your web archive? Subject specialist strategies for collection development

We’ve been archiving the web for many years but the need for web archiving really hit home for me in 2013 when NASA took down every one of their technical reports – for review on various grounds. And the web archiving community was very concerned. Michael Nelson said in a post “NASA information is too important to be left on nasa.gov computers”. And I wrote about when we rely on pointing not archiving.

So, as we planned for this panel we looked back on previous IIPC events and we didn’t see a lot about collection curation. We posed three topics all around these areas. So for each theme we’ll watch a brief screen cast by Kris to introduce them…

  1. Collection development and roles

Kris (via video): I wanted to talk about my role as a subject specialist and how collection development fits into that. AS a subject specialist that is a core part of the role, and I use various tools to develop the collection. I see web archiving as absolutely being part of this. Our collection is books, journals, audio visual content, quantitative and qualitative data sets… Web archives are just another piece of the pie. And when we develop our collection we are looking at what is needed now but in anticipation of what we be needed 10 or 20 years in the future, building a solid historical record that will persist in collections. And we think about how our archives fit into the bigger context of other archives around the country and around the world.

For the two web archives I work on – CA.gov and the Bay Area Governments archives – I am the primary person engaged in planning, collecting, describing and making available that content. And when you look at the web capture life cycle you need to ensure the subject specialist is included and their role understood and valued.

The CA.gov archive involves a group from several organisations including the government library. We have been archiving since 2007 in the California Digital Library initially. We moved into Archive-It in 2013.

The Bay Area Governments archives includes materials on 9 counties, but primarily and comprehensively focused on two key counties here. We bring in regional governments and special districts where policy making for these areas occur.

Archiving these collections has been incredibly useful for understanding government, their processes, how to work with government agencies and the dissemination of this work. But as the sole responsible person that is not ideal. We have had really good technical support from Internet Archive around scoping rules, problems with crawls, thinking about writing regular expressions, how to understand and manage what we see from crawls. We’ve also benefitted from working with our colleague Nicholas Taylor here at Stanford who wrote a great QA report which has helped us.

We are heavily reliant on crawlers, on tools and technologies created by you and others, to gather information for our archive. And since most subject selectors have pretty big portfolios of work – outreach, instruction, as well as collection development – we have to have good ties to developers, and to the wider community with whom we can share ideas and questions is really vital.

Pamela: I’m going to talk about two Columbia archives, the Human Rights Web Archive (HRWA) and Historic Preservation and Urban Planning. I’d like to echo Kris’ comments about the importance of subject specialists. The Historic Preservation and Urban Planning archive is led by our architecture subject specialist and we’d reached a point where we had to collect web materials to continue that archive – and she’s done a great job of bringing that together. Human Rights seems to have long been networked – using the idea of the “internet” long before the web and hypertext. We work closely with Alex Thurman, and have an additional specially supported web curator, but there are many more ways to collaborate and work together.

James: I will also reflect on my experience. And the FDLP – Federal Library Program – involves libraries receiving absolutely every government publications in order to ensure a comprehensive archive. There is a wider programme allowing selective collection. At Stanford we are 85% selective – we only weed out content (after five years) very lightly and usually flyers etc. As a librarian I curate content. As an FDLP library we have to think of our collection as part of the wider set of archives, and I like that.

As archivists we also have to understand provenance… How do we do that with the web archive. And at this point I have to shout out to Jefferson Bailey and colleagues for the “End of Term” collection – archiving all gov sites at the end of government terms. This year has been the most expansive, and the most collaborative – including FTP and social media. And, due to the Trump administration’s hostility to science and technology we’ve had huge support – proposals of seed sites, data capture events etc.

2. Collection Development approaches to web archives, perspectives from subject specialists

As subject specialists we all have to engage in collection development – there are no vendors in this space…

Kris: Looking again at the two government archives I work on there is are Depository Program Statuses to act as a starting point… But these haven’t been updated for the web. However, this is really a continuation of the print collection programme. And web archiving actually lets us collect more – we are no longer reliant on agencies putting content into the Depository Program.

So, for CA.gov we really treat this as a domain collection. And no-one really doing this except some UCs, myself, and state library and archives – not the other depository libraries. However, we don’t collect think tanks, or the not-for-profit players that influence policy – this is for clarity although this content provides important context.

We also had to think about granularity… For instance for the CA transport there is a top level domain and sub domains for each regional transport group, and so we treat all of these as seeds.

Scoping rules matter a great deal, partly as our resources are not unlimited. We have been fortunate that with the CA.gov archive that we have about 3TB space for this year, and have been able to utilise it all… We may not need all of that going forwards, but it has been useful to have that much space.

Pamela: Much of what Kris has said reflects our experience at Columbia. Our web archiving strengths mirror many of our other collection strengths and indeed I think web archiving is this important bridge from print to fully digital. I spent some time talking with our librarian (Chris) recently, and she will add sites as they come up in discussion, she monitors the news for sites that could be seeds for our collection… She is very integrated in her approach to this work.

For the human rights work one of the challenges is the time that we have to contribute. And this is a truly interdisciplinary area with unclear boundaries, and those are both challenging aspects. We do look at subject guides and other practice to improve and develop our collections. And each fall we sponsor about two dozen human rights scholars to visit and engage, and that feeds into what we collect… The other thing that I hope to do in the future is to do more assessment to look at more authoritative lists in order to compare with other places… Colleagues look at a site called ideallist which lists opportunities and funding in these types of spaces. We also try to capture sites that look more vulnerable – small activist groups – although it is nt clear if they actually are that risky.

Cost wise the expensive part of collecting is both human effort to catalogue, and the permission process in the collecting process. And yesterday’s discussion of possible need for ethics groups as part of the permissions prpcess.

In the web archiving space we have to be clearer on scope and boundaries as there is such a big, almost limitless, set of materials to pick from. But otherwise plenty of parallels.

James: For me the material we collect is in the public domain so permissions are not part of my challenge here. But there are other aspects of my work, including LOCKSS. In the case of Fugitive US Agencies Collection we take entire sites (e.g. CBO, GAO, EPA) plus sites at risk (eg Census, Current Industrial Reports). These “fugitive” agencies include publications should be in the depository programme but are not. And those lots documents that fail to make it out, they are what this collection is about. When a library notes a lost document I will share that on the Lost Docs Project blog, and then also am able to collect and seed the cloud and web archive – using the WordPress Amber plugin – for links. For instance the CBO looked at the health bill, aka Trump Care, was missing… In fact many CBO publications were missing so I have added it as a see for our Archive-it

3. Discovery and use of web archives

Discovery and use of web archives is becoming increasingly important as we look for needles in ever larger haystacks. So, firstly, over to Kris:

Kris: One way we get archives out there is in our catalogue, and into WorldCat. That’s one plae to help other libraries know what we are collecting, and how to find and understand it… So would be interested to do some work with users around what they want to find and how… I suspect it will be about a specific request – e.g. city council in one place over a ten year period… But they won’t be looking for a web archive per se… We have to think about that, and what kind of intermediaries are needed to make that work… Can we also provide better seed lists and documentation for this? In Social Sciences we have the Code Book and I think we need to share the equivalent information for web archives, to expose documentation on how the archive was built… And linking to seeds nad other parts of collections .

One other thing we have to think about is process and document ingest mechanism. We are trying to do this for CA.gov to better describe what we do… BUt maybe there is a standard way to produce that sort of documentation – like the Codebook…

Pamela: Very quickly… At Columbia we catalogue individual sites. We also have a customised portal for the Human Rights. That has facets for “search as research” so you can search and develop and learn by working through facets – that’s often more useful than item searches… And, in terms of collecting for the web we do have to think of what we collect as data for analysis as part of a larger data sets…

James: In the interests of time we have to wrap up, but there was one comment I wanted to make.which is that there are tools we use but also gaps that we see for subject specialists [see slide]… And Andrew’s comments about the catalogue struck home with me…

Q&A

Q1) Can you expand on that issue of the catalogue?

A1) Yes, I think we have to see web archives both as bulk data AND collections as collections. We have to be able to pull out the documents and reports – the traditional materials – and combine them with other material in the catalogue… So it is exciting to think about that, about the workflow… And about web archives working into the normal library work flows…

Q2) Pamela, you commented about permissions framework as possibly vital for IRB considerations for web research… Is that from conversations with your IRB or speculative.

A2) That came from Matt Webber’s comment yesterday on IRB becoming more concerned about web archive-based research. We have been looking for faster processes… But I am always very aware of the ethical concern… People do wonder about ethics and permissions when they see the archive… Interesting to see how we can navigate these challenges going forward…

Q3) Do you use LCSH and are there any issues?

A3) Yes, we do use LCSH for some items and the collections… Luckily someone from our metadata team worked with me. He used Dublin Core, with LCSH within that. He hasn’t indicated issues. Government documents in the US (and at state level) typically use LCSH so no, no issues that I’m aware of.

Plenary (Macmillan Hall): Posters with lightning talks (Chair: Olga Holownia)

Olga: I know you will be disappointed that it is the last day of Web Archiving Week! Maybe next year it should be Web Archiving Month… And then year!

So, we have lightening talks that go with posters that you can explore during the break, and speak to the presenters as well.

Tommi Jauhiainen, Heidi Jauhiainen, & Petteri Veikkolainen: Language identification for creating national web archives

Petteri: I am web archivist at the National Library of Finland. But this is really about Tommi’s PhD research on native Finno-Ugric languages and the internet. This work began in 2013 as part of the Kone Foundation Language Programme. It gathers texts in small languages on the web… They had to identify that content to capture them.

We extracted the web links on Finnish web pages, also crawled russian, estonian, swedish, and norwegion domains for these languages. They used HeLI and Heritrix. We used the list of Finnish URLs in the archive, rather than transferring the WARC files directly. So HeLI is the Helsinki language identification method, one of the best in the world. It can be found on Github. And can be used for your language as well! The full service will be out next year, but you can ask HeLi if you want that earlier.

Martin Klein: Robust links – a proposed solution to reference rot in scholarly communication

I work at Los Alamos, I have two short talks and both are work with my boss Herbert Van de Sompel, who I’m sure you’ll be aware of.

So, the problem of robust links is that links break and reference content changes. It is hard to ensure the author’s intention is honoured. So, you write a paper last year, point to the EPA, the DOI this year doesn’t work…

So, there are two ways to do this… You can create a snapshot of a referenced recourse… with Perma.cc, Internet Archive, Archive,is, Webcite. That’s great… But the citation people use is then the URI of the archive copy… Sometimes the original URI is included… But what if the URI-M is a copy elsewhere – archive.is or the no longer present mummy.it.

So, second approach, decorate your links by referencing: original URI, datetime of archiving, and the resource’s original URI. That makes your link more robust meaning you can find the live version. The original URI allows finding captures in all web archives. The Capture datetime lets you identify when/what version of the site is used.

How do you do this? With HTML5 link decoration, with the href attribute (data-original and data-versiondate). And we talked about this in a d-Lib article that, with some javascript that makes that actionable!

So, come talk to me upstairs about this!

Herbert Van de Sompel, Michael L. Nelson, Lyudmila Balakireva, Martin Klein, Shawn M. Jones & Harihar Shankar: Uniform access to raw mementos

Martin: Hello, it’s still me, I’m still from Los Alamos! But this is a more collaborative project…

The problem here… Most web archives augment their mementos with custom banners and links… So, in the Internet Archive there is a banner from them, and a pointer on links to a copy in the archive. There are lots of reasons, legal, convenience… BUT That enhancement doesn’t represent the website at the time of capturing… AS a researcher those enhancements are detrimental as you have to rewrite links again.

For us and our Memento Reconstruct, and other replay systems that’s a challenge. Also makes it harder to check the veracity of content.

Currently some systems do support this… OpenWayBack adn pywb do allow this – you can add the {datetime}im_/URI-R to do this, for instance. But that is quite dependent on the individual archive.

So, we propose using the Prefer Header in HTTP Request…

Option 1: Request header sent against Time Gate

Option 2: Request header sent against Memento

So come talk to us… Both versions work, I have a preference, Ilya has a different preference, so it should be interesting!

Sumitra Duncan: NYARC discovery: Promoting integrated access to web archive collections

NYARC is a consortium formed in 2006 from research libraries at Brooklyn Museum, The Frick Collection and the Museum of Modern Art. There is a two year Mellow grant to implement the program. An dthere are 10 collections in Archive-it devoted to scholarly art resources – including artist websites, gallery sites, catalogues, lists of lost and looted art. There is a seed list of 3900+ site.

To put this in place we asked for proof of concept discovery sites – we only had two submitted. We selected Primo from Ex-Libris. This brings in materials using the OpenSearch API. The set up does also let us pull in other archives if we want to. And you can choose whether to include the web archive (or not). The access points are through MARC Records and Full Records Search, and are in both the catalogue and WorldCat. We don’t howver, have faceted results for web archive as it’ snot in the API.

And recently, after discussion with Martin, we integrated Memento into th earchive, which lets them explore all captured content with Memento Time Travel.

In the future we will be doing usability testing of the discovery interface, we will promote use of web archive collections, and encouraging use in new digital art projects.

Fine NYARC’s Archive-It Collections: www.nywarc.org/webarchive. Documentation at http://wiki.nyarc.??

João Gomes: Arquivo.pt

Olga: Many of you will be aware of Arquivo. We couldn’t go to Lisbon to mark the 10th anniversary of the Portuguese web archive, but we welcome Joao to talk about it.

Joao: We have had ten years of preserving the Portuguese web, collaborating, researching and getting closer to our researchers, and ten years celebrating a lot.

Hello I am Joao Gomes, the head of Arquivo.pt. We are celebrating ten years of our archive. We are having our national event in November – you are all invited to attend and party a lot!

But what about the next 10 years? We want to be one of the best archives in the world… With improvements to full text search, to launch new services – like image serarching and high quality archiving services. Launching an annual prize for resarching projects over the Arquivo.pt. And at the same time increase our collection and users community.

So, thank you to all in this community who have supported us since 2007. And long live Arquivo.pt!

Changing records for scholarship & legal use cases (Chair: Alex Thurman)

Martin Klein & Herbert Van de Sompel: Using the Memento framework to assess content drift in scholarly communication

This project is to address both link rot and content drift – as I mentioned earlier in my lightening talk. I talked about link rot there, content drift is where the URI and content there changes, perhaps out of all recognition, so that what I cite is not reproducable.

You may or may not have seen this but there was a Supreme Court case referencing a website, and someone thought it would be really funny to purchase that, put up a very custom 404 error. But you can see pages that change between submission and publication. By contrast if you look at arxiv for instance you see an example of a page with no change over 20 years!

This matters partly as we reference URIs increasingly, hugely so since 2008.

So, some of this I talked about three years ago where I introduced the Hiberlink project, a collaborative project with the University of Edinburgh where we coined the term “reference rot”. This issue is a threat to the integrity of the web-based scholarly record. Resources do not have the same sense of fixity like e.g. journal article. And custodianship is also not as long term, custodians are not always as interest.

We wrote about link rot in PLoSOne. But now we want to focus on Content Drift. We published a new article on this in PLoSOne a few months ago. This is actually based on the same corpus – the entirity of arXiv, of PubMedCentral, and also over 2 million articles from Elsevier. This covered publications from January 1997 to December 2012. We only looked at URIs for non scholarly articles – not the DOIs but the blog posts, the Wikipedia page, etc. We ended up with a total of around 1 million URIs for these corpora. And we also kept the start date of the article with our data.

So, what is our approach for assessing content drift? We take publication date of URI as t. Then we try to find a Memento pre of referenced URI (t-1) and the Memento Post of referenced URI (t+1). Two Thirds of the URIs we looked at have this pair across archives. So now we do text analysis, looking at textual similarity between t-1 and t+1. We use measures of computed noralised scores (values 0 to 100) for:

  • simhash
  • Jaccard – sets of character changes
  • Sorensen-Dice
  • Cosine – contextual changes

So we defined a perfect Representative Momento if it gets a perfect score across all four measures. And we did some sanity checks too, via HTTP headers – E-Tag and Last-modified being the same are a good measure. And that sanity check passed! 98.88% of Mementos were representative.

Out of the 650k pairs we found, about 313k URIs have representative Mementos. There wasn’t any big difference across the three collections .

Now, with these 313k links, over 200k had a live site. And that allowed us to analyse and compare the live and archived versions. We used those same four measures to check similarity. Those vary so we aggregate. And we find that 23.7% of URIs have not drifted. But that means that over 75% have drifted and may not be representative of author intent.

In our work 25% of the most recent papers we looked at (2012) have not drifted at all. That gets worse going back in time, as is intuitive. Again, the differences across the corpora aren’t huge. PMC isn’t quite the same – as there were fewer articles initially. But the trend is common… In Elsevier’s 1997 works only 5% of content has not drifted.

So, take aways:

  1. Scholarly articles increasingly contain URI references to web and large resources
  2. Such resourcs are subject to reference rot (link rot and content drift)
  3. Custodians of these resoueces are typically not over concerned with archiving of their content and lonegtity of the scholarly record
  4. Spoiler: Robust links are one way to address this at the outset.

Q&A

Q1) Have you had any thought on site redesigns where human readable content may not have changed, but pages have.

A1) Yes. We used those four measures to address that… We strip out all of the HTML and formatting. Cosign ignores very minor “and” vs. “or” changes for instance.

Q1) What about Safari readibility mode?

A1) No. We used something like Beautiful Soup to strip out code. Of course you could also do visual analysis to compare pages.

Q2) You are systematically underestimating the problem… You are looking at publication date… It will have been submitted earlier – generally 6-12 months.

A2) Absolutely. For the sake of the experiment it’s the best we can do… Ideally you’d be as close as possible to the authoring process… When published, as you say, it may already

Q3) A comment and a question… 

Preprints versus publication… 

A3) No, we didn‘t look explicitly at pre-prints. In arXiv those are

The URIs in articles in Elsevier seem to rot more than those in arXiv.org articles… We think that could be because Elsevier articles tend to reference more .coms whereas arXiv references more .org URIs but we need more work to explore that…

Nicholas Taylor: Understanding legal use cases for web archives

I am going to talk about use of web archives in litigation. But out of scope here is the areas of perservation of web citations; terms of service and API agreements for social media collection; copyright; right to be forgotten.

So, why web archives? Well it’s where the content is. In some cases social media may only be available in web archives. Courts do now accept web archive conference. The earliest that IAWM (Internet Archive Way Back Machine) evidence was as early as 2004. Litigants reoutinely challenge this evidence but courts often accept IAWM evidence – commonly through affidavit or testimony, through judicial notice, sometimes through expert testimony.

The IA have affidavit guidance and they suggest asking the court to ensure they will accept that evidence, making that the issue for the courts not the IA. And interpretation is down to the parties in the case. There is also information on how the IAWM works.

Why should we care about this? Well legal professionals are our users too. Often we have unique historical data. And we can help courts and juries correctly interpret web archive evidence leading to more informed outcomes. Other opportunities may be to broaden the community of practice by bringing in legal technology professionals. And this is also part of mainstreaming web archives.

Why might we hestitate here? Well typically cases serve private interests rather than public goods. Immpature open source software culture for legal technology. And market solutions for web and social media archiving for this context do already exist.

USe cases for web archiving in litigation mainly have to do with information on individual webpages as a point in time; information individual webpages over a period of time; persistence of navigational paths over a period of time. And types of cases include civil litigaton and intellectual property cases (which are a separate court in the US). I haven’t seen any criminal cases using the archive but that doesn’t mean it doesn’t exist.

Where archives are used there is a focus on authentication and validity of the record. In the Telewizja Polska USA Inc v. Echostar Video Inc. (2004) saw arguing over the evidence but the court accepting it. In Specht v. Google inc (2010) the evidence was not admissable as it had not come through the affidavit rule.

Another important rule in ths US context is Judicial notice (FRE 201) which is a rule that allows a fact to be entered into evidence. And archives have been used in this context. For instance Martins v 3PD, Inc (2013). And Pond Guy, Inc. v. Aguascape Designs (2011). And in Tompkins v 23andme, Inc (2014) – both parties used IAWM screenshots and the courts went out and found further screenshots that countered both of these to an extent.

Expert testimony (FRE 202) has included Khoday v Symantex Corp et al (2015)  where the expert on navigational paths was queried but the court approved that testimony.

In terms of reliabiity factors things that are raised as concerns include IAWM disclaimer, incompleteness, provenance, temporal coherence. Not seen any examples on discreteness, temporal coherance with HTTP headers), etc.

In Nassar v Nassar (2017) was a defamation case where the IAWM disclaimer saw the court not accept evidence from th earchive.

In Stabile v. Paul Smith Ltd. (2015) saw incomplete archives used, with the court acknowledging but accepting relevance of what was entered.

In Marten Transport Ltd v Plattform Advertising Inc. (2016) was also incomplete, discussion of banners and ads, but the court understood that IAWM does account for some of this. Objections had include issues with crawlers, concern that human/witness wasn’t directly involved in capturing the pages. The literature includes different perceptions of incompleteness. We also have issues of live site “leakage” via AJAX – where new ads leaked into archive pages…

Temporal coherance can be complicated. Web archive  captures can include mementos that are embedded and archived at different points in time so that the composite does not totally make sense.

The Memento Time TRavel service shows you temporal coherance. See also Scott Ainsworth’s work. That kind of visualisation can help courts to understand temporal coherance. Other datetime estimation strategies includes “Carbon Dating” (and constitutent services)’ comparing X-Archive-Orig-last-modified with Memento dattime, etc.

Interpreting datetimes is complicated, and of  great importance in legals cases. These can be interpreted from static datetime of text in archived page, the Memento date time, the headers, etc.

In Servicenow, Inc. v Hewlett-Packard Co. (2015), a patent case where things much be published a year ago to be “prior art” and in this case the archive showed an earlier date than other documentatin.

IN terms of IAWM provenance… Cases have questioned this. Sources for IAWM include a range of different crawls but what does that mean for reliable provenance. There are other archives out there too, but I haven’t seen evidence of these being used in court yet. Canonicality is also an interesting issue… Personalisation of content served to archival agent is an an unanswered question. What about client artifacts?

So, what’s next? If we want to better serve legal and research use cases, then we need to surface more provenance information; to improve interfaces to understand temporal coherance and make volotile aspects visible…

So, some questions for you,

  1. why else might we care, or not about legal use cases?
  2. what other reliability factors are relevant?
    1. What is the relative importance of different reliability factors?
    2. For what use cases are different reliability factors relevant?

Q&A

Q1) Should we save WhoIs data alongside web archives?

A1) I haven’t seen that use case but it does provide context and provenance information

Q2) Is the legal status of IA relevant – it’s not a publicly funded archive. What about security certificates or similar to show that this is from the archive and unchanged?

A2) To the first question, courts have typically been more accepting of web evidence from .gov websites. They treat that as reliable or official. Not sure if that means they are more inclined to use it.. On the security side, there were some really interesting issues raised by Ilya and Jack. As courts become more concerned, they may increasingly look for those signs. But there may be more of those concerns…

Q3) I work with one of those commercial providers… A lot of lawyers want to be able to submit WARCs captured by web recorer or similar to courts.

A3) The legal system is vrry document centril… Much of their data coming in is PDF and that does raise those temporal issues.

Q3) Yes, but they do also want to render WARC, to bring that in to their tools…

Q4) Did you observe any provenance work outside the archive – developers, GitHub commits… Stuff beyond the WARC?

A4) I didn’t see examples of that… Maybe has to do with… These cases often go back a way… Sites created earlier…

Anastasia Aizman & Matt Phillips: Instruments for web archive comparison in Perma.cc

Matt: We are here to talk to you about some web archiving work we are doing. We are from the Harvard innovation lab. We have learnt so much from what you are doing, thank you so much. Perma.cc is creating tools to help you cite stuff on the web, to capture the WARC, organises those things…

We got started on this work when examining documents looking at the Supreme Court corpus from 1996 to present. We saw that Zittrain et al, Harvard Law Review, found more than 70% of references had rotted. So we wanted to build tools to help that…

Anastasia: So, we have some questions…

  1. How do we know a website has changed
  2. How do we know which are important changes.

So, what is a website made of… There are a lot of different resources that will appear on, say, a Washington Post article will have perhaps 90 components. Some are visual, some are hidden… So, again, how can we tell if the site has changed, if it is significant… And how do you convey that to the user.

In 1997, Andre Broder wrote about Syntactic clustering of the web. In that work he looked at every site on the world wide web. Things have changed a great deal since then… Websites are more dynamic now, we need more ways to compare pages…

Matt: So we have three types of comparison…

  • image comparison – we flatten the page down… If we compare two shots of Hacker News a few minutes apart there is a lot of similarity, but difference too… So we create a third image showing/highlighting the differences and can see where those changes there…

Why do image comparison? It’s kind of a dumb way to understand difference… Well it’s a mental model the human brain can take in. The HCI is pretty simple here – users regularly experience that sort of layering – and we are talking general web users here. And it’s easy to have images on hand.

So, sometimes it works well… Here’s an example… A silly one… A post that is the same but we have a cup of coffee with and without coffee in the mug, and small text differences. Comparisons like this work well…

But it works less well where we see banner ads on webpages and they change all the time… But what does that mean for the content? How do we fix that? We need more fidelity, we need more depth.

Anastasia: So we need another way to compare… Looking at a Washington post from 2016 and 2017… Here we can see what has been deleted, and we can see what has been added…. And the tagline of the paper itself has changed in this case.

The pros of this highlighting approach as that it’s in use in lots of places, it’s intuitive… BUT it has to ignore invisible-to-the_user tags. And it is kind of stupid… With two totally different headlines, both saying “Supreme Court”, it sees similarity where there is none.

So what about other similarity measures… ? Maybe a score would be nice, rather than an overlay highlighting change. So, for that we are looking at:

  • Jaccard Coefficient (MinHash) – this is essentially like applying a Venn diagram to two archives.
  • Hamming distance (SimHash) – This looks for number strings into 1s and 0s and figure out where the differences are… The difference/ratio
  • Sequence Matcher (Baseline/Truth) – this looks for sequences of words… It is good but hard to use as it is slow.

So, we took Washington Post archives (2000+) and resources (12,000) and looked at SimHash – big gaps. MinHash was much closer…

When we can calculate that changes… does it matter? If it’s ads, do you care? Some people will. Human eyes are needed…

Matt: So, how do we convey this information to the user… Right now in Perma we have a banner, we have highlighting, or you can choose image view. And you can see changes highlighted in “File Changes” panel on top left hand side of the screen. You can click to view a breakdown of where those changes are and what they mean… You can get to an HTML diff (via Javascript).

So, those are our three measures sitting in our Perma container..

Anastasia: So future work – coming soon – will look at weighted importance. We’d love your idea of what is important – is HTML more important than text? We want a Command Line (CLI) tool as well. And then we want to look at a similarity measure for images – other research on this out there, we need to look at that. We want a “Paranoia” heuristic – to see EVERY change, but with a tickbox to allow only the important change. And we need to work together!

Finally we’d like to thank you, and our colleagues at Harvard who support this work.

Q&A

Q1) Nerdy questions… How tightly bound are these similarity measures to the Perma.cc tool?

A1 – Anastasia) Not at all – should be able to use on command line

A1 – Matt) Perma is a Python Django stack and it’s super open source so you should be able to use this.

Comment) This looks super awesome and I want to use it!

Matt) These are really our first steps into this… So we welcome questions, comments, discussion. Come connect with us.

Anastasia) There is so much more work we have coming up that I’m excited about… Cutting up website to see importance of components… Also any work on resources here…

Q2) Do you primarily serve legal scholars? What about litigation stuff Nicholas talked about?

A2) We are in the law school but Perma is open to all. The litigation stuff is interesting..

A2 – Anastasia) It is a multi purpose school and others are using it. We are based in the law school but we are spreading to other places!

Q3) Thank you… There were HTML comparison tools that exist… But they go away and then we have nothing. A CLI will be really useful… And a service comparing any two URLs would be useful… Maybe worth looking at work on Memento damage – missing elements, and impact on the page – CSS, colour, alignment, images missing, etc. and relative importance. How do you highlight invisible changes?

A3 – Anastasia) This is really the complexity of this… And of the UI… Showing the users the changes… Many of our users are not from a technical background… Educating by showing changes is one way. The list with the measures is just very simple… But if a hyperlink has changed, that potentially is more important… So, do we organise the list to indicate importance? Or do we calculate that another way? We welcome ideas about that?

Q3) We have a service running in Momento showing scores on various levels that shows some of that, which may be useful.

Q4) So, a researcher has a copy of what they were looking at… Can other people look at their copy? So, researchers can use this tool as proof that it is what they cited… Can links be shared?

A4 – Matt) Absolutely. We have a way to do that from the Blue Book. Some folks make these private but that’s super super rare…

Understanding user needs (Chair Nicola Bingham)

Peter Webster, Chris Fryer & Jennifer Lynch: Understanding the users of the Parliamentary Web Archive: a user research project

Chris: We are here to talk about some really exciting user needs work we’ve been doing. The Parliamentary Archives holds several million historical records relating to Parliament, dating from 1497. My role is ensure that archive continues, in the form of digital records as well. One aspect of that is the Parliamentary Web Archive. This captures around 30 URLS – the official Parliamentary websphere content from 2009. But we also capture official social media feeds – Twitter, Facebook and Instagram. This work is essential as it captures our relationship with the public. But we don’t have a great idea of our users needs and we wanted to find out more and understand what they use and what they need.

Peter: The objectives of the study were:

  • assess levels and patterns of use – what areas of the sites they are using, etc.
  • gauge levels of user understanding of the archive
  • understand the value of each kind of content in the web archive – to understand curation effort in the future.
  • test UI for fit with user needs – and how satisfied they were.
  • identify most favoured future developments – what directions should the archive head in next.

The research method was an analysis of usage data, then a survey questionnaire – and we threw lots of effort at engaging people in that. There were then 16 individual user observations, where we sat with the users, asked them to carry out tests and narrate their work.  And then we had group workshops with parliamentary staff and public engagement staff, we well as four workshops with the external user community tailored to particular interests.

So we had a rich set of data from this. We identified important areas of the site. We also concluded that the archive and the relationship to the Parliament website, and that website itself, needed rethinking from the ground up.

So, what did we found of interest to this community?

Well, we found users are hard to find and engage – despite engaging the social media community – and staff similarly not least as the internal workshop was just after the EURef; that they are largely ignorant about what web archives are – we asked about the UK Web Archive, the Government Archive, and the Parliamentary Archive… It appeared that survey respondents understood what these are BUT in the workshops most were thinking about the online version of Hansard – a kind of archive but not what was intended. We also found that users are not always sure what they’re doing – particularly when engaging in a live browser snapshots of the site from a previous dates, that several snapshots might exist from different points in time. There was also some issues with understanding the Way Back Machine surround for the archived content – difficulty understanding what was content, what was the frame. There was a particular challenge around using URL search. People tried everything they could to avoid that… We asked them to find archived pages for the homepage of parliament.uk… And had many searches for “homepage” – there was real lack of understanding of the browser and the search functionality. There is also no correlation between how well users did with the task and how well they felt they did. I take from that that a lack of feedback, requests, issues, does not mean there is not an issue.

Second group of findings… We struggled to find academic participants for this work. But our users prioritised in their own way. It became clear that users wanted discovery mechanisms that match their mental map – and actually the archive mapped more to an internal view of how parliament worked… And browsing taxonomies and structures didn’t work for them. That led to a card sorting exercise to rethink this. We also found users liked structures and wanted discovery based on entities: people, acts, publications – so search connected with that structure works well. Also users were very interested to engage in their own curation, tagging and folksonomy, make their own collections, share materials. Teachers particularly saw potential here.

So, what don’t users want? They have a variety of real needs but they were less interested in derived data sets like link browse; I demonstrated data visualisation, including things like ngrams, work on WARCS; API access; take home data… No interest from them!

So, three general lessons coming out of this… If you are engaging in this sort of research, spend as much resource as possible. We need to cultivate users that we do know, they are hard to find but great when you find them. Remember the diversity of groups of users you deal with…

Chris: So the picture Peter is painting is complex, and can feel quite disheartening. But his work has uncovered issues in some of our assumptions, and really highlights needs of users in the public. We now have a much better understanding os can start to address these concerns.

What we’ve done internally is raise the profile of the Parliamentary Web Archive amongst colleagues. We got delayed with procurement… But we have a new provider (MirrorWeb) and they have really helped here too. So we are now in a good place to deliver a user-centred resource at: webarchive.parliament.uk.

We would love to keep the discussion going… Just not about #goatgate! (contact them on @C_Fryer and @pj_webster)

Q&A

Q1) Do you think there will be tangible benefits for the service and/or the users, and how will you evidence that?

A1 – Chris) Yes. We are redeveloping the web archive. And as part of that we are looking at how we can connect the archive to the catalogue and that is all part of new online services project. We have tangible results to work on… It’s early days but we want to translate it to tangibl ebenefits.

Q2) I imagine the parliament is a very conservative organisation that doesn’t delete content very often. Do you have a sense of what people come to the archive for?

A2 – Chris) Right now it is mainly people who are very aware of the archive, what it is and why it exists. But the research highlighted that many of the people less familiar with the archive wanted the archived versions of content on the live site, and the older content was more of interest.

A2 – Peter) One thing we did was to find out what the difference was between what was on the live website and what was on the archive… And looking ahead… The archive started in 2009… But demand seems to be quite consistent in terms of type of materials.

A2 – Chris) But it will take us time to develop and make use of this.

Q3) Can you say more about the interface and design… So interesting that they avoid the URL search.

A3 – Peter) The outsourced provider was Internet Memory Research… When you were in the archive there was an A-Z browser, a keyword search and a URL search. Above that on the parliament.uk site had taxonomy that linked out, and that didn’t work. I asked them to use that browse and it was clear that their thought process directed them to the wrong places… So recommendation was that it needs to be elsewhere, and more visible.

Q4) You were talking about users wanting to curate their own collections… Have you been considering setting up user dashboards to create and curate collections.

A4 – Chris) We are hoping to do that with our website and service, but it may take a while. But it’s a high priority for us.

Q5) I was interested to understand, the users that you selected for the survey… Were they connected before and part of the existing user base, or did you find through your own efforts.

A5 – Peter) a bit of both… We knew more about those who took the survey and they were the ones we had in the observations. But this was a self selecting group, and they did have a particular interest in the parliament.

Emily Maemura, Nicholas Worby, Christoph Becker & Ian Milligan: Origin stories: documentation for web archives provenance

Emily: We are going to talk about origin stories and it comes out of interest in web archives, provenance, trust. This has been a really collaborative project, and working with Ian Milligan from Toronto. So, we have been looking at two questions really: How are web archives made? How can we document or communicate this?

We wanted to look at choices and decisions in creating collections We have been studying creation of University of Toronto Libraries (UTL) Archive-It collections:

  • Canadian Political Parties and Political Interest Groups (crawled quarterly) – long running, continually collected and ever-evolving.
  • Toronto 2015 Pan Am games (crawled regularly for one month one-off event)
  • Global Summitry Archive

So, thinking about web archives and how they are made we looked at the Web Archiving Life Cycle Model (Bragg et al 2013), which suggests a linear process… But the reality is messier… and iterative as test crawls are reviewed, feed into production crawls… But are also patched as part of QA work.

From this work then we have four things you should document for provenence:

  1. Scoping is iterative and regularly reviewed. and the data budget is a key part of this.
  2. The Process of crawls is important to document as the influence of live web content and actors can be unpredictable
  3. There may be different considerations for access, choices for mode of access can impact discovery, and may be particularly well suited to particular users or use cases.
  4. The fourth thing is context, and the organisational or environmental factors that influence web archiving program – that context is important to understand those decision spaces and choices.

Nick: So, in order to understand these collections we had to look at the organisational history of web archiving. For us web archiving began in 2005, and we piloted what became Archive-it in 2006. It was in liminal state for about 8 years… There were few statements around collection develeopment until last year really But th enew policu talks about scoping, policy, permissions, etc.

So that transition towards service is reflected in staffing. It is still a part time commitment but is written into several people’s job descriptions now, it is higher profile. But there are resourcing challenges around crawling platforms – the earliest archives had to be automatic; dat abudgets; storage limits. There are policies, permissions. robots.text policy, access restrictions. And there is the legal context… Copright laws changed a lot in 2012… Started with permissions, then opt outs, but now it’s take down based…

Looking in turn at these collections:

Canadian Political Parties and Political Interest Groups (crawled quarterly) – long running, continually collected and ever-evolving. Covers main parties and ever changing group of loosely defined interest groups. This was hard to understand as there were four changes of staff in the time period.

Toronto 2015 Pan Am games (crawled regularly for one month one-off event) – based around a discrete event.

Global Summitry Archive – this is a collaborative archive, developed by researchers. It is a hybrid and is an ongoing collection capturing specific events.

In terms of scoping we looked at motivation whether mandate, an identified need or use, collaboration or coordination amongst institutions. These projects are based around technological budgets and limitations… In cases we only really understand what’s taking place when we see crawling taking place. Researchers did think ahead but, for instance, video is excluded… But there is no description of why text was prioritised over video or other storage. You can see evidence of a lack of explicit justifications for crawling particular sites… We have some information and detail, but it’s really useful to annotate content.

In the most recent elections the candidate sites had altered robots.txt… They weren’t trying to block us but the technology used and their measures against DDOS attacks had that effect.

In terms of access we needed metadata and indexes, but the metadata and how they are populated shapes how that happens. We need interfaces but also data formats and restrictions.

Emily: We tried to break out these interdependencies and interactions around what gets captured… Whether a site is captured is down to a mixture of organisational policies and permissions; legal context and copyright law for fair dealing, etc. The wider context elements also change over time… Including changes in staff, as well as changes in policy, in government, etc. This can all impact usage and clarity of how what is there came to be.

So, conclusions and future work… In telling the origin stories we rely on many different aspects and it very complex. We are working towards an extended paper. We believe a little documentation goes a long way… We have a proposal for structure documentation: goo.gl/CQwMt2

Q&A

Q1) We did this exercise in the Netherlands… We needed to go further in the history of our library… Because in the ’90s we already collected interesting websites for clients – the first time we thought about the web as an important stance.. But there was a gap there between the main library work and the web archiving work…

Q2) I always struggle with what can be conveyed that is not in the archive… Sites not crawl, technical challenges, sites that it is decided not to crawl early on… That very initial thinking needs to be conveyed to pre-seed things… Hard to capture that…

A2 – Emily) There is so much in scoping that is before the seed list that gets into the crawl… Nick mentioned there are proposals for new collections that explains the thinking…

A2 – Nick) That’s about the best way to do it… Can capture pre-seeds and test crawls… But need that “what should be in the collection”

A2 – Emily) The CPPP is actually based on a prior web list of suggested sites… Which should also have been archived.

Q3) In any kind of archive the same issues are hugely there… Decisions are rarely described… Though a whole area of post modern archive description around that… But a lot comes down to the creator of the collection. But I haven’t seen much work on what should be in the archive that is expected to be there… A different context I guess..

A3 – Emily) I’ve been reading a lot of post modern archive theory… It is challenging to document all of that, especially in a way that is useful for researchers… But have to be careful not to transfer over all those issues from the archive into the web archive…

Q4) You made the point that the liberal party candidate had blocked access to the Internet Archive crawler… That resonated for me as that’s happened a few times for our own collection… We have legal deposit legislation and that raises questions of whose responsibility it is to take that forward..

A4 – Nick) I found it fell to me… Once we got the right person on the phone it was an easy case to make – and it wasn’t one site but all the candidates for that party!

Q5) Have you have any positive or negative responses to opt-outs and Take downs

A5 – Nick) We don’t host our own WayBackMachine so use their policy. We honour take downs but get very very few. Our communications team might have felt differently but we had something quite bullish in charge.

Nicola) As an institution there is a very variable appetite for risk – hard to communicate internally, let alone externally to our users.

Q6) In your research have you seen any web archive documenting themselves well? People we should follow? Or based mainly on your archives?

A6) It’s mainly based on our own archives… We haven’t done a comprehensive search of other archives’ documentation.

Jackie Dooley, Alexis Antracoli, Karen Stoll Farrell & Deborah Kempe: Developing web archiving metadata best practices to meet user needs

Alexis: We are going to present on the OCLC Research Library Partnership web archive working group. So, what was the problem? Well, web archives are not very easily discoverable in the ways people are usually used to descovering archives or library resources. This was the most widely shared issue across two OCLC surveys and so a working group was formed.

At Princeton we use Archive-It, but you had to know we did that… It wasn’t in the catalogue, it wasn’t on the website… So you wouldn’t find it… Then we wanted to bring it into our discovery system but that meant two different interfaces… So… If we take an example of one of our finding aids… We have the College Republican Records (2004-2016) and they are an on-campus group with websites… This was catalogues with DACS. But how to use the title and dates appropriately? Is the date the content, the seed, what?! And extent – documents, space, or… we went for the number of websites as that felt like something users would understand.  We wrote Archive-it into the description… But we wanted guidelines…

So, the objectives of this group is to find best practices for web archiving metadata best practices. We have undertane a lutereature review, looked at best practices for descriptive metadata across single nad multiple sites.

Karen: For our literature review we looked at peer reviewed literature but also some other sources, and synthesised that. So, who are the end users of web archives… I was really pleased the UK Parliament work focused on public users, as the research tends to focus on academia. Where we can get some clarity on users is on their needs: to read specific web pages/site; data and text mining; technology development or systems analysis.

In terms of behaviours Costa and Silva (2010) classify three groups, much cited by others: Navigational; Informational or Transactionals.

Take aways…. A couple things that we found – some beyond metadata… Raw data can be a high barrier so they want accessible interaces, unified searches, but the user does want to engage directly with the metadata to make the background and provenence of the data. We need to be thinking about flexible formats, engagement. And to enable access we need re-use and rights statements. And we need to be very direct indicating live versus archive material.

Users also want provenance: when and why was this created? They want context. They want to know the collection criteria and scope.

For metadata practitioners there are distinct approaches… archival and bibliographic approaches – RDA, MARC, Dublin Core, MODS, finding aids, DACS; Data elements vary widely, and change quite quickly.

Jackie: We analysed metadata standards and institutional guidelines; we evaluated existing metdata records in the wild… Our preparatory work raised a lot of questions about building a metadata description… Is the website creator/owner the publisher? author? subject? What is the title? Who is the host institution – and will it stay the same? Is it imporant to clearly stats that the resource is a website (not a “web resources”).

And what does the provenance actually refer to? We saw a lot of variety!

In terms of setting up th econtext we have use cases for library, archives, research… Some comparisons between bibliographic and archival approaches to descriptoin; description of archived and live sites – mostly libraries catalogue live not archives sites; and then you have different levels… Collection level, site level… And there might be document-level discriptions.

So, we wanted to establish data dictionary characteristics. We wanted something simple, not a major new cataloguing standard. So this is a learn 14 element standard, which is grounded on those cataloguing rules, so can be part of wider systems. The categories we have include common elements are used for identification and discovery of types of resources; other elements have to have clear applicability in the discovery of all types of resources. But some things aren’t included as not super specific to web archives – e.g. audience.

So the 14 data elements are:

  • Access/rights*
  • Collector
  • Contributor*
  • Creator*
  • Date*
  • Description*…

Elements with asterisks are direct maps to Dublin Core fields.

So, Access Conditions (to be renamed as “Rights”) is a direct mapping to Dublin Core “Rights”. This provides the circumstances that affect the availability and/or reuse of an archived website or collection. E.g. for Twitter. And it’s not just about rights because so often we don’t actually know the rights, but we know what can be done with the data.

Collector was the strangest element… There is no equivalent in Dublim Core… This is about the organisation responsible for curation and stewardship of an archived website or collection. The only other place that uses Collector is the Internet Archive. We did consider “repository” but, it may do all those things but… for archived websites… the site lives elsewhere but e.g. Princeton decides to collect those things.

We have a special case for Collector where Archive-It creates its own collection…

So, we have three publications, due out in July on this work..

Q&A

Q1) I was a bit disappointed in the draft report – it wasn’t what I was expecting… We talked about complexities of provenance and wanted something better to convey that to researchers, and we have such detailed technical information we can draw from Archive-It.

A1 – Jackie) Our remit was about description, only. Provenance is bigger than that. Descriptive metadata was appropriate as scope. We did a third report on harvesting tools and whether metadata could be pulled from them… We should have had “descriptive” in our working group name too perhaps…

A1) It is maybe my fault too… But it’s that mapping of DACs that is not perfect… We are taking a different track at University of Albany

A1 – Jackie) This is NOT a standard, it addresses an absence of metadata that often exists for websites. Scalability of metadata creation is a real challenge… The average time available is 0.25 FTE looking at this. The provenance, the nuance of what was and was not crawled is not doable at scale. This is intentionally lean. If you will be using DACs then a lot of data goes straight in. All standards, with the exception of Dublin Core, are more detailed…

Q2) How difficult is this to put in practice for MARC records. For us we treat a website as a collector… You tend to describe the online publication… A lot of what we’d want to put in just can’t make it in…

A2 – Jackie) In Marc the 852 field is the closest to Collector that you can get. (Collector is comparable to Dublin Core’s Contributor; EAD’s <repository>; MARC’s 524, 852 a ad 852 b; MODS’ location or schema.org’s schema:OwnershipInfo.

Researcher case studies (Chair: Alex Thurman)

Jane Winters: Moving into the mainstream: web archives in the press

This paper accompanies my article for the first issue of Internet Histories. I’ll be talking about the increasing visibility of web archives and much greater public knowledge of web archive.

So, who are the audiences for web archives? Well they include researchers in the arts, humanities and social sciences – my area and where some tough barriers are. They are also policymakers, perticularly crucial in relation to legal deposit and acess. Also “general public” – though it is really many publics. And journalists as a mediator with the public.

What has changed with media? Well there was an initial focus on technology which reached an audience predisposed to that. But incresingly web archives come into discussion of politics and current affairs but there are also social and cultural concerns starting to emerge. There is real interest around launches and anniversaries – a great way for web archives to get attention, like the Easter Rising archive we heard about this week. We do also get that “digital dark age” klaxon which web archives can and do address. And with Brexit and Trump there is a silver lining… And a real interest in archives as a result.

So in 2013 Niels Brugge arranged the first RESAW meeting in Aahus. And at that time we had one of these big media moments…

Computer Weekly, 12th November 2013, reported on Conservatives erasing official records of speeches from the Internet Archive as a serious breach. Coverage in computing media migrated swiftly to coverage in the mainstream press, the Guardian’s election coverage; BBC News… The hook was that a number of those speeches were about the importance of the internet to open public debate… That hook, that narrative was obviously lovely for the media. Interestingly the Conservatives then responded that many of those speeches were actually still available in the BL’s UK Web Archives. The speeches also made Channel 4 News – and they used it as a hook to talk about broken promises.

Another lovely example was Dr Anat Ben-David from the Open University who got involved with BBC Click on restoring the lost .yu domain. This didn’t come from us trying to get something in the news… They knew our work and we could then point them in the direction of really interesting research… We can all do this highlighting and signposting which is why events like this are so useful for getting to know each others’ work.

When you make the tabloids you know you’ve done well… In 2016 coverage of the BBC Food website was faced with closure as part of cuts. The Independent didn’t lead with this, but with how to find recipes when the website goes… They directed everyone to the Internet Archive – as it’s open (unlike the British Library). Although the UK Web Archive blog did post about this, explained what they are collecting, and why they collect important cultural materials. The BBC actually back peddled… Maintaining the pages, but not updating it. But that message got out that web archiving is for everyone… Building it into people’s daily lives.

The launch of the UK Web Archive in 2013 went live – BBC covered this (and fact that it is not online). The 20th anniversary of the BnF archive had a lot of French press coverage. That’s a great hook as well.  Then I mentioned that Digital Dark Age set of stories… Bloomberg had the subtitle “if you want to preserve something, print it” in 2016. We saw similar from the Royal Society. But generally journalists do know who to speak to from BL, or DPC, or IA to counter that view… Can be a really positive story. Even that negative story can be used as a positive thing if you have that connection with journalists…

So this story: “Raiders of the Lost Web: If a Pultizer-finalist 34 part series can disappear from the web, anything can” looks like it will be that sort of story again… But actually this is about the forensic reconstruction of the work. And the article also talks about cinema at risk, again also preserved thanks to the Internet Archive. This piece of journalism that had been “lost” was about the death of 23 children in a bus crash… It was lost twice as it wasn’t reported, then the story disappeared… But the longer article here talks about that case and the importance of web archiving as a whole.

Talking of traumatic incidents… Brexit coverage of the NHS £350m per week saving on the Vote Leave website… But it disappeared after the vote. BUT you can use the Internet Archive, and the structured referendum collection from the UK Legal Deposit libraries, so the promises are retained into the long term…

And finally, on to Trump! In an Independent article on Melania Trump’s website disappearing, the journalist treats the Internet Archive as another source, a way to track change over time…

And indeed all of the coverage of IA in the last year, and their mirror site in Canada, that isn’t niche news, that’s mainstream coverage now. The more we have stories on data disappearing, or removed, the more opportunities web archives have to make their work clear to the world.

Q&A

Q1) A fantastic talk and close to my heart as I try to communicate web archives. I think that web archives have fame when they get into fiction… The BBC series New Tricks had a denouement centred on finding a record on the Internet Archive… Are there other fictional representations of web archives?

A1) A really interesting suggestion! Tweet us both if you’ve seen that…

Q2) That coverage is great…

A2) Yes, being held to account is a risk… But that is a particular product of our time… Hopefully when it is clear that it is evidence for any set of politicians… The users may be partisan, even if the content is… It’s a hard line to tread… Non publicly available archives mitigate that… But absolutely a concern.

Q3) It is a big win when there are big press mentions… What happens… Is it more people aware of the tools, or specifically journalists using them?

A3) It’s both but I think it’s how news travels… More people will read an article in the Guardian than will look at the BL website. But they really demonstrate the value and importance of the archive. You want – like the BBC recipe website 100k petition – that public support. We ran a workshop here on a random Saturday recently… It was pitched as tracing family or local history… And a couple were delighted to find their church community website 15 years ago… It was that easy to know about the value of the archive that way… We did a gaming event with late 1980s games in the IA… That’s brilliant, a kid’s birthdya party was going to be inspired by that – that’s fab use we hadn’t thought of… But journalism is often the easy win…

Q4) Political press and journalistic use is often central… But I love that GifCities project… The nostalgia of the web… The historicity… That use… They highlight the datedness of old web design is great… The way we can associated archives with web vernacular that are not evidenced elsewhere is valuable and awesome… Leveraging that should be kept in mind.

A4) The GifCities always gets a “Wow” – it’s a great way to engage people in a teaching setting… Then lead them onto harder real history stuff..!

Q5) Last year when we celebrated the anniversary I had a chance to speak with journalists. They were intrigued that we collect blogs, forums, stuff that is off the radar… And they titled the article “Maybe your Sky Blog is being archived in France” (Sky Blogs is a popular teen blog platform)… But what does not forgetting the stupid things you wrote on the internet when they were 15…

A5) We’ve had three sessions so far, only once did that question arise… But maybe people aren’t thinking like that. More of an issue of the public archive… Less of a worry for closed archive… But so much of the embaressing stuff is in Facebook so not in the archive. But it matters especially in the right to be forgotten legislation… But there is also that thing of having something worth archiving…

Q6) The thing of The Crossing is interesting… Their font was copyright… They had to get specific permission from the designer… But that site is in flash… And soon you’ll need Ilya Cramer’s old web tools to see it at all.

A6) Absolutely. That’s a really fascinating article and they had to work to revive and play that content…

Q6) And six years old! Only six years!

Cynthia Joyce: Keyword ‘Katrina’: a deep dive through Hurricane Katrina’s unsearchable archive

I’ll be talking about how I use – rather than engaging in the technology directly. I was a journalist for 20 years before teaching journalism, which I do at University of Mississippi. Every year we take a study group to New Orleans to look at the outcome of Katrina. Katrina was 12 years ago. But there is a lot of gentrification and so there are few physical scars there… It was weird to have to explain how hard things were to my 18 year old students. And I wanted to bring that to life… But not just the news coverage which is shown as anniversary, do an update piece… The story is not a discrete event, an era…

I found the best way to capture that era was through blogging. New Orleans was not a tech savvy space, it was a poor, black, high levels of illiteracy sort of space. Web 1.0 had skipped New Orleans and the Deep South in a lot of ways.. .It was pre-Twitter, Facebook in infancy, mobiles were primitive. Katrina was probably when many in New Orleans started texting – doable on struggling networks. There was also that Digital Divide – out of trend to talk about this but this is a real gap.

So, 80% of the city flooded, more than 800 people died, 70% of residents were displaced. The storm didn’t cause the problems here, it was the flooding and the failure of the levees. That is an important distinction, as that sparked the rage, the activism, the need for action was about the sense of being lied to and left behind.

I was working as a journalist for Salon.com from 1995 – very much web 1.0. I was an editor at Nola.com post Katrina. And I was a resident of New Orleans 2001-2007. We had questions of what to do with comments, follow up, retention of content… A lot of content wasn’t needing preserving… But actually that set of comments should be the shame of Advanced Digital and Conde Naste… It was interesting how little help they provided to Nola.com, one of their client papers…

I was conducting research as a citizen, but with journalistic principles and approaches… My method was madness basically… I had instincts, stories to follow, high points, themes that had been missed in mainstream media. I interviewed a lot of people… I followed and used a cross-list of blog rolls… This was a lot of surfing, not just searching…

The WayBackMachine helped me so much there, to see that blogroll, seeing those pages… That idea of the vernacular, drill down 10 years later was very helpful and interesting… To experience it again… To go through, to see common experiences… I also did social media posts and call outs – an affirmative action approach. African American people were on camera, but not a lot of first party documentation… I posted something on Binders Full of Women Writers… I searched more than 300 blogs. I chose the entries… I did it for them… I picked out moving, provocative, profound content… Then let them opt out, or suggest something else… It was an ongoing dialogue with 70 people crowd curating a collective diary. New Orleans Press produced a physical book, and I sent it to Jefferson and IA created a special collection for this.

In terms of choosing themes… The original TOC was based on categories that organically emerged… It’s not all sad, it’s often dark humour…

  • Forever days
  • An accounting
  • Led Astray (pets)
  • Re-entry
  • Kindness of Strangers
  • Indecisin
  • Elsewhere = not New Orleans
  • Saute Pans of Mercy (food)
  • Guyville

Guyville for instance… for months no schools were open, so it was a really male space, then so much construction… But some women there though that was great too. A really specific culture and space.

Some challenges… Some work was journalists writing off the record. We got permissions where we could – we have them for all of the people who survived.

I just wanted to talk about Josh Cousin, a former resident of St Bernard projects. His nickname was the “Bookman” – he was an unusual nerdy kid and was 18 when Katrina hit. They stayed… But were forced to leave eventually… It was very sad… They were forced onto a bus, not told where they were going, they took their dog… Someone on the bus complained. Cheddar was turfed onto the highway… They got taken to Houston. The first post Josh posted was a defiant “I made it” type post… He first had online access when he was at the Astrodome. They had online machines that no-one was using… But he was… And he started getting mail, shoes, stuff in the post… He was training people to use these machines. This kid is a hero… At the sort of book launch for contributors he brought Cheddar the dog… Through pet finder… He had been adopted by a couple in Conneticut who had renamed him “George Michael” – they tried to make him pay $3000 as they didn’t want their dog going back to New Orleans…

In terms of other documentary evidence… Material is all as PDF only… The email record of Micheal D. Brown… shows he’s concerned about dog sitting… And later criticised people for not evacuating because of their pets… Two weeks later his emails do talk about pets… There were obviously other things going on… But this narrative, this diary of that time… really brings this reality to life.

I was in a newsroom during Arab Spring… And that’s when they had no option but to run what’s on Twitter, it was hard to verify but it was there and no journalists could get in. And I think Katrina was that kind of moment for blogging…

On Archive-it you can find the Katrina collection… Ranging from resistance and suspicion to gratitude… Some people barely remembered writing stuff, certainly didn’t expect it to be archived. I was collecting 8-9 years later… I was reassured to read that a historian at the Holocaust museum (in Chronicle of Higher Ed) who wasn’t convinced about blogging, until Trump said something stupid and that had triggered her to engage.

Q&A

Q1 – David) In 2002 the LOCKSS program has a meeting with subject specialists at NY Public Library… And among those that were deemed worth preserving was The Exquisite Corpse. That was published out of New Orleans. After Katrina we were able to give Andre Projescu back his materials and that carried on publishing until 2015… A good news story of archiving from that time.

A1) There are dozens of examples… The things that I found too is that there is no appointed steward… If no institutional support it can be passed round, forgotten… I’d get excited then realise just one person was the advocate, rather than an institution to preserve it for posterity.

Andre wrote some amazing things, and captured that mood in the early days of the storm…

Q2) I love how your work shows blending of work and sources and web archives in conversation with each other… I have a mundane question… Did you go through any human subjects approval for this work from your institution.

A2) I was an independent journalist at the time… BUt went to University of New Orleans as the publisher had done a really intersting project with community work… I went to ask them if this project already existed… And basically I ended up creating it… He said “are you pitching it?” and that’s where it came from. Nievete benefited me.

Q3) Did anyone opt out of this project, given the traumatic nature of this time and work?

A3) Yes, a lot of people… But I went to people who were kind of thought leaders here, who were likely to see the benefit of this… So, for instance Karen Geiger had a blog called Squandered Heritage (now The Lens, the Pro Publica of New Orleans)… And participation of people like that helped build confidence and validity to the project.

Colin Post: The unending lives of net-based artworks: web archives, browser emulations, and new conceptual frameworks

Framing an artwork is never easy… Art objects are “lumps” of the physical world to be described… But what about net based art works, How do we make these objects of art history… And they raise questions of what we define an artwork in the first place… I will talk about Homework by Alexi Shulgin (http://www.easylife.org/homework/) as an example of where we need technique snad practices of web arching around net based artworks. I want to suggest a new conceptualisiation of net-based artworks as plural, proliferating, herteogenous archives. Homework is typical, and includes pop ups and self-concious elements that make it challenging to preserve…

So, this came from a real assignment for Natalie Bookchin’s course in 1997. Alexei Shulgin encouraged artists to turn in homework for grading, and did so himeself… And his piece was a single sentence followed by pop up messages – something we use differently today, has different significance… Pop ups ploferate the screen like spam, making the user aware of the browser and its affordances and role… Homework replicates structures of authority and expertise, grading, organising, creitiques, including or excluding artists… But rendered obsurd…

Homework was intended to be ephemeral… But Shulgin curates assignments turned in, and late assignments. It may be tempting to think of these net works as performance art, with records only of a particular moment in time. But actually this is a full record of the artwork… Homework has entered into archives as well as Shulgin’s own space. It is heterogenous… All acting on the work. The nature of pop up messages may have changes but the conditions of its original creation and it is still changing the world today.

Shulgin, in conversation with Armin Medosch in 1997, felt “The net at present has few possibilities for self expression but there is unlimited possibility for communication. But how can you record this communicative element, how can you store it?”. There are so many ways and artists but how to capture them… One answer is web archiving… There are at least 157 versions of Homework in the Internet Archive.. This is not comprehensive, but his own site is well archived… But capacity of connections is determined by incidence rather than choice… The crawler only caught some of these. But these are not discrete objects… The works on Shulgin’s site, the captures others have made, the websites that are still available, is one big object. This structure reflects the work itself, archival systems sustain and invigorate through the same infrastructure…

To return to the communicative elements… Archives do not capture the performative aspects of the piece. But we must also attend to the way the object has transformed over time… In order to engage with complex net-absed artworks… We cannot be easily separated into “original” and “archived” but more as a continuum…

Frank Upward (1996) describe the Records Continuum Model.. This is around four dimensions: Creation, Capture, Organisation, and Pluralisation. All of these are present in the archive of Homework… As copies appear in the Internet Archive, in Rhizome… And spread out… You could describe this as the vitalisation of the artwork on the web…

oldweb.today at Rhizome is a way to emulate the browser… This provides some assurance of the retention of old website.. BUt that is not the direct representation of the original work… The context and experience can vary – including the (now) speedy load of pages… And possible changes in appearance… When I load homework here… I see 28 captures all combined, from records over 10 years.. The piece wasn’t uniformly archived at any one time… I view the whole piece but actually it is emulated and artificial… It is disintegrated and inauthentic… But in the continuum it is another continuous layer in space and time.

Niels Brugger in “website history” (2010) talks about “Writing the complex strategic situation in which an artefact is entangled”. Digital archived and emulators preserve Homework, but are in themselves generative… But that isn’t exclusive to web archiving… It is something we see in Eugene Viollet Le Duc (1854/1996) talks about reestablishing a work in a finish state that may never in fact have existed in any point in time.

Q1) a really interesting and important work, particularly around plurality. I research at Rhizome and we have worked with Net Art Anthology – an online exhibition with emulators… is this faithful… should we present a plural version of the work?

A1) I have been thinking about this a lot… but i don’t think Rhizome should have to do all of this… art historians should do this contextual work too… Net Art Anthology does the convenience access work but art historians need to do the context work too.

Q1) I agree completely. For an art historian what provenance metadata should we provide for works like this to make it most useful… Give me a while and I’ll have a wish list… 

Comment) a shout out for Gent in Belgium is doing work on online art so I’ll connect you up.

Q2) Is Homework still an active interactive work?

A2) The final list was really in 1997 – only on IA now… It did end at this time… so experiencing the piece is about looking back… that is artefactial, or a terrace. But Shulgin has past work on his page… sort of a capture and framing as archive.

Q3) How does Homework fit in your research?

A3) I’m interested in 90s art, preservation, and that interactions

Q4) Have you seen that job of contextualisation done well, presented with the work? I’m thinking of Eli Harrison’s quantified self work and how different that looked at the time from now… 

A4) Rhizome does this well, galleries collecting net artists… especially with emulated works.. The guggenheim showed originals and emulated and part of that work was foregrounding the preservation and archiving aspects of the work. 

Closing remarks: Emmanuelle Bermès & Jane Winters

Emmanuelle: Thank you all for being here. This was three very intense day. Five days for those at archived unleashed. To close a few comments on IIPC. We were originally to meet in Lisbon, and I must apologise again to Portuguese colleagues, we hope to meet again there… But colocating with RESAW was brilliant – I saw a tweet that we are creating archives in the room next door to those who use and research them. And researchers are our co-creators.

And so many of our questions this week have been about truth and reliability and trust. This is a sign of growth and maturity of the groups. 

IIPC has had a tough year. We are still a young and fragile group… we have to transition to a strong world wide community. We need all the voices and inputs to grow and to transform into something more résiliant. We will have an annual meeting at an event in Ottawa later this year.

Finally thank you so much to Jane and colleagues from RESAW, and to Nicholas and WARC committee, and Olga and BL to get this all together so well.

Jane: you were saying how good it has been to bring archivists and researchers together, to see how we can help and not just ask… A few things struck me: discussion of context and provenance; and at the other end permanence and longevity. 

We will have a special issue of Internet Histories so do email us 

Thank you to Neils Brugger and NetLab, The Coffin Trust who funded our reception last night, RESAW Programme Committee, and the really important peop – the events team at University of London, and to Robert Kelly who did our wonderful promotional materials. And Olga who has made this all possible. 

And we do intend to have another Resaw conference in June in 2 years.

And thank you to Nicholas and Neils for representing IIPC, and to all of you for sharing your fantastic work. 

And with that a very interesting week of web archiving comes to an end. Thank you all for welcoming me along!

Jun 152017
 

I am again at the IIPC WAC / RESAW Conference 2017 and today I am in the very busy technical strand at the British Library. See my Day One post for more on the event and on the HiberActive project, which is why I’m attending this very interesting event.

These notes are live so, as usual, comments, additions, corrections, etc. are very much welcomed.

Tools for web archives analysis & record extraction (chair Nicholas Taylor)

Digging documents out of the archived web – Andrew Jackson

This is the technical counterpoint to the presentation I gave yesterday… So I talked yesterday about the physical workflow of catalogue items… We found that the Digital ePrints team had started processing eprints the same way…

  • staff looked in an outlook calendar for reminders
  • looked for new updates since last check
  • download each to local folder and open
  • check catalogue to avoid re-submitting
  • upload to internal submission portal
  • add essential metadata
  • submit for ingest
  • clean up local files
  • update stats sheet
  • Then inget usually automated (but can require intervention)
  • Updates catalogue once complete
  • New catalogue records processed or enhanced as necessary.

It was very manual, and very inefficient… So we have created a harvester:

  • Setup: specify “watched targets” then…
  • Harvest (harvester crawl targets as usual) –> Ingested… but also…
  • Document extraction:
    • spot documents in the crawl
    • find landing page
    • extract machine-readable metadata
    • submit to W3ACT (curation tool) for review
  • Acquisition:
    • check document harvester for new publications
    • edit essential metadata
    • submit to catalogue
  • Cataloguing
    • cataloguing records processed as necessary

This is better but there are challenges. Firstly, what is a “publication?”. With the eprints team there was a one-to-one print and digital relationship. But now, no more one-to-one. For example, gov.uk publications… An original report will has an ISBN… But that landing page is a representation of the publication, that’s where the assets are… When stuff is catalogued, what can frustrate technical folk… You take date and text from the page – honouring what is there rather than normalising it… We can dishonour intent by capturing the pages… It is challenging…

MARC is initially alarming… For a developer used to current data formats, it’s quite weird to get used to. But really it is just encoding… There is how we say we use MARC, how we do use MARC, and where we want to be now…

One of the intentions of the metadata extraction work was to provide an initial guess of the catalogue data – hoping to save cataloguers and curators time. But you probably won’t be surprised that the names of authors’ names etc. in the document metadata is rarely correct. We use the worse extractor, and layer up so we have the best shot. What works best is extracting the HTML. Gov.uk is a big and consistent publishing space so it’s worth us working on extracting that.

What works even better is the gov.uk API data – it’s in JSON, it’s easy to parse, it’s worth coding as it is a bigger publisher for us.

But now we have to resolve references… Multiple use cases for “records about this record”:

  • publisher metadata
  • third party data sources (e.g. Wikipedia)
  • Our own annotations and catalogues
  • Revisit records

We can’t ignore the revisit records… Have to do a great big join at some point… To get best possible quality data for every single thing….

And this is where the layers of transformation come in… Lots of opportunities to try again and build up… But… When I retry document extraction I can accidentally run up another chain each time… If we do our Solr searches correctly it should be easy so will be correcting this…

We do need to do more future experimentation.. Multiple workflows brings synchronisation problems. We need to ensure documents are accessible when discoverable. Need to be able to re-run automated extraction.

We want to iteratively improve automated metadata extraction:

  • improve HTML data extraction rules, e.g. Zotero translators (and I think LOCKSS are working on this).
  • Bring together different sources
  • Smarter extractors – Stanford NER, GROBID (built for sophisticated extraction from ejournals)

And we still have that tension between what a publication is… A tension between established practice and publisher output Need to trial different approaches with catalogues and users… Close that whole loop.

Q&A

Q1) Is the PDF you extract going into another repository… You probably have a different preservation goal for those PDFs and the archive…

A1) Currently the same copy for archive and access. Format migration probably will be an issue in the future.

Q2) This is quite similar to issues we’ve faced in LOCKSS… I’ve written a paper with Herbert von de Sompel and Michael Nelson about this thing of describing a document…

A2) That’s great. I’ve been working with the Government Digital Service and they are keen to do this consistently….

Q2) Geoffrey Bilder also working on this…

A2) And that’s the ideal… To improve the standards more broadly…

Q3) Are these all PDF files?

A3) At the moment, yes. We deliberately kept scope tight… We don’t get a lot of ePub or open formats… We’ll need to… Now publishers are moving to HTML – which is good for the archive – but that’s more complex in other ways…

Q4) What does the user see at the end of this… Is it a PDF?

A4) This work ends up in our search service, and that metadata helps them find what they are looking for…

Q4) Do they know its from the website, or don’t they care?

A4) Officially, the way the library thinks about monographs and serials, would be that the user doesn’t care… But I’d like to speak to more users… The library does a lot of downstream processing here too..

Q4) For me as an archivist all that data on where the document is from, what issues in accessing it they were, etc. would extremely useful…

Q5) You spoke yesterday about engaging with machine learning… Can you say more?

A5) This is where I’d like to do more user work. The library is keen on subject headings – thats a big high level challenge so that’s quite amenable to machine learning. We have a massive golden data set… There’s at least a masters theory in there, right! And if we built something, then ran it over the 3 million ish items with little metadata could be incredibly useful. In my 0pinion this is what big organisations will need to do more and more of… making best use of human time to tailor and tune machine learning to do much of the work…

Comment) That thing of everything ending up as a PDF is on the way out by the way… You should look at Distil.pub – a new journal from Google and Y combinator – and that’s the future of these sorts of formats, it’s JavaScript and GitHub. Can you collect it? Yes, you can. You can visit the page, switch off the network, and it still works… And it’s there and will update…

A6) As things are more dynamic the re-collecting issue gets more and more important. That’s hard for the organisation to adjust to.

Nick Ruest & Ian Milligan: Learning to WALK (Web Archives for Longitudinal Knowledge): building a national web archiving collaborative platform

Ian: Before I start, thank you to my wider colleagues and funders as this is a collaborative project.

So, we have a fantastic web archival collections in Canada… They collect political parties, activist groups, major events, etc. But, whilst these are amazing collections, they aren’t accessed or used much. I think this is mainly down to two issues: people don’t know they are there; and the access mechanisms don’t fit well with their practices. Maybe when the Archive-it API is live that will fix it all… Right now though it’s hard to find the right thing, and the Canadian archive is quite siloed. There are about 25 organisations collecting, most use the Archive-It service. But, if you are a researcher… to use web archives you really have to interested and engaged, you need to be an expert.

So, building this portal is about making this easier to use… We want web archives to be used on page 150 in some random book. And that’s what the WALK project is trying to do. Our goal is to break down the silos, take down walls between collections, between institutions. We are starting out slow… We signed Memoranda of Understanding with Toronto, Alberta, Victoria, Winnipeg, Dalhousie, Simon Fraser University – that represents about half of the archive in Canada.

We work on workflow… We run workshops… We separated the collections so that post docs can look at this

We are using Warcbase (warcbase.org) and command line tools, we transferred data from internet archive, generate checksums; we generate scholarly derivatives – plain text, hypertext graph, etc. In the front end you enter basic information, describe the collection, and make sure that the user can engage directly themselves… And those visualisations are really useful… Looking at visualisation of the Canadian political parties and political interest group web crawls which track changes, although that may include crawler issues.

Then, with all that generated, we create landing pages, including tagging, data information, visualizations, etc.

Nick: So, on a technical level… I’ve spent the last ten years in open source digital repository communities… This community is small and tight-knit, and I like how we build and share and develop on each others work. Last year we presented webarchives.ca. We’ve indexed 10 TB of warcs since then, representing 200+ M Solr docs. We have grown from one collection and we have needed additional facets: institution; collection name; collection ID, etc.

Then we have also dealt with scaling issues… 30-40Gb to 1Tb sized index. You probably think that’s kinda cute… But we do have more scaling to do… So we are learning from others in the community about how to manage this… We have Solr running on an Open Stack… But right now it isn’t at production scale, but getting there. We are looking at SolrCloud and potentially using a Shard2 per collection.

Last year we had a Solr index using the Shine front end… It’s great but… it doesn’t have an active open source community… We love the UK Web Archive but… Meanwhile there is BlackLight which is in wide use in libraries. There is a bigger community, better APIs, bug fixes, etc… So we have set up a prototype called WARCLight. It does almost all that Shine does, except the tree structure and the advanced searching..

Ian spoke about derivative datasets… For each collection, via Blacklight or ScholarsPortal we want domain/URL Counts; Full text; graphs. Rather than them having to do the work, they can just engage with particular datasets or collections.

So, that goal Ian talked about: one central hub for archived data and derivatives…

Q&A

Q1) Do you plan to make graphs interactive, by using Kibana rather than Gephi?

A1 – Ian) We tried some stuff out… One colleague tried R in the browser… That was great but didn’t look great in the browser. But it would be great if the casual user could look at drag and drop R type visualisations. We haven’t quite found the best option for interactive network diagrams in the browser…

A1 – Nick) Generally the data is so big it will bring down the browser. I’ve started looking at Kibana for stuff so in due course we may bring that in…

Q2) Interesting as we are doing similar things at the BnF. We did use Shine, looked at Blacklight, but built our own thing…. But we are looking at what we can do… We are interested in that web archive discovery collections approaches, useful in other contexts too…

A2 – Nick) I kinda did this the ugly way… There is a more elegant way to do it but haven’t done that yet..

Q2) We tried to give people WARC and WARC files… Our actual users didn’t want that, they want full text…

A2 – Ian) My students are quite biased… Right now if you search it will flake out… But by fall it should be available, I suspect that full text will be of most interest… Sociologists etc. think that network diagram view will be interesting but it’s hard to know what will happen when you give them that. People are quickly put off by raw data without visualisation though so we think it will be useful…

Q3) Do you think in few years time

A3) Right now that doesn’t scale… We want this more cloud-based – that’s our next 3 years and next wave of funded work… We do have capacity to write new scripts right now as needed, but when we scale that will be harder,,,,

Q4) What are some of the organisational, admin and social challenges of building this?

A4 – Nick) Going out and connecting with the archives is a big part of this… Having time to do this can be challenging…. “is an institution going to devote a person to this?”

A4 – Ian) This is about making this more accessible… People are more used to Backlight than Shine. People respond poorly to WARC. But they can deal with PDFs with CSV, those are familiar formats…

A4 – Nick) And when I get back I’m going to be doing some work and sharing to enable an actual community to work on this..

Gregory Wiedeman: Automating access to web archives with APIs and ArchivesSpace

A little bit of context here… University at Albany, SUNY we are a public university with state records las that require us to archive. This is consistent with traditional collecting. But we no dedicated web archives staff – so no capacity for lots of manual work.

One thing I wanted to note is that web archives are records. Some have paper equivalent, or which were for many years (e.g. Undergraduate Bulletin). We also have things like word documents. And then we have things like University sports websites, some of which we do need to keep…

The seed isn’t a good place to manage these as records. But archives theory and practices adapt well to web archives – they are designed to scale, they document and maintain context, with relationship to other content, and a strong emphasis on being a history of records.

So, we are using DACS: Describing Archives: A Content Standard to describe archives, why not use that for web archives? They focus on intellectual content, ignorant of formats; designed for pragmatic access to archives. We also use ArchiveSpace – a modern tool for aggregated records that allows curators to add metadata about a collection. And it interleaved with our physical archives.

So, for any record in our collection.. You can specify a subject… a Python script goes to look at our CDX, looks at numbers, schedules processes, and then as we crawl a collection the extents and data collected… And then shows in our catalogue… So we have our paper records, our digital captures… Uses then can find an item, and only then do you need to think about format and context. And, there is an awesome article by David Graves(?) which talks about that aggregation encourages new discovery…

Users need to understand where web archives come from. They need provenance to frame of their research question – it adds weight to their research. So we need to capture what was attempted to be collected – collecting policies included. We have just started to do this with a statement on our website. We need a more standardised content source. This sort of information should be easy to use and comprehend, but hard to find the right format to do that.

We also need to capture what was collected. We are using the Archive-It Partner Data API, part of the Archive-It 5.0 system. That API captures:

  • type of crawl
  • unique ID
  • crawl result
  • crawl start, end time
  • recurrence
  • exact data, time, etc…

This looks like a big JSON file. Knowing what has been captured – and not captured – is really important to understand context. What can we do with this data? Well we can see what’s in our public access system, we can add metadata, we can present some start times, non-finish issues etc. on product pages. BUT… it doesn’t address issues at scale.

So, we are now working on a new open digital repository using the Hydra system – though not called that anymore! Possibly we will expose data in the API. We need standardised data structure that is independent of tools. And we also have a researcher education challenge – the archival description needs to be easy to use, re-share and understand.

Find our work – sample scripts, command line query tools – on Github:

http://github.com/UAlbanyArchives/describingWebArchives

Q&A

Q1) Right now people describe collection intent, crawl targets… How could you standardise that?

A1) I don’t know… Need an intellectual definition of what a crawl is… And what the depth of a crawl is… They can produce very different results and WARC files… We need to articulate this in a way that is clear for others to understand…

Q1) Anything equivalent in the paper world?

A1) It is DACS but in the paper work we don’t get that granular… This is really specific data we weren’t really able to get before…

Q2) My impression is that ArchiveSpace isn’t built with discovery of archives in mind… What would help with that…

A2) I would actually put less emphasis on web archives… Long term you shouldn’t have all these things captures. We just need an good API access point really… I would rather it be modular I guess…

Q3) Really interesting… the definition of Archive-It, what’s in the crawl… And interesting to think about conveying what is in the crawl to researchers…

A3) From what I understand the Archive-It people are still working on this… With documentation to come. But we need granular way to do that… Researchers don’t care too much about the structure…. They don’t need all those counts but you need to convey some key issues, what the intellectual content is…

Comment) Looking ahead to the WASAPI presentation… Some steps towards vocabulary there might help you with this…

Comment) I also added that sort of issue for today’s panels – high level information on crawl or collection scope. Researchers want to know when crawlers don’t collect things, when to stop – usually to do with freak outs about what isn’t retained… But that idea of understanding absence really matters to researchers… It is really necessary to get some… There is a crapton of data in the partners API – most isn’t super interesting to researchers so some community effort to find 6 or 12 data points that can explain that crawl process/gaps etc…

A4) That issue of understanding users is really important, but also hard as it is difficult to understand who our users are…

Harvesting tools & strategies (Chair: Ian Milligan)

Jefferson Bailey: Who, what, when, where, why, WARC: new tools at the Internet Archive

Firstly, apologies for any repetition between yesterday and today… I will be talking about all sorts of updates…

So, WayBack Search… You can now search WayBackMachine… Including keyword, host/domain search, etc. The index is build on inbound anchor text links to a homepage. It is pretty cool and it’s one way to access this content which is not URL based. We also wanted to look at domain and host routes into this… So, if you look at the page for, say, parliament.uk you can now see statistics and visualisations. And there is an API so you can make your own visualisations – for hosts or for domains.

We have done stat counts for specific domains or crawl jobs… The API is all in json so you can just parse this for, for example, how much of what is archived for a domain is in the form of PDFs.

We also now have search by format using the same idea, the anchor text, the file and URL path, and you can search for media assets. We don’t have exciting front end displays yet… But I can search for e.g. Puppy, mime type: video, 2014… And get lots of awesome puppy videos [the demo is the Puppy Bowl 2014!]. This media search is available for some of the WayBackMachine for some media types… And you can again present this in the format and display you’d like.

For search and profiling we have a new 14 column CDX including new language, simhash, sha256 fields. Language will help users find material in their local/native languages. The SIMHASH is pretty exciting… that allows you to see how much a page has changed. We have been using it on Archive It partners… And it is pretty good. For instance seeing government blog change month to month shows the (dis)similarity.

For those that haven’t seen the Capture tool – Brozzler is in production in Archive-it with 3 doze orgaisations and using it. This has also led to warcprox developments too. It was intended for AV and social media stuff. We have a chromium cluster… It won’t do domain harvesting, but it’s good for social media.

In terms of crawl quality assurance we are working with the Internet Memory Foundation to create quality toools. These are building on internal crawl priorities work at IA crawler beans, comparison testing. And this is about quality at scale. And you can find reports on how we also did associated work on the WayBackMachine’s crawl quality. We are also looking at tools to monitor crawls for partners, trying to find large scale crawling quality as it happens… There aren’t great analytics… But there are domain-scale monitoring, domain scale patch crawling, and Slack integrations.

For doman scale work, for patch crawling we use WAT analysis for embeds and most linked. We rank by inbound links and add to crawl. ArchiveSpark is a framework for cluster-based data extraction and derivation (WA+).

Although this is a technical presentation we are also doing an IMLS funded project to train public librarians in web archiving to preserve online local history and community memory, working with partners in various communities.

Other collaborations and research include our end of term web archive 2016/17 when the administration changes… No one is official custodian for the gov.uk. And this year the widespread deletion of data has given this work greater profile than usual. This time the work was with IA, LOC, UNT, GWU, and others. 250+ TB of .gov/.mil as well as White House and Obama social media content.

There had already been discussion of the Partner Data API. We are currently re-building this so come talk to me if you are interested in this. We are working with partners to make sure this is useful. makes sense, and is made more relevant.

We take a lot of WARC files from people to preserve… So we are looking to see how we can get partners to do this with and for it. We are developing a pipeline for automated WARC ingest for web services.

There will be more on WASAPI later, but this is part of work to ensure web archives are more accessible… And that uses API calls to connect up repositories.

We have also build a WAT API that allows you to query most of the metadta for a WARC file. You can feed it URLs, and get back what you want – except the page type.

We have new portals and searches now and coming. This is about putting new search layers on TLD content in the WayBackMachine… So you can pick media types, and just from one domain, and explore them all…

And with a statement on what archives should do – involving a gif of a centaur entering a rainbow room – that’s all… 

Q&A

Q1) What are implications of new capabilities for headless browsing for Chrome for Brozzler…

A1 – audience) It changes how fast you can do things, not really what you can do…

Q2) What about http post for WASAPI

A2) Yes, it will be in the Archive-It web application… We’ll change a flag and then you can go and do whatever… And there is reporting on the backend. Doesn’t usually effect crawl budgets, it should be pretty automated… There is a UI.. Right now we do a lot manually, the idea is to do it less manually…

Q3) What do you do with pages that don’t specify encoding… ?

A3) It doesn’t go into url tokenisation… We would wipe character encoding in anchor text – it gets cleaned up before elastic search..

Q4) The SIMHASH is before or after the capture? And can it be used for deduplication

A4) After capture before CDX writing – it is part of that process. Yes, it could be used for deduplication. Although we do already do URL deduplication… But we could compare to previous SIMHASH to work out if another copy is needed… We really were thinking about visualising change…

Q5) I’m really excited about WATS… What scale will it work on…

A5) The crawl is on 100 TB – we mostly use existing WARC and Json pipeline… It performs well on something large. But if a lot of URLs, it could be a lot to parse.

Q6) With quality analysis and improvement at scale, can you tell me more about this?

A6) We’ve given the IMF access to our own crawls… But we have been compared our own crawls to our own crawls… Comparing to Archive-it is more interesting… And looking at domain level… We need to share some similar size crawls – BL and IA – and figure out how results look and differ. It won’t be content based at that stage, it will be hotpads and URLs and things.

Michele C. Weigle, Michael L. Nelson, Mat Kelly & John Berlin: Archive what I see now – personal web archiving with WARCs

Mat: I will be describing tools here for web users. We want to enable individuals to create personal web archives in a self-contained way, without external services. Standard web archiving tools are difficult for non IT experts. “Save page as” is not suitable for web archiving. Why do this? It’s for people who don’t want to touch the commend line, but also to ensure content is preserved that wouldn’t otherwise be. More archives are more better.

It is also about creation and access, as both elements are important.

So, our goals involve advancing development of:

  • WARCreate – create WARC from what you see in your browser.
  • Web Archiving Integration Layer (WAIL)
  • Mink

WARCcreate is… A Chrome browser extension to save WARC files from your browser, no credentials pass through 3rd parties. It heavilt leverages Chrome webRequest API. ut it was build in 2012 so APIs and libraries have evolved so we had to work on that. We also wanted three new modes for bwoser based preservation: record mode – retain buffer as you browse; countdown mode – preserve reloading page on an interval; event mode – preserve page when automatically reloaded.

So you simply click on the WARCreate button the browser to generate WARC files for non technical people.

Web Archiving Integration Layer (WAIL) is a stand-alone desktop application, it offers collection-based web archiving, and includes Heritrix for crawling, OpenWayback for replay, and Python scripts compiled to OS-native binaries (.app, .exe). One of the recent advancements was a new user interface. We ported Python to Electron – using web technologies to create native apps. And that means you can use native languages to help you to preserve. We also moves from a single archive to collection-based archiving. We also ported OpenWayback to pywb. And we also started doing native Twitter integration – over time and hashtags…

So, the original app was a tool to enter a URI and then get a notification. The new version is a little more complicated but provides that new collection-based interface. Right now both of these are out there… Eventually we’d like to merge functionality here. So, an example here, looking at the UK election as a collection… You can enter information, then crawl to within defined boundaries… You can kill processes, or restart an old one… And this process integrates with Heritrix to give status of a task here… And if you want to Archive Twitter you can enter a hashtag and interval, you can also do some additional filtering with keywords, etc. And then once running you’ll get notifications.

Mink… is a Google Chrome browser extension. It indicates archival capture count as you browse. Quickly submits URI to multiple archives from UI. From Mink(owski) space. Our recent enhancements include enhancements to the interface to add the number of archives pages to icon at bottom of page. And allows users to set preferences on how to view large set of memetos. And communication with user-specified or local archives…

The old mink interface could be affected by page CSS as in the DOM. So we ave moved to shadow DOM, making it more reliable and easy to use. And then you have a more consistent, intuitive iller columns for many captures. It’s an integration of live and archive web, whilst you are viewing the live web. And you can see year, month, day, etc. And it is refined to what you want to look at this. And you have an icon in Mink to make a request to save the page now – and notification of status.

So, in terms of tool integration…. We want to ensure integration between Mink and WAIL so that Mink points to local archives. In the future we want to decouple Mink from external Memento aggregator – client-side customisable collection of archives instead.

See: http://bit.ly/iipcWAC2017 for tools and source code.

Q&A

Q1) Do you see any qualitative difference in capture between WARCreate and WARC recorder?

A1) We capture the representation right at the moment you saw it.. Not the full experience for others, but for you in a moment of time. And that’s our goal – what you last saw.

Q2) Who are your users, and do you have a sense of what they want?

A2) We have a lot of digital humanities scholars wanting to preserve Twitter and Facebook – the stream as it is now, exactly as they see it. So that’s a major use case for us.

Q3) You said it is watching as you browse… What happens if you don’t select a WARC

A3) If you have hit record you could build up content as pages reload and are in that record mode… It will impact performance but you’ll have a better capture…

Q3) Just a suggestion but I often have 100 tabs open but only want to capture something once a week so I might want to kick it off only when I want to save it…

Q4) That real time capture/playback – are there cool communities you can see using this…

A4) Yes, I think with CNN coverage of a breaking storm allows you to see how that story evolves and changes…

Q5) Have you considered a mobile version for social media/web pages on my phone?

A5) Not currently supported… Chrome doesn’t support that… There is an app out there that lets you submit to archives, but not to create WARC… But there is a movement to making those types of things…

Q6) Personal archiving is interesting… But jailed in my laptop… great for personal content… But then can I share my WARC files with the wider community .

A6) That’s a good idea… And more captures is better… So there should be a way to aggregate these together… I am currently working on that, but you should need to be able to specify what is shared and what is not.

Q6) One challenge there is about organisations and what they will be comfortable with sharing/not sharing.

Lozana Rossenova and IIya Kreymar, Rhizome: Containerised browsers and archive augmentation

Lozana: As you probably know Webrecorder is a high fidelity interactive recording of any web site you browse – and how you engage. And we have recently released an App in electron format.

Webrecorder is a worm’s eye view of archiving, tracking how users actually move around the web… For instance for instragram and Twitter posts around #lovewins you can see the quality is high. Webrecorder uses symmetrical archiving – in the live browser and in a remote browser… And you can capture then replay…

In terms of how we organise webrecorder: we have collections and sessions.

The thing I want to talk about today is on Remote browsers, and my work with Rhizome on internet art. And a lot of these works actually require old browser plugins and tools… So Webrecorder enables capture and replay even where technology no longer available.

To clarify: the programme says “containerised” but we now refer to this as “remote browsers” – still using Docker cotainers to run these various older browsers.

When you go to record a site you select the browser, and the site, and it begins the recording… The Java Applet runs and shows you a visulisation of how it is being captured. You can do this with flash as well… If we open a multimedia in your normal (Chrome) browser, it isn’t working. Restoration is easier with just flash, need other things to capture flash with other dependencies and interactions.

Remote browsers are really important for Rhizome work in general, as we use them to stage old artworks in new exhibitions.

Ilya: I will be showing some upcoming beta features, including ways to use webrecorder to improve other arhives…

Firstly, which other web archives? So I built a public web archives repsitory:

https://github.com/webrecorder/public-web-archives

And with this work we are using WAM – the Web Archiving Manifest. And added a WARC source URI and WARC creation date field to the WARC Header at the moment.

So, Jefferson already talked about patching – patching remote archives from the live web… is an approach where we patch either from live web or from other archives, depending on what is available or missing. So, for instance, if I look at a Washington Post page in the archive from 2nd March… It shows how other archives are being patched in to me to deliver me a page… In the collection I have a think called “patch” that captures this.

Once pages are patched, then we introduce extraction… We are extracting again using remote archiving and automatic patching. So you combine extraction and patching features. You create two patches and two WARC files. I’ll demo that as well… So, here’s a page from the CCA website and we can patch that… And then extract that… And then when we patch again we get the images, the richer content, a much better recording of the page. So we have 2 WARCs here – one from the British Library archive, one from the patching that might be combined and used to enrich that partial UKWA capture.

Similarly we can look at a CNN page and take patches from e.g. the Portuguese archive. And once it is done we have a more complete archive… When we play this back you can display the page as it appeared, and patch files are available for archives to add to their copy.

So, this is all in beta right now but we hope to release it all in the near future…

Q&A

Q1) Every web archive already has a temporal issue where the content may come from other dates than the page claims to have… But you could aggrevate that problem. Have you considered this?

A1) Yes. There are timebounds for patching. And also around what you display to the user so they understand what they see… e.g. to patch only within the week or the month…

Q2) So it’s the closest date to what is in web recorder?

A2) The other sources are the closest successful result on/closest to the date from another site…

Q3) Rather than a fixed window for collection, seeing frequently of change might be useful to understand quality/relevance… But I think you are replaying

A3)Have you considered a headless browser… with the address bar…

A3 – Lozana) Actually for us the key use case is about highlighting and showcasing old art works to the users. It is really important to show the original page as it appeared – in the older browsers like Netscape etc.

Q4) This is increadibly exciting. But how difficult is the patching… What does it change?

A4) If you take a good capture and a static image is missing… Those are easy to patch in… If highly contextualised – like Facebook, that is difficult to do.

Q5) Can you do this in realtime… So you archive with Perma.cc then you want to patch something immediately…

A5) This will be in the new version I hope… So you can check other sources and fall back to other sources and scenarios…

Comment –  Lozana) We have run UX work with an archiving organisation in Europe for cultural heritage and their use case is that they use Archive-It and do QA the next day… Crawl might mix something but highly dynamic, so want to quickly be able to patch it pretty quickly.

Ilya) If you have an archive that is not in the public archive list on Github please do submit it as a fork request and we’ll be able to add it…

Leveraging APIs (Chair: Nicholas Taylor)

Fernando Melo and Joao Nobre: Arquivo.pt API: enabling automatic analytics over historical web data

Fernando: We are a publicly available web archive, mainly of Portuguese websites from the .pt domain. So, what can you do with out API?

Well, we built our first image search using our API, for instance a way to explore Charlie Hebdo materials; another application enables you to explore information on Portuguese politicians.

We support the Memento protocol, and you can use the Memento API. We are one of the time gates for the time travel searches. And we also have full text search as well as URL search, though our OpenSearch API. We have extended our API to support temporal searches in the portuguese web. Find this at: http://arquivo.pt/apis/opensearch/. Full text search requests can be made through a URL query, e.g. http://arquivp.pt/opensearch?query=euro 2004 would search for mentions of euro 2004, and you can add parameters to this, or search as a phrase rather than keywords.

You can also search mime types – so just within PDFs for instance. And you can also run URL searches – e.g. all pages from the New York Times website… And if you provide time boundaries the search will look for the capture from the nearest date.

Joao: I am going to talk about our image search API. This works based on keyword searches, you can include operators such as limiting to images from a particular site, to particular dates… Results are ordered by relevance, recency, or by type. You can also run advanced image searches, such as for icons, you can use quotation marks for names, or a phrase.

The request parameters include:

  • query
  • stamp – timestamp
  • Start – first index of search
  • safe Image (yes; no; all) – restricts search only to safe images.

The response is returned in json with total results, URL, width, height, alt, score, timestamp, mime, thumbnail, nsfw, pageTitle fields.

More on all of this: http://arquivo.pt/apis

Q&A

Q1) How do you classify safe for work/not safe for work

A1 – Fernando) This is a closed beta version. Safe for work/nsfw is based on classification worked around training set from Yahoo. We are not for blocking things but we want to be able to exclude shocking images if needed.

Q1) We have this same issue in the GifCities project – we have a manually curated training set to handle that.

Comment) Maybe you need to have more options for that measure to provide levels of filtering…

Q2) With that json response, why did you include title and alt text…

A2) We process image and extract from URL, the image text… So we capture the image, the alt text, but we thought that perhaps the page title would be interesting, giving some sense of context. Maybe the text before/after would also be useful but that takes more time… We are trying to keep this working

Q3) What is the thumbnail value?

A3) It is in base 64. But we can make that clearer in the next version…

Nicholas Taylor: Lots more LOCKSS for web archiving: boons from the LOCKSS software re-architecture

This is following on from the presentation myself and colleagues did at last year’s IIPC on APIs.

LOCKSS came about from a serials librarian and a computer scientist. They were thinking about emulating the best features of the system for preserving print journals, allowing libraries to conserve their traditional role as preserver. The LOCKSS boxes would sit in each library, collecting from publishers’ website, providing redundancy, sharing with other libraries if and when that publication was no longer available.

18 years on this is a self-sustaining programme running out of Stanford, with 10s of networks and hundreds of partners. Lots of copies isn’t exclusive to LOCKSS but it is the decentralised replication model that addresses the long term bit integrity is hard to solve, that more (correlated) copies doesn’t necessarily keep things safe and can make it vulnerable to hackers. So this model is community approved, published on, and well established.

Last year we started re-architecting the LOCKSS software so that it becomes a series of websites. Why do this? Well to reduce support and operation costs – taking advantage of other softwares on the web and web archiving,; to de silo components and enable external integration – we want components to find use in other systems, especially in web archiving; and we are preparing to evolve with the web, to adapt our technologies accordingly.

What that means is that LOCKSS systems will treat WARC as a storage abstraction, and more seamlessly do this, processing layers, proxies, etc. We also already integrate Memento but this will also let us engage WASAPI – which there will be more in our next talk.

We have built a service for bibliographic metadata extraction, for web harvest and file transfer content; we can map values in DOM tree to metadata fields; we can retrieve downloadable metadata from expected URL patterns; and parse RIS and XML by schema. That model shows our bias to bibliographic material.

We are also using plugins to make bibliographic objects and their metadata on many publishing platforms machine-intelligible. We mainly work with publishing/platform heuristics like Atypon, Digital Commons, HighWire, OJS and Silverchair. These vary so we have a framework for them.

The use cases for metadata extraction would include applying to consistent subsets of content in larger corpora; curating PA materials within broader crawls; retrieve faculty publications online; or retrieve from University CMSs. You can also undertake discovery via bibliographic metadata, with your institutions OpenURL resolver.

As described in 2005 D-Lib paper by DSHR et al, we are looking at on-access format migration. For instance x-bitmap to GIF.

Probably the most important core preservation capability is the audit and repair protocol. Network nodes conduct polls to validate integrity of distributed copies of data chunks. More nodes = more security – more nodes can be down; more copies can be corrupted… The notes do not trust each other in this model and responses cannot be cached. And when copies do not match, the node audits and repairs.

We think that functionality may be useful in other distributed digital preservation networks, in repository storage replication layers. And we would like to support varied back-ends including tape and cloud. We haven’t built those integrations yet…

To date our progress has addressed the WARC work. By end of 2017 we will have Docker-ised components, have a web harvest framework, polling and repair web service. By end of 2018 we will have IP address and Shibboleth access to OpenWayBack…

By all means follow and plugin. Most of our work is in a private repository, which then copies to GitHub. And we are moving more towards a community orientated software development approach, collaborating more, and exploring use of LOCKSS technologies in other contexts.

So, I want to end with some questions:

  • What potential do you see for LOCKSS technologies for web archiving, other use cases?
  • What standards or technologies could we use that we maybe haven’t considered
  • How could we help you to use LOCKSS technologies?
  • How would you like to see LOCKSS plug in more to the web archiving community?

Q&A

Q1) Will these work with existing LOCKSS software, and do we need to update our boxes?

A1) Yes, it is backwards compatible. And the new features are containerised so that does slightly change the requirements of the LOCKSS boxes but no changes needed for now.

Q2) Where do you store biblographic metadata? Or is in the WARC?

A2) It is separate from the WARC, in a database.

Q3) With the extraction of the metadata… We have some resources around translators that may be useful.

Q4 – David) Just one thing of your simplified example… For each node… They all have to calculate a new separate nonce… None of the answers are the same… They all have to do all the work… It’s actually a system where untrusted nodes are compared… And several nodes can’t gang up on the other… Each peer randomly decides on when to poll on things… There is  leader here…

Q5) Can you talk about format migration…

A5) It’s a capability already built into LOCKSS but we haven’t had to use it…

A5 – David) It’s done on the requests in http, which include acceptable formats… You can configure this thing so that if an acceptable format isn’t found, then you transform it to an acceptable format… (see the paper mentioned earlier). It is based on mime type.

Q6) We are trying to use LOCKSS as a generic archive crawler… Is that still how it will work…

A6) I’m not sure I have a definitive answer… LOCKSS will still be web harvesting-based. It will still be interesting to hear about approaches that are not web harvesting based.

A6 – David) Also interesting for CLOCKSS which are not using web harvesting…

A6) For the CLOCKSS and LOCKSS networks – the big networks – the web harvesting portfolio makes sense. But other networks with other content types, that is becoming more important.

Comment) We looked at doing transformation that is quite straightforward… We have used an API

Q7) Can you say more about the community project work?

A7) We have largely run LOCKSS as more of an in-house project, rather than a community project. We are trying to move it more in the direction of say, Blacklight, Hydra….etc. A culture change here but we see this as a benchmark of success for this re-architecting project… We are also in the process of hiring a partnerships manager and that person will focus more on creating documentation, doing developer outreach etc.

David: There is a (fragile) demo that you can have a lot of this… The goal is to continue that through the laws project, as a way to try this out… You can (cautiously) engage with that at demo.laws.lockss.org but it will be published to GitHub at some point.

Jefferson Bailey & Naomi Dushay: WASAPI data transfer APIs: specification, project update, and demonstration

Jefferson: I’ll give some background on the APIs. This is an IMLS funded project in the US looking at Systems Interoperability and Collaborative Development for Web Archives. Our goals are to:

  • build WARC and derivative dataset APIs (AIT and LOCKSS) and test via transfer to partners (SUL, UNT, Rutgers) to enable better distributed preservation and access
  • Seed and launch community modelled on characteristics of successful development and participation from communities ID’d by project
  • Sketch a blueprint and technical model for future web archiving APIs informed by project R&D
  • Technical architecture to support this.

So, we’ve already run WARC and Digital Preservation Surveys. 15-20% of Archive-it users download and locally store their WARCS – for various reasons – that is small and hasn’t really moved, that’s why data transfer was a core area. We are doing online webinars and demos. We ran a national symposium on API based interoperability and digital preservation and we have white papers to come from this.

Development wise we have created a general specification, a LOCKSS implementation, Archive-it implementation, Archive-it API documentation, testing and utility (in progress). All of this is on GitHub.

The WASAPI Archive-it Transfer API is written in python, meets all gen-spec citeria, swagger yaml in the repos. Authorisation uses AIT Django framework (same as web app), not defined in general specification. We are using browser cookies or http basic auth. We have a basic endpoint (in production) which returns all WARCs for that account; base/all results are paginated. In terms of query parameters you can use: filename; filetype; collection (ID); crawl (ID for AID crawl job)) etc.

So what do you get back? A JSON object has: pagination, count, request-url, includes-extra. You have fields including account (Archive-it ID); checksums; collection (Archive-It ID); crawl; craw time; crawl start; filename’ filetype; locations; size. And you can request these through simple http queries.

You can also submit jobs for generating derivative datasets. We use existing query language.

In terms of what is to come, this includes:

  1. Minor AIT API features
  2. Recipes and utilities (testers welcome)
  3. Community building research and report
  4. A few papers on WA APIs
  5. Ongoing surgets and research
  6. Other APIs in WASAPI (past and future)

So we need some way to bring together these APIs regularly. And also an idea of what other APIs we need to support, and how to prioritise that.

Naomi: I’m talking about the Stanford take on this… These are the steps Nicholas, as project owner, does to download WARC files from Archive-it at the moment… It is a 13 step process… And this grant funded work focuses on simplifying the first six steps and making it more manageable and efficient. As a team we are really focused on not being dependent on bespoke softwares, things much be maintainable, continuous integration set up, excellent test coverage, automate-able. There is a team behind this work, and this was their first touching of any of this code – you had 3 neophytes working on this with much to learn.

We are lucky to be just down the corridor from LOCKSS. Our preferred language is Ruby but Java would work best for LOCKSS. So we leveraged LOCKSS engineering here.

The code is at: https://github.com/sul-dlss/wasapi-downloader/.

You only need Java to run the code. And all arguments are documented in Github. You can also view a video demo:

YouTube Preview Image

These videos are how we share our progress at the end of each Agile sprint.

In terms of work remaining we have various tweaks, pull requests, etc. to ensure it is production ready. One of the challenges so far has been about thinking crawls and patches, and the context of the WARC.

Q&A

Q1) At Stanford are you working with the other WASAPI APIs, or just the downloads one.

A1) I hope the approach we are taking is a welcome one. But we have a lot of projects taking place, but we are limited by available software engineering cycles for archives work.

Note that we do need a new readme on GitHub

Q2) Jefferson, you mentioned plans to expand the API, when will that be?

A2 – Jefferson) I think that it is pretty much done and stable for most of the rest of the year… WARCs do not have crawl IDs or start dates – hence adding crawl time.

Naomi: It was super useful that a different team built the downloader was separate from the team building the WASAPI as that surfaced a lot of the assumptions, issues, etc.

David: We have a CLOCKSS implementation pretty much building on the Swagger. I need to fix our ID… But the goal is that you will be able to extract stuff from a LOCKSS box using WASAPI using URL or Solr text search. But timing wise, don’t hold your breath.

Jefferson: We’d also like others feedback and engagement with the generic specification – comments welcome on GitHub for instance.

Web archives platforms & infrastructure (Chair: Andrew Jackson)

Jack Cushman & Ilya Kreymer: Thinking like a hacker: security issues in web capture and playback

Jack: We want to talk about securing web archives, and how web archives can get themselves into trouble with security… We want to share what we’ve learnt, and what we are struggling with… So why should we care about security as web archives?

Ilya: Well web archives are not just a collection of old pages… No, high fidelity web archives run entrusted software. And there is an assumption that a live site is “safe” so nothing to worry about… but that isn’t right either..

Jack: So, what could a page do that could damage an archive? Not just a virus or a hack… but more than that…

Ilya: Archiving local content… Well a capture system could have privileged access – on local ports or network server or local files. It is a real threat. And could capture private resources into a public archive. So. Mitigation: network filtering and sandboxing, don’t allow capture of local IP addresses…

Jack: Threat: hacking the headless browser. Modern captures may use PhantomJS or other browsers on the server, most browsers have known exploits. Mitigation: sandbox your VM

Ilya: Stealing user secrets during capture… Normal web flow… But you have other things open in the browser. Partial mitigation: rewriting – rewrite cookies to exact path only; rewrite JS to intercept cookie access. Mitigation: separate recording sessions – for webrecorder use separate recording sessions when recording credentialed content. Mitigation: Remote browser.

Jack: So assume we are running MyArchive.com… Threat: cross site scripting to steal archive login

Ilya: Well you can use a subdomain…

Jack: Cookies are separate?

Ilya: Not really.. In IE10 the archive within the archive might steal login cookie. In all browsers a site can wipe and replace cookies.

Mitigation: run web archive on a separate domain from everything else. Use iFrames to isolate web archive content. Load web archive app from app domain, load iFrame content from content domain. As Webrecorder and Perma.cc both do.

Jack: Now, in our content frame… how back could it be if that content leaks… What if we have live web leakage on playback. This can happen all the time… It’s hard to stop that entirely… Javascript can send messages back and fetch new content… to mislead, track users, rewrite history. Bonus: for private archives – any of your captures could eport any of your other captures.

The best mitigation is a Content-Security-Policy header can limit access to web archive domain

Ilya: Threat: Show different age contents when archives… Pages can tell they’re in an archive and act differently. Mitigation: Run archive in containerised/proxy mode browser.

Ilya: Threat: Banner spoofing… This is a dangerous but quite easy to execute threat. Pages can dynamically edit the archives banner…

Jack: Suppose I copy the code of a page that was captured and change fake evidence, change the metadata of the date collected, and/or the URL bar…

Ilya: You can’t do that in Perma because we use frames. But if you don’t separate banner and content, this is a fairly easy exploit to do… So, Mitigation: Use iFrames for replay; don’t inject banner into replay frame… It’s a fidelity/security trade off.. .

Jack: That’s our top 7 tips… But what next… What we introduce today is a tool called http://warc.games. This is a version of webrecorder with every security problem possible turned on… You can run it locally on your machine to try all the exploits and think about mitigations and what to do about them!

And you can find some exploits to try, some challenges… Of course if you actually find a flaw in any real system please do be respectful

Q&A

Q1) How much is the bug bounty?! [laughs] What do we do about the use of very old browsers…

A1 – Jack) If you use an old browser you may be compromised already… But we use the most robust solution possible… In many cases there are secure options that work with older browsers too…

Q2) Any trends in exploits?

A2 – Jack) I recommend the book A Tangled Book… And there is an aspect that when you run a web browser there will always be some sort of issue

A2 – Ilya) We have to get around security policies to archive the web… It wasn’t designed for archiving… But that raises its own issues.

Q3) Suggestions for browser makers to make these safer?

A3) Yes, but… How do you do this with current protocols and APIs

Q4) Does running old browsers and escaping from containers keep you awake at night…

A4 – Ilya) Yes!

A4 – Jack) If anyone is good at container escapes please do write that challenge as we’d like to have it in there…

Q5) There’s a great article called “Familiarity builds content” which notes that old browsers and softwares get more vulnerable over time… It is particularly a big risk where you need old software to archive things…

A5 – Jack) Thanks David!

Q6) Can you saw more about the headers being used…

A6) The idea is we write the CSP header to only serve from the archive server… And they can be quite complex… May want to add something of your own…

Q7) May depend on what you see as a security issue… for me it may be about the authenticity of the archive… By building something in the website that shows different content in the archive…

A7 – Jack) We definitely think that changing the archive is a security threat…

Q8) How can you check the archives and look for arbitrary hacks?

A8 – Ilya) It’s pretty hard to do…

A8 – Jack) But it would be a really great research question…

Mat Kelly & David Dias: A collaborative, secure, and private InterPlanetary WayBack web archiving system using IPFS

David: Welcome to the session on going InterPlanatary… We are going to talk about peer to peer and other technology to make web archiving better…

We’ll talk about InterPlanatary File System (IPFS) and InterPlanatary WayBack (IPWB)…

IPFS is also known as  the distributed web, moving from location based to content based… As we are aware, the web has some problems… You have experience of using a service, accessing email, using a document… There is some break in connectivity… And suddenly all those essential services are gone… Why? Why do we need to have the services working in such a vulnerable way… Even a simple page, you lose a connection and you get a 404. Why?

There is a real problem with permanence… We have this URI, the URL, telling us the protocol, location and content path… But when we come back later – weeks or months – and that content has moved elsewhere… Either somewhere else you can find, or somewhere you can’t. Sometimes it’s like the content has been destroyed… But every time people see a webpage, you download it to your machine… These issues come from location addressing…

In content addressing we tie content to a unique hash that identifies the item… So a Content Identifier (CID) allows us to do this… And then, in a network, when I look for that data… If there is a disruption to the network, we can ask any machine where the content is… And the node near you can show you what is available before you ever go to the network.

IPFS is already used in video streaming (inc. Netflix), legal documents, 3D models – with Hollolens for instance, for games, for scientific data and papers, blogs and webpages, and totally distributed web apps.

IPFS allows this to be distributed, offline, saves space, optimise bandwidth usage, etc.

Mat: So I am going to talk about IPWB. Motivation here is the persistence of archived web data dependent on resilience of organisation and availability of data. The design is extending the CDXJ format, with indexing and IPFS dissemination procedure, and Replay and IPFS Pull Procedure. So in an adapted CDXJ adds a header with the hash for the content to the metadata structure.

Dave: One of the ways IPFS is making changes in the boundary is in browser tab, in browser extension and service worker as a proxy for requests the browser makes, with no changes to the interface (that one is definitely in alpha!)…

So the IPWB can expose the content to the IPFS and then connect and do everything in the browser without needing to download and execute code on their machine. Building it into the browser makes it easy to use…

Mat: And IPWB enables privacy, collaboration and security, building encryption method and key into the WARC. Similarly CDXJs may be transferred for our users’ replay… Ideally you won’t need a CDZJ on your own machine at all…

We are also rerouting, rather than rewriting, for archival replay… We’ll be presenting on that late this summer…

And I think we just have time for a short demo…

For more see: https://github.com/oduwsdl/ipwb

Q&A

Q1) Mat, I think that you should tell that story of what you do…

A1) So, I looked for files on another machine…

A1 – Dave) When Mat has the archive file on a remote machine… Someone looks for this hash on the network, send my way as I have it… So when Mat looked, it replied… so the content was discovered… request issued, received content… and presented… And that also lets you capture pages appearing differently in different places and easily access them…

Q2) With the hash addressing, are there security concerns…

A2 – Dave) We use Multihash, using Shard… But you can use different hash functions, they just verify the link… In IPFS we prevent issue with self-describable data functions..

Q3) The problem is that the hash function does end up in the URL… and it will decay over time because the hash function will decay… Its a really hard problem to solve – making a choice now that may be wrong… But there is no way of choosing the right choice.

A3) At least we can use the hash function to indicate whether it looks likely to be the right or wrong link…

Q4) Is hash functioning itself useful with or without IPFS… Or is content addressing itself inherently useful?

A4 – Dave) I think the IPLD is useful anyway… So with legal documents where links have to stay in tact, and not be part of the open web, then IPFS can work to restrict that access but still make this more useful…

Q5) If we had a content addressable web, almost all these web archiving issues would be resolved really… IT is hard to know if content is in Archive 1 or Archive 2. A content addressable web would make it easier to be archived.. Important to keep in mind…

A5 – Dave) I 100% agree! Content addressed web lets you understand what is important to capture. And IPTF saves a lot of bandwidth and a lot of storage…

Q6) What is the longevity of the hashs and how do I check that?

A6 – Dave) OK, you can check the integrity of the hash. And we have filecoin.io which is a blockchain [based storage network and cryptocurrency and that does handle this information… Using an address in a public blockchain… That’s our solution for some of those specific problems.

Andrew Jackson (AJ), Jefferson Bailey (JB), Kristinn Sigurðsson (KS) & Nicholas Taylor (NT): IIPC Tools: autumn technical workshop planning discussion

AJ: I’ve been really impressed with what I’ve seen today. There is a lot of enthusiasm for open source and collaborative approaches and that has been clear today and the IIPC wants to encourage and support that.

Now, in September 2016 we had a hackathon but there were some who just wanted to get something concrete done… And we might therefore adjust the format… Perhaps pre-define a task well ahead of time… But also a parallel track for the next hackathon/more experimental side. Is that a good idea? What else may be?

JB: We looked at Archives Unleashed, and we did a White House Social Media Hackathon earlier this year… This is a technical track but… it’s interesting to think about what kind of developer skills/what mix will work best… We have lots of web archiving engineers… They don’t use the software that comes out of it… We find it useful to have archivists in the room…

Then, from another angle, is that at the hackathons… IIPC doesn’t have a lot of money and travel is expensive… The impact of that gets debated – it’s a big budget line for 8-10 institutions out of 53 members. The outcomes are obviously useful but… If people expect to be totally funded for days on end across the world isn’t feasible… So maybe more little events, or fewer bigger events can work…

Comment 1) Why aren’t these sessions recorded?

JB: Too much money. We have recorded some of them… Sometimes it happens, sometimes it doesn’t…

AJ: We don’t have in-house skills, so it’s third party… And that’s the issue…

JB: It’s a quality thing…

KS: But also, when we’ve done it before, it’s not heavily watched… And the value can feel questionable…

Comment 1) I have a camera at home!

JB: People can film whatever they want… But that’s on people to do… IIPC isn’t an enforcement agency… But we should make it clear that people can film them…

KS: For me… You guys are doing incredible things… And it’s things I can’t do at home. The other aspect is that… There are advancements that never quite happened… But I think there is value in the unconference side…

AJ: One of the things with unconference sessions is that

NT: I didn’t go to the London hackathon… Now we have a technical team, it’s more appealling… The conference in general is good for surfacing issues we have in common… such as extraction of metadata… But there is also the question of when we sit down to deal with some specific task… That could be useful for taking things forward..

AJ: I like the idea of a counter conference, focused on the tools… I was a bit concerned that if there were really specific things… What does it need to be to be worth your organisations flying you to them… Too narrow and it’s exclusionary… Too broad and maybe it’s not helpful enough…

Comment 2) Worth seeing the model used by Python – they have a sprint after their conference. That isn’t an unconference but lets you come together. Mozilla Fest Sprint picks a topic and then next time you work on it… Sometimes looking at other organisations with less money are worth looking at… And for things like crowd sourcing coverage etc… There must be models…

AJ: This is cool.. You will have to push on this…

Comment 3) I think that tacking on to a conference helps…

KS: But challenging to be away from office more than 3/4 days…

Comment 4) Maybe look at NodeJS Community and how they organise… They have a website, NodeSchool.io with three workshops… People organise events pretty much monthly… And create material in local communities… Less travel but builds momentum… And you can see that that has impact through local NodeJS events now…

AJ: That would be possible to support as well… with IIPC or organisational support… Bootstrapping approaches…

Comment 5) Other than hackathon there are other ways to engage developers in the community… So you can engage with Google Summer of Code for instance – as mentors… That is where students look for projects to work on…

JB: We have two GSoC and like 8 working without funding at the moment… But it’s non trivial to manage that…

AJ: Onboarding new developers in any way would be useful…

Nick: Onboarding into the weird and wacky world of web archiving… If IIPC can curate a lot of onboarding stuff, that would be really good for potential… for getting started… Not relying on a small number of people…

AJ: We have to be careful as IIPC tools page is very popular, but hard to keep up to date… Benefits can be minor versus time…

Nick: Do you have GitHub? Just put up an awesome lise!

AJ: That’s a good idea…

JB: Microfunding projects – sub $10k is also an option for cost recovered brought out time for some of these sorts of tasks… That would be really interesting…

Comment 6) To expand on Jefferson and Nick were saying… I’m really new… Went to IIPC in April. I am enjoying this and learning this a lot… I’ve been talking to a lot of you… That would really help more people get the technical environment right… Organisations want to get into archiving on a small scale…

Olga: We do have a list on GitHub… but not up to date and well used…

AJ: We do have this document, we have GitHub… But we could refer to each other… and point to the getting started stuff (only). Rather get away from lists…

Comment 7) Google has an OpenSource.guide page – could take inspiration from that… Licensing, communities, etc… Very simple plain English getting started guide/documentation…

Comment 8) I’m very new to the community… And I was wondering to what extent you use Slack and Twitter between events to maintain these conversations and connections?

AJ: We have a Slack channel, but we haven’t publicised it particularly but it’s there… And Twitter you should tweet @NetPreserve and they will retweet then this community will see that…

Jun 142017
 

Following on from Day One of IIPC/RESAW I’m at the British Library for a connected Web Archiving Week 2017 event: Digital Conversations @BL, Web Archives: truth, lies and politics in the 21st century. This is a panel session chaired by Elaine Glaser (EG) with Jane Winters (JW), Valerie Schafer (VS), Jefferson Bailey (JB) and Andrew Jackson (AJ). 

As usual, this is a liveblog so corrections, additions, etc. are welcomed. 

EG: Really excited to be chairing this session. I’ll let everyone speak for a few minutes, then ask some questions, then open it out…

JB: I thought I’d talk a bit about our archiving strategy at Internet Archive. We don’t archive the whole of the internet, but we aim to collect a lot of it. The approach is multi-pronged: to take entire web domains in shallow but broad strategy; to work with other libraries and archives to focus on particular subjects or areas or collections; and then to work with researchers who are mining or scraping the web, but not neccassarily having preservation strategies. So, when we talk about political archiving or web archiving, it’s about getting as much as possible, with different volumes and frequencies. I think we know we can’t collect everything but important things frequently, less important things less frequently. And we work with national governments, with national libraries…

The other thing I wanted to raise in

T.R. Shellenberg who was an important archivist at the National Archive in the US. He had an idea about archival strategies: that there is a primary documentation strategy, and a secondary straetgy. The primary for a government and agencies to do for their own use, the secondary for futur euse in unknown ways… And including documentary and evidencey material (the latter being how and why things are done). Those evidencery elements becomes much more meaningful on the web, that has eerged and become more meaningful in the context of our current political environment.

AJ: My role is to build a Web Archive for the United Kingdom. So I want to ask a question that comes out of this… “Can a web archive lie?”. Even putting to one side that it isn’t possible to archive the whole web.. There is confusion because we can’t get every version of everything we capture… Then there are biases from our work. We choose all UK sites, but some are captured more than others… And our team isn’t as diverse as it could be. And what we collect is also constrained by technology capability. And we are limited by time issues… We don’t normally know when material is created… The crawler often finds things only when they become popular… So the academic paper is picked up after a BBC News item – they are out of order. We would like to use more structured data, such as Twitter which has clear publication date…

But can the archive lie? Well material is much easier than print to make an untraceable change. As digital is increasingly predominant we need to be aware that our archive could he hacked… So we have to protect for that, evidence that we haven’t been hacked… And we have to build systems that are secure and can maintain that trust. Libraries will have to take care of each other.

JW: The Oxford Dictionary word of the year in 2016 was “post truth” whilst the Australian dictionary went for “Fake News”. Fake News for them is either disinformation on websites for political purposes, or commercial benefit. Mirrium Webster went for “surreal” – their most searched for work. It feels like we live in very strange times… There aren’t calls for resignation where there once were… Hasn’t it always been thus though… ? For all the good citizens who point out the errors of a fake image circulated on Twitter, for many the truth never catches the lie. Fakes, lies and forgeries have helped change human history…

But modern fake news is different to that which existed before. Firstly there is the speed of fake news… Mainstream media only counteracts or addresses this. Some newspapers and websites do public corrections, but that isn’t the norm. Once publishing took time and means. Social media has made it much easier to self-publish. One can create, but also one can check accuracy and integrity – reverse image searching to see when a photo has been photoshopped or shows events of two things before…

And we have politicians making claims that they believe can be deleted and disappear from our memory… We have web archives – on both sides of the Atlantic. The European Referendum NHS pledge claim is archived and lasts long beyond the bus – which was brought by Greenpeace and repainted. The archives have also been capturing political parties websites throughout our endless election cycle… The DUP website crashed after announcement of the election results because of demands… But the archive copy was available throughout. Also a rumour that a hacker was creating an irish language version of the DUP website… But that wasn’t a new story, it was from 2011… And again the archive shows that, and archive of news websites do that.

Social Networks Responses to Terrorist Attacks in France – Valerie Schafer. 

Before 9/11 we had some digital archives of terrorist materials on the web. But this event challenged archivists and researchers. Charlie Hebdo, Paris Bataclan and Nice attacks are archived… People can search at the BNF to explore these archives, to provide users a way to see what has been said. And at the INA you can also explore the archive, including Titter archives. You can search, see keywords, explore timelines crossing key hashtags… And you can search for images… including the emoji’s used in discussion of Charlie Hebdo and Bataclan.

We also have Archive-It collections for Charlie Hebdo. This raises some questions of what should and should not be collected… We did not normally collected news papers and audio visual sites, but decided to in this case as we faced a special event. But we still face challenges – it is easiest to collect data from Twitter than from Facebook. But it is free to collect Twitter data in real time, but the archived/older data is charged for so you have to capture it in the moment. And there are limits on API collection… INA captured more than 12 Million tweets for Charlie Hebdo, for instance, it is very complete but not exhaustive.

We continue to collect for #jesuischarlie and #bataclan… They continually used and added to, in similar or related attacks, etc. There is a time for exploring and reflecting on this data, and space for critics too….

But we also see that content gets deleted… It is hard to find fake news on social media, unless you are looking for it… Looking for #fakenews just won’t cut it… So, we had a study on fake news… And we recommend that authorities are cautious about material they share. But also there is a need for cross checking – the kinds of projects with Facebook and Twitter. Web archives are full of fake news, but also full of others’ attempts to correct and check fake news as well…

EG: I wanted to go back in time to the idea of the term “fake news”… In order to understand from what “Fake News” actually is, we have to understand how it differs from previous lies and mistruths… I’m from outside the web world… We are often looking at tactics to fight fire with fire, to use an unfortunate metaphor…  How new is it? And who is to blame and why?

JW: Talking about it as a web problem, or a social media issue isn’t right. It’s about humans making decisions to critique or not that content. But it is about algorithmic sharing and visibility of that information.

JB: I agree. What is new is the way media is produced, disseminated and consumed – those have technological underpinnings. And they have been disruptive of publication and interpretation in a web world.

EG: Shouldn’t we be talking about a culture not just technology… It’s not just the “vessel”… Isn’t the dissemination have more of a role than perhaps we are suggesting…

AJ: When you build a social network or any digital space you build in different affordances… So that Facebook and Twitter is different. And you can create automated accounts, with Twitter especially offering an affordance for robots etc which allows you to give the impression of a movement. There are ways to change those affordances, but there will also always be fake news and issues…

EG: There are degrees of agency in fake news.. from bots to deliberate posts…

JW: I think there is also the aspect of performing your popularity – creating content for likes and shares, regardless of whether what you share is true or not.

VS: I know terrorism is different… But any tweet sharing fake news you get 4 retweets denying… You have more tweets denying than sharing fake news…

AJ: One wonders about the filter bubble impact here… Facebook encourges inward looking discussion… Social media has helped like minded people find each other, and perhaps they can be clipped off more easily from the wider discussion…

VS: I think also what is interested is the game between social media and traditional media…You have questions and relationship there…

EG: All the internet can do is reflect the crooked timber of reality… We know that people have confirmation bias, we are quite tolerant of untruths, to be less tolerant of information that contradicts our perceptions, even if untrue.You have people and the net being equally tolerant of lies and mistruths… But isn’t there another factor here… The people demonised as gatekeepers… By putting in place structures of authority – which were journalism and academics… Their resources are reduced now… So what role do you see for those traditional gatekeepers…

VS: These gatekeepers are no more the traditional gatekeepers that they were…. They work in 24 hour news cycles and have to work to that. In France they are trying to rethink that role, there were a lot of questions about this… Whether that’s about how you react to changing events, and what happens during election…. People thinking about that…

JB: There is an authority and responsibiity for media still, but has the web changed that? Looking back its suprising now how few organisations controlled most of the media… But is that that different now?

EG: I still think you are being too easy on the internet… We’ve had investigate journalism by Carrell Cadwalladar and others on Cambridge Analytica and others who deliberately manipulate reality… You talked about witness testimony in relation to terrorism… Isn’t there an immediacy and authenticity challenge there… Donald Trump’s tweets… They are transparant but not accountable… Haven’t we created a problem that we are now trying to fix?

AJ: Yes. But there are two things going on… It seems to be that people care less about lying… People see Trump lying, and they don’t care, and media organisations don’t care as long as advertising money comes in… A parallel for that in social media – the flow of content and ads takes priority over truth. There is an economic driver common to both mediums that is warping that…

JW: There is an aspect of unpopularity aspect too… a (nameless) newspaper here that shares content to generate “I can’t believe this!” and then sharing and generating advertising income… But on a positive note, there is scope and appetite for strong investigative journalism… and that is facilitated by the web and digital methods…

VS: Citizens do use different media and cross media… Colleagues are working on how TV is used… And different channels, to compare… Mainstream and social media are strongly crossed together…

EG: I did want to talk about temporal element… Twitter exists in the moment, making it easy to make people accountable… Do you see Twitter doing what newspapers did?

AJ: Yes… A substrate…

JB: It’s amazing how much of the web is archived… With “Save Page Now” we see all kinds of things archived – including pages that exposed the whole Russian downing a Ukrainian plane… Citizen action, spotting the need to capture data whilst it is still there and that happens all the time…

EG: I am still sceptical about citizen journalism… It’s a small group of narrow demographics people, it’s time consuming… Perhaps there is still a need for journalist roles… We did talk about filter bubbles… We hear about newspapers and media as biased… But isn’t the issue that communities of misinformation are not penetrated by the other side, but by the truth…

JW: I think bias in newspapers is quite interesting and different to unacknowledged bias… Most papers are explicit in their perspective… So you know what you will get…

AJ: I think so, but bias can be quite subtle… Different perspectives on a common issue allows comparison… But other stories only appear in one type of paper… That selection case is harder to compare…

EG: This really is a key point… There is a difference between facts and truth, and explicitly framed interpretation or commentary… Those things are different… That’s where I wonder about web archives… When I look at Wikipedia… It’s almost better to go to a source with an explicit bias where I can see a take on something, unlike Wikipedia which tries to focus on fact. Talking about politicians lying misses the point… It should be about a specific rhetorical position… That definition of truth comes up when we think of the role of the archive… How do you deal with that slightly differing definition of what truth is…

JB: I talked about different complimentary collecting strategy… The Archivist as a thing has some political power in deciding what goes in the historical record… The volume of the web does undercut that power in a way that I think is good – archives have historically been about the rich and the powerful… So making archives non-exclusive somewhat addresses that… But there will be fake news in the archive…

JW: But that’s great! Archives aren’t about collecting truth. Things will be in there that are not true, partially true, or factual… It’s for researchers to sort that out lately…

VS: Your comment on Wikipedia… They do try to be factual, neutral… But not truth… And to have a good balance of power… For us as researchers we can be surprised by the neutral point of view… Fortunately the web archive does capture a mixture of opinions…

EG: Yeah, so that captures what people believed at a point of time – true or not… So I would like to talk about the archive itself… Do you see your role as being successors to journalists… Or as being able to harvest the world’s record in a different way…

JB: I am an archivist with that training and background, as are a lot of people working on web archives and interesting spaces. Certainly historic preservation drives a lot of collecting aspects… But also engineering and technological aspects. So it’s poeple interested in archiving, preservation, but also technology… And software engineers interested in web archiving.

AJ: I’m a physicist but I’m now running web archives. And for us it’s an extension of the legal deposit role… Anything made public on the web should go into the legal deposit… That’s the theory, in practice there are questions of scope, and where we expend quality assurance energy. That’s the source of possible collection bias. And I want tools to support archivists… And also to prompt for challenging bias – if we can recognise that taking place.

JW: There are also questions of what you foreground in Special Collections. There are decisions being made about collections that will be archived and catalogued more deeply…

VS: In BNF my colleagues are work in an area with a tradition, with legal deposit responsibility… There are politics of heritage and what it should be. I think that is the case for many places where that activity sits with other archivists and librarians.

EG: You do have this huge responsibility to curate the record of human history… How do you match the top down requirements with the bottom up nature of the web as we now talk about i.t.

JW: One way is to have others come in to your department to curate particular collections…

JB: We do have special collections – people can choose their own, public suggestions, feeds from researchers, all sorts of projects to get the tools in place for building web archives for their own communities… I think for the sake of longevity and use going forward, the curated collections will probably have more value… Even if they seem more narrow now.

VS: Also interesting that archives did not select bottom-up curation. In Switzerland they went top down – there are a variety of approaches across Europe.

JW: We heard about the 1916 Easter Rising archive earlier, which was through public nominations… Which is really interesting…

AJ: And social media can help us – by seeing links and hashtags. We looked at this 4-5 years ago everyone linked to the BBC, but now we have more fake news sites etc…

VS: We do have this question of what should be archived… We see capture of the vernacular web – kitten or unicorn gifs etc… !

EG: I have a dystopian scenario in my head… Could you see a time years from now when newspapers are dead, public broadcasters are more or less dead… And we have flotsom and jetsom… We have all this data out there… And kinds of data who use all this social media data… Can you reassure me?

AJ: No…

JW: I think academics are always ready to pick holes in things, I hope that that continues…

JB: I think more interesting is the idea that there may not be a web… Apps, walled gardens… Facebook is pretty hard to web archive – they make it intentionally more challenging than it should be. There are lots of communication tools that disappeared… So I worry more about loss of a web that allows the positive affordances of participation and engagement…

EG: There is the issue of privatising and sequestering the web… I am becoming increasingly aware of the importance of organisations – like the BL and Internet Archive… Those roles did used to be taken on by publicly appointed organisations and bodies… How are they impacted by commercial privatisation… And how those roles are changing… How do you envisage that public sphere of collecting…

JW: For me more money for organisations like the British Library is important. Trust is crucial, and I trust that they will continue to do that in a trustworthy way. Commercial entities cannot be trusted to protect our cultural heritage…

AJ: A lot of people know what we do with physical material, but are surprised by our digital work. We have to advocate for ourselves. We are also constrained by the legal framework we operate within, and we have to challenge that over time…

JB: It’s super exciting to see libraries and archives recognised for their responsibility and trust… But that also puts them at higher risk by those who they hold accountable, and being recognised as bastions of accountability makes them more vulnerable.

VS: Recently we had 20th birthday of the Internet Archive, and 10 years of the French internet archiving… This is all so fast moving… People are more and more aware of web archiving… We will see new developments, ways to make things open… How to find and search and explore the archive more easily…

EG: The question then is how we access this data… The new masters of the universe will be those emerging gatekeepers who can explore the data… What is the role between them and the public’s ability to access data…

VS: It is not easy to explain everything around web archives but people will demand access…

JW: There are different levels of access… Most people will be able to access what they want. But there is also a great deal of expertise in organisations – it isn’t just commercial data work. And working with the Alan Turing Institute and cutting edge research helps here…

EG: One of the founders of the internet, Vint Cerf, says that “if you want to keep your treasured family pictures, print them out”. Are we overly optimistic about the permanence of the record.

AJ: We believe we have the skills and capabilities to maintain most if not all of it over time… There is an aspect of benign neglect… But if you are active about your digital archive you could have a copy in every continent… Digital allows you to protect content from different types of risk… I’m confident that the library can do this as part of it’s mission.

Q&A

Q1) Coming back to fake news and journalists… There is a changing role between the web as a communications media, and web archiving… Web archives are about documenting this stuff for journalists for research as a source, they don’t build the discussion… They are not the journalism itself.

Q2) I wanted to come back to the idea of the Filter Bubble, in the sense that it mediates the experience of the web now… It is important to capture that in some way, but how do we archive that… And changes from one year to the next?

Q3) It’s kind of ironic to have nostalgia about journalism and traditional media as gatekeepers, in a country where Rupert Murdoch is traditionally that gatekeeper. Global funding for web archiving is tens of millions; the budget for the web is tens of billions… The challenges are getting harder – right now you can use robots.txt but we have DRM coming and that will make it illegal to archive the web – and the budgets have to increase to match that to keep archives doing their job.

AJ: To respond to Q3… Under the legislation it will not be illegal for us to archive that data… But it will make it more expensive and difficult to do, especially at scale. So your point stands, even with that. In terms of the Filter Bubble, they are out of our scope, but we know they are important… It would be good to partner with an organisation where the modern experience of media is explicitly part of it’s role.

JW: I think that idea of the data not being the only thing that matters is important. Ethnography is important for understanding that context around all that other stuff…  To help you with supplementary research. On the expense side, it is increasingly important to demonstrate the value of that archiving… Need to think in terms of financial return to digital and creative economies, which is why researchers have to engage with this.

VS: Regarding the first two questions… Archives reflect reality, so there will be lies there… Of course web archives must be crossed and compared with other archives… And contextualisation matters, the digital environment in which the web was living… Contextualisation of web environment is important… And with terrorist archive we tried to document the process of how we selected content, and archive that too for future researchers to have in mind and understand what is there and why…

JB: I was interested in the first question, this idea of what happens and preserving the conversation… That timeline was sometimes decades before but is now weeks or days or less… In terms of experience websites are now personalised and our ability to capture that is impossible on a broad question. So we need to capture that experience, and the emergent personlisation… The web wasn’t public before, as ARPAnet, then it became public, but it seems to be ebbing a bit…

JW: With a longer term view… I wonder if the open stuff which is easier to archive may survive beyond the gated stuff that traditionally was more likely to survive.

Q4) Today we are 24 years into advertising on the web. We take ad-driven models as a given, and we see fake news as a consequence of that… So, my question is, Minitel was a large system that ran on a different model… Are there different ways to change the revenue model to change fake or true news and how it is shared…

Q5) Teresa May has been outspoken on fake news and wants a crackdown… The way I interpret that is censorship and banning of sites she does not like… Jefferson said that he’s been archiving sites that she won’t like… What will you do if she asks you to delete parts of your archive…

JB: In the US?!

Q6) Do you think we have sufficient web literacy amongst policy makers, researchers and citizens?

JW: On that last question… Absolutely not. I do feel sorry for politicians who have to appear on the news to answer questions but… Some of the responses and comments, especially on encryption and cybersecurity have been shocking. It should matter, but it doesn’t seem to matter enough yet… 

JB: We have a tactic of “geopolitical redundancy” to ensure our collections are shielded from political endangerment by making copies – which is easy to do – and locate them in different political and geographical contexts. 

AJ: We can suppress content by access. But not deletion. We don’t do that… 

EG: Is there a further risk of data manipulation… Of Trump and Farage and data… a covert threat… 

AJ: We do have to understand and learn how to cope with potential attack… Any one domain is a single point of failure… so we need to share metadata, content where possible… But web archives are fortunate to have the strong social framework to build that on… 

Q7) Going back to that idea of what kinds of responsibilities we have to enable a broader range of people to engage in a rich way with the digital archive… 

Q8) I was thinking about questions in context, and trust in content in the archive… And realising that web archives are fairly young… Generally researchers are close to the resource they are studying… Can we imagine projects in 50-100 years time where we are more separate from what we should be trusting in the archive… 

Q9) My perspective comes from building a web archive for European institutions… And can the archive live… Do we need legal notice on the archive, disclaimers, our method… How do we ensure people do not misinterpret what we do. How do we make the process of archiving more transparent. 

JB: That question of who has resources to access web archives is important. It is a responsibility of institutions like ours… To ensure even small collections can be accessed, that researchers and citizens are empowered with skills to query the archive, and things like APIs to enable that too… The other question on evidencing curatorial decisions – we are notoriously poor at that historically… But there is a lot of technological mystery there that we should demystify for users… All sorts of complexity there… The web archiving needs to work on that provenance information over the next few years… 

AJ: We do try to record this but as Jefferson said much of this is computational and algorithmic… So we maybe need to describe that better for wider audiences… That’s a bigger issue anyway, that understanding of algorithmic process. At the British Library we are fortunate to have capacity for text mining our own archives… We will be doing more than that… It will be small at first… But as it’s hard to bring data to the queries, we must bring queries to the archive. 

JW: I think it is so hard to think ahead to the long term… You’ll never pre-empt all usage… You just have to do the best that you can. 

VS: You won’t collect everything, every time… The web archive is not an exact mirror… It is “reborn digital heritage”… We have to document everything, but we can try to give some digital literacy to students so they have a way to access the web archive and engage with it… 

EG: Time is up, Thank you our panellists for this fantastic session. 

Jun 142017
 

From today until Friday I will be at the International Internet Preservation Coalition (IIPC) Web Archiving Conference 2017, which is being held jointly with the second RESAW: Research Infrastructure for the Study of Archived Web Materials Conference. I’ll be attending the main strand at the School of Advanced Study, University of London, today and Friday, and at the technical strand (at the British Library) on Thursday. I’m here wearing my “Reference Rot in Theses: A HiberActive Pilot” – aka “HiberActive” – hat. HiberActive is looking at how we can better enable PhD candidates to archive web materials they are using in their research, and citing in their thesis. I’m managing the project and working with developers, library and information services stakeholders, and a fab team of five postgraduate interns who are, whilst I’m here, out and about around the University of Edinburgh talking to PhD students to find out how they collect, manage and cite their web references, and what issues they may be having with “reference rot” – content that changes, decays, disappears, etc. We will have a webpage for the project and some further information to share soon but if you are interested in finding out more, leave me a comment below or email me: nicola.osborne@ed.ac.uk. These notes are being taken live so, as usual for my liveblogs, I welcome corrections, additions, comment etc. (and, as usual, you’ll see the structure of the day appearing below with notes added at each session). 

Opening remarks: Jane Winters

This event follows the first RESAW event which took place in Aarhus last year. This year we again highlight the huge range of work being undertaken with web archives. 

This year a few things are different… Firstly we are holding this with the IIPC, which means we can run the event over 3 days, and means we can bring together librarians, archivists, and data scientists. The BL have been involved and we are very greatful for their input. We are also excited to have a public event this evening, highlighted the increasingly public nature of web archiving. 

Opening remarks: Nicholas Taylor

On behalf of the IIPC Programme Committee I am hugely grateful to colleagues here at the School of Advanced Studies and at the British Library for being flexible and accommodating us. I would also like to thank colleagues in Portugal, and hope a future meeting will take place there as had been originally planned for IIPC.

For us we have seen the Web Archiving Conference as an increasingly public way to explore web archiving practice. The programme committee saw a great increase in submissions, requiring a larger than usual commitment from the programming committee. We are lucky to have this opportunity to connect as an international community of practice, to build connections to new members of the community, and to celebrate what you do. 

Opening plenary: Leah Lievrouw – Web history and the landscape of communication/media research Chair: Nicholas Taylor

I intend to go through some context in media studies. I know this is a mixed audience… I am from the Department of Information Studies at UCLA and we have a very polyglot organisation – we can never assume that we all understand each others backgrounds and contexts. 

A lot about the web, and web archiving, is changing, so I am hoping that we will get some Q&A going about how we address some gaps in possible approaches. 

I’ll begin by saying that it has been some time now that computing has been seen, computers as communication devices, have been seen as a medium. This seems commonplace now, but when I was in college this was seen as fringe, in communication research, in the US at least. But for years documentarists, engineers, programmers and designers have seen information resources, data and computing as tools and sites for imagining, building, and defending “new” societies; enacting emancipatory cultures and politics… A sort of Alexandrian vision of “all the knowledge in the world”. This is still part of the idea that we have in web archiving. Back in the day the idea of fostering this kind of knowledge would bring about internationality, world peace, modernity. When you look at old images you see artefacts – it is more than information, it is the materiality of artefacts. I am a contributor to Nils’ web archiving handbook, and he talks about history written of the web, and history written with the web. So there are attempts to write history with the web, but what about the tools themselves? 

So, this idea about connections between bits of knowledge… This goes back before browsers. Many of you will be familiar with H.G. Well’s ? Brain; Suzanne Briet’s Qu’est que la documentation (1951) is a very influential work in this space; Jennifer Light wrote a wonderful book on Cold War Intellectuals, and their relationship to networked information… One of my lecturers was one of these in fact, thinking about networked cities… Vannevar Bush “As we may think” (1945) saw information as essential to order and society. 

Another piece I often teach, J.C.R. Licklider and Robert W. Taylor (1968) in “the computer as a communication device” talked about computers communicating but not in the same ways that humans make meaning. In fact this graphic shows a man’s computer talking to an insurance salesman saying “he’s out” an the caption “your computer will know what is important to you and buffer you from the outside world”.

We then have this counterculture movement in California in the 1960s and 1970s.. And that feeds into the emerging tech culture. We have The Well coming out of this. Stewart Brand wrote The Whole Earth Catalog (1968-78). And Actually in 2012 someone wrote a new Whole Earth Catalog… 

Ted Nelson, Computer Lib/Dream Machines (1974) is known as being the person who came up with the concept of the link, between computers, to information… He’s an inventor essentially. Computer Lib/Dream Machine was a self-published title, a manifesto… The subtitle for Computer Lib was “you can and must understand computers NOW”. Counterculture was another element, and this is way before the web, where people were talking about networked information.. But these people were not thinking about preservation and archiving, but there was an assumption that information would be kept… 

And then as we see information utilities and wired cities emerging, mainly around cable TV but also local public access TV… There was a lot of capacity for information communication… In the UK you had teletext, in Canada there was Teledyne… And you were able to start thinking about information distribution wider and more diverse than central broadcasters… With services like LexisNexis emerging we had these ideas of information utilities… There was a lot of interest in the 1980s, and back in the 1970s too. 

Harold Sackman and Norman Nie (eds.) The Information Utility and Social Choice (1970); H.G. Bradley, H.S. Dordick and B. Nanus, the Emerging Network Marketplace (1980); R.S. Block “A global information utility”, the Futurist (1984); W.H. Dutton, J.G. Blumer and K.L. Kraemer “Wired cities: shaping the future of communications” (1987).

This new medium looked more like point-to-point communication, like the telephone. But no-one was studying that. There were communications scholars looking at face to face communication, and at media, but not at this on the whole. 

Now, that’s some background, I want to periodise a bit here… And I realise that is a risk of course… 

So, we have the Pre-browser internet (early 1980s-1990s). Here the emphasis was on access – to information, expertise and content at centre of early versions of “information utilities”, “wired cities” etc. This was about everyone having access – coming from that counter culture place. More people needed more access, more bandwidth, more information. There were a lot of digital materials already out there… But they were fiddly to get at. 

Now, when the internet become privatised – moved away from military and universities – the old model of markets and selling information to mass markets, the transmission model, reemerged. But there was also tis idea that because the internet was point-to-point – and any point could get to any other point… And that everyone would eventually be on the internet… The vision was of the internet as “inherently democratic”. Now we recognise the complexity of that right now, but that was the vision then. 

Post-browser internet (early 1990s to mid-2000s) – was about web 1.0. Browsers and WWW were designed to search and retrieve documents, discrete kinds of files, to access online documents. I’ve said “Web 1.0” but had a good conversation with a colleague yesterday who isn’t convinced about these kinds of labels, but I find them useful shorthand for thinking about the web at particular points in time/use. In this era we had email still but other types of authoring tools arose.. Encouraging a wave of “user generated content” – wikis, blogs, tagging, media production and publishing, social networking sites. This sounds such a dated term now but it did change who could produce and create media, and it was the team around LA around this time. 

Then we began to see Web 2.0 with the rise of “smart phones” in the mid-2000s, merging mobile telephony and specialised web-based mobile applications, accelerate user content production and social media profiling. And the rise of social networking sounded a little weird to those of us with sociology training who were used to these terms from the real world, from social network analysis. But Facebook is a social network. Many of the tools, blogging for example, can be seen as having a kind of mass media quality – so instead of a movie studio making content… But I can have my blog which may have an audience of millions or maybe just, like, 12 people. But that is highly personal. Indeed one of the earliest so-called “killer apps” for the internet was email. Instead of shipping data around for processing – as the architecture originally got set up for – you could send a short note to your friend elsewhere… Email hasn’t changed much. That point-to-opint communication suddenly and unexpectedly suddenly became more than half of the ARPANET. Many people were surprised by that. That pattern of interpersonal communication over networks, continued to repeat itself – we see it with Facebook, Twitter, and even with Blogs etc. that have feedback/comments etc. 

Web 2.0 is often talked about as social driven. But what is important from a sociology perspective, is the participation, and the participation of user generated communities. And actually that continues to be a challenge, it continues to be not the thing the architecture was for… 

In the last decade we’ve seen algorithmic media emerging, and the rise of “web 3.0”. Both access and participation appropriated as commodities to be monitored, captures, analyzed, monetised and sold back to individuals, reconcieved as data subjects. Everything is thought about as data, data that can be stored, accessed… Access itself, the action people take to stay in touch with each other… We all carry around monitoring devices every day… At UCLA we are looking at the concept of the “data subjects”. Bruce ? used to talk about the “data footprint” or the “data cloud”. We are at a moment where we are increasingly aware of being data subjects. London is one of the most remarkable in the world in terms of surveillance… The UK in general, but London in particular… And that is ok culturally, I’m not sure it would be in the United States. 

We did some work in UCLA to get students to mark up how many surveillance cameras there were, who controlled them, who had set them up, how many there were… Neither Campus police nor university knew. That was eye opening. Our students were horrified at this – but that’s an American cultural reaction. 

But if we conceive of our own connections to each other, to government, etc. as “data” we begin to think of ourselves, and everything, as “things”. Right now systems and governance maximising the market, institutional government surveillance; unrestricted access to user data; moves towards real-time flows rather than “stocks” of documents or content. Surveillance isn’t just about government – supermarkets are some of our most surveilled spaces. 

I currently have students working on a “name domain infrastructure” project. The idea is that data will be enclosed, that data is time-based, to replace the IP, the Internet Protocol. So that rather than packages, data is flowing all the time. So that it would be like opening the nearest tap to get water. One of the interests here is from the movie and television industry, particularly web streaming services who occupy significant percentages of bandwidth now… 

There are a lot of ways to talk about this, to conceive of this… 

1.0 tend to be about documents, press, publishing, texts, search, retrieval, circulation, access, reception, production-consumption: content. 

2.0 is about conversations, relationships, peers, interaction, communities, play – as a cooperative and flow experience, mobility, social media (though I rebel against that somewhere): social networks. 

3.0 is about algorithms, “clouds” (as fluffy benevolent things, rather than real and problematic, with physical spaces, server farms), “internet of things”, aggregation, sensing, visualisation, visibility, personalisation, self as data subject, ecosystems, surveillance, interoperability, flows: big data, algorithmic media. Surveillance is kind of the environment we live in. 

Now I want to talk a little about traditions in communication studies.. 

In communication, broadly and historically speaking, there has been one school of thought that is broadly social scientific, from sociology and communications research, that thinks about how technologies are “used” for expression, interaction, as data sources or analytic tools. Looking at media in terms of their effects on what people know or do, can look at media as data sources, but usually it is about their use. 

There are theories of interaction, group process and influence; communities and networks; semantic, topical and content studies; law, policy and regulation of systems/political economy. One key question we might ask here: “what difference does the web make as a medium/milieu for communicative action, relations, interact, organising, institutional formation and change? Those from a science and technology background might know about the issues of shaping – we shape technology and technology shapes us. 

Then there is the more cultural/critical/humanist or media studies approach. When I come to the UK people who do media studies still think of humanist studies as being different, “what people do”. However this approach of cultural/critical/etc. is about analyses of digital technologies and web; design, affordances, contexts, consequences – philosophical, historical, critical lens. How power is distributed are important in this tradition. 

In terms of theoretical schools, we have the Toronto School/media ecology – the Marshall McLuhan take – which is very much about the media itself; American cultural studies, and the work of James Carey and his students; Birmingham school – the British take on media studies; and new materialism – that you see in Digital Humanities, German Media Studies, that says we have gone too far from the roles of the materials themselves. So, we might ask “What is the web itself (social and technical constituents) as both medium and product of culture, under what conditions, times and places.

So, what are the implications for Web Archiving? Well I hope we can discuss this, thinking about a table of:

Web Phase | Soc sci/admin | Crit/Cultural

  • Documents: content + access
  • Conversation: Social nets + participation
  • Data/AlgorithmsL algorithmic media + data subjects

Comment: I was wondering about ArXiv and the move to sharing multiple versions, pre-prints, post prints…

Leah: That issue of changes in publication, what preprints mean for who is paid for what, that’s certainly changing things and an interesting question here…

Comment: If we think of the web moving from documents, towards fluid state, social networks… It becomes interesting… Where are the boundaries of web archiving? What is a web archiving object? Or is it not an object but an assemblage? Also ethics of this…

Leah: It is an interesting move from the concrete, the material… And then this whole cultural heritage question, what does it instantiate, what evidence is it, whose evidence is it? And do we participate in hardening those boundaries… Or do we keep them open… How porous are our boundaries…

Comment: What about the role of metadata?

Leah: Sure, arguably the metadata is the most important thing… What we say about it, what we define it as… And that issue of fluidity… We think of metadata as having some sort of fixity… One thing that has begun to emerge in surveillance contexts… Where law enforcement says “we aren’t looking at your content, just the metadata”, well it turns out that is highly personally identifiable, it’s the added value… What happens when that secondary data becomes the most important things… In face where many of our data systems do not communicate with each other, those connections are through the metadata (only).

Comment: In terms of web archiving… As you go from documents, to conversations, to algorithms… Archiving becomes so much more complex. Particularly where interactions are involved… You can archive the data and the algorithm but you still can’t capture the interactions there…

Leah: Absolutely. As we move towards the algorithmic level its not a fixed thing. You can’t just capture the Google search algorithms, they change all the time. The more I look at this work through the lens of algorithms and data flows, there is no object in the classic sense…

Comment: Perhaps, like a movie, we need longer temporal snapshots…

Leah: Like the algorithmic equivalence of persistence of vision. Yes, I think that’s really interesting.

And with that the opening session is over, with organisers noted that those interested in surveillance may be interested to know that Room 101, said to have inspired the room of the same name in 1984, is where we are having coffee…

Session 1B (Chair: Marie Chouleur, National Library of France):

Jefferson Bailey (Deputy chair of IIPC, Director of Web Archiving, Internet Archiving): Advancing access and interface for research use of web archives

I would like to thank all of the organisers again. I’ll be giving a broad rather than deep overview of what the Internet Archive is doing at the moment.

For those that don’t know, we are a non-profit Digital Library and Archive founded in 1996. We work in a former church and it’s awesome – you are welcome to visit and do open public lunches every Friday if you are ever in San Francisco. We have lots of open source technology and we are very technology-driven.

People always ask about stats… We are at 30 Petabytes plus multiple copies right now, including 560 billion URLs, 280 billion webpages. We archive about 1 billion URLs per week, and have partners and facilities around the world, including here in the UK where we have Wellcome Trust support.

So, searching… This is WayBackMachine. Most of our traffic – 75% – is automatically directed to the new service. So, if you search for, say, UK Parliament, you’ll see the screenshots, the URLs, and some statistics on what is there and captured. So, how does it work? With that much data to do full text search! Even the raw text (not HTML) is 3-5 Pb. So, we figured the most instructive and easiest to work with text is the anchor text of all in-bound links to a homepage. The index text covers 443 million homepages, drawn from 900B in-bound links from other cross-domain websites. Is that perfect? No, but it’s the best that works on this scale of data… And people tend to make keyword type searches which this works for.

You can also now, in the new Way Back Machine, see a summary tab which includes a visualisation of data captured for that page, host, domain, MIME-type or MIME-type category. It’s really fun to play with. It’s really cool information to work with. That information is in the Way Back Machine (WBM) if there fore 4.5 billion hosts; 256 millions domains; 1238 TLDs. Also special collections that exist – building this for specific crawls/collections such as our .gov collection. And there is an API – so you can create your own visualisations if you like.

We have also created a full text search for AIT (Archive-It). This was part of a total rebuild of full text search in Elasticsearch. 6.5 billion documents with a 52 TB full text index. In total AIT is 23 billion documents and 1 PB. Searches are across all 8000+ colections. We have improved relevance ranking, metadata search, performance. And we have a Media Search coming – it’s still a test at presence. So you can search non textual content with similar process.

So, how can we help people find things better… search, full text search… And APIs. The APIs power the details charts, captures counts, year, size, new, domain/hosts. Explore that more and see what you can do. We’ve also been looking at Data Transfer APIs to standardise transfer specifications for web data exchange between repositories for preservation. For research use you can submit “jobs” to create derivative datasets from WARCS from specific collections. And it allows programmatic access to AIT WARCs, submission of job, job status, derivative results list. More at: https://github.com/WASAPI-Community/data-transfer-apis.

In other API news we have been working with WAT files – a sort of metadata file derived from a WARC. This includes Headers and content (title, anchor/text, metas, links). We have API access to some capture content – a better way to get programmtic access to the content itself. So we have a test build on a 100 TB WARC set (EOT). It’s like CDX API with a build – replays WATs not WARCs (see: http://vinay-dev.us.archive.org:8080/eot2016/20170125090436/http://house.gov/. You can analyse, for example, term counts across the data.

In terms of analysing language we have a new CDX code to help identify languages. You can visualise this data, see the language of the texts, etc. A lot of our content right now is in English – we need less focus on English in the archive.

We are always interested in working with researchers on building archives, not just using them. So we are working on the News Measures Research Project. We are looking at 663 local news sites representing 100 communities. 7 crawls for a composite week (July-September 2016).

We are also working with a Katrina Blogs project, after research was done, project was published, but we created a special collection of the cites used so that it can be accessed and explored.

And in fact we are general looking at ways to create useful sub collections and ways to explore content. For instance Gif Cities is a way to search for gifs from Geocities. We have a Military Industrial Powerpoint Complex, turning PPT into PDFs and creating a special collection.

We did a new collection, with a dedicated portal (https://www.webharvest.gov) which archives US congress for NARA. And we capture this every 2 years, and also raised questions of indexing YouTube videos.

We are also looking at historical ccTLD Wayback Machines. Built on IA global crawls and added historic web data with keyword and mime/format search, embed linkback, domain stats and special features. This gives a german view – from the .de domain – of the archive.

And we continue to provide data and datasets for people. We love Archives Unleashed – which ran earlier this week. We did an Obama Whitehouse data hackathon recently. We have a webinar on APIs coming very soon

Q&A

Q1) What is anchor text?

A1) That’s when you create a link to a page – the text that is associated with that page.

Q2) If you are using anchor text in that keyword search… What happens when the anchor text is just a URL…

A2) We are tokenising all the URLs too. And yes, we are using a kind of PageRank type understanding of popular anchor text.

Q3) Is that TLD work.. Do you plan to offer that for all that ask for all top level domains?

A3) Yes! Because subsets are small enough that they allow search in a more manageable way… We basically build a new CDX for each of these…

Q4) What are issues you are facing with data protection challenges and archiving in the last few years… Concerns about storing data with privacy considerations.

A4) No problems for us. We operate as a library… The Way Back Machine is used in courts, but not by us – in US courts its recognised as a thing you can use in court.

Panel: Internet and Web Histories – Niels Bruger – Chair (NB); Marc Weber (MW); Steve Jones (SJ); Jane Winters (JW)

We are going to talk about the internet and the web, and also to talk about the new journal, Internet Histories, which I am editing. The new journal addresses what my colleagues and I saw as a gap. On the one hand there are journals like New Media and Society and Internet Studies which are great, but rarely focus on history. And media history journals are excellent but rarely look at web history. We felt there was a gap there… And Taylor & Francis Routledge agreed with us… The inaugeral issue is a double issue 1-2, and people on our panel today are authors in our first journal, and we asked them to address six key questions from members of our international editorial board.

For this panel we will have an arguement, counter statement, and questions from the floor type format.

A Common Language – Mark Weber

This journal has been a long time coming… I am Curatorial Director, Internet History Program, Computer History Museum. We have been going for a while now. This Internet History program was probably the first one of its kind in a museum.

When I first said I was looking at the history of the web in the mid ’90s, people were puzzled… Now most people have moved to incurious acceptance. Until recently there was also tepid interest from researchers. But in the last few years has reached critical mass – and this journal is a marker of this change.

We have this idea of a common language, the sharing of knowledge. For a long time my own perspective was mostly focused on the web, it was only when I started the Internet History program that I thought about the fuller sweep of cyberspace. We come in through one path or thread, and it can be (too) easy to only focus on that… The first major networks, the ARPAnet was there and has become the internet. Telenet was one of the most important commercial networks in the 1970s, but who here now remembers Anne Reid of Telenet? [no-one] And by contrast, what about Vint Cerf [some]. However, we need to understand what changed, what did not succeed in the long term, how things changed and shifted over time…

We are kind of in the Victorian era of the internet… We have 170 years of telephones, 60 years of going on line… longer of imagining a connected world. Our internet history goes back to the 1840s and the telegraph. And a useful thought here, “The past isn’t over. It isn’t even past” William Faulkner.  Of this history only small portions are preserved properly. Some of then risks of not having a collective narrative… And not understanding particular aspects in proper context. There is also scope for new types of approaches and work, not just applying traditional approaches to the web.

There is a risk of a digital dark age – we have  film to illustrate this at the museum although I don’t think this crowd needs persuading of the importance of preserving the web.

So, going forward… We need to treat history and preservation as something to do quickly, we cannot go back and find materials later…

Response – Jane Winters

Mark makes, I think convincingly, the case for a common language, and for understanding the preceding and surrounding technologies, why they failed and their commercial, political and social contexts. And I agree with the importance of capturing that history, with oral history a key means to do this. Secondly the call to look beyond your own interest or discipline – interdisiplinary researcg is always challenging, but in the best sense, and can be hugely rewarding when done well.

Understanding the history of the internet and its context is important, although I think we see too many comparisons with early printing. Although some of those views are useful… I think there is real importance in getting to grips with these histories now, not in a decade or two. Key decisions will be made, from net neutrality to mass surveillance, and right now the understanding and analysis of the issues is not sophisticated – such as the incompatibility of “back doors” and secure internet use. And as researchers we risk focusing on the content, not the infrastructure. I think we need a new interdisciplinary research network, and we have all the right people gathered here…

Q&A

Q1) Mark, as you are from a museum… Have you any thoughts about how you present the archived web, the interface between the visitor and the content you preserve.

A1) What we do now with the current exhibits… the star isn’t the objects, it is the screen. We do archive some websites – but don’t try to replicate the internet archive but we do work with them on some projects, including the GeoCities exhibition. When you get to things that require emulation or live data, we want live and interactive versions that can be accessed online.

Q2) I’m a linguist and was intrigued by the interdisciplinary collaboration suggested… How do you see linguists and the language of the web fitting in…

A2) Actually there is a postdoc – Naomi – looking at how different language communities in the UK have engaged through looking at the UK Web Archive, seeing how language has shaped their experience and change in moving to a new country. We are definitely thinking about this and it’s a really interesting opportunity.

Out from the PLATO Cave: Uncovering the pre-Internet history of social computing – Steve Jones, University of Ilinois at Chicago

I think you will have gathered that there is no one history of the internet. PLATO was a space for education and for my interest it also became a social space, and a platform for online gaming. These uses were spontaneous rather than centrally led. PLATO was an acronym for Programmed Logic for Automatic Teaching Operations (see diagram in Ted Nelson’s Dream Machine publication and https://en.wikipedia.org/wiki/PLATO_(computer_system)).
There were two key interests in developing for PLATO – one was multi-player games, and the other was communication. And the latter was due to laziness… Originally the PLATO lab was in a large room, and we couldn’t be bothered to walk to each others desks. So “Talk” was created – and that saved standard messages so you didn’t have to say the same thing twice!

As time went on, I undertook undergraduate biology studies and engaged in the Internet and saw that interaction as similar… At that time data storage was so expensive that storing content in perpetuity seemed absurd… If it was kept its because you hadn’t got to writing it yet. You would print out code – then rekey it – that was possible at the time given the number of lines per programme. So, in addition to the materials that were missing… There were boxes of Ledger-size green bar print outs from a particular PLATO Notes group of developers. Having found this in the archive I took pictures to OCR – that didn’t work! I got – brilliantly and terribly – funding to preserve that text. That content can now be viewed side by side in the archive – images next to re-keyed text.

Now, PLATO wasn’t designed for lay users, it was designed for professionals although also used by university and high school students who had the time to play with it. So you saw changes between developer and community values, seeing development of affordances in the context of the discourse of the developers – that archived set of discussions. The value of that work is to describe and engage with this history not just from our current day perspective, but to understand the context, the poeple and their discourse at the time.

Response – Mark

PLATO sort of is the perfect example of a system that didn’t survive into the mainstream… Those communities knew each other, the idea of the flatscreen – which led to the laptop – came from PLATO. PLATO had a distinct messaging system, separate from the ARPAnet route. It’s a great corpus to see how this was used – were there flames? What does one-to-many communication look like? It is a wonderful example of the importance of preserving these different threads.. And PLATO was one of the very first spaces not full of only technical people.

PLATO was designed for education, and that meant users were mainly students, and that shaped community and usage. There was a small experiment with community time sharing memory stores – with terminals in public places… But PLATO began in the late ’60s and ran through into the 80s, it is the poster child for preserving earlier systems. PLATO notes became Lotus Notes – that isn’t there now but in its own domain, PLATO was the progenitor of much of what we do with education online now, and that history is also very important.

Q&A

Q1) I’m so glad, Steve, that you are working on PLATO. I used to work in Medical Education in Texas and we had PLATO terminals to teach basic science first and second year medical education students and ER simulations. And my colleagues and I were taught computer instruction around PLATO. I am intereted that you wanted to look at discourse around UIC around PLATO – so, what did you find? I only experienced PLATO at the consumer end of the spectrum, so I wondered what the producer end was like…

A1) There are a few papers on this – search for it – but two basic things stand out… (1) the degree to which as a mainframe system PLATO was limited as system, and the conflict between the systems people and the gaming people. The gaming used a lot of the capacity, and although that taxed the system it did also mean they developed better code, showed what PLATO was capable of, and helped with the case for funding and support. So it wasn’t just shut PLATO down, it was a complex 2-way thing; (2) the other thing was around the emergence of community. Almost anyone could sit at a terminal and use the system. There were occasional flare ups and they mirrored community responses even later around flamewars, competition for attention, community norms… Hopefully others will mine that archive too and find some more things.

Digital Humanities – Jane Winters

I’m delighted to have an article in the journal, but I won’t be presenting on this. Instead I want to talk about digital humanities and web archives. There is a great deal of content in web archives but we still see little research engagement in web archives, there are numerous reasons including the continuing work on digitised traditional texts, and slow movement to develop new ways to research. But it is hard to engage with the history of the 21st century without engaging with the web.

The mismatch of the value of web archives and the use and research around the archive was part of what led us to set up a project here in 2014 to equip researchers to use web archives, and encourage others to do the same. For many humanities researchers it will take a long time to move to born-digital resources. And to engage with material that subtly differs for different audiences. There are real challenges to using this data – web archives are big data. As humanities scholars we are focused on the small, the detailed, we can want to filter down… But there is room for a macro historical view too. What Tim Hitchcock calls the “beautiful chaos?” of the web.

Exploring the wider context one can see change on many levels – from the individual person or business, to wide spread social and political change. How the web changes the language used between users and consumers. You can also track networks, the development of ideas… It is challenging but also offers huge opportunities. Web archives can include newspapers, media, and direct conversation – through social media. There is also visual content, gifs… The increase in use of YouTube and Instagram. Much of this sits outside the scope of web archives, but a lot still does make it in. And these media and archiving challenges will only become more challenging as see more data… The larger and more uncontrolled the data, the harder the analysis. Keyword searches are challenging at scale. The selection of the archive is not easily understood but is important.

The absence of metadata is another challenge too. The absence of metadata or alternative text can render images, particularly, invisible. And the mix of formats and types of personal and the public is most difficult but also most important. For instance the announcement of a government policy, the discussion around it, a petition perhaps, a debate in parliament… These are not easy to locate… Our histories is almost inherently online… But they only gain any real permanence through preservation in web archives, and thats why humanists and historians really need to engage with them.

Response – Steve

I particularly want to talk about archiving in scholarship. In order to fit archiving into scholarly models… administrators increasingly make the case for scholarship in the context of employment and value. But archive work is important. Scholars are discouraged from this sort of work because it is not quick, it’s harder to be published… Separately you need organisations to engage in preservation of their online presences. The degree to which archive work is needed is not reflected by promotion committees, organisational support, local archiving processes. There are immense rhetorical challenges here, to persuade others of the value of this work. There had been successful cases made to encourage telephone providers to capture and share historical information. I was at a telephone museum recently and asked about the archive… She handed me a huge book on the founding of Southwestern Bell, published in a very small run… She gave me a copy but no-one had asked about this before… That’s wrong though, it should be captured. So we can do some preservation work ourselves just by asking!

Q&A

Q1) Jane, you mentioned a skills gap for humanities researchers. What sort of skills do they need?

A1) I think the complete lack of quantitative data training, how to sample, how to make meaning from quantitative data. They have never been engaged in statistical training. They have never been required to do it – you specialise so early here. Also, basic command line stuff… People don’t understand that or why they have to engage that way. Those are two simple starting points. Those help them understand what they are looking at, what an ngram means, etc.

Session 2B (Chair: Tom Storrar)

Philip Webster, Claire Newing, Paul Clough & Gianluca Demartini: A temporal exploration of the composition of the UK Government Web Archive

I’m afraid I’ve come into this session a little late. I have come in at the point that Philip and Claire are talking about the composition of the archive – mostly 2008 onwards – and looking at status codes of UK Government Web Archive. 

Phillip: The hypothesis for looking at http status codes was to see if changes in government raised trends in the http status code. Actually, when we looked at post-2008 data we didn’t see what we expected there. However we did fine that there was an increase in not finding what was requested – and thought this may be about moving to dynamic pages – but this is not a strong trend.

In terms of MIME types – media types – which are restricted to:

Application – flash, java, Microsoft Office Documents. Here we saw trends away from PDF as the dominant format. Microsoft word increases, and we see the increased use of Atom – syndication – coming across.

Executable – we see quite a lot of javascript. The importance of flash decreased over time – which we expected – and the increased in javascript (javascript and javascript x).

Document – PDF remains prevalent. Also MS Word, some MS Excel. Open formats haven’t really taken hold…

Claire: The Government Digital Strategy included guidance to use open document formats as much as possible, but that wasn’t mandated until late 2014 – a bit too late for our data set unfortunately. But the Government Digital Strategy in 2011 was, itself, published in Word and PDF itself!

Philip: If we take document type outside of PDFs you see that lack of open formats more clearly..

Image – This includes images appearing in documents, plus icons. And occasionally you see non-standard media types associated with the MIME-types. Jpegs are fairly consistent changes. Gif and Png are comparable… Gif was being phased out for IP reasons, with Png to replace it,and you see that change over time…

Text – Test is almost all HTML. You see a lot of plain text, stylesheets, XML…

Video – we saw compressed video formats… but gradually superceded with embedded YouTube links. However we do still see a of flash video retained. And we see a large, increasing of MP4, used by Apple devices.

Another thing that is available over time is relative file sizes. However CDX index only contains compressed size data and therefore is not a true representation of file size trends. So you can’t compare images to their pre-archiving version. That means for this work we’ve limited the data set to those where you can tell the before and after status of the image files. We saw some spikes in compressed image formats over time, not clear if this shows departmental isssues..

To finish on a high note… There is an increase in the use of https rather than http. I thought it might be the result of a campaign, but it seems to be a general trend..

The conclusion… Yes, it is possible to do temporal analysis of CDX index data but you have to be careful, looking at proportion rather than raw frequency. SQL is feasible, commonly available and low cost. Archive data has particular weaknesses – data cannot be assumed to be fully representative, but in some cases trends can be identified.

Q&A

Q1) Very interesting, thank you. Can I understand… You are studying the whole archive? How do you take account of having more than one copy of the same data over time?

A1) There is a risk of one website being overrepresented in the archive. There are checks that can be done… But that is more computationally expensive…

Q2) With the seed list, is that generating the 404 rather than actual broken links?

A2 – Claire) We crawl by asking the crawler to go out to find links and seed from that. It generally looks within the domain we’ve asked it to capture…

Q3) At various points you talked about peaks and trends… Have you thought about highlighting that to folks who use your archive so they understand the data?

A3 – Claire) We are looking at how we can do that more. I have read about historians’ interest in the origins of the collection, and we are thinking about this, but we haven’t done that yet.

Caroline Nyvang, Thomas Hvid Kromann & Eld Zierau: Capturing the web at large – a critique of current web citation practices

Caroline: We are all here as we recognise the importance and relevance of internet research. Our paper looks at web referencing and citation within the sciences. We propose a new format to replace the URL+date format usually recommended. We will talk about a study of web references in 35 Danish master’s theses from the University of Copenhagen, then further work on monograph referencing, then a new citation format.

The work on 35 masters theses submitted to Copenhagen university, included, as a set: 899 web references, there was an average of 26.4 web references – some had none, the max was 80. This gave us some insight into how students cite URL. Of those students citing websites: 21% gave the date for all links; 58% had dates for some but not all sites; 22% had no dates. Some of those URLs pointed to homepages or search results.

We looked at web rot and web references – almost 16% could not be accessed by the reader, checked or reproduced. An error rate of 16% isn’t that remarkable – in 1992 a study of 10 journals found that a third of references was inaccurate enough to make it hard to find the source again. But web resources are dynamic and issues will vary, and likely increase over time.

The amount of web references does not seem to correlate with particular subjects. Students are also quite imprecise when they reference websites. And even when the correct format was used 15.5% of all the links would still have been dead.

Thomas: We looked at 10 danish academic monographs published from 2010-2016. Although this is a small number of titles, it allowed us to see some key trends in the citation of web content. There was a wide range of number of web citations used – 25% at the top, 0% at the bottom of these titles. Location of web references in these texts are not uniform. On the whole scholars rely on printed scholarly work… But web references are still important. This isn’t a systematic review of these texts… In theory these links should all work.

We wanted to see the status after five years… We used a traffic light system. 34.3% were red – broken, dead, a different page; 20?% were amber – critical links that either refer to changed or at risk material; 44.7% were green – working as expected.

This work showed that web references to dead links within a limited number of years. In our work the URLs that go to the front page, with instructions of where to look, actually, ironically, lasted best. Long complex URLs were most at risk… So, what can we do about this…

Eld: We felt that we had to do something here, to address what is needed. We can see from the studies that today’s practices of URLs and date stamp does not work. We need a new standard, a way to reference something stable. The web is a marketplace and changes all the time. We need to look at the web archives… And we need precision and persistency. We felt there were four neccassary elements, and we call it the PWID – Persistent Web IDentifier. The Four elemnts are:

  • Archived URL
  • Time of archiving
  • Web archive – precision and indication that you verified this is what you expect. Also persistency. Researcher has to understand that – is it a small or large archive, what is contextual legislation.
  • Content coverage specification – is part only? Is it the html? Is it the page including images as it appears in your browser? Is it a page? Is it the side including referred pages within the domain

So we propose a form of reference which can be textually expressed as:

web archive: archive.org, archiving time: 2016-04-20 18:21:47 UTC, archived URL: http://resaw.en/, content coverage: webpage

But, why not use web archive URL? Of the form:

https://web.archive.org/web/20160420182147http://resaw.en/

Well, this can be hard to read, there is a lot of technology embedded in the URL. It is not as accessible.

So, a PWID URI:

pwid:archive.org:2016-04-20_18.21.47Z:page:http://resaw.en/

This is now in as an ISO 690 suggestion and proposed as a URI type.

To sum up, all research fields eed to refer to the web. Good scientific practice cannot take place with current approaches.

Q&A

Q1) I really enjoyed your presentation… I was wondering what citation format you recommend for content behind paywalls, and for dynamic content – things that are not in the archive.

A1 – Eld) We have proposed this for content in the web archive only. You have to put it into an archive to be sure, then you refer to it. But we haven’t tried to address those issues of paywall and dynamic content. BUT the URI suggestion could refer to closed archives too, not just open archives.

A1 – Caroline) We also wanted to note that this approach is to make web citations align with traditional academic publication citations.

Q2) I think perhaps what you present here is an idealised way to present archiving resources, but what about the marketing and communications challenge here – to better cite websites, and to use this convention when they aren’t even using best practice for web resources.

A2 – Eld) You are talking about marketing to get people to use this, yes? We are starting with the ISO standard… That’s one aspect. I hope also that this event is something that can help promote this and help to support it. We hope to work with different people, like you, to make sure it is used. We have had contact with Zotero for instance. But we are a library… We only have the resources that we have.

Q3) With some archives of the web there can be a challenge for students, for them to actually look at the archive and check what is there..

A3) Firstly citing correctly is key. There are a lot of open archives at the moment… But we hope the next step will be more about closed archives, and ways to engage with these more easily, to find common ground, to ensure we are citing correctly in the first place.

Comment – Nicola Bingham, BL) I like the idea of incentivising not just researchers but also publishers to incentivise web archiving, another point of pressure to web archives… And making the case for openly accessible articles.

Q4) Have you come across Martin Klein and Herbert Von Sompel’s work on robust links, and Momento.

A4 – Eld) Momento is excellent to find things, but usually you do not have the archive in there… I don’t think the way of referencing without the archive is a precise reference…

Q5) When you compare to web archive URL, it was the content coverage that seems different – why not offer as an incremental update.

A5) As far as I know there is using a # in the URL and that doesn’t offer that specificity…

Comment) I would suggest you could define the standard for after that # in the URLs to include the content coverage – I’ll take that offline.

Q6) Is there a proposal there… For persistence across organisations, not just one archive.

A6) I think from my perspective there should be a registry when archives change/move to find the new registry. Our persistent identifier isn’t persistent if you can change something. And I think archives must be large organisations, with formal custodians, to ensure it is persistent.

Comment) I would like to talk offline about content addressing and Linked Data to directly address and connect to copies.

Andrew Jackson: The web archive and the catalogue

I wanted to talk about some bad experiences I had recently… There is a recent BL video of the journey of a (print) collection item… From posting to processing, cataloguing, etc… I have worked at the library for over 10 years, but this year for the first time I had to get to grips with the library catalogue… I’ll talk more about that tomorrow (in the technical strand) but we needed to update our catalogue… Accommodating the different ways the catalogue and the archive see c0ntent.

Now, that video, the formation of teams, the structure of the organisations, the physical structure of our building is all about that print process, and that catalogue… So it was a suprise for me – maybe not you – that the catalogue isn’t just bibliographic data, it’s also a workflow management tool…

There is a change of events here… Sometimes events are in a line, sometimes in circles… Always forwards…

Now, last year legal deposit came in for online items… The original digital processing workflow went from acquisition to ingest to cataloguing… But most of the content was already in the archive… We wanted to remove duplication, and make the process more efficient… So we wanted to automate this as a harvesting process.

For our digital work previously we also had a workflow, from nomination, to authorisation, etc… With legal deposit we have to get it all, all the time, all the stuff… So, we don’t collect news items, we want all news sites every day… We might specify crawl targets, but more likely that we’ll see what we’ve had before and draw them in… But this is a dynamic process….

So, our document harvester looks for “watched targets”, harvests, extracts documents for web archiving… and also ingest. There are relationships to acquisition, that feeds into cataloguing and the catalogue. But that is an odd mix of material and metadata. So that’s a process… But webpages change… For print matter things change rarely, it is highly unusual. For the web changes are regular… So how do we bring these things together…

To borrow an analogy from our Georeferencing project… Users engage with an editor to help us understand old maps. So, imagine a modern web is a web archive… Then you need information, DOIs, places and entities – perhaps a map. This kind of process allows us to understand the transition from print to online. So we think about this as layers of transformation… Where we can annotate the web archive… Or the main catalogue… That can be replaced each time this is needed. And the web content can, with this approach, be reconstructed with some certainty, later in time…

Also this approach allows us to use rich human curation to better understand that which is being automatically catalogued and organised.

So, in summary: the catalogue tends to focus on chains of operation and backlogs, item by item. The web archive tends to focus on transformation (and re-transformation) of data. Layered data model can bring them together. Means revisiting the datat (but fixity checking  requires this anyway). It’s costly in terms of disk space required. And it allows rapid exploration and experimentation.

Q1) To what extend is the drive for this your users, versus your colleagues?

A1) The business reason is that it will save us money… Taking away manual work. But, as a side effect we’ve been working with cataloguing colleagues in this area… And their expectations are being raised and changed by this project. I do now much better understand the catalogue. The catalogue tends to focus on tradition not output… So this project has been interesting from this perspective.

Q2) Are you planning to publish that layer model – I think it could be useful elsewhere?

A2) I hope to yes.

Q3) And could this be used in Higher Education research data management?

A3) I have noticed that with research data sets there are some tensions… Some communities use change management, functional programming etc… Hadoop, which we use, requires replacement of data… So yes, but this requires some transformation to do.

We’d like to use the same based data infrastructure for research… Otherwise had to maintain this pattern of work.

Q4) Your model… suggests WARC files and such archive documents might become part of new views and routes in for discovery.

A4) That’s the idea, for discovery to be decoupled from where you the file.

Nicola Bingham, UK Web Archive: Resource not in archive: understanding the behaviour, borders and gaps of web archive collections

I will describe the shape and the scope of the UK Web Archive, to give some context for you to explore it… By way of introduction.. We have been archiving the UK Web since 2013, under UK non-print legal deposit. But we’ve also had the Open Archive (since 2004); Legal Deposit Archive (since 2013); and the Jisc Historical Archive (1996-2013).

The UK Web Archive includes around 400 TB of compressed data. And in the region of 11-12 billion records. We grow, on average 60-70 TB per year and 3 B records per year. We want to be comprehensive but, that said, we can’t collect everything and we don’t want to collect everything… Firstly we collect UK websites only. We carry out web archiving under 2013 regulations, and they state that only UK published web content – meaning content on a UK web domain, or by a person whose work occurs in the UK. So, we can automate harvesting from UK TLD (.uk, .scot, .cymru etc); UK hosting – geo-IP loook up to locate server. Then manual checks. So Facebook, WordPress, Twitter cannot be automated…

We only collect published content. Out of scope here are:

  • Film and recorded sound where AV content predominates, e.g. YouTube
  • Private intranets and emails.
  • Social networkings sites only available to restricted groups – if you need a login, special permissions they are out of scope.

Web archiving is expensive. We have to provide good value for money… We crawl the UK domain on an annual basis (only). Some sites are more frequent but annual misses a lot. We cap domains at 512 MB – which captures many sites in their entirity, but others that we only capture part of (unless we override automatic settings).

There are technical limitations too, around:

  • Database driven sites – crawler struggle with these
  • Programming scripts
  • Plug-ins
  • Proprietary file formats
  • Blockers – robots.txt or access denied.

So there are misrepresentations… For instance the One Hundred Women blog captures the content but not the stylesheet – that’s a fairly common limitation.

We also have curatorial input to locate the “important stuff”. In the British Library web archiving is not performed universally by all curators, we rely on those who do engage, usually voluntarily. We try to onboard as many curators and specialist professionals as possible to widen coverage.

So, I’ve talked about gaps and boundaries, but I also want to talk about how the users of the archive find this information, so that even where there are gaps, it’s a little more transparant…

We have the Collection Scoping Document, this captures scope, motivation, parameters and timeframe of collection. This document could, in a paired-down form, be made available to end users of the archive.

We have run user testing of our current UK Web Archive website, and our new version. And even more general audiences really wanted as much contextual information as possible. That was particularly important on our current website – where we only shared permission-cleared items. But this is one way in which contextual information can be shown in the interface with the collection.

The metadata can be browsed searched, though users will be directed to come in to view the content.

So, an example of a collection would be 1000 Londoners, showing the context of the work.

We also gather information during the crawling process… We capture information on crawler configuration, seed list, exclusions… I understand this could be used and displayed to users to give statistics on the collection…

So, what do we know about what the researchers want to know? They want as much documentation as they possibly can. We have engaged with the research community to understand how best to present data to the community. And indeed that’s where your feedback and insight is important. Please do get in touch.

Q&A

Q1) You said you only collect “published” content… How do you define that?

A1) With legal deposit regulations… The legal deposit libraries may collect content openly available on the web… Content that is paywalled or behind login credentials. UK publishers are obliged to provide credentials for crawling. BUT how we make that accessible… Is a different matter – we wouldn’t republish that on the open web without logins/credentials.

Q2) How do you have any ideas about packaging this type of information for users and researchers – more than crawler config files.

A2) The short answer is no… We’d like to invite researchers to access the collection in both a close reading sense, and a big data sense… But I don’t have that many details about that at the moment.

Q3) A practical question: if you know you have to collect something… If you have a web copy of a government publication, say, and the option of the original, older, (digital) document… Is the web archive copy enough, do you have the metadata to use that the right way?

A3) Yes, so on the official publications… This is where the document harvester tool comes into play, adding another layer of metadata to pass the document through various access elements appropriately. We are still dealing with this issue though.

Chris Wemyss – Tracing the Virtual community of Hong Kong Britons through the archived web

I’ve joined this a wee bit late after a fun adventure on the Senate House stairs… 

Looking at the Gwulo: Old Hong Kong site.. User content is central to this site which is centred on a collection of old photographs, buildings, people, landscapes… The website starts to add features to explore categorisations of images.. And the site is led by an older British resident. He described subscribers being expats who have moved away, where an old version of Hong Kong that no longer exists – one user described it as an interactive photo album… There is clearly more to be done on this phenomenon of building these collective resources to construct this type of place. The founder comments on Facebook groups – they are about the now, “you don’t build anything, you just have interesting conversations”.

A third example then, Swire Mariners Association. This site has been running, nearly unchanged, for 17 years, but they have a very active forum, a very active Facebook group. These are all former dockyard workers, they meet every year, it is a close knit community but that isn’t totally represented on the web – they care about the community that has been constructed, not the website for others.

So, in conclusion archives are useful in some cases. Using oral history and web archives together is powerful, however, where it is possible to speak to website founders or members, to understand how and why things have changed over time. Seeing that change over time already gives some idea of the futures people want to see. And these sites indicate the demand for communities, active societies, long after they are formed. And illustrates how people utilise the web for community memory…

Q&A

Q1) You’ve raised a problem I hadn’t really thought about. How can you tell if they are more active on Facebook or the website… How do you approach that?

A1) I have used web archiving as one source to arrange other things around… Looking for new websites, finding and joining the Facebook group, finding interviewees to ask about that. But I wouldn’t have been prompted to ask about the website and its change/lack of change without consulting the web archives.

Q2) Were participants aware that their pages were in the archive?

A2) No, not at all. The blog I showed first was started by two guys, Gwilo is run by one guy… And he quite liked the idea that this site would live on in the future.

David Geiringer & James Baker: The home computer and networked technology: encounters in the Mass Observation Project archive, 1991-2004

I have been doing web on various communities, including some work on GeoCities which is coming out soon… And I heard about the Mass Observation project which, from 1991 – 2004, about computers and how they are using them in their life… The archives capture comments like:

“I confess that sometimes I resort to using the computer using th ecut and paste techniwue to write several letters at once”

Confess is a strong word there.. Over this period of observation we saw production of text moving to computers, computers moving into most homes, the rebuilding of modernity. We welcome comment on this project, and hope to publish soon where you can find out more on our method and approach.

So, each year since 1981 the mass observation project has issued directives to respondents to respond to key issues like e.g. Football, or the AIDs crisis. They issued the technology directive in 1991. From that year we see several fans of word processor – words like love, dream…  Responses to the 1991 directive are overwhelmingly positive… Something that was not the case for other technologies on the whole…

“There is a spell check on this machine. Also my mind works faster than my hand and I miss out letters. This machine picks up all my faults and corrects them. Thank you computer.”

After this positive response though we start to see etiquette issues, concerns about privacy… Writing some correspondence by hand. Some use simulated hand writing… And start to have concerns about adapting letters, whether that is cheating or not… Ethical considerations appearing.. It is apparent that sometimes guilt around typing text is also slightly humorous… Some playful mischief there…

Altering the context of the issue of copy and paste… the time and effort to write a unique manuscript is at concern… Interestingly the directive asked about printing and filing emails… And one respondent notes that actually it wasn’t financial or business records, but emails from their ex…

Another comments that they wish they had printed more emails during their pregnancy, a way of situating yourself in time and remembering the experience…

I’m going to skip ahead to how computers fitted into their home… People talk about dining rooms, and offices, and living rooms.. Lots of very specific discussions about where computers are placed and why they are placed there… One person comments:

“Usually at the dining room at home which doubles as our office and our coffee room”

Others talk about quieter spaces… The positioning of a computer seems to create some competition for use of space. The home changing to make room for the computer or the network… We also start to see (in 2004) comments about home life and work life, the setting up of a hotmail account as a subtle act of resistance, the reassertion of the home space.

A Mass Observation Directive in 1996 asked about email and the internet:

“Internet – we have this at work and it’s mildly useful. I wouldn’t have it at home because it costs a lot to be quite sad and sit alone at home” (1996)

So, observers from 1991-2004 talked about efficiencies of the computer and internet, copy, paste, ease… But this then reflected concerns about the act of creating texts, of engaging with others, computers as changing homes and spaces. Now, there are really specific findings around location, gender, class, gender, age, sexuality… The overwhelming majority of respondents are white middle class cis-gendered straight women over 50. But we do see that change of response to technology, a moment in time, from positive to concerned. That runs parallel to the rise of the World Wide Web… We think our work does provide context to web archive work and web research, with textual production influenced by these wider factors.

Q&A

Q1) I hadn’t realised mass observation picked up again in 1980. My understanding was that previously it was the observed, not the observers. Here people report on their own situations?

A1) They self report on themselves. At one point they are asked to draw their living room as well…

Q1) I was wondering about business machinery in the home – type writers for instance

A1) I don’t know enough about the wider archive. All of this newer material was done consistently… The older mass observation material was less consistent – people recorded on the street, or notes made in pubs. What is interesting is that in the newer responses you see a difference in the writing of the response… As they move from hand written to type writers to computer…

Q2) Partly you were talking about how people write and use computers. And a bit about how people archive themselves… But the only people I could find how people archive themselves digitally was by Microsoft Research… Is there anything since then… In that paper though you could almost read regret between the lines… the loss of photo albums, letters, etc…

A2) My colleague David Geiringer who I co-wrote the paper was initially looking at self-archiving. There was very very little. But printing stuff comes up… And the tensions there. There is enough there, people talking about worries and loss… There is lots in there… The great thing with Mass Obvs is that you can have a question but then you have to dig around a lot to find things…

Ian Milligan, University of Waterloo and Matthew Weber, Rutgers University – Archives Unleashed 4.0: presentation of projects (#hackarchives)

Ian: I’m here to talk about what happened on the first two days of Web Archiving Week. And I’d like to thank our hosts, supporters, and partners for this exciting event. We’ll do some lightening talks on the work undertaken… But why are historians organising data hackathons? Well, because we face problems in our popular cultural history. Problems like GeoCities… Kids write about Winnie the Pooh, people write about the love of Buffy the Vampire Slayer, their love of cigars… We face a problem of huge scale… 7 million users of the web now online… It’s the scale that boggles the mind and compare it to the Old Bailey – one of very few sources on ordinary people. They leave birth, death, marriage or criminal justice records… 239 years from 197,745 trials, 1674 and 1913 is the biggest collection of texts about ordinary people… But from 7 years of geocities we have 413 million web documents.

So, we have a problem, and myself, Matt and Olga from the British Library came together to build community, to establish a common vision of web archiving documents, to find new ways of addressing some of these issues.

Matt: I’m going to quickly show you some of what we did over the last few days… and the amazing projects created. I’ve always joked that Archives Unleashed is letting folk run amok to see what they can do… We started around 2 years ago, in Toronto, then Library of Congress, then at Internet Archive in San Francisco, and we stepped it up a little for London! We had the most teams, we had people from as far as New Zealand.

We started with some socilising in a pub on Friday evening, so that when we gathered on Monday we’d already done some introductions. Then a formal overview and quickly forming teams to work and develop ideas… And continuing through day one and day two… We ended up with 8 complete projects:

  • Robots in the Archives
  • US Elections 2008 and 2010 – text and keyword analysis
  • Study of Gender Distribution in Olympic communities
  • Link Ranking Group
  • Intersection Analysis
  • Public Inquiries Implications (Shipman)
  • Image Search in the Portuguese Web Archive
  • Rhyzome Web Archive Discovery Archive

We will hear from the top three from our informal voting…

Intersection Analysis – Jess

We wanted to understand how we could find a cookbook methodology for understanding the intersections between different data sets. So, we looked at the Occupy Movement (2011/12) with a Web Archive, a Rutgers archive and a social media archive from one of our researchers.

We normalised CDX, crunch WAT for outlinks and extract links from tweets. We generated counts and descriptive data, union/intersection between every data set. We had over 74 million datasets, but only 0.3% overlap between the collections… If you go to our website we have a visualisation of overlaps, tree maps of the collections…

We wanted to use the WAT files to explore Outlinks in the data sets, what they were linking to, how much of it was archived (not a lot).

Parting thoughts? Overlap is inversely proportional to the diversity pf URIs – in other words, the more collectors, the better. Diversifying see lists with social media is good.

Robots in the Archive 

We focused on robots.txt. And our wuestion was “what do we miss when we respect robots.txt?”. At National Library of Denmark we respect this… At Internet Archive they’ve started to ignore that in some contexts. So, what did we do? We extracts robots.txt from the WARC collection. Then apply it retroactively. Then we wanted to compare to link graph.

Our data was from The National Archives and from the 2010 election. We started by looking at user-agent blocks. Four had specifically blocked the internet archive, but some robot names were very old and out of date.. And we looked at crawl delay… Looking specifically at the sub collection of the department for energy and climate change… We would have missed only 24 links that would have been blocked…

So, robots.txt is minimal for this collection. Our method can be applied to other collections and extended to further the discussion on ignore robots.txt. And our code is on GitHub.

Link Ranking Group 

We looked at link analysis to ask if all links are treated the same… We wanted to test if links in <li> are different from content links (<p> or <div>). We used a WarcBase scripts to export manageable raw HTML, Load into Beuatifulsoup library. Used this on the Rio Olympic sites…

So we started looking at WARCs… We said, well, we should test if absolute or relative links… And comparing hard links to relative links but didn’t see lots of differences…

But we started to look at a previous election data set… There we saw links in tables, and there relative links were about 3/4 of links, and the other 1/4 were hard links. We did some investigation about why we had more hard links (proportionally) than before… Turns out this is a mixture of SEO practice, but also use of CMS (Content Management Systems) which make hard links easier to generate… So we sort of stumbled on that finding…

And with that the main programme for today is complete. There is a further event tonight and battery/power sockets permitting I’ll blog that too.