From today until Friday I will be at the International Internet Preservation Coalition (IIPC) Web Archiving Conference 2017, which is being held jointly with the second RESAW: Research Infrastructure for the Study of Archived Web Materials Conference. I’ll be attending the main strand at the School of Advanced Study, University of London, today and Friday, and at the technical strand (at the British Library) on Thursday. I’m here wearing my “Reference Rot in Theses: A HiberActive Pilot” – aka “HiberActive” – hat. HiberActive is looking at how we can better enable PhD candidates to archive web materials they are using in their research, and citing in their thesis. I’m managing the project and working with developers, library and information services stakeholders, and a fab team of five postgraduate interns who are, whilst I’m here, out and about around the University of Edinburgh talking to PhD students to find out how they collect, manage and cite their web references, and what issues they may be having with “reference rot” – content that changes, decays, disappears, etc. We will have a webpage for the project and some further information to share soon but if you are interested in finding out more, leave me a comment below or email me: email@example.com. These notes are being taken live so, as usual for my liveblogs, I welcome corrections, additions, comment etc. (and, as usual, you’ll see the structure of the day appearing below with notes added at each session).
Opening remarks: Jane Winters
This event follows the first RESAW event which took place in Aarhus last year. This year we again highlight the huge range of work being undertaken with web archives.
This year a few things are different… Firstly we are holding this with the IIPC, which means we can run the event over 3 days, and means we can bring together librarians, archivists, and data scientists. The BL have been involved and we are very greatful for their input. We are also excited to have a public event this evening, highlighted the increasingly public nature of web archiving.
Opening remarks: Nicholas Taylor
On behalf of the IIPC Programme Committee I am hugely grateful to colleagues here at the School of Advanced Studies and at the British Library for being flexible and accommodating us. I would also like to thank colleagues in Portugal, and hope a future meeting will take place there as had been originally planned for IIPC.
For us we have seen the Web Archiving Conference as an increasingly public way to explore web archiving practice. The programme committee saw a great increase in submissions, requiring a larger than usual commitment from the programming committee. We are lucky to have this opportunity to connect as an international community of practice, to build connections to new members of the community, and to celebrate what you do.
Opening plenary: Leah Lievrouw – Web history and the landscape of communication/media research Chair: Nicholas Taylor
I intend to go through some context in media studies. I know this is a mixed audience… I am from the Department of Information Studies at UCLA and we have a very polyglot organisation – we can never assume that we all understand each others backgrounds and contexts.
A lot about the web, and web archiving, is changing, so I am hoping that we will get some Q&A going about how we address some gaps in possible approaches.
I’ll begin by saying that it has been some time now that computing has been seen, computers as communication devices, have been seen as a medium. This seems commonplace now, but when I was in college this was seen as fringe, in communication research, in the US at least. But for years documentarists, engineers, programmers and designers have seen information resources, data and computing as tools and sites for imagining, building, and defending “new” societies; enacting emancipatory cultures and politics… A sort of Alexandrian vision of “all the knowledge in the world”. This is still part of the idea that we have in web archiving. Back in the day the idea of fostering this kind of knowledge would bring about internationality, world peace, modernity. When you look at old images you see artefacts – it is more than information, it is the materiality of artefacts. I am a contributor to Nils’ web archiving handbook, and he talks about history written of the web, and history written with the web. So there are attempts to write history with the web, but what about the tools themselves?
So, this idea about connections between bits of knowledge… This goes back before browsers. Many of you will be familiar with H.G. Well’s ? Brain; Suzanne Briet’s Qu’est que la documentation (1951) is a very influential work in this space; Jennifer Light wrote a wonderful book on Cold War Intellectuals, and their relationship to networked information… One of my lecturers was one of these in fact, thinking about networked cities… Vannevar Bush “As we may think” (1945) saw information as essential to order and society.
Another piece I often teach, J.C.R. Licklider and Robert W. Taylor (1968) in “the computer as a communication device” talked about computers communicating but not in the same ways that humans make meaning. In fact this graphic shows a man’s computer talking to an insurance salesman saying “he’s out” an the caption “your computer will know what is important to you and buffer you from the outside world”.
We then have this counterculture movement in California in the 1960s and 1970s.. And that feeds into the emerging tech culture. We have The Well coming out of this. Stewart Brand wrote The Whole Earth Catalog (1968-78). And Actually in 2012 someone wrote a new Whole Earth Catalog…
Ted Nelson, Computer Lib/Dream Machines (1974) is known as being the person who came up with the concept of the link, between computers, to information… He’s an inventor essentially. Computer Lib/Dream Machine was a self-published title, a manifesto… The subtitle for Computer Lib was “you can and must understand computers NOW”. Counterculture was another element, and this is way before the web, where people were talking about networked information.. But these people were not thinking about preservation and archiving, but there was an assumption that information would be kept…
And then as we see information utilities and wired cities emerging, mainly around cable TV but also local public access TV… There was a lot of capacity for information communication… In the UK you had teletext, in Canada there was Teledyne… And you were able to start thinking about information distribution wider and more diverse than central broadcasters… With services like LexisNexis emerging we had these ideas of information utilities… There was a lot of interest in the 1980s, and back in the 1970s too.
Harold Sackman and Norman Nie (eds.) The Information Utility and Social Choice (1970); H.G. Bradley, H.S. Dordick and B. Nanus, the Emerging Network Marketplace (1980); R.S. Block “A global information utility”, the Futurist (1984); W.H. Dutton, J.G. Blumer and K.L. Kraemer “Wired cities: shaping the future of communications” (1987).
This new medium looked more like point-to-point communication, like the telephone. But no-one was studying that. There were communications scholars looking at face to face communication, and at media, but not at this on the whole.
Now, that’s some background, I want to periodise a bit here… And I realise that is a risk of course…
So, we have the Pre-browser internet (early 1980s-1990s). Here the emphasis was on access – to information, expertise and content at centre of early versions of “information utilities”, “wired cities” etc. This was about everyone having access – coming from that counter culture place. More people needed more access, more bandwidth, more information. There were a lot of digital materials already out there… But they were fiddly to get at.
Now, when the internet become privatised – moved away from military and universities – the old model of markets and selling information to mass markets, the transmission model, reemerged. But there was also tis idea that because the internet was point-to-point – and any point could get to any other point… And that everyone would eventually be on the internet… The vision was of the internet as “inherently democratic”. Now we recognise the complexity of that right now, but that was the vision then.
Post-browser internet (early 1990s to mid-2000s) – was about web 1.0. Browsers and WWW were designed to search and retrieve documents, discrete kinds of files, to access online documents. I’ve said “Web 1.0” but had a good conversation with a colleague yesterday who isn’t convinced about these kinds of labels, but I find them useful shorthand for thinking about the web at particular points in time/use. In this era we had email still but other types of authoring tools arose.. Encouraging a wave of “user generated content” – wikis, blogs, tagging, media production and publishing, social networking sites. This sounds such a dated term now but it did change who could produce and create media, and it was the team around LA around this time.
Then we began to see Web 2.0 with the rise of “smart phones” in the mid-2000s, merging mobile telephony and specialised web-based mobile applications, accelerate user content production and social media profiling. And the rise of social networking sounded a little weird to those of us with sociology training who were used to these terms from the real world, from social network analysis. But Facebook is a social network. Many of the tools, blogging for example, can be seen as having a kind of mass media quality – so instead of a movie studio making content… But I can have my blog which may have an audience of millions or maybe just, like, 12 people. But that is highly personal. Indeed one of the earliest so-called “killer apps” for the internet was email. Instead of shipping data around for processing – as the architecture originally got set up for – you could send a short note to your friend elsewhere… Email hasn’t changed much. That point-to-opint communication suddenly and unexpectedly suddenly became more than half of the ARPANET. Many people were surprised by that. That pattern of interpersonal communication over networks, continued to repeat itself – we see it with Facebook, Twitter, and even with Blogs etc. that have feedback/comments etc.
Web 2.0 is often talked about as social driven. But what is important from a sociology perspective, is the participation, and the participation of user generated communities. And actually that continues to be a challenge, it continues to be not the thing the architecture was for…
In the last decade we’ve seen algorithmic media emerging, and the rise of “web 3.0”. Both access and participation appropriated as commodities to be monitored, captures, analyzed, monetised and sold back to individuals, reconcieved as data subjects. Everything is thought about as data, data that can be stored, accessed… Access itself, the action people take to stay in touch with each other… We all carry around monitoring devices every day… At UCLA we are looking at the concept of the “data subjects”. Bruce ? used to talk about the “data footprint” or the “data cloud”. We are at a moment where we are increasingly aware of being data subjects. London is one of the most remarkable in the world in terms of surveillance… The UK in general, but London in particular… And that is ok culturally, I’m not sure it would be in the United States.
We did some work in UCLA to get students to mark up how many surveillance cameras there were, who controlled them, who had set them up, how many there were… Neither Campus police nor university knew. That was eye opening. Our students were horrified at this – but that’s an American cultural reaction.
But if we conceive of our own connections to each other, to government, etc. as “data” we begin to think of ourselves, and everything, as “things”. Right now systems and governance maximising the market, institutional government surveillance; unrestricted access to user data; moves towards real-time flows rather than “stocks” of documents or content. Surveillance isn’t just about government – supermarkets are some of our most surveilled spaces.
I currently have students working on a “name domain infrastructure” project. The idea is that data will be enclosed, that data is time-based, to replace the IP, the Internet Protocol. So that rather than packages, data is flowing all the time. So that it would be like opening the nearest tap to get water. One of the interests here is from the movie and television industry, particularly web streaming services who occupy significant percentages of bandwidth now…
There are a lot of ways to talk about this, to conceive of this…
1.0 tend to be about documents, press, publishing, texts, search, retrieval, circulation, access, reception, production-consumption: content.
2.0 is about conversations, relationships, peers, interaction, communities, play – as a cooperative and flow experience, mobility, social media (though I rebel against that somewhere): social networks.
3.0 is about algorithms, “clouds” (as fluffy benevolent things, rather than real and problematic, with physical spaces, server farms), “internet of things”, aggregation, sensing, visualisation, visibility, personalisation, self as data subject, ecosystems, surveillance, interoperability, flows: big data, algorithmic media. Surveillance is kind of the environment we live in.
Now I want to talk a little about traditions in communication studies..
In communication, broadly and historically speaking, there has been one school of thought that is broadly social scientific, from sociology and communications research, that thinks about how technologies are “used” for expression, interaction, as data sources or analytic tools. Looking at media in terms of their effects on what people know or do, can look at media as data sources, but usually it is about their use.
There are theories of interaction, group process and influence; communities and networks; semantic, topical and content studies; law, policy and regulation of systems/political economy. One key question we might ask here: “what difference does the web make as a medium/milieu for communicative action, relations, interact, organising, institutional formation and change? Those from a science and technology background might know about the issues of shaping – we shape technology and technology shapes us.
Then there is the more cultural/critical/humanist or media studies approach. When I come to the UK people who do media studies still think of humanist studies as being different, “what people do”. However this approach of cultural/critical/etc. is about analyses of digital technologies and web; design, affordances, contexts, consequences – philosophical, historical, critical lens. How power is distributed are important in this tradition.
In terms of theoretical schools, we have the Toronto School/media ecology – the Marshall McLuhan take – which is very much about the media itself; American cultural studies, and the work of James Carey and his students; Birmingham school – the British take on media studies; and new materialism – that you see in Digital Humanities, German Media Studies, that says we have gone too far from the roles of the materials themselves. So, we might ask “What is the web itself (social and technical constituents) as both medium and product of culture, under what conditions, times and places.
So, what are the implications for Web Archiving? Well I hope we can discuss this, thinking about a table of:
Web Phase | Soc sci/admin | Crit/Cultural
- Documents: content + access
- Conversation: Social nets + participation
- Data/AlgorithmsL algorithmic media + data subjects
Comment: I was wondering about ArXiv and the move to sharing multiple versions, pre-prints, post prints…
Leah: That issue of changes in publication, what preprints mean for who is paid for what, that’s certainly changing things and an interesting question here…
Comment: If we think of the web moving from documents, towards fluid state, social networks… It becomes interesting… Where are the boundaries of web archiving? What is a web archiving object? Or is it not an object but an assemblage? Also ethics of this…
Leah: It is an interesting move from the concrete, the material… And then this whole cultural heritage question, what does it instantiate, what evidence is it, whose evidence is it? And do we participate in hardening those boundaries… Or do we keep them open… How porous are our boundaries…
Comment: What about the role of metadata?
Leah: Sure, arguably the metadata is the most important thing… What we say about it, what we define it as… And that issue of fluidity… We think of metadata as having some sort of fixity… One thing that has begun to emerge in surveillance contexts… Where law enforcement says “we aren’t looking at your content, just the metadata”, well it turns out that is highly personally identifiable, it’s the added value… What happens when that secondary data becomes the most important things… In face where many of our data systems do not communicate with each other, those connections are through the metadata (only).
Comment: In terms of web archiving… As you go from documents, to conversations, to algorithms… Archiving becomes so much more complex. Particularly where interactions are involved… You can archive the data and the algorithm but you still can’t capture the interactions there…
Leah: Absolutely. As we move towards the algorithmic level its not a fixed thing. You can’t just capture the Google search algorithms, they change all the time. The more I look at this work through the lens of algorithms and data flows, there is no object in the classic sense…
Comment: Perhaps, like a movie, we need longer temporal snapshots…
Leah: Like the algorithmic equivalence of persistence of vision. Yes, I think that’s really interesting.
And with that the opening session is over, with organisers noted that those interested in surveillance may be interested to know that Room 101, said to have inspired the room of the same name in 1984, is where we are having coffee…
Session 1B (Chair: Marie Chouleur, National Library of France):
Jefferson Bailey (Deputy chair of IIPC, Director of Web Archiving, Internet Archiving): Advancing access and interface for research use of web archives
I would like to thank all of the organisers again. I’ll be giving a broad rather than deep overview of what the Internet Archive is doing at the moment.
For those that don’t know, we are a non-profit Digital Library and Archive founded in 1996. We work in a former church and it’s awesome – you are welcome to visit and do open public lunches every Friday if you are ever in San Francisco. We have lots of open source technology and we are very technology-driven.
People always ask about stats… We are at 30 Petabytes plus multiple copies right now, including 560 billion URLs, 280 billion webpages. We archive about 1 billion URLs per week, and have partners and facilities around the world, including here in the UK where we have Wellcome Trust support.
So, searching… This is WayBackMachine. Most of our traffic – 75% – is automatically directed to the new service. So, if you search for, say, UK Parliament, you’ll see the screenshots, the URLs, and some statistics on what is there and captured. So, how does it work? With that much data to do full text search! Even the raw text (not HTML) is 3-5 Pb. So, we figured the most instructive and easiest to work with text is the anchor text of all in-bound links to a homepage. The index text covers 443 million homepages, drawn from 900B in-bound links from other cross-domain websites. Is that perfect? No, but it’s the best that works on this scale of data… And people tend to make keyword type searches which this works for.
You can also now, in the new Way Back Machine, see a summary tab which includes a visualisation of data captured for that page, host, domain, MIME-type or MIME-type category. It’s really fun to play with. It’s really cool information to work with. That information is in the Way Back Machine (WBM) if there fore 4.5 billion hosts; 256 millions domains; 1238 TLDs. Also special collections that exist – building this for specific crawls/collections such as our .gov collection. And there is an API – so you can create your own visualisations if you like.
We have also created a full text search for AIT (Archive-It). This was part of a total rebuild of full text search in Elasticsearch. 6.5 billion documents with a 52 TB full text index. In total AIT is 23 billion documents and 1 PB. Searches are across all 8000+ colections. We have improved relevance ranking, metadata search, performance. And we have a Media Search coming – it’s still a test at presence. So you can search non textual content with similar process.
So, how can we help people find things better… search, full text search… And APIs. The APIs power the details charts, captures counts, year, size, new, domain/hosts. Explore that more and see what you can do. We’ve also been looking at Data Transfer APIs to standardise transfer specifications for web data exchange between repositories for preservation. For research use you can submit “jobs” to create derivative datasets from WARCS from specific collections. And it allows programmatic access to AIT WARCs, submission of job, job status, derivative results list. More at: https://github.com/WASAPI-Community/data-transfer-apis.
In other API news we have been working with WAT files – a sort of metadata file derived from a WARC. This includes Headers and content (title, anchor/text, metas, links). We have API access to some capture content – a better way to get programmtic access to the content itself. So we have a test build on a 100 TB WARC set (EOT). It’s like CDX API with a build – replays WATs not WARCs (see: http://vinay-dev.us.archive.org:8080/eot2016/20170125090436/http://house.gov/. You can analyse, for example, term counts across the data.
In terms of analysing language we have a new CDX code to help identify languages. You can visualise this data, see the language of the texts, etc. A lot of our content right now is in English – we need less focus on English in the archive.
We are always interested in working with researchers on building archives, not just using them. So we are working on the News Measures Research Project. We are looking at 663 local news sites representing 100 communities. 7 crawls for a composite week (July-September 2016).
We are also working with a Katrina Blogs project, after research was done, project was published, but we created a special collection of the cites used so that it can be accessed and explored.
And in fact we are general looking at ways to create useful sub collections and ways to explore content. For instance Gif Cities is a way to search for gifs from Geocities. We have a Military Industrial Powerpoint Complex, turning PPT into PDFs and creating a special collection.
We did a new collection, with a dedicated portal (https://www.webharvest.gov) which archives US congress for NARA. And we capture this every 2 years, and also raised questions of indexing YouTube videos.
We are also looking at historical ccTLD Wayback Machines. Built on IA global crawls and added historic web data with keyword and mime/format search, embed linkback, domain stats and special features. This gives a german view – from the .de domain – of the archive.
And we continue to provide data and datasets for people. We love Archives Unleashed – which ran earlier this week. We did an Obama Whitehouse data hackathon recently. We have a webinar on APIs coming very soon
Q1) What is anchor text?
A1) That’s when you create a link to a page – the text that is associated with that page.
Q2) If you are using anchor text in that keyword search… What happens when the anchor text is just a URL…
A2) We are tokenising all the URLs too. And yes, we are using a kind of PageRank type understanding of popular anchor text.
Q3) Is that TLD work.. Do you plan to offer that for all that ask for all top level domains?
A3) Yes! Because subsets are small enough that they allow search in a more manageable way… We basically build a new CDX for each of these…
Q4) What are issues you are facing with data protection challenges and archiving in the last few years… Concerns about storing data with privacy considerations.
A4) No problems for us. We operate as a library… The Way Back Machine is used in courts, but not by us – in US courts its recognised as a thing you can use in court.
Panel: Internet and Web Histories – Niels Bruger – Chair (NB); Marc Weber (MW); Steve Jones (SJ); Jane Winters (JW)
We are going to talk about the internet and the web, and also to talk about the new journal, Internet Histories, which I am editing. The new journal addresses what my colleagues and I saw as a gap. On the one hand there are journals like New Media and Society and Internet Studies which are great, but rarely focus on history. And media history journals are excellent but rarely look at web history. We felt there was a gap there… And Taylor & Francis Routledge agreed with us… The inaugeral issue is a double issue 1-2, and people on our panel today are authors in our first journal, and we asked them to address six key questions from members of our international editorial board.
For this panel we will have an arguement, counter statement, and questions from the floor type format.
A Common Language – Mark Weber
This journal has been a long time coming… I am Curatorial Director, Internet History Program, Computer History Museum. We have been going for a while now. This Internet History program was probably the first one of its kind in a museum.
When I first said I was looking at the history of the web in the mid ’90s, people were puzzled… Now most people have moved to incurious acceptance. Until recently there was also tepid interest from researchers. But in the last few years has reached critical mass – and this journal is a marker of this change.
We have this idea of a common language, the sharing of knowledge. For a long time my own perspective was mostly focused on the web, it was only when I started the Internet History program that I thought about the fuller sweep of cyberspace. We come in through one path or thread, and it can be (too) easy to only focus on that… The first major networks, the ARPAnet was there and has become the internet. Telenet was one of the most important commercial networks in the 1970s, but who here now remembers Anne Reid of Telenet? [no-one] And by contrast, what about Vint Cerf [some]. However, we need to understand what changed, what did not succeed in the long term, how things changed and shifted over time…
We are kind of in the Victorian era of the internet… We have 170 years of telephones, 60 years of going on line… longer of imagining a connected world. Our internet history goes back to the 1840s and the telegraph. And a useful thought here, “The past isn’t over. It isn’t even past” William Faulkner. Of this history only small portions are preserved properly. Some of then risks of not having a collective narrative… And not understanding particular aspects in proper context. There is also scope for new types of approaches and work, not just applying traditional approaches to the web.
There is a risk of a digital dark age – we have film to illustrate this at the museum although I don’t think this crowd needs persuading of the importance of preserving the web.
So, going forward… We need to treat history and preservation as something to do quickly, we cannot go back and find materials later…
Response – Jane Winters
Mark makes, I think convincingly, the case for a common language, and for understanding the preceding and surrounding technologies, why they failed and their commercial, political and social contexts. And I agree with the importance of capturing that history, with oral history a key means to do this. Secondly the call to look beyond your own interest or discipline – interdisiplinary researcg is always challenging, but in the best sense, and can be hugely rewarding when done well.
Understanding the history of the internet and its context is important, although I think we see too many comparisons with early printing. Although some of those views are useful… I think there is real importance in getting to grips with these histories now, not in a decade or two. Key decisions will be made, from net neutrality to mass surveillance, and right now the understanding and analysis of the issues is not sophisticated – such as the incompatibility of “back doors” and secure internet use. And as researchers we risk focusing on the content, not the infrastructure. I think we need a new interdisciplinary research network, and we have all the right people gathered here…
Q1) Mark, as you are from a museum… Have you any thoughts about how you present the archived web, the interface between the visitor and the content you preserve.
A1) What we do now with the current exhibits… the star isn’t the objects, it is the screen. We do archive some websites – but don’t try to replicate the internet archive but we do work with them on some projects, including the GeoCities exhibition. When you get to things that require emulation or live data, we want live and interactive versions that can be accessed online.
Q2) I’m a linguist and was intrigued by the interdisciplinary collaboration suggested… How do you see linguists and the language of the web fitting in…
A2) Actually there is a postdoc – Naomi – looking at how different language communities in the UK have engaged through looking at the UK Web Archive, seeing how language has shaped their experience and change in moving to a new country. We are definitely thinking about this and it’s a really interesting opportunity.
Out from the PLATO Cave: Uncovering the pre-Internet history of social computing – Steve Jones, University of Ilinois at Chicago
I think you will have gathered that there is no one history of the internet. PLATO was a space for education and for my interest it also became a social space, and a platform for online gaming. These uses were spontaneous rather than centrally led. PLATO was an acronym for Programmed Logic for Automatic Teaching Operations (see diagram in Ted Nelson’s Dream Machine publication and https://en.wikipedia.org/wiki/PLATO_(computer_system)).
There were two key interests in developing for PLATO – one was multi-player games, and the other was communication. And the latter was due to laziness… Originally the PLATO lab was in a large room, and we couldn’t be bothered to walk to each others desks. So “Talk” was created – and that saved standard messages so you didn’t have to say the same thing twice!
As time went on, I undertook undergraduate biology studies and engaged in the Internet and saw that interaction as similar… At that time data storage was so expensive that storing content in perpetuity seemed absurd… If it was kept its because you hadn’t got to writing it yet. You would print out code – then rekey it – that was possible at the time given the number of lines per programme. So, in addition to the materials that were missing… There were boxes of Ledger-size green bar print outs from a particular PLATO Notes group of developers. Having found this in the archive I took pictures to OCR – that didn’t work! I got – brilliantly and terribly – funding to preserve that text. That content can now be viewed side by side in the archive – images next to re-keyed text.
Now, PLATO wasn’t designed for lay users, it was designed for professionals although also used by university and high school students who had the time to play with it. So you saw changes between developer and community values, seeing development of affordances in the context of the discourse of the developers – that archived set of discussions. The value of that work is to describe and engage with this history not just from our current day perspective, but to understand the context, the poeple and their discourse at the time.
Response – Mark
PLATO sort of is the perfect example of a system that didn’t survive into the mainstream… Those communities knew each other, the idea of the flatscreen – which led to the laptop – came from PLATO. PLATO had a distinct messaging system, separate from the ARPAnet route. It’s a great corpus to see how this was used – were there flames? What does one-to-many communication look like? It is a wonderful example of the importance of preserving these different threads.. And PLATO was one of the very first spaces not full of only technical people.
PLATO was designed for education, and that meant users were mainly students, and that shaped community and usage. There was a small experiment with community time sharing memory stores – with terminals in public places… But PLATO began in the late ’60s and ran through into the 80s, it is the poster child for preserving earlier systems. PLATO notes became Lotus Notes – that isn’t there now but in its own domain, PLATO was the progenitor of much of what we do with education online now, and that history is also very important.
Q1) I’m so glad, Steve, that you are working on PLATO. I used to work in Medical Education in Texas and we had PLATO terminals to teach basic science first and second year medical education students and ER simulations. And my colleagues and I were taught computer instruction around PLATO. I am intereted that you wanted to look at discourse around UIC around PLATO – so, what did you find? I only experienced PLATO at the consumer end of the spectrum, so I wondered what the producer end was like…
A1) There are a few papers on this – search for it – but two basic things stand out… (1) the degree to which as a mainframe system PLATO was limited as system, and the conflict between the systems people and the gaming people. The gaming used a lot of the capacity, and although that taxed the system it did also mean they developed better code, showed what PLATO was capable of, and helped with the case for funding and support. So it wasn’t just shut PLATO down, it was a complex 2-way thing; (2) the other thing was around the emergence of community. Almost anyone could sit at a terminal and use the system. There were occasional flare ups and they mirrored community responses even later around flamewars, competition for attention, community norms… Hopefully others will mine that archive too and find some more things.
Digital Humanities – Jane Winters
I’m delighted to have an article in the journal, but I won’t be presenting on this. Instead I want to talk about digital humanities and web archives. There is a great deal of content in web archives but we still see little research engagement in web archives, there are numerous reasons including the continuing work on digitised traditional texts, and slow movement to develop new ways to research. But it is hard to engage with the history of the 21st century without engaging with the web.
The mismatch of the value of web archives and the use and research around the archive was part of what led us to set up a project here in 2014 to equip researchers to use web archives, and encourage others to do the same. For many humanities researchers it will take a long time to move to born-digital resources. And to engage with material that subtly differs for different audiences. There are real challenges to using this data – web archives are big data. As humanities scholars we are focused on the small, the detailed, we can want to filter down… But there is room for a macro historical view too. What Tim Hitchcock calls the “beautiful chaos?” of the web.
Exploring the wider context one can see change on many levels – from the individual person or business, to wide spread social and political change. How the web changes the language used between users and consumers. You can also track networks, the development of ideas… It is challenging but also offers huge opportunities. Web archives can include newspapers, media, and direct conversation – through social media. There is also visual content, gifs… The increase in use of YouTube and Instagram. Much of this sits outside the scope of web archives, but a lot still does make it in. And these media and archiving challenges will only become more challenging as see more data… The larger and more uncontrolled the data, the harder the analysis. Keyword searches are challenging at scale. The selection of the archive is not easily understood but is important.
The absence of metadata is another challenge too. The absence of metadata or alternative text can render images, particularly, invisible. And the mix of formats and types of personal and the public is most difficult but also most important. For instance the announcement of a government policy, the discussion around it, a petition perhaps, a debate in parliament… These are not easy to locate… Our histories is almost inherently online… But they only gain any real permanence through preservation in web archives, and thats why humanists and historians really need to engage with them.
Response – Steve
I particularly want to talk about archiving in scholarship. In order to fit archiving into scholarly models… administrators increasingly make the case for scholarship in the context of employment and value. But archive work is important. Scholars are discouraged from this sort of work because it is not quick, it’s harder to be published… Separately you need organisations to engage in preservation of their online presences. The degree to which archive work is needed is not reflected by promotion committees, organisational support, local archiving processes. There are immense rhetorical challenges here, to persuade others of the value of this work. There had been successful cases made to encourage telephone providers to capture and share historical information. I was at a telephone museum recently and asked about the archive… She handed me a huge book on the founding of Southwestern Bell, published in a very small run… She gave me a copy but no-one had asked about this before… That’s wrong though, it should be captured. So we can do some preservation work ourselves just by asking!
Q1) Jane, you mentioned a skills gap for humanities researchers. What sort of skills do they need?
A1) I think the complete lack of quantitative data training, how to sample, how to make meaning from quantitative data. They have never been engaged in statistical training. They have never been required to do it – you specialise so early here. Also, basic command line stuff… People don’t understand that or why they have to engage that way. Those are two simple starting points. Those help them understand what they are looking at, what an ngram means, etc.
Session 2B (Chair: Tom Storrar)
Philip Webster, Claire Newing, Paul Clough & Gianluca Demartini: A temporal exploration of the composition of the UK Government Web Archive
I’m afraid I’ve come into this session a little late. I have come in at the point that Philip and Claire are talking about the composition of the archive – mostly 2008 onwards – and looking at status codes of UK Government Web Archive.
Phillip: The hypothesis for looking at http status codes was to see if changes in government raised trends in the http status code. Actually, when we looked at post-2008 data we didn’t see what we expected there. However we did fine that there was an increase in not finding what was requested – and thought this may be about moving to dynamic pages – but this is not a strong trend.
In terms of MIME types – media types – which are restricted to:
Application – flash, java, Microsoft Office Documents. Here we saw trends away from PDF as the dominant format. Microsoft word increases, and we see the increased use of Atom – syndication – coming across.
Document – PDF remains prevalent. Also MS Word, some MS Excel. Open formats haven’t really taken hold…
Claire: The Government Digital Strategy included guidance to use open document formats as much as possible, but that wasn’t mandated until late 2014 – a bit too late for our data set unfortunately. But the Government Digital Strategy in 2011 was, itself, published in Word and PDF itself!
Philip: If we take document type outside of PDFs you see that lack of open formats more clearly..
Image – This includes images appearing in documents, plus icons. And occasionally you see non-standard media types associated with the MIME-types. Jpegs are fairly consistent changes. Gif and Png are comparable… Gif was being phased out for IP reasons, with Png to replace it,and you see that change over time…
Text – Test is almost all HTML. You see a lot of plain text, stylesheets, XML…
Video – we saw compressed video formats… but gradually superceded with embedded YouTube links. However we do still see a of flash video retained. And we see a large, increasing of MP4, used by Apple devices.
Another thing that is available over time is relative file sizes. However CDX index only contains compressed size data and therefore is not a true representation of file size trends. So you can’t compare images to their pre-archiving version. That means for this work we’ve limited the data set to those where you can tell the before and after status of the image files. We saw some spikes in compressed image formats over time, not clear if this shows departmental isssues..
To finish on a high note… There is an increase in the use of https rather than http. I thought it might be the result of a campaign, but it seems to be a general trend..
The conclusion… Yes, it is possible to do temporal analysis of CDX index data but you have to be careful, looking at proportion rather than raw frequency. SQL is feasible, commonly available and low cost. Archive data has particular weaknesses – data cannot be assumed to be fully representative, but in some cases trends can be identified.
Q1) Very interesting, thank you. Can I understand… You are studying the whole archive? How do you take account of having more than one copy of the same data over time?
A1) There is a risk of one website being overrepresented in the archive. There are checks that can be done… But that is more computationally expensive…
Q2) With the seed list, is that generating the 404 rather than actual broken links?
A2 – Claire) We crawl by asking the crawler to go out to find links and seed from that. It generally looks within the domain we’ve asked it to capture…
Q3) At various points you talked about peaks and trends… Have you thought about highlighting that to folks who use your archive so they understand the data?
A3 – Claire) We are looking at how we can do that more. I have read about historians’ interest in the origins of the collection, and we are thinking about this, but we haven’t done that yet.
Caroline Nyvang, Thomas Hvid Kromann & Eld Zierau: Capturing the web at large – a critique of current web citation practices
Caroline: We are all here as we recognise the importance and relevance of internet research. Our paper looks at web referencing and citation within the sciences. We propose a new format to replace the URL+date format usually recommended. We will talk about a study of web references in 35 Danish master’s theses from the University of Copenhagen, then further work on monograph referencing, then a new citation format.
The work on 35 masters theses submitted to Copenhagen university, included, as a set: 899 web references, there was an average of 26.4 web references – some had none, the max was 80. This gave us some insight into how students cite URL. Of those students citing websites: 21% gave the date for all links; 58% had dates for some but not all sites; 22% had no dates. Some of those URLs pointed to homepages or search results.
We looked at web rot and web references – almost 16% could not be accessed by the reader, checked or reproduced. An error rate of 16% isn’t that remarkable – in 1992 a study of 10 journals found that a third of references was inaccurate enough to make it hard to find the source again. But web resources are dynamic and issues will vary, and likely increase over time.
The amount of web references does not seem to correlate with particular subjects. Students are also quite imprecise when they reference websites. And even when the correct format was used 15.5% of all the links would still have been dead.
Thomas: We looked at 10 danish academic monographs published from 2010-2016. Although this is a small number of titles, it allowed us to see some key trends in the citation of web content. There was a wide range of number of web citations used – 25% at the top, 0% at the bottom of these titles. Location of web references in these texts are not uniform. On the whole scholars rely on printed scholarly work… But web references are still important. This isn’t a systematic review of these texts… In theory these links should all work.
We wanted to see the status after five years… We used a traffic light system. 34.3% were red – broken, dead, a different page; 20?% were amber – critical links that either refer to changed or at risk material; 44.7% were green – working as expected.
This work showed that web references to dead links within a limited number of years. In our work the URLs that go to the front page, with instructions of where to look, actually, ironically, lasted best. Long complex URLs were most at risk… So, what can we do about this…
Eld: We felt that we had to do something here, to address what is needed. We can see from the studies that today’s practices of URLs and date stamp does not work. We need a new standard, a way to reference something stable. The web is a marketplace and changes all the time. We need to look at the web archives… And we need precision and persistency. We felt there were four neccassary elements, and we call it the PWID – Persistent Web IDentifier. The Four elemnts are:
- Archived URL
- Time of archiving
- Web archive – precision and indication that you verified this is what you expect. Also persistency. Researcher has to understand that – is it a small or large archive, what is contextual legislation.
- Content coverage specification – is part only? Is it the html? Is it the page including images as it appears in your browser? Is it a page? Is it the side including referred pages within the domain
So we propose a form of reference which can be textually expressed as:
web archive: archive.org, archiving time: 2016-04-20 18:21:47 UTC, archived URL: http://resaw.en/, content coverage: webpage
But, why not use web archive URL? Of the form:
Well, this can be hard to read, there is a lot of technology embedded in the URL. It is not as accessible.
So, a PWID URI:
This is now in as an ISO 690 suggestion and proposed as a URI type.
To sum up, all research fields eed to refer to the web. Good scientific practice cannot take place with current approaches.
Q1) I really enjoyed your presentation… I was wondering what citation format you recommend for content behind paywalls, and for dynamic content – things that are not in the archive.
A1 – Eld) We have proposed this for content in the web archive only. You have to put it into an archive to be sure, then you refer to it. But we haven’t tried to address those issues of paywall and dynamic content. BUT the URI suggestion could refer to closed archives too, not just open archives.
A1 – Caroline) We also wanted to note that this approach is to make web citations align with traditional academic publication citations.
Q2) I think perhaps what you present here is an idealised way to present archiving resources, but what about the marketing and communications challenge here – to better cite websites, and to use this convention when they aren’t even using best practice for web resources.
A2 – Eld) You are talking about marketing to get people to use this, yes? We are starting with the ISO standard… That’s one aspect. I hope also that this event is something that can help promote this and help to support it. We hope to work with different people, like you, to make sure it is used. We have had contact with Zotero for instance. But we are a library… We only have the resources that we have.
Q3) With some archives of the web there can be a challenge for students, for them to actually look at the archive and check what is there..
A3) Firstly citing correctly is key. There are a lot of open archives at the moment… But we hope the next step will be more about closed archives, and ways to engage with these more easily, to find common ground, to ensure we are citing correctly in the first place.
Comment – Nicola Bingham, BL) I like the idea of incentivising not just researchers but also publishers to incentivise web archiving, another point of pressure to web archives… And making the case for openly accessible articles.
Q4) Have you come across Martin Klein and Herbert Von Sompel’s work on robust links, and Momento.
A4 – Eld) Momento is excellent to find things, but usually you do not have the archive in there… I don’t think the way of referencing without the archive is a precise reference…
Q5) When you compare to web archive URL, it was the content coverage that seems different – why not offer as an incremental update.
A5) As far as I know there is using a # in the URL and that doesn’t offer that specificity…
Comment) I would suggest you could define the standard for after that # in the URLs to include the content coverage – I’ll take that offline.
Q6) Is there a proposal there… For persistence across organisations, not just one archive.
A6) I think from my perspective there should be a registry when archives change/move to find the new registry. Our persistent identifier isn’t persistent if you can change something. And I think archives must be large organisations, with formal custodians, to ensure it is persistent.
Comment) I would like to talk offline about content addressing and Linked Data to directly address and connect to copies.
Andrew Jackson: The web archive and the catalogue
I wanted to talk about some bad experiences I had recently… There is a recent BL video of the journey of a (print) collection item… From posting to processing, cataloguing, etc… I have worked at the library for over 10 years, but this year for the first time I had to get to grips with the library catalogue… I’ll talk more about that tomorrow (in the technical strand) but we needed to update our catalogue… Accommodating the different ways the catalogue and the archive see c0ntent.
Now, that video, the formation of teams, the structure of the organisations, the physical structure of our building is all about that print process, and that catalogue… So it was a suprise for me – maybe not you – that the catalogue isn’t just bibliographic data, it’s also a workflow management tool…
There is a change of events here… Sometimes events are in a line, sometimes in circles… Always forwards…
Now, last year legal deposit came in for online items… The original digital processing workflow went from acquisition to ingest to cataloguing… But most of the content was already in the archive… We wanted to remove duplication, and make the process more efficient… So we wanted to automate this as a harvesting process.
For our digital work previously we also had a workflow, from nomination, to authorisation, etc… With legal deposit we have to get it all, all the time, all the stuff… So, we don’t collect news items, we want all news sites every day… We might specify crawl targets, but more likely that we’ll see what we’ve had before and draw them in… But this is a dynamic process….
So, our document harvester looks for “watched targets”, harvests, extracts documents for web archiving… and also ingest. There are relationships to acquisition, that feeds into cataloguing and the catalogue. But that is an odd mix of material and metadata. So that’s a process… But webpages change… For print matter things change rarely, it is highly unusual. For the web changes are regular… So how do we bring these things together…
To borrow an analogy from our Georeferencing project… Users engage with an editor to help us understand old maps. So, imagine a modern web is a web archive… Then you need information, DOIs, places and entities – perhaps a map. This kind of process allows us to understand the transition from print to online. So we think about this as layers of transformation… Where we can annotate the web archive… Or the main catalogue… That can be replaced each time this is needed. And the web content can, with this approach, be reconstructed with some certainty, later in time…
Also this approach allows us to use rich human curation to better understand that which is being automatically catalogued and organised.
So, in summary: the catalogue tends to focus on chains of operation and backlogs, item by item. The web archive tends to focus on transformation (and re-transformation) of data. Layered data model can bring them together. Means revisiting the datat (but fixity checking requires this anyway). It’s costly in terms of disk space required. And it allows rapid exploration and experimentation.
Q1) To what extend is the drive for this your users, versus your colleagues?
A1) The business reason is that it will save us money… Taking away manual work. But, as a side effect we’ve been working with cataloguing colleagues in this area… And their expectations are being raised and changed by this project. I do now much better understand the catalogue. The catalogue tends to focus on tradition not output… So this project has been interesting from this perspective.
Q2) Are you planning to publish that layer model – I think it could be useful elsewhere?
A2) I hope to yes.
Q3) And could this be used in Higher Education research data management?
A3) I have noticed that with research data sets there are some tensions… Some communities use change management, functional programming etc… Hadoop, which we use, requires replacement of data… So yes, but this requires some transformation to do.
We’d like to use the same based data infrastructure for research… Otherwise had to maintain this pattern of work.
Q4) Your model… suggests WARC files and such archive documents might become part of new views and routes in for discovery.
A4) That’s the idea, for discovery to be decoupled from where you the file.
Nicola Bingham, UK Web Archive: Resource not in archive: understanding the behaviour, borders and gaps of web archive collections
I will describe the shape and the scope of the UK Web Archive, to give some context for you to explore it… By way of introduction.. We have been archiving the UK Web since 2013, under UK non-print legal deposit. But we’ve also had the Open Archive (since 2004); Legal Deposit Archive (since 2013); and the Jisc Historical Archive (1996-2013).
The UK Web Archive includes around 400 TB of compressed data. And in the region of 11-12 billion records. We grow, on average 60-70 TB per year and 3 B records per year. We want to be comprehensive but, that said, we can’t collect everything and we don’t want to collect everything… Firstly we collect UK websites only. We carry out web archiving under 2013 regulations, and they state that only UK published web content – meaning content on a UK web domain, or by a person whose work occurs in the UK. So, we can automate harvesting from UK TLD (.uk, .scot, .cymru etc); UK hosting – geo-IP loook up to locate server. Then manual checks. So Facebook, WordPress, Twitter cannot be automated…
We only collect published content. Out of scope here are:
- Film and recorded sound where AV content predominates, e.g. YouTube
- Private intranets and emails.
- Social networkings sites only available to restricted groups – if you need a login, special permissions they are out of scope.
Web archiving is expensive. We have to provide good value for money… We crawl the UK domain on an annual basis (only). Some sites are more frequent but annual misses a lot. We cap domains at 512 MB – which captures many sites in their entirity, but others that we only capture part of (unless we override automatic settings).
There are technical limitations too, around:
- Database driven sites – crawler struggle with these
- Programming scripts
- Proprietary file formats
- Blockers – robots.txt or access denied.
So there are misrepresentations… For instance the One Hundred Women blog captures the content but not the stylesheet – that’s a fairly common limitation.
We also have curatorial input to locate the “important stuff”. In the British Library web archiving is not performed universally by all curators, we rely on those who do engage, usually voluntarily. We try to onboard as many curators and specialist professionals as possible to widen coverage.
So, I’ve talked about gaps and boundaries, but I also want to talk about how the users of the archive find this information, so that even where there are gaps, it’s a little more transparant…
We have the Collection Scoping Document, this captures scope, motivation, parameters and timeframe of collection. This document could, in a paired-down form, be made available to end users of the archive.
We have run user testing of our current UK Web Archive website, and our new version. And even more general audiences really wanted as much contextual information as possible. That was particularly important on our current website – where we only shared permission-cleared items. But this is one way in which contextual information can be shown in the interface with the collection.
The metadata can be browsed searched, though users will be directed to come in to view the content.
So, an example of a collection would be 1000 Londoners, showing the context of the work.
We also gather information during the crawling process… We capture information on crawler configuration, seed list, exclusions… I understand this could be used and displayed to users to give statistics on the collection…
So, what do we know about what the researchers want to know? They want as much documentation as they possibly can. We have engaged with the research community to understand how best to present data to the community. And indeed that’s where your feedback and insight is important. Please do get in touch.
Q1) You said you only collect “published” content… How do you define that?
A1) With legal deposit regulations… The legal deposit libraries may collect content openly available on the web… Content that is paywalled or behind login credentials. UK publishers are obliged to provide credentials for crawling. BUT how we make that accessible… Is a different matter – we wouldn’t republish that on the open web without logins/credentials.
Q2) How do you have any ideas about packaging this type of information for users and researchers – more than crawler config files.
A2) The short answer is no… We’d like to invite researchers to access the collection in both a close reading sense, and a big data sense… But I don’t have that many details about that at the moment.
Q3) A practical question: if you know you have to collect something… If you have a web copy of a government publication, say, and the option of the original, older, (digital) document… Is the web archive copy enough, do you have the metadata to use that the right way?
A3) Yes, so on the official publications… This is where the document harvester tool comes into play, adding another layer of metadata to pass the document through various access elements appropriately. We are still dealing with this issue though.
Chris Wemyss – Tracing the Virtual community of Hong Kong Britons through the archived web
I’ve joined this a wee bit late after a fun adventure on the Senate House stairs…
Looking at the Gwulo: Old Hong Kong site.. User content is central to this site which is centred on a collection of old photographs, buildings, people, landscapes… The website starts to add features to explore categorisations of images.. And the site is led by an older British resident. He described subscribers being expats who have moved away, where an old version of Hong Kong that no longer exists – one user described it as an interactive photo album… There is clearly more to be done on this phenomenon of building these collective resources to construct this type of place. The founder comments on Facebook groups – they are about the now, “you don’t build anything, you just have interesting conversations”.
A third example then, Swire Mariners Association. This site has been running, nearly unchanged, for 17 years, but they have a very active forum, a very active Facebook group. These are all former dockyard workers, they meet every year, it is a close knit community but that isn’t totally represented on the web – they care about the community that has been constructed, not the website for others.
So, in conclusion archives are useful in some cases. Using oral history and web archives together is powerful, however, where it is possible to speak to website founders or members, to understand how and why things have changed over time. Seeing that change over time already gives some idea of the futures people want to see. And these sites indicate the demand for communities, active societies, long after they are formed. And illustrates how people utilise the web for community memory…
Q1) You’ve raised a problem I hadn’t really thought about. How can you tell if they are more active on Facebook or the website… How do you approach that?
A1) I have used web archiving as one source to arrange other things around… Looking for new websites, finding and joining the Facebook group, finding interviewees to ask about that. But I wouldn’t have been prompted to ask about the website and its change/lack of change without consulting the web archives.
Q2) Were participants aware that their pages were in the archive?
A2) No, not at all. The blog I showed first was started by two guys, Gwilo is run by one guy… And he quite liked the idea that this site would live on in the future.
David Geiringer & James Baker: The home computer and networked technology: encounters in the Mass Observation Project archive, 1991-2004
I have been doing web on various communities, including some work on GeoCities which is coming out soon… And I heard about the Mass Observation project which, from 1991 – 2004, about computers and how they are using them in their life… The archives capture comments like:
“I confess that sometimes I resort to using the computer using th ecut and paste techniwue to write several letters at once”
Confess is a strong word there.. Over this period of observation we saw production of text moving to computers, computers moving into most homes, the rebuilding of modernity. We welcome comment on this project, and hope to publish soon where you can find out more on our method and approach.
So, each year since 1981 the mass observation project has issued directives to respondents to respond to key issues like e.g. Football, or the AIDs crisis. They issued the technology directive in 1991. From that year we see several fans of word processor – words like love, dream… Responses to the 1991 directive are overwhelmingly positive… Something that was not the case for other technologies on the whole…
“There is a spell check on this machine. Also my mind works faster than my hand and I miss out letters. This machine picks up all my faults and corrects them. Thank you computer.”
After this positive response though we start to see etiquette issues, concerns about privacy… Writing some correspondence by hand. Some use simulated hand writing… And start to have concerns about adapting letters, whether that is cheating or not… Ethical considerations appearing.. It is apparent that sometimes guilt around typing text is also slightly humorous… Some playful mischief there…
Altering the context of the issue of copy and paste… the time and effort to write a unique manuscript is at concern… Interestingly the directive asked about printing and filing emails… And one respondent notes that actually it wasn’t financial or business records, but emails from their ex…
Another comments that they wish they had printed more emails during their pregnancy, a way of situating yourself in time and remembering the experience…
I’m going to skip ahead to how computers fitted into their home… People talk about dining rooms, and offices, and living rooms.. Lots of very specific discussions about where computers are placed and why they are placed there… One person comments:
“Usually at the dining room at home which doubles as our office and our coffee room”
Others talk about quieter spaces… The positioning of a computer seems to create some competition for use of space. The home changing to make room for the computer or the network… We also start to see (in 2004) comments about home life and work life, the setting up of a hotmail account as a subtle act of resistance, the reassertion of the home space.
A Mass Observation Directive in 1996 asked about email and the internet:
“Internet – we have this at work and it’s mildly useful. I wouldn’t have it at home because it costs a lot to be quite sad and sit alone at home” (1996)
So, observers from 1991-2004 talked about efficiencies of the computer and internet, copy, paste, ease… But this then reflected concerns about the act of creating texts, of engaging with others, computers as changing homes and spaces. Now, there are really specific findings around location, gender, class, gender, age, sexuality… The overwhelming majority of respondents are white middle class cis-gendered straight women over 50. But we do see that change of response to technology, a moment in time, from positive to concerned. That runs parallel to the rise of the World Wide Web… We think our work does provide context to web archive work and web research, with textual production influenced by these wider factors.
Q1) I hadn’t realised mass observation picked up again in 1980. My understanding was that previously it was the observed, not the observers. Here people report on their own situations?
A1) They self report on themselves. At one point they are asked to draw their living room as well…
Q1) I was wondering about business machinery in the home – type writers for instance
A1) I don’t know enough about the wider archive. All of this newer material was done consistently… The older mass observation material was less consistent – people recorded on the street, or notes made in pubs. What is interesting is that in the newer responses you see a difference in the writing of the response… As they move from hand written to type writers to computer…
Q2) Partly you were talking about how people write and use computers. And a bit about how people archive themselves… But the only people I could find how people archive themselves digitally was by Microsoft Research… Is there anything since then… In that paper though you could almost read regret between the lines… the loss of photo albums, letters, etc…
A2) My colleague David Geiringer who I co-wrote the paper was initially looking at self-archiving. There was very very little. But printing stuff comes up… And the tensions there. There is enough there, people talking about worries and loss… There is lots in there… The great thing with Mass Obvs is that you can have a question but then you have to dig around a lot to find things…
Ian Milligan, University of Waterloo and Matthew Weber, Rutgers University – Archives Unleashed 4.0: presentation of projects (#hackarchives)
Ian: I’m here to talk about what happened on the first two days of Web Archiving Week. And I’d like to thank our hosts, supporters, and partners for this exciting event. We’ll do some lightening talks on the work undertaken… But why are historians organising data hackathons? Well, because we face problems in our popular cultural history. Problems like GeoCities… Kids write about Winnie the Pooh, people write about the love of Buffy the Vampire Slayer, their love of cigars… We face a problem of huge scale… 7 million users of the web now online… It’s the scale that boggles the mind and compare it to the Old Bailey – one of very few sources on ordinary people. They leave birth, death, marriage or criminal justice records… 239 years from 197,745 trials, 1674 and 1913 is the biggest collection of texts about ordinary people… But from 7 years of geocities we have 413 million web documents.
So, we have a problem, and myself, Matt and Olga from the British Library came together to build community, to establish a common vision of web archiving documents, to find new ways of addressing some of these issues.
Matt: I’m going to quickly show you some of what we did over the last few days… and the amazing projects created. I’ve always joked that Archives Unleashed is letting folk run amok to see what they can do… We started around 2 years ago, in Toronto, then Library of Congress, then at Internet Archive in San Francisco, and we stepped it up a little for London! We had the most teams, we had people from as far as New Zealand.
We started with some socilising in a pub on Friday evening, so that when we gathered on Monday we’d already done some introductions. Then a formal overview and quickly forming teams to work and develop ideas… And continuing through day one and day two… We ended up with 8 complete projects:
- Robots in the Archives
- US Elections 2008 and 2010 – text and keyword analysis
- Study of Gender Distribution in Olympic communities
- Link Ranking Group
- Intersection Analysis
- Public Inquiries Implications (Shipman)
- Image Search in the Portuguese Web Archive
- Rhyzome Web Archive Discovery Archive
We will hear from the top three from our informal voting…
Intersection Analysis – Jess
We wanted to understand how we could find a cookbook methodology for understanding the intersections between different data sets. So, we looked at the Occupy Movement (2011/12) with a Web Archive, a Rutgers archive and a social media archive from one of our researchers.
We normalised CDX, crunch WAT for outlinks and extract links from tweets. We generated counts and descriptive data, union/intersection between every data set. We had over 74 million datasets, but only 0.3% overlap between the collections… If you go to our website we have a visualisation of overlaps, tree maps of the collections…
We wanted to use the WAT files to explore Outlinks in the data sets, what they were linking to, how much of it was archived (not a lot).
Parting thoughts? Overlap is inversely proportional to the diversity pf URIs – in other words, the more collectors, the better. Diversifying see lists with social media is good.
Robots in the Archive
We focused on robots.txt. And our wuestion was “what do we miss when we respect robots.txt?”. At National Library of Denmark we respect this… At Internet Archive they’ve started to ignore that in some contexts. So, what did we do? We extracts robots.txt from the WARC collection. Then apply it retroactively. Then we wanted to compare to link graph.
Our data was from The National Archives and from the 2010 election. We started by looking at user-agent blocks. Four had specifically blocked the internet archive, but some robot names were very old and out of date.. And we looked at crawl delay… Looking specifically at the sub collection of the department for energy and climate change… We would have missed only 24 links that would have been blocked…
So, robots.txt is minimal for this collection. Our method can be applied to other collections and extended to further the discussion on ignore robots.txt. And our code is on GitHub.
Link Ranking Group
We looked at link analysis to ask if all links are treated the same… We wanted to test if links in <li> are different from content links (<p> or <div>). We used a WarcBase scripts to export manageable raw HTML, Load into Beuatifulsoup library. Used this on the Rio Olympic sites…
So we started looking at WARCs… We said, well, we should test if absolute or relative links… And comparing hard links to relative links but didn’t see lots of differences…
But we started to look at a previous election data set… There we saw links in tables, and there relative links were about 3/4 of links, and the other 1/4 were hard links. We did some investigation about why we had more hard links (proportionally) than before… Turns out this is a mixture of SEO practice, but also use of CMS (Content Management Systems) which make hard links easier to generate… So we sort of stumbled on that finding…
And with that the main programme for today is complete. There is a further event tonight and battery/power sockets permitting I’ll blog that too.