Jun 162017
 

It’s the final day of the IIPC/RESAW conference in London. See my day one and day two post for more information on this. I’m back in the main track today and, as usual, these are live notes so comments, additions, corrections, etc. all welcome.

Collection development panel (Chair: Nicola Bingham)

James R. Jacobs, Pamela M. Graham & Kris Kasianovitz: What’s in your web archive? Subject specialist strategies for collection development

We’ve been archiving the web for many years but the need for web archiving really hit home for me in 2013 when NASA took down every one of their technical reports – for review on various grounds. And the web archiving community was very concerned. Michael Nelson said in a post “NASA information is too important to be left on nasa.gov computers”. And I wrote about when we rely on pointing not archiving.

So, as we planned for this panel we looked back on previous IIPC events and we didn’t see a lot about collection curation. We posed three topics all around these areas. So for each theme we’ll watch a brief screen cast by Kris to introduce them…

  1. Collection development and roles

Kris (via video): I wanted to talk about my role as a subject specialist and how collection development fits into that. AS a subject specialist that is a core part of the role, and I use various tools to develop the collection. I see web archiving as absolutely being part of this. Our collection is books, journals, audio visual content, quantitative and qualitative data sets… Web archives are just another piece of the pie. And when we develop our collection we are looking at what is needed now but in anticipation of what we be needed 10 or 20 years in the future, building a solid historical record that will persist in collections. And we think about how our archives fit into the bigger context of other archives around the country and around the world.

For the two web archives I work on – CA.gov and the Bay Area Governments archives – I am the primary person engaged in planning, collecting, describing and making available that content. And when you look at the web capture life cycle you need to ensure the subject specialist is included and their role understood and valued.

The CA.gov archive involves a group from several organisations including the government library. We have been archiving since 2007 in the California Digital Library initially. We moved into Archive-It in 2013.

The Bay Area Governments archives includes materials on 9 counties, but primarily and comprehensively focused on two key counties here. We bring in regional governments and special districts where policy making for these areas occur.

Archiving these collections has been incredibly useful for understanding government, their processes, how to work with government agencies and the dissemination of this work. But as the sole responsible person that is not ideal. We have had really good technical support from Internet Archive around scoping rules, problems with crawls, thinking about writing regular expressions, how to understand and manage what we see from crawls. We’ve also benefitted from working with our colleague Nicholas Taylor here at Stanford who wrote a great QA report which has helped us.

We are heavily reliant on crawlers, on tools and technologies created by you and others, to gather information for our archive. And since most subject selectors have pretty big portfolios of work – outreach, instruction, as well as collection development – we have to have good ties to developers, and to the wider community with whom we can share ideas and questions is really vital.

Pamela: I’m going to talk about two Columbia archives, the Human Rights Web Archive (HRWA) and Historic Preservation and Urban Planning. I’d like to echo Kris’ comments about the importance of subject specialists. The Historic Preservation and Urban Planning archive is led by our architecture subject specialist and we’d reached a point where we had to collect web materials to continue that archive – and she’s done a great job of bringing that together. Human Rights seems to have long been networked – using the idea of the “internet” long before the web and hypertext. We work closely with Alex Thurman, and have an additional specially supported web curator, but there are many more ways to collaborate and work together.

James: I will also reflect on my experience. And the FDLP – Federal Library Program – involves libraries receiving absolutely every government publications in order to ensure a comprehensive archive. There is a wider programme allowing selective collection. At Stanford we are 85% selective – we only weed out content (after five years) very lightly and usually flyers etc. As a librarian I curate content. As an FDLP library we have to think of our collection as part of the wider set of archives, and I like that.

As archivists we also have to understand provenance… How do we do that with the web archive. And at this point I have to shout out to Jefferson Bailey and colleagues for the “End of Term” collection – archiving all gov sites at the end of government terms. This year has been the most expansive, and the most collaborative – including FTP and social media. And, due to the Trump administration’s hostility to science and technology we’ve had huge support – proposals of seed sites, data capture events etc.

2. Collection Development approaches to web archives, perspectives from subject specialists

As subject specialists we all have to engage in collection development – there are no vendors in this space…

Kris: Looking again at the two government archives I work on there is are Depository Program Statuses to act as a starting point… But these haven’t been updated for the web. However, this is really a continuation of the print collection programme. And web archiving actually lets us collect more – we are no longer reliant on agencies putting content into the Depository Program.

So, for CA.gov we really treat this as a domain collection. And no-one really doing this except some UCs, myself, and state library and archives – not the other depository libraries. However, we don’t collect think tanks, or the not-for-profit players that influence policy – this is for clarity although this content provides important context.

We also had to think about granularity… For instance for the CA transport there is a top level domain and sub domains for each regional transport group, and so we treat all of these as seeds.

Scoping rules matter a great deal, partly as our resources are not unlimited. We have been fortunate that with the CA.gov archive that we have about 3TB space for this year, and have been able to utilise it all… We may not need all of that going forwards, but it has been useful to have that much space.

Pamela: Much of what Kris has said reflects our experience at Columbia. Our web archiving strengths mirror many of our other collection strengths and indeed I think web archiving is this important bridge from print to fully digital. I spent some time talking with our librarian (Chris) recently, and she will add sites as they come up in discussion, she monitors the news for sites that could be seeds for our collection… She is very integrated in her approach to this work.

For the human rights work one of the challenges is the time that we have to contribute. And this is a truly interdisciplinary area with unclear boundaries, and those are both challenging aspects. We do look at subject guides and other practice to improve and develop our collections. And each fall we sponsor about two dozen human rights scholars to visit and engage, and that feeds into what we collect… The other thing that I hope to do in the future is to do more assessment to look at more authoritative lists in order to compare with other places… Colleagues look at a site called ideallist which lists opportunities and funding in these types of spaces. We also try to capture sites that look more vulnerable – small activist groups – although it is nt clear if they actually are that risky.

Cost wise the expensive part of collecting is both human effort to catalogue, and the permission process in the collecting process. And yesterday’s discussion of possible need for ethics groups as part of the permissions prpcess.

In the web archiving space we have to be clearer on scope and boundaries as there is such a big, almost limitless, set of materials to pick from. But otherwise plenty of parallels.

James: For me the material we collect is in the public domain so permissions are not part of my challenge here. But there are other aspects of my work, including LOCKSS. In the case of Fugitive US Agencies Collection we take entire sites (e.g. CBO, GAO, EPA) plus sites at risk (eg Census, Current Industrial Reports). These “fugitive” agencies include publications should be in the depository programme but are not. And those lots documents that fail to make it out, they are what this collection is about. When a library notes a lost document I will share that on the Lost Docs Project blog, and then also am able to collect and seed the cloud and web archive – using the WordPress Amber plugin – for links. For instance the CBO looked at the health bill, aka Trump Care, was missing… In fact many CBO publications were missing so I have added it as a see for our Archive-it

3. Discovery and use of web archives

Discovery and use of web archives is becoming increasingly important as we look for needles in ever larger haystacks. So, firstly, over to Kris:

Kris: One way we get archives out there is in our catalogue, and into WorldCat. That’s one plae to help other libraries know what we are collecting, and how to find and understand it… So would be interested to do some work with users around what they want to find and how… I suspect it will be about a specific request – e.g. city council in one place over a ten year period… But they won’t be looking for a web archive per se… We have to think about that, and what kind of intermediaries are needed to make that work… Can we also provide better seed lists and documentation for this? In Social Sciences we have the Code Book and I think we need to share the equivalent information for web archives, to expose documentation on how the archive was built… And linking to seeds nad other parts of collections .

One other thing we have to think about is process and document ingest mechanism. We are trying to do this for CA.gov to better describe what we do… BUt maybe there is a standard way to produce that sort of documentation – like the Codebook…

Pamela: Very quickly… At Columbia we catalogue individual sites. We also have a customised portal for the Human Rights. That has facets for “search as research” so you can search and develop and learn by working through facets – that’s often more useful than item searches… And, in terms of collecting for the web we do have to think of what we collect as data for analysis as part of a larger data sets…

James: In the interests of time we have to wrap up, but there was one comment I wanted to make.which is that there are tools we use but also gaps that we see for subject specialists [see slide]… And Andrew’s comments about the catalogue struck home with me…

Q&A

Q1) Can you expand on that issue of the catalogue?

A1) Yes, I think we have to see web archives both as bulk data AND collections as collections. We have to be able to pull out the documents and reports – the traditional materials – and combine them with other material in the catalogue… So it is exciting to think about that, about the workflow… And about web archives working into the normal library work flows…

Q2) Pamela, you commented about permissions framework as possibly vital for IRB considerations for web research… Is that from conversations with your IRB or speculative.

A2) That came from Matt Webber’s comment yesterday on IRB becoming more concerned about web archive-based research. We have been looking for faster processes… But I am always very aware of the ethical concern… People do wonder about ethics and permissions when they see the archive… Interesting to see how we can navigate these challenges going forward…

Q3) Do you use LCSH and are there any issues?

A3) Yes, we do use LCSH for some items and the collections… Luckily someone from our metadata team worked with me. He used Dublin Core, with LCSH within that. He hasn’t indicated issues. Government documents in the US (and at state level) typically use LCSH so no, no issues that I’m aware of.

Plenary (Macmillan Hall): Posters with lightning talks (Chair: Olga Holownia)

Olga: I know you will be disappointed that it is the last day of Web Archiving Week! Maybe next year it should be Web Archiving Month… And then year!

So, we have lightening talks that go with posters that you can explore during the break, and speak to the presenters as well.

Tommi Jauhiainen, Heidi Jauhiainen, & Petteri Veikkolainen: Language identification for creating national web archives

Petteri: I am web archivist at the National Library of Finland. But this is really about Tommi’s PhD research on native Finno-Ugric languages and the internet. This work began in 2013 as part of the Kone Foundation Language Programme. It gathers texts in small languages on the web… They had to identify that content to capture them.

We extracted the web links on Finnish web pages, also crawled russian, estonian, swedish, and norwegion domains for these languages. They used HeLI and Heritrix. We used the list of Finnish URLs in the archive, rather than transferring the WARC files directly. So HeLI is the Helsinki language identification method, one of the best in the world. It can be found on Github. And can be used for your language as well! The full service will be out next year, but you can ask HeLi if you want that earlier.

Martin Klein: Robust links – a proposed solution to reference rot in scholarly communication

I work at Los Alamos, I have two short talks and both are work with my boss Herbert Van de Sompel, who I’m sure you’ll be aware of.

So, the problem of robust links is that links break and reference content changes. It is hard to ensure the author’s intention is honoured. So, you write a paper last year, point to the EPA, the DOI this year doesn’t work…

So, there are two ways to do this… You can create a snapshot of a referenced recourse… with Perma.cc, Internet Archive, Archive,is, Webcite. That’s great… But the citation people use is then the URI of the archive copy… Sometimes the original URI is included… But what if the URI-M is a copy elsewhere – archive.is or the no longer present mummy.it.

So, second approach, decorate your links by referencing: original URI, datetime of archiving, and the resource’s original URI. That makes your link more robust meaning you can find the live version. The original URI allows finding captures in all web archives. The Capture datetime lets you identify when/what version of the site is used.

How do you do this? With HTML5 link decoration, with the href attribute (data-original and data-versiondate). And we talked about this in a d-Lib article that, with some javascript that makes that actionable!

So, come talk to me upstairs about this!

Herbert Van de Sompel, Michael L. Nelson, Lyudmila Balakireva, Martin Klein, Shawn M. Jones & Harihar Shankar: Uniform access to raw mementos

Martin: Hello, it’s still me, I’m still from Los Alamos! But this is a more collaborative project…

The problem here… Most web archives augment their mementos with custom banners and links… So, in the Internet Archive there is a banner from them, and a pointer on links to a copy in the archive. There are lots of reasons, legal, convenience… BUT That enhancement doesn’t represent the website at the time of capturing… AS a researcher those enhancements are detrimental as you have to rewrite links again.

For us and our Memento Reconstruct, and other replay systems that’s a challenge. Also makes it harder to check the veracity of content.

Currently some systems do support this… OpenWayBack adn pywb do allow this – you can add the {datetime}im_/URI-R to do this, for instance. But that is quite dependent on the individual archive.

So, we propose using the Prefer Header in HTTP Request…

Option 1: Request header sent against Time Gate

Option 2: Request header sent against Memento

So come talk to us… Both versions work, I have a preference, Ilya has a different preference, so it should be interesting!

Sumitra Duncan: NYARC discovery: Promoting integrated access to web archive collections

NYARC is a consortium formed in 2006 from research libraries at Brooklyn Museum, The Frick Collection and the Museum of Modern Art. There is a two year Mellow grant to implement the program. An dthere are 10 collections in Archive-it devoted to scholarly art resources – including artist websites, gallery sites, catalogues, lists of lost and looted art. There is a seed list of 3900+ site.

To put this in place we asked for proof of concept discovery sites – we only had two submitted. We selected Primo from Ex-Libris. This brings in materials using the OpenSearch API. The set up does also let us pull in other archives if we want to. And you can choose whether to include the web archive (or not). The access points are through MARC Records and Full Records Search, and are in both the catalogue and WorldCat. We don’t howver, have faceted results for web archive as it’ snot in the API.

And recently, after discussion with Martin, we integrated Memento into th earchive, which lets them explore all captured content with Memento Time Travel.

In the future we will be doing usability testing of the discovery interface, we will promote use of web archive collections, and encouraging use in new digital art projects.

Fine NYARC’s Archive-It Collections: www.nywarc.org/webarchive. Documentation at http://wiki.nyarc.??

João Gomes: Arquivo.pt

Olga: Many of you will be aware of Arquivo. We couldn’t go to Lisbon to mark the 10th anniversary of the Portuguese web archive, but we welcome Joao to talk about it.

Joao: We have had ten years of preserving the Portuguese web, collaborating, researching and getting closer to our researchers, and ten years celebrating a lot.

Hello I am Joao Gomes, the head of Arquivo.pt. We are celebrating ten years of our archive. We are having our national event in November – you are all invited to attend and party a lot!

But what about the next 10 years? We want to be one of the best archives in the world… With improvements to full text search, to launch new services – like image serarching and high quality archiving services. Launching an annual prize for resarching projects over the Arquivo.pt. And at the same time increase our collection and users community.

So, thank you to all in this community who have supported us since 2007. And long live Arquivo.pt!

Changing records for scholarship & legal use cases (Chair: Alex Thurman)

Martin Klein & Herbert Van de Sompel: Using the Memento framework to assess content drift in scholarly communication

This project is to address both link rot and content drift – as I mentioned earlier in my lightening talk. I talked about link rot there, content drift is where the URI and content there changes, perhaps out of all recognition, so that what I cite is not reproducable.

You may or may not have seen this but there was a Supreme Court case referencing a website, and someone thought it would be really funny to purchase that, put up a very custom 404 error. But you can see pages that change between submission and publication. By contrast if you look at arxiv for instance you see an example of a page with no change over 20 years!

This matters partly as we reference URIs increasingly, hugely so since 2008.

So, some of this I talked about three years ago where I introduced the Hiberlink project, a collaborative project with the University of Edinburgh where we coined the term “reference rot”. This issue is a threat to the integrity of the web-based scholarly record. Resources do not have the same sense of fixity like e.g. journal article. And custodianship is also not as long term, custodians are not always as interest.

We wrote about link rot in PLoSOne. But now we want to focus on Content Drift. We published a new article on this in PLoSOne a few months ago. This is actually based on the same corpus – the entirity of arXiv, of PubMedCentral, and also over 2 million articles from Elsevier. This covered publications from January 1997 to December 2012. We only looked at URIs for non scholarly articles – not the DOIs but the blog posts, the Wikipedia page, etc. We ended up with a total of around 1 million URIs for these corpora. And we also kept the start date of the article with our data.

So, what is our approach for assessing content drift? We take publication date of URI as t. Then we try to find a Memento pre of referenced URI (t-1) and the Memento Post of referenced URI (t+1). Two Thirds of the URIs we looked at have this pair across archives. So now we do text analysis, looking at textual similarity between t-1 and t+1. We use measures of computed noralised scores (values 0 to 100) for:

  • simhash
  • Jaccard – sets of character changes
  • Sorensen-Dice
  • Cosine – contextual changes

So we defined a perfect Representative Momento if it gets a perfect score across all four measures. And we did some sanity checks too, via HTTP headers – E-Tag and Last-modified being the same are a good measure. And that sanity check passed! 98.88% of Mementos were representative.

Out of the 650k pairs we found, about 313k URIs have representative Mementos. There wasn’t any big difference across the three collections .

Now, with these 313k links, over 200k had a live site. And that allowed us to analyse and compare the live and archived versions. We used those same four measures to check similarity. Those vary so we aggregate. And we find that 23.7% of URIs have not drifted. But that means that over 75% have drifted and may not be representative of author intent.

In our work 25% of the most recent papers we looked at (2012) have not drifted at all. That gets worse going back in time, as is intuitive. Again, the differences across the corpora aren’t huge. PMC isn’t quite the same – as there were fewer articles initially. But the trend is common… In Elsevier’s 1997 works only 5% of content has not drifted.

So, take aways:

  1. Scholarly articles increasingly contain URI references to web and large resources
  2. Such resourcs are subject to reference rot (link rot and content drift)
  3. Custodians of these resoueces are typically not over concerned with archiving of their content and lonegtity of the scholarly record
  4. Spoiler: Robust links are one way to address this at the outset.

Q&A

Q1) Have you had any thought on site redesigns where human readable content may not have changed, but pages have.

A1) Yes. We used those four measures to address that… We strip out all of the HTML and formatting. Cosign ignores very minor “and” vs. “or” changes for instance.

Q1) What about Safari readibility mode?

A1) No. We used something like Beautiful Soup to strip out code. Of course you could also do visual analysis to compare pages.

Q2) You are systematically underestimating the problem… You are looking at publication date… It will have been submitted earlier – generally 6-12 months.

A2) Absolutely. For the sake of the experiment it’s the best we can do… Ideally you’d be as close as possible to the authoring process… When published, as you say, it may already

Q3) A comment and a question… 

Preprints versus publication… 

A3) No, we didn‘t look explicitly at pre-prints. In arXiv those are

The URIs in articles in Elsevier seem to rot more than those in arXiv.org articles… We think that could be because Elsevier articles tend to reference more .coms whereas arXiv references more .org URIs but we need more work to explore that…

Nicholas Taylor: Understanding legal use cases for web archives

I am going to talk about use of web archives in litigation. But out of scope here is the areas of perservation of web citations; terms of service and API agreements for social media collection; copyright; right to be forgotten.

So, why web archives? Well it’s where the content is. In some cases social media may only be available in web archives. Courts do now accept web archive conference. The earliest that IAWM (Internet Archive Way Back Machine) evidence was as early as 2004. Litigants reoutinely challenge this evidence but courts often accept IAWM evidence – commonly through affidavit or testimony, through judicial notice, sometimes through expert testimony.

The IA have affidavit guidance and they suggest asking the court to ensure they will accept that evidence, making that the issue for the courts not the IA. And interpretation is down to the parties in the case. There is also information on how the IAWM works.

Why should we care about this? Well legal professionals are our users too. Often we have unique historical data. And we can help courts and juries correctly interpret web archive evidence leading to more informed outcomes. Other opportunities may be to broaden the community of practice by bringing in legal technology professionals. And this is also part of mainstreaming web archives.

Why might we hestitate here? Well typically cases serve private interests rather than public goods. Immpature open source software culture for legal technology. And market solutions for web and social media archiving for this context do already exist.

USe cases for web archiving in litigation mainly have to do with information on individual webpages as a point in time; information individual webpages over a period of time; persistence of navigational paths over a period of time. And types of cases include civil litigaton and intellectual property cases (which are a separate court in the US). I haven’t seen any criminal cases using the archive but that doesn’t mean it doesn’t exist.

Where archives are used there is a focus on authentication and validity of the record. In the Telewizja Polska USA Inc v. Echostar Video Inc. (2004) saw arguing over the evidence but the court accepting it. In Specht v. Google inc (2010) the evidence was not admissable as it had not come through the affidavit rule.

Another important rule in ths US context is Judicial notice (FRE 201) which is a rule that allows a fact to be entered into evidence. And archives have been used in this context. For instance Martins v 3PD, Inc (2013). And Pond Guy, Inc. v. Aguascape Designs (2011). And in Tompkins v 23andme, Inc (2014) – both parties used IAWM screenshots and the courts went out and found further screenshots that countered both of these to an extent.

Expert testimony (FRE 202) has included Khoday v Symantex Corp et al (2015)  where the expert on navigational paths was queried but the court approved that testimony.

In terms of reliabiity factors things that are raised as concerns include IAWM disclaimer, incompleteness, provenance, temporal coherence. Not seen any examples on discreteness, temporal coherance with HTTP headers), etc.

In Nassar v Nassar (2017) was a defamation case where the IAWM disclaimer saw the court not accept evidence from th earchive.

In Stabile v. Paul Smith Ltd. (2015) saw incomplete archives used, with the court acknowledging but accepting relevance of what was entered.

In Marten Transport Ltd v Plattform Advertising Inc. (2016) was also incomplete, discussion of banners and ads, but the court understood that IAWM does account for some of this. Objections had include issues with crawlers, concern that human/witness wasn’t directly involved in capturing the pages. The literature includes different perceptions of incompleteness. We also have issues of live site “leakage” via AJAX – where new ads leaked into archive pages…

Temporal coherance can be complicated. Web archive  captures can include mementos that are embedded and archived at different points in time so that the composite does not totally make sense.

The Memento Time TRavel service shows you temporal coherance. See also Scott Ainsworth’s work. That kind of visualisation can help courts to understand temporal coherance. Other datetime estimation strategies includes “Carbon Dating” (and constitutent services)’ comparing X-Archive-Orig-last-modified with Memento dattime, etc.

Interpreting datetimes is complicated, and of  great importance in legals cases. These can be interpreted from static datetime of text in archived page, the Memento date time, the headers, etc.

In Servicenow, Inc. v Hewlett-Packard Co. (2015), a patent case where things much be published a year ago to be “prior art” and in this case the archive showed an earlier date than other documentatin.

IN terms of IAWM provenance… Cases have questioned this. Sources for IAWM include a range of different crawls but what does that mean for reliable provenance. There are other archives out there too, but I haven’t seen evidence of these being used in court yet. Canonicality is also an interesting issue… Personalisation of content served to archival agent is an an unanswered question. What about client artifacts?

So, what’s next? If we want to better serve legal and research use cases, then we need to surface more provenance information; to improve interfaces to understand temporal coherance and make volotile aspects visible…

So, some questions for you,

  1. why else might we care, or not about legal use cases?
  2. what other reliability factors are relevant?
    1. What is the relative importance of different reliability factors?
    2. For what use cases are different reliability factors relevant?

Q&A

Q1) Should we save WhoIs data alongside web archives?

A1) I haven’t seen that use case but it does provide context and provenance information

Q2) Is the legal status of IA relevant – it’s not a publicly funded archive. What about security certificates or similar to show that this is from the archive and unchanged?

A2) To the first question, courts have typically been more accepting of web evidence from .gov websites. They treat that as reliable or official. Not sure if that means they are more inclined to use it.. On the security side, there were some really interesting issues raised by Ilya and Jack. As courts become more concerned, they may increasingly look for those signs. But there may be more of those concerns…

Q3) I work with one of those commercial providers… A lot of lawyers want to be able to submit WARCs captured by web recorer or similar to courts.

A3) The legal system is vrry document centril… Much of their data coming in is PDF and that does raise those temporal issues.

Q3) Yes, but they do also want to render WARC, to bring that in to their tools…

Q4) Did you observe any provenance work outside the archive – developers, GitHub commits… Stuff beyond the WARC?

A4) I didn’t see examples of that… Maybe has to do with… These cases often go back a way… Sites created earlier…

Anastasia Aizman & Matt Phillips: Instruments for web archive comparison in Perma.cc

Matt: We are here to talk to you about some web archiving work we are doing. We are from the Harvard innovation lab. We have learnt so much from what you are doing, thank you so much. Perma.cc is creating tools to help you cite stuff on the web, to capture the WARC, organises those things…

We got started on this work when examining documents looking at the Supreme Court corpus from 1996 to present. We saw that Zittrain et al, Harvard Law Review, found more than 70% of references had rotted. So we wanted to build tools to help that…

Anastasia: So, we have some questions…

  1. How do we know a website has changed
  2. How do we know which are important changes.

So, what is a website made of… There are a lot of different resources that will appear on, say, a Washington Post article will have perhaps 90 components. Some are visual, some are hidden… So, again, how can we tell if the site has changed, if it is significant… And how do you convey that to the user.

In 1997, Andre Broder wrote about Syntactic clustering of the web. In that work he looked at every site on the world wide web. Things have changed a great deal since then… Websites are more dynamic now, we need more ways to compare pages…

Matt: So we have three types of comparison…

  • image comparison – we flatten the page down… If we compare two shots of Hacker News a few minutes apart there is a lot of similarity, but difference too… So we create a third image showing/highlighting the differences and can see where those changes there…

Why do image comparison? It’s kind of a dumb way to understand difference… Well it’s a mental model the human brain can take in. The HCI is pretty simple here – users regularly experience that sort of layering – and we are talking general web users here. And it’s easy to have images on hand.

So, sometimes it works well… Here’s an example… A silly one… A post that is the same but we have a cup of coffee with and without coffee in the mug, and small text differences. Comparisons like this work well…

But it works less well where we see banner ads on webpages and they change all the time… But what does that mean for the content? How do we fix that? We need more fidelity, we need more depth.

Anastasia: So we need another way to compare… Looking at a Washington post from 2016 and 2017… Here we can see what has been deleted, and we can see what has been added…. And the tagline of the paper itself has changed in this case.

The pros of this highlighting approach as that it’s in use in lots of places, it’s intuitive… BUT it has to ignore invisible-to-the_user tags. And it is kind of stupid… With two totally different headlines, both saying “Supreme Court”, it sees similarity where there is none.

So what about other similarity measures… ? Maybe a score would be nice, rather than an overlay highlighting change. So, for that we are looking at:

  • Jaccard Coefficient (MinHash) – this is essentially like applying a Venn diagram to two archives.
  • Hamming distance (SimHash) – This looks for number strings into 1s and 0s and figure out where the differences are… The difference/ratio
  • Sequence Matcher (Baseline/Truth) – this looks for sequences of words… It is good but hard to use as it is slow.

So, we took Washington Post archives (2000+) and resources (12,000) and looked at SimHash – big gaps. MinHash was much closer…

When we can calculate that changes… does it matter? If it’s ads, do you care? Some people will. Human eyes are needed…

Matt: So, how do we convey this information to the user… Right now in Perma we have a banner, we have highlighting, or you can choose image view. And you can see changes highlighted in “File Changes” panel on top left hand side of the screen. You can click to view a breakdown of where those changes are and what they mean… You can get to an HTML diff (via Javascript).

So, those are our three measures sitting in our Perma container..

Anastasia: So future work – coming soon – will look at weighted importance. We’d love your idea of what is important – is HTML more important than text? We want a Command Line (CLI) tool as well. And then we want to look at a similarity measure for images – other research on this out there, we need to look at that. We want a “Paranoia” heuristic – to see EVERY change, but with a tickbox to allow only the important change. And we need to work together!

Finally we’d like to thank you, and our colleagues at Harvard who support this work.

Q&A

Q1) Nerdy questions… How tightly bound are these similarity measures to the Perma.cc tool?

A1 – Anastasia) Not at all – should be able to use on command line

A1 – Matt) Perma is a Python Django stack and it’s super open source so you should be able to use this.

Comment) This looks super awesome and I want to use it!

Matt) These are really our first steps into this… So we welcome questions, comments, discussion. Come connect with us.

Anastasia) There is so much more work we have coming up that I’m excited about… Cutting up website to see importance of components… Also any work on resources here…

Q2) Do you primarily serve legal scholars? What about litigation stuff Nicholas talked about?

A2) We are in the law school but Perma is open to all. The litigation stuff is interesting..

A2 – Anastasia) It is a multi purpose school and others are using it. We are based in the law school but we are spreading to other places!

Q3) Thank you… There were HTML comparison tools that exist… But they go away and then we have nothing. A CLI will be really useful… And a service comparing any two URLs would be useful… Maybe worth looking at work on Memento damage – missing elements, and impact on the page – CSS, colour, alignment, images missing, etc. and relative importance. How do you highlight invisible changes?

A3 – Anastasia) This is really the complexity of this… And of the UI… Showing the users the changes… Many of our users are not from a technical background… Educating by showing changes is one way. The list with the measures is just very simple… But if a hyperlink has changed, that potentially is more important… So, do we organise the list to indicate importance? Or do we calculate that another way? We welcome ideas about that?

Q3) We have a service running in Momento showing scores on various levels that shows some of that, which may be useful.

Q4) So, a researcher has a copy of what they were looking at… Can other people look at their copy? So, researchers can use this tool as proof that it is what they cited… Can links be shared?

A4 – Matt) Absolutely. We have a way to do that from the Blue Book. Some folks make these private but that’s super super rare…

Understanding user needs (Chair Nicola Bingham)

Peter Webster, Chris Fryer & Jennifer Lynch: Understanding the users of the Parliamentary Web Archive: a user research project

Chris: We are here to talk about some really exciting user needs work we’ve been doing. The Parliamentary Archives holds several million historical records relating to Parliament, dating from 1497. My role is ensure that archive continues, in the form of digital records as well. One aspect of that is the Parliamentary Web Archive. This captures around 30 URLS – the official Parliamentary websphere content from 2009. But we also capture official social media feeds – Twitter, Facebook and Instagram. This work is essential as it captures our relationship with the public. But we don’t have a great idea of our users needs and we wanted to find out more and understand what they use and what they need.

Peter: The objectives of the study were:

  • assess levels and patterns of use – what areas of the sites they are using, etc.
  • gauge levels of user understanding of the archive
  • understand the value of each kind of content in the web archive – to understand curation effort in the future.
  • test UI for fit with user needs – and how satisfied they were.
  • identify most favoured future developments – what directions should the archive head in next.

The research method was an analysis of usage data, then a survey questionnaire – and we threw lots of effort at engaging people in that. There were then 16 individual user observations, where we sat with the users, asked them to carry out tests and narrate their work.  And then we had group workshops with parliamentary staff and public engagement staff, we well as four workshops with the external user community tailored to particular interests.

So we had a rich set of data from this. We identified important areas of the site. We also concluded that the archive and the relationship to the Parliament website, and that website itself, needed rethinking from the ground up.

So, what did we found of interest to this community?

Well, we found users are hard to find and engage – despite engaging the social media community – and staff similarly not least as the internal workshop was just after the EURef; that they are largely ignorant about what web archives are – we asked about the UK Web Archive, the Government Archive, and the Parliamentary Archive… It appeared that survey respondents understood what these are BUT in the workshops most were thinking about the online version of Hansard – a kind of archive but not what was intended. We also found that users are not always sure what they’re doing – particularly when engaging in a live browser snapshots of the site from a previous dates, that several snapshots might exist from different points in time. There was also some issues with understanding the Way Back Machine surround for the archived content – difficulty understanding what was content, what was the frame. There was a particular challenge around using URL search. People tried everything they could to avoid that… We asked them to find archived pages for the homepage of parliament.uk… And had many searches for “homepage” – there was real lack of understanding of the browser and the search functionality. There is also no correlation between how well users did with the task and how well they felt they did. I take from that that a lack of feedback, requests, issues, does not mean there is not an issue.

Second group of findings… We struggled to find academic participants for this work. But our users prioritised in their own way. It became clear that users wanted discovery mechanisms that match their mental map – and actually the archive mapped more to an internal view of how parliament worked… And browsing taxonomies and structures didn’t work for them. That led to a card sorting exercise to rethink this. We also found users liked structures and wanted discovery based on entities: people, acts, publications – so search connected with that structure works well. Also users were very interested to engage in their own curation, tagging and folksonomy, make their own collections, share materials. Teachers particularly saw potential here.

So, what don’t users want? They have a variety of real needs but they were less interested in derived data sets like link browse; I demonstrated data visualisation, including things like ngrams, work on WARCS; API access; take home data… No interest from them!

So, three general lessons coming out of this… If you are engaging in this sort of research, spend as much resource as possible. We need to cultivate users that we do know, they are hard to find but great when you find them. Remember the diversity of groups of users you deal with…

Chris: So the picture Peter is painting is complex, and can feel quite disheartening. But his work has uncovered issues in some of our assumptions, and really highlights needs of users in the public. We now have a much better understanding os can start to address these concerns.

What we’ve done internally is raise the profile of the Parliamentary Web Archive amongst colleagues. We got delayed with procurement… But we have a new provider (MirrorWeb) and they have really helped here too. So we are now in a good place to deliver a user-centred resource at: webarchive.parliament.uk.

We would love to keep the discussion going… Just not about #goatgate! (contact them on @C_Fryer and @pj_webster)

Q&A

Q1) Do you think there will be tangible benefits for the service and/or the users, and how will you evidence that?

A1 – Chris) Yes. We are redeveloping the web archive. And as part of that we are looking at how we can connect the archive to the catalogue and that is all part of new online services project. We have tangible results to work on… It’s early days but we want to translate it to tangibl ebenefits.

Q2) I imagine the parliament is a very conservative organisation that doesn’t delete content very often. Do you have a sense of what people come to the archive for?

A2 – Chris) Right now it is mainly people who are very aware of the archive, what it is and why it exists. But the research highlighted that many of the people less familiar with the archive wanted the archived versions of content on the live site, and the older content was more of interest.

A2 – Peter) One thing we did was to find out what the difference was between what was on the live website and what was on the archive… And looking ahead… The archive started in 2009… But demand seems to be quite consistent in terms of type of materials.

A2 – Chris) But it will take us time to develop and make use of this.

Q3) Can you say more about the interface and design… So interesting that they avoid the URL search.

A3 – Peter) The outsourced provider was Internet Memory Research… When you were in the archive there was an A-Z browser, a keyword search and a URL search. Above that on the parliament.uk site had taxonomy that linked out, and that didn’t work. I asked them to use that browse and it was clear that their thought process directed them to the wrong places… So recommendation was that it needs to be elsewhere, and more visible.

Q4) You were talking about users wanting to curate their own collections… Have you been considering setting up user dashboards to create and curate collections.

A4 – Chris) We are hoping to do that with our website and service, but it may take a while. But it’s a high priority for us.

Q5) I was interested to understand, the users that you selected for the survey… Were they connected before and part of the existing user base, or did you find through your own efforts.

A5 – Peter) a bit of both… We knew more about those who took the survey and they were the ones we had in the observations. But this was a self selecting group, and they did have a particular interest in the parliament.

Emily Maemura, Nicholas Worby, Christoph Becker & Ian Milligan: Origin stories: documentation for web archives provenance

Emily: We are going to talk about origin stories and it comes out of interest in web archives, provenance, trust. This has been a really collaborative project, and working with Ian Milligan from Toronto. So, we have been looking at two questions really: How are web archives made? How can we document or communicate this?

We wanted to look at choices and decisions in creating collections We have been studying creation of University of Toronto Libraries (UTL) Archive-It collections:

  • Canadian Political Parties and Political Interest Groups (crawled quarterly) – long running, continually collected and ever-evolving.
  • Toronto 2015 Pan Am games (crawled regularly for one month one-off event)
  • Global Summitry Archive

So, thinking about web archives and how they are made we looked at the Web Archiving Life Cycle Model (Bragg et al 2013), which suggests a linear process… But the reality is messier… and iterative as test crawls are reviewed, feed into production crawls… But are also patched as part of QA work.

From this work then we have four things you should document for provenence:

  1. Scoping is iterative and regularly reviewed. and the data budget is a key part of this.
  2. The Process of crawls is important to document as the influence of live web content and actors can be unpredictable
  3. There may be different considerations for access, choices for mode of access can impact discovery, and may be particularly well suited to particular users or use cases.
  4. The fourth thing is context, and the organisational or environmental factors that influence web archiving program – that context is important to understand those decision spaces and choices.

Nick: So, in order to understand these collections we had to look at the organisational history of web archiving. For us web archiving began in 2005, and we piloted what became Archive-it in 2006. It was in liminal state for about 8 years… There were few statements around collection develeopment until last year really But th enew policu talks about scoping, policy, permissions, etc.

So that transition towards service is reflected in staffing. It is still a part time commitment but is written into several people’s job descriptions now, it is higher profile. But there are resourcing challenges around crawling platforms – the earliest archives had to be automatic; dat abudgets; storage limits. There are policies, permissions. robots.text policy, access restrictions. And there is the legal context… Copright laws changed a lot in 2012… Started with permissions, then opt outs, but now it’s take down based…

Looking in turn at these collections:

Canadian Political Parties and Political Interest Groups (crawled quarterly) – long running, continually collected and ever-evolving. Covers main parties and ever changing group of loosely defined interest groups. This was hard to understand as there were four changes of staff in the time period.

Toronto 2015 Pan Am games (crawled regularly for one month one-off event) – based around a discrete event.

Global Summitry Archive – this is a collaborative archive, developed by researchers. It is a hybrid and is an ongoing collection capturing specific events.

In terms of scoping we looked at motivation whether mandate, an identified need or use, collaboration or coordination amongst institutions. These projects are based around technological budgets and limitations… In cases we only really understand what’s taking place when we see crawling taking place. Researchers did think ahead but, for instance, video is excluded… But there is no description of why text was prioritised over video or other storage. You can see evidence of a lack of explicit justifications for crawling particular sites… We have some information and detail, but it’s really useful to annotate content.

In the most recent elections the candidate sites had altered robots.txt… They weren’t trying to block us but the technology used and their measures against DDOS attacks had that effect.

In terms of access we needed metadata and indexes, but the metadata and how they are populated shapes how that happens. We need interfaces but also data formats and restrictions.

Emily: We tried to break out these interdependencies and interactions around what gets captured… Whether a site is captured is down to a mixture of organisational policies and permissions; legal context and copyright law for fair dealing, etc. The wider context elements also change over time… Including changes in staff, as well as changes in policy, in government, etc. This can all impact usage and clarity of how what is there came to be.

So, conclusions and future work… In telling the origin stories we rely on many different aspects and it very complex. We are working towards an extended paper. We believe a little documentation goes a long way… We have a proposal for structure documentation: goo.gl/CQwMt2

Q&A

Q1) We did this exercise in the Netherlands… We needed to go further in the history of our library… Because in the ’90s we already collected interesting websites for clients – the first time we thought about the web as an important stance.. But there was a gap there between the main library work and the web archiving work…

Q2) I always struggle with what can be conveyed that is not in the archive… Sites not crawl, technical challenges, sites that it is decided not to crawl early on… That very initial thinking needs to be conveyed to pre-seed things… Hard to capture that…

A2 – Emily) There is so much in scoping that is before the seed list that gets into the crawl… Nick mentioned there are proposals for new collections that explains the thinking…

A2 – Nick) That’s about the best way to do it… Can capture pre-seeds and test crawls… But need that “what should be in the collection”

A2 – Emily) The CPPP is actually based on a prior web list of suggested sites… Which should also have been archived.

Q3) In any kind of archive the same issues are hugely there… Decisions are rarely described… Though a whole area of post modern archive description around that… But a lot comes down to the creator of the collection. But I haven’t seen much work on what should be in the archive that is expected to be there… A different context I guess..

A3 – Emily) I’ve been reading a lot of post modern archive theory… It is challenging to document all of that, especially in a way that is useful for researchers… But have to be careful not to transfer over all those issues from the archive into the web archive…

Q4) You made the point that the liberal party candidate had blocked access to the Internet Archive crawler… That resonated for me as that’s happened a few times for our own collection… We have legal deposit legislation and that raises questions of whose responsibility it is to take that forward..

A4 – Nick) I found it fell to me… Once we got the right person on the phone it was an easy case to make – and it wasn’t one site but all the candidates for that party!

Q5) Have you have any positive or negative responses to opt-outs and Take downs

A5 – Nick) We don’t host our own WayBackMachine so use their policy. We honour take downs but get very very few. Our communications team might have felt differently but we had something quite bullish in charge.

Nicola) As an institution there is a very variable appetite for risk – hard to communicate internally, let alone externally to our users.

Q6) In your research have you seen any web archive documenting themselves well? People we should follow? Or based mainly on your archives?

A6) It’s mainly based on our own archives… We haven’t done a comprehensive search of other archives’ documentation.

Jackie Dooley, Alexis Antracoli, Karen Stoll Farrell & Deborah Kempe: Developing web archiving metadata best practices to meet user needs

Alexis: We are going to present on the OCLC Research Library Partnership web archive working group. So, what was the problem? Well, web archives are not very easily discoverable in the ways people are usually used to descovering archives or library resources. This was the most widely shared issue across two OCLC surveys and so a working group was formed.

At Princeton we use Archive-It, but you had to know we did that… It wasn’t in the catalogue, it wasn’t on the website… So you wouldn’t find it… Then we wanted to bring it into our discovery system but that meant two different interfaces… So… If we take an example of one of our finding aids… We have the College Republican Records (2004-2016) and they are an on-campus group with websites… This was catalogues with DACS. But how to use the title and dates appropriately? Is the date the content, the seed, what?! And extent – documents, space, or… we went for the number of websites as that felt like something users would understand.  We wrote Archive-it into the description… But we wanted guidelines…

So, the objectives of this group is to find best practices for web archiving metadata best practices. We have undertane a lutereature review, looked at best practices for descriptive metadata across single nad multiple sites.

Karen: For our literature review we looked at peer reviewed literature but also some other sources, and synthesised that. So, who are the end users of web archives… I was really pleased the UK Parliament work focused on public users, as the research tends to focus on academia. Where we can get some clarity on users is on their needs: to read specific web pages/site; data and text mining; technology development or systems analysis.

In terms of behaviours Costa and Silva (2010) classify three groups, much cited by others: Navigational; Informational or Transactionals.

Take aways…. A couple things that we found – some beyond metadata… Raw data can be a high barrier so they want accessible interaces, unified searches, but the user does want to engage directly with the metadata to make the background and provenence of the data. We need to be thinking about flexible formats, engagement. And to enable access we need re-use and rights statements. And we need to be very direct indicating live versus archive material.

Users also want provenance: when and why was this created? They want context. They want to know the collection criteria and scope.

For metadata practitioners there are distinct approaches… archival and bibliographic approaches – RDA, MARC, Dublin Core, MODS, finding aids, DACS; Data elements vary widely, and change quite quickly.

Jackie: We analysed metadata standards and institutional guidelines; we evaluated existing metdata records in the wild… Our preparatory work raised a lot of questions about building a metadata description… Is the website creator/owner the publisher? author? subject? What is the title? Who is the host institution – and will it stay the same? Is it imporant to clearly stats that the resource is a website (not a “web resources”).

And what does the provenance actually refer to? We saw a lot of variety!

In terms of setting up th econtext we have use cases for library, archives, research… Some comparisons between bibliographic and archival approaches to descriptoin; description of archived and live sites – mostly libraries catalogue live not archives sites; and then you have different levels… Collection level, site level… And there might be document-level discriptions.

So, we wanted to establish data dictionary characteristics. We wanted something simple, not a major new cataloguing standard. So this is a learn 14 element standard, which is grounded on those cataloguing rules, so can be part of wider systems. The categories we have include common elements are used for identification and discovery of types of resources; other elements have to have clear applicability in the discovery of all types of resources. But some things aren’t included as not super specific to web archives – e.g. audience.

So the 14 data elements are:

  • Access/rights*
  • Collector
  • Contributor*
  • Creator*
  • Date*
  • Description*…

Elements with asterisks are direct maps to Dublin Core fields.

So, Access Conditions (to be renamed as “Rights”) is a direct mapping to Dublin Core “Rights”. This provides the circumstances that affect the availability and/or reuse of an archived website or collection. E.g. for Twitter. And it’s not just about rights because so often we don’t actually know the rights, but we know what can be done with the data.

Collector was the strangest element… There is no equivalent in Dublim Core… This is about the organisation responsible for curation and stewardship of an archived website or collection. The only other place that uses Collector is the Internet Archive. We did consider “repository” but, it may do all those things but… for archived websites… the site lives elsewhere but e.g. Princeton decides to collect those things.

We have a special case for Collector where Archive-It creates its own collection…

So, we have three publications, due out in July on this work..

Q&A

Q1) I was a bit disappointed in the draft report – it wasn’t what I was expecting… We talked about complexities of provenance and wanted something better to convey that to researchers, and we have such detailed technical information we can draw from Archive-It.

A1 – Jackie) Our remit was about description, only. Provenance is bigger than that. Descriptive metadata was appropriate as scope. We did a third report on harvesting tools and whether metadata could be pulled from them… We should have had “descriptive” in our working group name too perhaps…

A1) It is maybe my fault too… But it’s that mapping of DACs that is not perfect… We are taking a different track at University of Albany

A1 – Jackie) This is NOT a standard, it addresses an absence of metadata that often exists for websites. Scalability of metadata creation is a real challenge… The average time available is 0.25 FTE looking at this. The provenance, the nuance of what was and was not crawled is not doable at scale. This is intentionally lean. If you will be using DACs then a lot of data goes straight in. All standards, with the exception of Dublin Core, are more detailed…

Q2) How difficult is this to put in practice for MARC records. For us we treat a website as a collector… You tend to describe the online publication… A lot of what we’d want to put in just can’t make it in…

A2 – Jackie) In Marc the 852 field is the closest to Collector that you can get. (Collector is comparable to Dublin Core’s Contributor; EAD’s <repository>; MARC’s 524, 852 a ad 852 b; MODS’ location or schema.org’s schema:OwnershipInfo.

Researcher case studies (Chair: Alex Thurman)

Jane Winters: Moving into the mainstream: web archives in the press

This paper accompanies my article for the first issue of Internet Histories. I’ll be talking about the increasing visibility of web archives and much greater public knowledge of web archive.

So, who are the audiences for web archives? Well they include researchers in the arts, humanities and social sciences – my area and where some tough barriers are. They are also policymakers, perticularly crucial in relation to legal deposit and acess. Also “general public” – though it is really many publics. And journalists as a mediator with the public.

What has changed with media? Well there was an initial focus on technology which reached an audience predisposed to that. But incresingly web archives come into discussion of politics and current affairs but there are also social and cultural concerns starting to emerge. There is real interest around launches and anniversaries – a great way for web archives to get attention, like the Easter Rising archive we heard about this week. We do also get that “digital dark age” klaxon which web archives can and do address. And with Brexit and Trump there is a silver lining… And a real interest in archives as a result.

So in 2013 Niels Brugge arranged the first RESAW meeting in Aahus. And at that time we had one of these big media moments…

Computer Weekly, 12th November 2013, reported on Conservatives erasing official records of speeches from the Internet Archive as a serious breach. Coverage in computing media migrated swiftly to coverage in the mainstream press, the Guardian’s election coverage; BBC News… The hook was that a number of those speeches were about the importance of the internet to open public debate… That hook, that narrative was obviously lovely for the media. Interestingly the Conservatives then responded that many of those speeches were actually still available in the BL’s UK Web Archives. The speeches also made Channel 4 News – and they used it as a hook to talk about broken promises.

Another lovely example was Dr Anat Ben-David from the Open University who got involved with BBC Click on restoring the lost .yu domain. This didn’t come from us trying to get something in the news… They knew our work and we could then point them in the direction of really interesting research… We can all do this highlighting and signposting which is why events like this are so useful for getting to know each others’ work.

When you make the tabloids you know you’ve done well… In 2016 coverage of the BBC Food website was faced with closure as part of cuts. The Independent didn’t lead with this, but with how to find recipes when the website goes… They directed everyone to the Internet Archive – as it’s open (unlike the British Library). Although the UK Web Archive blog did post about this, explained what they are collecting, and why they collect important cultural materials. The BBC actually back peddled… Maintaining the pages, but not updating it. But that message got out that web archiving is for everyone… Building it into people’s daily lives.

The launch of the UK Web Archive in 2013 went live – BBC covered this (and fact that it is not online). The 20th anniversary of the BnF archive had a lot of French press coverage. That’s a great hook as well.  Then I mentioned that Digital Dark Age set of stories… Bloomberg had the subtitle “if you want to preserve something, print it” in 2016. We saw similar from the Royal Society. But generally journalists do know who to speak to from BL, or DPC, or IA to counter that view… Can be a really positive story. Even that negative story can be used as a positive thing if you have that connection with journalists…

So this story: “Raiders of the Lost Web: If a Pultizer-finalist 34 part series can disappear from the web, anything can” looks like it will be that sort of story again… But actually this is about the forensic reconstruction of the work. And the article also talks about cinema at risk, again also preserved thanks to the Internet Archive. This piece of journalism that had been “lost” was about the death of 23 children in a bus crash… It was lost twice as it wasn’t reported, then the story disappeared… But the longer article here talks about that case and the importance of web archiving as a whole.

Talking of traumatic incidents… Brexit coverage of the NHS £350m per week saving on the Vote Leave website… But it disappeared after the vote. BUT you can use the Internet Archive, and the structured referendum collection from the UK Legal Deposit libraries, so the promises are retained into the long term…

And finally, on to Trump! In an Independent article on Melania Trump’s website disappearing, the journalist treats the Internet Archive as another source, a way to track change over time…

And indeed all of the coverage of IA in the last year, and their mirror site in Canada, that isn’t niche news, that’s mainstream coverage now. The more we have stories on data disappearing, or removed, the more opportunities web archives have to make their work clear to the world.

Q&A

Q1) A fantastic talk and close to my heart as I try to communicate web archives. I think that web archives have fame when they get into fiction… The BBC series New Tricks had a denouement centred on finding a record on the Internet Archive… Are there other fictional representations of web archives?

A1) A really interesting suggestion! Tweet us both if you’ve seen that…

Q2) That coverage is great…

A2) Yes, being held to account is a risk… But that is a particular product of our time… Hopefully when it is clear that it is evidence for any set of politicians… The users may be partisan, even if the content is… It’s a hard line to tread… Non publicly available archives mitigate that… But absolutely a concern.

Q3) It is a big win when there are big press mentions… What happens… Is it more people aware of the tools, or specifically journalists using them?

A3) It’s both but I think it’s how news travels… More people will read an article in the Guardian than will look at the BL website. But they really demonstrate the value and importance of the archive. You want – like the BBC recipe website 100k petition – that public support. We ran a workshop here on a random Saturday recently… It was pitched as tracing family or local history… And a couple were delighted to find their church community website 15 years ago… It was that easy to know about the value of the archive that way… We did a gaming event with late 1980s games in the IA… That’s brilliant, a kid’s birthdya party was going to be inspired by that – that’s fab use we hadn’t thought of… But journalism is often the easy win…

Q4) Political press and journalistic use is often central… But I love that GifCities project… The nostalgia of the web… The historicity… That use… They highlight the datedness of old web design is great… The way we can associated archives with web vernacular that are not evidenced elsewhere is valuable and awesome… Leveraging that should be kept in mind.

A4) The GifCities always gets a “Wow” – it’s a great way to engage people in a teaching setting… Then lead them onto harder real history stuff..!

Q5) Last year when we celebrated the anniversary I had a chance to speak with journalists. They were intrigued that we collect blogs, forums, stuff that is off the radar… And they titled the article “Maybe your Sky Blog is being archived in France” (Sky Blogs is a popular teen blog platform)… But what does not forgetting the stupid things you wrote on the internet when they were 15…

A5) We’ve had three sessions so far, only once did that question arise… But maybe people aren’t thinking like that. More of an issue of the public archive… Less of a worry for closed archive… But so much of the embaressing stuff is in Facebook so not in the archive. But it matters especially in the right to be forgotten legislation… But there is also that thing of having something worth archiving…

Q6) The thing of The Crossing is interesting… Their font was copyright… They had to get specific permission from the designer… But that site is in flash… And soon you’ll need Ilya Cramer’s old web tools to see it at all.

A6) Absolutely. That’s a really fascinating article and they had to work to revive and play that content…

Q6) And six years old! Only six years!

Cynthia Joyce: Keyword ‘Katrina’: a deep dive through Hurricane Katrina’s unsearchable archive

I’ll be talking about how I use – rather than engaging in the technology directly. I was a journalist for 20 years before teaching journalism, which I do at University of Mississippi. Every year we take a study group to New Orleans to look at the outcome of Katrina. Katrina was 12 years ago. But there is a lot of gentrification and so there are few physical scars there… It was weird to have to explain how hard things were to my 18 year old students. And I wanted to bring that to life… But not just the news coverage which is shown as anniversary, do an update piece… The story is not a discrete event, an era…

I found the best way to capture that era was through blogging. New Orleans was not a tech savvy space, it was a poor, black, high levels of illiteracy sort of space. Web 1.0 had skipped New Orleans and the Deep South in a lot of ways.. .It was pre-Twitter, Facebook in infancy, mobiles were primitive. Katrina was probably when many in New Orleans started texting – doable on struggling networks. There was also that Digital Divide – out of trend to talk about this but this is a real gap.

So, 80% of the city flooded, more than 800 people died, 70% of residents were displaced. The storm didn’t cause the problems here, it was the flooding and the failure of the levees. That is an important distinction, as that sparked the rage, the activism, the need for action was about the sense of being lied to and left behind.

I was working as a journalist for Salon.com from 1995 – very much web 1.0. I was an editor at Nola.com post Katrina. And I was a resident of New Orleans 2001-2007. We had questions of what to do with comments, follow up, retention of content… A lot of content wasn’t needing preserving… But actually that set of comments should be the shame of Advanced Digital and Conde Naste… It was interesting how little help they provided to Nola.com, one of their client papers…

I was conducting research as a citizen, but with journalistic principles and approaches… My method was madness basically… I had instincts, stories to follow, high points, themes that had been missed in mainstream media. I interviewed a lot of people… I followed and used a cross-list of blog rolls… This was a lot of surfing, not just searching…

The WayBackMachine helped me so much there, to see that blogroll, seeing those pages… That idea of the vernacular, drill down 10 years later was very helpful and interesting… To experience it again… To go through, to see common experiences… I also did social media posts and call outs – an affirmative action approach. African American people were on camera, but not a lot of first party documentation… I posted something on Binders Full of Women Writers… I searched more than 300 blogs. I chose the entries… I did it for them… I picked out moving, provocative, profound content… Then let them opt out, or suggest something else… It was an ongoing dialogue with 70 people crowd curating a collective diary. New Orleans Press produced a physical book, and I sent it to Jefferson and IA created a special collection for this.

In terms of choosing themes… The original TOC was based on categories that organically emerged… It’s not all sad, it’s often dark humour…

  • Forever days
  • An accounting
  • Led Astray (pets)
  • Re-entry
  • Kindness of Strangers
  • Indecisin
  • Elsewhere = not New Orleans
  • Saute Pans of Mercy (food)
  • Guyville

Guyville for instance… for months no schools were open, so it was a really male space, then so much construction… But some women there though that was great too. A really specific culture and space.

Some challenges… Some work was journalists writing off the record. We got permissions where we could – we have them for all of the people who survived.

I just wanted to talk about Josh Cousin, a former resident of St Bernard projects. His nickname was the “Bookman” – he was an unusual nerdy kid and was 18 when Katrina hit. They stayed… But were forced to leave eventually… It was very sad… They were forced onto a bus, not told where they were going, they took their dog… Someone on the bus complained. Cheddar was turfed onto the highway… They got taken to Houston. The first post Josh posted was a defiant “I made it” type post… He first had online access when he was at the Astrodome. They had online machines that no-one was using… But he was… And he started getting mail, shoes, stuff in the post… He was training people to use these machines. This kid is a hero… At the sort of book launch for contributors he brought Cheddar the dog… Through pet finder… He had been adopted by a couple in Conneticut who had renamed him “George Michael” – they tried to make him pay $3000 as they didn’t want their dog going back to New Orleans…

In terms of other documentary evidence… Material is all as PDF only… The email record of Micheal D. Brown… shows he’s concerned about dog sitting… And later criticised people for not evacuating because of their pets… Two weeks later his emails do talk about pets… There were obviously other things going on… But this narrative, this diary of that time… really brings this reality to life.

I was in a newsroom during Arab Spring… And that’s when they had no option but to run what’s on Twitter, it was hard to verify but it was there and no journalists could get in. And I think Katrina was that kind of moment for blogging…

On Archive-it you can find the Katrina collection… Ranging from resistance and suspicion to gratitude… Some people barely remembered writing stuff, certainly didn’t expect it to be archived. I was collecting 8-9 years later… I was reassured to read that a historian at the Holocaust museum (in Chronicle of Higher Ed) who wasn’t convinced about blogging, until Trump said something stupid and that had triggered her to engage.

Q&A

Q1 – David) In 2002 the LOCKSS program has a meeting with subject specialists at NY Public Library… And among those that were deemed worth preserving was The Exquisite Corpse. That was published out of New Orleans. After Katrina we were able to give Andre Projescu back his materials and that carried on publishing until 2015… A good news story of archiving from that time.

A1) There are dozens of examples… The things that I found too is that there is no appointed steward… If no institutional support it can be passed round, forgotten… I’d get excited then realise just one person was the advocate, rather than an institution to preserve it for posterity.

Andre wrote some amazing things, and captured that mood in the early days of the storm…

Q2) I love how your work shows blending of work and sources and web archives in conversation with each other… I have a mundane question… Did you go through any human subjects approval for this work from your institution.

A2) I was an independent journalist at the time… BUt went to University of New Orleans as the publisher had done a really intersting project with community work… I went to ask them if this project already existed… And basically I ended up creating it… He said “are you pitching it?” and that’s where it came from. Nievete benefited me.

Q3) Did anyone opt out of this project, given the traumatic nature of this time and work?

A3) Yes, a lot of people… But I went to people who were kind of thought leaders here, who were likely to see the benefit of this… So, for instance Karen Geiger had a blog called Squandered Heritage (now The Lens, the Pro Publica of New Orleans)… And participation of people like that helped build confidence and validity to the project.

Colin Post: The unending lives of net-based artworks: web archives, browser emulations, and new conceptual frameworks

Framing an artwork is never easy… Art objects are “lumps” of the physical world to be described… But what about net based art works, How do we make these objects of art history… And they raise questions of what we define an artwork in the first place… I will talk about Homework by Alexi Shulgin (http://www.easylife.org/homework/) as an example of where we need technique snad practices of web arching around net based artworks. I want to suggest a new conceptualisiation of net-based artworks as plural, proliferating, herteogenous archives. Homework is typical, and includes pop ups and self-concious elements that make it challenging to preserve…

So, this came from a real assignment for Natalie Bookchin’s course in 1997. Alexei Shulgin encouraged artists to turn in homework for grading, and did so himeself… And his piece was a single sentence followed by pop up messages – something we use differently today, has different significance… Pop ups ploferate the screen like spam, making the user aware of the browser and its affordances and role… Homework replicates structures of authority and expertise, grading, organising, creitiques, including or excluding artists… But rendered obsurd…

Homework was intended to be ephemeral… But Shulgin curates assignments turned in, and late assignments. It may be tempting to think of these net works as performance art, with records only of a particular moment in time. But actually this is a full record of the artwork… Homework has entered into archives as well as Shulgin’s own space. It is heterogenous… All acting on the work. The nature of pop up messages may have changes but the conditions of its original creation and it is still changing the world today.

Shulgin, in conversation with Armin Medosch in 1997, felt “The net at present has few possibilities for self expression but there is unlimited possibility for communication. But how can you record this communicative element, how can you store it?”. There are so many ways and artists but how to capture them… One answer is web archiving… There are at least 157 versions of Homework in the Internet Archive.. This is not comprehensive, but his own site is well archived… But capacity of connections is determined by incidence rather than choice… The crawler only caught some of these. But these are not discrete objects… The works on Shulgin’s site, the captures others have made, the websites that are still available, is one big object. This structure reflects the work itself, archival systems sustain and invigorate through the same infrastructure…

To return to the communicative elements… Archives do not capture the performative aspects of the piece. But we must also attend to the way the object has transformed over time… In order to engage with complex net-absed artworks… We cannot be easily separated into “original” and “archived” but more as a continuum…

Frank Upward (1996) describe the Records Continuum Model.. This is around four dimensions: Creation, Capture, Organisation, and Pluralisation. All of these are present in the archive of Homework… As copies appear in the Internet Archive, in Rhizome… And spread out… You could describe this as the vitalisation of the artwork on the web…

oldweb.today at Rhizome is a way to emulate the browser… This provides some assurance of the retention of old website.. BUt that is not the direct representation of the original work… The context and experience can vary – including the (now) speedy load of pages… And possible changes in appearance… When I load homework here… I see 28 captures all combined, from records over 10 years.. The piece wasn’t uniformly archived at any one time… I view the whole piece but actually it is emulated and artificial… It is disintegrated and inauthentic… But in the continuum it is another continuous layer in space and time.

Niels Brugger in “website history” (2010) talks about “Writing the complex strategic situation in which an artefact is entangled”. Digital archived and emulators preserve Homework, but are in themselves generative… But that isn’t exclusive to web archiving… It is something we see in Eugene Viollet Le Duc (1854/1996) talks about reestablishing a work in a finish state that may never in fact have existed in any point in time.

Q1) a really interesting and important work, particularly around plurality. I research at Rhizome and we have worked with Net Art Anthology – an online exhibition with emulators… is this faithful… should we present a plural version of the work?

A1) I have been thinking about this a lot… but i don’t think Rhizome should have to do all of this… art historians should do this contextual work too… Net Art Anthology does the convenience access work but art historians need to do the context work too.

Q1) I agree completely. For an art historian what provenance metadata should we provide for works like this to make it most useful… Give me a while and I’ll have a wish list… 

Comment) a shout out for Gent in Belgium is doing work on online art so I’ll connect you up.

Q2) Is Homework still an active interactive work?

A2) The final list was really in 1997 – only on IA now… It did end at this time… so experiencing the piece is about looking back… that is artefactial, or a terrace. But Shulgin has past work on his page… sort of a capture and framing as archive.

Q3) How does Homework fit in your research?

A3) I’m interested in 90s art, preservation, and that interactions

Q4) Have you seen that job of contextualisation done well, presented with the work? I’m thinking of Eli Harrison’s quantified self work and how different that looked at the time from now… 

A4) Rhizome does this well, galleries collecting net artists… especially with emulated works.. The guggenheim showed originals and emulated and part of that work was foregrounding the preservation and archiving aspects of the work. 

Closing remarks: Emmanuelle Bermès & Jane Winters

Emmanuelle: Thank you all for being here. This was three very intense day. Five days for those at archived unleashed. To close a few comments on IIPC. We were originally to meet in Lisbon, and I must apologise again to Portuguese colleagues, we hope to meet again there… But colocating with RESAW was brilliant – I saw a tweet that we are creating archives in the room next door to those who use and research them. And researchers are our co-creators.

And so many of our questions this week have been about truth and reliability and trust. This is a sign of growth and maturity of the groups. 

IIPC has had a tough year. We are still a young and fragile group… we have to transition to a strong world wide community. We need all the voices and inputs to grow and to transform into something more résiliant. We will have an annual meeting at an event in Ottawa later this year.

Finally thank you so much to Jane and colleagues from RESAW, and to Nicholas and WARC committee, and Olga and BL to get this all together so well.

Jane: you were saying how good it has been to bring archivists and researchers together, to see how we can help and not just ask… A few things struck me: discussion of context and provenance; and at the other end permanence and longevity. 

We will have a special issue of Internet Histories so do email us 

Thank you to Neils Brugger and NetLab, The Coffin Trust who funded our reception last night, RESAW Programme Committee, and the really important peop – the events team at University of London, and to Robert Kelly who did our wonderful promotional materials. And Olga who has made this all possible. 

And we do intend to have another Resaw conference in June in 2 years.

And thank you to Nicholas and Neils for representing IIPC, and to all of you for sharing your fantastic work. 

And with that a very interesting week of web archiving comes to an end. Thank you all for welcoming me along!

Jun 152017
 

I am again at the IIPC WAC / RESAW Conference 2017 and today I am in the very busy technical strand at the British Library. See my Day One post for more on the event and on the HiberActive project, which is why I’m attending this very interesting event.

These notes are live so, as usual, comments, additions, corrections, etc. are very much welcomed.

Tools for web archives analysis & record extraction (chair Nicholas Taylor)

Digging documents out of the archived web – Andrew Jackson

This is the technical counterpoint to the presentation I gave yesterday… So I talked yesterday about the physical workflow of catalogue items… We found that the Digital ePrints team had started processing eprints the same way…

  • staff looked in an outlook calendar for reminders
  • looked for new updates since last check
  • download each to local folder and open
  • check catalogue to avoid re-submitting
  • upload to internal submission portal
  • add essential metadata
  • submit for ingest
  • clean up local files
  • update stats sheet
  • Then inget usually automated (but can require intervention)
  • Updates catalogue once complete
  • New catalogue records processed or enhanced as necessary.

It was very manual, and very inefficient… So we have created a harvester:

  • Setup: specify “watched targets” then…
  • Harvest (harvester crawl targets as usual) –> Ingested… but also…
  • Document extraction:
    • spot documents in the crawl
    • find landing page
    • extract machine-readable metadata
    • submit to W3ACT (curation tool) for review
  • Acquisition:
    • check document harvester for new publications
    • edit essential metadata
    • submit to catalogue
  • Cataloguing
    • cataloguing records processed as necessary

This is better but there are challenges. Firstly, what is a “publication?”. With the eprints team there was a one-to-one print and digital relationship. But now, no more one-to-one. For example, gov.uk publications… An original report will has an ISBN… But that landing page is a representation of the publication, that’s where the assets are… When stuff is catalogued, what can frustrate technical folk… You take date and text from the page – honouring what is there rather than normalising it… We can dishonour intent by capturing the pages… It is challenging…

MARC is initially alarming… For a developer used to current data formats, it’s quite weird to get used to. But really it is just encoding… There is how we say we use MARC, how we do use MARC, and where we want to be now…

One of the intentions of the metadata extraction work was to provide an initial guess of the catalogue data – hoping to save cataloguers and curators time. But you probably won’t be surprised that the names of authors’ names etc. in the document metadata is rarely correct. We use the worse extractor, and layer up so we have the best shot. What works best is extracting the HTML. Gov.uk is a big and consistent publishing space so it’s worth us working on extracting that.

What works even better is the gov.uk API data – it’s in JSON, it’s easy to parse, it’s worth coding as it is a bigger publisher for us.

But now we have to resolve references… Multiple use cases for “records about this record”:

  • publisher metadata
  • third party data sources (e.g. Wikipedia)
  • Our own annotations and catalogues
  • Revisit records

We can’t ignore the revisit records… Have to do a great big join at some point… To get best possible quality data for every single thing….

And this is where the layers of transformation come in… Lots of opportunities to try again and build up… But… When I retry document extraction I can accidentally run up another chain each time… If we do our Solr searches correctly it should be easy so will be correcting this…

We do need to do more future experimentation.. Multiple workflows brings synchronisation problems. We need to ensure documents are accessible when discoverable. Need to be able to re-run automated extraction.

We want to iteratively improve automated metadata extraction:

  • improve HTML data extraction rules, e.g. Zotero translators (and I think LOCKSS are working on this).
  • Bring together different sources
  • Smarter extractors – Stanford NER, GROBID (built for sophisticated extraction from ejournals)

And we still have that tension between what a publication is… A tension between established practice and publisher output Need to trial different approaches with catalogues and users… Close that whole loop.

Q&A

Q1) Is the PDF you extract going into another repository… You probably have a different preservation goal for those PDFs and the archive…

A1) Currently the same copy for archive and access. Format migration probably will be an issue in the future.

Q2) This is quite similar to issues we’ve faced in LOCKSS… I’ve written a paper with Herbert von de Sompel and Michael Nelson about this thing of describing a document…

A2) That’s great. I’ve been working with the Government Digital Service and they are keen to do this consistently….

Q2) Geoffrey Bilder also working on this…

A2) And that’s the ideal… To improve the standards more broadly…

Q3) Are these all PDF files?

A3) At the moment, yes. We deliberately kept scope tight… We don’t get a lot of ePub or open formats… We’ll need to… Now publishers are moving to HTML – which is good for the archive – but that’s more complex in other ways…

Q4) What does the user see at the end of this… Is it a PDF?

A4) This work ends up in our search service, and that metadata helps them find what they are looking for…

Q4) Do they know its from the website, or don’t they care?

A4) Officially, the way the library thinks about monographs and serials, would be that the user doesn’t care… But I’d like to speak to more users… The library does a lot of downstream processing here too..

Q4) For me as an archivist all that data on where the document is from, what issues in accessing it they were, etc. would extremely useful…

Q5) You spoke yesterday about engaging with machine learning… Can you say more?

A5) This is where I’d like to do more user work. The library is keen on subject headings – thats a big high level challenge so that’s quite amenable to machine learning. We have a massive golden data set… There’s at least a masters theory in there, right! And if we built something, then ran it over the 3 million ish items with little metadata could be incredibly useful. In my 0pinion this is what big organisations will need to do more and more of… making best use of human time to tailor and tune machine learning to do much of the work…

Comment) That thing of everything ending up as a PDF is on the way out by the way… You should look at Distil.pub – a new journal from Google and Y combinator – and that’s the future of these sorts of formats, it’s JavaScript and GitHub. Can you collect it? Yes, you can. You can visit the page, switch off the network, and it still works… And it’s there and will update…

A6) As things are more dynamic the re-collecting issue gets more and more important. That’s hard for the organisation to adjust to.

Nick Ruest & Ian Milligan: Learning to WALK (Web Archives for Longitudinal Knowledge): building a national web archiving collaborative platform

Ian: Before I start, thank you to my wider colleagues and funders as this is a collaborative project.

So, we have a fantastic web archival collections in Canada… They collect political parties, activist groups, major events, etc. But, whilst these are amazing collections, they aren’t accessed or used much. I think this is mainly down to two issues: people don’t know they are there; and the access mechanisms don’t fit well with their practices. Maybe when the Archive-it API is live that will fix it all… Right now though it’s hard to find the right thing, and the Canadian archive is quite siloed. There are about 25 organisations collecting, most use the Archive-It service. But, if you are a researcher… to use web archives you really have to interested and engaged, you need to be an expert.

So, building this portal is about making this easier to use… We want web archives to be used on page 150 in some random book. And that’s what the WALK project is trying to do. Our goal is to break down the silos, take down walls between collections, between institutions. We are starting out slow… We signed Memoranda of Understanding with Toronto, Alberta, Victoria, Winnipeg, Dalhousie, Simon Fraser University – that represents about half of the archive in Canada.

We work on workflow… We run workshops… We separated the collections so that post docs can look at this

We are using Warcbase (warcbase.org) and command line tools, we transferred data from internet archive, generate checksums; we generate scholarly derivatives – plain text, hypertext graph, etc. In the front end you enter basic information, describe the collection, and make sure that the user can engage directly themselves… And those visualisations are really useful… Looking at visualisation of the Canadian political parties and political interest group web crawls which track changes, although that may include crawler issues.

Then, with all that generated, we create landing pages, including tagging, data information, visualizations, etc.

Nick: So, on a technical level… I’ve spent the last ten years in open source digital repository communities… This community is small and tight-knit, and I like how we build and share and develop on each others work. Last year we presented webarchives.ca. We’ve indexed 10 TB of warcs since then, representing 200+ M Solr docs. We have grown from one collection and we have needed additional facets: institution; collection name; collection ID, etc.

Then we have also dealt with scaling issues… 30-40Gb to 1Tb sized index. You probably think that’s kinda cute… But we do have more scaling to do… So we are learning from others in the community about how to manage this… We have Solr running on an Open Stack… But right now it isn’t at production scale, but getting there. We are looking at SolrCloud and potentially using a Shard2 per collection.

Last year we had a Solr index using the Shine front end… It’s great but… it doesn’t have an active open source community… We love the UK Web Archive but… Meanwhile there is BlackLight which is in wide use in libraries. There is a bigger community, better APIs, bug fixes, etc… So we have set up a prototype called WARCLight. It does almost all that Shine does, except the tree structure and the advanced searching..

Ian spoke about derivative datasets… For each collection, via Blacklight or ScholarsPortal we want domain/URL Counts; Full text; graphs. Rather than them having to do the work, they can just engage with particular datasets or collections.

So, that goal Ian talked about: one central hub for archived data and derivatives…

Q&A

Q1) Do you plan to make graphs interactive, by using Kibana rather than Gephi?

A1 – Ian) We tried some stuff out… One colleague tried R in the browser… That was great but didn’t look great in the browser. But it would be great if the casual user could look at drag and drop R type visualisations. We haven’t quite found the best option for interactive network diagrams in the browser…

A1 – Nick) Generally the data is so big it will bring down the browser. I’ve started looking at Kibana for stuff so in due course we may bring that in…

Q2) Interesting as we are doing similar things at the BnF. We did use Shine, looked at Blacklight, but built our own thing…. But we are looking at what we can do… We are interested in that web archive discovery collections approaches, useful in other contexts too…

A2 – Nick) I kinda did this the ugly way… There is a more elegant way to do it but haven’t done that yet..

Q2) We tried to give people WARC and WARC files… Our actual users didn’t want that, they want full text…

A2 – Ian) My students are quite biased… Right now if you search it will flake out… But by fall it should be available, I suspect that full text will be of most interest… Sociologists etc. think that network diagram view will be interesting but it’s hard to know what will happen when you give them that. People are quickly put off by raw data without visualisation though so we think it will be useful…

Q3) Do you think in few years time

A3) Right now that doesn’t scale… We want this more cloud-based – that’s our next 3 years and next wave of funded work… We do have capacity to write new scripts right now as needed, but when we scale that will be harder,,,,

Q4) What are some of the organisational, admin and social challenges of building this?

A4 – Nick) Going out and connecting with the archives is a big part of this… Having time to do this can be challenging…. “is an institution going to devote a person to this?”

A4 – Ian) This is about making this more accessible… People are more used to Backlight than Shine. People respond poorly to WARC. But they can deal with PDFs with CSV, those are familiar formats…

A4 – Nick) And when I get back I’m going to be doing some work and sharing to enable an actual community to work on this..

Gregory Wiedeman: Automating access to web archives with APIs and ArchivesSpace

A little bit of context here… University at Albany, SUNY we are a public university with state records las that require us to archive. This is consistent with traditional collecting. But we no dedicated web archives staff – so no capacity for lots of manual work.

One thing I wanted to note is that web archives are records. Some have paper equivalent, or which were for many years (e.g. Undergraduate Bulletin). We also have things like word documents. And then we have things like University sports websites, some of which we do need to keep…

The seed isn’t a good place to manage these as records. But archives theory and practices adapt well to web archives – they are designed to scale, they document and maintain context, with relationship to other content, and a strong emphasis on being a history of records.

So, we are using DACS: Describing Archives: A Content Standard to describe archives, why not use that for web archives? They focus on intellectual content, ignorant of formats; designed for pragmatic access to archives. We also use ArchiveSpace – a modern tool for aggregated records that allows curators to add metadata about a collection. And it interleaved with our physical archives.

So, for any record in our collection.. You can specify a subject… a Python script goes to look at our CDX, looks at numbers, schedules processes, and then as we crawl a collection the extents and data collected… And then shows in our catalogue… So we have our paper records, our digital captures… Uses then can find an item, and only then do you need to think about format and context. And, there is an awesome article by David Graves(?) which talks about that aggregation encourages new discovery…

Users need to understand where web archives come from. They need provenance to frame of their research question – it adds weight to their research. So we need to capture what was attempted to be collected – collecting policies included. We have just started to do this with a statement on our website. We need a more standardised content source. This sort of information should be easy to use and comprehend, but hard to find the right format to do that.

We also need to capture what was collected. We are using the Archive-It Partner Data API, part of the Archive-It 5.0 system. That API captures:

  • type of crawl
  • unique ID
  • crawl result
  • crawl start, end time
  • recurrence
  • exact data, time, etc…

This looks like a big JSON file. Knowing what has been captured – and not captured – is really important to understand context. What can we do with this data? Well we can see what’s in our public access system, we can add metadata, we can present some start times, non-finish issues etc. on product pages. BUT… it doesn’t address issues at scale.

So, we are now working on a new open digital repository using the Hydra system – though not called that anymore! Possibly we will expose data in the API. We need standardised data structure that is independent of tools. And we also have a researcher education challenge – the archival description needs to be easy to use, re-share and understand.

Find our work – sample scripts, command line query tools – on Github:

http://github.com/UAlbanyArchives/describingWebArchives

Q&A

Q1) Right now people describe collection intent, crawl targets… How could you standardise that?

A1) I don’t know… Need an intellectual definition of what a crawl is… And what the depth of a crawl is… They can produce very different results and WARC files… We need to articulate this in a way that is clear for others to understand…

Q1) Anything equivalent in the paper world?

A1) It is DACS but in the paper work we don’t get that granular… This is really specific data we weren’t really able to get before…

Q2) My impression is that ArchiveSpace isn’t built with discovery of archives in mind… What would help with that…

A2) I would actually put less emphasis on web archives… Long term you shouldn’t have all these things captures. We just need an good API access point really… I would rather it be modular I guess…

Q3) Really interesting… the definition of Archive-It, what’s in the crawl… And interesting to think about conveying what is in the crawl to researchers…

A3) From what I understand the Archive-It people are still working on this… With documentation to come. But we need granular way to do that… Researchers don’t care too much about the structure…. They don’t need all those counts but you need to convey some key issues, what the intellectual content is…

Comment) Looking ahead to the WASAPI presentation… Some steps towards vocabulary there might help you with this…

Comment) I also added that sort of issue for today’s panels – high level information on crawl or collection scope. Researchers want to know when crawlers don’t collect things, when to stop – usually to do with freak outs about what isn’t retained… But that idea of understanding absence really matters to researchers… It is really necessary to get some… There is a crapton of data in the partners API – most isn’t super interesting to researchers so some community effort to find 6 or 12 data points that can explain that crawl process/gaps etc…

A4) That issue of understanding users is really important, but also hard as it is difficult to understand who our users are…

Harvesting tools & strategies (Chair: Ian Milligan)

Jefferson Bailey: Who, what, when, where, why, WARC: new tools at the Internet Archive

Firstly, apologies for any repetition between yesterday and today… I will be talking about all sorts of updates…

So, WayBack Search… You can now search WayBackMachine… Including keyword, host/domain search, etc. The index is build on inbound anchor text links to a homepage. It is pretty cool and it’s one way to access this content which is not URL based. We also wanted to look at domain and host routes into this… So, if you look at the page for, say, parliament.uk you can now see statistics and visualisations. And there is an API so you can make your own visualisations – for hosts or for domains.

We have done stat counts for specific domains or crawl jobs… The API is all in json so you can just parse this for, for example, how much of what is archived for a domain is in the form of PDFs.

We also now have search by format using the same idea, the anchor text, the file and URL path, and you can search for media assets. We don’t have exciting front end displays yet… But I can search for e.g. Puppy, mime type: video, 2014… And get lots of awesome puppy videos [the demo is the Puppy Bowl 2014!]. This media search is available for some of the WayBackMachine for some media types… And you can again present this in the format and display you’d like.

For search and profiling we have a new 14 column CDX including new language, simhash, sha256 fields. Language will help users find material in their local/native languages. The SIMHASH is pretty exciting… that allows you to see how much a page has changed. We have been using it on Archive It partners… And it is pretty good. For instance seeing government blog change month to month shows the (dis)similarity.

For those that haven’t seen the Capture tool – Brozzler is in production in Archive-it with 3 doze orgaisations and using it. This has also led to warcprox developments too. It was intended for AV and social media stuff. We have a chromium cluster… It won’t do domain harvesting, but it’s good for social media.

In terms of crawl quality assurance we are working with the Internet Memory Foundation to create quality toools. These are building on internal crawl priorities work at IA crawler beans, comparison testing. And this is about quality at scale. And you can find reports on how we also did associated work on the WayBackMachine’s crawl quality. We are also looking at tools to monitor crawls for partners, trying to find large scale crawling quality as it happens… There aren’t great analytics… But there are domain-scale monitoring, domain scale patch crawling, and Slack integrations.

For doman scale work, for patch crawling we use WAT analysis for embeds and most linked. We rank by inbound links and add to crawl. ArchiveSpark is a framework for cluster-based data extraction and derivation (WA+).

Although this is a technical presentation we are also doing an IMLS funded project to train public librarians in web archiving to preserve online local history and community memory, working with partners in various communities.

Other collaborations and research include our end of term web archive 2016/17 when the administration changes… No one is official custodian for the gov.uk. And this year the widespread deletion of data has given this work greater profile than usual. This time the work was with IA, LOC, UNT, GWU, and others. 250+ TB of .gov/.mil as well as White House and Obama social media content.

There had already been discussion of the Partner Data API. We are currently re-building this so come talk to me if you are interested in this. We are working with partners to make sure this is useful. makes sense, and is made more relevant.

We take a lot of WARC files from people to preserve… So we are looking to see how we can get partners to do this with and for it. We are developing a pipeline for automated WARC ingest for web services.

There will be more on WASAPI later, but this is part of work to ensure web archives are more accessible… And that uses API calls to connect up repositories.

We have also build a WAT API that allows you to query most of the metadta for a WARC file. You can feed it URLs, and get back what you want – except the page type.

We have new portals and searches now and coming. This is about putting new search layers on TLD content in the WayBackMachine… So you can pick media types, and just from one domain, and explore them all…

And with a statement on what archives should do – involving a gif of a centaur entering a rainbow room – that’s all… 

Q&A

Q1) What are implications of new capabilities for headless browsing for Chrome for Brozzler…

A1 – audience) It changes how fast you can do things, not really what you can do…

Q2) What about http post for WASAPI

A2) Yes, it will be in the Archive-It web application… We’ll change a flag and then you can go and do whatever… And there is reporting on the backend. Doesn’t usually effect crawl budgets, it should be pretty automated… There is a UI.. Right now we do a lot manually, the idea is to do it less manually…

Q3) What do you do with pages that don’t specify encoding… ?

A3) It doesn’t go into url tokenisation… We would wipe character encoding in anchor text – it gets cleaned up before elastic search..

Q4) The SIMHASH is before or after the capture? And can it be used for deduplication

A4) After capture before CDX writing – it is part of that process. Yes, it could be used for deduplication. Although we do already do URL deduplication… But we could compare to previous SIMHASH to work out if another copy is needed… We really were thinking about visualising change…

Q5) I’m really excited about WATS… What scale will it work on…

A5) The crawl is on 100 TB – we mostly use existing WARC and Json pipeline… It performs well on something large. But if a lot of URLs, it could be a lot to parse.

Q6) With quality analysis and improvement at scale, can you tell me more about this?

A6) We’ve given the IMF access to our own crawls… But we have been compared our own crawls to our own crawls… Comparing to Archive-it is more interesting… And looking at domain level… We need to share some similar size crawls – BL and IA – and figure out how results look and differ. It won’t be content based at that stage, it will be hotpads and URLs and things.

Michele C. Weigle, Michael L. Nelson, Mat Kelly & John Berlin: Archive what I see now – personal web archiving with WARCs

Mat: I will be describing tools here for web users. We want to enable individuals to create personal web archives in a self-contained way, without external services. Standard web archiving tools are difficult for non IT experts. “Save page as” is not suitable for web archiving. Why do this? It’s for people who don’t want to touch the commend line, but also to ensure content is preserved that wouldn’t otherwise be. More archives are more better.

It is also about creation and access, as both elements are important.

So, our goals involve advancing development of:

  • WARCreate – create WARC from what you see in your browser.
  • Web Archiving Integration Layer (WAIL)
  • Mink

WARCcreate is… A Chrome browser extension to save WARC files from your browser, no credentials pass through 3rd parties. It heavilt leverages Chrome webRequest API. ut it was build in 2012 so APIs and libraries have evolved so we had to work on that. We also wanted three new modes for bwoser based preservation: record mode – retain buffer as you browse; countdown mode – preserve reloading page on an interval; event mode – preserve page when automatically reloaded.

So you simply click on the WARCreate button the browser to generate WARC files for non technical people.

Web Archiving Integration Layer (WAIL) is a stand-alone desktop application, it offers collection-based web archiving, and includes Heritrix for crawling, OpenWayback for replay, and Python scripts compiled to OS-native binaries (.app, .exe). One of the recent advancements was a new user interface. We ported Python to Electron – using web technologies to create native apps. And that means you can use native languages to help you to preserve. We also moves from a single archive to collection-based archiving. We also ported OpenWayback to pywb. And we also started doing native Twitter integration – over time and hashtags…

So, the original app was a tool to enter a URI and then get a notification. The new version is a little more complicated but provides that new collection-based interface. Right now both of these are out there… Eventually we’d like to merge functionality here. So, an example here, looking at the UK election as a collection… You can enter information, then crawl to within defined boundaries… You can kill processes, or restart an old one… And this process integrates with Heritrix to give status of a task here… And if you want to Archive Twitter you can enter a hashtag and interval, you can also do some additional filtering with keywords, etc. And then once running you’ll get notifications.

Mink… is a Google Chrome browser extension. It indicates archival capture count as you browse. Quickly submits URI to multiple archives from UI. From Mink(owski) space. Our recent enhancements include enhancements to the interface to add the number of archives pages to icon at bottom of page. And allows users to set preferences on how to view large set of memetos. And communication with user-specified or local archives…

The old mink interface could be affected by page CSS as in the DOM. So we ave moved to shadow DOM, making it more reliable and easy to use. And then you have a more consistent, intuitive iller columns for many captures. It’s an integration of live and archive web, whilst you are viewing the live web. And you can see year, month, day, etc. And it is refined to what you want to look at this. And you have an icon in Mink to make a request to save the page now – and notification of status.

So, in terms of tool integration…. We want to ensure integration between Mink and WAIL so that Mink points to local archives. In the future we want to decouple Mink from external Memento aggregator – client-side customisable collection of archives instead.

See: http://bit.ly/iipcWAC2017 for tools and source code.

Q&A

Q1) Do you see any qualitative difference in capture between WARCreate and WARC recorder?

A1) We capture the representation right at the moment you saw it.. Not the full experience for others, but for you in a moment of time. And that’s our goal – what you last saw.

Q2) Who are your users, and do you have a sense of what they want?

A2) We have a lot of digital humanities scholars wanting to preserve Twitter and Facebook – the stream as it is now, exactly as they see it. So that’s a major use case for us.

Q3) You said it is watching as you browse… What happens if you don’t select a WARC

A3) If you have hit record you could build up content as pages reload and are in that record mode… It will impact performance but you’ll have a better capture…

Q3) Just a suggestion but I often have 100 tabs open but only want to capture something once a week so I might want to kick it off only when I want to save it…

Q4) That real time capture/playback – are there cool communities you can see using this…

A4) Yes, I think with CNN coverage of a breaking storm allows you to see how that story evolves and changes…

Q5) Have you considered a mobile version for social media/web pages on my phone?

A5) Not currently supported… Chrome doesn’t support that… There is an app out there that lets you submit to archives, but not to create WARC… But there is a movement to making those types of things…

Q6) Personal archiving is interesting… But jailed in my laptop… great for personal content… But then can I share my WARC files with the wider community .

A6) That’s a good idea… And more captures is better… So there should be a way to aggregate these together… I am currently working on that, but you should need to be able to specify what is shared and what is not.

Q6) One challenge there is about organisations and what they will be comfortable with sharing/not sharing.

Lozana Rossenova and IIya Kreymar, Rhizome: Containerised browsers and archive augmentation

Lozana: As you probably know Webrecorder is a high fidelity interactive recording of any web site you browse – and how you engage. And we have recently released an App in electron format.

Webrecorder is a worm’s eye view of archiving, tracking how users actually move around the web… For instance for instragram and Twitter posts around #lovewins you can see the quality is high. Webrecorder uses symmetrical archiving – in the live browser and in a remote browser… And you can capture then replay…

In terms of how we organise webrecorder: we have collections and sessions.

The thing I want to talk about today is on Remote browsers, and my work with Rhizome on internet art. And a lot of these works actually require old browser plugins and tools… So Webrecorder enables capture and replay even where technology no longer available.

To clarify: the programme says “containerised” but we now refer to this as “remote browsers” – still using Docker cotainers to run these various older browsers.

When you go to record a site you select the browser, and the site, and it begins the recording… The Java Applet runs and shows you a visulisation of how it is being captured. You can do this with flash as well… If we open a multimedia in your normal (Chrome) browser, it isn’t working. Restoration is easier with just flash, need other things to capture flash with other dependencies and interactions.

Remote browsers are really important for Rhizome work in general, as we use them to stage old artworks in new exhibitions.

Ilya: I will be showing some upcoming beta features, including ways to use webrecorder to improve other arhives…

Firstly, which other web archives? So I built a public web archives repsitory:

https://github.com/webrecorder/public-web-archives

And with this work we are using WAM – the Web Archiving Manifest. And added a WARC source URI and WARC creation date field to the WARC Header at the moment.

So, Jefferson already talked about patching – patching remote archives from the live web… is an approach where we patch either from live web or from other archives, depending on what is available or missing. So, for instance, if I look at a Washington Post page in the archive from 2nd March… It shows how other archives are being patched in to me to deliver me a page… In the collection I have a think called “patch” that captures this.

Once pages are patched, then we introduce extraction… We are extracting again using remote archiving and automatic patching. So you combine extraction and patching features. You create two patches and two WARC files. I’ll demo that as well… So, here’s a page from the CCA website and we can patch that… And then extract that… And then when we patch again we get the images, the richer content, a much better recording of the page. So we have 2 WARCs here – one from the British Library archive, one from the patching that might be combined and used to enrich that partial UKWA capture.

Similarly we can look at a CNN page and take patches from e.g. the Portuguese archive. And once it is done we have a more complete archive… When we play this back you can display the page as it appeared, and patch files are available for archives to add to their copy.

So, this is all in beta right now but we hope to release it all in the near future…

Q&A

Q1) Every web archive already has a temporal issue where the content may come from other dates than the page claims to have… But you could aggrevate that problem. Have you considered this?

A1) Yes. There are timebounds for patching. And also around what you display to the user so they understand what they see… e.g. to patch only within the week or the month…

Q2) So it’s the closest date to what is in web recorder?

A2) The other sources are the closest successful result on/closest to the date from another site…

Q3) Rather than a fixed window for collection, seeing frequently of change might be useful to understand quality/relevance… But I think you are replaying

A3)Have you considered a headless browser… with the address bar…

A3 – Lozana) Actually for us the key use case is about highlighting and showcasing old art works to the users. It is really important to show the original page as it appeared – in the older browsers like Netscape etc.

Q4) This is increadibly exciting. But how difficult is the patching… What does it change?

A4) If you take a good capture and a static image is missing… Those are easy to patch in… If highly contextualised – like Facebook, that is difficult to do.

Q5) Can you do this in realtime… So you archive with Perma.cc then you want to patch something immediately…

A5) This will be in the new version I hope… So you can check other sources and fall back to other sources and scenarios…

Comment –  Lozana) We have run UX work with an archiving organisation in Europe for cultural heritage and their use case is that they use Archive-It and do QA the next day… Crawl might mix something but highly dynamic, so want to quickly be able to patch it pretty quickly.

Ilya) If you have an archive that is not in the public archive list on Github please do submit it as a fork request and we’ll be able to add it…

Leveraging APIs (Chair: Nicholas Taylor)

Fernando Melo and Joao Nobre: Arquivo.pt API: enabling automatic analytics over historical web data

Fernando: We are a publicly available web archive, mainly of Portuguese websites from the .pt domain. So, what can you do with out API?

Well, we built our first image search using our API, for instance a way to explore Charlie Hebdo materials; another application enables you to explore information on Portuguese politicians.

We support the Memento protocol, and you can use the Memento API. We are one of the time gates for the time travel searches. And we also have full text search as well as URL search, though our OpenSearch API. We have extended our API to support temporal searches in the portuguese web. Find this at: http://arquivo.pt/apis/opensearch/. Full text search requests can be made through a URL query, e.g. http://arquivp.pt/opensearch?query=euro 2004 would search for mentions of euro 2004, and you can add parameters to this, or search as a phrase rather than keywords.

You can also search mime types – so just within PDFs for instance. And you can also run URL searches – e.g. all pages from the New York Times website… And if you provide time boundaries the search will look for the capture from the nearest date.

Joao: I am going to talk about our image search API. This works based on keyword searches, you can include operators such as limiting to images from a particular site, to particular dates… Results are ordered by relevance, recency, or by type. You can also run advanced image searches, such as for icons, you can use quotation marks for names, or a phrase.

The request parameters include:

  • query
  • stamp – timestamp
  • Start – first index of search
  • safe Image (yes; no; all) – restricts search only to safe images.

The response is returned in json with total results, URL, width, height, alt, score, timestamp, mime, thumbnail, nsfw, pageTitle fields.

More on all of this: http://arquivo.pt/apis

Q&A

Q1) How do you classify safe for work/not safe for work

A1 – Fernando) This is a closed beta version. Safe for work/nsfw is based on classification worked around training set from Yahoo. We are not for blocking things but we want to be able to exclude shocking images if needed.

Q1) We have this same issue in the GifCities project – we have a manually curated training set to handle that.

Comment) Maybe you need to have more options for that measure to provide levels of filtering…

Q2) With that json response, why did you include title and alt text…

A2) We process image and extract from URL, the image text… So we capture the image, the alt text, but we thought that perhaps the page title would be interesting, giving some sense of context. Maybe the text before/after would also be useful but that takes more time… We are trying to keep this working

Q3) What is the thumbnail value?

A3) It is in base 64. But we can make that clearer in the next version…

Nicholas Taylor: Lots more LOCKSS for web archiving: boons from the LOCKSS software re-architecture

This is following on from the presentation myself and colleagues did at last year’s IIPC on APIs.

LOCKSS came about from a serials librarian and a computer scientist. They were thinking about emulating the best features of the system for preserving print journals, allowing libraries to conserve their traditional role as preserver. The LOCKSS boxes would sit in each library, collecting from publishers’ website, providing redundancy, sharing with other libraries if and when that publication was no longer available.

18 years on this is a self-sustaining programme running out of Stanford, with 10s of networks and hundreds of partners. Lots of copies isn’t exclusive to LOCKSS but it is the decentralised replication model that addresses the long term bit integrity is hard to solve, that more (correlated) copies doesn’t necessarily keep things safe and can make it vulnerable to hackers. So this model is community approved, published on, and well established.

Last year we started re-architecting the LOCKSS software so that it becomes a series of websites. Why do this? Well to reduce support and operation costs – taking advantage of other softwares on the web and web archiving,; to de silo components and enable external integration – we want components to find use in other systems, especially in web archiving; and we are preparing to evolve with the web, to adapt our technologies accordingly.

What that means is that LOCKSS systems will treat WARC as a storage abstraction, and more seamlessly do this, processing layers, proxies, etc. We also already integrate Memento but this will also let us engage WASAPI – which there will be more in our next talk.

We have built a service for bibliographic metadata extraction, for web harvest and file transfer content; we can map values in DOM tree to metadata fields; we can retrieve downloadable metadata from expected URL patterns; and parse RIS and XML by schema. That model shows our bias to bibliographic material.

We are also using plugins to make bibliographic objects and their metadata on many publishing platforms machine-intelligible. We mainly work with publishing/platform heuristics like Atypon, Digital Commons, HighWire, OJS and Silverchair. These vary so we have a framework for them.

The use cases for metadata extraction would include applying to consistent subsets of content in larger corpora; curating PA materials within broader crawls; retrieve faculty publications online; or retrieve from University CMSs. You can also undertake discovery via bibliographic metadata, with your institutions OpenURL resolver.

As described in 2005 D-Lib paper by DSHR et al, we are looking at on-access format migration. For instance x-bitmap to GIF.

Probably the most important core preservation capability is the audit and repair protocol. Network nodes conduct polls to validate integrity of distributed copies of data chunks. More nodes = more security – more nodes can be down; more copies can be corrupted… The notes do not trust each other in this model and responses cannot be cached. And when copies do not match, the node audits and repairs.

We think that functionality may be useful in other distributed digital preservation networks, in repository storage replication layers. And we would like to support varied back-ends including tape and cloud. We haven’t built those integrations yet…

To date our progress has addressed the WARC work. By end of 2017 we will have Docker-ised components, have a web harvest framework, polling and repair web service. By end of 2018 we will have IP address and Shibboleth access to OpenWayBack…

By all means follow and plugin. Most of our work is in a private repository, which then copies to GitHub. And we are moving more towards a community orientated software development approach, collaborating more, and exploring use of LOCKSS technologies in other contexts.

So, I want to end with some questions:

  • What potential do you see for LOCKSS technologies for web archiving, other use cases?
  • What standards or technologies could we use that we maybe haven’t considered
  • How could we help you to use LOCKSS technologies?
  • How would you like to see LOCKSS plug in more to the web archiving community?

Q&A

Q1) Will these work with existing LOCKSS software, and do we need to update our boxes?

A1) Yes, it is backwards compatible. And the new features are containerised so that does slightly change the requirements of the LOCKSS boxes but no changes needed for now.

Q2) Where do you store biblographic metadata? Or is in the WARC?

A2) It is separate from the WARC, in a database.

Q3) With the extraction of the metadata… We have some resources around translators that may be useful.

Q4 – David) Just one thing of your simplified example… For each node… They all have to calculate a new separate nonce… None of the answers are the same… They all have to do all the work… It’s actually a system where untrusted nodes are compared… And several nodes can’t gang up on the other… Each peer randomly decides on when to poll on things… There is  leader here…

Q5) Can you talk about format migration…

A5) It’s a capability already built into LOCKSS but we haven’t had to use it…

A5 – David) It’s done on the requests in http, which include acceptable formats… You can configure this thing so that if an acceptable format isn’t found, then you transform it to an acceptable format… (see the paper mentioned earlier). It is based on mime type.

Q6) We are trying to use LOCKSS as a generic archive crawler… Is that still how it will work…

A6) I’m not sure I have a definitive answer… LOCKSS will still be web harvesting-based. It will still be interesting to hear about approaches that are not web harvesting based.

A6 – David) Also interesting for CLOCKSS which are not using web harvesting…

A6) For the CLOCKSS and LOCKSS networks – the big networks – the web harvesting portfolio makes sense. But other networks with other content types, that is becoming more important.

Comment) We looked at doing transformation that is quite straightforward… We have used an API

Q7) Can you say more about the community project work?

A7) We have largely run LOCKSS as more of an in-house project, rather than a community project. We are trying to move it more in the direction of say, Blacklight, Hydra….etc. A culture change here but we see this as a benchmark of success for this re-architecting project… We are also in the process of hiring a partnerships manager and that person will focus more on creating documentation, doing developer outreach etc.

David: There is a (fragile) demo that you can have a lot of this… The goal is to continue that through the laws project, as a way to try this out… You can (cautiously) engage with that at demo.laws.lockss.org but it will be published to GitHub at some point.

Jefferson Bailey & Naomi Dushay: WASAPI data transfer APIs: specification, project update, and demonstration

Jefferson: I’ll give some background on the APIs. This is an IMLS funded project in the US looking at Systems Interoperability and Collaborative Development for Web Archives. Our goals are to:

  • build WARC and derivative dataset APIs (AIT and LOCKSS) and test via transfer to partners (SUL, UNT, Rutgers) to enable better distributed preservation and access
  • Seed and launch community modelled on characteristics of successful development and participation from communities ID’d by project
  • Sketch a blueprint and technical model for future web archiving APIs informed by project R&D
  • Technical architecture to support this.

So, we’ve already run WARC and Digital Preservation Surveys. 15-20% of Archive-it users download and locally store their WARCS – for various reasons – that is small and hasn’t really moved, that’s why data transfer was a core area. We are doing online webinars and demos. We ran a national symposium on API based interoperability and digital preservation and we have white papers to come from this.

Development wise we have created a general specification, a LOCKSS implementation, Archive-it implementation, Archive-it API documentation, testing and utility (in progress). All of this is on GitHub.

The WASAPI Archive-it Transfer API is written in python, meets all gen-spec citeria, swagger yaml in the repos. Authorisation uses AIT Django framework (same as web app), not defined in general specification. We are using browser cookies or http basic auth. We have a basic endpoint (in production) which returns all WARCs for that account; base/all results are paginated. In terms of query parameters you can use: filename; filetype; collection (ID); crawl (ID for AID crawl job)) etc.

So what do you get back? A JSON object has: pagination, count, request-url, includes-extra. You have fields including account (Archive-it ID); checksums; collection (Archive-It ID); crawl; craw time; crawl start; filename’ filetype; locations; size. And you can request these through simple http queries.

You can also submit jobs for generating derivative datasets. We use existing query language.

In terms of what is to come, this includes:

  1. Minor AIT API features
  2. Recipes and utilities (testers welcome)
  3. Community building research and report
  4. A few papers on WA APIs
  5. Ongoing surgets and research
  6. Other APIs in WASAPI (past and future)

So we need some way to bring together these APIs regularly. And also an idea of what other APIs we need to support, and how to prioritise that.

Naomi: I’m talking about the Stanford take on this… These are the steps Nicholas, as project owner, does to download WARC files from Archive-it at the moment… It is a 13 step process… And this grant funded work focuses on simplifying the first six steps and making it more manageable and efficient. As a team we are really focused on not being dependent on bespoke softwares, things much be maintainable, continuous integration set up, excellent test coverage, automate-able. There is a team behind this work, and this was their first touching of any of this code – you had 3 neophytes working on this with much to learn.

We are lucky to be just down the corridor from LOCKSS. Our preferred language is Ruby but Java would work best for LOCKSS. So we leveraged LOCKSS engineering here.

The code is at: https://github.com/sul-dlss/wasapi-downloader/.

You only need Java to run the code. And all arguments are documented in Github. You can also view a video demo:

YouTube Preview Image

These videos are how we share our progress at the end of each Agile sprint.

In terms of work remaining we have various tweaks, pull requests, etc. to ensure it is production ready. One of the challenges so far has been about thinking crawls and patches, and the context of the WARC.

Q&A

Q1) At Stanford are you working with the other WASAPI APIs, or just the downloads one.

A1) I hope the approach we are taking is a welcome one. But we have a lot of projects taking place, but we are limited by available software engineering cycles for archives work.

Note that we do need a new readme on GitHub

Q2) Jefferson, you mentioned plans to expand the API, when will that be?

A2 – Jefferson) I think that it is pretty much done and stable for most of the rest of the year… WARCs do not have crawl IDs or start dates – hence adding crawl time.

Naomi: It was super useful that a different team built the downloader was separate from the team building the WASAPI as that surfaced a lot of the assumptions, issues, etc.

David: We have a CLOCKSS implementation pretty much building on the Swagger. I need to fix our ID… But the goal is that you will be able to extract stuff from a LOCKSS box using WASAPI using URL or Solr text search. But timing wise, don’t hold your breath.

Jefferson: We’d also like others feedback and engagement with the generic specification – comments welcome on GitHub for instance.

Web archives platforms & infrastructure (Chair: Andrew Jackson)

Jack Cushman & Ilya Kreymer: Thinking like a hacker: security issues in web capture and playback

Jack: We want to talk about securing web archives, and how web archives can get themselves into trouble with security… We want to share what we’ve learnt, and what we are struggling with… So why should we care about security as web archives?

Ilya: Well web archives are not just a collection of old pages… No, high fidelity web archives run entrusted software. And there is an assumption that a live site is “safe” so nothing to worry about… but that isn’t right either..

Jack: So, what could a page do that could damage an archive? Not just a virus or a hack… but more than that…

Ilya: Archiving local content… Well a capture system could have privileged access – on local ports or network server or local files. It is a real threat. And could capture private resources into a public archive. So. Mitigation: network filtering and sandboxing, don’t allow capture of local IP addresses…

Jack: Threat: hacking the headless browser. Modern captures may use PhantomJS or other browsers on the server, most browsers have known exploits. Mitigation: sandbox your VM

Ilya: Stealing user secrets during capture… Normal web flow… But you have other things open in the browser. Partial mitigation: rewriting – rewrite cookies to exact path only; rewrite JS to intercept cookie access. Mitigation: separate recording sessions – for webrecorder use separate recording sessions when recording credentialed content. Mitigation: Remote browser.

Jack: So assume we are running MyArchive.com… Threat: cross site scripting to steal archive login

Ilya: Well you can use a subdomain…

Jack: Cookies are separate?

Ilya: Not really.. In IE10 the archive within the archive might steal login cookie. In all browsers a site can wipe and replace cookies.

Mitigation: run web archive on a separate domain from everything else. Use iFrames to isolate web archive content. Load web archive app from app domain, load iFrame content from content domain. As Webrecorder and Perma.cc both do.

Jack: Now, in our content frame… how back could it be if that content leaks… What if we have live web leakage on playback. This can happen all the time… It’s hard to stop that entirely… Javascript can send messages back and fetch new content… to mislead, track users, rewrite history. Bonus: for private archives – any of your captures could eport any of your other captures.

The best mitigation is a Content-Security-Policy header can limit access to web archive domain

Ilya: Threat: Show different age contents when archives… Pages can tell they’re in an archive and act differently. Mitigation: Run archive in containerised/proxy mode browser.

Ilya: Threat: Banner spoofing… This is a dangerous but quite easy to execute threat. Pages can dynamically edit the archives banner…

Jack: Suppose I copy the code of a page that was captured and change fake evidence, change the metadata of the date collected, and/or the URL bar…

Ilya: You can’t do that in Perma because we use frames. But if you don’t separate banner and content, this is a fairly easy exploit to do… So, Mitigation: Use iFrames for replay; don’t inject banner into replay frame… It’s a fidelity/security trade off.. .

Jack: That’s our top 7 tips… But what next… What we introduce today is a tool called http://warc.games. This is a version of webrecorder with every security problem possible turned on… You can run it locally on your machine to try all the exploits and think about mitigations and what to do about them!

And you can find some exploits to try, some challenges… Of course if you actually find a flaw in any real system please do be respectful

Q&A

Q1) How much is the bug bounty?! [laughs] What do we do about the use of very old browsers…

A1 – Jack) If you use an old browser you may be compromised already… But we use the most robust solution possible… In many cases there are secure options that work with older browsers too…

Q2) Any trends in exploits?

A2 – Jack) I recommend the book A Tangled Book… And there is an aspect that when you run a web browser there will always be some sort of issue

A2 – Ilya) We have to get around security policies to archive the web… It wasn’t designed for archiving… But that raises its own issues.

Q3) Suggestions for browser makers to make these safer?

A3) Yes, but… How do you do this with current protocols and APIs

Q4) Does running old browsers and escaping from containers keep you awake at night…

A4 – Ilya) Yes!

A4 – Jack) If anyone is good at container escapes please do write that challenge as we’d like to have it in there…

Q5) There’s a great article called “Familiarity builds content” which notes that old browsers and softwares get more vulnerable over time… It is particularly a big risk where you need old software to archive things…

A5 – Jack) Thanks David!

Q6) Can you saw more about the headers being used…

A6) The idea is we write the CSP header to only serve from the archive server… And they can be quite complex… May want to add something of your own…

Q7) May depend on what you see as a security issue… for me it may be about the authenticity of the archive… By building something in the website that shows different content in the archive…

A7 – Jack) We definitely think that changing the archive is a security threat…

Q8) How can you check the archives and look for arbitrary hacks?

A8 – Ilya) It’s pretty hard to do…

A8 – Jack) But it would be a really great research question…

Mat Kelly & David Dias: A collaborative, secure, and private InterPlanetary WayBack web archiving system using IPFS

David: Welcome to the session on going InterPlanatary… We are going to talk about peer to peer and other technology to make web archiving better…

We’ll talk about InterPlanatary File System (IPFS) and InterPlanatary WayBack (IPWB)…

IPFS is also known as  the distributed web, moving from location based to content based… As we are aware, the web has some problems… You have experience of using a service, accessing email, using a document… There is some break in connectivity… And suddenly all those essential services are gone… Why? Why do we need to have the services working in such a vulnerable way… Even a simple page, you lose a connection and you get a 404. Why?

There is a real problem with permanence… We have this URI, the URL, telling us the protocol, location and content path… But when we come back later – weeks or months – and that content has moved elsewhere… Either somewhere else you can find, or somewhere you can’t. Sometimes it’s like the content has been destroyed… But every time people see a webpage, you download it to your machine… These issues come from location addressing…

In content addressing we tie content to a unique hash that identifies the item… So a Content Identifier (CID) allows us to do this… And then, in a network, when I look for that data… If there is a disruption to the network, we can ask any machine where the content is… And the node near you can show you what is available before you ever go to the network.

IPFS is already used in video streaming (inc. Netflix), legal documents, 3D models – with Hollolens for instance, for games, for scientific data and papers, blogs and webpages, and totally distributed web apps.

IPFS allows this to be distributed, offline, saves space, optimise bandwidth usage, etc.

Mat: So I am going to talk about IPWB. Motivation here is the persistence of archived web data dependent on resilience of organisation and availability of data. The design is extending the CDXJ format, with indexing and IPFS dissemination procedure, and Replay and IPFS Pull Procedure. So in an adapted CDXJ adds a header with the hash for the content to the metadata structure.

Dave: One of the ways IPFS is making changes in the boundary is in browser tab, in browser extension and service worker as a proxy for requests the browser makes, with no changes to the interface (that one is definitely in alpha!)…

So the IPWB can expose the content to the IPFS and then connect and do everything in the browser without needing to download and execute code on their machine. Building it into the browser makes it easy to use…

Mat: And IPWB enables privacy, collaboration and security, building encryption method and key into the WARC. Similarly CDXJs may be transferred for our users’ replay… Ideally you won’t need a CDZJ on your own machine at all…

We are also rerouting, rather than rewriting, for archival replay… We’ll be presenting on that late this summer…

And I think we just have time for a short demo…

For more see: https://github.com/oduwsdl/ipwb

Q&A

Q1) Mat, I think that you should tell that story of what you do…

A1) So, I looked for files on another machine…

A1 – Dave) When Mat has the archive file on a remote machine… Someone looks for this hash on the network, send my way as I have it… So when Mat looked, it replied… so the content was discovered… request issued, received content… and presented… And that also lets you capture pages appearing differently in different places and easily access them…

Q2) With the hash addressing, are there security concerns…

A2 – Dave) We use Multihash, using Shard… But you can use different hash functions, they just verify the link… In IPFS we prevent issue with self-describable data functions..

Q3) The problem is that the hash function does end up in the URL… and it will decay over time because the hash function will decay… Its a really hard problem to solve – making a choice now that may be wrong… But there is no way of choosing the right choice.

A3) At least we can use the hash function to indicate whether it looks likely to be the right or wrong link…

Q4) Is hash functioning itself useful with or without IPFS… Or is content addressing itself inherently useful?

A4 – Dave) I think the IPLD is useful anyway… So with legal documents where links have to stay in tact, and not be part of the open web, then IPFS can work to restrict that access but still make this more useful…

Q5) If we had a content addressable web, almost all these web archiving issues would be resolved really… IT is hard to know if content is in Archive 1 or Archive 2. A content addressable web would make it easier to be archived.. Important to keep in mind…

A5 – Dave) I 100% agree! Content addressed web lets you understand what is important to capture. And IPTF saves a lot of bandwidth and a lot of storage…

Q6) What is the longevity of the hashs and how do I check that?

A6 – Dave) OK, you can check the integrity of the hash. And we have filecoin.io which is a blockchain [based storage network and cryptocurrency and that does handle this information… Using an address in a public blockchain… That’s our solution for some of those specific problems.

Andrew Jackson (AJ), Jefferson Bailey (JB), Kristinn Sigurðsson (KS) & Nicholas Taylor (NT): IIPC Tools: autumn technical workshop planning discussion

AJ: I’ve been really impressed with what I’ve seen today. There is a lot of enthusiasm for open source and collaborative approaches and that has been clear today and the IIPC wants to encourage and support that.

Now, in September 2016 we had a hackathon but there were some who just wanted to get something concrete done… And we might therefore adjust the format… Perhaps pre-define a task well ahead of time… But also a parallel track for the next hackathon/more experimental side. Is that a good idea? What else may be?

JB: We looked at Archives Unleashed, and we did a White House Social Media Hackathon earlier this year… This is a technical track but… it’s interesting to think about what kind of developer skills/what mix will work best… We have lots of web archiving engineers… They don’t use the software that comes out of it… We find it useful to have archivists in the room…

Then, from another angle, is that at the hackathons… IIPC doesn’t have a lot of money and travel is expensive… The impact of that gets debated – it’s a big budget line for 8-10 institutions out of 53 members. The outcomes are obviously useful but… If people expect to be totally funded for days on end across the world isn’t feasible… So maybe more little events, or fewer bigger events can work…

Comment 1) Why aren’t these sessions recorded?

JB: Too much money. We have recorded some of them… Sometimes it happens, sometimes it doesn’t…

AJ: We don’t have in-house skills, so it’s third party… And that’s the issue…

JB: It’s a quality thing…

KS: But also, when we’ve done it before, it’s not heavily watched… And the value can feel questionable…

Comment 1) I have a camera at home!

JB: People can film whatever they want… But that’s on people to do… IIPC isn’t an enforcement agency… But we should make it clear that people can film them…

KS: For me… You guys are doing incredible things… And it’s things I can’t do at home. The other aspect is that… There are advancements that never quite happened… But I think there is value in the unconference side…

AJ: One of the things with unconference sessions is that

NT: I didn’t go to the London hackathon… Now we have a technical team, it’s more appealling… The conference in general is good for surfacing issues we have in common… such as extraction of metadata… But there is also the question of when we sit down to deal with some specific task… That could be useful for taking things forward..

AJ: I like the idea of a counter conference, focused on the tools… I was a bit concerned that if there were really specific things… What does it need to be to be worth your organisations flying you to them… Too narrow and it’s exclusionary… Too broad and maybe it’s not helpful enough…

Comment 2) Worth seeing the model used by Python – they have a sprint after their conference. That isn’t an unconference but lets you come together. Mozilla Fest Sprint picks a topic and then next time you work on it… Sometimes looking at other organisations with less money are worth looking at… And for things like crowd sourcing coverage etc… There must be models…

AJ: This is cool.. You will have to push on this…

Comment 3) I think that tacking on to a conference helps…

KS: But challenging to be away from office more than 3/4 days…

Comment 4) Maybe look at NodeJS Community and how they organise… They have a website, NodeSchool.io with three workshops… People organise events pretty much monthly… And create material in local communities… Less travel but builds momentum… And you can see that that has impact through local NodeJS events now…

AJ: That would be possible to support as well… with IIPC or organisational support… Bootstrapping approaches…

Comment 5) Other than hackathon there are other ways to engage developers in the community… So you can engage with Google Summer of Code for instance – as mentors… That is where students look for projects to work on…

JB: We have two GSoC and like 8 working without funding at the moment… But it’s non trivial to manage that…

AJ: Onboarding new developers in any way would be useful…

Nick: Onboarding into the weird and wacky world of web archiving… If IIPC can curate a lot of onboarding stuff, that would be really good for potential… for getting started… Not relying on a small number of people…

AJ: We have to be careful as IIPC tools page is very popular, but hard to keep up to date… Benefits can be minor versus time…

Nick: Do you have GitHub? Just put up an awesome lise!

AJ: That’s a good idea…

JB: Microfunding projects – sub $10k is also an option for cost recovered brought out time for some of these sorts of tasks… That would be really interesting…

Comment 6) To expand on Jefferson and Nick were saying… I’m really new… Went to IIPC in April. I am enjoying this and learning this a lot… I’ve been talking to a lot of you… That would really help more people get the technical environment right… Organisations want to get into archiving on a small scale…

Olga: We do have a list on GitHub… but not up to date and well used…

AJ: We do have this document, we have GitHub… But we could refer to each other… and point to the getting started stuff (only). Rather get away from lists…

Comment 7) Google has an OpenSource.guide page – could take inspiration from that… Licensing, communities, etc… Very simple plain English getting started guide/documentation…

Comment 8) I’m very new to the community… And I was wondering to what extent you use Slack and Twitter between events to maintain these conversations and connections?

AJ: We have a Slack channel, but we haven’t publicised it particularly but it’s there… And Twitter you should tweet @NetPreserve and they will retweet then this community will see that…

Jun 142017
 

Following on from Day One of IIPC/RESAW I’m at the British Library for a connected Web Archiving Week 2017 event: Digital Conversations @BL, Web Archives: truth, lies and politics in the 21st century. This is a panel session chaired by Elaine Glaser (EG) with Jane Winters (JW), Valerie Schafer (VS), Jefferson Bailey (JB) and Andrew Jackson (AJ). 

As usual, this is a liveblog so corrections, additions, etc. are welcomed. 

EG: Really excited to be chairing this session. I’ll let everyone speak for a few minutes, then ask some questions, then open it out…

JB: I thought I’d talk a bit about our archiving strategy at Internet Archive. We don’t archive the whole of the internet, but we aim to collect a lot of it. The approach is multi-pronged: to take entire web domains in shallow but broad strategy; to work with other libraries and archives to focus on particular subjects or areas or collections; and then to work with researchers who are mining or scraping the web, but not neccassarily having preservation strategies. So, when we talk about political archiving or web archiving, it’s about getting as much as possible, with different volumes and frequencies. I think we know we can’t collect everything but important things frequently, less important things less frequently. And we work with national governments, with national libraries…

The other thing I wanted to raise in

T.R. Shellenberg who was an important archivist at the National Archive in the US. He had an idea about archival strategies: that there is a primary documentation strategy, and a secondary straetgy. The primary for a government and agencies to do for their own use, the secondary for futur euse in unknown ways… And including documentary and evidencey material (the latter being how and why things are done). Those evidencery elements becomes much more meaningful on the web, that has eerged and become more meaningful in the context of our current political environment.

AJ: My role is to build a Web Archive for the United Kingdom. So I want to ask a question that comes out of this… “Can a web archive lie?”. Even putting to one side that it isn’t possible to archive the whole web.. There is confusion because we can’t get every version of everything we capture… Then there are biases from our work. We choose all UK sites, but some are captured more than others… And our team isn’t as diverse as it could be. And what we collect is also constrained by technology capability. And we are limited by time issues… We don’t normally know when material is created… The crawler often finds things only when they become popular… So the academic paper is picked up after a BBC News item – they are out of order. We would like to use more structured data, such as Twitter which has clear publication date…

But can the archive lie? Well material is much easier than print to make an untraceable change. As digital is increasingly predominant we need to be aware that our archive could he hacked… So we have to protect for that, evidence that we haven’t been hacked… And we have to build systems that are secure and can maintain that trust. Libraries will have to take care of each other.

JW: The Oxford Dictionary word of the year in 2016 was “post truth” whilst the Australian dictionary went for “Fake News”. Fake News for them is either disinformation on websites for political purposes, or commercial benefit. Mirrium Webster went for “surreal” – their most searched for work. It feels like we live in very strange times… There aren’t calls for resignation where there once were… Hasn’t it always been thus though… ? For all the good citizens who point out the errors of a fake image circulated on Twitter, for many the truth never catches the lie. Fakes, lies and forgeries have helped change human history…

But modern fake news is different to that which existed before. Firstly there is the speed of fake news… Mainstream media only counteracts or addresses this. Some newspapers and websites do public corrections, but that isn’t the norm. Once publishing took time and means. Social media has made it much easier to self-publish. One can create, but also one can check accuracy and integrity – reverse image searching to see when a photo has been photoshopped or shows events of two things before…

And we have politicians making claims that they believe can be deleted and disappear from our memory… We have web archives – on both sides of the Atlantic. The European Referendum NHS pledge claim is archived and lasts long beyond the bus – which was brought by Greenpeace and repainted. The archives have also been capturing political parties websites throughout our endless election cycle… The DUP website crashed after announcement of the election results because of demands… But the archive copy was available throughout. Also a rumour that a hacker was creating an irish language version of the DUP website… But that wasn’t a new story, it was from 2011… And again the archive shows that, and archive of news websites do that.

Social Networks Responses to Terrorist Attacks in France – Valerie Schafer. 

Before 9/11 we had some digital archives of terrorist materials on the web. But this event challenged archivists and researchers. Charlie Hebdo, Paris Bataclan and Nice attacks are archived… People can search at the BNF to explore these archives, to provide users a way to see what has been said. And at the INA you can also explore the archive, including Titter archives. You can search, see keywords, explore timelines crossing key hashtags… And you can search for images… including the emoji’s used in discussion of Charlie Hebdo and Bataclan.

We also have Archive-It collections for Charlie Hebdo. This raises some questions of what should and should not be collected… We did not normally collected news papers and audio visual sites, but decided to in this case as we faced a special event. But we still face challenges – it is easiest to collect data from Twitter than from Facebook. But it is free to collect Twitter data in real time, but the archived/older data is charged for so you have to capture it in the moment. And there are limits on API collection… INA captured more than 12 Million tweets for Charlie Hebdo, for instance, it is very complete but not exhaustive.

We continue to collect for #jesuischarlie and #bataclan… They continually used and added to, in similar or related attacks, etc. There is a time for exploring and reflecting on this data, and space for critics too….

But we also see that content gets deleted… It is hard to find fake news on social media, unless you are looking for it… Looking for #fakenews just won’t cut it… So, we had a study on fake news… And we recommend that authorities are cautious about material they share. But also there is a need for cross checking – the kinds of projects with Facebook and Twitter. Web archives are full of fake news, but also full of others’ attempts to correct and check fake news as well…

EG: I wanted to go back in time to the idea of the term “fake news”… In order to understand from what “Fake News” actually is, we have to understand how it differs from previous lies and mistruths… I’m from outside the web world… We are often looking at tactics to fight fire with fire, to use an unfortunate metaphor…  How new is it? And who is to blame and why?

JW: Talking about it as a web problem, or a social media issue isn’t right. It’s about humans making decisions to critique or not that content. But it is about algorithmic sharing and visibility of that information.

JB: I agree. What is new is the way media is produced, disseminated and consumed – those have technological underpinnings. And they have been disruptive of publication and interpretation in a web world.

EG: Shouldn’t we be talking about a culture not just technology… It’s not just the “vessel”… Isn’t the dissemination have more of a role than perhaps we are suggesting…

AJ: When you build a social network or any digital space you build in different affordances… So that Facebook and Twitter is different. And you can create automated accounts, with Twitter especially offering an affordance for robots etc which allows you to give the impression of a movement. There are ways to change those affordances, but there will also always be fake news and issues…

EG: There are degrees of agency in fake news.. from bots to deliberate posts…

JW: I think there is also the aspect of performing your popularity – creating content for likes and shares, regardless of whether what you share is true or not.

VS: I know terrorism is different… But any tweet sharing fake news you get 4 retweets denying… You have more tweets denying than sharing fake news…

AJ: One wonders about the filter bubble impact here… Facebook encourges inward looking discussion… Social media has helped like minded people find each other, and perhaps they can be clipped off more easily from the wider discussion…

VS: I think also what is interested is the game between social media and traditional media…You have questions and relationship there…

EG: All the internet can do is reflect the crooked timber of reality… We know that people have confirmation bias, we are quite tolerant of untruths, to be less tolerant of information that contradicts our perceptions, even if untrue.You have people and the net being equally tolerant of lies and mistruths… But isn’t there another factor here… The people demonised as gatekeepers… By putting in place structures of authority – which were journalism and academics… Their resources are reduced now… So what role do you see for those traditional gatekeepers…

VS: These gatekeepers are no more the traditional gatekeepers that they were…. They work in 24 hour news cycles and have to work to that. In France they are trying to rethink that role, there were a lot of questions about this… Whether that’s about how you react to changing events, and what happens during election…. People thinking about that…

JB: There is an authority and responsibiity for media still, but has the web changed that? Looking back its suprising now how few organisations controlled most of the media… But is that that different now?

EG: I still think you are being too easy on the internet… We’ve had investigate journalism by Carrell Cadwalladar and others on Cambridge Analytica and others who deliberately manipulate reality… You talked about witness testimony in relation to terrorism… Isn’t there an immediacy and authenticity challenge there… Donald Trump’s tweets… They are transparant but not accountable… Haven’t we created a problem that we are now trying to fix?

AJ: Yes. But there are two things going on… It seems to be that people care less about lying… People see Trump lying, and they don’t care, and media organisations don’t care as long as advertising money comes in… A parallel for that in social media – the flow of content and ads takes priority over truth. There is an economic driver common to both mediums that is warping that…

JW: There is an aspect of unpopularity aspect too… a (nameless) newspaper here that shares content to generate “I can’t believe this!” and then sharing and generating advertising income… But on a positive note, there is scope and appetite for strong investigative journalism… and that is facilitated by the web and digital methods…

VS: Citizens do use different media and cross media… Colleagues are working on how TV is used… And different channels, to compare… Mainstream and social media are strongly crossed together…

EG: I did want to talk about temporal element… Twitter exists in the moment, making it easy to make people accountable… Do you see Twitter doing what newspapers did?

AJ: Yes… A substrate…

JB: It’s amazing how much of the web is archived… With “Save Page Now” we see all kinds of things archived – including pages that exposed the whole Russian downing a Ukrainian plane… Citizen action, spotting the need to capture data whilst it is still there and that happens all the time…

EG: I am still sceptical about citizen journalism… It’s a small group of narrow demographics people, it’s time consuming… Perhaps there is still a need for journalist roles… We did talk about filter bubbles… We hear about newspapers and media as biased… But isn’t the issue that communities of misinformation are not penetrated by the other side, but by the truth…

JW: I think bias in newspapers is quite interesting and different to unacknowledged bias… Most papers are explicit in their perspective… So you know what you will get…

AJ: I think so, but bias can be quite subtle… Different perspectives on a common issue allows comparison… But other stories only appear in one type of paper… That selection case is harder to compare…

EG: This really is a key point… There is a difference between facts and truth, and explicitly framed interpretation or commentary… Those things are different… That’s where I wonder about web archives… When I look at Wikipedia… It’s almost better to go to a source with an explicit bias where I can see a take on something, unlike Wikipedia which tries to focus on fact. Talking about politicians lying misses the point… It should be about a specific rhetorical position… That definition of truth comes up when we think of the role of the archive… How do you deal with that slightly differing definition of what truth is…

JB: I talked about different complimentary collecting strategy… The Archivist as a thing has some political power in deciding what goes in the historical record… The volume of the web does undercut that power in a way that I think is good – archives have historically been about the rich and the powerful… So making archives non-exclusive somewhat addresses that… But there will be fake news in the archive…

JW: But that’s great! Archives aren’t about collecting truth. Things will be in there that are not true, partially true, or factual… It’s for researchers to sort that out lately…

VS: Your comment on Wikipedia… They do try to be factual, neutral… But not truth… And to have a good balance of power… For us as researchers we can be surprised by the neutral point of view… Fortunately the web archive does capture a mixture of opinions…

EG: Yeah, so that captures what people believed at a point of time – true or not… So I would like to talk about the archive itself… Do you see your role as being successors to journalists… Or as being able to harvest the world’s record in a different way…

JB: I am an archivist with that training and background, as are a lot of people working on web archives and interesting spaces. Certainly historic preservation drives a lot of collecting aspects… But also engineering and technological aspects. So it’s poeple interested in archiving, preservation, but also technology… And software engineers interested in web archiving.

AJ: I’m a physicist but I’m now running web archives. And for us it’s an extension of the legal deposit role… Anything made public on the web should go into the legal deposit… That’s the theory, in practice there are questions of scope, and where we expend quality assurance energy. That’s the source of possible collection bias. And I want tools to support archivists… And also to prompt for challenging bias – if we can recognise that taking place.

JW: There are also questions of what you foreground in Special Collections. There are decisions being made about collections that will be archived and catalogued more deeply…

VS: In BNF my colleagues are work in an area with a tradition, with legal deposit responsibility… There are politics of heritage and what it should be. I think that is the case for many places where that activity sits with other archivists and librarians.

EG: You do have this huge responsibility to curate the record of human history… How do you match the top down requirements with the bottom up nature of the web as we now talk about i.t.

JW: One way is to have others come in to your department to curate particular collections…

JB: We do have special collections – people can choose their own, public suggestions, feeds from researchers, all sorts of projects to get the tools in place for building web archives for their own communities… I think for the sake of longevity and use going forward, the curated collections will probably have more value… Even if they seem more narrow now.

VS: Also interesting that archives did not select bottom-up curation. In Switzerland they went top down – there are a variety of approaches across Europe.

JW: We heard about the 1916 Easter Rising archive earlier, which was through public nominations… Which is really interesting…

AJ: And social media can help us – by seeing links and hashtags. We looked at this 4-5 years ago everyone linked to the BBC, but now we have more fake news sites etc…

VS: We do have this question of what should be archived… We see capture of the vernacular web – kitten or unicorn gifs etc… !

EG: I have a dystopian scenario in my head… Could you see a time years from now when newspapers are dead, public broadcasters are more or less dead… And we have flotsom and jetsom… We have all this data out there… And kinds of data who use all this social media data… Can you reassure me?

AJ: No…

JW: I think academics are always ready to pick holes in things, I hope that that continues…

JB: I think more interesting is the idea that there may not be a web… Apps, walled gardens… Facebook is pretty hard to web archive – they make it intentionally more challenging than it should be. There are lots of communication tools that disappeared… So I worry more about loss of a web that allows the positive affordances of participation and engagement…

EG: There is the issue of privatising and sequestering the web… I am becoming increasingly aware of the importance of organisations – like the BL and Internet Archive… Those roles did used to be taken on by publicly appointed organisations and bodies… How are they impacted by commercial privatisation… And how those roles are changing… How do you envisage that public sphere of collecting…

JW: For me more money for organisations like the British Library is important. Trust is crucial, and I trust that they will continue to do that in a trustworthy way. Commercial entities cannot be trusted to protect our cultural heritage…

AJ: A lot of people know what we do with physical material, but are surprised by our digital work. We have to advocate for ourselves. We are also constrained by the legal framework we operate within, and we have to challenge that over time…

JB: It’s super exciting to see libraries and archives recognised for their responsibility and trust… But that also puts them at higher risk by those who they hold accountable, and being recognised as bastions of accountability makes them more vulnerable.

VS: Recently we had 20th birthday of the Internet Archive, and 10 years of the French internet archiving… This is all so fast moving… People are more and more aware of web archiving… We will see new developments, ways to make things open… How to find and search and explore the archive more easily…

EG: The question then is how we access this data… The new masters of the universe will be those emerging gatekeepers who can explore the data… What is the role between them and the public’s ability to access data…

VS: It is not easy to explain everything around web archives but people will demand access…

JW: There are different levels of access… Most people will be able to access what they want. But there is also a great deal of expertise in organisations – it isn’t just commercial data work. And working with the Alan Turing Institute and cutting edge research helps here…

EG: One of the founders of the internet, Vint Cerf, says that “if you want to keep your treasured family pictures, print them out”. Are we overly optimistic about the permanence of the record.

AJ: We believe we have the skills and capabilities to maintain most if not all of it over time… There is an aspect of benign neglect… But if you are active about your digital archive you could have a copy in every continent… Digital allows you to protect content from different types of risk… I’m confident that the library can do this as part of it’s mission.

Q&A

Q1) Coming back to fake news and journalists… There is a changing role between the web as a communications media, and web archiving… Web archives are about documenting this stuff for journalists for research as a source, they don’t build the discussion… They are not the journalism itself.

Q2) I wanted to come back to the idea of the Filter Bubble, in the sense that it mediates the experience of the web now… It is important to capture that in some way, but how do we archive that… And changes from one year to the next?

Q3) It’s kind of ironic to have nostalgia about journalism and traditional media as gatekeepers, in a country where Rupert Murdoch is traditionally that gatekeeper. Global funding for web archiving is tens of millions; the budget for the web is tens of billions… The challenges are getting harder – right now you can use robots.txt but we have DRM coming and that will make it illegal to archive the web – and the budgets have to increase to match that to keep archives doing their job.

AJ: To respond to Q3… Under the legislation it will not be illegal for us to archive that data… But it will make it more expensive and difficult to do, especially at scale. So your point stands, even with that. In terms of the Filter Bubble, they are out of our scope, but we know they are important… It would be good to partner with an organisation where the modern experience of media is explicitly part of it’s role.

JW: I think that idea of the data not being the only thing that matters is important. Ethnography is important for understanding that context around all that other stuff…  To help you with supplementary research. On the expense side, it is increasingly important to demonstrate the value of that archiving… Need to think in terms of financial return to digital and creative economies, which is why researchers have to engage with this.

VS: Regarding the first two questions… Archives reflect reality, so there will be lies there… Of course web archives must be crossed and compared with other archives… And contextualisation matters, the digital environment in which the web was living… Contextualisation of web environment is important… And with terrorist archive we tried to document the process of how we selected content, and archive that too for future researchers to have in mind and understand what is there and why…

JB: I was interested in the first question, this idea of what happens and preserving the conversation… That timeline was sometimes decades before but is now weeks or days or less… In terms of experience websites are now personalised and our ability to capture that is impossible on a broad question. So we need to capture that experience, and the emergent personlisation… The web wasn’t public before, as ARPAnet, then it became public, but it seems to be ebbing a bit…

JW: With a longer term view… I wonder if the open stuff which is easier to archive may survive beyond the gated stuff that traditionally was more likely to survive.

Q4) Today we are 24 years into advertising on the web. We take ad-driven models as a given, and we see fake news as a consequence of that… So, my question is, Minitel was a large system that ran on a different model… Are there different ways to change the revenue model to change fake or true news and how it is shared…

Q5) Teresa May has been outspoken on fake news and wants a crackdown… The way I interpret that is censorship and banning of sites she does not like… Jefferson said that he’s been archiving sites that she won’t like… What will you do if she asks you to delete parts of your archive…

JB: In the US?!

Q6) Do you think we have sufficient web literacy amongst policy makers, researchers and citizens?

JW: On that last question… Absolutely not. I do feel sorry for politicians who have to appear on the news to answer questions but… Some of the responses and comments, especially on encryption and cybersecurity have been shocking. It should matter, but it doesn’t seem to matter enough yet… 

JB: We have a tactic of “geopolitical redundancy” to ensure our collections are shielded from political endangerment by making copies – which is easy to do – and locate them in different political and geographical contexts. 

AJ: We can suppress content by access. But not deletion. We don’t do that… 

EG: Is there a further risk of data manipulation… Of Trump and Farage and data… a covert threat… 

AJ: We do have to understand and learn how to cope with potential attack… Any one domain is a single point of failure… so we need to share metadata, content where possible… But web archives are fortunate to have the strong social framework to build that on… 

Q7) Going back to that idea of what kinds of responsibilities we have to enable a broader range of people to engage in a rich way with the digital archive… 

Q8) I was thinking about questions in context, and trust in content in the archive… And realising that web archives are fairly young… Generally researchers are close to the resource they are studying… Can we imagine projects in 50-100 years time where we are more separate from what we should be trusting in the archive… 

Q9) My perspective comes from building a web archive for European institutions… And can the archive live… Do we need legal notice on the archive, disclaimers, our method… How do we ensure people do not misinterpret what we do. How do we make the process of archiving more transparent. 

JB: That question of who has resources to access web archives is important. It is a responsibility of institutions like ours… To ensure even small collections can be accessed, that researchers and citizens are empowered with skills to query the archive, and things like APIs to enable that too… The other question on evidencing curatorial decisions – we are notoriously poor at that historically… But there is a lot of technological mystery there that we should demystify for users… All sorts of complexity there… The web archiving needs to work on that provenance information over the next few years… 

AJ: We do try to record this but as Jefferson said much of this is computational and algorithmic… So we maybe need to describe that better for wider audiences… That’s a bigger issue anyway, that understanding of algorithmic process. At the British Library we are fortunate to have capacity for text mining our own archives… We will be doing more than that… It will be small at first… But as it’s hard to bring data to the queries, we must bring queries to the archive. 

JW: I think it is so hard to think ahead to the long term… You’ll never pre-empt all usage… You just have to do the best that you can. 

VS: You won’t collect everything, every time… The web archive is not an exact mirror… It is “reborn digital heritage”… We have to document everything, but we can try to give some digital literacy to students so they have a way to access the web archive and engage with it… 

EG: Time is up, Thank you our panellists for this fantastic session. 

Jun 142017
 

From today until Friday I will be at the International Internet Preservation Coalition (IIPC) Web Archiving Conference 2017, which is being held jointly with the second RESAW: Research Infrastructure for the Study of Archived Web Materials Conference. I’ll be attending the main strand at the School of Advanced Study, University of London, today and Friday, and at the technical strand (at the British Library) on Thursday. I’m here wearing my “Reference Rot in Theses: A HiberActive Pilot” – aka “HiberActive” – hat. HiberActive is looking at how we can better enable PhD candidates to archive web materials they are using in their research, and citing in their thesis. I’m managing the project and working with developers, library and information services stakeholders, and a fab team of five postgraduate interns who are, whilst I’m here, out and about around the University of Edinburgh talking to PhD students to find out how they collect, manage and cite their web references, and what issues they may be having with “reference rot” – content that changes, decays, disappears, etc. We will have a webpage for the project and some further information to share soon but if you are interested in finding out more, leave me a comment below or email me: nicola.osborne@ed.ac.uk. These notes are being taken live so, as usual for my liveblogs, I welcome corrections, additions, comment etc. (and, as usual, you’ll see the structure of the day appearing below with notes added at each session). 

Opening remarks: Jane Winters

This event follows the first RESAW event which took place in Aarhus last year. This year we again highlight the huge range of work being undertaken with web archives. 

This year a few things are different… Firstly we are holding this with the IIPC, which means we can run the event over 3 days, and means we can bring together librarians, archivists, and data scientists. The BL have been involved and we are very greatful for their input. We are also excited to have a public event this evening, highlighted the increasingly public nature of web archiving. 

Opening remarks: Nicholas Taylor

On behalf of the IIPC Programme Committee I am hugely grateful to colleagues here at the School of Advanced Studies and at the British Library for being flexible and accommodating us. I would also like to thank colleagues in Portugal, and hope a future meeting will take place there as had been originally planned for IIPC.

For us we have seen the Web Archiving Conference as an increasingly public way to explore web archiving practice. The programme committee saw a great increase in submissions, requiring a larger than usual commitment from the programming committee. We are lucky to have this opportunity to connect as an international community of practice, to build connections to new members of the community, and to celebrate what you do. 

Opening plenary: Leah Lievrouw – Web history and the landscape of communication/media research Chair: Nicholas Taylor

I intend to go through some context in media studies. I know this is a mixed audience… I am from the Department of Information Studies at UCLA and we have a very polyglot organisation – we can never assume that we all understand each others backgrounds and contexts. 

A lot about the web, and web archiving, is changing, so I am hoping that we will get some Q&A going about how we address some gaps in possible approaches. 

I’ll begin by saying that it has been some time now that computing has been seen, computers as communication devices, have been seen as a medium. This seems commonplace now, but when I was in college this was seen as fringe, in communication research, in the US at least. But for years documentarists, engineers, programmers and designers have seen information resources, data and computing as tools and sites for imagining, building, and defending “new” societies; enacting emancipatory cultures and politics… A sort of Alexandrian vision of “all the knowledge in the world”. This is still part of the idea that we have in web archiving. Back in the day the idea of fostering this kind of knowledge would bring about internationality, world peace, modernity. When you look at old images you see artefacts – it is more than information, it is the materiality of artefacts. I am a contributor to Nils’ web archiving handbook, and he talks about history written of the web, and history written with the web. So there are attempts to write history with the web, but what about the tools themselves? 

So, this idea about connections between bits of knowledge… This goes back before browsers. Many of you will be familiar with H.G. Well’s ? Brain; Suzanne Briet’s Qu’est que la documentation (1951) is a very influential work in this space; Jennifer Light wrote a wonderful book on Cold War Intellectuals, and their relationship to networked information… One of my lecturers was one of these in fact, thinking about networked cities… Vannevar Bush “As we may think” (1945) saw information as essential to order and society. 

Another piece I often teach, J.C.R. Licklider and Robert W. Taylor (1968) in “the computer as a communication device” talked about computers communicating but not in the same ways that humans make meaning. In fact this graphic shows a man’s computer talking to an insurance salesman saying “he’s out” an the caption “your computer will know what is important to you and buffer you from the outside world”.

We then have this counterculture movement in California in the 1960s and 1970s.. And that feeds into the emerging tech culture. We have The Well coming out of this. Stewart Brand wrote The Whole Earth Catalog (1968-78). And Actually in 2012 someone wrote a new Whole Earth Catalog… 

Ted Nelson, Computer Lib/Dream Machines (1974) is known as being the person who came up with the concept of the link, between computers, to information… He’s an inventor essentially. Computer Lib/Dream Machine was a self-published title, a manifesto… The subtitle for Computer Lib was “you can and must understand computers NOW”. Counterculture was another element, and this is way before the web, where people were talking about networked information.. But these people were not thinking about preservation and archiving, but there was an assumption that information would be kept… 

And then as we see information utilities and wired cities emerging, mainly around cable TV but also local public access TV… There was a lot of capacity for information communication… In the UK you had teletext, in Canada there was Teledyne… And you were able to start thinking about information distribution wider and more diverse than central broadcasters… With services like LexisNexis emerging we had these ideas of information utilities… There was a lot of interest in the 1980s, and back in the 1970s too. 

Harold Sackman and Norman Nie (eds.) The Information Utility and Social Choice (1970); H.G. Bradley, H.S. Dordick and B. Nanus, the Emerging Network Marketplace (1980); R.S. Block “A global information utility”, the Futurist (1984); W.H. Dutton, J.G. Blumer and K.L. Kraemer “Wired cities: shaping the future of communications” (1987).

This new medium looked more like point-to-point communication, like the telephone. But no-one was studying that. There were communications scholars looking at face to face communication, and at media, but not at this on the whole. 

Now, that’s some background, I want to periodise a bit here… And I realise that is a risk of course… 

So, we have the Pre-browser internet (early 1980s-1990s). Here the emphasis was on access – to information, expertise and content at centre of early versions of “information utilities”, “wired cities” etc. This was about everyone having access – coming from that counter culture place. More people needed more access, more bandwidth, more information. There were a lot of digital materials already out there… But they were fiddly to get at. 

Now, when the internet become privatised – moved away from military and universities – the old model of markets and selling information to mass markets, the transmission model, reemerged. But there was also tis idea that because the internet was point-to-point – and any point could get to any other point… And that everyone would eventually be on the internet… The vision was of the internet as “inherently democratic”. Now we recognise the complexity of that right now, but that was the vision then. 

Post-browser internet (early 1990s to mid-2000s) – was about web 1.0. Browsers and WWW were designed to search and retrieve documents, discrete kinds of files, to access online documents. I’ve said “Web 1.0” but had a good conversation with a colleague yesterday who isn’t convinced about these kinds of labels, but I find them useful shorthand for thinking about the web at particular points in time/use. In this era we had email still but other types of authoring tools arose.. Encouraging a wave of “user generated content” – wikis, blogs, tagging, media production and publishing, social networking sites. This sounds such a dated term now but it did change who could produce and create media, and it was the team around LA around this time. 

Then we began to see Web 2.0 with the rise of “smart phones” in the mid-2000s, merging mobile telephony and specialised web-based mobile applications, accelerate user content production and social media profiling. And the rise of social networking sounded a little weird to those of us with sociology training who were used to these terms from the real world, from social network analysis. But Facebook is a social network. Many of the tools, blogging for example, can be seen as having a kind of mass media quality – so instead of a movie studio making content… But I can have my blog which may have an audience of millions or maybe just, like, 12 people. But that is highly personal. Indeed one of the earliest so-called “killer apps” for the internet was email. Instead of shipping data around for processing – as the architecture originally got set up for – you could send a short note to your friend elsewhere… Email hasn’t changed much. That point-to-opint communication suddenly and unexpectedly suddenly became more than half of the ARPANET. Many people were surprised by that. That pattern of interpersonal communication over networks, continued to repeat itself – we see it with Facebook, Twitter, and even with Blogs etc. that have feedback/comments etc. 

Web 2.0 is often talked about as social driven. But what is important from a sociology perspective, is the participation, and the participation of user generated communities. And actually that continues to be a challenge, it continues to be not the thing the architecture was for… 

In the last decade we’ve seen algorithmic media emerging, and the rise of “web 3.0”. Both access and participation appropriated as commodities to be monitored, captures, analyzed, monetised and sold back to individuals, reconcieved as data subjects. Everything is thought about as data, data that can be stored, accessed… Access itself, the action people take to stay in touch with each other… We all carry around monitoring devices every day… At UCLA we are looking at the concept of the “data subjects”. Bruce ? used to talk about the “data footprint” or the “data cloud”. We are at a moment where we are increasingly aware of being data subjects. London is one of the most remarkable in the world in terms of surveillance… The UK in general, but London in particular… And that is ok culturally, I’m not sure it would be in the United States. 

We did some work in UCLA to get students to mark up how many surveillance cameras there were, who controlled them, who had set them up, how many there were… Neither Campus police nor university knew. That was eye opening. Our students were horrified at this – but that’s an American cultural reaction. 

But if we conceive of our own connections to each other, to government, etc. as “data” we begin to think of ourselves, and everything, as “things”. Right now systems and governance maximising the market, institutional government surveillance; unrestricted access to user data; moves towards real-time flows rather than “stocks” of documents or content. Surveillance isn’t just about government – supermarkets are some of our most surveilled spaces. 

I currently have students working on a “name domain infrastructure” project. The idea is that data will be enclosed, that data is time-based, to replace the IP, the Internet Protocol. So that rather than packages, data is flowing all the time. So that it would be like opening the nearest tap to get water. One of the interests here is from the movie and television industry, particularly web streaming services who occupy significant percentages of bandwidth now… 

There are a lot of ways to talk about this, to conceive of this… 

1.0 tend to be about documents, press, publishing, texts, search, retrieval, circulation, access, reception, production-consumption: content. 

2.0 is about conversations, relationships, peers, interaction, communities, play – as a cooperative and flow experience, mobility, social media (though I rebel against that somewhere): social networks. 

3.0 is about algorithms, “clouds” (as fluffy benevolent things, rather than real and problematic, with physical spaces, server farms), “internet of things”, aggregation, sensing, visualisation, visibility, personalisation, self as data subject, ecosystems, surveillance, interoperability, flows: big data, algorithmic media. Surveillance is kind of the environment we live in. 

Now I want to talk a little about traditions in communication studies.. 

In communication, broadly and historically speaking, there has been one school of thought that is broadly social scientific, from sociology and communications research, that thinks about how technologies are “used” for expression, interaction, as data sources or analytic tools. Looking at media in terms of their effects on what people know or do, can look at media as data sources, but usually it is about their use. 

There are theories of interaction, group process and influence; communities and networks; semantic, topical and content studies; law, policy and regulation of systems/political economy. One key question we might ask here: “what difference does the web make as a medium/milieu for communicative action, relations, interact, organising, institutional formation and change? Those from a science and technology background might know about the issues of shaping – we shape technology and technology shapes us. 

Then there is the more cultural/critical/humanist or media studies approach. When I come to the UK people who do media studies still think of humanist studies as being different, “what people do”. However this approach of cultural/critical/etc. is about analyses of digital technologies and web; design, affordances, contexts, consequences – philosophical, historical, critical lens. How power is distributed are important in this tradition. 

In terms of theoretical schools, we have the Toronto School/media ecology – the Marshall McLuhan take – which is very much about the media itself; American cultural studies, and the work of James Carey and his students; Birmingham school – the British take on media studies; and new materialism – that you see in Digital Humanities, German Media Studies, that says we have gone too far from the roles of the materials themselves. So, we might ask “What is the web itself (social and technical constituents) as both medium and product of culture, under what conditions, times and places.

So, what are the implications for Web Archiving? Well I hope we can discuss this, thinking about a table of:

Web Phase | Soc sci/admin | Crit/Cultural

  • Documents: content + access
  • Conversation: Social nets + participation
  • Data/AlgorithmsL algorithmic media + data subjects

Comment: I was wondering about ArXiv and the move to sharing multiple versions, pre-prints, post prints…

Leah: That issue of changes in publication, what preprints mean for who is paid for what, that’s certainly changing things and an interesting question here…

Comment: If we think of the web moving from documents, towards fluid state, social networks… It becomes interesting… Where are the boundaries of web archiving? What is a web archiving object? Or is it not an object but an assemblage? Also ethics of this…

Leah: It is an interesting move from the concrete, the material… And then this whole cultural heritage question, what does it instantiate, what evidence is it, whose evidence is it? And do we participate in hardening those boundaries… Or do we keep them open… How porous are our boundaries…

Comment: What about the role of metadata?

Leah: Sure, arguably the metadata is the most important thing… What we say about it, what we define it as… And that issue of fluidity… We think of metadata as having some sort of fixity… One thing that has begun to emerge in surveillance contexts… Where law enforcement says “we aren’t looking at your content, just the metadata”, well it turns out that is highly personally identifiable, it’s the added value… What happens when that secondary data becomes the most important things… In face where many of our data systems do not communicate with each other, those connections are through the metadata (only).

Comment: In terms of web archiving… As you go from documents, to conversations, to algorithms… Archiving becomes so much more complex. Particularly where interactions are involved… You can archive the data and the algorithm but you still can’t capture the interactions there…

Leah: Absolutely. As we move towards the algorithmic level its not a fixed thing. You can’t just capture the Google search algorithms, they change all the time. The more I look at this work through the lens of algorithms and data flows, there is no object in the classic sense…

Comment: Perhaps, like a movie, we need longer temporal snapshots…

Leah: Like the algorithmic equivalence of persistence of vision. Yes, I think that’s really interesting.

And with that the opening session is over, with organisers noted that those interested in surveillance may be interested to know that Room 101, said to have inspired the room of the same name in 1984, is where we are having coffee…

Session 1B (Chair: Marie Chouleur, National Library of France):

Jefferson Bailey (Deputy chair of IIPC, Director of Web Archiving, Internet Archiving): Advancing access and interface for research use of web archives

I would like to thank all of the organisers again. I’ll be giving a broad rather than deep overview of what the Internet Archive is doing at the moment.

For those that don’t know, we are a non-profit Digital Library and Archive founded in 1996. We work in a former church and it’s awesome – you are welcome to visit and do open public lunches every Friday if you are ever in San Francisco. We have lots of open source technology and we are very technology-driven.

People always ask about stats… We are at 30 Petabytes plus multiple copies right now, including 560 billion URLs, 280 billion webpages. We archive about 1 billion URLs per week, and have partners and facilities around the world, including here in the UK where we have Wellcome Trust support.

So, searching… This is WayBackMachine. Most of our traffic – 75% – is automatically directed to the new service. So, if you search for, say, UK Parliament, you’ll see the screenshots, the URLs, and some statistics on what is there and captured. So, how does it work? With that much data to do full text search! Even the raw text (not HTML) is 3-5 Pb. So, we figured the most instructive and easiest to work with text is the anchor text of all in-bound links to a homepage. The index text covers 443 million homepages, drawn from 900B in-bound links from other cross-domain websites. Is that perfect? No, but it’s the best that works on this scale of data… And people tend to make keyword type searches which this works for.

You can also now, in the new Way Back Machine, see a summary tab which includes a visualisation of data captured for that page, host, domain, MIME-type or MIME-type category. It’s really fun to play with. It’s really cool information to work with. That information is in the Way Back Machine (WBM) if there fore 4.5 billion hosts; 256 millions domains; 1238 TLDs. Also special collections that exist – building this for specific crawls/collections such as our .gov collection. And there is an API – so you can create your own visualisations if you like.

We have also created a full text search for AIT (Archive-It). This was part of a total rebuild of full text search in Elasticsearch. 6.5 billion documents with a 52 TB full text index. In total AIT is 23 billion documents and 1 PB. Searches are across all 8000+ colections. We have improved relevance ranking, metadata search, performance. And we have a Media Search coming – it’s still a test at presence. So you can search non textual content with similar process.

So, how can we help people find things better… search, full text search… And APIs. The APIs power the details charts, captures counts, year, size, new, domain/hosts. Explore that more and see what you can do. We’ve also been looking at Data Transfer APIs to standardise transfer specifications for web data exchange between repositories for preservation. For research use you can submit “jobs” to create derivative datasets from WARCS from specific collections. And it allows programmatic access to AIT WARCs, submission of job, job status, derivative results list. More at: https://github.com/WASAPI-Community/data-transfer-apis.

In other API news we have been working with WAT files – a sort of metadata file derived from a WARC. This includes Headers and content (title, anchor/text, metas, links). We have API access to some capture content – a better way to get programmtic access to the content itself. So we have a test build on a 100 TB WARC set (EOT). It’s like CDX API with a build – replays WATs not WARCs (see: http://vinay-dev.us.archive.org:8080/eot2016/20170125090436/http://house.gov/. You can analyse, for example, term counts across the data.

In terms of analysing language we have a new CDX code to help identify languages. You can visualise this data, see the language of the texts, etc. A lot of our content right now is in English – we need less focus on English in the archive.

We are always interested in working with researchers on building archives, not just using them. So we are working on the News Measures Research Project. We are looking at 663 local news sites representing 100 communities. 7 crawls for a composite week (July-September 2016).

We are also working with a Katrina Blogs project, after research was done, project was published, but we created a special collection of the cites used so that it can be accessed and explored.

And in fact we are general looking at ways to create useful sub collections and ways to explore content. For instance Gif Cities is a way to search for gifs from Geocities. We have a Military Industrial Powerpoint Complex, turning PPT into PDFs and creating a special collection.

We did a new collection, with a dedicated portal (https://www.webharvest.gov) which archives US congress for NARA. And we capture this every 2 years, and also raised questions of indexing YouTube videos.

We are also looking at historical ccTLD Wayback Machines. Built on IA global crawls and added historic web data with keyword and mime/format search, embed linkback, domain stats and special features. This gives a german view – from the .de domain – of the archive.

And we continue to provide data and datasets for people. We love Archives Unleashed – which ran earlier this week. We did an Obama Whitehouse data hackathon recently. We have a webinar on APIs coming very soon

Q&A

Q1) What is anchor text?

A1) That’s when you create a link to a page – the text that is associated with that page.

Q2) If you are using anchor text in that keyword search… What happens when the anchor text is just a URL…

A2) We are tokenising all the URLs too. And yes, we are using a kind of PageRank type understanding of popular anchor text.

Q3) Is that TLD work.. Do you plan to offer that for all that ask for all top level domains?

A3) Yes! Because subsets are small enough that they allow search in a more manageable way… We basically build a new CDX for each of these…

Q4) What are issues you are facing with data protection challenges and archiving in the last few years… Concerns about storing data with privacy considerations.

A4) No problems for us. We operate as a library… The Way Back Machine is used in courts, but not by us – in US courts its recognised as a thing you can use in court.

Panel: Internet and Web Histories – Niels Bruger – Chair (NB); Marc Weber (MW); Steve Jones (SJ); Jane Winters (JW)

We are going to talk about the internet and the web, and also to talk about the new journal, Internet Histories, which I am editing. The new journal addresses what my colleagues and I saw as a gap. On the one hand there are journals like New Media and Society and Internet Studies which are great, but rarely focus on history. And media history journals are excellent but rarely look at web history. We felt there was a gap there… And Taylor & Francis Routledge agreed with us… The inaugeral issue is a double issue 1-2, and people on our panel today are authors in our first journal, and we asked them to address six key questions from members of our international editorial board.

For this panel we will have an arguement, counter statement, and questions from the floor type format.

A Common Language – Mark Weber

This journal has been a long time coming… I am Curatorial Director, Internet History Program, Computer History Museum. We have been going for a while now. This Internet History program was probably the first one of its kind in a museum.

When I first said I was looking at the history of the web in the mid ’90s, people were puzzled… Now most people have moved to incurious acceptance. Until recently there was also tepid interest from researchers. But in the last few years has reached critical mass – and this journal is a marker of this change.

We have this idea of a common language, the sharing of knowledge. For a long time my own perspective was mostly focused on the web, it was only when I started the Internet History program that I thought about the fuller sweep of cyberspace. We come in through one path or thread, and it can be (too) easy to only focus on that… The first major networks, the ARPAnet was there and has become the internet. Telenet was one of the most important commercial networks in the 1970s, but who here now remembers Anne Reid of Telenet? [no-one] And by contrast, what about Vint Cerf [some]. However, we need to understand what changed, what did not succeed in the long term, how things changed and shifted over time…

We are kind of in the Victorian era of the internet… We have 170 years of telephones, 60 years of going on line… longer of imagining a connected world. Our internet history goes back to the 1840s and the telegraph. And a useful thought here, “The past isn’t over. It isn’t even past” William Faulkner.  Of this history only small portions are preserved properly. Some of then risks of not having a collective narrative… And not understanding particular aspects in proper context. There is also scope for new types of approaches and work, not just applying traditional approaches to the web.

There is a risk of a digital dark age – we have  film to illustrate this at the museum although I don’t think this crowd needs persuading of the importance of preserving the web.

So, going forward… We need to treat history and preservation as something to do quickly, we cannot go back and find materials later…

Response – Jane Winters

Mark makes, I think convincingly, the case for a common language, and for understanding the preceding and surrounding technologies, why they failed and their commercial, political and social contexts. And I agree with the importance of capturing that history, with oral history a key means to do this. Secondly the call to look beyond your own interest or discipline – interdisiplinary researcg is always challenging, but in the best sense, and can be hugely rewarding when done well.

Understanding the history of the internet and its context is important, although I think we see too many comparisons with early printing. Although some of those views are useful… I think there is real importance in getting to grips with these histories now, not in a decade or two. Key decisions will be made, from net neutrality to mass surveillance, and right now the understanding and analysis of the issues is not sophisticated – such as the incompatibility of “back doors” and secure internet use. And as researchers we risk focusing on the content, not the infrastructure. I think we need a new interdisciplinary research network, and we have all the right people gathered here…

Q&A

Q1) Mark, as you are from a museum… Have you any thoughts about how you present the archived web, the interface between the visitor and the content you preserve.

A1) What we do now with the current exhibits… the star isn’t the objects, it is the screen. We do archive some websites – but don’t try to replicate the internet archive but we do work with them on some projects, including the GeoCities exhibition. When you get to things that require emulation or live data, we want live and interactive versions that can be accessed online.

Q2) I’m a linguist and was intrigued by the interdisciplinary collaboration suggested… How do you see linguists and the language of the web fitting in…

A2) Actually there is a postdoc – Naomi – looking at how different language communities in the UK have engaged through looking at the UK Web Archive, seeing how language has shaped their experience and change in moving to a new country. We are definitely thinking about this and it’s a really interesting opportunity.

Out from the PLATO Cave: Uncovering the pre-Internet history of social computing – Steve Jones, University of Ilinois at Chicago

I think you will have gathered that there is no one history of the internet. PLATO was a space for education and for my interest it also became a social space, and a platform for online gaming. These uses were spontaneous rather than centrally led. PLATO was an acronym for Programmed Logic for Automatic Teaching Operations (see diagram in Ted Nelson’s Dream Machine publication and https://en.wikipedia.org/wiki/PLATO_(computer_system)).
There were two key interests in developing for PLATO – one was multi-player games, and the other was communication. And the latter was due to laziness… Originally the PLATO lab was in a large room, and we couldn’t be bothered to walk to each others desks. So “Talk” was created – and that saved standard messages so you didn’t have to say the same thing twice!

As time went on, I undertook undergraduate biology studies and engaged in the Internet and saw that interaction as similar… At that time data storage was so expensive that storing content in perpetuity seemed absurd… If it was kept its because you hadn’t got to writing it yet. You would print out code – then rekey it – that was possible at the time given the number of lines per programme. So, in addition to the materials that were missing… There were boxes of Ledger-size green bar print outs from a particular PLATO Notes group of developers. Having found this in the archive I took pictures to OCR – that didn’t work! I got – brilliantly and terribly – funding to preserve that text. That content can now be viewed side by side in the archive – images next to re-keyed text.

Now, PLATO wasn’t designed for lay users, it was designed for professionals although also used by university and high school students who had the time to play with it. So you saw changes between developer and community values, seeing development of affordances in the context of the discourse of the developers – that archived set of discussions. The value of that work is to describe and engage with this history not just from our current day perspective, but to understand the context, the poeple and their discourse at the time.

Response – Mark

PLATO sort of is the perfect example of a system that didn’t survive into the mainstream… Those communities knew each other, the idea of the flatscreen – which led to the laptop – came from PLATO. PLATO had a distinct messaging system, separate from the ARPAnet route. It’s a great corpus to see how this was used – were there flames? What does one-to-many communication look like? It is a wonderful example of the importance of preserving these different threads.. And PLATO was one of the very first spaces not full of only technical people.

PLATO was designed for education, and that meant users were mainly students, and that shaped community and usage. There was a small experiment with community time sharing memory stores – with terminals in public places… But PLATO began in the late ’60s and ran through into the 80s, it is the poster child for preserving earlier systems. PLATO notes became Lotus Notes – that isn’t there now but in its own domain, PLATO was the progenitor of much of what we do with education online now, and that history is also very important.

Q&A

Q1) I’m so glad, Steve, that you are working on PLATO. I used to work in Medical Education in Texas and we had PLATO terminals to teach basic science first and second year medical education students and ER simulations. And my colleagues and I were taught computer instruction around PLATO. I am intereted that you wanted to look at discourse around UIC around PLATO – so, what did you find? I only experienced PLATO at the consumer end of the spectrum, so I wondered what the producer end was like…

A1) There are a few papers on this – search for it – but two basic things stand out… (1) the degree to which as a mainframe system PLATO was limited as system, and the conflict between the systems people and the gaming people. The gaming used a lot of the capacity, and although that taxed the system it did also mean they developed better code, showed what PLATO was capable of, and helped with the case for funding and support. So it wasn’t just shut PLATO down, it was a complex 2-way thing; (2) the other thing was around the emergence of community. Almost anyone could sit at a terminal and use the system. There were occasional flare ups and they mirrored community responses even later around flamewars, competition for attention, community norms… Hopefully others will mine that archive too and find some more things.

Digital Humanities – Jane Winters

I’m delighted to have an article in the journal, but I won’t be presenting on this. Instead I want to talk about digital humanities and web archives. There is a great deal of content in web archives but we still see little research engagement in web archives, there are numerous reasons including the continuing work on digitised traditional texts, and slow movement to develop new ways to research. But it is hard to engage with the history of the 21st century without engaging with the web.

The mismatch of the value of web archives and the use and research around the archive was part of what led us to set up a project here in 2014 to equip researchers to use web archives, and encourage others to do the same. For many humanities researchers it will take a long time to move to born-digital resources. And to engage with material that subtly differs for different audiences. There are real challenges to using this data – web archives are big data. As humanities scholars we are focused on the small, the detailed, we can want to filter down… But there is room for a macro historical view too. What Tim Hitchcock calls the “beautiful chaos?” of the web.

Exploring the wider context one can see change on many levels – from the individual person or business, to wide spread social and political change. How the web changes the language used between users and consumers. You can also track networks, the development of ideas… It is challenging but also offers huge opportunities. Web archives can include newspapers, media, and direct conversation – through social media. There is also visual content, gifs… The increase in use of YouTube and Instagram. Much of this sits outside the scope of web archives, but a lot still does make it in. And these media and archiving challenges will only become more challenging as see more data… The larger and more uncontrolled the data, the harder the analysis. Keyword searches are challenging at scale. The selection of the archive is not easily understood but is important.

The absence of metadata is another challenge too. The absence of metadata or alternative text can render images, particularly, invisible. And the mix of formats and types of personal and the public is most difficult but also most important. For instance the announcement of a government policy, the discussion around it, a petition perhaps, a debate in parliament… These are not easy to locate… Our histories is almost inherently online… But they only gain any real permanence through preservation in web archives, and thats why humanists and historians really need to engage with them.

Response – Steve

I particularly want to talk about archiving in scholarship. In order to fit archiving into scholarly models… administrators increasingly make the case for scholarship in the context of employment and value. But archive work is important. Scholars are discouraged from this sort of work because it is not quick, it’s harder to be published… Separately you need organisations to engage in preservation of their online presences. The degree to which archive work is needed is not reflected by promotion committees, organisational support, local archiving processes. There are immense rhetorical challenges here, to persuade others of the value of this work. There had been successful cases made to encourage telephone providers to capture and share historical information. I was at a telephone museum recently and asked about the archive… She handed me a huge book on the founding of Southwestern Bell, published in a very small run… She gave me a copy but no-one had asked about this before… That’s wrong though, it should be captured. So we can do some preservation work ourselves just by asking!

Q&A

Q1) Jane, you mentioned a skills gap for humanities researchers. What sort of skills do they need?

A1) I think the complete lack of quantitative data training, how to sample, how to make meaning from quantitative data. They have never been engaged in statistical training. They have never been required to do it – you specialise so early here. Also, basic command line stuff… People don’t understand that or why they have to engage that way. Those are two simple starting points. Those help them understand what they are looking at, what an ngram means, etc.

Session 2B (Chair: Tom Storrar)

Philip Webster, Claire Newing, Paul Clough & Gianluca Demartini: A temporal exploration of the composition of the UK Government Web Archive

I’m afraid I’ve come into this session a little late. I have come in at the point that Philip and Claire are talking about the composition of the archive – mostly 2008 onwards – and looking at status codes of UK Government Web Archive. 

Phillip: The hypothesis for looking at http status codes was to see if changes in government raised trends in the http status code. Actually, when we looked at post-2008 data we didn’t see what we expected there. However we did fine that there was an increase in not finding what was requested – and thought this may be about moving to dynamic pages – but this is not a strong trend.

In terms of MIME types – media types – which are restricted to:

Application – flash, java, Microsoft Office Documents. Here we saw trends away from PDF as the dominant format. Microsoft word increases, and we see the increased use of Atom – syndication – coming across.

Executable – we see quite a lot of javascript. The importance of flash decreased over time – which we expected – and the increased in javascript (javascript and javascript x).

Document – PDF remains prevalent. Also MS Word, some MS Excel. Open formats haven’t really taken hold…

Claire: The Government Digital Strategy included guidance to use open document formats as much as possible, but that wasn’t mandated until late 2014 – a bit too late for our data set unfortunately. But the Government Digital Strategy in 2011 was, itself, published in Word and PDF itself!

Philip: If we take document type outside of PDFs you see that lack of open formats more clearly..

Image – This includes images appearing in documents, plus icons. And occasionally you see non-standard media types associated with the MIME-types. Jpegs are fairly consistent changes. Gif and Png are comparable… Gif was being phased out for IP reasons, with Png to replace it,and you see that change over time…

Text – Test is almost all HTML. You see a lot of plain text, stylesheets, XML…

Video – we saw compressed video formats… but gradually superceded with embedded YouTube links. However we do still see a of flash video retained. And we see a large, increasing of MP4, used by Apple devices.

Another thing that is available over time is relative file sizes. However CDX index only contains compressed size data and therefore is not a true representation of file size trends. So you can’t compare images to their pre-archiving version. That means for this work we’ve limited the data set to those where you can tell the before and after status of the image files. We saw some spikes in compressed image formats over time, not clear if this shows departmental isssues..

To finish on a high note… There is an increase in the use of https rather than http. I thought it might be the result of a campaign, but it seems to be a general trend..

The conclusion… Yes, it is possible to do temporal analysis of CDX index data but you have to be careful, looking at proportion rather than raw frequency. SQL is feasible, commonly available and low cost. Archive data has particular weaknesses – data cannot be assumed to be fully representative, but in some cases trends can be identified.

Q&A

Q1) Very interesting, thank you. Can I understand… You are studying the whole archive? How do you take account of having more than one copy of the same data over time?

A1) There is a risk of one website being overrepresented in the archive. There are checks that can be done… But that is more computationally expensive…

Q2) With the seed list, is that generating the 404 rather than actual broken links?

A2 – Claire) We crawl by asking the crawler to go out to find links and seed from that. It generally looks within the domain we’ve asked it to capture…

Q3) At various points you talked about peaks and trends… Have you thought about highlighting that to folks who use your archive so they understand the data?

A3 – Claire) We are looking at how we can do that more. I have read about historians’ interest in the origins of the collection, and we are thinking about this, but we haven’t done that yet.

Caroline Nyvang, Thomas Hvid Kromann & Eld Zierau: Capturing the web at large – a critique of current web citation practices

Caroline: We are all here as we recognise the importance and relevance of internet research. Our paper looks at web referencing and citation within the sciences. We propose a new format to replace the URL+date format usually recommended. We will talk about a study of web references in 35 Danish master’s theses from the University of Copenhagen, then further work on monograph referencing, then a new citation format.

The work on 35 masters theses submitted to Copenhagen university, included, as a set: 899 web references, there was an average of 26.4 web references – some had none, the max was 80. This gave us some insight into how students cite URL. Of those students citing websites: 21% gave the date for all links; 58% had dates for some but not all sites; 22% had no dates. Some of those URLs pointed to homepages or search results.

We looked at web rot and web references – almost 16% could not be accessed by the reader, checked or reproduced. An error rate of 16% isn’t that remarkable – in 1992 a study of 10 journals found that a third of references was inaccurate enough to make it hard to find the source again. But web resources are dynamic and issues will vary, and likely increase over time.

The amount of web references does not seem to correlate with particular subjects. Students are also quite imprecise when they reference websites. And even when the correct format was used 15.5% of all the links would still have been dead.

Thomas: We looked at 10 danish academic monographs published from 2010-2016. Although this is a small number of titles, it allowed us to see some key trends in the citation of web content. There was a wide range of number of web citations used – 25% at the top, 0% at the bottom of these titles. Location of web references in these texts are not uniform. On the whole scholars rely on printed scholarly work… But web references are still important. This isn’t a systematic review of these texts… In theory these links should all work.

We wanted to see the status after five years… We used a traffic light system. 34.3% were red – broken, dead, a different page; 20?% were amber – critical links that either refer to changed or at risk material; 44.7% were green – working as expected.

This work showed that web references to dead links within a limited number of years. In our work the URLs that go to the front page, with instructions of where to look, actually, ironically, lasted best. Long complex URLs were most at risk… So, what can we do about this…

Eld: We felt that we had to do something here, to address what is needed. We can see from the studies that today’s practices of URLs and date stamp does not work. We need a new standard, a way to reference something stable. The web is a marketplace and changes all the time. We need to look at the web archives… And we need precision and persistency. We felt there were four neccassary elements, and we call it the PWID – Persistent Web IDentifier. The Four elemnts are:

  • Archived URL
  • Time of archiving
  • Web archive – precision and indication that you verified this is what you expect. Also persistency. Researcher has to understand that – is it a small or large archive, what is contextual legislation.
  • Content coverage specification – is part only? Is it the html? Is it the page including images as it appears in your browser? Is it a page? Is it the side including referred pages within the domain

So we propose a form of reference which can be textually expressed as:

web archive: archive.org, archiving time: 2016-04-20 18:21:47 UTC, archived URL: http://resaw.en/, content coverage: webpage

But, why not use web archive URL? Of the form:

https://web.archive.org/web/20160420182147http://resaw.en/

Well, this can be hard to read, there is a lot of technology embedded in the URL. It is not as accessible.

So, a PWID URI:

pwid:archive.org:2016-04-20_18.21.47Z:page:http://resaw.en/

This is now in as an ISO 690 suggestion and proposed as a URI type.

To sum up, all research fields eed to refer to the web. Good scientific practice cannot take place with current approaches.

Q&A

Q1) I really enjoyed your presentation… I was wondering what citation format you recommend for content behind paywalls, and for dynamic content – things that are not in the archive.

A1 – Eld) We have proposed this for content in the web archive only. You have to put it into an archive to be sure, then you refer to it. But we haven’t tried to address those issues of paywall and dynamic content. BUT the URI suggestion could refer to closed archives too, not just open archives.

A1 – Caroline) We also wanted to note that this approach is to make web citations align with traditional academic publication citations.

Q2) I think perhaps what you present here is an idealised way to present archiving resources, but what about the marketing and communications challenge here – to better cite websites, and to use this convention when they aren’t even using best practice for web resources.

A2 – Eld) You are talking about marketing to get people to use this, yes? We are starting with the ISO standard… That’s one aspect. I hope also that this event is something that can help promote this and help to support it. We hope to work with different people, like you, to make sure it is used. We have had contact with Zotero for instance. But we are a library… We only have the resources that we have.

Q3) With some archives of the web there can be a challenge for students, for them to actually look at the archive and check what is there..

A3) Firstly citing correctly is key. There are a lot of open archives at the moment… But we hope the next step will be more about closed archives, and ways to engage with these more easily, to find common ground, to ensure we are citing correctly in the first place.

Comment – Nicola Bingham, BL) I like the idea of incentivising not just researchers but also publishers to incentivise web archiving, another point of pressure to web archives… And making the case for openly accessible articles.

Q4) Have you come across Martin Klein and Herbert Von Sompel’s work on robust links, and Momento.

A4 – Eld) Momento is excellent to find things, but usually you do not have the archive in there… I don’t think the way of referencing without the archive is a precise reference…

Q5) When you compare to web archive URL, it was the content coverage that seems different – why not offer as an incremental update.

A5) As far as I know there is using a # in the URL and that doesn’t offer that specificity…

Comment) I would suggest you could define the standard for after that # in the URLs to include the content coverage – I’ll take that offline.

Q6) Is there a proposal there… For persistence across organisations, not just one archive.

A6) I think from my perspective there should be a registry when archives change/move to find the new registry. Our persistent identifier isn’t persistent if you can change something. And I think archives must be large organisations, with formal custodians, to ensure it is persistent.

Comment) I would like to talk offline about content addressing and Linked Data to directly address and connect to copies.

Andrew Jackson: The web archive and the catalogue

I wanted to talk about some bad experiences I had recently… There is a recent BL video of the journey of a (print) collection item… From posting to processing, cataloguing, etc… I have worked at the library for over 10 years, but this year for the first time I had to get to grips with the library catalogue… I’ll talk more about that tomorrow (in the technical strand) but we needed to update our catalogue… Accommodating the different ways the catalogue and the archive see c0ntent.

Now, that video, the formation of teams, the structure of the organisations, the physical structure of our building is all about that print process, and that catalogue… So it was a suprise for me – maybe not you – that the catalogue isn’t just bibliographic data, it’s also a workflow management tool…

There is a change of events here… Sometimes events are in a line, sometimes in circles… Always forwards…

Now, last year legal deposit came in for online items… The original digital processing workflow went from acquisition to ingest to cataloguing… But most of the content was already in the archive… We wanted to remove duplication, and make the process more efficient… So we wanted to automate this as a harvesting process.

For our digital work previously we also had a workflow, from nomination, to authorisation, etc… With legal deposit we have to get it all, all the time, all the stuff… So, we don’t collect news items, we want all news sites every day… We might specify crawl targets, but more likely that we’ll see what we’ve had before and draw them in… But this is a dynamic process….

So, our document harvester looks for “watched targets”, harvests, extracts documents for web archiving… and also ingest. There are relationships to acquisition, that feeds into cataloguing and the catalogue. But that is an odd mix of material and metadata. So that’s a process… But webpages change… For print matter things change rarely, it is highly unusual. For the web changes are regular… So how do we bring these things together…

To borrow an analogy from our Georeferencing project… Users engage with an editor to help us understand old maps. So, imagine a modern web is a web archive… Then you need information, DOIs, places and entities – perhaps a map. This kind of process allows us to understand the transition from print to online. So we think about this as layers of transformation… Where we can annotate the web archive… Or the main catalogue… That can be replaced each time this is needed. And the web content can, with this approach, be reconstructed with some certainty, later in time…

Also this approach allows us to use rich human curation to better understand that which is being automatically catalogued and organised.

So, in summary: the catalogue tends to focus on chains of operation and backlogs, item by item. The web archive tends to focus on transformation (and re-transformation) of data. Layered data model can bring them together. Means revisiting the datat (but fixity checking  requires this anyway). It’s costly in terms of disk space required. And it allows rapid exploration and experimentation.

Q1) To what extend is the drive for this your users, versus your colleagues?

A1) The business reason is that it will save us money… Taking away manual work. But, as a side effect we’ve been working with cataloguing colleagues in this area… And their expectations are being raised and changed by this project. I do now much better understand the catalogue. The catalogue tends to focus on tradition not output… So this project has been interesting from this perspective.

Q2) Are you planning to publish that layer model – I think it could be useful elsewhere?

A2) I hope to yes.

Q3) And could this be used in Higher Education research data management?

A3) I have noticed that with research data sets there are some tensions… Some communities use change management, functional programming etc… Hadoop, which we use, requires replacement of data… So yes, but this requires some transformation to do.

We’d like to use the same based data infrastructure for research… Otherwise had to maintain this pattern of work.

Q4) Your model… suggests WARC files and such archive documents might become part of new views and routes in for discovery.

A4) That’s the idea, for discovery to be decoupled from where you the file.

Nicola Bingham, UK Web Archive: Resource not in archive: understanding the behaviour, borders and gaps of web archive collections

I will describe the shape and the scope of the UK Web Archive, to give some context for you to explore it… By way of introduction.. We have been archiving the UK Web since 2013, under UK non-print legal deposit. But we’ve also had the Open Archive (since 2004); Legal Deposit Archive (since 2013); and the Jisc Historical Archive (1996-2013).

The UK Web Archive includes around 400 TB of compressed data. And in the region of 11-12 billion records. We grow, on average 60-70 TB per year and 3 B records per year. We want to be comprehensive but, that said, we can’t collect everything and we don’t want to collect everything… Firstly we collect UK websites only. We carry out web archiving under 2013 regulations, and they state that only UK published web content – meaning content on a UK web domain, or by a person whose work occurs in the UK. So, we can automate harvesting from UK TLD (.uk, .scot, .cymru etc); UK hosting – geo-IP loook up to locate server. Then manual checks. So Facebook, WordPress, Twitter cannot be automated…

We only collect published content. Out of scope here are:

  • Film and recorded sound where AV content predominates, e.g. YouTube
  • Private intranets and emails.
  • Social networkings sites only available to restricted groups – if you need a login, special permissions they are out of scope.

Web archiving is expensive. We have to provide good value for money… We crawl the UK domain on an annual basis (only). Some sites are more frequent but annual misses a lot. We cap domains at 512 MB – which captures many sites in their entirity, but others that we only capture part of (unless we override automatic settings).

There are technical limitations too, around:

  • Database driven sites – crawler struggle with these
  • Programming scripts
  • Plug-ins
  • Proprietary file formats
  • Blockers – robots.txt or access denied.

So there are misrepresentations… For instance the One Hundred Women blog captures the content but not the stylesheet – that’s a fairly common limitation.

We also have curatorial input to locate the “important stuff”. In the British Library web archiving is not performed universally by all curators, we rely on those who do engage, usually voluntarily. We try to onboard as many curators and specialist professionals as possible to widen coverage.

So, I’ve talked about gaps and boundaries, but I also want to talk about how the users of the archive find this information, so that even where there are gaps, it’s a little more transparant…

We have the Collection Scoping Document, this captures scope, motivation, parameters and timeframe of collection. This document could, in a paired-down form, be made available to end users of the archive.

We have run user testing of our current UK Web Archive website, and our new version. And even more general audiences really wanted as much contextual information as possible. That was particularly important on our current website – where we only shared permission-cleared items. But this is one way in which contextual information can be shown in the interface with the collection.

The metadata can be browsed searched, though users will be directed to come in to view the content.

So, an example of a collection would be 1000 Londoners, showing the context of the work.

We also gather information during the crawling process… We capture information on crawler configuration, seed list, exclusions… I understand this could be used and displayed to users to give statistics on the collection…

So, what do we know about what the researchers want to know? They want as much documentation as they possibly can. We have engaged with the research community to understand how best to present data to the community. And indeed that’s where your feedback and insight is important. Please do get in touch.

Q&A

Q1) You said you only collect “published” content… How do you define that?

A1) With legal deposit regulations… The legal deposit libraries may collect content openly available on the web… Content that is paywalled or behind login credentials. UK publishers are obliged to provide credentials for crawling. BUT how we make that accessible… Is a different matter – we wouldn’t republish that on the open web without logins/credentials.

Q2) How do you have any ideas about packaging this type of information for users and researchers – more than crawler config files.

A2) The short answer is no… We’d like to invite researchers to access the collection in both a close reading sense, and a big data sense… But I don’t have that many details about that at the moment.

Q3) A practical question: if you know you have to collect something… If you have a web copy of a government publication, say, and the option of the original, older, (digital) document… Is the web archive copy enough, do you have the metadata to use that the right way?

A3) Yes, so on the official publications… This is where the document harvester tool comes into play, adding another layer of metadata to pass the document through various access elements appropriately. We are still dealing with this issue though.

Chris Wemyss – Tracing the Virtual community of Hong Kong Britons through the archived web

I’ve joined this a wee bit late after a fun adventure on the Senate House stairs… 

Looking at the Gwulo: Old Hong Kong site.. User content is central to this site which is centred on a collection of old photographs, buildings, people, landscapes… The website starts to add features to explore categorisations of images.. And the site is led by an older British resident. He described subscribers being expats who have moved away, where an old version of Hong Kong that no longer exists – one user described it as an interactive photo album… There is clearly more to be done on this phenomenon of building these collective resources to construct this type of place. The founder comments on Facebook groups – they are about the now, “you don’t build anything, you just have interesting conversations”.

A third example then, Swire Mariners Association. This site has been running, nearly unchanged, for 17 years, but they have a very active forum, a very active Facebook group. These are all former dockyard workers, they meet every year, it is a close knit community but that isn’t totally represented on the web – they care about the community that has been constructed, not the website for others.

So, in conclusion archives are useful in some cases. Using oral history and web archives together is powerful, however, where it is possible to speak to website founders or members, to understand how and why things have changed over time. Seeing that change over time already gives some idea of the futures people want to see. And these sites indicate the demand for communities, active societies, long after they are formed. And illustrates how people utilise the web for community memory…

Q&A

Q1) You’ve raised a problem I hadn’t really thought about. How can you tell if they are more active on Facebook or the website… How do you approach that?

A1) I have used web archiving as one source to arrange other things around… Looking for new websites, finding and joining the Facebook group, finding interviewees to ask about that. But I wouldn’t have been prompted to ask about the website and its change/lack of change without consulting the web archives.

Q2) Were participants aware that their pages were in the archive?

A2) No, not at all. The blog I showed first was started by two guys, Gwilo is run by one guy… And he quite liked the idea that this site would live on in the future.

David Geiringer & James Baker: The home computer and networked technology: encounters in the Mass Observation Project archive, 1991-2004

I have been doing web on various communities, including some work on GeoCities which is coming out soon… And I heard about the Mass Observation project which, from 1991 – 2004, about computers and how they are using them in their life… The archives capture comments like:

“I confess that sometimes I resort to using the computer using th ecut and paste techniwue to write several letters at once”

Confess is a strong word there.. Over this period of observation we saw production of text moving to computers, computers moving into most homes, the rebuilding of modernity. We welcome comment on this project, and hope to publish soon where you can find out more on our method and approach.

So, each year since 1981 the mass observation project has issued directives to respondents to respond to key issues like e.g. Football, or the AIDs crisis. They issued the technology directive in 1991. From that year we see several fans of word processor – words like love, dream…  Responses to the 1991 directive are overwhelmingly positive… Something that was not the case for other technologies on the whole…

“There is a spell check on this machine. Also my mind works faster than my hand and I miss out letters. This machine picks up all my faults and corrects them. Thank you computer.”

After this positive response though we start to see etiquette issues, concerns about privacy… Writing some correspondence by hand. Some use simulated hand writing… And start to have concerns about adapting letters, whether that is cheating or not… Ethical considerations appearing.. It is apparent that sometimes guilt around typing text is also slightly humorous… Some playful mischief there…

Altering the context of the issue of copy and paste… the time and effort to write a unique manuscript is at concern… Interestingly the directive asked about printing and filing emails… And one respondent notes that actually it wasn’t financial or business records, but emails from their ex…

Another comments that they wish they had printed more emails during their pregnancy, a way of situating yourself in time and remembering the experience…

I’m going to skip ahead to how computers fitted into their home… People talk about dining rooms, and offices, and living rooms.. Lots of very specific discussions about where computers are placed and why they are placed there… One person comments:

“Usually at the dining room at home which doubles as our office and our coffee room”

Others talk about quieter spaces… The positioning of a computer seems to create some competition for use of space. The home changing to make room for the computer or the network… We also start to see (in 2004) comments about home life and work life, the setting up of a hotmail account as a subtle act of resistance, the reassertion of the home space.

A Mass Observation Directive in 1996 asked about email and the internet:

“Internet – we have this at work and it’s mildly useful. I wouldn’t have it at home because it costs a lot to be quite sad and sit alone at home” (1996)

So, observers from 1991-2004 talked about efficiencies of the computer and internet, copy, paste, ease… But this then reflected concerns about the act of creating texts, of engaging with others, computers as changing homes and spaces. Now, there are really specific findings around location, gender, class, gender, age, sexuality… The overwhelming majority of respondents are white middle class cis-gendered straight women over 50. But we do see that change of response to technology, a moment in time, from positive to concerned. That runs parallel to the rise of the World Wide Web… We think our work does provide context to web archive work and web research, with textual production influenced by these wider factors.

Q&A

Q1) I hadn’t realised mass observation picked up again in 1980. My understanding was that previously it was the observed, not the observers. Here people report on their own situations?

A1) They self report on themselves. At one point they are asked to draw their living room as well…

Q1) I was wondering about business machinery in the home – type writers for instance

A1) I don’t know enough about the wider archive. All of this newer material was done consistently… The older mass observation material was less consistent – people recorded on the street, or notes made in pubs. What is interesting is that in the newer responses you see a difference in the writing of the response… As they move from hand written to type writers to computer…

Q2) Partly you were talking about how people write and use computers. And a bit about how people archive themselves… But the only people I could find how people archive themselves digitally was by Microsoft Research… Is there anything since then… In that paper though you could almost read regret between the lines… the loss of photo albums, letters, etc…

A2) My colleague David Geiringer who I co-wrote the paper was initially looking at self-archiving. There was very very little. But printing stuff comes up… And the tensions there. There is enough there, people talking about worries and loss… There is lots in there… The great thing with Mass Obvs is that you can have a question but then you have to dig around a lot to find things…

Ian Milligan, University of Waterloo and Matthew Weber, Rutgers University – Archives Unleashed 4.0: presentation of projects (#hackarchives)

Ian: I’m here to talk about what happened on the first two days of Web Archiving Week. And I’d like to thank our hosts, supporters, and partners for this exciting event. We’ll do some lightening talks on the work undertaken… But why are historians organising data hackathons? Well, because we face problems in our popular cultural history. Problems like GeoCities… Kids write about Winnie the Pooh, people write about the love of Buffy the Vampire Slayer, their love of cigars… We face a problem of huge scale… 7 million users of the web now online… It’s the scale that boggles the mind and compare it to the Old Bailey – one of very few sources on ordinary people. They leave birth, death, marriage or criminal justice records… 239 years from 197,745 trials, 1674 and 1913 is the biggest collection of texts about ordinary people… But from 7 years of geocities we have 413 million web documents.

So, we have a problem, and myself, Matt and Olga from the British Library came together to build community, to establish a common vision of web archiving documents, to find new ways of addressing some of these issues.

Matt: I’m going to quickly show you some of what we did over the last few days… and the amazing projects created. I’ve always joked that Archives Unleashed is letting folk run amok to see what they can do… We started around 2 years ago, in Toronto, then Library of Congress, then at Internet Archive in San Francisco, and we stepped it up a little for London! We had the most teams, we had people from as far as New Zealand.

We started with some socilising in a pub on Friday evening, so that when we gathered on Monday we’d already done some introductions. Then a formal overview and quickly forming teams to work and develop ideas… And continuing through day one and day two… We ended up with 8 complete projects:

  • Robots in the Archives
  • US Elections 2008 and 2010 – text and keyword analysis
  • Study of Gender Distribution in Olympic communities
  • Link Ranking Group
  • Intersection Analysis
  • Public Inquiries Implications (Shipman)
  • Image Search in the Portuguese Web Archive
  • Rhyzome Web Archive Discovery Archive

We will hear from the top three from our informal voting…

Intersection Analysis – Jess

We wanted to understand how we could find a cookbook methodology for understanding the intersections between different data sets. So, we looked at the Occupy Movement (2011/12) with a Web Archive, a Rutgers archive and a social media archive from one of our researchers.

We normalised CDX, crunch WAT for outlinks and extract links from tweets. We generated counts and descriptive data, union/intersection between every data set. We had over 74 million datasets, but only 0.3% overlap between the collections… If you go to our website we have a visualisation of overlaps, tree maps of the collections…

We wanted to use the WAT files to explore Outlinks in the data sets, what they were linking to, how much of it was archived (not a lot).

Parting thoughts? Overlap is inversely proportional to the diversity pf URIs – in other words, the more collectors, the better. Diversifying see lists with social media is good.

Robots in the Archive 

We focused on robots.txt. And our wuestion was “what do we miss when we respect robots.txt?”. At National Library of Denmark we respect this… At Internet Archive they’ve started to ignore that in some contexts. So, what did we do? We extracts robots.txt from the WARC collection. Then apply it retroactively. Then we wanted to compare to link graph.

Our data was from The National Archives and from the 2010 election. We started by looking at user-agent blocks. Four had specifically blocked the internet archive, but some robot names were very old and out of date.. And we looked at crawl delay… Looking specifically at the sub collection of the department for energy and climate change… We would have missed only 24 links that would have been blocked…

So, robots.txt is minimal for this collection. Our method can be applied to other collections and extended to further the discussion on ignore robots.txt. And our code is on GitHub.

Link Ranking Group 

We looked at link analysis to ask if all links are treated the same… We wanted to test if links in <li> are different from content links (<p> or <div>). We used a WarcBase scripts to export manageable raw HTML, Load into Beuatifulsoup library. Used this on the Rio Olympic sites…

So we started looking at WARCs… We said, well, we should test if absolute or relative links… And comparing hard links to relative links but didn’t see lots of differences…

But we started to look at a previous election data set… There we saw links in tables, and there relative links were about 3/4 of links, and the other 1/4 were hard links. We did some investigation about why we had more hard links (proportionally) than before… Turns out this is a mixture of SEO practice, but also use of CMS (Content Management Systems) which make hard links easier to generate… So we sort of stumbled on that finding…

And with that the main programme for today is complete. There is a further event tonight and battery/power sockets permitting I’ll blog that too. 

Oct 082016
 

Today is the last day of the Association of Internet Researchers Conference 2016 – with a couple fewer sessions but I’ll be blogging throughout.

As usual this is a liveblog so corrections, additions, etc. are welcomed. 

PS-24: Rulemaking (Chair: Sandra Braman)

The DMCA Rulemaking and Digital Legal Vernaculars – Olivia G Conti, University of Wisconsin-Madison, United States of America

Apologies, I’ve joined this session late so you miss the first few minutes of what seems to have been an excellent presentation from Olivia. The work she was presenting on – the John Deere DMCA case – is part of her PhD work on how lay communities feed into lawmaking. You can see a quick overview of the case on NPR All Tech Considered and a piece on the ruling at IP Watchdog. The DMCA is the Digital Millennium Copyright Act (1998). My notes start about half-way through Olivia’s talk…

Property and ownership claims made of distinctly American values… Grounded in general ideals, evocations of the Bill of Rights. Or asking what Ben Franklin would say… Bringing the ideas of the DMCA as being contrary to the very foundations of the United Statements. Another them was the idea of once you buy something you should be able to edit as you like. Indeed a theme here is the idea of “tinkering and a liberatory endeavour”. And you see people claiming that it is a basic human right to make changes and tinker, to tweak your tractor (or whatever). Commentators are not trying to appeal to the nation state, they are trying to perform the state to make rights claims to enact the rights of the citizen in a digital world.

So, John Deere made a statement that tractro buyers have an “implied license” to their tractor, they don’t own it out right. And that raised controversies as well.

So, the final register rule was that the farmers won: they could repair their own tractors.

But the vernacular legal formations allow us to see the tensions that arise between citizens and the rights holders. And that also raises interesting issues of citizenship – and of citizenship of the state versus citizenship of the digital world.

The Case of the Missing Fair Use: A Multilingual History & Analysis of Twitter’s Policy Documentation – Amy Johnson, MIT, United States of America

This paper looks at the multilingual history and analysis of Twitter’s policy documentation. Or policies as uneven scalar tools of power alignment. And this comes from the idea of thinking of the Twitter as more than just the whole complete overarching platform. There is much research now on moderation, but understanding this type of policy allows you to understand some of the distributed nature of the platforms. Platforms draw lines when they decide which laws to tranform into policies, and then again when they think about which policies to translate.

If you look across at a list of Twitter policies, there is an English language version. Of this list it is only the Fair Use policy and the Twitter API limits that appear only in English. The API policy makes some sense, but the Fair Use policy does not. And Fair Use only appears really late – in 2014. It sets up in 2005, and many other policies come in in 2013… So what is going on?

So, here is the Twitter Fair Use Policy… Now, before I continue here, I want to say that this translation (and lack of) for this policy is unusual. Generally all companies – not just tech companies – translate into FIGS: French, Italian, German, Spanish languages. And Twitter does not do this. But this is in contrast to the translations of the platform itself. And I wanted to talk in particularly about translations into Japanese and Arabic. Now the Japanese translation came about through collaboration with a company that gave it opportunities to expand out into Japen. Arabic is not put in place until 2011, and around the Arab Spring. And the translation isn’t doen by Twitter itself but by another organisaton set up to do this. So you can see that there are other actors here playing into translations of platform and policies. So this iconic platforms are shaped in some unexpected ways.

So… I am not a lawyer but… Fair Use is a phenomenon that creates all sorts of internet lawyering. And typically there are four factors of fair use (Section 107 of US Copyright Act of 1976): purpose and character of use; nature of copyright work; amount and substantiality of portion used; effect of use on potential market for or value of copyright work. And this is very much an american law, from a legal-economic point of view. And the US is the only country that has Fair Use law.

Now there is a concept of “Fair Dealing” – mentioned in passing in Fair Use – which shares some characters. There are other countries with Fair Use law: Poland, Israel, South Korea… Well they point to the English language version. What about Japanese which has a rich reuse community on Twitter? It also points to the English policy.

So, policy are not equal in their policynesss. But why does this matter? Because this is where rule of law starts to break down… And we cannot assume that the same policies apply universally, that can’t be assumed.

But what about parody? Why bring this up? Well parody is tied up with the idea of Fair Use and creative transformation. Comedy is protected Fair Use category. And Twitter has a rich seam of parody. And indeed, if you Google for the fair use policy, the “People also ask” section has as the first question: “What is a parody account”.

Whilst Fair Use wasn’t there as a policy until 2014, parody unofficially had a policy in 2009, an official one in 2010, updates, another version in 2013 for the IPO. Biz Stone writes about, when at Google, lawyers saying about fake accounts “just say it is parody!” and the importance of parody. And indeed the parody policy has been translated much more widely than the Fair Use policy.

So, policies select bodies of law and align platforms to these bodies of law, in varying degree and depending on specific legitimation practices. Fair Use is strongly associated with US law, and embedding that in the translated policies aligns Twitter more to US law than they want to be. But parody has roots in free speech, and that is something that Twitter wishes to align itself with.

Visual Arts in Digital and Online Environments: Changing Copyright and Fair Use Practice among Institutions and Individuals Abstract – Patricia Aufderheide, Aram Sinnreich, American University, United States of America

Patricia: Aram and I have been working with the College Art Association and it brings together a wide range of professionals and practitioners in art across colleges in the US. They had a new code of conduct and we wanted to speak to them, a few months after that code of conduct was released, to see if that had changed practice and understanding. This is a group that use copyrighted work very widely. And indeed one-third of respondents avoid, abandon, or are delayed because of copyrighted work.

Aram: four-fifths of CAA members use copyrighted materials in their work, but only one fifth employ fair use to do that – most or always seek permission. And of those that use fair use there are some that always or usually use Fair Use. So there are real differences here. So, Fair Use are valued if you know about it and undestand it… but a quarter of this group aren’t sure if Fair Use is useful or not. Now there is that code of conduct. There is also some use of Creative Commons and open licenses.

Of those that use copyright materials… But 47% never use open licenses for their own work – there is a real reciprocity gap. Only 26% never use others openly licensed work. and only 10% never use others’ public domain work. Respondents value creative copying… 19 out of 20 CAA members think that creative appropriation can be “original”, and despite this group seeking permissions they also don’t feel that creative appropriation shouldn’t neccassarily require permission. This really points to an education gap within the community.

And 43% said that uncertainty about the law limits creativity. They think they would appropriate works more, they would public more, they would share work online… These mirror fair use usage!

Patricia: We surveyed this group twice in 2013 and in 2016. Much stays the same but there have been changes… In 2016, 2/3rd have heard about the code, and a third have shared that information – with peers, in teaching, with colleagues. Their associations with the concept of Fair Use are very positive.

Arem: The good news is that the code use does lead to change, even within 10 months of launch. This work was done to try and show how much impact a code of conduct has on understanding… And really there was a dramatic differences here. From the 2016 data, those who are not aware of the code, look a lot like those who are aware but have not used the code. But those who use the code, there is a real difference… And more are using fair use.

Patricia: There is one thing we did outside of the survey… There have been dramatic changes in the field. A number of universities have changed journal policies to be default Fair Use – Yale, Duke, etc. There has been a lot of change in the field. Several museums have internally changed how they create and use their materials. So, we have learned that education matters – behaviour changes with knowledge confidence. Peer support matters and validates new knowledge. Institutional action, well publicized, matters .The newest are most likely to change quickly, but the most veteran are in the best position – it is important to have those influencers on board… And teachers need to bring this into their teaching practice.

Panel Q&A

Q1) How many are artists versus other roles?

A1 – Patricia) About 15% are artists, and they tend to be more positive towards fair use.

Q2) I was curious about changes that took place…

A2 – Arem) We couldn’t ask whether the code made you change your practice… But we could ask whether they had used fair use before and after…

Q3) You’ve made this code for the US CAA, have you shared that more widely…

A3 – Patricia) Many of the CAA members work internationally, but the effectiveness of this code in the US context is that it is about interpreting US Fair Use law – it is not a legal document but it has been reviewed by lawyers. But copyright is territorial which makes this less useful internationally as a document. If copyright was more straightforward, that would be great. There are rights of quotation elsewhere, there is fair dealing… And Canadian law looks more like Fair Use. But the US is very litigious so if something passes Fair Use checking, that’s pretty good elsewhere… But otherwise it is all quite territorial.

A3 – Arem) You can see in data we hold that international practitioners have quite different attitudes to American CAA members.

Q4) You talked about the code, and changes in practice. When I talk to filmmakers and documentary makers in Germany they were aware of Fair Use rights but didn’t use them as they are dependent on TV companies buy them and want every part of rights cleared… They don’t want to hurt relationships.

A4 – Patricia) We always do studies before changes and it is always about reputation and relationship concerns… Fair Use only applies if you can obtain the materials independently… But then the question may be that will rights holders be pissed off next time you need to licence content. What everyone told me was that we can do this but it won’t make any difference…

Chair) I understand that, but that question is about use later on, and demonstration of rights clearance.

A4 – Patricia) This is where change in US errors and omissions insurance makes a difference – that protects them. The film and television makers code of conduct helped insurers engage and feel confident to provide that new type of insurance clause.

Q5) With US platforms, as someone in Norway, it can be hard to understand what you can and cannot access and use on, for instance, in YouTube. Also will algorithmic filtering processes of platforms take into account that they deal with content in different territories?

A5 – Arem) I have spoken to Google Council about that issue of filtering by law – there is no difference there… But monitoring

A5 – Amy) I have written about legal fictions before… They are useful for thinking about what a “reasonable person” – and that can be vulnerable by jury and location so writing that into policies helps to shape that.

A5 – Patricia) The jurisdiction is where you create, not where the work is from…

Q6) There is an indecency case in France which they want to try in French court, but Facebook wants it tried in US court. What might the impact on copyright be?

A6 – Arem) A great question but this type of jurisdictional law has been discussed for over 10 years without any clear conclusion.

A6 – Patricia) This is a European issue too – Germany has good exceptions and limitations, France has horrible exceptions and limitations. There is a real challenge for pan European law.

Q7) Did you look at all of impact on advocacy groups who encouraged writing in/completion of replies on DCMA. And was there any big difference between the farmers and car owners?

A7) There was a lot of discussion on the digital right to repair site, and that probably did have an impact. I did work on Net Neutrality before. But in any of those cases I take out boiler plate, and see what they add directly – but there is a whole other paper to be done on boiler plate texts and how they shape responses and terms of additional comments. It wasn’t that easy to distinguish between farmers and car owners, but it was interesting how individuals established credibility. For farmers they talked abot the value of fixing their own equipment, of being independent, of history of ownership. Car mechanics, by contrast, establish technical expertise.

Q8) As a follow up: farmers will have had a long debate over genetically modified seeds – and the right to tinker in different ways…

A8) I didn’t see that reflected in the comments, but there may well be a bigger issue around micromanagement of practices.

Q9) Olivia, I was wondering if you were considering not only the rhetorical arguements of users, what about the way the techniques and tactics they used are received on the other side… What are the effective tactics there, or locate the limits of the effectiveness of the layperson vernacular stategies?

A9) My goal was to see what frames of arguements looked most effective. I think in the case of the John Deere DCMA case that wasn’t that conclusive. It can be really hard to separate the NGO from the individual – especially when NGOs submit huge collections of individual responses. I did a case study on non-consensual pornography was more conclusive in terms of strategies that was effective. The discourses I look at don’t look like legal discourse but I look at the tone and content people use. So, on revenge porn, the law doesn’t really reflect user practice for instance.

Q10) For Amy, I was wondering… Is the problem that Fair Use isn’t translated… Or the law behind that?

A10 – Amy) I think Twitter in particular have found themselves in a weird middle space… Then the exceptions wouldn’t come up. But having it in English is the odd piece. That policy seems to speak specifically to Americans… But you could argue they are trying to impose (maybe that’s a bit too strong) on all English speaking territory. On YouTube all of the policies are translated into the same languages, including Fair Use.

Q11) I’m fascinated in vernacular understanding and then the experts who are in the round tables, who specialise in these areas. How do you see vernacular discourse use in more closed/smaller settings?

A11 – Olivia) I haven’t been able to take this up as so many of those spaces are opaque. But in the 2012 rule making there were some direct quotes from remixers. And there a suggestion around DVD use that people should videotape the TV screen… and that seemed unreasonably onorous…

Chair) Do you forsee a next stage where you get to be in those rooms and do more on that?

A11 – Olivia) I’d love to do some ethnographic studies, to get more involved.

A11 – Patricia) I was in Washington for the DMCA hearings and those are some of the most fun things I go to. I know that the documentary filmmakers have complained about cost of participating… But a technician from the industry gave 30 minutes of evidence on the 40 technical steps to handle analogue film pieces of information… And to show that it’s not actually broadcast quality. It made them gasp. It was devastating and very visual information, and they cited it in their ruling… And similarly in John Deere case the car technicians made impact. By contrast a teacher came in to explain why copying material was important for teaching, but she didn’t have either people or evidence of what the difference is in the classroom.

Q12) I have an interesting case if anyone wants to look at it, around Wikipedia’s Fair Use issues around multimedia. Volunteers take pre-emptively being stricter as they don’t want lawyers to come in on that… And the Wikipedia policies there. There is also automation through bots to delete content without clear Fair Use exception.

A12 – Arem) I’ve seen Fair Use misappropriated on Wikipedia… Copyright images used at low resolution and claimed as Fair Use…

A12- Patricia) Wikimania has all these people who don’t want to deal with law on copyright at all! Wikimedia lawyers are in an a really difficult position.

Intersections of Technology and Place (panel): Erika Polson, University of Denver, United States of America; Rowan Wilken, Swinburne Institute for Social Research, Australia; Germaine Halegoua,University of Kansas, United States of America; Bryce Renninger, Rutgers University, United States of America; Adrienne Russell, University of Denver, United States of America (Chair: Jessica Lingel)

Traces of our passage: Locative media and the capture of place data – Rowan Wilken

This is a small part of a book that I’m working on. And I am looking at how technologies are geolocating us… In space, in time, but moreso the ways that they reveal our complex socio-technical context through place. And I’m seeing this from an anthropological point of view of places as having particular

Josia Van Dyke in her work on social media business models talks about the use of “location intelligence” as part of the social media ecosystem and economic system.

I want to focus particular on FourSquare… It has changed significantly changed since repositioning in 2014 and those changes in their own and the Swarn app seek to generate real time and even predictive recommendations. They so this through combining social data/social graph and location/Places Graph data. They look to understand People as nodes with edges of proximity, co-location, etc. And in places the places are nodes, the edges are menus, recommendations, etc. So they have these two graphs, but the engineers seen to understand “What are the underlying properties and dynamics of these networks? How can we predict new connections? How do we measure influence?”. Their work now builds up this rich database of places and data around them.

And these changes have led to new repositioning… This has seen FourSquare selling advertising through predictive analysis… The second service called PinPoint, allowing marketers to target users of FourSquare… And for users beyond FourSquare. This is done through GPS locations, finding patterns and tracking shopping and dining routes…

In the last part of this talk I want to talk about Tim Ingol’s work in . For Ingol our perception of place is less about the birds eye view of maps, but of the walked and experienced route, based on the course of moving about in it, of ambulatory knowing. This is perceptual and way finding, less about co-ordinates, more about situating position in the context of moving, of what one knows about routing and moving.

So, my contention is that it’s way finding or mapping not map making or use that are primarily of use and interest to these social platforms going forward. Ingols talks about how new maps come from the replacement and changes over time… I think that is no longer the case, as what is of interest to companies like Foursquare is the digital trace of our passage, not the map itself.

“We know that right now we are not funky”: Placemaking practices in smart cities – Germaine Halegoua, University of Kansas

I am looking at attempts to use underused urban spaces, based on interviews with planners, architects, developers, about how they were developing these spaces – often on reclaimed land or infill – and about what makes them special and unique.

Placemaking is almost always defined as a bottom up process, often linked to home or making somewhere feel like home… But theories of placemaking are less thought of as strategic, thinking of KirkPatrick, or La Corbuisier. And the idea that these are spaces for dominant players – military, powerful people. So in these urban settings the strategic placemaking connects to powerful people, connected and valued around these international players.

I wanted to look at the differences between the planning behind these spaces and smart cities versus the lived experiences and processes. Smart cities are about urbanism imagines, with sustainable urbanism – everything is leaf certified!; technscientific ubranism – data capture is built in, data and technology are thought of as progressive and solutions to our problems; urban triumphalism (Brenner & Schmid 2015). These smart cities are purported as visionary designs, of this coming from the modern needs of people… Taking the best of global cities around the world, naming locations and designs coming in as fragments from other places. Digital media are used to show that this place works, as a place for ideas, a place to get things done… That they are like campus-based communities, like Silicon Valley, a better place than before…

There is this statistic that 70% of all people live in cities, and growing… But they are seen as dumb, problematic, in need of updating… They need order and smart cities are seen as a solution. There is an ordered view of the city as a lab – showroom and demonstration space as well as petri dish for transforming technology. And these are cities built of systems on top of systems – literally (Le Corbeusier-like but with a flowing soft aesthetic) and bringing of things together. So, in Songdu you see this range of services in the space. And in TechCity we see apps and connectedness within the home… Smart cities are monitoring traffic and centralised systems, to monitor biosigns, climate, etc… But in the green spaces or sustainable urban of getting you to live and linger… So you have this odd mixture of not spending the time in the streets, and these green spaces to linger…

But these are quite cold spaces… Vacancies are extremely high. They are seen as artificial. My talk quote is from a developer who feels that the solution is to bring in some funk… To programme serendipity into their lives… The answer is always more technology…

So a few themes here… There is the People Problem… attracting people to the place – not “funky”; placing people within the union of technology and physical design – claim that tech puts man first and needs of the end user… but there is also a sense of people as “bugs”. And I am producing all this data that aren’t about my experience of the city, but which shape that experience.

Geo-social media and the quest for place on-the-go – Erika Polson, University of Denver

This is coming out of my latest book, a multi-site ethnographic project. In the recent work I have developed an idea of digital place making… And this has been about how location technology can be used to shape the space of mobile people.

Expatriation was previously a post WWII experience, and a family affair… Often those assignments failed, sometimes as one partner (often female) couldn’t work. So, as corporations try to globalise there is a move to send younger, single assignees replacing families – they are cheaper and easier to relocate, they are more used to a global professional life as an idea and are enthusiastic.

And we don’t just see people moving once, we see serial global workers… The international experience can be seen as “a global lifestyle is seen as attractive and exciting” Anne Marie Fetcher 2009(?) but that may not reflects reality. There can be deep feelings of loneliness, the experience does’t match experience, they miss out on families, they lack social connections and possibilities to socialise. Margaret Malewski writes in Generation Expatriot (2005) about how there can be an increasing dependency on friends at home, and the need for these extratiots to get out and meet people…

So, my work is based on a range of meetup apps, from Grindr and Tinder, to MeetUp, InterNations and (less of my focus) Couch Surfing… Tools to build connections and find each other. I have studied use of apps in Paris, Bangalore and Singapore. So this image is of a cafe in Paris full of people – the first meetup that I went to and it was intimidating to walk into but immediately someone approached… And I started to think about Digital Place-making about two months into the Paris experience when a friend wanted to meet for dinner and I was at a MeetUp, and he was super floored by his discomfort with talking to a bar full of strangers in Paris – he’s a local guy, he speaks perfect English, he’s very sociable… On any other night he would have owned the space but he was thrown by these expats making the space their own, through Meetup, through their profiles, through discourse of “who we are” and pre-articulation of some of the expectations and norms.

This made me think about the idea of Place and the feelings of belonging and place attachment (Coulthard and Ledema 2009), about shared meanings of place. We’ve seen lots of work on online world and how to create that sense of place, of attachment, or shared meaning.

So, if everyone is able to drop in and feel part of a place… And if professionals can do this, who else can? So, I’m excited to hear the next paper on Grindr. But it’s interesting to think about who is out-of-place, of the quality of place and place relations. And the fact that even as these people maintain this positive narrative of working globally, but also a feeling of following a common template or script. And problems with place-on-the-go for social commitments, community building… Willingness to meet up again, to drop in rather than create anything.

Grindr – Bryce Renninger, Rutgers University, United States of America

I work on open government issues and the site of my work is Grindr – a location based, mainly male, mainly gay and bi casual dating space. And where I am starting from is the idea that Grindr is killing the gay bar (or gayborhood or the gay resort town), which is part of the gay press, for instance articles on the Pines neighbourhood of Fire Island, from New York Magazine. And quotes Ghaziani, author of There Goes the Gaybourhood, that having the app means they don’t need Boystown any more… And I think this narrative comes from concerns of valuing or not valuing these gay towns, resorts, bars, and of the willingness to defend those spaces. Bumgarner (2013) argues that the app does the same thing as the bar… But that assumes that the bar/place is only there to introduce people to each other for narrow range of purposes…

And my way of thinking about this is to think of technologies in democratic ways… Sclove talks about design criteria for democratic technologies, mainly to do with local labour and contribution but this can also be overlaid on social factors as well. And I think there is a space for democratically deliberating as sex publics. Michael Warner respoonds to Andrew Sullivan by problematizing his idea that “normal” is the place for queer people to exist. There are also authors writing on design in public sex spaces as a way to improve health outcomes.

The founder of Grindr says it isn’t killing the gay bar, and indeed provides a platform for the m to advertise on. And showing a quote here of how it is used shows the wide range of use of Grindr (beyond the obvious). I don’t think that Ghaziani’s writing doesn’t talk enough about what the gayborgoods and LGBT spaces are, how they can be class and race exclusive, fitting into gentrification of public spaces… And therefore I recommend Christina Lagazzi’s book.

One of the things I want to do with this work is to think about narratives in which platforms play a part can be written about, spoken about, that allow challenges to popular discourses of technological disruption. The idea that technological disruption is exciting is prevelant, and we aren’t doing enough to challenge that. This AirBnB billboard campaign – a kind of “Fuck You” to the San Francisco authorities and the legal changes to limit their business – are a reminder that we can respond to disruption…

I’m out of time but I think we need to think critically, about social roles of technology and how technological organisations figure into that… And to acknowledge ethnography and press.

Defining space through activism (and journalism): the Paris climate summit – Adrienne Russell, University of Denver

I’ve been working with researchers around the world on the G8 Climate Summits for around ten years, and coverage around it. I’ve been looking at activists and how they kind of spunk up the sapces where meeting take place…

But let me start with an image of Black Lives Matter protestors from the Daily Mail commenting on protestors using mobile phones. It exemplifies the idea that being on your phone means that you are not fully present… If they are on their phone, that arent that serious. This fits a long term type of coverage of protests that seems to suggest that in-person protests are more effective and authentic than social media. Although our literature shows that it is both approaches in combination that is most effective. And then the issue of official versus unofficial action. Activists in the 2014 Paris protestors were especially reliant on online work as protests were banned, public spaces were closed, activists were placed under house arrests… So they had been preparing for years but their action was restricted.

So, the ways that protestors took action was through tools like Climate Games, a real time game which enable you to see real time photography, but also you could highlight surveillance… It was non-violent but called police and law enforcement “team blue”, and lobbyists and greenwashers were “team grey”!

Probably many of you saw the posters across Paris – mocking corporate ad campaigns – e.g. a VW ad saying “we are sorry we got caught”. So you saw these really interesting alternative narratives and interpretations. There was also a hostel called Place to B which became a defacto media centres for protestors, with interviews being given throughout the event. There was a hub of artists who raised issues faced in their own countries. And outside the city there was a venue where they held a mock trial of Exxon vs the People with prominent campaigners from across the globe, this was on the heals of releases showing Exxon had evidence of climate change twenty years back and ignored it. This mock trial made a real media event.

So all these events helped create an alternative narrative. And that crackdown on protest reflects how we are coming to understand this type of top-down event… And resistance through media and counter narratives to mainstream media running predictable official lines.

Panel Q&A
Q1) I have a question, maybe a pushback to you Germaine… Or maybe not… Who are the “they” you are talking about… You talk about city planners… I admire the critique so I want to know who “they” are, and should we problematise that, especially in contemporary smart cities discourses…
A1 – Germaine) It’s CISCO, Seimans, IBM… Those with smart cities labs… Those are the “they”. And I’ve seen the networking of the expert – it is always the same people… The language is really specific and consistent. Everyone is using this term “solutions”… This is the language to talk about the problems… So “they” are transnational, often US based tech corporation with in-house smart cities labs.
Q1) But “they” are also in meetings across the world with lots of different stakeholders, including those people, but others are there. It looks like you are pulling from corporate discourses… Have you traced how that is translating into everyday city planners who host conferences and events they all meet at… And how that plays out and adopt it…
A1 – Germaine) The most I’ve gone with this is to CIOs and City Planners… But it’s a really interesting questions…
Q1) I think it would be interesting and a direction we need to take… How discourses played out and adopted.
Q2) So I was wanting to follow up that question by asking about the role of governments and funders. In the UK right now there is a big push from Government to engage in smart cities, and that offers local authorities a source of capital income that they are keen to take, but then they need providers to deliver that work and are turning to these private sector players…
A2) With cities I have looked at show no vacancy rates, or very low vacancy rates… Of the need to build more units because all are already sold. Some are dormitories for international schools… That lack of join up between ownership and real estate narrative really differs from lived experience. In Kansas they are retrofitting as a smart cities, and taking on that discourse of efficiencies and costs effectiveness…
Q3) How do narratives here fit versus what we used to have as the Cultural Cities narrative…. Who is pushing this? It’s not the same people from civil society perhaps?
A3 – Erika) When I was in Singapore I had this sense of an almost sterile environment. And I learned that the red light district was cleaned up, moved the transvestities and sex workers out… People thought it was too boring… And they started hiring women to dress as men dressed as women to liven it up…
Q4 – Germaine) I wanted to ask about the discourse around the gaybourhood and where they come from…
A4 – Bryce) I think there are particular stakeholders… So one of the articles I showed was about closure of one of the oldest gay bars in New York, and the idea that Grindr caused that, but someone pointed out in the comments that actually real estate prices is the issue. And there is also this change that came from Mayor Giuliani wanting Christopher Street to be more consistent with the rest of New York…
Q5) I was wondering how that location data and tracking data from Rowan’s paper connects with Smart Cities work…
A5 – Germaine) That idea of tracing is common, but the idea of relational space, whilst there, doesn’t really work as it isn’t made yet… There isn’t sufficient density of people to do that… They need the people for that data. In the social media layer it’s relatively invisible, it’s there… But there really is something connected there.
A5 – Rowan) The move to pinpoint technology at FourSquare, they may be interested in Smart Cities… But quite a lot of the critiques I’ve read is that its just about consumption… I’m tired of that… I think they are trying to do something more interesting, to get at the complexity of everyday life… In Melbourne there was a planned development called Docklands… There is nothing there on Foursquare…
A5 – Erika) I am surprised that they aren’t hiring people to be people…
A5 – Rowan) I was thinking about that William Gibson comment about street signs. One of the things about Docklands was that it had high technology and good connections but low population so it did become a centre for crime.
Q6) I work with low income/low socio-economic groups, and how are people ensuring that those communities are part of smart cities, or how their interests are voiced.
A6 – Germaine) In Kansas Cities Google wired neighbourhoods, but that also raised issues around neighbourhoods that were not reached… And that came from activists. Cable wasn’t fitted for poor and middle income communities, but data centres were also located in them. You also see small MESH and Line of Sight networks emerging as a counter measure in some neighbourhoods. I that place it was activists and the press… But in Kansas City it is being picked up as a story.
A6 -Rowan) In my field Jordan Frick does great work on this area, particularly on issues of monolingualism and how that excludes communities.
A6 – Erika) Tim Cresswell does really interesting work in this space… As I’ve thought about place and whose place a particular space it, I’ve been thinking about activists and police in the US. Would be interesting to look at.
A6 – Adreinne) People who have Tor, who resist surveillance, are well off and tech savvy, almost exclusively…
PS-32: Power (chair: Lina Dencik)
Lina: We have another #allfemalepanel for you! On power. 
The Terms Of Service Of Online Protest – Stefania Milan, University of Amsterdam, The Netherlands.
This is part of a bigger project which is slowly approaching book stage, so I won’t sum everything up here but I will give an overview of the theoretical position.
So, one of our starting points is the materiality and broker role of semiotechnologies, and particularly about mediation of social media and the ways that materiality contributes here. I am a sociologist and I’m looking at change. I have been accursed of being a techno-determinist… Yes, to an extent. I play with this. And I am working from the perspective that algorithmically mediated environment of social media has the ability to create change.
I look at a micro level and meso level, looking at interactions between individuals and how that makes differences. Collective action is a social construct – the result of interactions between social actors (Melucci 1996) – not a huge surprise. Organisation is a communicative and expressive activity. And centrality of sense-making activities (ie how people make sense of what they do) Meaning construction is embedded here. That shouldn’t be a surprise either here. Mediata tech and internet are not just tools but as both metaphors and enablers of a new configuration of collective action: cloud protesting. That’s a term I stick with – despite much criticism – as I like the contradiction that it captures… the direct, on the ground, individual, and the large, opaque, inaccessible.
So, features of “cloud protesting” is about the cloud as an “imagined online space” where resources are stored. In social movements there is something important there around shared resources. In this case resources are soft resources – information and meaning making resources. Resources are the “ingredients” of mobilisation. Cyberspaces gives these soft resources and (immaterial) body.

The cloud is a metaphor for organisational forms… And I relate that back to organisational forms of the 1960s, and to later movements, and now the idea of the cloud protest.  The cloud is also an analogy for individualisation – many of the nodes are individuals, who reject pre-packaged non-negotiable identities and organisations. The cloud is a platform for the movements resources can be… But a cloud movement does not require commitment and can be quite hard to activate and mobilise.

Collective identity, in these spaces, has some particular aspects. The “cloud” is an enabler, and you can identify “we” and “them”. But social media spaces overly emphasise visibility over collective identity.

The consequences of the materiality of social media are seen in four mechanisms: centrality of performance; interpellation to fellows and opponents; expansion of the temporality of the protest; reproducability of social action. Now much of that enables new forms of collective action… But there are both positive and negative aspects. Something I won’t mention here is surveillance and consequences of that on collective action.

So, what’s the role of social media? Social media act as intermediaries, enabling speed in protest organisation and diffusion – shaping and constraining collective action too. The cloud is grounded on everyday technology, everyone has the right in his/her pockets. The cloud has the power to deeply influence not only the nature of the protest but also the tactics. Social media enables the creation of a customisable narrative.

Hate Speech and Social Media Platforms – Eugenia Siapera, Paloma Viejo Otero, Dublin City University, Ireland

Eugenia: My narrative is also not hugely positive. We wanted to look at how social media platforms themselves understand, regulate and manage hate speech on their platforms. We did this through an analysis of terms of service. And we did in-depth interviews with key informants – Facebook, Twitter, and YouTube. These platforms are happy to talk to researchers but not to be quoted. We have permission from Facebook and Twitter. YouTube have told us to re-record interviews with lawyers and PR people present.

So, we had three analytical steps – looking at what constitutes hate speech means.

We found that there is no use of definitions of hate speech based on law. Instead they put in reporting mechanisms and use that to determine what is/is not hate speech.

Now, we spoke to people from Twitter and Facebook (indeed there are a number of staff members who move from one to another). The tactic at Facebook was to make rules, what will be taken down (and what won’t), hiring teams to work to apply then, and then help ensure rules are are appropriate. Twitter took a similar approach. So, the definition largely comes from what users report as hate speech rather than from external definitions or understandings.

We had assumed that the content would be manually and algorithmically assessed, but actually reports are reviewed by real people. Facebook have four teams across the world. There are native speakers – to ensure that they understand context – and they prioritise self-/harm and some other categories.

Platforms are reactively rather than proactively positioned. Take downs are not based on number of reports. Hate speech is considered in context – a compromising selfie of a young woman in the UK isn’t hates speech… Unless in India where that may impact on marriage (See Hot Girls of Mumbai – in that case they didn’t take down on that basis but did remove it directly with the ). And if in doubt they keep the content on.

Twitter talk about reluctance to share information with law enforcement, protective of users, protective of freedom of speech. They are not keen to remove someone, would prefer counter arguments. And there are also tensions created by different local regulations and the global operations of the platforms – tension is resolved by compromise (not the case for YouTube).

A Twitter employee talked about the challenges of meeting with representatives from government, where there is tension between legislation and commercial needs, and the need for consistent handling.

There is also a tension between the principled stance assumed by social media corporations that sends the user to block and protect themselves first – a focus on safety and security and personal responsibility. And they want users to feel happy and secure.

Some conclusions… Social media corporations are increasingly acquiring state-like powers. Users are conditioned to behave in ways conforming to social media corporations’ liberal ideology. Posts are “rewarded” by staying online but only if they conform to social media corporations’ version of what constitutes acceptable hate speech.

#YesAllWomen (have a collective story to tell): Feminist hashtags and the intersection of personal narratives, networked publics, and intimate citizenship – Jacqueline Ryan Vickery, University of North Texas, United States of America

The original idea here was to think about second wave feminism and the idea of sharing personal stories and make the personal political. And how that looks online. Working on Plummer’s work (2003) in this areas. All was well… And then I got stuck down the rabbit hole of publics and public discourses that are created when people share personal stories in public spaces… So I have tried to map these aspects. Thinking about the goals of hashtags and who started them as well… not something non-academics tend to look at. I also will be talking about hashtags themselves.

So I tried to think about and mapping goals, political, affective aspects, and affordances and relationships around these. The affordances of hashtags include: Curational – immediacy, reciprocity and conversationality (Papacharissi 2015); they are Polysemic – plurality, open signifiers, diverse meanings (Fiske 1987); Memetic – replicable, ever-evolving, remix, spreadable cultural information (Knobel and Lankshear 2007); Duality in communities of practice – opposing forces that drive change and creativity, local and broader for instance (Wenger 1988); Articulated subjectivities – momentarily jumping in and out of hashtags without really engaging beyond brief usage.

And how can I understand political hashtags on Twitter and their impact? Are we just sharing amongst ourselves, or can we measure that? So I want to think about agenda setting and re-framing – the hashtags I am looking at speak to a public event, or speak back to a media event that is taking place another way. We have op-option by organisations etc. And we see (strategic) essentialism. Awareness/mobilisation. Amplification/silencing of (privileged/marganlisation narratives). So #Yesallwomen is adopted by many privileged white feminists but was started by a biracial muslim women. Indeed all of the hashtags I study were started by non-white women.

So, looking at #Yesallwomen was in response to a terrible shooting and wrote a diatribe about women denying him. The person who created that hashtags left Twitter for a while but has now returned. So we do see lots of tweets that use that hashtag, responding with appropriate experiences and comments.  But it became problematic, too open… This memetic affordance – a controversial male monologist used it as a title for his show, using it abusively and trolling, and beauty brands being there.

The #WhyIStayed hashtag was started by Beverley Gooden in response to commentary that a woman should have left her partner, and that media wasn’t asking why they didn’t ask why that man had beaten and abused his partner. So people shared real stories… But also a pizza company used it – though they apologised and admitted not researching first. Some found the hashtag traumatic… But others shared resources for other women here…

So, I wanted to talk about how these spaces are creating these networked publics, and they do have power to deal with changes. I also think that idea of openness, of lack of control, and the consequences of that openness. #Yesallwomen has lost its meaning to an extent, and is now a very white hashtag. But if we look at these and think of them with social theories we can think about what this means for future movements and publicness.

Internet Rules, Media Logics and Media Grammar: New Perspectives on the Relation between Technology, Organization and Power – Caja Thimm, University of Bonn, Germany

I’m going to report briefly on a long term project on Twitter funded by a range of agencies. There is also a book coming on Twitter and the European Election. So, where do we start… We have Twitter. And we have tweets in French – e.g. from Marine Le Pen – but we see Tweets in other languages too – emoticons, standard structures, but also visual storytelling – images from events.

We have politicians, witnesses, and we see other players, e.g. the police. So first of all we wanted a model for Tweets and how we can understand them. So we used the Functional Operator Model (Thimm et al 2014) – but thats descriptive – great for organising data but not for analysing and understanding platforms.

So, we started with a conference on Media Logic, an old concept from the 1970s. Media Logic offers an approach to develop parameters for a better analysis of such new forms of “media”. It defines players, objectives and power. And how players interact and what do they do (e.g. how do others conquer a hashtag for instance). Consequently you can consider media logics that are to be considered as a network of parameters.

So, what are the parameters of Media Logics that we should understand?

  1. Media Logic and communication cultures. For instance how politicians and political parties take into account media logic of television – production routines, presentation formats (Schulz 2004)
  2. Media Logic and media institutions – institutional and technological modus operandi (Hjarvard 2014)
  3. Media Grammar – a concept drawn from analogy of language.

So, lets think about constituents of “Media Grammar”? Periscope came out of a need, a gap… So you have Surface Grammar – visible and accessible to the user (language, semiotic signs, sounds etc). Surface Grammar is (sometimes) open to the creativity of users. It guides use through media.

(Constitutive) Property Grammar is difference. They are constitutive for the medium itself, determines the rules the functional level of the surface power. Constitutes of algorithms (not exclusively). Not accessible but for the platform itself. And surface grammar and property grammar form a reflexive feedback loop.

We also see van Dijk and Poell (2013) talking about social media as powerful institutions, so the idea of connecting social media grammar here to understand that… This opens up the focus on the open and hidden properties of social media and its interplay with communicative practices. Social media are differentiated, segmented and diverse to such a degree that it seems necessary to focus in more to gain a better idea of how we understand them as technology and society…

Panel Q&A

Q1) A general question to start off. You presented a real range of methodologies, but I didn’t hear a lot about practices and what people actually do, and how that fits into your models.

A1 – Caja) We have a six year project, millions of tweets, and we are trying to find patterns of what they do, and who does what.  There are real differences in usage but still working on what those means.

A1 – Jacqueline) I think that when you look at very large hashtags, even #blacklivesmatter, you do see community participation. But the tags I’m looking at are really personal, not “Political”, these are using hashtags as a momentary act in some way, but is not really a community of practice in a sustainable movements, but some are triggering bigger movements and engagement though…

A1 – Eugenia) We see hate speech being gamed… People put outrageous posts out there to see what will happen, if they will be taen down…

Q2) I’ve been trying to find an appropriate framework… The field is so multidisciplinary… For a study I did on native american activists. We saw interest groups – discursive groups – were loosely stitched together with #indigenous. I’m sort of using the phrase “translator” to capture this. I was wondering if you had any thoughts on how we navigate this…

A2 – Caja) It’s a good question… This conference is very varied, there are so many fields… Socio-linguistics has some interesting frameworks for accommodations in Twitter. No-one seems to have published on that.

A2 – Jacqueline) I think understanding the network, the echo chamber effects, mapping of that network and how the hashtag moves, might be the way in there…

Q2) That’s what we did, but that’s also a problem… But hashtag seems to have a transformative impact too…

Q3) I wonder if we say Social Media Logic, do we loose sight of the overarching issue…

A3 – Caja) I think that Media Logic is in really early stages… It was founded in the 1970s when media was so different. But there are real power symmetries… And I hope we find a real way to bridge the two.

Q4) Many of these arguments come down to how much we trust the idea of the structure in action. Eugenia talks about creating rules iteratively around the issue. Jacqueline talked about the contested rules of play… It’s not clear of who defines those terms in the end…

A4 – Eugenia) There are certain media logics in place now… But they are fast moving as social media move to monetise, to develop, to change. Twitter launches Periscope, Facebook then launches Facebook Live! The grammar keeps on moving, aimed towards the users… Everything keeps moving…

A4 – Caja) But that’s the model. The dynamics are at the core. I do believe that media grammar on the level of small nitpicks that are magic – like the hashtag which has transgressed the platform and even the written form. But it’s about how they work, and whether there are logics inscribed.

A4 – Stefania) There is, of course, attempts made by the platform to hide the logic, and to hide the dynamics of the logic… Even at a radical activist conference who cannot imagine their activism without the platform – and that statement also comes from a belief that they understand the platform.

Q5) I study hate speech too… I came with my top five criticisms but you covered them all in your presentation! You talked about location (IP address) as a factor in hate speech, rather than jurisdiction.

A5 – Eugenia) I think they (nameless social platform) take this approach in the same way that they do for take down notices… But they only do that for France and Germany where hate speech law is very different.

A5 – Caja) There is a study that has been taking place about take downs and the impact of pressure, politics, and effect across platforms when dealing with issues in different countries.

A5 – Eugenia) Twitter has a relationship with NGOs. and have a priority to deal with their requests, sometimes automatically. But they give guidance on how to do that, but they are outsourcing that process to these users…

Q6) I was thinking about platform logics and business logics… And how the business models are part of these logics. And I was wondering if you could talk to some of the methodological issues there… And the issue of the growing powers of governments – for instance Benjamin Netanahu meeting Mark Zuckerberg and talking to him about taking down arabic journalists.

A6 – Eugenia) This is challenging… We want to research them and we want to critique them… But we don’t want to find ourselves blacklisted for doing this. Some of the people I spoke to are very sensitive about, for instance, Palestinian content and when they can take it down. Sometimes though platforms are keen to show they have the power to take down content…

Q7) For Eugenia, you had very good access to people at these platforms. Not surprised they are reluctant to be quoted… But that access is quite difficult in our experience – how did you do it.

A7) These people live in Dublin so you meet them at conferences, there are cross overs through shared interests. Once you get in it’s easier to meet and speak to them… Speaking is ok, quoting and identifying names in our work is different. But it’s not just in social media

Comment) These people really are restricted in who they can talk to… There are PR people at one platform… You ask for comparative roles and use that as a way in… You can start to sidle inside. But mainly it’s the PR people you can access… I’ve had some luck referring to role area at a given company, rather than by name.

Q8 – Stefania) I was wondering about our own roles, in this room, and the issue of agency and publics…

A8 – Jacqueline) I don’t think publics take agency away, in the communities I look at these women benefit from the publics, and of sharing… But actually what we understand as publics varies… So in some publics some talk about exclusion of, e.g. women or people of public, but there are counter publics…

A8 – Caja) Like you were saying there are mini publics and they can be public, and extend out into media and coverage. I think we have to look beyond the idea of the bubble… It’s really fragmented and we shouldn’t overlook that…

And with that, the conference is finished. 

You can read the rest of my posts from this week here:

Thanks to all at AoIR for a really excellent week. I have much to think about, lots of contacts to follow up with, and lots of ideas for taking forward my own work, particularly our new YikYak project

Oct 072016
 

PS-15: Divides (Chair: Christoph Lutz)

The Empowered Refugee: The Smartphone as a Tool of Resistance on the Journey to Europe – Katja Kaufmann

For those of you from other continents we had a great deal of refugees coming to Europe last year, from Turkey, Syria, etc. who were travelling to Germany, Sweden, and Vienna – where I am from – was also a hub. Some of these refugees had smartphones and that was covered in the (right wing) press about this, criticising this group’s ownership of devices but it was not clear how many had smartphones, how they were being used and that’s what I wanted to look at.

So we undertook interviews with refugees to see if they used them, how they used them. We were researching empowerment by mobile phones, following Svensson and Wamala Larsson (2015) on the role of the mobile phone in transforming capacilities of users. Also with reference to N. Kabeer (1999), A. Sen (1999) etc. on meanings of empowerment in these contexts. Smith, Spend and Rashid (2011) describe mobiles and their networs altering users capability sets, and about phone increasing access to flows of information (Castell 2012).

So, I wanted to identify how smartphones were empowering refugees through: gaining an advantage in knowledge by the experiences of other refugees; sensory information; cross-checking information; and capabilities to opposse actions of others.

In terms of an advantage in knowledge refugees described gaining knowledge from previous refugees on reports, routes, maps, administrative processes, warnings, etc. This was through social networks and Facebook groups in particular. So, a male refugee (age 22) described which people smugglers cannot be trusted, and which can. And another (same age) felt that smart phones were essential to being able to get to Europe – because you find information, plan, check, etc.

So, there was retrospective knowledge here, but also engagement with others during their refugee experience and with those ahead on their journey. This was mainly in WhatsApp. So a male refugee (aged 24) described being in Macedonia and speaking to refugees in Serbia, finding out the situation. This was particularly important last year when approaches were changes, border access changed on an hour by hour basis.

In terms of Applying Sensory Abilities, this was particularly manifested in identifying own GPS position – whilst crossing the Aegean or woods. Finding the road with their GPS, or identifying routes and maps. They also used GPS to find other refugees – friends, family members… Using location based services was also very important as they could share data elsewhere – sending GPS location to family members in Sweden for instance.

In terms of Cross-checking information and actions, refugees were able to track routes whilst in the hand of smugglers. A male Syrian refugee (aged 30) checked information every day whilst with people smugglers, to make sure that they were being taken in the right direction – he wanted to head west. But it wasn’t just routes, it was also weather condiions, also rumous, and cross-checking weather conditions before entering a boat. A female Syrian refugee downloaded an app to check conditions and ensure her smuggler was honest and her trip would be safer.

In terms of opposing actions of others, this was about being capable of opposing actions of others – orders of authorities, potential acts of (police) violence, risks, fraud attempts, etc. Also disobedience by knowledge – the Greek government gave orders about the borders, but smartphones allowed annotated map sharing that allowed orders to be disobeyed. And access to timely information – exchange rates for example – a refugee described negotiating price of changing money down by Google searching for this. And opposition was also about a means to apply pressure – threatening with or publishing photos. A male refugee (aged 25) described holding up phones to threaten to document policy violence, and that was impactful. Also some refugees took pictures of people smugglers as a form of personal protection and information exchange, particularly with publication of images as a threat held in case of mistreatment.

So, in summary the smartphones

Q&A

Q1) Did you have any examples of privacy concerns in your interviews, or was this a concern for later perhaps?

A1) Some mentioned this, some felt some apps and spaces are more scrutinised than others. There was concern that others may have been identified through Facebook – a feeling rather than proof. One said that they do not send their parents any pictures in case she was mistaken by Syrian government as a fighter. But mostly privacy wasn’t an immediate concern, access to information was – and it was very succesful.

Q2) I saw two women in the data here, were there gender differences?

A2) We tried to get more women but there were difficulties there. On the journey they were using smartphones in similar ways – but I did talk to them and they described differences in use before their journey and talked about picture taking and sharing, the hijab effect, etc.

Social media, participation, peer pressure, and the European refugee crisis: a force awakens? – Nils Gustafsson, Lund university, Sweden

My paper is about receiving/host nations. Sweden took in 160,000 refugees during the crisis in 2015. I wanted to look at this as it was a strange time to live in. A lot of people started coming in late summer and early autumn… Numbers were rising. At first response was quite enthusiastic and welcoming in host populations in Germany, Austria, Sweden. But as it became more difficult to cope with larger groups of people, there were changes and organising to address challenge.

And the organisation will remind you of Alexander (??) on the “logic of collective action” – where groups organise around shared ideas that can be joined, ideas, almost a brand, e.g. “refugees welcome”. And there were strange collaborations between government, NGOs, and then these ad hoc networks. But there was also a boom and bust aspect here… In Sweden there were statements about opening hearts, of not shutting borders… But people kept coming through autumn and winter… By December Denmark, Sweden, etc. did a 180 degree turn, closing borders. There were border controls between Denmark and Sweden for the first time in 60 years. And that shift had popular support. And I was intrigued about this. And this work is all part of a longer 3 year project on young people in Sweden and their political engagement – how they choose to engage, how they respond to each other. We draw on Bennett & Segerberg (2013), social participation, social psychology, and the notion of “latent participation” – where people are waiting to engage so just need asking to mobilise.

So, this is work in progress and I don’t know where it will go… But I’ll share what I have so far. And I tried to focus on recruitment – I am interested in when young people are recruited into action by their peers. I am interested in peer pressure here – friends encouraging behaviours, particularly important given that we develop values as young people that have lasting impacts. But also information sharing through young people’s networks…

So, as part of the larger project, we have a survey, so we added some specific questions about the refugee crisis to that. So we asked, “you remember the refugee crisis, did you discuss it with your friends?” – 93.5% had, and this was not surprising as it is a major issue. When we asked if they had discussed it on social media it was around 33.3% – much lower perhaps due to controversy of subject matter, but this number was also similar to those in the 16-25 year old age group.

We also asked whether they did “work” around the refugee crisis – volunteering or work for NGOs, traditional organisations. Around 13.8% had. We also asked about work with non-traditional organisations and 26% said that they had (and in 16-25% age group, it was 29.6%), which seems high – but we have nothing to compare this too.

Colleagues and I looked at Facebook refugee groups in Sweden – those that were open – and I looked at and scraped these (n=67) and I coded these as being either set up as groups by NGOs, churches, mosques, traditional organisations, or whether they were networks… Looking across autumn and winter of 2015 the posts to these groups looked consistent across traditional groups, but there was a major spike from the networks around the crisis.

We have also been conducting interviews in Malmo, with 16-19 and 19-25 year olds. They commented on media coverage, and the degree to which the media influences them, even with social media. Many commented on volunteering at the central station, receiving refugees. Some felt it was inspiring to share stories, but others talked about their peers doing it as part of peer pressure, and critical commenting about “bragging” in Facebook posts. Then as the mood changed, the young people talked about going to the central station being less inviting, on fewer Facebook posts… about feeling that “maybe it’s ok then”. One of our participants was from a refugee background and ;;;***

Q&A

Q1) I think you should focus on where interest drops off – there is a real lack of research there. But on the discussion question, I wasn’t surprised that only 30% discussed the crisis there really.

A1) I wasn’t too surprised either here as people tend to be happier to let others engage in the discussion, and to stand back from posting on social media themselves on these sorts of issues.

Q2) I am from Finland, and we also helped in the crisis, but I am intrigued at the degree of public turnaround as it hasn’t shifted like that in Finland.

A2) Yeah, I don’t know… The middleground changed. Maybe something Swedish about it… But also perhaps to do with the numbers…

Q2) I wonder… There was already a strong anti-immigrant movement from 2008, I wonder if it didn’t shift in the same way.

A2) Yes, I think that probably is fair, but I think how the Finnish media treated the crisis would also have played a role here too.

An interrupted history of digital divides – Bianca Christin Reisdorf, Whisnu Triwibowo, Michael Nelson, William Dutton, Michigan State University, United States of America

I am going to switch gears a bit with some more theoretical work. We have been researching internet use and how it changes over time – from a period where there was very little knowledge of or use of the internet to the present day. And I’ll give some background than talk about survey data – but that is an issue of itself… I’ll be talking about quantitative survey data as it’s hard to find systematic collection of qualitative research instruments that I could use in my work.

So we have been asking about internet use for over 20 years… And right now I have data from Michigan, the UK, and the US… I have also just received further data from South Africa (this week!).

When we think about Digital Inequality the idea of the digital divide emerged in the late 1990s – there was government interest, data collection, academic work. This was largely about the haves vs. have-nots; on vs. off. And we saw a move to digital inequalities (Hargittai) in the early 2000s… Then it went quite aside from work from Neil Selwyn in the UK, from Helsper and Livingstone… But the discussion has moved onto skills…

Policy wise we have also seen a shift… Lots of policies around digital divide up to around 2002, then a real pause as there was an assumption that problems would be solved. Then, in the US at least, Obama refocused on that divide from 2009.

So, I have been looking at data from questionnaires from Michigan State of the State Survey (1997-2016); questionnaires from digital future survey in the US (2000, 2002, 2003, 2014); questionnaires from the Oxford Internet Surveys in the UK (2003, 2005, 2007, 2009, 2013); Hungarian World Internet Project (2009); South African World Internet Project (2012).

Across these data sets we have looked at questionnaires and frequency of use of particular questions here on use, on lack of use, etc. When internet penetration was less high there was a lot of explanation in questions, but we have shifted away from that, so that we assume that people understand that… And we’ve never returned to that. We’ve shifted to devices questions, but we don’t ask other than that. We asked about number of hours online… But that increasingly made less sense, we do that less as it is essentially “all day” – shifting to how frequently they go online though.

Now the State of the State Survey in Michigan is different from the other data here – all the others are World Internet Project surveys but SOSS is not looking at the same areas as not interent researchers neccassarily. In Hungary (2009 data) similar patterns of question use emerged, but particular focus on mobile use. But the South African questionnaire was very different – they ask how many people in the household is using the internet – we ask about the individual but not others in the house, or others coming to the house. South Africa has around 40% penetration of internet connection (at least in 2012 when we have data here), that is a very different context. There they ask for lack of access and use, and the reasons for that. We ask about use/non-use rather than reasons.

So there is this gap in the literature, there is a need for quantitative and qualitative methods here. We also need to understand that we need to consider other factors here, particularly technology itself being a moving target – in South Africa they ask about internet use and also Facebook – people don’t always identify Facebook as internet use. Indeed so many devices are connected – maybe we need

Q&A

Q1) I have a question about the questionnaires – do any ask about costs? I was in Peru and lack of connections, but phones often offer free WhatsApp and free Pokemon Go.

A1) Only the South African one asks that… It’s a great question though…

Q2) You can get Pew questionnaires and also Ofcom questionnaires from their website. And you can contact the World Internet Project directly… And there is an issue with people not knowing if they are on the internet or not – increasingly you ask a battery of questions… and then filtering on that – e.g. if you use email you get counted as an internet user.

A2) I have done that… Trying to locate those questionnaires isn’t always proving that straightforward.

Q3) In terms of instruments – maybe there is a need to developmore nuanced questionnaires there.

A3) Yes.

Levelling the socio-economic playing field with the Internet? A case study in how (not) to help disadvantaged young people thrive online – Huw Crighton Davies, Rebecca Eynon, Sarah Wilkin, Oxford Internet Institute, United Kingdom

This is about a scheme called the “Home Access Scheme” and I’m going to talk about why we could not make it work. The origins here was a city council’s initiative – they came to us. DCLG (2016) data showed 20-30% of the population were below the poverty line, and we new around 7-8% locally had no internet access (known through survey responses). And the players here were researchers, local government, schools, and also an (unnamed) ISP.

The aim of the scheme was to raise attainment in GCSEs, to build confidence, and to improve employability skills. The Schools had a responsibility to identify students in need at school, to procure laptops, memory sticks and software, provide regular, structured in-school pastoral skills and opportunities – not just in computing class. The ISP was to provide set up help, technical support, free internet connections for 2 years.

This scheme has been running two years, so where are we? Well we’ve had successes: preventing arguments and conflict; helped with schoolwork, job hunting; saved money; and improved access to essential services – this is partly as cost cutting by local authorities have moved transactions online like bidding for council housing, repeat prescription etc. There was also some intergenerational bonding as families shared interests. Families commented on the success and opportunities.

We did 25 interiews, 84 1-1 sessions in schools, 3 group workshops, 17 ethnographic visits, plus many more informal meet ups. So we have lots of data about these families, their context, their lives. But…

Only three families had consistent internet access throughout. Only 8 families are still in the programme. It fell apart… Why?

Some schools were so nervous about use that they filtered and locked down their laptops. One school used the scheme money to buy teacher laptops, gave students old laptops instead. Technical support was low priority. Lead teachers left/delegated/didn’t answer emails. Very narrow use of digital technology. No in-house skills training. Very little cross-curriculum integration. Lack of ICT classes after year 11. And no matter how often we asked about it we got no data from schools.

The ISP didn’t set up collections, didn’t support the families, didn’t do what they had agreed to. They tried to bill families and one was threatened with debt collectors!

So, how did this happen? Well maybe these are neoliberalist currents? I use that term cautiously but… We can offer an emergent definition of neoliberalism from this experience.

There is a neoliberalist disfigurement of schools: teachers under intense pressue to meet auditable targets; the scheme’s students subject to a range of targets used to problematise a school’s performance – exclusions, attendance, C grades; the scheme shuffled down priorities; ICT not deemed academic enough under Govian school changes; and learning is stribbed back to narrow range of subjects and focus towards these targets.

There were effects of neoliberalism on the city council: targets and “more for less” culture; scheme disincentivised; erosion of authority of democratic institutional councils – schools beyond authority controls, and high turn over of staff.

There were neoliberalist practices at the ISP: commodifying philanthropy; couldn’t not treat families as customers. And there were dysfunctional mini-markets: they subcontracted delivery and set up; they subcontracted support; they charged for support and charged for internet even if they couldn’t help…

Q&A

Q1) Is the problem digital divides but divides… Any attempt to overcome class separation and marketisation is working against the attempts to fix this issue here.

A1) We have a paper coming and yes, there were big issues here for policy and a need to be holistic… We found parents unable to attend parents evening due to shift work, and nothing in the school processes to accommodate this. And the measure of poverty for children is “free school meals” but many do not want to apply as it is stigmatising, and many don’t qualify even on very low incomes… That leads to children and parents being labelled disengaged or problematic

Q2) Isn’t the whole basis of this work neoliberal though?]

A2) I agree. We didn’t set the terms of this work..

Panel Q&A

Q1/comment) RSE and access

A1 – Huw) Other companies the same

Q2) Did the refugees in your work Katja have access to Sim cards and internet?

A2 – Katja) It was a challenge. Most downloaded maps and resources… And actually they preferred Apple to Android as the GPS is more accurate without an internet connection – that makes a big difference in the Aegean sea for instance. So refugees shared sim cards, used power banks for the energy.

Q3) I had a sort of reflection on Nils’ paper and where to take this next… It occurs to me that you have quite a few different arguements… You have this survey data, the interviews, and then a different sort of participation from the Facebook groups… I have students in Berlin here looking at the boom and bust – and I wondered about that Facebook group work being worth connecting up to that type of work – it seems quite separate to the youth participation section.

A3 – Nils) I wasn’t planning on talking about that, but yes.

Comment) I think there is a really interesting aspect of these campaigns and how they become part of social media and the everyday life online… The way they are becoming engaged… And the latent participation there…

Q3) I can totally see that, though challenging to cover in one article.

Q4) I think it might be interesting to talk to the people who created the surveys to understand motivations…

A4) Absolutely, that is one of the reasons I am so keen to hear about other surveys.

Q5) You said you were struggling to find qualitative data?

A5 – Katja) You can usually download quantitative instruments, but that is harder for qualitative instruments including questions and interview guides…

XP-02: Carnival of Privacy and Security Delights – Jason Edward Archer, Nathanael Edward Bassett, Peter Snyder, University of Illinois at Chicago, United States of America

Note: I’m not quite sure how to write up this session… So these are some notes from the more presentation parts of the session and I’ll add further thoughts and notes later… 

Nathanial: We have prepared three interventions for you today and this is going to be kind of a gallery exploring space. And we are experimenting with wearables…

Fitbits on a Hamster Wheel and Other Oddities, oh my!

Nathanial: I have been wearing a FitBit this week… but these aren’t new ideas… People used to have beads for counting, there are self-training books for wrestling published in the 16th Century. Pedometers were conceived of in Leonardo di Vinci’s drawings… These devices are old, and tie into ideas of posture, and mastering control of physical selves… And we see the pedometer being connected with regimes of fitness – like the Manpo-Meter (“10,000 steps meter) (1965). This narrative takes us to the 1970s running boom and the idea of recreational discipline. And now the world of smart devices… Wearables are taking us to biometric analysis as a mental model (Neff – preprint).

So, these are ways to track, but what happens with insurance companies, with those monitoring you. At Oriel Roberts university students have to track their fitness as part of their role as students. What does that mean? I encourage you all to check out “unfitbit” – interventions to undermine tracking. Or we could, rather than going to the gym with a FitBit, give it to Terry Crews – he’s going anyway! – and he could earn money… Are fitness slaves in our future?

So, use my FitBit – it’s on my account

And so, that’s the first part of our session…

?: Now, you might like to hear about the challenges of running this session… We had to think about how to make things uncomfortable… But then how do you get people to take part… We considered a man-in-the-middle site that was ethically far too problematic! And no-one was comfortable participating in that way… Certainly raising the privacy and security issue… But as we talk of data as a proxy for us… As internet researchers a lot of us are more aware of privacy and security issues than the general population, particularly around metadata. But this would have been one day… I was curious if people might have faked your data for that one day capture…

Nathanial: And the other issue is why we are so much more comfortable sharing information with FitBit, and other sharing platforms, faceless entities versus people you meet at a conference… And we didn’t think about a gender aspect here… We are three white guys here and we are less sensitive to that being publicised rather than privatised. Men talk about how much they can benchpress… but personal metadata can make you feel under scrutiny

Me: I wouldn’t want to share my data and personal data collection tools…

Borrowing laptop vs borrowing phone…

?: In the US there have been a few cases where FitBits have been submitted as evidence in court… But that data is easier to fake… In one case a woman claimed to have been raped, and they used her FitBit to suggest that

Nathanial: You talked about not being comfortable handing someone your phone… It is really this blackbox… Is it a wearable? It has all that stuff, but you wear it on your body…

??: On cellphones there is FOMO – Fear Of Missing Out… What you might mix…

Me: Device as security

Comment: Ableism embedded in devices… I am a cancer survivor and I first used step counts as part of a research project on chemotherapy and activity… When I see a low step day on my phone now… I can feel this stress of those triggers on someone going through that stress…

Nathanial: FitBit’s vibrate when you have/have not done a number of steps… Trying to put you in an ideological state apparatus…

Jh: That nudge… That can be good for able bodied… But if you can’t move that is a very different experience… How does that add to their stress load.

Interperspectival Goggles

Again looking at the condition of virtuality – Hayles 2006(?)

Vision is constructed… Thinking of higher resolution… From small phone to big phone… Lower resolution to higher resolution TV… We have spectacles, quizzing glasses and monocles… And there is the strange idea of training ourselves to see better (William Horation Bates, 1920s)… And emotional state interfering with how you do something… Rgeb we have optomitry and x-rays as a concept of seeing what could not be seen before… And you have special goggles and helmets… LIke the idea of the Image Accumulator in Videodrome (1985?), or the idea of the Memory recorder and playback device in Brainstorm (1983). We see embodied work stations – Da Vinci Surgery Robot (2000) – divorcing what is seen, from what is in front of them…

There are also playful ideas: binocular football; the Decelerator Helmet; Meta-perceptional Helmet (Cleary and Donnelly 2014); and most recently Google Glass – what is there and also extra layers… Finally we have Oculus Rift and VR devices – seeing something else entirely… We can divorce what we see from what we are perceiving… We want to swap people’s vision…

1. Raise awareness about the complexity of electronic privacy and security issues.

2. Identify potential gaps in the research agenda through playful interventions, subversions, and moments of the absurd.

3. Be weird, have fun!

Mathius

“Cell phones are tracking devices that make phonecalls” (Applebaum, 2012)

I am interested in IMSI catcher which masquerades as a wireless base station, prompting phones to communicate with it. They are used by police, law inforcement, etc. They can be small and handheld, or they can be drone mounted. And they can track people, people in crowds, etc. There is always a different way to use it – you can scan for people in crowds. So if you know someone is there you can scan for it in a different way. So, these tools are simple and disruptive and problematic, especially in activism contexts.

But these tools are also capable of caturing transmitted content, and all the data in your phone. These devices are problematic and have raised all sorts of issues about their use, who and how you use them. I’d like to think of this a different way… Is there a right to protest? And to protest anonymously? We do have anti-masking laws in some places – that suggests no right to anonymous protest. But that’s still a different privacy right – covering my face is different from participating at all…

Protests are generally about a minority persuading a majoruty about some sort of change. There is no legal rights to protest anonymously, but there are lots of protected anoymous spaces. So, in the 19th century there was big debate on whether or not the voting ballot should be anonymous – democracy is really the C19th killer app. So there is a lovely quote here about the “The Australian system” by Bernheim (1889) and the introduction of anonymous voting. It wasn’t brought in to preserve privacy. At the time politicians brought votes – buying a keg of beer or whatever – and anonymity was there to stop that, not to preserve individual privacy. But Jill LePore (2008) writes about how our forebears considered casting a “secret ballot” to be “cowardly, underhanded and dispicable”.

So, back to these devices… There can be an idea that “if you have nothing to fear, you have nothing to hide”, but many of us understand that it is not true. And this type of device silences uncomfortable discourse.

Mathias Klang, University of Massachusetts Boston

Q1) How do you think that these devices fit into the move to allow law inforcement to block/”switch off” the camera on protestors/individuals’ phones?

A1) Well people can resist these surveillance efforts, and you will see subversive moves. People can cover cameras, conceal devices etc. But with these devices it may be that the phone becomes unusable, requiring protestors to disable phones or leave phones at home… And phones are really popular and well used for coordinating protests

Bryce Newell, Tilburg Institute for Law, Technology, and Society

I have been working on research in Washington Stat, working with law enforcement on license plate recognition systems and public disclosure law. And looking at what you can tell. So, here is a map of license plate data from Seattle, showing vehicle activity. In Minneapolis similar data being released led to mapping of the governer’s registered vehicles..

The second area is about law enforcement and body cameras. Several years ago peaceful protestors at UC Davis were pepper sprayed. Even in the cropped version of that image you can see a vast number of phones out, recording the event. And indeed there are a range of police surveillance apps that allow you to capture police encounters without that being visible on the phone, including: ACLU Police Tape, Stop and Frisk Watch; OpenWatch; CopRecorder2. And some of these apps upload the recording to the cloud right away to ensure capture. And there have certainly been a number of incidents from Rodney King to Oscar Grant (BART), Eric Garner, Ian Tomlinson, Michael Brown. Of these only the Michael Brown case featured law enforcement with bodycams. There has been a huge call for more cameras on law enforcement… During a training meeting some officers told me “Where’s the direct-to-YouTube button?” and “If citizens can do it, why can’t we also benefit from the ability to record in public places?”. There is a real awareness of control and of citizen videos. I also heard a lot of there being “a witch hunt about to begin…”.

So, I’m in the middle of focused coding on police attitudes to body cameras. Police are concerned that citizen video is edited, out of context, distorting. And they are concerned that it doesn’t show wider contexts – when recording starts, perspective, the wider scene, the fact that provocation occurs before filming usually. But there is also the issue of control, and immediate physical interaction, framing, disclosure, visibility – around their own safety, around how visible they are on the web. They don’t know why it is being recorded, where it will go…

There have been a number of regulatory responses to this challenge: (1) restrict collection – not many, usually budgetary and rarely on privacy; (2) restrict access – going back to the Minneapolis case, within two weeks of the map of governer vehicles being published in the paper they had an exemption to public disclosure law which is now permanent for this sort of data. In the North Carolina protests recently the call was “release the tapes” – and they released only some – then the cry was “release all the tapes”… But on 1st October law changed to again restrict access to this type of data.

But different state provide different access. Some provide access. In Oakland, California, data was released on how many license plates had been scanned. In Seattle data on scans can, because the data for many scans of one licence plates over 90 days is quite specific, you can almost figure out the householder. But granularity varies.

Now, we do see body cameras of sobriety tests, foot chases, and a half hour long interview with prostitute that discloses a lot of data. Washington shares a lot of video to YouTube. We see that in Rotterdam, Netherlands police doing this too.

But one patrol office told me that he would never give his information to an officer with a camera. Another noted that police choose when to start recording with little guidance on when and how to do this.

And we see a “collatoreal visibility” issue for police around these technologies.

Q&A

Q1) Is there any process where police have to disclose that they are filming with a body cam?

A1) Interesting question… Initially they didn’t know. We used to have two party consent process – as for tapings – to ensure consent/implied consent. But the State attorney general described this as outside of that privacy regulation, saying that a conversation with a police officer is a public conversation. But police are starting to have policies that officers should disclose that they have cameras – partly as they hope and sometimes it may reduce violence to police.

Data Privacy in commercial users of municipal location data – Meg Young, University of Washington

My work looks at how companies use Seattle’s location data. I wanted to look at how data privacy is enacted by Seattle municipal government? And I am drawing on the work of Annemarie Mol and John Law (2004), an ethnographer working on health, that focuses on the lived experience. My data is drawing on ethnographic as as well as focus groups, interviews with municipal government and local civic technology communities. I really wanted to present the role of commercial actors in data privacy in city government.

We know that cities collect location data to provide services, and so share it for third parties to do so. In Washinton we have a state freedom of information (FOI) law, which states “The people of this state do not yield their sovereignty to the government…”, making data requestable.

In Seattle the traffic data is collected by a company called Acyclica. The city is growing and the infrastructure is struggling, so they are gathering data to deal with this, to shape traffic signals. This is a large scale longitudinal data collection process. Acyclica are doing that with wi-fi sensors sniff MAC addresses, the location traces sent to Acyclica (MAC salted). The data is aggregated and sent to the city – they don’t see the detailed creepy tracking, but the company does. And this is where the FOI law comes in. The raw data is on the company side here. If the raw data was a public record, it would be requestable. The company becomes a shield for collecting sensitive data – it is proprietizing.

So you can collect data, have service needs met, but without it becoming public to you and I. But analysing the contract the terms do not preclude the resale of data – though a Seattle Dept. of Transport (DOT) worker notes that right now people trust companies more than government. Now I did ask about this data collection – not approved elsewhere – and was told that having wifi settings on in public making you open to data collection – as it is in public space.

My next example is the data from parking meters/pay stations. This shows only the start, end, no credit card #s etc. The DOT is happy to make this available via public records requests. But you can track each individual, and they are using this data to model parking needs.

The third example is the Open Data Portal for Seattle. They pay Socrata to host that public-facing data portal. They also sell access to cleaned, aggregated data to companies through a separate API called the Open Data Network. The Seattle Open Data Manager didn’t see this situation as different from any other reseller. But there is little thought about third party data users – they rarely come up in converations – who may combine this data with other data sets for data analysis.

So, in summary, municipal government data is no less by and for commercial actors as it is the public. Proprietary protections around data are a strategy for protecting sensitive data. Government transfers data to third party

Q&A

Q1) Seattle has a wifi for all programme

A1) Promisingly this data isn’t being held side by side… But the routers that we connect to collect so much data… Seeing an Oracle database of the websites fokls

Q2) What are you policy recommendations based on your work?

A2) We would recommend licensing data with some restrictions on use, so that if the data is used inappropriately their use could be cut off…

Q2) So activists could be blocked by that recommendation?

A2) That is a tension… Activists are keen for no licensing here for that reason… It is challenging, particularly when data brokers can do problematic profiling…

Q2) But that restricts activists from questioning the state as well.

Response – Sandra Braman

I think that these presentations highlight many of the issues that raise questions about values we hold as key as humans. And I want to start from an aggressive position, thinking about how and why you might effectively be an activist in this sort of environment. And I want to say that any concerns about algorithmically driven processes should be evaluated in the same way as we would social process. So, for instance we need think about how the press and media interrogate data and politicians

? “Decoding the social” (coming soon) is looking at social data and analysis of social data in the context of big data. She argues that social life is too big and complex than predicatable data. Everything that people who use big data “do” to understand patterns, are things that activists can do too. We can be just as sophisticated as corporations.

The two things I am thinking about are how to mask the local, and how to use the local… When I talk of masking the local I look back to work I did several years back on local broadcasting. There is mammoth literature on TV as locale, and production and how that is separate, misrepresenting, and the assumptions versus the actual information provided vs actual decision making. My perception is that social activism is that there is some brilliant activity taking place – brilliance at moments, specific apps often. And I think that if you look at the essays that Julian Assange before he founded WikiLeaks, particularly n weak links and how those work… He uses sophisticated social theory in a political manner.

But anonymity is practicably impossible… What can we learn from local broadcast? You can use phones in organised ways – there was training for phone cameras for the Battle of Seattle for instance. You can fight with indistinguishable actions – all doing the same things. Encryption is cat and mouse… Often we have activists presenting themselves as mice, although we did see an app discussed at the plenary on apps to alert you to protest and risk. And I have written before on tactical memory.

In terms of using the local… If you know you will be sensed all the time, there are things you can do as an activist to use that. It is useful to think about how we can conceive of ourselves as activists as part of the network. And I was inspired by US libel laws – if a journalist has transmission/recording devices but are a neutral observer, you are not “repeating” the libel and can share that footage. That goes back to 1970s law, but that can be useful to us.

We are at risk of being censored, but that means that you have choices about what to share, being deliberate in giving signals. We have witnessing, which can be taken as a serious commitment. That can happen with people with phones, you can train witnessing. There are many moments were leakage can be an opportunity – maybe not with volume or content of Snowden, but we can do that. There are also ways to learn and shape learning. But we can also be routers, and be critically engaged in that – what we share, the acceptable error rate. National Security are concerned about where in the stream they should target the misinformation – activists can adopt that too. The server functions – see my strategic memory piece. We certainly have community-based wifi, MESH networks, and that is useful politically and socially. We have responsibilities to build the public that is appropriate, and the networking infrastructure that enables those freedom. We can use more computational power to resolve issues. Information can be an enabler as well as influencing your own activism. Thank you to Anne and her group in Amsterdam for triggering thinking here, but big data we should be engaging critically. If you can’t make decisions in some way, there’s no point to doing it.

I think there needs to be more robustness in managing and working with data. If you go far then you need a very high level of methodological trust. Information has to stand up in court, to respect activist contributions to data. Use as your standard, what would be acceptable in court. And in a Panspectrum (not Panopticon) environment, when data is collected all the times, you absolutely have to ask the right questions.

Panel Q&A

Q1) I was really interested in that idea of witnessing as being part of being a modern digital citizens… Is there more on protections or on that which you can say

A1 – Sandra) We’ve seen all protections for whistle blowing in government disappear under Bush (II)… We still have protections for private sector whistle blowers. But there would be an interesting research project in there…

Q2) I wondered about that idea of cat and mouse use of technology… Isn’t that potentially making access a matter of securitisation…?

A2) I don’t think that “securitisation” makes you a military force… One thing I forgot to say was about network relations… If a system is interacting with another system – the principle of requisite variety – they have to be as complex as the system you are dealing with. You have to be at least as sophisticated as the other guy…

Q3) For Bryce and Meg, there are so many tensions over when data should be public and when it should be private, and tensions there… And police desires to show the good things they do. Also Meg, this idea of privatising data to ensure privacy of data – it’s problematic for us to collect data, but now a third party can do that.

A3 – Bryce) One thing I didn’t explain well enough is that video online comes from police, and from activists – it depends on the video here. Some videos are accessed via public records requests and published to YouTube channel – in fact in Washington you can make requests for free and you can do it anonymously. Police department does public video. Whilst they did a pilot in 2014 they had a hackathon to consider how to deal with redaction issues… detect faces, blur them, etc.. And proactive posting of – only some – video. The narrative of sharing everything, but that isn’t the case. The rhetoric has been about being open, by privacy rights and the new police chief. A lot of it was administrative cost concerns… In the hackathon they asked if posting in a blurred form, it would do away with blanket requests to focus requests. At that time they dealt with all requests for email. They were receiving so many emails and under state law they had to give up all the data and for free. But state law varies, in Charlotte they gave up less data. In some states there is a a differnet approach with press conferences, narratives around the footage as they release parts of videos…

A3 – Meg) The city has worked on how to release data… They have a privacy screening process. They try to provide data in a way that is embedded. They still have a hard core central value that any public record is requestable. Collection limitation is an important and essential part of what cities should be doing… In a way private companies collecting data results in large data sets that will end up insecure in those data sets… Going back to what Bryce was saying, the bodycam initiative was really controversial… There was so much footage and unclear what should be public and when… And the faultlines have been pretty deep. We have the Coalition for Open Government advocates for full access, the ACLU worried that these become surveillance cameras… This was really contentious… They passed a version of a compromise but the bottom line is that the PRA is still a core value for the state.

A3 – Bryce) Much of the ACLU, nationally certainly, was to support bodycams, but individuals and local ACLUs change and vary… They were very pro, then backing off, then local variance… It’s a very different picture hence that variance.

Q4) For Matthias, you talked about anti-masking laws. Are there cases where people have been brought in for jamming signals under that law.

A4 – Matthias) Right now the American cases is looking for keywords – manufacturers of devices, the ways data is discussed. I haven’t seen cases like that, but perhaps it is too new… I am a Swedish lawyer and that jamming would be illegal in protest…

A4 – Sandra) Would that be under antimasking or under jamming law.

A4 – Matthias) It would be under hacking laws…

Q4) If you counter with information… But not if switching phone off…

A4 – Matthias) That’s still allowed right now.

Q5) Do you do work comparing US and UK bodycamera?

A5 – Bryce) I don’t but I have come across the Rotterdam footage. One of my colleagues has looked at this… The impetus for adoption in the Netherlands has been different. In the US it is transparancy, in the Netherlands it was protection of public servants as the narrative. A number of co-authors have just published recently on the use of cameras and how they may increase assault on officers… Seeing some counter-intuitive results… But the why question is interesting.

Comment) Is there any aspect of cameras being used in higher risk areas that makes that more likely perhaps?

A5 – Sandra) It’s the YouTube on-air question – everyone imagines themselves on air.

Q6) Two speakers quoted individuals accused of serious sexual assault… And I was wondering how we account for the fact that activists are not homogenous here… Particularly when tech activists are often white males, they can be problematic…

A6) Techies don’t tend to be the most politically correct people – to generalise a great deal…

A6 – Sandra) I think they are separate issues, if I didn’t engage with people whose behaviour is problematic it would be hard to do any job at all. Those things have to be fought, but as a woman you should also challenge and call those white male activists on their actions.

Q7 – me) I was wondering about the retention of data. In Europe there is a lot of use of CCTV and the model  there is record everything, and retain any incident. In the US CCTV is not in widespread use I think and the bodycam model is record incidents in progress only… So I was wondering about that choice in practice and about the retention of those videos and the data after capture.

A7 – Bryce) The ACLU has looked at retention of data. It is a state based issue. In Washington there are mandatory minimu periods… They are interesting as due to findings in conduct they are under requirements to keep everything for as long as possible so auditors from DOJ can access and audit. Bellingham and Spokane, officers can flag items, and supervisors can… And that is what dictates retention schedule. There are issues there of course. Default when I was there was 2 years. If it is publicly available and hits YouTube then that will be far more long lasting, can pop up again… Perpetual memory there… So actual retention schedule won’t matter.

A7 – Sandra) A small follow up – you may have answered with that metadata… Do they treat bodycam data like other types of police data, or is it a separate class of data?

A7 – Bryce) Generally it is being thought of as data collection… And there is no difference from public disclosure, but they are really worried about public access. And how they share that with prosecutors… They could share on DVD… And wanted to use share function of software… But they didn’t want emails to be publicly disclosable with that link… So being thought about as like email.

Q8 – Sandra) On behalf of colleagues working on visual evidence in course.

Comment – Micheal) There is work on video and how it can be perceived as “truth” without awareness of potential for manipulation.

A8 – Bryce) One of the interesting things in Bellingham was release of that video I showed of a suspect running away… The footage was following a police pick up for suspected drug dealing but the footage showed evasion of arrest and the whole encounter… And in that case, whether or not he was guilty of the drug charge, that video told a story of the encounter. In preparing for the court case the police shared the video with his defence team and almost immediately they entered a guilty plea in response to that… And I think we will see more of that kind of invisible use of footage that never goes to court.

And with that this session ends… 

PA-31:Caught in a feedback loop? Algorithmic personalization and digital traces (Chair: Katrin Weller)

Wiebke Loosen1, Marco T Bastos2, Cornelius Puschmann3, Uwe Hasebrink1, Sascha Hölig1, Lisa Merten1, Jan­-Hinrik Schmidt1, Katharina E Kinder­-Kurlanda4, Katrin Weller4

1Hans Bredow Institute for Media Research; 2; 3Alexander von Humboldt Institute for Internet and Society; 4GESIS Leibniz Institute for the Social Sciences

?? – Marco T Bastos, University of California, Davis  and Cornelius Puschmann, Alexander von Humboldt Institute for Internet and Society

Marco: This is a long-running project that Cornelius and I have been working on. At the time we started, in 2012, it wasn’t clear what impact social media might have on the filtering of news, but they are now huge mediators of news and news content in Western countries.

Since then there is some challenge and conflict between journalists, news editors and audiences and that raises the issue of how to monitor and understand that through digital trace data. We want to think about which topics are emphasized by news editors, and which are most shared by social media, etc.

So we will talk about taking two weeks of content from the NYT and The Guardian across a range of social media sites – that’s work I’ve been doing. And Cornelius has tracked 1.5/4 years worth of content from four German newspapers (Suddeutsche Zeitung, Die Zeit, FAZ, Die Welt).

With the Guardian we accessed data from the API which tells you which articles were published in print, and which have not – that is baseline data for the emphasis editors place on different types of content.

So, I’ll talk about my data from the NY Times and the Guardian, from 2013, though we now have 2014 and 2015 data too. This data from two weeks is about 16k+ articles. The Guardian runs around 800 articles per day, the NYT does around 1000. And we could track the items on Twitter, Facebook, Google+, Delicious, Pinterest and Stumbleupon. We do that by grabbing the unique identifyer for the news article, then use the social media endpoints of social platforms to find sharing. But we had a challenge with Twitter – in 2014 they killed the end point we and others had been using to track sharing of URLs. The other sites are active, but relatively irrelevant in the sharing of news items! And there are considerable differences across the ecosystems, some of these social networks are not immediately identifiable as social networks – will Delicious or Pinterest impact popularity?

This data allows us to contrast the differences in topics identified by news editors and social media users.

So, looking at the NYT there is a lot of world news, local news, opinion. But looking at the range of articles Twitter maps relatively well (higher sharing of national news, opinion and technology news), but Facebook is really different – there is huge sharing of opinion, as people share what lies with their interests etc. We see outliers in every section – some articles skew the data here.

If we look at everything that appeared in print, we can look at a horrible diagram that shows all shares… When you look here you see how big Pinterest is, but in fashion in lifestyle areas. The sharing there doesn’t reflect ratio of articles published really though. Google+ has sharing in science and technology in the Guardian, in environment, jobs, local news, opinion and technology in the NYT.

Interestingly news and sports, which are real staples of newspapers but barely feature here. Economics are even worse. Now the articles are english-speaking but they are available globally… But what about differences in Germany… Over to Cornelius…

Cornelius: So Marcos’ work is ahead of mine – he’s already published some of this work. But I have been applying his approach to German newspapers. I’ve been looking at usage metrics and how that relationship between audiences and publishers, and how that relationship changes over time.

So, I’ve looked at Facebook engagement with articles in four German newspapers. I have compared comments, likes and shares and how contribution varies… Opinion is important for newspapers but not necessarily where the action is. And I don’t think people share stories in some areas less – in economics they like and comment, but they don’t share. So interesting to think about the social perception of sharability.

So, a graph here of Die Zeit here shows articles published and the articles shared on Facebook… You see a real change in 2014 to greater numbers (in both). I have also looked at type of articles and print vs. web versions.

So, some observations: niche social networks (e.g. Pinterest) are more relevant to news sharing than expected. Reliance on FB at Die Zeit grew suddenly in 2014. Social nors of liking, sharing and discussing differ significantly across news desks. Some sections (e.g. sports) see a mismatch of importance and use versus liking and sharing.

In the future we want to look at temporal shifts in social media feedback and newspapers coverage. Monitoring

Q&A

Q1) Have you accounted for the possibility of bots sharing content?

A1 – Marcus) No, we haven’t But we are looking across the board but we cannot account for that with the data we have.

Q2) How did you define or find out that an article was shared from the URLs

A2) Tricky… We wrote a script for parsing shortened URLs to check that.

A2 – Cornelius) Read Marco’s excellent documentation.

Q3) What do you make of how readers are engaging, what they like more, what they share more… and what influences that?

A3 – Cornelius) I think it is hard to judge. There are some indications, and have some idea of some functions that are marketed by the platforms being used in different ways… But wouldn’t want to speculate.

Twitter Friend Reportoires: Inferring sources of information management from digital traces – Jan-Hinri Schmidt; Lisa Merton, Wiebke Loosen, Uwe, Kartin?

Our starting point was to think about shifting the focus of Twitter Research. Many studies are on Twitter – explicitly or implicitly – as a broadcast paradigm, but we want to conceive of it as an information tool, and the concept of “Twitter Friend Reportoires” – using “Friend” in the Twitter terminology – someone I follow. We ware looking for patterns in composition of friend sets.

So we take a user, take their friends list, and compare to list of accounts identified previously. So our index has 7,528 Twitter account of media outlets (20.8%) of organisations (political parties, companies, civil society organisations (53.4%) and of individuals (politicians, celebrities and journalists, 25.8%) – all in Germany. We take our sample, compare with a relational table, and then to our master index. And if the account isn’t found in the master index, we can’t say anything about them yet.

To demonstrate the answers we can find with this approach…. We have looked at five different samples:

  • Audience_TS – sample following PSB TV News
  • Audience_SZ – sample following quality daily newspapers
  • MdB – members of federal parliament
  • BPK – political journalists registerd for the bundespressekonferenz
  • Random – random sample of German Twitter users (via Axel Bruns)

We can look at the friends here, and we can categorise the account catagories. In our random sample 77.8% are not identifiable, 22.2% are in our index (around 13% are individual accounts). That is lower than the percentages of friends in our index for all other audiences – for MdB and BPK a high percentage of their friends are in our index. Across the groups there is less following of organisational accounts (in our index) – with the exception of the MdB and political parties. If we look at the media accounts we can see that with the two audience samples they have more following of media accounts than others, including MdB and BPK… When it comes to individual public figures in our indexes, celebrities are prominent for audiences, much less so for MdB and BPK, but MdB follow other politicians, and journalists tend to follow other politicians. And journalists do follow politicians, and politicians – to a less extent – follow journalists.

In terms of patterns of preference we can suggest a model of a fictional user to understand preference between our three categories (organisational account, media account, individual account). And we can use that profile example and compare with our own data, to see how others behaviours fit that typology. So, in our random sample over 30% (37,9%) didn’t follow any organisational accounts. Amongst MdB and BPK there is a real preference for individual accounts.

So, this is what we are measuring right now… I am still not quite happy yet. It is complex to explain, but hard to also show the detail behind that… We have 20 categories in our master index but only three are shown here… Some frequently asked questions that I will ask and answer based on previous talks…

  1. Around 40% identified accounts is not very must is it?
    Yes and no! We have increased this over time. But initially we did not include international accounts, if we did that we’d increase share, especially with celebrities, also international media outlets. However, there is always a trade off, there will also be a long tail… And we are interested in specific categorisations and in public speakers as sources on Twitter.
  2. What does friending mean on Twitter anyway?
    Good question! More qualitative research is needed to understand that – but there is some work on journalists (only). Maybe people friend people for information management reasons, reciprocity norms, public signal of connection, etc. And also how important are algorithmic recommendations in building your set of friends?

Q&A

Q1 – me) I’m glad you raised the issue of recommendation algorithms – the celebrity issue you identified is something Twitter really pushes as a platform now. I was wondering though if you have been looking at how long the people you are looking at have been on Twitter – as behavioural norms

A1) It would be possible to collect it, but we don’t now. We do, for journalists and politicians we do gather list of friends of each month to get longitudinal idea of changes. Over a year, there haven’t been many changes yet…

Q2) Really interesting talk, could you go further with the reportoire? Could there be a discrepancy between the reportoire and their use in terms of retweeting, replying etc.

A2) We haven’t so far… Could see which types of tweets accounts are favouriting or retweeting – but we are not there yet.

Q3) A problem here…

A3) I am not completely happy to establish preference based on indexes… But not sure how else to do this, so maybe you can help me with it. 

Analysing digital traces: The epistemological dimension of algorithms and (big) internet data – Katharine Kinder-Kuranda and Katrin Weller

Katherine: We are interested in the epistemiological aspects of algorithms, so how we research these. So, our research subjects are researchers themselves.

So we are seeing real focus on algorithms in Internet Research, and we need to understand the (hidden) influence of algorithms on all kinds of research, including researchers themselves. So we have researchers interested in algorithms… And in platforms, users and data… But all of these aspects are totally intertwined.

So lets take a Twitter profile… A user of Twitter gets recommendations of who to follow in a given moment of time, and they see newsfeeds at a given moment of time. That user has context that as a researcher I cannot see or interpret the impact of that context on the user’s choice of e.g. who they then follow.

So, algorithms observe, count, sort and rank information on the basis of a variety of different data sources – they are highly heterogeneous and transient. Online data can be user-generated content or activity, traces or location data from various internet platforms. That promises new possibilities, but also raises significant challenge, including because of its heterogeneity.

Social media data has uncertain origins, about users and their motivations; often uncertain provenance of the data. The “users that we see are not users” but highly structured profiles and the result of careful image-management. And we see renewed discussion of methods and epistemology, particularly within the social sciences, for instance suggestions include “messiness” (Knupf 2014), and ? (Kitchen 2012).

So, what does this mean for algorithms? Algorithms operate on an uncertain basis and present real challenges for internet research. So I’m going to now talk about work that Katrin and I did in a qualitative study of social media researchers (Kinder-Kurlanda and Weller 2014). We conducted interviews at conferences – highly varied – speaking to those working with data obtained from social media. There were 40 interviews in total and we focused on research data management.

We found that researchers found very individual ways to address epistemological challenges in order to realise the potential of this data for research. And there were three real concerns here: accessibility, methodology, research ethics.

  1. Data access and quality of research

Here there were challenges of data access, restrictions on privacy of social media data, technical skills; adjusting research questions due to data availability; struggle for data access often consumes much effort. Researchers talks about difficulty in finding publicatio outlets, recognition, jobs in the disciplinary “mainstream” – it is getting better but a big issue. There was also comment on this being a computer science dominated fields – which had highly formalised review processes, few high ranking conferences, and this enforces highly strategic planning of resources and research topics. So researchers attempts to acieve validity and good research quality are constrained. So, this is really challenging for researchers.

2. New Methodologies for “big data”

Methodologies in this research often defy traditional ways of achieveing research validity – through ensuring reproducability, sharing of data sets (ethically not possible). There is a need to find patterns in large data sets by analysis of keywords, or automated analysis. It is hard for others to understand process and validate it. Data sets cannot be shared…

3. Research ethics

There is a lack of users informed consent to studies based on online data (Hutton and Henderson 2015). There are ethical complexity. Data cannot really be anonymised…

So, how do algorithms influence our research data and what does this mean for researchers who want to learn something about the users? Algoritms influence what content users interact with, for example: How to study user networks without knowing the algorithms behind follower/friend suggestions? How to study populations?

To get back to the question of observing algorithms? Well the problem is that various actors in the most diverse situations react out of different interests to the results of algorithic calculations, and may even try to influence algorithms. You see that with tactics around trending hashtags as part of protest for instance. The results of algorithmic analyses presented to internet users with information on how algorithms take part.

In terms of next steps. researchers need to be aware that online environments are influenced by algorithms and so are the users and the data they leave behind. It may mean capturing the “look and feel” of the platform as part of research.

Q&A

Q1) One thing I wasn’t sure about… Is your sense when you were interviewing researchers that they were unaware of algorithmic shaping… Or was it about not being sure how to capture that?

A1) Algorithms wasn’t the terminology when we started our work… They talked about big data… the framing and terminology is shifting… So we are adding the algorithms now… But we did find varying levels of understanding of platform function – some were very aware of platform dynamics, but some felt that if they have a Twitter dataset that’s a representation of the real world.

Q1) I would think that if we think about recognising how algorithms and platform function come in as an object… Presumably some working on interfaces were aware but others looking at, e.g. friendship group, took data and weren’t thinking about platform function, but that is something they should be thinking about…

A1) Yes.

Q2) What do you mean by the term “algorithm” now, and how that term is different from previously…

A2) I’m sure there is a messyness of this term. I do believe that looking at programmes, wouldn’t solve that problem. You have the algorithm in itself, gaining attention… From researchers and industry… So you have programmers tweaking algorithms here… as part of different structures and pressures and contexts… But algorithms are part of a lot of peoples’ everyday practice… It makes sense to focus on those.

Q3) You started at the beginning with an illustration of the researcher in the middle, then moved onto the agency of the user… And the changes to the analytical capacities working with this type of data… But how much is the awareness amongst researchers of how the data, the tools they work with, and how they are inscribed into the research…

A3) Thank you for making that distinction here. The problem in a way is that we saw what we might expect – highly varied awareness… This was determined by disciplinary background – whether STS researchers in sociology, or whether a computer scientist, say. We didn’t find too many disciplinary trends, but we looked across many disciplines…. But there were huge ranges of approach and attitude here – our data was too broad.

Panel Q&A

Q1 – Cornelius) I think that we should say that if you are wondering about “feedback” here, it’s about thinking about metrics and how they then feedback into practice, if there is a feedback loop… From very different perspectives… I would like to return to that – maybe next year when research has progressed. More qualitative understanding is needed. But a challenge is that stakeholder groups vary greatly… What if one finding doesn’t hold for other groups…

Q2) I am from the Wikimedia Foundation… I’m someone who does data analysis a lot. I am curious if in looking at these problems you have looked at recommender systems research which has been researching this space for 10 years, work on messy data and cleaning messy data… There are so many tiny differences that can really make a difference. I work on predictive algorithms, but that’s a new bit of turbulence in a turbulent sea… How much of this do you want to bring this space…

A2 – Katrin) These communities have not come together yet. I know people who work in socio-technical studies who do study interface changes… There is another community that is aware that this exists… And is not aware so closely… But see it as tiny bits of the same puzzle… And can be harder to understand for historical data… And getting an idea of what factors influence your data set. In our data sets we have interviewees more like you, and some with people at sessions like this… There is some connection, but not all of those areas coming together…

A2 – Cornelius) I think that there is a clash between computational social science data work, and this stuff here… That predictable aspect screws with big claims about society… Maybe an awareness but not a keenness. In terms of older computer science research that we are not engaging in, but should be… But often there is a conflict of interests sometimes… I saw a presentation that showed changes to the interface, changing behaviour… But companies don’t want to disclose that manipulation…

Comment) We’ve gone through a period, disheartened to see it is still there, that researchers are so excited to trace human activities, that they treat hashtags as the political debate… This community helpfully problematises or contextualises this… But I think that these papers are raising the question of people orientating practices towards the platform, from machine learning… I find it hard to talk about that… And how behaviour feeds into machine learn… Our system tips to behaviour, and technology shifts and reacts to that which is hard.

Q3) I wanted to agree with that idea of the  need to document. But I want to push at your implicit position that this is messy and difficult and hard to measure… But I think that applies to *any* methods… Standards of data removal, arise elsewhere, messiness occurs elsewhere… Some of those issues apply across all kinds of research…

A3 – Cornelius) Christian would have had an example on his algorithm audit work that might have been helpful there.

Comment) I wanted to comment on social media research versus traditional social science research… We don’t have much power over our data set – that’s quite different in comparison with those running surveys, undertaking interviews… and I have control of that tool… And I think that argument isn’t just about survey analysis, but other qualitative analysis… Your research design can fit your purposes…

 

Twitter recommend algorithms, celebrities and noise. Time on twitter. Overall follower/following counts? Does friend suggest influence?

Advertistors? and role in shaping content in news

Time:
Friday, 07/Oct/2016:

4:00pm – 5:30pm

Session Chair:

Location: HU 1.205
Humboldt University of Berlin Dorotheenstr. 24 Building 1, second floor 80 seats
Show help for 'Increase or decrease the abstract text size'

Presentations

Wiebke Loosen1, Marco T Bastos2, Cornelius Puschmann3, Uwe Hasebrink1, Sascha Hölig1, Lisa Merten1, Jan­-Hinrik Schmidt1, Katharina E Kinder­-Kurlanda4, Katrin Weller4

1Hans Bredow Institute for Media Research; 2University of California, Davis; 3Alexander von Humboldt Institute for Internet and Society; 4GESIS Leibniz Institute for the Social Sciences

Oct 062016
 

Today I am again at the Association of Internet Researchers AoIR 2016 Conference in Berlin. Yesterday we had workshops, today the conference kicks off properly. Follow the tweets at: #aoir2016.

As usual this is a liveblog so all comments and corrections are very much welcomed. 

PA-02 Platform Studies: The Rules of Engagement (Chair: Jean Burgess, QUT)

How affordances arise through relations between platforms, their different types of users, and what they do to the technology – Taina Bucher (University of Copenhagen) and Anne Helmond (University of Amsterdam)

Taina: Hearts on Twitter: In 2015 Twitter moved from stars to hearts, changing the affordances of the platform. They stated that they wanted to make the platform more accessible to new users, but that impacted on existing users.

Today we are going to talk about conceptualising affordances. In it’s original meaning an affordance is conceived of as a relational property (Gibson). For Norman perceived affordances were more the concern – thinking about how objects can exhibit or constrain particular actions. Affordances are not just the visual clues or possibilities, but can be felt. Gaver talks about these technology affordances. There are also social affordances – talked about my many – mainly about how poor technological affordances have impact on societies. It is mainly about impact of technology and how it can contain and constrain sociality. And finally we have communicative affordances (Hutchby), how technological affordances impact on communities and communications of practices.

So, what about platform changes? If we think about design affordances, we can see that there are different ways to understand this. The official reason for the design was given as about the audience, affording sociality of community and practices.

Affordances continues to play an important role in media and social media research. They tend to be conceptualised as either high-level or low-level affordances, with ontological and epistemological differences:

  • High: affordance in the relation – actions enabled or constrained
  • Low: affordance in the technical features of the user interface – reference to Gibson but they vary in where and when affordances are seen, and what features are supposed to enable or constrain.

Anne: We want to now turn to platform-sensitive approach, expanding the notion of the user –> different types of platform users, end-users, developers, researchers and advertisers – there is a real diversity of users and user needs and experiences here (see Gillespie on platforms. So, in the case of Twitter there are many users and many agendas – and multiple interfaces. Platforms are dynamic environments – and that differentiates social media platforms from Gibson’s environmental platforms. Computational systems driving media platforms are different, social media platforms adjust interfaces to their users through personalisation, A/B testing, algorithmically organised (e.g. Twitter recommending people to follow based on interests and actions).

In order to take a relational view of affordances, and do that justice, we also need to understand what users afford to the platforms – as they contribute, create content, provide data that enables to use and development and income (through advertisers) for the platform. Returning to Twitter… The platform affords different things for different people

Taking medium-specificity of platforms into account we can revisit earlier conceptions of affordance and critically analyse how they may be employed or translated to platform environments. Platform users are diverse and multiple, and relationships are multidirectional, with users contributing back to the platform. And those different users have different agendas around affordances – and in our Twitter case study, for instance, that includes developers and advertisers, users who are interested in affordances to measure user engagement.

How the social media APIs that scholars so often use for research are—for commercial reasons—skewed positively toward ‘connection’ and thus make it difficult to understand practices of ‘disconnection’ – Nicolas John (Hebrew University of Israel) and Asaf Nissenbaum (Hebrew University of Israel)

Consider this… On Facebook…If you add someone as a friend they are notified. If you unfriend them, they do not. If you post something you see it in your feed, if you delete it it is not broadcast. They have a page called World of Friends – they don’t have one called World of Enemies. And Facebook does not take kindly to app creators who seek to surface unfriending and removal of content. And Facebook is, like other social media platforms, therefore significantly biased towards positive friending and sharing actions. And that has implications for norms and for our research in these spaces.

One of our key questions here is what can’t we know about

Agnotology is defined as the study of ignorance. Robert Proctor talks about this in three terms: native state – childhood for instance; strategic ploy – e.g. the tobacco industry on health for years; lost realm – the knowledge that we cease to hold, that we loose.

I won’t go into detail on critiques of APIs for social science research, but as an overview the main critiques are:

  1. APIs are restrictive – they can cost money, we are limited to a percentage of the whole – Burgess and Bruns 2015; Bucher 2013; Bruns 2013; Driscoll and Walker
  2. APIs are opaque
  3. APIs can change with little notice (and do)
  4. Omitted data – Baym 2013 – now our point is that these platforms collect this data but do not share it.
  5. Bias to present – boyd and Crawford 2012

Asaf: Our methodology was to look at some of the most popular social media spaces and their APIs. We were were looking at connectivity in these spaces – liking, sharing, etc. And we also looked for the opposite traits – unliking, deletion, etc. We found that social media had very little data, if any, on “negative” traits – and we’ll look at this across three areas: other people and their content; me and my content; commercial users and their crowds.

Other people and their content – APIs tend to supply basic connectivity – friends/following, grouping, likes. Almost no historical content – except Facebook which shares when a user has liked a page. Current state only – disconnections are not accounted for. There is a reason to not know this data – privacy concerns perhaps – but that doesn’t explain my not being able to find this sort of information about my own profile.

Me and my content – negative traits and actions are hidden even from ourselves. Success is measured – likes and sharin, of you or by you. Decline is not – disconnections are lost connections… except on Twitter where you can see analytics of followers – but no names there, and not in the API. So we are losing who we once were but are not anymore. Social network sites do not see fit to share information over time… Lacking disconnection data is an idealogical and commercial issue.

Commercial users and their crowds – these users can see much more of their histories, and the negative actions online. They have a different regime of access in many cases, with the ups and downs revealed – though you may need to pay for access. Negative feedback receives special attention. Facebook offers the most detailed information on usage – including blocking and unliking information. Customers know more than users, or Pages vs. Groups.

Nicholas: So, implications. From what Asaf has shared shows the risk for API-based research… Where researchers’ work may be shaped by the affordances of the API being used. Any attempt to capture negative actions – unlikes, choices to leave or unfriend. If we can’t use APIs to measure social media phenomena, we have to use other means. So, unfriending is understood through surveys – time consuming and problematic. And that can put you off exploring these spaces – it limits research. The advertiser-friends user experience distorts the space – it’s like the stock market only reporting the rises except for a few super wealthy users who get the full picture.

A biography of Twitter (a story told through the intertwined stories of its key features and the social norms that give them meaning, drawing on archival material and oral history interviews with users) – Jean Burgess (Queensland University of Technology) and Nancy Baym (Microsoft Research)

I want to start by talking about what I mean by platforms, and what I mean by biographies. Here platforms are these social media platforms that afford particular possibilities, they enable and shape society – we heard about the platformisation of society last night – but their governance, affordances, are shaped by their own economic existance. They are shaping and mediating socio-cultural experience and we need to better to understand the values and socio-cultural concerns of the platforms. By platform studies we mean treating social media platforms as spaces to study in their own rights: as institutions, as mediating forces in the environment.

So, why “biography” here? First we argue that whilst biographical forms tend to be reserved for individuals (occasionally companies and race horses), they are about putting the subject in context of relationships, place in time, and that the context shapes the subject. Biographies are always partial though – based on unreliable interviews and information, they quickly go out of date, and just as we cannot get inside the heads of those who are subjects of biographies, we cannot get inside many of the companies at the heart of social media platforms. But (after Richard Rogers) understanding changes helps us to understand the platform.

So, in our forthcoming book, Twitter: A Biography (NYU 2017), we will look at competing and converging desires around e.g the @, RT, #. Twitter’s key feature set are key characters in it’s biography. Each has been a rich site of competing cultures and norms. We drew extensively on the Internet Archives, bloggers, and interviews with a range of users of the platform.

Nancy: When we interviewed people we downloaded their archive with them and talked through their behaviour and how it had changed – and many of those features and changes emerged from that. What came out strongly is that noone knows what Twitter is for – not just amongst users but also amongst the creators – you see that today with Jack Dorsey and Anne Richards. The heart of this issue is about whether Twitter is about sociality and fun, or is it a very important site for sharing important news and events. Users try to negotiate why they need this space, what is it for… They start squabling saying “Twitter, you are doing it wrong!”… Changes come with backlash and response, changed decisions from Twitter… But that is also accompanied by the media coverage of Twitter, but also the third party platforms build on Twitter.

So the “@” is at the heart of Twitter for sociality and Twitter for information distribution. It was imported from other spaces – IRC most obviously – as with other features. One of the earliest things Twitter incorporated was the @ and the links back.. You have things like originally you could see everyone’s @ replies and that led to feed clutter – although some liked seeing unexpected messages like this. So, Twitter made a change so you could choose. And then they changed again to automatically not see replies from those you don’t follow. So people worked around that with “.@” – which created conflict between the needs of the users, the ways they make it usable, and the way the platform wants to make the space less confusing to new users.

The “RT” gave credit to people for their words, and preserved integrity of words. At first this wasn’t there and so you had huge variance – the RT, the manually spelled out retweet, the hat tip (HT). Technical changes were made, then you saw the number of retweets emerging as a measure of success and changing cultures and practices.

The “#” is hugely disputed – it emerged through hashtag.org: you couldn’t follow them in Twitter at first but they incorporated it to fend off third party tools. They are beloved by techies, and hated by user experience designers. And they are useful but they are also easily coopted by trolls – as we’ve seen on our own hashtag.

Insights into the actual uses to which audience data analytics are put by content creators in the new screen ecology (and the limitations of these analytics) – Stuart Cunningham (QUT) and David Craig (USC Annenberg School for Communication and Journalism)

The algorithmic culture is well understood as a part of our culture. There are around 150 items on Tarleton Gillespie and Nick Seaver’s recent reading list and the literature is growing rapidly. We want to bring back a bounded sense of agency in the context of online creatives.

What do I mean by “online creatives”? Well we are looking at social media entertainment – a “new screen ecology” (Cunningham and Silver 2013; 2015) shaped by new online creatives who are professionalising and monetising on platforms like YouTube, as opposed to professional spaces, e.g. Netflix. YouTube has more than 1 billion users, with revenue in 2015 estimated at $4 billion per year. And there are a large number of online creatives earning significant incomes from their content in these spaces.

Previously online creatives were bound up with ideas of democratic participative cultures but we want to offer an immanent critique of the limits of data analytics/algorithmic culture in shaping SME from with the industry on both the creator (bottom up) and platform (top down) side. This is an approach to social criticism exposes the way reality conflicts not with some “transcendent” concept of rationality but with its own avowed norms, drawing on Foucault’s work on power and domination.

We undertook a large number of interviews and from that I’m going to throw some quotes at you… There is talk of information overload – of what one might do as an online creative presented with a wealth of data. Creatives talk about the “non-scalable practices” – the importance and time required to engage with fans and subscribers. Creatives talk about at least half of a working week being spent on high touch work like responding to comments, managing trolls, and dealing with challenging responses (especially with creators whose kids are engaged in their content).

We also see cross-platform engagement – and an associated major scaling in workload. There is a volume issue on Facebook, and the use of Twitter to manage that. There is also a sense of unintended consequences – scale has destroyed value. Income might be $1 or $2 for 100,000s or millions of views. There are inherent limits to algorithmic culture… But people enjoy being part of it and reflect a real entrepreneurial culture.

In one or tow sentences, the history of YouTube can be seen as a sort of clash of NoCal and SoCal cultures. Again, no-one knows what it is for. And that conflict has been there for ten years. And you also have the MCNs (Multi-Contact Networks) who are caught like the meat in the sandwich here.

Panel Q&A

Q1) I was wondering about user needs and how that factors in. You all drew upon it to an extent… And the dissatisfaction of users around whether needs are listened to or not was evident in some of the case studies here. I wanted to ask about that.

A1 – Nancy) There are lots of users, and users have different needs. When platforms change and users are angry, others are happy. We have different users with very different needs… Both of those perspectives are user needs, they both call for responses to make their needs possible… The conflict and challenges, how platforms respond to those tensions and how efforts to respond raise new tensions… that’s really at the heart here.

A1 – Jean) In our historical work we’ve also seen that some users voices can really overpower others – there are influential users and they sometimes drown out other voices, and I don’t want to stereotype here but often technical voices drown out those more concerned with relationships and intimacy.

Q2) You talked about platforms and how they developed (and I’m afraid I didn’t catch the rest of this question…)

A2 – David) There are multilateral conflicts about what features to include and exclude… And what is interesting is thinking about what ideas fail… With creators you see economic dependence on platforms and affordances – e.g. versus PGC (Professionally Generated Content).

A2 – Nicholas) I don’t know what user needs are in a broader sense, but everyone wants to know who unfriended them, who deleted them… And a dislike button, or an unlike button… The response was strong but “this post makes me sad” doesn’t answer that and there is no “you bastard for posting that!” button.

Q3) Would it be beneficial to expose unfriending/negative traits?

A3 – Nicholas) I can think of a use case for why unfriending would be useful – for instance wouldn’t it be useful to understand unfriending around the US elections. That data is captured – Facebook know – but we cannot access it to research it.

A3 – Stuart) It might be good for researchers, but is it in the public good? In Europe and with the Right to be Forgotten should we limit further the data availability…

A3 – Nancy) I think the challenge is that mismatch of only sharing good things, not sharing and allowing exploration of negative contact and activity.

A3 – Jean) There are business reasons for positivity versus negativity, but it is also about how the platforms imagine their customers and audiences.

Q4) I was intrigued by the idea of the “Medium specificity of platforms” – what would that be? I’ve been thinking about devices and interfaces and how they are accessed… We have what we think of as a range but actually we are used to using really one or two platforms – e.g. Apple iPhone – in terms of design, icons, etc. and the possibilities of interface is, and what happens when something is made impossible by the interface.

A4 – Anne) When the “medium specificity” we are talking about the platform itself as medium. Moving beyond end user and user experience. We wanted to take into account the role of the user – the platform also has interfaces for developers, for advertisers, etc. and we wanted to think about those multiple interfaces, where they connect, how they connect, etc.

A4 – Taina) It’s a great point about medium specitivity but for me it’s more about platform specifity.

A4 – Jean) The integration of mobile web means the phone iOS has a major role here…

A4 – Nancy) We did some work with couples who brought in their phones, and when one had an Apple and one had an Android phone we actually found that they often weren’t aware of what was possible in the social media apps as the interfaces are so different between the different mobile operating systems and interfaces.

Q5) Can you talk about algorithmic content and content innovation?

A5 – David) In our work with YouTube we see forms of innovation that are very platform specific around things like Vine and Instagram. And we also see counter-industrial forms and practices. So, in the US, we see blogging and first person accounts of lives… beauty, unboxing, etc. But if you map content innovation you see (similarly) this taking the form of gaps in mainstream culture – in India that’s stand up comedy for instance. Algorithms are then looking for qualities and connections based on what else is being accessed – creating a virtual circle…

Q6) Can we think of platforms as instable, about platforms having not quite such a uniform sense of purpose and direction…

A6 – Stuart) Most platforms are very big in terms of their finance… If you compare that to 20 years ago the big companies knew what they were doing! Things are much more volatile…

A6 – Jean) That’s very common in the sector, except maybe on Facebook… Maybe.

PA-05: Identities (Chair: Tero Jukka Karppi)

The Bot Affair: Ashley Madison and Algorithmic Identities as Cultural Techniques – Tero Karppi, University at Buffalo, USA

As of 2012 Ashley Madison is the biggest online dating site targeted at those already in a committed relationship. Users are asked to share their gender, their sexuality, and to share images. Some aspects are free but message and image exchange are limited to paid accounts.

The site was hacked in 2016, stealing site user data which was then shared. Security experts who analysed the data assessed it as real, associated with real payment details etc. The hacker intention was to expose cheaters but my paper is focused on a different aspect of the aftermath. Analysis showed 43 male bots, and 70k female bots and that is the focus of my paper. And I want to think about this space and connectivity by removing the human user from the equation.

The method for me was about thinking about the distinction between human and non-human user, the individual and the bot. Eminating from germination theory I wanted to use cultural techniques – with materials, symbolic values, rules and places. So I am seeking elements of difference of different materials in the context of the hack and the aftermath.

So, looking at a news items: “Ashley madison, the dating website for cheaters, has admitted that some women on its site were virtual computer programmes instead of real women.” (CNN money), which goes onto say that users thought that they were cheating, but they weren’t after all! These bots interacted with users in a variety of ways from “winking” to messaging, etc. The role of the bot is to engage users in the platform and transform them into paying customers. A blogger talked about the space as all fake – the men are cheaters, the women are bots and only the credit card payments are real!

The fact that the bots are so gender imbalanced tells us the difference in how the platform imagines male and female users. In another commentary they comment on the ways in which fake accounts drew men in – both by implying real women were on the site, and by using real images on fake accounts… The lines between what is real and what is fake have been blurred. Commentators noted the opaqueness of connectivity here, and of the role of the bots. Who knows how many of the 4 million users were real?

The bots are designed to engage users, to appear as human to the extent that we understand human appearance. Santine Olympo talked about bots whilst others looking at algorithmic spaces and what can be imagined and created from our wants and needed. According to Ashley Madison employees the bots – or “angels” – were created to match the needs of users, recycling old images from real user accounts. This case brings together the “angel” and human users. A quote from a commentator imagines this as a science fiction fantasy where real women are replaced by perfect interested bots. We want authenticity in social media sites but bots are part of our mundane everyday existence and part of these spaces.

I want to finish by quoting from Ashley Madison’s terms and conditions, in which users agree that “some of the accounts and users you may encounter on the site may be fiction”.

Facebook algorithm ruins friendship – Taina Bucher, University of Copenhagen

“Rachel”, a Facebook user/informant states this in a tweet. She has a Facebook account that she doesn’t use much. She posts something and old school friends she has forgotten comment on it. She feels out of control… And what I want to focus on today are ordinary affects of algorithmic life taking that idea from ?’s work and Catherine Stewart’s approach to using this in the context of understanding the encounters between people and algorithmic processes. I want to think about the encounter and how the encounter itself becoming generative.

I think that the fetish could be one place to start in knowing algorithms… And how people become attuned to them. We don’t want to treat algorithms as a fetish. The fetishist doesn’t care about the object, just about how the object makes them feel. And so the algorithm as fetish can be a mood maker, using the “power of engagement”. The power does not reside in the algorithm, but in the types of ways people imagine the algorithm to exist and impact upon them.

So, I have undertaken a study of people’s personal algorithm stories, looking at people’s personal algorithm stories about Facebook algorithm; monitoring and querying Twitter for comments and stories (through keywords) relating to Facebook algorithms. And a total of 25 interviews were undertaken via email, chat and Skype.

So, when Rachel tweeted about Facebook and friendship, that gave me the starting point to understand stories and the context for these positions through interviews. And what repeatedly arose was the uncanny nature of Facebook algorithms. Take, for instance Micheal, a musician in LA. He shares a post and usually the likes come in rapidly, but this time nothing… He tweets that the algorithm is “super frustrating” and he believes that Facebook only shows paid for posts. Like others he has developed his own strategy to show posts more clearly. He says:

“If the status doesn’t build buzz (likes, comments, shares) within the first 10 minutes or so it immediately starts moving down the news feed and eventually gets lost.”

Adapting behaviour to social media platforms and their operation can be seen as a form of “optimisation”. Users aren’t just updating their profile or hoping to be seen, they are trying to change behaviours to be better seen by the algorithm. And this takes us to the algorithmic imaginary, the ways of thinking about what algorithms are, what they should be, how they function, and what these imaginations in turn make possible. Many of our participants talked about changing behaviours for the platform. Rachel talks about “clicking every day to change what will show up on her feed” is not only her using the platform, but thinking and behaving differently in the space. Adverts can also suggest algorithmic intervention and, no matter whether the user is profiled or not (e.g. for anti-wrinkle cream), users can feel profiled regardless.

So, people do things to algorithms – disrupting liking practices, comment more frequently to increase visibility, emphasise positively charged words, etc. these are not just interpreted by the algorithm but also shape that algorithm. Critiquing the algorithm is not enough, people are also part of the algorithm and impact upon its function.

Algorithmic identity – Michael Stevenson, University of Groningen, Netherlands

Michael is starting with a poster of Blade Runner… Algorithmic identity brings to mind cyberpunk and science fiction. But day to day algorithmic identity is often about ads for houses, credit scores… And I’m interested in this connection between this clash of technological cool vs mundane instruments of capitalism.

For critics the “cool” is seen as an ideological cover for the underlying political economy. We can look at the rhetoric around technology – “rupture talk”, digital utopianism as that covering of business models etc. Evgeny Morozov writes entertainingly of this issue. I think this critique is useful but I also think that it can be too easy… We’ve seen Morozov tear into Jeff Jarvis and Tim O’Reilly, describing the latter as a spin doctor for Silicon Valley. I think that’s too easy…

My response is this… An image of Christopher Walken saying “needs more Bourdieu”. I think we need to take seriously the values and cultures and the effort it takes to create those. Bourdieu talks about the new media field with areas of “web native”, open, participatory, transparant at one end of the spectrum – the “autonomous pole”; and the “heteronomous pole” of mass/traditional media, closed, controlled, opaque. The idea is that actors locate themselves between these poles… There is also competition to be seen as the most open, the most participatory – you may remember a post from a few years back on Google’s idea of open versus that of Facebook. Bourdieu talks of the autonomous pole as being about downplaying income and economic value, whereas the heteronomous pole is much more directly about that…

So, I am looking at “Everything” – a site designed in the 1990s. It was built by the guys behind Slashdot. It was intended as a compendium of knowledge to support that site and accompany it – items of common interest, background knowledge that wasn’t news. If we look at the site we see implicit and explicit forms of impact… Voting forms on articles (e.g. “I like this write up”), and soft links at the bottom of the page – generated by these types of feedback and engagement. This was the first version in the 1990s. Then in 1999 Nathan Dussendorf(?) developed the Everything2 built with the Everything Development Engine. This is still online. Here you see that techniques of algorithmic identity and datafication of users, this is very explicitly presented – very much unlike Facebook. Among the geeks here the technology is put on top, showing reputation on the site. And being open source, if you wanted to understand the recommendation engine you could just look it up.

If we think of algorithms as talk makers, and we look back at 1999 Everything2, you see the tracking and datafication in place but the statement around it talks about web 2.0/social media type ideas of democracy, meritocracy, conflations of cultural values and social actions with technologies and techniques. Aspects of this are bottom up and you also talk about the role of cookies, and the addressing of privacy. And it directly says “the more you participate, the greater the opportunity for you to mold it your way”.

Thinking about Field Theory we can see some symbolic exclusion – of Microsoft, of large organisations – as a way to position Everything2 within the field. This continues throughout the documentation across the site. And within this field “making money is not a sin” – that developers want to do cool stuff, but that can sit alongside making money.

So, I don’t want to suggest this is a utopian space… Everything2 had a business model, but this was of its time for open source software. The idea was to demonstrate capabilities of the development framework, to get them to use it, and to then get them to pay for services… But this was 2001 and the bubble burst… So the developers turned to “real jobs”. But Everything2 is still out there… And you can play with the first version on an archived version if you are curious!

The Algorithmic Listener – Robert Prey, University of Groningen, Netherlands

This is a version of a paper I am working on – feedback appreciated. And this was sparked by re-reading Raymond Williams, who talks about “there are in fact no masses, but only ways of seeing people as masses” (1958/2011). But I think that in the current environment Williams might now say “there are in fact no individuals, but only ways of seeing people as individuals”. and for me I’m looking at this through the lens of music platforms.

In an increasingly crowded and competitive sector platforms like Spotify, SoundCloud, Apple Music, Deezer, Pandora, Tidel, those platforms are increasingly trying to differentiate themselves through recommendation engines. And I’ll go on to talk about recommendations as individualisation.

Pandora internet radio calls itself the “music genome project” and sees music as genes. It seeks to provide recommendatoins that are outside the distorting impact of cultural information, e.g. you might like “The colour of my love” but you might be put off by the fact that Celine Dion is not cool. They market themselves against the crowd. They play on the individual as the part separated from the whole. However…

Many of you will be familiar with Spotify, and will therefore be familiar with Discover Weekly. The core of Spotify is the “taste profile”. Every interaction you have is captured and recorded in real time – selected artists, songs, behaviours, what you listen to and for how long, what you skip. Discover weekly uses both the taste profile and aspects of collaborative filtering – selecting songs you haven’t discovered that fits your taste profile. So whilst it builds a unique identity for each user, it also relies heavily on other peoples’ taste. Pandora treats other people as distortion, Spotify sees it as more information. Discover weekly does also understands the user based on current and previous behaviours. Ajay Kalia (Spotify) says:

“We believe that it’s important to recognise that a single music listener is usually many listeners… [A] person’s preference will vary by the type of music, by their current activity, by the time of day, and so on. Our goal then is to come up with the right recommendation…”

This treats identity as being in context, as being the sum of our contexts. Previously fixed categories, like gender, are not assigned at the beginning but emerge from behaviours and data. Pagano talks about this, whilst Cheney-Lippold (2011) talks about “cybernetic relationship to individual” and the idea of individuation (Simondon). For Simondon we are not individuals, individuals are an effect of individuation, not the cause. A focus on individuation transforms our relationship to recommendation systems… We shouldn’t be asking if they understand who we are, but the extent to which the person is an effect of personalisation. Personalisation is seen as about you and your need. From a Simondonian perspective there is no “you” or “want” outside of technology. In taking this perspective we have to acknowledge the political economy of music streaming systems…

And the reality is that streaming services are increasingly important to industry and advertisers, particularly as many users use the free variants. And a developer of Pandora talks about the importance for understanding profiles for advertisers. Pandora boasts that they have 700 audience segments to data. “Whether you want to reach fitness-driven moms in Atlanta or mobile Gen X-er… “. The Echo Nest, now owned by Spotify, had created highly detailed consumer profiling before it was brought up. That idea isn’t new, but the detail is. The range of segments here is highly granular… And this brings us to the point that we need to take seriously what Nick Seaver (2015) says we need to think of: “contextualisation as a practice in its own right”.

This matters as the categories that emerge online have profound impacts on how we discover and encounter our world.

Panel Q&A

Q1) I think it’s about music category but also has wider relevance… I had an introduction to the NLP process of Topic Modelling – where you label categories after the factor… The machine sorts without those labels and takes it from the data. Do you have a sense of whether the categorisation is top down, or is it emerging from the data? And if there is similar top down or bottom up categorisation in the other presentations, that would be interesting.

A1 – Robert) I think that’s an interesting question. Many segments are impacted by advertisers, and identifying groups they want to reach… But they may also

Micheal) You talked about the Ashley Madison bots – did they have categorisation, A/B testing, etc. to find successful bots?

Tero) I don’t know but I think looking at how machine learning and machine learning history

Micheal) The idea of content filtering from the bottom to the top was part of the thinking behind Everything…

Q2) I wanted to ask about the feedback loop between the platforms and the users, who are implicated here, in formation of categories and shaping platforms.

A2 – Taina) Not so much in the work I showed but I have had some in-depth Skype interviews with school children, and they all had awareness of some of these (Facebook algorithm) issues, press coverage and particularly the review of the year type videos… People pick up on this, and the power of the algorithm. One of the participants emails me since the study noting how much she sees writing about the algorithm, and about algorithms in other spaces. Awareness is growing much more about the algorithms shaping spaces. It is more prominent than it was.

Q3) I wanted to ask Michael about that idea of positioning Everything2 in relation to other sites… And also the idea of the individual being transformed by platforms like Spotify…

A3 – Michael) I guess the Bourdieun vision is that anyone who wants to position themselves on the spectrum, they can. With Everything you had this moment during the Internet Bubble, a form of utopianism… You see it come together somewhat… And the gap between Wired – traditional mass media – and smaller players but then also a coming together around shared interests and common enemies.

A3 – Robert) There were segments that did come from media, from radio and for advertisers and that’s where the idea of genre came in… That has real effects… When I was at High School there were common groups around particular genres… But right now the move to streaming and online music means there are far more mixed listening and people self-organise in different ways. There has been de-bunking of Bourdieu, but his work was at a really different time.

Q4) I wanted to ask about interactions between humans and non-human. Taina, did people feel positive impacts of understanding Facebook algorithms… Or did you see frustrations with the Twitter algorithms. And Tero, I was wondering how those bots had been shaped by humans.

A4 – Taina) The human and non-human, and whether people felt more or less frustrated by understanding the algorithm. Even if they felt they knew, it changes all the time, their strategies might help but then become obsolete… And practices of concealment and misinformation were tactics here. But just knowing what is taking place, and trying to figure it out, is something that I get a sense is helpful… But maybe that is’t the right answer to it. And that notion of a human and a non human is interesting, particularly for when we see something as human, and when we see things as non-human. In terms of some of the controversies… When is an algorithm blamed versus a human… Well there is no necessary link/consistency there… So when do we assign humanness and non-humanness to the system and does it make a difference?

A4 – Tero) I think that’s a really interesting questions…. Looking at social media now from this perspective helps us to understand that, and the idea of how we understand what is human and what is non-human agency… And what it is to be a human.

Q5) I’m afraid I couldn’t here this question

A5 – Richard) Spotify supports what Deleuze wrote about in terms of the individual and how aspects of our personality are highlighted at the points that is convenient. And how does that effect help us regulate. Maybe the individual isn’t the most appropriate unit any more?

A5 – Taine) For users the exposure that they are being manipulated or can be summed up by the algorithm, that is what can upset or disconcert them… They don’t like to feel summed up by that…

Q6) I really like the idea of the imagined… And perceptions of non-human actors… In the Ashley Madison case we assume that men thought bots were real… But maybe not everyone did that. I think that moment of how and when people imagine and ascribe human or non-human status here. In one way we aren’t concerned by the imaginary… And in another way we might need to consider different imaginaries – the imaginary of the platform creators vs. users for instance.

A6 – Tero) Right now I’m thinking about two imaginaries here… Ashley Madison’s imaginary around the bots, and the users encountering them and how they imagine those bots…

A6 – Taine) A good question… How many imaginaries o you think?! It is about understanding more who you encounter, who you engage with. Imaginaries are tied to how people conceive of their practice in their context, which varies widely, in terms of practices and what you might post…

And with that session finished – and much to think about in terms of algorithmic roles in identity – it’s off to lunch… 

PS-09: Privacy (Chair: Michael Zimmer)

Unconnected: How Privacy Concerns Impact Internet Adoption – Eszter Hargittai, Ashley Walker, University of Zurich

The literature in this area seems to target the usual suspects – age, socio-economic status… But the literature does not tend to talk about privacy. I think one of the reasons may be the idea that you can’t compare users and non-users of the internet on privacy. But we have located a data set that does address this issue.

The U.S. Federal Communication Commission’s issued a National Consumer’s Broadband Service Capability Service in 2009 – when about 24% of Americans were still not yet online. This work is some years ago but our insterest is in the comparison rather than numbers/percentages. And this questioned both internet users and non-users.

One of the questions was: “It is too easy for my personal information to be stolen online” and participants were asked if they Strongly agreed, somewhat agreed, somewhat disagreed, disagreed. We looked at that as bivariate – strongly agreed or not. And analysing that we found that among internet users 63.3% said they strongly agreed versus 81% of non internet users. Now we did analyse demographically… It is what you expect generally – more older people are not online (though interestingly more female respondents are online). But even then the internet non-users again strongly agreed about that privacy/concern question.

So, what does that mean? Well getting people online should address people’s concerns about privacy issues. There is also a methodological takeaway – there is value to asking non-users about internet-related questions – as they may explain their reasons.

Q&A

Q1) Was it asked whether they had previously been online?

A1) There is data on drop outs, but I don’t know if that was captured here.

Q2) Is there a differentiation in how internet use is done – frequently or not?

A2) No, I think it was use or non-use. But we have a paper coming out on those with disabilities and detailed questions on internet skills and other factors – that is a strength of the dataset.

Q3) Are there security or privacy questions in the dataset?

A3) I don’t think there are, or we would have used them. It’s a big national dataset… There is a lot on type of internet connection and quality of access in there, if that is of interest.

Note, there is more on some of the issues around access, motivations and skills in the Royal Society of Edinburgh Spreading the Benefits of Digital Participation in Scotland Inquiry report (Fourman et al 2014). I was a member of this inquiry so if anyone at AoIR2016 is interested in finding out more, let me know. 

Enhancing online privacy at the user level: the role of internet skills and policy implications – Moritz Büchi, Natascha Just, Michael Latzer, U of Zurich, Switzerland

Natascha: This presentation is connected with a paper we just published and where you can read more if you are interested.

So, why do we care about privacy protection? Well there is increased interest in/availability of personal data. We see big data as a new asset class, we see new methods of value extraction, we see growth potential of data-driven management, and we see platformisation of internet-based markets. Users have to continually balance the benefits with the risks of disclosure. And we see issues of online privacy and digital inequality – those with fewer digital skills are more vulnerable to privacy risks.

We see governance becoming increasingly important and there is an issue of understanding appropriate measures. Market solutions by industry self-regulation is problematic because of a lack of incentives as they benefit from data. At the same time states are not well placed to regulate because of their knowledge and the dynamic nature of the tech sector. There is also a route through users’ self-help. Users self-help can be an effective method to protect privacy – whether opting out, or using privacy enhancing technology. But we are increasingly concerned but we still share our data and engage in behaviour that could threaten our privacy online. And understanding that is crucial to understand what can trigger users towards self-help behaviour. To do that we need evidence, and we have been collecting that through a world internet study.

Moritz: We can imperically address issues of attitudes, concerns and skills. The literature finds these all as important, but usually at most two factors covered in the literature. Our research design and contributions look at general population data, nationally representative so that they can feed into policy. The data was collected in the World Internet Project, though many questions only asked in Switzerland. Participants were approached on landline and mobile phones. And our participants had about 88% internet users – that maps to the approx. population using the internet in Switzerland.

We found a positive effect of privacy attitudes on behaviours – but a small effect. There was a strong effect of privacy breaches and engaging in privacy protection behaviours. And general internet skills also had an effect on privacy protection. Privacy breaches – learning the hard way – do predict privacy self-protection. Caring is not enough – that pro-privacy attitudes do not really predict privacy protection behaviours. But skills are central – and that can mean that digital inequalities may be exacerbated because users with low general internet skills do not tend to engage in privacy protection behaviour.

Q&A

Q1) What do you mean by internet skills?

A1 – Moritz): In this case there were questions that participants were asked, following a model by Alexander von Durnstern and colleagues developed, that asks for agreement or disagreement

Navigating between privacy settings and visibility rules: online self-disclosure in the social web – Manuela Farinosi1,Sakari Taipale2, 1: University of Udine; 2: University of Jyväskylä

Our work is focused on self-disclosure online, and particularly whether young people are concerned about privacy in relation to other internet users, privacy to Facebook, or privacy to others.

Facebook offers complex privacy settings allowing users to adopt a range of strategies in managing their information and sharing online. Waters and Ackerman (2011) talk about the practice of managing privacy settings and factors that play a role including culture, motivation, risk-taking ratio, etc. And other factors are at play here. Fuchs (2012) talks about Facebook as commercial organisation and concerns around that. But only some users are aware of the platform’s access to their data, may believe their content is (relatively) private. And for many users privacy to other people is more crucial than privacy to Facebook.

And there are differences in privacy management… Women are less likely to share their phone number, sexual orientation or book preferences. Men are more likely to share corporate information and political views. Several scholars have found that women are more cautious about sharing their information online. Nosko et al (2010) found no significant difference in information disclosure except for political informaltion (which men still do more of).

Sakari: Manuela conducted an online survey in 2012 in Italy with single and multiple choice questions. It was issued to university students – 1125 responses were collected. We focused on 18-38 year old respondents, and only those using facebook. We have slightly more female than male participants, mainly 18-25 years old. Mostly single (but not all). And most use facebook everyday.

So, a quick reminder of Facebook’s privacy settings… (a screenshot reminder, you’ve seen these if you’ve edited yours).

To the results… We found that the data that are most often kept private and not shared are mobile phone number, postal address or residence, and usernames of instant messaging services. The only data they do share is email address. But disclosure is high of other types of data – birth date for instance. And they were not using friends list to manage data. Our research also confirmed that women are more cautious about sharing their data, and men are more likely to share political views. The only not gender related issues were disclosure of email and date of birth.

Concerns were mainly about other users, rather than Facebook, but it was not substantially different in Italy. We found very consistent gender effects across our study. We also checked factors related to concerns but age, marital status, education, and perceived level of expertise as Facebook user did not have a significant impact. The more time you spend on Facebook, the less likely you are to care about privacy issues. There was also a connection between respondents’ privacy concerns were related to disclosures by others on their wall.

So, conclusions, women are more aware of online privacy protection than men, and protection of private sphere. They take more active self protection there. And we speculate on the reasons… There are practices around sense of security/insecurity, risk perception between men and women, and the more sociological understanding of women as maintainers of social labour – used to taking more care of their material… Future research needed though.

Q&A

Q1) When you asked users about privacy settings on Facebook how did you ask that?

A1) They could go and check, or they could remember.

WHOSE PRIVACY? LOBBYING FOR THE FREE FLOW OF EUROPEAN PERSONAL DATA – Jockum Philip Hildén, University of Helsinki, Finland

My focus is related to political science… And my topic is lobbying for the free flow of European Personal Data – and how the General Data Protection Regulation come into being and which lobbyists influenced the legislators. This is a new piece of regulation coming in next year. It was the subject of a great deal of lobbying – it became visible when the regulation was in parliament, but the lobbying was much earlier than that.

So, a quick description of EU law making. There is the European Commission which proposes legislation and that goes to both the Council of Europe and also to the Parliament. Both draw up regulations based on the proposal and then that becomes final regulation. In this particular case there was public consultation before the final regulation so I looked at a wide range of publicly available position pages. Looking across here I could see 10 types of stakeholders offering replies to the position papers – far more in 2011 than to the first version in 2009. Companies in the US participated to a very high degree – almost as much as those in the UK and France. That’s interesting… And that’s partly to do with the extended scope of this new regulation that covers EU but also service providers in the US and other locations. This idea is not exclusive to this regulation, known as “the Brussels effect”.

In terms of sector I have categorised the stakeholders so I have divided IP and Node communications for instance, to understand their interests. But I am interested in what they are saying, so I draw on Kluver (2013) and the “preference attainment model” to compare policy preferences of interest groups with the Commissions preliminary draft proposal, the Commission’s final proposal, and the final legislative act adopted by the council. So, what interests did the council take into account? Well almost every article changed – which makes those changes hard to pin down. But…

There is an EU Power Struggle. The Commission draft contained 26 different cases where it was empowered to adopt delegated acts. All but one of these articles were removed from the Council’s draft. And there were 48 exceptions for member states, most of them are “in the public interest”… But that could mean anything! And thus the role of nation states comes into question. The idea of European law is to have consistent policy – that amount of variance undermines that.

We also see a degree of User disempowerment. Here we see responses from Digital Europe – a group of organisations doing any sort of surveillance; But we also see the American Chambers of Commerce submitting responses. In these responses both are lobbying for “implicit consent” – the original draft requested explicit consent. And the Commission sort of brought into this, using a concept of unambiguous consent… Which is itself very ambiguous. Looking at the Council vs Free Data Advocates and then compared to Council vs Privacy Advocates. The Free Data Advocates are pro free movement of data, and privacy – as that’s useful to them too, but they are not keen on greater Commission powers. Privacy Advocates are pro privacy and more supportive of Commission powers.

In Search of Safe Harbors – Privacy and Surveillance of Refugees in Europe – Paula Kift, New York University, United States of America

Over 2015 a million refugees and migrants arrived at the borders of Europe. One of the ways in which the EU attempted to manage this influx was to gather information on these peoples. In particular satellite surveillance and data collection on individuals on arrival.   
The EU does acknowledge that biometric data does raise privacy issues, but that satellites and drones is not personally identifiable or an issue here. I will argue that the right to privacy does not require presence of Personally Identifiable Information.
As background there are two pieces of legislation, Eurosur – regulations to gather and share satelite and drone data across Member States. Although the EU justifies this on the basis of helping refugees in distress, it isn’t written into the regulation. Refugee and human rights organisations say that this surveillance is likely to enable turning back of migrants before they enter EU waters.
If they do reach the EU, according to Eurodac (2000) refugees must give fingerprints (if over 14 years old) and can only apply for asylum status in one country. But in 2013 this regulation has been updated so that fingerprinting can be used in law enforcement – that goes again EU human rights act and Data Protection law. It is also demeaning and suggests that migrants are more likely to be criminal, something not backed up by evidence. They have also proposed photography and fingerprinting be extended to everyone over 6 years old. There are legitimate reasons for this… Refugees come into Southern Europe where opportunities are not as good, so some have burned off fingerprints to avoid registration there, so some of these are attempts to register migrants, and to avoid losing children once in the EU.
The EU does not dispute that biometric data is private data. But with Eurodac and Eurosur the right to data protection does not apply – they monitor boats not individuals. But I argue that the Right to Private Life is jeapodised here, through prejudice, reachability and classifiability… The bigger issue may actually be the lack of personal data being collected… The EU should approach boats and identify those with asylum claim, and manage others differently, but that is not what is done.
So, how is big data relevant? Well big data can turn non personally identifiable information into PII through aggregation and combination. And classifying individuals also has implications for the design of Data Protection Laws. Data protection is a procedural right, but privacy is a substantive right, less dependent on personally identifiable information. Ultimately the right to privacy protects the person, rather than the integrity of the data.
Q&A
Q1) In your research have you encountered any examples of when policy makers have engaged with research here?
A1 – Paula) I have not conducted any on the ground interviews or ethnographic work with policy makers but I would suggest that the increasing focus on national security is driving this activity, whereas data protection is shrinking in priority.
A1 – Jockum) It’s fairly clear that the Council of Europe engaged with digital rights groups, and that the Commission did too. But then for every one of those groups, there are 10 lobby groups. So you have Privacy International and European Digital Rights who have some traction at European level, but little traction at national level. My understanding is that researchers weren’t significantly consulted, but there was a position paper submitted by a research group at Oxford, submitted by lawers, but their interest was more aligned with national rather than digital rights issues.
Q2) You talked about the ? being embedded in the new legislation… You talk about information and big data… But is there any hope? We’ve negotiated for 4 years, won’t be in force until 2018…
A2 – Paula) I totally agree… You spend years trying to come up with a framework, but it all rests on PII…. And so how do we create Data Protection Act that respects personal privacy without being dependent on PII? Maybe the question is not about privacy but about profiles and discrimination.
A2 – Jockum) I looked at all the different sectors to look at surveillance logic, to understand why surveillance is related to regulation. The problem with Data Protection regulation is inherently problematic as it has opposing goals – to protect individuals and to enable the sharing of data… So, in that sense, surveillance logic is informing this here.
Q3) Could you outline again the threats here beyond PII?
A3 – Paula) Refugees who are aware of these issues don’t take their phones – but that reduces chance of identification but also stops potential help calls and rescues. But the risk is also about profiling… High ranking job offers are more likely to be made to women than men… Google thinks I am between 60 and 80 years old and Jewish, I’m neither, they detect who I am… And that’s where the risk is here… profiling… e.g. transactions being blocked through proposals.
Q4) Interesting mixture of papers here… Many people are concerned about social side of privacy… But know little of institutional privacy concerns. Some become more cynical… But how can we improve literacy… How can we influence people here about Data Protection laws, and privacy measures…
A4 – Esther) It varies by context. In the US the concern is with government surveillance, the EU it’s more about corporate surveillance… You may need to target differently. Myself and a colleague wrote a paper on apathy of privacy… There are issues of trust, but also work on skills. There are bigger conversations, not just with users, to be had. There are conversations to have generally with the population… Where do you infuse that, I don’t know… How do you reach adults, I don’t know?
A4 – Natascha) Not enough to strengthen awareness and rights… Skills are important here too… That you really need to ensure that skills are developed to adapt to policies and changes. Skills are key.
Q5) You talked about exclusion and registration,,, And I was wondering how exclusion to and exclusion of registration (e.g. the dead are not registered).
A5 – Paula) They collect how many are registered… But that can lead to threat inflation and very flawed data. In terms of data that is excluded there is a capacity issue… That may be the issue with deaths. The EU isn’t responsible for saving lives, but doesn’t want to be seen as responsible for those deaths either.
Q6) I wanted to come back to what you see as the problematic implications of the boat surveillance.
A6 – Paula) For many data collection is fine until something happens to you… But if you know it takes place it can have an impact on your behaviours… So there is work to be done to understand if refugees are aware of that surveillance. But the other issue here is about the use of drone surveillance to turn people back then that has clear impact on private lives, particularly as EU states have bilateral agreements with nations that have not all ratified refugee law – meaning turned back boats may result in significantly different rights and opportunities.
RT-07: IR (Chair: Victoria Nash)

The Politics of Internet Research: Reflecting on the challenges and responsibilities of policy engagement

Victoria Nash (University of Oxford, United Kingdom), Wolfgang Schulz (Hans-Bredow-Institut für Medienforschung, Germany), Juan-Carlos De Martin (Politecnico di Torino, Italy), Ivan Klimov, New Economic School, Russia (not attending), Bianca C. Reisdorf (representing Bill Dutton, Quello Center, Michigan Statue University), Kate Coyer, Central European University, Hungary (not attending)

Victoria: I am Vicky Nash and I have convened a round table of members of the international network of internet research centres.

Juan-Carlos: I am director of the Nexa Center for Internet and Society in Italy and we are mainly computer scientists like myself, and lawers. We are ten years old.

Wolfgang: I am associated with two centres, in Humboldt primarily and our interest is in governance and surveillance primarily. We are celebrating our five birthday this year. I also work with the Hans-Bredow-Institut a traditional media institute, multidisciplinary, and we increasingly focus on the internet and internet studies as part of our work.

Bianca: I am representing Bill Dutton. I am Assistant Director of the Quello Center at Michigan State University centre. We were more focused on traditional media but have moved towards internet policy in the last few years as Bill moved to join us. There are three of us right now, but we are currently recruiting for a policy post-doc.

Victoria: Thanks for that, I should talk about the department I am representing… We are in a very traditional institution but our focus has explicitly always been involvement in policy and real world impact.

Victoria: So, over the last five or so years, it does feel like there are particular challenges arising now, especially working with politicians. And I was wondering if other types of researchers are facing those same challenges – is it about politics, or is it specific to internet studies. So, can I kick off and ask you to give me an example of a policy your centre has engaged in, how you were involved, and the experience of that.

Juan-Carlos: There are several examples. One with the regional government in our region of Italy. We were aware of data and participatory information issues in Europe. We reached out and asked if they were aware. We wanted to make them aware of opportunities to open up data, and build on OECD work, but we were also doing some research ourselves. Everybody agreed in the technical infrastructure and on political level… We assisted them in creating the first open data portal in Italy, and one of the first in Europe. And that was great, it was satisfying at the time. Nothing was controversial, we were following a path in Europe… But with a change of regional government that portal has somewhat been neglected so that is frustrating…

Victoria: What motivated that approach you made?

JC: We had a chance to do something new and exciting. We had the know-how and the way it could be, at least in Italy, and that seemed like a great opportunity.

Wolfgang: My centres, I’m kind of an outsider in political governance as I’m concerned with media. But in internet governance it feels like this is our space and we are invested in how it is governed – more so than in other areas. The example I have is from more traditional media work… And that’s from the Hans-Bredow-Institute. We were asked to investigate for a report on usage patterns changes, technology changes, and puts strain on governance structures in Germany… And where there is a need for solutions to make federal and state law in Germany more convergent and able to cope with those changes. But you have to be careful when providing options, because of course you can make some options more appealing than others… So you have to be clear about whether you will be and present it as neutral, or whether you prefer an option and present it differently. And that’s interesting and challenging as an academic and with the role of an academic and institution.

Victoria: So did you consciously present options you did not support?

Wolfgang: Yes, we did. And there were two reasons for this… They were convinced we would come up with a suggestion and basis to start working with… And they accepted that we would not be specifically taking a side – for the federal or local government. And also they were confident we wouldn’t attempt to mess up the system… We didn’t present the ideal but we understood other dependencies and factors and trusted us to only put in suggestions to enhance and practically work, not replace the whole thing…

Victoria: And did they use your options?

Wolfgang: They ignored some suggestions, but where they acted they did take our options.

Bianca: I’ll talk about a semi-successful project. We were looking at detailed postcode level data on internet access and quality and reasons for that. We submitted to the National Science Foundation, it was rejected, then two weeks later we were invited to an event on just that topic by the NPIA. So we are collectively drafting suggestions from the NPIA and from a wide range of many research centres, and we are drafting that now. It was nice to be invited by policy makers… and interesting to see that idea picked up through that process in some way…

Victoria: That’s maybe an unintended consequences aspect there… And that suggestion to work with others was right for you?

Bianca: We were already keen to work with other research centres but actually we also now have policy makers and other stakeholders around the table and that’s really useful.

Victoria: those were all very positive… Maybe you could reflect on more problematic examples…

JC: Ministers often want to show that they are consulting on policy but often that is a gesture, a political move to listen but then policy made an entirely different way… After a while you get used to that. And then you have to calculate whether you participate or not – there is a time aspect there.

Victoria: And for conflict of interest reasons you pay those costs of participating…

JC: Absolutely, the costs are on you.

Wolfgang: We have had contact from ministeries in Germany but then discovered they are interested in the process as a public relations tool rather than as a genuine interest in the outcome. So now we assess that interest and engage – or don’t – accordingly. We try to say at the beginning “no, please speak to someone else” when needed. At Humboldt is reluctant to engage in policy making, and that’s a historical thing, but people expect us to get involved. We are one of the few places that can deliver monitoring on the internet, and there is an expectation to do that… And when ministeries design new programmes, we are often asked to be engaged and we have learned to be cautious about when we engage. Experience helps but you see different ways to approach academia – can be PR, sometimes you want support for your position or support politically, or you can actually be engaged in research to learn and have expertise and information. If you can see what approach it is, you can handle it appropriately.

Victoria: I think as a general piece of advice – to always question “why am I being approached” in the framing of “what are their motivations?”, that is very useful.

Wolfgang: I think starting in terms of research questions and programmes that you are concerned with gives you a counterpoint in your own thinking to dealing with requests. Then when good opportunities come up you can take it and make use of it… But academic value can be limited of some approaches so you need a good reason to engage in those projects and they have to align with your own priorities.

Bianca: My bad example is related to that. The Net Neutrality debate is a big part of our work… There are a lot of partisan opinions on that, and not a lot of neutral research there. We wanted to do a big project there but when we try to get funding for that we have been steered to stay away. We’ve been steered that talking about policy with policy makers is very negative, it is taken poorly. This debate has been bouncing around for 10 years, we want to see where Net Neutrality is imposed if we see changes in investment… But we need funding to do that… And funders don’t want to do it and are usually very cosy with policy makers…

Victoria: This is absolutely an issue, these concerns are in the minds of policy makers as well and that’s important.

Wolfgang: When we talk about research in our field and policy makers, it’s not just about when policy makers approach you to do something… You have a term like Net Neutrality at the centre that requires you to be either neutral or not neutral, that really shapes how you handle that as an academic… You can become, without wanting it, someone promoting one side sometimes. On a minor protection issue we did some work on co-regulation with Australia that seemed to solve a problem… But then after this debate in Germany and started drafting the inter-state treaty on media regulation, the policy makers were interested… And then we felt that we should support it… and I entered the stage but it’s not my question anymore… So you have opinion about how you want something done…

JC: As a coordinator of a European project there was a call that included a topic of “Net Neutrality” – we made a proposal but what happened afterwards clearly proved that that whole area was topic. It was in the call… But we should have framed it differently. Again at European level you see the Commission funds research, you see the outcomes, and then they put out a call that entirely contradicts the work that they funded for political reasons. There is such a drive for evidence-based policy making that it is important that they frame that way… It is evidence-based when it fits their agenda, not when it doesn’t.

Victoria: I did some work with the Department of Media, Culture and Sport last year, again on minor protection, and we were told at the offset to assume porn caused harm to minors. And the frames of reference was shaped to be technical – about access etc. They did bring in a range of academic expertise but the terms of reference really constrained the contribution that was possible. So, there are real bear traps out there!

Wolfgang: A few years back the European Commission asked researchers to look at broadcasters and interruptions to broadcasts and the role of advertising, even though we need money we do not do that, it isn’t answering interesting research questions for us.

Victoria: I raised a question earlier about the specific stakes that academia has in the internet, it isn’t just what we study. Do you want to say more about that.

Wolfgang: Yes, at the pre-conference we had an STS stream… People said “of course we engage with policy” and I was wondering why that is the main position… But the internet comes from academia and there is a long standing tradition of engagement in policy making. Academics do engage with media policy, but they would’t class it as “our domain”, but they were not there are part of the beginning – academia was part of that beginning of the internet.

Q&A

Q1) I wonder if you are mistaking the “of-ness” with the fact that the internet is still being formed, still in the making. Broadcast is established, the internet is in constant construction.

A1 – Wolfgang) I see that

Q1) I don’t know about Europe but in the US since the 1970s there have been deliberate efforts to reduce the power of decision makers and policy makers to work with researchers…

A1 – Bianca) The Federal Communications Commission is mainly made of economists…

Q1) Requirements and roles constrain activities. The assumption of evidence-based decisions is no longer there.

Q2) I think that there is also the issue of shifting governance. Internet governance is changing and so many academics are researching the governance of the internet, we reflect greatly on that. The internet and also the governance structure are still in the making.

Victoria: Do you feel like if you were sick of the process tomorrow, you’d still want to engage with policy making?

A2 – Phoebe) We are a publicly funded university and we are focused on digital inequalities… We feel real responsibility to get involved, to offer advice and opinions based on our advice. On other topics we’d feel less responsible, depending on the impact it would have. It is a public interest thing.

A2 – Wolfgang) When we look at our mission at the Hans-Bredow-Institute we have a vague and normative mission – we think a functioning public sphere is important for democracy… Our tradition is research into public spheres… We have a responsibility there. But we also have a responsibility that the evaluation of academic research becomes more and more important but there is no mechanism to ensure researchers answer the problems that society has… We have a completely divided set of research councils and their yardsticks are academic excellence. State broadcasters do research but with no peer review at all… There are some calls from the Ministry of Science that are problem-orientated but on the whole there isn’t that focus on social issues and relevance in the reward process, in the understanding of prestige.

Victoria: In the UK we have a bizarre dichotomy where research is measured against two measures: impact – where policy impact has real value – and that applies in all fields; but there is also regulation that you cannot use project funds to “lobby” government – which means you potentially cannot communicate research to politicians who disagree. This happened because a research organisation (not a university) opposed government policy with research funded by them… Implications for universities is currently uncleared.

JC: Italy is implementing a similar system to the UK. Often there is no actual mandate on a topic, so individuals come up with ideas without numbers and plans… We think there is a gap – but it is government and ministries work. We are funded to work in the national interest… But we need resources to help there. We are filling gaps in a way that is not sustainable in the long term really – you are evaluated on other criteria.

Q3) I wanted to ask about policy research… I was wondering if there is policy research we do not want to engage in. In Europe, and elsewhere, there is increasing need to attract research… What are the guidelines or principles around what we do or do not go for funding wise.

A3 – Bianca) We are small so we go for what interests us… But we have an advisory board that guides us.

A3 – Wolfgang) I’m not sure that there are overarching guidelines – there may be for other types of special centres – but it’s an interesting thing to have a more formalised exchange like we have right now…

A3 – JC) No, no blockers for us.

A3 – Victoria) Academic freedom is vigorously held up at Oxford but that can mean we have radically different research agendas in the same centre.

Q4) With that lack of guidance, isn’t there a need for academics to show that they have trust, especially in the public sphere, especially when getting funding from, say, Google or Microsoft. And how can you embed that trust?

A4 – Wolfgang) I think peer review as a system functions to support that trust. But we have to think about other institutional settings, and that there is enough oversight… And many associations, like Liebneiz, requires an institutional review board, to look over the research agenda and ensure some outside scrutiny. I wouldn’t say every organisation or research centre needs that – it can be helpful but costly in terms of time in particular. And you cannot trust the general public to do that, you need it to be peers. An interesting question though, especially as Humboldt has national funding from Google… In this network academics play a role, and organisations play a role, and you have to understand the networks and relationships of partners you work with, and their interests.

A4 – Bianca) That’s a question that we’ve faced recently… That concern that corporate funding may sway result and the best way to face that is to publish methodology, questionnaires, process… to ensure the work is understood in that context that enables trust in the work.
A4 – JC) We spent years trying to deal with the issue of independence and it is very important as academia has responsibility to provide research that is independent and unbiased by funding etc. And not just about the work itself, but also perceptions of the work… It is quite a local/contextual issue. So, getting money from Google is perceived differently in different countries, and at different times…
Victoria: This is something we have to have more conversations about this. In medicine there is far more conversation about codes of conduct around funding. I am also concerned that PhD funding is now requiring something like a third of PhDs to be co-funded by industry, without any understanding from UK Government about what that means and what that means for peer review… That’s something we need to think about far more stringently.
Q5) For companies there are requirements to review outputs before publications to check for proprietary information and ensure it is not released. That makes industry the final arbiter here. In Canada our funding is also increasingly coming from industry and there that means that proprietary data gives them final say…
A5 – Bianca) Sometimes it has to be about negotiating contracts and being clear what is and is not acceptable.
Victoria) That’s my concern with new PhD funding models, and also with use of industry data. It will be non-negotiable that the research is not compromised but how you make that process clear is important.
Q6) What are your models here – are you academic or outside academia?
A6 – JC) Academic and policy are part of the work we are funded to do.
A6 – Bianca) We are 99% Endowment funded, hence having a lot of freedom but also advisory board guidance.
A6 – Wolfgang) Our success is assessed by academic publication. The Humboldt Institute is funded largely by private companies but a range of them, but also from grants. The Hans-Bredow-Institute is mainly directly funded by the Hamburg Ministry of Science but we’d like to be funded from other funders across Germany.
A6 – Victoria) Our income is research income, teaching income from masters degrees… We are a department of the university. Our projects are usually policy related, but not always government related.
Q7) I was wondering if others in the room have been funded for policy work – my experience has been that policy makers had expectations and an idea of how much control they wanted… By contrast money from Google comes with a “research something on the internet” type freedom. This is not what I would have expected so I just wondered how others experiences compared.
Comment) I was asked to do work across Europe with public sector broadcasters… I don’t know how well my report was seen by policy makers but it was well received by the public sector broadcaster organisations.
Comment) I’ve had public sector funding, foundation funding… But I’ve never had corporate money… My cynical take is that corporations maybe are doing this as PR, hence not minding what you work on!
Comment) I receive money from funding agencies, I did a joint project that I proposed to a think tank… Which was orientated to government… But a real push for impact… Numbers needed to be in the title. I had to be an objective researcher but present it the right way… And that worked with impact… And then the government offered me a contract to continue the research – working for them not against them. The funding was coming from a position close to my own idea… I felt it was a bit instrumentalised in this way…
A7 – Wolfgang) I think that it is hard to generalise… Companies as funders do sometimes make demands and expect control of publishing of results… And whether it is published or not. We don’t do that – our work is always public domain. It’s case by case… But there is one aspect we haven’t talked about and that is the relationship between the individual researcher and their political engagement (or not) and how that impacts upon the neutrality of the organisation. As a lawyer I’m very aware of that… For instance if giving expert evidence in court, the importance of being an individual not the organisation. Especially if partners/funders before or in the future are on the opposite side. I was an expert for Germany in a court case, with private broadcasters on the other side, and you have to be careful there…
A7 – JC) There is so little money for research in Italy… Regarding corporations… We got some money from Google to write an open source library, it’s out there, it’s public… There was no conflict there. But money from companies for policy work is really difficult. But lots of case by case issues in-between.
Q8) But companies often fund social science work that isn’t about policy but has impact on policy.
A8 – JC) We don’t do social science research so we don’t face that issue.
A8 – Victoria) Finding ways to make that work that guarantees independence is often the best way forward – you cannot and often do not want to say no… But you work with codes of conduct, with advisory board, with processes to ensure appropriate freedoms.
JC: A question to the audience… A controversial topic arises, one side owns the debate and a private company approaches to support your voice… Do you take their funding?
Comment) I was asked to do that and I kind of stalled so that I didn’t have to refuse or take part, but in that case I didn’t feel
Comment) If having your voice in the public triggers the conversation, you do make it visible and participate, to progress the issue…
Comment) Maybe this comes down to personal versus institutional points of view. And I would need to talk to colleagues to help me make that decision, to decide if this would be important or not… Then I would say yes… Better solution is to say “no, I’m talking in a private capacity”.
JC) I think that the point of separating individual and centres here is important. Generally centres like ours do not take a position… And there is an added element that if a corporation wants to be involved, a track record of past behaviour makes it less troublesome. Saying something for 10 years gives you credibility in a way that suddenly engaging does not.
Wolfgang) In Germany it is general practice that if your arguments are not being heard, then you engage expertise – it is general practice in German legal academic practice. It is ok I think.
Comment) In the Bundestag they bring in experts… But of course the choice of expert reflects values and opinions made in articles. So you have a range of academics supporting politics… If I am invited to talk to parliament, I say what I always say “this is not a problem”.
Victoria: And I think that nicely reminds us why this is the politics of internet research! Thank you.
Plenary Panel: Who Rules the Internet? Kate Crawford (Microsoft Research NYC), Fieke Jansen (Tactical Tech), Carolin Gerlitz (University of Siegen) – Chair: Cornelius Puschmann
Jennifer Stromer-Galley, President of the Association of Internet Researchers: For those of you who are new to the AoIR, this is our 17th conference and we are an international organisation that looks at issues around the internet – now including those things that have come out of the internet including mobile apps. And our panel today we will be focusing on governance issues. Before that I would like to acknowledge this marvellous city of Berlin, and to thank all of my colleagues in Germany who have taken such care, and to Humboldt University for hosting us in this beautiful venue. And now, I’d like to handover to Herr Matthias Graf von Kielmansegg, representing Professor Dr Elizabeth Wacker, Federal Minister of Labour and Social Affairs.
Matthias Graf von Kielmansegg: is here representing Professor Wacker, who takes a great interest in internet and society, including the issues that you are looking at here this week. If you are not familiar with our digitisation policy, the German government published a digital agenda for the first time two years ago, covering all areas of government operation. In terms of activities it concentrates on the term 2013-2017, and it needs to be extended, and it reaches strategically far into the next decade. Additionally we have a regular summit bringing together the private sector, unions, government and the academic world looking at key issues.
You all know that digital is a fundamental gamechanger, in the way goods and services are used, the ways we communicate and collaborate, and digital loosens our ties to time and place… And we aren’t at the end but at the middle of this process. Wikipedia was founded 16 years ago, the iPhone launched 9 years ago, and now we talk about Blockchain… So we do not know where we will be in 10 or 20 years time. And good education and research are key to that. And we need to engage proactively. In Germany we are incorporating Internet of Things into our industries. In Germany we used to have a technology-driven view of these things, but now we look at economic and cultural contexts or ecosystems to understand digital systems.
Research is one driver, the other is that science, education, and research are users in their own right. Let me focus first on education… Here we must answer some major issues – what will drive change here, technology or pedagogy? Who will be the change agents? And what of the role of teachers and schools? They must take the lead in change and secure the dominance of pedagogy, using digital tools to support our key education goals – and not vice versa. And that means digital education must offer more opportunities, flexibilities, and better preparation for tomorrow’s world of work. With this in mind we plan to launch a digital education campaign to help young people find their place in an ever changing digital world, and to be ready to adapt to the changes that arise. How education can support our economic model and higher education. And we will need to address issues of technical infrastructure, governance – and for us how this plays out with our 60 federal states. Closer to your world is the world of science. Digital tools create huge amounts of new data and big data. The challenges organisations face is not just infrastructure but how to access and use this data. We call our approach Securing the Life Cycle of Data, concerned with aceess, use, reuse, interoperability. And how will be decide what we save, and what we delete? And who will decide how third parties use this data. And big data goes alongside other aspects such as high powered computing. We plan to launch an initiative of action in this area next year. To oversee this we have a Scientific Oversight Body with stakeholders. We are also keen to embrace Open Data and the resources to support that. We have added new conditions to our own funding conditions – any publication based on research funded by us, must be published open access.
We need to know more about internet and society need to be known, and there is research to be done. So, the federal government has decided to establish a German Internet Institute. It will address a number of areas of importance: access and use of the digital world; work and value creation and our democracy. We want an interdisciplinary team of social scientists, economists, and information scientists. The competitive selection process is just underway, and we expect the winner to be announced next spring. There is readiness to spend up to €15M over the first five years. And this highlights the importance of the digital world in Germany.
Let me just make one comment. The overall title of this conference is Internet Rules! It is still up to us to be the fool or the wise… We need to understand what might happen is politics, economics and society do not find the answers to the challenges we face. And so hopefully we will find that it’s not the internet that rules, but that democracy rules!
Kate Crawford
When Cornelius asked me to look at the idea of “Who rules the internet?” I looked up at my bookshelf, and found lots of books written by people in this community, many of you in this room, looking at just this question. And we have moved from the ’90s utopianism to the world of infrastructure, socio-technical aspects, the Internet of Things layer – and zombie web cams being coopted by hackers. So many of you have enhanced my understanding of this issue.
Right now we see machine learning and AI being rapidly build into our world without implications being fully understood… I am talking narrowly about AI here… Sometimes they have lovely feminine names: Siri, Alexa, etc… But these systems are embedded in our phones, we have AI analysing images on Facebook. It will never be separate from humans, but it is distinct and significant, and we see AI beyond the internet and into systems – on who gets released from jail, on hospital stays, etc. I am sure all of us were surprised by the fact that Facebook, last month, censored a Pulitzer Prize winning image of a girl being napalmed in Vietnam… We don’t know the processes that triggered this, though an image of a nude girl likely triggers these processes… Now that had attention, the Government of Norway accused Facebook or erasing our shared history. The image was restored but this is the tip of the iceberg – and most images and actions are not so apparent to us…
This lack of visibility is important but it isn’t new… There are many organisational and procedural aspects that are opaque… I think we are having a moment around AI where we don’t know what is taking place… So what do we do?
We could make them transparent… But this doesn’t seem likely to work. A colleague and I have written about the history of transparency and that process and availability code does not necessarily tell you exactly what is happening and how this is used. Y Combinator has installed a system, called HAL 9000 brilliantly, and have boasted that they don’t know how it filters applications, only the system could do that. That’s fine until that system causes issues, denies you rights, gets in your way…
So we need to understand these algorithms from the outside… We have to poke them… And I think of Christian Salmand(?)’s work on algorithmic auditing. Christian couldn’t be here this evening and my thoughts are with him. But he is also part of a group who are trying to pursue legal rights to enable this type of research.
And there are people that say that AI can fix this system… This is something that the finance sector talks about. They have an environment of predatory machine learning hunting each other – Terry Cary has written about this. It’s tempting to create a “police AI” to watch these… I’ve been going back to the 1970s books on AI, and the work of Joseph Weizenbaum who created ELIZA. And he suggested that if we continue to ascribe AI to human acting systems it might be a slow acting poison. It is a reminder to not be seduced by these new forms of AI.
Carolin Gerlitz, University of Siegen
I think after the last few days the answer to the question of “who rules the internet?”, I think the answer is “platforms”!
Their rules of who users are, what they can do, can seem very rigid. Before Facebook introduced the emotions, the Like button was used in a range of ways. With the introduction of emotions they have rigidly defined responses, creating discreet data points to be advertiser ready and available to be recombined.
There are also rules around programmability, that dictate what data can be extracted, how, by whom, in what ways… And platforms also like to keep the interpretation of data in control, and adjust the rules of APIs. Some of you have been working to extract data from platforms where things are changing rapidly – Twitter API changes, Facebook API and Research changes, Instagram API changes, all increasingly restricting access, all dictating who can participate. And limiting the opportunity to hold platforms to account, as my colleague Anne Helmond argues.
Increasingly platforms are accessed indirectly through intermediaries which create their own rules, a cascade of rules for users to engage with. Platforms don’t just extend to platforms but also to apps… As many of you have been writing about in regard to platforms and apps… And Christian, if he were here today, would talk about the increasing role of platforms in this way…
And platforms reach out not only to users but also non-users. They these spaces are also contextual – with place, temporality and the role of commercial content all important here.
These rules can be characterised in different ways… There is a dichotomy of openness and closedness. Much of what takes place is hidden and dictated by cascading sets of rule sets. And then there is the issue of evaluation – what counts, for whom, and in what way? Tailorism refers to the mass production of small tasks – and platforms work in these fine grained algorithmed way. But platforms don’t earn money from users’ repetitive actions… Or from use of platform data by third parties. They “put life to work” (Lazlo) by using data points raising questions of who counts and what counts.
Fieke Jansen, Tactical Tech
I work at an NGO, on the ground in real world scenarios. And we are concerned with the Big Five: Apple, Amazon, Google, Microsoft and Facebook. How did we get like this? People we work with are uncomfortable with this. When we ask activists and ask them to draw the internet, they mostly draw a cloud. We asked at a session “what happens if the government bans Facebook” and they cannot imagine it – and if Facebook is beyond government then where are we at here? And I work with an open source company who use Google Apps for Business – and that seems like an odd situation to me…
But I’ll leave the Big Five for now and turn to BitNik… They used the dark net shopper and brought random stuff for $50… And then placed them in a gallery… They did
Iced T watch… After Wikileaks an activist in Berlin found all the NSA services spying on this and worked out who was working for the secret service… But that triggers a real debate… There was real discussion of being anti-patriotic, and puts people in data… But the data he used, from LinkedIn, is sold every day…. He just used it in a way that raised debate. We allow that selling use… But this coder’s work was not… Isn’t that debate needed.
So, back to the Big Five. In 2014 Google (now Alphabet) was the second biggest company in the world – with equivalent GDP bigger than Austria. We choose to use many of their services every day… But many of their services are less in our face. In the sensor world we have fewer choices about data… And with the big companies it is political too… In Brussels you have to register lobbists – there are 9 for Google, 7 used to work for the European Parliament… There is a revolving door here.
There is also an issue of skill… Google has wealth and power and knowledge that are very large to counter. Facebook have, around 400m active users a month, 300m likes a day, they are worth $190m… And here we miss the political influence. They have an enormous drive to conquer the global south… They want to roll out Facebook Sero as “the internet”…
So, who rules the internet? It’s the 1% of the 1%… It is the Big Five, but also the venture capitalists who back them… Sequoia and Kleiner Perkins Caufield & Byers, and you have Peter Thiel… It is very few people behind many of the biggest companies including some of the Big Five…
People use these services that work well, work easily… I only use open source… Yes, it is harder… Why are so few questioning and critiquing that? We feed the beast on an every day basis… It is our universities – also moving to decentralised Big Five platforms in preference to their own, it is our government… and if we are not critical what happens?
Panel Discussion
Cornelius: Many here study internet governance… So I want to ask, Kate, does AI rule the internet?
Kate: I think it is really hard to think about who rules the internet. The interesting thing about automated decision making networks have been with us for a while… It’s less about ruling, and who… And it’s more about the entanglements, fragmentation and governance. We talk about the Big Five… I would probably say there are Seven companies here, deciding how we get into university, healthcare, housing, filtering far beyond the internet… And governments do have a role to play.
Cornelius: How do we govern what we don’t understand?
Kate: That’s a hard question… That keeps me up at night that question… Governments look to us academics, technology sectors, NGOs, trying to work out what to do. We need really strong research groups to look at this – we tried to do this with AI Now. Interdisciplinary is crucial – these issues cannot be solved by computer science alone or social science alone… This is the biggest challenge of the next 50 years.
Cornelius: What about how national governments can legislate for Facebook, say? (I’m simplifying a longer question that I didn’t catch in time here, correction welcome!)
Carolyn: I’m not sure about Facebook but in our digital methods workshop we talked about how on Twitter content can be deleted, that can then be exposed in other locations via the API. And it is also the case that these services are specific and localised… We expect national governments to have some governance, when what you understand and how you access information varies by location… Increasing that uncanny notion. I also wanted to comment on something you asked Kate – thinking about the actors here, they all require engagement of users – something Fieke pointed to. Those actors involved in rulers are dependent on actions of other actors.
Cornelius: So how else we be running these things? The Chinese option, the Russion options, are there better options?
Carolyn: I think I cannot answer – I’d want to put it to these 570 smart people for the next two days. My answer would be to acknowledge distributedness to which we have to respond and react… We cannot understand algorithms and AI without understanding context…
Carolyn: Fieke, what you talked about… Being extreme… Are we whining because as Europeans we are being colonised by other areas of the world, even as we use and are obsessed by our devices and tools – complaining then checking our iPhones. I’m serious… If we did care that much, maybe actions would change… You said people have the power here, maybe it’s not a big enough issue…
Fieke: Is it Europeans concerned about Americans from a libertarian point of view? Yes. I work mainly in non-European parts of the world and particularly in the North America… For many the internet is seen as magical and neutral – but those who research it we know it is not. But when you ask why people use tools, it’s their friends or community. If you ask them who owns it, that raises questions that are framed in a relevant way. The framing has to fit people’s reality. In South America talk of Facebook Sero as the new colonialism, you will have a political conversation… But we also don’t always know why we are uncomfortable… It can feel abstract, distant, and the concern is momentary. Outside of this field, people don’t think about it.
Kate: Your provocation that we could just step away, and move to open source. The reality includes opportunity costs to employment, to friends and family… But even if you do none of those things then you walk down the streets and you are tracked by sensors, by other devices…
Fieke: I absolutely agree. All the data collected beyond our control is the concern… But we can’t just roll over and die, we have to try and provoke and find mechanisms to play…
Kate: I think that idea of what the political levers may be… Those conversation of legal, ethical, technical parameters seem crucial, more than consumer choice. But I don’t think we have sufficient collective models of changing information ecologies… and they are changing so rapidly.
Q&A
Q1) Thank you for this wonderful talk and perspectives here. You talked about the infrastructure layer… What about that question. You say this 1% of 1% own the internet, but do they own the infrastructure? Facebook is trying to balloon in the internet so that they cannot be cut off… It also – second question – used to be that YOU owns the internet that changed the dominance of big companies… This happens in history quite often… So what about that?
A1 – Fieke) I think that Kate talked about the many levels of ownership… Facebook piggy backs on other infrastructures, Google does the balloons. It used to be that government owned the infrastructure. There are new cables rolling out… EU funding, governments, private companies, rich people… The infrastructure is mainly owned by companies now.
A1 – Kate) I think infrastructure studies has been extraordinarily rich – work of Nicole Serafichi for instance – but also we have art responses. Infrastructure is very of the moment… But what happens next… It is not just about infrastructures and their ownerships, but also surveillance access to these. There are things like MESH networks… And there are people working here in Berlin to flag up faux police networks during protests to help protestors protect themselves.
A1 – Carolyn) I think that platforms would have argued differently ten years ago about who owned the internet – but “you” probably wouldn’t have been the answer…
Q2) I wonder if the real issue is that we are running on very vague ideas of government that have been established for a very different world. People are responding to elections and referenda in very irrational ways that suggest that model is not fit for purpose. Is there a better form of governance or democracy that we should move towards? Can AI help us there?
A2 – Kate) What a beautiful and impossible to answer question! Obviously I cannot answer that properly but part of the reason I do AI research is to try to inform and shape that… Hence my passion for building research in this space. We don’t have much data to go on but the imaginative space here has been dominated by those with narrow ideas. I want to think about how communities can develop and contribute to AI, and what potential there is.
Q3) Do we need to rethink what we mean by democratic control and regulations… Regulations are closely associated with nation states, but that’s not the context that most of the internet operates. Do we need to re-engage with the question of globalisation again.
A3) As Carolyn said, who is the “you” in web 2.0, and whose narrative is there. Globalisation is similar. I pay taxes to a nation state that has rules of law and governance… By denying that they buy into the narrative of mainly internet companies and huge multinational organisations.
Cornelius: I have the declaration of independence of the internet by Perry Barlow which I was tempted to quote you… But it is interesting to reflect on how we have moved from utopian positions to where we are today.
Q4 – participant from Google!) There is an interesting question here… If this question was pointing to deeper truth… A clear ruler, an internet, would allow this question of who rules to be answered. I would ask how we have agency over how the proliferation of internet technologies and how we benefit from them… ?
A4 – Kate) A great title, but long for the programme! But your phrasing is so interesting – if it is so diverse and complex then how we engage is crucial. I think that is important but, the optimistic part, I think we can do this.
A4 – Carolyn) One way to engage is through descent… and negotiating on a level that ensures platforms work beyond economic values…
Q5) The last time I was forced to give away my data was by the Australian state (where I live) in completing the census… I had to complete it or I would be fined over $1000 AUS – Facebook, Twitter, etc. never did that… I rule this kind of internet, I am still free in my choices. But on the other hand why is it that states that are best at governing platforms are the ones I want to live in the least. Maybe without the platforms no-one would use the internet so we’d have one problem less… If we as academics think about platforms in these mythic ways, maybe we end up governing in a way that is more controlled and has undesirable effects.
A5 – Kate) Many questions there, I’ll address two of those. On the census I’d refer you to articles
University of Cambridge study showed huge accuracy in determining marital status, sexuality and whether a drug or alcohol user based on Facebook likes… You may feel free but those data patterns are being built. But we have to move beyond thinking that only by active participation do you contribute to these platforms…
A5 – Fieke) The Census issue you brought up is interesting… In the UK, US and Australia the contractor for the Census is conducted by one of the world’s biggest arms manufacturers… You don’t give data to the Big Five… But…  So, we do need to question the politics behind our actions… There is also a perception that having technical skills makes you superior to those without, and if we do that we create a whole new class system and that raises whole new questions.
Q6) The question of internet raises issues of boundaries, and how we do governance and work of governance and rule-making. Ideally when we do that governance and rule-making there are values behind that… So what are the values that you think need to underlie those structures and systems…
A6 – Carolyn) I think values that do not discriminate people through algorithmic processing, AI, etc. Those tools should allow people to not be discriminated on the basis of things they have done in the past… But that requires understanding of how that discrimination is taking place now…
A6 – Kate) I love that question… All of these layers of control come with values baked in, we just don’t know what they are… I would be interested to see what values drop out of those systems, that don’t fit the easy metricisation of our world. Some great things to fall out of feminist and race theory and values from that…
A6 – Fieke) I would add that values should not just be about the individual, and should ensure that the collective is also considered…
Cornelius: Thank you for offering a glimmer of hope! Thank you all!
Oct 052016
 

If you’ve been following my blog today you will know that I’m in Berlin for the Association of Internet Researchers AoIR 2016 (#aoir2016) Conference, at Humboldt University. As this first day has mainly been about workshops – and I’ve been in a full day long Digital Methods workshop – we do have our first conference keynote this evening. And as it looks a bit different to my workshop blog, I thought a new post was in order.

As usual, this is a live blog post so corrections, comments, etc. are all welcomed. This session is also being videoed so you will probably want to refer to that once it becomes available as the authoritative record of the session. 

Keynote: The Platform Society – José van Dijck (University of Amsterdam) with Session Chair: Jennifer Stromer-Galley

We are having an introduction from Wolfgang (?) from Humboldt University, welcoming us and noting that AoIR 2016 has made the front page of a Berlin newspaper today! He also notes the hunger for internet governance information, understanding, etc. from German government and from Europe.

Wolfgang: The theme of “Internet Rules!” provides lots of opportunities for keynotes, discussions, etc. and it allows us to connect the ideas of internet and society without deterministic structures. I will now hand over to the session chair Cornelius Puschmann.

Cornelius: It falls to me to do the logistical stuff… But first we have 570 people registered for AoIR 2016  so we have a really big conference. And now the boring details… which I won’t blog in detail here, other than to note the hashtag list:

  • Official: #aoir2016
  • Rebel: #aoir16
  • Retro: #ir17
  • Tim Highfield: #itisthesevebeenthassociationofinternetresearchersconferenceanditishappeningin2016

And with that, and a reminder of some of the more experimental parts of the programme to come.

Jennifer: Huge thanks to all of my colleagues here for turning this crazy idea into this huge event with a record number of attendees! Thank you to Cornelius, our programme chair.

Now to introduce our speaker… Jose van Dijck, professor at the University of Amsterdam as well as visiting work across the world. She is the first woman to hold the Presidency of the Royal Academy of Arts, Science and Research in The Netherlands. Her most recent book is the Culture of Connectivity: A History of Social Media. It takes a critical look back at social media and social networking, not only as social spaces but as business spaces. And her lecture tonight will give a preview of her forthcoming work on the Public Values in a Platform Society.

Jose: It is lovely to be here, particularly on this rather strange day…. I became President of the Royal Academy this year and today my colleague won the Nobel Prize in Chemistry – so instead of preparing for my keynote today I was dealing with press inquiries, so it is nice to focus back on my real job…

So a few years ago Thomas Poell wrote an article on the politics of social platforms. His work on platforms inspired my work on networked platforms being interwoven into an ecology economically and socially. Since I wrote that book, the last chapter is on platforms, many of which have now become the main players… I talked about Google (now Alphabet), Facebook, Amazon, Microsoft, LinkedIn (now owned by Microsoft), Apple… And since then we’ve seen other players coming in and creating change – like Uber, AirBnB, Coursera. These platforms have become the gateways to our social life… And they have consolidated and expanded…

So a Platform is an online site that deploys automated technologies and business models to organise data streams, economic interactions, and social exchanges between users of the internet. That’s the core of the social theory I am using. Platforms ARE NOT simple facilitators, and they are not stand alone systems – they are interconnected.

And a Platform Ecosystem is an assemblage of networked platforms, governed by its own dynamics and operating on a set of mechanisms…

Now a couple of years ago Thomas and I wrote about platform mechanisms and the very important idea of “Datafication”. Commodification – a platform’s business model and governance defines the way in which datafied information is transformed into (economic, societal) value. There are many business models and many governance models – they vary but governance models are maybe more important than business models, and they can be hard to pin down. Selection are about data flows filtered by algorithms and bots, allowing for automated selection such as personalisation, rankings, reputation. Those mechanisms are not visible right now, and we need to make those explicit so that we can talk about them and their implications. Can we hold Facebook accountable for Newsfeed in the ways that traditional media are accountable? That’s an important question for us to consider…

The platform ecosystem is not a level playing field. They are gaining traction not through money but through the number of users. And network effects mean that user numbers are the way we understand the size of the network. There is Platformisation (thanks Anna?) across sectors… And that power is gained through cross ownership and cross platform, but also through true architecture and shared platforms. In our book we’ll give both private and public sectors and how they are penetrated by platform ecosystems. We used to have big oil companies, or big manufacturing companies… But now big companies operate across sectors.

So transport for instance… Uber is huge, partly financed by Google and also in competition with Google. If we look at News as a sector we have Huffington Post, Buzzfeed, etc. they are also used as content distribution and aggregators for Google, Facebook, etc.

In health – a second becoming most proliferated – we see fitness and health apps, with Google and Apple major players here. And in your neighbourhood there are apps available, some of these are global apps localised to your neighbourhoods, sitting alongside massive players.

In Education we’ve seen the rise of Massive Online Open Courses, with Microsoft and Google investing heavily alongside players like EdX, Coursera, Udacity, FutureLearn, etc.

All of the sectors are undergoing platformisation… And if you look across them all, all areas of private and public life the activity is revolving around the big five: Google, Facebook. Apple, Amazon, with LinkedIn and Twitter also important. And take, for example, AirBnB

Platform society is a society which social, economic and interpersonal traffic is largely channeled by an (overwhelmingly corporate) global online platform ecosystem that is driven by algorithms and fuelled by data. That’s not a revolution, it’s something we are part of and see every day.

Now we have promises of “participatory culture” and the euphoria of the idea of web 2.0, and of individuals contributing. More recently that idea has shifted to the idea of the “sharing economy”… But sharing has shifted in it’s meaning too. It is about sharing resources or services for some sort of fee, that’s a transaction based idea. And from 2015 we see awareness of the negative sides of the sharing economy. So a Feb 2015 Time cover read: “Strangers crashed my car, ate my food and wore my pants. Tales from the sharing economy” – about the personal discomfort of the downsides. And we see Technology Quarterly writing about “When it’s not so good to share” – from the perspective of securing the property we share here. But there is more at stake than personal discomfort…

We have started to see disruptive protest against private platforms, like posters against AirBnB. City Councils have to hire more inspectors to regulate AirBnB hosts for safety reasons – a huge debate in Amsterdam now, and the public values changing as a consequence of so many AirBnB hosts in this city. And there are more protests about changing values… Saying people are citizens not entrepreneurs, that the city is not for sale…

In another sector we see Uber protests, by various stakeholders. We see these from licenced taxi drivers, accusing them of safety issues and social values; but also protests by drivers. Uber do not call themselves a “transportation” company, instead calling themselves a connectivity company. Now Uber drivers have complained that Uber don’t pay insurance or pensions…

So, AirBnB and Uber are changing public values, they haven’t anchored existing values in their own design and development. There are platform promises and paradoxes here… They offer personalised services whilst contributing to the public good… The idea is that they are better at providing services than existing players. They promote community and connectedness whilst bypassing cumbersome institutions – based on the idea that we can do without big government or institutions, and without those values. These platforms also emphasize public values, whilst obscuring private gain. These are promises claiming that they are in the public interest… But that’s a paradox with hidden private gains.

And so how do we anchor collective, public values in a platform society and how do we govern this. ? has the idea of governance of platforms as opposed to governance by platforms. Our government is mainly concerned with governing platforms – regulations, privacy etc. and that is appropriate but there are public values like fairness, like accuracy, like safety, like privacy, like transparency, like democracy… Those values are increasingly being governed by platforms, and that governance is hidden from us in the algorithms and design decisions…

Who rules the platform society? Who are the stakeholders here? There are many platform societies of course, but who can be held accountable? Well it is an intense ideological battleground… With private stakeholders like (global) corporations, businesses, (micro-)entrepreneurs; consumer groups; consumers. And public stakeholders like citizens; co-ops and collectives, NGOs, public institutions, governments, supra-national bodies… And matching those needs up is never going to happen really…

Who uses health apps here? (many do) In 2015 there were 165,000 health apps in the Google Play store. Most of them promise personalised health and, whilst that is in the future, they track data… They take data right from individual to companies, bi-passing other actors and health providers… They manage a wide variety of data flows (patients, doctors, companies). There is a variety of business models, particularly unclear. There is a site called “Patients like me” which says that it is “not just for profit” – so it is for profit, but not just for profit… Data has become currency in our health economy. And that private gain is hiding behind the public good arguement. A few months ago in Holland we started to have insurance discounts (5%) if you send FitBit scores… But I thin the next step will be paying more if you do not send your scores… That’s how public values change…

Finally we have regulation – government should be regulating security, safety, accuracy, and privacy. It takes the Dutch FDA 6 months to check the safety and accuracy of one app – and if it is updated, you have to start again! In the US the US Dept of Health and Human Services, Office of National Coordinator for Health Information Technology (ONC), Office for Civil Rights (OCR) and Food and Drug Administration (FDA) released a guide called “Developing a mobile health app?” providing guidance on which federal laws need to be followed. And we see not just insurance using apps, but insurance and healthcare providers having to buy data services from providers and that changing the impact of these apps. You have things like 23 and Me, and those are global – and raises global regulation issues – so hard to govern around that issue. But our platform ecosystem is transnational, and governments are national. We also see platforms coming from technology companies – Phillips was building physical kit, MRI machines, but it now models itself as a data company. What you see here is that the big five internet and technology players are also big players in this field – Google Health and 23 and Me (financed by Sergei Brin, run by his ex-wife), Apple HealthKit, etc. And even then you have small independent apps like mPower but they are distributed via the app stores, led by big players and again, hard to govern.

 

We used to build trust in society through institutions and institutional norms and codes, which were subject to democratic controls. But these are increasingly bi-passed… And that may be subtle but it is going uncontrolled. So, how can we build trust in a platformed world? Well, we have to understand who rules the platform ecosystem, and by understanding how it is governed. And when you look at this globally you see competing ideological hemispheres… You see the US model of commercial values, and those are literally imposed on others. And you have Yandex and the Chinese model, and that that’s an interesting model…

I think coming back to my main question: what do we do here to help? We can make visible how this platformised society works… So I did a presentation a few weeks ago and shared recommendations there for users:

  • Require transparency in platforms
  • Do not trade convenience for public values
  • Be vigilant, be informed

But can you expect individuals to understand how each app works and what its implications are? I think government have a key role to protect citizens rights here.

In terms of owners and developers my recommendations are:

  • Put long-term trust over short-term gain
  • Be transparent about data flows, business models, and governance structure
  • Help encode public values in platform architecture (e.g. privacy by design)

A few weeks back the New York Times ran an article on holding algorithms accountable, and I think that that is a useful idea.

I think my biggest recommendations are for governments, and they are:

  • Defend public values and common good; negotiate public interests with platforms. What it could also do is to, for instance, legislate to manage demands and needs in how platforms work.
  • Upgrade regulatory institutions to deal with the digital constellations we are facing.
  • Develop (inter)national blueprint for a democratic platform society.

And we, as researchers, we can help expose and share the platform society so that it is understaood and engaged with in a more knowledgeable way. Governments have a special responsibility to govern the networked society – right now it is a Wild West. We are struggling to resolve these issues, so how can we help govern the platforms to shape society, when the platforms themselves are so enormous and powerful. In Europe we see platforms that are mainly US-based private sector spaces, and they are threatening public sector organisations.. It is important to think about how we build trust in that platform society…

Q&A

Q1) You talked about private interests being concealed by public values, but you didn’t talk about private interests of incumbents…

A1) That is important of course. Those protests that I mentioned do raise some of those issues – undercutting prices by not paying for insurance, pensions etc. of taxi drivers. In Europe those costs can be up to 50% of costs, so what do we do with those public values, how do we pay for this? We’ll pay for it one way or the other. The incumbents do have their own vested interests… But there are also social values there… If we want to retain those values though we need to find a model for that… European economic values have had collective values inscribed in… If that is outmoded, than fine, but how do we build those in in other ways…

Q2) I think in my context in Australia at least the Government is in cahoots with private companies, with public-private partnerships and security arms of government heavily benefitting from data collection and surveillance… I think that government regulating these platforms is possible, I’m not sure that they will.

A2) A lot of governments are heavily invested in private industries… I am not anti-companies or anti-government… My first goal is to make them aware of how this works… I am always surprised how little governments are aware of what runs underneath the promises and paradoxes… There is reluctance to work with companies from regulators but there is also exhaustion and a lack of understanding about how to update regulations and processes. How can you update health regulations with 165k health apps out there? I probably am an optimist… But I want to ensure governments are aware and understand how this is transforming society. There is so much ignorance in the field, and there is nievete about how this will play out. Yes, I’m an optimist. But no, there is something we can do to shape the direction that the platform society will develop.

Q3) You have great faith in regulation, but there are real challenges and issues… There are many cases where governments have colluded with industry to inflate the costs of delivery. There is the idea of regulatory capture. Why should we expect regulators to act in public interest when historically they act in the interest of private companies.

A3) It’s not that I put all my trust there… But I’m looking for a dialogue with whoever is involved in this space, in the contested play of where we start… It is one of many actors in this whole contested battlefield. I don’t think we have the answers, but it is our job to explain the underlying mechanisms… And I’m pretty shocked by how little they know about the platforms and the underlying mechanisms there. Sometimes it’s hard to know where to start… But you have to make a start somewhere…

Oct 052016
 

After a few weeks of leave I’m now back and spending most of this week at the Association of Internet Researchers (AoIR) Conference 2016. I’m hugely excited to be here as the programme looks excellent with a really wide range of internet research being presented and discussed. I’ll be liveblogging throughout the week starting with today’s workshops.

This is a liveblog so all corrections, updates, links, etc. are very much welcomed – just leave me a comment, drop me an email or similar to flag them up!

I am booked into the Digital Methods in Internet Research: A Sampling Menu workshop, although I may be switching session at lunchtime to attend the Internet rules… for Higher Education workshop this afternoon.

The Digital Methods workshop is being chaired by Patrik Wikstrom (Digital Media Research Centre, Queensland University of Technology, Australia) and the speakers are:

  • Erik Borra (Digital Methods Initiative, University of Amsterdam, the Netherlands),
  • Axel Bruns (Digital Media Research Centre, Queensland University of Technology, Australia),
  • Jean Burgess (Digital Media Research Centre, Queensland University of Technology, Australia),
  • Carolin Gerlitz (University of Siegen, Germany),
  • Anne Helmond (Digital Methods Initiative, University of Amsterdam, the Netherlands),
  • Ariadna Matamoros Fernandez (Digital Media Research Centre, Queensland University of Technology, Australia),
  • Peta Mitchell (Digital Media Research Centre, Queensland University of Technology, Australia),
  • Richard Rogers (Digital Methods Initiative, University of Amsterdam, the Netherlands),
  • Fernando N. van der Vlist (Digital Methods Initiative, University of Amsterdam, the Netherlands),
  • Esther Weltevrede (Digital Methods Initiative, University of Amsterdam, the Netherlands).

I’ll be taking notes throughout but the session materials are also available here: http://tinyurl.com/aoir2016-digmethods/.

Patrik: We are in for a long and exciting day! I won’t introduce all the speakers as we won’t have time!

Conceptual Introduction: Situating Digital Methods (Richard Rogers)

My name is Richard Rogers, I’m professor of new media and digital culture at the University of Amsterdam and I have the pleasure of introducing today’s session. So I’m going to do two things, I’ll be situating digital methods in internet-related research, and then taking you through some digital methods.

I would like to situate digital methods as a third era of internet research… I think all of these eras thrive and overlap but they are differentiated.

  1. Web of Cyberspace (1994-2000): Cyberstudies was an effort to see difference in the internet, the virtual as distinct from the real. I’d situate this largely in the 90’s and the work of Steve Jones and Steve (?).
  2. Web as Virtual Society? (2000-2007) saw virtual as part of the real. Offline as baseline and “virtual methods” with work around the digital economy, the digital divide…
  3. Web as societal data (2007-) is about “virtual as indication of the real. Online as baseline.

Right now we use online data about society and culture to make “grounded” claims.

So, if we look at Allrecipes.com Thanksgiving recipe searches on a map we get some idea of regional preference, or we look at Google data in more depth, we get this idea of internet data as grounding for understanding culture, society, tastes.

So, we had this turn in around 2008 to “web as data” as a concept. When this idea was first introduced not all were comfortable with the concept. Mike Thelwell et al (2005) talked about the importance of grounding the data from the internet. So, for instance, Google’s flu trends can be compared to Wikipedia traffic etc. And with these trends we also get the idea of “the internet knows first”, with the web predicting other sources of data.

Now I do want to talk about digital methods in the context of digital humanities data and methods. Lev Manovich talks about Cultural Analytics. It is concerned with digitised cultural materials with materials clusterable in a sort of art historical way – by hue, style, etc. And so this is a sort of big data approach that substitutes “continuous change” for periodisation and categorisation for continuation. So, this approach can, for instance, be applied to Instagram (Selfiexploration), looking at mood, aesthetics, etc. And then we have Culturenomics, mainly through the Google Ngram Viewer. A lot of linguists use this to understand subtle differences as part of distance reading of large corpuses.

And I also want to talk about e-social sciences data and method. Here we have Webometrics (Thelwell et al) with links as reputational markers. The other tradition here is Altmetrics (Priem et al), which uses online data to do citation analysis, with social media data.

So, at least initially, the idea behind digital methods was to be in a different space. The study of online digital objects, and also natively online method – methods developed for the medium. And natively digital is meant in a computing sense here. In computing software has a native mode when it is written for a specific processor, so these are methods specifically created for the digital medium. We also have digitized methods, those which have been imported and migrated methods adapted slightly to the online.

Generally speaking there is a sort of protocol for digital methods: Which objects and data are available? (links, tags, timestamps); how do dominant devices handle them? etc.

I will talk about some methods here:

1. Hyperlink

For the hyperlink analysis there are several methods. The Issue Crawler software, still running and working, enable you to see links between pages, direction of linking, aspirational linking… For example a visualisation of an Armenian NGO shows the dynamics of an issue network showing politics of association.

The other method that can be used here takes a list of sensitive sites, using Issue Crawler, then parse it through an internet censorship service. And variations on this that indicate how successful attempts at internet censorship are. We do work on Iran and China and I should say that we are always quite thoughtful about how we publish these results because of their sensitivity.

2. The website as archived object

We have the Internet Archive and we have individual archived web sites. Both are useful but researcher use is not terribly signficant so we have been doing work on this. See also a YouTube video called “Google and the politics of tabs” – a technique to create a movie of the evolution of a webpage in the style of timelapse photography. I will be publishing soon about this technique.

But we have also been looking at historical hyperlink analysis – giving you that context that you won’t see represented in archives directly. This shows the connections between sites at a previous point in time. We also discovered that the “Ghostery” plugin can also be used with archived websites – for trackers and for code. So you can see the evolution and use of trackers on any website/set of websites.

6. Wikipedia as cultural reference

Note: the numbering is from a headline list of 10, hence the odd numbering… 

We have been looking at the evolution of Wikipedia pages, understanding how they change. It seems that pages shift from neutral to national points of view… So we looked at Srebenica and how that is represented. The pages here have different names, indicating difference in the politics of memory and reconciliation. We have developed a triangulation tool that grabs links and references and compares them across different pages. We also developed comparative image analysis that lets you see which images are shared across articles.

7. Facebook and other social networking sites

Facebook is, as you probably well know, is a social media platform that is relatively difficult to pin down at a moment in time. Trying to pin down the history of Facebook find that very hard – it hasn’t been in the Internet Archive for four years, the site changes all the time. We have developed two approaches: one for social media profiles and interest data as means of stufying cultural taste ad political preference or “Postdemographics”; And “Networked content analysis” which uses social media activity data as means of studying “most engaged with content” – that helps with the fact that profiles are no longer available via the API. To some extend the API drives the research, but then taking a digital methods approach we need to work with the medium, find which possibilities are there for research.

So, one of the projects undertaken with in this space was elFriendo, a MySpace-based project which looked at the cultural tastes of “friends” of Obama and McCain during their presidential race. For instance Obama’s friends best liked Lost and The Daily Show on TV, McCain’s liked Desperate Housewives, America’s Next Top Model, etc. Very different cultures and interests.

Now the Networked Content Analysis approach, where you quantify and then analyse, works well with Facebook. You can look at pages and use data from the API to understand the pages and groups that liked each other, to compare memberships of groups etc. (at the time you were able to do this). In this process you could see specific administrator names, and we did this with right wing data working with a group called Hope not Hate, who recognised many of the names that emerged here. Looking at most liked content from groups you also see the shared values, cultural issues, etc.

So, you could see two areas of Facebook Studies, Facebook I (2006-2011) about presentation of self: profiles and interests studies (with ethics); Facebook II (2011-) which is more about social movements. I think many social media platforms are following this shift – or would like to. So in Instagram Studies the Instagram I (2010-2014) was about selfie culture, but has shifed to Instagram II (2014-) concerned with antagonistic hashtag use for instance.

Twitter has done this and gone further… Twitter I (2006-2009) was about urban lifestyle tool (origins) and “banal” lunch tweets – their own tagline of “what are you doing?”, a connectivist space; Twitter II (2009-2012) has moved to elections, disasters and revolutions. The tagline is “what’s happening?” and we have metrics “trending topics”; Twitter III (2012-) sees this as a generic resource tool with commodification of data, stock market predictions, elections, etc.

So, I want to finish by talking about work on Twitter as a storytelling machine for remote event analysis. This is an approach we developed some years ago around the Iran event crisis. We made a tweet collection around a single Twitter hashtag – which is no longer done – and then ordered by most retweeted (top 3 for each day) and presented in chronological (not reverse) order. And we then showed those in huge displays around the world…

To take you back to June 2009… Mousavi holds an emergency press conference. Voter turn out is 80%. SMS is down. Mousavi’s website and Facebook are blocked. Police use pepper spray… The first 20 days of most popular tweets is a good succinct summary of the events.

So, I’ve taken you on a whistle stop tour of methods. I don’t know if we are coming to the end of this. I was having a conversation the other day that the Web 2.0 days are over really, the idea that the web is readily accessible, that APIs and data is there to be scraped… That’s really changing. This is one of the reasons the app space is so hard to research. We are moving again to user studies to an extent. What the Chinese researchers are doing involves convoluted processes to getting the data for instance. But there are so many areas of research that can still be done. Issue Crawler is still out there and other tools are available at tools.digitalmethods.net.

Twitter studies with DMI-TCAT (Fernando van der Vlist and Emile den Tex)

Fernando: I’m going to be talking about how we can use the DMI-TCAT tool to do Twitter Studies. I am here with Emile den Tex, one of the original developers of this tool, alongside Eric Borra.

So, what is DMI-TCAT? It is the Digital Methods Initiative Twitter Capture and Analysis Toolset, a server side tool which tries to capture robust and reproducible data capture and analysis. The design is based on two ideas: that captured datasets can be refined in different ways; and that the datasets can be analysed in different ways. Although we developed this tool, it is also in use elsewhere, particularly in the US and Australia.

So, how do we actually capture Twitter data? Some of you will have some experience of trying to do this. As researchers we don’t just want the data, we also want to look at the platform in itself. If you are in industry you get Twitter data through a “data partner”, the biggest of which by far is GNIP – owned by Twitter as of the last two years – then you just pay for it. But it is pricey. If you are a researcher you can go to an academic data partner – DiscoverText or Hexagon – and they are also resellers but they are less costly. And then the third route is the publicly available data – REST APIs, Search API, Streaming APIs. These are, to an extent, the authentic user perspective as most people use these… We have built around these but the available data and APIs shape and constrain the design and the data.

For instance the “Search API” prioritises “relevance” over “completeness” – but as academics we don’t know how “relevance” is being defined here. If you want to do representative research then completeness may be most important. If you want to look at how Twitter prioritises the data, then that Search API may be most relevant. You also have to understand rate limits… This can constrain research, as different data has different rate limits.

So there are many layers of technical mediation here, across three big actors: Twitter platform – and the APIs and technical data interfaces; DMI-TCAT (extraction); Output types. And those APIs and technical data interfaces are significant mediators here, and important to understand their implications in our work as researchers.

So, onto the DMI-TCAT tool itself – more on this in Borra & Reider (2014) (doi:10.1108/AJIM-09-2013-0094). They talk about “programmed method” and the idea of the methodological implications of the technical architecture.

What can one learn if one looks at Twitter through this “programmed method”? Well (1) Twitter users can change their Twitter handle, but their ids will remain identical – sounds basic but its important to understand when collecting data. (2) the length of a Tweet may vary beyond maximum of 140 characters (mentions and urls); (3) native retweets may have their top level text property stortened. (4) Unexpected limitations  support for new emoji characters can be problematic. (5) It is possible to retrieve a deleted tweet.

So, for example, a tweet can vary beyond 140 characters. The Retweet of an original post may be abbreviated… Now we don’t want that, we want it to look as it would to a user. So, we capture it in our tool in the non-truncated version.

And, on the issue of deletion and witholding. There are tweets deleted by users, and their are tweets which are withheld by the platform – and the withholding is a country by country issue. But you can see tweets only available in some countries. A project that uses this information is “Politwoops” (http://politwoops.sunlightfoundation.com/) which captures tweets deleted by US politicians, that lets you filter to specific states, party, position. Now there is an ethical discussion to be had here… We don’t know why tweets are deleted… We could at least talk about it.

So, the tool captures Twitter data in two ways. Firstly there is the direct capture capabilities (via web front-end) which allows tracking of users and capture of public tweets posted by these users; tracking particular terms or keywords, including hashtags; get a small random (approx 1%) of all public statuses. Secondary capture capabilities (via scripts) allows further exploration, including user ids, deleted tweets etc.

Twitter as a platform has a very formalised idea of sociality, the types of connections, parameters, etc. When we use the term “user” we mean it in the platform defined object meaning of the word.

Secondary analytical capabilities, via script, also allows further work:

  1. support for geographical polygons to delineate geographical regions for tracking particular terms or keywords, including hashtags.
  2. Built-in URL expander, following shortened URLs to their destination. Allowing further analysis, including of which statuses are pointing to the same URLs.
  3. Download media (e.g. videos and images (attached to particular Tweets).

So, we have this tool but what sort of studies might we do with Twitter? Some ideas to get you thinking:

  1. Hashtag analysis – users, devices etc. Why? They are often embedded in social issues.
  2. Mentions analysis – users mentioned in contexts, associations, etc. allowing you to e.g. identify expertise.
  3. Retweet analysis – most retweeted per day.
  4. URL analysis – the content that is most referenced.

So Emile will now go through the tool and how you’d use it in this way…

Emile: I’m going to walk through some main features of the DMI TCAT tool. We are going to use a demo site (http://tcatdemo.emiledentex.nl/analysis/) and look at some Trump tweets…

Note: I won’t blog everything here as it is a walkthrough, but we are playing with timestamps (the tool uses UTC), search terms etc. We are exploring hashtag frequency… In that list you can see Bengazi, tpp, etc. Now, once you see a common hashtag, you can go back and query the dataset again for that hashtag/search terms… And you can filter down… And look at “identical tweets” to found the most retweeted content. 

Emile: Eric called this a list making tool – it sounds dull but it is so useful… And you can then put the data through other tools. You can put tweets into Gephi. Or you can do exploration… We looked at Getty Parks project, scraped images, reverse Google image searched those images to find the originals, checked the metadata for the camera used, and investigated whether the cost of a camera was related to the success in distributing an image…

Richard: It was a critique of user generated content.

Analysing Social Media Data with TCAT and Tableau (Axel Bruns)

My talk should be a good follow on from the previous presentation as I’ll be looking at what you can do with TCAT data outside and beyond the tool. Before I start I should say that both Amsterdam and QUT are holding summer schools – and we have different summers! – so do have a look at those.

You’ve already heard about TCAT so I won’t talk more about that except to talk about the parts of TCAT I have been using.

TCAT Data Export allows you to export all tweets from selection – containing all of the tweets and information about them. You can also export a table of hashtags – tweet ids from your selection and hashtags; and mentions – tweet ids from your selection with mentions and mention type. You can export other things as well – known users (politicians, celebrities, etc); URLs; etc. And the structure that emerges are the Main TCAT export file (“full export”) and associating Hashtags; Mentions; Any other additional data. If you are familiar with SQL you are essentially joining databases here. If not then that’s fine, Tableau does this for you.

In terms of processing the data there are a number of tools here. Excel just isn’t good enough at scale – limited to 100,000 rows and that Trump dataset was 2.8 M already. So a tool that I and many others have been working with is Tableau. It’s a tool that copes with scale, it’s user-friendly, intuitive, all-purpose data analytics tool, but the downside is that it is not free (unless you are a student or are using it in teaching). Alongside that, for network visualisation, Gephi is the main tool at the moment. That’s open source and free and a new version came out in December.

So, into Tableau and an idea of what we can do with the data… Tableau enables you to work with data sources of any form, databases, spreadsheets, etc. So I have connected the full export I’ve gotten from TCAT… I have linked the main file to hashtag and mention files. Then I have also generated an additional file that expands the URLs in that data source (you can now do this in TCAT too). This is a left join – one main table that other tables are connected to. I’ve connected based on (tweet) id. And the dataset I’m showing here is from the Paris 2015 UN Climate Change. And all the steps I’m going through today are in a PDF guidebook that is available in that session resources link (http://tinyurl.com/aoir2016-digmethods/).

Tableau then tries to make sense of the data… Dimensions are the datasets which have been brought in, clicking on those reveals columns in the data, and then you see Measures – countable features in the data. Tableau makes sense of the file itself, although it won’t always guess correctly.

Now, we’ve joined the data here so that can mean we get repetition… If a tweet has 6 hashtags, it might seem to be 6 tweets. So I’m going to use the unique tweet ids as a measure. And I’ll also right click to ensure this is a distinct count.

Having done that I can begin to visualise my data and see a count of tweets in my dataset… And I can see when they were created – using Created at but also then finessing that to Hour (rather than default of Year). Now when I look at that dataset I see a peak at 10pm… That seems unlikely… And it’s because TCAT is running on Brisbane time, so I need to shift to CET time as these tweets were concerned with events in Paris. So I create a new Formula called CET, and I’ll set it to be “DateAdd (‘hour’, -9, [Created at])” – which simply allows us to take 9 hours off the time to bring it to the correct timezone. Having done that the spike is 3.40pm, and that makes a lot more sense!

Having generated that graph I can click on, say, the peak activity and see the number of tweets and the tweets that appeared. You can see some spam there – of course – but also widely retweeted tweet from the White House, tweets showing that Twitter has created a new emoji for the summit, a tweet from the Space Station. This gives you a first quick visual inspection of what is taking place… And you can also identify moments to drill down to in further depth.

I might want to compare Twitter activity with number of participating users, comparing the unique number of counts (synchronising axes for scale). Doing that we do see that there are more tweets when more users are active… But there is also a spike that is independent of that. And that spike seems to be generated by Twitter users tweeting more – around something significant perhaps – that triggers attention and activity.

So, this tool enables quantitative data analysis as a starting point or related route into qualitative analysis, the approaches are really inter-related. Quickly assessing this data enables more investigation and exploration.

Now I’m going to look at hashtags, seeing the volume against activity. By default the hashtags are ordered alphabetically, but that isn’t that useful, so I’m going to reorder by use. When I do that you can see that by far COP21 – the official hashtag – is by far the most popular. These tweets were generated from that hashtags but also from several search terms for the conference – official abbreviations for the event. And indeed some tweets have “Null” hashtags – no hashtags, just the search terms. You also see variance in spelling and capitalisation. Unlike Twitter Tableau is case sensitive so I would need to use some sort of Formula to resolve this – combining terms to one hashtag. A quick way to do that is to use “LOWER(‘Hashtag’)” which converts all data in the hashtag fields to lower case. That clustering shows COP21 as an even bigger hashtag, but also identifies other popular terms. We do see spikes in a given hashtag – often very brief – and these are often related to one very popular and heavily retweeted tweet has emerged. So, e.g. a prominent actor/figure has tweeted – e.g. in this data set Cara Delevingne (a British supermodel) triggers a short sharp spike in tweets/retweets.

And we can see these hashtags here, their relative popularity. But remember that my dataset is just based on what I asked TCAT to collect… TCOT might be a really big hashtag but maybe they don’t usually mention my search terms, hence being smaller in my data set. So, don’t be fooled into assuming some of the hashtags are small/low use just because they may not be prominent in a collected dataset.

Turning now to Mentions… We can see several Mention Types: original/null (no mentions); mentions; retweet. You also see that mentions and retweets spikes at particular moments – tweets going viral, key figures getting involved in the event or the tweeting, it all gives you a sense of the choreography of the event…

So, we can now look at who is being mentioned. I’m going to take all Twitter users in my dataset… I’ll see how many tweets mention them. I have a huge Null group here – no mentions – so I’ll start by removing that. The most mentioned accounts we see COP21 being the biggest mentioned account, and others such as Narendra Modi (chair of event?), POTUS, UNFCCC, Francois Hollande, the UN, Mashi Rafael, COP21en – the English language event account; EPN – Justin Trudeau; StationCDRKelly; C Figueres; India4Climate; Barack Obama’s personal account, etc. And I can also see what kind of mention they get. And you see that POTUS gets mentions but no retweets, whilst Barack Obama has a few retweets but mainly mentions. That doesn’t mean he doesn’t get retweets, but not in this dataset/search terms. By contrast Station Commander Kelly gets almost exclusively retweets… The balance of mentions, how people are mentioned, what gets retweeting etc… That is all a starting point for closer reading and qualitative analysis.

And now I want to look at who tweets the most… And you’ll see that there is very little overlap between the people who tweet the most, and the people who are mentioned and retweeted. The one account there that appears in both is COP21 – the event itself. Now some of the most active users are spammers and bots… But others will be obsessive, super-active users… Further analysis lets you dig further. Having looked at this list, I can look at what sort of tweets these users are sending… And that may look a bit different… This uses the Mention type and it may be that one tweet mentions multiple users, so get counted multiple times… So, for instance, DiploMix puts out 372 tweets… But when re-looked at for mentions and retweets we see a count of 636. That’s an issue you have to get your head around a bit… And the same issue occurs with hashtags. Looking at the types of tweets put out show some who post only or mainly original tweets, some who do mention others, some only or mainly retweet – perhaps bots or automated accounts. For instance DiploMix retweets diplomats and politicians. RelaxinParis is a bot retweeting everything on Paris – not useful for analysis, but part of lived experience of Twitter of course.

So, I have lots of views of data, and sheets saved here. You can export tables and graphs for publications too, which is very helpful.

I’m going to finish by looking at URLs mentioned… I’ve expanded these myself, and I’ve got the domain/path as well as the domain captured. I remove the NULL group here. And the most popular linked to domain is Twitter – I’m going to combine http and https versions in Tableau – but Youtube, UN, Leader of Iran, etc. are most popular. If I dig further into the Twitter domains, looking at Path, I can see whose accounts/profiles etc. are most linked to. If I dig into Station Commander Kelly you see that the most shared of these URLs are images… And we can look at that… And that’s a tweet we had already seen all day – a very widely shared image of a view of earth.

My time is up but I’m hoping this has been useful… This is the sort of approach I would take – exploring the data, using this as an entry point for more qualitative data analysis.

Analysing Network Dynamics with Agent Based Models (Patrik Wikström)

I will be talking about network dynamics and how we can understand some of the theory of network dynamics. And before I start a reminder that you can access and download all these materials at the URL for the session.

So, what are network dynamics? Well we’ve already seen graphs and visualisations of things that change over time. Network dynamics are very much about things that change and develop over time… So when we look at a corpus of tweets they are not all simultaneous, there is a dimension of time… And we have people responding to each other, to what they see around them, etc. So, how can we understand what goes on? We are interested in human behaviour, social behaviour, the emergence of norms and institutions, information diffusion patterns across multiple networks, etc. And these are complex and related to time, we have to take time into account. We also have to understand how macro level patterns emerge from local interactions between heterogenous agents, and how macro level patterns influence and impact upon those interactions. But this is hard…

It is difficult to capture complexity of such dynamic phenomena with verbal or conceptual models (or with static statistical models). And we can be seduced by big data. So I will be talking about using particular models, agent-based models. But what is that? Well it’s essentially a computer program, or a computer program for each agent… That allows it to be heterogeneous, autonomous and to interact with the environment and with other agents; that means they can interact in a (physical) space or as nodes in a network; and we can allow them to have (limited) perception, memory and cognition, etc. That’s something it is very hard for us to do and imagine with our own human brains when we look at large data sets.

The fundamental goal of this model is to develop a model that represents theoretical constructs, logics and assumptions and we want to be able to replicate the observed real-world behaviour. This is the same kind of approach that we use in most of our work.

So, a simple example…

Let’s assume that we start with some inductive idea. So we want to explain the emergence of the different social media network structures we observe. We might want some macro-level observations of Structure – clusters, path lengths, degree distributions, size; Time – growth, decline, cyclic; Behaviours – contagion, diffusion. So we want to build some kind of model to transfer or take our assumptions of what is going on, and translate that into a computer model…

So, what are our assumptions?

Well lets say we think people use different strategies when they decide which accounts to follow, with factors such as familiarity, similarity, activity, popularity, random… They may all be different explanations of why I connect with one person rather than another…  And lets also assume that when a user joins Twitter they immediately start following a set of accounts, and once part of the network they add more. And lets also assume that people are different – that’s really important! People are interested in different things – they have different passions, topics that interest them, some are more active, some are more passive. And that’s something we want to capture.

So, to do this I’m going to use something called NetLogo – which some of you may have already played with – it is a tool developed maybe 25 years back at Northwestern University. You can download it – or use a limited browser-based version -from: http://ccl.northwestern.edu/netlogo/.

In NetLogo we start with a 3 node network… I initialise the network and get three new nodes. Then I can add a new node… In this model I have a slider for “randomness” – if I set it to less random, it picks existing popular nodes, in the middle it combines popularity with randomness, and at most random it just adds nodes randomly…

So, I can run a simulation with about 200 nodes with randomness set to maximum… You can see how many nodes are present, how many friends the most popular node has, and how many nodes have very few friends (with 3 which is minimum connections in this model). If I now change the formation strategy here to set randomness to zero… then we see the nodes connecting back to the same most popular nodes… A more broadcast-like network. This is a totally different kind of network.

Now, another simulation here toggles the size of nodes to represent number of followers… Larger blobs represent really popular nodes… So if I run this in random mode again, you’ll see it looks very different…

So, why am I showing you this? Well I live to show a really simple model. This is maybe 50 lines of code – you could build it in a few hours. The first message is that it is easy to build this kind of model. And even though we have a simple model we have at least 200 agents… We normally work with thousands or much greater scale, but you can still learn something here. You can see how to replicate the structure of a network. Maybe it is a starting point that requires more data to be added, but it is a place to start and explore. Even though a simple model you can use this to build theory, to guide data collection and so forth.

So, having developed a model you can set up a simulation to run hundreds of times, to analyse with your data analytics tools… So I’ve run my 200 node network, 5000 simulations, comparing randomness and maximum links to a nodes – helping understand that different formation strategy creates different structures. And that’s interesting but it doesn’t take us all the way. So I’d like to show you a different model that takes this a little bit further…

This model is an extension of the previous model – with all the previous assumptions – so you have two formation strategies, but also other assumptions we were talking about… That I am more likely to connect to accounts with shared interests, more inclines to connect with accounts with shared interests, and with that we generate a simulation which is perhaps a better representation of the kinds of network we might see. And this accommodates the idea that this network has content, sharing, and other aspects that inform what is going on in the formation of that network. This visualisation looks pretty but the useful part is the output you can get at an aggregate level… We are looking at population level, seeing how local interactions at local levels, influence macro level patterns and behaviours… We can look at in-degree distribution, we can look at out-degree… We can look at local clustering coefficients, longest/shortest path, etc. And my assumptions might be plausible and reasonable…

So you can build models that give a much deeper understanding of real world dynamics… We are building an artificial network BUT you can combine this with real world data – load a real world network structure into the model and look at diffusion within that network, and understand what happens when one node posts something, what impact would that have, what information diffusion would that have…

So I’ve shown you NetLogo to play with these models. If you want to play around, that’s a great first step. It’s easy to get started with and it has been developed for use in educational settings. There is a big community and lots of models to use. And if you download NetLogo you can download that library of models. Pretty soon, however, I think you’ll find it too limited. There are many other tools you can use… But in general you can use any programming language that you want… Repast and Mason are very common tools. And they are based on Java or C++. You can also use an ABM Python module.

In the folder for this session there are some papers that give a good introduction to agent-based modelling… If we think about agent-based modelling and network theory there are some books I would recommend: Natatame & Chen: Agent-based modelling and Network dynamics. ABM look at Miller & Scott; Gilbert and Troitzsch; Epstein. Network theory – look at Jackson, Watts (& Strogatz), Barabasi.

So, three things:

Simplify! – You don’t need millions of agents. A simple model can be more powerful than a realistic one

Iterate! – Start simple and, as needed, build up complexity, add more features, but only if necessary.

Validate? – You can build models in a speculative way to guide research, to inform data collection… You don’t always have to validate that model as it may be a tool for your thinking. But validation is important if you want to be able to replicate and ensure relevance in the real world.

We started talking about data collection, analysis, and how we build theory based on the data we collect. After lunch we will continue with Carolin, Anne and Fernando on Tracking the Trackers. At the end of the day we’ll have a full panel Q&A for any questions.

And we are back after lunch and a little exposure to the Berlin rain!

Tracking the Trackers (Anne Helmond, Carolin Gerlitz, Esther Weltevrede and Fernando van der Vlist)

Carolin: Having talked about tracking users and behaviours this morning, we are going to talk about studying the media themselves, and of tracking the trackers across these platforms. So what are we tracking? Berry (2011) says:

“For every explicit action of a user, there are probably 100+ implicit data points from usage; whether that is a page visit, a scroll etc.”

Whenever a user makes an action on the web, a series of tracking features are enabled, things like cookies, widgets, advertising trackers, analytics, beacons etc. Cookies are small pieces of text that are placed on the user’s computer indicating that they have visited a site before. These are 1st party trackers and can be accessed by the platforms and webmasters. There are now many third party trackers such as Facebook, Twitter, Google, and many websites now place third party cookies on the devices of users. And there are widgets that enable this functionality with third party trackers – e.g. Disquus.

So we have first party tracker files – text files that remember, e.g. what you put in a shopping cart; third party tracker files used by marketers and data-gathering companies to track your actions across the web; you have beacons; and you have flash cookies.

The purpose of tracking varies, from functionality that is useful (e.g. the shopping basket example) but increasingly prevelant for use in profiling users and behaviours. The increasing use of trackers has resulted in them becoming more visible. There is lots of research looking at the prevalence of tracking across the web, from the Continuum project and the Guardian’s Tracking the Trackers project. One of the most famous plugins that allows you to see the trackers in your own browser is Ghostery – a browser plugin that you can install and immediately detects different kinds of trackers, widgets, cookies, analytics tracking on the sites that you browse to… It shows these in a pop up. It allows you to see the trackers and to block trackers, or selectively block trackers. You may want to selectively block trackers as whole parts of websites disappear when you switch off trackers.

Ghostery detects via tracker library/code snippets (regular expressions). It currently detects around 2295 trackers – across many different varieties. The tool is not uncontroversial. It started as an NGO but was bought by analytics company Evidon in 2010, using the data for marketing and advertising.

So, we thought that if we, as researchers, want to look at trackers and there are existing tools, lets repurpose existing tools. So we did that, creating a Tracker tracker tool based on Ghostery. It takes up a logic of Digital Methods, working with lists of websites. So the Tracker Tracker tool has been created by the Digital Methods Initiative (2012). It allows us to detect which tracers are present on lists of wevsites and create a network view. And we are “repurposing analytical capabilities”. So, what sort of project can we use this with?

One of our first project was on the Like Economy. Our starting point was the fact that social media widgets place cookies (Gerlitz and Helmond 2013), where are they present. These cookies track both platform users and website users. We wanted to see how pervasive these cookies were on the web, and on the most used sites on the web.

We started by using Alexa to identify a collection of 1000 most-visited websites. We inputted it into the Tracking Tracker tool (it’s only one button so options are limited!). Then we visualised the results with Gephi. And what did we get? Well, in 2012 only 18% of top websites had Facebook trackers – if we did it again today it would probably be different. This data may be connected to personal user profiles – when a user has been previously logged in and has a profile – but it is also being collected for non-users of Facebook, they create anonymous profiles but if they subsequently join Facebook that tracking data can be fed into their account/profile.

Since we did this work we have used this method on other projects. Now I’ll hand over to Anne to do a methods walkthrough.

Anne: Now you’ve had a sense of the method I’m going to do a dangerous walkthrough thing… And then we’ll look at some other projects here.

So, a quick methodological summary:

  1. Research question: type of tracker and sites
  2. Website (URL) collection making: existing expert list.
  3. Input list for Tracker Tracker
  4. Run Tracker Tracker
  5. Analyse in Gephi

So we always start with a research question… Perhaps we start with websites we wouldn’t want to find trackers on – where privacy issues are heightened e.g. childrens’ websites, porn websites, etc. So, homework here – work through some research question ideas.

Today we’ll walk through what we will call “adult sites”. So, we will go to Alexa – which is great for locating top sites in categories, in specific countries, etc. We take that list, we put it into Tracker Tracker – choosing whether or not to look at the first level of subpages – and press the button. The tool looks at the Ghostery database, which now scans those websites for the possible 2600 trackers that may exist.

Carolyn: Maybe some of you are wondering if it’s ok to do this with Ghostery? Well, yes, we developed Tracker Tracker in collaboration with Ghostery when it was an NGO, with one of their developers visiting us in Amsterdam. One other note here: if you use Ghostery on your machine, it may be different to your neighbours trackers. Trackers vary by machine, by location, by context. That’s something we have to take into account when requesting data. So for news websites you may, for instance, have more and more trackers generated the longer the site is open – this tool only captures a short window of time so may not gather all of the trackers.

Anne: Also in Europe you may encounter a so-called cookie walls. You have to press OK to accept cookies… And the tool can’t emulate user experience in clicking beyond the cookie walls… So zero trackers may indicate that issue, rather than no trackers.

Q: Is it server side or client side?

A: It is server side.

Q: And do you cache the tracker data?

A: Once you run the tool you can save the CSV and Gephi files, but we don’t otherwise cache.

Anne: Ghostery updates very frequently so that makes it most useful to always use the most up to date list of trackers to check against.

So, once we’ve run the Tracker Tracker tool you get outputs that can be used in a variety of flexible formats. We will download the “exhaustive” CSV – which has all of the data we’ve found here.

If I open that CSV (in Excel) we can see the site, the scheme, the patterns that was used to find the tracker, the name of the tracker… This is very detailed information. So for these adult sites we see things like Google Analytics, the Porn Ad network, Facebook Connect. So, already, there is analysis you could do with this data. But you could also do further analysis using Gephi.

Now, we have steps of this procedure in the tutorial that goes with today’s session. So here we’ve coloured the sites in grey, and we’ve highlighted the trackers in different colours. The purple lines/nodes are advertising trackers for instance.

If you want to create this tracker at home, you have all the steps here. And doing this work we’ve found trackers we’d never seen before – for instance the porn industry ad network DoublePimp (a play on DoubleClick) – and to see regional and geographic difference between trackers, which of course has interesting implications.

So, some more examples… We have taken this approach looking at Jihadi websites, working with e.g. governments to identify the trackers. And found that they are financially dependent on advertising included SkimLinks, DoubleClick, Google AdSense.

Carolyn: And in almost all networks we encounter DoubleClick, AdSense, etc. And it’s important to know that webmasters enable these trackers, they have picked these services. But there is an issue of who selects you as a client – something journalists collaborating on this work raised with Google.

Anne: The other usage of these trackers has been in historical tracking analysis using the internet archive. This enables you to see the website in the context in a techno-commercial configuration, and to analyse it in that context. So for instance looking at New York Times trackers and the wevsite as an ecosystem embedded in the wider context – in this case trackers decreased but that was commercial concentration, from companies buying each other therefore reducing the range of trackers.

Carolyn: We did some work called the Trackers Guide. We wanted to look not only at trackers, but also look at Content Delivery Networks, to visualise on a website how websites are not single items, but collections of data with inflows and outflows. The result became part artwork, part biological fieldguide. We imagined content and trackers as little biological cell-like clumps on the site, creating a whole booklet of this guide. So the image here shows the content from other spaces, content flowing in and connected…

Anne: We were also interested in what kind of data is being collected by these trackers. And also who owns these trackers. And also the countries these trackers are located in. So, we used this method with Ghostery. And then we dug further into those trackers. For Ghostery you can click on a tracker and see what kind of data it collects. We then looked at privacy policies of trackers to see what it claims to collect… And then we manually looked up ownership – and nationality – of the trackers to understand rules, regulations, etc. – and seeing where your data actually ends up.

Carolyn: Working with Ghostery, and repurposing their technology, was helpful but their database is not complete. And it is biased to the English-speaking world – so it is particularly lacking in Chinese contexts for instance. So there are limits here. It is not always clear what data is actually being collected. BUT this work allows us to study invisible participation in data flows – that cannot be found in other ways; to study media concentration and the emergence of specific tracking ecologies. And in doing so it allows us to imagine alternative spatialities of the web – tracker origins and national ecologies. And it provides insights into the invisible infrastructures of the web.

Slides for this presentation: http://www.slideshare.net/cgrltz/aoir-2016-digital-methods-workshop-tracking-the-trackers-66765013

Multiplatform Issue Mapping (Jean Burgess & Ariadna Matamoros Fernandez)

Jean: I’m Jean Burgess and I’m Professor of Digital Media and Director of the DRMC at QUT. Ariadna is one of our excellent PhD students at QUT but she was previously at DMI so she’s a bridge to both organisations. And I wanted to say how lovely it is to have the DRMC and DMI connected like this today.

So we are going to talk about issue mapping, and the idea of using issue mapping to teach digital research methods, particularly with people who may not be interested in social media outside of their specific research area. And about issue mapping as an approach that is outside the dominant “influencers” narrative that is dominant in the marketing side of social media.

We are in the room with people who have been working in this space for a long time but I just want to raise that we are making connections to AMT and cultural and social studies. So, a few ontological things… Our approach combines digital methods and controversy analysis. We understand there to be Controversies which are discreet, acute, often temporality that are sites of intersectionality, bringing together different issues in new combination. And drawing on Latour, Callon etc. we see controversies as generative. They can reveal the dynamics of issues, bring them together in new combinations, trasform them and mode them forward. And we undertake network and content analysis to understand relations among stakeholders, arguments and objects.

There are both very practical applications and more critical-reflexive possibilities of issue mapping. And we bring our own media studies viewpoint to that, with an interest in the vernacular of the space.

So, issue mapping with social media frequently starts with topical Twitter hashtags/hashtag communities. We then have iteractive “issue inventories” – actors, hashtags, media objects from one dataset used as seeds on their own. We then undertake some hybrid network/thematic analysis – e.g. associations among hashtags; thematic network clusters And we inevitably meet the issue of multi-platform/cross-platform engagement. And we’ll talk more about that.

One project we undertook on the #agchatoz, which is a community in Australia around weekly Twitter chats, but connected to a global community, explored the hashtag as a hybrid community. So here we looked at, for instance, the network of followers/followees in this network. And within that we were able to identify clusters of actors (across: Left-learning Twitterati (30%); Australian ag, farmers (29%); Media orgs, politicians (13%); International ag, farmers (12%); Foodies (10%); Right-wing Australian politics and others), and this reveals some unexpected alliances or crossovers – e.g. between animal rights campaigners and dairy farmers. That suggests opportunities to bridge communities, to raise challenges, etc.

We have linked, in the files for this session, to various papers. One of these, Burgess and Matamoros-Fernandez (2016) looks at Gamergate and I’m going to show a visualisation of the YouTube video network (Reider 2015; Gephi), which shows videos mentioned in tweets around that controversy, showing those that were closely related to each other.

Ariadna: My PhD is looking at another controversy, this one is concerned by Adam Goodes, an Australian Rules Footballer who was a high profile player until he retired last year. He has been a high profile campaigner against racism, and has called out racism on the field. He has been criticised for that by one part of society. And in 2014 he performed an indiginous war dance on the pitch, which again received booing from the crowd and backlash. So, I start with Twitter, follow the links, and then move to those linked platforms and moving onwards…

Now I’m focusing on visual material, because the controversy was visual, it was about a gesture. So there is visual content (images, videos, gifs) are mediators of race and racism on social media. I have identified key media objects through qualitative analysis – important gestures, different image genres. And the next step has been to reflect on the differences between platform traces – YouTube relates videos, Facebook like network, Twitter filters, notice and take down automatic messages. That gives a sense of the community, the discourse, the context, exploring their specificities and how their contributes to the cultural dynamics of face and racism online.

Jean: And if you want to learn more, there’s a paper later this week!

So, we usually do training on this at DMRC #CCISS16 Workshops. We usually ask participants to think about YouTube and related videos – as a way to encourage to people to think about networks other than social networks, and also to get to grips with Gephi.

Ariadna: Usually we split people into small groups and actually it is difficult to identify a current controversy that is visible and active in digital media – we look at YouTube and Tumblr (Twitter really requires prior collection of data). So, we go to YouTube to look for a key term, and we can then filter and find results changing… Usually you don’t reflect that much. So, if you look at “Black Lives Matter”, you get a range of content… And we ask participants to pick out relevant results – and what is relevant will depend on the research question you are asking. That first choice of what to select is important. Once this is done we get participants to use the YouTube Data Tools: https://tools.digitalmethods.net/netvizz/youtube/. This tool enables you to explore the network… You can use a video as a “seed”, or you can use a crawler that finds related videos… And that can be interesting… So if you see an Anti-Islamic video, does YouTube recommend more, or other videos related in other ways?

That seed leads you to related videos, and, depending on the depth you are interested in, videos related to the related videos… You can make selections of what to crawl, what the relevance should be. The crawler runs and outputs a Gephi file. So, this is an undirected network. Here nodes are videos, edges are relationships between videos. We generally use the layout: Force Atlas 2. And we run the Modularity Report to colour code the relationships on thematic or similar basis. Gephi can be confusing at first, but you can configure and use options to explore and better understand your network. You can look at the Data Table – and begin to understand the reasons for connection…

So, I have done this for Adam Goodes videos, to understand the clusters and connections.

So, we have looked at YouTube. Normally we move to Tumblr. But sometimes a controversy does not resonate on a different social media platform… So maybe a controversy on Twitter, doesn’t translate on Facebook; or one on YouTube doesn’t resonate on Tumblr… Or keywords will vary greatly. It can be a good way to start to understand the cultures of the platforms. And the role of main actors etc. on response in a given platform.

With Tumblr we start with the interface – e.g. looking at BlackLivesMatter. We look at the interface, functionality, etc. And then, again, we have a tool that can be used: https://tools.digitalmethods.net/netvizz/tumblr/. We usually encourage use of the same timeline across Tumblr and YouTube so that they can be compared.

So we can again go to Gephi, visualise the network. And in this case the nodes and edges can look different. So in this example we see 20 posts that connect 141 nodes, reflecting the particular reposting nature of that space.

Jean: The very specific cultural nature of the different online spaces can make for very interesting stuff when looking at controversies. And those are really useful starting points into further exploration.

And finally, a reminder, we run our summer schools in DMRC in February. When it is summer! And sunny! Apply now at: http://dmrcss.org/!

Analysing and visualising geospatial data (Peta Mitchell)

Normally when I would do this as a workshop I’d give some theoretical and historical background of the emergence of geospatial data, and then move onto the practical workshop on Carto (was CartoDB). Today though I’m going to talk about a case study, around the G8 meeting in Melbourne, and then talk about using Carto to create a social media map.

My own background is a field increasingly known as the geo humanities or the spatial humanities. And I did a close reading project of novels and films to create a Cultural Atlas of Australia. And how locations relate to narrative. For instance almost all films are made in South Australia, regardless of where they are set, mapping patterns of representation. We also created a CultureMap – an app that went with a map to alert you to literary or filmic places nearby that related back to that atlas.

I’ll talk about that G8 stuff. I now work on rapid spatial analytics; participatory geovisualisation and crowdsourced data; VGI – Volunteered Geographic Information; placemaking etc. But today I’ll be talking about emerging forms of spatial information/geodata, neogeographical tools etc.

So Godon and de Souza e Silva (2011) talk about us witnessing the increasing proliferation of geospatial data. And this is sitting alongside a geospatial revolution – GPS enabled devices, geospatial data permeating social media, etc. So GPS emerged in the late ’90s/early 00’s with a slight social friend-finder function. But the geospatial web really begins around 2000, the beginning of the end of the idea of the web as a “placeless space”. To an extent this came from a legal case brought by a French individual against Yahoo!, who were allowing Nazi memorabilia to be sold. That was illegal in France, and Yahoo! claimed that the internet is global, and claimed that it wasn’t possible. A French judge found in favour of the individual, Yahoo! were told it was both doable and easy, and Yahoo! went on to financially benefit from IP based location information. As Richard Rogers that case was the “revenge of geography against the idea of cyberspace”.

Then in 2005 Google Maps was described by John Yudell as that platform having the potential to be a “service factory for the geospatial web”. So in 2005 the “geospatial web” really is there as a term. By 2006 the concept of “Neogeography” was defined by Andrew (?) to describe the kind of non-professional, user-orientated, web 2.0-enabled mapping. There are are critiques in cultural geography, in geospatial literature about this term, and the use of the “neo” part of it. But there are multiple applications here, from humanities to humanitariasm; from cultural mapping to crisis mapping. An example here is Ushahidi maps, where individuals can send in data and contribute to mapping of crisis. Now Ushahidi is more of a platform for crisis mapping, and other tools have emerged.

So there are lots of visualisation tools and platforms. There are traditional desktop GIS – ArcGIS, QGIS. There is basic web-mapping (e.g. Google Maps); Online services (E.g. CARTO, Mapbox); Custom map design applications (e.g. MapMill); and there are many more…

Spatial data is not new, but there is a growth in ambient and algorithmic spatial data. So for instance ABC (TV channel in Australia) did some investigation, inviting audiences to find out as much as they could based on their reporter Will Ockenden’s metadata. So, his phone records, for instance, revealed locations, a sensitive data point. And geospatial data is growing too.

We now have a geospatial sub stratum underpinning all social media networks. So this includes check-in/recommendation platforms: Foursquare, Swarm, Gowalla (now defunct), Yelp; Meetup/hookup apps: Tinder, Grindr, Meetup; YikYak; Facebook; Twitter; Instagram; and Geospatial Gaming: Ingress; Pokemon Go (from which Google has been harvesting improvements for its pedestrian routes).

Geospatial media data is generated from sources ranging from VGI (Volunteered geographic information) to AGI (ambient geographic information), where users are not always aware that they are sharing data. That type of data doesn’t feel like crowd sourced data or VGI, hence the potential challenges, potential and ethical complexity of AGI.

So, the promises of geosocial analysis include a focus on real-time dynamics – people working with geospatial data aren’t used to this… And we also see social media as a “sensor network” for crisis events. There is also potential to provide new insights into spatio-temporal spread of ideas and actions; human mobilities and human behaviours.

People do often start with Twitter – because it is easier to gather data from it – but only between 1% and 3% of Tweets are located. But when we work at festivals we see around 10% being location data – partly a nature of the event, partly as Tweets are often coming through Instagram… On Instagram we see between 20% and 30% of images georeferenced, but based on upload location, not where image is taken.

There is also the challenge of geospatial granularity. On a tweet with Lat Long, that’s fairly clear. When we have a post tagged with a place we essentially have a polygon. And then when you geoparse, what is the granularity – street, city? Then there are issues of privacy and the extent to which people are happy to share that data.

So, in 2014 Brisbane hosted the G20, at a cost of $140 AUS for one highly disruptive weekend. In preceeding G20 meetings there had been large scale protests. At the time the premier of the city was former military and he put the whole central business district was in lockdown and designated a “declared area” – under new laws made for this event. And hotels for G20 world leaders were inside the zone. So, Twitter mapping is usually during crisis events – but you don’t know where this will happen, where to track it, etc. In this case we knew in advance where to look. So, a Safety and Security Act (2013) was put in place for this event, requiring prior approval for protests; arrests for the duration of the event; on the spot strip search; banning of eggs in the central Business District, no manure, no kayaks or floatation devices, no remote control cars or reptiles!

So we had these fears of violent protests, given all of these draconian measures. We had elevated terror levels. And we had war threatened after Abbott said he would “shirtfront” Vladimir Putin over MH17. But all that concern made city leaders concerned that the city might be a ghost town, when they wanted it marketed as a new world city. They were offering free parking etc. to incentivise them to come in. And tweets reinforced the ghost town trope. So, what geosocial mapping enabled was a close to realtime sensor network of what might be happening during the G20.

So, the map we did was the first close to real time social media map that was public facing, using CARTODB, and it was never more than an hour behind reality. We had few false matches. But we had clear locations and clear keywords – e.g. G20 – to focus on. A very few “the meeting will now be held in G20” but otherwise no false matches. We tracked the data through the meeting… Which ran over a weekend and bank holiday. This map parses around 17,000(?) tweets, most of which were not geotagged but parsed. Only 10% represent where someone was when they tweeted, the remaining 90% are subjects of posts from geoparsing of tweets.

Now, even though that declared area isn’t huge, there are over 300 streets there. I had to build a manually constructed gazeteer, using Open Street Map (OSM) data, and then new data. Picking a bounding box that included that area generated a whole range of features – but I wasn’t that excited about fountains, benches etc. I was looking for places people might mention. And I wanted to know about features people might actually mention in their tweets. So, I had a bounding box, and the declared area before… Would have been ideal if the G20 had given me their bounding polygon but we didn’t especially want to draw attention to what we were doing.

So, at the end we had lat, long, amenity (using OSM terms), name (e.g. Obama was at the Marriott so tweets about that), associated search terms – including local/vernacular versions of names of amenities; Status (declared or restricted); and confidence (of location/coordinates – score of 1 for geospatially tagged tweets, 0.8 for buildings, etc.). We could also create category maps of different data sets. On our map we showed geospatial and parsed tweets inside the area, but we only used geotweets outside the declared area. One of my colleagues created a Python script to “read” and parse tweets, and that generated a CSV. That CSV could then be fed into CARTODB. CARTODB has a time dimension, could update directly every half hour, and could use a Dr0pbox source to do that.

So, did we see much disruption? Well no… About celebrity spotting – the two most tweeted images were Obama with a koala and Putin with a koala. It was very hot and very secured so little disruption happened. We did see selfies with Angela Merkel, images of phallic motorcade. And after the G20 there was a complaint filed to board of corruption about the cooling effect of security on participation, particularly in environmental protests. There was still engagement on social media, but not in-person. Disruption, protest, criticism were replaced by spectacle and distant viewing of the event.

And, with that, we turn to an 11 person panel session to answer questions, wrap up, answer questions, etc. 

Panel Session

Q1) Each of you presented different tools and approaches… Can you comment on how they are connected and how we can take advantage of that.

A1 – Jean) Implicitly or explicitly we’ve talked about possibilities of combining tools together in bigger projects. And tools that Peta and I have been working on are based on DMI tools for instance… It’s sharing tools, shared fundamental techniques for analytics for e.g. a Twitter dataset…

A1 – Richard) We’ve never done this sort of thing together… The fact that so much has been shared has been remarkable. We share quite similar outlooks on digital methods, and also on “to what end” – largely for the study of social issues and mapping social issues. But also other social research opportunities available when looking at a variety of online data, including geodata. It’s online web data analysis using digital methods for issue mapping and also other forms of social research.

A1 – Carolyn) All of these projects are using data that hasn’t been generated by research, but which has been created for other purposes… And that’s pushing the analysis in their own way… And tools that we combine bring in levels, encryptions… Digital methods use these, but also a need to step back and reflect – present in all of the presentations.

Q2) A question especially for Carolyn and Anne: what do you think about the study of proprietary algorithms. You talked a bit about the limitations of proprietary algorithms – for mobile applications etc? I’m having trouble doing that…

A2 – Anne) I think in the case of the tracker tool, it doesn’t try to engage with the algorithm, it looks at presence of trackers. But here we have encountered proprietary issues… So for Ghostery, if you download a Firefox plugin you can access the content. We took the library of trackers from that to use as a database, we took that apart. We did talk to Ghostery, to make them aware… The question of algorithms… Of how you get to the blackbox things… We are developing methods to do this… One way in is to see the outputs, and compare that. Also Christian Zudwig is doing the auditing algorithms work.

A2 – Carolyn) Was just a discussion on Twitter about currency of algorithms and research on them… We’ve tried to ride on them, to implement that… Otherwise difficult. One element was on studying mobile applications. We are giving a presentation on this on Friday. Similar approach here, using infrastructures of app distribution and description etc. to look into this… Using existing infrastructures in which apps are built or encountered…

A2 – Anne) We can’t screenscrape and we are moving to this more closed world.

A2 – Richard) One of the best ways to understand algorithms is to save the outputs – e.g. we’ve been saving Google search outputs for years. Trying to save newsfeeds on Facebook, or other sorts of web apps can be quite difficult… You can use the API but you don’t necessarily get what the user has seen. The interface outputs are very different from developer outputs. So people think about recording rather than saving data – an older method in a way… But then you have the problem of only capturing a small sample of data – like analysing TV News. The new digital methods can mean resorting to older media methods… Data outputs aren’t as friendly or obtainable…

A2 – Carolyn) This one strand is accessing algorithms via transparancy; you can also think of them as situated and in context, seeing it in operation and in action in relation to the data, associated with outputs. I’d recommend Salam Marocca on the Impact of Big Data which sits in legal studies.

A2 – Jean) One of the ways we approach this is the “App Walkthrough”, a method Ben Light and I have worked on and will shortly be published in Media and Society, is to think about those older media approaches, with user studies part of that…

Q3) What is your position as researchers on opening up data, and doing ethically acceptable data on the other side? Do you take a stance, even a public stance on these issues.

A3 – Anne) Many of these tools, like the YouTube tool, and his Facebook tools, our developer took the conscious decision to anonymise that data.

A3 – Jean) I do have public positions. I’ve published on the political economy of Twitter… One interesting thing is that privacy discourses were used by Twitter to shut down TwapperKeeper at a time it was seeking to monetise… But you can’t just published an archive of tweets with username, I don’t think anyone would find that acceptable…

A3 – Richard) I think it is important to respect or understand contextual privacy. People posting, on Twitter say, don’t have an expectation of its use in commercial or research uses. Awareness of that is important for a researcher, no matter what terms of service the user has signed/consented to, or even if you have paid for that data. You should be aware and concerned about contextual privacy… Which leads to a number of different steps. And that’s why, for instance, NetVis – the Facebook tool – usernames are not available for comments made, even though FacePager does show that. Tools vary in that understanding. Those issues need to be thought about, but not necessarily uniformly thought about by our field.

A3 – Carolyn) But that becomes more difficult in spaces that require you to take part to research them – WhatsApp? for instance – researchers start pretending to be regular users… to generate insights.

Comment (me): on native vs web apps and approaches and potential for applying Ghostery/Tracker Tracker methods to web apps which are essentially pointing to URLs.

Q4) Given that we are beholden to commercial companies, changes to algorithms, APIs etc, and you’ve all spoken about that to an extent, how do you feel about commercial limitations?

A4 – Richard) Part of my idea of digital methods is to deal with ephemerality… And my ideal to follow the medium… Rather than to follow good data prescripts… If you follow that methodology, then you won’t be able to use web data or social media data… Unless you either work with the corporation or corporate data scientist – many issues there of course. We did work with Yahoo! on political insights… categorising search queries around a US election, which was hard to do from outside. But the point is that even on the inside, you don’t have all the insight or the full access to all the data… The question arises of what can we still do… What web data work can we still do… We constantly ask ourselves, I think digital methods is in part an answer to that, otherwise we wouldn’t be able to do any of that.

A4 – Jean) All research has limitations, and describing that is part of the role here… But also when Axel and I started doing this work we got criticism for not having a “representative sample”… And we have people from across humanities and social sciences seem to be using the same approaches and techniques but actually we are doing really different things…

Q5) Digital methods in social sciences looks different from anthropology where this is a classical “informant” problem… This is where digital ethnography is there and understood in a way that it isn’t in the social sciences…

Resources from this workshop: