After a few weeks of leave I’m now back and spending most of this week at the Association of Internet Researchers (AoIR) Conference 2016. I’m hugely excited to be here as the programme looks excellent with a really wide range of internet research being presented and discussed. I’ll be liveblogging throughout the week starting with today’s workshops.
This is a liveblog so all corrections, updates, links, etc. are very much welcomed – just leave me a comment, drop me an email or similar to flag them up!
I am booked into the Digital Methods in Internet Research: A Sampling Menu workshop, although I may be switching session at lunchtime to attend the Internet rules… for Higher Education workshop this afternoon.
The Digital Methods workshop is being chaired by Patrik Wikstrom (Digital Media Research Centre, Queensland University of Technology, Australia) and the speakers are:
- Erik Borra (Digital Methods Initiative, University of Amsterdam, the Netherlands),
- Axel Bruns (Digital Media Research Centre, Queensland University of Technology, Australia),
- Jean Burgess (Digital Media Research Centre, Queensland University of Technology, Australia),
- Carolin Gerlitz (University of Siegen, Germany),
- Anne Helmond (Digital Methods Initiative, University of Amsterdam, the Netherlands),
- Ariadna Matamoros Fernandez (Digital Media Research Centre, Queensland University of Technology, Australia),
- Peta Mitchell (Digital Media Research Centre, Queensland University of Technology, Australia),
- Richard Rogers (Digital Methods Initiative, University of Amsterdam, the Netherlands),
- Fernando N. van der Vlist (Digital Methods Initiative, University of Amsterdam, the Netherlands),
- Esther Weltevrede (Digital Methods Initiative, University of Amsterdam, the Netherlands).
I’ll be taking notes throughout but the session materials are also available here: http://tinyurl.com/aoir2016-digmethods/.
Patrik: We are in for a long and exciting day! I won’t introduce all the speakers as we won’t have time!
Conceptual Introduction: Situating Digital Methods (Richard Rogers)
My name is Richard Rogers, I’m professor of new media and digital culture at the University of Amsterdam and I have the pleasure of introducing today’s session. So I’m going to do two things, I’ll be situating digital methods in internet-related research, and then taking you through some digital methods.
I would like to situate digital methods as a third era of internet research… I think all of these eras thrive and overlap but they are differentiated.
- Web of Cyberspace (1994-2000): Cyberstudies was an effort to see difference in the internet, the virtual as distinct from the real. I’d situate this largely in the 90’s and the work of Steve Jones and Steve (?).
- Web as Virtual Society? (2000-2007) saw virtual as part of the real. Offline as baseline and “virtual methods” with work around the digital economy, the digital divide…
- Web as societal data (2007-) is about “virtual as indication of the real. Online as baseline.
Right now we use online data about society and culture to make “grounded” claims.
So, if we look at Allrecipes.com Thanksgiving recipe searches on a map we get some idea of regional preference, or we look at Google data in more depth, we get this idea of internet data as grounding for understanding culture, society, tastes.
So, we had this turn in around 2008 to “web as data” as a concept. When this idea was first introduced not all were comfortable with the concept. Mike Thelwell et al (2005) talked about the importance of grounding the data from the internet. So, for instance, Google’s flu trends can be compared to Wikipedia traffic etc. And with these trends we also get the idea of “the internet knows first”, with the web predicting other sources of data.
Now I do want to talk about digital methods in the context of digital humanities data and methods. Lev Manovich talks about Cultural Analytics. It is concerned with digitised cultural materials with materials clusterable in a sort of art historical way – by hue, style, etc. And so this is a sort of big data approach that substitutes “continuous change” for periodisation and categorisation for continuation. So, this approach can, for instance, be applied to Instagram (Selfiexploration), looking at mood, aesthetics, etc. And then we have Culturenomics, mainly through the Google Ngram Viewer. A lot of linguists use this to understand subtle differences as part of distance reading of large corpuses.
And I also want to talk about e-social sciences data and method. Here we have Webometrics (Thelwell et al) with links as reputational markers. The other tradition here is Altmetrics (Priem et al), which uses online data to do citation analysis, with social media data.
So, at least initially, the idea behind digital methods was to be in a different space. The study of online digital objects, and also natively online method – methods developed for the medium. And natively digital is meant in a computing sense here. In computing software has a native mode when it is written for a specific processor, so these are methods specifically created for the digital medium. We also have digitized methods, those which have been imported and migrated methods adapted slightly to the online.
Generally speaking there is a sort of protocol for digital methods: Which objects and data are available? (links, tags, timestamps); how do dominant devices handle them? etc.
I will talk about some methods here:
For the hyperlink analysis there are several methods. The Issue Crawler software, still running and working, enable you to see links between pages, direction of linking, aspirational linking… For example a visualisation of an Armenian NGO shows the dynamics of an issue network showing politics of association.
The other method that can be used here takes a list of sensitive sites, using Issue Crawler, then parse it through an internet censorship service. And variations on this that indicate how successful attempts at internet censorship are. We do work on Iran and China and I should say that we are always quite thoughtful about how we publish these results because of their sensitivity.
2. The website as archived object
We have the Internet Archive and we have individual archived web sites. Both are useful but researcher use is not terribly signficant so we have been doing work on this. See also a YouTube video called “Google and the politics of tabs” – a technique to create a movie of the evolution of a webpage in the style of timelapse photography. I will be publishing soon about this technique.
But we have also been looking at historical hyperlink analysis – giving you that context that you won’t see represented in archives directly. This shows the connections between sites at a previous point in time. We also discovered that the “Ghostery” plugin can also be used with archived websites – for trackers and for code. So you can see the evolution and use of trackers on any website/set of websites.
6. Wikipedia as cultural reference
Note: the numbering is from a headline list of 10, hence the odd numbering…
We have been looking at the evolution of Wikipedia pages, understanding how they change. It seems that pages shift from neutral to national points of view… So we looked at Srebenica and how that is represented. The pages here have different names, indicating difference in the politics of memory and reconciliation. We have developed a triangulation tool that grabs links and references and compares them across different pages. We also developed comparative image analysis that lets you see which images are shared across articles.
7. Facebook and other social networking sites
Facebook is, as you probably well know, is a social media platform that is relatively difficult to pin down at a moment in time. Trying to pin down the history of Facebook find that very hard – it hasn’t been in the Internet Archive for four years, the site changes all the time. We have developed two approaches: one for social media profiles and interest data as means of stufying cultural taste ad political preference or “Postdemographics”; And “Networked content analysis” which uses social media activity data as means of studying “most engaged with content” – that helps with the fact that profiles are no longer available via the API. To some extend the API drives the research, but then taking a digital methods approach we need to work with the medium, find which possibilities are there for research.
So, one of the projects undertaken with in this space was elFriendo, a MySpace-based project which looked at the cultural tastes of “friends” of Obama and McCain during their presidential race. For instance Obama’s friends best liked Lost and The Daily Show on TV, McCain’s liked Desperate Housewives, America’s Next Top Model, etc. Very different cultures and interests.
Now the Networked Content Analysis approach, where you quantify and then analyse, works well with Facebook. You can look at pages and use data from the API to understand the pages and groups that liked each other, to compare memberships of groups etc. (at the time you were able to do this). In this process you could see specific administrator names, and we did this with right wing data working with a group called Hope not Hate, who recognised many of the names that emerged here. Looking at most liked content from groups you also see the shared values, cultural issues, etc.
So, you could see two areas of Facebook Studies, Facebook I (2006-2011) about presentation of self: profiles and interests studies (with ethics); Facebook II (2011-) which is more about social movements. I think many social media platforms are following this shift – or would like to. So in Instagram Studies the Instagram I (2010-2014) was about selfie culture, but has shifed to Instagram II (2014-) concerned with antagonistic hashtag use for instance.
Twitter has done this and gone further… Twitter I (2006-2009) was about urban lifestyle tool (origins) and “banal” lunch tweets – their own tagline of “what are you doing?”, a connectivist space; Twitter II (2009-2012) has moved to elections, disasters and revolutions. The tagline is “what’s happening?” and we have metrics “trending topics”; Twitter III (2012-) sees this as a generic resource tool with commodification of data, stock market predictions, elections, etc.
So, I want to finish by talking about work on Twitter as a storytelling machine for remote event analysis. This is an approach we developed some years ago around the Iran event crisis. We made a tweet collection around a single Twitter hashtag – which is no longer done – and then ordered by most retweeted (top 3 for each day) and presented in chronological (not reverse) order. And we then showed those in huge displays around the world…
To take you back to June 2009… Mousavi holds an emergency press conference. Voter turn out is 80%. SMS is down. Mousavi’s website and Facebook are blocked. Police use pepper spray… The first 20 days of most popular tweets is a good succinct summary of the events.
So, I’ve taken you on a whistle stop tour of methods. I don’t know if we are coming to the end of this. I was having a conversation the other day that the Web 2.0 days are over really, the idea that the web is readily accessible, that APIs and data is there to be scraped… That’s really changing. This is one of the reasons the app space is so hard to research. We are moving again to user studies to an extent. What the Chinese researchers are doing involves convoluted processes to getting the data for instance. But there are so many areas of research that can still be done. Issue Crawler is still out there and other tools are available at tools.digitalmethods.net.
Twitter studies with DMI-TCAT (Fernando van der Vlist and Emile den Tex)
Fernando: I’m going to be talking about how we can use the DMI-TCAT tool to do Twitter Studies. I am here with Emile den Tex, one of the original developers of this tool, alongside Eric Borra.
So, what is DMI-TCAT? It is the Digital Methods Initiative Twitter Capture and Analysis Toolset, a server side tool which tries to capture robust and reproducible data capture and analysis. The design is based on two ideas: that captured datasets can be refined in different ways; and that the datasets can be analysed in different ways. Although we developed this tool, it is also in use elsewhere, particularly in the US and Australia.
So, how do we actually capture Twitter data? Some of you will have some experience of trying to do this. As researchers we don’t just want the data, we also want to look at the platform in itself. If you are in industry you get Twitter data through a “data partner”, the biggest of which by far is GNIP – owned by Twitter as of the last two years – then you just pay for it. But it is pricey. If you are a researcher you can go to an academic data partner – DiscoverText or Hexagon – and they are also resellers but they are less costly. And then the third route is the publicly available data – REST APIs, Search API, Streaming APIs. These are, to an extent, the authentic user perspective as most people use these… We have built around these but the available data and APIs shape and constrain the design and the data.
For instance the “Search API” prioritises “relevance” over “completeness” – but as academics we don’t know how “relevance” is being defined here. If you want to do representative research then completeness may be most important. If you want to look at how Twitter prioritises the data, then that Search API may be most relevant. You also have to understand rate limits… This can constrain research, as different data has different rate limits.
So there are many layers of technical mediation here, across three big actors: Twitter platform – and the APIs and technical data interfaces; DMI-TCAT (extraction); Output types. And those APIs and technical data interfaces are significant mediators here, and important to understand their implications in our work as researchers.
So, onto the DMI-TCAT tool itself – more on this in Borra & Reider (2014) (doi:10.1108/AJIM-09-2013-0094). They talk about “programmed method” and the idea of the methodological implications of the technical architecture.
What can one learn if one looks at Twitter through this “programmed method”? Well (1) Twitter users can change their Twitter handle, but their ids will remain identical – sounds basic but its important to understand when collecting data. (2) the length of a Tweet may vary beyond maximum of 140 characters (mentions and urls); (3) native retweets may have their top level text property stortened. (4) Unexpected limitations support for new emoji characters can be problematic. (5) It is possible to retrieve a deleted tweet.
So, for example, a tweet can vary beyond 140 characters. The Retweet of an original post may be abbreviated… Now we don’t want that, we want it to look as it would to a user. So, we capture it in our tool in the non-truncated version.
And, on the issue of deletion and witholding. There are tweets deleted by users, and their are tweets which are withheld by the platform – and the withholding is a country by country issue. But you can see tweets only available in some countries. A project that uses this information is “Politwoops” (http://politwoops.sunlightfoundation.com/) which captures tweets deleted by US politicians, that lets you filter to specific states, party, position. Now there is an ethical discussion to be had here… We don’t know why tweets are deleted… We could at least talk about it.
So, the tool captures Twitter data in two ways. Firstly there is the direct capture capabilities (via web front-end) which allows tracking of users and capture of public tweets posted by these users; tracking particular terms or keywords, including hashtags; get a small random (approx 1%) of all public statuses. Secondary capture capabilities (via scripts) allows further exploration, including user ids, deleted tweets etc.
Twitter as a platform has a very formalised idea of sociality, the types of connections, parameters, etc. When we use the term “user” we mean it in the platform defined object meaning of the word.
Secondary analytical capabilities, via script, also allows further work:
- support for geographical polygons to delineate geographical regions for tracking particular terms or keywords, including hashtags.
- Built-in URL expander, following shortened URLs to their destination. Allowing further analysis, including of which statuses are pointing to the same URLs.
- Download media (e.g. videos and images (attached to particular Tweets).
So, we have this tool but what sort of studies might we do with Twitter? Some ideas to get you thinking:
- Hashtag analysis – users, devices etc. Why? They are often embedded in social issues.
- Mentions analysis – users mentioned in contexts, associations, etc. allowing you to e.g. identify expertise.
- Retweet analysis – most retweeted per day.
- URL analysis – the content that is most referenced.
So Emile will now go through the tool and how you’d use it in this way…
Emile: I’m going to walk through some main features of the DMI TCAT tool. We are going to use a demo site (http://tcatdemo.emiledentex.nl/analysis/) and look at some Trump tweets…
Note: I won’t blog everything here as it is a walkthrough, but we are playing with timestamps (the tool uses UTC), search terms etc. We are exploring hashtag frequency… In that list you can see Bengazi, tpp, etc. Now, once you see a common hashtag, you can go back and query the dataset again for that hashtag/search terms… And you can filter down… And look at “identical tweets” to found the most retweeted content.
Emile: Eric called this a list making tool – it sounds dull but it is so useful… And you can then put the data through other tools. You can put tweets into Gephi. Or you can do exploration… We looked at Getty Parks project, scraped images, reverse Google image searched those images to find the originals, checked the metadata for the camera used, and investigated whether the cost of a camera was related to the success in distributing an image…
Richard: It was a critique of user generated content.
Analysing Social Media Data with TCAT and Tableau (Axel Bruns)
My talk should be a good follow on from the previous presentation as I’ll be looking at what you can do with TCAT data outside and beyond the tool. Before I start I should say that both Amsterdam and QUT are holding summer schools – and we have different summers! – so do have a look at those.
You’ve already heard about TCAT so I won’t talk more about that except to talk about the parts of TCAT I have been using.
TCAT Data Export allows you to export all tweets from selection – containing all of the tweets and information about them. You can also export a table of hashtags – tweet ids from your selection and hashtags; and mentions – tweet ids from your selection with mentions and mention type. You can export other things as well – known users (politicians, celebrities, etc); URLs; etc. And the structure that emerges are the Main TCAT export file (“full export”) and associating Hashtags; Mentions; Any other additional data. If you are familiar with SQL you are essentially joining databases here. If not then that’s fine, Tableau does this for you.
In terms of processing the data there are a number of tools here. Excel just isn’t good enough at scale – limited to 100,000 rows and that Trump dataset was 2.8 M already. So a tool that I and many others have been working with is Tableau. It’s a tool that copes with scale, it’s user-friendly, intuitive, all-purpose data analytics tool, but the downside is that it is not free (unless you are a student or are using it in teaching). Alongside that, for network visualisation, Gephi is the main tool at the moment. That’s open source and free and a new version came out in December.
So, into Tableau and an idea of what we can do with the data… Tableau enables you to work with data sources of any form, databases, spreadsheets, etc. So I have connected the full export I’ve gotten from TCAT… I have linked the main file to hashtag and mention files. Then I have also generated an additional file that expands the URLs in that data source (you can now do this in TCAT too). This is a left join – one main table that other tables are connected to. I’ve connected based on (tweet) id. And the dataset I’m showing here is from the Paris 2015 UN Climate Change. And all the steps I’m going through today are in a PDF guidebook that is available in that session resources link (http://tinyurl.com/aoir2016-digmethods/).
Tableau then tries to make sense of the data… Dimensions are the datasets which have been brought in, clicking on those reveals columns in the data, and then you see Measures – countable features in the data. Tableau makes sense of the file itself, although it won’t always guess correctly.
Now, we’ve joined the data here so that can mean we get repetition… If a tweet has 6 hashtags, it might seem to be 6 tweets. So I’m going to use the unique tweet ids as a measure. And I’ll also right click to ensure this is a distinct count.
Having done that I can begin to visualise my data and see a count of tweets in my dataset… And I can see when they were created – using Created at but also then finessing that to Hour (rather than default of Year). Now when I look at that dataset I see a peak at 10pm… That seems unlikely… And it’s because TCAT is running on Brisbane time, so I need to shift to CET time as these tweets were concerned with events in Paris. So I create a new Formula called CET, and I’ll set it to be “DateAdd (‘hour’, -9, [Created at])” – which simply allows us to take 9 hours off the time to bring it to the correct timezone. Having done that the spike is 3.40pm, and that makes a lot more sense!
Having generated that graph I can click on, say, the peak activity and see the number of tweets and the tweets that appeared. You can see some spam there – of course – but also widely retweeted tweet from the White House, tweets showing that Twitter has created a new emoji for the summit, a tweet from the Space Station. This gives you a first quick visual inspection of what is taking place… And you can also identify moments to drill down to in further depth.
I might want to compare Twitter activity with number of participating users, comparing the unique number of counts (synchronising axes for scale). Doing that we do see that there are more tweets when more users are active… But there is also a spike that is independent of that. And that spike seems to be generated by Twitter users tweeting more – around something significant perhaps – that triggers attention and activity.
So, this tool enables quantitative data analysis as a starting point or related route into qualitative analysis, the approaches are really inter-related. Quickly assessing this data enables more investigation and exploration.
Now I’m going to look at hashtags, seeing the volume against activity. By default the hashtags are ordered alphabetically, but that isn’t that useful, so I’m going to reorder by use. When I do that you can see that by far COP21 – the official hashtag – is by far the most popular. These tweets were generated from that hashtags but also from several search terms for the conference – official abbreviations for the event. And indeed some tweets have “Null” hashtags – no hashtags, just the search terms. You also see variance in spelling and capitalisation. Unlike Twitter Tableau is case sensitive so I would need to use some sort of Formula to resolve this – combining terms to one hashtag. A quick way to do that is to use “LOWER(‘Hashtag’)” which converts all data in the hashtag fields to lower case. That clustering shows COP21 as an even bigger hashtag, but also identifies other popular terms. We do see spikes in a given hashtag – often very brief – and these are often related to one very popular and heavily retweeted tweet has emerged. So, e.g. a prominent actor/figure has tweeted – e.g. in this data set Cara Delevingne (a British supermodel) triggers a short sharp spike in tweets/retweets.
And we can see these hashtags here, their relative popularity. But remember that my dataset is just based on what I asked TCAT to collect… TCOT might be a really big hashtag but maybe they don’t usually mention my search terms, hence being smaller in my data set. So, don’t be fooled into assuming some of the hashtags are small/low use just because they may not be prominent in a collected dataset.
Turning now to Mentions… We can see several Mention Types: original/null (no mentions); mentions; retweet. You also see that mentions and retweets spikes at particular moments – tweets going viral, key figures getting involved in the event or the tweeting, it all gives you a sense of the choreography of the event…
So, we can now look at who is being mentioned. I’m going to take all Twitter users in my dataset… I’ll see how many tweets mention them. I have a huge Null group here – no mentions – so I’ll start by removing that. The most mentioned accounts we see COP21 being the biggest mentioned account, and others such as Narendra Modi (chair of event?), POTUS, UNFCCC, Francois Hollande, the UN, Mashi Rafael, COP21en – the English language event account; EPN – Justin Trudeau; StationCDRKelly; C Figueres; India4Climate; Barack Obama’s personal account, etc. And I can also see what kind of mention they get. And you see that POTUS gets mentions but no retweets, whilst Barack Obama has a few retweets but mainly mentions. That doesn’t mean he doesn’t get retweets, but not in this dataset/search terms. By contrast Station Commander Kelly gets almost exclusively retweets… The balance of mentions, how people are mentioned, what gets retweeting etc… That is all a starting point for closer reading and qualitative analysis.
And now I want to look at who tweets the most… And you’ll see that there is very little overlap between the people who tweet the most, and the people who are mentioned and retweeted. The one account there that appears in both is COP21 – the event itself. Now some of the most active users are spammers and bots… But others will be obsessive, super-active users… Further analysis lets you dig further. Having looked at this list, I can look at what sort of tweets these users are sending… And that may look a bit different… This uses the Mention type and it may be that one tweet mentions multiple users, so get counted multiple times… So, for instance, DiploMix puts out 372 tweets… But when re-looked at for mentions and retweets we see a count of 636. That’s an issue you have to get your head around a bit… And the same issue occurs with hashtags. Looking at the types of tweets put out show some who post only or mainly original tweets, some who do mention others, some only or mainly retweet – perhaps bots or automated accounts. For instance DiploMix retweets diplomats and politicians. RelaxinParis is a bot retweeting everything on Paris – not useful for analysis, but part of lived experience of Twitter of course.
So, I have lots of views of data, and sheets saved here. You can export tables and graphs for publications too, which is very helpful.
I’m going to finish by looking at URLs mentioned… I’ve expanded these myself, and I’ve got the domain/path as well as the domain captured. I remove the NULL group here. And the most popular linked to domain is Twitter – I’m going to combine http and https versions in Tableau – but Youtube, UN, Leader of Iran, etc. are most popular. If I dig further into the Twitter domains, looking at Path, I can see whose accounts/profiles etc. are most linked to. If I dig into Station Commander Kelly you see that the most shared of these URLs are images… And we can look at that… And that’s a tweet we had already seen all day – a very widely shared image of a view of earth.
My time is up but I’m hoping this has been useful… This is the sort of approach I would take – exploring the data, using this as an entry point for more qualitative data analysis.
Analysing Network Dynamics with Agent Based Models (Patrik Wikström)
I will be talking about network dynamics and how we can understand some of the theory of network dynamics. And before I start a reminder that you can access and download all these materials at the URL for the session.
So, what are network dynamics? Well we’ve already seen graphs and visualisations of things that change over time. Network dynamics are very much about things that change and develop over time… So when we look at a corpus of tweets they are not all simultaneous, there is a dimension of time… And we have people responding to each other, to what they see around them, etc. So, how can we understand what goes on? We are interested in human behaviour, social behaviour, the emergence of norms and institutions, information diffusion patterns across multiple networks, etc. And these are complex and related to time, we have to take time into account. We also have to understand how macro level patterns emerge from local interactions between heterogenous agents, and how macro level patterns influence and impact upon those interactions. But this is hard…
It is difficult to capture complexity of such dynamic phenomena with verbal or conceptual models (or with static statistical models). And we can be seduced by big data. So I will be talking about using particular models, agent-based models. But what is that? Well it’s essentially a computer program, or a computer program for each agent… That allows it to be heterogeneous, autonomous and to interact with the environment and with other agents; that means they can interact in a (physical) space or as nodes in a network; and we can allow them to have (limited) perception, memory and cognition, etc. That’s something it is very hard for us to do and imagine with our own human brains when we look at large data sets.
The fundamental goal of this model is to develop a model that represents theoretical constructs, logics and assumptions and we want to be able to replicate the observed real-world behaviour. This is the same kind of approach that we use in most of our work.
So, a simple example…
Let’s assume that we start with some inductive idea. So we want to explain the emergence of the different social media network structures we observe. We might want some macro-level observations of Structure – clusters, path lengths, degree distributions, size; Time – growth, decline, cyclic; Behaviours – contagion, diffusion. So we want to build some kind of model to transfer or take our assumptions of what is going on, and translate that into a computer model…
So, what are our assumptions?
Well lets say we think people use different strategies when they decide which accounts to follow, with factors such as familiarity, similarity, activity, popularity, random… They may all be different explanations of why I connect with one person rather than another… And lets also assume that when a user joins Twitter they immediately start following a set of accounts, and once part of the network they add more. And lets also assume that people are different – that’s really important! People are interested in different things – they have different passions, topics that interest them, some are more active, some are more passive. And that’s something we want to capture.
So, to do this I’m going to use something called NetLogo – which some of you may have already played with – it is a tool developed maybe 25 years back at Northwestern University. You can download it – or use a limited browser-based version -from: http://ccl.northwestern.edu/netlogo/.
In NetLogo we start with a 3 node network… I initialise the network and get three new nodes. Then I can add a new node… In this model I have a slider for “randomness” – if I set it to less random, it picks existing popular nodes, in the middle it combines popularity with randomness, and at most random it just adds nodes randomly…
So, I can run a simulation with about 200 nodes with randomness set to maximum… You can see how many nodes are present, how many friends the most popular node has, and how many nodes have very few friends (with 3 which is minimum connections in this model). If I now change the formation strategy here to set randomness to zero… then we see the nodes connecting back to the same most popular nodes… A more broadcast-like network. This is a totally different kind of network.
Now, another simulation here toggles the size of nodes to represent number of followers… Larger blobs represent really popular nodes… So if I run this in random mode again, you’ll see it looks very different…
So, why am I showing you this? Well I live to show a really simple model. This is maybe 50 lines of code – you could build it in a few hours. The first message is that it is easy to build this kind of model. And even though we have a simple model we have at least 200 agents… We normally work with thousands or much greater scale, but you can still learn something here. You can see how to replicate the structure of a network. Maybe it is a starting point that requires more data to be added, but it is a place to start and explore. Even though a simple model you can use this to build theory, to guide data collection and so forth.
So, having developed a model you can set up a simulation to run hundreds of times, to analyse with your data analytics tools… So I’ve run my 200 node network, 5000 simulations, comparing randomness and maximum links to a nodes – helping understand that different formation strategy creates different structures. And that’s interesting but it doesn’t take us all the way. So I’d like to show you a different model that takes this a little bit further…
This model is an extension of the previous model – with all the previous assumptions – so you have two formation strategies, but also other assumptions we were talking about… That I am more likely to connect to accounts with shared interests, more inclines to connect with accounts with shared interests, and with that we generate a simulation which is perhaps a better representation of the kinds of network we might see. And this accommodates the idea that this network has content, sharing, and other aspects that inform what is going on in the formation of that network. This visualisation looks pretty but the useful part is the output you can get at an aggregate level… We are looking at population level, seeing how local interactions at local levels, influence macro level patterns and behaviours… We can look at in-degree distribution, we can look at out-degree… We can look at local clustering coefficients, longest/shortest path, etc. And my assumptions might be plausible and reasonable…
So you can build models that give a much deeper understanding of real world dynamics… We are building an artificial network BUT you can combine this with real world data – load a real world network structure into the model and look at diffusion within that network, and understand what happens when one node posts something, what impact would that have, what information diffusion would that have…
So I’ve shown you NetLogo to play with these models. If you want to play around, that’s a great first step. It’s easy to get started with and it has been developed for use in educational settings. There is a big community and lots of models to use. And if you download NetLogo you can download that library of models. Pretty soon, however, I think you’ll find it too limited. There are many other tools you can use… But in general you can use any programming language that you want… Repast and Mason are very common tools. And they are based on Java or C++. You can also use an ABM Python module.
In the folder for this session there are some papers that give a good introduction to agent-based modelling… If we think about agent-based modelling and network theory there are some books I would recommend: Natatame & Chen: Agent-based modelling and Network dynamics. ABM look at Miller & Scott; Gilbert and Troitzsch; Epstein. Network theory – look at Jackson, Watts (& Strogatz), Barabasi.
So, three things:
Simplify! – You don’t need millions of agents. A simple model can be more powerful than a realistic one
Iterate! – Start simple and, as needed, build up complexity, add more features, but only if necessary.
Validate? – You can build models in a speculative way to guide research, to inform data collection… You don’t always have to validate that model as it may be a tool for your thinking. But validation is important if you want to be able to replicate and ensure relevance in the real world.
We started talking about data collection, analysis, and how we build theory based on the data we collect. After lunch we will continue with Carolin, Anne and Fernando on Tracking the Trackers. At the end of the day we’ll have a full panel Q&A for any questions.
And we are back after lunch and a little exposure to the Berlin rain!
Tracking the Trackers (Anne Helmond, Carolin Gerlitz, Esther Weltevrede and Fernando van der Vlist)
Carolin: Having talked about tracking users and behaviours this morning, we are going to talk about studying the media themselves, and of tracking the trackers across these platforms. So what are we tracking? Berry (2011) says:
“For every explicit action of a user, there are probably 100+ implicit data points from usage; whether that is a page visit, a scroll etc.”
Whenever a user makes an action on the web, a series of tracking features are enabled, things like cookies, widgets, advertising trackers, analytics, beacons etc. Cookies are small pieces of text that are placed on the user’s computer indicating that they have visited a site before. These are 1st party trackers and can be accessed by the platforms and webmasters. There are now many third party trackers such as Facebook, Twitter, Google, and many websites now place third party cookies on the devices of users. And there are widgets that enable this functionality with third party trackers – e.g. Disquus.
So we have first party tracker files – text files that remember, e.g. what you put in a shopping cart; third party tracker files used by marketers and data-gathering companies to track your actions across the web; you have beacons; and you have flash cookies.
The purpose of tracking varies, from functionality that is useful (e.g. the shopping basket example) but increasingly prevelant for use in profiling users and behaviours. The increasing use of trackers has resulted in them becoming more visible. There is lots of research looking at the prevalence of tracking across the web, from the Continuum project and the Guardian’s Tracking the Trackers project. One of the most famous plugins that allows you to see the trackers in your own browser is Ghostery – a browser plugin that you can install and immediately detects different kinds of trackers, widgets, cookies, analytics tracking on the sites that you browse to… It shows these in a pop up. It allows you to see the trackers and to block trackers, or selectively block trackers. You may want to selectively block trackers as whole parts of websites disappear when you switch off trackers.
Ghostery detects via tracker library/code snippets (regular expressions). It currently detects around 2295 trackers – across many different varieties. The tool is not uncontroversial. It started as an NGO but was bought by analytics company Evidon in 2010, using the data for marketing and advertising.
So, we thought that if we, as researchers, want to look at trackers and there are existing tools, lets repurpose existing tools. So we did that, creating a Tracker tracker tool based on Ghostery. It takes up a logic of Digital Methods, working with lists of websites. So the Tracker Tracker tool has been created by the Digital Methods Initiative (2012). It allows us to detect which tracers are present on lists of wevsites and create a network view. And we are “repurposing analytical capabilities”. So, what sort of project can we use this with?
One of our first project was on the Like Economy. Our starting point was the fact that social media widgets place cookies (Gerlitz and Helmond 2013), where are they present. These cookies track both platform users and website users. We wanted to see how pervasive these cookies were on the web, and on the most used sites on the web.
We started by using Alexa to identify a collection of 1000 most-visited websites. We inputted it into the Tracking Tracker tool (it’s only one button so options are limited!). Then we visualised the results with Gephi. And what did we get? Well, in 2012 only 18% of top websites had Facebook trackers – if we did it again today it would probably be different. This data may be connected to personal user profiles – when a user has been previously logged in and has a profile – but it is also being collected for non-users of Facebook, they create anonymous profiles but if they subsequently join Facebook that tracking data can be fed into their account/profile.
Since we did this work we have used this method on other projects. Now I’ll hand over to Anne to do a methods walkthrough.
Anne: Now you’ve had a sense of the method I’m going to do a dangerous walkthrough thing… And then we’ll look at some other projects here.
So, a quick methodological summary:
- Research question: type of tracker and sites
- Website (URL) collection making: existing expert list.
- Input list for Tracker Tracker
- Run Tracker Tracker
- Analyse in Gephi
So we always start with a research question… Perhaps we start with websites we wouldn’t want to find trackers on – where privacy issues are heightened e.g. childrens’ websites, porn websites, etc. So, homework here – work through some research question ideas.
Today we’ll walk through what we will call “adult sites”. So, we will go to Alexa – which is great for locating top sites in categories, in specific countries, etc. We take that list, we put it into Tracker Tracker – choosing whether or not to look at the first level of subpages – and press the button. The tool looks at the Ghostery database, which now scans those websites for the possible 2600 trackers that may exist.
Carolyn: Maybe some of you are wondering if it’s ok to do this with Ghostery? Well, yes, we developed Tracker Tracker in collaboration with Ghostery when it was an NGO, with one of their developers visiting us in Amsterdam. One other note here: if you use Ghostery on your machine, it may be different to your neighbours trackers. Trackers vary by machine, by location, by context. That’s something we have to take into account when requesting data. So for news websites you may, for instance, have more and more trackers generated the longer the site is open – this tool only captures a short window of time so may not gather all of the trackers.
Anne: Also in Europe you may encounter a so-called cookie walls. You have to press OK to accept cookies… And the tool can’t emulate user experience in clicking beyond the cookie walls… So zero trackers may indicate that issue, rather than no trackers.
Q: Is it server side or client side?
A: It is server side.
Q: And do you cache the tracker data?
A: Once you run the tool you can save the CSV and Gephi files, but we don’t otherwise cache.
Anne: Ghostery updates very frequently so that makes it most useful to always use the most up to date list of trackers to check against.
So, once we’ve run the Tracker Tracker tool you get outputs that can be used in a variety of flexible formats. We will download the “exhaustive” CSV – which has all of the data we’ve found here.
If I open that CSV (in Excel) we can see the site, the scheme, the patterns that was used to find the tracker, the name of the tracker… This is very detailed information. So for these adult sites we see things like Google Analytics, the Porn Ad network, Facebook Connect. So, already, there is analysis you could do with this data. But you could also do further analysis using Gephi.
Now, we have steps of this procedure in the tutorial that goes with today’s session. So here we’ve coloured the sites in grey, and we’ve highlighted the trackers in different colours. The purple lines/nodes are advertising trackers for instance.
If you want to create this tracker at home, you have all the steps here. And doing this work we’ve found trackers we’d never seen before – for instance the porn industry ad network DoublePimp (a play on DoubleClick) – and to see regional and geographic difference between trackers, which of course has interesting implications.
So, some more examples… We have taken this approach looking at Jihadi websites, working with e.g. governments to identify the trackers. And found that they are financially dependent on advertising included SkimLinks, DoubleClick, Google AdSense.
Carolyn: And in almost all networks we encounter DoubleClick, AdSense, etc. And it’s important to know that webmasters enable these trackers, they have picked these services. But there is an issue of who selects you as a client – something journalists collaborating on this work raised with Google.
Anne: The other usage of these trackers has been in historical tracking analysis using the internet archive. This enables you to see the website in the context in a techno-commercial configuration, and to analyse it in that context. So for instance looking at New York Times trackers and the wevsite as an ecosystem embedded in the wider context – in this case trackers decreased but that was commercial concentration, from companies buying each other therefore reducing the range of trackers.
Carolyn: We did some work called the Trackers Guide. We wanted to look not only at trackers, but also look at Content Delivery Networks, to visualise on a website how websites are not single items, but collections of data with inflows and outflows. The result became part artwork, part biological fieldguide. We imagined content and trackers as little biological cell-like clumps on the site, creating a whole booklet of this guide. So the image here shows the content from other spaces, content flowing in and connected…
Anne: We were also interested in what kind of data is being collected by these trackers. And also who owns these trackers. And also the countries these trackers are located in. So, we used this method with Ghostery. And then we dug further into those trackers. For Ghostery you can click on a tracker and see what kind of data it collects. We then looked at privacy policies of trackers to see what it claims to collect… And then we manually looked up ownership – and nationality – of the trackers to understand rules, regulations, etc. – and seeing where your data actually ends up.
Carolyn: Working with Ghostery, and repurposing their technology, was helpful but their database is not complete. And it is biased to the English-speaking world – so it is particularly lacking in Chinese contexts for instance. So there are limits here. It is not always clear what data is actually being collected. BUT this work allows us to study invisible participation in data flows – that cannot be found in other ways; to study media concentration and the emergence of specific tracking ecologies. And in doing so it allows us to imagine alternative spatialities of the web – tracker origins and national ecologies. And it provides insights into the invisible infrastructures of the web.
Slides for this presentation: http://www.slideshare.net/cgrltz/aoir-2016-digital-methods-workshop-tracking-the-trackers-66765013
Multiplatform Issue Mapping (Jean Burgess & Ariadna Matamoros Fernandez)
Jean: I’m Jean Burgess and I’m Professor of Digital Media and Director of the DRMC at QUT. Ariadna is one of our excellent PhD students at QUT but she was previously at DMI so she’s a bridge to both organisations. And I wanted to say how lovely it is to have the DRMC and DMI connected like this today.
So we are going to talk about issue mapping, and the idea of using issue mapping to teach digital research methods, particularly with people who may not be interested in social media outside of their specific research area. And about issue mapping as an approach that is outside the dominant “influencers” narrative that is dominant in the marketing side of social media.
We are in the room with people who have been working in this space for a long time but I just want to raise that we are making connections to AMT and cultural and social studies. So, a few ontological things… Our approach combines digital methods and controversy analysis. We understand there to be Controversies which are discreet, acute, often temporality that are sites of intersectionality, bringing together different issues in new combination. And drawing on Latour, Callon etc. we see controversies as generative. They can reveal the dynamics of issues, bring them together in new combinations, trasform them and mode them forward. And we undertake network and content analysis to understand relations among stakeholders, arguments and objects.
There are both very practical applications and more critical-reflexive possibilities of issue mapping. And we bring our own media studies viewpoint to that, with an interest in the vernacular of the space.
So, issue mapping with social media frequently starts with topical Twitter hashtags/hashtag communities. We then have iteractive “issue inventories” – actors, hashtags, media objects from one dataset used as seeds on their own. We then undertake some hybrid network/thematic analysis – e.g. associations among hashtags; thematic network clusters And we inevitably meet the issue of multi-platform/cross-platform engagement. And we’ll talk more about that.
One project we undertook on the #agchatoz, which is a community in Australia around weekly Twitter chats, but connected to a global community, explored the hashtag as a hybrid community. So here we looked at, for instance, the network of followers/followees in this network. And within that we were able to identify clusters of actors (across: Left-learning Twitterati (30%); Australian ag, farmers (29%); Media orgs, politicians (13%); International ag, farmers (12%); Foodies (10%); Right-wing Australian politics and others), and this reveals some unexpected alliances or crossovers – e.g. between animal rights campaigners and dairy farmers. That suggests opportunities to bridge communities, to raise challenges, etc.
We have linked, in the files for this session, to various papers. One of these, Burgess and Matamoros-Fernandez (2016) looks at Gamergate and I’m going to show a visualisation of the YouTube video network (Reider 2015; Gephi), which shows videos mentioned in tweets around that controversy, showing those that were closely related to each other.
Ariadna: My PhD is looking at another controversy, this one is concerned by Adam Goodes, an Australian Rules Footballer who was a high profile player until he retired last year. He has been a high profile campaigner against racism, and has called out racism on the field. He has been criticised for that by one part of society. And in 2014 he performed an indiginous war dance on the pitch, which again received booing from the crowd and backlash. So, I start with Twitter, follow the links, and then move to those linked platforms and moving onwards…
Now I’m focusing on visual material, because the controversy was visual, it was about a gesture. So there is visual content (images, videos, gifs) are mediators of race and racism on social media. I have identified key media objects through qualitative analysis – important gestures, different image genres. And the next step has been to reflect on the differences between platform traces – YouTube relates videos, Facebook like network, Twitter filters, notice and take down automatic messages. That gives a sense of the community, the discourse, the context, exploring their specificities and how their contributes to the cultural dynamics of face and racism online.
Jean: And if you want to learn more, there’s a paper later this week!
So, we usually do training on this at DMRC #CCISS16 Workshops. We usually ask participants to think about YouTube and related videos – as a way to encourage to people to think about networks other than social networks, and also to get to grips with Gephi.
Ariadna: Usually we split people into small groups and actually it is difficult to identify a current controversy that is visible and active in digital media – we look at YouTube and Tumblr (Twitter really requires prior collection of data). So, we go to YouTube to look for a key term, and we can then filter and find results changing… Usually you don’t reflect that much. So, if you look at “Black Lives Matter”, you get a range of content… And we ask participants to pick out relevant results – and what is relevant will depend on the research question you are asking. That first choice of what to select is important. Once this is done we get participants to use the YouTube Data Tools: https://tools.digitalmethods.net/netvizz/youtube/. This tool enables you to explore the network… You can use a video as a “seed”, or you can use a crawler that finds related videos… And that can be interesting… So if you see an Anti-Islamic video, does YouTube recommend more, or other videos related in other ways?
That seed leads you to related videos, and, depending on the depth you are interested in, videos related to the related videos… You can make selections of what to crawl, what the relevance should be. The crawler runs and outputs a Gephi file. So, this is an undirected network. Here nodes are videos, edges are relationships between videos. We generally use the layout: Force Atlas 2. And we run the Modularity Report to colour code the relationships on thematic or similar basis. Gephi can be confusing at first, but you can configure and use options to explore and better understand your network. You can look at the Data Table – and begin to understand the reasons for connection…
So, I have done this for Adam Goodes videos, to understand the clusters and connections.
So, we have looked at YouTube. Normally we move to Tumblr. But sometimes a controversy does not resonate on a different social media platform… So maybe a controversy on Twitter, doesn’t translate on Facebook; or one on YouTube doesn’t resonate on Tumblr… Or keywords will vary greatly. It can be a good way to start to understand the cultures of the platforms. And the role of main actors etc. on response in a given platform.
With Tumblr we start with the interface – e.g. looking at BlackLivesMatter. We look at the interface, functionality, etc. And then, again, we have a tool that can be used: https://tools.digitalmethods.net/netvizz/tumblr/. We usually encourage use of the same timeline across Tumblr and YouTube so that they can be compared.
So we can again go to Gephi, visualise the network. And in this case the nodes and edges can look different. So in this example we see 20 posts that connect 141 nodes, reflecting the particular reposting nature of that space.
Jean: The very specific cultural nature of the different online spaces can make for very interesting stuff when looking at controversies. And those are really useful starting points into further exploration.
And finally, a reminder, we run our summer schools in DMRC in February. When it is summer! And sunny! Apply now at: http://dmrcss.org/!
Analysing and visualising geospatial data (Peta Mitchell)
Normally when I would do this as a workshop I’d give some theoretical and historical background of the emergence of geospatial data, and then move onto the practical workshop on Carto (was CartoDB). Today though I’m going to talk about a case study, around the G8 meeting in Melbourne, and then talk about using Carto to create a social media map.
My own background is a field increasingly known as the geo humanities or the spatial humanities. And I did a close reading project of novels and films to create a Cultural Atlas of Australia. And how locations relate to narrative. For instance almost all films are made in South Australia, regardless of where they are set, mapping patterns of representation. We also created a CultureMap – an app that went with a map to alert you to literary or filmic places nearby that related back to that atlas.
I’ll talk about that G8 stuff. I now work on rapid spatial analytics; participatory geovisualisation and crowdsourced data; VGI – Volunteered Geographic Information; placemaking etc. But today I’ll be talking about emerging forms of spatial information/geodata, neogeographical tools etc.
So Godon and de Souza e Silva (2011) talk about us witnessing the increasing proliferation of geospatial data. And this is sitting alongside a geospatial revolution – GPS enabled devices, geospatial data permeating social media, etc. So GPS emerged in the late ’90s/early 00’s with a slight social friend-finder function. But the geospatial web really begins around 2000, the beginning of the end of the idea of the web as a “placeless space”. To an extent this came from a legal case brought by a French individual against Yahoo!, who were allowing Nazi memorabilia to be sold. That was illegal in France, and Yahoo! claimed that the internet is global, and claimed that it wasn’t possible. A French judge found in favour of the individual, Yahoo! were told it was both doable and easy, and Yahoo! went on to financially benefit from IP based location information. As Richard Rogers that case was the “revenge of geography against the idea of cyberspace”.
Then in 2005 Google Maps was described by John Yudell as that platform having the potential to be a “service factory for the geospatial web”. So in 2005 the “geospatial web” really is there as a term. By 2006 the concept of “Neogeography” was defined by Andrew (?) to describe the kind of non-professional, user-orientated, web 2.0-enabled mapping. There are are critiques in cultural geography, in geospatial literature about this term, and the use of the “neo” part of it. But there are multiple applications here, from humanities to humanitariasm; from cultural mapping to crisis mapping. An example here is Ushahidi maps, where individuals can send in data and contribute to mapping of crisis. Now Ushahidi is more of a platform for crisis mapping, and other tools have emerged.
So there are lots of visualisation tools and platforms. There are traditional desktop GIS – ArcGIS, QGIS. There is basic web-mapping (e.g. Google Maps); Online services (E.g. CARTO, Mapbox); Custom map design applications (e.g. MapMill); and there are many more…
Spatial data is not new, but there is a growth in ambient and algorithmic spatial data. So for instance ABC (TV channel in Australia) did some investigation, inviting audiences to find out as much as they could based on their reporter Will Ockenden’s metadata. So, his phone records, for instance, revealed locations, a sensitive data point. And geospatial data is growing too.
We now have a geospatial sub stratum underpinning all social media networks. So this includes check-in/recommendation platforms: Foursquare, Swarm, Gowalla (now defunct), Yelp; Meetup/hookup apps: Tinder, Grindr, Meetup; YikYak; Facebook; Twitter; Instagram; and Geospatial Gaming: Ingress; Pokemon Go (from which Google has been harvesting improvements for its pedestrian routes).
Geospatial media data is generated from sources ranging from VGI (Volunteered geographic information) to AGI (ambient geographic information), where users are not always aware that they are sharing data. That type of data doesn’t feel like crowd sourced data or VGI, hence the potential challenges, potential and ethical complexity of AGI.
So, the promises of geosocial analysis include a focus on real-time dynamics – people working with geospatial data aren’t used to this… And we also see social media as a “sensor network” for crisis events. There is also potential to provide new insights into spatio-temporal spread of ideas and actions; human mobilities and human behaviours.
People do often start with Twitter – because it is easier to gather data from it – but only between 1% and 3% of Tweets are located. But when we work at festivals we see around 10% being location data – partly a nature of the event, partly as Tweets are often coming through Instagram… On Instagram we see between 20% and 30% of images georeferenced, but based on upload location, not where image is taken.
There is also the challenge of geospatial granularity. On a tweet with Lat Long, that’s fairly clear. When we have a post tagged with a place we essentially have a polygon. And then when you geoparse, what is the granularity – street, city? Then there are issues of privacy and the extent to which people are happy to share that data.
So, in 2014 Brisbane hosted the G20, at a cost of $140 AUS for one highly disruptive weekend. In preceeding G20 meetings there had been large scale protests. At the time the premier of the city was former military and he put the whole central business district was in lockdown and designated a “declared area” – under new laws made for this event. And hotels for G20 world leaders were inside the zone. So, Twitter mapping is usually during crisis events – but you don’t know where this will happen, where to track it, etc. In this case we knew in advance where to look. So, a Safety and Security Act (2013) was put in place for this event, requiring prior approval for protests; arrests for the duration of the event; on the spot strip search; banning of eggs in the central Business District, no manure, no kayaks or floatation devices, no remote control cars or reptiles!
So we had these fears of violent protests, given all of these draconian measures. We had elevated terror levels. And we had war threatened after Abbott said he would “shirtfront” Vladimir Putin over MH17. But all that concern made city leaders concerned that the city might be a ghost town, when they wanted it marketed as a new world city. They were offering free parking etc. to incentivise them to come in. And tweets reinforced the ghost town trope. So, what geosocial mapping enabled was a close to realtime sensor network of what might be happening during the G20.
So, the map we did was the first close to real time social media map that was public facing, using CARTODB, and it was never more than an hour behind reality. We had few false matches. But we had clear locations and clear keywords – e.g. G20 – to focus on. A very few “the meeting will now be held in G20” but otherwise no false matches. We tracked the data through the meeting… Which ran over a weekend and bank holiday. This map parses around 17,000(?) tweets, most of which were not geotagged but parsed. Only 10% represent where someone was when they tweeted, the remaining 90% are subjects of posts from geoparsing of tweets.
Now, even though that declared area isn’t huge, there are over 300 streets there. I had to build a manually constructed gazeteer, using Open Street Map (OSM) data, and then new data. Picking a bounding box that included that area generated a whole range of features – but I wasn’t that excited about fountains, benches etc. I was looking for places people might mention. And I wanted to know about features people might actually mention in their tweets. So, I had a bounding box, and the declared area before… Would have been ideal if the G20 had given me their bounding polygon but we didn’t especially want to draw attention to what we were doing.
So, at the end we had lat, long, amenity (using OSM terms), name (e.g. Obama was at the Marriott so tweets about that), associated search terms – including local/vernacular versions of names of amenities; Status (declared or restricted); and confidence (of location/coordinates – score of 1 for geospatially tagged tweets, 0.8 for buildings, etc.). We could also create category maps of different data sets. On our map we showed geospatial and parsed tweets inside the area, but we only used geotweets outside the declared area. One of my colleagues created a Python script to “read” and parse tweets, and that generated a CSV. That CSV could then be fed into CARTODB. CARTODB has a time dimension, could update directly every half hour, and could use a Dr0pbox source to do that.
So, did we see much disruption? Well no… About celebrity spotting – the two most tweeted images were Obama with a koala and Putin with a koala. It was very hot and very secured so little disruption happened. We did see selfies with Angela Merkel, images of phallic motorcade. And after the G20 there was a complaint filed to board of corruption about the cooling effect of security on participation, particularly in environmental protests. There was still engagement on social media, but not in-person. Disruption, protest, criticism were replaced by spectacle and distant viewing of the event.
And, with that, we turn to an 11 person panel session to answer questions, wrap up, answer questions, etc.
Q1) Each of you presented different tools and approaches… Can you comment on how they are connected and how we can take advantage of that.
A1 – Jean) Implicitly or explicitly we’ve talked about possibilities of combining tools together in bigger projects. And tools that Peta and I have been working on are based on DMI tools for instance… It’s sharing tools, shared fundamental techniques for analytics for e.g. a Twitter dataset…
A1 – Richard) We’ve never done this sort of thing together… The fact that so much has been shared has been remarkable. We share quite similar outlooks on digital methods, and also on “to what end” – largely for the study of social issues and mapping social issues. But also other social research opportunities available when looking at a variety of online data, including geodata. It’s online web data analysis using digital methods for issue mapping and also other forms of social research.
A1 – Carolyn) All of these projects are using data that hasn’t been generated by research, but which has been created for other purposes… And that’s pushing the analysis in their own way… And tools that we combine bring in levels, encryptions… Digital methods use these, but also a need to step back and reflect – present in all of the presentations.
Q2) A question especially for Carolyn and Anne: what do you think about the study of proprietary algorithms. You talked a bit about the limitations of proprietary algorithms – for mobile applications etc? I’m having trouble doing that…
A2 – Anne) I think in the case of the tracker tool, it doesn’t try to engage with the algorithm, it looks at presence of trackers. But here we have encountered proprietary issues… So for Ghostery, if you download a Firefox plugin you can access the content. We took the library of trackers from that to use as a database, we took that apart. We did talk to Ghostery, to make them aware… The question of algorithms… Of how you get to the blackbox things… We are developing methods to do this… One way in is to see the outputs, and compare that. Also Christian Zudwig is doing the auditing algorithms work.
A2 – Carolyn) Was just a discussion on Twitter about currency of algorithms and research on them… We’ve tried to ride on them, to implement that… Otherwise difficult. One element was on studying mobile applications. We are giving a presentation on this on Friday. Similar approach here, using infrastructures of app distribution and description etc. to look into this… Using existing infrastructures in which apps are built or encountered…
A2 – Anne) We can’t screenscrape and we are moving to this more closed world.
A2 – Richard) One of the best ways to understand algorithms is to save the outputs – e.g. we’ve been saving Google search outputs for years. Trying to save newsfeeds on Facebook, or other sorts of web apps can be quite difficult… You can use the API but you don’t necessarily get what the user has seen. The interface outputs are very different from developer outputs. So people think about recording rather than saving data – an older method in a way… But then you have the problem of only capturing a small sample of data – like analysing TV News. The new digital methods can mean resorting to older media methods… Data outputs aren’t as friendly or obtainable…
A2 – Carolyn) This one strand is accessing algorithms via transparancy; you can also think of them as situated and in context, seeing it in operation and in action in relation to the data, associated with outputs. I’d recommend Salam Marocca on the Impact of Big Data which sits in legal studies.
A2 – Jean) One of the ways we approach this is the “App Walkthrough”, a method Ben Light and I have worked on and will shortly be published in Media and Society, is to think about those older media approaches, with user studies part of that…
Q3) What is your position as researchers on opening up data, and doing ethically acceptable data on the other side? Do you take a stance, even a public stance on these issues.
A3 – Anne) Many of these tools, like the YouTube tool, and his Facebook tools, our developer took the conscious decision to anonymise that data.
A3 – Jean) I do have public positions. I’ve published on the political economy of Twitter… One interesting thing is that privacy discourses were used by Twitter to shut down TwapperKeeper at a time it was seeking to monetise… But you can’t just published an archive of tweets with username, I don’t think anyone would find that acceptable…
A3 – Richard) I think it is important to respect or understand contextual privacy. People posting, on Twitter say, don’t have an expectation of its use in commercial or research uses. Awareness of that is important for a researcher, no matter what terms of service the user has signed/consented to, or even if you have paid for that data. You should be aware and concerned about contextual privacy… Which leads to a number of different steps. And that’s why, for instance, NetVis – the Facebook tool – usernames are not available for comments made, even though FacePager does show that. Tools vary in that understanding. Those issues need to be thought about, but not necessarily uniformly thought about by our field.
A3 – Carolyn) But that becomes more difficult in spaces that require you to take part to research them – WhatsApp? for instance – researchers start pretending to be regular users… to generate insights.
Comment (me): on native vs web apps and approaches and potential for applying Ghostery/Tracker Tracker methods to web apps which are essentially pointing to URLs.
Q4) Given that we are beholden to commercial companies, changes to algorithms, APIs etc, and you’ve all spoken about that to an extent, how do you feel about commercial limitations?
A4 – Richard) Part of my idea of digital methods is to deal with ephemerality… And my ideal to follow the medium… Rather than to follow good data prescripts… If you follow that methodology, then you won’t be able to use web data or social media data… Unless you either work with the corporation or corporate data scientist – many issues there of course. We did work with Yahoo! on political insights… categorising search queries around a US election, which was hard to do from outside. But the point is that even on the inside, you don’t have all the insight or the full access to all the data… The question arises of what can we still do… What web data work can we still do… We constantly ask ourselves, I think digital methods is in part an answer to that, otherwise we wouldn’t be able to do any of that.
A4 – Jean) All research has limitations, and describing that is part of the role here… But also when Axel and I started doing this work we got criticism for not having a “representative sample”… And we have people from across humanities and social sciences seem to be using the same approaches and techniques but actually we are doing really different things…
Q5) Digital methods in social sciences looks different from anthropology where this is a classical “informant” problem… This is where digital ethnography is there and understood in a way that it isn’t in the social sciences…
Resources from this workshop: