This afternoon I am attending a talk on the Privacy of Online Social Networks which has been arranged by the Social Network Analysis in Scotland Group (SNAS) and is taking place at the University of Edinburgh. The speakers are Jordi Herrera-Joancomarti, Cristina Perez-sola, and Jordi Casas-Roma, all from Universitat Autonoma de Barcelona (UAB). I’ll be taking notes throughout, although I think that the talk is also being recorded so may be available later. As ever this is a liveblog so corrections, comments, etc. welcome. (I will also be adding some images from the event later today as some of the processes discussed were quite complex and require illustration!)
We are opening with an introduction to the SNAS group, which meets at the University of Edinburgh on the last Tuesday of every month… They have a mailing list and I’ll add a link here later. Dr Jordi Herrera-Joancomarti is leading the talk, and is an expert on privacy and security.
Dr Jordi H-J: This is collaborative work with my colleagues Cristina and Jordi. My background is not social sciences but mathematics, so it is a good challenge for me to speak to a non technical audience here… Hopefully there are no scary mathematical equations here! I’ll open with an introduction, talk about Online Social Networks and graph theory, talk about the data you can mine, and I will talk about Online Social network Data anonimisation, and how you can release data from networks without compromising privacy, before coming to my conclusions.
So, to start with the definition of Online Social Network I am using is an “online service, platform or site that allos to create a user profle which can be connected with other user profiles of the network… ” – a very computer science definition.
So this can be about specialisms like Flickr, LastFM, WikiLoc… specialised format (e.g. Twitter); Scope limited (e.g. LinkedIn); General purpose (e.g. Faebook, Google+) etc. The denomination of connectivity can be network dependent (e.g. Facebook: friends; Twitter: followers). An dinteractions between user profiles are also network ependent (e.g. Facebook: “like” action, post a message; Twitter: tweet, Retweet etc).
So, why are OSN interesting or important? Well they have become an important part of people’s everyday communications, with huge volumes of users. But there is also a book, Big Data (Viktor Mayer-Schonberger and Kenneth Cukier) which includes chapter 5 “Datafication” talking about the quantification of the world along the time from differnt aspects. So, when words became data (Google books in 2004); when localization becomes data (GPS); and when relationships become data (OSN). For instance Facebook datafied relationships, and most notably with the introduction of “Social graph”.
To graph theory then. A graph is a mathematical tool used to represent objects (nodes) that can be connected by links (edges). OSN can be modeled using graphs and analysed with graph theory. So… You can represented connections between individuals etc.
There are different OSN propoerties that dertmine the type of the corresponding social graph:
– Undirected graphs are those with no meaning on the incidence of an edge in the node. Facebook social graph is an undirected graph. So, no arrows between individuals, no value to that edge.
– Directed graphs (digraph) are those in which the edges have a direction associated with them. Twitter social graph is a directed graph. For instance you can follow someone, they don’t have to follow you… So we have arrows here to indicate connection and direction.
– Weighted graphs assign a weight to every edge in a graph.
So, when you add direction to a graph you can borrow many analysis tools from graph theory. So if we try with a degree of a node in an undirected graph… The degree of a node is the number of edges incident to that node, denoted as deg(vi).
In a directed graph the same concept applies but it is more complex… We have In-degree of a node and that is the number of head endpoints adjacent to that node denoted as deg-(vi). Similarly we can have out-degree for number of tail endpoints, denoted as deg+(vi).
So, in a facebook social graph the degree of a node is the number of friends of that user. In Twitter social graph, the in-degree can be seen as the number of followers of that user. High in-degree may indicate a popular user. And the out degree can be seen as the number of users that person follows.
We can also talk about the clustering coefficient. We see local clustering coefficient of a node – the proportion of edges between the nodes within its neighbourhood divided by the number of edges that could possible exist between them… So it measures how far are the neighbourood of a node to become a clique. So this is how well the friends of a node are connected. These kinds of technical techniques can be used to understand user connections and relationships.
We study OSN privacy from an information-fundamental point of view, analysing OSN privacy from a graph mining perspective. We do not study specific OSN services, configurations or vulnerbailities. In some cases we do make some assumptions about the type of OSN: open vs closed profiles. For instance Facebook is more difficult to extract data from than Twitter, an open social network.
So there are two kinds of users information that can be extracted:
1) Node information – data about a specific user, details contaied in the users profile on a specific OSN
2) Edge information – data about the relationship between members of the network – and that is what we are most interested in.
Edge information can, however, directly disclose node attributes – e.g. an edge representing a sentimental relationships between two individuals of the same sex would be revealing about their sexual orientation. It is more difficult to protect edge information than node information – as it depends on behaviour of connected people whereas node information is controlled by just one user. Relations between users can also detect communities, and more node attributes.
So, I wanted to explain about data retrieval. How do you ontain social network information? Well you can ask OSN providers – but many are not that cooperative or put a great deal of restrictions/agreements to do that. They provide local and/or anonimised data. OR you can take the data from the OSN providers – that is not always possible adn depends on the open degree of the OSN service. And it is very important to take care on the mechanism used to obtain information as that may determine the bias of the data you collect.
You can gather data several ways. You can use a web crawler to gather daya from an open OSN (like Twitter). Web crawlers are computer programs that retrieve web pages starting from a single (or multiple) page and exploring all its linked pages and also the pages linked to those ones and so on. Since most of OSN interact through the web, you can use web crawlers for OSN data retrieval… The process is iterative…
A download is the interface between the OSN and the crawler – it downloads the users profiles and passes it to the parser, which then parses that data. You draw out the friends of that user and add them to the queue, which contains all the users that are awaiting to be explored, found when crawling every user. And the scheduler selects which user, from the ones in the queue will be explore and sends the decision to the downloader. The scheduler impacts on both performance and data quality.
If you are exploring the whole network then it is not so important to consider the crawler details… if I am crawling every member I will find all of the connections at the end… the order you gather data in doesn’t matter in that case. BUT you cannot crawl all of the network available now… So you will have to, at some point, decide to take a partial view of the network. So to do that we have to think about notification and assumptions…
Users can be crawled (one that all his profiles information and all friends are known to the crawler (v E Vcrawl). A discovered user (connected to the user crawled), and an explored user (discovered by relationship to discoverd user)?
So… for instance a Breath-First Search (BFS) Algorithm would start with one user (h)… you find they have two friends (d and j)… I crawl j and then discover they connect to users l and k and g (and I’ve already crawled d and h)… Then I crawl user d, finding connections to f, e, b, c… others are already found… Then I crawl l, find connections etc…
So, that is your schedule, the order you crawl. And the idea is that you can end up with all the elements of the network… This is quite a linear process. So, this is one approach, and this BFS algorithm produces graphs quite dissimilar to other algorithms you could use.
An alternative approach is the Depth-First Search (DFS) which works as a traditional stack, the first nodes to be crawled are the last ones that have been discovered (LIFO management). So, in this approach… If you start with user h… you discover j and d… But the next node you explore is d… then when you find connections to f, g, e, b, c… and you next explore node c. At the end you will end up with all the nodes as well… But in a different order than you had before… So, again, if you do this with a group of users (example here being 162 flickr nodes) it looks quite different…
Then you can do more intelligent things… You can use “greedy” algorithms:
– Real-degree greedy (hypothetical greedy or higherst-degree-crawler) takes its decisions based on the real degree (which may be unknown to the crawler before the node is crawled) of the nodes in the OSN. So a user has degree 5, degree 7 etc. based on the edges between different nodes… You can gather the whole network, or you may have restrictions and only capture part of the network…
– Explored-degree greedy (greedy) uses the actual known degree of the nodes in the OSN… So if you graph that you see many many connections, you look more conciously to the mode connected nodes.
You can also choose to select more variance in the network, to randomise your sample to an extent. This can be done with a lottery algorithm…
So, if you take information from a social network or a social network graph you have to be really well aware of what you are getting. When you do your sampling from different profiles, etc. that you understand what your sample is of. As far as you can see you can just adjust the scheduler to get what you want… you can do that to focus on particular users, types of users.
Schedulers have implications on privacy… depending on the level you select that has different implications… So your scheduler can have different objectives for the crawler – taking the privacy attackers point of view. So you can then understand which scheduler algorithm fits those objectives most appropriately…
You can also do more tricky things… For instance the classification of users from a graph point of views. So, I want to classify users, identifying the set of categories a new observation belongs to. The decision is made on the basis of a training set of data containing observations whose category membership is already known. When you try to classify users within the network, you can see link information which may help you to classify a user – connections to a community for instance.
The idea is that you can see classification as a privacy attack – user classification allows an attacker to infer private attributes from the user. Attributes may be sensitive by themselbes, attribute disclosuer may have undesirable consequences for the user. So the design of a user (node) classifer that uses the graph structure alone (no semantic infomation needed)… So, for instance… We may classify the user, with a neighborhood analysis to better classify the user… So the classifer analyses the graph structure and maps each node to a 2-dimensional sample using degree and clustering coefficient. The output is an initial assignation of nodes to categories…
And you can make that neighborhood information to classify the node… You can also have a relational classifier, which maps users to n-dimensional samples, using both degree and clustering coefficient and the neighborhood information to classify users…
So coming to the issue of data and data release… When you obtain a collection of data… you may have a more anonymised data view… You may see connections etc. but without user names, for instance. The intention is to preserve the privacy of users. But is this enough? Well no… this nieve anonimisation potentially reveals huge amounts about the user… if you know other data (other than names), you may be able to deduce who is in the network, you might find one user in the network and thus expose others. Removing the identifiers is not enough… So, you have to do something more elaborate…
One approach is to modify the edges – adding or deleting edges to hinder re-identification… But the problem is that you have two opposite objectives: On th eone hand you want to maximise the data utility and you want to minimise noise in that data. But you also want to preserve users privacy…
So, there are different ways to quantify the objective…. There are generic information loss measures (GIL) – measures like average distance, diameter, harmonic mean of shortest distance, etc… You want to preserve that in your data. So… you have the original network, you do one metric… and end up with a different network that is anonimised, and you can apply a similar metric afterwards to use it… In statistical databases you can preserve the mean of all the registers that sold boots (say)… If you know the questions to ask of that data, you know the process to keep that anonimised data close to the original data set…
You can also use specific information loss measures (clustering process)… Similar problem here… You have the original clusters, you use a clustering method to get to an anonimised (perturbed) version.
So, some measures behave in a similar way independently of the data in which they are gathered.
And then you have the idea of k-anonimity. A model that indicates that an attacker can not distinguish between different k records although he managed to find a group of quasi-identifiers. Therefore the attackers can not re-identify an individual. So, node degree can be the quasi-identifier… We can presume the attacker may know some of the nodes in the network… We can preserve the degree sequence, and the ordered degree sequence. And you can measure the k degree by understanding how many nodes have the same degree. So if two nodes in the network have degree 4, then the k-degree anonymity is 2. You can then make use of this to preserve the graph…
To modify the graph you can use edge modification (adding and/or deleting); node modification (adding and/or deleting). You can use uncertain graphs – adding or removing edges “particially” by assigning a probabiity to each edge. The set of all possible edges is considered and a probability is assigned to each edge.
Edge modification can include edge rotation, random perturbation, relevant edge identification, k-anonymity orientated anonimisation. These can allow you to keep data you want to keep, whilst preserving user privacy.
So, in conclusion, OSN can be modeled with social graph and analysed using graph mining techniques. Web crawlers may retrieve sensitive information from OSNs but the quality of the collected information will depend on the scheduler algorithm specitifities. Relational classifiers may provide relevant user information by just analyzing the graph structure information… Data anonimisation is needed for releasing OSN data without compromising the user’s privacy. This is a research field that is quite new and quite difficult… unlike statistical databases, where you can change one user without impacting on others, any change here does effect the network. And anonymisation algorithms need a trade-off between information loss and user anonymity loss.
Q1) You talked about how much stuff is being datafied… Soon with smart watches we’ll have health data available. Because crawlers take some time… things could change whilst you are crawling.
A1) One of the problems in social networks and graph theory, is that algorithms for this sort of data are complex and time consuming… And that is a problem… Especially at scale. And sometimes you have the information, you make a lot of computation but the information is not static… so not only a lot of work not only on algorithms but also on understanding different and changes in the network – what happens when a node is removed for instance. There are people working on algorithms for dynamic data… But much m
Q2) What kind of research questions have you been using this with?
A2) There are two different issues for me in terms of social sciences… We don’t start with research questions… we start with problem and try to start it… So when AOL released data about lots of servers… you could identify individuals from the data… but you shouldn’t be able to… That happens because they don’t understand or care about anonymising data. So we are trying to provide tools to enable that anonymisation. We also have ideas about the crawling approach… So as a social network provider you might want to avoid this type of crawler… you might use this approach to trap or mislead the crawler… So the crawler end up in a dead end… and cannot crawl the network.
Q3) Some of the techniques you showed there were about anonymisation… do you use removal of nodes for that purpose
A3) There are several approaches for adding or removing nodes… Sometimes those approaches collapse those nodes… So you anonymise all the nodes too… But the general techniques that are more used are those that perturb and move the nodes.
Q4) One of the last things you said was about that trade off of utility of analysis and user privacy. My question is who makes that decision about the trade off? Would the people being studied agree with those decisions for instance, in the real world?
A4) The real world is much more complex of course. The problem is about deciding level of usefulness of the data… At the present time these methods are not used as far as they could be done… For statistical data this is often fixed by government… for instance in Census data you can see the method by which data has been anonimised. But for OSN there is nothing of that type, and nobody is telling… and basically no-one is releasing data… Data is money… So if we can try to give good algorithms to enable that, then maybe the OSN companies can release some of this kind of data. But at this moment, nobody is putting that idea of privacy there… Generally privacy level tends to be low, information level is high…
Q5) I didn’t totally understand how you set the boundaries of the network… Is it the crawling process?
A5) The idea is that there are no boundaries… Crawler goes… Maybe it completes within 1000 nodes, or 3 hours… or similar. You won’t crawl everything and you want some data. So 10 million users might be the boundary for instance… Then you have data to look at… So I have 10 million users out of a pool of 500 million… But which ones do I have? How representative? That needs consideration…
Q6) The crawler gathers a model of relationships and behaviours, and I’m sure that marketers are very interested. Is there potential to predict connections, behaviours, intentions etc.
A6) Yes, there are lots of techniques of graph theory that allow that sort of interpretation and prediction. OSN use these sorts of approaches for recommendations and so on…
Q6) How reliable is that data?
A6) Understanding similarities there can help make it more reliable… similarity rather than distance between nodes can be helpful for understanding behaviour… But I will say that they are quite accurate… And the more information they gather, the more accurate they are…
Q7) I was wondering when you were talking about measuring the effectiveness of different anonymisation methods… Is there a way to take account of additional data that could effect anonimisation
A7) In computer security in general, when you model someone you have to define the adversary model… What the adversary is able to do… So, what is the attacker able to have… The available information… So the more information is available, the harder it is to protect the individual. It is a complex scenario.
Q8) Is there a user friendly web crawler that can be used by non technicians…
A8) No. Sorry about that… No, because there are some frameworks… But you don’t have one solution to fit all… But the idea is that there are some frameworks that are more suited to computer science people… Tomorrow in the workshop we will explain extracting information from Twitter… And those techniques will let us explore how we could develop a crawler on Twitter… So exploring connections and followers, etc.
Q9) What are the ethics of web crawling in social sciences? And what are the positions of the OSN on that?
A9) You can crawl OSN because the information is public. So you can crawl Twitter, as information is public. If you want to crawl Facebook, you have to be authorised by the user to look at the profile… And you need to develop an algorithm to run as an app in Facebook… and authorise that… But that doesn’t mean the user understands that… But for instance in last US Election, Obama campaign did an application on Facebook that did that… graphing their supporters and friends… And use that in the campaign…
Q9) I was wondering about the crawling of discussion forums… where you cannot get authorisation. But you also mentioned that providers not keen… is it legitimate to do that…
A9) I think that it is… If you are crawling public information… There is another thing of the OSN not liking it – then they can make some restrictions. If I do things that avoid OSN restrictions that is fine… You can do that
Q10) I wanted to follow up on that… There are legal and ethical issues associated with crawling websites. You have to consider it extremely carefully. If I use a website that says it does not allow crawlers, I don’t expect it to be crawled and that would not be legal under data protection law. And there was some research about 10 years ago a research project found that bloggers, although posting in public, didn’t expect to be analysed and interpreted… And you do have to think about the ethics here… And you need to think about the user’s expectation when they put the data up.
A – Christina) Everyone uses Google, you can’t expect that when you put something on the internet you have to expect it to be crawled
A – Jordi) From my perspective, as a researcher doing cryptography what you say is quite strange… My work is about protecting information… It assumes people will be trustworthy with your information…
Q10) No, I’m saying an ethical researcher should not be breaking the law.
Comment) There can be an expectation of privacy in a “public” space…
Comment – from me) I would recommend the Association of Internet Researchers Ethics Guide for more on how you can mediate expectations of users in your research. For your cryptography work that may not be as relevant, but for others in this audience that guide is very helpful for understanding ethical research processes, and for thinking about appropriate research methods and approaches for ethical approval.
And with a gracious close from Jordi, we are done! There is a workshop running tomorrow on this type of analysis – I won’t be there but others may be tweeting or blogging from it.