Authors Arvind Narayanan and Vitaly Shmatikov
Conference SoSP 2009
Summary The paper demonstrates the possibility of breaching privacy by re-identification of entities across networks
Motivation:
Increasing use of social networks
Sharing of data with advertisers, developers, researchers
Show how privacy is not the same as anonymization
Major Result
Re-identified 1/3 of users who verifiably have accounts on both Twitter and Flickr with 12% error rate
(Very reasonable) Assumptions
Purely based on network topology
Doesn't require creation of Sybil nodes
Robust to noise and existing defenses
Works even with small overlap between target network and auxiliary network
Survey of current state of data sharing
Academic and govt. data mining - phone call data, health datasets
Advertising - Facebook and third party application data sharing** Aggregation from multiple sources
Definition of privacy and its relation to node anonymity
Node attributes are sensitive information
Eg. Higher accuracy of SSN prediction from Facebook profiles than random guessing
Even existence of edges in a connection graph can be sensitive information
Past attempts: Attacks and defenses
Example attack for breaking node anonymity based on Sybil nodes - model is rather weak
Defense often ignores auxiliary information available to attacker
A computationally expensive and functionally limiting idea - scramble and unscramble data whenever reads and writes occur
Auxiliary information available depends on the attacker's position
Strongest adversary - government-level agency: large auxiliary network
Abusive marketing - commercial enterprise. phishing and spamming
Individual node information - stalkers, investigators etc.
Model
Graph based model of social network, with nodes and edges having attributes that are classified as private/public
Released data - publish subset of attributes that are insufficient for re-identification by themselves; remove some edges, add random edges
Attacker model -has an auxiliary graph which can be subset of the target graph too
Aggregate auxiliary information (good model - includes not only binary but also probabilistic knowledge)
Individual auxiliary information (detailed information about small number of nodes - 30 to 150 seeds)
Privacy breach -a threshold increase in information above what auxiliary graph had
Measuring success of attack - since singletons would negatively impact performance for no good reason, add weights to nodes in relation to their importance (centrality - they just use node-degree as measure) in the network; Weighted probability of mapping a node correctly from auxiliary to target network
Algorithm
Seed ID - map small number of seed nodes from auxiliary to target graph based on search for a k-clique with matching degrees, common neighbors
Propagation - Iteratively use already mapped nodes to compute scores for unmapped nodes and if above a threshold, take note of a match; Use of eccentricity - only 'exceptional' matches are recorded; Other similar heuristics for better matching
Experimental results on graphs used - 'follow' relations on Twitter; 'contact' relations on Flickr; 'friend' on LiveJournal
Seed ID results - Tested on LiveJournal
Synthetically create auxiliary information by randomly sampling cliques
Identify the clique again in the graph with given error rate for matching
Identification success decreases with increase in error rate
Propagation
Demonstrate robustness against perturbations in data
Add noise by removing edges (only the given procedure B is tested)*** Results
Number of seeds decides whether propagation step dies out or not
Imprecision of auxiliary information decreases precentage of nodes re-identified
Mapping Flickr nodes to Twitter
27000 Ground truth mappings were produced by exact matches in name or username and heuristics to decide whether it was a match
150 seed mappings selected at random from the 27000 were used
30.8 % of the 27000 mappings were re-identified correctly, 12.1 % incorrectly and the rest were not identified
Even in erroneous mappings, there was some sense of closeness
Pros Good attack model based on discussion of adversary's strength
Good survey of existing literature
Interesting framework and use of auxiliary information
Real evaluation
Robustness to noise
Cons Should have showed results with more noise in the sanitized data
There is disconnect between the framework and the evaluation - the adversary models are never used etc.
Only 27000 mappings from among millions were used for evaluation
No execution time statistics
Discussion Points
Given the results, where does that leave us with respect to countermeasures?
Does the average user understand the privacy risk?
Is user expectation of privacy reasonable?
Could a business model be developed for P2P social networking?
Could bulk of computation and data could be moved to user side?
Votes Strong accept - 5