Classifying Online Social Network Users Through the Social Graph

Similar documents
OSN: when multiple autonomous users disclose another individual s information

Unsupervised Learning : Clustering

Based on Raymond J. Mooney s slides

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Computer Vision. Exercise Session 10 Image Categorization

You are Who You Know and How You Behave: Attribute Inference Attacks via Users Social Friends and Behaviors

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014

ECG782: Multidimensional Digital Signal Processing

Contents. Preface to the Second Edition

Figure (5) Kohonen Self-Organized Map

Object Segmentation and Tracking in 3D Video With Sparse Depth Information Using a Fully Connected CRF Model

Applying Supervised Learning

Multi-label classification using rule-based classifier systems

Mining Web Data. Lijun Zhang

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

SVM: Multiclass and Structured Prediction. Bin Zhao

Sanitization Techniques against Personal Information Inference Attack on Social Network

Inferring the Source of Encrypted HTTP Connections

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Accelerometer Gesture Recognition

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

Machine Learning. Chao Lan

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Segmentation of Images

Markov Networks in Computer Vision

Markov Networks in Computer Vision. Sargur Srihari

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

Mining Web Data. Lijun Zhang

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Spectral Clustering and Community Detection in Labeled Graphs

Semi-supervised Learning

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

F. Aiolli - Sistemi Informativi 2006/2007

SOCIAL MEDIA MINING. Data Mining Essentials

Outsourcing Privacy-Preserving Social Networks to a Cloud

Human Body Recognition and Tracking: How the Kinect Works. Kinect RGB-D Camera. What the Kinect Does. How Kinect Works: Overview

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

SUPPORT VECTOR MACHINES

HOG-based Pedestriant Detector Training

Randomized Response Technique in Data Mining

Gene Clustering & Classification

Bolt: I Know What You Did Last Summer In the Cloud

K-Means Clustering Using Localized Histogram Analysis

Final Exam DATA MINING I - 1DL360

Flaws in Some Self-Healing Key Distribution Schemes with Revocation

Figure 1: Workflow of object-based classification

Robust PDF Table Locator

Machine Learning Classifiers and Boosting

Content-based image and video analysis. Machine learning

A Dendrogram. Bioinformatics (Lec 17)

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Introduction to Machine Learning CMU-10701

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Information Retrieval and Organisation

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

CS229 Final Project: Predicting Expected Response Times

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

CS145: INTRODUCTION TO DATA MINING

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Parallel Composition Revisited

Pufferfish: A Semantic Approach to Customizable Privacy

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Network Lasso: Clustering and Optimization in Large Graphs

Methods for Intelligent Systems

Advanced Internet Architectures

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

Chapter 8: Enhanced ER Model

Machine Learning with MATLAB --classification

Support vector machines. Dominik Wisniewski Wojciech Wawrzyniak

Generative and discriminative classification techniques

Machine Learning : Clustering, Self-Organizing Maps

Unsupervised Learning

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

PARALLEL CLASSIFICATION ALGORITHMS

OB-PWS: Obfuscation-Based Private Web Search

Privacy-Preserving Data Mining in the Fully Distributed Model

Accountability in Privacy-Preserving Data Mining

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

CS570: Introduction to Data Mining

Application of Support Vector Machine Algorithm in Spam Filtering

Unsupervised Learning and Clustering

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Object Purpose Based Grasping

CS 1674: Intro to Computer Vision. Attributes. Prof. Adriana Kovashka University of Pittsburgh November 2, 2016

CRF Based Point Cloud Segmentation Jonathan Nation

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

The Curse of Dimensionality

A Novel Identification Approach to Encryption Mode of Block Cipher Cheng Tan1, a, Yifu Li 2,b and Shan Yao*2,c

Support Vector Machines

Transcription:

Classifying Online Social Network Users Through the Social Graph Cristina Pe rez Sola and Jordi Herrera Joancomartı Departament d Enginyeria de la Informacio i les Comunicacions Universitat Auto noma de Barcelona October 25th, 2012

1 Introduction 2 Classifier proposal 3 The experiments 4 Conclusions and further work 2 / 23

About the title Classifying... Definition Classification is the problem of identifying to which of a set of categories a new observation belongs. The decision is made on the basis of a training set of data containing observations whose category membership is already known. 3 / 23

About the title... Online Social Network Users... 4 / 23

About the title...through the Social Graph Definition A social graph is a graph where nodes represent users in a social network and edges represent relationships between these users. 5 / 23

What do we want to do? Goals Design a user (node) classifier that uses the graph structure alone (no semantic information is needed). Apply the previously designed classifier to label OSN users. Demonstrate that OSN user classification is possible with naively anonymized graphs. 6 / 23

Why is it interesting? Motivation User classification as a privacy attack User classification allows an attacker to infer (private) attributes from the user. Attributes may be sensitive by themselves. Attribute disclosure may have undesirable consecuences for the user. In any case, the user is not able to control the disclosure of the information about himself anymore... 7 / 23

1 Introduction 2 Classifier proposal Architecture overview Classifier modules Specific design details 3 The experiments 4 Conclusions and further work 8 / 23

Architecture overview Classifier Architecture The proposed classifier is implemented with a 5 module architecture, which includes two different classifiers: an initial classifier and a relational classifier. Clus. coeff. & degrees Data preprocessing Initial classifier Class labels Neighborhood analysis Data preprocessing Relational classifier New class labels 9 / 23

Classifier modules Initial classifier The initial classifier analyzes the graph structure and maps each node to a 2-dimensional sample: degree & clustering coefficient. The output is an initial assignation of nodes to categories. 10 / 23

Classifier modules Neighborhood analysis The neighborhood analysis module reports to which kind of nodes is every node connected, using the labels assigned by the initial classifier. 11 / 23

Classifier modules Relational classifier The relational classifier maps users to n-dimensional samples, using both degree & clustering coefficient and the neighborhood information to classify users. The output is a new assignation of nodes to categories, which can differ from the initial classification. 12 / 23

Specific design details Some details about the classifier The graph is directed, so we distinguish between indegree and outdegree (instead of having just degree). This distinction increases by 2 the number of dimensions in the neighborhood analysis. We can have as many categories as we want: we just have to add more dimensions! Classifiers are instantiated with Support Vector Machines with soft margins. The relational classifier is applied iteratively. 13 / 23

1 Introduction 2 Classifier proposal 3 The experiments Experiment design Experiment results 4 Conclusions and further work 14 / 23

Experiment design The main goal Research question Is an attacker able to recover attributes from OSN users knowing just the social graph structure and the attributes of a small subset of the nodes in the graph? We are facing a within network classification problem, where nodes for which the labels are unknown are linked to nodes for which the label is known. 15 / 23

Experiment design Data used in the experiments We collected data from 936.423 Twitter users, which were all the neighbors of a subset of 300 nodes. We constructed two disjoint graphs G 1 = (V 1, E 1 ) and G 2 = (V 2, E 2 ) with users and their relationships. We labeled the nodes of the graphs to obtain the ground of truth: Binary classification: individual or company. Multiclass classification: normal user, blogger, celebrity, media and organization. 16 / 23

Experiment design An experiment Each of the experiments consisted on: Randomly selecting a subset of nodes (V train ) to be used as training samples: 65%, 50%, 35% and 20% of nodes. Training the classifiers with those samples. Classifying the rest of the nodes (V test = V V train ). Evaluating the overall performance using the ground of truth. We performed 100 experiments for each of the training set sizes and for both classification problems. 17 / 23

Experiment results Binary Classification Results 0.75 Correct rates 0.7 Correct rate 0.65 0.6 0.55 0.5 0 1 2 3 4 5 6 7 8 9 10 Iteration D1 65% train D1 50% train D1 35% train D1 20% train D2 65% train D2 50% train D2 35% train D2 20% train 18 / 23

Experiment results Multiclass Classification Results Correct rates 0.6 Correct rate 0.55 0.5 0.45 0.4 0.35 Cat a 65% train Cat a 50% train Cat a 35% train Cat a 20% train 0.3 0 1 2 3 4 5 6 7 8 9 10 Iteration 19 / 23

1 Introduction 2 Classifier proposal 3 The experiments 4 Conclusions and further work 20 / 23

Conclusions Conclusions Information found in the social graph is enough to perform classification. It is possible to classify OSN users using a naively anonymized copy of a social graph. Naive anonymization does not protect OSN users from attribute disclosure. Success rate varies depening on the training set sizes. 21 / 23

Further work Further work Integrate both structural and semantic information to improve classification. Study the impact of different graph anonymization techniques (other than the naive anonymization) on the classification. Analyze the performance of other classification techniques for relational data. 22 / 23

Classifying Online Social Network Users Through the Social Graph Cristina Pe rez Sola and Jordi Herrera Joancomartı Departament d Enginyeria de la Informacio i les Comunicacions Universitat Auto noma de Barcelona October 25th, 2012

Linear SVM 24 / 23

Non linear SVM 25 / 23