Generalization Algorithm For Prevent Inference Attacks In Social Network Data

Size: px

Start display at page:

Download "Generalization Algorithm For Prevent Inference Attacks In Social Network Data"

Evelyn Warren
5 years ago
Views:

1 Generalization Algorithm For Prevent Inference Attacks In Social Network Data Chethana Nair, Neethu Krishna, Siby Abraham 1Dept of computer Science And Engg, Christ Knowledge City, M annoor, 2,3 Dept Of Computer Science And Engg, M usaliar College Of Engg & Technology cnchethananair@gmail.com, Pathtanamthitta,kerala Abstract - Online social networking has become one of the most popular activities on the web. Online social networks (OS Ns), such as Facebook, are increasingly utilized by many people. OS Ns allow users to control and customize what personal information is available to other users. These networks allow users to publish details about themselves and to connect to their friends. S ome of the information revealed inside these networks is meant to be private. A privacy breach occurs when sensitive information about the user, the information that an individual wants to keep from public, is disclosed to an adversary. Yet it is possible to use learning algorithms on released data to predict private information. Private information leakage could be an important issue in some cases. And explore how to launch inference attacks using released social networking data to predict private information. Desired use of data and individual privacy presents an opportunity for privacy-preserving social network data mining. Then devise three possible sanitization techniques that could be used in various situations. The effect of removing details and links in preventing sensitive information leakage. Removing details and friendship links together is the best way to reduce classifier accuracy. This is probably infeasible in maintaining the use of social networks. Explore the effectiveness of these techniques and attempt to use methods of collective inference to discover sensitive attributes of the data set. Decrease the effectiveness of both local and relational classification algorithms by using the sanitization methods. I. INT RODUCT ION The rapid growth and ubiquity of online social media services has given an impact to the way people interact with each other. Online social networking has become one of the most popular activities on the web. Social network analysis has been a key technique in modern sociology, geography, economics, and information science The data generated by social media services often referred to as the social network data. In many situations, the data needs to be published and shared with others. Social networks are online applications that allow their users to connect by means of various linktypes. As part of their professional network; because of users specify details which are related to their professional life. These sites gather extensive personal information, social network application providers have a rare opportunity direct use of this information could be useful to advertisers for direct marketing. Publish data for others to analyze, even though it may create severe privacy threats, or they can withhold data because of privacy concerns, even though that makes the analysis impossible. A privacy breach occurs when sensitive information about the user, the information that an individual wants to keep from public, is disclosed to an adversary. For examples, business companies are analysing the social connections in social network data to uncover customer relationship that can benefit their services and product sales. The analysis result of social network data is believed to potentially provide an alternative view of real-world phenomena due to the strong connection between the actors behind the network data and real world entities. Social-network data makes commerce much more profitable. On the other hand, the request to use the data can also come from third party applications embedded in the social media application itself. For instance, Facebook has thousands of third party applications and the number is growing exponentially. Even though the process of data sharing in this case is implicit, the data is indeed passed over from the data owner (service provider) to different party (the application). Published by IJRCCT ( Page 60

2 The data given to these applications is usuall notsanitized to protect users privacy. Desired use of data and individual privacy presents an opportunity for privacy-preserving social network data mining. That is, the discovery of information and relationships from social network datawithout violating privacy. II..RELATED WORK The area of privacy inside a social network encompasses a large breadth, based on how privacy is defined. In Anonymized Social Networks [2] consider an attack against an anonymized network. In their model, the network consists of only nodes and edges. Detail values are not included. The goal of the attacker is simply to identify people. Further, their problem is very different than the one considered, because they ignore details and do not consider the effect of the existence of details on privacy. Other papers have tried to infer private information inside social networks. Inference Attacks by Third-Party Extensions to Social Network Systems[1] identify the threat of social networks site API inference attacks, provide a taxonomy of these attacks, and propose a risk assessment scheme to help users understand the risk of subscribing to a third-party application in an extensible SNS. The extension of the metric to account for uneven popularity of authentication questions, and the design of a secure API for extensible SNSs. And create a benchmark, formulate the feasibility predicates, and empirically assess the inference accuracy of the inference algorithms in the benchmark. This would allow us to empirically evaluate the effectiveness of the risk assessment scheme. One limitation of the risk assessment scheme is that it assumes all authentication questions in the benchmark are equally popular. An improvement is to reformulate the metric so that it takes into account the uneven popularity of the authentication questions. An interesting research question would be to determine which version of the risk metric is actually more effective in steering users privacy expectations. Inferring Privacy Information from Social Networks [3], consider ways to infer private information via friendship links by creating a Bayesian network from the links inside a social network. While they crawl a real social network, Live Journal, they use hypothetical attributes to analyze their learning algorithm. Techniques that can help with choosing the most effective details or links that need to be removed for protecting privacy. The effect of collective inference techniques in possible inference attacks. Preserving the privacy of sensitive relationships in graph data[4], a method of link reidentification. That is, they assume that the social network has various link types embedded, and that some of these link types are sensitive. Several methods of social graph anonymization, focusing mainly on the idea that by anonymizing both the nodes in the group and the link structure, that one thereby anonymizes the graph as a whole. However, their methods all focus on anonymity in the structure itself. For example, through the use of kanonymity or t-closeness, depending on the quasi-identifiers which are chosen, much of the uniqueness in the data may be lost. Through our method of anonymity preservation, we maintain the full uniqueness in each node, which allows more information in the data post release. The general method by which they hide links is by either random elimination or by link aggregation. Instead of attempting to identify sensitive links between individuals, we attempt to identify sensitive traits of individuals by using a graph that initially has a full listing of friendship links. Also, instead of random elimination of links between nodes, develop an heuristic for removing those links between individuals that will reduce the accuracy of our classifiers the most. Use of automatic crawlers [5] to gather users profile information for the purpose of launching profile cloning attacks. Once enough personal information is harvested, the attacker can clone the profile of the victim, either within the same SNS, or in an SNS in which the victim is not registered. Armed with the cloned profile, the attacker now attempts to befriend the friends of the victim. Empirical data suggests that the victim s friends will likely consent to the forging of friendship, thus granting the attacker access to their personal information. Both our work and theirs Published by IJRCCT ( Page 61

3 attempt to gauge the impact of inference attacks through some form of operationalization: gaining the trust of the victim s friends in, and subverting authentication in ours. However, in that third -party applications are controlled by a different access control model than a user-mimicking crawler. The data made available in the two contexts are also different. III. EXISTING SYSTEM Existing work consider only ways to infer private information via friendship links by creating a Bayesian network from the links inside a social network. Infer private information inside social networks. While they crawl a real social network, Live Journal, they use hypothetical attributes to analyze their learning algorithm. Use hypothetical attributes to analyze learning algorithm. The threat of social networks site API inference attacks, provide a taxonomy of these attacks, and propose a risk assessment scheme to help users understand the risk of subscribing to a third-party application in an extensible SNS. The extension of the metric to account for uneven popularity of authentication questions, and the design of a secure API for extensible SNSs. And create a benchmark, formulate the feasibility predicates, and empirically assess the inference accuracy of the inference algorithms in the benchmark. This would allow us to empirically evaluate the effectiveness of the risk assessment scheme. One limitation of the risk assessment scheme is that it assumes all authentication questions in the benchmark are equally popular. An improvement is to reformulate the metric so that it takes into account the uneven popularity of the authentication questions. An interesting research question would be to determine which version of the risk metric is actually more effective in steering users privacy expectations. In anonymized network, the network consists of only nodes and edges. Detail values are not included. The goal of the attacker is simply to identify people. Further, their problem is very different than the one considered, because they ignore details and do not consider the effect of the existence of details on privacy. Other papers have tried to infer private information inside social networks. Use of automatic crawlers to gather users profile information for the Modified NaiveBayes algorithm predicts privacy sensitive trait information using both node traits and link structure. The accuracy of our learning method based on link structure against the accuracy of our learning method based on node traits. The existing work could model and analyze access control requirements with respect to collaborative authorization management of shared data in OSNs. The need of joint management for data sharing, especially photo sharing, in OSNs has been recognized by the recent work provided a solution for IV. SOCIAL NETWORK ARCHITECTURE A high level system component of social network is shown in Figure 1. In the architecture, there are users, social media services, data owner and third party data recipients. Online social media services have been provided in many forms. Generally, there are six different forms of social media: collaborative projects, blogs, content communities, social networking sites, virtual game worlds, and virtual communities. Social media service users can be any real world entity that uses the servicor organizatwhen a user uses an purpose of launching profile cloning attacks. Once enough personal information is harvested, the attacker can clone the profile of the victim, either within the same SNS, or in an SNS in which the victim is not registered. Armed with the cloned profile, the attacker now attempts to befriend the friends of the victim. Empirical data suggests that the victim s friends will likely consent to the forging of friendship, thus granting the attacker access to their personal information. Both our work and theirs attempt to gauge the impact of inference attacks through some form of operationalization: gaining the trust of the victim s friends in, and subverting authentication in ours. However, in that third-party applications are controlled by a different access control model than a user-mimicking crawler. The data made available in the two contexts are also different. Problem of inferring private traits using real-life social network data and possible sanitization approaches to prevent such inference. A modification of NaiveBayes classification that is suitable for classifying large amount of social network data. collective privacy management in OSNs. Their work considered access control policies of a content that is co-owned by multiple users in an OSN, such that Published by IJRCCT ( Page 62

4 each co-owner may separately specify her/his own privacy preference for the shared content. Disadvantages Of Existing System: Problem of private information leakage could be an important issue in some cases. Attacker is simply to identify people. online social media service, they usually are asked to create a profile and to give information about themselves. This information includes personal identifiable information like s ocial security number, name and phone number which uniquely identify a person. Sensitive information can include religion, political view, type of disease (as in healthcare network) or generated income (as in financial network). There are also data generated from the social activity from the services. In many situations, the data needs to be published and shared with others. The data usually contain valuable information that can enable better social targeting of advertisements. The Social networking sites, the most famous form of social media are applications that enable participants to connect by creating personal information profiles, inviting friends and colleagues to have access to those profiles, and sending s and instant messages between each other. These personal profiles can include any type of information, including photos, video, audio files, and blogs. Indeed, this form mixes several social media types into one package. Facebook (facebook.com) is the most popular application of this kind where it currently has more than 500 million active users and they spend over 700 billion minutes per month of using the application. Privacy concerns of individuals in a social network can be classified into two categories: privacy after data release, and private information leakage. Instances of privacy after data release involve the identification of specific individuals in a data set subsequent to its release to the general public or to paying customers for a specific usage. Private information leakage, conversely, is related to details about an individual that are not explicitly stated, but, rather, are inferred through other details released and/ or relationships to individuals who may express that detail. online social network data could be used to predict some individual private detail that a user is not ly list their affiliation, but also through inference could determine the affiliation of other users in their data, this would obviously be a privacy violation of hidden details. Explore how the online social network data could be used to predict some individual private detail that a user is not willing to disclose (e.g., political or religious affiliation, sexual orientation) and explore the effect of possible data sanitization approaches on preventing such private information leakage, while allowing the recipient of the sanitized data to do inference on non-private details. Explore willing to disclose (e.g., political or religious affiliation,) and explore the effect of possible data sanitization approaches on preventing such private information leakage, while allowing the recipient of the sanitized data to do inference on nonprivate details. 4.1Learning Methods On Social Networks Social network data could be used to predict some individual private detail that a user is not willing to disclose. The problem of private information leakage for individuals as a direct result of their actions as being part of an online social network. A privacy breach occurs when sensitive information about the user, the information that an individual wants to keep from public, is disclosed to an adversary. Yet it is possible to use learning algorithms on released data to predict private information. Private information leakage could be an important issue in some cases. And explore how to launch inference attacks using released social networking data to predict private information. Model an attack scenario as follows: Suppose Facebook wishes to release data to electronic arts for their use in advertising games to interested people. However, once electronic arts has this data, they want to identify the political affiliation of users in their data for lobbying efforts. Because they would not only use the names of those individuals who explicit the effectiveness of these techniques and attempt to use methods of collective inference to discover sensitive attributes of the data set. Decrease the Published by IJRCCT ( Page 63

5 effectiveness of both local and relational classification algorithms by using the sanitization methods. The problem of sanitizing a social network to prevent inference of social network data and then examines the effectiveness of those approaches on a real-world data set. In order to protect privacy, sanitize both details and the underlying link structure of the graph. That is, delete some information from a user s profile and remove some links between friends. Also examine the effects of generalizing detail values to more generic values. Figure 2 illustrates an example of social network as a graph. The vertices usually represent real world actors or entities like individuals or organizations. Each vertex has a profile that usually contains personal attributes, such as name, gender, birth date, political view, religion etc. These individuals are usually connected by edges to represent some sort of social tie or link made between them. For example, in Social Networking Sites, these edges represent the connected friend each member has. Therefore, edge can also have its attributes to describe the properties of the connection. Definition 1:- A social network is represented as a graph, G ={ѵ, Ʃ, D}, where ѵ is the set of nodes in the graph, wher each node ni represents a unique user of the social network. Ʃ represents the set of edges in the graph, which are the links defined in the social network. For any friendship link between user ni and user nj, we assume that both ε Ʃ and ε D is the set of details from the social network. set of all detail types is represented by Ҥ. A detail value is a string defined over an alphabet Ʃ that represents a user s input for a detail type. A detail is a (detail type, detail value) pair, represented uniquely by an identifier. i isthe jth (detail type, detail value) pair specified by the user ni. is the set of all I for a node ni. Ɗ is the set of for all i. To evaluate the effect that changing a pers on s details has on their privacy, first create a learning method that could predict a person s private details (for the sake of example, assume that political affiliation is unspecified for some subset of our population). To understand the feasibility of possible inference attacks and the effectiveness of various sanitization techniques combating against those attacks, initially used a simple naive Bayes classifier. Using naive Bayes as our learning algorithm allowed us to easily scale our implementation to the large size and diverseness of the Facebook data set. It also has the added advantage of allowing simple selection techniques to remove detail and link information when trying to hide the class of a network node. Finally, it has shown itself to be extremely effective in these classification tasks Naïve Bayes Classification Determining an individual s political affiliation is an exercise in graph classification. Given a node ni with m details and p potential classification labels C 1 to C x, the probability of ni being in class Cx, is given by the equation Naïve Bayes on Friendship Links The problem of determining the class detail value of person ni given their friendship links using a naive Bayes model. That is, of calculating Using friendship link, from person ni to nj is, Definition 2.:- A detail type is a string defined over an alphabet Ʃ that represents a specific category name within the social network details set. The Published by IJRCCT ( Page 64

6 4.1.3 Weighing Friendships There are many ways to weigh friendship links, the method used is very easy to calculate and is based on the assumption that the more public details two people share, the more private details they are likely to share. The formula for W i,j, which represents the weight of a friendship link from ni to node nj, 4.2 Network Classification Collective inference is a method of classifying social network data using a combination of node details and connecting links in the social graph. Each of these classifiers consists of three components: a local classifier, a relational classifier, and a collective inference algorithm Local Classifiers Local classifiers are a type of learning method that are applied in the initial step of collective inference. Typically, it is a classification technique that examines details of a node and constructs a classification scheme based on the details that it finds there. The naive Bayes classifier builds a model based on the details of nodes in the training set. It then applies this model to nodes in the testing set to classify them Relational Classifiers The relational classifier is a separate type of learning algorithm that looks at the link structure of the graph, and uses the labels of nodes in the training set to develop a model which it uses to classify the nodes in the test set Collective Inference Methods Collective inference attempts to make up for these deficiencies by using both local and relational classifiers in a precise manner to attempt to increase the classification accuracy of nodes in the network. By using a local classifier in the first iteration, collective inference ensures that every node will have an initial probabilistic classification, referred to as a prior. The algorithm then uses a relational classifier to reclassify nodes. At each of these steps i > 2, the relational classifier uses the fully labeled graph from step i - 1 to classify each node in the graph. The collective inference method also controls the length of time the algorithm runs. Some algorithms specify a number of iterations to run, while others converge after a general length of time. Each step i, the algorithm uses the probability estimates, not a single classified label, from step i - 1 to calculate new probability estimates. Further, to account for the possibility that there may not be a convergence, there is a decay rate, called α set to 0.99 that discounts the weight of each subsequent iteration compared to the previous iterations. 4.3 Hiding Private Information The result of a differential private algorithm is very similar with or without the data of any single user. Privacy guarantees that the change in one record does not change the result too much. On the other hand, this definition does not protect against the building of an accurate data mining model that can predict sensitive information. Actually many differentially private data mining algorithms have been developed that has similar accuracy to no differentially private versions. Since our goal is to release rich social network data set while preventing sensitive detail disclosure through data mining techniques, differential privacy definition is not directly applicable in our scenario. Release rich social network data set while preventing sensitive detail disclosure through data mining techniques. Two issues, Understanding sensitive information, that used by the adversary can use to launch an inference attack. It is impossible to provide absolute privacy guarantees with respect to all background knowledge. Analyze the potential success of inference attack. To limit the success of an adversary with respect to a given set of classifiers Formal Privacy Definition Published by IJRCCT ( Page 65

7 Privacy definition focuses on preventing inference attacks. Background knowledge, K, is some data that is not necessarily directly related to the social network, but that can be obtained through various means by an attacker. Additional accuracy gained by the attacker represented by max = C- Set of given classifiers Ć- Classification accuracy. Pć(K) -sensitive hidden data. Pc(G,K) - prediction accuracy of the classifier. = 0, attacker does not gain additional accuracy in predicting sensitive hidden data Manipulating Details Manipulated in three ways 1.Adding details to nodes 2.Modifying existing details. 3.Removing details from nodes. ify these into two categories: Perturbation and Anonymization. Choosing Details: Choose which details to remove. Globally remove the most representative details given from, ie, probability on a network level has the highest correlation with a protected class label. Most highly indicative of a class and remove Manipulating Link Information Option for anonymizing social networks is altering links. Unlike details, there are only two methods of altering the link structure: adding or removing links. evaluate the effects of Privacy on removing friendship links instead of adding fake link. Determining detail type using friendship links from = Ʃ. 4.4 Detail Generalization To combat inference attacks on privacy, to provide detail anonymization for social networks. By doing this, to reduce the value of to an acceptable threshold value that matches the desired utility/privacy tradeoff for a release of data. A detail generalization hierarchy (DGH) is an anonymization technique that generates a hierarchical ordering of the details expressed within a given category. The resulting hierarchy is structured as a tree, but the generalization scheme guarantees that all values substituted will be an ancestor, and thus at a maximum may be only as specific as the detail the user initially defined. Detail value decomposition (DVD )is a process by which an attribute is divided into a series of representative tags. These tags do not necessarily reassemble into a unique match to the original attribute Generalization Algorithm Generalize(,G) G While Classify(G) Classify(G ) <= do S all details that can be further generalized s gehighestinfogainattrib(s) Gen(s,G ) end while return g Generalization algorithm determining which attributes can be further generalized without complete removal and keeps a list of the accuracy of this generalization. At the end of each round, we permanently store the individual detail type that provides the greatest privacy. the changed graph,, meets the chosen privacy requirement, savings. V. EXPERIMENTS 5.1 Data Gathering A program to crawl the Facebook network together data for the experiments. Written in Java 1.6, the crawler loaded a profile, parsed the details out of the HTML, and stored the details inside a MySQL database. Then, the crawler loaded all friends of the current profile and stored the friends inside the Published by IJRCCT ( Page 66

8 database both as friendship links and as possible profiles to later crawl. Because of the sheer size of Facebook s social network, the crawler was in limited small network. This means that if two people share a common friend that is outside the network, this is not reflected inside the database. Also, some people have enabled privacy restrictions on their profile which prevented the crawler from seeing their profile details. The total time for the crawl was seven days. Because the data inside a Face book profile is free form text, it is critical that the input be normalized.. The normalization method use is based upon a Porter stemmer. To normalize a detail, it was broken into words and each word was stemmed with a Porter stemmer then recombined. Two details that normalized to the same value were considered the same for the purposes of the learning algorithm. Total crawl resulted in over 167,000 profiles, almost 4.5 million profile details, and over 3 million friendship links.in the graph representation, one large central group of connected nodes that had a maximum path length of 16. Only 22 of the collected users were not inside this group. Some general statistics of our Facebook data set, including the diameter mentioned above. Common knowledge leads us to expect a small diameter in social networks. Note that, although popular, not every person in society has a Facebook account and even those who do still do not have friendship links to every person they know. Additionally, given the limited scope of crawl, it is possible that some connecting individuals maybe outside thenetwork. This consideration allows us to reconcile the information presented in observed network diameter. change. This can account for the decrease in accuracy of the links classifier. Additionally, there is a severe drop in the classification accuracy after the removal of a single detail. However, when looking at the data, this can be explained by the removal of a detail that is very indicative of the conservative class value. When we remove this detail, the probability of being conservative drastically decreases, which leads to a higher number of incorrect classifications. When remove the second detail, which has a similar likelihood for the Liberal classification, then the class value probabilities begin to trend downward at a much smoother rate. Much more volatile classification accuracy. This appears to be as a result of the wider class size disparity in the underlying data.. For instance when remove five details, have lowered the classification accuracy, but for the sixth and seventh details, see an increase in classification accuracy. Then, again see another decrease in accuracy when remove the eighth detail. Link remove generally more stable downward trend, with only a few exceptions. Combined Removal While each measure provides a decrease in classification accuracy, also test what happens in data set if we remove both details and links. To do this, conduct further experiments where we test classification accuracy after removing 0 details and 0 links (the baseline accuracy),0 details and 10 links, 10 details and 0 links, and 10 detailsand 10 links The original class likelihood for those details which will be used as experimental class values. 5.2 Experimental Setup Implemented Detail Removal can be seen from the results, methods are generally successful at reducing the accuracy of classification tasks. Removing the details most highly connected with a class is accurate across the details and average classifiers. Counterintuitively, perhaps, is that the accuracy of our links classifier is also decreased as we remove details. The details of two nodes are compared to find a similarity. Remove details from the network, the set of similar nodes to any given node will also Published by IJRCCT ( Page 67

9 this situation, all three classifiers perform similarly. The greatest variance occurs when remove Numbers because after removing 12 links, to create a number of isolated groups of few nodes or single, disconnected nodes. Additionally, when removed 13details, These sets as 0 details, 0 links; 10 details,0 links; 0 details, 10 links; 10 details, 10 links removed, respectively. Following this, we want to gauge the accuracy of the classifiers for various ratios of labeled versus unlabeled graphs. To do this, we collect a list of all of the available nodes, as discussed above. We then obtain a random permutation of this list using the Java function built-in to the collections class. Next, we divide the list into attest set and a training set, based on the desired ratio. The Average Only algorithm substantially outperformed traditional naive Bayes and the Links Only algorithm. Additionally, the Average Only algorithm generally performed better than the Details Only algorithm with the exception of the(0 details, 10 links) experiments. Also, as a verification of expected results, the Details Only classification accuracy only decreased significantly when removed details from nodes, while the (0 details) accuracies are approximately equivalent. Similarly, the Link Only accuracies were mostly affected by the removal of links between nodes, while the (*, 0 links) points of interest are approximately equal. The difference in accuracy between (0 details, 0 links) and (10 details, 0 links) can be accounted for by the weighting portion of the Links Only calculations, which depends on the similarity between two nodes. These results indicate that the average and details classifiers generally perform at approximately the same accuracy level. The Links Only classifier, however, generally performs significantly worse except in the case where 10 details and no links are removed. In details alone. It may be unexpected that the Links Only classifier has such varied accuracies as a result of removing details, but since our calculation of probabilities for that classifier uses a measure of similarity between people, the removal of details may affect that classifier. To generate the DGH for each activity, book, and show/movie, used Google directories. To generate the DVD for Music, used the Last.fm tagging system. To generate the hierarchy for Groups, we used the classification criteria from the Facebook page of that group. To account for the freeform tagging that Last.fm allows, also store the popularity for each tag that a particular detail has. Last.fm indicates this through the presentation of tags on the page. The font size for a tag is representative of how many users across the system have defined thatparticular tag for the music type. Then keep a list of tag recurrence (weighted by strength) for each Published by IJRCCT ( Page 68

10 user. For Music anonymization, eliminate the lowest scoring tags. A naive Bayes classifier and the implementation of SVM from Weka. Findings from domain generalization. A comparison of simply using K to guess the most populated class from background knowledge, the result of generalizing all trait types, generalizing no trait types, and when we generalize the best single performing trait type (activities). Method of generalization (seen through the All and Activities lines) does indeed decrease the accuracy of classification on the data set. Interestingly, while previous work indicates that group memberships the dominant detail in classification, we see the most benefit here from generalizing only the Activities detail. This is due to the fact that Activities generally have a far larger range of generalization values, because the trees for these detail types are taller than those of groups. Next, show that given a desired increase are able to determine what level to anonymize the data set to. Require less privacy from anonymized graph, fewer categories are generalized to any degree. Groups is most consistently anonymized completely until the required privacy allowances 20 percent. This may be because the nature of the music detail is that it allows us more easily to include or remove details to fit arequired privacy value. Rather than, say, the activities detail type, which has a fixed hierarchy, music has a loosely collected group of tags. Collective Inference Results The Facebook data, there are a limited number of groups that are highly indicative of an individual s political affiliation. When removing details, these are the first that are removed. Assume that conducting the collective inference classifiers after removing only one detail may generate results that are specific forthe particular detail we classify for. For that reason, consider only the removal of 0 details and10 details, the other lowest point on the classification accuracy.. For each, store the predictions made by the details only, linksonly, and average classifiers and use those as the priors forthe NetKit toolkit. For each of those priors, test the final accuracy of the cdrn, wvrn, nlb, and nbc classifiers. For each of the five sets generated for each of the four points of interest. Then take the average of their accuracies for the final accuracy. The results of our experiments using relaxation labeling. The difference in the local classifier and iterative classification steps of experiments indicate that Relaxation Labeling almost always performs better than merely predicting the most frequent class. Generally, it performs at near 80 percent accuracy, which is an increase of approximately 30 percent in their data sets. Relaxation Labeling typically performed no more than approximately 5 percent better than predicting the majority class for political affiliation. This is also substantially less accurate than using only local classifier. Performance is at least partially because our data set is not densely connected. There is very little significant difference in the collective inference classifiers except for cdrn, which performs significantly worse on data sets where there is a small training set. These results also indicate that our Average classifier consistently outperforms relaxation labeling on the pre- and post anonymized data sets. Additionally, while the local classifier s accuracy is directly affected by the removal of details and/or links, this relationship is not shown by using relaxation labeling with the local classifiers as a prior. For each pair of the figures mentioned, the relational classifier portion of the graph remains constant, only the local classifier accuracy changes. From these, the most anonymous graph, meaning the graph structure that has the lowest predictive accuracy, is achieved when remove both details and links from the graph. Effect of Sanitization on Other Attack Techniques further test the removal of details as an anonymization technique by using a variety of different classification algorithms to test the effectiveness of our method. For each number of details removed, we began by removing the indicated number of details in accordance with the method as described in tenfold cross validation on this set 100 times, and conduct this for 0-20 details removed. Effective at reducing the classification of networks for those details which we have classified as sensitive. While the specific accuracy reduction is varied by the number of details removed and by the specific algorithm used for classification, in fact reduce the accuracy across a broad range of classifiers.. Also that decision trees are affected the most, with a roughly 35 percent reduction in classification accuracy. This indicates that by using a Bayesian classifier to perform sanitization, which Published by IJRCCT ( Page 69

11 makes it easier to identify the individual details that make a class label more likely, decrease the accuracy of a far larger set of classifiers. We also see similar results with our generalization method While the specific value of privacy which was defined for naive Bayes does not exactly hold, we still see that by performing generalization, we are able to decrease classification accuracy across multiple types of classifier. VI. CONCLUSION Desired use of data and individual privacy presents an opportunity for privacy-preserving social network data mining. That is, the discovery of information and relationships from social network data without violating privacy. Then devise three possible sanitization techniques that could be used in various situations. Using both friendship links and details together gives better predictability than details alone. In addition, the effect of removing details and links in preventing sensitive information leakage. In the process, discovered situations in which collective inferencing does not improve on using a simple local classification method to identify nodes. Combine the results from the collective inference implications with the individual results, removing details and friends hip links together is the best way to reduce classifier accuracy. This is probably infeasible in maintaining the use of social networks. Removing only details, greatly reduce the accuracy of local classifiers, which give us the maximum accuracy that able to achieve through any combination of classifiers. Assumed full use of the graph information when deciding which details to hide. Useful research could be done on how individuals with limited access to the network could pick which details to hide. The problem of sanitizing a social network to prevent inference of social network data and then examines the effectiveness of those approaches on a real-world data set. In order to protect privacy, sanitize both details and the underlying link structure of the graph. That is, delete some information from a user s profile and remove some links between friends. VII. FUTURE ENHANCEMENT Future work could be conducted in identifying key nodes of the graph structure to see if removing or altering these nodes can decrease information leakage. VIII. REFERENCES [1] Seyed Hossein Ahmadinejad, mohd anwar and philip w. l. fong(2010). Inference Attacks by Third-Party Extensions to Social Network Systems. [2] Backstrom, c. dwork, and j. kleinberg(2010), Wherefore Art Thou r3579x?: Anonymized Social Networks, Hidden Patterns, and Structural Steganography, Proc. 16th Int l Conf. World Wide Web (WWW 07), pp [3] j. he, w. chu, and v. liu(2006), Inferring Privacy Information from Social Networks, Proc. Intelligence and Security Informatics. [4] E. Zheleva And L. Getoor(2008), Preserving The Privacy Of Sensitive Relationships In Graph Data, Proc. First Acm Sigkdd Int l Conf. Privacy, Security, And Trust In Kdd, Pp [5] L. Bilge, T. Strufe, D. Balzarotti, And E. Kirda(2009), All Your Contacts Are Belong To Us, In Proceedings Of Www 09, Madrid, Spain, Pp [6] Ratan Dey, Cong Tang, Keith Ross And Nitesh Saxena(2009). Estimating Age Privacy Leakage In Online Social Networks [7] L. Sweeney(2002), K-Anonymity: A Model For Protecting Privacy, Int l J. Uncertainty, Fuzziness And Knowledge-Based Systems, Pp [8] A. Friedman And A. Schuster(2010), Data Mining With Differential Privacy, Proc. 16th Acm Sigkdd Int l Conf. Knowledge Discovery And Data Mining, Pp [9] C. Clifton, Using Sample Size To Limit Exposure To Data Mining, J. Computer Security, Vol. 8, Pp , Citation.Cfm?Id= , Dec [10] K. Tumer And J. Ghosh, Bayes Error Rate Estimation Using Classifier Ensembles, Int l J. Smart Eng. System Design, Vol. 5,No. 2, Pp , Published by IJRCCT ( Page 70

12 [11] C. Van Rijsbergen, S. Robertson, And M. Porter, New Models In Probabilistic Information Retrieval, Technical Report 5587, British Library, [12] D.J. Watts And S.H. Strogatz, Collective Dynamics Of Small- World Networks, Nature, Vol. 393, No. 6684, Pp , June E. Steel And G. A. Fowler, Facebook In Privacy Breach, The Wall Street Journal, Oct [13] J. He, W. W. Chu, And Z. V. Liu, Inferring Privacy Information From Social Network, In Proceedings Of ISI 06, Ser. LNCS, Vol San Diego, CA, USA: Springer, May 2006, Pp [14] W. Xu, X. Zhou, And L. Li, Inferring Privacy Information Via Social Relations, In Proceedings Of The 24th IEEE ICDE Workshop, Cancun, Mexico, Apr [15] L. Bilge, T. Strufe, D. Balzarotti, And E. Kirda, All Your Contacts Are Belong To Us, In Proceedings Of WWW 09, Madrid, Spain, Apr. 2009, Pp [16] E. Zheleva And L. Getoor, To Join Or Not To Join, In Proc. WWW 09, Madrid, Spain, Apr. 2009, Pp Published by IJRCCT ( Page 71

Sanitization Techniques against Personal Information Inference Attack on Social Network

Sanitization Techniques against Personal Information Inference Attack on Social Network Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 12, December 2014,