Digital Footprints: Disambiguation of name entities in Online Social Networks

Size: px

Start display at page:

Download "Digital Footprints: Disambiguation of name entities in Online Social Networks"

Gervais O’Neal’
5 years ago
Views:

1 Digital Footprints: Disambiguation of name entities in Online Social Networks MSc Thesis Artificial Intelligence December 17 th, 2013 Author: Evangelia Paraskevi Nastou Supervisor: Dr. Wouter Weerkamp

2 Abstract A person s identity is distributed over multiple platforms; each one of these networks has been created for providing different services to the users of the platform. So, in order to make a complete picture of an individual it is important to be able to aggregate this information together. In our project we want to investigate if it is possible to find a certain person among all the people who share the same name The input that we have is the name of the person, sometimes only the last name and the first letter of the first name, the age, the organization which works or used to work and the city of residence. Social media profiles are textually sparse and the extraction of good textual features which will help on the disambiguation becomes difficult. Within this paper, we present a detailed approach to gaining this type of information and try to achieve better performance in the disambiguation phase than the already existing methods presented in the past. The decision of which Online Social Networks we are going to use was based on their popularity according to the number of users each one of them counts and, their ranking among other web sites in Web 2.01, in the Netherlands. We concluded in the following seven social networks; Facebook, LinkedIn,Twitter, Hyves, Google+, Flickr and MySpace

3 Contents Contents Introduction Growth of Social Media Name disambiguation Research questions Related work WePS Name disambiguation above WePS campaign Matching profiles and social engineering Linkage Records Data Privacy Data Collection First run Nickname extraction Find profiles based on the extracted nicknames Pre Processing Retrieved data set Extraction of features Ground Truth Data Set Training Data Set Disambiguation method Problems to overcome Model description Matching methods Levenshtein metric Cosine similarity Q-grams Jaro - Winkler Jaccard distance Monge Elkan distance

4 Euclidean distance Birthday similarity measure Location similarity and Website similarity About and Introduction similarity Friend s list similarity Profile pictures similarity Multiple references on same category Experimental Setup Evaluation metrics for clustering Parameters Name similarity threshold and Pairwise threshold SVM weights Support Vector Machines Imbalanced data Results Comparison with previous work Results from our method Nicknames and missing profiles Ranking Discussion Feature Work Bibliography Appendix

5 1. Introduction The increasingly growth of information that can be stored and retrieved on the Web 2.0 has risen the need of new methods for extracting documents and pages of the desired subject. One of the most common activities of Internet users is finding information about other people who might be either in their friends circles or well-known people from media, history etc. The results obtained from a search engine contain a mixture of all documents that refer to that name, without taking into consideration the different people that share the same name and without creating clusters of pages that refer to different people. This makes it harder for the user to find the information which applies to a certain person. By having to go through all the results and select manually which documents belong to the same person, this procedure becomes time consuming. To make things worse the last decade more and more Online Social Networks appear in the online world; each one of them offering different features and possibilities on how and to whom to connect. Online Social Networks world has become our virtual reality world. All our life can be found through social media pages; our preferences on movies, music, sports, our friends and family. We share our private and professional life with our friends through online social networks. But in most of the cases is not only our social network who has access on this shared information but also people who might not belong to our network. Due to the continuous growth of the usage of the online social networks and the variety of information that users of the platforms publish, makes it easier to third parties and advertisement companies to target specific persons; by sending them product advertisements that suits their taste. Moreover, employers nowadays check the public profiles of potential employees, not only on the professional network LinkedIn but also on Facebook. Forensics on the other hand can benefit from the availability of this publically available information for finding evidence for suspects. In cases where they have to find more information about a suspect, Online Social Networks could give more insight on what a certain person was doing at a certain time or even from her social network find useful connections. A person s identity is distributed over multiple platforms; each one of these Online Social Networks has been created for providing different services to the users of the platform. Through LinkedIn people share their professional life on the other hand Facebook is a social media platform sharing events or our personal life. So, in order to make a complete picture of an individual it is important to be able to aggregate this information together. In our project we want to investigate if it is possible to find a certain person among all the people who share the same name, also known as namesakes, with limited extra information given. The input that we have is the name of the person, sometimes only the last name and the first letter of the first name, the age, - 5 -

6 the organization in which she works and the city of residence. Specifically, social media profiles are textually sparse and the extraction of good textual features which will help on the disambiguation becomes difficult Growth of Social Media According to Steven Van Belleghem [1] seven out of ten internet users are members of at least one social network. This implies that almost 1.5 billion people worldwide use Online Social Networks. As also mentioned in the same presentation, people join 2.1 social networks on average, with more popular combinations between Facebook and either Twitter or LinkedIn. Other top ranking social networks are Hyves, Flickr, Google+ and MySpace; their rankings among other websites in the Netherlands are shown on Table 1. Google+ is not in the ranked list since there is no relevant information on the website 1 that we obtained the following ranking information (Table 1). Google(.nl &.com) 1 & 2 Facebook 3 YouTube 4 LinkedIn 5 Twitter 13 Flickr 57 Hyves 79 Instagram 38 MySpace 478 Foursquare 1649 Table 1: Ranking of Social Media Web sites in the Netherlands according to Web Information Company Alexa 1 The information that a user shares through each social media platform can be categorized with respect to the privacy level which each individual user sets. The data of the user can be public, meaning that anyone can access the user s profile through Internet. Secondly, the profile of the user is only open to the users of the same social media platform and finally, the user can share her personal and professional life only

7 within her network. That makes the extraction of good textual features and the disambiguation even more challenging. It is important for this research to understand how each social network handles the information that a user share within the social media platform. Each social network platform has different default privacy settings which we have to be taken into consideration for the collection of the features from each platform that will enhance the methods used for the disambiguation phase. Some of them can be changed to less or more restrict privacy levels and some of them do not have the option to be changed; leaving usually the information public to either the users of the social network or to all Internet users. The default privacy settings of the features that we are interested in extracting from each social network can be found in the Appendix Tables [16-22]. Trying to find a specific person through Online Social Networks can be really difficult when the username of that person is not known. The username can be the real name but also can be a nickname or a diminutive of the real name or even a fake name. Some social networks like LinkedIn and Facebook prompt users to create an account with the real personal information so as to be easier for their friends to find them. On the other hand Twitter prompts new members to use a nickname in addition to their real name. Within this paper, we present a detailed approach to gaining this type of information and try to achieve better performance in the disambiguation phase than the already existing methods presented in the past. The decision of which Online Social Networks we are going to use was based on their popularity according to the number of users each one of them counts and, their ranking among other web sites in Web 2.0 1, in the Netherlands. We concluded in the following seven social networks; Facebook, LinkedIn, Twitter, Hyves, Google+, Flickr and MySpace. Detailed information on the popularity and the information that are shared in each Online Social Network can be found at the Appendix Name disambiguation Being able to identify certain people on the Internet from the information that they release publicly is a challenging and interesting task. The ability to automatically pointing on certain profiles on different social media platforms can be helpful not only for marketing campaigns and analysis of the user s behavior but also for investigations and frauds in banks. The collected data from different Online Social Networks can be structured or unstructured data. So, before trying to match the profiles processing is needed

8 Having in mind the previous mentioned reasons that someone might want to automatically match Online Social Network accounts brings to light another important aspect; privacy. Although this information is publicly available, users are not willing to share this information with researchers and moreover to use their data from different sources and integrate them is a sensitive topic. For this reason we decided to use volunteers to participate in our research. Eighty people from the same educational background gave us permission to download and process their personal publicly available data from seven different Online Social Networks. Suppose,,, the social networks from which we downloaded all the information related to the volunteers name. In order to retrieve all the data we used a crawler. Before trying different methods for matching them, preprocessing needed, by knowing the structure of an OSN certain features could be extracted. With this process all irrelevant data are being excluded. The last step is to integrate the profiles which belong to our initial volunteer query name by using some more specific knowledge for that person such as the city of residence, the company where she works and finally the year of birth. For matching the profiles we use textual information basically and the profile picture as a strong identification if two profiles belong to the same person in general. After matching the profiles the results can be used in different applications. The previous mentions procedure can be visualized on Figure 1. In our paper we are not going to use a single search engine to collect the results, but we are going to query the search engine of each one of the social media platforms that we mentioned before. Moreover, as stated from the terms of each Online Social Network, each user of an Online Social Network can have only one account. So, we expect in each Online Social Network each profile that is being retrieved to belong to an individual entity

9 1.3. Research questions The purpose of our research can be summarized on the following question: Can we develop a model that retrieves and identifies a target person in different Online Social Networks; matches profiles that belong to the same person without knowing her username on the different Online Social Networks only by using the limited textual representation of a person on the virtual reality world? In order to answer this question we have to actually answer the following questions: What data are available in each Online Social Network? Which are the most informative features for matching a profile to the target person? Can we create new queries for the search engine of an Online Social Network from the extracted textual features that will return profiles of the target person? World Wide Web OSN 1 OSN 2 OSN 3 Web crawling OSN 1 OSN 2 OSN 3 Pre processing OSN 1 OSN 2 OSN 3 Profile matching OSN 4 OSN 4 OSN 4 Figure 1: Visualization of research - 9 -

10 2. Related work In this section we discuss previous work on name disambiguation and profile matching. How this problem has been approached in different fields of research such as in engineering and in information retrieval. In section 3.1 we review the WePS campaigns, in section 3.2 is a review of different approaches that have been used outside WePS campaign and finally in section 3.3 we take a glimpse of how profile matching and name disambiguation are being approached in engineering WePS With the growth of the social media platforms the appearance of individuals in the web search results grows as well. As a consequence the effort to find an individual through the list of results becomes more and more difficult due to the fact that more than one people might share the same name. More precisely in the Netherlands, family names are shared among 16 million Netherlanders, as stated from the Meertens Institute 2. People name disambiguation is a task that has been the center of research the last few years, where they try to find all the occurrences of web results that belong to the same person. Although there is not much relevant work with name disambiguation using only social media results as input, there has been a lot of research in the area of people name disambiguation with results obtained from search engines, including social and non social documents. Three campaigns [2] [3] [4] have taken place the last five years with main tasks both the clustering of ambiguous name entities and the extraction of good textual features. In each campaign took part sixteen, nineteen and thirteen research teams. WePS-I [2] try to estimate the number of namesakes for a given person name and group documents referring to the same individual. More precisely the work of [5] extended the token based-info to a web corpus and as features they also used noun phrase information. After the preprocessing of the resulted web pages a variation of features were extracted for the disambiguation phase; token based features (local tokens, full tokens, URL tokens and title tokens in root page) and phrase-based features (noun phrases and name entities). Their clustering method was hierarchical agglomerative algorithm with single linkage and the similarity measure for calculating the similarity matrix was the standard SoftTFIDF [6]. For the evaluation they used Purity and Inverse Purity metrics and they achieved the highest performance among the participants of the SemEval-2007 [2]. [7] combined a variety of features, which extracted from the pages such as tokens, name entities, hostnames, domains and URLs, as input to the hierarchical agglomerative algorithm with single linkage clustering and, conducted a set of experiments to the different data sets (ECDL, Wikipedia and Cencus) available from the [2]. The results showed that Named

11 Entities achieve the best performance among the feature combinations which tried. [8] extracted a number of HTML based features which would allow them to run Agglomerative Vector Space Clustering and compare the features of each document and cluster them together according to a minimum number of similar pairs. A semi-supervised method for the clustering task was applied in [9]; as a seed node they either used the corresponding Wikipedia page or if not existed for a certain name then the top ranked Web pages were used. Each time two pages were merged the centroid of the cluster was recalculated in order to control the fluctuation. [10] presents a combination of a supervised and unsupervised method for the name disambiguation task. As for this they used classification and clustering algorithms and the best results were merged according to the intersection between the initial set and the results obtained from the algorithms. The features they used were biographic facts, like date and place of birth, and URL and a list of weighted keywords and metadata information about the web page. A different approach for solving the task of WePS was introduced by [11]; they used language modeling tools and different representations of each document (snippet, title and body text) and experimented with Single Pass Clustering and Latent Semantic Analysis. WePS-II [3] focuses on two tasks; personal name disambiguation and attribute extraction. [12] extracted features beyond the web corpus and achieved the highest performance in the disambiguation task. The features they used are tokens appearing in the same sentence with the ambiguous name, tokens appearing in a given webpage, URL tokens and title tokens in the root pages, more or less the same as in the work of [5]. What they showed is that the robustness of a disambiguation system can be improved effectively with the collection of features from broader resources other than the Web corpus. On the work [13] it is stated from the results that hierarchical agglomerative clustering outperforms, moreover what they experimented with was the impact that has in clustering the preprocessing of the HTML page. In their method they converted HTML pages and tried two systems one with stemming and the other one with no stemming. For the clustering they used single pass clustering and probabilistic latent analysis. [14] developed a two stage clustering algorithm where the results or the first run of the hierarchical agglomerative algorithm are used to extract features (named entity, compound keyword features, URLs) for the second stage clustering. In that way they enhanced the low recall of the first stage clustering. Professional categorization is the method that [15] used for the name disambiguation task. They categorized the namesakes by a professional taxonomy, which was extracted from Freebase and evidence pages in which the name and the profession would appear in the same sentence. Each web page was represented as a vector of features, such as tokens, URL tokens, snippet tokens and named entity tokes. Finally a knn classifier used for disambiguating the namesakes and creating clusters represented by the name and the profession. The feature set used by [16] was all unigrams by sentence and paragraph combined with the title and URL and calculated the similarity between the documents by six different measures; cosine

12 similarity, Euclidean distance, Jaccard coefficient, Manhattan distance, weighted sum of common unigrams and Jaro similarity. For the machinely learned model support vector machine was used with polynomial kernel. [17] used a web search engine as an external data source for the disambiguation task. Named entity features were extracted from each web page and a TF/ IDF similarity was computed for the clustering procedure. The new query would consist of the initially ambiguous name and a category of name entities found in the web page. These pages where then clustered with a single-link hierarchical agglomerative algorithm. WePS-III [4] is the last of the three campaigns which addressed the name disambiguation and attribute extraction task. In this campaign the two tasks where merged to one; systems must return both the documents and the attributes which belong to each person namesake. [18] their method outperformed the existing ones on the task of name disambiguation by extending the bag-of-words feature with Wikipedia concepts. For the similarity measures they used a model which weights the query and the content relevance of each feature in the vector. They evaluated their model under different features and similarity measures (cosine similarity and overlap) and in all cases the combination of bags-of-words and Wikipedia concepts gave the best results. The AXIS research team [19] used Web graph structure for the task of person name disambiguation achieving relatively high precision. Their system works as follows. After clustering the person pages that share related pages in a cluster, which they call Web structure clustering; for the remaining pages they created frequency vectors of the terms that are contained in each page and run the hierarchical agglomerative algorithm. In their results can be shown that the performance of these systems delivers competitive results to other systems which took part on the third campaign. Finally, the third ranked team [20] of this campaign compared the results obtained from three different clustering algorithms, Lingo, HAC and 2-step HAC) with HAC outperforming. As they state an important role in the results of clustering is the pre-processing procedures; the cleaning of noise from HTML tags and NLP processing improves the clustering task Name disambiguation above WePS campaign In the WePS campaigns the input data for the disambiguation phase were taken from querying web search engines and collecting the first one hundred results. In these web pages were included both social web pages and non-social web pages like biographies etc. In our work we want to focus only on the social pages, pages from Online Social Networks. In the work of [21] used different approaches for social pages and non social pages and at the end they merged the results. From the clustering methods tried, one-inone, all-in-one and hierarchical agglomerative clustering, the last one outperformed on the results. For the social pages they used additional methods such as co-clicks, clicks it the same burst and cross links. At

13 the end they merged the results with two methods. In the first method they take just the union of the two clusters. On the second method they use a similarity threshold and also penalize clusters which contain a social media page. One more team [22] tried to work on the person disambiguation problem using only social media pages. In their work all their data had a common organization as an input which would help them on the disambiguation procedure. They knew a priori that every person that participated was a student of a certain university and connections between them could help the disambiguation phase. Their work is divided into two phases; the discovery phase and the disambiguation phase. In the discovery phase they try to find as many profiles as possible which can be candidates for the disambiguation phase. In order to find possible usernames Rapporative and Google s Api are used and in each step they query with the new usernames the web search engine. After they collect all the possible profiles for each candidate name they start the disambiguation. The heuristics that they use are a combination of keyword matching, community structure analysis and extraction of semantic and feature data from profiles. A vector space model for document co-reference was used by [23]. In their approach they used summaries of the documents to create the vectors with the terms and experiment with different similarity metrics for scoring; MUC co-reference scoring, B-Cubed scoring algorithm; two names are predicted to be the same person their similarity scores were above a threshold. [24] extended the vector space representation by using as features biographic details such as birth day and year, and occupation. For the person name disambiguation task the work of [25] tried a different approach by using graph walk methods and re-ranking outputs based on the graph walked features. More precisely they represented with nodes different types of real word entities, such as file, person, address, term and date. Respectively, they produced the edge types, like sent-to or sent-from, includes-term etc. which denote direct connection between two nodes. The similarity between two nodes is calculated by a lazy walk process, controlled by a parameter θ. In their walk method they created vectors with the weight of each node and started the graph walk from the term column witch propagates to a person node where more frequently appears the specific term. In the work of [26] introduced a two phase method for identity disambiguation using social circles for detecting Sybil and non-sybil users. The two phases of ranking are the Static which is based on the initial parameters given by the users and the results obtained from the web crawler, that then being verified by the Matcher and the Score Generator. The Dynamic phase changes the values of the first phase according to the polling results from each user; for instance by accepting or not a friend request gives an answer to the question whether or not this person is fake. Re-calculating the rank of each user is based on the

14 acceptance or rejection of the friend request and the positive or negative vote regarding the genuinity of the requester. [27] presents a semi-supervised approach for identity disambiguation of Web references using Web 2.0 data and semantics. More specifically, first an RDF model is generated for a digital identity of a person within a specific platform. For the second stage they create metadata models from the results obtained from querying the semantic Web and the World Wide Web with a certain person s name. And finally, the seed data are represented in a Linked Data social graph. For the disambiguation phase two methods used, one was a semi-supervised machine learning technique and the second one, a graph based technique with random walks which use an agglomerative clustering in order to cluster the instances which are closer to the social graph. The approach of [28] to the name disambiguation problem was more general one and not generated only for the specific task. Their focus is on the different strategies that can be used inside the hierarchical agglomerative clustering. Some of the successful methods that they tried are the similarity between the closest documents, stemming also improves the results and window term gives high B-cubed results. In the work of [29] is shown how to match profiles of social media networks which belong to the same person. Their work is based on a certain person s list of friends in two different social networks, Facebook and Twitter. They created the social graphs of each profile and tried to project each node to the other graph and then, used a joint link-attribute model in which they create connection between nodes based on the similarity of profile fields. More precisely they used a probabilistic model for finding the optimal configuration of profile projections Matching profiles and social engineering Matching profiles from different social media platforms has been an intriguing subject in social engineering over the last years. The research project in [30] tried to prove how is it possible to match profiles between Facebook and LinkedIn of the employees of a certain company. For their research they used different methods of collecting their data and matching the profiles. They used only the public available data, the friends lists of recognized matched profiles and the final technique is the creation of zombie profiles; fake accounts with one purpose, to make as much connections with other users as possible in order to have access on their private information. Their purpose of this project was to prove that it is possible to create a company s hierarchy structure of employees just from matching profiles on different Online Social Networks and the consequences would have for the company Linkage Records

15 Our problem of matching profiles depending on certain features could be characterized similar to the record linkage problem. Record linkage is the task of accurately finding and linking records from different sources of the same entity and it has applications in customer systems for marketing, fraud detection, data warehousing and government administration [31]. More precisely record linkage is used for data cleaning and removing duplicates and for merging two or more datasets to one. Two are the main problems for the linkage records matching; accuracy and records should contain common identifying information [32]. In an ideal word each record should have one unique identifier, but when we are trying to match records from different databases this identifier might be a random number created by the system. The same holds in our case, each member of an OSN has a unique identifier for that OSN but a different one for another OSN. Two distinct methodologies have been tried to solve the data linkage problem; deterministic and probabilistic methods. Deterministic methods match exact one to one character and works better when there are several representative identifiers [33], on the other hand probabilistic linkage is a supervised method where weights are being estimated from the observed agreements and disagreements of the data values [34]. Record pairs with probability above a certain threshold are a match, else a non-match. Our work is more relevant to the probabilistic linkage approach with the difference that the amount of fields given as features are not the same on the different sets / Online Social Networks. In other words some fields might be presented in two or three different Online Social Networks but not to the rest. This makes the integration of the records a more challenging task

16 3. Data For the collection of the data from the different Online Social Networks we faced some legal issues about which data are considered private and which data are considered public and in which extend we can download and process them in an anonymous way. As it is shown in the Appendix, different platforms of Online Social Networks have different default settings on which data are shared publically, which data remain private and which of them are being shared among the connections of the user. The subject of the privacy of the data that are being published in the Web has been extensively discussed the last decade, since the privacy regulations in different countries can vary. As it is mentioned in [35] European law allows the analysis of data collected from the Web as long as the records are identified by a pseudonym. The US law claim that the record holder can assign a code to the record in order to be possible to identify it Privacy [35] brings up a very important question on how can we define privacy in an environment that cannot be controlled like World Wide Web. Moreover what one user considers private for another user might not be private. For example, a user might share a photo of him and some of his friends in an online social platform but one of them might consider that photo as private and not want to be shared publically; as it stated in [36] what poses a threat to their privacy is their links with friends. Moreover in the same paper is shown that being a member of a group in a Social Media Network can reveal information which is private in their own profile. One sentence that summarizes the paper of [35] is; Privacy is not only about hiding certain information, but also about controlling information and its uses (e.g., by constructing different identities), and is finally a dynamic practice involving negotiations and tradeoffs between hiding and disclosing/sharing. Because of the cloud that covers the laws that apply on which are considered private data and if it is legal to download and experiment with those publically available data, we decided to take the permission of the people to search and download their publicly available data found in the different Online Social Networks and keep their data private and not be exposed on our report Data Collection

17 Each Online Social Network has a different construction and use different technologies for presenting the data each user decides to upload. Obtaining the full collection of data was made in multiple steps and can be seen in Figure 2. For downloading the data we used the WGET 3 command in python scripts. Only in Twitter we made full use of the API that is being offered by the Online Social Network. The main reason why we did not use the APIs of the other OSNs as well is because the information that we could get was very limited due to the fact that our account in each one of these networks was friend-free, meaning that we did not have any contacts. The APIs in order to return more details about a certain user needs authentication by the user. In Google Plus and in Hyves we used the API offered by the network only for querying it and getting all the URLs of the users who share the certain name. Names of volunteers Create all diminutives of the official name Query and Download data from the Online Social Networks Process of.html pages to extract the features Process data to find possible nicknames Structured Documents per profile No Assess against criteria Yes Clustering / Disambiguation algorithm Search & Download if a profile exists with the certain user-id /nickname Figure 2: Process followed to download the data set and extract features

18 As it is shown in Figure 2 first we created all the diminutives that can be derived by the formal first name using the database of Dutch names from the Meertens 4 Institute. Each formal name might be a combination of more than one names, we created all possible diminutives for all the possible names. After deriving all the diminutives of the first official name that a volunteer has we run our algorithm for downloading all profiles from each Online Social Network. The second step is to obtain all nicknames that can be found and search for new profiles. In the same way we continue until we have searched all the possible names of the formal name of the user. The order of the last name and the first name on the queries does not make any difference in Twitter, LinkedIn, Google Plus, Hyves, Flickr and MySpace. Both combinations return the same results. Only on Facebook the order of the words in the query might return different results. For not losing any valuable information we decided to run the reversed query as well First run On the first run we collect all the profiles which are returned from the query of the casual first name or the diminutive and the last name of a person. From the profiles that have been returned we try to find if there is a nickname that could possible help us find more profiles on other Online Social Networks of the some person. The only network that we do not query again is LinkedIn since the users of the Social Network are not allowed to use nicknames and they can only use both their first and last name Nickname extraction For deciding if a user-name or user-id is a nickname we retrieved the real name and the user-name that is available on the user's page. The most common way of a system to create a user-id is either by using a unique random number or by creating a user-id (Table 2) which consists of the given real name of the user and a unique number. In that case usually they separate the first name from the last name by a dot and if it is already in use they add a dot and a number at the end. For that reason we discarded any symbols like underscore, minus, dots and spaces that are commonly appear on nicknames or user-ids by replacing them with a non-space character. We calculated the longest common substring and, if the difference of the length of the actual name and the longest common substring is beyond a certain threshold only then we calculated the Levenshtein distance, which is also known as the edit- distance. Only if the ratio was below a certain threshold we added the nickname to our list for searching if a profile exists on the other Online Social Networks

19 Facebook firstname.lastname.# Systems ascription with the possibility to change by user LinkedIn Unique number Systems ascription Twitter Screen name User s selection Google Plus Unique number System s ascription Hyves Username User s selection Flickr Unique number@n# Systems ascription with the possibility to change by user MySpace Unique number Systems ascription with the possibility to change by user Table 2: ID ascription for each OSN In Facebook, Flickr, Hyves and MySpace the comparison in order to obtain a possible nickname is between the name of the user and the user-id. On Google Plus we try to find a nickname through URLs from Youtube and/or Picasa and from the other names field feature. If they are available we compare again the username extracted from the URL with the actual name of the user, as given in the user s personal web page. LinkedIn users do not use nicknames because of the professional nature of the network. But what it is possible to extract from the provided data is the twitter name of a user, if it is available and visible to the public. From LinkedIn we also get profiles with no name, which are the LinkedIn Members who have selected their profile not to be publicly available. In this case as a name of the profile we use the query name. On Twitter we compare the screen name with the actual given name of the user and only if they differ according to the same criteria that we used on the previous mentioned networks, we add it on the list with the nicknames Find profiles based on the extracted nicknames From the list with the nicknames we do not run a new query, as we are not interested on finding derivations of the nickname since that would return too many irrelevant results. What we are interested in is if exists a certain user profile which has the specific nickname as a user-id. For that reason we only download the pages with WGET 5 that do not return a 404 error page not found. For the same reason we do not search again the returned pages for new nicknames since we make the assumption that the previous returned nickname can give us possible additional information on user profiles that we might have missed

20 Many of the results that returned from the queries might be of our interest. We are not aware of the similarity functions that each Online Social Networks uses but for example many of the results returned by Facebook may have a totally different last name and only the first name is the same. This usually happens on uncommon combinations of first names and last names. The same can be the case in Twitter; since for a given query it searches not only on the given name but also on the screen name and returns possible matches Pre Processing From each Online Social Network we decided to download certain categories of data that will help us in our research to match the existing profiles with the specific person that we are looking for. After obtaining the pages with all the existing information, we had to work on the different structures of the OSNs in order to find how we can extract only the data that will be able to use for the disambiguation phase. Table 4 shows which specific information was extracted from each OSN. We present them by their original names for each OSN and we have grouped them by the category (name, city of origin, date of birth etc.) by assigning a different color for each category. At the Appendix Table 23 can be found the whole set of features that extracted from each OSN; the ones that are colored free are the ones which we did not use as features at the end because either the information are not good features for matching profiles or because the specific information are available to only one OSN and not the others Retrieved data set The information that has been downloaded from each network was limited to the sub pages of the static profile of user. By static profile we refer to the information that might stay unchangeable for a certain period of time and refer to the personal information that a user publically shares. On the other hand as dynamical information we refer to data that an active user might change daily such as the posts on her wall; as it is usually referred the place where users post events and thoughts of their everyday life. We are interested only on the static profile of a user. Moreover, we decided to store only the profile picture of each user and not all the albums that would possibly be available online, since this one is usually the most representative picture of an individual. Finally we retrieved their friends list under the assumption that the same person might have a number of same contacts in more than one OSN. On Table 3Table 3 are listed the pages that we are interested in from each Online Social Network. Facebook LinkedIn Twitter Hyves Google Plus Flickr MySpace Domain_url/ab Domain_url details returned Domain_url/pr Domain_url Domain_url/pro Domain_url/pr out /profile from API ofile /about file ofile Domain_url/fri Domain_url/pe Domain_url/follo Domain_url/fri Domain_url/ha Domain_url/con Domain_url/fri

21 end ople also viewed Domain_url/pr Domain_url/pr ofile picture ofile picture Domain_url/co ver phot wing ends ve circles tacts end Domain_url/follo wers Domain_url/pr ofile picture Domain_url/in circles Domain_url/pr ofile picture Table 3: Sub-pages extracted from each Online Social Network Extraction of features Domain_url/pro file picture Domain_url/pr ofile picture As we have already mentioned the diversity of information that are available in each Online Social Network depends on the purposes of the network; if it includes personal or professional life events and, the settings that each individual user selects on how her information prefers to be shared with the users and non users of the OSN. For this reason, in order to extract the features that will help us on the disambiguation task we took as a guideline the default settings of each platform; we extracted only those information that is publically available by default, in case they existed (detailed tables of the default settings per OSN can be found at the Appendix, Table 14 to Table 20). On Table 4 are listed all the features per OSN. The different colors on each cell indicate the possible matches. For the extraction of these features we used different techniques per OSN depending on the source code of the platform. We mainly used Beautiful Soup 6 which is a Python library that parses HTML and XML documents since this is the main format of our data resources. In some cases when the data we wanted to extract where inside Java script code and Beautiful Soup could not recognize them, we used regular expressions. In that manner we ended with structured document files per user of each Online Social Network where all the existing features were listed

22 Facebook LinkedIn Twitter Google Plus Hyves Flickr MySpace Name Name Screen name Name Name Nickname Name Nickname Headline Name Works at Nickname Name name (info) About Location Locations Attends Relationship status Hometown age (info) Employers Industry Description Lives in Living Currently current (info) city Graduate School Twitter Status Gender About me I am Nickname High School Working Id Looking for Birthday Occupation Zodiac Sign College Title of work Statuses count Birthday Age Website Status Birthday School name Replies Id Relationship Companies Gender Here for Gender Work duration Screen reply Other names Weblog Groups Hometown Interested In Languages Urls Url Religion Joined Orientation Relationship Status Groups Current City Hometown About me Religious Views Interests Cities Lived Study/werk Status Website Occupation Relatie Education Current City Employment What's on my mind Schools Hometown Education Website Major Anniversary Introduction Schools Minor Languages Colleges Companies Political Views Bragging rights Religion Table 4: Features extracted from each profile page for each Online Social Network. Different colors for possible matches of information in the different platforms

23 3.4. Ground Truth Data Set For this research project we asked from people to volunteer and give us their permission to download their publicly available data. After running our model, which is described in the section, with an unsupervised manner in order to obtain some idea on how it works; we asked from the volunteers to evaluate the results that we obtained by pointing us to the true profiles in each Online Social Network. From the eighty volunteers that we gathered at the beginning of our project only the seventy replied, which finally composed our ground truth data set Training Data Set For finding the parameters of our model we needed a training data set. Our training data set is constructed by computing all the distances per feature in between profiles that belong to the same person and assigning these to class one; and by computing distances per feature on profiles that do not belong to the same person and assigning those to class zero. The feature vector contains all per feature distances between profiles that belong to the same person which is the positive class and all per feature distances between profiles that do not belong to the same person which formats the negative class. Table 5 shows the number of positive and negative profiles found by querying all the OSNs per instance for the training data set. Two were the main problems that we had to deal with: 1. As we have already mention almost none profile contains all the extracted information per feature. That means that the number of missing values is relatively large. In these cases we decided to assign the value zero (0) as if they are not similar at all for either the cases; one of them exists or none of them exists. 2. Our training data set is an imbalanced data set, since the number of positive instances is relatively small compared to the number of negative instances. This has as a result the misclassification of elements of different classes, by causing also negative effect on classification performance of machine learning algorithms. In order to overcome these two problems we decided to follow well known techniques from the literature. For the missing values different ways has been proposed in the literature for overcoming the problem. In section we discuss it in more detail. Volunteer Instance Matching Profiles (positive) Non matching Profiles (negative) Vn Vn

24 Vn Vn Vn Vn Vn Vn Vn Vn Vn Vn Vn Vn Vn Vn Table 5: Positive and negative instances for training data set

25 4. Disambiguation method 4.1. Problems to overcome The main disadvantage of trying to disambiguate users profiles among different OSN is the scarcity of data. On the top of this there are more problems that have to be considered beforehand for having a clear view of the matching problem. For beginners each OSN ascribes its own user ID, thus there is no universal ID between the profiles of the same person in different OSN. The default IDs per OSN can be seen on Table 2. Only in the cases that the member selects the user ID and with the assumption that she uses the same ID on different OSN we can conclude with some probability that these two profiles might belong to the same person. One other fact that we have to take into consideration is the way the personal information is given in each Network; it is not the same. Moreover even the same person might represent her information in different format in different Online Social Networks. For example, on one OSN the current city of residence might be given in English and on another might be written in Dutch; The Hague and De Haag, respectively for the two languages. Additionally, even if someone presents the same personal information in different OSN, spelling errors should also be taken into account for example, Amsterdam and Amterdam. Finally, since people are not using all the Online Social Networks in the same frequency some data might also be out dated or even missing. And in this case although two profiles might belong to the same person the limited shared features between them would not allow us to conclude on matching profiles. By taking into consideration the previous mentioned problems we try to approach the matching and disambiguation problem as a problem of matching textual information from the personal information that can be found on Online Social Network profile pages, information that can be found on the social network of the user and finally by comparing the profile picture of a user. Here we are going to describe in more detail how our disambiguation method works, how we estimate the importance of each feature and give definitions of the thresholds that we are setting for selecting the final set of profiles that belong to the target person

Entity Matching in Online Social Networks

Entity Matching in Online Social Networks Olga Peled 1, Michael Fire 1,2, Lior Rokach 1 and Yuval Elovici 1,2 1 Department of Information Systems Engineering, Ben Gurion University, Be er Sheva, 84105,