Digital Footprints: Disambiguation of name entities in Online Social Networks

Size: px
Start display at page:

Download "Digital Footprints: Disambiguation of name entities in Online Social Networks"

Transcription

1 Digital Footprints: Disambiguation of name entities in Online Social Networks MSc Thesis Artificial Intelligence December 17 th, 2013 Author: Evangelia Paraskevi Nastou Supervisor: Dr. Wouter Weerkamp

2 Abstract A person s identity is distributed over multiple platforms; each one of these networks has been created for providing different services to the users of the platform. So, in order to make a complete picture of an individual it is important to be able to aggregate this information together. In our project we want to investigate if it is possible to find a certain person among all the people who share the same name The input that we have is the name of the person, sometimes only the last name and the first letter of the first name, the age, the organization which works or used to work and the city of residence. Social media profiles are textually sparse and the extraction of good textual features which will help on the disambiguation becomes difficult. Within this paper, we present a detailed approach to gaining this type of information and try to achieve better performance in the disambiguation phase than the already existing methods presented in the past. The decision of which Online Social Networks we are going to use was based on their popularity according to the number of users each one of them counts and, their ranking among other web sites in Web 2.01, in the Netherlands. We concluded in the following seven social networks; Facebook, LinkedIn,Twitter, Hyves, Google+, Flickr and MySpace

3 Contents Contents Introduction Growth of Social Media Name disambiguation Research questions Related work WePS Name disambiguation above WePS campaign Matching profiles and social engineering Linkage Records Data Privacy Data Collection First run Nickname extraction Find profiles based on the extracted nicknames Pre Processing Retrieved data set Extraction of features Ground Truth Data Set Training Data Set Disambiguation method Problems to overcome Model description Matching methods Levenshtein metric Cosine similarity Q-grams Jaro - Winkler Jaccard distance Monge Elkan distance

4 Euclidean distance Birthday similarity measure Location similarity and Website similarity About and Introduction similarity Friend s list similarity Profile pictures similarity Multiple references on same category Experimental Setup Evaluation metrics for clustering Parameters Name similarity threshold and Pairwise threshold SVM weights Support Vector Machines Imbalanced data Results Comparison with previous work Results from our method Nicknames and missing profiles Ranking Discussion Feature Work Bibliography Appendix

5 1. Introduction The increasingly growth of information that can be stored and retrieved on the Web 2.0 has risen the need of new methods for extracting documents and pages of the desired subject. One of the most common activities of Internet users is finding information about other people who might be either in their friends circles or well-known people from media, history etc. The results obtained from a search engine contain a mixture of all documents that refer to that name, without taking into consideration the different people that share the same name and without creating clusters of pages that refer to different people. This makes it harder for the user to find the information which applies to a certain person. By having to go through all the results and select manually which documents belong to the same person, this procedure becomes time consuming. To make things worse the last decade more and more Online Social Networks appear in the online world; each one of them offering different features and possibilities on how and to whom to connect. Online Social Networks world has become our virtual reality world. All our life can be found through social media pages; our preferences on movies, music, sports, our friends and family. We share our private and professional life with our friends through online social networks. But in most of the cases is not only our social network who has access on this shared information but also people who might not belong to our network. Due to the continuous growth of the usage of the online social networks and the variety of information that users of the platforms publish, makes it easier to third parties and advertisement companies to target specific persons; by sending them product advertisements that suits their taste. Moreover, employers nowadays check the public profiles of potential employees, not only on the professional network LinkedIn but also on Facebook. Forensics on the other hand can benefit from the availability of this publically available information for finding evidence for suspects. In cases where they have to find more information about a suspect, Online Social Networks could give more insight on what a certain person was doing at a certain time or even from her social network find useful connections. A person s identity is distributed over multiple platforms; each one of these Online Social Networks has been created for providing different services to the users of the platform. Through LinkedIn people share their professional life on the other hand Facebook is a social media platform sharing events or our personal life. So, in order to make a complete picture of an individual it is important to be able to aggregate this information together. In our project we want to investigate if it is possible to find a certain person among all the people who share the same name, also known as namesakes, with limited extra information given. The input that we have is the name of the person, sometimes only the last name and the first letter of the first name, the age, - 5 -

6 the organization in which she works and the city of residence. Specifically, social media profiles are textually sparse and the extraction of good textual features which will help on the disambiguation becomes difficult Growth of Social Media According to Steven Van Belleghem [1] seven out of ten internet users are members of at least one social network. This implies that almost 1.5 billion people worldwide use Online Social Networks. As also mentioned in the same presentation, people join 2.1 social networks on average, with more popular combinations between Facebook and either Twitter or LinkedIn. Other top ranking social networks are Hyves, Flickr, Google+ and MySpace; their rankings among other websites in the Netherlands are shown on Table 1. Google+ is not in the ranked list since there is no relevant information on the website 1 that we obtained the following ranking information (Table 1). Google(.nl &.com) 1 & 2 Facebook 3 YouTube 4 LinkedIn 5 Twitter 13 Flickr 57 Hyves 79 Instagram 38 MySpace 478 Foursquare 1649 Table 1: Ranking of Social Media Web sites in the Netherlands according to Web Information Company Alexa 1 The information that a user shares through each social media platform can be categorized with respect to the privacy level which each individual user sets. The data of the user can be public, meaning that anyone can access the user s profile through Internet. Secondly, the profile of the user is only open to the users of the same social media platform and finally, the user can share her personal and professional life only

7 within her network. That makes the extraction of good textual features and the disambiguation even more challenging. It is important for this research to understand how each social network handles the information that a user share within the social media platform. Each social network platform has different default privacy settings which we have to be taken into consideration for the collection of the features from each platform that will enhance the methods used for the disambiguation phase. Some of them can be changed to less or more restrict privacy levels and some of them do not have the option to be changed; leaving usually the information public to either the users of the social network or to all Internet users. The default privacy settings of the features that we are interested in extracting from each social network can be found in the Appendix Tables [16-22]. Trying to find a specific person through Online Social Networks can be really difficult when the username of that person is not known. The username can be the real name but also can be a nickname or a diminutive of the real name or even a fake name. Some social networks like LinkedIn and Facebook prompt users to create an account with the real personal information so as to be easier for their friends to find them. On the other hand Twitter prompts new members to use a nickname in addition to their real name. Within this paper, we present a detailed approach to gaining this type of information and try to achieve better performance in the disambiguation phase than the already existing methods presented in the past. The decision of which Online Social Networks we are going to use was based on their popularity according to the number of users each one of them counts and, their ranking among other web sites in Web 2.0 1, in the Netherlands. We concluded in the following seven social networks; Facebook, LinkedIn, Twitter, Hyves, Google+, Flickr and MySpace. Detailed information on the popularity and the information that are shared in each Online Social Network can be found at the Appendix Name disambiguation Being able to identify certain people on the Internet from the information that they release publicly is a challenging and interesting task. The ability to automatically pointing on certain profiles on different social media platforms can be helpful not only for marketing campaigns and analysis of the user s behavior but also for investigations and frauds in banks. The collected data from different Online Social Networks can be structured or unstructured data. So, before trying to match the profiles processing is needed

8 Having in mind the previous mentioned reasons that someone might want to automatically match Online Social Network accounts brings to light another important aspect; privacy. Although this information is publicly available, users are not willing to share this information with researchers and moreover to use their data from different sources and integrate them is a sensitive topic. For this reason we decided to use volunteers to participate in our research. Eighty people from the same educational background gave us permission to download and process their personal publicly available data from seven different Online Social Networks. Suppose,,, the social networks from which we downloaded all the information related to the volunteers name. In order to retrieve all the data we used a crawler. Before trying different methods for matching them, preprocessing needed, by knowing the structure of an OSN certain features could be extracted. With this process all irrelevant data are being excluded. The last step is to integrate the profiles which belong to our initial volunteer query name by using some more specific knowledge for that person such as the city of residence, the company where she works and finally the year of birth. For matching the profiles we use textual information basically and the profile picture as a strong identification if two profiles belong to the same person in general. After matching the profiles the results can be used in different applications. The previous mentions procedure can be visualized on Figure 1. In our paper we are not going to use a single search engine to collect the results, but we are going to query the search engine of each one of the social media platforms that we mentioned before. Moreover, as stated from the terms of each Online Social Network, each user of an Online Social Network can have only one account. So, we expect in each Online Social Network each profile that is being retrieved to belong to an individual entity

9 1.3. Research questions The purpose of our research can be summarized on the following question: Can we develop a model that retrieves and identifies a target person in different Online Social Networks; matches profiles that belong to the same person without knowing her username on the different Online Social Networks only by using the limited textual representation of a person on the virtual reality world? In order to answer this question we have to actually answer the following questions: What data are available in each Online Social Network? Which are the most informative features for matching a profile to the target person? Can we create new queries for the search engine of an Online Social Network from the extracted textual features that will return profiles of the target person? World Wide Web OSN 1 OSN 2 OSN 3 Web crawling OSN 1 OSN 2 OSN 3 Pre processing OSN 1 OSN 2 OSN 3 Profile matching OSN 4 OSN 4 OSN 4 Figure 1: Visualization of research - 9 -

10 2. Related work In this section we discuss previous work on name disambiguation and profile matching. How this problem has been approached in different fields of research such as in engineering and in information retrieval. In section 3.1 we review the WePS campaigns, in section 3.2 is a review of different approaches that have been used outside WePS campaign and finally in section 3.3 we take a glimpse of how profile matching and name disambiguation are being approached in engineering WePS With the growth of the social media platforms the appearance of individuals in the web search results grows as well. As a consequence the effort to find an individual through the list of results becomes more and more difficult due to the fact that more than one people might share the same name. More precisely in the Netherlands, family names are shared among 16 million Netherlanders, as stated from the Meertens Institute 2. People name disambiguation is a task that has been the center of research the last few years, where they try to find all the occurrences of web results that belong to the same person. Although there is not much relevant work with name disambiguation using only social media results as input, there has been a lot of research in the area of people name disambiguation with results obtained from search engines, including social and non social documents. Three campaigns [2] [3] [4] have taken place the last five years with main tasks both the clustering of ambiguous name entities and the extraction of good textual features. In each campaign took part sixteen, nineteen and thirteen research teams. WePS-I [2] try to estimate the number of namesakes for a given person name and group documents referring to the same individual. More precisely the work of [5] extended the token based-info to a web corpus and as features they also used noun phrase information. After the preprocessing of the resulted web pages a variation of features were extracted for the disambiguation phase; token based features (local tokens, full tokens, URL tokens and title tokens in root page) and phrase-based features (noun phrases and name entities). Their clustering method was hierarchical agglomerative algorithm with single linkage and the similarity measure for calculating the similarity matrix was the standard SoftTFIDF [6]. For the evaluation they used Purity and Inverse Purity metrics and they achieved the highest performance among the participants of the SemEval-2007 [2]. [7] combined a variety of features, which extracted from the pages such as tokens, name entities, hostnames, domains and URLs, as input to the hierarchical agglomerative algorithm with single linkage clustering and, conducted a set of experiments to the different data sets (ECDL, Wikipedia and Cencus) available from the [2]. The results showed that Named

11 Entities achieve the best performance among the feature combinations which tried. [8] extracted a number of HTML based features which would allow them to run Agglomerative Vector Space Clustering and compare the features of each document and cluster them together according to a minimum number of similar pairs. A semi-supervised method for the clustering task was applied in [9]; as a seed node they either used the corresponding Wikipedia page or if not existed for a certain name then the top ranked Web pages were used. Each time two pages were merged the centroid of the cluster was recalculated in order to control the fluctuation. [10] presents a combination of a supervised and unsupervised method for the name disambiguation task. As for this they used classification and clustering algorithms and the best results were merged according to the intersection between the initial set and the results obtained from the algorithms. The features they used were biographic facts, like date and place of birth, and URL and a list of weighted keywords and metadata information about the web page. A different approach for solving the task of WePS was introduced by [11]; they used language modeling tools and different representations of each document (snippet, title and body text) and experimented with Single Pass Clustering and Latent Semantic Analysis. WePS-II [3] focuses on two tasks; personal name disambiguation and attribute extraction. [12] extracted features beyond the web corpus and achieved the highest performance in the disambiguation task. The features they used are tokens appearing in the same sentence with the ambiguous name, tokens appearing in a given webpage, URL tokens and title tokens in the root pages, more or less the same as in the work of [5]. What they showed is that the robustness of a disambiguation system can be improved effectively with the collection of features from broader resources other than the Web corpus. On the work [13] it is stated from the results that hierarchical agglomerative clustering outperforms, moreover what they experimented with was the impact that has in clustering the preprocessing of the HTML page. In their method they converted HTML pages and tried two systems one with stemming and the other one with no stemming. For the clustering they used single pass clustering and probabilistic latent analysis. [14] developed a two stage clustering algorithm where the results or the first run of the hierarchical agglomerative algorithm are used to extract features (named entity, compound keyword features, URLs) for the second stage clustering. In that way they enhanced the low recall of the first stage clustering. Professional categorization is the method that [15] used for the name disambiguation task. They categorized the namesakes by a professional taxonomy, which was extracted from Freebase and evidence pages in which the name and the profession would appear in the same sentence. Each web page was represented as a vector of features, such as tokens, URL tokens, snippet tokens and named entity tokes. Finally a knn classifier used for disambiguating the namesakes and creating clusters represented by the name and the profession. The feature set used by [16] was all unigrams by sentence and paragraph combined with the title and URL and calculated the similarity between the documents by six different measures; cosine

12 similarity, Euclidean distance, Jaccard coefficient, Manhattan distance, weighted sum of common unigrams and Jaro similarity. For the machinely learned model support vector machine was used with polynomial kernel. [17] used a web search engine as an external data source for the disambiguation task. Named entity features were extracted from each web page and a TF/ IDF similarity was computed for the clustering procedure. The new query would consist of the initially ambiguous name and a category of name entities found in the web page. These pages where then clustered with a single-link hierarchical agglomerative algorithm. WePS-III [4] is the last of the three campaigns which addressed the name disambiguation and attribute extraction task. In this campaign the two tasks where merged to one; systems must return both the documents and the attributes which belong to each person namesake. [18] their method outperformed the existing ones on the task of name disambiguation by extending the bag-of-words feature with Wikipedia concepts. For the similarity measures they used a model which weights the query and the content relevance of each feature in the vector. They evaluated their model under different features and similarity measures (cosine similarity and overlap) and in all cases the combination of bags-of-words and Wikipedia concepts gave the best results. The AXIS research team [19] used Web graph structure for the task of person name disambiguation achieving relatively high precision. Their system works as follows. After clustering the person pages that share related pages in a cluster, which they call Web structure clustering; for the remaining pages they created frequency vectors of the terms that are contained in each page and run the hierarchical agglomerative algorithm. In their results can be shown that the performance of these systems delivers competitive results to other systems which took part on the third campaign. Finally, the third ranked team [20] of this campaign compared the results obtained from three different clustering algorithms, Lingo, HAC and 2-step HAC) with HAC outperforming. As they state an important role in the results of clustering is the pre-processing procedures; the cleaning of noise from HTML tags and NLP processing improves the clustering task Name disambiguation above WePS campaign In the WePS campaigns the input data for the disambiguation phase were taken from querying web search engines and collecting the first one hundred results. In these web pages were included both social web pages and non-social web pages like biographies etc. In our work we want to focus only on the social pages, pages from Online Social Networks. In the work of [21] used different approaches for social pages and non social pages and at the end they merged the results. From the clustering methods tried, one-inone, all-in-one and hierarchical agglomerative clustering, the last one outperformed on the results. For the social pages they used additional methods such as co-clicks, clicks it the same burst and cross links. At

13 the end they merged the results with two methods. In the first method they take just the union of the two clusters. On the second method they use a similarity threshold and also penalize clusters which contain a social media page. One more team [22] tried to work on the person disambiguation problem using only social media pages. In their work all their data had a common organization as an input which would help them on the disambiguation procedure. They knew a priori that every person that participated was a student of a certain university and connections between them could help the disambiguation phase. Their work is divided into two phases; the discovery phase and the disambiguation phase. In the discovery phase they try to find as many profiles as possible which can be candidates for the disambiguation phase. In order to find possible usernames Rapporative and Google s Api are used and in each step they query with the new usernames the web search engine. After they collect all the possible profiles for each candidate name they start the disambiguation. The heuristics that they use are a combination of keyword matching, community structure analysis and extraction of semantic and feature data from profiles. A vector space model for document co-reference was used by [23]. In their approach they used summaries of the documents to create the vectors with the terms and experiment with different similarity metrics for scoring; MUC co-reference scoring, B-Cubed scoring algorithm; two names are predicted to be the same person their similarity scores were above a threshold. [24] extended the vector space representation by using as features biographic details such as birth day and year, and occupation. For the person name disambiguation task the work of [25] tried a different approach by using graph walk methods and re-ranking outputs based on the graph walked features. More precisely they represented with nodes different types of real word entities, such as file, person, address, term and date. Respectively, they produced the edge types, like sent-to or sent-from, includes-term etc. which denote direct connection between two nodes. The similarity between two nodes is calculated by a lazy walk process, controlled by a parameter θ. In their walk method they created vectors with the weight of each node and started the graph walk from the term column witch propagates to a person node where more frequently appears the specific term. In the work of [26] introduced a two phase method for identity disambiguation using social circles for detecting Sybil and non-sybil users. The two phases of ranking are the Static which is based on the initial parameters given by the users and the results obtained from the web crawler, that then being verified by the Matcher and the Score Generator. The Dynamic phase changes the values of the first phase according to the polling results from each user; for instance by accepting or not a friend request gives an answer to the question whether or not this person is fake. Re-calculating the rank of each user is based on the

14 acceptance or rejection of the friend request and the positive or negative vote regarding the genuinity of the requester. [27] presents a semi-supervised approach for identity disambiguation of Web references using Web 2.0 data and semantics. More specifically, first an RDF model is generated for a digital identity of a person within a specific platform. For the second stage they create metadata models from the results obtained from querying the semantic Web and the World Wide Web with a certain person s name. And finally, the seed data are represented in a Linked Data social graph. For the disambiguation phase two methods used, one was a semi-supervised machine learning technique and the second one, a graph based technique with random walks which use an agglomerative clustering in order to cluster the instances which are closer to the social graph. The approach of [28] to the name disambiguation problem was more general one and not generated only for the specific task. Their focus is on the different strategies that can be used inside the hierarchical agglomerative clustering. Some of the successful methods that they tried are the similarity between the closest documents, stemming also improves the results and window term gives high B-cubed results. In the work of [29] is shown how to match profiles of social media networks which belong to the same person. Their work is based on a certain person s list of friends in two different social networks, Facebook and Twitter. They created the social graphs of each profile and tried to project each node to the other graph and then, used a joint link-attribute model in which they create connection between nodes based on the similarity of profile fields. More precisely they used a probabilistic model for finding the optimal configuration of profile projections Matching profiles and social engineering Matching profiles from different social media platforms has been an intriguing subject in social engineering over the last years. The research project in [30] tried to prove how is it possible to match profiles between Facebook and LinkedIn of the employees of a certain company. For their research they used different methods of collecting their data and matching the profiles. They used only the public available data, the friends lists of recognized matched profiles and the final technique is the creation of zombie profiles; fake accounts with one purpose, to make as much connections with other users as possible in order to have access on their private information. Their purpose of this project was to prove that it is possible to create a company s hierarchy structure of employees just from matching profiles on different Online Social Networks and the consequences would have for the company Linkage Records

15 Our problem of matching profiles depending on certain features could be characterized similar to the record linkage problem. Record linkage is the task of accurately finding and linking records from different sources of the same entity and it has applications in customer systems for marketing, fraud detection, data warehousing and government administration [31]. More precisely record linkage is used for data cleaning and removing duplicates and for merging two or more datasets to one. Two are the main problems for the linkage records matching; accuracy and records should contain common identifying information [32]. In an ideal word each record should have one unique identifier, but when we are trying to match records from different databases this identifier might be a random number created by the system. The same holds in our case, each member of an OSN has a unique identifier for that OSN but a different one for another OSN. Two distinct methodologies have been tried to solve the data linkage problem; deterministic and probabilistic methods. Deterministic methods match exact one to one character and works better when there are several representative identifiers [33], on the other hand probabilistic linkage is a supervised method where weights are being estimated from the observed agreements and disagreements of the data values [34]. Record pairs with probability above a certain threshold are a match, else a non-match. Our work is more relevant to the probabilistic linkage approach with the difference that the amount of fields given as features are not the same on the different sets / Online Social Networks. In other words some fields might be presented in two or three different Online Social Networks but not to the rest. This makes the integration of the records a more challenging task

16 3. Data For the collection of the data from the different Online Social Networks we faced some legal issues about which data are considered private and which data are considered public and in which extend we can download and process them in an anonymous way. As it is shown in the Appendix, different platforms of Online Social Networks have different default settings on which data are shared publically, which data remain private and which of them are being shared among the connections of the user. The subject of the privacy of the data that are being published in the Web has been extensively discussed the last decade, since the privacy regulations in different countries can vary. As it is mentioned in [35] European law allows the analysis of data collected from the Web as long as the records are identified by a pseudonym. The US law claim that the record holder can assign a code to the record in order to be possible to identify it Privacy [35] brings up a very important question on how can we define privacy in an environment that cannot be controlled like World Wide Web. Moreover what one user considers private for another user might not be private. For example, a user might share a photo of him and some of his friends in an online social platform but one of them might consider that photo as private and not want to be shared publically; as it stated in [36] what poses a threat to their privacy is their links with friends. Moreover in the same paper is shown that being a member of a group in a Social Media Network can reveal information which is private in their own profile. One sentence that summarizes the paper of [35] is; Privacy is not only about hiding certain information, but also about controlling information and its uses (e.g., by constructing different identities), and is finally a dynamic practice involving negotiations and tradeoffs between hiding and disclosing/sharing. Because of the cloud that covers the laws that apply on which are considered private data and if it is legal to download and experiment with those publically available data, we decided to take the permission of the people to search and download their publicly available data found in the different Online Social Networks and keep their data private and not be exposed on our report Data Collection

17 Each Online Social Network has a different construction and use different technologies for presenting the data each user decides to upload. Obtaining the full collection of data was made in multiple steps and can be seen in Figure 2. For downloading the data we used the WGET 3 command in python scripts. Only in Twitter we made full use of the API that is being offered by the Online Social Network. The main reason why we did not use the APIs of the other OSNs as well is because the information that we could get was very limited due to the fact that our account in each one of these networks was friend-free, meaning that we did not have any contacts. The APIs in order to return more details about a certain user needs authentication by the user. In Google Plus and in Hyves we used the API offered by the network only for querying it and getting all the URLs of the users who share the certain name. Names of volunteers Create all diminutives of the official name Query and Download data from the Online Social Networks Process of.html pages to extract the features Process data to find possible nicknames Structured Documents per profile No Assess against criteria Yes Clustering / Disambiguation algorithm Search & Download if a profile exists with the certain user-id /nickname Figure 2: Process followed to download the data set and extract features

18 As it is shown in Figure 2 first we created all the diminutives that can be derived by the formal first name using the database of Dutch names from the Meertens 4 Institute. Each formal name might be a combination of more than one names, we created all possible diminutives for all the possible names. After deriving all the diminutives of the first official name that a volunteer has we run our algorithm for downloading all profiles from each Online Social Network. The second step is to obtain all nicknames that can be found and search for new profiles. In the same way we continue until we have searched all the possible names of the formal name of the user. The order of the last name and the first name on the queries does not make any difference in Twitter, LinkedIn, Google Plus, Hyves, Flickr and MySpace. Both combinations return the same results. Only on Facebook the order of the words in the query might return different results. For not losing any valuable information we decided to run the reversed query as well First run On the first run we collect all the profiles which are returned from the query of the casual first name or the diminutive and the last name of a person. From the profiles that have been returned we try to find if there is a nickname that could possible help us find more profiles on other Online Social Networks of the some person. The only network that we do not query again is LinkedIn since the users of the Social Network are not allowed to use nicknames and they can only use both their first and last name Nickname extraction For deciding if a user-name or user-id is a nickname we retrieved the real name and the user-name that is available on the user's page. The most common way of a system to create a user-id is either by using a unique random number or by creating a user-id (Table 2) which consists of the given real name of the user and a unique number. In that case usually they separate the first name from the last name by a dot and if it is already in use they add a dot and a number at the end. For that reason we discarded any symbols like underscore, minus, dots and spaces that are commonly appear on nicknames or user-ids by replacing them with a non-space character. We calculated the longest common substring and, if the difference of the length of the actual name and the longest common substring is beyond a certain threshold only then we calculated the Levenshtein distance, which is also known as the edit- distance. Only if the ratio was below a certain threshold we added the nickname to our list for searching if a profile exists on the other Online Social Networks

19 Facebook firstname.lastname.# Systems ascription with the possibility to change by user LinkedIn Unique number Systems ascription Twitter Screen name User s selection Google Plus Unique number System s ascription Hyves Username User s selection Flickr Unique number@n# Systems ascription with the possibility to change by user MySpace Unique number Systems ascription with the possibility to change by user Table 2: ID ascription for each OSN In Facebook, Flickr, Hyves and MySpace the comparison in order to obtain a possible nickname is between the name of the user and the user-id. On Google Plus we try to find a nickname through URLs from Youtube and/or Picasa and from the other names field feature. If they are available we compare again the username extracted from the URL with the actual name of the user, as given in the user s personal web page. LinkedIn users do not use nicknames because of the professional nature of the network. But what it is possible to extract from the provided data is the twitter name of a user, if it is available and visible to the public. From LinkedIn we also get profiles with no name, which are the LinkedIn Members who have selected their profile not to be publicly available. In this case as a name of the profile we use the query name. On Twitter we compare the screen name with the actual given name of the user and only if they differ according to the same criteria that we used on the previous mentioned networks, we add it on the list with the nicknames Find profiles based on the extracted nicknames From the list with the nicknames we do not run a new query, as we are not interested on finding derivations of the nickname since that would return too many irrelevant results. What we are interested in is if exists a certain user profile which has the specific nickname as a user-id. For that reason we only download the pages with WGET 5 that do not return a 404 error page not found. For the same reason we do not search again the returned pages for new nicknames since we make the assumption that the previous returned nickname can give us possible additional information on user profiles that we might have missed

20 Many of the results that returned from the queries might be of our interest. We are not aware of the similarity functions that each Online Social Networks uses but for example many of the results returned by Facebook may have a totally different last name and only the first name is the same. This usually happens on uncommon combinations of first names and last names. The same can be the case in Twitter; since for a given query it searches not only on the given name but also on the screen name and returns possible matches Pre Processing From each Online Social Network we decided to download certain categories of data that will help us in our research to match the existing profiles with the specific person that we are looking for. After obtaining the pages with all the existing information, we had to work on the different structures of the OSNs in order to find how we can extract only the data that will be able to use for the disambiguation phase. Table 4 shows which specific information was extracted from each OSN. We present them by their original names for each OSN and we have grouped them by the category (name, city of origin, date of birth etc.) by assigning a different color for each category. At the Appendix Table 23 can be found the whole set of features that extracted from each OSN; the ones that are colored free are the ones which we did not use as features at the end because either the information are not good features for matching profiles or because the specific information are available to only one OSN and not the others Retrieved data set The information that has been downloaded from each network was limited to the sub pages of the static profile of user. By static profile we refer to the information that might stay unchangeable for a certain period of time and refer to the personal information that a user publically shares. On the other hand as dynamical information we refer to data that an active user might change daily such as the posts on her wall; as it is usually referred the place where users post events and thoughts of their everyday life. We are interested only on the static profile of a user. Moreover, we decided to store only the profile picture of each user and not all the albums that would possibly be available online, since this one is usually the most representative picture of an individual. Finally we retrieved their friends list under the assumption that the same person might have a number of same contacts in more than one OSN. On Table 3Table 3 are listed the pages that we are interested in from each Online Social Network. Facebook LinkedIn Twitter Hyves Google Plus Flickr MySpace Domain_url/ab Domain_url details returned Domain_url/pr Domain_url Domain_url/pro Domain_url/pr out /profile from API ofile /about file ofile Domain_url/fri Domain_url/pe Domain_url/follo Domain_url/fri Domain_url/ha Domain_url/con Domain_url/fri

21 end ople also viewed Domain_url/pr Domain_url/pr ofile picture ofile picture Domain_url/co ver phot wing ends ve circles tacts end Domain_url/follo wers Domain_url/pr ofile picture Domain_url/in circles Domain_url/pr ofile picture Table 3: Sub-pages extracted from each Online Social Network Extraction of features Domain_url/pro file picture Domain_url/pr ofile picture As we have already mentioned the diversity of information that are available in each Online Social Network depends on the purposes of the network; if it includes personal or professional life events and, the settings that each individual user selects on how her information prefers to be shared with the users and non users of the OSN. For this reason, in order to extract the features that will help us on the disambiguation task we took as a guideline the default settings of each platform; we extracted only those information that is publically available by default, in case they existed (detailed tables of the default settings per OSN can be found at the Appendix, Table 14 to Table 20). On Table 4 are listed all the features per OSN. The different colors on each cell indicate the possible matches. For the extraction of these features we used different techniques per OSN depending on the source code of the platform. We mainly used Beautiful Soup 6 which is a Python library that parses HTML and XML documents since this is the main format of our data resources. In some cases when the data we wanted to extract where inside Java script code and Beautiful Soup could not recognize them, we used regular expressions. In that manner we ended with structured document files per user of each Online Social Network where all the existing features were listed

22 Facebook LinkedIn Twitter Google Plus Hyves Flickr MySpace Name Name Screen name Name Name Nickname Name Nickname Headline Name Works at Nickname Name name (info) About Location Locations Attends Relationship status Hometown age (info) Employers Industry Description Lives in Living Currently current (info) city Graduate School Twitter Status Gender About me I am Nickname High School Working Id Looking for Birthday Occupation Zodiac Sign College Title of work Statuses count Birthday Age Website Status Birthday School name Replies Id Relationship Companies Gender Here for Gender Work duration Screen reply Other names Weblog Groups Hometown Interested In Languages Urls Url Religion Joined Orientation Relationship Status Groups Current City Hometown About me Religious Views Interests Cities Lived Study/werk Status Website Occupation Relatie Education Current City Employment What's on my mind Schools Hometown Education Website Major Anniversary Introduction Schools Minor Languages Colleges Companies Political Views Bragging rights Religion Table 4: Features extracted from each profile page for each Online Social Network. Different colors for possible matches of information in the different platforms

23 3.4. Ground Truth Data Set For this research project we asked from people to volunteer and give us their permission to download their publicly available data. After running our model, which is described in the section, with an unsupervised manner in order to obtain some idea on how it works; we asked from the volunteers to evaluate the results that we obtained by pointing us to the true profiles in each Online Social Network. From the eighty volunteers that we gathered at the beginning of our project only the seventy replied, which finally composed our ground truth data set Training Data Set For finding the parameters of our model we needed a training data set. Our training data set is constructed by computing all the distances per feature in between profiles that belong to the same person and assigning these to class one; and by computing distances per feature on profiles that do not belong to the same person and assigning those to class zero. The feature vector contains all per feature distances between profiles that belong to the same person which is the positive class and all per feature distances between profiles that do not belong to the same person which formats the negative class. Table 5 shows the number of positive and negative profiles found by querying all the OSNs per instance for the training data set. Two were the main problems that we had to deal with: 1. As we have already mention almost none profile contains all the extracted information per feature. That means that the number of missing values is relatively large. In these cases we decided to assign the value zero (0) as if they are not similar at all for either the cases; one of them exists or none of them exists. 2. Our training data set is an imbalanced data set, since the number of positive instances is relatively small compared to the number of negative instances. This has as a result the misclassification of elements of different classes, by causing also negative effect on classification performance of machine learning algorithms. In order to overcome these two problems we decided to follow well known techniques from the literature. For the missing values different ways has been proposed in the literature for overcoming the problem. In section we discuss it in more detail. Volunteer Instance Matching Profiles (positive) Non matching Profiles (negative) Vn Vn

24 Vn Vn Vn Vn Vn Vn Vn Vn Vn Vn Vn Vn Vn Vn Table 5: Positive and negative instances for training data set

25 4. Disambiguation method 4.1. Problems to overcome The main disadvantage of trying to disambiguate users profiles among different OSN is the scarcity of data. On the top of this there are more problems that have to be considered beforehand for having a clear view of the matching problem. For beginners each OSN ascribes its own user ID, thus there is no universal ID between the profiles of the same person in different OSN. The default IDs per OSN can be seen on Table 2. Only in the cases that the member selects the user ID and with the assumption that she uses the same ID on different OSN we can conclude with some probability that these two profiles might belong to the same person. One other fact that we have to take into consideration is the way the personal information is given in each Network; it is not the same. Moreover even the same person might represent her information in different format in different Online Social Networks. For example, on one OSN the current city of residence might be given in English and on another might be written in Dutch; The Hague and De Haag, respectively for the two languages. Additionally, even if someone presents the same personal information in different OSN, spelling errors should also be taken into account for example, Amsterdam and Amterdam. Finally, since people are not using all the Online Social Networks in the same frequency some data might also be out dated or even missing. And in this case although two profiles might belong to the same person the limited shared features between them would not allow us to conclude on matching profiles. By taking into consideration the previous mentioned problems we try to approach the matching and disambiguation problem as a problem of matching textual information from the personal information that can be found on Online Social Network profile pages, information that can be found on the social network of the user and finally by comparing the profile picture of a user. Here we are going to describe in more detail how our disambiguation method works, how we estimate the importance of each feature and give definitions of the thresholds that we are setting for selecting the final set of profiles that belong to the target person

Entity Matching in Online Social Networks

Entity Matching in Online Social Networks Entity Matching in Online Social Networks Olga Peled 1, Michael Fire 1,2, Lior Rokach 1 and Yuval Elovici 1,2 1 Department of Information Systems Engineering, Ben Gurion University, Be er Sheva, 84105,

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

TALP at WePS Daniel Ferrés and Horacio Rodríguez

TALP at WePS Daniel Ferrés and Horacio Rodríguez TALP at WePS-3 2010 Daniel Ferrés and Horacio Rodríguez TALP Research Center, Software Department Universitat Politècnica de Catalunya Jordi Girona 1-3, 08043 Barcelona, Spain {dferres, horacio}@lsi.upc.edu

More information

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics An Oracle White Paper October 2012 Oracle Social Cloud Platform Text Analytics Executive Overview Oracle s social cloud text analytics platform is able to process unstructured text-based conversations

More information

Mining Social Media Users Interest

Mining Social Media Users Interest Mining Social Media Users Interest Presenters: Heng Wang,Man Yuan April, 4 th, 2016 Agenda Introduction to Text Mining Tool & Dataset Data Pre-processing Text Mining on Twitter Summary & Future Improvement

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

2. Design Methodology

2. Design Methodology Content-aware Email Multiclass Classification Categorize Emails According to Senders Liwei Wang, Li Du s Abstract People nowadays are overwhelmed by tons of coming emails everyday at work or in their daily

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Clustering Results. Result List Example. Clustering Results. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai. UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

Social Media Tools. March 13, 2010 Presented by: Noble Studios, Inc.

Social Media Tools. March 13, 2010 Presented by: Noble Studios, Inc. March 13, 2010 Presented by: Noble Studios, Inc. 1 Communication Timeline 2 Familiar Social Media Sites According to Facebook, more than 1.5 million local businesses have active pages on Facebook According

More information

Ning Frequently Asked Questions

Ning Frequently Asked Questions Ning Frequently Asked Questions Ning is a Web tool that allows anyone to create a customizable social network, allowing users to share pictures and videos, maintain blogs, communicate in chat and discussion

More information

Marketing & Back Office Management

Marketing & Back Office Management Marketing & Back Office Management Menu Management Add, Edit, Delete Menu Gallery Management Add, Edit, Delete Images Banner Management Update the banner image/background image in web ordering Online Data

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

SAP Jam Communities What's New 1808 THE BEST RUN. PUBLIC Document Version: August

SAP Jam Communities What's New 1808 THE BEST RUN. PUBLIC Document Version: August PUBLIC Document Version: August 2018 2018-10-26 2018 SAP SE or an SAP affiliate company. All rights reserved. THE BEST RUN Content 1 Release Highlights....3 1.1 Anonymous access to public communities....4

More information

CRM Insights. User s Guide

CRM Insights. User s Guide CRM Insights User s Guide Copyright This document is provided "as-is". Information and views expressed in this document, including URL and other Internet Web site references, may change without notice.

More information

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European

More information

Learning and Development. UWE Staff Profiles (USP) User Guide

Learning and Development. UWE Staff Profiles (USP) User Guide Learning and Development UWE Staff Profiles (USP) User Guide About this training manual This manual is yours to keep and is intended as a guide to be used during the training course and as a reference

More information

TISA Methodology Threat Intelligence Scoring and Analysis

TISA Methodology Threat Intelligence Scoring and Analysis TISA Methodology Threat Intelligence Scoring and Analysis Contents Introduction 2 Defining the Problem 2 The Use of Machine Learning for Intelligence Analysis 3 TISA Text Analysis and Feature Extraction

More information

Overview of Web Mining Techniques and its Application towards Web

Overview of Web Mining Techniques and its Application towards Web Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous

More information

Record Linkage using Probabilistic Methods and Data Mining Techniques

Record Linkage using Probabilistic Methods and Data Mining Techniques Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University

More information

Determine the Entity Number in Hierarchical Clustering for Web Personal Name Disambiguation

Determine the Entity Number in Hierarchical Clustering for Web Personal Name Disambiguation Determine the Entity Number in Hierarchical Clustering for Web Personal Name Disambiguation Jun Gong Department of Information System Beihang University No.37 XueYuan Road HaiDian District, Beijing, China

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 4. Prof. James She

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 4. Prof. James She ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 4 Prof. James She james.she@ust.hk 1 Selected Works of Activity 4 2 Selected Works of Activity 4 3 Last lecture 4 Mid-term

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Text Mining. Representation of Text Documents

Text Mining. Representation of Text Documents Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,

More information

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness UvA-DARE (Digital Academic Repository) Exploring topic structure: Coherence, diversity and relatedness He, J. Link to publication Citation for published version (APA): He, J. (211). Exploring topic structure:

More information

Social Media Tip and Tricks

Social Media Tip and Tricks Social Media Tip and Tricks Hey 2016 CCP attendees! The Technology Council has put together social media tips and tricks to support your CCP process. Social media and other online tools can be great resources

More information

Replication on Affinity Propagation: Clustering by Passing Messages Between Data Points

Replication on Affinity Propagation: Clustering by Passing Messages Between Data Points 1 Replication on Affinity Propagation: Clustering by Passing Messages Between Data Points Zhe Zhao Abstract In this project, I choose the paper, Clustering by Passing Messages Between Data Points [1],

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Lab 9. Julia Janicki. Introduction

Lab 9. Julia Janicki. Introduction Lab 9 Julia Janicki Introduction My goal for this project is to map a general land cover in the area of Alexandria in Egypt using supervised classification, specifically the Maximum Likelihood and Support

More information

Making Recommendations by Integrating Information from Multiple Social Networks

Making Recommendations by Integrating Information from Multiple Social Networks Noname manuscript No. (will be inserted by the editor) Making Recommendations by Integrating Information from Multiple Social Networks Makbule Gulcin Ozsoy Faruk Polat Reda Alhajj Received: date / Accepted:

More information

Facebook Tutorial. An Introduction to Today s Most Popular Online Community

Facebook Tutorial. An Introduction to Today s Most Popular Online Community Facebook Tutorial An Introduction to Today s Most Popular Online Community Introduction to Facebook Facebook is the most popular social network, in the U.S. and internationally. In October 2011, more than

More information

Lecture 8 May 7, Prabhakar Raghavan

Lecture 8 May 7, Prabhakar Raghavan Lecture 8 May 7, 2001 Prabhakar Raghavan Clustering documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of

More information

Kristina Lerman University of Southern California. This lecture is partly based on slides prepared by Anon Plangprasopchok

Kristina Lerman University of Southern California. This lecture is partly based on slides prepared by Anon Plangprasopchok Kristina Lerman University of Southern California This lecture is partly based on slides prepared by Anon Plangprasopchok Social Web is a platform for people to create, organize and share information Users

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Part 11: Collaborative Filtering. Francesco Ricci

Part 11: Collaborative Filtering. Francesco Ricci Part : Collaborative Filtering Francesco Ricci Content An example of a Collaborative Filtering system: MovieLens The collaborative filtering method n Similarity of users n Methods for building the rating

More information

You are Who You Know and How You Behave: Attribute Inference Attacks via Users Social Friends and Behaviors

You are Who You Know and How You Behave: Attribute Inference Attacks via Users Social Friends and Behaviors You are Who You Know and How You Behave: Attribute Inference Attacks via Users Social Friends and Behaviors Neil Zhenqiang Gong Iowa State University Bin Liu Rutgers University 25 th USENIX Security Symposium,

More information

SEO: SEARCH ENGINE OPTIMISATION

SEO: SEARCH ENGINE OPTIMISATION SEO: SEARCH ENGINE OPTIMISATION SEO IN 11 BASIC STEPS EXPLAINED What is all the commotion about this SEO, why is it important? I have had a professional content writer produce my content to make sure that

More information

The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? A-Z of Digital Marketing Translation

The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? A-Z of Digital Marketing Translation The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? In our experience, we find we can get over-excited when talking to clients or family or friends and sometimes we forget that not everyone

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,

More information

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Department of Electronic Engineering FINAL YEAR PROJECT REPORT Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:

More information

CLUSTER ANALYSIS APPLIED TO EUROPEANA DATA

CLUSTER ANALYSIS APPLIED TO EUROPEANA DATA CLUSTER ANALYSIS APPLIED TO EUROPEANA DATA by Esra Atescelik In partial fulfillment of the requirements for the degree of Master of Computer Science Department of Computer Science VU University Amsterdam

More information

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document

More information

International Journal of Advance Engineering and Research Development. A Facebook Profile Based TV Shows and Movies Recommendation System

International Journal of Advance Engineering and Research Development. A Facebook Profile Based TV Shows and Movies Recommendation System Scientific Journal of Impact Factor (SJIF): 4.72 International Journal of Advance Engineering and Research Development Volume 4, Issue 3, March -2017 A Facebook Profile Based TV Shows and Movies Recommendation

More information

Unstructured Data. CS102 Winter 2019

Unstructured Data. CS102 Winter 2019 Winter 2019 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for patterns in data

More information

Open Source Software Recommendations Using Github

Open Source Software Recommendations Using Github This is the post print version of the article, which has been published in Lecture Notes in Computer Science vol. 11057, 2018. The final publication is available at Springer via https://doi.org/10.1007/978-3-030-00066-0_24

More information

Information Retrieval CSCI

Information Retrieval CSCI Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

Social Networking in Action

Social Networking in Action Social Networking In Action 1 Social Networking in Action I. Facebook Friends Friends are people on Facebook whom you know, which can run the range from your immediate family to that person from high school

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

UCEAP Connect User Guide October 2017

UCEAP Connect User Guide October 2017 UCEAP Connect User Guide October 2017 1 P a g e Contents Introduction... 3 How to access the platform... 3 Registration... 3 Approval... 4 Using the platform... 4 Logging In... 4 Updating your profile...

More information

Plagiarism Detection Using FP-Growth Algorithm

Plagiarism Detection Using FP-Growth Algorithm Northeastern University NLP Project Report Plagiarism Detection Using FP-Growth Algorithm Varun Nandu (nandu.v@husky.neu.edu) Suraj Nair (nair.sur@husky.neu.edu) Supervised by Dr. Lu Wang December 10,

More information

Semantic text features from small world graphs

Semantic text features from small world graphs Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK

More information

Module 1: Internet Basics for Web Development (II)

Module 1: Internet Basics for Web Development (II) INTERNET & WEB APPLICATION DEVELOPMENT SWE 444 Fall Semester 2008-2009 (081) Module 1: Internet Basics for Web Development (II) Dr. El-Sayed El-Alfy Computer Science Department King Fahd University of

More information

Ambiguity Handling in Mobile-capable Social Networks

Ambiguity Handling in Mobile-capable Social Networks Ambiguity Handling in Mobile-capable Social Networks Péter Ekler Department of Automation and Applied Informatics Budapest University of Technology and Economics peter.ekler@aut.bme.hu Abstract. Today

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Assigning Vocation-Related Information to Person Clusters for Web People Search Results

Assigning Vocation-Related Information to Person Clusters for Web People Search Results Global Congress on Intelligent Systems Assigning Vocation-Related Information to Person Clusters for Web People Search Results Hiroshi Ueda 1) Harumi Murakami 2) Shoji Tatsumi 1) 1) Graduate School of

More information

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS Nilam B. Lonkar 1, Dinesh B. Hanchate 2 Student of Computer Engineering, Pune University VPKBIET, Baramati, India Computer Engineering, Pune University VPKBIET,

More information

Concept-Based Document Similarity Based on Suffix Tree Document

Concept-Based Document Similarity Based on Suffix Tree Document Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri

More information

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Outline Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Lecture 10 CS 410/510 Information Retrieval on the Internet Query reformulation Sources of relevance for feedback Using

More information

How Often and What StackOverflow Posts Do Developers Reference in Their GitHub Projects?

How Often and What StackOverflow Posts Do Developers Reference in Their GitHub Projects? How Often and What StackOverflow Posts Do Developers Reference in Their GitHub Projects? Saraj Singh Manes School of Computer Science Carleton University Ottawa, Canada sarajmanes@cmail.carleton.ca Olga

More information

A Generic Statistical Approach for Spam Detection in Online Social Networks

A Generic Statistical Approach for Spam Detection in Online Social Networks Final version of the accepted paper. Cite as: F. Ahmad and M. Abulaish, A Generic Statistical Approach for Spam Detection in Online Social Networks, Computer Communications, 36(10-11), Elsevier, pp. 1120-1129,

More information

Birkbeck (University of London)

Birkbeck (University of London) Birkbeck (University of London) MSc Examination for Internal Students Department of Computer Science and Information Systems Information Retrieval and Organisation (COIY64H7) Credit Value: 5 Date of Examination:

More information

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

Query Difficulty Prediction for Contextual Image Retrieval

Query Difficulty Prediction for Contextual Image Retrieval Query Difficulty Prediction for Contextual Image Retrieval Xing Xing 1, Yi Zhang 1, and Mei Han 2 1 School of Engineering, UC Santa Cruz, Santa Cruz, CA 95064 2 Google Inc., Mountain View, CA 94043 Abstract.

More information

Social Networking Applied

Social Networking Applied Social Networking Applied 1 I. Facebook Social Networking Applied Uses: An address book: Facebook users can share their current city, e-mail address, phone number, screen name, street address, and birthday

More information

Online Communication. Chat Rooms Instant Messaging Blogging Social Media

Online Communication.  Chat Rooms Instant Messaging Blogging Social Media Online Communication E-mail Chat Rooms Instant Messaging Blogging Social Media Advantages: Reduces cost of postage Fast and convenient Eliminates phone charges Disadvantages: May be difficult to understand

More information

WHAT IS GOOGLE+ AND WHY SHOULD I USE IT?

WHAT IS GOOGLE+ AND WHY SHOULD I USE IT? CHAPTER ONE WHAT IS GOOGLE+ AND WHY SHOULD I USE IT? In this chapter: + Discovering Why Google+ Is So Great + What Is the Difference between Google+ and Other Social Networks? + Does It Cost Money to Use

More information

Identifying Important Communications

Identifying Important Communications Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our

More information

6 WAYS Google s First Page

6 WAYS Google s First Page 6 WAYS TO Google s First Page FREE EBOOK 2 CONTENTS 03 Intro 06 Search Engine Optimization 08 Search Engine Marketing 10 Start a Business Blog 12 Get Listed on Google Maps 15 Create Online Directory Listing

More information

Natural Language Processing with PoolParty

Natural Language Processing with PoolParty Natural Language Processing with PoolParty Table of Content Introduction to PoolParty 2 Resolving Language Problems 4 Key Features 5 Entity Extraction and Term Extraction 5 Shadow Concepts 6 Word Sense

More information

USING RECURRENT NEURAL NETWORKS FOR DUPLICATE DETECTION AND ENTITY LINKING

USING RECURRENT NEURAL NETWORKS FOR DUPLICATE DETECTION AND ENTITY LINKING USING RECURRENT NEURAL NETWORKS FOR DUPLICATE DETECTION AND ENTITY LINKING BRUNO MARTINS, RUI SANTOS, RICARDO CUSTÓDIO SEPTEMBER 20 TH, 2016 GOLOCAL WORKSHOP WHAT THIS TALK IS ABOUT Approximate string

More information

The Ultimate Social Media Setup Checklist. fans like your page before you can claim your custom URL, and it cannot be changed once you

The Ultimate Social Media Setup Checklist. fans like your page before you can claim your custom URL, and it cannot be changed once you Facebook Decide on your custom URL: The length can be between 5 50 characters. You must have 25 fans like your page before you can claim your custom URL, and it cannot be changed once you have originally

More information

LEARN IT 1. Digital Identity Management Community Platform

LEARN IT 1. Digital Identity Management Community Platform LEARN IT 1 Digital Identity Management Community Platform Note: This document is for Fox BBA in MIS majors (only). The instructions and software described below will not work for others. Please contact

More information

Privacy Policy. Last Updated: August 2017

Privacy Policy. Last Updated: August 2017 Privacy Policy Last Updated: August 2017 Here at ConsenSys we know how much you value privacy, and we realize that you care about what happens to the information you provide to us through our website,

More information