Ontological Topic Modeling to Extract Twitter users' Topics of Interest

Ontological Topic Modeling to Extract Twitter users' Topics of Interest Ounas Asfari, Lilia Hannachi, Fadila Bentayeb and Omar Boussaid Abstract--Twitter, as the most notable services of micro-blogs, has become a significant means by which people communicate with the world and describe their current activities, opinions and status in short text snippets. Tweets can be analyzed automatically in order to derive much potential information such as, interesting topics, social influence, prediction analysis and users' communities. In this paper, we describe an approach for modeling users' interests as topics extracted from their Tweets. The proposed approach differs from existing ones as it combines the use of LDA (Latent Dirichlet Allocation) to extract topics from Tweets with a taxonomy (in this case, ODP) as an external knowledge source. A semantic hierarchy is defined for each topic which allows detecting common topics between users that would not have been detected with LDA only. Thus, our aim is to derive from users' Tweets the high level topics and the categories of topics. We will show, in our experimentation, that our proposed model can extract the main topics and categories from the users' Tweets. Also, we compute the distances between users based on their topics of interest. Index Terms--Data mining, Ontology, Semantic processing, Topic Model, Tweets. I. INTRODUCTION Micro-blogging, like Twitter, has grown explosively in recent years. Twitter is an online social networking service that enables its users to send and read text-based posts of up to 40 characters, known as "Tweets". It enables its users to communicate with the world and share current activities, opinions, spontaneous ideas and organize large communities of people. The service rapidly gained world wide popularity, with over 300 million users as of 0, generating over 300 million Tweets and handling over.6 billion search queries per day (http://twitter.com/about). This fast evolution has led the researchers to study the characteristics of Tweets content and to extract information such as opinions on a specific topic or users' topics of interest. The Tweets studies have perspectives in many O. Asfari is with the ERIC Laboratory, university of Lyon, Bron 69676 France (telephone: 33 (04) 78 77 30 49, e-mail: ounas.asfari@univ-lyon.fr). L. Hannachi is with the LRDSI Laboratory, university of Blida, Blida, Algeria, (e-mail: hannachi.lilia@yahoo.fr). F. Bentayeb is with the ERIC Laboratory, university of Lyon, Bron 69676 France (telephone: 33(04)78 77 6 8, e-mail: fadila.bentayeb@univ-lyon.fr). O. Boussaid is with the ERIC Laboratory, university of Lyon, Bron 69676 France (telephone: 33 (04) 78 77 3 77, e-mail: omar.boussaid@univ-lyon.fr). ISBN: 978-0-980367-5-8 domains such as, friends' recommendation, opinions analysis, users' topics, etc. However, the text of Tweets is generally noisy, unstructured text data, but it is a rich data set to analyze and most likely users try to pack substantial meaning into a short space. Thus, it is important to understand the information behind Tweets and to detect the topics presented by them. To detect these topics, many applications propose to use topic models like, PLSA (Probabilistic Latent Semantic Analysis) [] or LDA (Latent Dirichlet Allocation) [] which try to detect the different latent topics by presenting each one as a words distribution. However, they do not extract semantic concepts for the latent topics. Our contribution, in this paper, that we will define semantics to the distributed words by topic model in order to detect automatically the high level topics presented by the Tweets. Thus, we can extract users topics of interest by examining the terms they mention in their Tweets. To achieve this goal, we propose an ontological topic clustering model based on ODP taxonomy Open Directory Project (www.dmoz.org) as an external knowledge source in order to derive from users' tweets the high level topics and the categories of topics. Thus, different from the works that use topic model on Tweets data by representing the topics as words distribution, we will define semantics to each topic by constructing for it a multi levels semantic hierarchy. In fact, the use of ODP taxonomy is motivated by the fact that topics, discovered by using standard topic model like LDA, are based only on statistical word distributions and do not account for semantic relationships. The relations between users, as defined in the social networks models, are based on the communication between them. Although an existence of a friendship between users, we cannot extract their common interesting information. This is a problem when we look for users community related to the same topics in order to recommend them some information. That is why the study of micro-blogging content attracts much attention in recent years. Here, we suppose to create relations between users based on their common topics or their common topic categories which are derived from their Tweets. For instance, if we consider a user who writes always Tweets in the domain sport like the following real word Tweet: "Contracts for Top College Football Coaches Grow Complicated". Our proposed ontological topic model will assign automatically the topic football to this Tweet and the category Sport. Now, if we consider another user who writes the following real-world Tweet: 4

"Barcelona wins -0 at Real Mallorca but Real Madrid return to form and smash Real Sociedad 5- ". The detected topic for this Tweet will be also football although the word football is not mentioned in this Tweet. This shows the important of adding the semantic layer to the classical topic model. We note that the two users treat the same topic football. Thus we can create or increase the relation between them. The rest of this paper is organized as follows; Section reviews the related works. In Section 3, we will present in details our proposed ontological topic model which is composed of four steps in order to detect the high level topics from users' Tweets. Section 4 illustrates the relationship between users and the high level topics. Section 5 presents our experimentation which contains the calculation of the distances between users based on the detected high-level-topics. Finally, section 6 concludes the paper and presents the future works. II. RELATED WORK As we mentioned previously, we propose an ontological topic model to discover Twitter users' topics. We will use, in this model, one general probabilistic topic model LDA (Latent Dirichlet Allocation) [], in which the text collection of the users' Tweets is represented as a distribution of topics, and each topic is represented as a words distribution. The LDA topic model is used in many applications, such as text mining and recently Twitter's data. Here, we will present some works which use the topic model on Twitter. For instance, the researchers, in [3], use the standard topic model LDA in micro-blogging environments in order to identify influential users. The proposed measure in their work is based on the number of Tweets. Also, in the works of [4], the authors compare the Tweets content empirically with traditional news media by using a new Twitter-LDA model. They consider that each Tweet is a document and usually it treats a single topic. However, this is not always the case because although the limited size of the Tweet, it can handle many topics. These works, [3] and [4], have not taken into consideration that the Twitter users usually publish a large number of noisy posts. The researchers in [5] suggest organizing the Tweets content into four dimensions in order to improve the finding / following new users and topics. However this proposed approach is very general and have not an importance in the choice of friends based on the distance according the topics. Another work which uses the LDA topic model is [6]. They propose a new framework to discover the user's topics of interest on Twitter. They consider each Tweet as a document and then they distinguish between relevant Tweets to their users' interest and other noisy Tweets. However, they do not consider the semantic side of the topics which are generated by LDA model. Some works use the LDA topic model with ontology. For instance, [7] propose to automatically tag web pages by combining the both ontological concepts and the probabilistic LDA topic models. They define the concepts before applying the LDA topic model. However, this will limit the number of relevant concepts. Also [8] present an approach to discover a Twitter user's profile by extracting the entities contained in Tweets. Then they determine a common set of high-level categories that is covered these entities by using Wikipedia's user-defined categories. In fact, in this paper, we propose a new ontological topic model which uses both the LDA topic model together with ODP taxonomy as an external source in order to discover Twitter users' topics of interest. In addition, we define, for each topic, a semantic hierarchy. After that, we construct semantic relations between users based on the detected Twitter users' topics. III. MODEL OF DISCOVERING TWITTER USERS' TOPICS In this section, we present our proposed ontological topic model to discover Twitter users' topics. The architecture of this model is illustrated in Fig.. We can divide it into four steps: Cleaning Database; ODP-Based Adapted LDA; ODP-Based Topics Semantic; High Level Topics inferring. Fig.. Architecture of the ontological topic model A. Cleaning Database In this step, we index the data collection; the Tweets corpus is stored in a database. Then, we organize Tweets data with the following attributes: Id, User, Time, Content, hashtag, URL and at_user In order to clean our data base, we use both a linguistic knowledge and a semantic knowledge to process the Tweets corpus. Because linguistic knowledge does not capture the semantic relationships between terms and semantic knowledge does not represent linguistic relationships of the terms. In the linguistic processing phase, we do the following steps: Remove stop-words such as: the, is, at, which, on, etc. Remove user names, hashtag and URL. Stemming: getting the word root (ex: plays, playing, etc. will be: play). 4

Spelling correction. Use the WordNet dictionary to remove the noisy words. After the first cleaning based only on the linguistic processing, we notice that many noisy or unrecognized words still existed after this step. To solve this problem, we clean the corpus by using a semantic knowledge, such as, the ODP Taxonomy as an instance of a general ontology. Here, we use ODP categories as a stopword filtering mechanism before applying the LDA model. Thus, we achieve the following steps to keep only the relevant words: We verify the existence of word in the results of the ODP indexing. If it does not exist, we remove it from the Tweets. For example, a noise words such as: suprkkbwp, mirsku, etc. can be removed in this step. If the word exists, we compute the number of documents (Web pages) which support it N D (w i ). As cited in DMOZ site, a word may be helpful if the number of its WebPages more than 0. Thus, we suppose a threshold: N D (w i ) >= 0. Therefore, we remove all words that have a number of supported documents less than 0; i.e. N D (w i ) < 0. For example, if we consider the word "awry", the number of its supported pages is 5, thus this word will be removed in this cleaning step. Fig. illustrates a small example of this task. We notice that many irrelevant words are removed after the cleaning step based on a semantic knowledge. Topic 5: Sport (0.7), College (0.030), Football (0.030), Baseball (0.05), Golf (0.05). C. ODP-Based Topics Semantic As we mentioned previously, the LDA model proposes to represent a topic as a words distribution. In this case, users can not observe his different orientations because the word is a very specific unit and connected to different topics categories. Thus, users interpret the results according to their personal background and experiences and this will decrease the performance of the model. To solve this problem, we propose providing the user with a topical hierarchy to each latent topic. Thus, in this phase, we construct concepts trees in order to detect the semantic relations between the words of each latent topic resulted after applying the LDA model. For each word in the unsupervised topic (latent topic) denoted T k, we generate the semantic sub-tree (fragments) from the ODP taxonomy. Here, we consider only the first five categories with their top three levels. This choice is because the first five categories are the more specific categories (see note ) and in the other hand to simplify the model implementation. Then we repeat the same process for all latent topics. Next, we construct XML file for each topic T k, called Topics-XML, and represent each one by fragments. Fig. 3 presents an example of this process. Fig.. Cleaning step B. ODP-Based Adapted LDA In this step, we apply LDA topic model on the cleaned Tweets data. In this case, we have to define the number of iterations, the number of words allocated to each topic and the number of topics. Thus, in this phase, each user s Tweets are represented as a distribution of topics, and each topic is represented as a distribution of words. For instance, in our experimentations, if we consider a sample of Tweets, and specify five LDA topics and five words for each topic. The resulting LDA words distributions for each topic P (w i T k ) are the following: Topic : Agriculture (0.094), Farmland (0.08), Wheat (0.08), Farm (0.0), Forestry (0.09). Topic : Gallery (0.33), Sculpture (0.03), Art (0.09), Photograph (0.09), Graphic (0.0). Topic 3: News (0.3), Politics (0.03), Iran (0.05), Nuclear (0.05), Obama (0.06). Topic 4: Architecture (0.), Design (0.034), Style (0.03), Decor (0.03), Archaeology (0.05). http://www.dmoz.org/guidelines/subcategories.html Fig. 3. Extracted topics and their XML file Thus, we generate for each word in the unsupervised topic, its semantic fragments from the ODP taxonomy. Based on these fragments, we construct the global tree which characterizes this topic. Then we calculate the weight of each node in this tree. These steps can be detailed in the following algorithm: Firstly, for each latent topic T k, in Topics-XML file, we create two empty sets, lines (4) and (5). The elements of the first one represent the global tree generated for this topic, while the elements of the second set represent the categories (nodes in the tree) for the same topic. For each word w i in an unsupervised topic T k, we create a new set called SetCategory (8). We insert into this SetCategory two types of elements: the categories generated for the word w i and the weight of each category. For each hierarchy H x, generated for w i, we apply a 43

Create-Hierarchy function to extend SetHierarchy with H x line (). This function verifies if there are nodes in H x that do not exist in the topical tree, contained in SetHierarchy. If so, we insert this hierarchy into the tree. Otherwise, that means a node Nd from hierarchy H x is found in the tree, the function maps child nodes of Nd against their corresponding parent nodes in the topical tree. SetCategory. Otherwise, the C's weight in the word w i is already calculated. In line (3), SetCategory of word w i is included in SetTopic of topic T k. In order to define the weight of each category C in T k, the algorithm checks if C is mentioned several times in SetTopic then it computes the sum of all C's weights (9). For instance, if we consider the Topic, the category Field Crops is generated to the words Agriculture and Forestry. In this case, the weight of Field Crops in this topic is computed by the sum of the two weights in the two words. Finally, each C's node in the topical tree, contained in SetHierarchy, is labeled with both the name and the weight of this category. Fig. 4 presents the resulting semantic tree for the Topic, which is previously mentioned in Section B. In this tree, the nodes represent the categories generated by using ODP taxonomy, while the links between the nodes represent the relationships which are of the type supercategory-subcategory. For instance, the node Antiques has a supercategory Recreation and a subcategory Farm and ranch Equipment. We notice, from the tree, the different levels of this hierarchy and also each node is attached with both category name and category weight. Fig. 4. The topical tree of Topic In line (4), C is the category associated with level L y in the hierarchy H x. We verify if C does not exist in SetCategory, Line (5). If so, the weight of C in the word w i is computed (8). This weight is computed based on two types of probabilities: The first one represents the importance of this category in the word w i. It is defined by the occurrence number of C in w i divided by the total number of categories C in this word denoted N. The second represents the importance of the word w i in the topic T k. This probability is obtained from the result produced previously by the LDA topic model. In line (9) the category C with his weight are inserted into D. High-Level-Topics inferring The last phase, in our model, is inferring the high-level topics in the several levels. Thus, at each topic T k and in each level L, we select the node (represent the category C) which has the maximum weight. The weight of each node is obtained from the results of the previous section. Thus, to do this, we propose the following formula: I N br SC k, L = arg Max ( * p ( w i T k )) () N i = where SC k,l : the selected node for the topic T k in the level L. N br : the occurrence of the category C in the word w i for the level L. N: the total occurrence of the category C for the word w i existed in the topic T k. 44

P (w i T k ): probability of the word w i in the latent LDA topic T k. For instance, in the semantic tree of topic, shown in Fig. 4, the high level topics inferred for each level will be: Level : Business Level : Agriculture and Forestry Level 3: Field Crops In our model, the topic, inferred from the last level (here, third level), is considered as the representative title of this topic. Therefore, the title which characterizes the topic will be Field Crops. IV. RELATIONSHIP BETWEEN USERS AND CATEGORIES We have detected users' topics of interest on Twitter. These topics can be linked to different domains or categories. Usually, users treat a set of topics in different categories with different percentages. Thus we can define the relation between users and categories. Here, we propose to combine the importance of topic T k for the user U s from the distribution of user-topics produced by LDA model and the importance of the Category j for this topic T k from the weight of topic-category which are produced by our model. By the following formula (), we can calculate the probability of related user s to category j. Thus, it is computed by the sum of multiplication two measures: The probability of treating topic T k by user s: P (U s, k ) The weight of the category j in the topic T k : W (C j, k ) k P ( U, C ) = p ( U ) * W ( C j, s j k = 0 Where K: topics number. s, k k Table I. presents the categories weights for five users selected from our data set, as an instance of users. Table I. categories weights for experimental users A. Data Set Users - Categories User User User 3 User 4 User 5 Arts 0.009 0.0607 0.0866 0.007 0.05 Business 0.0559 0.078 0.030 0.0084 0.006 Science 0.080 0.0034 0.0077 0.0033 0.003 Computers 0.0085 0.034 0.0059 0.0004 0.055 V. EXPERIMENTATION We applied our proposed model on data set, which is collected by crawling one week of public Tweets by using the 40dev Twitter framework. Our collection is constructed from the Tweets of the 5th through the th of January 0 ) () (OneWeek). We select the first 00 relevant users according to the follows number that have Tweets number more than 0 Tweets. For each user Tweets, we applied the four steps of our proposed model in order to define the high-level-topics and the categories. We used the GibbsLDA++ C++-code 3, in the second step, to apply the topic model LDA. B. Distance between users In our experimentation, we calculate the distance between users according to the common topics and then the results will be used to construct the social graph which presents the different closeness relations between users. The distance between user i and user j can be defined as the Jensen-Shannon Divergence between the topics distributions on users [3]. The following formula presents this distance: dist T ( i, j ) = * D JS ( i, j ) (3) where ( i, j ) : the Jensen-Shannon Divergence between the D JS two topic distributions DT i and DT j. It is defined as: D ( i, j ) = ( D KL ( DT i M ) D KL ( DT j M )) (4) JS + where M: the average of the two probability distributions. M = ( DT i + DT j ) D KL : the Kullback-Leibler Divergence. Moreover, there is a possibility that two users treat the same topic but not necessarily the same orientation. For example, two users i and j treat the same topic President Obama with the following probabilities: 0.8 for user i and 0.86 for user j. However, they are not in the same orientation, because user i talks about Obama politics and user j talks about Obama health care. On the other hand, there is a possibility that two users treat different topics but with the same orientation. For example, if we consider two users user and user, they treat two different topics, the first about Football and the second about Water sports. The probability that user user treats the first topic is 0.90 and the second topic is: 0.. The probability that user user treats the first topic is 0.8 and the second topic is: 0.87. However, they are the same orientation, which is sport. Thus, if we calculate the distance between the users according to Jensen-Shannon Divergence measure, we will find that the users i and j are considered very close while the users: user and user are considered very distant. We note that the category orientation is completely ignored. That is why we propose two distance measures: In the first measure (categories), we calculate the distance between users as the Jensen-Shannon Divergence between categories weights over users as in formula (3) and (4). http://40dev.com/free-twitter-api-source-code-library/ 3 http://gibbslda.sourceforge.net/ 45

The second measure combines the two previous measures (Topics, categories). This new measure allows decreasing the distance between users who do not treat only same topics but also same categories. The distance between users in this measure is computed by using the following formula: instance, an edge is created from user to user 3 based on the minimal distance which is:.633. dist TD ( i, j ) = * D JS ( i, j ) (5) where D JS (i, j): the Jensen-Shannon Divergence between the categories weights WC i, WC j, and the topic distributions DT i and DT j Here, the divergence is computed by the following formula: DJS ( i, j) = ( ( DKL ( DTi M) + DKL ( DTj M))) + ( ( DKL ( WCi M) + DKL( WCj M))) (6) Table II shows the distance between users based on both topics and categories. We consider, in this table, only five users as an instance of our data set. For example, the distance between user 3 and user 5 according to topics-categories is.49 which is the minimum distance. Thus, these two users are the closest in comparison with the others. Table II. Distance between users based on topic-category measure User - User (Topic-Category) User User User 3 User 4 User 5 User 0.0000.608.633.6.438 User.608 0.0000.37.644.470 User 3.633.37 0.0000.630.49 User 4.6.644.630 0.0000.309 User 5.438.470.49.309 0.0000 C. Graph construction based on the distance between users Graph has a great expressive power in the modeling. It is based on two concepts nodes and edges. The nodes in the social network area represent a set of social entities such as users or social organizations while the edges between nodes indicate that a direct relationship has been created during social interactions. Here, we present a new social graph, which has nodes represent the users and edges represent the semantic closeness between them according to the measures which are mentioned in the previous section. Thus the relations between users, in our graph, are not the communication relationship as in the existing works. Accordingly, we can define several types of graphs depending on the selected measure: Topics, Categories and topics-categories. Thus, we can construct a graph that represents the closeness between users according to: one topic, one category, one topic-category, all topics, all categories, and all topics-categories. We create an edge from a user i to a user j, if the user i is the closest to the user j for the topic T k, the weight of this link is the distance between them for this selected topic T k. Table shows the distance between the experimented five users based on topic-category measure. Therefore Fig. 5 shows the corresponded graph named topics-categories graph. For Fig. 5. Topics-Categories Graph VI. Conclusion In this paper, we proposed a new ontological model to topics clustering for Tweets based on ODP taxonomy as an external source in order to derive from user's Tweets the high level topics and the categories. The existing works use a generative probabilistic model such as LDA on the Tweets data to identify topics for these Tweets as words distribution without considering the semantic notion. The originality of our proposed model is using a semantic hierarchy (in this case, ODP taxonomy) to detected users' topics of interest on Twitter. The advantage of our proposition is that it allows focusing on the relation between users based on their topics of interest and not only based on the communications between them, like the existing works. Thus, we increase the relations between users based on their common high-level-topics and categories. The result is a social graph based on common topics of users. This study can be used to create user community based on the semantic relations between users and then propose a new OLAP (On-Line Analytical Processing) operator for the Tweets data including the semantic aspect. In the future work, we will try to consider the other Tweet s attributes such as location to calculate the similarities between users. REFERENCES [] T. Hofmann, "Probabilistic latent semantic indexing", SIGIR 999, pp. 50-57. [] D.M. Blei, A. Ng, M. Jordan, "Latent dirichlet allocation", JMLR 003, pp. 993-0. [3] J. Weng, E.P. Lim, J. Jiang, Q. He, "Twitterrank: finding topic-sensitive influential twitterers", WSDM 00, pp. 6-70. [4] W.X. Zhao, J. Jiang, J. Weng, J. He, E.P. Lim, H. Yan, X. Li, "Comparing Twitter and traditional media using topic models", ECIR. 0, pp. 338-349. [5] D. Ramage, S.T. Dumais, D.J. Liebling, "Characterizing microblogs with topic models", ICWSM 00. [6] Z. Xu, R. Lu, L. Xiang, Q. Yang, "Discovering user interest on Twitter with a modified author-topic model", Web Intelligence 0, pp. 4-49. [7] C. Chemudugunta, A. Holloway, P. Smyth, M. Steyvers, "Modeling documents by combining semantic concepts with unsupervised statistical learning", International Semantic Web Conference 008, pp. 9-44. [8] M. Michelson, S.A. Macskassy, "Discovering users' topics of interest on Twitter: a first look", AND 00, pp.73-80. 46