Ontological Topic Modeling to Extract Twitter users' Topics of Interest

Size: px
Start display at page:

Download "Ontological Topic Modeling to Extract Twitter users' Topics of Interest"

Transcription

1 Ontological Topic Modeling to Extract Twitter users' Topics of Interest Ounas Asfari, Lilia Hannachi, Fadila Bentayeb and Omar Boussaid Abstract--Twitter, as the most notable services of micro-blogs, has become a significant means by which people communicate with the world and describe their current activities, opinions and status in short text snippets. Tweets can be analyzed automatically in order to derive much potential information such as, interesting topics, social influence, prediction analysis and users' communities. In this paper, we describe an approach for modeling users' interests as topics extracted from their Tweets. The proposed approach differs from existing ones as it combines the use of LDA (Latent Dirichlet Allocation) to extract topics from Tweets with a taxonomy (in this case, ODP) as an external knowledge source. A semantic hierarchy is defined for each topic which allows detecting common topics between users that would not have been detected with LDA only. Thus, our aim is to derive from users' Tweets the high level topics and the categories of topics. We will show, in our experimentation, that our proposed model can extract the main topics and categories from the users' Tweets. Also, we compute the distances between users based on their topics of interest. Index Terms--Data mining, Ontology, Semantic processing, Topic Model, Tweets. I. INTRODUCTION Micro-blogging, like Twitter, has grown explosively in recent years. Twitter is an online social networking service that enables its users to send and read text-based posts of up to 40 characters, known as "Tweets". It enables its users to communicate with the world and share current activities, opinions, spontaneous ideas and organize large communities of people. The service rapidly gained world wide popularity, with over 300 million users as of 0, generating over 300 million Tweets and handling over.6 billion search queries per day ( This fast evolution has led the researchers to study the characteristics of Tweets content and to extract information such as opinions on a specific topic or users' topics of interest. The Tweets studies have perspectives in many O. Asfari is with the ERIC Laboratory, university of Lyon, Bron France (telephone: 33 (04) , ounas.asfari@univ-lyon.fr). L. Hannachi is with the LRDSI Laboratory, university of Blida, Blida, Algeria, ( hannachi.lilia@yahoo.fr). F. Bentayeb is with the ERIC Laboratory, university of Lyon, Bron France (telephone: 33(04) , fadila.bentayeb@univ-lyon.fr). O. Boussaid is with the ERIC Laboratory, university of Lyon, Bron France (telephone: 33 (04) , omar.boussaid@univ-lyon.fr). ISBN: domains such as, friends' recommendation, opinions analysis, users' topics, etc. However, the text of Tweets is generally noisy, unstructured text data, but it is a rich data set to analyze and most likely users try to pack substantial meaning into a short space. Thus, it is important to understand the information behind Tweets and to detect the topics presented by them. To detect these topics, many applications propose to use topic models like, PLSA (Probabilistic Latent Semantic Analysis) [] or LDA (Latent Dirichlet Allocation) [] which try to detect the different latent topics by presenting each one as a words distribution. However, they do not extract semantic concepts for the latent topics. Our contribution, in this paper, that we will define semantics to the distributed words by topic model in order to detect automatically the high level topics presented by the Tweets. Thus, we can extract users topics of interest by examining the terms they mention in their Tweets. To achieve this goal, we propose an ontological topic clustering model based on ODP taxonomy Open Directory Project ( as an external knowledge source in order to derive from users' tweets the high level topics and the categories of topics. Thus, different from the works that use topic model on Tweets data by representing the topics as words distribution, we will define semantics to each topic by constructing for it a multi levels semantic hierarchy. In fact, the use of ODP taxonomy is motivated by the fact that topics, discovered by using standard topic model like LDA, are based only on statistical word distributions and do not account for semantic relationships. The relations between users, as defined in the social networks models, are based on the communication between them. Although an existence of a friendship between users, we cannot extract their common interesting information. This is a problem when we look for users community related to the same topics in order to recommend them some information. That is why the study of micro-blogging content attracts much attention in recent years. Here, we suppose to create relations between users based on their common topics or their common topic categories which are derived from their Tweets. For instance, if we consider a user who writes always Tweets in the domain sport like the following real word Tweet: "Contracts for Top College Football Coaches Grow Complicated". Our proposed ontological topic model will assign automatically the topic football to this Tweet and the category Sport. Now, if we consider another user who writes the following real-world Tweet: 4

2 "Barcelona wins -0 at Real Mallorca but Real Madrid return to form and smash Real Sociedad 5- ". The detected topic for this Tweet will be also football although the word football is not mentioned in this Tweet. This shows the important of adding the semantic layer to the classical topic model. We note that the two users treat the same topic football. Thus we can create or increase the relation between them. The rest of this paper is organized as follows; Section reviews the related works. In Section 3, we will present in details our proposed ontological topic model which is composed of four steps in order to detect the high level topics from users' Tweets. Section 4 illustrates the relationship between users and the high level topics. Section 5 presents our experimentation which contains the calculation of the distances between users based on the detected high-level-topics. Finally, section 6 concludes the paper and presents the future works. II. RELATED WORK As we mentioned previously, we propose an ontological topic model to discover Twitter users' topics. We will use, in this model, one general probabilistic topic model LDA (Latent Dirichlet Allocation) [], in which the text collection of the users' Tweets is represented as a distribution of topics, and each topic is represented as a words distribution. The LDA topic model is used in many applications, such as text mining and recently Twitter's data. Here, we will present some works which use the topic model on Twitter. For instance, the researchers, in [3], use the standard topic model LDA in micro-blogging environments in order to identify influential users. The proposed measure in their work is based on the number of Tweets. Also, in the works of [4], the authors compare the Tweets content empirically with traditional news media by using a new Twitter-LDA model. They consider that each Tweet is a document and usually it treats a single topic. However, this is not always the case because although the limited size of the Tweet, it can handle many topics. These works, [3] and [4], have not taken into consideration that the Twitter users usually publish a large number of noisy posts. The researchers in [5] suggest organizing the Tweets content into four dimensions in order to improve the finding / following new users and topics. However this proposed approach is very general and have not an importance in the choice of friends based on the distance according the topics. Another work which uses the LDA topic model is [6]. They propose a new framework to discover the user's topics of interest on Twitter. They consider each Tweet as a document and then they distinguish between relevant Tweets to their users' interest and other noisy Tweets. However, they do not consider the semantic side of the topics which are generated by LDA model. Some works use the LDA topic model with ontology. For instance, [7] propose to automatically tag web pages by combining the both ontological concepts and the probabilistic LDA topic models. They define the concepts before applying the LDA topic model. However, this will limit the number of relevant concepts. Also [8] present an approach to discover a Twitter user's profile by extracting the entities contained in Tweets. Then they determine a common set of high-level categories that is covered these entities by using Wikipedia's user-defined categories. In fact, in this paper, we propose a new ontological topic model which uses both the LDA topic model together with ODP taxonomy as an external source in order to discover Twitter users' topics of interest. In addition, we define, for each topic, a semantic hierarchy. After that, we construct semantic relations between users based on the detected Twitter users' topics. III. MODEL OF DISCOVERING TWITTER USERS' TOPICS In this section, we present our proposed ontological topic model to discover Twitter users' topics. The architecture of this model is illustrated in Fig.. We can divide it into four steps: Cleaning Database; ODP-Based Adapted LDA; ODP-Based Topics Semantic; High Level Topics inferring. Fig.. Architecture of the ontological topic model A. Cleaning Database In this step, we index the data collection; the Tweets corpus is stored in a database. Then, we organize Tweets data with the following attributes: Id, User, Time, Content, hashtag, URL and at_user In order to clean our data base, we use both a linguistic knowledge and a semantic knowledge to process the Tweets corpus. Because linguistic knowledge does not capture the semantic relationships between terms and semantic knowledge does not represent linguistic relationships of the terms. In the linguistic processing phase, we do the following steps: Remove stop-words such as: the, is, at, which, on, etc. Remove user names, hashtag and URL. Stemming: getting the word root (ex: plays, playing, etc. will be: play). 4

3 Spelling correction. Use the WordNet dictionary to remove the noisy words. After the first cleaning based only on the linguistic processing, we notice that many noisy or unrecognized words still existed after this step. To solve this problem, we clean the corpus by using a semantic knowledge, such as, the ODP Taxonomy as an instance of a general ontology. Here, we use ODP categories as a stopword filtering mechanism before applying the LDA model. Thus, we achieve the following steps to keep only the relevant words: We verify the existence of word in the results of the ODP indexing. If it does not exist, we remove it from the Tweets. For example, a noise words such as: suprkkbwp, mirsku, etc. can be removed in this step. If the word exists, we compute the number of documents (Web pages) which support it N D (w i ). As cited in DMOZ site, a word may be helpful if the number of its WebPages more than 0. Thus, we suppose a threshold: N D (w i ) >= 0. Therefore, we remove all words that have a number of supported documents less than 0; i.e. N D (w i ) < 0. For example, if we consider the word "awry", the number of its supported pages is 5, thus this word will be removed in this cleaning step. Fig. illustrates a small example of this task. We notice that many irrelevant words are removed after the cleaning step based on a semantic knowledge. Topic 5: Sport (0.7), College (0.030), Football (0.030), Baseball (0.05), Golf (0.05). C. ODP-Based Topics Semantic As we mentioned previously, the LDA model proposes to represent a topic as a words distribution. In this case, users can not observe his different orientations because the word is a very specific unit and connected to different topics categories. Thus, users interpret the results according to their personal background and experiences and this will decrease the performance of the model. To solve this problem, we propose providing the user with a topical hierarchy to each latent topic. Thus, in this phase, we construct concepts trees in order to detect the semantic relations between the words of each latent topic resulted after applying the LDA model. For each word in the unsupervised topic (latent topic) denoted T k, we generate the semantic sub-tree (fragments) from the ODP taxonomy. Here, we consider only the first five categories with their top three levels. This choice is because the first five categories are the more specific categories (see note ) and in the other hand to simplify the model implementation. Then we repeat the same process for all latent topics. Next, we construct XML file for each topic T k, called Topics-XML, and represent each one by fragments. Fig. 3 presents an example of this process. Fig.. Cleaning step B. ODP-Based Adapted LDA In this step, we apply LDA topic model on the cleaned Tweets data. In this case, we have to define the number of iterations, the number of words allocated to each topic and the number of topics. Thus, in this phase, each user s Tweets are represented as a distribution of topics, and each topic is represented as a distribution of words. For instance, in our experimentations, if we consider a sample of Tweets, and specify five LDA topics and five words for each topic. The resulting LDA words distributions for each topic P (w i T k ) are the following: Topic : Agriculture (0.094), Farmland (0.08), Wheat (0.08), Farm (0.0), Forestry (0.09). Topic : Gallery (0.33), Sculpture (0.03), Art (0.09), Photograph (0.09), Graphic (0.0). Topic 3: News (0.3), Politics (0.03), Iran (0.05), Nuclear (0.05), Obama (0.06). Topic 4: Architecture (0.), Design (0.034), Style (0.03), Decor (0.03), Archaeology (0.05). Fig. 3. Extracted topics and their XML file Thus, we generate for each word in the unsupervised topic, its semantic fragments from the ODP taxonomy. Based on these fragments, we construct the global tree which characterizes this topic. Then we calculate the weight of each node in this tree. These steps can be detailed in the following algorithm: Firstly, for each latent topic T k, in Topics-XML file, we create two empty sets, lines (4) and (5). The elements of the first one represent the global tree generated for this topic, while the elements of the second set represent the categories (nodes in the tree) for the same topic. For each word w i in an unsupervised topic T k, we create a new set called SetCategory (8). We insert into this SetCategory two types of elements: the categories generated for the word w i and the weight of each category. For each hierarchy H x, generated for w i, we apply a 43

4 Create-Hierarchy function to extend SetHierarchy with H x line (). This function verifies if there are nodes in H x that do not exist in the topical tree, contained in SetHierarchy. If so, we insert this hierarchy into the tree. Otherwise, that means a node Nd from hierarchy H x is found in the tree, the function maps child nodes of Nd against their corresponding parent nodes in the topical tree. SetCategory. Otherwise, the C's weight in the word w i is already calculated. In line (3), SetCategory of word w i is included in SetTopic of topic T k. In order to define the weight of each category C in T k, the algorithm checks if C is mentioned several times in SetTopic then it computes the sum of all C's weights (9). For instance, if we consider the Topic, the category Field Crops is generated to the words Agriculture and Forestry. In this case, the weight of Field Crops in this topic is computed by the sum of the two weights in the two words. Finally, each C's node in the topical tree, contained in SetHierarchy, is labeled with both the name and the weight of this category. Fig. 4 presents the resulting semantic tree for the Topic, which is previously mentioned in Section B. In this tree, the nodes represent the categories generated by using ODP taxonomy, while the links between the nodes represent the relationships which are of the type supercategory-subcategory. For instance, the node Antiques has a supercategory Recreation and a subcategory Farm and ranch Equipment. We notice, from the tree, the different levels of this hierarchy and also each node is attached with both category name and category weight. Fig. 4. The topical tree of Topic In line (4), C is the category associated with level L y in the hierarchy H x. We verify if C does not exist in SetCategory, Line (5). If so, the weight of C in the word w i is computed (8). This weight is computed based on two types of probabilities: The first one represents the importance of this category in the word w i. It is defined by the occurrence number of C in w i divided by the total number of categories C in this word denoted N. The second represents the importance of the word w i in the topic T k. This probability is obtained from the result produced previously by the LDA topic model. In line (9) the category C with his weight are inserted into D. High-Level-Topics inferring The last phase, in our model, is inferring the high-level topics in the several levels. Thus, at each topic T k and in each level L, we select the node (represent the category C) which has the maximum weight. The weight of each node is obtained from the results of the previous section. Thus, to do this, we propose the following formula: I N br SC k, L = arg Max ( * p ( w i T k )) () N i = where SC k,l : the selected node for the topic T k in the level L. N br : the occurrence of the category C in the word w i for the level L. N: the total occurrence of the category C for the word w i existed in the topic T k. 44

5 P (w i T k ): probability of the word w i in the latent LDA topic T k. For instance, in the semantic tree of topic, shown in Fig. 4, the high level topics inferred for each level will be: Level : Business Level : Agriculture and Forestry Level 3: Field Crops In our model, the topic, inferred from the last level (here, third level), is considered as the representative title of this topic. Therefore, the title which characterizes the topic will be Field Crops. IV. RELATIONSHIP BETWEEN USERS AND CATEGORIES We have detected users' topics of interest on Twitter. These topics can be linked to different domains or categories. Usually, users treat a set of topics in different categories with different percentages. Thus we can define the relation between users and categories. Here, we propose to combine the importance of topic T k for the user U s from the distribution of user-topics produced by LDA model and the importance of the Category j for this topic T k from the weight of topic-category which are produced by our model. By the following formula (), we can calculate the probability of related user s to category j. Thus, it is computed by the sum of multiplication two measures: The probability of treating topic T k by user s: P (U s, k ) The weight of the category j in the topic T k : W (C j, k ) k P ( U, C ) = p ( U ) * W ( C j, s j k = 0 Where K: topics number. s, k k Table I. presents the categories weights for five users selected from our data set, as an instance of users. Table I. categories weights for experimental users A. Data Set Users - Categories User User User 3 User 4 User 5 Arts Business Science Computers V. EXPERIMENTATION We applied our proposed model on data set, which is collected by crawling one week of public Tweets by using the 40dev Twitter framework. Our collection is constructed from the Tweets of the 5th through the th of January 0 ) () (OneWeek). We select the first 00 relevant users according to the follows number that have Tweets number more than 0 Tweets. For each user Tweets, we applied the four steps of our proposed model in order to define the high-level-topics and the categories. We used the GibbsLDA++ C++-code 3, in the second step, to apply the topic model LDA. B. Distance between users In our experimentation, we calculate the distance between users according to the common topics and then the results will be used to construct the social graph which presents the different closeness relations between users. The distance between user i and user j can be defined as the Jensen-Shannon Divergence between the topics distributions on users [3]. The following formula presents this distance: dist T ( i, j ) = * D JS ( i, j ) (3) where ( i, j ) : the Jensen-Shannon Divergence between the D JS two topic distributions DT i and DT j. It is defined as: D ( i, j ) = ( D KL ( DT i M ) D KL ( DT j M )) (4) JS + where M: the average of the two probability distributions. M = ( DT i + DT j ) D KL : the Kullback-Leibler Divergence. Moreover, there is a possibility that two users treat the same topic but not necessarily the same orientation. For example, two users i and j treat the same topic President Obama with the following probabilities: 0.8 for user i and 0.86 for user j. However, they are not in the same orientation, because user i talks about Obama politics and user j talks about Obama health care. On the other hand, there is a possibility that two users treat different topics but with the same orientation. For example, if we consider two users user and user, they treat two different topics, the first about Football and the second about Water sports. The probability that user user treats the first topic is 0.90 and the second topic is: 0.. The probability that user user treats the first topic is 0.8 and the second topic is: However, they are the same orientation, which is sport. Thus, if we calculate the distance between the users according to Jensen-Shannon Divergence measure, we will find that the users i and j are considered very close while the users: user and user are considered very distant. We note that the category orientation is completely ignored. That is why we propose two distance measures: In the first measure (categories), we calculate the distance between users as the Jensen-Shannon Divergence between categories weights over users as in formula (3) and (4)

6 The second measure combines the two previous measures (Topics, categories). This new measure allows decreasing the distance between users who do not treat only same topics but also same categories. The distance between users in this measure is computed by using the following formula: instance, an edge is created from user to user 3 based on the minimal distance which is:.633. dist TD ( i, j ) = * D JS ( i, j ) (5) where D JS (i, j): the Jensen-Shannon Divergence between the categories weights WC i, WC j, and the topic distributions DT i and DT j Here, the divergence is computed by the following formula: DJS ( i, j) = ( ( DKL ( DTi M) + DKL ( DTj M))) + ( ( DKL ( WCi M) + DKL( WCj M))) (6) Table II shows the distance between users based on both topics and categories. We consider, in this table, only five users as an instance of our data set. For example, the distance between user 3 and user 5 according to topics-categories is.49 which is the minimum distance. Thus, these two users are the closest in comparison with the others. Table II. Distance between users based on topic-category measure User - User (Topic-Category) User User User 3 User 4 User 5 User User User User User C. Graph construction based on the distance between users Graph has a great expressive power in the modeling. It is based on two concepts nodes and edges. The nodes in the social network area represent a set of social entities such as users or social organizations while the edges between nodes indicate that a direct relationship has been created during social interactions. Here, we present a new social graph, which has nodes represent the users and edges represent the semantic closeness between them according to the measures which are mentioned in the previous section. Thus the relations between users, in our graph, are not the communication relationship as in the existing works. Accordingly, we can define several types of graphs depending on the selected measure: Topics, Categories and topics-categories. Thus, we can construct a graph that represents the closeness between users according to: one topic, one category, one topic-category, all topics, all categories, and all topics-categories. We create an edge from a user i to a user j, if the user i is the closest to the user j for the topic T k, the weight of this link is the distance between them for this selected topic T k. Table shows the distance between the experimented five users based on topic-category measure. Therefore Fig. 5 shows the corresponded graph named topics-categories graph. For Fig. 5. Topics-Categories Graph VI. Conclusion In this paper, we proposed a new ontological model to topics clustering for Tweets based on ODP taxonomy as an external source in order to derive from user's Tweets the high level topics and the categories. The existing works use a generative probabilistic model such as LDA on the Tweets data to identify topics for these Tweets as words distribution without considering the semantic notion. The originality of our proposed model is using a semantic hierarchy (in this case, ODP taxonomy) to detected users' topics of interest on Twitter. The advantage of our proposition is that it allows focusing on the relation between users based on their topics of interest and not only based on the communications between them, like the existing works. Thus, we increase the relations between users based on their common high-level-topics and categories. The result is a social graph based on common topics of users. This study can be used to create user community based on the semantic relations between users and then propose a new OLAP (On-Line Analytical Processing) operator for the Tweets data including the semantic aspect. In the future work, we will try to consider the other Tweet s attributes such as location to calculate the similarities between users. REFERENCES [] T. Hofmann, "Probabilistic latent semantic indexing", SIGIR 999, pp [] D.M. Blei, A. Ng, M. Jordan, "Latent dirichlet allocation", JMLR 003, pp [3] J. Weng, E.P. Lim, J. Jiang, Q. He, "Twitterrank: finding topic-sensitive influential twitterers", WSDM 00, pp [4] W.X. Zhao, J. Jiang, J. Weng, J. He, E.P. Lim, H. Yan, X. Li, "Comparing Twitter and traditional media using topic models", ECIR. 0, pp [5] D. Ramage, S.T. Dumais, D.J. Liebling, "Characterizing microblogs with topic models", ICWSM 00. [6] Z. Xu, R. Lu, L. Xiang, Q. Yang, "Discovering user interest on Twitter with a modified author-topic model", Web Intelligence 0, pp [7] C. Chemudugunta, A. Holloway, P. Smyth, M. Steyvers, "Modeling documents by combining semantic concepts with unsupervised statistical learning", International Semantic Web Conference 008, pp [8] M. Michelson, S.A. Macskassy, "Discovering users' topics of interest on Twitter: a first look", AND 00, pp

jldadmm: A Java package for the LDA and DMM topic models

jldadmm: A Java package for the LDA and DMM topic models jldadmm: A Java package for the LDA and DMM topic models Dat Quoc Nguyen School of Computing and Information Systems The University of Melbourne, Australia dqnguyen@unimelb.edu.au Abstract: In this technical

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

Theme Based Clustering of Tweets

Theme Based Clustering of Tweets Theme Based Clustering of Tweets Rudra M. Tripathy Silicon Institute of Technology Bhubaneswar, India Sameep Mehta IBM Research-India Shashank Sharma I I T Delhi Amitabha Bagchi I I T Delhi Sachindra Joshi

More information

VisoLink: A User-Centric Social Relationship Mining

VisoLink: A User-Centric Social Relationship Mining VisoLink: A User-Centric Social Relationship Mining Lisa Fan and Botang Li Department of Computer Science, University of Regina Regina, Saskatchewan S4S 0A2 Canada {fan, li269}@cs.uregina.ca Abstract.

More information

A Study of Pattern-based Subtopic Discovery and Integration in the Web Track

A Study of Pattern-based Subtopic Discovery and Integration in the Web Track A Study of Pattern-based Subtopic Discovery and Integration in the Web Track Wei Zheng and Hui Fang Department of ECE, University of Delaware Abstract We report our systems and experiments in the diversity

More information

Volume 6, Issue 5, May 2018 International Journal of Advance Research in Computer Science and Management Studies

Volume 6, Issue 5, May 2018 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 6, Issue 5, May 2018 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at:

More information

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Jianyong Wang Department of Computer Science and Technology Tsinghua University Jianyong Wang Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn Joint work with Wei Shen (Tsinghua), Ping Luo (HP), and Min Wang (HP) Outline Introduction to entity

More information

A MODEL OF EXTRACTING PATTERNS IN SOCIAL NETWORK DATA USING TOPIC MODELLING, SENTIMENT ANALYSIS AND GRAPH DATABASES

A MODEL OF EXTRACTING PATTERNS IN SOCIAL NETWORK DATA USING TOPIC MODELLING, SENTIMENT ANALYSIS AND GRAPH DATABASES A MODEL OF EXTRACTING PATTERNS IN SOCIAL NETWORK DATA USING TOPIC MODELLING, SENTIMENT ANALYSIS AND GRAPH DATABASES ABSTRACT Assane Wade 1 and Giovanna Di MarzoSerugendo 2 Centre Universitaire d Informatique

More information

Building Rich User Profiles for Personalized News Recommendation

Building Rich User Profiles for Personalized News Recommendation Building Rich User Profiles for Personalized News Recommendation Youssef Meguebli 1, Mouna Kacimi 2, Bich-liên Doan 1, and Fabrice Popineau 1 1 SUPELEC Systems Sciences (E3S), Gif sur Yvette, France, {youssef.meguebli,bich-lien.doan,fabrice.popineau}@supelec.fr

More information

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM Myomyo Thannaing 1, Ayenandar Hlaing 2 1,2 University of Technology (Yadanarpon Cyber City), near Pyin Oo Lwin, Myanmar ABSTRACT

More information

Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web

Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web Chenghua Lin, Yulan He, Carlos Pedrinaci, and John Domingue Knowledge Media Institute, The Open University

More information

Ontology-Based Web Query Classification for Research Paper Searching

Ontology-Based Web Query Classification for Research Paper Searching Ontology-Based Web Query Classification for Research Paper Searching MyoMyo ThanNaing University of Technology(Yatanarpon Cyber City) Mandalay,Myanmar Abstract- In web search engines, the retrieval of

More information

Clustering using Topic Models

Clustering using Topic Models Clustering using Topic Models Compiled by Sujatha Das, Cornelia Caragea Credits for slides: Blei, Allan, Arms, Manning, Rai, Lund, Noble, Page. Clustering Partition unlabeled examples into disjoint subsets

More information

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language Dong Han and Kilian Stoffel Information Management Institute, University of Neuchâtel Pierre-à-Mazel 7, CH-2000 Neuchâtel,

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

Text Document Clustering Using DPM with Concept and Feature Analysis

Text Document Clustering Using DPM with Concept and Feature Analysis Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

Interpreting Document Collections with Topic Models. Nikolaos Aletras University College London

Interpreting Document Collections with Topic Models. Nikolaos Aletras University College London Interpreting Document Collections with Topic Models Nikolaos Aletras University College London Acknowledgements Mark Stevenson, Sheffield Tim Baldwin, Melbourne Jey Han Lau, IBM Research Talk Outline Introduction

More information

Replication on Affinity Propagation: Clustering by Passing Messages Between Data Points

Replication on Affinity Propagation: Clustering by Passing Messages Between Data Points 1 Replication on Affinity Propagation: Clustering by Passing Messages Between Data Points Zhe Zhao Abstract In this project, I choose the paper, Clustering by Passing Messages Between Data Points [1],

More information

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Dipak J Kakade, Nilesh P Sable Department of Computer Engineering, JSPM S Imperial College of Engg. And Research,

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining

Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining 1 Vishakha D. Bhope, 2 Sachin N. Deshmukh 1,2 Department of Computer Science & Information Technology, Dr. BAM

More information

Introduction to Text Mining. Hongning Wang

Introduction to Text Mining. Hongning Wang Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:

More information

An Approach To Web Content Mining

An Approach To Web Content Mining An Approach To Web Content Mining Nita Patil, Chhaya Das, Shreya Patanakar, Kshitija Pol Department of Computer Engg. Datta Meghe College of Engineering, Airoli, Navi Mumbai Abstract-With the research

More information

Efficient integration of data mining techniques in DBMSs

Efficient integration of data mining techniques in DBMSs Efficient integration of data mining techniques in DBMSs Fadila Bentayeb Jérôme Darmont Cédric Udréa ERIC, University of Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex, FRANCE {bentayeb jdarmont

More information

Tag-based Social Interest Discovery

Tag-based Social Interest Discovery Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture

More information

arxiv: v1 [cs.db] 10 May 2007

arxiv: v1 [cs.db] 10 May 2007 Decision tree modeling with relational views Fadila Bentayeb and Jérôme Darmont arxiv:0705.1455v1 [cs.db] 10 May 2007 ERIC Université Lumière Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex France

More information

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR P.SHENBAGAVALLI M.E., Research Scholar, Assistant professor/cse MPNMJ Engineering college Sspshenba2@gmail.com J.SARAVANAKUMAR B.Tech(IT)., PG

More information

Mining Social Media Users Interest

Mining Social Media Users Interest Mining Social Media Users Interest Presenters: Heng Wang,Man Yuan April, 4 th, 2016 Agenda Introduction to Text Mining Tool & Dataset Data Pre-processing Text Mining on Twitter Summary & Future Improvement

More information

Hierarchical Location and Topic Based Query Expansion

Hierarchical Location and Topic Based Query Expansion Hierarchical Location and Topic Based Query Expansion Shu Huang 1 Qiankun Zhao 2 Prasenjit Mitra 1 C. Lee Giles 1 Information Sciences and Technology 1 AOL Research Lab 2 Pennsylvania State University

More information

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky The Chinese University of Hong Kong Abstract Husky is a distributed computing system, achieving outstanding

More information

Extracting Information from Social Networks

Extracting Information from Social Networks Extracting Information from Social Networks Reminder: Social networks Catch-all term for social networking sites Facebook microblogging sites Twitter blog sites (for some purposes) 1 2 Ways we can use

More information

Bitmap index-based decision trees

Bitmap index-based decision trees Bitmap index-based decision trees Cécile Favre and Fadila Bentayeb ERIC - Université Lumière Lyon 2, Bâtiment L, 5 avenue Pierre Mendès-France 69676 BRON Cedex FRANCE {cfavre, bentayeb}@eric.univ-lyon2.fr

More information

A New Tool for Textual Aggregation in OLAP Context

A New Tool for Textual Aggregation in OLAP Context A New Tool for Textual Aggregation in OLAP Context Mustapha BOUAKKAZ 1, Sabine LOUDCHER 2 and Youcef OUINTEN 1 1 LIM Laboratory,University of Laghouat, Algeria 2 ERIC Laboratory, University of Lyon2, France

More information

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bonfring International Journal of Data Mining, Vol. 7, No. 2, May 2017 11 News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bamber and Micah Jason Abstract---

More information

Module 3: GATE and Social Media. Part 4. Named entities

Module 3: GATE and Social Media. Part 4. Named entities Module 3: GATE and Social Media Part 4. Named entities The 1995-2018 This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs Licence Named Entity Recognition Texts frequently

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Concept-Based Document Similarity Based on Suffix Tree Document

Concept-Based Document Similarity Based on Suffix Tree Document Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri

More information

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate

More information

Ontology based Web Page Topic Identification

Ontology based Web Page Topic Identification Ontology based Web Page Topic Identification Abhishek Singh Rathore Department of Computer Science & Engineering Maulana Azad National Institute of Technology Bhopal, India Devshri Roy Department of Computer

More information

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE Ms.S.Muthukakshmi 1, R. Surya 2, M. Umira Taj 3 Assistant Professor, Department of Information Technology, Sri Krishna College of Technology, Kovaipudur,

More information

Computer-assisted Ontology Construction System: Focus on Bootstrapping Capabilities

Computer-assisted Ontology Construction System: Focus on Bootstrapping Capabilities Computer-assisted Ontology Construction System: Focus on Bootstrapping Capabilities Omar Qawasmeh 1, Maxime Lefranois 2, Antoine Zimmermann 2, Pierre Maret 1 1 Univ. Lyon, CNRS, Lab. Hubert Curien UMR

More information

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:

More information

Enhancing Cluster Quality by Using User Browsing Time

Enhancing Cluster Quality by Using User Browsing Time Enhancing Cluster Quality by Using User Browsing Time Rehab Duwairi Dept. of Computer Information Systems Jordan Univ. of Sc. and Technology Irbid, Jordan rehab@just.edu.jo Khaleifah Al.jada' Dept. of

More information

Probabilistic Graphical Models Part III: Example Applications

Probabilistic Graphical Models Part III: Example Applications Probabilistic Graphical Models Part III: Example Applications Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2014 CS 551, Fall 2014 c 2014, Selim

More information

Semantic text features from small world graphs

Semantic text features from small world graphs Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK

More information

Finding Influencers within Fuzzy Topics on Twitter

Finding Influencers within Fuzzy Topics on Twitter 1 1. Introduction Finding Influencers within Fuzzy Topics on Twitter Tal Stramer, Stanford University In recent years, there has been an explosion in usage of online social networks such as Twitter. This

More information

EECS 545: Project Final Report

EECS 545: Project Final Report EECS 545: Project Final Report Querying Methods for Twitter Using Topic Modeling Fall 2011 Lei Yang, Yang Liu, Eric Uthoff, Robert Vandermeulen 1.0 Introduction In the area of information retrieval, where

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI 1 KAMATCHI.M, 2 SUNDARAM.N 1 M.E, CSE, MahaBarathi Engineering College Chinnasalem-606201, 2 Assistant Professor,

More information

EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES

EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES B. GEETHA KUMARI M. Tech (CSE) Email-id: Geetha.bapr07@gmail.com JAGETI PADMAVTHI M. Tech (CSE) Email-id: jageti.padmavathi4@gmail.com ABSTRACT:

More information

DIGIT.B4 Big Data PoC

DIGIT.B4 Big Data PoC DIGIT.B4 Big Data PoC DIGIT 01 Social Media D02.01 PoC Requirements Table of contents 1 Introduction... 5 1.1 Context... 5 1.2 Objective... 5 2 Data SOURCES... 6 2.1 Data sources... 6 2.2 Data fields...

More information

A Novel deep learning models for Cold Start Product Recommendation using Micro blogging Information

A Novel deep learning models for Cold Start Product Recommendation using Micro blogging Information A Novel deep learning models for Cold Start Product Recommendation using Micro blogging Information Chunchu.Harika, PG Scholar, Department of CSE, QIS College of Engineering and Technology, Ongole, Andhra

More information

Semantic Web. Ontology Engineering and Evaluation. Morteza Amini. Sharif University of Technology Fall 93-94

Semantic Web. Ontology Engineering and Evaluation. Morteza Amini. Sharif University of Technology Fall 93-94 ه عا ی Semantic Web Ontology Engineering and Evaluation Morteza Amini Sharif University of Technology Fall 93-94 Outline Ontology Engineering Class and Class Hierarchy Ontology Evaluation 2 Outline Ontology

More information

TempWeb rd Temporal Web Analytics Workshop

TempWeb rd Temporal Web Analytics Workshop TempWeb 2013 3 rd Temporal Web Analytics Workshop Stuff happens continuously: exploring Web contents with temporal information Omar Alonso Microsoft 13 May 2013 Disclaimer The views, opinions, positions,

More information

Proxy Server Systems Improvement Using Frequent Itemset Pattern-Based Techniques

Proxy Server Systems Improvement Using Frequent Itemset Pattern-Based Techniques Proceedings of the 2nd International Conference on Intelligent Systems and Image Processing 2014 Proxy Systems Improvement Using Frequent Itemset Pattern-Based Techniques Saranyoo Butkote *, Jiratta Phuboon-op,

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS A Friend Recommender System for Social Networks by Life Style Extraction Using Probabilistic Method - Friendtome Namrata M.Eklaspur [1], Anand S.Pashupatimath [2] M.Tech P.G

More information

Event Detection in Czech Twitter

Event Detection in Czech Twitter Event Detection in Czech Twitter Václav Rajtmajer 1 and Pavel Král 1,2 1 Dept. of Computer Science & Engineering Faculty of Applied Sciences University of West Bohemia Plzeň, Czech Republic 2 NTIS - New

More information

Diversionary Comments under Political Blog Posts

Diversionary Comments under Political Blog Posts Diversionary Comments under Political Blog Posts Jing Wang jwang69@uic.edu Bing Liu liub@cs.uic.edu Clement T. Yu yu@cs.uic.edu Philip S. Yu psyu@uic.edu Weiyi Meng SUNY at Binghamton meng@cs.binghamton.edu

More information

Improving Difficult Queries by Leveraging Clusters in Term Graph

Improving Difficult Queries by Leveraging Clusters in Term Graph Improving Difficult Queries by Leveraging Clusters in Term Graph Rajul Anand and Alexander Kotov Department of Computer Science, Wayne State University, Detroit MI 48226, USA {rajulanand,kotov}@wayne.edu

More information

Marketing & Back Office Management

Marketing & Back Office Management Marketing & Back Office Management Menu Management Add, Edit, Delete Menu Gallery Management Add, Edit, Delete Images Banner Management Update the banner image/background image in web ordering Online Data

More information

Using Latent Dirichlet Allocation to Incorporate Domain Knowledge with Concept based Approach for Automatic Topic Detection

Using Latent Dirichlet Allocation to Incorporate Domain Knowledge with Concept based Approach for Automatic Topic Detection Using Latent Dirichlet Allocation to Incorporate Domain Knowledge with Concept based Approach for Automatic Topic Detection A.Mekala, MCA, MSC, MPhil, Research Scholar Manonmaniam Sundaranar University,

More information

Un-moderated real-time news trends extraction from World Wide Web using Apache Mahout

Un-moderated real-time news trends extraction from World Wide Web using Apache Mahout Un-moderated real-time news trends extraction from World Wide Web using Apache Mahout A Project Report Presented to Professor Rakesh Ranjan San Jose State University Spring 2011 By Kalaivanan Durairaj

More information

Web Data mining-a Research area in Web usage mining

Web Data mining-a Research area in Web usage mining IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 22-26 Web Data mining-a Research area in Web usage mining 1 V.S.Thiyagarajan,

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Tag Based Image Search by Social Re-ranking

Tag Based Image Search by Social Re-ranking Tag Based Image Search by Social Re-ranking Vilas Dilip Mane, Prof.Nilesh P. Sable Student, Department of Computer Engineering, Imperial College of Engineering & Research, Wagholi, Pune, Savitribai Phule

More information

Clustering. Bruno Martins. 1 st Semester 2012/2013

Clustering. Bruno Martins. 1 st Semester 2012/2013 Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts

More information

A Recommendation Model Based on Site Semantics and Usage Mining

A Recommendation Model Based on Site Semantics and Usage Mining A Recommendation Model Based on Site Semantics and Usage Mining Sofia Stamou Lefteris Kozanidis Paraskevi Tzekou Nikos Zotos Computer Engineering and Informatics Department, Patras University 26500 GREECE

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

The Open University s repository of research publications and other research outputs. Search Personalization with Embeddings

The Open University s repository of research publications and other research outputs. Search Personalization with Embeddings Open Research Online The Open University s repository of research publications and other research outputs Search Personalization with Embeddings Conference Item How to cite: Vu, Thanh; Nguyen, Dat Quoc;

More information

arxiv: v1 [cs.mm] 12 Jan 2016

arxiv: v1 [cs.mm] 12 Jan 2016 Learning Subclass Representations for Visually-varied Image Classification Xinchao Li, Peng Xu, Yue Shi, Martha Larson, Alan Hanjalic Multimedia Information Retrieval Lab, Delft University of Technology

More information

Mining User - Aware Rare Sequential Topic Pattern in Document Streams

Mining User - Aware Rare Sequential Topic Pattern in Document Streams Mining User - Aware Rare Sequential Topic Pattern in Document Streams A.Mary Assistant Professor, Department of Computer Science And Engineering Alpha College Of Engineering, Thirumazhisai, Tamil Nadu,

More information

BUPT at TREC 2009: Entity Track

BUPT at TREC 2009: Entity Track BUPT at TREC 2009: Entity Track Zhanyi Wang, Dongxin Liu, Weiran Xu, Guang Chen, Jun Guo Pattern Recognition and Intelligent System Lab, Beijing University of Posts and Telecommunications, Beijing, China,

More information

Collecting social media data based on open APIs

Collecting social media data based on open APIs Collecting social media data based on open APIs Ye Li With Qunyan Zhang, Haixin Ma, Weining Qian, and Aoying Zhou http://database.ecnu.edu.cn/ Outline Social Media Data Set Data Feature Data Model Data

More information

A Novel Approach for Restructuring Web Search Results by Feedback Sessions Using Fuzzy clustering

A Novel Approach for Restructuring Web Search Results by Feedback Sessions Using Fuzzy clustering A Novel Approach for Restructuring Web Search Results by Feedback Sessions Using Fuzzy clustering R.Dhivya 1, R.Rajavignesh 2 (M.E CSE), Department of CSE, Arasu Engineering College, kumbakonam 1 Asst.

More information

A Navigation-log based Web Mining Application to Profile the Interests of Users Accessing the Web of Bidasoa Turismo

A Navigation-log based Web Mining Application to Profile the Interests of Users Accessing the Web of Bidasoa Turismo A Navigation-log based Web Mining Application to Profile the Interests of Users Accessing the Web of Bidasoa Turismo Olatz Arbelaitz, Ibai Gurrutxaga, Aizea Lojo, Javier Muguerza, Jesús M. Pérez and Iñigo

More information

Spatial Latent Dirichlet Allocation

Spatial Latent Dirichlet Allocation Spatial Latent Dirichlet Allocation Xiaogang Wang and Eric Grimson Computer Science and Computer Science and Artificial Intelligence Lab Massachusetts Tnstitute of Technology, Cambridge, MA, 02139, USA

More information

The Curated Web: A Recommendation Challenge. Saaya, Zurina; Rafter, Rachael; Schaal, Markus; Smyth, Barry. RecSys 13, Hong Kong, China

The Curated Web: A Recommendation Challenge. Saaya, Zurina; Rafter, Rachael; Schaal, Markus; Smyth, Barry. RecSys 13, Hong Kong, China Provided by the author(s) and University College Dublin Library in accordance with publisher policies. Please cite the published version when available. Title The Curated Web: A Recommendation Challenge

More information

Enhancing Cluster Quality by Using User Browsing Time

Enhancing Cluster Quality by Using User Browsing Time Enhancing Cluster Quality by Using User Browsing Time Rehab M. Duwairi* and Khaleifah Al.jada'** * Department of Computer Information Systems, Jordan University of Science and Technology, Irbid 22110,

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

Part 1. Learn how to collect streaming data from Twitter web API.

Part 1. Learn how to collect streaming data from Twitter web API. Tonight Part 1. Learn how to collect streaming data from Twitter web API. Part 2. Learn how to store the streaming data to files or a database so that you can use it later for analyze or representation

More information

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics An Oracle White Paper October 2012 Oracle Social Cloud Platform Text Analytics Executive Overview Oracle s social cloud text analytics platform is able to process unstructured text-based conversations

More information

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Romain Deveaud 1 and Florian Boudin 2 1 LIA - University of Avignon romain.deveaud@univ-avignon.fr

More information

Research and Design of Key Technology of Vertical Search Engine for Educational Resources

Research and Design of Key Technology of Vertical Search Engine for Educational Resources 2017 International Conference on Arts and Design, Education and Social Sciences (ADESS 2017) ISBN: 978-1-60595-511-7 Research and Design of Key Technology of Vertical Search Engine for Educational Resources

More information

Topic Model Visualization with IPython

Topic Model Visualization with IPython Topic Model Visualization with IPython Sergey Karpovich 1, Alexander Smirnov 2,3, Nikolay Teslya 2,3, Andrei Grigorev 3 1 Mos.ru, Moscow, Russia 2 SPIIRAS, St.Petersburg, Russia 3 ITMO University, St.Petersburg,

More information

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India sumit_13@yahoo.com 2 School of Computer

More information

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16 Federated Search Jaime Arguello INLS 509: Information Retrieval jarguell@email.unc.edu November 21, 2016 Up to this point... Classic information retrieval search from a single centralized index all ueries

More information

Computing Similarity between Cultural Heritage Items using Multimodal Features

Computing Similarity between Cultural Heritage Items using Multimodal Features Computing Similarity between Cultural Heritage Items using Multimodal Features Nikolaos Aletras and Mark Stevenson Department of Computer Science, University of Sheffield Could the combination of textual

More information

An improved PageRank algorithm for Social Network User s Influence research Peng Wang, Xue Bo*, Huamin Yang, Shuangzi Sun, Songjiang Li

An improved PageRank algorithm for Social Network User s Influence research Peng Wang, Xue Bo*, Huamin Yang, Shuangzi Sun, Songjiang Li 3rd International Conference on Mechatronics and Industrial Informatics (ICMII 2015) An improved PageRank algorithm for Social Network User s Influence research Peng Wang, Xue Bo*, Huamin Yang, Shuangzi

More information

NetMapper User Guide

NetMapper User Guide NetMapper User Guide Eric Malloy and Kathleen M. Carley March 2018 NetMapper is a tool that supports extracting networks from texts and assigning sentiment at the context level. Each text is processed

More information

LITERATURE SURVEY ON SEARCH TERM EXTRACTION TECHNIQUE FOR FACET DATA MINING IN CUSTOMER FACING WEBSITE

LITERATURE SURVEY ON SEARCH TERM EXTRACTION TECHNIQUE FOR FACET DATA MINING IN CUSTOMER FACING WEBSITE International Journal of Civil Engineering and Technology (IJCIET) Volume 8, Issue 1, January 2017, pp. 956 960 Article ID: IJCIET_08_01_113 Available online at http://www.iaeme.com/ijciet/issues.asp?jtype=ijciet&vtype=8&itype=1

More information

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Clustering Results. Result List Example. Clustering Results. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to

More information

Hierarchical Document Clustering

Hierarchical Document Clustering Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

More information

CSE 316: SOCIAL NETWORK ANALYSIS INTRODUCTION. Fall 2017 Marion Neumann

CSE 316: SOCIAL NETWORK ANALYSIS INTRODUCTION. Fall 2017 Marion Neumann CSE 316: SOCIAL NETWORK ANALYSIS Fall 2017 Marion Neumann INTRODUCTION Contents in these slides may be subject to copyright. Some materials are adopted from: http://www.cs.cornell.edu/home /kleinber/ networks-book,

More information

Mobile Web User Behavior Modeling

Mobile Web User Behavior Modeling Mobile Web User Behavior Modeling Bozhi Yuan 1,2,BinXu 1,2,ChaoWu 1,2, and Yuanchao Ma 1,2 1 Department of Computer Science and Technology, Tsinghua University, China 2 Tsinghua National Laboratory for

More information

Ranking models in Information Retrieval: A Survey

Ranking models in Information Retrieval: A Survey Ranking models in Information Retrieval: A Survey R.Suganya Devi Research Scholar Department of Computer Science and Engineering College of Engineering, Guindy, Chennai, Tamilnadu, India Dr D Manjula Professor

More information

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 1 Student, M.E., (Computer science and Engineering) in M.G University, India, 2 Associate Professor

More information

Theme Identification in RDF Graphs

Theme Identification in RDF Graphs Theme Identification in RDF Graphs Hanane Ouksili PRiSM, Univ. Versailles St Quentin, UMR CNRS 8144, Versailles France hanane.ouksili@prism.uvsq.fr Abstract. An increasing number of RDF datasets is published

More information

Measuring Diversity of a Domain-Specic Crawl

Measuring Diversity of a Domain-Specic Crawl Measuring Diversity of a Domain-Specic Crawl Pattisapu Nikhil Priyatam, Ajay Dubey, Krish Perumal, Dharmesh Kakadia, and Vasudeva Varma Search and Information Extraction Lab, IIIT-Hyderabad, India {nikhil.pattisapu,

More information