Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets

Size: px
Start display at page:

Download "Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets"

Transcription

1 2016 IEEE 16th International Conference on Data Mining Workshops Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets Teruaki Hayashi Department of Systems Innovation School of Engineering, the University of Tokyo Tokyo, Japan Abstract Data Jacket (DJ) is a technique for sharing information about data and for considering the potential value of datasets, allowing data itself hidden, by describing the summary of data in natural language. In DJs, variables in datasets are described as variable labels (VLs), which is the name/meaning of variables. In the context of data utilization and exchange, the utility of data can be discussed upon the VLs to consider the combination of data stored in different domains. However, due to the lack of VLs in some DJs, DJs essentially related to each other cannot be formed to have linkage through string matching of VLs, which makes it difficult to think of feasible plans of data analyses and combinations. In this paper, we propose a method for inferring VLs in DJs whose VLs are missing or unknown, using the content of the outlines of DJs written in free texts. Specifically, we focus on the co-occurrence of VLs in DJs. The cooccurrence of VLs is a feature that there may be a highly frequent pair of VLs appearing at the same time, e.g., year and day, or name and gender. By focusing on the cooccurrence of VLs and the similarity of the outlines of data in training data of DJs, we demonstrate that our proposed method works significantly better than the method introducing only the similarity of outlines of datasets. Keywords Market of Data, Data Jacket, Variable Label, metadata, Innovators Marketplace I. INTRODUCTION In recent years, the potential benefits of reusing and analyzing massive quantities of data have been discussed among various stakeholders from diverse domains. The discussion involves privacy and security of data. Acquisti and Gross reveal that the combination of public databases may cause a serious violation of privacy [1]. Xu et al. review the privacy issues related to data mining, by differentiating the responsibilities of different users [2]. Surveying the situations of data utilization, the cost of data management and security issues discourage private companies and individuals to open or share their datasets, which disturb data utilization and exchange. In order to overcome these problems, a Data Jacket (DJ) has been developed as a technique for sharing information of data and for considering the potential value of datasets, allowing data itself hidden [3, 4]. The idea of DJ is to share a summary of data as meta-data without sharing data itself, which enables stakeholders of data utilization to discuss the combination of data, reducing the risk of data management cost and privacy. In the communication about data utilization and combination using DJs, stakeholders start from discussing variable labels (VLs) in the data. A VL is the name/meaning of variables in datasets. In DJs, variables in data are summarized as a VL, which is the meta-data of variables. For example, the dataset daily weather data in March 2016 in Tokyo includes variable labels year, month, day, highest temperature, lowest temperature, and weather. Assuming that you are going to visit Tokyo in March, you may think of the clothes in viewing of the temperature there. If you learn that the dataset daily weather data in March 2016 in Tokyo includes the VLs day, highest temperature, lowest temperature, you may choose this dataset for decision making. On the other hands, if the data has only the VLs day and the average temperature in Tokyo, you may not choose this data, because the information about highest temperature and lowest temperature are important for you to consider the clothes. Even if the data itself is not open, we can learn and make a decision from the summary of data described in DJs. Some data include the private information, such as name, address, or ID as variables. The private information cannot be shared, but the VLs name, address, or ID may be shared. Introducing DJs with VLs, stakeholders can learn the meaning of variables in data, by leading the hypotheses about possible combinations of VLs, reducing the risks of data management and privacy. The workshop methods introducing DJs have been proposed for discussions and generations of the feasible plans of data analyses. Once revealed the utility of data to different stakeholders, they can negotiate conditions for exchanging their data. In the gamified workshops Innovators Marketplace on Data Jackets (IMDJ) [4] and Action Planning (AP) [5], data owners provide DJs representing their own data, data analysts create solutions for solving data users problems which are stated as requirements. In the process of IMDJ and AP, participants negotiate for data exchange or buying/selling to create new businesses. As a result of this discussion and evaluation among participants, data owners are expected to be able to learn how to use their own data from a possible combination of DJs proposed by data analysts. Users are expected to be able to learn how their requirements can be satisfied with proposed plans /16 $ IEEE DOI /ICDMW

2 However, DJs do not always contain VLs, because the description rule of DJs cannot force data owners to enter all the information about their data. In other words, only the information written by data owners is registered as DJs, therefore, due to the lack of description about VLs, DJs essentially having linkage each other may not have linkage via VLs, which makes it difficult to think of plans of data analyses and combinations. In this paper, we propose a method for inferring variable labels not explicitly included in the outline of data. II. INFERENCE OF VARIABLE LABELS A. Our Approach The purpose of this study is to infer VLs of data whose variable labels are missing or unknown. Because data itself is hidden, it is impossible to know about the VLs by observing the data itself. Therefore, we consider tackling the problem, using the information about the data described in DJs. We assume that 1) datasets are similar when the information for explaining data is similar each other, and 2) datasets have similar variable labels when the similarity is higher. That is, when their similarity between DJs is higher, VLs included in DJs are similar. In this study, we introduce the outline of data (OD) as an indicator of the similarity of DJs. An OD represents a description for explaining data. Due to the amount of description in the entry items of DJs, we consider ODs are appropriate as the characteristic of datasets. In order to infer VLs of data whose VLs are unknown, we propose the method to obtain a set of likely VLs from ODs whose VLs are unknown. The expected function is to obtain sets of likely VLs stored in training data of DJs and VLs by inputting ODs ( ) as queries. In order to achieve above function, our models are conducted as follows. 1. the similarity of DJs from the outlines (Model 1): This model is based on the assumption that when a pair of datasets whose ODs are similar, the pair of datasets has similar VLs. By this model, a scored set of VLs are obtained considering the similarity between DJs with VLs (training data) and OD whose VLs are unknown (a query). 2. the co-occurrence of variable labels (Model 2): This model takes into account the co-occurrence of VLs. The co-occurrence of VLs is a feature that there may be a highly frequent pair of VLs appearing at the same time, e.g., year and day, or name and gender. B. Inference Process for Obtaining VLs Based on two models shown in the last subsection, we show the inference process of VLs from ODs. In this study, we introduce bag-of-words and the vector space model [6, 7]. In the pre-processing steps, we conduct morphological analysis of the text of ODs, 1) extracting words, 2) removing stop words, and 3) restoring words to their original forms. 1) Term-OD matrix Based on Model 1, we consider an algorithm to calculate the similarity among training data of ODs. After conducting the pre-processing steps to ODs, the ODs are converted into a matrix representation (a Term-OD matrix). In other words, using the outlines of data as a corpus, a Term-OD matrix ( ) are obtained, consisting of -dimensional term vectors as rows, and -dimensional OD vectors as columns, with each element in an OD vector ( ) corresponding to the frequency with which a term (a row ) occurs in an OD (a column ) as shown in (1) and (2). Note that the subscript on the upper-right corner of vectors represents the transposition, and the vectors are highlighted in bold. (1) 2) VL-OD matrix (2) In the second step, a set of VLs included in DJs is converted into a VL-OD matrix. In the training data of DJs, ODs and VLs are linked when they appear in the same DJs. A VL-OD matrix ( ) consists of -dimensional VL vectors as rows, and -dimensional OD vectors as columns, with each element in the th OD vector ( ) corresponding to the frequency (0 or 1) with which the th VL occurs in the th OD as shown in (3) and (4). (3) (4) 3) Term-VL matrix (Model 1) In the third step, we create a Term-VL matrix ( ) ( ) from a Term-OD matrix ( ) and a VL- OD matrix ( ) obtained in the second step. The Term-VL matrix is represented as follows: (5) (6) where the element of the Term-VL matrix is calculated as follows: which means the sum of the product of the frequency ( ) with which the th term ( ) occurs in the th OD ( ) and the frequency ( ) with which the th VL ( ) links with the th OD ( ). Through the above process, Model 1 was implemented as the Term-VL matrix. With this matrix, a scored set of VLs (7) 784

3 are obtained considering the similarity between ODs in the matrix and whose VLs are unknown. When is given, a -dimensional feature vector of ( ) is obtained after the pre-processing of morphological analysis. By comparing the similarity of and each -dimensional feature vector of VL ( ) in the matrix, a scored set of VLs are obtained. 4) VL co-occurrence matrix We combine Model 2 to Model 1, considering the cooccurrence of VLs. First, we assume that any pair of VLs in the same DJ occurs once. In order to combine with the Term- VL matrix created in Model 1, we conduct the VL cooccurrence matrix whose element represents the number of DJs which include a pair of VLs and (8). 5) Term-VL matrix (Model 1 and 2) Finally, a Term-VL matrix is generated by a product of the Term-VL matrix (5) and the VL co-occurrence matrix, considering the co-occurrence of VLs. The Term-VL matrix consists of -dimensional term vectors as rows, and dimensional VL vectors as columns, which has the same structure as the Term-VL matrix. The difference between and is whether the co-occurrences of VLs (Model 2), i.e., the elements of the matrices, are considered. The element of matrix is given as follows: which represents the value considered the similarities of ODs and queries (the function of the matrix ), and the cooccurrence of VLs (the function of the matrix ). When whose VLs are unknown is given, a - dimensional feature vector of ( ) is obtained. By comparing the similarity of and each -dimensional feature vector of VL ( ) in the matrix, a scored set of VLs are obtained. C. Example Table I shows the list of top 10 inferred VLs for an OD This data represents the transition of the population of each year in Barcelona, Spain. whose VLs are unknown (the experimental conditions for obtaining the inferred result will be explained in detail in the following section). Moreover, the OD does not exist in training data of DJs. The inference by the matrix, considering both the cooccurrence of VLs and the similarity of ODs, seems to be highly related VLs in the OD. On the other hands, the inference with the matrix, using only the similarity of ODs, (8) (9) seems that some VLs are not related to the OD, e.g., total population of or total population of agricultural workforce, because of the influence of the highly similar training data of agricultural population. Looking at the example of the result in Table 1, it may be possible to infer the related VLs which may be included in ODs whose VLs are unknown, e.g., the number of births or the number of deaths, by introducing not only the similarity of ODs, but also considering the co-occurrence of VLs. TABLE I. The Term-VL Matrix (Considering both the cooccurrence of VLs and the similarity of ODs) THE EXAMPLE OF INFERRED RESULT The Term-VL Matrix (Considering only the similarity of ODs) Inferred VL Similarity Inferred VL Similarity the number of births the number of deaths total population of total population of in-migrants the number of births fatalities the number of deaths out-migrants population the number of households population (male) (male) (female) the number of full-time the number of parttime population (female) every 5 years fertilities the number of increases and decreases III. EXPERIMENTAL DETAILS A. Datasets In this paper, we used 799 DJs including both ODs and VLs, which were collected from business persons, researchers, and data holders who are interested in data utilization in various domains. Each DJ is constructed from an OD and several VLs. There are 3,215 unique VLs in total. The corpus and the dictionary were constructed from all the words in OD texts. We removed punctuation marks and symbols in the texts as stop words, restored words to their original forms, and extracted nouns, verbs, adverbs, and adjectives which appear more than one. The OD corpus consists of approximately 2,

4 unique words. We used MeCab 1 for the morphological analysis [8], which is one of the common tools for analyzing morphemes of Japanese texts. The detail information of the training data is shown in Table II. For weighting the discriminative terms in DJs, we introduced tf-idf in weighting scheme [9], which is reliable in identifying distinctive terms in each DJ. The term frequency (tf) is the number of times a term appears in a document, and the inverse document frequency (idf) diminishes the weight of frequent terms in all the documents and increases the weight of terms which appear rarely. As the test data, we collect 50 DJs from Open Data of Sizuoka prefecture in Japan 2, which publishes governmental records on the web. We collected DJs with ODs and VLs of them. The detail information of the test data is shown in Table III. TABLE II. TRAINING DATA (CORPUS) STATISTICS Number of Data Jackets 799 Average number of terms in each OD 39.5 Average number of VLs in each Data Jacket 5.34 TABLE III. TEST DATA STATISTICS Number of Data Jackets 50 Average number of terms in each OD 36.7 Precision) for evaluating the ranked inferred results [10], which is the method for evaluating the order of the results, our method AS focuses on the similarity of the results. In a DJ, each VL is equally linked with an OD. In other words, there is no order among VLs in DJs. For example, the VLs day, month, and weather equally exist in the weather data. Therefore, in this experiment we do not evaluate the inferred results with MAP, but with AS. (11) IV. RESULT AND DISCUSSION Table IV shows the evaluation values of results. Comparing the average of AS for evaluating the similarities of the inferred VLs, we found the Term-VL matrix got higher marks than the matrix, and there is the significant difference in the Term-VL matrix and ( ). Moreover, the similarity of correct sets of VLs increases in 48 of 50 test data when introducing the matrix. This result suggests that the inferring method considering the cooccurrence of variable labels may increase the similarities of variable labels to the outline of data. TABLE IV. THE EVALUATION OF RESULTS (AVERAGE SCORES STANDARD DEVIATION) Mean AS Average number of VLs in each Data Jacket 4.70 Matrix B. Experimental Setting The purpose of this experiment is to evaluate the inference ability of VLs from ODs whose VLs are unknown, not only using the similarity of ODs, but also considering the cooccurrence of VLs. In this experiment, we compare the performance of the Term-VL matrix introducing Model 1 and Model 2, with the Term-VL matrix introducing only Model 1. We prepare the 50 DJs as test data, and extract ODs from them. Using these ODs as queries whose VLs are unknown, we compare each feature vector of ODs with feature vectors of VLs contained in the Term-VL matrix and, and obtain the sets of VLs in descending order. The similarity scores of and are calculated as cosine similarities shown as. For the evaluation of this experiment, we define Average Similarity (AS), considering the relationships of ODs and VLs by similarities (11). Here, means the set of correct VLs included in, and is an indicator function equivalent to 1 if is the correct VL, i.e.,, 0 otherwise. Calculating AS of each query, we compare the performance of the matrix and the matrix using a paired t-test. Although there is MAP (Mean Average Matrix value ** **: <0.01, *: <0.05, : non significance V. CONCLUSION In this paper, we proposed a method for inferring variable labels from the outline of data whose variable labels are missing or unknown. Focusing on the co-occurrence of variable labels, and the similarity of the outlines of data in DJs, we construct two models according to the features of DJs. By modeling the features of variable labels and the outlines of data, we found that even if a new DJ misses the variable labels, it is possible to infer the variable labels from the outline of the DJ. The results suggest that the similarities of the correct sets of variable labels with ODs increase by considering not only the similarity of outlines, but also the co-occurrence of variable labels. In this study, because the outlines of data are small but include a certain amount of terms, it was possible to discuss and compare the similarities in the vector space model by creating the term-document matrix. However, a variable label is a very small element composed of one or several words. As mentioned in the last section, although some variable labels have similar meanings, they are described in different representations. In the future work, we aim at constructing a 786

5 model considering the meaning of variable labels, even if they have small descriptions. In addition, this study has been developed as a technique for supporting human decision making in data utilization and exchange. It is important to validate the performance of the application introducing our proposed method in the workshops of IMDJ or AP. ACKNOWLEDGEMENT This study was partially supported by JST-CREST, and JSPS KAKENHI Grant Number JP16J Also, we would like to thank all the staff members of KKE (Kozo Keikaku Engineering Inc.) for supporting our research. The present research was partially supported by the Leading Graduates Schools Program, Global Leader Program for Social Design and Management, by the Ministry of Education, Culture, Sports, Science and Technology. REFERENCES [1] A. Acquisti, and R. Gross, Predicting social security numbers from public data, Proceedings of the National Academy of Science, vol.106, No.27, pp , [2] L. Xu, C. Jiang, J. Wang, J. Yuan, and Y. Ren, Information Security in Big Data: Privacy and Data Mining, IEEE Access, Vol.2, pp , IEEE, [3] Y. Ohsawa, C. Liu, Y. Suda, and H. Kido, Innovators Marketplace on Data Jackets for Externalizing the Value of Data via Stakeholders Requirement Communication, AAAI 2014 Spring Symposium on Big data becomes personal: Knowledge into Meaning, AAAI Technical Report, pp.45-50, [4] Y. Ohsawa, H. Kido, T. Hayashi, C. Liu, and K. Komoda, Innovators Marketplace on Data Jackets, for Valuating, Sharing, and Synthesizing Data, Knowledge-based Information Systems in Practice, Smart Innovation, Systems and Technologies, W.J. Tweedale, C.L. Jain, J. Watada, and R. Howlett (eds), Springer International Publishing, Vol.30, pp.83-97, [5] T. Hayashi, and Y. Ohsawa, Processing Combinatorial Thinking: Innovators Marketplace as Role-based Game plus Action Planning, International Journal of Knowledge and Systems Science, Vol.4, No.3, pp.14-38, [6] G. Salton, A. Wong, and C.S. Yang, A Vector Space Model for Automatic Indexing, Communications of the ACM, Vol.18, No.11, pp , [7] P.D. Turney, and P. Pantel, From Frequency to Meaning: Vector Space Models of Semantics, Journal of Artificial Intelligence Research, Vol.37, pp , [8] T. Kudo, and Y. Matsumoto, Japanese Dependency Structure Analysis Based on Support Vector Machines, In Proc. EMNLP, pp.18-25, [9] G. Salton, and C. Buckley, Term-weighting Approaches in Automatic Text Retrieval, Information processing and management, Vol.24, No.5, pp , [10] C. Buckley, and E. M. Voorhees, Evaluating Evaluation Measure Stability, In Proc. SIGIR, pp.33-40,

What is this Song About?: Identification of Keywords in Bollywood Lyrics

What is this Song About?: Identification of Keywords in Bollywood Lyrics What is this Song About?: Identification of Keywords in Bollywood Lyrics by Drushti Apoorva G, Kritik Mathur, Priyansh Agrawal, Radhika Mamidi in 19th International Conference on Computational Linguistics

More information

doi: / _32

doi: / _32 doi: 10.1007/978-3-319-12823-8_32 Simple Document-by-Document Search Tool Fuwatto Search using Web API Masao Takaku 1 and Yuka Egusa 2 1 University of Tsukuba masao@slis.tsukuba.ac.jp 2 National Institute

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Katsuya Masuda *, Makoto Tanji **, and Hideki Mima *** Abstract This study proposes a framework to access to the

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

KEYWORD EXTRACTION FROM DESKTOP USING TEXT MINING TECHNIQUES

KEYWORD EXTRACTION FROM DESKTOP USING TEXT MINING TECHNIQUES KEYWORD EXTRACTION FROM DESKTOP USING TEXT MINING TECHNIQUES Dr. S.Vijayarani R.Janani S.Saranya Assistant Professor Ph.D.Research Scholar, P.G Student Department of CSE, Department of CSE, Department

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming

Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming Florian Boudin LINA - UMR CNRS 6241, Université de Nantes, France Keyphrase 2015 1 / 22 Errors made by

More information

Annotated Suffix Trees for Text Clustering

Annotated Suffix Trees for Text Clustering Annotated Suffix Trees for Text Clustering Ekaterina Chernyak and Dmitry Ilvovsky National Research University Higher School of Economics Moscow, Russia echernyak,dilvovsky@hse.ru Abstract. In this paper

More information

Multimodal Information Spaces for Content-based Image Retrieval

Multimodal Information Spaces for Content-based Image Retrieval Research Proposal Multimodal Information Spaces for Content-based Image Retrieval Abstract Currently, image retrieval by content is a research problem of great interest in academia and the industry, due

More information

Semantic Web in a Constrained Environment

Semantic Web in a Constrained Environment Semantic Web in a Constrained Environment Laurens Rietveld and Stefan Schlobach Department of Computer Science, VU University Amsterdam, The Netherlands {laurens.rietveld,k.s.schlobach}@vu.nl Abstract.

More information

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12 Fall 2016 CS646: Information Retrieval Lecture 2 - Introduction to Search Result Ranking Jiepu Jiang University of Massachusetts Amherst 2016/09/12 More course information Programming Prerequisites Proficiency

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

A Filtering System Based on Personal Profiles

A  Filtering System Based on Personal Profiles A E-mail Filtering System Based on Personal Profiles Masami Shishibori, Kazuaki Ando and Jun-ichi Aoe Department of Information Science & Intelligent Systems, The University of Tokushima 2-1 Minami-Jhosanjima-Cho,

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

IRCE at the NTCIR-12 IMine-2 Task

IRCE at the NTCIR-12 IMine-2 Task IRCE at the NTCIR-12 IMine-2 Task Ximei Song University of Tsukuba songximei@slis.tsukuba.ac.jp Yuka Egusa National Institute for Educational Policy Research yuka@nier.go.jp Masao Takaku University of

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

A Framework for Securing Databases from Intrusion Threats

A Framework for Securing Databases from Intrusion Threats A Framework for Securing Databases from Intrusion Threats R. Prince Jeyaseelan James Department of Computer Applications, Valliammai Engineering College Affiliated to Anna University, Chennai, India Email:

More information

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk

More information

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Dipak J Kakade, Nilesh P Sable Department of Computer Engineering, JSPM S Imperial College of Engg. And Research,

More information

Impact of Term Weighting Schemes on Document Clustering A Review

Impact of Term Weighting Schemes on Document Clustering A Review Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan

More information

Overview of Classification Subtask at NTCIR-6 Patent Retrieval Task

Overview of Classification Subtask at NTCIR-6 Patent Retrieval Task Overview of Classification Subtask at NTCIR-6 Patent Retrieval Task Makoto Iwayama *, Atsushi Fujii, Noriko Kando * Hitachi, Ltd., 1-280 Higashi-koigakubo, Kokubunji, Tokyo 185-8601, Japan makoto.iwayama.nw@hitachi.com

More information

Comprehensive analysis and evaluation of big data for main transformer equipment based on PCA and Apriority

Comprehensive analysis and evaluation of big data for main transformer equipment based on PCA and Apriority IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Comprehensive analysis and evaluation of big data for main transformer equipment based on PCA and Apriority To cite this article:

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability

More information

Vulnerability Analysis of information systems (Modeling of interaction between information systems and social infrastructures)

Vulnerability Analysis of information systems (Modeling of interaction between information systems and social infrastructures) Vulnerability Analysis of information systems (Modeling of interaction between information systems and social infrastructures) Ichiro Murase Team Leader of Security Technology Team, Information Technology

More information

Privacy-Preserving of Check-in Services in MSNS Based on a Bit Matrix

Privacy-Preserving of Check-in Services in MSNS Based on a Bit Matrix BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 15, No 2 Sofia 2015 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.1515/cait-2015-0032 Privacy-Preserving of Check-in

More information

Construction of School Temperature Measurement System with Sensor Network

Construction of School Temperature Measurement System with Sensor Network Construction of School Temperature Measurement System with Sensor Network Ayahiko Niimi, Masaaki Wada, Kei Ito, and Osamu Konishi Department of Media Architecture, Future University-Hakodate 116 2 Kamedanakano-cho,

More information

ANALYSIS OF DOMAIN INDEPENDENT STATISTICAL KEYWORD EXTRACTION METHODS FOR INCREMENTAL CLUSTERING

ANALYSIS OF DOMAIN INDEPENDENT STATISTICAL KEYWORD EXTRACTION METHODS FOR INCREMENTAL CLUSTERING ANALYSIS OF DOMAIN INDEPENDENT STATISTICAL KEYWORD EXTRACTION METHODS FOR INCREMENTAL CLUSTERING Rafael Geraldeli Rossi 1, Ricardo Marcondes Marcacini 1,2, Solange Oliveira Rezende 1 1 Institute of Mathematics

More information

Record Linkage using Probabilistic Methods and Data Mining Techniques

Record Linkage using Probabilistic Methods and Data Mining Techniques Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University

More information

Improving Recognition through Object Sub-categorization

Improving Recognition through Object Sub-categorization Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,

More information

Video annotation based on adaptive annular spatial partition scheme

Video annotation based on adaptive annular spatial partition scheme Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory

More information

Information Extraction based Approach for the NTCIR-10 1CLICK-2 Task

Information Extraction based Approach for the NTCIR-10 1CLICK-2 Task Information Extraction based Approach for the NTCIR-10 1CLICK-2 Task Tomohiro Manabe, Kosetsu Tsukuda, Kazutoshi Umemoto, Yoshiyuki Shoji, Makoto P. Kato, Takehiro Yamamoto, Meng Zhao, Soungwoong Yoon,

More information

Browsing Support by Hilighting Keywords based on a User s Browsing History

Browsing Support by Hilighting Keywords based on a User s Browsing History Browsing Support by Hilighting Keywords based on a User s Browsing History Yutaka Matsuo Hayato Fukuta Mitsuru Ishizuka National Institute of Advanced Industrial Science and Technology Aomi 2-41-6, Tokyo

More information

Survey Result on Privacy Preserving Techniques in Data Publishing

Survey Result on Privacy Preserving Techniques in Data Publishing Survey Result on Privacy Preserving Techniques in Data Publishing S.Deebika PG Student, Computer Science and Engineering, Vivekananda College of Engineering for Women, Namakkal India A.Sathyapriya Assistant

More information

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS KULWADEE SOMBOONVIWAT Graduate School of Information Science and Technology, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033,

More information

Detecting Near-Duplicates in Large-Scale Short Text Databases

Detecting Near-Duplicates in Large-Scale Short Text Databases Detecting Near-Duplicates in Large-Scale Short Text Databases Caichun Gong 1,2, Yulan Huang 1,2, Xueqi Cheng 1, and Shuo Bai 1 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing,

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

Knowledge Engineering in Search Engines

Knowledge Engineering in Search Engines San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 Knowledge Engineering in Search Engines Yun-Chieh Lin Follow this and additional works at:

More information

Evaluating Three Scrutability and Three Privacy User Privileges for a Scrutable User Modelling Infrastructure

Evaluating Three Scrutability and Three Privacy User Privileges for a Scrutable User Modelling Infrastructure Evaluating Three Scrutability and Three Privacy User Privileges for a Scrutable User Modelling Infrastructure Demetris Kyriacou, Hugh C Davis, and Thanassis Tiropanis Learning Societies Lab School of Electronics

More information

Weighted Powers Ranking Method

Weighted Powers Ranking Method Weighted Powers Ranking Method Introduction The Weighted Powers Ranking Method is a method for ranking sports teams utilizing both number of teams, and strength of the schedule (i.e. how good are the teams

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

Social Voting Techniques: A Comparison of the Methods Used for Explicit Feedback in Recommendation Systems

Social Voting Techniques: A Comparison of the Methods Used for Explicit Feedback in Recommendation Systems Special Issue on Computer Science and Software Engineering Social Voting Techniques: A Comparison of the Methods Used for Explicit Feedback in Recommendation Systems Edward Rolando Nuñez-Valdez 1, Juan

More information

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT A Patent Retrieval Method Using a Hierarchy of Clusters at TUT Hironori Doi Yohei Seki Masaki Aono Toyohashi University of Technology 1-1 Hibarigaoka, Tenpaku-cho, Toyohashi-shi, Aichi 441-8580, Japan

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

Linking Entities in Chinese Queries to Knowledge Graph

Linking Entities in Chinese Queries to Knowledge Graph Linking Entities in Chinese Queries to Knowledge Graph Jun Li 1, Jinxian Pan 2, Chen Ye 1, Yong Huang 1, Danlu Wen 1, and Zhichun Wang 1(B) 1 Beijing Normal University, Beijing, China zcwang@bnu.edu.cn

More information

Enriching Lifelong User Modelling with the Social e- Networking and e-commerce Pieces of the Puzzle

Enriching Lifelong User Modelling with the Social e- Networking and e-commerce Pieces of the Puzzle Enriching Lifelong User Modelling with the Social e- Networking and e-commerce Pieces of the Puzzle Demetris Kyriacou Learning Societies Lab School of Electronics and Computer Science, University of Southampton

More information

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System Takashi Yukawa Nagaoka University of Technology 1603-1 Kamitomioka-cho, Nagaoka-shi Niigata, 940-2188 JAPAN

More information

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Paper / Case Study Available online at: www.ijarcsms.com

More information

Retrieval of Highly Related Documents Containing Gene-Disease Association

Retrieval of Highly Related Documents Containing Gene-Disease Association Retrieval of Highly Related Documents Containing Gene-Disease Association K. Santhosh kumar 1, P. Sudhakar 2 Department of Computer Science & Engineering Annamalai University Annamalai Nagar, India. santhosh09539@gmail.com,

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Reliability Verification of Search Engines Hit Counts: How to Select a Reliable Hit Count for a Query

Reliability Verification of Search Engines Hit Counts: How to Select a Reliable Hit Count for a Query Reliability Verification of Search Engines Hit Counts: How to Select a Reliable Hit Count for a Query Takuya Funahashi and Hayato Yamana Computer Science and Engineering Div., Waseda University, 3-4-1

More information

Static Pruning of Terms In Inverted Files

Static Pruning of Terms In Inverted Files In Inverted Files Roi Blanco and Álvaro Barreiro IRLab University of A Corunna, Spain 29th European Conference on Information Retrieval, Rome, 2007 Motivation : to reduce inverted files size with lossy

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Efficient Mining Algorithms for Large-scale Graphs

Efficient Mining Algorithms for Large-scale Graphs Efficient Mining Algorithms for Large-scale Graphs Yasunari Kishimoto, Hiroaki Shiokawa, Yasuhiro Fujiwara, and Makoto Onizuka Abstract This article describes efficient graph mining algorithms designed

More information

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage Zhe Wang Princeton University Jim Gemmell Microsoft Research September 2005 Technical Report MSR-TR-2006-30 Microsoft Research Microsoft

More information

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets Sheetal K. Labade Computer Engineering Dept., JSCOE, Hadapsar Pune, India Srinivasa Narasimha

More information

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao Large Scale Chinese News Categorization --based on Improved Feature Selection Method Peng Wang Joint work with H. Zhang, B. Xu, H.W. Hao Computational-Brain Research Center Institute of Automation, Chinese

More information

HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining

HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining Masaharu Yoshioka Graduate School of Information Science and Technology, Hokkaido University

More information

R 2 D 2 at NTCIR-4 Web Retrieval Task

R 2 D 2 at NTCIR-4 Web Retrieval Task R 2 D 2 at NTCIR-4 Web Retrieval Task Teruhito Kanazawa KYA group Corporation 5 29 7 Koishikawa, Bunkyo-ku, Tokyo 112 0002, Japan tkana@kyagroup.com Tomonari Masada University of Tokyo 7 3 1 Hongo, Bunkyo-ku,

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Supervised classification of law area in the legal domain

Supervised classification of law area in the legal domain AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms

More information

Extraction of Context Information from Web Content Using Entity Linking

Extraction of Context Information from Web Content Using Entity Linking 18 IJCSNS International Journal of Computer Science and Network Security, VOL.13 No.2, February 2013 Extraction of Context Information from Web Content Using Entity Linking Norifumi Hirata, Shun Shiramatsu,

More information

Available online at ScienceDirect. Procedia Computer Science 52 (2015 )

Available online at  ScienceDirect. Procedia Computer Science 52 (2015 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 52 (2015 ) 1071 1076 The 5 th International Symposium on Frontiers in Ambient and Mobile Systems (FAMS-2015) Health, Food

More information

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News Selecting Model in Automatic Text Categorization of Chinese Industrial 1) HUEY-MING LEE 1 ), PIN-JEN CHEN 1 ), TSUNG-YEN LEE 2) Department of Information Management, Chinese Culture University 55, Hwa-Kung

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Web Service Recommendation Using Hybrid Approach

Web Service Recommendation Using Hybrid Approach e-issn 2455 1392 Volume 2 Issue 5, May 2016 pp. 648 653 Scientific Journal Impact Factor : 3.468 http://www.ijcter.com Web Service Using Hybrid Approach Priyanshi Barod 1, M.S.Bhamare 2, Ruhi Patankar

More information

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500

More information

Navigation Retrieval with Site Anchor Text

Navigation Retrieval with Site Anchor Text Navigation Retrieval with Site Anchor Text Hideki Kawai Kenji Tateishi Toshikazu Fukushima NEC Internet Systems Research Labs. 8916-47, Takayama-cho, Ikoma-city, Nara, JAPAN {h-kawai@ab, k-tateishi@bq,

More information

SENTIMENT ESTIMATION OF TWEETS BY LEARNING SOCIAL BOOKMARK DATA

SENTIMENT ESTIMATION OF TWEETS BY LEARNING SOCIAL BOOKMARK DATA IADIS International Journal on WWW/Internet Vol. 14, No. 1, pp. 15-27 ISSN: 1645-7641 SENTIMENT ESTIMATION OF TWEETS BY LEARNING SOCIAL BOOKMARK DATA Yasuyuki Okamura, Takayuki Yumoto, Manabu Nii and Naotake

More information

Self-Organized Similarity based Kernel Fuzzy Clustering Model and Its Applications

Self-Organized Similarity based Kernel Fuzzy Clustering Model and Its Applications Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Self-Organized Similarity based Kernel Fuzzy

More information

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Women s College Tokyo, Japan masaki.eto@gakushuin.ac.jp Abstract. To improve the search performance

More information

Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method

Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method Dr.K.P.Kaliyamurthie HOD, Department of CSE, Bharath University, Tamilnadu, India ABSTRACT: Automated

More information

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL Lim Bee Huang 1, Vimala Balakrishnan 2, Ram Gopal Raj 3 1,2 Department of Information System, 3 Department

More information

Automatic Discovery of Association Orders between Name and Aliases from The Web using Anchor Texts-Based Co-Occurrences

Automatic Discovery of Association Orders between Name and Aliases from The Web using Anchor Texts-Based Co-Occurrences 64 IJCSNS International Journal of Computer Science and Network Security, VOL.14 No.6, June 2014 Automatic Discovery of Association Orders between Name and Aliases from The Web using Anchor Texts-Based

More information

SECURE MULTI-KEYWORD TOP KEY RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD DATA

SECURE MULTI-KEYWORD TOP KEY RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD DATA Research Manuscript Title SECURE MULTI-KEYWORD TOP KEY RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD DATA Dr.B.Kalaavathi, SM.Keerthana, N.Renugadevi Professor, Assistant professor, PGScholar Department of

More information

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A thesis Submitted to the faculty of the graduate school of the University of Minnesota by Vamshi Krishna Thotempudi In partial fulfillment of the requirements

More information

A New Technique to Optimize User s Browsing Session using Data Mining

A New Technique to Optimize User s Browsing Session using Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND 41 CHAPTER 5 TEXT W. Bruce Croft BACKGROUND Much of the information in digital library or digital information organization applications is in the form of text. Even when the application focuses on multimedia

More information

OneView. User s Guide

OneView. User s Guide OneView User s Guide Welcome to OneView. This user guide will show you everything you need to know to access and utilize the wealth of information available from OneView. The OneView program is an Internet-based

More information

Unstructured Data. CS102 Winter 2019

Unstructured Data. CS102 Winter 2019 Winter 2019 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for patterns in data

More information

Query Answering Using Inverted Indexes

Query Answering Using Inverted Indexes Query Answering Using Inverted Indexes Inverted Indexes Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 2 Document-at-a-time Evaluation

More information

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

An Empirical Performance Comparison of Machine Learning Methods for Spam  Categorization An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University

More information

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Collaborative Filtering using Euclidean Distance in Recommendation Engine Indian Journal of Science and Technology, Vol 9(37), DOI: 10.17485/ijst/2016/v9i37/102074, October 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Collaborative Filtering using Euclidean Distance

More information

Proceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan

Proceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan Read Article Management in Document Search Process for NTCIR-9 VisEx Task Yasufumi Takama Tokyo Metropolitan University 6-6 Asahigaoka, Hino Tokyo 191-0065 ytakama@sd.tmu.ac.jp Shunichi Hattori Tokyo Metropolitan

More information

Assigning Vocation-Related Information to Person Clusters for Web People Search Results

Assigning Vocation-Related Information to Person Clusters for Web People Search Results Global Congress on Intelligent Systems Assigning Vocation-Related Information to Person Clusters for Web People Search Results Hiroshi Ueda 1) Harumi Murakami 2) Shoji Tatsumi 1) 1) Graduate School of

More information

An Automatic Reply to Customers Queries Model with Chinese Text Mining Approach

An Automatic Reply to Customers  Queries Model with Chinese Text Mining Approach Proceedings of the 6th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 15-17, 2007 71 An Automatic Reply to Customers E-mail Queries Model with Chinese Text Mining Approach

More information

Information Gathering Support Interface by the Overview Presentation of Web Search Results

Information Gathering Support Interface by the Overview Presentation of Web Search Results Information Gathering Support Interface by the Overview Presentation of Web Search Results Takumi Kobayashi Kazuo Misue Buntarou Shizuki Jiro Tanaka Graduate School of Systems and Information Engineering

More information

with Twitter, especially for tweets related to railway operational conditions, television programs and landmarks

with Twitter, especially for tweets related to railway operational conditions, television programs and landmarks Development of Real-time Services Offering Daily-life Information Real-time Tweet Development of Real-time Services Offering Daily-life Information The SNS provided by Twitter, Inc. is popular as a medium

More information

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Chirag Shah Dept. of CSE IIT Madras Chennai - 600036 Tamilnadu, India. chirag@speech.iitm.ernet.in A. Nayeemulla Khan Dept. of CSE

More information

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm K.Parimala, Assistant Professor, MCA Department, NMS.S.Vellaichamy Nadar College, Madurai, Dr.V.Palanisamy,

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

Extraction of Web Image Information: Semantic or Visual Cues?

Extraction of Web Image Information: Semantic or Visual Cues? Extraction of Web Image Information: Semantic or Visual Cues? Georgina Tryfou and Nicolas Tsapatsoulis Cyprus University of Technology, Department of Communication and Internet Studies, Limassol, Cyprus

More information