Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets
|
|
- Kimberly Boyd
- 5 years ago
- Views:
Transcription
1 2016 IEEE 16th International Conference on Data Mining Workshops Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets Teruaki Hayashi Department of Systems Innovation School of Engineering, the University of Tokyo Tokyo, Japan Abstract Data Jacket (DJ) is a technique for sharing information about data and for considering the potential value of datasets, allowing data itself hidden, by describing the summary of data in natural language. In DJs, variables in datasets are described as variable labels (VLs), which is the name/meaning of variables. In the context of data utilization and exchange, the utility of data can be discussed upon the VLs to consider the combination of data stored in different domains. However, due to the lack of VLs in some DJs, DJs essentially related to each other cannot be formed to have linkage through string matching of VLs, which makes it difficult to think of feasible plans of data analyses and combinations. In this paper, we propose a method for inferring VLs in DJs whose VLs are missing or unknown, using the content of the outlines of DJs written in free texts. Specifically, we focus on the co-occurrence of VLs in DJs. The cooccurrence of VLs is a feature that there may be a highly frequent pair of VLs appearing at the same time, e.g., year and day, or name and gender. By focusing on the cooccurrence of VLs and the similarity of the outlines of data in training data of DJs, we demonstrate that our proposed method works significantly better than the method introducing only the similarity of outlines of datasets. Keywords Market of Data, Data Jacket, Variable Label, metadata, Innovators Marketplace I. INTRODUCTION In recent years, the potential benefits of reusing and analyzing massive quantities of data have been discussed among various stakeholders from diverse domains. The discussion involves privacy and security of data. Acquisti and Gross reveal that the combination of public databases may cause a serious violation of privacy [1]. Xu et al. review the privacy issues related to data mining, by differentiating the responsibilities of different users [2]. Surveying the situations of data utilization, the cost of data management and security issues discourage private companies and individuals to open or share their datasets, which disturb data utilization and exchange. In order to overcome these problems, a Data Jacket (DJ) has been developed as a technique for sharing information of data and for considering the potential value of datasets, allowing data itself hidden [3, 4]. The idea of DJ is to share a summary of data as meta-data without sharing data itself, which enables stakeholders of data utilization to discuss the combination of data, reducing the risk of data management cost and privacy. In the communication about data utilization and combination using DJs, stakeholders start from discussing variable labels (VLs) in the data. A VL is the name/meaning of variables in datasets. In DJs, variables in data are summarized as a VL, which is the meta-data of variables. For example, the dataset daily weather data in March 2016 in Tokyo includes variable labels year, month, day, highest temperature, lowest temperature, and weather. Assuming that you are going to visit Tokyo in March, you may think of the clothes in viewing of the temperature there. If you learn that the dataset daily weather data in March 2016 in Tokyo includes the VLs day, highest temperature, lowest temperature, you may choose this dataset for decision making. On the other hands, if the data has only the VLs day and the average temperature in Tokyo, you may not choose this data, because the information about highest temperature and lowest temperature are important for you to consider the clothes. Even if the data itself is not open, we can learn and make a decision from the summary of data described in DJs. Some data include the private information, such as name, address, or ID as variables. The private information cannot be shared, but the VLs name, address, or ID may be shared. Introducing DJs with VLs, stakeholders can learn the meaning of variables in data, by leading the hypotheses about possible combinations of VLs, reducing the risks of data management and privacy. The workshop methods introducing DJs have been proposed for discussions and generations of the feasible plans of data analyses. Once revealed the utility of data to different stakeholders, they can negotiate conditions for exchanging their data. In the gamified workshops Innovators Marketplace on Data Jackets (IMDJ) [4] and Action Planning (AP) [5], data owners provide DJs representing their own data, data analysts create solutions for solving data users problems which are stated as requirements. In the process of IMDJ and AP, participants negotiate for data exchange or buying/selling to create new businesses. As a result of this discussion and evaluation among participants, data owners are expected to be able to learn how to use their own data from a possible combination of DJs proposed by data analysts. Users are expected to be able to learn how their requirements can be satisfied with proposed plans /16 $ IEEE DOI /ICDMW
2 However, DJs do not always contain VLs, because the description rule of DJs cannot force data owners to enter all the information about their data. In other words, only the information written by data owners is registered as DJs, therefore, due to the lack of description about VLs, DJs essentially having linkage each other may not have linkage via VLs, which makes it difficult to think of plans of data analyses and combinations. In this paper, we propose a method for inferring variable labels not explicitly included in the outline of data. II. INFERENCE OF VARIABLE LABELS A. Our Approach The purpose of this study is to infer VLs of data whose variable labels are missing or unknown. Because data itself is hidden, it is impossible to know about the VLs by observing the data itself. Therefore, we consider tackling the problem, using the information about the data described in DJs. We assume that 1) datasets are similar when the information for explaining data is similar each other, and 2) datasets have similar variable labels when the similarity is higher. That is, when their similarity between DJs is higher, VLs included in DJs are similar. In this study, we introduce the outline of data (OD) as an indicator of the similarity of DJs. An OD represents a description for explaining data. Due to the amount of description in the entry items of DJs, we consider ODs are appropriate as the characteristic of datasets. In order to infer VLs of data whose VLs are unknown, we propose the method to obtain a set of likely VLs from ODs whose VLs are unknown. The expected function is to obtain sets of likely VLs stored in training data of DJs and VLs by inputting ODs ( ) as queries. In order to achieve above function, our models are conducted as follows. 1. the similarity of DJs from the outlines (Model 1): This model is based on the assumption that when a pair of datasets whose ODs are similar, the pair of datasets has similar VLs. By this model, a scored set of VLs are obtained considering the similarity between DJs with VLs (training data) and OD whose VLs are unknown (a query). 2. the co-occurrence of variable labels (Model 2): This model takes into account the co-occurrence of VLs. The co-occurrence of VLs is a feature that there may be a highly frequent pair of VLs appearing at the same time, e.g., year and day, or name and gender. B. Inference Process for Obtaining VLs Based on two models shown in the last subsection, we show the inference process of VLs from ODs. In this study, we introduce bag-of-words and the vector space model [6, 7]. In the pre-processing steps, we conduct morphological analysis of the text of ODs, 1) extracting words, 2) removing stop words, and 3) restoring words to their original forms. 1) Term-OD matrix Based on Model 1, we consider an algorithm to calculate the similarity among training data of ODs. After conducting the pre-processing steps to ODs, the ODs are converted into a matrix representation (a Term-OD matrix). In other words, using the outlines of data as a corpus, a Term-OD matrix ( ) are obtained, consisting of -dimensional term vectors as rows, and -dimensional OD vectors as columns, with each element in an OD vector ( ) corresponding to the frequency with which a term (a row ) occurs in an OD (a column ) as shown in (1) and (2). Note that the subscript on the upper-right corner of vectors represents the transposition, and the vectors are highlighted in bold. (1) 2) VL-OD matrix (2) In the second step, a set of VLs included in DJs is converted into a VL-OD matrix. In the training data of DJs, ODs and VLs are linked when they appear in the same DJs. A VL-OD matrix ( ) consists of -dimensional VL vectors as rows, and -dimensional OD vectors as columns, with each element in the th OD vector ( ) corresponding to the frequency (0 or 1) with which the th VL occurs in the th OD as shown in (3) and (4). (3) (4) 3) Term-VL matrix (Model 1) In the third step, we create a Term-VL matrix ( ) ( ) from a Term-OD matrix ( ) and a VL- OD matrix ( ) obtained in the second step. The Term-VL matrix is represented as follows: (5) (6) where the element of the Term-VL matrix is calculated as follows: which means the sum of the product of the frequency ( ) with which the th term ( ) occurs in the th OD ( ) and the frequency ( ) with which the th VL ( ) links with the th OD ( ). Through the above process, Model 1 was implemented as the Term-VL matrix. With this matrix, a scored set of VLs (7) 784
3 are obtained considering the similarity between ODs in the matrix and whose VLs are unknown. When is given, a -dimensional feature vector of ( ) is obtained after the pre-processing of morphological analysis. By comparing the similarity of and each -dimensional feature vector of VL ( ) in the matrix, a scored set of VLs are obtained. 4) VL co-occurrence matrix We combine Model 2 to Model 1, considering the cooccurrence of VLs. First, we assume that any pair of VLs in the same DJ occurs once. In order to combine with the Term- VL matrix created in Model 1, we conduct the VL cooccurrence matrix whose element represents the number of DJs which include a pair of VLs and (8). 5) Term-VL matrix (Model 1 and 2) Finally, a Term-VL matrix is generated by a product of the Term-VL matrix (5) and the VL co-occurrence matrix, considering the co-occurrence of VLs. The Term-VL matrix consists of -dimensional term vectors as rows, and dimensional VL vectors as columns, which has the same structure as the Term-VL matrix. The difference between and is whether the co-occurrences of VLs (Model 2), i.e., the elements of the matrices, are considered. The element of matrix is given as follows: which represents the value considered the similarities of ODs and queries (the function of the matrix ), and the cooccurrence of VLs (the function of the matrix ). When whose VLs are unknown is given, a - dimensional feature vector of ( ) is obtained. By comparing the similarity of and each -dimensional feature vector of VL ( ) in the matrix, a scored set of VLs are obtained. C. Example Table I shows the list of top 10 inferred VLs for an OD This data represents the transition of the population of each year in Barcelona, Spain. whose VLs are unknown (the experimental conditions for obtaining the inferred result will be explained in detail in the following section). Moreover, the OD does not exist in training data of DJs. The inference by the matrix, considering both the cooccurrence of VLs and the similarity of ODs, seems to be highly related VLs in the OD. On the other hands, the inference with the matrix, using only the similarity of ODs, (8) (9) seems that some VLs are not related to the OD, e.g., total population of or total population of agricultural workforce, because of the influence of the highly similar training data of agricultural population. Looking at the example of the result in Table 1, it may be possible to infer the related VLs which may be included in ODs whose VLs are unknown, e.g., the number of births or the number of deaths, by introducing not only the similarity of ODs, but also considering the co-occurrence of VLs. TABLE I. The Term-VL Matrix (Considering both the cooccurrence of VLs and the similarity of ODs) THE EXAMPLE OF INFERRED RESULT The Term-VL Matrix (Considering only the similarity of ODs) Inferred VL Similarity Inferred VL Similarity the number of births the number of deaths total population of total population of in-migrants the number of births fatalities the number of deaths out-migrants population the number of households population (male) (male) (female) the number of full-time the number of parttime population (female) every 5 years fertilities the number of increases and decreases III. EXPERIMENTAL DETAILS A. Datasets In this paper, we used 799 DJs including both ODs and VLs, which were collected from business persons, researchers, and data holders who are interested in data utilization in various domains. Each DJ is constructed from an OD and several VLs. There are 3,215 unique VLs in total. The corpus and the dictionary were constructed from all the words in OD texts. We removed punctuation marks and symbols in the texts as stop words, restored words to their original forms, and extracted nouns, verbs, adverbs, and adjectives which appear more than one. The OD corpus consists of approximately 2,
4 unique words. We used MeCab 1 for the morphological analysis [8], which is one of the common tools for analyzing morphemes of Japanese texts. The detail information of the training data is shown in Table II. For weighting the discriminative terms in DJs, we introduced tf-idf in weighting scheme [9], which is reliable in identifying distinctive terms in each DJ. The term frequency (tf) is the number of times a term appears in a document, and the inverse document frequency (idf) diminishes the weight of frequent terms in all the documents and increases the weight of terms which appear rarely. As the test data, we collect 50 DJs from Open Data of Sizuoka prefecture in Japan 2, which publishes governmental records on the web. We collected DJs with ODs and VLs of them. The detail information of the test data is shown in Table III. TABLE II. TRAINING DATA (CORPUS) STATISTICS Number of Data Jackets 799 Average number of terms in each OD 39.5 Average number of VLs in each Data Jacket 5.34 TABLE III. TEST DATA STATISTICS Number of Data Jackets 50 Average number of terms in each OD 36.7 Precision) for evaluating the ranked inferred results [10], which is the method for evaluating the order of the results, our method AS focuses on the similarity of the results. In a DJ, each VL is equally linked with an OD. In other words, there is no order among VLs in DJs. For example, the VLs day, month, and weather equally exist in the weather data. Therefore, in this experiment we do not evaluate the inferred results with MAP, but with AS. (11) IV. RESULT AND DISCUSSION Table IV shows the evaluation values of results. Comparing the average of AS for evaluating the similarities of the inferred VLs, we found the Term-VL matrix got higher marks than the matrix, and there is the significant difference in the Term-VL matrix and ( ). Moreover, the similarity of correct sets of VLs increases in 48 of 50 test data when introducing the matrix. This result suggests that the inferring method considering the cooccurrence of variable labels may increase the similarities of variable labels to the outline of data. TABLE IV. THE EVALUATION OF RESULTS (AVERAGE SCORES STANDARD DEVIATION) Mean AS Average number of VLs in each Data Jacket 4.70 Matrix B. Experimental Setting The purpose of this experiment is to evaluate the inference ability of VLs from ODs whose VLs are unknown, not only using the similarity of ODs, but also considering the cooccurrence of VLs. In this experiment, we compare the performance of the Term-VL matrix introducing Model 1 and Model 2, with the Term-VL matrix introducing only Model 1. We prepare the 50 DJs as test data, and extract ODs from them. Using these ODs as queries whose VLs are unknown, we compare each feature vector of ODs with feature vectors of VLs contained in the Term-VL matrix and, and obtain the sets of VLs in descending order. The similarity scores of and are calculated as cosine similarities shown as. For the evaluation of this experiment, we define Average Similarity (AS), considering the relationships of ODs and VLs by similarities (11). Here, means the set of correct VLs included in, and is an indicator function equivalent to 1 if is the correct VL, i.e.,, 0 otherwise. Calculating AS of each query, we compare the performance of the matrix and the matrix using a paired t-test. Although there is MAP (Mean Average Matrix value ** **: <0.01, *: <0.05, : non significance V. CONCLUSION In this paper, we proposed a method for inferring variable labels from the outline of data whose variable labels are missing or unknown. Focusing on the co-occurrence of variable labels, and the similarity of the outlines of data in DJs, we construct two models according to the features of DJs. By modeling the features of variable labels and the outlines of data, we found that even if a new DJ misses the variable labels, it is possible to infer the variable labels from the outline of the DJ. The results suggest that the similarities of the correct sets of variable labels with ODs increase by considering not only the similarity of outlines, but also the co-occurrence of variable labels. In this study, because the outlines of data are small but include a certain amount of terms, it was possible to discuss and compare the similarities in the vector space model by creating the term-document matrix. However, a variable label is a very small element composed of one or several words. As mentioned in the last section, although some variable labels have similar meanings, they are described in different representations. In the future work, we aim at constructing a 786
5 model considering the meaning of variable labels, even if they have small descriptions. In addition, this study has been developed as a technique for supporting human decision making in data utilization and exchange. It is important to validate the performance of the application introducing our proposed method in the workshops of IMDJ or AP. ACKNOWLEDGEMENT This study was partially supported by JST-CREST, and JSPS KAKENHI Grant Number JP16J Also, we would like to thank all the staff members of KKE (Kozo Keikaku Engineering Inc.) for supporting our research. The present research was partially supported by the Leading Graduates Schools Program, Global Leader Program for Social Design and Management, by the Ministry of Education, Culture, Sports, Science and Technology. REFERENCES [1] A. Acquisti, and R. Gross, Predicting social security numbers from public data, Proceedings of the National Academy of Science, vol.106, No.27, pp , [2] L. Xu, C. Jiang, J. Wang, J. Yuan, and Y. Ren, Information Security in Big Data: Privacy and Data Mining, IEEE Access, Vol.2, pp , IEEE, [3] Y. Ohsawa, C. Liu, Y. Suda, and H. Kido, Innovators Marketplace on Data Jackets for Externalizing the Value of Data via Stakeholders Requirement Communication, AAAI 2014 Spring Symposium on Big data becomes personal: Knowledge into Meaning, AAAI Technical Report, pp.45-50, [4] Y. Ohsawa, H. Kido, T. Hayashi, C. Liu, and K. Komoda, Innovators Marketplace on Data Jackets, for Valuating, Sharing, and Synthesizing Data, Knowledge-based Information Systems in Practice, Smart Innovation, Systems and Technologies, W.J. Tweedale, C.L. Jain, J. Watada, and R. Howlett (eds), Springer International Publishing, Vol.30, pp.83-97, [5] T. Hayashi, and Y. Ohsawa, Processing Combinatorial Thinking: Innovators Marketplace as Role-based Game plus Action Planning, International Journal of Knowledge and Systems Science, Vol.4, No.3, pp.14-38, [6] G. Salton, A. Wong, and C.S. Yang, A Vector Space Model for Automatic Indexing, Communications of the ACM, Vol.18, No.11, pp , [7] P.D. Turney, and P. Pantel, From Frequency to Meaning: Vector Space Models of Semantics, Journal of Artificial Intelligence Research, Vol.37, pp , [8] T. Kudo, and Y. Matsumoto, Japanese Dependency Structure Analysis Based on Support Vector Machines, In Proc. EMNLP, pp.18-25, [9] G. Salton, and C. Buckley, Term-weighting Approaches in Automatic Text Retrieval, Information processing and management, Vol.24, No.5, pp , [10] C. Buckley, and E. M. Voorhees, Evaluating Evaluation Measure Stability, In Proc. SIGIR, pp.33-40,
What is this Song About?: Identification of Keywords in Bollywood Lyrics
What is this Song About?: Identification of Keywords in Bollywood Lyrics by Drushti Apoorva G, Kritik Mathur, Priyansh Agrawal, Radhika Mamidi in 19th International Conference on Computational Linguistics
More informationdoi: / _32
doi: 10.1007/978-3-319-12823-8_32 Simple Document-by-Document Search Tool Fuwatto Search using Web API Masao Takaku 1 and Yuka Egusa 2 1 University of Tsukuba masao@slis.tsukuba.ac.jp 2 National Institute
More informationWeb Information Retrieval using WordNet
Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT
More informationRevealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization
Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Katsuya Masuda *, Makoto Tanji **, and Hideki Mima *** Abstract This study proposes a framework to access to the
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:
More informationBoolean Model. Hongning Wang
Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer
More informationKEYWORD EXTRACTION FROM DESKTOP USING TEXT MINING TECHNIQUES
KEYWORD EXTRACTION FROM DESKTOP USING TEXT MINING TECHNIQUES Dr. S.Vijayarani R.Janani S.Saranya Assistant Professor Ph.D.Research Scholar, P.G Student Department of CSE, Department of CSE, Department
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationA BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK
A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationReducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming
Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming Florian Boudin LINA - UMR CNRS 6241, Université de Nantes, France Keyphrase 2015 1 / 22 Errors made by
More informationAnnotated Suffix Trees for Text Clustering
Annotated Suffix Trees for Text Clustering Ekaterina Chernyak and Dmitry Ilvovsky National Research University Higher School of Economics Moscow, Russia echernyak,dilvovsky@hse.ru Abstract. In this paper
More informationMultimodal Information Spaces for Content-based Image Retrieval
Research Proposal Multimodal Information Spaces for Content-based Image Retrieval Abstract Currently, image retrieval by content is a research problem of great interest in academia and the industry, due
More informationSemantic Web in a Constrained Environment
Semantic Web in a Constrained Environment Laurens Rietveld and Stefan Schlobach Department of Computer Science, VU University Amsterdam, The Netherlands {laurens.rietveld,k.s.schlobach}@vu.nl Abstract.
More informationFall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12
Fall 2016 CS646: Information Retrieval Lecture 2 - Introduction to Search Result Ranking Jiepu Jiang University of Massachusetts Amherst 2016/09/12 More course information Programming Prerequisites Proficiency
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationTERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES
TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.
More informationA Filtering System Based on Personal Profiles
A E-mail Filtering System Based on Personal Profiles Masami Shishibori, Kazuaki Ando and Jun-ichi Aoe Department of Information Science & Intelligent Systems, The University of Tokushima 2-1 Minami-Jhosanjima-Cho,
More informationLearning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li
Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,
More informationIRCE at the NTCIR-12 IMine-2 Task
IRCE at the NTCIR-12 IMine-2 Task Ximei Song University of Tsukuba songximei@slis.tsukuba.ac.jp Yuka Egusa National Institute for Educational Policy Research yuka@nier.go.jp Masao Takaku University of
More informationEncoding Words into String Vectors for Word Categorization
Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,
More informationKeyword Extraction by KNN considering Similarity among Features
64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,
More informationA Framework for Securing Databases from Intrusion Threats
A Framework for Securing Databases from Intrusion Threats R. Prince Jeyaseelan James Department of Computer Applications, Valliammai Engineering College Affiliated to Anna University, Chennai, India Email:
More informationWEIGHTING QUERY TERMS USING WORDNET ONTOLOGY
IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk
More informationCombining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating
Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Dipak J Kakade, Nilesh P Sable Department of Computer Engineering, JSPM S Imperial College of Engg. And Research,
More informationImpact of Term Weighting Schemes on Document Clustering A Review
Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan
More informationOverview of Classification Subtask at NTCIR-6 Patent Retrieval Task
Overview of Classification Subtask at NTCIR-6 Patent Retrieval Task Makoto Iwayama *, Atsushi Fujii, Noriko Kando * Hitachi, Ltd., 1-280 Higashi-koigakubo, Kokubunji, Tokyo 185-8601, Japan makoto.iwayama.nw@hitachi.com
More informationComprehensive analysis and evaluation of big data for main transformer equipment based on PCA and Apriority
IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Comprehensive analysis and evaluation of big data for main transformer equipment based on PCA and Apriority To cite this article:
More informationDesigning and Building an Automatic Information Retrieval System for Handling the Arabic Data
American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationKnowledge Discovery and Data Mining 1 (VO) ( )
Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability
More informationVulnerability Analysis of information systems (Modeling of interaction between information systems and social infrastructures)
Vulnerability Analysis of information systems (Modeling of interaction between information systems and social infrastructures) Ichiro Murase Team Leader of Security Technology Team, Information Technology
More informationPrivacy-Preserving of Check-in Services in MSNS Based on a Bit Matrix
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 15, No 2 Sofia 2015 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.1515/cait-2015-0032 Privacy-Preserving of Check-in
More informationConstruction of School Temperature Measurement System with Sensor Network
Construction of School Temperature Measurement System with Sensor Network Ayahiko Niimi, Masaaki Wada, Kei Ito, and Osamu Konishi Department of Media Architecture, Future University-Hakodate 116 2 Kamedanakano-cho,
More informationANALYSIS OF DOMAIN INDEPENDENT STATISTICAL KEYWORD EXTRACTION METHODS FOR INCREMENTAL CLUSTERING
ANALYSIS OF DOMAIN INDEPENDENT STATISTICAL KEYWORD EXTRACTION METHODS FOR INCREMENTAL CLUSTERING Rafael Geraldeli Rossi 1, Ricardo Marcondes Marcacini 1,2, Solange Oliveira Rezende 1 1 Institute of Mathematics
More informationRecord Linkage using Probabilistic Methods and Data Mining Techniques
Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University
More informationImproving Recognition through Object Sub-categorization
Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,
More informationVideo annotation based on adaptive annular spatial partition scheme
Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory
More informationInformation Extraction based Approach for the NTCIR-10 1CLICK-2 Task
Information Extraction based Approach for the NTCIR-10 1CLICK-2 Task Tomohiro Manabe, Kosetsu Tsukuda, Kazutoshi Umemoto, Yoshiyuki Shoji, Makoto P. Kato, Takehiro Yamamoto, Meng Zhao, Soungwoong Yoon,
More informationBrowsing Support by Hilighting Keywords based on a User s Browsing History
Browsing Support by Hilighting Keywords based on a User s Browsing History Yutaka Matsuo Hayato Fukuta Mitsuru Ishizuka National Institute of Advanced Industrial Science and Technology Aomi 2-41-6, Tokyo
More informationSurvey Result on Privacy Preserving Techniques in Data Publishing
Survey Result on Privacy Preserving Techniques in Data Publishing S.Deebika PG Student, Computer Science and Engineering, Vivekananda College of Engineering for Women, Namakkal India A.Sathyapriya Assistant
More informationA PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS
A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS KULWADEE SOMBOONVIWAT Graduate School of Information Science and Technology, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033,
More informationDetecting Near-Duplicates in Large-Scale Short Text Databases
Detecting Near-Duplicates in Large-Scale Short Text Databases Caichun Gong 1,2, Yulan Huang 1,2, Xueqi Cheng 1, and Shuo Bai 1 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing,
More informationA Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition
A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es
More informationKnowledge Engineering in Search Engines
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 Knowledge Engineering in Search Engines Yun-Chieh Lin Follow this and additional works at:
More informationEvaluating Three Scrutability and Three Privacy User Privileges for a Scrutable User Modelling Infrastructure
Evaluating Three Scrutability and Three Privacy User Privileges for a Scrutable User Modelling Infrastructure Demetris Kyriacou, Hugh C Davis, and Thanassis Tiropanis Learning Societies Lab School of Electronics
More informationWeighted Powers Ranking Method
Weighted Powers Ranking Method Introduction The Weighted Powers Ranking Method is a method for ranking sports teams utilizing both number of teams, and strength of the schedule (i.e. how good are the teams
More informationInstructor: Stefan Savev
LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information
More informationSocial Voting Techniques: A Comparison of the Methods Used for Explicit Feedback in Recommendation Systems
Special Issue on Computer Science and Software Engineering Social Voting Techniques: A Comparison of the Methods Used for Explicit Feedback in Recommendation Systems Edward Rolando Nuñez-Valdez 1, Juan
More informationA Patent Retrieval Method Using a Hierarchy of Clusters at TUT
A Patent Retrieval Method Using a Hierarchy of Clusters at TUT Hironori Doi Yohei Seki Masaki Aono Toyohashi University of Technology 1-1 Hibarigaoka, Tenpaku-cho, Toyohashi-shi, Aichi 441-8580, Japan
More informationA Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition
A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es
More informationLinking Entities in Chinese Queries to Knowledge Graph
Linking Entities in Chinese Queries to Knowledge Graph Jun Li 1, Jinxian Pan 2, Chen Ye 1, Yong Huang 1, Danlu Wen 1, and Zhichun Wang 1(B) 1 Beijing Normal University, Beijing, China zcwang@bnu.edu.cn
More informationEnriching Lifelong User Modelling with the Social e- Networking and e-commerce Pieces of the Puzzle
Enriching Lifelong User Modelling with the Social e- Networking and e-commerce Pieces of the Puzzle Demetris Kyriacou Learning Societies Lab School of Electronics and Computer Science, University of Southampton
More informationA Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System
A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System Takashi Yukawa Nagaoka University of Technology 1603-1 Kamitomioka-cho, Nagaoka-shi Niigata, 940-2188 JAPAN
More informationISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Paper / Case Study Available online at: www.ijarcsms.com
More informationRetrieval of Highly Related Documents Containing Gene-Disease Association
Retrieval of Highly Related Documents Containing Gene-Disease Association K. Santhosh kumar 1, P. Sudhakar 2 Department of Computer Science & Engineering Annamalai University Annamalai Nagar, India. santhosh09539@gmail.com,
More informationShrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent
More informationReliability Verification of Search Engines Hit Counts: How to Select a Reliable Hit Count for a Query
Reliability Verification of Search Engines Hit Counts: How to Select a Reliable Hit Count for a Query Takuya Funahashi and Hayato Yamana Computer Science and Engineering Div., Waseda University, 3-4-1
More informationStatic Pruning of Terms In Inverted Files
In Inverted Files Roi Blanco and Álvaro Barreiro IRLab University of A Corunna, Spain 29th European Conference on Information Retrieval, Rome, 2007 Motivation : to reduce inverted files size with lossy
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationEfficient Mining Algorithms for Large-scale Graphs
Efficient Mining Algorithms for Large-scale Graphs Yasunari Kishimoto, Hiroaki Shiokawa, Yasuhiro Fujiwara, and Makoto Onizuka Abstract This article describes efficient graph mining algorithms designed
More informationClean Living: Eliminating Near-Duplicates in Lifetime Personal Storage
Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage Zhe Wang Princeton University Jim Gemmell Microsoft Research September 2005 Technical Report MSR-TR-2006-30 Microsoft Research Microsoft
More informationResults and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets
Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets Sheetal K. Labade Computer Engineering Dept., JSCOE, Hadapsar Pune, India Srinivasa Narasimha
More informationLarge Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao
Large Scale Chinese News Categorization --based on Improved Feature Selection Method Peng Wang Joint work with H. Zhang, B. Xu, H.W. Hao Computational-Brain Research Center Institute of Automation, Chinese
More informationHUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining
HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining Masaharu Yoshioka Graduate School of Information Science and Technology, Hokkaido University
More informationR 2 D 2 at NTCIR-4 Web Retrieval Task
R 2 D 2 at NTCIR-4 Web Retrieval Task Teruhito Kanazawa KYA group Corporation 5 29 7 Koishikawa, Bunkyo-ku, Tokyo 112 0002, Japan tkana@kyagroup.com Tomonari Masada University of Tokyo 7 3 1 Hongo, Bunkyo-ku,
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationSupervised classification of law area in the legal domain
AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms
More informationExtraction of Context Information from Web Content Using Entity Linking
18 IJCSNS International Journal of Computer Science and Network Security, VOL.13 No.2, February 2013 Extraction of Context Information from Web Content Using Entity Linking Norifumi Hirata, Shun Shiramatsu,
More informationAvailable online at ScienceDirect. Procedia Computer Science 52 (2015 )
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 52 (2015 ) 1071 1076 The 5 th International Symposium on Frontiers in Ambient and Mobile Systems (FAMS-2015) Health, Food
More informationFeature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News
Selecting Model in Automatic Text Categorization of Chinese Industrial 1) HUEY-MING LEE 1 ), PIN-JEN CHEN 1 ), TSUNG-YEN LEE 2) Department of Information Management, Chinese Culture University 55, Hwa-Kung
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationWeb Service Recommendation Using Hybrid Approach
e-issn 2455 1392 Volume 2 Issue 5, May 2016 pp. 648 653 Scientific Journal Impact Factor : 3.468 http://www.ijcter.com Web Service Using Hybrid Approach Priyanshi Barod 1, M.S.Bhamare 2, Ruhi Patankar
More informationMODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS
MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500
More informationNavigation Retrieval with Site Anchor Text
Navigation Retrieval with Site Anchor Text Hideki Kawai Kenji Tateishi Toshikazu Fukushima NEC Internet Systems Research Labs. 8916-47, Takayama-cho, Ikoma-city, Nara, JAPAN {h-kawai@ab, k-tateishi@bq,
More informationSENTIMENT ESTIMATION OF TWEETS BY LEARNING SOCIAL BOOKMARK DATA
IADIS International Journal on WWW/Internet Vol. 14, No. 1, pp. 15-27 ISSN: 1645-7641 SENTIMENT ESTIMATION OF TWEETS BY LEARNING SOCIAL BOOKMARK DATA Yasuyuki Okamura, Takayuki Yumoto, Manabu Nii and Naotake
More informationSelf-Organized Similarity based Kernel Fuzzy Clustering Model and Its Applications
Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Self-Organized Similarity based Kernel Fuzzy
More informationIncorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches
Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Women s College Tokyo, Japan masaki.eto@gakushuin.ac.jp Abstract. To improve the search performance
More informationAutomated Information Retrieval System Using Correlation Based Multi- Document Summarization Method
Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method Dr.K.P.Kaliyamurthie HOD, Department of CSE, Bharath University, Tamilnadu, India ABSTRACT: Automated
More informationIMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL
IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL Lim Bee Huang 1, Vimala Balakrishnan 2, Ram Gopal Raj 3 1,2 Department of Information System, 3 Department
More informationAutomatic Discovery of Association Orders between Name and Aliases from The Web using Anchor Texts-Based Co-Occurrences
64 IJCSNS International Journal of Computer Science and Network Security, VOL.14 No.6, June 2014 Automatic Discovery of Association Orders between Name and Aliases from The Web using Anchor Texts-Based
More informationSECURE MULTI-KEYWORD TOP KEY RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD DATA
Research Manuscript Title SECURE MULTI-KEYWORD TOP KEY RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD DATA Dr.B.Kalaavathi, SM.Keerthana, N.Renugadevi Professor, Assistant professor, PGScholar Department of
More informationA RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH
A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A thesis Submitted to the faculty of the graduate school of the University of Minnesota by Vamshi Krishna Thotempudi In partial fulfillment of the requirements
More informationA New Technique to Optimize User s Browsing Session using Data Mining
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
More informationTEXT CHAPTER 5. W. Bruce Croft BACKGROUND
41 CHAPTER 5 TEXT W. Bruce Croft BACKGROUND Much of the information in digital library or digital information organization applications is in the form of text. Even when the application focuses on multimedia
More informationOneView. User s Guide
OneView User s Guide Welcome to OneView. This user guide will show you everything you need to know to access and utilize the wealth of information available from OneView. The OneView program is an Internet-based
More informationUnstructured Data. CS102 Winter 2019
Winter 2019 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for patterns in data
More informationQuery Answering Using Inverted Indexes
Query Answering Using Inverted Indexes Inverted Indexes Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 2 Document-at-a-time Evaluation
More informationAn Empirical Performance Comparison of Machine Learning Methods for Spam Categorization
An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University
More informationCollaborative Filtering using Euclidean Distance in Recommendation Engine
Indian Journal of Science and Technology, Vol 9(37), DOI: 10.17485/ijst/2016/v9i37/102074, October 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Collaborative Filtering using Euclidean Distance
More informationProceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan
Read Article Management in Document Search Process for NTCIR-9 VisEx Task Yasufumi Takama Tokyo Metropolitan University 6-6 Asahigaoka, Hino Tokyo 191-0065 ytakama@sd.tmu.ac.jp Shunichi Hattori Tokyo Metropolitan
More informationAssigning Vocation-Related Information to Person Clusters for Web People Search Results
Global Congress on Intelligent Systems Assigning Vocation-Related Information to Person Clusters for Web People Search Results Hiroshi Ueda 1) Harumi Murakami 2) Shoji Tatsumi 1) 1) Graduate School of
More informationAn Automatic Reply to Customers Queries Model with Chinese Text Mining Approach
Proceedings of the 6th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 15-17, 2007 71 An Automatic Reply to Customers E-mail Queries Model with Chinese Text Mining Approach
More informationInformation Gathering Support Interface by the Overview Presentation of Web Search Results
Information Gathering Support Interface by the Overview Presentation of Web Search Results Takumi Kobayashi Kazuo Misue Buntarou Shizuki Jiro Tanaka Graduate School of Systems and Information Engineering
More informationwith Twitter, especially for tweets related to railway operational conditions, television programs and landmarks
Development of Real-time Services Offering Daily-life Information Real-time Tweet Development of Real-time Services Offering Daily-life Information The SNS provided by Twitter, Inc. is popular as a medium
More informationSpoken Document Retrieval (SDR) for Broadcast News in Indian Languages
Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Chirag Shah Dept. of CSE IIT Madras Chennai - 600036 Tamilnadu, India. chirag@speech.iitm.ernet.in A. Nayeemulla Khan Dept. of CSE
More informationEnhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm
Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm K.Parimala, Assistant Professor, MCA Department, NMS.S.Vellaichamy Nadar College, Madurai, Dr.V.Palanisamy,
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections
More informationExtraction of Web Image Information: Semantic or Visual Cues?
Extraction of Web Image Information: Semantic or Visual Cues? Georgina Tryfou and Nicolas Tsapatsoulis Cyprus University of Technology, Department of Communication and Internet Studies, Limassol, Cyprus
More information