Text Mining for Documents Annotation and Ontology Support

Size: px
Start display at page:

Download "Text Mining for Documents Annotation and Ontology Support"

Transcription

1 Text Mining for Documents Annotation and Ontology Support Jan Paralic and Peter Bednar Department of Cybernetics and Artificial Intelligence, Technical University of Kosice, Letná 9, Kosice, Slovakia Abstract This paper presents a survey of basic concepts in the area of text data mining and some of the methods used in order to elicit useful knowledge from collections of textual data. Three different text data mining techniques (clustering/visualisation, association rules and classification models) are analysed and its exploitation possibilities within the Webocracy project 1 are showed. Clustering and association rules discovery are well suited as supporting tools for ontology management. Classification models are used for automatic documents annotation. 1 Introduction In order to get some new, useful information (or knowledge) from (possibly large) collection of textual documents, text data mining methods can be applied. As the concept of knowledge discovery in texts (KDT) is quite new, in the following section 2 the basic process of KDT and its particular steps are described. The process of KDT can be divided into two main phases. Within the first phase, (free-form) text documents are transformed into an internal or intermediate form, which presents already structured data suitable for text data mining, i.e. the second phase of the whole process. Internal representation forms of a text document collection as well as some important pre-processing steps necessary to achieve an efficient and useful internal representation are described in section 3. Different text data mining approaches and algorithms supporting them are analysed in section 4. As first, clustering/visualisation and association rules as unsupervised text mining approaches are presented. Next, supervised approaches used for building classification models are described. In section 5 the possible use of some of the described text mining methods within the WEBOCRAT system is sketched. 2 Knowledge Discovery in Texts Knowledge discovery in texts (KDT) or text data mining can be defined in the same way as knowledge discovery in databases (KDD), though here the data are textual. This implies significant difference in comparison to KDD, which uses as a source of data well 1 IST Webocracy: Web Technologies Supporting Direct Participation in Democratic Processes

2 PARALIC, BEDNAR structured databases. In KDT usually plain textual documents are used. There are also some minor attempts to use (partially or fully) structured textual documents as HTML or XML documents in order to make use not only of plain textual parts but also of additional structural information. Despite this simple approximation of the KDD definition, there is quite a lot of confusion, what KDT really is. For example, Marti Hearst [2] claims that it is important to distinguish between text data mining and information retrieval. The goal of information retrieval is to help users find documents that satisfy their information needs. Marti Hearst describes information retrieval as a way to pull out the documents you are interested in and push away the others. Which means, that information retrieval is the process of finding information that is already known and has been inserted into document by an author. In text data mining in contrast, a collection of documents is examined with the aim to discover information (knowledge) not contained in any individual document in the collection. Yves Kodratoff [1] distinguishes between inductive and deductive text mining. The better-known deductive text mining is called Information Extraction, and amounts to finding instances of a predefined pattern in a set of texts. On the other hand, inductive text mining looks for unknown patterns or rules to discover inside a set of texts. We further on refer always to inductive text mining, when we are speaking about text data mining. 2.1 Particular steps of the KDT process Text data mining is much more complex task than data mining [7], because it involves text data that is inherently unstructured and fuzzy. KDT process can be divided into two main phases. 1. Transformation of (free-form) text documents into an internal or intermediate form (this is an analogy of data pre-processing techniques in KDD process. 2. Text mining itself (A. H. Tan in [7] calls it knowledge distillation) that deduces patterns or knowledge from the intermediate form. In greater detail we can compare the KDT approach and its particular steps against the KDD process steps [8]. 1. Understanding the application domain and the goals of the KDT process: user must define which concepts are interesting. 2. Acquiring or selecting a target data set: texts must be gathered using information retrieval tools or in manual way. 3. Data cleaning, pre-processing and transformation: concepts must be described and texts need to be analyzed and stored in the internal representation form, usually after eliminating stop-words and possibly after stemming and exclusion of too frequent. 4. Model development and hypothesis building: identifying concepts in the collection 5. Choosing and execution of suitable data mining algorithms: e.g. the application of the statistical techniques (text data mining task). 6. Result interpretation and visualisation: human must interpret the results. 2.2 Text Data Mining Tasks Mining internal representation form of a document collection induces patterns and relationship across documents [7]. Some examples of unsupervised text mining tasks are:

3 TEXT MINING FOR DOCUMENTS ANNOTATION AND ONTOLOGY SUPPORT Clustering/visualisation of documents Association rules Typical example of supervised text mining task is Predictive modelling (classification models) 3 Representation of Textual Documents For internal representation of textual documents we can use some information retrieval model (see formal definition in [2]). The classic models in information retrieval consider that each document is described by a set of representative keywords called index terms. An index term is simply a (document) word whose semantics helps in remembering the documents main themes [2]. It is quite obvious that different index terms have varying relevance when used to describe document contents in particular document collection. This effect is captured through the assignment of numerical weights to each index term of a document. Let t j be an index term, d i be a document, and w ij 0 be a weight associated with the pair (d i, t j ). This weight quantifies the importance of the index term t j describing the document d i semantic contents. Based on how these weights are calculated and treated, there are three classic information retrieval models, namely the Boolean, the vector [5], and the probabilistic models [2]. 3.1 Classical information retrieval models The Boolean model is a simple retrieval model based on set theory and Boolean algebra. This model considers that index terms are present or absent in a document. As a result, the index term weights are assumed to be binary, i.e. w ij {0,1}. A query is composed of index terms linked by three logical connectives: not, and, or. Thus a query is essentially a Boolean expression with precise semantics. As a result this model is unable to recognise partial matches, which frequently leads to poor performance. Another weakness is that by considering only the presence or absence of term, the binary weighting scheme ignores information inherent in the frequencies of terms. A related problem concerns document length. As a document gets longer, the number of distinct terms used will in general increase. Many of these terms usages in very long documents will be unrelated to the core content of the document, but are treated as being of the same significance as similar occurrences in short documents. The vector model removes this disadvantages by assigning non-binary weights to index terms in queries and in documents. The term frequency of occurrence (tf) [5] [6] in document is a common weighting scheme here and is generally used as the basis of the weighted document vector. Document frequency can be combined with the collection frequency factor, which is used to discriminate one document from the other. Most of the used schemes for this factor, e.g. the inverse document frequency (idf), assume that the importance of a term is proportional to the number of documents the terms appear in. As a combination of these two factors we can obtain tfidf scheme, which is the most widely used weighting scheme, defined as: C w( i, j) = tfidf ( di, t j ) = Ndi, t j. log (1) Nt j

4 PARALIC, BEDNAR where Nd i,t j denotes the number the term t j occurs in the document d i (term frequency factor), Nt j denotes the number of documents in collection C in which t j occurs at least once (document frequency of the term t j ) and C denotes the number of the documents in collection C. This weighting scheme includes intuitive presumption that: the more often a term occurs in a document, the more it is representative of the content of the document, and the more documents the term occurs in, less discriminating it is. In order to fit weights into the interval 0, 1 and documents be represented by vectors of equal length, the document vector resulting from tfidf weighting are often normalized to length equal to 1, so final normalized term weight can be computed as: tfidf ( d i, t j ) w( i, j) = tfidf ( d i, t j ) norm = (2) 2 tfidf ( d, t ) i T where T is the set of terms used for the vector representation of the document i. The probabilistic model attempts to capture the IR problem within a probabilistic framework [2]. The index term weights are all binary and a query is a subset of index terms. Given a user query, there is a set of documents which contains exactly relevant documents and no other (so called ideal answer set). The querying process can be seen as a process of specifying the properties of an ideal answer set. Since these properties are not known at query time, an effort has to be made at initially guessing what they could be. This initial guess allows then to generate a preliminary probabilistic description of the ideal answer set which is used to retrieve a first set of documents. An interaction with the user is then initiated with the purpose of improving the probabilistic description of the ideal answer set. Through several different measures, Salton and Buckley in [6] showed that the vector space model is expected to outperform the probabilistic model with general collections. This also seems to be dominant thought among researchers, practitioners, and the Web community, where popularity of the vector model runs high [2]. 3.2 Term selection/reduction Documents can be described by thousands of terms and this high dimensionality of document space can cause problem with efficiency. Terms that do not describe content of documents induce noise, which can degrade performance of created text mining model. For these reasons, selection of relevant terms is very important text processing. Appropriate method for term selection is generally dependent on used text mining algorithm. Either it will be a supervised text data mining algorithm (i.e. the information about classes of particular documents is available), or an unsupervised one (i.e. no information about classes of particular documents is available). Main difference is that methods for supervised learning can use information about document category, and relevance of the term can be determined by how it separates documents into categories. The classification accuracy of the generated model (classifier) estimated on the testing examples can be used as a guide to find optimal set of terms. Note that unsupervised term selection methods can be generally used also for the supervised learning. i j

5 TEXT MINING FOR DOCUMENTS ANNOTATION AND ONTOLOGY SUPPORT From unsupervised term selection methods two can be mentioned: Document frequency threshold [7] is the simplest technique for term selection. In this method, document frequency for all terms on training collection is computed and terms, which have document frequency lower than specified threshold, are removed from the resulting set of terms used for document representation. Over years, alternative modelling paradigms for each type of classic model have been proposed. Regarding vector model, as a representative of algebraic models, a very interesting extension the latent semantic indexing (LSI) has been proposed in [13]. From supervised term selection methods, e.g. Information gain is frequently employed. Other approach uses χ 2 statistic measures. 4 Text Mining Methods 4.1 Clustering/visualization For clustering of textual documents in vector representation the self-organizing map (SOM) [9] is used very often. SOM is an unsupervised neural network, which provides a mapping from high-dimensional feature spaces onto a two-dimensional space such that similar data are mapped close to each other. This allows a very intuitive cluster representation and analysis. A comparison of the SOM approach with a statistical one, on one particular domain can be found in [12]. The comparison has shown that statistical approach was not powerful enough to deal with larger text collections and the interpretation of results has been quite difficult. Very interesting for text mining purposes is the combination of the basic SOM algorithm with the LabelSOM method to automatically extract classification from the trained SOM [10]. This method has been used e.g. within the SOMLib system [11]. The SOMLib Digital Library System provides methods for organizing of large collections of electronic documents to allow topic-oriented browsing and orientation. SOM provides only flat, i.e. two-dimensional representation of documents clusters, which might be blind for interpretation when document collection is very large. Moreover, this representation of clusters has usually very irregular coverage with documents due to unbalanced topic distribution. To overcome these limitations the Growing Hierarchical SOM (GHSOM) [11], which automatically creates a hierarchical organization of a set of documents has been developed. This allows the network architecture to determine the topical structure of the given document repository during the training process, creating a hierarchy of self-organizing maps, each of which provides a topologically sorted representation of a topical subset. Starting from a rather small high-level SOM, which provides a coarse overview of the various topics present in the collection, subsequent layers are added where necessary to display a finer subdivision of topics. Each map in turn grows in size until it represents its topic in sufficient degree of granularity. Since usually not all topics are present equally strongly in a collection, this leads to an unbalanced hierarchy, assigning more mapspace to topics that are more prominent in a given collection. This allows the user to approach and intuitively browse a document collection in a way similar to conventional libraries.

6 PARALIC, BEDNAR Visualization by this system is provided by means of web interface. An HTML page represents each map with links to lower level (expanded) maps, or particular documents associated with a cell at the lowest leaves level (see Figure 1). Figure 1: A partial view of the 2 nd layer map for a text collection from Austrian newsletter Standard 4.2 Association Rules This text data mining task discovers associations between concepts and expresses these findings as rules in the format B H [support, confidence], where B as well as H may be a set of concepts or a unique concept [8]. The rule means, if B is present in a text, then H is present with a certain confidence and a certain support. Following the usual definition (e.g. in [4]), confidence is the proportion of texts that have B AND H in relation to number of texts that have only B and support is the proportion of texts that have B AND H in relation to number of all texts in the collection. Such rules allow predicting the presence of (a) concept(s) according to the presence of another one(s). Moreover, complex rules may be discovered, when combination of concepts and/or words is allowed in discovered rules. E.g. WORD_1 AND WORD_2 AND CONCEPT_1 AND CONCEPT_2 CONCEPT_3. This kind of rule can be used to select sub-collections of text documents where some words are present.

7 TEXT MINING FOR DOCUMENTS ANNOTATION AND ONTOLOGY SUPPORT The most often used approach to mine associations is Apriori algorithm [4], which is running in two steps. 1. Finds all frequent itemsets (i.e. tuples of concepts or terms). Each of these itemsets must occur at least as frequently as a pre-defined minimum support count. 2. Generates strong association rules from the frequent itemsets. These rules must satisfy pre-defined minimum support and minimum confidence. The name of the Apriori algorithm is based on the fact, that the algorithm uses prior knowledge of frequent itemset properties in the following way. It employs an iterative approach known as level-wise search, where k-itemsets are used to explore (k+1)- itemsets, starting with k=1 until no more frequent k-itemsets can be found. For each new k one full scan of the database is needed. To improve the efficiency of Apriori algorithm, anti-monotone property is applied which means that all subsets of a frequent itemset must be frequent itemsets. This reduces the number of candidate itemsets generated for k+1 (from already identified frequent itemsets of size k and less). Even so the performance of Apriori algorithm does not scale well for large databases, which is the case of document collections as well. An interesting method that generates the set of frequent itemsets without candidate generation is called frequent-pattern growth, or simply FP-growth [4], which adopts a divide-and-conquer strategy. 4.3 Classification Models The term classification can cover any procedure in which some decision or forecast is made on the basis of currently available information. In the context of data mining a more restrictive interpretation is considered. At first, we may aim of establishing existence of classes (or clusters) in the data according to a set of observations. Or we may know for certain that there are some classes, and the aim is to establish a rule by use of which we can classify a new observation into one (or more) of the existing class(es). The first case is known as unsupervised learning (clustering). In this chapter we use the term classification for the second case - supervised learning. Rule induction algorithms. Any form of inference in which the premises do not imply deductively the conclusions, can be thought of as an induction. In the case of the text categorisation, it is one special form of the inductive inference, supervised concept learning or classification learning. To learn a concept means to infer its general definition from a number of specific examples of it. A concept can be formally considered as a function from the set of all examples (in our case from the set of all documents) to the Boolean set {true, false} or equivalently to the set {document is assigned to category; document is not assigned to category}. For text categorization, the concept is equivalent to some category, so the problem is to find a definition of this category from a number of documents. Rules induction algorithms are characterized by representing the target category as a set of "if-then" rules. These rules have the form "if <complex> then predict <category> where <complex> is the conjunction of the attribute tests (selectors). Rule induction algorithms have several advantages; the most noteworthy one is that these rules are best understandable for humans from all representations currently in use in concept learning.

8 PARALIC, BEDNAR Examples of rule induction algorithms are CN2 [15] (see Figure 2 with visualisation of the results in KDD Package * [12]), REP, IREP [16], RIPPER [17]. Other possibility is to mine decision trees, e.g. using Quinlan s algorithm C4.5 Figure 2: Example of a visual representation of decision rules produced by the CN2 implementation within the KDD package. 5 Exploitation of KDT in Webocracy Webocracy project responds to an urgent need for establishment of efficient systems providing effective and secure user-friendly tools, working methods, and support mechanisms to ensure the efficient exchange of information between citizens and the administrations. The project addresses the problem of providing new types of communication flows and services from public institutions to citizens, and improves the access of citizens to public administration services and information. The new types of services will increase the efficiency, transparency and accountability of public administration institutions and their policies toward citizens. Within this project a WEBOCRAT system is being designed and developed. WE- BOCRAT system is a Web-based system comprising Web publishing, computer-mediated discussion, virtual communities, discussion forums, organizational memories, text data mining, and knowledge modeling. The WEBOCRAT system will support communication and discussion, publication of documents on the Internet, browsing and navigation, opinion polling on questions of public interest, intelligent retrieval, analytical tool, alerting services, and convenient access to information based on individual needs. 5.1 Clustering/visualisation We think, that clustering/visualisation does not fit the functionality of the WEBOCRAT system as defined in [3], because documents in WEBOCRAT system are primarily organ- * We have developed KDD Package within the EU funded INCO Copernicus project No GOAL - Geographic Information On-Line Analysis (GIS Data Warehouse Integration)

9 TEXT MINING FOR DOCUMENTS ANNOTATION AND ONTOLOGY SUPPORT ized by their links to knowledge model so that primarily knowledge model is used for document retrieval and topic-oriented browsing On the other hand, it could be useful to use techniques like GHSOM, because of its hierarchical structure that is tailored to the actual text data collection, as a supporting tool within the initial phase, when the knowledge model of a local authority is being constructed. This is true in such a case when local authority has a representative set of text documents in electronic form available for this purpose. It is assumed that these documents will be later on published using the WEBOCRAT system and linked to the knowledge model for intelligent retrieval purposes. But users must be aware of the fact, that GHSOM does not produce any ontology. It is just a hierarchical structure, where documents are organized in such a way that documents about similar topics should be topologically close to each other, and documents with different topics should be topologically far away from each other. Particular node in this hierarchical structure is labelled by (stemmed) words terms, which occur most often in cluster of documents presented by this node. This list of terms can provide some idea about concept(s), which can be (possibly) represented in the designed knowledge model. Finally, particular documents represent leave nodes of this hierarchical structure. It is in our opinion necessary to look carefully through the whole structure, including particular documents in order to make reasonable conclusions about particular concepts proposed for the knowledge model and relations among them. 5.2 Association rules Association rules as presented in previous section have been used e.g. in [8] in two different experiments. The first one tried to analyse political newspaper articles from two different periods in order to analyse what there are saying about a mayor in a big Brazil city (first period was before corruption scandal and the other one after it). The second experiment tried to analyse articles about various text mining tools as competitive intelligence analysis. In both experiments firstly, concepts were defined by different approaches. To identify concept terms, natural language processing is commonly required. Instead of that, ontology provides this information in the WEBOCRAT system, and we can represent documents directly as a binary vector. This vector has one element for each concept, equal to 1 if document is linked to the particular concept. Another advantage is, that within the WEBOCRAT system a hierarchy of concepts from particular domain (local authority s area of work) is available. We can make use of it for mining of associations, not only at the level of leaves concepts, but also at the higher levels of the hierarchy, getting so called multi-dimensional association rules [4]. Association rules can be exploited e.g. for automatic improvements of the knowledge model in the following way. When we use as input attributes for association rules mining algorithm only concepts to which documents are linked that means that we are looking for frequently occurring linking patterns. These patterns can be confronted with the actual ontology. When e.g. our algorithm finds association between concepts X, Y, and Z and in our ontology no relation between concepts X, Y and Z is presented, we can expect a missing relation between them. This approach is suitable mainly for documents that were not linked automatically, using a pre-defined template.

10 PARALIC, BEDNAR 5.3 Classification models In the WEBOCRAT system, ontology is used as a knowledge model of the domain, which is composed from concepts occurring in this domain and relationships between these concepts. Information is stored in the system in the form of text documents, which are annotated by set of concepts relevant to the document content. One strategy for document retrieval is based on concepts. User selects interesting concepts and asks for information related to them, is used for information retrieval. The decision about document relevance to the user query is based on a similarity between set of query concepts and a set of concepts, which are annotated to the document. This task of document retrieval can be viewed as a classification task when the decision is made, whether the document is relevant for the user or not. With appropriate ontology which models domain well, use of this knowledge model can yield better results than e.g. retrieval based on vector representation of documents. Retrieval accuracy depends on the quality of documents annotation. Data mining methods can be very useful to guide user at annotating new document. Annotation of the new document is the classification task (text categorization task) when we need to make decision which concept (concept represents category) is relevant to the content of the document. The system must propose relevant concepts for new document in real time, so important requirement to used algorithm is execution time efficiency. User can add or delete some link between new document and concepts, and these changes can be immediately integrated into classifier. This requires ability of incremental learning. Relevance weighting of the concepts to the new document is better than simple binary decision. Concepts can be ordered by weight of the relevance to the new document and user can search for additional relevant concept according to this ordering. Acknowledgments This work is done within the Webocracy project, which is supported by European Commission DG INFSO under the IST program, contract No. IST ; within the TEXAS Project supported by Austrian Institute for East- and Southeast Europe and within the VEGA project 1/8131/01 Knowledge Technologies for Information Acquisition and Retrieval of Scientific Grant Agency of Ministry of Education of the Slovak Republic. The content of this publication is the sole responsibility of the authors, and in no way represents the view of the European Commission or its services. References [1] Kodratoff, Y. (2001) Rating the Interest of Rules Induced from Data and within Texts. In Proc. of the 12 th IEEE - International Conference on Database and Expert Systems Applications - DEXA 2001, Munich. (Long version of the paper, submitted for publication in Knowledge and Information Systems: An International Journal). [2] Baeza-Yates, R. and Ribeiro-Neto, B. (1999) Modern Information Retrieval. Addison-Wesley Longman Publishing Company [3] Mach, M., Furdik K. (2001) Webocrat system architecture and functionality. Webocracy deliverable R2.4, Technical University of Kosice, April 2001

11 TEXT MINING FOR DOCUMENTS ANNOTATION AND ONTOLOGY SUPPORT [4] Han, J., Kamber, M. (2000) Data Mining Concepts and Techniques. Morgan Kaufmann Publishers [5] Salton, G., Wong, A., Yang, C. (1975) A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11), [6] Salton, G.,Buckley, C. (1988) Term Weighting Approaches in Automatic Text Retrieval. Information Processing & Management, 24(5), [7] Ah-Hwee Tan (1999) Text Mining: The state of the art and the challenges. In Proc. of the PAKDD'99 workshop on Knowledge Disocovery from Advanced Databases, Beijing, pp [8] Loh, S., Wives, L. K., and Palazzo, J. (2000) Concept-Based Knowledge Discovery in Texts Extracted from the Web. SIGKDD Explorations, Vol. 2, Issue 1, July 2000, pp [9] Kohonen, T. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, Springer Verlag, Berlin, Heidelberg, New York. [10] Rauber, A. (1999) On the labeling of self-organizing maps. In Proc. of the International Joint Conference on Neural Networks, Washington, DC [11] Rauber, A., Dittenbach, M. and Merkl, D. (2000) Automatically Detecting and Organizing Documents into Topic Hierarchies: A Neural Network Based Approach to Bookshelf Creation and Arrangement. In: Proc. of the 4th European Conference on Research and Advanced Technologies for Digital Libraries (ECDL2000), Springer LNCS 1923, Lisboa, Portugal [12] Rauber, A., and Paralic, J. (2000) Cluster Analysis as a First Step in the Knowledge Discovery Process. In Journal of Advanced Computational Intelligence, Fuji Technology Press Ltd., ISSN , Vol. 4, No. 4, pp [13] Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. Harshman, R. A. (1990) Indexing by latent semantic analysis. Journal of the Society for Information Science 41(6), [14] Lewis, D. D. (1998) Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In Proceedings of ECML-98, 10th European Conference on Machine Learning, 4-15, Chemnitz. [15] Clark, P., Niblett, T. (1989) The CN2 induction algorithm. Machine Learning Journal, 3(4), [16] Fürnkranz, J., Widmer, G. (1994) Incremental Reduced Error Prunning. Machine Learning: Proceedings of the 11th Annual Conference, New Brunswick, New Jersey. [17] Cohen, W. W. (1995) Fast effective rule induction. Proceedings of the 12th International Conference (ML95), Morgan Kaufmann, San Mateo, California. About the Authors Jan Paralic received his Master s degree in technical cybernetics in 1992 and PhD s degree in artificial intelligence in 1998 from Technical University of Kosice. His research currently focuses mainly on knowledge discovery (from databases as well as from texts) and knowledge management. Peter Bednar received a Master s degree in cybernetics 2001 from Technical University of Kosice. His research interests include text categorisation, data and text mining.

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Visualizing Changes in Data Collections Using Growing Self-Organizing Maps *

Visualizing Changes in Data Collections Using Growing Self-Organizing Maps * Visualizing Changes in Data Collections Using Growing Self-Organizing Maps * Andreas Nürnberger and Marcin Detyniecki University of California at Berkeley EECS, Computer Science Division Berkeley, CA 94720,

More information

A Bagging Method using Decision Trees in the Role of Base Classifiers

A Bagging Method using Decision Trees in the Role of Base Classifiers A Bagging Method using Decision Trees in the Role of Base Classifiers Kristína Machová 1, František Barčák 2, Peter Bednár 3 1 Department of Cybernetics and Artificial Intelligence, Technical University,

More information

International ejournals

International ejournals Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT

More information

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014. A B S T R A C T International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Information Retrieval Models and Searching Methodologies: Survey Balwinder Saini*,Vikram Singh,Satish

More information

Associating Terms with Text Categories

Associating Terms with Text Categories Associating Terms with Text Categories Osmar R. Zaïane Department of Computing Science University of Alberta Edmonton, AB, Canada zaiane@cs.ualberta.ca Maria-Luiza Antonie Department of Computing Science

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

Mining High Order Decision Rules

Mining High Order Decision Rules Mining High Order Decision Rules Y.Y. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 e-mail: yyao@cs.uregina.ca Abstract. We introduce the notion of high

More information

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798

More information

Knowledge Engineering in Search Engines

Knowledge Engineering in Search Engines San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 Knowledge Engineering in Search Engines Yun-Chieh Lin Follow this and additional works at:

More information

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA SEQUENTIAL PATTERN MINING FROM WEB LOG DATA Rajashree Shettar 1 1 Associate Professor, Department of Computer Science, R. V College of Engineering, Karnataka, India, rajashreeshettar@rvce.edu.in Abstract

More information

Content-based Management of Document Access. Control

Content-based Management of Document Access. Control Content-based Management of Document Access Control Edgar Weippl, Ismail Khalil Ibrahim Software Competence Center Hagenberg Hauptstr. 99, A-4232 Hagenberg, Austria {edgar.weippl, ismail.khalil-ibrahim}@scch.at

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Towards Rule Learning Approaches to Instance-based Ontology Matching

Towards Rule Learning Approaches to Instance-based Ontology Matching Towards Rule Learning Approaches to Instance-based Ontology Matching Frederik Janssen 1, Faraz Fallahi 2 Jan Noessner 3, and Heiko Paulheim 1 1 Knowledge Engineering Group, TU Darmstadt, Hochschulstrasse

More information

A New Technique to Optimize User s Browsing Session using Data Mining

A New Technique to Optimize User s Browsing Session using Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

An Efficient Hash-based Association Rule Mining Approach for Document Clustering

An Efficient Hash-based Association Rule Mining Approach for Document Clustering An Efficient Hash-based Association Rule Mining Approach for Document Clustering NOHA NEGM #1, PASSENT ELKAFRAWY #2, ABD-ELBADEEH SALEM * 3 # Faculty of Science, Menoufia University Shebin El-Kom, EGYPT

More information

Text mining on a grid environment

Text mining on a grid environment Data Mining X 13 Text mining on a grid environment V. G. Roncero, M. C. A. Costa & N. F. F. Ebecken COPPE/Federal University of Rio de Janeiro, Brazil Abstract The enormous amount of information stored

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

Structure of Association Rule Classifiers: a Review

Structure of Association Rule Classifiers: a Review Structure of Association Rule Classifiers: a Review Koen Vanhoof Benoît Depaire Transportation Research Institute (IMOB), University Hasselt 3590 Diepenbeek, Belgium koen.vanhoof@uhasselt.be benoit.depaire@uhasselt.be

More information

The use of frequent itemsets extracted from textual documents for the classification task

The use of frequent itemsets extracted from textual documents for the classification task The use of frequent itemsets extracted from textual documents for the classification task Rafael G. Rossi and Solange O. Rezende Mathematical and Computer Sciences Institute - ICMC University of São Paulo

More information

A Hierarchical Document Clustering Approach with Frequent Itemsets

A Hierarchical Document Clustering Approach with Frequent Itemsets A Hierarchical Document Clustering Approach with Frequent Itemsets Cheng-Jhe Lee, Chiun-Chieh Hsu, and Da-Ren Chen Abstract In order to effectively retrieve required information from the large amount of

More information

Using Decision Boundary to Analyze Classifiers

Using Decision Boundary to Analyze Classifiers Using Decision Boundary to Analyze Classifiers Zhiyong Yan Congfu Xu College of Computer Science, Zhejiang University, Hangzhou, China yanzhiyong@zju.edu.cn Abstract In this paper we propose to use decision

More information

Optimization using Ant Colony Algorithm

Optimization using Ant Colony Algorithm Optimization using Ant Colony Algorithm Er. Priya Batta 1, Er. Geetika Sharmai 2, Er. Deepshikha 3 1Faculty, Department of Computer Science, Chandigarh University,Gharaun,Mohali,Punjab 2Faculty, Department

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

Challenges and Interesting Research Directions in Associative Classification

Challenges and Interesting Research Directions in Associative Classification Challenges and Interesting Research Directions in Associative Classification Fadi Thabtah Department of Management Information Systems Philadelphia University Amman, Jordan Email: FFayez@philadelphia.edu.jo

More information

Text Mining: A Burgeoning technology for knowledge extraction

Text Mining: A Burgeoning technology for knowledge extraction Text Mining: A Burgeoning technology for knowledge extraction 1 Anshika Singh, 2 Dr. Udayan Ghosh 1 HCL Technologies Ltd., Noida, 2 University School of Information &Communication Technology, Dwarka, Delhi.

More information

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Using Association Rules for Better Treatment of Missing Values

Using Association Rules for Better Treatment of Missing Values Using Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine Intelligence Group) National University

More information

Correlation Based Feature Selection with Irrelevant Feature Removal

Correlation Based Feature Selection with Irrelevant Feature Removal Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Clustering of Data with Mixed Attributes based on Unified Similarity Metric Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1

More information

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús

More information

Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials *

Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials * Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials * Galina Bogdanova, Tsvetanka Georgieva Abstract: Association rules mining is one kind of data mining techniques

More information

A mining method for tracking changes in temporal association rules from an encoded database

A mining method for tracking changes in temporal association rules from an encoded database A mining method for tracking changes in temporal association rules from an encoded database Chelliah Balasubramanian *, Karuppaswamy Duraiswamy ** K.S.Rangasamy College of Technology, Tiruchengode, Tamil

More information

An Automatic Reply to Customers Queries Model with Chinese Text Mining Approach

An Automatic Reply to Customers  Queries Model with Chinese Text Mining Approach Proceedings of the 6th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 15-17, 2007 71 An Automatic Reply to Customers E-mail Queries Model with Chinese Text Mining Approach

More information

Fig 1. Overview of IE-based text mining framework

Fig 1. Overview of IE-based text mining framework DiscoTEX: A framework of Combining IE and KDD for Text Mining Ritesh Kumar Research Scholar, Singhania University, Pacheri Beri, Rajsthan riteshchandel@gmail.com Abstract: Text mining based on the integration

More information

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations Scientific Journal of Impact Factor (SJIF): 5.71 International Journal of Advance Engineering and Research Development Volume 5, Issue 02, February -2018 e-issn (O): 2348-4470 p-issn (P): 2348-6406 Survey

More information

Reading group on Ontologies and NLP:

Reading group on Ontologies and NLP: Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 1 / 25 Te Reading group on Ontologies and NLP: Machine Learning in Automated Text Categorization, by Fabrizio Sebastianini.

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Overview of Web Mining Techniques and its Application towards Web

Overview of Web Mining Techniques and its Application towards Web Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

60-538: Information Retrieval

60-538: Information Retrieval 60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are

More information

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining P.Subhashini 1, Dr.G.Gunasekaran 2 Research Scholar, Dept. of Information Technology, St.Peter s University,

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Efficient SQL-Querying Method for Data Mining in Large Data Bases

Efficient SQL-Querying Method for Data Mining in Large Data Bases Efficient SQL-Querying Method for Data Mining in Large Data Bases Nguyen Hung Son Institute of Mathematics Warsaw University Banacha 2, 02095, Warsaw, Poland Abstract Data mining can be understood as a

More information

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion Ye Tian, Gary M. Weiss, Qiang Ma Department of Computer and Information Science Fordham University 441 East Fordham

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

The Comparison of CBA Algorithm and CBS Algorithm for Meteorological Data Classification Mohammad Iqbal, Imam Mukhlash, Hanim Maria Astuti

The Comparison of CBA Algorithm and CBS Algorithm for Meteorological Data Classification Mohammad Iqbal, Imam Mukhlash, Hanim Maria Astuti Information Systems International Conference (ISICO), 2 4 December 2013 The Comparison of CBA Algorithm and CBS Algorithm for Meteorological Data Classification Mohammad Iqbal, Imam Mukhlash, Hanim Maria

More information

Data Mining Technology Based on Bayesian Network Structure Applied in Learning

Data Mining Technology Based on Bayesian Network Structure Applied in Learning , pp.67-71 http://dx.doi.org/10.14257/astl.2016.137.12 Data Mining Technology Based on Bayesian Network Structure Applied in Learning Chunhua Wang, Dong Han College of Information Engineering, Huanghuai

More information

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System ijade Reporter An Intelligent Multi-agent Based Context Aware Reporting System Eddie C.L. Chan and Raymond S.T. Lee The Department of Computing, The Hong Kong Polytechnic University, Hung Hong, Kowloon,

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Impact of Term Weighting Schemes on Document Clustering A Review

Impact of Term Weighting Schemes on Document Clustering A Review Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Chapter 1, Introduction

Chapter 1, Introduction CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Ontology-Based Web Query Classification for Research Paper Searching

Ontology-Based Web Query Classification for Research Paper Searching Ontology-Based Web Query Classification for Research Paper Searching MyoMyo ThanNaing University of Technology(Yatanarpon Cyber City) Mandalay,Myanmar Abstract- In web search engines, the retrieval of

More information

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk

More information

Data Mining Concepts

Data Mining Concepts Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

A Comparative Study of Selected Classification Algorithms of Data Mining

A Comparative Study of Selected Classification Algorithms of Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.220

More information

1. Inroduction to Data Mininig

1. Inroduction to Data Mininig 1. Inroduction to Data Mininig 1.1 Introduction Universe of Data Information Technology has grown in various directions in the recent years. One natural evolutionary path has been the development of the

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

Parallel Approach for Implementing Data Mining Algorithms

Parallel Approach for Implementing Data Mining Algorithms TITLE OF THE THESIS Parallel Approach for Implementing Data Mining Algorithms A RESEARCH PROPOSAL SUBMITTED TO THE SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

More information

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining Miss. Rituja M. Zagade Computer Engineering Department,JSPM,NTC RSSOER,Savitribai Phule Pune University Pune,India

More information

Multimodal Information Spaces for Content-based Image Retrieval

Multimodal Information Spaces for Content-based Image Retrieval Research Proposal Multimodal Information Spaces for Content-based Image Retrieval Abstract Currently, image retrieval by content is a research problem of great interest in academia and the industry, due

More information

Knowledge Enhanced E-government Portal

Knowledge Enhanced E-government Portal Knowledge Enhanced E-government Portal Jan Paralic 1, Tomas Sabol 2, and Marian Mach 1 1 Dept. of Cybernetics and AI, Technical University of Kosice, Letna 9, 042 00 Kosice, Slovakia Jan.Paralic@tuke.sk

More information

A Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining

A Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining A Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining D.Kavinya 1 Student, Department of CSE, K.S.Rangasamy College of Technology, Tiruchengode, Tamil Nadu, India 1

More information

Supervised classification of law area in the legal domain

Supervised classification of law area in the legal domain AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms

More information

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News Selecting Model in Automatic Text Categorization of Chinese Industrial 1) HUEY-MING LEE 1 ), PIN-JEN CHEN 1 ), TSUNG-YEN LEE 2) Department of Information Management, Chinese Culture University 55, Hwa-Kung

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document BayesTH-MCRDR Algorithm for Automatic Classification of Web Document Woo-Chul Cho and Debbie Richards Department of Computing, Macquarie University, Sydney, NSW 2109, Australia {wccho, richards}@ics.mq.edu.au

More information

A Novel method for Frequent Pattern Mining

A Novel method for Frequent Pattern Mining A Novel method for Frequent Pattern Mining K.Rajeswari #1, Dr.V.Vaithiyanathan *2 # Associate Professor, PCCOE & Ph.D Research Scholar SASTRA University, Tanjore, India 1 raji.pccoe@gmail.com * Associate

More information

Automatic Modularization of ANNs Using Adaptive Critic Method

Automatic Modularization of ANNs Using Adaptive Critic Method Automatic Modularization of ANNs Using Adaptive Critic Method RUDOLF JAKŠA Kyushu Institute of Design 4-9-1 Shiobaru, Minami-ku, Fukuoka, 815-8540 JAPAN Abstract: - We propose automatic modularization

More information

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

IJMIE Volume 2, Issue 9 ISSN:

IJMIE Volume 2, Issue 9 ISSN: WEB USAGE MINING: LEARNER CENTRIC APPROACH FOR E-BUSINESS APPLICATIONS B. NAVEENA DEVI* Abstract Emerging of web has put forward a great deal of challenges to web researchers for web based information

More information

Improving Recognition through Object Sub-categorization

Improving Recognition through Object Sub-categorization Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,

More information

Improved Frequent Pattern Mining Algorithm with Indexing

Improved Frequent Pattern Mining Algorithm with Indexing IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.

More information

Controlling the spread of dynamic self-organising maps

Controlling the spread of dynamic self-organising maps Neural Comput & Applic (2004) 13: 168 174 DOI 10.1007/s00521-004-0419-y ORIGINAL ARTICLE L. D. Alahakoon Controlling the spread of dynamic self-organising maps Received: 7 April 2004 / Accepted: 20 April

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

Feature Construction and δ-free Sets in 0/1 Samples

Feature Construction and δ-free Sets in 0/1 Samples Feature Construction and δ-free Sets in 0/1 Samples Nazha Selmaoui 1, Claire Leschi 2, Dominique Gay 1, and Jean-François Boulicaut 2 1 ERIM, University of New Caledonia {selmaoui, gay}@univ-nc.nc 2 INSA

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore Data Warehousing Data Mining (17MCA442) 1. GENERAL INFORMATION: PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore 560 100 Department of MCA COURSE INFORMATION SHEET Academic

More information

Mining Quantitative Association Rules on Overlapped Intervals

Mining Quantitative Association Rules on Overlapped Intervals Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,

More information

Ranking models in Information Retrieval: A Survey

Ranking models in Information Retrieval: A Survey Ranking models in Information Retrieval: A Survey R.Suganya Devi Research Scholar Department of Computer Science and Engineering College of Engineering, Guindy, Chennai, Tamilnadu, India Dr D Manjula Professor

More information

PROJECT PERIODIC REPORT

PROJECT PERIODIC REPORT PROJECT PERIODIC REPORT Grant Agreement number: 257403 Project acronym: CUBIST Project title: Combining and Uniting Business Intelligence and Semantic Technologies Funding Scheme: STREP Date of latest

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Web Data mining-a Research area in Web usage mining

Web Data mining-a Research area in Web usage mining IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 22-26 Web Data mining-a Research area in Web usage mining 1 V.S.Thiyagarajan,

More information

A Self-Adaptive Insert Strategy for Content-Based Multidimensional Database Storage

A Self-Adaptive Insert Strategy for Content-Based Multidimensional Database Storage A Self-Adaptive Insert Strategy for Content-Based Multidimensional Database Storage Sebastian Leuoth, Wolfgang Benn Department of Computer Science Chemnitz University of Technology 09107 Chemnitz, Germany

More information

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach ABSTRACT G.Ravi Kumar 1 Dr.G.A. Ramachandra 2 G.Sunitha 3 1. Research Scholar, Department of Computer Science &Technology,

More information