Information Extraction Techniques in Terrorism Surveillance
|
|
- Phebe Stokes
- 5 years ago
- Views:
Transcription
1 Information Extraction Techniques in Terrorism Surveillance Roman Tekhov Abstract. The article gives a brief overview of what information extraction is and how it might be used for the purposes of counter-terrorism surveillance. It also describes the new proposed way of retrieving information by using frequent pattern mining as well as lists the results of conducted experiments. Keywords: information extraction, terrorism, frequent itemset mining 1 Introduction Terrorism is a major threat and therefore it is essential to develop tools that help with revealing and exploring information concerning future attacks, unknown organizations and their members and so on. Even if information itself is available in the form of surveillance reports, news articles, posts published in the web or social network data it is still hard to process efficiently because bits of relevant information are scattered across huge amounts of irrelevant noise. Some pieces of information may also be duplicated which makes the job of collecting relevant data harder. To allow counter-terrorism experts to process information efficiently at least part of the routine has to be automated. If the data is extracted, filtered and classified automatically then domain experts could concentrate on analyzing what is really important. 2 Information Extraction The field of information extraction (IE) is concerned with retrieving certain data of interest from initially unstructured natural sources, mostly regarding text documents but also various multimedia streams (e.g. video). In case of text documents the extraction is established by using natural language processing (NLP) tools. E.g. one might be interested in the extraction of following information from text documents: entities of some certain type (location names, people and/or organizations involved) new relations between known entities or occurrences of concrete relations of interest (person connections with each other, members of organizations) co-references and mentions (what words like he or it really refer to based on previously extracted named entities)
2 Due to the high complexity of the challenge many systems have restricted scope, e.g. by only considering semi-structured documents or documents than are known to be concerned with some specific domain (like criminal news feed for example). Probably the simplest form of IE is just applying hand-written regular expressions. Even though this might be efficient for finding certain named entities or filtering out data that is known to be irrelevant this technique has still very limited capabilities. More sophisticated systems might use Hidden Markov Models or machine learning techniques. Pattern-oriented systems. This class of IE systems is concerned with determining structural patterns which help to distinguish target entities and/or relations [1] [3]. Some systems require some initial seed values for bootstrapping the extraction process. Suppose that in the beginning we have a number of seed entity instances and some target relation. E.g. if the goal is to retrieve the relations between events and locations then the seed value for events might be party and for locations Tallinn. Next we search the texts for the occurrences of these seed instances and retrieve the constructions that express the target relation (e.g. party was held in Tallinn ). If we replace the seed values with placeholders we get a pattern (X was held in Y) and we can search the texts again for other occurrences of this pattern. Suppose we find the construction meeting was held in Tartu this gives us another pair of seed instances: meeting and Tartu that we might use to find new patterns and so on. Other systems require a manually annotated training data at the beginning. Patterns are extracted based on that information and searched against new unseen data. 3 Experiments: Frequent Pattern Based IE The main goal of this work is to apply a pattern-based technique for entity recognition that is based on frequent itemset mining principles: 1. Choose a training data set. In this work the training set was 200 Postimees news site articles about various accidents that took place in Estonia: traffic accidents, fires, murders, assaults etc. 2. Annotate the entities of interest by hand. This work is concerned with geographical locations: counties, towns, villages etc. 3. Use the annotated training data for creating an extractor function that can retrieve entities of interest from general texts. This was accomplished by applying frequent itemset mining approach: first we search the training data set for frequent structural patterns containing our entities of interest. Then the extractor searches for these patterns in the given texts. 4. Test the extractor function on some annotated test data set and analyze quality of the extractor function. In this work a set of 25 articles (different from the training set) was used.
3 3.1 Preprocessing Natural language corpora contain a lot of redundant and noisy information which is an obstacle for efficient pattern retrieval and matching. Therefore data preprocessing is essential. Tokenization. It makes sense to apply the divide and conquer principle and to restrict pattern searching to certain sub-structures instead of processing the entire text corpus at once. E.g. one natural way is to split the text into sentences and process them separately. In this work the texts were divided into even smaller tokens by splitting according to the punctuation marks (.,, and so on). Noise elimination. Natural texts obtained from the web often contain formatting elements such as <span class= > or <p> which need to be cleaned up in order to perform the analysis. For this work all HTML tags were removed from the text. Normalization. Semantically equivalent things might have very different representations in natural language. For pattern recognition it might be a good idea to normalize the data by replacing these common constructions with some standard representations. Consider some examples: Character normalization. Characters might be encoded differently in different sources. E.g. character ä might be expressed as ä in a HTML document. For the experiments all characters were represented using UTF-8 encoding. Date and time normalization. Dates and times can be expressed very differently: , 6. aprillil, kuuendal aprillil are all valid ways to describe the same date in Estonian as well as kell 9.00, kella 9.00 ajal, are all valid times. For the experiments described in this work the date and time patterns were normalized as ##.## and ##:## respectively. 3.2 Annotating In order to find frequent patterns that contain location entities one has to manually annotate the data in the training set specifying which words represent location names and which do not. E.g. one could use tags: juhtus liiklusõnnetus <location>põlvamaal</location> In the course of this work the information was encoded and annotated using JSON format: {"loc": false, "word": "juhtus"}, {"loc": false, "word": "liiklusõnnetus"}, {"loc": true, "word": "Põlvamaal"}
4 3.3 Generalization with Morphological Attributes The most straightforward way would be to treat phrases, words or characters as patterns without generalizing their structure. That approach would suffer, however, even from slight variations in the text structure. In addition the total size of the training data set would need to be significantly large in order to find these exact frequent constructs. One possible solution is to generalize the data by performing linguistic analysis on it. E.g. we can treat the text as a sequence of morphological attributes: stem, suffix, part of speech (POS), grammatical case, tense, quantity, etc. Consider again the phrase juhtus liiklusõnnetus Põlvamaal After applying some morphological analysis on each word we obtain the following information: juhtus: POS = verb, tense = past, stem = juhtu liiklusõnnetus: POS = noun, singular, case = nominative, stem = liiklusõnnetus Põlvamaal: POS = noun, proper, singular, case = adessive, stem = Põlvamaa For the purpose of this work the morphological analyzer ESTMORF was used [6]. It represents the above sequence in the following form: juhtu+s //_V_ s, // liiklus_õnnetus+0 //_S_ sg n, // Põlva_maa+l //_H_ sg ad, // For the sake of simplicity only the following attribute subset was used: stem (ignoring the fact that some words are really composite, e.g. liiklus_õnnetus here), POS, quantity (where appropriate), tense (for verbs), grammatical case (for other types). The last three will further be referred to as the form. So the above phrase would be treated as the following sequence: [ juhtu, _V_, s ] [ liiklus_õnnetus, _S_, sg, n ] [ Põlva_maa, _H_, sg, ad ] 3.4 Abstract Representation After running the morphological analysis each word is expressed as a vector of features. It might, however, be the case that the entire word sequence is not frequent while some part of it is. E.g. a complete pattern [ juhtu, _V_, s ] [ liiklus_õnnetus, _S_, sg, n ] might be non-frequent but the sequence of its partial elements
5 [ _V_, s ] [ _S_, sg, n ] might be. In other words different abstract representations of the same initial construction might have different supports so it is required to consider them separately. For these experiments the following combinations were used: Complete sequence, i.e. [ liiklus_õnnetus, _S_, sg, n ] POS + form, i.e. [ _S_, sg, n ] Form only, i.e. [ sg, n ] Basically it means that the sequence provided above would in fact be processed as a set of multiple sequences which contains different abstract representations of the initial phrase: [ juhtu, _V_, s ] [ liiklus_õnnetus, _S_, sg, n ] [ _V_, s ] [ _S_, sg, n ] [ _V_, s ] [ sg, n ]... all other combinations 3.5 Finding Frequent Itemsets Once we have the sequences we can start looking for frequent patterns in them. Some definitions first: In classical frequent itemset mining the data is usually represented as a set of transactions and each transaction is a separate set of one or more items, usually unordered. In our case each phrase is a transaction and it is ordered. Support of an itemset is a measure that shows how many transactions (in our case phrases) it is contained in. This can be absolute (number of transactions which contain that itemset) or relative (what fraction of all transactions contains that itemset). Itemset is considered frequent if its support is above some user-defined threshold. In the current experiments the relative support of 5 percent was used, i.e. each pattern must occur in at least 5% of all phrases in order to be considered frequent. Apriori is a well known and relatively simple algorithm for finding frequent itemsets and it was chosen to be used in this work. It is based on a principle that all subsets of a frequent itemset are also frequent and vice versa all supersets of a nonfrequent itemset are also non-frequent. The algorithm starts by finding frequent itemsets of size 1. Then during each iteration of the main loop it generates candidate itemsets of size k + 1 based on previously retrieved frequent itemsets of size k and checks if these candidates are really frequent or not. Algorithm terminates when no more frequent itemsets are found. E.g. suppose we have found two frequent sets of size 2:
6 [ juhtu, _V_, s ] [ liiklus_õnnetus, _S_, sg, n ] [ liiklus_õnnetus, _S_, sg, n ] [ sg, ad ] Then the next candidate of size 3 would be [ juhtu, _V_, s ] [ liiklus_õnnetus, _S_, sg, n ] [ sg, ad ] and the algorithm will check its support. If the set turns out to be frequent then we will use it to find frequent patterns of size 4 and so on. Once all the frequent patterns are obtained we can filter only those that contain the entities of interest, i.e. locations in this case. This is easy because the annotations initially attached to words are still preserved for of morphological sequences. 3.6 Determining Significant Patterns Not all of the obtained frequent patterns are really significant it might be the case that we are dealing with a coincidence rather than a real rule. E.g. individual words might simply be very common which increases their chance of being encountered in higher level itemsets. Therefore, it is important to measure the interestingness of each pattern and to filter out the insignificant ones. One way to do it is to use p-values. Suppose we have an assumption that we want to either prove or reject. The default (and not interesting) position is called a null hypothesis and the new idea which is opposite to it is called an alternative hypothesis. We collect a sample of data and we calculate some statistic value based on it. Then the p-value is the probability of obtaining the same statistic value in case the null hypothesis is true. If it is lower than some user-defined threshold we can reject the null hypothesis (which means that the alternative hypothesis holds). In our case the null hypothesis is the one that states that all items are independent from each other (and the alternative states that there is a dependency). The test statistic is the support of an itemset. The p-value is the probability of the itemset having at least the same support under the assumption that the null hypothesis is true, i.e. the p-value shows how probable it would be to encounter that particular chain of items if all items were independent. If this probability is smaller than some user defined threshold then it is considered that the null hypothesis can be rejected and the pattern is significant. [5] describes the formula for p-value calculation: P( I) n ssup(i) n s pi (1 pi ) s ns, p I ii f i, f i sup( i) n where sup(i) is the support of an itemset I. For these experiments the traditional significance level of 0.05 was used.
7 3.7 Extractor function Once all significant frequent patterns are retrieved from the training data we can use them to search for entities in previously unseen texts. The extractor function first preprocesses the text using the same routines that were used for the training data (noise cleaning, tokenization, normalization). Then it applies the morphological analyzer and transforms the phrases into the same format that was used for processing the training data. The last step is to match all the frequent patterns against each encoded phrase starting from the longest patterns and continuing with the shorter ones until a match is found. 3.8 Testing phase Two common measures of accuracy are precision and recall. In the context of this work they can be defined as follows. Precision is the number of correctly extracted entities (true positives) divided by the total number of extracted entities (true positives and false positives). In the described experiment setting the precision turned out to be Recall is the number of true positives divided by a total number of entities (true positives and false negatives). In the described experiment it was equal to Conclusion The experiments have shown that in principle it is possible to apply frequent itemset mining techniques for finding relevant entities in text with relatively high precision. Recall, on the other hand, turned out to be quite small which can obviously be explained with the fact that the variety of possible forms in natural language can be very high even in case of restricting to very specific domains and sources. Perhaps a larger training sample would help in fixing this shortcoming. Another useful technique that might be applied is introducing known false patterns counterexamples [1]. These might increase the precision by reducing the number of false positives. E.g. highway names (like Tallinn-Tartu-Võru ) often seem to be included in phrases that are very similar with those containing point location names. So it would be a good idea to introduce some pattern filtering routine based on known counterexamples. References 1. Fabian M. S.: Automated Construction and Growth of a Large Ontology (2009) 2. Zamin N., Oxley A.: Information Extraction for Counter-Terrorism: A Survey on Link Analysis (2010)
8 3. Sun Z., Lim E., Chang K., Ong T., Gunaratna R.K.: Event-Driven Document Selection for Terrorism Information Extraction (2005) 4. Chang C., Kayed M., Girgis M.R., Shaalan K.: A Survey of Web Information Extraction Systems 5. Gallo A., Bie T., Cristianini N.: MINI: Mining Informative Non-redundant Itemsets 6. ESTMORF -
Data Mining Part 3. Associations Rules
Data Mining Part 3. Associations Rules 3.2 Efficient Frequent Itemset Mining Methods Fall 2009 Instructor: Dr. Masoud Yaghini Outline Apriori Algorithm Generating Association Rules from Frequent Itemsets
More informationInformation Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining
Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining 1 Vishakha D. Bhope, 2 Sachin N. Deshmukh 1,2 Department of Computer Science & Information Technology, Dr. BAM
More informationShrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent
More informationDepartment of Electronic Engineering FINAL YEAR PROJECT REPORT
Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:
More informationParmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge
Discover hidden information from your texts! Information overload is a well known issue in the knowledge industry. At the same time most of this information becomes available in natural language which
More informationA Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,
More informationDetection and Extraction of Events from s
Detection and Extraction of Events from Emails Shashank Senapaty Department of Computer Science Stanford University, Stanford CA senapaty@cs.stanford.edu December 12, 2008 Abstract I build a system to
More informationAnnotated Suffix Trees for Text Clustering
Annotated Suffix Trees for Text Clustering Ekaterina Chernyak and Dmitry Ilvovsky National Research University Higher School of Economics Moscow, Russia echernyak,dilvovsky@hse.ru Abstract. In this paper
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationImage Classification Using Text Mining and Feature Clustering (Text Document and Image Categorization Using Fuzzy Similarity Based Feature Clustering)
Image Classification Using Text Mining and Clustering (Text Document and Image Categorization Using Fuzzy Similarity Based Clustering) 1 Mr. Dipak R. Pardhi, 2 Mrs. Charushila D. Pati 1 Assistant Professor
More informationINFORMATION EXTRACTION
COMP90042 LECTURE 13 INFORMATION EXTRACTION INTRODUCTION Given this: Brasilia, the Brazilian capital, was founded in 1960. Obtain this: capital(brazil, Brasilia) founded(brasilia, 1960) Main goal: turn
More informationA Hybrid Unsupervised Web Data Extraction using Trinity and NLP
IJIRST International Journal for Innovative Research in Science & Technology Volume 2 Issue 02 July 2015 ISSN (online): 2349-6010 A Hybrid Unsupervised Web Data Extraction using Trinity and NLP Anju R
More informationParsing tree matching based question answering
Parsing tree matching based question answering Ping Chen Dept. of Computer and Math Sciences University of Houston-Downtown chenp@uhd.edu Wei Ding Dept. of Computer Science University of Massachusetts
More informationUbiquitous Computing and Communication Journal (ISSN )
A STRATEGY TO COMPROMISE HANDWRITTEN DOCUMENTS PROCESSING AND RETRIEVING USING ASSOCIATION RULES MINING Prof. Dr. Alaa H. AL-Hamami, Amman Arab University for Graduate Studies, Amman, Jordan, 2011. Alaa_hamami@yahoo.com
More informationOverview of Web Mining Techniques and its Application towards Web
Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationFinal Project Discussion. Adam Meyers Montclair State University
Final Project Discussion Adam Meyers Montclair State University Summary Project Timeline Project Format Details/Examples for Different Project Types Linguistic Resource Projects: Annotation, Lexicons,...
More informationCHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.
119 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 120 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 5.1. INTRODUCTION Association rule mining, one of the most important and well researched
More informationIn the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,
1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to
More informationChapter 4: Mining Frequent Patterns, Associations and Correlations
Chapter 4: Mining Frequent Patterns, Associations and Correlations 4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? Pattern Evaluation Methods 4.4 Summary Frequent
More informationData and Information Integration: Information Extraction
International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) Data and Information Integration: Information Extraction Varnica Verma 1 1 (Department of Computer Science Engineering, Guru Nanak
More informationClassifying Twitter Data in Multiple Classes Based On Sentiment Class Labels
Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels Richa Jain 1, Namrata Sharma 2 1M.Tech Scholar, Department of CSE, Sushila Devi Bansal College of Engineering, Indore (M.P.),
More informationText Mining for Software Engineering
Text Mining for Software Engineering Faculty of Informatics Institute for Program Structures and Data Organization (IPD) Universität Karlsruhe (TH), Germany Department of Computer Science and Software
More informationA Deep Relevance Matching Model for Ad-hoc Retrieval
A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese
More informationA Multilingual Social Media Linguistic Corpus
A Multilingual Social Media Linguistic Corpus Luis Rei 1,2 Dunja Mladenić 1,2 Simon Krek 1 1 Artificial Intelligence Laboratory Jožef Stefan Institute 2 Jožef Stefan International Postgraduate School 4th
More informationFuzzy Cognitive Maps application for Webmining
Fuzzy Cognitive Maps application for Webmining Andreas Kakolyris Dept. Computer Science, University of Ioannina Greece, csst9942@otenet.gr George Stylios Dept. of Communications, Informatics and Management,
More informationHow to.. What is the point of it?
Program's name: Linguistic Toolbox 3.0 α-version Short name: LIT Authors: ViatcheslavYatsko, Mikhail Starikov Platform: Windows System requirements: 1 GB free disk space, 512 RAM,.Net Farmework Supported
More informationPrecise Medication Extraction using Agile Text Mining
Precise Medication Extraction using Agile Text Mining Chaitanya Shivade *, James Cormack, David Milward * The Ohio State University, Columbus, Ohio, USA Linguamatics Ltd, Cambridge, UK shivade@cse.ohio-state.edu,
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationAn Adaptive Framework for Named Entity Combination
An Adaptive Framework for Named Entity Combination Bogdan Sacaleanu 1, Günter Neumann 2 1 IMC AG, 2 DFKI GmbH 1 New Business Department, 2 Language Technology Department Saarbrücken, Germany E-mail: Bogdan.Sacaleanu@im-c.de,
More informationNLP Final Project Fall 2015, Due Friday, December 18
NLP Final Project Fall 2015, Due Friday, December 18 For the final project, everyone is required to do some sentiment classification and then choose one of the other three types of projects: annotation,
More informationCSC 5930/9010: Text Mining GATE Developer Overview
1 CSC 5930/9010: Text Mining GATE Developer Overview Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789 GATE Components 2 We will deal primarily with GATE Developer:
More informationChapter 4. Processing Text
Chapter 4 Processing Text Processing Text Modifying/Converting documents to index terms Convert the many forms of words into more consistent index terms that represent the content of a document What are
More informationMaking Sense Out of the Web
Making Sense Out of the Web Rada Mihalcea University of North Texas Department of Computer Science rada@cs.unt.edu Abstract. In the past few years, we have witnessed a tremendous growth of the World Wide
More informationAn Approach To Web Content Mining
An Approach To Web Content Mining Nita Patil, Chhaya Das, Shreya Patanakar, Kshitija Pol Department of Computer Engg. Datta Meghe College of Engineering, Airoli, Navi Mumbai Abstract-With the research
More informationCHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS
82 CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS In recent years, everybody is in thirst of getting information from the internet. Search engines are used to fulfill the need of them. Even though the
More informationFrequent Itemsets Melange
Frequent Itemsets Melange Sebastien Siva Data Mining Motivation and objectives Finding all frequent itemsets in a dataset using the traditional Apriori approach is too computationally expensive for datasets
More informationLecture 14: Annotation
Lecture 14: Annotation Nathan Schneider (with material from Henry Thompson, Alex Lascarides) ENLP 23 October 2016 1/14 Annotation Why gold 6= perfect Quality Control 2/14 Factors in Annotation Suppose
More informationAutoMap Introduction Version 1.0
AutoMap Introduction Version 1.0 Francis Nimick May 27, 2011 1 Introduction: What is AutoMap? AutoMap is a text mining tool developed by the CASOS (Computational Analysis of Social and Organizational Systems)
More informationTutorial on Association Rule Mining
Tutorial on Association Rule Mining Yang Yang yang.yang@itee.uq.edu.au DKE Group, 78-625 August 13, 2010 Outline 1 Quick Review 2 Apriori Algorithm 3 FP-Growth Algorithm 4 Mining Flickr and Tag Recommendation
More informationText Mining: A Burgeoning technology for knowledge extraction
Text Mining: A Burgeoning technology for knowledge extraction 1 Anshika Singh, 2 Dr. Udayan Ghosh 1 HCL Technologies Ltd., Noida, 2 University School of Information &Communication Technology, Dwarka, Delhi.
More informationKeywords Data alignment, Data annotation, Web database, Search Result Record
Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web
More informationJuggling the Jigsaw Towards Automated Problem Inference from Network Trouble Tickets
Juggling the Jigsaw Towards Automated Problem Inference from Network Trouble Tickets Rahul Potharaju (Purdue University) Navendu Jain (Microsoft Research) Cristina Nita-Rotaru (Purdue University) April
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationChapter 2. Related Work
Chapter 2 Related Work There are three areas of research highly related to our exploration in this dissertation, namely sequential pattern mining, multiple alignment, and approximate frequent pattern mining.
More informationA FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India sumit_13@yahoo.com 2 School of Computer
More informationInfluence of Word Normalization on Text Classification
Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we
More informationA Short Introduction to CATMA
A Short Introduction to CATMA Outline: I. Getting Started II. Analyzing Texts - Search Queries in CATMA III. Annotating Texts (collaboratively) with CATMA IV. Further Search Queries: Analyze Your Annotations
More informationUnstructured Data. CS102 Winter 2019
Winter 2019 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for patterns in data
More informationISSN: [Sugumar * et al., 7(4): April, 2018] Impact Factor: 5.164
IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY IMPROVED PERFORMANCE OF STEMMING USING ENHANCED PORTER STEMMER ALGORITHM FOR INFORMATION RETRIEVAL Ramalingam Sugumar & 2 M.Rama
More informationTISA Methodology Threat Intelligence Scoring and Analysis
TISA Methodology Threat Intelligence Scoring and Analysis Contents Introduction 2 Defining the Problem 2 The Use of Machine Learning for Intelligence Analysis 3 TISA Text Analysis and Feature Extraction
More informationReview on Text Mining
Review on Text Mining Aarushi Rai #1, Aarush Gupta *2, Jabanjalin Hilda J. #3 #1 School of Computer Science and Engineering, VIT University, Tamil Nadu - India #2 School of Computer Science and Engineering,
More informationIntroducing XAIRA. Lou Burnard Tony Dodd. An XML aware tool for corpus indexing and searching. Research Technology Services, OUCS
Introducing XAIRA An XML aware tool for corpus indexing and searching Lou Burnard Tony Dodd Research Technology Services, OUCS What is XAIRA? XML Aware Indexing and Retrieval Architecture Developed from
More informationAn Oracle White Paper October Oracle Social Cloud Platform Text Analytics
An Oracle White Paper October 2012 Oracle Social Cloud Platform Text Analytics Executive Overview Oracle s social cloud text analytics platform is able to process unstructured text-based conversations
More informationDomain-specific Concept-based Information Retrieval System
Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical
More informationEfficient Algorithms for Preprocessing and Stemming of Tweets in a Sentiment Analysis System
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 19, Issue 3, Ver. II (May.-June. 2017), PP 44-50 www.iosrjournals.org Efficient Algorithms for Preprocessing
More informationOntology based Model and Procedure Creation for Topic Analysis in Chinese Language
Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language Dong Han and Kilian Stoffel Information Management Institute, University of Neuchâtel Pierre-à-Mazel 7, CH-2000 Neuchâtel,
More informationDocument Retrieval using Predication Similarity
Document Retrieval using Predication Similarity Kalpa Gunaratna 1 Kno.e.sis Center, Wright State University, Dayton, OH 45435 USA kalpa@knoesis.org Abstract. Document retrieval has been an important research
More informationA HYBRID METHOD FOR SIMULATION FACTOR SCREENING. Hua Shen Hong Wan
Proceedings of the 2006 Winter Simulation Conference L. F. Perrone, F. P. Wieland, J. Liu, B. G. Lawson, D. M. Nicol, and R. M. Fujimoto, eds. A HYBRID METHOD FOR SIMULATION FACTOR SCREENING Hua Shen Hong
More informationWeb Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India
Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the
More informationResults and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets
Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets Sheetal K. Labade Computer Engineering Dept., JSCOE, Hadapsar Pune, India Srinivasa Narasimha
More informationTEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION
TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION Ms. Nikita P.Katariya 1, Prof. M. S. Chaudhari 2 1 Dept. of Computer Science & Engg, P.B.C.E., Nagpur, India, nikitakatariya@yahoo.com 2 Dept.
More informationNatural Language Processing. SoSe Question Answering
Natural Language Processing SoSe 2017 Question Answering Dr. Mariana Neves July 5th, 2017 Motivation Find small segments of text which answer users questions (http://start.csail.mit.edu/) 2 3 Motivation
More informationNatural Language Processing with PoolParty
Natural Language Processing with PoolParty Table of Content Introduction to PoolParty 2 Resolving Language Problems 4 Key Features 5 Entity Extraction and Term Extraction 5 Shadow Concepts 6 Word Sense
More informationCHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING
94 CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING 5.1 INTRODUCTION Expert locator addresses the task of identifying the right person with the appropriate skills and knowledge. In large organizations, it
More informationAUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS
AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS Nilam B. Lonkar 1, Dinesh B. Hanchate 2 Student of Computer Engineering, Pune University VPKBIET, Baramati, India Computer Engineering, Pune University VPKBIET,
More informationEmpirical Analysis of Single and Multi Document Summarization using Clustering Algorithms
Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2562-2567 2562 Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Mrunal S. Bewoor Department
More informationINTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 1, January- June (2012), TECHNOLOGY (IJCET) IAEME ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume
More informationLecture 23: Domain-Driven Design (Part 1)
1 Lecture 23: Domain-Driven Design (Part 1) Kenneth M. Anderson Object-Oriented Analysis and Design CSCI 6448 - Spring Semester, 2005 2 Goals for this lecture Introduce the main concepts of Domain-Driven
More informationPrivacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras
Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 25 Tutorial 5: Analyzing text using Python NLTK Hi everyone,
More informationError annotation in adjective noun (AN) combinations
Error annotation in adjective noun (AN) combinations This document describes the annotation scheme devised for annotating errors in AN combinations and explains how the inter-annotator agreement has been
More informationA Review on Identifying the Main Content From Web Pages
A Review on Identifying the Main Content From Web Pages Madhura R. Kaddu 1, Dr. R. B. Kulkarni 2 1, 2 Department of Computer Scienece and Engineering, Walchand Institute of Technology, Solapur University,
More informationMemory issues in frequent itemset mining
Memory issues in frequent itemset mining Bart Goethals HIIT Basic Research Unit Department of Computer Science P.O. Box 26, Teollisuuskatu 2 FIN-00014 University of Helsinki, Finland bart.goethals@cs.helsinki.fi
More informationRelevance Feature Discovery for Text Mining
Relevance Feature Discovery for Text Mining Laliteshwari 1,Clarish 2,Mrs.A.G.Jessy Nirmal 3 Student, Dept of Computer Science and Engineering, Agni College Of Technology, India 1,2 Asst Professor, Dept
More informationMachine Learning in GATE
Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell Recap Previous two days looked at knowledge engineered IE This session looks at machine learned IE Supervised learning Effort
More informationChallenge. Case Study. The fabric of space and time has collapsed. What s the big deal? Miami University of Ohio
Case Study Use Case: Recruiting Segment: Recruiting Products: Rosette Challenge CareerBuilder, the global leader in human capital solutions, operates the largest job board in the U.S. and has an extensive
More informationCS101 Introduction to Programming Languages and Compilers
CS101 Introduction to Programming Languages and Compilers In this handout we ll examine different types of programming languages and take a brief look at compilers. We ll only hit the major highlights
More informationAustralian Journal of Basic and Applied Sciences. Named Entity Recognition from Biomedical Abstracts An Information Extraction Task
ISSN:1991-8178 Australian Journal of Basic and Applied Sciences Journal home page: www.ajbasweb.com Named Entity Recognition from Biomedical Abstracts An Information Extraction Task 1 N. Kanya and 2 Dr.
More informationChapter 6 Evaluation Metrics and Evaluation
Chapter 6 Evaluation Metrics and Evaluation The area of evaluation of information retrieval and natural language processing systems is complex. It will only be touched on in this chapter. First the scientific
More informationNatural Language Processing Is No Free Lunch
Natural Language Processing Is No Free Lunch STEFAN WAGNER UNIVERSITY OF STUTTGART, STUTTGART, GERMANY ntroduction o Impressive progress in NLP: OS with personal assistants like Siri or Cortan o Brief
More informationForm Identifying. Figure 1 A typical HTML form
Table of Contents Form Identifying... 2 1. Introduction... 2 2. Related work... 2 3. Basic elements in an HTML from... 3 4. Logic structure of an HTML form... 4 5. Implementation of Form Identifying...
More informationA Framework for Ontology Life Cycle Management
A Framework for Ontology Life Cycle Management Perakath Benjamin, Nitin Kumar, Ronald Fernandes, and Biyan Li Knowledge Based Systems, Inc., College Station, TX, USA Abstract - This paper describes a method
More informationWeb Usage Mining. Overview Session 1. This material is inspired from the WWW 16 tutorial entitled Analyzing Sequential User Behavior on the Web
Web Usage Mining Overview Session 1 This material is inspired from the WWW 16 tutorial entitled Analyzing Sequential User Behavior on the Web 1 Outline 1. Introduction 2. Preprocessing 3. Analysis 2 Example
More informationData Mining for Knowledge Management. Association Rules
1 Data Mining for Knowledge Management Association Rules Themis Palpanas University of Trento http://disi.unitn.eu/~themis 1 Thanks for slides to: Jiawei Han George Kollios Zhenyu Lu Osmar R. Zaïane Mohammad
More informationContent Based Key-Word Recommender
Content Based Key-Word Recommender Mona Amarnani Student, Computer Science and Engg. Shri Ramdeobaba College of Engineering and Management (SRCOEM), Nagpur, India Dr. C. S. Warnekar Former Principal,Cummins
More informationWEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE
WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE *Vidya.V.L, **Aarathy Gandhi *PG Scholar, Department of Computer Science, Mohandas College of Engineering and Technology, Anad **Assistant Professor,
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationChapter 4: Association analysis:
Chapter 4: Association analysis: 4.1 Introduction: Many business enterprises accumulate large quantities of data from their day-to-day operations, huge amounts of customer purchase data are collected daily
More informationA Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet
A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet Joerg-Uwe Kietz, Alexander Maedche, Raphael Volz Swisslife Information Systems Research Lab, Zuerich, Switzerland fkietz, volzg@swisslife.ch
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Beyond Bag of Words Bag of Words a document is considered to be an unordered collection of words with no relationships Extending
More informationImproving Suffix Tree Clustering Algorithm for Web Documents
International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal
More informationCCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.
CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems Leigh M. Smith Humtap Inc. leigh@humtap.com Basic system overview Segmentation (Frames, Onsets, Beats, Bars, Chord Changes, etc) Feature
More informationQuestion Answering Systems
Question Answering Systems An Introduction Potsdam, Germany, 14 July 2011 Saeedeh Momtazi Information Systems Group Outline 2 1 Introduction Outline 2 1 Introduction 2 History Outline 2 1 Introduction
More informationHebei University of Technology A Text-Mining-based Patent Analysis in Product Innovative Process
A Text-Mining-based Patent Analysis in Product Innovative Process Liang Yanhong, Tan Runhua Abstract Hebei University of Technology Patent documents contain important technical knowledge and research results.
More informationA tool for Cross-Language Pair Annotations: CLPA
A tool for Cross-Language Pair Annotations: CLPA August 28, 2006 This document describes our tool called Cross-Language Pair Annotator (CLPA) that is capable to automatically annotate cognates and false
More informationTERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES
TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.
More informationNews-Oriented Keyword Indexing with Maximum Entropy Principle.
News-Oriented Keyword Indexing with Maximum Entropy Principle. Li Sujian' Wang Houfeng' Yu Shiwen' Xin Chengsheng2 'Institute of Computational Linguistics, Peking University, 100871, Beijing, China Ilisujian,
More informationRevealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization
Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Katsuya Masuda *, Makoto Tanji **, and Hideki Mima *** Abstract This study proposes a framework to access to the
More informationInformation Retrieval. Chap 7. Text Operations
Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing
More informationMining Frequent Patterns without Candidate Generation
Mining Frequent Patterns without Candidate Generation Outline of the Presentation Outline Frequent Pattern Mining: Problem statement and an example Review of Apriori like Approaches FP Growth: Overview
More information