Information Extraction Techniques in Terrorism Surveillance

Similar documents
Data Mining Part 3. Associations Rules

Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Detection and Extraction of Events from s

Annotated Suffix Trees for Text Clustering

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Image Classification Using Text Mining and Feature Clustering (Text Document and Image Categorization Using Fuzzy Similarity Based Feature Clustering)

INFORMATION EXTRACTION

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

Parsing tree matching based question answering

Ubiquitous Computing and Communication Journal (ISSN )

Overview of Web Mining Techniques and its Application towards Web

Information Retrieval. (M&S Ch 15)

Final Project Discussion. Adam Meyers Montclair State University

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Data and Information Integration: Information Extraction

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

Text Mining for Software Engineering

A Deep Relevance Matching Model for Ad-hoc Retrieval

A Multilingual Social Media Linguistic Corpus

Fuzzy Cognitive Maps application for Webmining

How to.. What is the point of it?

Precise Medication Extraction using Agile Text Mining

String Vector based KNN for Text Categorization

An Adaptive Framework for Named Entity Combination

NLP Final Project Fall 2015, Due Friday, December 18

CSC 5930/9010: Text Mining GATE Developer Overview

Chapter 4. Processing Text

Making Sense Out of the Web

An Approach To Web Content Mining

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

Frequent Itemsets Melange

Lecture 14: Annotation

AutoMap Introduction Version 1.0

Tutorial on Association Rule Mining

Text Mining: A Burgeoning technology for knowledge extraction

Keywords Data alignment, Data annotation, Web database, Search Result Record

Juggling the Jigsaw Towards Automated Problem Inference from Network Trouble Tickets

Information Retrieval

Chapter 2. Related Work

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

Influence of Word Normalization on Text Classification

A Short Introduction to CATMA

Unstructured Data. CS102 Winter 2019

ISSN: [Sugumar * et al., 7(4): April, 2018] Impact Factor: 5.164

TISA Methodology Threat Intelligence Scoring and Analysis

Review on Text Mining

Introducing XAIRA. Lou Burnard Tony Dodd. An XML aware tool for corpus indexing and searching. Research Technology Services, OUCS

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics

Domain-specific Concept-based Information Retrieval System

Efficient Algorithms for Preprocessing and Stemming of Tweets in a Sentiment Analysis System

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

Document Retrieval using Predication Similarity

A HYBRID METHOD FOR SIMULATION FACTOR SCREENING. Hua Shen Hong Wan

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

Natural Language Processing. SoSe Question Answering

Natural Language Processing with PoolParty

CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

Lecture 23: Domain-Driven Design (Part 1)

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

Error annotation in adjective noun (AN) combinations

A Review on Identifying the Main Content From Web Pages

Memory issues in frequent itemset mining

Relevance Feature Discovery for Text Mining

Machine Learning in GATE

Challenge. Case Study. The fabric of space and time has collapsed. What s the big deal? Miami University of Ohio

CS101 Introduction to Programming Languages and Compilers

Australian Journal of Basic and Applied Sciences. Named Entity Recognition from Biomedical Abstracts An Information Extraction Task

Chapter 6 Evaluation Metrics and Evaluation

Natural Language Processing Is No Free Lunch

Form Identifying. Figure 1 A typical HTML form

A Framework for Ontology Life Cycle Management

Web Usage Mining. Overview Session 1. This material is inspired from the WWW 16 tutorial entitled Analyzing Sequential User Behavior on the Web

Data Mining for Knowledge Management. Association Rules

Content Based Key-Word Recommender

WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 4: Association analysis:

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet

Search Engines. Information Retrieval in Practice

Improving Suffix Tree Clustering Algorithm for Web Documents

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Question Answering Systems

Hebei University of Technology A Text-Mining-based Patent Analysis in Product Innovative Process

A tool for Cross-Language Pair Annotations: CLPA

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

News-Oriented Keyword Indexing with Maximum Entropy Principle.

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

Information Retrieval. Chap 7. Text Operations

Mining Frequent Patterns without Candidate Generation

Transcription:

Information Extraction Techniques in Terrorism Surveillance Roman Tekhov Abstract. The article gives a brief overview of what information extraction is and how it might be used for the purposes of counter-terrorism surveillance. It also describes the new proposed way of retrieving information by using frequent pattern mining as well as lists the results of conducted experiments. Keywords: information extraction, terrorism, frequent itemset mining 1 Introduction Terrorism is a major threat and therefore it is essential to develop tools that help with revealing and exploring information concerning future attacks, unknown organizations and their members and so on. Even if information itself is available in the form of surveillance reports, news articles, posts published in the web or social network data it is still hard to process efficiently because bits of relevant information are scattered across huge amounts of irrelevant noise. Some pieces of information may also be duplicated which makes the job of collecting relevant data harder. To allow counter-terrorism experts to process information efficiently at least part of the routine has to be automated. If the data is extracted, filtered and classified automatically then domain experts could concentrate on analyzing what is really important. 2 Information Extraction The field of information extraction (IE) is concerned with retrieving certain data of interest from initially unstructured natural sources, mostly regarding text documents but also various multimedia streams (e.g. video). In case of text documents the extraction is established by using natural language processing (NLP) tools. E.g. one might be interested in the extraction of following information from text documents: entities of some certain type (location names, people and/or organizations involved) new relations between known entities or occurrences of concrete relations of interest (person connections with each other, members of organizations) co-references and mentions (what words like he or it really refer to based on previously extracted named entities)

Due to the high complexity of the challenge many systems have restricted scope, e.g. by only considering semi-structured documents or documents than are known to be concerned with some specific domain (like criminal news feed for example). Probably the simplest form of IE is just applying hand-written regular expressions. Even though this might be efficient for finding certain named entities or filtering out data that is known to be irrelevant this technique has still very limited capabilities. More sophisticated systems might use Hidden Markov Models or machine learning techniques. Pattern-oriented systems. This class of IE systems is concerned with determining structural patterns which help to distinguish target entities and/or relations [1] [3]. Some systems require some initial seed values for bootstrapping the extraction process. Suppose that in the beginning we have a number of seed entity instances and some target relation. E.g. if the goal is to retrieve the relations between events and locations then the seed value for events might be party and for locations Tallinn. Next we search the texts for the occurrences of these seed instances and retrieve the constructions that express the target relation (e.g. party was held in Tallinn ). If we replace the seed values with placeholders we get a pattern (X was held in Y) and we can search the texts again for other occurrences of this pattern. Suppose we find the construction meeting was held in Tartu this gives us another pair of seed instances: meeting and Tartu that we might use to find new patterns and so on. Other systems require a manually annotated training data at the beginning. Patterns are extracted based on that information and searched against new unseen data. 3 Experiments: Frequent Pattern Based IE The main goal of this work is to apply a pattern-based technique for entity recognition that is based on frequent itemset mining principles: 1. Choose a training data set. In this work the training set was 200 Postimees news site articles about various accidents that took place in Estonia: traffic accidents, fires, murders, assaults etc. 2. Annotate the entities of interest by hand. This work is concerned with geographical locations: counties, towns, villages etc. 3. Use the annotated training data for creating an extractor function that can retrieve entities of interest from general texts. This was accomplished by applying frequent itemset mining approach: first we search the training data set for frequent structural patterns containing our entities of interest. Then the extractor searches for these patterns in the given texts. 4. Test the extractor function on some annotated test data set and analyze quality of the extractor function. In this work a set of 25 articles (different from the training set) was used.

3.1 Preprocessing Natural language corpora contain a lot of redundant and noisy information which is an obstacle for efficient pattern retrieval and matching. Therefore data preprocessing is essential. Tokenization. It makes sense to apply the divide and conquer principle and to restrict pattern searching to certain sub-structures instead of processing the entire text corpus at once. E.g. one natural way is to split the text into sentences and process them separately. In this work the texts were divided into even smaller tokens by splitting according to the punctuation marks (.,, and so on). Noise elimination. Natural texts obtained from the web often contain formatting elements such as <span class= > or <p> which need to be cleaned up in order to perform the analysis. For this work all HTML tags were removed from the text. Normalization. Semantically equivalent things might have very different representations in natural language. For pattern recognition it might be a good idea to normalize the data by replacing these common constructions with some standard representations. Consider some examples: Character normalization. Characters might be encoded differently in different sources. E.g. character ä might be expressed as ä in a HTML document. For the experiments all characters were represented using UTF-8 encoding. Date and time normalization. Dates and times can be expressed very differently: 06.04., 6. aprillil, kuuendal aprillil are all valid ways to describe the same date in Estonian as well as kell 9.00, kella 9.00 ajal, 09.00 are all valid times. For the experiments described in this work the date and time patterns were normalized as ##.## and ##:## respectively. 3.2 Annotating In order to find frequent patterns that contain location entities one has to manually annotate the data in the training set specifying which words represent location names and which do not. E.g. one could use tags: juhtus liiklusõnnetus <location>põlvamaal</location> In the course of this work the information was encoded and annotated using JSON format: {"loc": false, "word": "juhtus"}, {"loc": false, "word": "liiklusõnnetus"}, {"loc": true, "word": "Põlvamaal"}

3.3 Generalization with Morphological Attributes The most straightforward way would be to treat phrases, words or characters as patterns without generalizing their structure. That approach would suffer, however, even from slight variations in the text structure. In addition the total size of the training data set would need to be significantly large in order to find these exact frequent constructs. One possible solution is to generalize the data by performing linguistic analysis on it. E.g. we can treat the text as a sequence of morphological attributes: stem, suffix, part of speech (POS), grammatical case, tense, quantity, etc. Consider again the phrase juhtus liiklusõnnetus Põlvamaal After applying some morphological analysis on each word we obtain the following information: juhtus: POS = verb, tense = past, stem = juhtu liiklusõnnetus: POS = noun, singular, case = nominative, stem = liiklusõnnetus Põlvamaal: POS = noun, proper, singular, case = adessive, stem = Põlvamaa For the purpose of this work the morphological analyzer ESTMORF was used [6]. It represents the above sequence in the following form: juhtu+s //_V_ s, // liiklus_õnnetus+0 //_S_ sg n, // Põlva_maa+l //_H_ sg ad, // For the sake of simplicity only the following attribute subset was used: stem (ignoring the fact that some words are really composite, e.g. liiklus_õnnetus here), POS, quantity (where appropriate), tense (for verbs), grammatical case (for other types). The last three will further be referred to as the form. So the above phrase would be treated as the following sequence: [ juhtu, _V_, s ] [ liiklus_õnnetus, _S_, sg, n ] [ Põlva_maa, _H_, sg, ad ] 3.4 Abstract Representation After running the morphological analysis each word is expressed as a vector of features. It might, however, be the case that the entire word sequence is not frequent while some part of it is. E.g. a complete pattern [ juhtu, _V_, s ] [ liiklus_õnnetus, _S_, sg, n ] might be non-frequent but the sequence of its partial elements

[ _V_, s ] [ _S_, sg, n ] might be. In other words different abstract representations of the same initial construction might have different supports so it is required to consider them separately. For these experiments the following combinations were used: Complete sequence, i.e. [ liiklus_õnnetus, _S_, sg, n ] POS + form, i.e. [ _S_, sg, n ] Form only, i.e. [ sg, n ] Basically it means that the sequence provided above would in fact be processed as a set of multiple sequences which contains different abstract representations of the initial phrase: [ juhtu, _V_, s ] [ liiklus_õnnetus, _S_, sg, n ] [ _V_, s ] [ _S_, sg, n ] [ _V_, s ] [ sg, n ]... all other combinations 3.5 Finding Frequent Itemsets Once we have the sequences we can start looking for frequent patterns in them. Some definitions first: In classical frequent itemset mining the data is usually represented as a set of transactions and each transaction is a separate set of one or more items, usually unordered. In our case each phrase is a transaction and it is ordered. Support of an itemset is a measure that shows how many transactions (in our case phrases) it is contained in. This can be absolute (number of transactions which contain that itemset) or relative (what fraction of all transactions contains that itemset). Itemset is considered frequent if its support is above some user-defined threshold. In the current experiments the relative support of 5 percent was used, i.e. each pattern must occur in at least 5% of all phrases in order to be considered frequent. Apriori is a well known and relatively simple algorithm for finding frequent itemsets and it was chosen to be used in this work. It is based on a principle that all subsets of a frequent itemset are also frequent and vice versa all supersets of a nonfrequent itemset are also non-frequent. The algorithm starts by finding frequent itemsets of size 1. Then during each iteration of the main loop it generates candidate itemsets of size k + 1 based on previously retrieved frequent itemsets of size k and checks if these candidates are really frequent or not. Algorithm terminates when no more frequent itemsets are found. E.g. suppose we have found two frequent sets of size 2:

[ juhtu, _V_, s ] [ liiklus_õnnetus, _S_, sg, n ] [ liiklus_õnnetus, _S_, sg, n ] [ sg, ad ] Then the next candidate of size 3 would be [ juhtu, _V_, s ] [ liiklus_õnnetus, _S_, sg, n ] [ sg, ad ] and the algorithm will check its support. If the set turns out to be frequent then we will use it to find frequent patterns of size 4 and so on. Once all the frequent patterns are obtained we can filter only those that contain the entities of interest, i.e. locations in this case. This is easy because the annotations initially attached to words are still preserved for of morphological sequences. 3.6 Determining Significant Patterns Not all of the obtained frequent patterns are really significant it might be the case that we are dealing with a coincidence rather than a real rule. E.g. individual words might simply be very common which increases their chance of being encountered in higher level itemsets. Therefore, it is important to measure the interestingness of each pattern and to filter out the insignificant ones. One way to do it is to use p-values. Suppose we have an assumption that we want to either prove or reject. The default (and not interesting) position is called a null hypothesis and the new idea which is opposite to it is called an alternative hypothesis. We collect a sample of data and we calculate some statistic value based on it. Then the p-value is the probability of obtaining the same statistic value in case the null hypothesis is true. If it is lower than some user-defined threshold we can reject the null hypothesis (which means that the alternative hypothesis holds). In our case the null hypothesis is the one that states that all items are independent from each other (and the alternative states that there is a dependency). The test statistic is the support of an itemset. The p-value is the probability of the itemset having at least the same support under the assumption that the null hypothesis is true, i.e. the p-value shows how probable it would be to encounter that particular chain of items if all items were independent. If this probability is smaller than some user defined threshold then it is considered that the null hypothesis can be rejected and the pattern is significant. [5] describes the formula for p-value calculation: P( I) n ssup(i) n s pi (1 pi ) s ns, p I ii f i, f i sup( i) n where sup(i) is the support of an itemset I. For these experiments the traditional significance level of 0.05 was used.

3.7 Extractor function Once all significant frequent patterns are retrieved from the training data we can use them to search for entities in previously unseen texts. The extractor function first preprocesses the text using the same routines that were used for the training data (noise cleaning, tokenization, normalization). Then it applies the morphological analyzer and transforms the phrases into the same format that was used for processing the training data. The last step is to match all the frequent patterns against each encoded phrase starting from the longest patterns and continuing with the shorter ones until a match is found. 3.8 Testing phase Two common measures of accuracy are precision and recall. In the context of this work they can be defined as follows. Precision is the number of correctly extracted entities (true positives) divided by the total number of extracted entities (true positives and false positives). In the described experiment setting the precision turned out to be 0.92. Recall is the number of true positives divided by a total number of entities (true positives and false negatives). In the described experiment it was equal to 0.45. 4 Conclusion The experiments have shown that in principle it is possible to apply frequent itemset mining techniques for finding relevant entities in text with relatively high precision. Recall, on the other hand, turned out to be quite small which can obviously be explained with the fact that the variety of possible forms in natural language can be very high even in case of restricting to very specific domains and sources. Perhaps a larger training sample would help in fixing this shortcoming. Another useful technique that might be applied is introducing known false patterns counterexamples [1]. These might increase the precision by reducing the number of false positives. E.g. highway names (like Tallinn-Tartu-Võru ) often seem to be included in phrases that are very similar with those containing point location names. So it would be a good idea to introduce some pattern filtering routine based on known counterexamples. References 1. Fabian M. S.: Automated Construction and Growth of a Large Ontology (2009) 2. Zamin N., Oxley A.: Information Extraction for Counter-Terrorism: A Survey on Link Analysis (2010)

3. Sun Z., Lim E., Chang K., Ong T., Gunaratna R.K.: Event-Driven Document Selection for Terrorism Information Extraction (2005) 4. Chang C., Kayed M., Girgis M.R., Shaalan K.: A Survey of Web Information Extraction Systems 5. Gallo A., Bie T., Cristianini N.: MINI: Mining Informative Non-redundant Itemsets 6. ESTMORF - http://www.eki.ee/keeletehnoloogia/projektid/estmorf/