Information Extraction Techniques in Terrorism Surveillance

Information Extraction Techniques in Terrorism Surveillance Roman Tekhov Abstract. The article gives a brief overview of what information extraction is and how it might be used for the purposes of counter-terrorism surveillance. It also describes the new proposed way of retrieving information by using frequent pattern mining as well as lists the results of conducted experiments. Keywords: information extraction, terrorism, frequent itemset mining 1 Introduction Terrorism is a major threat and therefore it is essential to develop tools that help with revealing and exploring information concerning future attacks, unknown organizations and their members and so on. Even if information itself is available in the form of surveillance reports, news articles, posts published in the web or social network data it is still hard to process efficiently because bits of relevant information are scattered across huge amounts of irrelevant noise. Some pieces of information may also be duplicated which makes the job of collecting relevant data harder. To allow counter-terrorism experts to process information efficiently at least part of the routine has to be automated. If the data is extracted, filtered and classified automatically then domain experts could concentrate on analyzing what is really important. 2 Information Extraction The field of information extraction (IE) is concerned with retrieving certain data of interest from initially unstructured natural sources, mostly regarding text documents but also various multimedia streams (e.g. video). In case of text documents the extraction is established by using natural language processing (NLP) tools. E.g. one might be interested in the extraction of following information from text documents: entities of some certain type (location names, people and/or organizations involved) new relations between known entities or occurrences of concrete relations of interest (person connections with each other, members of organizations) co-references and mentions (what words like he or it really refer to based on previously extracted named entities)

Due to the high complexity of the challenge many systems have restricted scope, e.g. by only considering semi-structured documents or documents than are known to be concerned with some specific domain (like criminal news feed for example). Probably the simplest form of IE is just applying hand-written regular expressions. Even though this might be efficient for finding certain named entities or filtering out data that is known to be irrelevant this technique has still very limited capabilities. More sophisticated systems might use Hidden Markov Models or machine learning techniques. Pattern-oriented systems. This class of IE systems is concerned with determining structural patterns which help to distinguish target entities and/or relations [1] [3]. Some systems require some initial seed values for bootstrapping the extraction process. Suppose that in the beginning we have a number of seed entity instances and some target relation. E.g. if the goal is to retrieve the relations between events and locations then the seed value for events might be party and for locations Tallinn. Next we search the texts for the occurrences of these seed instances and retrieve the constructions that express the target relation (e.g. party was held in Tallinn ). If we replace the seed values with placeholders we get a pattern (X was held in Y) and we can search the texts again for other occurrences of this pattern. Suppose we find the construction meeting was held in Tartu this gives us another pair of seed instances: meeting and Tartu that we might use to find new patterns and so on. Other systems require a manually annotated training data at the beginning. Patterns are extracted based on that information and searched against new unseen data. 3 Experiments: Frequent Pattern Based IE The main goal of this work is to apply a pattern-based technique for entity recognition that is based on frequent itemset mining principles: 1. Choose a training data set. In this work the training set was 200 Postimees news site articles about various accidents that took place in Estonia: traffic accidents, fires, murders, assaults etc. 2. Annotate the entities of interest by hand. This work is concerned with geographical locations: counties, towns, villages etc. 3. Use the annotated training data for creating an extractor function that can retrieve entities of interest from general texts. This was accomplished by applying frequent itemset mining approach: first we search the training data set for frequent structural patterns containing our entities of interest. Then the extractor searches for these patterns in the given texts. 4. Test the extractor function on some annotated test data set and analyze quality of the extractor function. In this work a set of 25 articles (different from the training set) was used.

3.1 Preprocessing Natural language corpora contain a lot of redundant and noisy information which is an obstacle for efficient pattern retrieval and matching. Therefore data preprocessing is essential. Tokenization. It makes sense to apply the divide and conquer principle and to restrict pattern searching to certain sub-structures instead of processing the entire text corpus at once. E.g. one natural way is to split the text into sentences and process them separately. In this work the texts were divided into even smaller tokens by splitting according to the punctuation marks (.,, and so on). Noise elimination. Natural texts obtained from the web often contain formatting elements such as <span class= > or <p> which need to be cleaned up in order to perform the analysis. For this work all HTML tags were removed from the text. Normalization. Semantically equivalent things might have very different representations in natural language. For pattern recognition it might be a good idea to normalize the data by replacing these common constructions with some standard representations. Consider some examples: Character normalization. Characters might be encoded differently in different sources. E.g. character ä might be expressed as ä in a HTML document. For the experiments all characters were represented using UTF-8 encoding. Date and time normalization. Dates and times can be expressed very differently: 06.04., 6. aprillil, kuuendal aprillil are all valid ways to describe the same date in Estonian as well as kell 9.00, kella 9.00 ajal, 09.00 are all valid times. For the experiments described in this work the date and time patterns were normalized as ##.## and ##:## respectively. 3.2 Annotating In order to find frequent patterns that contain location entities one has to manually annotate the data in the training set specifying which words represent location names and which do not. E.g. one could use tags: juhtus liiklusõnnetus <location>põlvamaal</location> In the course of this work the information was encoded and annotated using JSON format: {"loc": false, "word": "juhtus"}, {"loc": false, "word": "liiklusõnnetus"}, {"loc": true, "word": "Põlvamaal"}

3.3 Generalization with Morphological Attributes The most straightforward way would be to treat phrases, words or characters as patterns without generalizing their structure. That approach would suffer, however, even from slight variations in the text structure. In addition the total size of the training data set would need to be significantly large in order to find these exact frequent constructs. One possible solution is to generalize the data by performing linguistic analysis on it. E.g. we can treat the text as a sequence of morphological attributes: stem, suffix, part of speech (POS), grammatical case, tense, quantity, etc. Consider again the phrase juhtus liiklusõnnetus Põlvamaal After applying some morphological analysis on each word we obtain the following information: juhtus: POS = verb, tense = past, stem = juhtu liiklusõnnetus: POS = noun, singular, case = nominative, stem = liiklusõnnetus Põlvamaal: POS = noun, proper, singular, case = adessive, stem = Põlvamaa For the purpose of this work the morphological analyzer ESTMORF was used [6]. It represents the above sequence in the following form: juhtu+s //_V_ s, // liiklus_õnnetus+0 //_S_ sg n, // Põlva_maa+l //_H_ sg ad, // For the sake of simplicity only the following attribute subset was used: stem (ignoring the fact that some words are really composite, e.g. liiklus_õnnetus here), POS, quantity (where appropriate), tense (for verbs), grammatical case (for other types). The last three will further be referred to as the form. So the above phrase would be treated as the following sequence: [ juhtu, _V_, s ] [ liiklus_õnnetus, _S_, sg, n ] [ Põlva_maa, _H_, sg, ad ] 3.4 Abstract Representation After running the morphological analysis each word is expressed as a vector of features. It might, however, be the case that the entire word sequence is not frequent while some part of it is. E.g. a complete pattern [ juhtu, _V_, s ] [ liiklus_õnnetus, _S_, sg, n ] might be non-frequent but the sequence of its partial elements

[ _V_, s ] [ _S_, sg, n ] might be. In other words different abstract representations of the same initial construction might have different supports so it is required to consider them separately. For these experiments the following combinations were used: Complete sequence, i.e. [ liiklus_õnnetus, _S_, sg, n ] POS + form, i.e. [ _S_, sg, n ] Form only, i.e. [ sg, n ] Basically it means that the sequence provided above would in fact be processed as a set of multiple sequences which contains different abstract representations of the initial phrase: [ juhtu, _V_, s ] [ liiklus_õnnetus, _S_, sg, n ] [ _V_, s ] [ _S_, sg, n ] [ _V_, s ] [ sg, n ]... all other combinations 3.5 Finding Frequent Itemsets Once we have the sequences we can start looking for frequent patterns in them. Some definitions first: In classical frequent itemset mining the data is usually represented as a set of transactions and each transaction is a separate set of one or more items, usually unordered. In our case each phrase is a transaction and it is ordered. Support of an itemset is a measure that shows how many transactions (in our case phrases) it is contained in. This can be absolute (number of transactions which contain that itemset) or relative (what fraction of all transactions contains that itemset). Itemset is considered frequent if its support is above some user-defined threshold. In the current experiments the relative support of 5 percent was used, i.e. each pattern must occur in at least 5% of all phrases in order to be considered frequent. Apriori is a well known and relatively simple algorithm for finding frequent itemsets and it was chosen to be used in this work. It is based on a principle that all subsets of a frequent itemset are also frequent and vice versa all supersets of a nonfrequent itemset are also non-frequent. The algorithm starts by finding frequent itemsets of size 1. Then during each iteration of the main loop it generates candidate itemsets of size k + 1 based on previously retrieved frequent itemsets of size k and checks if these candidates are really frequent or not. Algorithm terminates when no more frequent itemsets are found. E.g. suppose we have found two frequent sets of size 2:

[ juhtu, _V_, s ] [ liiklus_õnnetus, _S_, sg, n ] [ liiklus_õnnetus, _S_, sg, n ] [ sg, ad ] Then the next candidate of size 3 would be [ juhtu, _V_, s ] [ liiklus_õnnetus, _S_, sg, n ] [ sg, ad ] and the algorithm will check its support. If the set turns out to be frequent then we will use it to find frequent patterns of size 4 and so on. Once all the frequent patterns are obtained we can filter only those that contain the entities of interest, i.e. locations in this case. This is easy because the annotations initially attached to words are still preserved for of morphological sequences. 3.6 Determining Significant Patterns Not all of the obtained frequent patterns are really significant it might be the case that we are dealing with a coincidence rather than a real rule. E.g. individual words might simply be very common which increases their chance of being encountered in higher level itemsets. Therefore, it is important to measure the interestingness of each pattern and to filter out the insignificant ones. One way to do it is to use p-values. Suppose we have an assumption that we want to either prove or reject. The default (and not interesting) position is called a null hypothesis and the new idea which is opposite to it is called an alternative hypothesis. We collect a sample of data and we calculate some statistic value based on it. Then the p-value is the probability of obtaining the same statistic value in case the null hypothesis is true. If it is lower than some user-defined threshold we can reject the null hypothesis (which means that the alternative hypothesis holds). In our case the null hypothesis is the one that states that all items are independent from each other (and the alternative states that there is a dependency). The test statistic is the support of an itemset. The p-value is the probability of the itemset having at least the same support under the assumption that the null hypothesis is true, i.e. the p-value shows how probable it would be to encounter that particular chain of items if all items were independent. If this probability is smaller than some user defined threshold then it is considered that the null hypothesis can be rejected and the pattern is significant. [5] describes the formula for p-value calculation: P( I) n ssup(i) n s pi (1 pi ) s ns, p I ii f i, f i sup( i) n where sup(i) is the support of an itemset I. For these experiments the traditional significance level of 0.05 was used.

3.7 Extractor function Once all significant frequent patterns are retrieved from the training data we can use them to search for entities in previously unseen texts. The extractor function first preprocesses the text using the same routines that were used for the training data (noise cleaning, tokenization, normalization). Then it applies the morphological analyzer and transforms the phrases into the same format that was used for processing the training data. The last step is to match all the frequent patterns against each encoded phrase starting from the longest patterns and continuing with the shorter ones until a match is found. 3.8 Testing phase Two common measures of accuracy are precision and recall. In the context of this work they can be defined as follows. Precision is the number of correctly extracted entities (true positives) divided by the total number of extracted entities (true positives and false positives). In the described experiment setting the precision turned out to be 0.92. Recall is the number of true positives divided by a total number of entities (true positives and false negatives). In the described experiment it was equal to 0.45. 4 Conclusion The experiments have shown that in principle it is possible to apply frequent itemset mining techniques for finding relevant entities in text with relatively high precision. Recall, on the other hand, turned out to be quite small which can obviously be explained with the fact that the variety of possible forms in natural language can be very high even in case of restricting to very specific domains and sources. Perhaps a larger training sample would help in fixing this shortcoming. Another useful technique that might be applied is introducing known false patterns counterexamples [1]. These might increase the precision by reducing the number of false positives. E.g. highway names (like Tallinn-Tartu-Võru ) often seem to be included in phrases that are very similar with those containing point location names. So it would be a good idea to introduce some pattern filtering routine based on known counterexamples. References 1. Fabian M. S.: Automated Construction and Growth of a Large Ontology (2009) 2. Zamin N., Oxley A.: Information Extraction for Counter-Terrorism: A Survey on Link Analysis (2010)

3. Sun Z., Lim E., Chang K., Ong T., Gunaratna R.K.: Event-Driven Document Selection for Terrorism Information Extraction (2005) 4. Chang C., Kayed M., Girgis M.R., Shaalan K.: A Survey of Web Information Extraction Systems 5. Gallo A., Bie T., Cristianini N.: MINI: Mining Informative Non-redundant Itemsets 6. ESTMORF - http://www.eki.ee/keeletehnoloogia/projektid/estmorf/