INFORMATION EXTRACTION USING SVM UNEVEN MARGIN FOR MULTI-LANGUAGE DOCUMENT

Similar documents
Incremental Information Extraction Using Tree-based Context Representations

Semantic Annotation using Horizontal and Vertical Contexts

An Adaptive Information Extraction System based on Wrapper Induction with POS Tagging

Domain-specific Concept-based Information Retrieval System

Keyword Extraction by KNN considering Similarity among Features

Supervised Models for Coreference Resolution [Rahman & Ng, EMNLP09] Running Example. Mention Pair Model. Mention Pair Example

A Feature Selection Method to Handle Imbalanced Data in Text Classification

NUS-I2R: Learning a Combined System for Entity Linking

Naïve Bayes for text classification

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands

Incorporating ontological background knowledge into Information Extraction

Discovering Advertisement Links by Using URL Text

Performance Evaluation of Various Classification Algorithms

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

Encoding Words into String Vectors for Word Categorization

Fig 1. Overview of IE-based text mining framework

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

Semi-Supervised Clustering with Partial Background Information

PV211: Introduction to Information Retrieval

Dependency Parsing domain adaptation using transductive SVM

Text Mining for Software Engineering

Interactive Machine Learning (IML) Markup of OCR Generated Text by Exploiting Domain Knowledge: A Biodiversity Case Study

String Vector based KNN for Text Categorization

Segment-based Hidden Markov Models for Information Extraction

A study of classification algorithms using Rapidminer

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

OLLIE: On-Line Learning for Information Extraction

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Performance Analysis of Data Mining Classification Techniques

Machine Learning in GATE

Semi-Supervised Learning of Named Entity Substructure

Overview of Classification Subtask at NTCIR-6 Patent Retrieval Task

6.034 Quiz 2, Spring 2005

A Practical Guide to Support Vector Classification

Influence of Word Normalization on Text Classification

Towards Domain Independent Named Entity Recognition

ADVANCED CLASSIFICATION TECHNIQUES

A Survey on Postive and Unlabelled Learning

Classification. 1 o Semestre 2007/2008

Parsing tree matching based question answering

66 defined. The main constraint in the MUC conferences is that applications are to be developed in a short time (e.g. one month). The MUCs represent a

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Applying Supervised Learning

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

ClassMate: A System for Automated Event Extraction from Course Websites

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

Cost-sensitive Boosting for Concept Drift

CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING

Sentiment Analysis for Amazon Reviews

Semantic Searching. John Winder CMSC 676 Spring 2015

Opinion Analysis for Business Intelligence Applications

Application of Support Vector Machine Algorithm in Spam Filtering

Making Sense Out of the Web

Keywords Data alignment, Data annotation, Web database, Search Result Record

Query Difficulty Prediction for Contextual Image Retrieval

A Hybrid Neural Model for Type Classification of Entity Mentions

Web Service Matchmaking Using Web Search Engine and Machine Learning

Supervised Learning Classification Algorithms Comparison

Detection and Extraction of Events from s

Chapter VI Automatic Semantic Annotation Using Machine Learning

A Quick Guide to MaltParser Optimization

An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models

ImageCLEF 2011

Link Prediction for Social Network

Combining SVMs with Various Feature Selection Strategies

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

Pro, Con, and Affinity Tagging of Product Reviews

Reducing Human Effort in Named Entity Corpus Construction Based on Ensemble Learning and Annotation Categorization

LASSO: A Learning Architecture for Semantic Web Ontologies

Data mining with Support Vector Machine

Ontology-Based Categorization of Web Services with Machine Learning

Search Engines. Information Retrieval in Practice

2. Design Methodology

Transition-Based Dependency Parsing with MaltParser

Lecture 9: Support Vector Machines

Information Extraction andrule Prediction Using DiscoTEX

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

SUPPORT VECTOR MACHINES

Conclusion and review

Using idocument for Document Categorization in Nepomuk Social Semantic Desktop

Video annotation based on adaptive annular spatial partition scheme

Dmesure: a readability platform for French as a foreign language

CS6220: DATA MINING TECHNIQUES

GoNTogle: A Tool for Semantic Annotation and Search

Machine Learning Classifiers and Boosting

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Segmenting Lesions in Multiple Sclerosis Patients James Chen, Jason Su

Accelerometer Gesture Recognition

Rita McCue University of California, Santa Cruz 12/7/09

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document

Edlin: an easy to read linear learning framework

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Information Extraction

Purchasing the Web: an Agent based E-retail System with Multilingual Knowledge

Structured Learning. Jun Zhu

GoNTogle: A Tool for Semantic Annotation and Search

Transcription:

249 INFORMATION EXTRACTION USING SVM UNEVEN MARGIN FOR MULTI-LANGUAGE DOCUMENT Dwi Hendratmo Widyantoro*, Ayu Purwarianti*, Paramita* * School of Electrical Engineering and Informatics, Institut Teknologi Bandung Ganesha 10, Bandung, Indonesia email : dwi@if.itb.ac.id, ayu@informatika.org ABSTRACT We applied SVM Uneven Margin (SVMUM) for cross lingual Information Extraction (IE) task. SVM has been proved to have good performance in IE task for documents. So far, no IE research has been conducted for multi-language documents where the majority text are written in language with low resource tool such as Indonesian. In the experiment, we compared several training data composition to see its impact on the token classification of IE task for the multi-language document. The result shows that even though the linguistics information for Indonesian are not satistfying, the IE performance can still be improved by adding the training data. We also compared the SVMUM with knn and NB as the learning algorithm in the token classification of IE task. The experimental results showed that SVMUM is suitable for multi-language documents. The IE accuracy achieved better performance than the other two algorithms. Keywords: Information Extraction, SVM with Uneven Margin, multi-language documents. 1 INTRODUCTION Information Extraction (IE) aims at extracting snippets of information automatically. The process takes unstructured text as input and produces fixedformat, unambiguous data as output. IE task mainly is supported by MUC (Message Understanding Conference) and ACE (Automatic Content Extraction, successor of MUC). Both events supply data and evaluation for the IE task. All data are provided in language. Extraction rules are key ingredients in IE task. These rules can be manually hand-crafted by human experts or automatically generated using machine learning approach. Hand-crafted rules may have high extraction accuracy, but are not scalable and are also very expensive to create. For this reason, researchers tend to use machine learning algorithm in constructing an IE system, where the system automatically learns extraction rules from training data (i.e., texts with annotation on piece of information to be extracted). There are two categories in employing this machine learning algorithm: rule learning and statistical learning. Rule learning method induces a set extraction rules from training examples. IE rule based learning systems such as SRV[8], RAPIER[1], WHISK[15], BWI[10], and (LP) 2 [5] belong to this category. The second category, statistical learning, aims to classify token into Named Entity (NE) category by using statistical model such as HMM[9], Maximum Entropy[4], SVM[7,11,12,13] and Perceptron[2]. These systems employed different machine learning techniques as the classifier and also utilized different features such as linguistic features or document structure features. SVM as a general supervised machine learning algorithm has achieved a good performance on many classification tasks including Named Entity (NE) recognition. Isozaki and Kazawa[11] compared three methods for NE recognition (SVM with quadratic kernel, maximum entropy method and a rule based learning system). Their experiments showed that SVM-based system achieved better performance than other systems. Mayfield et al.[13] employed SVM with cubic kernel to compute transition probabilities in a lattice-based approach for NE recognition. Their results were comparable to others on the CoNLL2003 shared task. ELIE[7] used two-level boundary classification with SVM algorithm. The first level classifies each word into start or end tags. This first level gives high precision score but low recall score. The second classifier tries to handle the problem by detecting the end of a fragment given a start tag and the start of a fragment given an end tag. GATE-SVM[12] used SVM uneven margin to classify each word into start or end tag in one time classification. They used a post-processing to guarantee tag consistency and improve the results by exploring other information. To guarantee the consistency, the document is scanned from left to right to remove start tags without matching end tags and end tags without matching start tags. They also filtered candidates by its length, if the length is not

250 4 th International Conference Information & Communication Technology and System equal to the length of any entity of the same type in the training set then it is removed from the candidate list. In the last process, they put together all possible tags for a sequence of tokens and chose the best one according to the probability computed from the output of the classifier via a Sigmoid function. Base on the experimental results[7,12], ELIE achieved better performance of 78.8% average score than GATE-SVM with 77.9% average score. All systems mentioned above employ documents as the training and testing data. Using documents, the linguistic features can be easily gained from the available language tools such as GATE, Collins parsers, etc. Meanwhile, there also demands of information extraction system for other languages. For major language such as Japan, the linguistic features can also be gained from the available language processing tools such as Chasen, Mecab, EDR dictionary, etc. However, for language such as Indonesian, there is no available language processing tool. This paper explores the use of the available language processing tools to perform IE on document written in multi- languages ( and Indonesian). We found that these kind of documents are widely available on the Internet. In particular, we address the IE task on multi-language documents as the classification problem, using the SVMUM (Support Vector Machine with Uneven Margin) for learning the classification model. 2 INFORMATION EXTRACTION AS TOKEN CLASSIFICATION There are three main components in modeling IE as token classification problem[14]: context representation, classification algorithm, and tagging strategy. Each component will be described in the following sections. 2.1 Context Representation Before token classification is performed, each token in a document will be preprocessed into several features. In this work, we used ANNIE- GATE[6] as the natural language processing tool. Even though ANNIE was designed for, it can still be used to obtain linguistic features in our multi-language documents because and Indonesian have similar characteristics. Specifically, there are four linguistics features for each token: 1. Lexical form. The words are processed by using morphological analyzer tool. For example, are is transformed into be. The Indonesian words are assumed as the OOV (out of vocabulary) words, the lexical form is not changed. 2. Orthographic information. Indonesian and have the same rules on the orthographic style. Name words (such as person, location, organization) are usually headed with an uppercase letter. This feature has values of upperinitial, allcaps, and standard. For example the orthographical feature for the word PM is allcaps. 3. Semantic gazetteer information. A word has this feature if the word exists in the gazetteer of ANNIE. For example the gazetteer feature for the word PM is Time. Indonesian words have poor information feature since the ANNIE s gazetteer has only words in it. 4. ANNIE s predicted named entity (NE). Other than gazetteer, ANNIE also has an NE predictor. The NE resulted by this predictor is used to help the token classification in our multi-language IE system. Using only the feature vector of one current token is not sufficient to perform token classification because of the context effect from surrounding token. Therefore, to classify a particular token, the feature vector needed for the classification includes linguistics feature of the current token, three previous tokens and three successor tokens. 2.2 SVMUM as Classification Algorithm SVM has been proved to have a good performance on classification task. To handle the imbalance training data, Li[12] has proposed SVM with Uneven Margin to handle document categorization and token classification in IE. Their experimental results showed that SVM with Uneven Margin gave better performance on the IE task compared to other modification of SVM method. Given a training set Z = ((x 1,y 1 ),,(x m,y m )) where x is the n-dimensional vector input and y is the label (-1, +1), the SVM with uneven margin is obtained by solving the quadratic optimization problem: min wb,, ξ w, w m + C ξ (1) i= 1 s.t. <w,x i > + ξ i + b 1 if y i = + 1 <w,x i > - ξ i + b -τ if y i = - 1 ξ i 0 for i = 1,, m τ is the ratio of negative margin to the positive margin of the classifier and is equal to 1 in the standard SVM. In this paper, we used the Java version of LibSVM[3]. i

044 Information Extraction Using SVM Uneven Margin For Multi-Language Document Dwi Widyantoro 251 2.3 Tagging Strategy Each token is classified into a start tag or an end tag of each class of information to be extracted. There are 14 classes in our job domain problem: industry (company domain), company_name, job_category, job_title, location education_level (minimum desired education degree), foreign_language, description (job description), salary, contact (email address), deadline, posting_date, needed_experience, and experience_duration. After each token is classified into start tag, end tag or none of these, tag consistency checking is performed. Token with start tag but without an end tag for a particular class will not be considered as the class candidate. If for one text snippet, there are more than one class candidate, the selected class candidate is the one with the highest probability. 3 EXPERIMENTS 3.1 Experimental Data and Setup To test the IE system on multi-language documents, we collected multi-language job advertisement from the Internet. We also collected only job ad. There are 90 documents and 90 Indonesian- documents downloaded from several company web sites. There are three training data sets used in the experiments: (1) 80 10 Indonesian- documents, (2) 45 45 Indonesian- documents, and (3) 10 80 Indonesian- documents. Each training data is tested twice: by using 10 documents, and by using 10 Indonesian- documents. 3.2 Experimental Results The overall IE performance on the three types of training data is shown in Figure 1. The score shown in the figure is the average of lenient and strict F- measure. The F-measure score is calculated using equation 2. 2 precision recall F measure = precision + recall (2) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Indonesian- Data-1 Data-2 Data-3 Figure 1. Performance Comparison of IE Task on Three Types of Training Data for Monolingual and Multi-language Documents Figure 1 shows that although there is some missing information on the feature vector for the Indonesian- documents, the IE performance can still be improved by adding the size of training data. The performance on Indonesian- documents on Data-3 (80 Indonesian-) improved by 14% over that of on Data-1 (10 Indonesian-). The additional training data enhances the token-class vocabulary. The performance of documents is 20% higher that that of the Indonesian- documents on Data-1 (80 10 Engilsh- Indonesian). When the composition is reversed such as on Data-3 (10 80 - Indonesian), the performance improvement of the -Indonesian documents is only 3.7% higher than that of the documents. It points out that although the number of training data affects the IE accuracy, linguistics features also plays an important role in improving the IE performance. Table 1. IE Performance for Each Class Using SVM Uneven Margin with Three Training Data Types Indonesian- Data-1 Data-2 Data-3 Data-1 Data-2 Data-3 industry 1.00 1.00 0.91 0.59 0.59 0.67 company_name 0.93 0.90 0.90 0.67 0.80 0.77 job_category 0.64 0.50 0.38 0.25 0.40 0.60 job_title 0.57 0.50 0.35 0.48 0.69 0.67 location 0.90 0.90 0.63 0.61 0.58 0.67 education_level 0.91 0.83 0.87 0.90 0.95 0.93 foreign_language 1.00 0.92 0.92 0.72 0.81 0.76 description 0.19 0.21 0.43 0.22 0.35 0.55 salary 1.00 1.00 1.00 1.00 1.00 1.00 contact 0.84 0.84 0.84 0.94 0.86 0.81

252 4 th International Conference Information & Communication Technology and System deadline 1.00 1.00 1.00 0.71 0.80 0.75 posting_date 0.91 0.91 0.81 0.83 0.91 0.91 needed_experience 0.73 0.63 0.44 0.00 0.00 0.29 experience_duration 0.90 0.91 0.87 0.00 0.46 0.67 The detail experimental result for each class is shown in Table 1. The larger the size of the training data, the higher the performance it could achieve, except the description class in the testing data set. Looking into the testing data, we found that the description class contains a large number of variety of text snippet. Unlike other classes such as posting_date, salary, or company_name, the description class could not be classified by its orthographical nor token type information. The lexical forms or the semantic information of the start words and end words also vary. Further analysis is needed in order to find representative features of a text snippet to have a good classification on the description class. In the experiments, we also compared the performance of SVM algorithm with knn and Naïve Bayes algorithms. The IE accuracy is shown in Figure 2. The figure shows SVM algorithm achieved higher IE performance compared to knn and Naïve Bayes algorithms, more over when using Indonesian- documents as the testing data set. Using documents as the testing data set, the performance of SVM and knn are quite comparable. It is different from the result comparison on the Indonesian- documents, the SVM algorithms outperforms knn about 22.6% accuracy. It assures that SVM with Uneven Margin algorithms is suitable for imbalance data such as the Indonesian- documents in the presence of incomplete linguistics information. 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Indonesian- SVM knn Naïve Bayes Figure 2. Performance Comparison of IE Task on SVMUM, knn and Naïve Bayes Algorithms 4 CONCLUSION Our work shows that although the linguistics features of multi-language documents are not as adequate as the documents, the SVMUM still gives good IE performance on these multilanguage documents. Furthermore, the performance can be improved by providing more training data. The experimental results also assured that SVM with Uneven Margin is suitable for imbalanced data such as the one in multi-language documents. In the experiments, SVM with Uneven Margin outperformed two other machine learning algorithms: knn and Naïve Bayes. It seems that token classification with the start and end tagging srategy does not work well for variant-long text snippets such as description class. This kind of more general information needs more global information than the one contained in the three window size that we used in our research. Our future work will further explore this problem. ACKNOWLEDGEMENT This research is supported by Insentif Program from the Ministry of Research & Technology under contract# 53/RT/Insentif/PPK/II/2008 and by Research Grant from Institute of Technology Bandung under contract # 0045/K01.03.2/PL2.1.5/I/2008. REFERENCE [1] Califf, M. E. Relational Learning Techniques for Natural Language Information Extraction. Ph. D. Thesis, University of Texas at Austin. 1998. [2] Carreras, X., L. Marquez and L. Padro. Learning a perceptron-based named entity chunker via on-line recognition feedback. Maximum Entropy Approach to Information Extraction from Semi-structured and Free Text. In Proceedings of CoNLL 2003, pages 156-159. Edmonton, Canada. 2003. [3] Chang, Chih-Chung and Chih-Jen Lin. (LIBSVM): a library for support vector machine. 2001. [4] Chieu, H. L. and H. T. Ng. A Maximum Entropy Approach to Information Extraction from Semi-structured and Free Text. In Proceedings of the Eighteenth National Conference on Artificial Intelligence, pages 786-791. 2002. [5] Ciravegna, F. Am Adaptive Algorithm for Information Extraction from Web-related

044 Information Extraction Using SVM Uneven Margin For Multi-Language Document Dwi Widyantoro 253 Texts. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle. 2001 [6] Cunningham, H., D. Maynard, K. Bontcheva, and V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL 02). 2002. [7] Finn, Aidan (2006). A Multi-Level Boundary Classification Approach to Information Extraction. [8] Freitag, D. A Machine Learning for Information Extraction in Informal Domains. Ph.D. thesis, Carnegie Mellon University. 1998. [9] Freitag, D. and A. K. McCallum. A Information Extraction with HMMs and Shrinkage. In Proceedings of Workshop on Machine Learning for Information Extraction, pages 31-36. 1999. [10] Freitag, D. and N. Kushmerick. Boosted Wrapper Induction. In Proceedings of AAAI 2000. 2000. [11] Isozaki, H. And H. Kazawa. Efficient Support Vector Classifiers for Named Entity Recognition. In Proceedings of the 19th International Conference on Computational Linguistics (COLING 02), pages 390-396, Taipei, Taiwan. 2002. [12] Li, Yaoyong, Kalina Bontcheva, dan Hamish Cunningham. Using Uneven Margins SVM and Perceptron for Information Extraction. Proceedings of Ninth Conference on Computational Natural Language Learning (CoNLL-2005), 72-79. 2005. [13] Mayfield, P. McNamee and C. Piatko. Named Entity Recognition Using Hundreds of Thousands of Features. In Proceedings of CoNLL-2003, pages 184-187. Edmonto, Canada. 2003. [14] Siefkes, Christian (2007). An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models. Ph.D. thesis. Freie Universität Berlin, Berlin-Brandenburg Graduate School in Distributed Information Systems. [15] Soderland, S. Learning Information Extraction Rules for Semi-structured and Free-text. Machine Learning, 34(1): 233-272. 1999.