New Media Analysis Using Focused Crawl and Natural Language Processing: Case of Lithuanian News Websites

Size: px

Start display at page:

Download "New Media Analysis Using Focused Crawl and Natural Language Processing: Case of Lithuanian News Websites"

Ashley Sutton
5 years ago
Views:

1 New Media Analysis Using Focused Crawl and Natural Language Processing: Case of Lithuanian News Websites Tomas Krilavičius Žygimantas Medelis Jurgita Kapočiūtė-Dzikienė Tomas Žalandauskas

2 Problem How to monitor Lithuanian news media, identifying main topics and facts? Potential users business intelligence, political campaing,military intelligence, police Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 2

Purpose of the System Collect relevant information from the Internet Prepare text for analysis Extract relevant information Store it Provide tools to Search it (e.g.

3 Purpose of the System Collect relevant information from the Internet Prepare text for analysis Extract relevant information Store it Provide tools to Search it (e.g., faceted search) Analyse it (e.g., visualisation, word frequency) Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 3

4 Novelty and Scientific Problems Problem is not new There exist solutions Reference architecture For selected languages (e.g., English, French, Russian) Problem: text analysis is language-dependent Our results: case-based analysis of media in Lithuanian Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 4

5 General Architecture Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 5

href elements analysis] Anchor text filters Content filters Reg-exp based terms filters Classifiers employing words

6 Focused Crawl Web Crawler that downloads only relevant pages Removes ads and topics of no interest Uses several types of filters Link filters URL filters [.href elements analysis] Anchor text filters Content filters Reg-exp based terms filters Classifiers employing words lists, topic maps, ontologies Highly configurable tools exist, e.g. Apache Nutch Mostly language-agnostic Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 6

7 Text Preprocessing text + structure text document accents spacing etc. stopwords noun groups stemming indexing structure recognition structure full text terms index Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 7

lt Taxonomy Commercial usage of university/institute owned corpus is not clear

8 Linguistic Infrastructure: Corpora Stop-words Easy to build Available as a part of TokenMill's language pack General corpus Hunspell Domain vocabulary Synonyms sinonimai.lt Taxonomy Commercial usage of university/institute owned corpus is not clear Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 8

pack) Structure recognition Probably, Tilde, Fotonija, Leksinova, etc.

9 Linguistic Infrastructure: Small Tools Language identifiers LT language identifier (TokenMill language pack) Sentence splitters (TokenMill lang. pack) Structure recognition Probably, Tilde, Fotonija, Leksinova, etc. have some inhouse tools Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 9

Linguistic Infrastructure: Word Level Analyzers Stemmers Porter-stemmer, prototype for

Lithuanian Soundex (Paliulionis, 2009; Krilavičius, Kuliešienė, 2010) Morphological

Computer Ling. Center) Tilde? Petkevičius (in progress, VMU, Informatics fac.

10 Linguistic Infrastructure: Word Level Analyzers Stemmers Porter-stemmer, prototype for Lithuanian language is available (Krilavičius, Medelis, 2010) Phonetic algorithms Lithuanian Soundex (Paliulionis, 2009; Krilavičius, Kuliešienė, 2010) Morphological analysis/lemmatization/pos Morfolema, Zinkevičius Morfologinis anotatorius (VMU, Computer Ling. Center) Tilde? Petkevičius (in progress, VMU, Informatics fac.) Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 10

results for Lithuanian language Report and publications in progress Some preliminary results in Zuokas, Medelis,

11 Linguistic Infrastructure: Text Analytics Classification (categorization) Some EuroVoc-based results by Daudaravičius, 2012 Report and publications in progress Kapočiūtė-Dzikienė, Krilavičius Clustering No published results for Lithuanian language Report and publications in progress Some preliminary results in Zuokas, Medelis, Kaušas, Krilavičius, 2010 Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 11

12 Linguistic Infrastructure: Text Analytics NER tools RegExp-based: person-names, dates, locations, etc. Kapočiūtė, Raškinis, 2005 GATE (JAPE)-based: citations, person-names, dates Zuokas, Medelis, Kaušas, Krilavičius, 2010; Krilavičius, Medelis, Balčas, Širvinskas, 2012 AI/ML-based methods just starting Zuokas, Medelis, Kaušas, Krilavičius, 2010 Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 12

Experimental Implementation Crawler: Apache Nutch Text

Engineering) Hunspell Lithuanian (Porter-based) stemmer

Apache Solr (Apache Lucene) Krilavičius, Medelis,

13 Experimental Implementation Crawler: Apache Nutch Text preprocessing and NER GATE (General Architecture for Text Engineering) Hunspell Lithuanian (Porter-based) stemmer Classification and clustering: Apache Mahout Faceted search: Apache Solr (Apache Lucene) Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 13

14 Experimental Results Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 14

15 Results Overview of existing/missing NLP/IR tools for Lithuanian language Experimental implementation Running in production and based on that future research plans Classification Clustering Corpora, e.g. TREC-annotated Stemmer NER Ontologies Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 15

16 Conclusions Some tools exist, but more is missing in some cases they are just well hidden Plenty of things to do, but not all very interesting research-wise (e.g., EN Soundex is over 100 years old; Porter stemmer, 1979) Enough tools to build media monitoring systems, but a lot of improvements are possible Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 16

17 THANKS Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 17

Implementing a Variety of Linguistic Annotations

Implementing a Variety of Linguistic Annotations through a Common Web-Service Interface Adam Funk, Ian Roberts, Wim Peters University of Sheffield 18 May 2010 Adam Funk, Ian Roberts, Wim Peters Implementing