How SPICE Language Modeling Works

Size: px
Start display at page:

Download "How SPICE Language Modeling Works"

Transcription

1 How SPICE Language Modeling Works Abstract Enhancement of the Language Model is a first step towards enhancing the performance of an Automatic Speech Recognition system. This report describes an integrated scheme of data collection and language model building in SPICE. Starting with a small corpus, SPICE helps the user gather more relevant data from the web in order to build a better language model. Techniques involved for achieving this goal and details of implementation are presented for the understanding of future developers. 1. Introduction SPICE is a web-based interface aimed at rapid development of Automatic Speech Recognition (ASR) and Text to Speech (TTS) synthesizers in the target languages of interest. SPICE can be used to build systems that are large and open domain, as well as systems specific to a domain. However, it has been our experience [1][2] that users of SPICE often develop small scale domain specific ASRs and TTS. The performance of an ASR is dependent on acoustic models (AM), and coverage and quality of the language model (LM). For domain specific ASR systems, the amount of data available for training a language model is typically lesser than for an open domain ASR. Further, for languages other than English, Chinese, Spanish, French and German, the amount of text data available is even lesser when building a domain specific system. Improvement in the performance of an ASR requires enhancing both language model and acoustic model. A first step towards improvement in ASR performance involves boosting the performance of language model. The LM training component of SPICE acquires text resources from the web in addition to the user supplied corpus and helps build a better language model. The user of SPICE initially specifies a web-link or uploads a small text corpus in the target language of interest. Domain specific key-terms and phrases extracted from the web-pages /corpus are presented to the user as a list. Terms that are pertinent to the domain of interest are selected by the user and sent as queries to an external search engine. Web-links obtained from the search engine are used to obtain additional text data which is then filtered to retain only the relevant content. Thus, starting with a small corpus, the user of SPICE is aided in gathering a large amount of domain specific data for training better LMs. Some research on the use of web as a corpus for building languages models has been done in the past. In [3], the authors compute the most frequent bigrams in a corpus and submit the most frequent bigrams as queries to a search engine. Bigram counts in the results of the search engine are used to improve the trigram models for speech recognition. [4] proceed a step further and crawl the websites listed by the search engine and retain those web-pages that have low out-of-vocabulary words. There is a detailed analysis on the performance of web-based models in natural language processing presented in the work of [5]. At the same time, there is an entire body of research in the databases community [6]

2 related to topic specific crawling (also referred to as focused crawling) which deals with collection of data specific to a particular topics. While the purpose of focused crawling is to build topic specific indexes, our purpose is to enhance the language model. Our ideas [2] are closer to the work done by [3][4] but there are key differences which will be elaborated later. The succeeding sections will discuss in detail the ideas and implementation of the new SPICE language modeling tools. Section 2 outlines the algorithms and steps involved in data collection and building language models. Section 3 details some of the aspects of implementation from a developer perspective. Section 4 concludes with a discussion of some issues along with pointers to future work. 2. Approach When collecting text data for a particular domain, the user of SPICE usually knows one or few URLs (Universal Resource Locator) related to the domain. Or, the user has a text corpus which can be uploaded into the SPICE system. In case of English or other commonly spoken languages, the domain specific corpus available may be large. But for languages with lesser digital representation, the domain specific corpus can be quite small. In this case, the user would like to obtain additional corpora from the web to augment the existing corpus. This can be achieved by typing queries into a search-engine and using the resultant URLs for getting more data. But non-roman-script languages require specialized keyboards to submit such queries. To address these issues, we have developed a scheme to collect additional data as follows: Step 1: Specifying the initial data source The data source can be a plain-text file or a URL (Eg: related to the domain of interest. SPICE assumes that the URL or corpus represents a particular domain. Figure 1 gives an example. Figure 1: Uploading text corpus or specifying a URL related to the domain of interest.

3 Step 2: Identifying domain terms As discussed earlier, additional text data can be obtained by using the URLs returned by a search-engine. Instead of the user specifying the queries, we propose to identify the domain terms from the text corpus specified in Step 1. These domain terms can be used as queries to a search engine. Since good queries result in good search results, SPICE attempts to obtain good domain terms from the data initially specified by the user. Figure 2: Terms related to domain of interest presented to the user. User can select terms that are relevant. The domain terms are obtained using the following algorithm: 1. If the user specifies a URL, the documents (HTML, PDF, DOC, TXT formats supported) from the URL are downloaded and cleaned to obtain the text data. 2. A frequency analysis of the text data is performed in order to identify the stop-words (Eg: In English, the stop-words are the, in ). The K (K=10) most frequent tokens are identified as stopwords. 3. Statistics for bigrams are then collected from the cleaned up documents: a. Frequency of the bigram in all the documents downloaded b. Document frequency of the bigram i.e. number of documents in which the bigram has occurred. c. The bigrams are scored according to the Term-Frequency Inverse-Document-Frequency (TF-IDF) measure given by ln where is the frequency and is the document frequency of bigram and is the total number of documents downloaded. Bigrams are also sorted in decreasing order of their frequency of occurrence in the corpus. The bigrams with highest TF-IDF scores and frequencies are presented to the user in the form of a list. Currently, 15 bigrams with highest TF-IDF and 15 bigrams with highest frequency are presented to the user in an interleaved fashion. Since SPICE does not know apriori whether the URL is homogenous or heterogeneous in

4 topical content, both the highest frequency and highest TF-IDF bigrams are presented to the user alternately. d. The bigrams do not contain stop-words. For example suppose the trigram cut the vegetables occurs with high frequency. The bigrams cut the and the vegetables are not good queries to a search engine if the user wishes to collect more data related to cookery domain. Instead, cut vegetables may be a good query after removing the stop-word the. Thus, the stop-words obtained in 2. are removed during collection of the bigram statistics. 4. A model of the domain is also learned. Currently, the model has been kept simple and represents the domain as a single vector of terms and their frequencies (normalized). There is a scope for using clustering or methods like Latent Semantic Analysis (LSA) to discover latent topics in the data. 5. The highest TF-IDF score and highest frequency bigram are presented alternately to the user in the form of a list as shown in Figure 2. If the user uploads a single text corpus instead of specifying a URL, bigrams with highest frequencies are shown to the user. 6. The user selects relevant domain terms (bigrams) and initials a re-crawl. Step 3: Re-crawl and domain filtering 1. The domain terms selected by the user are sent as queries to a search engine (Google in our case). For each query, the top 5 URLs returned by the search-engine are assumed to be relevant sources of data and added to the list of URLs to be crawled. 2. A shallow crawl (1 level deep) is initiated for each of the URLs in the list. Deep crawls can take much more time and may result in a topical drift. For example, a deep crawl into a sports URL may also result in documents from a URL related to politics apart from sports-related data. 3. The documents collected from the URLs are cleaned up of tags and other formatting. Each document is represented as a feature vector of terms and their normalized frequencies. The cosine-distance of this document vector from the domain model learned in Step 2.4 is computed. If the distance lies above a preset threshold, the document is considered as indomain and retained. Otherwise the document is discarded. This approach is similar to [4] with the difference that [4] threshold on the OOV rate instead. Our technique on the other hand allows for more sophisticated models to be used instead of a one-centroid cluster. 4. All the documents that are classified as in-domain are appended together and presented as additional data for enhancing the language model. Step 4: Building and improving the language model When building a language model, the user of SPICE is presented with 3 options in-case additional data from Step 3 has been collected by the user. The options (Figure 3) are: 1. Interpolating the LM built using additional data with the LM built using the user uploaded corpus or URL. If the additional corpus contains considerable out-of-domain data, the LM built

5 using initially uploaded data may get corrupted. Instead interpolation can be done to avoid domain drift. 2. Adding the additional corpus to the initially crawled data and building an improved (hopefully) language model. 3. Not using additional data. The user can choose this if the initially uploaded corpus itself is large enough. In all the 3 cases performance of the resultant language model is presented to the user in terms of perplexity and OOV rate as shown in Figure 3. Figure 3: Improving the language model using additional data. The performance of LM built using initially uploaded data is shown in first table. Second table shows the result of using an interpolated language model built using additional crawled data. A lower perplexity and OOV indicates an improvement in the quality of the language model.

6 3. Implementation Details As with the rest of the SPICE modules, the data collection and LM training components are organized in a 3 layer hierarchy: 1. Web-interface code: This forms the highest level user facing code which allows for user interaction and shows the progress of the LM training and performance. This code is written in PHP and invokes the middle level SPICE shell-scripts. Related files along with their relative locations in SPICE package are: a. spice/web/text.php Forms the interface as shown in Figure 1. b. spice/web/code/h_text.php Serves as a header file used by Text.php in order to dynamically change the content of the page shown in Figure 1 based of the progress of the crawling/re-crawling processes. c. spice/web/2col_leftnav.css A cascaded style sheet (CSS) common to all SPICE components and related to the look and feel of the spice pages. 2. SPICE LM Scripts: These are some of the main tools used in the background for collection of data and LM training. Written primarily in shell-script and PERL, the required scripts are as follows: a. spice/lm/scripts/crawl.sh Top level script which uses wget command for crawling URLs and calls the PERL scripts responsible for cleaning and obtaining domain-terms (bigrams). Takes as input a URL and returns a list of domain terms and domain model. b. spice/lm/scripts/recrawl.sh Top level shell-script that calls PERL scripts responsible for querying Google, getting the additional URLs, crawling the URLs and filtering the data based on cosine distance. c. spice/lm/scripts/procupload.sh Top level script that functions in almost the same way as crawl.sh except that it uses a single corpus instead of web-documents. This script can be merged with crawl.sh in the future. d. spice/lm/scripts/killprocs.sh Responsible for aborting the crawl.sh/recrawl.sh processes in case the user desires so. Not fully tested yet. e. spice/lm/scripts/buildlm Top level shell-script responsible for LM training and comprehensive performance testing by 5-fold cross-validation. Call commands ngram and ngram-count of the SRILM Toolkit for LM evaluation and training respectively. f. spice/lm/scripts/procdocs.pl Used by crawl.sh, recrawl.sh and procupload.sh. This script cleans up the web-documents, computes the statistics of domain terms and outputs the best domain-terms in a file. Takes a URL as input and outputs a domain model and domain-terms as output. Essentially performs Step 2 of previous section. g. spice/lm/scripts/moredata.pl - Used by recrawl.sh and takes a domain model as input along with a directory containing additional documents obtained by re-crawling. Finds relevant documents and outputs them as a single file of additional domain-related data.

7 h. spice/lm/scripts/search.pl Used by recrawl.sh and queries Google given a file containing a list of domain terms selected by the user and output a file containing URLs to crawl data from. i. spice/lm/scripts/htmlproc.pm PERL module used by procdocs.pl and moredata.pl and responsible for handling document formats like HTML, DOC, PDF, TXT, cleaning and extracting body-text and converting the text data into UTF-8 encoding. j. spice/lm/scripts/charfreq.pl Computes character frequencies given a text corpus. Used later for Grapheme to Phoneme (G2P) rule and learning pronunciation lexicon. k. spice/lm/scripts/split.pl Used to generate split a text corpus into training, testing and development portions in the proportion 80%:10%:10% respectively. l. spice/lm/scripts/clean.pl Used to remove lines and extra spaces in the corpus collected at the end of the data collection process. Important: This module serves as a placeholder for text normalization code which can remove punctuations, resolve numerals and other non-standard words (NSWs). m. spice/lm/scripts/avgweights.pl Obtains interpolation weights as a result of 5-fold cross-validation over development data. n. spice/lm/scripts/compile_results.pl Compiles the results of LM performance (perplexity, OOV) and outputs them as an HTML table for displaying to the user. Important: Scripts (j-n) are used by buildlm 3. External Dependencies: These toolkits are external to SPICE and consist of: a. SRILM Toolkit b. Antiword Used by HtmlProc.pm for converting a DOC format file to plaintext. c. XPDF Used by HtmlProc.pm for extracting text from a PDF file. Issues and Future Work While the data collection tools have been designed to perform with a wide variety of data sources, there some issues remain to be addressed. These issues provide pointers to future work. Some of these issues are as follows: 1. Handling Document Formats: a. The functionality of the PDF to text conversion needs to be tested thoroughly with different documents. b. The open xml document formats used in Microsoft document are a collection of compressed (.zip) xml files. These need to be handled too. c. Compressed corpora (.tar.gz,.gz,.zip) need to be handled too.

8 2. Improving the quality of domain terms. a. Removing out-of-language data Web-pages often contain data that does not belong to the language of interest. For example a Chinese web-page can contain English data like Contact Us All Rights Reserved etc which is not desirable. While the current SPICE data collection tools filter out some of this data and numerals while finding domain terms, out-of-language data does show up in the list of candidate domain terms. For example, a crawl of an English website resulted in a frequent bigram am pm referring to time which showed up among the list of candidate domain terms. b. Metric for Ranking Domain Terms TF-IDF and frequency of bigrams are currently used to score domain-terms to be presented to the user. When the text corpus is homogeneous, bigram frequency work well to identify domain terms. On the other hand, if the text corpus is heterogeneous (more than one topic), TF-IDF works well. It is difficult to determine beforehand whether the user supplied URL is homogeneous or not. Instead, the initial data crawled by the user can be clustered using methods like Latent Dirichlet Allocation (LDA) to obtain a set of latent topics. A frequency analysis can be performed on documents belonging to dominant topics to obtain domain-terms. Further, this model can also be used during the re-crawling stage to filter-out irrelevant documents. c. Handling Word Segmentation In languages like Chinese that do not use whitespace for segmenting words, additional processing is required. Chinese websites in particular use BIG5 or GB-2312 encoding for representing data. Once the encoding is identified, appropriate methods can be used to handle word segmentation. 3. Text Normalization Non Standard Words like numerals are often language specific and need to be handled separately. Currently punctuation symbols are not removed before LM training because their role depends on the language being worked on. Normalization issues like lower-casing need to be addressed especially for European languages. References [1] SPICE: Web-based tools for Rapid Language Adaptation in Speech Processing Systems, Tanja Schultz, Alan Black, Sameer Badaskar, Matthew Hornyak, John Kominek. Proceedings of Interspeech 2007, Antwerp, Belgium. [2] Improving Speech Systems built from very little data, John Kominek, Sameer Badaskar, Alan Black, Tanja Schultz. To appear in the proceedings of Interspeech 2008, Brisbane, Australia. [3] Improving Trigram Language Modeling with World Wide Web, Xiaojin Zhu, Roni Rosenfeld. Proceedings of ICASSP 2001, pp , Vol. 1, Salt Lake City, USA. [4] Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures, Ivan Bulyko, Mari Ostendorf, Andreas Stolcke. Proceeding of HLT-NAACL 2003, pp. 7-9, Edmonton, Canada.

9 [5] Web-based models for natural language processing, Mirella Lapata, Frank Keller. ACM Transactions on Speech and Language Processing 2005, Article No. 3, Vol. 2, Issue 1. [6] Focused crawling: a new approach to topic-specific Web resource discovery, Soumen Chakrabarti, Martin van de Berg, Byron Dom. In proceedings of WWW8, Toronto, Canada.

RLAT Rapid Language Adaptation Toolkit

RLAT Rapid Language Adaptation Toolkit RLAT Rapid Language Adaptation Toolkit Tim Schlippe May 15, 2012 RLAT Rapid Language Adaptation Toolkit - 2 RLAT Rapid Language Adaptation Toolkit RLAT Rapid Language Adaptation Toolkit - 3 Outline Introduction

More information

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky The Chinese University of Hong Kong Abstract Husky is a distributed computing system, achieving outstanding

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

Tag-based Social Interest Discovery

Tag-based Social Interest Discovery Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Focused crawling: a new approach to topic-specific Web resource discovery. Authors Focused crawling: a new approach to topic-specific Web resource discovery Authors Soumen Chakrabarti Martin van den Berg Byron Dom Presented By: Mohamed Ali Soliman m2ali@cs.uwaterloo.ca Outline Why Focused

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Web Crawling As Nonlinear Dynamics

Web Crawling As Nonlinear Dynamics Progress in Nonlinear Dynamics and Chaos Vol. 1, 2013, 1-7 ISSN: 2321 9238 (online) Published on 28 April 2013 www.researchmathsci.org Progress in Web Crawling As Nonlinear Dynamics Chaitanya Raveendra

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

How Does a Search Engine Work? Part 1

How Does a Search Engine Work? Part 1 How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0 What we ll examine Web crawling

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

Enhancing applications with Cognitive APIs IBM Corporation

Enhancing applications with Cognitive APIs IBM Corporation Enhancing applications with Cognitive APIs After you complete this section, you should understand: The Watson Developer Cloud offerings and APIs The benefits of commonly used Cognitive services 2 Watson

More information

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

Natural Language Processing Basics. Yingyu Liang University of Wisconsin-Madison

Natural Language Processing Basics. Yingyu Liang University of Wisconsin-Madison Natural Language Processing Basics Yingyu Liang University of Wisconsin-Madison Natural language Processing (NLP) The processing of the human languages by computers One of the oldest AI tasks One of the

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Tutorial to QuotationFinder_0.4.3

Tutorial to QuotationFinder_0.4.3 Tutorial to QuotationFinder_0.4.3 What is Quotation Finder and for which purposes can it be used? Quotation Finder is a tool for the automatic comparison of fully digitized texts. It can either detect

More information

Using Semantic Similarity in Crawling-based Web Application Testing. (National Taiwan Univ.)

Using Semantic Similarity in Crawling-based Web Application Testing. (National Taiwan Univ.) Using Semantic Similarity in Crawling-based Web Application Testing Jun-Wei Lin Farn Wang Paul Chu (UC-Irvine) (National Taiwan Univ.) (QNAP, Inc) Crawling-based Web App Testing the web app under test

More information

The Topic Specific Search Engine

The Topic Specific Search Engine The Topic Specific Search Engine Benjamin Stopford 1 st Jan 2006 Version 0.1 Overview This paper presents a model for creating an accurate topic specific search engine through a focussed (vertical)

More information

Recap: lecture 2 CS276A Information Retrieval

Recap: lecture 2 CS276A Information Retrieval Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider

More information

Open Source Software Recommendations Using Github

Open Source Software Recommendations Using Github This is the post print version of the article, which has been published in Lecture Notes in Computer Science vol. 11057, 2018. The final publication is available at Springer via https://doi.org/10.1007/978-3-030-00066-0_24

More information

Tutorial to QuotationFinder_0.6

Tutorial to QuotationFinder_0.6 Tutorial to QuotationFinder_0.6 What is QuotationFinder, and for which purposes can it be used? QuotationFinder is a tool for the automatic comparison of fully digitized texts. It can detect quotations,

More information

SKOS Shuttle. (Welcome) Tutorial TEM Text Extraction Management. May 2018

SKOS Shuttle. (Welcome) Tutorial TEM Text Extraction Management. May 2018 SKOS Shuttle (Welcome) Tutorial TEM Text Extraction Management May 2018 This tutorial illustrates How to extract in SKOS Shuttle new concepts out of free text and to add them to a thesaurus Table of Contents

More information

Tutorial to QuotationFinder_0.4.4

Tutorial to QuotationFinder_0.4.4 Tutorial to QuotationFinder_0.4.4 What is Quotation Finder and for which purposes can it be used? Quotation Finder is a tool for the automatic comparison of fully digitized texts. It can detect quotations,

More information

AN EFFECTIVE INFORMATION RETRIEVAL FOR AMBIGUOUS QUERY

AN EFFECTIVE INFORMATION RETRIEVAL FOR AMBIGUOUS QUERY Asian Journal Of Computer Science And Information Technology 2: 3 (2012) 26 30. Contents lists available at www.innovativejournal.in Asian Journal of Computer Science and Information Technology Journal

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Walid Magdy, Gareth J.F. Jones Centre for Next Generation Localisation School of Computing Dublin City University,

More information

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term. Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term. Question 1: (4 points) Shown below is a portion of the positional index in the format term: doc1: position1,position2

More information

Corpus collection and analysis for the linguistic layman: The Gromoteur

Corpus collection and analysis for the linguistic layman: The Gromoteur Corpus collection and analysis for the linguistic layman: The Gromoteur Kim Gerdes LPP, Université Sorbonne Nouvelle & CNRS Abstract This paper presents a tool for corpus collection, handling, and statistical

More information

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 25 Tutorial 5: Analyzing text using Python NLTK Hi everyone,

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

Research Tools: DIY Text Tools

Research Tools: DIY Text Tools As with the other Research Tools, the DIY Text Tools are primarily designed for small research projects at the undergraduate level. What are the DIY Text Tools for? These tools are designed to help you

More information

Supervised classification of law area in the legal domain

Supervised classification of law area in the legal domain AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms

More information

Conclusions. Chapter Summary of our contributions

Conclusions. Chapter Summary of our contributions Chapter 1 Conclusions During this thesis, We studied Web crawling at many different levels. Our main objectives were to develop a model for Web crawling, to study crawling strategies and to build a Web

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information

More information

Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web

Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web Chenghua Lin, Yulan He, Carlos Pedrinaci, and John Domingue Knowledge Media Institute, The Open University

More information

NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL

NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL DR.B.PADMAJA RANI* AND DR.A.VINAY BABU 1 *Associate Professor Department of CSE JNTUCEH Hyderabad A.P. India http://jntuceh.ac.in/csstaff.htm

More information

Enhancing Language Models for ASR using RSS Feeds

Enhancing Language Models for ASR using RSS Feeds Enhancing Language Models for ASR using RSS Feeds Diplomarbeit am Cognitive Systems Lab Prof. Dr.-Ing. Tanja Schultz Fakultät für Informatik Karlsruher Institut für Technologie von cand. inform. Lukasz

More information

A Survey on Spoken Document Indexing and Retrieval

A Survey on Spoken Document Indexing and Retrieval A Survey on Spoken Document Indexing and Retrieval Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University Introduction (1/5) Ever-increasing volumes of audio-visual

More information

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Department of Electronic Engineering FINAL YEAR PROJECT REPORT Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:

More information

Master Syndication Gateway V2. User's Manual. Copyright Bontrager Connection LLC

Master Syndication Gateway V2. User's Manual. Copyright Bontrager Connection LLC Master Syndication Gateway V2 User's Manual Copyright 2005-2006 Bontrager Connection LLC 1 Introduction This document is formatted for A4 printer paper. A version formatted for letter size printer paper

More information

Scalable Trigram Backoff Language Models

Scalable Trigram Backoff Language Models Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work

More information

SCHOLARONE MANUSCRIPTS TM REVIEWER GUIDE

SCHOLARONE MANUSCRIPTS TM REVIEWER GUIDE SCHOLARONE MANUSCRIPTS TM REVIEWER GUIDE TABLE OF CONTENTS Select an item in the table of contents to go to that topic in the document. INTRODUCTION... 2 THE REVIEW PROCESS... 2 RECEIVING AN INVITATION...

More information

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper

More information

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following

More information

Automatic Summarization

Automatic Summarization Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization

More information

Outline GIZA++ Moses. Demo. Steps Output files. Training pipeline Decoder

Outline GIZA++ Moses. Demo. Steps Output files. Training pipeline Decoder GIZA++ and Moses Outline GIZA++ Steps Output files Moses Training pipeline Decoder Demo GIZA++ A statistical machine translation toolkit used to train IBM Models 1-5 (moses only uses output of IBM Model-1)

More information

A New Context Based Indexing in Search Engines Using Binary Search Tree

A New Context Based Indexing in Search Engines Using Binary Search Tree A New Context Based Indexing in Search Engines Using Binary Search Tree Aparna Humad Department of Computer science and Engineering Mangalayatan University, Aligarh, (U.P) Vikas Solanki Department of Computer

More information

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.

More information

ExpertSearch. Data Sciences Summer Institute University of Illinois at UrbanaChampaign. July 1, 2011

ExpertSearch. Data Sciences Summer Institute University of Illinois at UrbanaChampaign. July 1, 2011 ExpertSearch Data Sciences Summer Institute University of Illinois at UrbanaChampaign July 1, 2011 Expert Search Group Expert Search Goal Expert Search is a search engine that returns a list of people

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion Sara Lana-Serrano 1,3, Julio Villena-Román 2,3, José C. González-Cristóbal 1,3 1 Universidad Politécnica de Madrid 2 Universidad

More information

Extracting Summary from Documents Using K-Mean Clustering Algorithm

Extracting Summary from Documents Using K-Mean Clustering Algorithm Extracting Summary from Documents Using K-Mean Clustering Algorithm Manjula.K.S 1, Sarvar Begum 2, D. Venkata Swetha Ramana 3 Student, CSE, RYMEC, Bellary, India 1 Student, CSE, RYMEC, Bellary, India 2

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

doi: / _32

doi: / _32 doi: 10.1007/978-3-319-12823-8_32 Simple Document-by-Document Search Tool Fuwatto Search using Web API Masao Takaku 1 and Yuka Egusa 2 1 University of Tsukuba masao@slis.tsukuba.ac.jp 2 National Institute

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

tm Text Mining Environment

tm Text Mining Environment tm Text Mining Environment Ingo Feinerer Technische Universität Wien, Austria SNLP Seminar, 22.10.2010 Text Mining Package and Infrastructure I. Feinerer tm: Text Mining Package, 2010 URL http://cran.r-project.org/package=tm

More information

Large Crawls of the Web for Linguistic Purposes

Large Crawls of the Web for Linguistic Purposes Large Crawls of the Web for Linguistic Purposes SSLMIT, University of Bologna Birmingham, July 2005 Outline Introduction 1 Introduction 2 3 Basics Heritrix My ongoing crawl 4 Filtering and cleaning 5 Annotation

More information

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha Physics Institute,

More information

AntMover 0.9 A Text Structure Analyzer

AntMover 0.9 A Text Structure Analyzer AntMover 0.9 A Text Structure Analyzer Overview and User Guide 1.1 Introduction AntMover 1.0 is a prototype version of a general learning environment that can be applied to the analysis of text structure

More information

Information Retrieval

Information Retrieval Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing

More information

CS 288: Statistical NLP Assignment 1: Language Modeling

CS 288: Statistical NLP Assignment 1: Language Modeling CS 288: Statistical NLP Assignment 1: Language Modeling Due September 12, 2014 Collaboration Policy You are allowed to discuss the assignment with other students and collaborate on developing algorithms

More information

Behavioral Data Mining. Lecture 18 Clustering

Behavioral Data Mining. Lecture 18 Clustering Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i

More information

Query Answering Using Inverted Indexes

Query Answering Using Inverted Indexes Query Answering Using Inverted Indexes Inverted Indexes Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 2 Document-at-a-time Evaluation

More information

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler Mukesh Kumar and Renu Vig University Institute of Engineering and Technology, Panjab University, Chandigarh,

More information

MURDOCH RESEARCH REPOSITORY

MURDOCH RESEARCH REPOSITORY MURDOCH RESEARCH REPOSITORY http://researchrepository.murdoch.edu.au/ This is the author s final version of the work, as accepted for publication following peer review but without the publisher s layout

More information

Query-Free News Search

Query-Free News Search Query-Free News Search by Monika Henzinger, Bay-Wei Chang, Sergey Brin - Google Inc. Brian Milch - UC Berkeley presented by Martin Klein, Santosh Vuppala {mklein, svuppala}@cs.odu.edu ODU, Norfolk, 03/21/2007

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability

More information

Clustering using Topic Models

Clustering using Topic Models Clustering using Topic Models Compiled by Sujatha Das, Cornelia Caragea Credits for slides: Blei, Allan, Arms, Manning, Rai, Lund, Noble, Page. Clustering Partition unlabeled examples into disjoint subsets

More information

FAQ Retrieval using Noisy Queries : English Monolingual Sub-task

FAQ Retrieval using Noisy Queries : English Monolingual Sub-task FAQ Retrieval using Noisy Queries : English Monolingual Sub-task Shahbaaz Mhaisale 1, Sangameshwar Patil 2, Kiran Mahamuni 2, Kiranjot Dhillon 3, Karan Parashar 3 Abstract: 1 Delhi Technological University,

More information

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bonfring International Journal of Data Mining, Vol. 7, No. 2, May 2017 11 News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bamber and Micah Jason Abstract---

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Hands-Free Internet using Speech Recognition

Hands-Free Internet using Speech Recognition Introduction Trevor Donnell December 7, 2001 6.191 Preliminary Thesis Proposal Hands-Free Internet using Speech Recognition The hands-free Internet will be a system whereby a user has the ability to access

More information

The Goal of this Document. Where to Start?

The Goal of this Document. Where to Start? A QUICK INTRODUCTION TO THE SEMILAR APPLICATION Mihai Lintean, Rajendra Banjade, and Vasile Rus vrus@memphis.edu linteam@gmail.com rbanjade@memphis.edu The Goal of this Document This document introduce

More information

TALP at WePS Daniel Ferrés and Horacio Rodríguez

TALP at WePS Daniel Ferrés and Horacio Rodríguez TALP at WePS-3 2010 Daniel Ferrés and Horacio Rodríguez TALP Research Center, Software Department Universitat Politècnica de Catalunya Jordi Girona 1-3, 08043 Barcelona, Spain {dferres, horacio}@lsi.upc.edu

More information

Impact of Term Weighting Schemes on Document Clustering A Review

Impact of Term Weighting Schemes on Document Clustering A Review Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan

More information

Lecture Video Indexing and Retrieval Using Topic Keywords

Lecture Video Indexing and Retrieval Using Topic Keywords Lecture Video Indexing and Retrieval Using Topic Keywords B. J. Sandesh, Saurabha Jirgi, S. Vidya, Prakash Eljer, Gowri Srinivasa International Science Index, Computer and Information Engineering waset.org/publication/10007915

More information

A Survey on Information Extraction in Web Searches Using Web Services

A Survey on Information Extraction in Web Searches Using Web Services A Survey on Information Extraction in Web Searches Using Web Services Maind Neelam R., Sunita Nandgave Department of Computer Engineering, G.H.Raisoni College of Engineering and Management, wagholi, India

More information

Efficiency vs. Effectiveness in Terabyte-Scale IR

Efficiency vs. Effectiveness in Terabyte-Scale IR Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada November 17, 2005 1 2 3 4 5 6 What is Wumpus? Multi-user file system

More information

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV Jan Vaněk and Josef V. Psutka Department of Cybernetics, West Bohemia University,

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

Digital Libraries: Language Technologies

Digital Libraries: Language Technologies Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................

More information

Static Pruning of Terms In Inverted Files

Static Pruning of Terms In Inverted Files In Inverted Files Roi Blanco and Álvaro Barreiro IRLab University of A Corunna, Spain 29th European Conference on Information Retrieval, Rome, 2007 Motivation : to reduce inverted files size with lossy

More information

Context Based Web Indexing For Semantic Web

Context Based Web Indexing For Semantic Web IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 12, Issue 4 (Jul. - Aug. 2013), PP 89-93 Anchal Jain 1 Nidhi Tyagi 2 Lecturer(JPIEAS) Asst. Professor(SHOBHIT

More information

Outline of the course

Outline of the course Outline of the course Introduction to Digital Libraries (15%) Description of Information (30%) Access to Information (30%) User Services (10%) Additional topics (15%) Buliding of a (small) digital library

More information

LARGE-VOCABULARY CHINESE TEXT/SPEECH INFORMATION RETRIEVAL USING MANDARIN SPEECH QUERIES

LARGE-VOCABULARY CHINESE TEXT/SPEECH INFORMATION RETRIEVAL USING MANDARIN SPEECH QUERIES LARGE-VOCABULARY CHINESE TEXT/SPEECH INFORMATION RETRIEVAL USING MANDARIN SPEECH QUERIES Bo-ren Bai 1, Berlin Chen 2, Hsin-min Wang 2, Lee-feng Chien 2, and Lin-shan Lee 1,2 1 Department of Electrical

More information

Information Retrieval

Information Retrieval Natural Language Processing SoSe 2014 Information Retrieval Dr. Mariana Neves June 18th, 2014 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing

More information

CADIAL Search Engine at INEX

CADIAL Search Engine at INEX CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr

More information

To search and summarize on Internet with Human Language Technology

To search and summarize on Internet with Human Language Technology To search and summarize on Internet with Human Language Technology Hercules DALIANIS Department of Computer and System Sciences KTH and Stockholm University, Forum 100, 164 40 Kista, Sweden Email:hercules@kth.se

More information

Orange3 Text Mining Documentation

Orange3 Text Mining Documentation Orange3 Text Mining Documentation Release Biolab January 26, 2017 Widgets 1 Corpus 1 2 NY Times 5 3 Twitter 9 4 Wikipedia 13 5 Pubmed 17 6 Corpus Viewer 21 7 Preprocess Text 25 8 Bag of Words 31 9 Topic

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

KenLM: Faster and Smaller Language Model Queries

KenLM: Faster and Smaller Language Model Queries KenLM: Faster and Smaller Language Model Queries Kenneth heafield@cs.cmu.edu Carnegie Mellon July 30, 2011 kheafield.com/code/kenlm What KenLM Does Answer language model queries using less time and memory.

More information