How SPICE Language Modeling Works

How SPICE Language Modeling Works Abstract Enhancement of the Language Model is a first step towards enhancing the performance of an Automatic Speech Recognition system. This report describes an integrated scheme of data collection and language model building in SPICE. Starting with a small corpus, SPICE helps the user gather more relevant data from the web in order to build a better language model. Techniques involved for achieving this goal and details of implementation are presented for the understanding of future developers. 1. Introduction SPICE is a web-based interface aimed at rapid development of Automatic Speech Recognition (ASR) and Text to Speech (TTS) synthesizers in the target languages of interest. SPICE can be used to build systems that are large and open domain, as well as systems specific to a domain. However, it has been our experience [1][2] that users of SPICE often develop small scale domain specific ASRs and TTS. The performance of an ASR is dependent on acoustic models (AM), and coverage and quality of the language model (LM). For domain specific ASR systems, the amount of data available for training a language model is typically lesser than for an open domain ASR. Further, for languages other than English, Chinese, Spanish, French and German, the amount of text data available is even lesser when building a domain specific system. Improvement in the performance of an ASR requires enhancing both language model and acoustic model. A first step towards improvement in ASR performance involves boosting the performance of language model. The LM training component of SPICE acquires text resources from the web in addition to the user supplied corpus and helps build a better language model. The user of SPICE initially specifies a web-link or uploads a small text corpus in the target language of interest. Domain specific key-terms and phrases extracted from the web-pages /corpus are presented to the user as a list. Terms that are pertinent to the domain of interest are selected by the user and sent as queries to an external search engine. Web-links obtained from the search engine are used to obtain additional text data which is then filtered to retain only the relevant content. Thus, starting with a small corpus, the user of SPICE is aided in gathering a large amount of domain specific data for training better LMs. Some research on the use of web as a corpus for building languages models has been done in the past. In [3], the authors compute the most frequent bigrams in a corpus and submit the most frequent bigrams as queries to a search engine. Bigram counts in the results of the search engine are used to improve the trigram models for speech recognition. [4] proceed a step further and crawl the websites listed by the search engine and retain those web-pages that have low out-of-vocabulary words. There is a detailed analysis on the performance of web-based models in natural language processing presented in the work of [5]. At the same time, there is an entire body of research in the databases community [6]

related to topic specific crawling (also referred to as focused crawling) which deals with collection of data specific to a particular topics. While the purpose of focused crawling is to build topic specific indexes, our purpose is to enhance the language model. Our ideas [2] are closer to the work done by [3][4] but there are key differences which will be elaborated later. The succeeding sections will discuss in detail the ideas and implementation of the new SPICE language modeling tools. Section 2 outlines the algorithms and steps involved in data collection and building language models. Section 3 details some of the aspects of implementation from a developer perspective. Section 4 concludes with a discussion of some issues along with pointers to future work. 2. Approach When collecting text data for a particular domain, the user of SPICE usually knows one or few URLs (Universal Resource Locator) related to the domain. Or, the user has a text corpus which can be uploaded into the SPICE system. In case of English or other commonly spoken languages, the domain specific corpus available may be large. But for languages with lesser digital representation, the domain specific corpus can be quite small. In this case, the user would like to obtain additional corpora from the web to augment the existing corpus. This can be achieved by typing queries into a search-engine and using the resultant URLs for getting more data. But non-roman-script languages require specialized keyboards to submit such queries. To address these issues, we have developed a scheme to collect additional data as follows: Step 1: Specifying the initial data source The data source can be a plain-text file or a URL (Eg: http://www.tagesschau.de/) related to the domain of interest. SPICE assumes that the URL or corpus represents a particular domain. Figure 1 gives an example. Figure 1: Uploading text corpus or specifying a URL related to the domain of interest.

Step 2: Identifying domain terms As discussed earlier, additional text data can be obtained by using the URLs returned by a search-engine. Instead of the user specifying the queries, we propose to identify the domain terms from the text corpus specified in Step 1. These domain terms can be used as queries to a search engine. Since good queries result in good search results, SPICE attempts to obtain good domain terms from the data initially specified by the user. Figure 2: Terms related to domain of interest presented to the user. User can select terms that are relevant. The domain terms are obtained using the following algorithm: 1. If the user specifies a URL, the documents (HTML, PDF, DOC, TXT formats supported) from the URL are downloaded and cleaned to obtain the text data. 2. A frequency analysis of the text data is performed in order to identify the stop-words (Eg: In English, the stop-words are the, in ). The K (K=10) most frequent tokens are identified as stopwords. 3. Statistics for bigrams are then collected from the cleaned up documents: a. Frequency of the bigram in all the documents downloaded b. Document frequency of the bigram i.e. number of documents in which the bigram has occurred. c. The bigrams are scored according to the Term-Frequency Inverse-Document-Frequency (TF-IDF) measure given by ln where is the frequency and is the document frequency of bigram and is the total number of documents downloaded. Bigrams are also sorted in decreasing order of their frequency of occurrence in the corpus. The bigrams with highest TF-IDF scores and frequencies are presented to the user in the form of a list. Currently, 15 bigrams with highest TF-IDF and 15 bigrams with highest frequency are presented to the user in an interleaved fashion. Since SPICE does not know apriori whether the URL is homogenous or heterogeneous in

topical content, both the highest frequency and highest TF-IDF bigrams are presented to the user alternately. d. The bigrams do not contain stop-words. For example suppose the trigram cut the vegetables occurs with high frequency. The bigrams cut the and the vegetables are not good queries to a search engine if the user wishes to collect more data related to cookery domain. Instead, cut vegetables may be a good query after removing the stop-word the. Thus, the stop-words obtained in 2. are removed during collection of the bigram statistics. 4. A model of the domain is also learned. Currently, the model has been kept simple and represents the domain as a single vector of terms and their frequencies (normalized). There is a scope for using clustering or methods like Latent Semantic Analysis (LSA) to discover latent topics in the data. 5. The highest TF-IDF score and highest frequency bigram are presented alternately to the user in the form of a list as shown in Figure 2. If the user uploads a single text corpus instead of specifying a URL, bigrams with highest frequencies are shown to the user. 6. The user selects relevant domain terms (bigrams) and initials a re-crawl. Step 3: Re-crawl and domain filtering 1. The domain terms selected by the user are sent as queries to a search engine (Google in our case). For each query, the top 5 URLs returned by the search-engine are assumed to be relevant sources of data and added to the list of URLs to be crawled. 2. A shallow crawl (1 level deep) is initiated for each of the URLs in the list. Deep crawls can take much more time and may result in a topical drift. For example, a deep crawl into a sports URL may also result in documents from a URL related to politics apart from sports-related data. 3. The documents collected from the URLs are cleaned up of tags and other formatting. Each document is represented as a feature vector of terms and their normalized frequencies. The cosine-distance of this document vector from the domain model learned in Step 2.4 is computed. If the distance lies above a preset threshold, the document is considered as indomain and retained. Otherwise the document is discarded. This approach is similar to [4] with the difference that [4] threshold on the OOV rate instead. Our technique on the other hand allows for more sophisticated models to be used instead of a one-centroid cluster. 4. All the documents that are classified as in-domain are appended together and presented as additional data for enhancing the language model. Step 4: Building and improving the language model When building a language model, the user of SPICE is presented with 3 options in-case additional data from Step 3 has been collected by the user. The options (Figure 3) are: 1. Interpolating the LM built using additional data with the LM built using the user uploaded corpus or URL. If the additional corpus contains considerable out-of-domain data, the LM built

using initially uploaded data may get corrupted. Instead interpolation can be done to avoid domain drift. 2. Adding the additional corpus to the initially crawled data and building an improved (hopefully) language model. 3. Not using additional data. The user can choose this if the initially uploaded corpus itself is large enough. In all the 3 cases performance of the resultant language model is presented to the user in terms of perplexity and OOV rate as shown in Figure 3. Figure 3: Improving the language model using additional data. The performance of LM built using initially uploaded data is shown in first table. Second table shows the result of using an interpolated language model built using additional crawled data. A lower perplexity and OOV indicates an improvement in the quality of the language model.

3. Implementation Details As with the rest of the SPICE modules, the data collection and LM training components are organized in a 3 layer hierarchy: 1. Web-interface code: This forms the highest level user facing code which allows for user interaction and shows the progress of the LM training and performance. This code is written in PHP and invokes the middle level SPICE shell-scripts. Related files along with their relative locations in SPICE package are: a. spice/web/text.php Forms the interface as shown in Figure 1. b. spice/web/code/h_text.php Serves as a header file used by Text.php in order to dynamically change the content of the page shown in Figure 1 based of the progress of the crawling/re-crawling processes. c. spice/web/2col_leftnav.css A cascaded style sheet (CSS) common to all SPICE components and related to the look and feel of the spice pages. 2. SPICE LM Scripts: These are some of the main tools used in the background for collection of data and LM training. Written primarily in shell-script and PERL, the required scripts are as follows: a. spice/lm/scripts/crawl.sh Top level script which uses wget command for crawling URLs and calls the PERL scripts responsible for cleaning and obtaining domain-terms (bigrams). Takes as input a URL and returns a list of domain terms and domain model. b. spice/lm/scripts/recrawl.sh Top level shell-script that calls PERL scripts responsible for querying Google, getting the additional URLs, crawling the URLs and filtering the data based on cosine distance. c. spice/lm/scripts/procupload.sh Top level script that functions in almost the same way as crawl.sh except that it uses a single corpus instead of web-documents. This script can be merged with crawl.sh in the future. d. spice/lm/scripts/killprocs.sh Responsible for aborting the crawl.sh/recrawl.sh processes in case the user desires so. Not fully tested yet. e. spice/lm/scripts/buildlm Top level shell-script responsible for LM training and comprehensive performance testing by 5-fold cross-validation. Call commands ngram and ngram-count of the SRILM Toolkit for LM evaluation and training respectively. f. spice/lm/scripts/procdocs.pl Used by crawl.sh, recrawl.sh and procupload.sh. This script cleans up the web-documents, computes the statistics of domain terms and outputs the best domain-terms in a file. Takes a URL as input and outputs a domain model and domain-terms as output. Essentially performs Step 2 of previous section. g. spice/lm/scripts/moredata.pl - Used by recrawl.sh and takes a domain model as input along with a directory containing additional documents obtained by re-crawling. Finds relevant documents and outputs them as a single file of additional domain-related data.

h. spice/lm/scripts/search.pl Used by recrawl.sh and queries Google given a file containing a list of domain terms selected by the user and output a file containing URLs to crawl data from. i. spice/lm/scripts/htmlproc.pm PERL module used by procdocs.pl and moredata.pl and responsible for handling document formats like HTML, DOC, PDF, TXT, cleaning and extracting body-text and converting the text data into UTF-8 encoding. j. spice/lm/scripts/charfreq.pl Computes character frequencies given a text corpus. Used later for Grapheme to Phoneme (G2P) rule and learning pronunciation lexicon. k. spice/lm/scripts/split.pl Used to generate split a text corpus into training, testing and development portions in the proportion 80%:10%:10% respectively. l. spice/lm/scripts/clean.pl Used to remove lines and extra spaces in the corpus collected at the end of the data collection process. Important: This module serves as a placeholder for text normalization code which can remove punctuations, resolve numerals and other non-standard words (NSWs). m. spice/lm/scripts/avgweights.pl Obtains interpolation weights as a result of 5-fold cross-validation over development data. n. spice/lm/scripts/compile_results.pl Compiles the results of LM performance (perplexity, OOV) and outputs them as an HTML table for displaying to the user. Important: Scripts (j-n) are used by buildlm 3. External Dependencies: These toolkits are external to SPICE and consist of: a. SRILM Toolkit http://www.speech.sri.com/projects/srilm/ b. Antiword Used by HtmlProc.pm for converting a DOC format file to plaintext. http://www.winfield.demon.nl/ c. XPDF Used by HtmlProc.pm for extracting text from a PDF file. http://www.foolabs.com/xpdf/download.html Issues and Future Work While the data collection tools have been designed to perform with a wide variety of data sources, there some issues remain to be addressed. These issues provide pointers to future work. Some of these issues are as follows: 1. Handling Document Formats: a. The functionality of the PDF to text conversion needs to be tested thoroughly with different documents. b. The open xml document formats used in Microsoft document are a collection of compressed (.zip) xml files. These need to be handled too. c. Compressed corpora (.tar.gz,.gz,.zip) need to be handled too.

2. Improving the quality of domain terms. a. Removing out-of-language data Web-pages often contain data that does not belong to the language of interest. For example a Chinese web-page can contain English data like Contact Us All Rights Reserved etc which is not desirable. While the current SPICE data collection tools filter out some of this data and numerals while finding domain terms, out-of-language data does show up in the list of candidate domain terms. For example, a crawl of an English website resulted in a frequent bigram am pm referring to time which showed up among the list of candidate domain terms. b. Metric for Ranking Domain Terms TF-IDF and frequency of bigrams are currently used to score domain-terms to be presented to the user. When the text corpus is homogeneous, bigram frequency work well to identify domain terms. On the other hand, if the text corpus is heterogeneous (more than one topic), TF-IDF works well. It is difficult to determine beforehand whether the user supplied URL is homogeneous or not. Instead, the initial data crawled by the user can be clustered using methods like Latent Dirichlet Allocation (LDA) to obtain a set of latent topics. A frequency analysis can be performed on documents belonging to dominant topics to obtain domain-terms. Further, this model can also be used during the re-crawling stage to filter-out irrelevant documents. c. Handling Word Segmentation In languages like Chinese that do not use whitespace for segmenting words, additional processing is required. Chinese websites in particular use BIG5 or GB-2312 encoding for representing data. Once the encoding is identified, appropriate methods can be used to handle word segmentation. 3. Text Normalization Non Standard Words like numerals are often language specific and need to be handled separately. Currently punctuation symbols are not removed before LM training because their role depends on the language being worked on. Normalization issues like lower-casing need to be addressed especially for European languages. References [1] SPICE: Web-based tools for Rapid Language Adaptation in Speech Processing Systems, Tanja Schultz, Alan Black, Sameer Badaskar, Matthew Hornyak, John Kominek. Proceedings of Interspeech 2007, Antwerp, Belgium. [2] Improving Speech Systems built from very little data, John Kominek, Sameer Badaskar, Alan Black, Tanja Schultz. To appear in the proceedings of Interspeech 2008, Brisbane, Australia. [3] Improving Trigram Language Modeling with World Wide Web, Xiaojin Zhu, Roni Rosenfeld. Proceedings of ICASSP 2001, pp. 533-536, Vol. 1, Salt Lake City, USA. [4] Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures, Ivan Bulyko, Mari Ostendorf, Andreas Stolcke. Proceeding of HLT-NAACL 2003, pp. 7-9, Edmonton, Canada.

[5] Web-based models for natural language processing, Mirella Lapata, Frank Keller. ACM Transactions on Speech and Language Processing 2005, Article No. 3, Vol. 2, Issue 1. [6] Focused crawling: a new approach to topic-specific Web resource discovery, Soumen Chakrabarti, Martin van de Berg, Byron Dom. In proceedings of WWW8, Toronto, Canada.