Enhancing Language Models for ASR using RSS Feeds

Size: px

Start display at page:

Download "Enhancing Language Models for ASR using RSS Feeds"

Violet Oliver
6 years ago
Views:

1 Enhancing Language Models for ASR using RSS Feeds Diplomarbeit am Cognitive Systems Lab Prof. Dr.-Ing. Tanja Schultz Fakultät für Informatik Karlsruher Institut für Technologie von cand. inform. Lukasz Gren Betreuer: Dipl.-Inform. Ngoc Thang Vu Dipl.-Inform. Tim Schlippe Prof. Dr.-Ing. Tanja Schultz Tag der Anmeldung: 01.Juli 2011 Tag der Abgabe: 05.September 2011 KIT Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft

3 Ich erkläre hiermit, dass ich die vorliegende Arbeit selbständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe. Karlsruhe, den 05. September 2011

5 Abstract In this work, we improve the automatic speech recognition of broadcast news with time- and topic-relevant text data. Our previous method for collecting large amounts of text data for language modelling was to use the crawler in the Rapid Language Adaptation Toolkit (RLAT) with its recursive crawling implementation. This implementation is good for crawling large amounts of text data. However, it has shortcomings to pick out exclusively text material that is relevant for up-to-date broadcast news shows which we intend to transcribe. To provide text data that fits better to our shows, we propose crawling methods using RSS Feeds. RSS Feeds are small automatically generated XML files, that contain time-stamped URLs of the published updates. They can easily be found on almost all online news websites. We implemented an RSS parser into RLAT, which takes RSS Feeds, extracts the URLs with the publishing date and collects them preserving the time information. Then exclusively the pages corresponding to the URLs are crawled. Text data that was collected regarding the information in the RSS Feeds of 4 French online news websites improved the performance of our base LM which had been used in the Quaero programme: The word error rates of five tested broadcast news shows from Europe1 are reduced by 0.9 % absolute on average with our new text data. The highest improvement is 1.5 % absolute (4.4 % relative). Inserting new words into our search vocabulary and pronunciation dictionary that occurred frequently in the RSS Feed-related webpages gives an additional improvement of 0.4 % absolute on average compared to the results with the Quaero dictionary. The best result was 0.5 % (1.74 % relative).

7 Contents 1 Introduction Motivation and purpose of this thesis Structure of this Thesis Automatic Speech Recognition Basics ASR Basics Tools Rapid Language Adaptation Toolkit SRI Language Model Toolkit JANUS Sequitur G2P Shortcomings of the RLAT crawler and proposed improvements Shortcomings Related work Proposed improvements Our Work RSS Feeds Basics Implementation Implemented scripts Parsing the RSS Feed Output of the parsed links into files Adaptation of the RLAT user interface Experimental Setup 17 6 Experiments and Results Experiments without dictionary adaptation Experiments with dictionary expansion Oracle experiments Real world results Summary Summary and future prospects 31 Bibliography 33

9 1. Introduction 1.1 Motivation and purpose of this thesis In today s globalized and connected world, there is a constant flow of information. Especially, the internet as it grows rapidly generates more and more useful information. The extreme amount of information generated and published every second from all over the world contains valuable data, not only text but also sounds, images and video, that can be used in many different fields of science. Not only computer science but also social sciences are profiting. For example, by examining the social graph generated from the connections between people from all over the world on social networks, it is possible to predict epidemics [10] or analyse other health issues like obesity [9]. In this work, we specialize on broadcast news. In general, news broadcast on television or on the radio are still the main source of information for many people. But being only present as spoken speech in most cases, these broadcasts have some disadvantages: First these broadcasts cannot reach hearing impaired people.second audio data cannot be archived and searched for a occurring topic like text data. Automatic Speech Recognition (ASR) software is trying to solve these problems. ASR systems as the name already states, recognize spoken language and transform it into text. Two of the main components of an ASR system are the language model (LM), which contains the probabilities of word sequences that may be spoken and the acoustic model which contains the statistical representation of sounds of which spoken language consists of [8]. After the spoken utterance is processed by the acoustic model, it generates a range of different possible words. This task needs a pronunciation dictionary which contains words and the corresponding phoneme sequence. Having the LM, the most probable word sequence is chosen out of the offered possibilities. With those elements, it is possible to convert speech to text. As broadcast news always contain the latest developments, new words emerge frequently and different topics get into the focus of attention. To adapt an ASR system for broadcast news, it is necessary to update it with text data that is in near temporal proximity to the date of the broadcast news show, is part of the same domain

10 2 1. Introduction and from the same language. Close temporal and topical proximity of the text data ensures that the words and sentences contained in the news show have a higher probability to fit, than using text data from a long time before the show. In this work, the crawling functionality of the CSL S system for building speech processing systems, the Rapid Language Adaptation Toolkit (RLAT) [14], is improved to make this possible. Currently, the RLAT crawler can only ensure that the language of the crawled texts fits to the target language. We propose to use RSS Feeds [6] to crawl text data which features time- and topic-relevancy. RSS Feeds are small XML files that are automatically published on many websites. They are updated every time a new article is published and do not contain only the URL to the article but also its publishing date. According to these information, we parse them and create chronologically sorted files that contain URLs to the new articles. The pages corresponding to the URLs then are crawled and LMs are built. These are then interpolated. In other words they are mixed together adapting the word sequence probabilities with an already evaluated LM (Quaero LM). The performance of the resulting LM is tested and compared to the results using solely the Quaero LM. Running experiments with a fixed pronunciation dictionary and an extended dictionary, we also evaluated the impact of adding new words found in the text data and crawled with the help of the RSS Feeds. 1.2 Structure of this Thesis In the previous section, the motivation behind this thesis and the steps to achieve this goal have been explained. In Chapter 2, a brief overview of the basics of Speech Recognition is given and the tools that were used are described. Short overviews of the Rapid Language Toolkit (RLAT), SRI Language Model Toolkit, Sequitur G2P and JANUS is given. In Chapter 3, the shortcomings of the currently used system are described and ways to fix them are searched in related work. In Chapter 4, an overview of our implementations is given. In Chapter 5, the requirements for running the experiments are presented. In Chapter 6, our system is evaluated in various experiments. Upcoming problems are described and the results are presented. Chapter 7 finishes the thesis with a conclusion.

11 2. Automatic Speech Recognition Basics 2.1 ASR Basics The term Automatic Speech Recognition (ASR) denotes to the process of how human speech is converted to written text. The three main components of an ASR system are the acoustic model, the pronunciation dictionary and the language model. After speech is produced in the speakers vocal tract a signal processing component transforms it into a sequence of acoustic vectors. These vectors model acoustic phonemes. The acoustic model contains phonetic knowledge and has to deal with the characteristics of acoustic differences between genders or dialects. Given this knowledge present in the acoustic model the most probable phonetic sequences are chosen and transformed into a sequence of words. This happens with the help of a pronunciation dictionary which contains the mapping of phonetic sequences into words. The acoustic model is not always able to recognize words correctly. For example, it may offer the following candidates Merry Christmas and Very Christmas. Now the language model has to choose the correct candidate. To be able to make the correct decision, a language model holds information about the probabilities of word sequences. To build a language model, a large amount of text data is analysed. The language model would choose Merry Christmas as the word Merry is standing in front of the word Christmas much more often than the word Very in English texts. By combining the knowledge of phonetics (acoustic model) and linguistics (language model), speech can be recognized. There are different metrics to evaluate the performance of an ASR system. One of them is the out-of-vocabulary (OOV) rate, which states the percentage of words present in a reference vocabulary that are not part of the language model or the search vocabulary. Another criterion is the perplexity which is a measure for how many successors a word can have in average in a language model. If the perplexity is high, the recognizer can choose from a high number of possible word sequences, thus leading to a greater probability that errors occur. If OOV and/or perplexity are high,

12 4 2. Automatic Speech Recognition Basics this usually leads to a high word error rate (WER) which states the amount of words that have to be inserted, deleted or substituted to achieve a perfect recognition. 2.2 Tools The tools which we used are the Rapid Language Adaptation Toolkit (RLAT) acting as the front-end for collecting text data from the internet. The SRI Language Model toolkit which is responsible for doing language model related tasks. JANUS, which is used to recognize speech and Sequitur G2P to generate pronunciations needed when adding new words to the dictionary Rapid Language Adaptation Toolkit The Rapid Language Adaptation Toolkit (RLAT) [14] is an extension of the Speech Processing Interactive Creation and Evaluation Toolkit for new Languages (SPICE) [15] developed at the Carnegie Mellon University. RLAT is based on GlobalPhone [16] and Festvox. The main purpose of RLAT is to reduce the time for building speech processing systems for new languages. The system is web-based and offers all tools needed to complete this task. The time is reduced and it allows novice users to built a speech processing system. One of the major features of RLAT is the use of data sharing between languages and system components. The creation process is divided in nine steps. From collecting the required text data over selecting phonemes everything needed is present to finally build a complete ASR system. In this work, only the component of RLAT that collects text data is used. This component offers different ways to add text data. This text data then can be used to build language models. Text data might be uploaded directly if already present or obtained by crawling the web for text resources. If the text data is crawled, an URL and the depth of the links to be followed are entered (see Figure 2.1). The RLAT crawler gets the text from the URL entered before and checks if any URLs are present on this page. If so the URLs are followed and the corresponding text is downloaded. This process continues until the link depth that was set is reached. By choosing a high link depth, it is possible to get huge amounts of text data. After the crawling finishes, the text data is normalized. This means that things like punctuation or the format of numbers are converted to a consistent format. Then this text then can be used for the next steps of the process to built an ASR system SRI Language Model Toolkit The SRI Language Modeling Toolkit (SRILM), developed by the SRI Speech Technology and Research Laboratory (STAR Lab) [17] is a freely available collection of C++ libraries, executable programs and helper scripts. They allow the creation of statistical language models for speech recognition and the evaluation of them. The main tasks that SRILM is used for in this work are the creation of language models from text data, the interpolation of the models and the computation of the perplexity. This tasks are performed by two tools included in the toolkit: ngram and ngram-count.

2.2. Tools 5 Figure 2.1: Standard text crawling in RLAT ngram-count processes text data by counting how often words and word sequences occur in the text to build a language model.

13 2.2. Tools 5 Figure 2.1: Standard text crawling in RLAT ngram-count processes text data by counting how often words and word sequences occur in the text to build a language model. It can be customized by a number of options. The ngram order (meaning the maximum length of word sequences), a custom vocabulary (if the number of words in the dictionary should be restricted), the discounting algorithm (Good-turing, absolute, Witten-Bell, and modified Kneser- Ney are supported) and how to treat unknown words are the most commonly used options. ngram is responsible for the evaluation of language models. It takes a language model and a file containing test data and computes the perplexity of the language model on this test data. Another important task is the interpolation of language models. Having two or more LMs, they can be combined using linear interpolation [7]. The optimal ratio for interpolation between those LMs can be computed when appropriate text data is given, on which the performance should be optimized JANUS The JANUS [19] speech recognition software has been developed at the Universität Karlsruhe and the Carnegie Mellon since It is commonly used at both institutions. In this work JANUS is used to convert speech to text. The toolkit provides objects for all possible speech recognition approaches. It can deal with different Hidden Markov models (HMMs), acoustic and speech recognizer architectures. The data structures of these objects can be modified with scripts. Therefore it is possible to test new ideas easily.

14 6 2. Automatic Speech Recognition Basics Sequitur G2P Sequitur G2P [5] has been developed at the University of Aachen. Sequitur is a grapheme-to-phoneme converter. A grapheme is the fundamental unit of written speech (like alphabetic letters). A phoneme is the fundamental linguistic unit. Sequitur G2P transforms any sequence of graphemes to sequences of phonemes. Sequitur G2P uses statistical models to achieve this task. To built these models a training has to be performed: An already existing pronunciation dictionary is taken and in an iterative process a model is trained. Then this model can be used to generate the pronunciations from a list of words without known pronunciations.

15 3. Shortcomings of the RLAT crawler and proposed improvements The functionality of the currently used system is not exactly designed to get small chunks of data fitting to defined periods. In this chapter, the shortcomings of the current system are described and improvements based on related work are proposed. 3.1 Shortcomings The main shortcoming of the RLAT crawler is that is not possible to automatically crawl websites for news that fit accurately to a time interval. After entering a start URL, all pages up to a defined depth are crawled. This leads to a big overhead especially if choosing a high link depth as websites are collected that are much older. Limiting the crawl to just one day is only possible by manually collecting links published on a certain date and then crawl and process them with the RLAT crawler. This is not only time-consuming but also not always possible to access older articles, since they may be removed or moved to an unreachable destination. It is also difficult to crawl exclusively text from a certain topic. For example, you can start the crawl with a sport news story to get some sport-related text data, but after reaching a certain depth, other topics may be reached if the website contains different kinds of news. This crawling method also collects data that might have been crawled already earlier as no information is stored if web pages were crawled in an already finished crawl. For example, a new crawl is started with a link depth greater than one and finishes after some time. A few days later we decide to crawl text from this website again. After it finishes we will probably see that some text that was crawled in this session was already crawled a few days ago in the last session. This method obviously leads to more overhead than crawling websites on a regular basis.

16 8 3. Shortcomings of the RLAT crawler and proposed improvements Collecting text data with this method also consumes much more space and bandwidth. The cause of this is that when a high link depth is chosen, this method follows all the links and gets all the corresponding text until it finds no links or the link depth is reached. If a website stores its old articles in a way that they can be reached just by following links, the text data collected may take away much space and bandwidth. These massive amounts of text data that can be crawled with the current method can be used as a base for building language models that can cope with different kinds of speech (depending of the topic and domain of the crawled text) relatively well. But to improve the performance of language models used for the recognition of current news, it is essential to get text data that fits the time of the broadcast. The difficulties of the current approach to collect time-relevant text data also lead to problems when the dictionary has to be expanded to recognize new words. When crawling massive amounts of text data, the number of potential words can grow very high and if not reduced somehow before adding to an existing dictionary can lower the performance of the language model. 3.2 Related work To deal with these problems, we searched for an extension of a method to realize an incremental crawling. One particular solution is described by Adam et al. [2]. They propose a way to collect fresh contents from the web based on RSS Feeds. These feeds can be found on almost every website and contain data about newly published content. Furthermore, they monitor a large number of those feeds and check for new content in self-adapting time intervals. To deal with the problem of new words in broadcast news, that are not part of the lexicon of the ASR system, Ohtsuki et al. [13] propose a method for adding words to the lexicon. They expand an approach by Auzanne et al. [4]. Auzanne et al. add and remove words from the lexicon based on how frequent a word appears and on how many days it appeared. Ohtsuki et al. besides these two factors also consider when the last time was a word appeared. They combine all three factors by using probabilistic methods and call it probabilistic rolling language model. With this method they could improve the OOV rates and word error rates on broadcast news. Thang et al. [18] describe the building of ASR systems using RLAT. By collecting massive amounts of text data and interpolating the language models on a day-wise basis, OOV and perplexity can be lowered in contrary to language models without interpolation. Moreover they describe that using multiple sources for text data improves the performance. In this work we will use the of incremental crawling method described by Adam et al. Because we will work only with a few RSS Feeds, we do no implement the self-adapting time intervals. Additionally, we will adapt the lexicon but in contrary to Ohtsuki et al. we do not use probabilistic methods and will not remove words from the lexicon. We will add words to the lexicon based on the frequency they appeared recently. The day-wise interpolation described in [18] cannot be realized in our work as we lack data to compute the necessary interpolation ratios. Like proposed we will use multiple sources for text data.

17 3.3. Proposed improvements Proposed improvements As described in Adam et al., a new crawling strategy is proposed to reduce the overhead. Almost every website offers at least one RSS Feed that contains all changes of the website or just the changes in a certain part. This feed can be used to collect only the text of the new articles. This crawling method significantly reduces duplicate content when crawling regularly compared to method described in 3.1. This is because with the proposed method, we are able to collect text incrementally. With the help of this technique, it is also possible to collect data that was published on a specific day which might help to improve the performance of language models or keep the vocabulary up-to-date. One downside of using RSS Feeds is that they have to be monitored constantly for updates as there is no archive openly available. Archives like the one built by Google for their Reader service cannot be accessed, because no public API is provided that could enable us to collect that data. It also takes longer to collect larger amounts of text compared to crawling everything until a certain link depth is reached. Additionally, the adaptation of the vocabulary as Ohtsuki et al. and Thang et al. have shown improves the general performance. Having the possibility of timetagged text data with the help of RSS Feeds, provides the opportunity to extract words that appeared in a certain time interval. They can be used to update an existing vocabulary with words that do not only fit to a certain time interval but also a certain topic if the appropriate RSS Feed is chosen. In contrary to the work of Ohtsuki et al., we add words to our system depending on the number of occurrences and how recent a word appeared. Like described in [18] using multiple sources to increase text diversity leads to a better performance of the language models. This is easy to achieve with RSS Feeds as they are available on most websites. In this work, four text sources will be used. But not only diversity is important. Reducing remaining noise in the text data after text normalization is complete is another task we try to solve in this work. We define noise as text that is not part of the news article, like website navigation elements, advertising or user comments. Adding such kinds of text may not only harm the ngram probabilities but also the perplexity of the language model as words containing spelling mistakes are added. Therefore a better method to remove at least some of this unnecessary text data is analysed. All suggested improvements are evaluated in this work. We start with the implementation of the new crawling strategy in the next chapter.

18 10 3. Shortcomings of the RLAT crawler and proposed improvements

19 4. Our Work 4.1 RSS Feeds Basics Especially news websites and blogs produce a constant flow of new content. Users often have the problem to cope with all information published. Typically the contents of the front pages can change fast depending on the events that dominate the news. To inform the users about changes on a website, the first version 0.91 of the RSS format was introduced in the year 2000 by Netscape. Since then the popularity grew and more and more websites (especially blogs and news websites) are using this technology. RSS (Really Simple Syndication) is a standardized format to publish frequently updated work. As it is based on the Extensible Markup Language (XML), it is very easy to parse and view with a broad range of applications. Since 2000, the format evolved and today RSS 2.0 is the most used version [6]. An RSS Feed contains information about updates and can contain links to enclosures or pictures, too. A basic feed is shown in Listing 4.1. Every feed begins with information which encoding and which XML and RSS version is used. It is always followed by the <channel> Tag which has to contain the following elements: Besides the title, which should be the title of the website the RSS Feed is based on and the URL of the corresponding website, also a short description of the contents of the feed must be given. There is also a number of optional channel elements. Most of them are not helpful for our task, for example a copyright notice or the address of the webmaster. Besides the <lastbuilddate> element which states the date of the last change to the channel may be used for information retrieval purposes. Rarely used are the <skiphours> and <skipdays> elements which indicate hours or days in which the feed is not updated. The most important part of the feed is located between the <item> tags. A feed can contain any number of items. But almost all feeds contain only a fixed number

20 12 4. Our Work Listing 4.1: Sample RSS feed <?xml version= 1. 0 encoding= UTF 8?> <r s s version= 2. 0 > <channel> < t i t l e>rss T i t l e</ t i t l e> <d e s c r i p t i o n>this i s an example o f an RSS f e e d</ d e s c r i p t i o n> <l i n k>h t t p : //www. someexamplerssdomain. com/main. html</ l i n k> <l a s t B u i l d D a t e>mon, 06 Sep : 0 1 : </ l a s t B u i l d D a te> <pubdate>mon, 06 Sep : 4 5 : </ pubdate> <item> < t i t l e>example entry</ t i t l e> <d e s c r i p t i o n>here i s some t e x t c o n t a i n i n g an i n t e r e s t i n g d e s c r i p t i o n.</ d e s c r i p t i o n> <l i n k>h t t p : //www. w i k i p e d i a. org /</ l i n k> <guid>unique s t r i n g per item</ guid> <pubdate>mon, 06 Sep : 4 5 : </ pubdate> </ item> </ channel> </ r s s> of items. All elements of <item> are optional. But either a title or a description has to be given. Most of the feeds contain at least a title and a link to the content. The description element is commonly used to give a short synopsis of the main article or contains the first few sentences of it. In general, it is also possible that the description element contains the full text of the articles. But is rarely used by companies that earn money by showing advertising on their websites. Offering the whole text via RSS Feedsm they would loose visitors and therefore valuable page views. Most important for us is not only that a link to the main content is given, but also the <pubdate> element, which states the time when the article was published. Often there is a <guid> tag (globally unique identifier) which contains a string that uniquely identifies the item for the publisher. In some cases, the <guid> element contains the attribute <ispermalink> which can be set to either true or false. If true, the URL can be seen as a permanent link to the content. If time-relevant text data should be crawled, the essential information that need to be parsed from the RSS Feed are the publishing date (found in the <pubdate> element) and the corresponding URL (inside the <link> element). 4.2 Implementation In this chapter an overview of the implementation needed to fulfil the required functionality is given. First the scripts running constantly in the background to fetch and parse the feeds are described. Afterwards the changes to the user interface of RLAT will are characterized Implemented scripts Our goal is to constantly collect the desired RSS Feeds, as they contain only a fixed number of articles. This means if one new article is added, the information about

21 4.2. Implementation 13 the former oldest article present in the feed is pruned. To achieve this goal, a script was written, that downloads the XML file in certain time intervals and checks if any news items are present, that were not already processed and saved in an earlier run. Due to the low complexity of the task and as most of the other RLAT tasks are realized with Perl scripts, Perl was chosen for this implementation. Our main script performs the following tasks: Downloading the RSS Feed Parsing the feed Output of the parsed links to files All tasks are repeated until they are cancelled by the user or after a certain amount of days. This is achieved by putting the tasks in a loop. A wait time of 30 minutes between two loop iterations proved to get the updates issued throughout the day. The script requires: The URL of the RSS Feed The time in days the script should run The file the RSS Feed is located in is downloaded with the help of curl [11]. After downloading the feed the parsing begins, which we describe in greater detail Parsing the RSS Feed To parse the XML file, the Perl module XML::simple [20] is used. It builds a tree representation of the XML file. All items can be parsed following the branches of the tree until the desired item is reached. All elements required for our task are Date when the articles were published URLs of the corresponding articles The publishing date can be found in the element <pubdate> of each item. The date follows the specifications of RFC822 [1]. A typical date in this format looks like this: Fri, 08 Oct :27: To better work with it, we convert it into a simpler format with the Perl module Date::Parse [12]. It transforms the date into the following format: YYYY-MM-DD. The corresponding link to the published articles can be stored in two places. Usually the link can be found in the <link> element of the item. However it is better to look after a permalink that usually has a longer lifetime. Permalinks are stored in the <guid> element with the attribute <ispermalink>. If the attribute is true, that link is used. After each downloading and parsing all items of a feed, the script waits 30 minutes and repeats the process.

14 4. Our Work 4.2.1.2 Output of the parsed links into files In each item of the feed, the publication date and the URL is parsed and written in a file.

22 14 4. Our Work Output of the parsed links into files In each item of the feed, the publication date and the URL is parsed and written in a file. After all <items> of the RSS Feed are parsed, the contents of the file containing the publication dates and URLs are sorted. Sometimes shortly after publishing, the headlines of the articles are changed. With the change of the headline the URL may change too. As every change made on the website is visible in the RSS Feeds, this results in collecting different URLs leading to the same website. To eliminate these URLs we compare them and eliminate those which coincide in a number of characters. This number was optimized for the RSS Feeds we used in this work. If using other RSS Feeds this number must be adapted. The file is then processed again. The links are sorted according to the day they were published. Every link with a specific timestamp is written to seperate files that contain only URLs belonging to one day Adaptation of the RLAT user interface All described procedures were integrated into the existing RLAT. Figure 4.1 shows our new interface. A new subitem called RSS functionality was added to the navigation bar. All files containing URLs that are collected based on the RSS Feed information are displayed and can be selected to be crawled with the RLAT crawler. Figure 4.1: Adapted RLAT User Interface for RSS Feed based Text Collection To start monitoring a new feed, its URL and the time the RSS Feed should be watched has to be entered. After clicking the Crawl RSS button, the scripts start with downloading the feed, parsing it for URLs to new content and writing those to files. The files, that can be seen in Figure 4.1 are updated and created every time the RSS Feed is parsed again to check for new articles (every thirty minutes).

23 4.2. Implementation 15 Each file containing the collected URLs can be selected and send to the previous standard RLAT crawling script via the CopyCrawl RSS button. Then all links are crawled, cleaned and the crawled text data is normalized. This crawled data then can be found in the Text management area from where it can be later used for building language models.

24 16 4. Our Work

25 5. Experimental Setup To check if language models can be improved with RSS-based crawled text, we conducted some experiments. Before this experiments could be started some data had to be collected. As we want to evaluate the performance on broadcast news, the RSS Feeds from the websites of four major French news services were chosen to be monitored for new articles. Then all articles that fit to the time interval were crawled with the RLAT crawler. The following feeds were downloaded over a period of sixth months: Le Parisien ( Feed URL: approximately 1,400 updates per week Le Point ( Feed URL: approximately 2,300 updates per week Le Monde ( Feed URL: approximately 400 updates per week France24 ( Feed URL: approximately 170 updates per week These four sources where chosen as they provide a huge amount of new articles published compared to other news websites. The number of updates was obtained with the help of statistics provided by Googles Reader service. The number of tokens collected daily from these source varies heavily from day to day as shown in Figure 5.1 and Figure 5.2. This is usual as the amount of news does not only depend on the weekday (for example less new articles on Sundays) but also of the nature of news. Some events lead to a greater amount of articles than others.

26 18 5. Experimental Setup Figure 5.1: Tokens per day - collected between 1/17/11 and 2/16/11 Figure 5.2: Tokens per day - collected between 1/31/11 and 3/2/11

19 To evaluate the impact of the new time and topic-relevant texts collected with the help of RSS Feeds, French broadcast news shows needed to be obtained.

27 19 To evaluate the impact of the new time and topic-relevant texts collected with the help of RSS Feeds, French broadcast news shows needed to be obtained. We downloaded radio broadcasts of the 7 am news from Europe1 ( Each show usually has a duration of about 10 minutes. The shows could be downloaded easily, as the URLs leading to the MP3 files of the episodes were published through RSS Feeds. In preparation for a decoding with JANUS, the downloaded episodes were first transcoded from MP3 stereo to WAVE 16khz mono files. After that, a speaker segmentation and clustering was performed and database files were created. These database files contain information about speakers and the time code of the utterances in the audio file. We collected almost all episodes since the beginning of Five of these episodes were chosen to be transcribed by a French native speaker. To ensure that the news differ in the topics, a period of about two weeks lies between each episode except one (for testing purposes). To reduce the transcription work, each episode was transcribed roughly initially using the French Quaero P3 evaluation system with our baseline language model. Then this first transcription was checked and corrected by a native speaker. The resulting word error rates of the decodings using the unchanged Quaero LM (baseline LM) are illustrated in Figure 5.3 Figure 5.3: Baseline Word Error Rates This corrected transcriptions were used as references to compute the perplexities and OOV rate of the language models built with the downloaded texts. The amount of collected text data is not very big. Therefore language models built exclusively from these texts do not perform well on our shows. Consequently language models created from these texts were interpolated with our good baseline model. Although these text data was normalized after it was crawled, it still contained text we did not need (see 3.3 for details). Therefore we performed an additional text processing to the crawled texts. The main goal we tried to achieve with the text processing, was to reduce the number of words that would be qualified to be added to the dictionary. When

28 20 5. Experimental Setup we later search for new words to add the dictionary we look for them in these texts. One example: After crawling and normalizing the text data, some words in it are present in different representations (e.g. the correct form politicien vs POliticien vs polliticien ). Many of these typos were present in our crawled texts, because the RLAT crawler often did not filter out user comments. Also as the RLAT normalization is rule-based it can make errors. All of these factors lead to a high amount of new words, that are in fact not part of the official form of the language. Therefore should not be part of the dictionary because they may lead to a lower speech recognition performance. Another reason for a good normalization is, that pronunciation generation for malformed words is not possible with Sequitur G2P. The crawled text included lines with capitalized words that are falsely capitalized or just were a sequence of capitalized words which did not form valid sentences, for example TWITTER FACEBOOK. To remove those, all lines with less than three words and all lines containing more than 50% capitals were deleted. Because the previous RLAT text normalization did not remove all user comments on some pages, there was also text collected, that in some cases contained a larger amount of typos or was written in other languages. This non-standard text might later increase the pronunciation dictionary with misspelled words. To avoid this, a spell check with GNU aspell [3] was performed. Aspell used a french dictionary to mark those words in the crawled text that are not in the dictionary. If words similar to those marked as unknown exist in the dictionary they are listed. To get rid off low quality text, only lines containing more than 50% marked words were eliminated. Including our new text processing steps, resulted in some improvement as shown in Figure 5.4. Not only the perplexity was reduced but also the amount of new words in the crawled text that were not part of the baseline dictionary. For example before our new method there were about 40k words in the texts from the period , that were not part of the baseline dictionary. After the improved text preprocessing only 32k words were found in the text, that were not part of the dictionary. All contained OOV words survived this additional cleaning process. This does not only helps to achieve better speech recognition results but also supports to cope with the high amount of potentially new vocabulary that has to be considered. To compute a good interpolation weight between the baseline LM and the LM built from our texts, the SRILM tools need a development set. In real world situations, the development set is not known exactly. That is why we performed a range of experiments to estimate this ratio. In these experiments, the impact of different interpolation weights between the base LM and the LM generated from the collected text data were evaluated on the perplexity. After building LMs from the text data from different periods they were interpolated with the base LM using 0.3, 0.5 and 0.8 as weights. The experiments have shown that a good weight for the baseline is 0.8 (see Figure 5.5 and 5.6). This will be the standard weight for the rest of the experiments. With the corrected transcriptions and the crawled texts the number of OOV words in the transcriptions using the baseline dictionary could be determined. The results in Figure 5.7 show that not all of those words could be found in the crawled texts. The number of unknown words is pretty low, but high enough to examine the impact of those new words to the dictionary on the speech recognition performance.

29 21 Figure 5.4: Possible improvement with vocabulary adaptation Figure 5.5: Interpolation

30 22 5. Experimental Setup Figure 5.6: Interpolation Figure 5.7: Number of OOV words in transcriptions with baseline dictionary and number of those occurring in our crawled texts

31 6. Experiments and Results In this chapter, the experiments are described. In the experiments we checked how time-relevant text data and expanding the dictionary improves the ASR performance. 6.1 Experiments without dictionary adaptation To estimate how texts downloaded from different periods of time impact the performance of language models, some experiments were conducted. The daily collected texts from the four RSS Feed sources (Le Parisien, France24, Le Point, Le Monde) were concatenated. To check for performance improvement, LMs built from the crawled text data from different periods were interpolated with the baseline Quaero LM. For all chosen broadcast news, we took all texts published at most thirty days before the day of the transmission into account When t 0 is the day of the transmission and t 30 is the day thirty days before, we built the following LMs from: - All texts t 30 until t 20 before the transmission (t 30 -> t 20 ) - All texts t 30 until t 10 before the transmission (t 30 -> t 10 ) - All texts t 30 until t 5 before the transmission (t 30 -> t 5 ) - All texts t 30 until t 0 before the transmission (t 30 -> t 0 ) These language models were interpolated with the base language model using a weight of 0.8 for the baseline (see 5). Then the vocabulary was reduced to the amount used in the baseline language model (around 170k). Only a small percentage (around 2 %) of the words used in the news broadcasts was not covered by the baseline dictionary. Later we discovered the impact of an expanded vocabulary. The resulting language models were evaluated on the reference transcriptions with the perplexity and the WER after decodings with JANUS. The results are illustrated in Table 6.1 and 6.2.

32 24 6. Experiments and Results Figure 6.1: Perplexity of interpolated LMs Figure 6.2: Word Error Rate of interpolated LMs

33 6.2. Experiments with dictionary expansion 25 All results are with interpolated LMs, except of the results labeled as only Quaero LM. It can be seen that the perplexity declines more when text data closer to the transmission date is added to the language model (see change between t30->t5 and t30->t0). We also decoded with text data from the date of the transmission and five days before it (t5->t0). As the graphs show, the performance of the interpolated language models with these smaller time intervals were better or equal to the performance with greater amounts of data. The differences in perplexity and word error rate between the broadcast shows can be explained by the different content. The broadcasts sometimes contained music samples or interviews with people from other countries having a strong dialect which harms the performance. To improve our results, vocabulary adaptation has to be considered. Therefore the possible gain of vocabulary adaptation was evaluated. 6.2 Experiments with dictionary expansion We showed that the baseline dictionary does not contain all words that occur in the news broadcasts. Therefore these words are incorrectly recognized, thus leading to a higher WER. Consequently adding new words to the dictionary may improve the speech recognition performance Oracle experiments To check the best case scenario of an enlarged dictionary, Oracle experiments were performed. This means that, all OOV words are known because we have transcriptions of the broadcasts. Because the results of the interpolated language model with texts from t5->t0 performed best and the amount of OOVs in t5-t0 is for three of the five shows were not much lower than the optimum (see Figure 5.7), the language models enriched with the OOV words were chosen. All OOV words present in t5->t0 were also added to the baseline dictionary and a decoding was started. The results in Figure 6.3 show that a considerable improvement in WER is possible by adapting the vocabulary. Therefore, we decoded the shows with an vocabulary that was enriched with new words from up to five days before each show. Moreover we added unknown words found in the text data from up to thirty days before each broadcast Real world results Despite additional text preprocessing steps (described in Chapter 5), there were still many words that were misspelled or capitalized. To further reduce this number, we counted all words in t5->t0. As the text data of the day of the transmission (t0) contained almost all OOV words found in the texts from t5->t0, we selected only these texts to be further processed. All words that were not part of the baseline dictionary and only appeared once in the texts from t5->t0 were eliminated. This reduced the vocabulary, used for the expansion of the base dictionary (see Table 6.1) These new words were added to the base vocabulary and to the individual language models for each show. In addition, the pronunciations for the new words were

34 26 6. Experiments and Results Figure 6.3: Possible improvement with vocabulary adaptation 1/16/11 1/19/11 2/2/11 2/16/11 3/2/11 words not in dictionary occurence > Table 6.1: New words after cleaning

35 6.2. Experiments with dictionary expansion 27 Figure 6.4: Results after vocabulary adaptation

28 6. Experiments and Results generated using Sequitur G2P and added to the baseline dictionary. The results with the vocabulary expansion are shown in Figure 6.4.

36 28 6. Experiments and Results generated using Sequitur G2P and added to the baseline dictionary. The results with the vocabulary expansion are shown in Figure 6.4. It can be seen that including these additional words (see Table 6.5 for comparison of OOV) leads to improvements of the speech recognition performance. However, with this comparatively small expansion of the dictionary, the perplexity begins to rise. Figure 6.5: Results after vocabulary adaptation The next step was to evaluate if an additional improvement can be achieved by expanding the dictionary with all words from t30->t0 that are not part of the dictionary. By cleaning the text data, the vocabulary containing words that are not found in the base dictionary could be reduced. But the expansion still increases the base dictionary by about 20%. The results between t30->t0 are shown in Figure 6.6. Two results improved significantly (up to 1.4% absolute better), the other three stayed same or even performed worse (up to 0.9% absolute worse). Those results can be explained with more confusions derived from new inadequate words (extending the baseline dictionary by 20%). If this confusiosn can be lowered by decreasing the amount of words added to the dictionary, further improvement might be achieved. 6.3 Summary Our experiments have shown that the performance of language models can be improved by using text data crawled with RSS Feeds information. There is also a strong correlation between perplexity and the temporal proximity to the date of transmission of the text data used to built the LMs for interpolation. By using exclusively text data from a few days before the transmission, the perplexity and WER could be improved same or better than using text data from thirty days before the broadcast. In the first experiments, the baseline dictionary was used for decoding to see the impact of time-relevant text data. In this case, an improvement of average 0.9 % absolute compared to the unchanged Quareo LM could be achieved (even 1.5% absolute (4.4% relative) in the best case). Using small amounts of data made it easier to extend the vocabulary with previously unknown words, which leads to improvements. Despite if normalizing the crawled text data in various ways, it still contained a lot of unnecessary data. This unneeded data leads to a high number of formerly unknown words that are either misspelled or not from the target language. Adding those to the dictionary increases perplexity. The amount was reduced by adding

37 6.3. Summary 29 Figure 6.6: Results after expanding the baseline dictionary with all the unknown words from t30->t0

38 30 6. Experiments and Results only new words that occur at least twice in the texts that were collected up to five days before the broadcasts. After extending the dictionary, an additional average 0.4% absolute improvement was achieved compared to the former results without vocabulary adaptation. Combining time-relevant text data and an expanded dictionary, the overall performance of the Quaero LM could be improved by on average 1.16 % absolute (about 4 % relative).

39 7. Summary and future prospects Our work has shown that the proposed approach for crawling text data helps to boost the performance of already present LMs. Not only a lot of time can be saved by collecting exclusively new data. But it makes it possible to use only text data that is time- and topic-relevant for the transmission to be recognized. We showed that even if the amount of text data is relatively small, the results can be as good or better compared to using a huge amount of data not fitting to the time frame or the topic. The time-relevant text data enables us to extract words fitting a certain time interval and therefore fitting better to broadcast news shows to be recognized. Adding too many unnecessary words to the base dictionary resulted in mixed results. Addressing this issued maybe by fully implementing the probabilistic rolling language model proposed [13] the performance may be probably improved. Another approach to help with this issue, is to find better ways to clean the text data. Our additional cleaning performed lowered the perplexities and reduced the number of unknown words in the text. Another improvement may be gained by finding a fitting development set that is then used to interpolate the LMs daywise as suggested by [18]. Using already finished transcriptions may improve the performance of LMs. For example, we may use the transcriptions from 2/3/11 to improve a language model used for a transcription a few days later when the news have not changed much. We may either build LMs from the transcriptions and interpolate them with later LMs or we may concatenate the transcriptions with the crawled text data. As broadcast news shows often contain interviews or the moderators speak with each other, integrating text data from other sources like blogs or social media like Twitter that contains a kind of speech similar to that used in interviews may lead to improvements.

RLAT Rapid Language Adaptation Toolkit

RLAT Rapid Language Adaptation Toolkit Tim Schlippe May 15, 2012 RLAT Rapid Language Adaptation Toolkit - 2 RLAT Rapid Language Adaptation Toolkit RLAT Rapid Language Adaptation Toolkit - 3 Outline Introduction