Enhancing Language Models for ASR using RSS Feeds

Size: px
Start display at page:

Download "Enhancing Language Models for ASR using RSS Feeds"

Transcription

1 Enhancing Language Models for ASR using RSS Feeds Diplomarbeit am Cognitive Systems Lab Prof. Dr.-Ing. Tanja Schultz Fakultät für Informatik Karlsruher Institut für Technologie von cand. inform. Lukasz Gren Betreuer: Dipl.-Inform. Ngoc Thang Vu Dipl.-Inform. Tim Schlippe Prof. Dr.-Ing. Tanja Schultz Tag der Anmeldung: 01.Juli 2011 Tag der Abgabe: 05.September 2011 KIT Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft

2

3 Ich erkläre hiermit, dass ich die vorliegende Arbeit selbständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe. Karlsruhe, den 05. September 2011

4

5 Abstract In this work, we improve the automatic speech recognition of broadcast news with time- and topic-relevant text data. Our previous method for collecting large amounts of text data for language modelling was to use the crawler in the Rapid Language Adaptation Toolkit (RLAT) with its recursive crawling implementation. This implementation is good for crawling large amounts of text data. However, it has shortcomings to pick out exclusively text material that is relevant for up-to-date broadcast news shows which we intend to transcribe. To provide text data that fits better to our shows, we propose crawling methods using RSS Feeds. RSS Feeds are small automatically generated XML files, that contain time-stamped URLs of the published updates. They can easily be found on almost all online news websites. We implemented an RSS parser into RLAT, which takes RSS Feeds, extracts the URLs with the publishing date and collects them preserving the time information. Then exclusively the pages corresponding to the URLs are crawled. Text data that was collected regarding the information in the RSS Feeds of 4 French online news websites improved the performance of our base LM which had been used in the Quaero programme: The word error rates of five tested broadcast news shows from Europe1 are reduced by 0.9 % absolute on average with our new text data. The highest improvement is 1.5 % absolute (4.4 % relative). Inserting new words into our search vocabulary and pronunciation dictionary that occurred frequently in the RSS Feed-related webpages gives an additional improvement of 0.4 % absolute on average compared to the results with the Quaero dictionary. The best result was 0.5 % (1.74 % relative).

6

7 Contents 1 Introduction Motivation and purpose of this thesis Structure of this Thesis Automatic Speech Recognition Basics ASR Basics Tools Rapid Language Adaptation Toolkit SRI Language Model Toolkit JANUS Sequitur G2P Shortcomings of the RLAT crawler and proposed improvements Shortcomings Related work Proposed improvements Our Work RSS Feeds Basics Implementation Implemented scripts Parsing the RSS Feed Output of the parsed links into files Adaptation of the RLAT user interface Experimental Setup 17 6 Experiments and Results Experiments without dictionary adaptation Experiments with dictionary expansion Oracle experiments Real world results Summary Summary and future prospects 31 Bibliography 33

8

9 1. Introduction 1.1 Motivation and purpose of this thesis In today s globalized and connected world, there is a constant flow of information. Especially, the internet as it grows rapidly generates more and more useful information. The extreme amount of information generated and published every second from all over the world contains valuable data, not only text but also sounds, images and video, that can be used in many different fields of science. Not only computer science but also social sciences are profiting. For example, by examining the social graph generated from the connections between people from all over the world on social networks, it is possible to predict epidemics [10] or analyse other health issues like obesity [9]. In this work, we specialize on broadcast news. In general, news broadcast on television or on the radio are still the main source of information for many people. But being only present as spoken speech in most cases, these broadcasts have some disadvantages: First these broadcasts cannot reach hearing impaired people.second audio data cannot be archived and searched for a occurring topic like text data. Automatic Speech Recognition (ASR) software is trying to solve these problems. ASR systems as the name already states, recognize spoken language and transform it into text. Two of the main components of an ASR system are the language model (LM), which contains the probabilities of word sequences that may be spoken and the acoustic model which contains the statistical representation of sounds of which spoken language consists of [8]. After the spoken utterance is processed by the acoustic model, it generates a range of different possible words. This task needs a pronunciation dictionary which contains words and the corresponding phoneme sequence. Having the LM, the most probable word sequence is chosen out of the offered possibilities. With those elements, it is possible to convert speech to text. As broadcast news always contain the latest developments, new words emerge frequently and different topics get into the focus of attention. To adapt an ASR system for broadcast news, it is necessary to update it with text data that is in near temporal proximity to the date of the broadcast news show, is part of the same domain

10 2 1. Introduction and from the same language. Close temporal and topical proximity of the text data ensures that the words and sentences contained in the news show have a higher probability to fit, than using text data from a long time before the show. In this work, the crawling functionality of the CSL S system for building speech processing systems, the Rapid Language Adaptation Toolkit (RLAT) [14], is improved to make this possible. Currently, the RLAT crawler can only ensure that the language of the crawled texts fits to the target language. We propose to use RSS Feeds [6] to crawl text data which features time- and topic-relevancy. RSS Feeds are small XML files that are automatically published on many websites. They are updated every time a new article is published and do not contain only the URL to the article but also its publishing date. According to these information, we parse them and create chronologically sorted files that contain URLs to the new articles. The pages corresponding to the URLs then are crawled and LMs are built. These are then interpolated. In other words they are mixed together adapting the word sequence probabilities with an already evaluated LM (Quaero LM). The performance of the resulting LM is tested and compared to the results using solely the Quaero LM. Running experiments with a fixed pronunciation dictionary and an extended dictionary, we also evaluated the impact of adding new words found in the text data and crawled with the help of the RSS Feeds. 1.2 Structure of this Thesis In the previous section, the motivation behind this thesis and the steps to achieve this goal have been explained. In Chapter 2, a brief overview of the basics of Speech Recognition is given and the tools that were used are described. Short overviews of the Rapid Language Toolkit (RLAT), SRI Language Model Toolkit, Sequitur G2P and JANUS is given. In Chapter 3, the shortcomings of the currently used system are described and ways to fix them are searched in related work. In Chapter 4, an overview of our implementations is given. In Chapter 5, the requirements for running the experiments are presented. In Chapter 6, our system is evaluated in various experiments. Upcoming problems are described and the results are presented. Chapter 7 finishes the thesis with a conclusion.

11 2. Automatic Speech Recognition Basics 2.1 ASR Basics The term Automatic Speech Recognition (ASR) denotes to the process of how human speech is converted to written text. The three main components of an ASR system are the acoustic model, the pronunciation dictionary and the language model. After speech is produced in the speakers vocal tract a signal processing component transforms it into a sequence of acoustic vectors. These vectors model acoustic phonemes. The acoustic model contains phonetic knowledge and has to deal with the characteristics of acoustic differences between genders or dialects. Given this knowledge present in the acoustic model the most probable phonetic sequences are chosen and transformed into a sequence of words. This happens with the help of a pronunciation dictionary which contains the mapping of phonetic sequences into words. The acoustic model is not always able to recognize words correctly. For example, it may offer the following candidates Merry Christmas and Very Christmas. Now the language model has to choose the correct candidate. To be able to make the correct decision, a language model holds information about the probabilities of word sequences. To build a language model, a large amount of text data is analysed. The language model would choose Merry Christmas as the word Merry is standing in front of the word Christmas much more often than the word Very in English texts. By combining the knowledge of phonetics (acoustic model) and linguistics (language model), speech can be recognized. There are different metrics to evaluate the performance of an ASR system. One of them is the out-of-vocabulary (OOV) rate, which states the percentage of words present in a reference vocabulary that are not part of the language model or the search vocabulary. Another criterion is the perplexity which is a measure for how many successors a word can have in average in a language model. If the perplexity is high, the recognizer can choose from a high number of possible word sequences, thus leading to a greater probability that errors occur. If OOV and/or perplexity are high,

12 4 2. Automatic Speech Recognition Basics this usually leads to a high word error rate (WER) which states the amount of words that have to be inserted, deleted or substituted to achieve a perfect recognition. 2.2 Tools The tools which we used are the Rapid Language Adaptation Toolkit (RLAT) acting as the front-end for collecting text data from the internet. The SRI Language Model toolkit which is responsible for doing language model related tasks. JANUS, which is used to recognize speech and Sequitur G2P to generate pronunciations needed when adding new words to the dictionary Rapid Language Adaptation Toolkit The Rapid Language Adaptation Toolkit (RLAT) [14] is an extension of the Speech Processing Interactive Creation and Evaluation Toolkit for new Languages (SPICE) [15] developed at the Carnegie Mellon University. RLAT is based on GlobalPhone [16] and Festvox. The main purpose of RLAT is to reduce the time for building speech processing systems for new languages. The system is web-based and offers all tools needed to complete this task. The time is reduced and it allows novice users to built a speech processing system. One of the major features of RLAT is the use of data sharing between languages and system components. The creation process is divided in nine steps. From collecting the required text data over selecting phonemes everything needed is present to finally build a complete ASR system. In this work, only the component of RLAT that collects text data is used. This component offers different ways to add text data. This text data then can be used to build language models. Text data might be uploaded directly if already present or obtained by crawling the web for text resources. If the text data is crawled, an URL and the depth of the links to be followed are entered (see Figure 2.1). The RLAT crawler gets the text from the URL entered before and checks if any URLs are present on this page. If so the URLs are followed and the corresponding text is downloaded. This process continues until the link depth that was set is reached. By choosing a high link depth, it is possible to get huge amounts of text data. After the crawling finishes, the text data is normalized. This means that things like punctuation or the format of numbers are converted to a consistent format. Then this text then can be used for the next steps of the process to built an ASR system SRI Language Model Toolkit The SRI Language Modeling Toolkit (SRILM), developed by the SRI Speech Technology and Research Laboratory (STAR Lab) [17] is a freely available collection of C++ libraries, executable programs and helper scripts. They allow the creation of statistical language models for speech recognition and the evaluation of them. The main tasks that SRILM is used for in this work are the creation of language models from text data, the interpolation of the models and the computation of the perplexity. This tasks are performed by two tools included in the toolkit: ngram and ngram-count.

13 2.2. Tools 5 Figure 2.1: Standard text crawling in RLAT ngram-count processes text data by counting how often words and word sequences occur in the text to build a language model. It can be customized by a number of options. The ngram order (meaning the maximum length of word sequences), a custom vocabulary (if the number of words in the dictionary should be restricted), the discounting algorithm (Good-turing, absolute, Witten-Bell, and modified Kneser- Ney are supported) and how to treat unknown words are the most commonly used options. ngram is responsible for the evaluation of language models. It takes a language model and a file containing test data and computes the perplexity of the language model on this test data. Another important task is the interpolation of language models. Having two or more LMs, they can be combined using linear interpolation [7]. The optimal ratio for interpolation between those LMs can be computed when appropriate text data is given, on which the performance should be optimized JANUS The JANUS [19] speech recognition software has been developed at the Universität Karlsruhe and the Carnegie Mellon since It is commonly used at both institutions. In this work JANUS is used to convert speech to text. The toolkit provides objects for all possible speech recognition approaches. It can deal with different Hidden Markov models (HMMs), acoustic and speech recognizer architectures. The data structures of these objects can be modified with scripts. Therefore it is possible to test new ideas easily.

14 6 2. Automatic Speech Recognition Basics Sequitur G2P Sequitur G2P [5] has been developed at the University of Aachen. Sequitur is a grapheme-to-phoneme converter. A grapheme is the fundamental unit of written speech (like alphabetic letters). A phoneme is the fundamental linguistic unit. Sequitur G2P transforms any sequence of graphemes to sequences of phonemes. Sequitur G2P uses statistical models to achieve this task. To built these models a training has to be performed: An already existing pronunciation dictionary is taken and in an iterative process a model is trained. Then this model can be used to generate the pronunciations from a list of words without known pronunciations.

15 3. Shortcomings of the RLAT crawler and proposed improvements The functionality of the currently used system is not exactly designed to get small chunks of data fitting to defined periods. In this chapter, the shortcomings of the current system are described and improvements based on related work are proposed. 3.1 Shortcomings The main shortcoming of the RLAT crawler is that is not possible to automatically crawl websites for news that fit accurately to a time interval. After entering a start URL, all pages up to a defined depth are crawled. This leads to a big overhead especially if choosing a high link depth as websites are collected that are much older. Limiting the crawl to just one day is only possible by manually collecting links published on a certain date and then crawl and process them with the RLAT crawler. This is not only time-consuming but also not always possible to access older articles, since they may be removed or moved to an unreachable destination. It is also difficult to crawl exclusively text from a certain topic. For example, you can start the crawl with a sport news story to get some sport-related text data, but after reaching a certain depth, other topics may be reached if the website contains different kinds of news. This crawling method also collects data that might have been crawled already earlier as no information is stored if web pages were crawled in an already finished crawl. For example, a new crawl is started with a link depth greater than one and finishes after some time. A few days later we decide to crawl text from this website again. After it finishes we will probably see that some text that was crawled in this session was already crawled a few days ago in the last session. This method obviously leads to more overhead than crawling websites on a regular basis.

16 8 3. Shortcomings of the RLAT crawler and proposed improvements Collecting text data with this method also consumes much more space and bandwidth. The cause of this is that when a high link depth is chosen, this method follows all the links and gets all the corresponding text until it finds no links or the link depth is reached. If a website stores its old articles in a way that they can be reached just by following links, the text data collected may take away much space and bandwidth. These massive amounts of text data that can be crawled with the current method can be used as a base for building language models that can cope with different kinds of speech (depending of the topic and domain of the crawled text) relatively well. But to improve the performance of language models used for the recognition of current news, it is essential to get text data that fits the time of the broadcast. The difficulties of the current approach to collect time-relevant text data also lead to problems when the dictionary has to be expanded to recognize new words. When crawling massive amounts of text data, the number of potential words can grow very high and if not reduced somehow before adding to an existing dictionary can lower the performance of the language model. 3.2 Related work To deal with these problems, we searched for an extension of a method to realize an incremental crawling. One particular solution is described by Adam et al. [2]. They propose a way to collect fresh contents from the web based on RSS Feeds. These feeds can be found on almost every website and contain data about newly published content. Furthermore, they monitor a large number of those feeds and check for new content in self-adapting time intervals. To deal with the problem of new words in broadcast news, that are not part of the lexicon of the ASR system, Ohtsuki et al. [13] propose a method for adding words to the lexicon. They expand an approach by Auzanne et al. [4]. Auzanne et al. add and remove words from the lexicon based on how frequent a word appears and on how many days it appeared. Ohtsuki et al. besides these two factors also consider when the last time was a word appeared. They combine all three factors by using probabilistic methods and call it probabilistic rolling language model. With this method they could improve the OOV rates and word error rates on broadcast news. Thang et al. [18] describe the building of ASR systems using RLAT. By collecting massive amounts of text data and interpolating the language models on a day-wise basis, OOV and perplexity can be lowered in contrary to language models without interpolation. Moreover they describe that using multiple sources for text data improves the performance. In this work we will use the of incremental crawling method described by Adam et al. Because we will work only with a few RSS Feeds, we do no implement the self-adapting time intervals. Additionally, we will adapt the lexicon but in contrary to Ohtsuki et al. we do not use probabilistic methods and will not remove words from the lexicon. We will add words to the lexicon based on the frequency they appeared recently. The day-wise interpolation described in [18] cannot be realized in our work as we lack data to compute the necessary interpolation ratios. Like proposed we will use multiple sources for text data.

17 3.3. Proposed improvements Proposed improvements As described in Adam et al., a new crawling strategy is proposed to reduce the overhead. Almost every website offers at least one RSS Feed that contains all changes of the website or just the changes in a certain part. This feed can be used to collect only the text of the new articles. This crawling method significantly reduces duplicate content when crawling regularly compared to method described in 3.1. This is because with the proposed method, we are able to collect text incrementally. With the help of this technique, it is also possible to collect data that was published on a specific day which might help to improve the performance of language models or keep the vocabulary up-to-date. One downside of using RSS Feeds is that they have to be monitored constantly for updates as there is no archive openly available. Archives like the one built by Google for their Reader service cannot be accessed, because no public API is provided that could enable us to collect that data. It also takes longer to collect larger amounts of text compared to crawling everything until a certain link depth is reached. Additionally, the adaptation of the vocabulary as Ohtsuki et al. and Thang et al. have shown improves the general performance. Having the possibility of timetagged text data with the help of RSS Feeds, provides the opportunity to extract words that appeared in a certain time interval. They can be used to update an existing vocabulary with words that do not only fit to a certain time interval but also a certain topic if the appropriate RSS Feed is chosen. In contrary to the work of Ohtsuki et al., we add words to our system depending on the number of occurrences and how recent a word appeared. Like described in [18] using multiple sources to increase text diversity leads to a better performance of the language models. This is easy to achieve with RSS Feeds as they are available on most websites. In this work, four text sources will be used. But not only diversity is important. Reducing remaining noise in the text data after text normalization is complete is another task we try to solve in this work. We define noise as text that is not part of the news article, like website navigation elements, advertising or user comments. Adding such kinds of text may not only harm the ngram probabilities but also the perplexity of the language model as words containing spelling mistakes are added. Therefore a better method to remove at least some of this unnecessary text data is analysed. All suggested improvements are evaluated in this work. We start with the implementation of the new crawling strategy in the next chapter.

18 10 3. Shortcomings of the RLAT crawler and proposed improvements

19 4. Our Work 4.1 RSS Feeds Basics Especially news websites and blogs produce a constant flow of new content. Users often have the problem to cope with all information published. Typically the contents of the front pages can change fast depending on the events that dominate the news. To inform the users about changes on a website, the first version 0.91 of the RSS format was introduced in the year 2000 by Netscape. Since then the popularity grew and more and more websites (especially blogs and news websites) are using this technology. RSS (Really Simple Syndication) is a standardized format to publish frequently updated work. As it is based on the Extensible Markup Language (XML), it is very easy to parse and view with a broad range of applications. Since 2000, the format evolved and today RSS 2.0 is the most used version [6]. An RSS Feed contains information about updates and can contain links to enclosures or pictures, too. A basic feed is shown in Listing 4.1. Every feed begins with information which encoding and which XML and RSS version is used. It is always followed by the <channel> Tag which has to contain the following elements: Besides the title, which should be the title of the website the RSS Feed is based on and the URL of the corresponding website, also a short description of the contents of the feed must be given. There is also a number of optional channel elements. Most of them are not helpful for our task, for example a copyright notice or the address of the webmaster. Besides the <lastbuilddate> element which states the date of the last change to the channel may be used for information retrieval purposes. Rarely used are the <skiphours> and <skipdays> elements which indicate hours or days in which the feed is not updated. The most important part of the feed is located between the <item> tags. A feed can contain any number of items. But almost all feeds contain only a fixed number

20 12 4. Our Work Listing 4.1: Sample RSS feed <?xml version= 1. 0 encoding= UTF 8?> <r s s version= 2. 0 > <channel> < t i t l e>rss T i t l e</ t i t l e> <d e s c r i p t i o n>this i s an example o f an RSS f e e d</ d e s c r i p t i o n> <l i n k>h t t p : //www. someexamplerssdomain. com/main. html</ l i n k> <l a s t B u i l d D a t e>mon, 06 Sep : 0 1 : </ l a s t B u i l d D a te> <pubdate>mon, 06 Sep : 4 5 : </ pubdate> <item> < t i t l e>example entry</ t i t l e> <d e s c r i p t i o n>here i s some t e x t c o n t a i n i n g an i n t e r e s t i n g d e s c r i p t i o n.</ d e s c r i p t i o n> <l i n k>h t t p : //www. w i k i p e d i a. org /</ l i n k> <guid>unique s t r i n g per item</ guid> <pubdate>mon, 06 Sep : 4 5 : </ pubdate> </ item> </ channel> </ r s s> of items. All elements of <item> are optional. But either a title or a description has to be given. Most of the feeds contain at least a title and a link to the content. The description element is commonly used to give a short synopsis of the main article or contains the first few sentences of it. In general, it is also possible that the description element contains the full text of the articles. But is rarely used by companies that earn money by showing advertising on their websites. Offering the whole text via RSS Feedsm they would loose visitors and therefore valuable page views. Most important for us is not only that a link to the main content is given, but also the <pubdate> element, which states the time when the article was published. Often there is a <guid> tag (globally unique identifier) which contains a string that uniquely identifies the item for the publisher. In some cases, the <guid> element contains the attribute <ispermalink> which can be set to either true or false. If true, the URL can be seen as a permanent link to the content. If time-relevant text data should be crawled, the essential information that need to be parsed from the RSS Feed are the publishing date (found in the <pubdate> element) and the corresponding URL (inside the <link> element). 4.2 Implementation In this chapter an overview of the implementation needed to fulfil the required functionality is given. First the scripts running constantly in the background to fetch and parse the feeds are described. Afterwards the changes to the user interface of RLAT will are characterized Implemented scripts Our goal is to constantly collect the desired RSS Feeds, as they contain only a fixed number of articles. This means if one new article is added, the information about

21 4.2. Implementation 13 the former oldest article present in the feed is pruned. To achieve this goal, a script was written, that downloads the XML file in certain time intervals and checks if any news items are present, that were not already processed and saved in an earlier run. Due to the low complexity of the task and as most of the other RLAT tasks are realized with Perl scripts, Perl was chosen for this implementation. Our main script performs the following tasks: Downloading the RSS Feed Parsing the feed Output of the parsed links to files All tasks are repeated until they are cancelled by the user or after a certain amount of days. This is achieved by putting the tasks in a loop. A wait time of 30 minutes between two loop iterations proved to get the updates issued throughout the day. The script requires: The URL of the RSS Feed The time in days the script should run The file the RSS Feed is located in is downloaded with the help of curl [11]. After downloading the feed the parsing begins, which we describe in greater detail Parsing the RSS Feed To parse the XML file, the Perl module XML::simple [20] is used. It builds a tree representation of the XML file. All items can be parsed following the branches of the tree until the desired item is reached. All elements required for our task are Date when the articles were published URLs of the corresponding articles The publishing date can be found in the element <pubdate> of each item. The date follows the specifications of RFC822 [1]. A typical date in this format looks like this: Fri, 08 Oct :27: To better work with it, we convert it into a simpler format with the Perl module Date::Parse [12]. It transforms the date into the following format: YYYY-MM-DD. The corresponding link to the published articles can be stored in two places. Usually the link can be found in the <link> element of the item. However it is better to look after a permalink that usually has a longer lifetime. Permalinks are stored in the <guid> element with the attribute <ispermalink>. If the attribute is true, that link is used. After each downloading and parsing all items of a feed, the script waits 30 minutes and repeats the process.

22 14 4. Our Work Output of the parsed links into files In each item of the feed, the publication date and the URL is parsed and written in a file. After all <items> of the RSS Feed are parsed, the contents of the file containing the publication dates and URLs are sorted. Sometimes shortly after publishing, the headlines of the articles are changed. With the change of the headline the URL may change too. As every change made on the website is visible in the RSS Feeds, this results in collecting different URLs leading to the same website. To eliminate these URLs we compare them and eliminate those which coincide in a number of characters. This number was optimized for the RSS Feeds we used in this work. If using other RSS Feeds this number must be adapted. The file is then processed again. The links are sorted according to the day they were published. Every link with a specific timestamp is written to seperate files that contain only URLs belonging to one day Adaptation of the RLAT user interface All described procedures were integrated into the existing RLAT. Figure 4.1 shows our new interface. A new subitem called RSS functionality was added to the navigation bar. All files containing URLs that are collected based on the RSS Feed information are displayed and can be selected to be crawled with the RLAT crawler. Figure 4.1: Adapted RLAT User Interface for RSS Feed based Text Collection To start monitoring a new feed, its URL and the time the RSS Feed should be watched has to be entered. After clicking the Crawl RSS button, the scripts start with downloading the feed, parsing it for URLs to new content and writing those to files. The files, that can be seen in Figure 4.1 are updated and created every time the RSS Feed is parsed again to check for new articles (every thirty minutes).

23 4.2. Implementation 15 Each file containing the collected URLs can be selected and send to the previous standard RLAT crawling script via the CopyCrawl RSS button. Then all links are crawled, cleaned and the crawled text data is normalized. This crawled data then can be found in the Text management area from where it can be later used for building language models.

24 16 4. Our Work

25 5. Experimental Setup To check if language models can be improved with RSS-based crawled text, we conducted some experiments. Before this experiments could be started some data had to be collected. As we want to evaluate the performance on broadcast news, the RSS Feeds from the websites of four major French news services were chosen to be monitored for new articles. Then all articles that fit to the time interval were crawled with the RLAT crawler. The following feeds were downloaded over a period of sixth months: Le Parisien ( Feed URL: approximately 1,400 updates per week Le Point ( Feed URL: approximately 2,300 updates per week Le Monde ( Feed URL: approximately 400 updates per week France24 ( Feed URL: approximately 170 updates per week These four sources where chosen as they provide a huge amount of new articles published compared to other news websites. The number of updates was obtained with the help of statistics provided by Googles Reader service. The number of tokens collected daily from these source varies heavily from day to day as shown in Figure 5.1 and Figure 5.2. This is usual as the amount of news does not only depend on the weekday (for example less new articles on Sundays) but also of the nature of news. Some events lead to a greater amount of articles than others.

26 18 5. Experimental Setup Figure 5.1: Tokens per day - collected between 1/17/11 and 2/16/11 Figure 5.2: Tokens per day - collected between 1/31/11 and 3/2/11

27 19 To evaluate the impact of the new time and topic-relevant texts collected with the help of RSS Feeds, French broadcast news shows needed to be obtained. We downloaded radio broadcasts of the 7 am news from Europe1 ( Each show usually has a duration of about 10 minutes. The shows could be downloaded easily, as the URLs leading to the MP3 files of the episodes were published through RSS Feeds. In preparation for a decoding with JANUS, the downloaded episodes were first transcoded from MP3 stereo to WAVE 16khz mono files. After that, a speaker segmentation and clustering was performed and database files were created. These database files contain information about speakers and the time code of the utterances in the audio file. We collected almost all episodes since the beginning of Five of these episodes were chosen to be transcribed by a French native speaker. To ensure that the news differ in the topics, a period of about two weeks lies between each episode except one (for testing purposes). To reduce the transcription work, each episode was transcribed roughly initially using the French Quaero P3 evaluation system with our baseline language model. Then this first transcription was checked and corrected by a native speaker. The resulting word error rates of the decodings using the unchanged Quaero LM (baseline LM) are illustrated in Figure 5.3 Figure 5.3: Baseline Word Error Rates This corrected transcriptions were used as references to compute the perplexities and OOV rate of the language models built with the downloaded texts. The amount of collected text data is not very big. Therefore language models built exclusively from these texts do not perform well on our shows. Consequently language models created from these texts were interpolated with our good baseline model. Although these text data was normalized after it was crawled, it still contained text we did not need (see 3.3 for details). Therefore we performed an additional text processing to the crawled texts. The main goal we tried to achieve with the text processing, was to reduce the number of words that would be qualified to be added to the dictionary. When

28 20 5. Experimental Setup we later search for new words to add the dictionary we look for them in these texts. One example: After crawling and normalizing the text data, some words in it are present in different representations (e.g. the correct form politicien vs POliticien vs polliticien ). Many of these typos were present in our crawled texts, because the RLAT crawler often did not filter out user comments. Also as the RLAT normalization is rule-based it can make errors. All of these factors lead to a high amount of new words, that are in fact not part of the official form of the language. Therefore should not be part of the dictionary because they may lead to a lower speech recognition performance. Another reason for a good normalization is, that pronunciation generation for malformed words is not possible with Sequitur G2P. The crawled text included lines with capitalized words that are falsely capitalized or just were a sequence of capitalized words which did not form valid sentences, for example TWITTER FACEBOOK. To remove those, all lines with less than three words and all lines containing more than 50% capitals were deleted. Because the previous RLAT text normalization did not remove all user comments on some pages, there was also text collected, that in some cases contained a larger amount of typos or was written in other languages. This non-standard text might later increase the pronunciation dictionary with misspelled words. To avoid this, a spell check with GNU aspell [3] was performed. Aspell used a french dictionary to mark those words in the crawled text that are not in the dictionary. If words similar to those marked as unknown exist in the dictionary they are listed. To get rid off low quality text, only lines containing more than 50% marked words were eliminated. Including our new text processing steps, resulted in some improvement as shown in Figure 5.4. Not only the perplexity was reduced but also the amount of new words in the crawled text that were not part of the baseline dictionary. For example before our new method there were about 40k words in the texts from the period , that were not part of the baseline dictionary. After the improved text preprocessing only 32k words were found in the text, that were not part of the dictionary. All contained OOV words survived this additional cleaning process. This does not only helps to achieve better speech recognition results but also supports to cope with the high amount of potentially new vocabulary that has to be considered. To compute a good interpolation weight between the baseline LM and the LM built from our texts, the SRILM tools need a development set. In real world situations, the development set is not known exactly. That is why we performed a range of experiments to estimate this ratio. In these experiments, the impact of different interpolation weights between the base LM and the LM generated from the collected text data were evaluated on the perplexity. After building LMs from the text data from different periods they were interpolated with the base LM using 0.3, 0.5 and 0.8 as weights. The experiments have shown that a good weight for the baseline is 0.8 (see Figure 5.5 and 5.6). This will be the standard weight for the rest of the experiments. With the corrected transcriptions and the crawled texts the number of OOV words in the transcriptions using the baseline dictionary could be determined. The results in Figure 5.7 show that not all of those words could be found in the crawled texts. The number of unknown words is pretty low, but high enough to examine the impact of those new words to the dictionary on the speech recognition performance.

29 21 Figure 5.4: Possible improvement with vocabulary adaptation Figure 5.5: Interpolation

30 22 5. Experimental Setup Figure 5.6: Interpolation Figure 5.7: Number of OOV words in transcriptions with baseline dictionary and number of those occurring in our crawled texts

31 6. Experiments and Results In this chapter, the experiments are described. In the experiments we checked how time-relevant text data and expanding the dictionary improves the ASR performance. 6.1 Experiments without dictionary adaptation To estimate how texts downloaded from different periods of time impact the performance of language models, some experiments were conducted. The daily collected texts from the four RSS Feed sources (Le Parisien, France24, Le Point, Le Monde) were concatenated. To check for performance improvement, LMs built from the crawled text data from different periods were interpolated with the baseline Quaero LM. For all chosen broadcast news, we took all texts published at most thirty days before the day of the transmission into account When t 0 is the day of the transmission and t 30 is the day thirty days before, we built the following LMs from: - All texts t 30 until t 20 before the transmission (t 30 -> t 20 ) - All texts t 30 until t 10 before the transmission (t 30 -> t 10 ) - All texts t 30 until t 5 before the transmission (t 30 -> t 5 ) - All texts t 30 until t 0 before the transmission (t 30 -> t 0 ) These language models were interpolated with the base language model using a weight of 0.8 for the baseline (see 5). Then the vocabulary was reduced to the amount used in the baseline language model (around 170k). Only a small percentage (around 2 %) of the words used in the news broadcasts was not covered by the baseline dictionary. Later we discovered the impact of an expanded vocabulary. The resulting language models were evaluated on the reference transcriptions with the perplexity and the WER after decodings with JANUS. The results are illustrated in Table 6.1 and 6.2.

32 24 6. Experiments and Results Figure 6.1: Perplexity of interpolated LMs Figure 6.2: Word Error Rate of interpolated LMs

33 6.2. Experiments with dictionary expansion 25 All results are with interpolated LMs, except of the results labeled as only Quaero LM. It can be seen that the perplexity declines more when text data closer to the transmission date is added to the language model (see change between t30->t5 and t30->t0). We also decoded with text data from the date of the transmission and five days before it (t5->t0). As the graphs show, the performance of the interpolated language models with these smaller time intervals were better or equal to the performance with greater amounts of data. The differences in perplexity and word error rate between the broadcast shows can be explained by the different content. The broadcasts sometimes contained music samples or interviews with people from other countries having a strong dialect which harms the performance. To improve our results, vocabulary adaptation has to be considered. Therefore the possible gain of vocabulary adaptation was evaluated. 6.2 Experiments with dictionary expansion We showed that the baseline dictionary does not contain all words that occur in the news broadcasts. Therefore these words are incorrectly recognized, thus leading to a higher WER. Consequently adding new words to the dictionary may improve the speech recognition performance Oracle experiments To check the best case scenario of an enlarged dictionary, Oracle experiments were performed. This means that, all OOV words are known because we have transcriptions of the broadcasts. Because the results of the interpolated language model with texts from t5->t0 performed best and the amount of OOVs in t5-t0 is for three of the five shows were not much lower than the optimum (see Figure 5.7), the language models enriched with the OOV words were chosen. All OOV words present in t5->t0 were also added to the baseline dictionary and a decoding was started. The results in Figure 6.3 show that a considerable improvement in WER is possible by adapting the vocabulary. Therefore, we decoded the shows with an vocabulary that was enriched with new words from up to five days before each show. Moreover we added unknown words found in the text data from up to thirty days before each broadcast Real world results Despite additional text preprocessing steps (described in Chapter 5), there were still many words that were misspelled or capitalized. To further reduce this number, we counted all words in t5->t0. As the text data of the day of the transmission (t0) contained almost all OOV words found in the texts from t5->t0, we selected only these texts to be further processed. All words that were not part of the baseline dictionary and only appeared once in the texts from t5->t0 were eliminated. This reduced the vocabulary, used for the expansion of the base dictionary (see Table 6.1) These new words were added to the base vocabulary and to the individual language models for each show. In addition, the pronunciations for the new words were

34 26 6. Experiments and Results Figure 6.3: Possible improvement with vocabulary adaptation 1/16/11 1/19/11 2/2/11 2/16/11 3/2/11 words not in dictionary occurence > Table 6.1: New words after cleaning

35 6.2. Experiments with dictionary expansion 27 Figure 6.4: Results after vocabulary adaptation

36 28 6. Experiments and Results generated using Sequitur G2P and added to the baseline dictionary. The results with the vocabulary expansion are shown in Figure 6.4. It can be seen that including these additional words (see Table 6.5 for comparison of OOV) leads to improvements of the speech recognition performance. However, with this comparatively small expansion of the dictionary, the perplexity begins to rise. Figure 6.5: Results after vocabulary adaptation The next step was to evaluate if an additional improvement can be achieved by expanding the dictionary with all words from t30->t0 that are not part of the dictionary. By cleaning the text data, the vocabulary containing words that are not found in the base dictionary could be reduced. But the expansion still increases the base dictionary by about 20%. The results between t30->t0 are shown in Figure 6.6. Two results improved significantly (up to 1.4% absolute better), the other three stayed same or even performed worse (up to 0.9% absolute worse). Those results can be explained with more confusions derived from new inadequate words (extending the baseline dictionary by 20%). If this confusiosn can be lowered by decreasing the amount of words added to the dictionary, further improvement might be achieved. 6.3 Summary Our experiments have shown that the performance of language models can be improved by using text data crawled with RSS Feeds information. There is also a strong correlation between perplexity and the temporal proximity to the date of transmission of the text data used to built the LMs for interpolation. By using exclusively text data from a few days before the transmission, the perplexity and WER could be improved same or better than using text data from thirty days before the broadcast. In the first experiments, the baseline dictionary was used for decoding to see the impact of time-relevant text data. In this case, an improvement of average 0.9 % absolute compared to the unchanged Quareo LM could be achieved (even 1.5% absolute (4.4% relative) in the best case). Using small amounts of data made it easier to extend the vocabulary with previously unknown words, which leads to improvements. Despite if normalizing the crawled text data in various ways, it still contained a lot of unnecessary data. This unneeded data leads to a high number of formerly unknown words that are either misspelled or not from the target language. Adding those to the dictionary increases perplexity. The amount was reduced by adding

37 6.3. Summary 29 Figure 6.6: Results after expanding the baseline dictionary with all the unknown words from t30->t0

38 30 6. Experiments and Results only new words that occur at least twice in the texts that were collected up to five days before the broadcasts. After extending the dictionary, an additional average 0.4% absolute improvement was achieved compared to the former results without vocabulary adaptation. Combining time-relevant text data and an expanded dictionary, the overall performance of the Quaero LM could be improved by on average 1.16 % absolute (about 4 % relative).

39 7. Summary and future prospects Our work has shown that the proposed approach for crawling text data helps to boost the performance of already present LMs. Not only a lot of time can be saved by collecting exclusively new data. But it makes it possible to use only text data that is time- and topic-relevant for the transmission to be recognized. We showed that even if the amount of text data is relatively small, the results can be as good or better compared to using a huge amount of data not fitting to the time frame or the topic. The time-relevant text data enables us to extract words fitting a certain time interval and therefore fitting better to broadcast news shows to be recognized. Adding too many unnecessary words to the base dictionary resulted in mixed results. Addressing this issued maybe by fully implementing the probabilistic rolling language model proposed [13] the performance may be probably improved. Another approach to help with this issue, is to find better ways to clean the text data. Our additional cleaning performed lowered the perplexities and reduced the number of unknown words in the text. Another improvement may be gained by finding a fitting development set that is then used to interpolate the LMs daywise as suggested by [18]. Using already finished transcriptions may improve the performance of LMs. For example, we may use the transcriptions from 2/3/11 to improve a language model used for a transcription a few days later when the news have not changed much. We may either build LMs from the transcriptions and interpolate them with later LMs or we may concatenate the transcriptions with the crawled text data. As broadcast news shows often contain interviews or the moderators speak with each other, integrating text data from other sources like blogs or social media like Twitter that contains a kind of speech similar to that used in interviews may lead to improvements.

RLAT Rapid Language Adaptation Toolkit

RLAT Rapid Language Adaptation Toolkit RLAT Rapid Language Adaptation Toolkit Tim Schlippe May 15, 2012 RLAT Rapid Language Adaptation Toolkit - 2 RLAT Rapid Language Adaptation Toolkit RLAT Rapid Language Adaptation Toolkit - 3 Outline Introduction

More information

Unsupervised Language Model Adaptation for Automatic Speech Recognition of Broadcast News Using Web 2.0

Unsupervised Language Model Adaptation for Automatic Speech Recognition of Broadcast News Using Web 2.0 Unsupervised Language Model Adaptation for Automatic Speech Recognition of Broadcast News Using Web 2.0 Diplomarbeit am Cognitive Systems Lab Prof. Dr.-Ing. Tanja Schultz Fakultät für Informatik Karlsruher

More information

How SPICE Language Modeling Works

How SPICE Language Modeling Works How SPICE Language Modeling Works Abstract Enhancement of the Language Model is a first step towards enhancing the performance of an Automatic Speech Recognition system. This report describes an integrated

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Scalable Trigram Backoff Language Models

Scalable Trigram Backoff Language Models Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

A PRACTICE BUILDERS white paper. 8 Ways to Improve SEO Ranking of Your Healthcare Website

A PRACTICE BUILDERS white paper. 8 Ways to Improve SEO Ranking of Your Healthcare Website A PRACTICE BUILDERS white paper 8 Ways to Improve SEO Ranking of Your Healthcare Website More than 70 percent of patients find their healthcare providers through a search engine. This means appearing high

More information

SEO: SEARCH ENGINE OPTIMISATION

SEO: SEARCH ENGINE OPTIMISATION SEO: SEARCH ENGINE OPTIMISATION SEO IN 11 BASIC STEPS EXPLAINED What is all the commotion about this SEO, why is it important? I have had a professional content writer produce my content to make sure that

More information

CS6200 Information Retreival. Crawling. June 10, 2015

CS6200 Information Retreival. Crawling. June 10, 2015 CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

introduction to using the connect community website november 16, 2010

introduction to using the connect community website november 16, 2010 introduction to using the connect community website november 16, 2010 getting started t How GoToWebinar works Use earbuds or speakers to hear the presentation audio Note that t webinar staff cannot hear

More information

The Ultimate Guide for Content Marketers. by SEMrush

The Ultimate Guide for Content Marketers. by SEMrush The Ultimate Guide for Content Marketers by SEMrush Table of content Introduction Who is this guide for? 1 2 3 4 5 Content Analysis Content Audit Optimization of Existing Content Content Creation Gap Analysis

More information

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV Jan Vaněk and Josef V. Psutka Department of Cybernetics, West Bohemia University,

More information

APPLYING THE POWER OF AI TO YOUR VIDEO PRODUCTION STORAGE

APPLYING THE POWER OF AI TO YOUR VIDEO PRODUCTION STORAGE APPLYING THE POWER OF AI TO YOUR VIDEO PRODUCTION STORAGE FINDING WHAT YOU NEED IN YOUR IN-HOUSE VIDEO STORAGE SECTION 1 You need ways to generate metadata for stored videos without time-consuming manual

More information

Advanced Training Guide

Advanced Training Guide Advanced Training Guide West Corporation 100 Enterprise Way, Suite A-300 Scotts Valley, CA 95066 800-920-3897 www.schoolmessenger.com Contents Before you Begin... 4 Advanced Lists... 4 List Builder...

More information

COMMUNICATE. Advanced Training. West Corporation. 100 Enterprise Way, Suite A-300. Scotts Valley, CA

COMMUNICATE. Advanced Training. West Corporation. 100 Enterprise Way, Suite A-300. Scotts Valley, CA COMMUNICATE Advanced Training West Corporation 100 Enterprise Way, Suite A-300 Scotts Valley, CA 95066 800-920-3897 www.schoolmessenger.com Contents Before you Begin... 4 Advanced Lists... 4 List Builder...

More information

Advanced Training COMMUNICATE. West Corporation. 100 Enterprise Way, Suite A-300 Scotts Valley, CA

Advanced Training COMMUNICATE. West Corporation. 100 Enterprise Way, Suite A-300 Scotts Valley, CA COMMUNICATE Advanced Training West Corporation 100 Enterprise Way, Suite A-300 Scotts Valley, CA 95066 800-920-3897 www.schoolmessenger.com 2017 West Corp. All rights reserved. [Rev 2.0, 05172017]. May

More information

6 TOOLS FOR A COMPLETE MARKETING WORKFLOW

6 TOOLS FOR A COMPLETE MARKETING WORKFLOW 6 S FOR A COMPLETE MARKETING WORKFLOW 01 6 S FOR A COMPLETE MARKETING WORKFLOW FROM ALEXA DIFFICULTY DIFFICULTY MATRIX OVERLAP 6 S FOR A COMPLETE MARKETING WORKFLOW 02 INTRODUCTION Marketers use countless

More information

Getting Started Guide. Getting Started With Quick Blogcast. Setting up and configuring your blogcast site.

Getting Started Guide. Getting Started With Quick Blogcast. Setting up and configuring your blogcast site. Getting Started Guide Getting Started With Quick Blogcast Setting up and configuring your blogcast site. Getting Started with Quick Blogcast Version 2.0.1 (07.01.08) Copyright 2007. All rights reserved.

More information

Advisor/Committee Members Dr. Chris Pollett Dr. Mark Stamp Dr. Soon Tee Teoh. By Vijeth Patil

Advisor/Committee Members Dr. Chris Pollett Dr. Mark Stamp Dr. Soon Tee Teoh. By Vijeth Patil Advisor/Committee Members Dr. Chris Pollett Dr. Mark Stamp Dr. Soon Tee Teoh By Vijeth Patil Motivation Project goal Background Yioop! Twitter RSS Modifications to Yioop! Test and Results Demo Conclusion

More information

Full Website Audit. Conducted by Mathew McCorry. Digimush.co.uk

Full Website Audit. Conducted by Mathew McCorry. Digimush.co.uk Full Website Audit Conducted by Mathew McCorry Digimush.co.uk 1 Table of Contents Full Website Audit 1 Conducted by Mathew McCorry... 1 1. Overview... 3 2. Technical Issues... 4 2.1 URL Structure... 4

More information

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 25 Tutorial 5: Analyzing text using Python NLTK Hi everyone,

More information

Knowing something about how to create this optimization to harness the best benefits will definitely be advantageous.

Knowing something about how to create this optimization to harness the best benefits will definitely be advantageous. Blog Post Optimizer Contents Intro... 3 Page Rank Basics... 3 Using Articles And Blog Posts... 4 Using Backlinks... 4 Using Directories... 5 Using Social Media And Site Maps... 6 The Downfall Of Not Using

More information

Page Title is one of the most important ranking factor. Every page on our site should have unique title preferably relevant to keyword.

Page Title is one of the most important ranking factor. Every page on our site should have unique title preferably relevant to keyword. SEO can split into two categories as On-page SEO and Off-page SEO. On-Page SEO refers to all the things that we can do ON our website to rank higher, such as page titles, meta description, keyword, content,

More information

Automatic Transcription of Speech From Applied Research to the Market

Automatic Transcription of Speech From Applied Research to the Market Think beyond the limits! Automatic Transcription of Speech From Applied Research to the Market Contact: Jimmy Kunzmann kunzmann@eml.org European Media Laboratory European Media Laboratory (founded 1997)

More information

3 Publishing Technique

3 Publishing Technique Publishing Tool 32 3 Publishing Technique As discussed in Chapter 2, annotations can be extracted from audio, text, and visual features. The extraction of text features from the audio layer is the approach

More information

Traffic Analysis on Business-to-Business Websites. Masterarbeit

Traffic Analysis on Business-to-Business Websites. Masterarbeit Traffic Analysis on Business-to-Business Websites Masterarbeit zur Erlangung des akademischen Grades Master of Science (M. Sc.) im Studiengang Wirtschaftswissenschaft der Wirtschaftswissenschaftlichen

More information

SEARCH ENGINE OPTIMIZATION ALWAYS, SOMETIMES, NEVER

SEARCH ENGINE OPTIMIZATION ALWAYS, SOMETIMES, NEVER SEARCH ENGINE OPTIMIZATION ALWAYS, SOMETIMES, NEVER ADVICE FOR LAW FIRM MARKETERS CONSULTWEBS.COM GETMORE@CONSULTWEBS.COM (800) 872-6590 1 SEARCH ENGINE OPTIMIZATION ALWAYS, SOMETIMES, NEVER ADVICE FOR

More information

Focus Group Analysis

Focus Group Analysis Focus Group Analysis Contents FOCUS GROUP ANALYSIS... 1 HOW CAN MAXQDA SUPPORT FOCUS GROUP ANALYSIS?... 1 IMPORTING FOCUS GROUP TRANSCRIPTS... 1 TRANFORMING AN ALREADY IMPORTED TEXT INTO A FOCUS GROUP

More information

ExpertClick Member Handbook 2018

ExpertClick Member Handbook 2018 ExpertClick Member Handbook 2018 Version 2018.1.1 January 1, 2018 This is the Member Handbook for ExpertClick members. Updated versions of this manual can be downloaded in Adobe PDF from www.memberhandbook.com.

More information

Social Media and Masonry

Social Media and Masonry Social Media and Masonry What is social media? Social media describes the various ways of using technology to connect with an audience. Every Lodge should have a social media or outreach program that connects

More information

Repurposing Your Podcast. 3 Places Your Podcast Must Be To Maximize Your Reach (And How To Use Each Effectively)

Repurposing Your Podcast. 3 Places Your Podcast Must Be To Maximize Your Reach (And How To Use Each Effectively) Repurposing Your Podcast 3 Places Your Podcast Must Be To Maximize Your Reach (And How To Use Each Effectively) What You ll Learn What 3 Channels (Besides itunes and Stitcher) Your Podcast Should Be On

More information

User Guide Contents The Toolbar The Menus The Spell Checker and Dictionary Adding Pictures to Documents... 80

User Guide Contents The Toolbar The Menus The Spell Checker and Dictionary Adding Pictures to Documents... 80 User Guide Contents Chapter 1 The Toolbar... 40 Unique Talking Toolbar Features... 40 Text Navigation and Selection Buttons... 42 Speech Buttons... 44 File Management Buttons... 45 Content Buttons... 46

More information

GOOGLE ANALYTICS 101 INCREASE TRAFFIC AND PROFITS WITH GOOGLE ANALYTICS

GOOGLE ANALYTICS 101 INCREASE TRAFFIC AND PROFITS WITH GOOGLE ANALYTICS GOOGLE ANALYTICS 101 INCREASE TRAFFIC AND PROFITS WITH GOOGLE ANALYTICS page 2 page 3 Copyright All rights reserved worldwide. YOUR RIGHTS: This book is restricted to your personal use only. It does not

More information

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Web crawlers Retrieving web pages Crawling the web» Desktop crawlers» Document feeds File conversion Storing the documents Removing noise Desktop Crawls! Used

More information

PreFeed: Cloud-Based Content Prefetching of Feed Subscriptions for Mobile Users. Xiaofei Wang and Min Chen Speaker: 饒展榕

PreFeed: Cloud-Based Content Prefetching of Feed Subscriptions for Mobile Users. Xiaofei Wang and Min Chen Speaker: 饒展榕 PreFeed: Cloud-Based Content Prefetching of Feed Subscriptions for Mobile Users Xiaofei Wang and Min Chen Speaker: 饒展榕 Outline INTRODUCTION RELATED WORK PREFEED FRAMEWORK SOCIAL RSS SHARING OPTIMIZATION

More information

Intelligent Hands Free Speech based SMS System on Android

Intelligent Hands Free Speech based SMS System on Android Intelligent Hands Free Speech based SMS System on Android Gulbakshee Dharmale 1, Dr. Vilas Thakare 3, Dr. Dipti D. Patil 2 1,3 Computer Science Dept., SGB Amravati University, Amravati, INDIA. 2 Computer

More information

Comprehensive Tool for Generation and Compatibility Management of Subtitles for English Language Videos

Comprehensive Tool for Generation and Compatibility Management of Subtitles for English Language Videos International Journal of Computational Intelligence Research ISSN 0973-1873 Volume 12, Number 1 (2016), pp. 63-68 Research India Publications http://www.ripublication.com Comprehensive Tool for Generation

More information

The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? A-Z of Digital Marketing Translation

The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? A-Z of Digital Marketing Translation The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? In our experience, we find we can get over-excited when talking to clients or family or friends and sometimes we forget that not everyone

More information

Loquendo TTS Director: The next generation prompt-authoring suite for creating, editing and checking prompts

Loquendo TTS Director: The next generation prompt-authoring suite for creating, editing and checking prompts Loquendo TTS Director: The next generation prompt-authoring suite for creating, editing and checking prompts 1. Overview Davide Bonardo with the collaboration of Simon Parr The release of Loquendo TTS

More information

Estimating I/O Memory Bandwidth

Estimating I/O Memory Bandwidth Estimating I/O Memory Bandwidth Diplomarbeit von cand. inform. Jos Ewert an der Fakultät für Informatik Erstgutachter: Zweitgutachter: Betreuender Mitarbeiter: Prof. Dr. Frank Bellosa Prof. Dr. Wolfgang

More information

Firespring Analytics

Firespring Analytics Firespring Analytics What do my website statistics mean? To answer this question, let's first consider how a web page is loaded. You've just typed in the address of a web page and hit go. Depending on

More information

Context-based Navigational Support in Hypermedia

Context-based Navigational Support in Hypermedia Context-based Navigational Support in Hypermedia Sebastian Stober and Andreas Nürnberger Institut für Wissens- und Sprachverarbeitung, Fakultät für Informatik, Otto-von-Guericke-Universität Magdeburg,

More information

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Chirag Shah Dept. of CSE IIT Madras Chennai - 600036 Tamilnadu, India. chirag@speech.iitm.ernet.in A. Nayeemulla Khan Dept. of CSE

More information

Google Analytics 101

Google Analytics 101 Copyright GetABusinessMobileApp.com All rights reserved worldwide. YOUR RIGHTS: This book is restricted to your personal use only. It does not come with any other rights. LEGAL DISCLAIMER: This book is

More information

Core Publisher Content Import

Core Publisher Content Import Core Publisher Content Import As part of the Core Publisher on-boarding process, stations can import archived content into their new Core Publisher site. Because of the station effort required to prepare

More information

2015 Search Ranking Factors

2015 Search Ranking Factors 2015 Search Ranking Factors Introduction Abstract Technical User Experience Content Social Signals Backlinks Big Picture Takeaway 2 2015 Search Ranking Factors Here, at ZED Digital, our primary concern

More information

BUILDING CORPORA OF TRANSCRIBED SPEECH FROM OPEN ACCESS SOURCES

BUILDING CORPORA OF TRANSCRIBED SPEECH FROM OPEN ACCESS SOURCES BUILDING CORPORA OF TRANSCRIBED SPEECH FROM OPEN ACCESS SOURCES O.O. Iakushkin a, G.A. Fedoseev, A.S. Shaleva, O.S. Sedova Saint Petersburg State University, 7/9 Universitetskaya nab., St. Petersburg,

More information

CMU Sphinx: the recognizer library

CMU Sphinx: the recognizer library CMU Sphinx: the recognizer library Authors: Massimo Basile Mario Fabrizi Supervisor: Prof. Paola Velardi 01/02/2013 Contents 1 Introduction 2 2 Sphinx download and installation 4 2.1 Download..........................................

More information

Business Forum Mid Devon. Optimising your place on search engines

Business Forum Mid Devon. Optimising your place on search engines Optimising your place on search engines What do I know? Professional copywriter since 1996 Words inform Google and Bing Content is now king on Google Work on SEO campaigns for clients Who are Oxygen? Who

More information

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0 Website Name Project Code: #10001 Version: 1.0 DocID: SEO/site/rec Issue Date: DD-MM-YYYY Prepared By: - Owned By: Rave Infosys Reviewed By: - Approved By: - 3111 N University Dr. #604 Coral Springs FL

More information

Going to Another Board from the Welcome Board. Conference Overview

Going to Another Board from the Welcome Board. Conference Overview WebBoard for Users Going to Another Board from the Welcome Board Many WebBoard sites have more than one board, each with its own set of conferences and messages. When you click on Boards on the WebBoard

More information

Top 3 Marketing Metrics You Should Measure in Google Analytics

Top 3 Marketing Metrics You Should Measure in Google Analytics Top 3 Marketing Metrics You Should Measure in Google Analytics Presented By Table of Contents Overview 3 How to Use This Knowledge Brief 3 Metric to Measure: Traffic 4 Direct (Acquisition > All Traffic

More information

The Late Night Blog Post SEO Checklist

The Late Night Blog Post SEO Checklist The Late Night Blog Post SEO Checklist ü Do you know what keywords your post is targeting? When you write a blog post, you need to know WHY you are writing it. What action do you want the reader to take?

More information

Khmer OCR for Limon R1 Size 22 Report

Khmer OCR for Limon R1 Size 22 Report PAN Localization Project Project No: Ref. No: PANL10n/KH/Report/phase2/002 Khmer OCR for Limon R1 Size 22 Report 09 July, 2009 Prepared by: Mr. ING LENG IENG Cambodia Country Component PAN Localization

More information

Fraunhofer IAIS Audio Mining Solution for Broadcast Archiving. Dr. Joachim Köhler LT-Innovate Brussels

Fraunhofer IAIS Audio Mining Solution for Broadcast Archiving. Dr. Joachim Köhler LT-Innovate Brussels Fraunhofer IAIS Audio Mining Solution for Broadcast Archiving Dr. Joachim Köhler LT-Innovate Brussels 22.11.2016 1 Outline Speech Technology in the Broadcast World Deep Learning Speech Technologies Fraunhofer

More information

Search Enginge Optimization (SEO) Proposal

Search Enginge Optimization (SEO) Proposal Search Enginge Optimization (SEO) Proposal Proposal Letter Thank you for the opportunity to provide you with a quotation for the search engine campaign proposed by us for your website as per your request.our

More information

1. Create your website. 2. Choose a template

1. Create your website. 2. Choose a template WEBSELF TUTORIAL Are you a craftsman or an entrepreneur? Having a strong web presence today is critical. A website helps let your visitors, prospects, customers and partners know who you are and what services

More information

Reducing Overhead in Microkernel Based Multiserver Operating Systems through Register Banks

Reducing Overhead in Microkernel Based Multiserver Operating Systems through Register Banks Reducing Overhead in Microkernel Based Multiserver Operating Systems through Register Banks Studienarbeit von Sebastian Ottlik an der Fakultät für Informatik Verantwortlicher Betreuer: Betreuender Mitarbeiter:

More information

Toward Interlinking Asian Resources Effectively: Chinese to Korean Frequency-Based Machine Translation System

Toward Interlinking Asian Resources Effectively: Chinese to Korean Frequency-Based Machine Translation System Toward Interlinking Asian Resources Effectively: Chinese to Korean Frequency-Based Machine Translation System Eun Ji Kim and Mun Yong Yi (&) Department of Knowledge Service Engineering, KAIST, Daejeon,

More information

Website Authority Checklist

Website Authority Checklist Website Authority Checklist A 20-point checklist for winning the lion s share of traffic and conversions in your industry. By David Jenyns A word from the wise You d have to agree, the web is forever changing

More information

Automated Tagging to Enable Fine-Grained Browsing of Lecture Videos

Automated Tagging to Enable Fine-Grained Browsing of Lecture Videos Automated Tagging to Enable Fine-Grained Browsing of Lecture Videos K.Vijaya Kumar (09305081) under the guidance of Prof. Sridhar Iyer June 28, 2011 1 / 66 Outline Outline 1 Introduction 2 Motivation 3

More information

Speech Tuner. and Chief Scientist at EIG

Speech Tuner. and Chief Scientist at EIG Speech Tuner LumenVox's Speech Tuner is a complete maintenance tool for end-users, valueadded resellers, and platform providers. It s designed to perform tuning and transcription, as well as parameter,

More information

Web Site Documentation Eugene School District 4J

Web Site Documentation Eugene School District 4J Eugene School District 4J Using this Documentation Revision 1.3 1. Instruction step-by-step. The left column contains the simple how-to steps. Over here on the right is the color commentary offered to

More information

Traffic Overdrive Send Your Web Stats Into Overdrive!

Traffic Overdrive Send Your Web Stats Into Overdrive! Traffic Overdrive Send Your Web Stats Into Overdrive! Table of Contents Generating Traffic To Your Website... 3 Optimizing Your Site For The Search Engines... 5 Traffic Strategy #1: Article Marketing...

More information

Traffic Triggers Domain Here.com

Traffic Triggers   Domain Here.com www.your Domain Here.com - 1 - Table of Contents INTRODUCTION TO TRAFFIC TRIGGERS... 3 SEARCH ENGINE OPTIMIZATION... 4 PAY PER CLICK MARKETING... 6 SOCIAL MARKETING... 9 PRESS RELEASES... 11 ARTICLE MARKETING...

More information

SCHULICH MEDICINE & DENTISTRY Website Updates August 30, Administrative Web Editor Guide v6

SCHULICH MEDICINE & DENTISTRY Website Updates August 30, Administrative Web Editor Guide v6 SCHULICH MEDICINE & DENTISTRY Website Updates August 30, 2012 Administrative Web Editor Guide v6 Table of Contents Chapter 1 Web Anatomy... 1 1.1 What You Need To Know First... 1 1.2 Anatomy of a Home

More information

Strong signs your website needs a professional redesign

Strong signs your website needs a professional redesign Strong signs your website needs a professional redesign Think - when was the last time that your business website was updated? Better yet, when was the last time you looked at your website? When the Internet

More information

Frequently Asked Questions- Communication, the Internet, Presentations Question 1: What is the difference between the Internet and the World Wide Web?

Frequently Asked Questions- Communication, the Internet, Presentations Question 1: What is the difference between the Internet and the World Wide Web? Frequently Asked Questions- Communication, the Internet, Presentations Question 1: What is the difference between the Internet and the World Wide Web? Answer 1: The Internet and the World Wide Web are

More information

OCR Interfaces for Visually Impaired

OCR Interfaces for Visually Impaired OCR Interfaces for Visually Impaired TOPIC ASSIGNMENT 2 Author: Sachin FERNANDES Graduate 8 Undergraduate Team 2 TOPIC PROPOSAL Instructor: Dr. Robert PASTEL March 4, 2016 LIST OF FIGURES LIST OF FIGURES

More information

Please note that some of the resources used in this assignment require a Stanford Network Account and therefore may not be accessible.

Please note that some of the resources used in this assignment require a Stanford Network Account and therefore may not be accessible. Please note that some of the resources used in this assignment require a Stanford Network Account and therefore may not be accessible. CS 224N / Ling 237 Programming Assignment 1: Language Modeling Due

More information

Conclusions. Chapter Summary of our contributions

Conclusions. Chapter Summary of our contributions Chapter 1 Conclusions During this thesis, We studied Web crawling at many different levels. Our main objectives were to develop a model for Web crawling, to study crawling strategies and to build a Web

More information

How To Construct A Keyword Strategy?

How To Construct A Keyword Strategy? Introduction The moment you think about marketing these days the first thing that pops up in your mind is to go online. Why is there a heck about marketing your business online? Why is it so drastically

More information

QUALITY SEO LINK BUILDING

QUALITY SEO LINK BUILDING QUALITY SEO LINK BUILDING Developing Your Online Profile through Quality Links TABLE OF CONTENTS Introduction The Impact Links Have on Your Search Profile 02 Chapter II Evaluating Your Link Profile 03

More information

VOX TURBO QUESTIONS AND ANSWER

VOX TURBO QUESTIONS AND ANSWER VOX TURBO QUESTIONS AND ANSWER While the dropdown rate is a must-have feature, I have also seen it become the source of some new problems. The most significant of these problems are punctuation and numbers

More information

Worldnow Producer. Stories

Worldnow Producer. Stories Worldnow Producer Stories Table of Contents Overview... 4 Getting Started... 4 Adding Stories... 5 Story Sections... 5 Toolbar... 5 Copy Live URL... 6 Headline... 6 Abridged Title... 6 Abridged Clickable

More information

Next Level Marketing Online techniques to grow your business Hudson Digital

Next Level Marketing Online techniques to grow your business Hudson Digital Next Level Marketing Online techniques to grow your business. 2019 Hudson Digital Your Online Presence Chances are you've already got a web site for your business. The fact is, today, every business needs

More information

Help! My Birthday Reminder Wants to Brick My Phone!

Help! My Birthday Reminder Wants to Brick My Phone! Institut für Technische Informatik und Kommunikationsnetze Master Thesis Help! My Birthday Reminder Wants to Brick My Phone! Student Name Advisor: Dr. Stephan Neuhaus, neuhaust@tik.ee.ethz.ch Professor:

More information

Activity Report at SYSTRAN S.A.

Activity Report at SYSTRAN S.A. Activity Report at SYSTRAN S.A. Pierre Senellart September 2003 September 2004 1 Introduction I present here work I have done as a software engineer with SYSTRAN. SYSTRAN is a leading company in machine

More information

How to Get Your Web Maps to the Top of Google Search

How to Get Your Web Maps to the Top of Google Search How to Get Your Web Maps to the Top of Google Search HOW TO GET YOUR WEB MAPS TO THE TOP OF GOOGLE SEARCH Chris Brown CEO & Co-founder of Mango SEO for web maps is particularly challenging because search

More information

OnCrawl Metrics. What SEO indicators do we analyze for you? Dig into our board of metrics to find the one you are looking for.

OnCrawl Metrics. What SEO indicators do we analyze for you? Dig into our board of metrics to find the one you are looking for. 1 OnCrawl Metrics What SEO indicators do we analyze for you? Dig into our board of metrics to find the one you are looking for. UNLEASH YOUR SEO POTENTIAL Table of content 01 Crawl Analysis 02 Logs Monitoring

More information

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition Kartik Audhkhasi, Abhinav Sethy Bhuvana Ramabhadran Watson Multimodal Group IBM T. J. Watson Research Center Motivation

More information

If most of the people use Google by writing a few words in the Google windows and perform their search it should be note:

If most of the people use Google by writing a few words in the Google windows and perform their search it should be note: The «Google World» Henri Dou, Competitive Intelligence Think Tank douhenri@yahoo.fr In chapter 1 we presented the key role of information in competitive intelligence and also the necessity to go through

More information

Getting Started With Google Analytics Detailed Beginner s Guide

Getting Started With Google Analytics Detailed Beginner s Guide Getting Started With Google Analytics Detailed Beginner s Guide Copyright 2009-2016 FATbit - All rights reserved. The number of active websites on the internet could exceed the billionth mark by the end

More information

A Letting agency s shop window is no longer a place on the high street, it is now online

A Letting agency s shop window is no longer a place on the high street, it is now online A Letting agency s shop window is no longer a place on the high street, it is now online 1 Let s start by breaking down the two ways in which search engines will send you more traffic: 1. Search Engine

More information

EBOOK. On-Site SEO Made MSPeasy Everything you need to know about Onsite SEO

EBOOK. On-Site SEO Made MSPeasy Everything you need to know about Onsite SEO EBOOK On-Site SEO Made MSPeasy Everything you need to know about Onsite SEO K SEO easy ut Onsite SEO What is SEO & How is it Used? SEO stands for Search Engine Optimisation. The idea of SEO is to improve

More information

Whitepaper Italy SEO Ranking Factors 2012

Whitepaper Italy SEO Ranking Factors 2012 Whitepaper Italy SEO Ranking Factors 2012 Authors: Marcus Tober, Sebastian Weber Searchmetrics GmbH Greifswalder Straße 212 10405 Berlin Phone: +49-30-3229535-0 Fax: +49-30-3229535-99 E-Mail: info@searchmetrics.com

More information

Marketing & Back Office Management

Marketing & Back Office Management Marketing & Back Office Management Menu Management Add, Edit, Delete Menu Gallery Management Add, Edit, Delete Images Banner Management Update the banner image/background image in web ordering Online Data

More information

COPYRIGHTED MATERIAL. Introduction. 1.1 Introduction

COPYRIGHTED MATERIAL. Introduction. 1.1 Introduction 1 Introduction 1.1 Introduction One of the most fascinating characteristics of humans is their capability to communicate ideas by means of speech. This capability is undoubtedly one of the facts that has

More information

File: SiteExecutive 2013 Content Intelligence Modules User Guide.docx Printed January 20, Page i

File: SiteExecutive 2013 Content Intelligence Modules User Guide.docx Printed January 20, Page i File: SiteExecutive 2013 Content Intelligence Modules User Guide.docx Page i Contact: Systems Alliance, Inc. Executive Plaza III 11350 McCormick Road, Suite 1203 Hunt Valley, Maryland 21031 Phone: 410.584.0595

More information

Lecture 7: Neural network acoustic models in speech recognition

Lecture 7: Neural network acoustic models in speech recognition CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 7: Neural network acoustic models in speech recognition Outline Hybrid acoustic modeling overview Basic

More information

Search Engine Optimization. Rebecca Blanchette SEO & Analytics Specialist Carnegie Communications

Search Engine Optimization. Rebecca Blanchette SEO & Analytics Specialist Carnegie Communications Search Engine Optimization Rebecca Blanchette SEO & Analytics Specialist Carnegie Communications What is SEO anyway? The short answer: search engine optimization refers to the process of optimizing your

More information

PODCASTS, from A to P

PODCASTS, from A to P PODCASTS, from A to P Basics of Podcasting 1) What are podcasts all About? 2) Where do I get podcasts? 3) How do I start receiving a podcast? Art Gresham UCHUG Editor July 18 2009 Seniors Computer Group

More information

the magazine of the Marketing Research and Intelligence Association YEARS OF RESEARCH INTELLIGENCE A FUTURESPECTIVE

the magazine of the Marketing Research and Intelligence Association YEARS OF RESEARCH INTELLIGENCE A FUTURESPECTIVE the magazine of the Marketing Research and Intelligence Association vuemay 2010 5 YEARS OF RESEARCH INTELLIGENCE A FUTURESPECTIVE If You Want to Rank in Google, Start by Fixing Your Site You have an informative,

More information

Whitepaper US SEO Ranking Factors 2012

Whitepaper US SEO Ranking Factors 2012 Whitepaper US SEO Ranking Factors 2012 Authors: Marcus Tober, Sebastian Weber Searchmetrics Inc. 1115 Broadway 12th Floor, Room 1213 New York, NY 10010 Phone: 1 866-411-9494 E-Mail: sales-us@searchmetrics.com

More information

FUNCTIONAL BEST PRACTICES ORACLE USER PRODUCTIVITY KIT

FUNCTIONAL BEST PRACTICES ORACLE USER PRODUCTIVITY KIT FUNCTIONAL BEST PRACTICES ORACLE USER PRODUCTIVITY KIT Purpose Oracle s User Productivity Kit (UPK) provides functionality that enables content authors, subject matter experts, and other project members

More information

For more info on Cloud9 see their documentation:

For more info on Cloud9 see their documentation: Intro to Wordpress Cloud 9 - http://c9.io With the free C9 account you have limited space and only 1 private project. Pay attention to your memory, cpu and disk usage meter at the top of the screen. For

More information

Multimedia Databases. Wolf-Tilo Balke Younès Ghammad Institut für Informationssysteme Technische Universität Braunschweig

Multimedia Databases. Wolf-Tilo Balke Younès Ghammad Institut für Informationssysteme Technische Universität Braunschweig Multimedia Databases Wolf-Tilo Balke Younès Ghammad Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de Previous Lecture Audio Retrieval - Query by Humming

More information

Implementation and Verification of a Pollingbased MAC Layer Protocol for PLC

Implementation and Verification of a Pollingbased MAC Layer Protocol for PLC Implementation and Verification of a Pollingbased Layer Protocol for PLC Project Work by Luis Lax Cortina at the Institute of the Industrial Information Technology Zeitraum: 11.12.2009 12.06.2010 Hauptreferent:

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information