Modeling Slang-style Word Formation for Retrieving Evaluative Information

Size: px

Start display at page:

Download "Modeling Slang-style Word Formation for Retrieving Evaluative Information"

Grant Virgil Wilcox
5 years ago
Views:

1 Modeling Slang-style Word Formation for Retrieving Evaluative Information Atsushi Fujii Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba, , Japan Abstract The volume of evaluative information for a specific item, such as opinions about an organization or reviews of a product, has been increasing rapidly on the World Wide Web. This trend has had a significant impact on both producers and consumers. However, the information overload problem on the Web makes it time consuming to identify relevant information. We propose a method for retrieving evaluative documents for an item. In evaluative documents, the item in question is often represented by a slang-style coined name, such as Micro$oft referring to Microsoft, chosen for the purposes of euphemism or wordplay. To generate coined names for an item automatically, we have modeled slang-style word formation for Japanese. Coined names are used to query the Web, enabling us to retrieve evaluative documents that cannot be retrieved by existing methods. We show the effectiveness of our method experimentally. 1 Introduction The volume of evaluative information for a specific item, such as opinions about an organization or reviews for a product, has been increasing rapidly on the World Wide Web. This trend has a significant impact on both producers and consumers. While a company may assess its reputation by analyzing customers opinions, a customer may compare reviews before choosing a product. However, because a simple method for Web search, such as using the name of a target item as a query, usually retrieves a large number of extraneous pages, it is time consuming for users to identify pages that satisfy their information needs. Because evaluative information usually contains subjective descriptions, a number of methods for sentiment analysis can potentially alleviate this information overload problem. Existing methods for sentiment analysis can be divided into three approaches: distinguishing between subjective and objective descriptions in texts (Eguchi and Lavrenko, 2006; Riloff and Wiebe, 2003), classifying subjective descriptions into bipolar categories (Hu and Liu, 2004; Turney, 2002) or multipoint scale categories (Pang and Lee, 2005), and summarizing subjective descriptions (Fujii and Ishikawa, 2006; Hu and Liu, 2004; Liu et al., 2005). Among the above approaches, the distinction between subjective and objective descriptions is the most straightforward solution to retrieving evaluative information. However, because existing methods rely on evaluative expressions associated with sentiment or subjectivity, such as excellent or service is bad, as the clues, subjective descriptions that do not contain these expressions cannot be retrieved. We propose a new method for retrieving evaluative documents for an item. The contribution of our research is that to overcome the limitation of existing methods, we explore a new feature for sentiment analysis. In evaluative documents on the Web, the item in question is often represented by a slang-style coined name, such as Micro$oft referring to Microsoft, for the purposes of euphemism or wordplay. This implies that by using a slang-style coined name for an item as a query, we can identify evaluative documents that cannot be retrieved by existing methods. In brief, given the name of a target item, our method automatically generates slang-style alternative names for that item and uses those names to query the Web. To realize this method, we need to model slang-style word formation, for which we currently target only Japanese. Because Japanese uses different types of characters, such as the Kanji, Katakana, and Hiragana alphabets, and

2 other characters such as numerals, the mechanism of word formation in Japanese is complicated and thus our method can potentially be applied to other languages in the future. Although slang has been a subject of linguistics, the purpose of past research was to analyze and classify slang words in terms of specific properties, such as usage and word formation. Our research is the first exploration of utilizing slang words for information retrieval and sentiment analysis. Section 2 outlines our method for retrieving evaluative documents. Section 3 describes our method for generating slang-style names. Section 4 describes the experiments and discusses the results obtained. 2 Retrieval Method for Evaluative Documents Given the name of a target item, our retrieval method performs the following two steps. (1) We generate slang-style names for the input item. (2) For each generated name, we search the Web for pages in which that name appears. Our method does not classify the retrieved pages into semantic orientations, such as positive or negative. For this purpose, the existing classification methods in Section 1 can be utilized. However, because slang words are often used for criticism, our method tends to retrieve negative documents. We discuss this tendency in Section 4. In the current retrieval interface, users are allowed to select slang-style names to be used as a query; otherwise all the generated names are automatically used for retrieval purposes. For step (1), we have modeled slang-style word formation in Japanese, which will be elaborated in Section 3. Our method for generating slang-style names uses both a target name and its pronunciation. If a target name consists of only Katakana or Hiragana, both of which comprise phonograms, its pronunciation is represented by the name. However, if a target name contains Kanji, which comprises ideograms, in principle we consult a dictionary for the pronunciation. However, because target names are usually proper nouns and the out-ofvocabulary problem is therefore crucial, in practice a user is requested to provide the pronunciation of a target name in Katakana or Hiragana. For step (2), an existing search engine on the Web can be used without any modification. However, after the retrieval, we discard the pages that do not include the generated query name as it is. Irrelevant pages are often retrieved if the query name contains a special symbol that is used as a wildcard character, such as an asterisk ( * ), in the search engine used. 3 Generating Slang-style Names 3.1 Overview To generate slang-style names for an item, we have identified types of slang-style word formation in Japanese and developed an automatic generation method for each type. Although a number of traditional references in linguistics have identified types of word formations in Japanese slang (Maeda, 1922; Nomura and Koike, 1992), our focus is so-called Internet slang and there are thus a number of word formation types that are not identified in traditional studies. Certain types of Internet slang are associated with word processing methods in computers. For example, many users make deliberate typographical errors to generate an unusual sequence of characters. In view of this background, we performed a preliminary study, in which we collected slang words from Japanese Web sites and identified their types in terms of the word formation. However, not all slang types are desirable for our purpose. For example, abbreviation, which is typically used to generate both general and slang words, cannot be effective in retrieving only evaluative documents. In addition, not all slang types can be realized in an automatic method with high accuracy. For example, Micro$oft looks similar to Microsoft and may be associated with a company chasing a profit. Again, a slang word of a company s name may be associated with the personality or physical characteristics of the president. To model such highly intelligent association or inference accurately, we require a knowledge-intensive method using a number of rules and heuristics. However, as the first step of our research, we currently focus only on word formation types that can be realized with straightforward algorithms and dictionaries available to the public. As a result, our initial work targets the following six types of word formation: blank, partial romanization, typographical similarity, character-type conversion, input-mode er-

3 ror, and Japanese-conversion error. For each word formation type, we describe the definition and the method for generating slang-style names in the following sections. 3.2 Blank One or more characters in an original name are not printed and are replaced with a special symbol, for which in English * (an asterisk) is often used, but in Japanese (a circle) is often used. placement is unidirectional ( ) or bidirectional ( ). Using Figure 1, for we can generate (/sofutopanku/) or (/nfutohasoku/). In principle, we can replace an arbitrary number of characters in the original name with any symbol. However, in practice we replace only a sin- gle character with, to restrict the number of names generated. Thus, for a name consisting of N characters, we generate N slang-style names. For example, for (/sofuto- banku/), which is the name of an information technology company in Japan, we generate six Figure 1: Correspondences for typographically names, such as and similar Katakana characters.. This method can also be used for names in English. For example, for softbank, which is the English name for, we can generate oftbank and s ftbank. Throughout this paper, we use slashes to indicate the pronunciation of Japanese words in Roman characters. 3.3 Partial Romanization One or more segments in an original name are romanized and only the first Roman character for each segment is used. Example names for are S and F. However, as in Section 3.2, to restrict the number of names generated, we romanize only a single character in the original name. To convert a Japanese character into its Roman representation, we use correspondences between Japanese and Roman characters 1. If the target character in question is a Katakana or Hiragana character, we simply consult these correspondences for its Roman representation. However if the target character is a Kanji character, we use the pronunciation of the target name. 3.4 Typographical Similarity One or more characters in an original name are replaced with another character that is typographically similar to the character in question. For example, (/n/) and (/shi/) may be replaced with (/so/) and (/tsu/), respectively. 1 utashiro/perl/scripts/romkan pl/ We have empirically identified 44 pairs of typographically similar Katakana characters, and replace an arbitrary number of Katakana characters in the original name with their counterpart characters. Figure 1 shows the 44 typographically similar pairs, in which arrows denote whether the re- 3.5 Character-type Conversion Katakana characters in an original name are entirely or partially replaced with Hiragana characters and vice versa. An example name for is, in which is written in Hiragana characters. We use the pronunciation of the input name represented by Hiragana. We segment the original name into two segments with an arbitrary position and convert one of the segments into Katakana. Thus, for a name consisting of N characters, we consider N 1 segmentations. We use the EUC-JP code, in which Hiragana and Katakana can mutually be converted based on the character codes. 3.6 Input-mode Error The background of this word formation type should perhaps be explained. In typical front-end processors (FEPs) for Japanese, which help users to input Japanese characters, users are requested to choose Japanese or non-japanese mode. In either mode, users are allowed to input ASCII characters as indicated by the keyboard. However, in the Japanese mode, an input string is regarded as the romanization of the pronunciation of a Japanese word and will be converted into a plausible word in the Japanese alphabets. If a user intends to input an English word, but mistakenly chooses the Japanese mode, an input string is entirely or partially converted into

4 Japanese and the resultant string may look like an unusual combination of characters. Thus, users can purposefully choose the wrong mode to generate a slang-style name. If a user chooses the Japanese mode to input softbank, the resultant string can be ft k, in which so, ba, and n coincidently correspond to romanized Japanese moras. If a user chooses the non- Japanese mode to input Japanese characters, the resultant string is simply a romanization of the intended Japanese word. We use two different methods independently. If a target name is in Japanese, such as candidates for each segment. We retain the segments that do not correspond to a Japanese word, we simply convert the target name into its Roman representation, for which we use the romanization method in Section 3.3. However, if a Unlike the other methods in Sections , as Hiragana characters. target name is not in Japanese, we read the constituent characters in the target name sequentially this method often generates a plausible name in which usually generate unusual Japanese strings, and convert combinations of characters that are the Japanese that corresponds to an existing item. If same as the Roman representation for Japanese we use these names to query the Web, we cannot retrieve evaluative documents for a target item, moras into their corresponding Hiragana characters. Although in most cases a mora is a single but retrieve homepages for different items. For example, for (/fujiya/), which is a confec- vowel, such as (/a/), or a combination of one or more consonants followed by a vowel, such as tionery company in Japan, our method generated (/so/) and (/kya/), (/n/) consists of a single consonant. If we read the character n, we must read the next character to determine the resultant Hiragana character. If the next character is a vowel, we convert a combination of n and the vowel into the corresponding Hiragana characters, such as (/na/) or (/ni/) ; otherwise we convert n into (/n/). 3.7 Japanese-conversion Error As explained in Section 3.6, in the Japanese mode, an input string is converted into one or more Japanese words with the same pronunciation as the input characters. However, because more than one Japanese word often corresponds to the same pronunciation, most Japanese FEPs use disambiguation methods and also allow users to choose a correct Japanese word from more than one candidate. This problem is crucial because Kanji comprises ideograms. Users can choose incorrect Japanese words purposefully, to generate unusual words and often play on the double meaning. For, an example word is, in which (/sofu/), (/to/), and (/banku/) mean grand father, and, and great pain, respectively. We use the SKK dictionary 2 for Japanese FEPs, which defines Japanese words and their pronunciation in Hiragana. We use this dictionary to segment the input Hiragana pronunciation and to derive possible Japanese words for each segment. In principle, we consider all possible segmentations of the input Hiragana pronunciation by consulting the SKK dictionary, and derive all possible Japanese words for each segment. However, to restrict the number of names generated, we currently segment the input into only two segments and use up to three Japanese word (/fujiya/), which is the name of a hotel. To resolve this problem, we check whether a generated name corresponds to an existing item, and if it does, we discard the name. We use a query classification method (Fujii, 2008), which automatically identifies whether a query is informational or navigational. While an informational query is used to obtain information in general, a navigational query is used to retrieve one or more representative pages for a known item, such as a homepage for a company or product. In other words, a navigational query is usually the name of an existing item and thus we discard the generated names classified as navigational. Because the above query classification method requires a collection of Web pages, we used the test collection produced for NTCIR-5 3. The target document set for NTCIR-5 consists of pages collected from the JP domain. 4 Experiments 4.1 Method To evaluate the effectiveness of our retrieval method, we used the following three company

5 names as targets: (Softbank), get. Thus, our method cannot be evaluated by such (Amazon), and (Fujiya). For each target name, we also used its English common measures as Mean Average Precision and Mean Reciprocal Rank, which uses the rank name, as indicated in parentheses. Although in of each document in a single list. principle our method can be used with any type of item, such as a company or a product, we targeted 4.2 Results and Discussion only company names for the following two Tables 1 and 2 show the retrieval accuracy and reasons. First, because the number of pages for other figures for slang-style and original names, a company is usually larger than that for a single respectively. In Table 1, Fujiya and Fujiya* product, the information overload problem is crucial for company names compared with product denote the results with or without the query classification method (Fujii, 2008) for filtering purposes. Using the filtering method, we discarded names. Second, because the cost of human judgment is prohibitive, it was necessary to restrict the six generated names and successfully reduced the number of target names. number of irrelevant pages while maintaining the While evaluative documents for number of evaluative pages. We did not use the and are associated with a variety of filtering method for the other two targets, in which their products, evaluative documents for the names consist of only Katakana characters. are associated with its service for online shopping. Thus, an experiment for can be Comparing the results in Tables 1 and 2, our method retrieved more evaluative pages and seen as either for a company or for a product. achieved a higher accuracy than the baseline For each target, we used the six methods in Section 3 to generate slang-style names. The numbers method, irrespective of the target. Comparing the results for Fujiya and Fujiya* in Table 1, our of names generated for, filtering method was effective in improving the, and were 44, 40, and 35, respectively. For each generated name, we used Yahoo! Japan 4 to query the Web and we retrieved up to 20 top pages that contained the generated name. For each target, we also discarded duplicate pages. As a result, the numbers of pages retrieved retrieval accuracy. We investigated the number of evaluative pages retrieved by our method that could not be retrieved by existing methods. Our method retrieved 188 evaluative pages in total, of which 166 pages did not contain the original target name. These 166 pages could not be retrieved for,, and by existing methods that use the original name of were 524, 474, and 416, respectively. a target and additional evaluative expressions. As the baseline method, for each target, we used Table 3 shows the number of names generated and the retrieval accuracy for each slang its original name as a query. Because the number of pages retrieved for slang-style names for each type: blank (BL), partial romanization (PR), target was approximately 500, we retrieved up to typographical similarity (TS), character-type 500 pages for each target. We also discarded duplicate pages and the pages that did not contain the conversion (CC), input-mode error (IE), and Japanese-conversion error (JE). The number of query. As a result, the numbers of pages retrieved names by IE is only two, because we did not use for,, and the names by IE for and, were 489, 428, and 449, respectively. which are identical to its Japanese or English official name. Because we distinguish pages retrieved For each retrieved page, an assessor assigned one of three categories: positive evaluation by different slang types in Table 3, the total number of pages in Table 3 is more than that in Table 1. (Pos), negative evaluation (Neg), and no evaluation (No). For each target, we calculated the In Table 3, BL retrieved the largest number of evaluative pages and achieved the highest accuracy. retrieval accuracy ( Acc ), which is the ratio of the number of Pos and Neg pages to the total number of pages retrieved ( Total ). For each negative page, we analyzed whether the description was emotional or rational. The Because our method does not determine priorities of different generated names, it produces more than one ranked document list for each tar- 4 numbers of emotional negative descriptions retrieved by the baseline method and our method were 5 and 39, respectively. If a user intends to find slanders against a company, retrieving emo-

6 Table 1: Retrieval accuracy for slang-style names. Target Pos Neg No Total Acc (%) Softbank Amazon Fujiya Fujiya* Table 2: Retrieval accuracy for original names. Target Pos Neg No Total Acc (%) Softbank Amazon Fujiya Table 3: Retrieval accuracy for each slang type. Type #Names Pos Neg No Total Acc (%) BL PR TS CC IE JE tional descriptions is crucial, and our method does so effectively. To identify the reasons for errors using our method, we analyzed irrelevant pages retrieved by our method and found that in those pages generated names used as queries often matched typographic errors, handle names, or Kanji characters in Chinese pages. The irrelevant pages, including typographical errors of a query, are divided into intentional and unintentional. Those who try to increase the page view of a specific page often embed high-frequency typographic errors for a company or product in that page, so that a user who mistakenly uses an incorrect query may reach that page. We will analyze the characteristics of these pages for filtering purposes in the future. 5 Conclusion We have proposed a method for retrieving evaluative documents for a specific item. Because evaluative documents often include slang-style coined names, to retrieve these pages, we modeled slangstyle word formation in Japanese. We also showed the effectiveness of our method experimentally. Acknowledgments This research was supported in part by MEXT Grant-in-Aid Scientific Research on Priority Area of New IT Infrastructure for the Informationexplosion Era (Grant No ). References Koji Eguchi and Victor Lavrenko Sentiment retrieval using generative models. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages Atsushi Fujii Modeling anchor text and classifying queries to enhance Web document retrieval. In Proceedings of the 17th International World Wide Web Conference, pages Atsushi Fujii and Tetsuya Ishikawa A system for summarizing and visualizing arguments in subjective documents: Toward supporting decision making. In Proceedings of COLING-ACL Workshop on Sentiment and Subjectivity in Text, pages Minqing Hu and Bing Liu Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages Bing Liu, Minqing Hu, and Junsheng Cheng Opinion observer: Analyzing and comparing opinions on the Web. In Proceedings of the 14th International World Wide Web Conference, pages Taro Maeda Gairaigonokenkyuu (A Study on Loanwords). Iwanami Shoten publisher. (In Japanese). Masaaki Nomura and Seiji Koike Nihongojiten (An Encyclopedia for Japanese). Tokyodo publisher. (In Japanese). Bo Pang and Lillian Lee Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages Ellen Riloff and Janyce Wiebe Learning extraction patterns for subjective expressions. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pages Peter D. Turney Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages

Document Structure Analysis in Associative Patent Retrieval

Document Structure Analysis in Associative Patent Retrieval Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba, 305-8550,