World Wide Web Collecting Encyclopedic Knowledge Using the World Wide Web Atsushi Fujii Tetsuya Ishikawa Universi

Size: px

Start display at page:

Download "World Wide Web Collecting Encyclopedic Knowledge Using the World Wide Web Atsushi Fujii Tetsuya Ishikawa Universi"

Suzanna Reeves
5 years ago
Views:

1 World Wide Web Collecting Encyclopedic Knowledge Using the World Wide Web Atsushi Fujii Tetsuya Ishikawa University of Library and Information Science Abstract: This paper proposes a method to collect encyclopedic knowledge from the World Wide Web. For this purpose, we uselinguistic patterns and text structures provided with HTML tags to extract text fragments containing term descriptions, from Web pages. We then use a language model to discard extraneous descriptions, and a clustering method to summarize resultant descriptions. We show the eectiveness of our method by way of experiments. 1 / World Wide Web Web Web [17, 18] HTML World Wide Web 20 50% [18] [8] [4] [13] Web Web [11, 12] [10, 24] Web [2, 9] Web Web

2 Web [5, 15] Web (1) (2) (3) (4) 4 1 (1) Web Web (2) Web HTML Web a) b) c) HTML <X>...</X> (3) Web 1 (4) 2 1 Web Web Web (1) (4) Web Web 2 2

3 Web Web Web 1: 2: 3.3 a) b) [16, 21] [17] CD-ROM 8 [20] 2 EDR 12 [19] 2,259 [23] HTML Web HTML 2 <H> ? <A> HTML

4 HTML HTML 1 1. ... 1 2. <UL>...</UL> 3. N. 3 N =3 3.4 (1)/ (0) 2 N N Web CMU-Cambridge [1] 2 [22] N perplexity 1, [14] 8 5 Hierarchical Baysian Clustering: HBC [6] HBC HBC : 4.1 Web

5 NACSIS [7] NACSIS goo Web % 27/ goo Web goo 1, Zipf % 292/ % 266/ NACSIS : NACSIS % 67.9% World Wide Web [3]

6 2: 27 Web Zipf 15 6, , ,399 3, ,3899, , ,6861,6943,141 1,682 10,078 1, , ,938 2, CD-ROM NACSIS [1] Philip Clarkson and Ronald Rosenfeld. Statistical language modeling using the CMU-Cambridge toolkit. In Proceedings of EuroSpeech'97, pp. 2707{ 2710, [2] Oren Etzioni. Moving up the information food chain. AI Magazine, Vol. 18, No. 2, pp. 11{18, [3] Atsushi Fujii and Tetsuya Ishikawa. Applying machine translation to two-stage cross-language information retrieval. In Proceedings of the 4th Conference of the Association for Machine Translation in the Americas, [4] Vasileios Hatzivassiloglou and Kathleen R. McKeown. Towards the automatic identication of adjectival scales: Clustering adjectives according to meaning. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 172{182, [5] Akihiro Inokuchi, Takashi Washio, Hiroshi Motoda, Kouhei Kumasawa, and Naohide Arai. Basket analysis for graph structured data. In Proceedings of the 3rd Pacic-Asia Conference on Knowledge Discovery and Data Mining, pp. 420{431, [6] Makoto Iwayama and Takenobu Tokunaga. Hierarchical Bayesian clustering for automatic text classication. In Proceedings of the 14th International Joint Conference on Articial Intelligence, pp. 1322{1327, [7] Noriko Kando, Kazuko Kuriyama, and Toshihiko Nozue. NACSIS test collection workshop (NTCIR- 1). In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 299{300, [8] Julian Kupiec and John Maxwell. Training stochastic grammars from unlabelled text corpora. In Workshop on Statistically-Based Natural Language Programming Techniques, AAAI Technical Reports WS [9] Hougen Lu, Leon Sterling, and Alex Wyatt. Knowledge discovery in SportsFinder: An agent to extract sports results from the Web. In Proceedings of the 3rd Pacic-Asia Conference on Knowledge Discovery and Data Mining, pp. 469{473, [10] Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. A machine learning approach to building domain-specic search engines. In Proceedings of the 16th International Joint Conference on Articial Intelligence, pp. 662{667, [11] Jian-Yun Nie, Michel Simard, Pierre Isabelle, and Richard Durand. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 74{81, [12] Philip Resnik. Mining the Web for bilingual texts. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 527{ 534, [13] Frank Smadja, Kathleen R. McKeown, and Vasileios Hatzivassiloglou. Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, Vol. 22, No. 1, pp. 1{38, [14].., [15],,,.., Vol. 15, No. 3, pp. 485{494, [16],,,.., Vol. 7, No. 2, pp. 336{345, [17],. Quest. 5, pp. 353{356, [18],. Web. 6, pp. 296{299, [19]., [20]. CD-ROM, [21],,.. 5, pp. 124{127, [22]. CD- '94-'95, [23],,,,. version 1.5. Technical Report NAIST-IS-TR97007,, [24].. 6, pp. 447{450, 2000.

arxiv:cs/ v1 [cs.cl] 2 Nov 2000

arxiv:cs/ v1 [cs.cl] 2 Nov 2000 Applying Machine Translation to Two-Stage Cross-Language Information Retrieval Atsushi Fujii and Tetsuya Ishikawa arxiv:cs/0011003v1 [cs.cl] 2 Nov 2000 University of Library and Information Science 1-2