World Wide Web Collecting Encyclopedic Knowledge Using the World Wide Web Atsushi Fujii Tetsuya Ishikawa Universi

Size: px
Start display at page:

Download "World Wide Web Collecting Encyclopedic Knowledge Using the World Wide Web Atsushi Fujii Tetsuya Ishikawa Universi"

Transcription

1 World Wide Web Collecting Encyclopedic Knowledge Using the World Wide Web Atsushi Fujii Tetsuya Ishikawa University of Library and Information Science Abstract: This paper proposes a method to collect encyclopedic knowledge from the World Wide Web. For this purpose, we uselinguistic patterns and text structures provided with HTML tags to extract text fragments containing term descriptions, from Web pages. We then use a language model to discard extraneous descriptions, and a clustering method to summarize resultant descriptions. We show the eectiveness of our method by way of experiments. 1 / World Wide Web Web Web [17, 18] HTML World Wide Web 20 50% [18] [8] [4] [13] Web Web [11, 12] [10, 24] Web [2, 9] Web Web

2 Web [5, 15] Web (1) (2) (3) (4) 4 1 (1) Web Web (2) Web HTML Web a) b) c) HTML <X>...</X> (3) Web 1 (4) 2 1 Web Web Web (1) (4) Web Web 2 2

3 Web Web Web 1: 2: 3.3 a) b) [16, 21] [17] CD-ROM 8 [20] 2 EDR 12 [19] 2,259 [23] HTML Web HTML 2 <H> <B>? <A> HTML

4 HTML HTML 1 1. <P>...</P> </P> <P> 1 2. <UL>...</UL> 3. N. 3 N =3 3.4 (1)/ (0) 2 N N Web CMU-Cambridge [1] 2 [22] N perplexity 1, [14] 8 5 Hierarchical Baysian Clustering: HBC [6] HBC HBC : 4.1 Web

5 NACSIS [7] NACSIS goo Web % 27/ goo Web goo 1, Zipf % 292/ % 266/ NACSIS : NACSIS % 67.9% World Wide Web [3]

6 2: 27 Web Zipf 15 6, , ,399 3, ,3899, , ,6861,6943,141 1,682 10,078 1, , ,938 2, CD-ROM NACSIS [1] Philip Clarkson and Ronald Rosenfeld. Statistical language modeling using the CMU-Cambridge toolkit. In Proceedings of EuroSpeech'97, pp. 2707{ 2710, [2] Oren Etzioni. Moving up the information food chain. AI Magazine, Vol. 18, No. 2, pp. 11{18, [3] Atsushi Fujii and Tetsuya Ishikawa. Applying machine translation to two-stage cross-language information retrieval. In Proceedings of the 4th Conference of the Association for Machine Translation in the Americas, [4] Vasileios Hatzivassiloglou and Kathleen R. McKeown. Towards the automatic identication of adjectival scales: Clustering adjectives according to meaning. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 172{182, [5] Akihiro Inokuchi, Takashi Washio, Hiroshi Motoda, Kouhei Kumasawa, and Naohide Arai. Basket analysis for graph structured data. In Proceedings of the 3rd Pacic-Asia Conference on Knowledge Discovery and Data Mining, pp. 420{431, [6] Makoto Iwayama and Takenobu Tokunaga. Hierarchical Bayesian clustering for automatic text classication. In Proceedings of the 14th International Joint Conference on Articial Intelligence, pp. 1322{1327, [7] Noriko Kando, Kazuko Kuriyama, and Toshihiko Nozue. NACSIS test collection workshop (NTCIR- 1). In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 299{300, [8] Julian Kupiec and John Maxwell. Training stochastic grammars from unlabelled text corpora. In Workshop on Statistically-Based Natural Language Programming Techniques, AAAI Technical Reports WS [9] Hougen Lu, Leon Sterling, and Alex Wyatt. Knowledge discovery in SportsFinder: An agent to extract sports results from the Web. In Proceedings of the 3rd Pacic-Asia Conference on Knowledge Discovery and Data Mining, pp. 469{473, [10] Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. A machine learning approach to building domain-specic search engines. In Proceedings of the 16th International Joint Conference on Articial Intelligence, pp. 662{667, [11] Jian-Yun Nie, Michel Simard, Pierre Isabelle, and Richard Durand. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 74{81, [12] Philip Resnik. Mining the Web for bilingual texts. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 527{ 534, [13] Frank Smadja, Kathleen R. McKeown, and Vasileios Hatzivassiloglou. Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, Vol. 22, No. 1, pp. 1{38, [14].., [15],,,.., Vol. 15, No. 3, pp. 485{494, [16],,,.., Vol. 7, No. 2, pp. 336{345, [17],. Quest. 5, pp. 353{356, [18],. Web. 6, pp. 296{299, [19]., [20]. CD-ROM, [21],,.. 5, pp. 124{127, [22]. CD- '94-'95, [23],,,,. version 1.5. Technical Report NAIST-IS-TR97007,, [24].. 6, pp. 447{450, 2000.

arxiv:cs/ v1 [cs.cl] 2 Nov 2000

arxiv:cs/ v1 [cs.cl] 2 Nov 2000 Applying Machine Translation to Two-Stage Cross-Language Information Retrieval Atsushi Fujii and Tetsuya Ishikawa arxiv:cs/0011003v1 [cs.cl] 2 Nov 2000 University of Library and Information Science 1-2

More information

Document Structure Analysis in Associative Patent Retrieval

Document Structure Analysis in Associative Patent Retrieval Document Structure Analysis in Associative Patent Retrieval Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba, 305-8550,

More information

Cross-Lingual Information Access and Its Evaluation

Cross-Lingual Information Access and Its Evaluation Cross-Lingual Information Access and Its Evaluation Noriko Kando Research and Development Department National Center for Science Information Systems (NACSIS), Japan URL: http://www.rd.nacsis.ac.jp/~{ntcadm,kando}/

More information

An Iterative Link-based Method for Parallel Web Page Mining

An Iterative Link-based Method for Parallel Web Page Mining An Iterative Link-based Method for Parallel Web Page Mining Le Liu 1, Yu Hong 1*, Jun Lu 2, Jun Lang 2, Heng Ji 3, Jianmin Yao 1 1 School of Computer Science & Technology, Soochow University, Suzhou, 215006,

More information

Overview of the Patent Mining Task at the NTCIR-8 Workshop

Overview of the Patent Mining Task at the NTCIR-8 Workshop Overview of the Patent Mining Task at the NTCIR-8 Workshop Hidetsugu Nanba Atsushi Fujii Makoto Iwayama Taiichi Hashimoto Graduate School of Information Sciences, Hiroshima City University 3-4-1 Ozukahigashi,

More information

1.

1. * 390/0/2 : 389/07/20 : 2 25-8223 ( ) 2 25-823 ( ) ISC SCOPUS L ISA http://jist.irandoc.ac.ir 390 22-97 - :. aminnezarat@gmail.com mosavit@pnu.ac.ir : ( ).... 00.. : 390... " ". ( )...2 2. 3. 4 Google..

More information

Overview of the Patent Retrieval Task at the NTCIR-6 Workshop

Overview of the Patent Retrieval Task at the NTCIR-6 Workshop Overview of the Patent Retrieval Task at the NTCIR-6 Workshop Atsushi Fujii, Makoto Iwayama, Noriko Kando Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba,

More information

Overview of Patent Retrieval Task at NTCIR-5

Overview of Patent Retrieval Task at NTCIR-5 Overview of Patent Retrieval Task at NTCIR-5 Atsushi Fujii, Makoto Iwayama, Noriko Kando Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba, 305-8550, Japan

More information

indexing and query processing. The inverted le was constructed for the retrieval target collection which contains full texts of two years' Japanese pa

indexing and query processing. The inverted le was constructed for the retrieval target collection which contains full texts of two years' Japanese pa Term Distillation in Patent Retrieval Hideo Itoh Hiroko Mano Yasushi Ogawa Software R&D Group, RICOH Co., Ltd. 1-1-17 Koishikawa, Bunkyo-ku, Tokyo 112-0002, JAPAN fhideo,mano,yogawag@src.ricoh.co.jp Abstract

More information

Overview of Classification Subtask at NTCIR-6 Patent Retrieval Task

Overview of Classification Subtask at NTCIR-6 Patent Retrieval Task Overview of Classification Subtask at NTCIR-6 Patent Retrieval Task Makoto Iwayama *, Atsushi Fujii, Noriko Kando * Hitachi, Ltd., 1-280 Higashi-koigakubo, Kokubunji, Tokyo 185-8601, Japan makoto.iwayama.nw@hitachi.com

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

ELIJAH, Extracting Genealogy from the Web David Barney and Rachel Lee {dbarney, WhizBang! Labs and Brigham Young University

ELIJAH, Extracting Genealogy from the Web David Barney and Rachel Lee {dbarney, WhizBang! Labs and Brigham Young University ELIJAH, Extracting Genealogy from the Web David Barney and Rachel Lee {dbarney, rlee}@whizbang.com WhizBang! Labs and Brigham Young University 1. Introduction On-line genealogy is becoming America s latest

More information

A Novel Method for Bilingual Web Page Acquisition from Search Engine Web Records

A Novel Method for Bilingual Web Page Acquisition from Search Engine Web Records A Novel Method for Bilingual Web Page Acquisition from Search Engine Web Records Yanhui Feng, Yu Hong, Zhenxiang Yan, Jianmin Yao, Qiaoming Zhu School of Computer Science & Technology, Soochow University

More information

Annotated Suffix Trees for Text Clustering

Annotated Suffix Trees for Text Clustering Annotated Suffix Trees for Text Clustering Ekaterina Chernyak and Dmitry Ilvovsky National Research University Higher School of Economics Moscow, Russia echernyak,dilvovsky@hse.ru Abstract. In this paper

More information

Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web

Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web Robert Meusel and Heiko Paulheim University of Mannheim, Germany Data and Web Science Group {robert,heiko}@informatik.uni-mannheim.de

More information

jldadmm: A Java package for the LDA and DMM topic models

jldadmm: A Java package for the LDA and DMM topic models jldadmm: A Java package for the LDA and DMM topic models Dat Quoc Nguyen School of Computing and Information Systems The University of Melbourne, Australia dqnguyen@unimelb.edu.au Abstract: In this technical

More information

IPAL at CLEF 2008: Mixed-Modality based Image Search, Novelty based Re-ranking and Extended Matching

IPAL at CLEF 2008: Mixed-Modality based Image Search, Novelty based Re-ranking and Extended Matching IPAL at CLEF 2008: Mixed-Modality based Image Search, Novelty based Re-ranking and Extended Matching Sheng Gao, Jean-Pierre Chevallet and Joo-Hwee Lim IPAL, Institute for Infocomm Research, A*Star, Singapore

More information

Using Monolingual Clickthrough Data to Build Cross-lingual Search Systems

Using Monolingual Clickthrough Data to Build Cross-lingual Search Systems Using Monolingual Clickthrough Data to Build Cross-lingual Search Systems ABSTRACT Vamshi Ambati Institute for Software Research International Carnegie Mellon University Pittsburgh, PA vamshi@cmu.edu A

More information

Automatically Generating Queries for Prior Art Search

Automatically Generating Queries for Prior Art Search Automatically Generating Queries for Prior Art Search Erik Graf, Leif Azzopardi, Keith van Rijsbergen University of Glasgow {graf,leif,keith}@dcs.gla.ac.uk Abstract This report outlines our participation

More information

Sparse unsupervised feature learning for sentiment classification of short documents

Sparse unsupervised feature learning for sentiment classification of short documents Sparse unsupervised feature learning for sentiment classification of short documents Simone Albertini Ignazio Gallo Alessandro Zamberletti University of Insubria Varese, Italy simone.albertini@uninsubria.it

More information

Information Agents for Competitive Market Monitoring in Production Chains

Information Agents for Competitive Market Monitoring in Production Chains Agents for Competitive Market Monitoring in Production Chains Gerhard Schiefer and Melanie Fritz University of Bonn, Business and Management e-mail: schiefer@uni-bonn.de m.fritz@uni-bonn.de Abstract The

More information

How SPICE Language Modeling Works

How SPICE Language Modeling Works How SPICE Language Modeling Works Abstract Enhancement of the Language Model is a first step towards enhancing the performance of an Automatic Speech Recognition system. This report describes an integrated

More information

IJSER. Privacy and Data Mining

IJSER. Privacy and Data Mining Privacy and Data Mining 2177 Shilpa M.S Dept. of Computer Science Mohandas College of Engineering and Technology Anad,Trivandrum shilpams333@gmail.com Shalini.L Dept. of Computer Science Mohandas College

More information

Keywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5

Keywords : Bayesian,  classification, tokens, text, probability, keywords. GJCST-C Classification: E.5 Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 13 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global

More information

Using the K-Nearest Neighbor Method and SMART Weighting in the Patent Document Categorization Subtask at NTCIR-6

Using the K-Nearest Neighbor Method and SMART Weighting in the Patent Document Categorization Subtask at NTCIR-6 Using the K-Nearest Neighbor Method and SMART Weighting in the Patent Document Categorization Subtask at NTCIR-6 Masaki Murata National Institute of Information and Communications Technology 3-5 Hikaridai,

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

The NTCIR Workshop : the First Evaluation Workshop on Japanese Text Retrieval and Cross-Lingual Information Retrieval

The NTCIR Workshop : the First Evaluation Workshop on Japanese Text Retrieval and Cross-Lingual Information Retrieval The NTCIR Workshop : the First Evaluation Workshop on Japanese Text Retrieval and Cross-Lingual Information Retrieval Noriko Kando, Kazuko Kuriyama, Toshihiko Nozue, Koji Eguchi, Hiroyuki Kato, Soichiro

More information

Inferring Ongoing Activities of Workstation Users by Clustering

Inferring Ongoing Activities of Workstation Users by Clustering Inferring Ongoing Activities of Workstation Users by Clustering Email 1. The Problem Yifen Huang, Dinesh Govindaraju, Tom Mitchell School of Computer Science Carnegie Mellon University We are interested

More information

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Walid Magdy, Gareth J.F. Jones Centre for Next Generation Localisation School of Computing Dublin City University,

More information

MINING GRAPH DATA EDITED BY. Diane J. Cook School of Electrical Engineering and Computei' Science Washington State University Puliman, Washington

MINING GRAPH DATA EDITED BY. Diane J. Cook School of Electrical Engineering and Computei' Science Washington State University Puliman, Washington MINING GRAPH DATA EDITED BY Diane J. Cook School of Electrical Engineering and Computei' Science Washington State University Puliman, Washington Lawrence B. Holder School of Electrical Engineering and

More information

Latent Relation Representations for Universal Schemas

Latent Relation Representations for Universal Schemas University of Massachusetts Amherst From the SelectedWorks of Andrew McCallum 2013 Latent Relation Representations for Universal Schemas Sebastian Riedel Limin Yao Andrew McCallum, University of Massachusetts

More information

Scalable Trigram Backoff Language Models

Scalable Trigram Backoff Language Models Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work

More information

Learning to find transliteration on the Web

Learning to find transliteration on the Web Learning to find transliteration on the Web Chien-Cheng Wu Department of Computer Science National Tsing Hua University 101 Kuang Fu Road, Hsin chu, Taiwan d9283228@cs.nthu.edu.tw Jason S. Chang Department

More information

A recommendation engine by using association rules

A recommendation engine by using association rules Available online at www.sciencedirect.com Procedia - Social and Behavioral Sciences 62 ( 2012 ) 452 456 WCBEM 2012 A recommendation engine by using association rules Ozgur Cakir a 1, Murat Efe Aras b a

More information

Finding Scientific Papers with HomePageSearch and MOPS

Finding Scientific Papers with HomePageSearch and MOPS Finding Scientific Papers with HomePageSearch and MOPS Gerd Hoff Universität Trier FB IV Informatik D-54286 Trier, Germany hoffg@uni-trier.de Martin Mundhenk Friedrich-Schiller-Universität Jena Fakultät

More information

Graph Mining Sub Domains and a Framework for Indexing A Graphical Approach

Graph Mining Sub Domains and a Framework for Indexing A Graphical Approach Graph Mining Sub Domains and a Framework for Indexing A Graphical Approach K. Vivekanandan Professor BSMED A. Pankaj Moses Monickaraj (Correspoding author) Doctoral Scholar Department of Computer Science

More information

Columbia University (office) Computer Science Department (mobile) Amsterdam Avenue

Columbia University (office) Computer Science Department (mobile) Amsterdam Avenue Wisam Dakka Columbia University (office) 212-939-7116 Computer Science Department (mobile) 646-643-1306 1214 Amsterdam Avenue wisam@cs.columbia.edu New York, New York, 10027 www.cs.columbia.edu/~wisam

More information

Incorporating Hyperlink Analysis in Web Page Clustering

Incorporating Hyperlink Analysis in Web Page Clustering Incorporating Hyperlink Analysis in Web Page Clustering Michael Chau School of Business The University of Hong Kong Pokfulam, Hong Kong +852 2859-1014 mchau@business.hku.hk Patrick Y. K. Chau School of

More information

A component-centric UML based approach for modeling the architecture of web applications.

A component-centric UML based approach for modeling the architecture of web applications. International Journal of Recent Research and Review, Vol. V, March 2013 ISSN 2277 8322 A component-centric UML based approach for modeling the architecture of web applications. Mukesh Kataria 1 1 Affiliated

More information

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD 10 Text Mining Munawar, PhD Definition Text mining also is known as Text Data Mining (TDM) and Knowledge Discovery in Textual Database (KDT).[1] A process of identifying novel information from a collection

More information

A Combined Method of Text Summarization via Sentence Extraction

A Combined Method of Text Summarization via Sentence Extraction Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 434 A Combined Method of Text Summarization via Sentence Extraction

More information

Using Maximum Entropy for Automatic Image Annotation

Using Maximum Entropy for Automatic Image Annotation Using Maximum Entropy for Automatic Image Annotation Jiwoon Jeon and R. Manmatha Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst Amherst, MA-01003.

More information

Improving Query Translation for Cross-Language Information Retrieval using a Web-based Approach

Improving Query Translation for Cross-Language Information Retrieval using a Web-based Approach Improving Query Translation for Cross-Language Information Retrieval using a Web-based Approach Jian Hu 1 Gui-Rong Xue 1 Hua-Jun Zeng 2 Fan-Yuan Ma 1 Ming Zhou 2 1 Computer Science and Engineering, Shanghai

More information

Iterative Learning of Relation Patterns for Market Analysis with UIMA

Iterative Learning of Relation Patterns for Market Analysis with UIMA UIMA Workshop, GLDV, Tübingen, 09.04.2007 Iterative Learning of Relation Patterns for Market Analysis with UIMA Sebastian Blohm, Jürgen Umbrich, Philipp Cimiano, York Sure Universität Karlsruhe (TH), Institut

More information

Metadata Extraction from Scholarly Articles

Metadata Extraction from Scholarly Articles Metadata Extraction from Scholarly Articles Ricardo Candeias Instituto Superior Técnico, INESC-ID Av. Professor Cavaco Silva, 2744-016 Porto Salvo, Portugal Abstract. Modern digital libraries of scholarly

More information

Sentiment Learning on Product Reviews via Sentiment Ontology Tree

Sentiment Learning on Product Reviews via Sentiment Ontology Tree Sentiment Learning on Product Reviews via Sentiment Ontology Tree Wei Wei Department of Computer and Information Science Norwegian University of Science and Technology wwei@idi.ntnu.no Jon Atle Gulla Department

More information

From CLIR to CLIE: Some Lessons in NTCIR Evaluation

From CLIR to CLIE: Some Lessons in NTCIR Evaluation From CLIR to CLIE: Some Lessons in NTCIR Evaluation Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan +886-2-33664888 ext 311 hhchen@csie.ntu.edu.tw

More information

Weighted Suffix Tree Document Model for Web Documents Clustering

Weighted Suffix Tree Document Model for Web Documents Clustering ISBN 978-952-5726-09-1 (Print) Proceedings of the Second International Symposium on Networking and Network Security (ISNNS 10) Jinggangshan, P. R. China, 2-4, April. 2010, pp. 165-169 Weighted Suffix Tree

More information

Using Reinforcement Learning to Spider the Web Efficiently

Using Reinforcement Learning to Spider the Web Efficiently Using Reinforcement Learning to Spider the Web Efficiently Jason Rennie jrennie@justresearch.com School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Andrew Kachites McCallum mccallum@justresearch.com

More information

second_language research_teaching sla vivian_cook language_department idl

second_language research_teaching sla vivian_cook language_department idl Using Implicit Relevance Feedback in a Web Search Assistant Maria Fasli and Udo Kruschwitz Department of Computer Science, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, United Kingdom fmfasli

More information

Proceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan

Proceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan The Report on Subtopic Mining and Document Ranking of NTCIR-9 Intent Task Wei-Lun Xiao, CSIE, Chaoyang University of Technology No. 168, Jifong E. Rd., Wufong, Taichung, Taiwan, R.O.C s9927632@cyut.edu.tw

More information

You ve Got A Workflow Management Extraction System

You ve Got   A Workflow Management Extraction System 342 Journal of Reviews on Global Economics, 2017, 6, 342-349 You ve Got Email: A Workflow Management Extraction System Piyanuch Chaipornkaew 1, Takorn Prexawanprasut 1,* and Michael McAleer 2-6 1 College

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Applying Clustering of Hierarchical K-means-like Algorithm on Arabic Language

Applying Clustering of Hierarchical K-means-like Algorithm on Arabic Language Applying Clustering of Hierarchical K-means-like Algorithm on Arabic Language Sameh H. Ghwanmeh Abstract In this study a clustering technique has been implemented which is K-Means like with hierarchical

More information

A Novel Method of Optimizing Website Structure

A Novel Method of Optimizing Website Structure A Novel Method of Optimizing Website Structure Mingjun Li 1, Mingxin Zhang 2, Jinlong Zheng 2 1 School of Computer and Information Engineering, Harbin University of Commerce, Harbin, 150028, China 2 School

More information

Automatic Classification of Web Pages using Back Propagation

Automatic Classification of Web Pages using Back Propagation Automatic Classification of Web Pages using Back Propagation Poonam Nagale Student DYPSOET,Lohegaon Pune, India ABSTRACT Word Wide Web is huge repository of information. So there is tremendous growth in

More information

Query Expansion from Wikipedia and Topic Web Crawler on CLIR

Query Expansion from Wikipedia and Topic Web Crawler on CLIR Query Expansion from Wikipedia and Topic Web Crawler on CLIR Meng-Chun Lin, Ming-Xiang Li, Chih-Chuan Hsu and Shih-Hung Wu* Department of Computer Science and Information Engineering Chaoyang University

More information

MetaNews: An Information Agent for Gathering News Articles On the Web

MetaNews: An Information Agent for Gathering News Articles On the Web MetaNews: An Information Agent for Gathering News Articles On the Web Dae-Ki Kang 1 and Joongmin Choi 2 1 Department of Computer Science Iowa State University Ames, IA 50011, USA dkkang@cs.iastate.edu

More information

Graph-Based Concept Clustering for Web Search Results

Graph-Based Concept Clustering for Web Search Results International Journal of Electrical and Computer Engineering (IJECE) Vol. 5, No. 6, December 2015, pp. 1536~1544 ISSN: 2088-8708 1536 Graph-Based Concept Clustering for Web Search Results Supakpong Jinarat*,

More information

Modeling Slang-style Word Formation for Retrieving Evaluative Information

Modeling Slang-style Word Formation for Retrieving Evaluative Information Modeling Slang-style Word Formation for Retrieving Evaluative Information Atsushi Fujii Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba, 305-8550, Japan

More information

Improving Statistical Word Alignment with Ensemble Methods

Improving Statistical Word Alignment with Ensemble Methods Improving Statiical Word Alignment with Ensemble Methods Hua Wu and Haifeng Wang Toshiba (China) Research and Development Center, 5/F., Tower W2, Oriental Plaza, No.1, Ea Chang An Ave., Dong Cheng Dirict,

More information

Methodology for evaluating citation parsing and matching

Methodology for evaluating citation parsing and matching Methodology for evaluating citation parsing and matching Mateusz Fedoryszak, Lukasz Bolikowski, Dominika Tkaczyk, and Krzyś Wojciechowski Interdisciplinary Centre for Mathematical and Computational Modelling,

More information

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline

More information

Centroid-Based Document Classification: Analysis & Experimental Results?

Centroid-Based Document Classification: Analysis & Experimental Results? Centroid-Based Document Classification: Analysis & Experimental Results? Eui-Hong (Sam) Han and George Karypis University of Minnesota, Department of Computer Science / Army HPC Research Center Minneapolis,

More information

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND 41 CHAPTER 5 TEXT W. Bruce Croft BACKGROUND Much of the information in digital library or digital information organization applications is in the form of text. Even when the application focuses on multimedia

More information

Text mining on a grid environment

Text mining on a grid environment Data Mining X 13 Text mining on a grid environment V. G. Roncero, M. C. A. Costa & N. F. F. Ebecken COPPE/Federal University of Rio de Janeiro, Brazil Abstract The enormous amount of information stored

More information

Ph.D. in Computer Science & Technology, Tsinghua University, Beijing, China, 2007

Ph.D. in Computer Science & Technology, Tsinghua University, Beijing, China, 2007 Yiqun Liu Associate Professor & Department co-chair Department of Computer Science and Technology Email yiqunliu@tsinghua.edu.cn URL http://www.thuir.org/group/~yqliu Phone +86-10-62796672 Fax +86-10-62796672

More information

Finding parallel texts on the web using cross-language information retrieval

Finding parallel texts on the web using cross-language information retrieval Finding parallel texts on the web using cross-language information retrieval Achim Ruopp University of Washington, Seattle, WA 98195, USA achimr@u.washington.edu Fei Xia University of Washington Seattle,

More information

Mining Parallel Documents Using Low Bandwidth and High Precision CLIR from the Heterogeneous Web

Mining Parallel Documents Using Low Bandwidth and High Precision CLIR from the Heterogeneous Web Mining Parallel Documents Using Low Bandwidth and High Precision CLIR from the Heterogeneous Web Simon Shi 1, Pascale Fung 1, Emmanuel Prochasson 2, Chi-kiu Lo 1 and Dekai Wu 1 1 Human Language Technology

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Integrate Multilingual Web Search Results using Cross-Lingual Topic Models

Integrate Multilingual Web Search Results using Cross-Lingual Topic Models Integrate Multilingual Web Search Results using Cross-Lingual Topic Models Duo Ding Shanghai Jiao Tong University, Shanghai, 200240, P.R. China dingduo1@gmail.com Abstract With the thriving of the Internet,

More information

Andrew Davenport and Edward Tsang. fdaveat,edwardgessex.ac.uk. mostly soluble problems and regions of overconstrained, mostly insoluble problems as

Andrew Davenport and Edward Tsang. fdaveat,edwardgessex.ac.uk. mostly soluble problems and regions of overconstrained, mostly insoluble problems as An empirical investigation into the exceptionally hard problems Andrew Davenport and Edward Tsang Department of Computer Science, University of Essex, Colchester, Essex CO SQ, United Kingdom. fdaveat,edwardgessex.ac.uk

More information

Making Sense Out of the Web

Making Sense Out of the Web Making Sense Out of the Web Rada Mihalcea University of North Texas Department of Computer Science rada@cs.unt.edu Abstract. In the past few years, we have witnessed a tremendous growth of the World Wide

More information

Analyzing Document Retrievability in Patent Retrieval Settings

Analyzing Document Retrievability in Patent Retrieval Settings Analyzing Document Retrievability in Patent Retrieval Settings Shariq Bashir and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna University of Technology, Austria {bashir,rauber}@ifs.tuwien.ac.at

More information

SPIDERING AND FILTERING WEB PAGES FOR VERTICAL SEARCH ENGINES

SPIDERING AND FILTERING WEB PAGES FOR VERTICAL SEARCH ENGINES Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2002 Proceedings Americas Conference on Information Systems (AMCIS) December 2002 SPIDERING AND FILTERING WEB PAGES FOR VERTICAL

More information

Information Granulation for Web based Information Retrieval Support Systems

Information Granulation for Web based Information Retrieval Support Systems Information Granulation for Web based Information Retrieval Support Systems J.T. Yao Y.Y. Yao Department of Computer Science University of Regina Regina, Saskatchewan, Canada S4S 0A2 E-mail: {jtyao, yyao}@cs.uregina.ca

More information

HYBRID APROACH FOR WEB PAGE CLASSIFICATION BASED ON FIREFLY AND ANT COLONY OPTIMIZATION

HYBRID APROACH FOR WEB PAGE CLASSIFICATION BASED ON FIREFLY AND ANT COLONY OPTIMIZATION HYBRID APROACH FOR WEB PAGE CLASSIFICATION BASED ON FIREFLY AND ANT COLONY OPTIMIZATION ABSTRACT: Poonam Asawara, Dr Amit Shrivastava and Dr Manish Manoria Department of Computer Science and Engineering

More information

Quagmire or Goldmine?

Quagmire or Goldmine? The World-Wide Wide Web: Quagmire or Goldmine? Oren Etzioni [Comm. of the ACM, Nov 1996] Presentation Credits: Shabnam Sobti 30 - OCT - 2002 WWW - Quagmire or Goldmine? 1 Agenda Prelude: The Internet Story

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information

PARAMETRIC STUDY WITH GEOFRAC: A THREE-DIMENSIONAL STOCHASTIC FRACTURE FLOW MODEL. Alessandra Vecchiarelli, Rita Sousa, Herbert H.

PARAMETRIC STUDY WITH GEOFRAC: A THREE-DIMENSIONAL STOCHASTIC FRACTURE FLOW MODEL. Alessandra Vecchiarelli, Rita Sousa, Herbert H. PROCEEDINGS, Thirty-Eighth Workshop on Geothermal Reservoir Engineering Stanford University, Stanford, California, February 3, 23 SGP-TR98 PARAMETRIC STUDY WITH GEOFRAC: A THREE-DIMENSIONAL STOCHASTIC

More information

Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text

Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text Amir Hossein Jadidinejad Mitra Mohtarami Hadi Amiri Computer Engineering Department, Islamic Azad University,

More information

Graph Classification in Heterogeneous

Graph Classification in Heterogeneous Title: Graph Classification in Heterogeneous Networks Name: Xiangnan Kong 1, Philip S. Yu 1 Affil./Addr.: Department of Computer Science University of Illinois at Chicago Chicago, IL, USA E-mail: {xkong4,

More information

Bitextor s participation in WMT 16: shared task on document alignment

Bitextor s participation in WMT 16: shared task on document alignment Bitextor s participation in WMT 16: shared task on document alignment Miquel Esplà-Gomis, Mikel L. Forcada Departament de Llenguatges i Sistemes Informàtics Universitat d Alacant, E-03690 Sant Vicent del

More information

Text clustering based on a divide and merge strategy

Text clustering based on a divide and merge strategy Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 55 (2015 ) 825 832 Information Technology and Quantitative Management (ITQM 2015) Text clustering based on a divide and

More information

Adaptive Web Sites: Conceptual Cluster Mining

Adaptive Web Sites: Conceptual Cluster Mining Adaptive Web Sites: Conceptual Cluster Mining Mike Perkowitz Oren Etzioni Department of Computer Science and Engineering, Box 352350 University of Washington, Seattle, WA 98195 {map, etzioni}@s.washington.edu

More information

Text Assisted Defence Information Extractor

Text Assisted Defence Information Extractor International Journal of Computational Engineering Research Vol, 03 Issue, 6 Text Assisted Defence Information Extractor Nishant Kumar 1, Shikha Suman 2, Anubhuti Khera 3, Kanika Agarwal 4 1 Scientist

More information

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System ijade Reporter An Intelligent Multi-agent Based Context Aware Reporting System Eddie C.L. Chan and Raymond S.T. Lee The Department of Computing, The Hong Kong Polytechnic University, Hung Hong, Kowloon,

More information

MaSMT: A Multi-agent System Development Framework for English-Sinhala Machine Translation

MaSMT: A Multi-agent System Development Framework for English-Sinhala Machine Translation MaSMT: A Multi-agent System Development Framework for English-Sinhala Machine Translation B. Hettige #1, A. S. Karunananda *2, G. Rzevski *3 # Department of Statistics and Computer Science, University

More information

Experiments on Patent Retrieval at NTCIR-4 Workshop

Experiments on Patent Retrieval at NTCIR-4 Workshop Working Notes of NTCIR-4, Tokyo, 2-4 June 2004 Exeriments on Patent Retrieval at NTCIR-4 Worksho Hironori Takeuchi Λ Naohiko Uramoto Λy Koichi Takeda Λ Λ Tokyo Research Laboratory, IBM Research y National

More information

Support System- Pioneering approach for Web Data Mining

Support System- Pioneering approach for Web Data Mining Support System- Pioneering approach for Web Data Mining Geeta Kataria 1, Surbhi Kaushik 2, Nidhi Narang 3 and Sunny Dahiya 4 1,2,3,4 Computer Science Department Kurukshetra University Sonepat, India ABSTRACT

More information

Navigation Retrieval with Site Anchor Text

Navigation Retrieval with Site Anchor Text Navigation Retrieval with Site Anchor Text Hideki Kawai Kenji Tateishi Toshikazu Fukushima NEC Internet Systems Research Labs. 8916-47, Takayama-cho, Ikoma-city, Nara, JAPAN {h-kawai@ab, k-tateishi@bq,

More information

Curriculum Vitae Lidong BING

Curriculum Vitae Lidong BING Curriculum Vitae Lidong BING Senior Researcher, Tencent AI Lab Contact Information Address: Langke Building, High Tech South 6th Road, Nanshan District, Shenzhen city, China. 518000 Tel: +86-18018715657

More information

Evaluation of the Document Categorization in Fixed-point Observatory

Evaluation of the Document Categorization in Fixed-point Observatory Evaluation of the Document Categorization in Fixed-point Observatory Yoshihiro Ueda Mamiko Oka Katsunori Houchi Service Technology Development Department Fuji Xerox Co., Ltd. 3-1 Minatomirai 3-chome, Nishi-ku,

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

A New Context Based Indexing in Search Engines Using Binary Search Tree

A New Context Based Indexing in Search Engines Using Binary Search Tree A New Context Based Indexing in Search Engines Using Binary Search Tree Aparna Humad Department of Computer science and Engineering Mangalayatan University, Aligarh, (U.P) Vikas Solanki Department of Computer

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Middleware for Ubiquitous Computing

Middleware for Ubiquitous Computing Middleware for Ubiquitous Computing Software Testing for Mobile Computing National Institute of Informatics Ichiro Satoh Abstract When a portable computing device is moved into and attached to a new local

More information

IE in Context. Machine Learning Problems for Text/Web Data

IE in Context. Machine Learning Problems for Text/Web Data Machine Learning Problems for Text/Web Data Lecture 24: Document and Web Applications Sam Roweis Document / Web Page Classification or Detection 1. Does this document/web page contain an example of thing

More information

Web. The Discovery Method of Multiple Web Communities with Markov Cluster Algorithm

Web. The Discovery Method of Multiple Web Communities with Markov Cluster Algorithm Markov Cluster Algorithm Web Web Web Kleinberg HITS Web Web HITS Web Markov Cluster Algorithm ( ) Web The Discovery Method of Multiple Web Communities with Markov Cluster Algorithm Kazutami KATO and Hiroshi

More information

Knowledge-based Word Sense Disambiguation using Topic Models Devendra Singh Chaplot

Knowledge-based Word Sense Disambiguation using Topic Models Devendra Singh Chaplot Knowledge-based Word Sense Disambiguation using Topic Models Devendra Singh Chaplot Ruslan Salakhutdinov Word Sense Disambiguation Word sense disambiguation (WSD) is defined as the problem of computationally

More information