Clustering and Information Retrieval

Similar documents
INFORMATION RETRIEVAL SYSTEMS: Theory and Implementation

FINITE FIELDS FOR COMPUTER SCIENTISTS AND ENGINEERS

TIME-CONSTRAINED TRANSACTION MANAGEMENT. Real-Time Constraints in Database Transaction Systems

MULTIMEDIA TOOLS AND APPLICATIONS

FUZZY DATABASES Principles and Applications

Topological Structure and Analysis of Interconnection Networks

Fuzzy Modeling for Control.,,i.

MULTIMEDIA DATABASE MANAGEMENT SYSTEMS

Module: CLUTO Toolkit. Draft: 10/21/2010

PARALLEL ARCHITECTURES AND PARALLEL ALGORITHMS FOR INTEGRATED VISION SYSTEMS

INTRUSION DETECTION IN DISTRIBUTED SYSTEMS An Abstraction-Based Approach

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results

ASSIGNMENT PROBLEMS IN PARALLEL AND DISTRIBUTED COMPUTING

LOAD BALANCING IN PARALLEL COMPUTERS Theory and Practice

RETARGETABLE CODE GENERATION FOR DIGITAL SIGNAL PROCESSORS

INPUT/OUTPUT IN PARALLEL AND DISTRIBUTED COMPUTER SYSTEMS

The VHDL Handbook. David R. Coelho Vantage Analysis Systems, Inc. Kluwer Academic Publishers. KALA llrporation

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

Centroid Based Text Clustering

Linear Programming: Mathematics, Theory and Algorithms

Guide to OSI and TCP/IP Models

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

VIDEO CODING. The Second Generation Approach

Mining Time-Profiled Associations: A Preliminary Study Report. Technical Report

Inferring User Search for Feedback Sessions

Encyclopedia of Information Science and Technology

CS570: Introduction to Data Mining

DISSEMINATING SECURITY UPDATES AT INTERNET SCALE

ARCHITECTURE AND CAD FOR DEEP-SUBMICRON FPGAs

UML for SOC Design GRANT MARTIN WOLFGANG MÜLLER. Edited by. Tensilica Inc., Santa Clara, CA, USA. and. University of Paderborn, Germany

DATA CLUSTERING SATU VIRTANEN. T Seminar on String Algorithms

SpringerBriefs in Computer Science

PERFORMANCE ANALYSIS OF REAL-TIME EMBEDDED SOFTWARE

Design of student information system based on association algorithm and data mining technology. CaiYan, ChenHua

Study and Implementation of CHAMELEON algorithm for Gene Clustering

CS570: Introduction to Data Mining

Polymeric Biomaterials for Tissue Regeneration

Distributed Intrusion Detection

Computer Science Workbench. Editor: Tosiyasu L. Kunii

HIGH-LEVEL SYNTHESIS FOR REAL-TIME DIGITAL SIGNAL PROCESSING

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

ADAPTIVE HYPERTEXT AND HYPERMEDIA

MINING VERY LARGE DATABASES WITH PARALLEL PROCESSING

Energy Efficient Microprocessor Design

Programming Clojure. Extracted from: Second Edition. The Pragmatic Bookshelf

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

Data Mining Download or Read Online ebook data mining in PDF Format From The Best User Guide Database

Computer-Aided Design in Magnetics

Visualization in Supercomputing

ITIL 2011 At a Glance. John O. Long

A New Approach to Determine Eps Parameter of DBSCAN Algorithm

Learning to Rank for Information Retrieval

COMMUNICATION SYSTEMS The State of the Art

Keywords: hierarchical clustering, traditional similarity metrics, potential based similarity metrics.

Research on Industrial Security Theory

APPLICATION SPECIFIC PROCESSORS

Parallel K-Means Clustering with Triangle Inequality

INCORPORATING SYNONYMS INTO SNIPPET BASED QUERY RECOMMENDATION SYSTEM

Groupware and the World Wide Web

Essential Angular for ASP.NET Core MVC

Tau-p: A Plane Wave Approach to the Analysis of Seismic Data

THE VERILOG? HARDWARE DESCRIPTION LANGUAGE

Research on Full-text Retrieval based on Lucene in Enterprise Content Management System Lixin Xu 1, a, XiaoLin Fu 2, b, Chunhua Zhang 1, c

LOGICAL DATA MODELING

SOUTH AFRICAN NATIONAL STANDARD

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Windows 10 Revealed. The Universal Windows Operating System for PC, Tablets, and Windows Phone. Kinnary Jangla

International Journal of Advanced Computer Technology (IJACT) ISSN: CLUSTERING OF WEB QUERY RESULTS USING ENHANCED K-MEANS ALGORITHM

SpringerBriefs in Computer Science

Parallel Algorithms for Irregular Problems: State of the Art

An Introduction to Programming with IDL

WIRELESS ATM AND AD-HOC NETWORKS. Protocols and Architectures

Recommendation system Based On Cosine Similarity Algorithm

International Journal of Scientific Research and Modern Education (IJSRME) Impact Factor: 6.225, ISSN (Online): (

CE Adoption and Trends

Optical Burst Switched Networks

Michael Kifer, Arthur Bernstein, Philip M. Lewis. Solutions Manual

Finding Hubs and authorities using Information scent to improve the Information Retrieval precision

Failure-Modes-Based Software Reading

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Microsoft Computer Vision APIs Distilled

Cluster Cores-based Clustering for High Dimensional Data

An Experimental Analysis of Outliers Detection on Static Exaustive Datasets.

A Metric for Inferring User Search Goals in Search Engines

USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING

Dynamic Visualization of Hubs and Authorities during Web Search

Loop Tiling for Parallelism

A Model for Interactive Web Information Retrieval

Comparison of Agglomerative and Partitional Document Clustering Algorithms

PERFORMANCE EVALUATION, PREDICTION AND VISUALIZATION OF PARALLEL SYSTEMS

Topic 1 Classification Alternatives

Redefining travel industry tech

CIM/E Oriented Graph Database Model Architecture and Parallel Network Topology Processing

Java Quick Syntax Reference. Second Edition. Mikael Olsson

INTRUSION DETECTION AND CORRELATION. Challenges and Solutions

Video Traces for Network Performance Evaluation

Distributed and Parallel Computing with Ruby

DIGITAL HOLOGRAPHY AND DIGITAL IMAGE PROCESSING: Principles, Methods, Algorithms

BAYESIAN MODELING OF UNCERTAINTY IN LOW-LEVEL VISION

Transcription:

Clustering and Information Retrieval

Network Theory and Applications Volume 11 Managing Editors: Ding-ZhuDu University o/minnesota, U.S.A. Cauligi Raghavendra University 0/ Southern Califorina, U.S.A.

Clustering and Information Retrieval Weili Wu Department of Computer Science, The University of Texas at Dallas, Mail Station EC 31, Box 830688, Richardson, TX 75083, U.S.A. H ui Xiong and Shashi Shekhar Department of Computer Science and Engineering, University of Minnesota - Twin Cities, EECS BLDG 4-192, 200 Union Street SE, Minneapolis, MN 55455, U.S.A.

Distributors for North, Central and South America: Kluwer Academic Publishers 101 Philip Drive Assinippi Park Norwell, Massachusetts 02061 USA Telephone (781) 871-6600 Fax (781) 871-6528 E-Mail <kluwer@wkap.com> Distributors for all other countries: Kluwer Academic Publishers Group Post Office Box 322 3300 AH Dordrecht, THE NETHERLANDS Telephone 31 786576000 Fax31786576474 E-Mail <orderdept@wkap.nl>..... " Electromc Services <http://www.wkap.nl> Library of Congress Cataloging-in-Publication Wu, Weilil Xiong, Huil Shekhar, Shashi Clustering and Information Retrieval ISBN-13: 978-1-4613-7949-2 e-isbn-13: 978-1-4613-0227-8 DOl: 10.1007/978-1-4613-0227-8 Copyright 2004 by Kluwer Academic Publishers Softcover reprint of the hardcover 1st edition 2004 All rights reserved. No part ofthis publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photo-copying, microfilming, recording, or otherwise, without the prior written permission of the publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser ofthe work. Permissions for books published in the USA: permi ssi ons@wkap com Permissions for books published in Europe: permissions@wkap.nl Printed on acid-free paper.

Contents Foreword... vii Clustering in Metric Spaces with Applications to Information Retrieval... 1 Ricardo Baeza- Yates, Benjamin Bustos, Edgar Chavez, Norma Herrera, and Gonzalo Navarro Techniques for Clustering Massive Data Sets.... 35 Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach.... 83 Levent Erwz, Michael Steinbach, and Vipin Kumar On Quantitative Evaluation of Clustering Systems... 105 Ji He, Ah-Hwee Tan, Chew-Lim tan, and Sam- Yuan Sung Techniques for Textual Document Indexing and Retrieval via Knowledge Sources and Data Mining.... 135 Wesley W. Chu, Victor Zhenyu Liu, and Wenlei Mao Document Clustering, Visualization, and Retrieval via Link Mining.... 161 Steven Noel, Vijay Raghavan, and C.-H. Henry Chu Query Clustering in the Web Context Ji-Rong Wen and Hong-Jiang Zhang.... 195 Clustering Techniques for Large Database Cleansing Sam Y. Sung, Zhao Li, and Tok W. Ling... 227 A Science Data System Architecture for Information Retrieval..... 261 Daniel J. Crichton, J. Steven Hughes, and Sean Kelly

Granular Computing for the Design of Information Retrieval Support Systems... 299 Y. Y. Yao VI

Foreword Clustering is an important technique for discovering relatively dense sub-regions or sub-spaces of a multi-dimension data distribution. Clustering has been used in information retrieval for many different purposes, such as query expansion, document grouping, document indexing, and visualization of search results. In this book, we address issues of clustering algorithms, evaluation methodologies, applications, and architectures for information retrieval. The first two chapters discuss clustering algorithms. The chapter from Baeza-Yates et al. describes a clustering method for a general metric space which is a common model of data relevant to information retrieval. The chapter by Guha, Rastogi, and Shim presents a survey as well as detailed discussion of two clustering algorithms: CURE and ROCK for numeric data and categorical data respectively. Evaluation methodologies are addressed in the next two chapters. Ertoz et al. demonstrate the use of text retrieval benchmarks, such as TRECS, to evaluate clustering algorithms. He et al. provide objective measures of clustering quality in their chapter. Applications of clustering methods to information retrieval is addressed in the next four chapters. Chu et al. and Noel et al. explore feature selection using word stems, phrases, and link associations for document clustering and indexing. Wen et al. and Sung et al. discuss applications of clustering to user queries and data cleansing. Finally, we consider the problem of designing architectures for information retrieval. Crichton, Hughes, and Kelly elaborate on the development of a scientific data system architecture for information retrieval. Their approach is to build a system solution that allows for the clustering and retrieval of information to support scientific research. In the final chapter of the book, Yao presents the design of an information retrieval support systems (IRSS) using granular computing. IRSS is expected to be another general framework for supporting scientific research.

We wish to thank all those who contributed articles or reviewed articles for this book. We believe this collection of articles will serve as a useful reference in bridging the gap between clustering and information retrieval. Weili Wu Hui Xiong Shashi Shekhar Vlll