Clustering and Information Retrieval

Network Theory and Applications Volume 11 Managing Editors: Ding-ZhuDu University o/minnesota, U.S.A. Cauligi Raghavendra University 0/ Southern Califorina, U.S.A.

Clustering and Information Retrieval Weili Wu Department of Computer Science, The University of Texas at Dallas, Mail Station EC 31, Box 830688, Richardson, TX 75083, U.S.A. H ui Xiong and Shashi Shekhar Department of Computer Science and Engineering, University of Minnesota - Twin Cities, EECS BLDG 4-192, 200 Union Street SE, Minneapolis, MN 55455, U.S.A.

Distributors for North, Central and South America: Kluwer Academic Publishers 101 Philip Drive Assinippi Park Norwell, Massachusetts 02061 USA Telephone (781) 871-6600 Fax (781) 871-6528 E-Mail <kluwer@wkap.com> Distributors for all other countries: Kluwer Academic Publishers Group Post Office Box 322 3300 AH Dordrecht, THE NETHERLANDS Telephone 31 786576000 Fax31786576474 E-Mail <orderdept@wkap.nl>..... " Electromc Services <http://www.wkap.nl> Library of Congress Cataloging-in-Publication Wu, Weilil Xiong, Huil Shekhar, Shashi Clustering and Information Retrieval ISBN-13: 978-1-4613-7949-2 e-isbn-13: 978-1-4613-0227-8 DOl: 10.1007/978-1-4613-0227-8 Copyright 2004 by Kluwer Academic Publishers Softcover reprint of the hardcover 1st edition 2004 All rights reserved. No part ofthis publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photo-copying, microfilming, recording, or otherwise, without the prior written permission of the publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser ofthe work. Permissions for books published in the USA: permi ssi ons@wkap com Permissions for books published in Europe: permissions@wkap.nl Printed on acid-free paper.

Contents Foreword... vii Clustering in Metric Spaces with Applications to Information Retrieval... 1 Ricardo Baeza- Yates, Benjamin Bustos, Edgar Chavez, Norma Herrera, and Gonzalo Navarro Techniques for Clustering Massive Data Sets.... 35 Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach.... 83 Levent Erwz, Michael Steinbach, and Vipin Kumar On Quantitative Evaluation of Clustering Systems... 105 Ji He, Ah-Hwee Tan, Chew-Lim tan, and Sam- Yuan Sung Techniques for Textual Document Indexing and Retrieval via Knowledge Sources and Data Mining.... 135 Wesley W. Chu, Victor Zhenyu Liu, and Wenlei Mao Document Clustering, Visualization, and Retrieval via Link Mining.... 161 Steven Noel, Vijay Raghavan, and C.-H. Henry Chu Query Clustering in the Web Context Ji-Rong Wen and Hong-Jiang Zhang.... 195 Clustering Techniques for Large Database Cleansing Sam Y. Sung, Zhao Li, and Tok W. Ling... 227 A Science Data System Architecture for Information Retrieval..... 261 Daniel J. Crichton, J. Steven Hughes, and Sean Kelly

Granular Computing for the Design of Information Retrieval Support Systems... 299 Y. Y. Yao VI

Foreword Clustering is an important technique for discovering relatively dense sub-regions or sub-spaces of a multi-dimension data distribution. Clustering has been used in information retrieval for many different purposes, such as query expansion, document grouping, document indexing, and visualization of search results. In this book, we address issues of clustering algorithms, evaluation methodologies, applications, and architectures for information retrieval. The first two chapters discuss clustering algorithms. The chapter from Baeza-Yates et al. describes a clustering method for a general metric space which is a common model of data relevant to information retrieval. The chapter by Guha, Rastogi, and Shim presents a survey as well as detailed discussion of two clustering algorithms: CURE and ROCK for numeric data and categorical data respectively. Evaluation methodologies are addressed in the next two chapters. Ertoz et al. demonstrate the use of text retrieval benchmarks, such as TRECS, to evaluate clustering algorithms. He et al. provide objective measures of clustering quality in their chapter. Applications of clustering methods to information retrieval is addressed in the next four chapters. Chu et al. and Noel et al. explore feature selection using word stems, phrases, and link associations for document clustering and indexing. Wen et al. and Sung et al. discuss applications of clustering to user queries and data cleansing. Finally, we consider the problem of designing architectures for information retrieval. Crichton, Hughes, and Kelly elaborate on the development of a scientific data system architecture for information retrieval. Their approach is to build a system solution that allows for the clustering and retrieval of information to support scientific research. In the final chapter of the book, Yao presents the design of an information retrieval support systems (IRSS) using granular computing. IRSS is expected to be another general framework for supporting scientific research.

We wish to thank all those who contributed articles or reviewed articles for this book. We believe this collection of articles will serve as a useful reference in bridging the gap between clustering and information retrieval. Weili Wu Hui Xiong Shashi Shekhar Vlll