MINING VERY LARGE DATABASES WITH PARALLEL PROCESSING

Similar documents
PARALLEL, OBJECT -ORIENTED, AND ACTIVE KNOWLEDGE BASE SYSTEMS

TIME-CONSTRAINED TRANSACTION MANAGEMENT. Real-Time Constraints in Database Transaction Systems

INTERCONNECTING HETEROGENEOUS INFORMATION SYSTEMS

FUZZY LOGIC IN DATA MODELING. Semantics, Constraints, and Database Design

ARCHITECTURE AND CAD FOR DEEP-SUBMICRON FPGAs

MULTIMEDIA DATABASE MANAGEMENT SYSTEMS

THE VERILOG? HARDWARE DESCRIPTION LANGUAGE

WIRELESS ATM AND AD-HOC NETWORKS. Protocols and Architectures

Energy Efficient Microprocessor Design

Replication Techniques in Distributed Systems

PERFORMANCE ANALYSIS OF REAL-TIME EMBEDDED SOFTWARE

COMMUNICATION SYSTEMS The State of the Art

Loop Tiling for Parallelism

Fundamentals of Operating Systems. Fifth Edition

INVERSE PROBLEMS IN GROUNDWATER MODELING

LEGITIMATE APPLICATIONS OF PEER-TO-PEER NETWORKS DINESH C. VERMA IBM T. J. Watson Research Center A JOHN WILEY & SONS, INC., PUBLICATION

9. Conclusions. 9.1 Definition KDD

INTRUSION DETECTION IN DISTRIBUTED SYSTEMS An Abstraction-Based Approach

Parallelism in Knowledge Discovery Techniques

Topological Structure and Analysis of Interconnection Networks

INFORMATION RETRIEVAL SYSTEMS: Theory and Implementation

TASK SCHEDULING FOR PARALLEL SYSTEMS

COMPUTATIONAL DYNAMICS

Video Traces for Network Performance Evaluation

LEGITIMATE APPLICATIONS OF PEER-TO-PEER NETWORKS

RETARGETABLE CODE GENERATION FOR DIGITAL SIGNAL PROCESSORS

SPECC: SPECIFICATION LANGUAGE AND METHODOLOGY

Groupware and the World Wide Web

PARALLEL ARCHITECTURES AND PARALLEL ALGORITHMS FOR INTEGRATED VISION SYSTEMS

DISSEMINATING SECURITY UPDATES AT INTERNET SCALE

Computing with Memory for Energy-Efficient Robust Systems

Yves Nievergelt. Wavelets Made Easy. Springer Science+Business Media, LLC

Software Development for SAP R/3

Graphics Programming in c++

Functional Programming in R

HIGH-SPEED COMMUNICATION NETWORKS

Database Replication

Advanced Data Mining Techniques

Algorithm Collections for Digital Signal Processing Applications Using Matlab

Scheduling in Distributed Computing Systems Analysis, Design & Models

Robust SRAM Designs and Analysis

Tribhuvan University Institute of Science and Technology MODEL QUESTION

Fractal Surfaces. John C. Russ. Springer Science+Business Media, LLC. North Carolina State University Raleigh, North Carolina

Fundamentals of Operating Systems

Stock Message Boards

Heterogeneous Information Exchange and Organizational Hubs

THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER SCIENCE

Enabling Technologies for Wireless E-Business

Whitestein Series in software Agent Technologies. About whitestein Technologies

ADAPTIVE HYPERTEXT AND HYPERMEDIA

Computer Science Workbench. Editor: Tosiyasu L. Kunii

Exploiting Distributed Resources in Wireless, Mobile and Social Networks Frank H. P. Fitzek and Marcos D. Katz

FUZZY LOGIC WITH ENGINEERING APPLICATIONS

ITSM: An Interactive Time Series Modelling Package for the pe

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Computer-Aided Design in Magnetics

The 2018 (14th) International Conference on Data Science (ICDATA)

Chapter 1, Introduction

Database Management Systems

Preface. and Its Applications 81, ISBN , doi: / , Springer Science+Business Media New York, 2013.

Data Warehouse and Data Mining

HIGH-LEVEL SYNTHESIS FOR REAL-TIME DIGITAL SIGNAL PROCESSING

Computer Communications and Networks

Theory of Automatic Robot Assembly and Programming

Stereo Scene Flow for 3D Motion Analysis

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Syllabus DATABASE I Introduction to Database (INLS523)

Multi-Core Programming

VERILOG QUICKSTART. James M. Lee Cadence Design Systems, Inc. SPRINGER SCIENCE+BUSINESS MEDIA, LLC

Hierarchical Scheduling in Parallel and Cluster Systems

SYNTHESIS OF FINITE STATE MACHINES: LOGIC OPTIMIZATION

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Knowledge Discovery from Client-Server Databases

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

ASSIGNMENT PROBLEMS IN PARALLEL AND DISTRIBUTED COMPUTING

Parallel Algorithms for Irregular Problems: State of the Art

MULTIMEDIA TOOLS AND APPLICATIONS

Concepts Of Database Management 7th Edition Solution Manual

High-Performance Parallel Database Processing and Grid Databases

Essential Angular for ASP.NET Core MVC

Fault-Tolerant Parallel and Distributed Systems

COMPONENT-ORIENTED PROGRAMMING

Jinkun Liu Xinhua Wang. Advanced Sliding Mode Control for Mechanical Systems. Design, Analysis and MATLAB Simulation

A Survey of Parallel Data Mining.

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

DR. JIVRAJ MEHTA INSTITUTE OF TECHNOLOGY

Windows 10 Revealed. The Universal Windows Operating System for PC, Tablets, and Windows Phone. Kinnary Jangla

Dr.G.R.Damodaran College of Science

INFORMATION SECURITY MANAGEMENT & SMALL SYSTEMS SECURITY

Computational Geometry on Surfaces

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Basic Concepts in Information Theory and Coding. The Adventures of Secret Agent 00111

Java Quick Syntax Reference. Second Edition. Mikael Olsson

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems

Clustering and Information Retrieval

Research on Industrial Security Theory

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

Technologies solutions and Oracle instruments used in the accomplishment of executive informatics systems (EIS)

Modeling and Simulation in Scilab/Scicos with ScicosLab 4.4

SWITCHING AND TRAFFIC THEORY FOR INTEGRATED BROADBAND NETWORKS

Transcription:

MINING VERY LARGE DATABASES WITH PARALLEL PROCESSING

The Kluwer International Series on ADVANCES IN DATABASE SYSTEMS Series Editor Ahmed K. Elmagarmid Purdue University West Lafayette, IN 47907 Other books in the Series: DATABASE CONCURRENCY CONTROL: Methods, Performance, and Analysis by Alexander Thomasian ISBN: 0-7923-9741-X TIME-CONSTRAINED TRANSACTION MANAGEMENT Real-Time Constraints in Database Transaction Systems by Nandit R. Soparkar, Henry F. Korth, Abraham Silberschatz ISBN: 0-7923-9752-5 SEARCHING MULTIMEDIA DATABASES BY CONTENT by Christos Faloutsos ISBN: 0-7923-9777-0 REPLICATION TECHNIQUES IN DISTRIBUTED SYSTEMS by Abdelsalam A. Helal, Abdelsalam A. Heddaya, Bharat B. Bhargava ISBN: 0-7923-9800-9 VIDEO DATABASE SYSTEMS: Issues, Products, and Applications by Ahmed K. Elmagarmid, Haitao Jiang, Abdelsalam A. Helal, Anupam Joshi, Magdy Ahmed ISBN: 0-7923-9872-6 DATABASE ISSUES IN GEOGRAPHIC INFORMATION SYSTEMS by Nabil R. Adam and Aryya Gangopadhyay ISBN: 0-7923-9924-2 INDEX DATA STRUCTURES IN OBJECT-ORIENTED DATABASES by Thomas A. Mueck and Martin L. Polaschek ISBN: 0-7923-9971-4 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS by Elisa Bertino, Beng Chin Ooi, Ron Sacks-Davis, Kian-Lee Tan, Justin Zobel, Boris Shidlovsky and Barbara Catania ISBN: 0-7923-9985-4

MINING VERY LARGE DATABASES WITH PARALLEL PROCESSING by Alex A. Freitas University of Essex Colchester, United Kingdom and Simon H. Lavington University of Essex Colchester, United Kingdom SPRINGER SCIENCE+BUSINESS MEDIA, LLC

Library of Congress Cataloging.in.Publication Data Freitas, Alex A., 1964- Mining very large databases with parallel processing / by Alex A. Freitas and Simon H. Lavington. p. cm. -- (The Kluwer international series on advances in database systems) Includes bibliographical references and index. ISBN 978-1-4613-7523-4 ISBN 978-1-4615-5521-6 (ebook) DOI 10.1007/978-1-4615-5521-6 1. Database management. 2. Data mining. 3. Transaction systems (Computer systems) 4. Parallel processing (Electronic computers) I. Lavington, S. H. (Simon Hugh), 1939-. II. Title. III. Series. QA76.9.D3F745 1998 006.3--dc21 97-41615 CIP Copyright by Springer Science+Business Media New York Origina1ly published by Kluwer Academic Publishers in 2000 Softcover reprint of the hardcover 1 st edition 2000 AII rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher, Springer Science+Business Media, LLC. Printed an acid-free paper.

This book is dedicated to all the people who believe that learning is not only one of the most necessary but also one of the noblest human activities.

CONTENTS. PREFACE... xi ACKNOWLEDGMENTS... xiii INTRODUCTION... 1 The Motivation for Data Mining and Knowledge Discovery... 1 The Inter-disciplinary Nature of Knowledge Discovery in Databases (KDD)... 2 The Challenge of Efficient Knowledge Discovery in Large Databases and Data Warehouses... 3 Organization of the Book... 4 Part I KNOWLEDGE DISCOVERY AND DATA MINING... 5 1 KNOWLEDGE DISCOVERY TASKS... 7 1.1 Discovery of Association Rules... 7 1.2 Classification... 10 1.3 Other KDD Tasks... 14 2 KNOWLEDGE DISCOVERY PARADIGMS... 19 2.1 Rule Induction (RI)... 19 2.2 Instance-Based Learning (IBL)... 21 2.3 Neural Networks (NN)... 22 2.4 Genetic Algorithms (GA)... 24 2.5 On-Line Analytical Processing (OLAP)... 26 2.6 Focus on Rule Induction... 28 3 THE KNOWLEDGE DISCOVERY PROCESS... 31 3.1 An Overview of the Knowledge Discovery Process... 31 3.2 Data Warehouse (DW)... 33 3.3 Attribute Selection... 34 3.4 Discretization... 37 3.5 Rule-Set Refinement... 39 4 DATA MINING... 41 4.1 Decision-Tree Building... 41 4.2 Overfitting... 45 4.3 Data-Mining-Algorithm Bias... 47 4.4 Improved Representation Languages... 48 4.5 Integrated Data Mining Architectures... 49

viii 5 DATA MINING TOOLS... 51 5.1 Clementine... 51 5.2 Darwin... 53 5.3 MineSet... 54 5.4 Intelligent Miner... 55 5.5 Decision-Tree-Building Tools... 56 Part II PARALLEL DATABASE SYSTEMS... 59 6 BASIC CONCEPTS ON PARALLEL PROCESSING... 61 6.1 Temporal and Spatial Parallelism... 61 6.2 Granularity, Level and Degree of Parallelism... 62 6.3 Shared and Distributed Memory... 63 6.4 Evaluating the Performance of a Parallel System... 64 6.5 Communication Overhead... 65 6.6 Load Balancing... 67 6.7 Approaches for Exploiting Parallelism... 69 7 DATA PARALLELISM, CONTROL PARALLELISM AND RELATED ISSUES... 71 7.1 Data Parallelism and Control Parallelism... 71 7.2 Easy of Use and Automatic Paralle1ization... 73 7.3 Machine-Architecture Independence... 73 7.4 Scalability... 74 7.5 Data Partitioning... 75 7.6 Data Placement (Declustering)... 76 8 PARALLEL DATABASE SERVERS... 79 8.1 Architectures of Parallel Database Servers... 79 8.2 From the Teradata DBC 1012 to the NCR WorldMark 5100... 82 8.3 ICL Goldrush Running Oracle Parallel Server... 83 8.4 IBM SP2 Running DB2 Parallel Edition (DB2-PE)... 84 8.5 Monet... 85 PartIII PARALLEL DATA MINING... 87 9 APPROACHES TO SPEED UP DATA MINING... 89 9.1 Overview of Approaches to Speed up Data Mining... 89 9.2 Discretization... 90 9.3 Attribute Selection... 91 9.4 Sampling and Related Approaches... 92 9.5 Fast Algorithms... 97 9.6 Distributed Data Mining... 100 9.7 Parallel Data Mining... 103 9.8 Discussion... 105

ix 10 PARALLEL DATA MINING WITHOUT DBMS FACILITIES... 109 10.1 Parallel Rule Induction... 109 10.2 Parallel Decision-Tree Building... 116 10.3 Parallel Instance-Based Learning... 123 10.4 Parallel Genetic Algorithms... 128 1O.S Parallel Neural Networks... 133 10.6 Discussion... 137 11 PARALLEL DATA MINING WITH DATABASE FACILITIES... 143 11.1 An Overview ofintegrated Data MininglData Warehouse Frameworks... 143 11.2 The Case for Integrating Data Mining and the Data Warehouse... 147 11.3 Server-Based KDD Systems... lsi 11.4 Hybrid Client/Server-Based KDD Systems... IS4 11.5 Generic, Set-Oriented Primitives for the Hybrid Client/Server-Based KDD Framework... IS6 11.6 A Generic, Set-Oriented Primitive for Candidate-Rule (CR) Evaluation in Rule Induction... 157 11.7 A Generic, Set-Oriented Primitive for Computing Distance Metrics in Instance-Based Learning... 164 11.8 Parallel Data Mining with Specialized-Hardware Parallel Database Servers................................. 171 12 SUMMARY AND SOME OPEN PROBLEMS... 173 12.1 Data-Parallel vs. Control-Parallel Data Mining... 173 12.2 Client/Server Frameworks for Parallel Data Mining... 174 12.3 Open Problems... 177 REFERENCES... 181 INDEX... 199

PREFACE. This book addresses the problem of large-scale data mining. It is an interdisciplinary text, describing advances in the integration of three computer science areas, namely: "intelligent" (machine learning-based) data mining techniques; relational databases and parallel processing. The basic idea is to use concepts and techniques of the latter two areas - particularly parallel processing - to speed up and scale up data mining algorithms. The book is divided into three parts. The first part presents a comprehensive review of intelligent data mining techniques such as rule induction, instance-based learning, neural networks and genetic algorithms. Likewise, the second part presents a comprehensive review of parallel processing and parallel databases. Each of these parts includes an overview of commercially-available, state-of-the-art tools. The third part deals with the application of parallel processing to data mining. The emphasis is on finding generic, cost-effective solutions for realistic data volumes. Two parallel computational environments are discussed, firstly excluding the use of commercialstrength DBMS, and then using parallel DBMS servers. It is assumed that the reader has a knowledge roughly equivalent to a first degree (B.Sc.) in accurate sciences, so that (s)he is reasonably familiar with basic concepts of statistics and computer science. The primary audience for this book is industry data miners and practitioners in general, who would like to apply intelligent data mining techniques to large amounts of data. The book will also be of interest to academic researchers and post-graduate students, particularly database researchers interested in advanced, intelligent database applications and artificial intelligence researchers interested in industrial, real-world applications of machine learning.

ACKNOWLEDGMENTS. Since we started to work on data mining we have had the help of several good people. We are grateful to all of them, for their support. In particular, we would like to express our thanks to the following people: To Dominicus R. Thoen and Neil EJ. Dewhurst, for their help in some data mining experiments and for their support in general. To Paul Scott, for interesting discussions about data mining and machine learning. To Steve Hassan, for his help in using the White Cross WX90lO parallel database server. To Foster Provost, Richard Kufrin, and Sarabjot Anand, for interesting discussions about parallel data mining and for their encouragement. During the project that led to the writing up of this book, the first author was financially supported by a grant from the Brazilian government's National Council of Scientific and Technological Development (CNPq), process number 200384/93-7.