Table Of Contents: xix Foreword to Second Edition

Size: px

Start display at page:

Download "Table Of Contents: xix Foreword to Second Edition"

Rafe Lane
5 years ago
Views:

Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.

1 Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data Mining? 1 (4) Moving toward the Information 1 (1) Age Data Mining as the Evolution of 2 (3) Information Technology 1.2 What Is Data Mining? 5 (3) 1.3 What Kinds of Data Can Be Mined? 8 (7) Database Data 9 (1) Data Warehouses 10 (3) Transactional Data 13 (1) Other Kinds of Data 14 (1) 1.4 What Kinds of Patterns Can Be Mined? 15 (8) Class/Concept Description: 15 (2) Characterization and Discrimination

2 1.4.2 Mining Frequent Patterns, 17 (1) Associations, and Correlations Classification and Regression for 18 (1) Predictive Analysis Cluster Analysis 19 (1) Outlier Analysis 20 (1) Are All Patterns Interesting? 21 (2) 1.5 Which Technologies Are Used? 23 (4) Statistics 23 (1) Machine Learning 24 (2) Database Systems and Data 26 (1) Warehouses Information Retrieval 26 (1) 1.6 Which Kinds of Applications Are 27 (2) Targeted? Business Intelligence 27 (1) Web Search Engines 28 (1) 1.7 Major Issues in Data Mining 29 (4) Mining Methodology 29 (1) User Interaction 30 (1) Efficiency and Scalability 31 (1) Diversity of Database Types 32 (1) Data Mining and Society 32 (1) 1.8 Summary 33 (1) 1.9 Exercises 34 (1) 1.10 Bibliographic Notes 35 (4) Chapter 2 Getting to Know Your Data 39 (44) 2.1 Data Objects and Attribute Types 40 (4) What Is an Attribute? 40 (1) Nominal Attributes 41 (1) Binary Attributes 41 (1)

3 2.1.4 Ordinal Attributes 42 (1) Numeric Attributes 43 (1) Discrete versus Continuous 44 (1) Attributes 2.2 Basic Statistical Descriptions of Data 44 (12) Measuring the Central Tendency: 45 (3) Mean, Median, and Mode Measuring the Dispersion of Data: 48 (3) Range, Quartiles, Variance, Standard Deviation, and Interquartile Range Graphic Displays of Basic Statistical 51 (5) Descriptions of Data 2.3 Data Visualization 56 (9) Pixel-Oriented Visualization 57 (1) Techniques Geometric Projection Visualization 58 (2) Techniques Icon-Based Visualization 60 (3) Techniques Hierarchical Visualization 63 (1) Techniques Visualizing Complex Data and 64 (1) Relations 2.4 Measuring Data Similarity and 65 (14) Dissimilarity Data Matrix versus Dissimilarity 67 (1) Matrix Proximity Measures for Nominal 68 (2) Attributes Proximity Measures for Binary 70 (2) Attributes

4 2.4.4 Dissimilarity of Numeric Data: 72 (2) Minkowski Distance Proximity Measures for Ordinal 74 (1) Attributes Dissimilarity for Attributes of 75 (2) Mixed Types Cosine Similarity 77 (2) 2.5 Summary 79 (1) 2.6 Exercises 79 (2) 2.7 Bibliographic Notes 81 (2) Chapter 3 Data Preprocessing 83 (42) 3.1 Data Preprocessing: An Overview 84 (4) Data Quality: Why Preprocess the 84 (1) Data? Major Tasks in Data Preprocessing 85 (3) 3.2 Data Cleaning 88 (5) Missing Values 88 (1) Noisy Data 89 (2) Data Cleaning as a Process 91 (2) 3.3 Data Integration 93 (6) Entity Identification Problem 94 (1) Redundancy and Correlation 94 (4) Analysis Tuple Duplication 98 (1) Data Value Conflict Detection and 99 (1) Resolution 3.4 Data Reduction 99 (12) Overview of Data Reduction 99 (1) Strategies Wavelet Transforms 100 (2) Principal Components Analysis 102 (1)

5 3.4.4 Attribute Subset Selection 103 (2) Regression and Log-Linear Models: 105 (1) Parametric Data Reduction Histograms 106 (2) Clustering 108 (1) Sampling 108 (2) Data Cube Aggregation 110 (1) 3.5 Data Transformation and Data 111 (9) Discretization Data Transformation Strategies 112 (1) Overview Data Transformation by 113 (2) Normalization Discretization by Binning 115 (1) Discretization by Histogram 115 (1) Analysis Discretization by Cluster Decision 116 (1) Tree, and Correlation Analyses Concept Hierarchy Generation for 117 (3) Nominal Data 3.6 Summary 120 (1) 3.7 Exercises 121 (2) 3.8 Bibliographic Notes 123 (2) Chapter 4 Data Warehousing and Online 125 (62) Analytical Processing 4.1 Data Warehouse: Basic Concepts 125 (10) What Is a Data Warehouse? 126 (2) Differences between Operational 128 (1) Database Systems and Data Warehouses But, Why Have a Separate Data 129 (1) Warehouse?

6 4.1.4 Data Warehousing: A Multitiered 130 (2) Architecture Data Warehouse Models: 132 (2) Enterprise Warehouse, Data Mart, and Virtual Warehouse Extraction, Transformation, and 134 (1) Loading Metadata Repository 134 (1) 4.2 Data Warehouse Modeling: Data Cube 135 (15) and OLAP Data Cube: A Multidimensional 136 (3) Data Model Stars, Snowflakes, and Fact 139 (3) Constellations: Schemas for Multidimensional Data Models Dimensions: The Role of Concept 142 (2) Hierarchies Measures: Their Categorization 144 (2) and Computation Typical OLAP Operations 146 (3) A Starnet Query Model for 149 (1) Querying Multidimensional Databases 4.3 Data Warehouse Design and Usage 150 (6) A Business Analysis Framework for 150 (1) Data Warehouse Design Data Warehouse Design Process 151 (2) Data Warehouse Usage for 153 (2) Information Processing From Online Analytical Processing 155 (1) to Multidimensional Data Mining 4.4 Data Warehouse Implementation 156 (10)

7 4.4.1 Efficient Data Cube Computation: 156 (4) An Overview Indexing OLAP Data: Bitmap Index 160 (3) and Join Index Efficient Processing of OLAP 163 (1) Queries OLAP Server Architectures: ROLAP 164 (2) versus MOLAP versus HOLAP 4.5 Data Generalization by Attribute- 166 (12) Oriented Induction Attribute-Oriented Induction for 167 (5) Data Characterization Efficient Implementation of 172 (3) Attribute-Oriented Induction Attribute-Oriented Induction for 175 (3) Class Comparisons 4.6 Summary 178 (2) 4.7 Exercises 180 (4) 4.8 Bibliographic Notes 184 (3) Chapter 5 Data Cube Technology 187 (56) 5.1 Data Cube Computation: Preliminary 188 (6) Concepts Cube Materialization: Full Cube, 188 (4) Iceberg Cube, Closed Cube, and Cube Shell General Strategies for Data Cube 192 (2) Computation 5.2 Data Cube Computation Methods 194 (24) Multiway Array Aggregation for 195 (5) Full Cube Computation BUC: Computing Iceberg Cubes 200 (4)

8 from the Apex Cuboid Downward Star-Cubing: Computing Iceberg 204 (6) Cubes Using a Dynamic Star-Tree Structure Precomputing Shell fragments for 210 (8) Fast High-Dimensional OLAP 5.3 Processing Advanced Kinds of Queries 218 (9) by Exploring Cube Technology Sampling Cubes: OLAP-Based 218 (7) Mining on Sampling Data Ranking Cubes: Efficient 225 (2) Computation of Top-k Queries 5.4 Multidimensional Data Analysis in 227 (7) Cube Space Prediction Cubes: Prediction 227 (3) Mining in Cube Space Multifeature Cubes: Complex 230 (1) Aggregation at Multiple Granularities Exception-Based, Discovery-Driven 231 (3) Cube Space Exploration 5.5 Summary 234 (1) 5.6 Exercises 235 (5) 5.7 Bibliographic Notes 240 (3) Chapter 6 Mining Frequent Patterns, 243 (36) Associations, and Correlations: Basic Concepts and Methods 6.1 Basic Concepts 243 (5) Market Basket Analysis: A 244 (2) Motivating Example Frequent Itemsets Closed Itemsets 246 (2) and Association Rules

9 6.2 Frequent Itemset Mining Methods 248 (16) Apnori Algorithm: Finding Frequent 248 (6) Itemsets by Confined Candidate Generation Generating Association Rules from 254 (1) Frequent Itemsets Improving the Efficiency of Apriori 254 (3) A Pattern-Growth Approach for 257 (2) Mining Frequent Itemsets Mining Frequent Itemsets Using 259 (3) Vertical Data Format Mining Closed and Max Patterns 262 (2) 6.3 Which Patterns Are Interesting? (7) Pattern Evaluation Methods Strong Rules Are Not Necessarily 264 (1) Interesting From Association Analysis to 265 (2) Correlation Analysis A Comparison of Pattern 267 (4) Evaluation Measures 6.4 Summary 271 (2) 6.5 Exercises 273 (3) 6.6 Bibliographic Notes 276 (3) Chapter 7 Advanced Pattern Mining 279 (48) 7.1 Pattern Mining: A Road Map 279 (4) 7.2 Pattern Mining in Multilevel, 283 (11) Multidimensional Space Mining Multilevel Associations 283 (4) Mining Multidimensional 287 (2) Associations Mining Quantitative Association 289 (2)

10 Rules Mining Rare Patterns and Negative 291 (3) Patterns 7.3 Constraint-Based Frequent Pattern 294 (7) Mining Metarule-Guided Mining of 295 (1) Association Rules Constraint-Based Pattern 296 (5) Generation: Pruning Pattern Space and Pruning Data Space 7.4 Mining High-Dimensional Data and 301 (6) Colossal Patterns Mining Colossal Patterns by 302 (5) Pattern-Fusion 7.5 Mining Compressed or Approximate 307 (6) Patterns Mining Compressed Patterns by 308 (2) Pattern Clustering Extracting Redundancy-Aware Topk 310 (3) Patterns 7.6 Pattern Exploration and Application 313 (6) Semantic Annotation of Frequent 313 (4) Patterns Applications of Pattern Mining 317 (2) 7.7 Summary 319 (2) 7.8 Exercises 321 (2) 7.9 Bibliographic Notes 323 (4) Chapter 8 Classification: Basic Concepts 327 (66) 8.1 Basic Concepts 327 (3) What Is Classification? 327 (1) General Approach to Classification 328 (2)

11 8.2 Decision Tree Induction 330 (20) Decision Tree Induction 332 (4) Attribute Selection Measures 336 (8) Tree Pruning 344 (3) Scalability and Decision Tree 347 (1) Induction Visual Mining for Decision Tree 348 (2) Induction 8.3 Bayes Classification Methods 350 (5) Bayes' Theorem 350 (1) Naive Bayesian Classification 351 (4) 8.4 Rule-Based Classification 355 (9) Using IF-THEN Rules for 355 (2) Classification Rule Extraction from a Decision 357 (2) Tree Rule Induction Using a Sequential 359 (5) Covering Algorithm 8.5 Model Evaluation and Selection 364 (13) Metrics for Evaluating Classifier 364 (6) Performance Holdout Method and Random 370 (1) Subsampling Cross-Validation 370 (1) Bootstrap 371 (1) Model Selection Using Statistical 372 (1) Tests of Significance Comparing Classifiers Based on 373 (4) Cost-Benefit and ROC Curves 8.6 Techniques to Improve Classification 377 (8) Accuracy

12 8.6.1 Introducing Ensemble Methods 378 (1) Bagging 379 (1) Boosting and AdaBoost 380 (2) Random Forests 382 (1) Improving Classification Accuracy 383 (2) of Class-Imbalanced Data 8.7 Summary 385 (1) 8.8 Exercises 386 (3) 8.9 Bibliographic Notes 389 (4) Chapter 9 Classification: Advanced Methods 393 (50) 9.1 Bayesian Belief Networks 393 (5) Concepts and Mechanisms 394 (2) Training Bayesian Belief Networks 396 (2) 9.2 Classification by Backpropagation 398 (10) A Multilayer Feed-Forward Neural 398 (2) Network Defining a Network Topology 400 (1) Backpropagation 400 (6) Inside the Black Box: 406 (2) Backpropagation and Interpretability 9.3 Support Vector Machines 408 (7) The Case When the Data Are 408 (5) Linearly Separable The Case When the Data Are 413 (2) Linearly Inseparable 9.4 Classification Using Frequent Patterns 415 (7) Associative Classification 416 (3) Discriminative Frequent Pattern- 419 (3) Based Classification 9.5 Lazy Learners (or Learning from Your 422 (4) Neighbors)

13 9.5.1 k-nearest-neighbor Classifiers 423 (2) Case-Based Reasoning 425 (1) 9.6 Other Classification Methods 426 (3) Genetic Algorithms 426 (1) Rough Set Approach 427 (1) Fuzzy Set Approaches 428 (1) 9.7 Additional Topics Regarding 429 (7) Classification Multiclass Classification 430 (2) Semi-Supervised Classification 432 (1) Active Learning 433 (1) Transfer Learning 434 (2) 9.8 Summary 436 (2) 9.9 Exercises 438 (1) 9.10 Bibliographic Notes 439 (4) Chapter 10 Cluster Analysis: Basic Concepts 443 (54) and Methods 10.1 Cluster Analysis 444 (7) What Is Cluster Analysis? 444 (1) Requirements for Cluster Analysis 445 (3) Overview of Basic Clustering 448 (3) Methods 10.2 Partitioning Methods 451 (6) k-means: A Centroid-Based 451 (3) Technique k-medoids: A Representative 454 (3) Object-Based Technique 10.3 Hierarchical Methods 457 (14) Agglomerative versus Divisive 459 (2) Hierarchical Clustering Distance Measures in Algorithmic 461 (1)

14 Methods BIRCH: Multiphase Hierarchical 462 (4) Clustering Using Clustering Feature Trees Chameleon: Multiphase 466 (1) Hierarchical Clustering Using Dynamic Modeling Probabilistic Hierarchical 467 (4) Clustering 10.4 Density-Based Methods 471 (8) DBSCAN: Density-Based 471 (2) Clustering Based on Connected Regions with High Density OPTICS: Ordering Points to 473 (3) Identify the Clustering Structure DENCLUE: Clustering Based on 476 (3) Density Distribution Functions 10.5 Grid-Based Methods 479 (4) STING: STatistical INformation 479 (2) Grid CLIQUE: An Apriori-like Subspace 481 (2) Clustering Method 10.6 Evaluation of Clustering 483 (7) Assessing Clustering Tendency 484 (2) Determining the Number of 486 (1) Clusters Measuring Clustering Quality 487 (3) 10.7 Summary 490 (1) 10.8 Exercises 491 (3) 10.9 Bibliographic Notes 494 (3) Chapter 11 Advanced Cluster Analysis 497 (46)

15 11.1 Probabilistic Model-Based Clustering 497 (11) Fuzzy Clusters 499 (2) Probabilistic Model-Based 501 (4) Clusters Expectation-Maximization 505 (3) Algorithm 11.2 Clustering High-Dimensional Data 508 (14) Clustering High-Dimensional Data: 508 (2) Problems, Challenges, and Major Methodologies Subspace Clustering Methods 510 (2) Biclustering 512 (7) Dimensionality Reduction 519 (3) Methods and Spectral Clustering 11.3 Clustering Graph and Network Data 522 (10) Applications and Challenges 523 (2) Similarity Measures 525 (3) Graph Clustering Methods 528 (4) 11.4 Clustering with Constraints 532 (6) Categorization of Constraints 533 (2) Methods for Clustering with 535 (3) Constraints 11.5 Summary 538 (1) 11.6 Exercises 539 (1) 11.7 Bibliographic Notes 540 (3) Chapter 12 Outlier Detection 543 (42) 12.1 Outliers and Outlier Analysis 544 (5) What Are Outliers? 544 (1) Types of Outliers 545 (3) Challenges of Outlier Detection 548 (1) 12.2 Outlier Detection Methods 549 (4)

16 Supervised, Semi-Supervised, and 549 (2) Unsupervised Methods Statistical Methods, Proximity- 551 (2) Based Methods, and Clustering-Based Methods 12.3 Statistical Approaches 553 (7) Parametric Methods 553 (5) Nonparamertic Methods 558 (2) 12.4 Proximity-Based Approaches 560 (7) Distance-Based Outlier Detection 561 (1) and a Nested Loop Method A Grid-Based Method 562 (2) Density-Based Outlier Detection 564 (3) 12.5 Clustering-Based Approaches 567 (4) 12.6 Classification-Based Approaches 571 (2) 12.7 Mining Contextual and Collective 573 (3) Outliers Transforming Contextual Outlier 573 (1) Detection to Conventional Outlier Detection Modeling Normal Behavior with 574 (1) Respect to Contexts Mining Collective Outliers 575 (1) 12.8 Outlier Detection in High-Dimensional 576 (5) Data Extending Conventional Outlier 577 (1) Detection Finding Outliers in Subspaces 578 (1) Modeling High-Dimensional 579 (2) Outliers 12.9 Summary 581 (1)

17 12.10 Exercises 582 (1) Bibliographic Notes 583 (2) Chapter 13 Data Mining Trends and 585 (48) Research Frontiers 13.1 Mining Complex Data Types 585 (13) Mining Sequence Data: Time- 586 (5) Series, Symbolic Sequences, and Biological Sequences Mining Graphs and Networks 591 (4) Mining Other Kinds of Data 595 (3) 13.2 Other Methodologies of Data Mining 598 (9) Statistical Data Mining 598 (2) Views on Data Mining 600 (2) Foundations Visual and Audio Data Mining 602 (5) 13.3 Data Mining Applications 607 (11) Data Mining for Financial Data 607 (2) Analysis Data Mining for Retail and 609 (2) Telecommunication Industries Data Mining in Science and 611 (3) Engineering Data Mining for Intrusion 614 (1) Detection and Prevention Data Mining and Recommender 615 (3) Systems 13.4 Data Mining and Society 618 (4) Ubiquitous and Invisible Data 618 (2) Mining Privacy, Security, and Social 620 (2) Impacts of Data Mining

18 13.5 Data Mining Trends 622 (3) 13.6 Summary 625 (1) 13.7 Exercises 626 (2) 13.8 Bibliographic Notes 628 (5) Bibliography 633 (40) Index 673

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1