Contents. Preface to the Second Edition

Size: px

Start display at page:

Download "Contents. Preface to the Second Edition"

Rosalyn Cook
5 years ago
Views:

1 Preface to the Second Edition v 1 Introduction What Is Data Mining? Motivating Challenges The Origins of Data Mining Data Mining Tasks Scope and Organization of the Book Bibliographic Notes Exercises Data Types of Data Attributes and Measurement Types of Data Sets Data Quality Measurement and Data Collection Issues Issues Related to Applications Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature Subset Selection Feature Creation Discretization and Binarization Variable Transformation Measures of Similarity and Dissimilarity Basics Similarity and Dissimilarity between Simple Attributes Dissimilarities between Data Objects Similarities between Data Objects

2 xii Examples of Proximity Measures Mutual Information Kernel Functions* Bregman Divergence* Issues in Proximity Calculation Selecting the Right Proximity Measure Bibliographic Notes Exercises Classification: Basic Concepts and Techniques Basic Concepts General Framework for Classification Decision Tree Classifier A Basic Algorithm to Build a Decision Tree Methods for Expressing Attribute Test Conditions Measures for Selecting an Attribute Test Condition Algorithm for Decision Tree Induction Example Application: Web Robot Detection Characteristics of Decision Tree Classifiers Model Overfitting Reasons for Model Overfitting Model Selection Using a Validation Set Incorporating Model Complexity Estimating Statistical Bounds Model Selection for Decision Trees Model Evaluation Holdout Method Cross-Validation Presence of Hyper-parameters Hyper-parameter Selection Nested Cross-Validation Pitfalls of Model Selection and Evaluation Overlap between Training and Test Sets Use of Validation Error as Generalization Error Model Comparison Estimating the Confidence Interval for Accuracy Comparing the Performance of Two Models Bibliographic Notes Exercises

3 xiii 4 Classification: Alternative Techniques Types of Classifiers Rule-Based Classifier How a Rule-Based Classifier Works Properties of a Rule Set Direct Methods for Rule Extraction Indirect Methods for Rule Extraction Characteristics of Rule-Based Classifiers Nearest Neighbor Classifiers Algorithm Characteristics of Nearest Neighbor Classifiers Naïve Bayes Classifier Basics of Probability Theory Naïve Bayes Assumption Bayesian Networks Graphical Representation Inference and Learning Characteristics of Bayesian Networks Logistic Regression Logistic Regression as a Generalized Linear Model Learning Model Parameters Characteristics of Logistic Regression Artificial Neural Network (ANN) Perceptron Multi-layer Neural Network Characteristics of ANN Deep Learning Using Synergistic Loss Functions Using Responsive Activation Functions Regularization Initialization of Model Parameters Characteristics of Deep Learning Support Vector Machine (SVM) Margin of a Separating Hyperplane Linear SVM Soft-margin SVM Nonlinear SVM Characteristics of SVM Ensemble Methods Rationale for Ensemble Method

4 xiv Methods for Constructing an Ensemble Classifier Bias-Variance Decomposition Bagging Boosting Random Forests Empirical Comparison among Ensemble Methods Class Imbalance Problem Building Classifiers with Class Imbalance Evaluating Performance with Class Imbalance Finding an Optimal Score Threshold Aggregate Evaluation of Performance Multiclass Problem Bibliographic Notes Exercises Association Analysis: Basic Concepts and Algorithms Preliminaries Frequent Itemset Generation The Apriori Principle Frequent Itemset Generation in the Apriori Algorithm Candidate Generation and Pruning Support Counting Computational Complexity Rule Generation Confidence-Based Pruning Rule Generation in Apriori Algorithm An Example: Congressional Voting Records Compact Representation of Frequent Itemsets Maximal Frequent Itemsets Closed Itemsets Alternative Methods for Generating Frequent Itemsets* FP-Growth Algorithm* FP-Tree Representation Frequent Itemset Generation in FP-Growth Algorithm Evaluation of Association Patterns Objective Measures of Interestingness Measures beyond Pairs of Binary Variables Simpson s Paradox Effect of Skewed Support Distribution Bibliographic Notes

5 xv 5.10 Exercises Association Analysis: Advanced Concepts Handling Categorical Attributes Handling Continuous Attributes Discretization-Based Methods Statistics-Based Methods Non-discretization Methods Handling a Concept Hierarchy Sequential Patterns Preliminaries Sequential Pattern Discovery Timing Constraints Alternative Counting Schemes Subgraph Patterns Preliminaries Frequent Subgraph Mining Candidate Generation Candidate Pruning Support Counting Infrequent Patterns Negative Patterns Negatively Correlated Patterns Comparisons among Infrequent Patterns, Negative Patterns, and Negatively Correlated Patterns Techniques for Mining Interesting Infrequent Patterns Techniques Based on Mining Negative Patterns Techniques Based on Support Expectation Bibliographic Notes Exercises Cluster Analysis: Basic Concepts and Algorithms Overview What Is Cluster Analysis? Different Types of Clusterings Different Types of Clusters K-means The Basic K-means Algorithm K-means: Additional Issues Bisecting K-means

6 xvi K-means and Different Types of Clusters Strengths and Weaknesses K-means as an Optimization Problem Agglomerative Hierarchical Clustering Basic Agglomerative Hierarchical Clustering Algorithm Specific Techniques The Lance-Williams Formula for Cluster Proximity Key Issues in Hierarchical Clustering Outliers Strengths and Weaknesses DBSCAN Traditional Density: Center-Based Approach The DBSCAN Algorithm Strengths and Weaknesses Cluster Evaluation Overview Unsupervised Cluster Evaluation Using Cohesion and Separation Unsupervised Cluster Evaluation Using the Proximity Matrix Unsupervised Evaluation of Hierarchical Clustering Determining the Correct Number of Clusters Clustering Tendency Supervised Measures of Cluster Validity Assessing the Significance of Cluster Validity Measures Choosing a Cluster Validity Measure Bibliographic Notes Exercises Cluster Analysis: Additional Issues and Algorithms Characteristics of Data, Clusters, and Clustering Algorithms Example: Comparing K-means and DBSCAN Data Characteristics Cluster Characteristics General Characteristics of Clustering Algorithms Prototype-Based Clustering Fuzzy Clustering Clustering Using Mixture Models Self-Organizing Maps (SOM) Density-Based Clustering

7 xvii Grid-Based Clustering Subspace Clustering DENCLUE: A Kernel-Based Scheme for Density-Based Clustering Graph-Based Clustering Sparsification Minimum Spanning Tree (MST) Clustering OPOSSUM: Optimal Partitioning of Sparse Similarities Using METIS Chameleon: Hierarchical Clustering with Dynamic Modeling Spectral Clustering Shared Nearest Neighbor Similarity The Jarvis-Patrick Clustering Algorithm SNN Density SNN Density-Based Clustering Scalable Clustering Algorithms Scalability: General Issues and Approaches BIRCH CURE Which Clustering Algorithm? Bibliographic Notes Exercises Anomaly Detection Characteristics of Anomaly Detection Problems A Definition of an Anomaly Nature of Data How Anomaly Detection is Used Characteristics of Anomaly Detection Methods Statistical Approaches Using Parametric Models Using Non-parametric Models Modeling Normal and Anomalous Classes Assessing Statistical Significance Strengths and Weaknesses Proximity-based Approaches Distance-based Anomaly Score Density-based Anomaly Score Relative Density-based Anomaly Score

8 xviii Strengths and Weaknesses Clustering-based Approaches Finding Anomalous Clusters Finding Anomalous Instances Strengths and Weaknesses Reconstruction-based Approaches Strengths and Weaknesses One-class Classification Use of Kernels The Origin Trick Strengths and Weaknesses Information Theoretic Approaches Strengths and Weaknesses Evaluation of Anomaly Detection Bibliographic Notes Exercises Avoiding False Discoveries Preliminaries: Statistical Testing Significance Testing Hypothesis Testing Multiple Hypothesis Testing Pitfalls in Statistical Testing Modeling Null and Alternative Distributions Generating Synthetic Data Sets Randomizing Class Labels Resampling Instances Modeling the Distribution of the Test Statistic Statistical Testing for Classification Evaluating Classification Performance Binary Classification as Multiple Hypothesis Testing Multiple Hypothesis Testing in Model Selection Statistical Testing for Association Analysis Using Statistical Models Using Randomization Methods Statistical Testing for Cluster Analysis Generating a Null Distribution for Internal Indices Generating a Null Distribution for External Indices Enrichment Statistical Testing for Anomaly Detection

9 xix 10.7 Bibliographic Notes Exercises Author Index 816 Subject Index 829 Copyright Permissions 839

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1