Data mining techniques for actuaries: an overview

Similar documents
Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Supervised and Unsupervised Learning (II)

SOCIAL MEDIA MINING. Data Mining Essentials

Machine Learning Duncan Anderson Managing Director, Willis Towers Watson

Data Mining Clustering

Lecture 25: Review I

Unsupervised Learning and Clustering

Jarek Szlichta

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Machine Learning (BSMC-GA 4439) Wenke Liu

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Unsupervised Learning

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Unsupervised Learning and Clustering

Overview and Practical Application of Machine Learning in Pricing

Machine Learning (BSMC-GA 4439) Wenke Liu

Network Traffic Measurements and Analysis

CSE 5243 INTRO. TO DATA MINING

Applying Supervised Learning

CSE 5243 INTRO. TO DATA MINING

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Understanding Clustering Supervising the unsupervised

Lecture on Modeling Tools for Clustering & Regression

Supervised vs. Unsupervised Learning

ECG782: Multidimensional Digital Signal Processing

K-Means Clustering 3/3/17

Artificial Intelligence. Programming Styles

Clustering CS 550: Machine Learning

Table Of Contents: xix Foreword to Second Edition

Search Engines. Information Retrieval in Practice

Using Machine Learning to Optimize Storage Systems

Based on Raymond J. Mooney s slides

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Clustering and The Expectation-Maximization Algorithm

Variable Selection 6.783, Biomedical Decision Support

Random Forest A. Fornaser

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Hierarchical Clustering

ECLT 5810 Clustering

Gene Clustering & Classification

Predict Outcomes and Reveal Relationships in Categorical Data

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

ECLT 5810 Clustering

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

CHAPTER 4: CLUSTER ANALYSIS

Network Traffic Measurements and Analysis

Road map. Basic concepts

1 Case study of SVM (Rob)

Unsupervised Learning

Business Intelligence. Tutorial for Performing Market Basket Analysis (with ItemCount)

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

CSE 158. Web Mining and Recommender Systems. Midterm recap

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

Machine Learning / Jan 27, 2010

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Data Mining and Knowledge Discovery: Practice Notes

Chapter 9. Classification and Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Chapter 3: Supervised Learning

Supervised vs unsupervised clustering

Lecture 9: Support Vector Machines

Hierarchical Clustering

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Slides for Data Mining by I. H. Witten and E. Frank

Clustering and Visualisation of Data

Semi-supervised Learning

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Clustering. Supervised vs. Unsupervised Learning

Contents. Preface to the Second Edition

Machine Learning - Clustering. CS102 Fall 2017

Kapitel 4: Clustering

Semi-Supervised Clustering with Partial Background Information

Data Mining and Knowledge Discovery: Practice Notes

Oracle9i Data Mining. Data Sheet August 2002

Chapter 6: Cluster Analysis

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Workload Characterization Techniques

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Hierarchical Clustering 4/5/17

The exam is closed book, closed notes except your one-page cheat sheet.

Unsupervised Learning and Data Mining

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Cluster Analysis for Microarray Data

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Chapter DM:II. II. Cluster Analysis

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Classification Algorithms in Data Mining

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Transcription:

Data mining techniques for actuaries: an overview Emiliano A. Valdez joint work with Banghee So and Guojun Gan University of Connecticut Advances in Predictive Analytics (APA) Conference University of Waterloo, Canada 1-2 December 2017 So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 1 / 20

Introduction Data mining Data mining Refers to a computational process of exploring and analyzing enormous amount of data to uncover hidden and useful information. Information for what: to process and efficiently reduce data into a more summarized, analytical and interpretable representation. The complicated process for data mining: data acquisition, data preparation, preliminary exploration, data modeling and model evaluation. What is it generally use for: to deliver predictive models applicable to new data. Predictive models are becoming increasingly important for actuaries in all areas of insurance and financial security programs: life, health, pensions, property and casualty. So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 2 / 20

Introduction Goals and advantages Our goals and advantages of data mining Not surprisingly, data mining has been applied in several industries: social media, manufacturing, retail, banks and lenders, and government. Our goals are to give an overview of data mining techniques and how some of these techniques can be applied to some actuarial problems. Previous work: Guo (2003) Some advantages of data mining techniques (Fürnkranz, et al., 2012): (i) Many data mining techniques can easily and more efficiently handle huge amount of data. (ii) Data mining does not make assumptions about the probability distribution of the data. (iii) It aims to find patterns that can help to construct and formulate suitable hypotheses. So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 3 / 20

Overview of data mining techniques An overview of data mining techniques Scope of data mining techniques encompasses algorithms derived from machine learning, pattern recognition, statistics and database theory. Types of machine learning: supervised learning: the process of approximating a functional relationship to predict an output variable using input data. unsupervised learning: the goal is to understand the underlying pattern, structure or distribution of the input data (no corresponding output variable). deep structured learning: a new and emerging development of algorithms for multiple layers or hierarchical representations of data. So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 4 / 20

Overview of data mining techniques Data mining taks Types of data mining tasks Generally four (4) main types we have identified useful for our purposes: regression, classification, association rule learning, and clustering. Labelled vs unlabelled data Labelled data has a specifically designated attribute and the aim is to use the given data to predict the value of that attribute for new data. Regression and classification work with labelled data and are considered supervised learning. Unlabelled data, on the other hand, does not have such a designated attribute. Association rule learning and clustering work with unlabelled data and are considered unsupervised learning. So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 5 / 20

Supervised learning Regression Regression Regression: widely popular, easy to interpret, and well understood actuarial applications: claims prediction, pricing, risk classification, claims reserving Linear regression models and least squares Given a vector of output y and an input vector of p features with design matrix X, our regression function assumes linear in the coefficient parameter β: E(y X) = f(x) = β 0 p X j β j Least squares method assumes minimizing the residual sum of squares: { β = arg min y Xβ 2 }, β where RSS(β) = y Xβ 2. j=1 So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 6 / 20

Supervised learning Regularization or penalized least squares Problems with least squares estimates There are many situations when the LSE β is not considered a very good estimator: large number of parameters (especially problematic for big data) the predictor variables are highly correlated produces unstable, highly variable estimates One approach is to use shrinkage methods, or regularization, by introducing a penalty function. These methods also help to reduce the variability of the prediction error. So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 7 / 20

Supervised learning Regularization or penalized least squares Regularization or least squares penalty L q penalty function: β = arg min β y Xβ 2 + λ p β j q, j=1 where λ is the regularization or penalty parameter. Special cases include: LASSO (Least Absolute Squares and Selection Operator): q = 1 Ridge regression: q = 2 Interpretation is to penalize unreasonable values of β. A variation is the use of elastic net penalty: λ p ( αβ 2 j + (1 α) β j ) j=1 So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 8 / 20

Supervised learning Classification Classification and various methods Classification: the problem deals with an output or attribute that is categorical or nominal, for example, assigning the attribute to one of a class {1, 2,..., K}. actuarial applications: fraud detection, policy lapses/persistency, whether claim or not, risk classification Regression for binary dependent variable: Logistic regression: Cox (1958), linear predictors are linked to a logistic distribution. Probit regression: Bliss (1934), linear predictors are linked to the normal distribution. Classification using decision trees: involves the process of splitting points of attributes based on some specified criteria and applies pruning. Pruning is the procedure of removing sections of trees to re-fine the classification for predictive accuracy. CART algorithm: Breiman, et al. (1984), binary branches, pruning is done using bottom-up based on training data. So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 9 / 20

Supervised learning Classification - continued Some methods based on Bayes Theorem: f k (x)π k Pr(G = k X = x) K j=1 f j(x)π j where f j (x) are the class conditional densities of X given G = k and π j are the prior probability of being in class j, for j = 1,..., K. Linear discriminant analysis (LDA) assumes Gaussian class conditional densities Naive Bayes: assumes very strong independence that each class density is a product of marginal densities. Support vector classifier: aim is to find a hyperplane that can separate the classes (linearly separable data). K-nearest negihborhood: classification based on some distance measure of nearness (e.g. Euclidean) So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 10 / 20

Unsupervised learning Association rule Association rule learning Agrawal, et al. (1993) Association rule learning is a method to find meaningful relations among incidence of events. Most useful for marketing products, but has large potential in actuarial applications e.g. fraud detection, policy lapse analysis. Start with a set of attributes A = {A 1, A 2,..., A n } for which we can observe the occurrence: Our dataset can be written as a set of T observations D = {d 1, d 2,..., d T } where d t A for t = 1,..., T. Next define i t = (1 A1 d t,..., 1 An d t ) so that I = {i 1, i 2,..., i T } contains the same information with D as a dataframe. Dataset is then stored and handled in the form of I. If X A, then we call X is an itemset. So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 11 / 20

Association rule - continued An association rule is an if/then statement of the form: X Y, for itemsets X and Y. We call X the precedent and Y the consequent. For example in a basket of groceries, {ground beef, onions} {beer}. Types of association rule: single dimension: buys {diaper} buys {baby foods} multiple dimensions: location (< 1), age (< 22) buys {auto} hybrid dimension: career (< 1), buys {car} buys {insurance} So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 12 / 20

Association rule Measurement concepts Measurement concepts Support of itemset X is the proportion of observations containing X in the whole dataset: supp(x) = {d t A X d t } T It can be interpreted as the frequency to which you observe X in the dataset. Confidence indicates how often the rule between precedent and consequent is found among the given precedent: conf(x Y ) = supp(x Y ) supp(x) = {d t A X d t, Y d t }. {d t A X d t } It can be interpreted as the probability that a transaction with X also contains Y. So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 13 / 20

Association rule Measurement concepts - continued Lift is the ratio of the observed support to that of the expected if X and Y are independent: lift(x Y ) = supp(x Y ) supp(x) supp(y ) If X and Y are independent, there is no possible association rule that can be drawn. The lift is a measure of the degree to which there is dependent. Conviction is the ratio of how often the rule makes an incorrect prediction if X and Y are independent to that of the observed frequency of incorrect predictions: conv(x Y ) = 1 supp(y ) 1 conf(x Y ) To illustrate, a conviction value of 1.35 indicates that you will be 35% correct that there is association than purely random. So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 14 / 20

Unsupervised learning Data clustering Data clustering Data clustering, a form of unsupervised learning, is the process of dividing a group of data points into homogeneous groups or clusters. Data points in the same cluster share similar features while data points from different clusters are dissimilar. Gan, et al. (2007), Gan (2011) 500 400 300 200 100 0 100 200 200 100 0 100 200 300 400 Has many applications in big data analytics: astronomy, medical science, marketing. Has large potential in actuarial applications (general insurance; see Frees, et al. (2016), Ch. 6), valuation of large portfolios of VAs, mortality deviation for claims monotoring. Many clustering algorithms have emerged in the last few decades. So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 15 / 20

Data clustering Clustering algorithms Some common clustering methods Most clustering algorithms are formulated as an optimization problem. The well-known K-means algorithm performs clustering by minimizing the following objective function: P (U, Z) = k n u il x i z l 2, l=1 i=1 where {x 1,..., x n } is a dataset, k is the desired number of clusters, U = (u il ) n k is an n k partition matrix, Z = {z 1,..., z k } is a set of cluster centers, and is the L 2 norm or Euclidean distance. This results in an iterative process that partitions the dataset into a pre-specified number of clusters K. At each iteration, each data point is assigned to its closest centroid (Euclidean distance). So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 16 / 20

Data clustering Clustering algorithms - continued K-medoids is a variation proposed by Kaufman and Rousseeuw (1987). In contrast to using the means as center of cluster, K-medoids uses an actual data point in the cluster. This avoids the sensitivity of the K-means algorithm to extreme data points or outliers. Algorithm is also an iterative optimization process (steps are quoted directly from Hastie, et al. (2009)): For a given cluster assignment C, find the observation in the cluster that minimizes the total distance to other points in that cluster. This gives us the current estimates of the cluster centers. Given a current set of cluster centers, minimize the total error by assigning each observation ti the closest (current) cluster center. Repeat the iterative process until the assignments do not change. So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 17 / 20

Data clustering Clustering algorithms 6 9 5 3 2 4 8 7 1 - continued Hierarchical clustering as the name implies, has the goal of building a hierarchy of clusters. This requires user to measure dissimilarity between groups of observations. Linkage criterion determines the measure of dissimilarity used: complete (max), single (min), average (mean), centroid (many variations). No need to pre-specify the number of clusters K. Two paradigms: agglomerative (bottom up) clustering divisive (top-down) clustering 0 1 2 3 4 15 11 12 10 Cluster Dendrogram dist(samp.data) hclust (*, "complete") The dendogram is used to visualize the resulting clusters. 14 13 So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 18 / 20

Concluding remarks Concluding remarks Our goal is to survey data mining tasks associated with supervised and unsupervised learning, and explore their potentail use in actuarial problems for predictive analytics. We have only accomplished this mid-way in this presentation. We have lots of work to do yet: Demonstrate their usefulness to actuarial problems: we have begun exploration of actuarial data to be used for each data mining tasks. How the models can be used for predictive analytics and their performance, especially when compared to classical methods. Short survey on the emerging development of a new class of machine learning algorithms called deep learning. So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 19 / 20

Concluding remarks Acknowledgment We thank the financial support of the Society of Actuaries through our CAE grant on data mining. So/Gan/Valdez (U. of Connecticut) APA Conference 1-2 Dec 2017 20 / 20