GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015

Size: px
Start display at page:

Download "GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015"

Transcription

1 GLM II Basic Modeling Strategy 2015 CAS Ratemaking and Product Management Seminar by Paul Bailey March 10, 2015

2 Building predictive models is a multi-step process Set project goals and review background Gather and prepare data Explore Data Build Component Predictive Validate Component Combine Component Incorporate Constraints Ernesto walked us through the first 3 components We will now go through an example of the remaining steps: Building component predictive models We will illustrate how to build a frequency model Validating component models We will illustrate how to validate your component model We will also briefly discuss combining models and incorporating implementation constraints Goal should be to build best predictive models now and incorporate constraints later 1

3 Building component predictive models can be separated into two steps Set project goals and review background Gather and prepare data Explore Data Build Component Predictive Validate Component Combine Component Incorporate Constraints Initial Modeling Selecting error structure and link function Build simple initial model Testing basic modeling assumptions and methodology Iterative modeling Refining your initial models through a series of iterative steps complicating the model, then simplifying the model, then repeating 2

4 Initial modeling Initial modeling is done to test basic modeling methodology Is my link function appropriate? Is my error structure appropriate? Is my overall modeling methodology appropriate (e.g. do I need to cap losses? Exclude expense only claims? Model by peril?) 3

5 Examples of error structures Error functions reflect the variability of the underlying process and can be any distribution within the exponential family, for example: Gamma consistent with severity modeling; may want to try Inverse Gaussian Tweedie consistent with pure premium modeling Poisson consistent with frequency modeling Normal useful for a variety of applications 4

6 Generally accepted error structure and link functions Use generally accepted standards as starting point for link functions and error structures Observed Response Most Appropriate Link Function Most Appropriate Error Structure Variance Function Normal µ 0 Claim Frequency Log Poisson µ 1 Claim Severity Log Gamma µ 2 Claim Severity Log Inverse Gaussian µ 3 Pure Premium Log Gamma or Tweedie µ T Retention Rate Logit Binomial µ(1-µ) Conversion Rate Logit Binomial µ(1-µ) 5

7 Build an initial model Reasonable starting points for model structure Prior model Stepwise regression General insurance knowledge CART (Classification and Regression Trees) or similar algorithms 6

8 >.2 >.5 >.7 > 1.4 > 2.2 > 2.9 > 3.6 > 4.3 > 5.0 > 5.7 > 6.5 > 7.2 > 7.9 > 8.6 > 9.3 > 10.1 > 10.8 > 11.5 > 12.2 > 12.9 > 13.6 > 14.4 > 15.1 > 15.8 > 16.5 > 17.2 > 17.9 > 18.7 > 19.4 > 20.1 > 20.8 > 21.5 > 22.3 > 23.0 > 23.7 > 24.4 > 25.1 > 25.8 > 26.6 > 27.3 Test model assumptions Plot of all residuals tests selected error structure/link function Studentized Standardized Deviance Residuals , ,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 Fitte d Value Two concentrations suggests two perils: split or use joint modeling Normal Error Structure/Log Link (Studentized Standardized Deviance Residuals) Transformed Fitted Value Asymmetrical appearance suggests power of variance function is too low > 1 > 3 > 6 > 11 > 16 > 21 > 26 > 31 > 37 > 42 > 47 > 52 > 57 > 62 > 68 > 73 > 78 > 83 > Crunched Residuals (Group Size: 72) Elliptical pattern is ideal Fitted Value Use crunched residuals for frequency 7

9 Example: initial frequency model Link function: Log Gender Relativity Error structure: Poisson Initial variable selected based on industry knowledge: Gender Driver age Vehicle value Area (territory) Variable NOT in initial model: Vehicle body Vehicle age 8

10 Example: initial frequency model Link function: Log Driver Age Relativity Error structure: Poisson Initial variable selected based on industry knowledge: Gender Driver age Vehicle value Area (territory) Variable NOT in initial model: Vehicle body Vehicle age 9

11 Example: initial frequency model Link function: Log Vehicle Value Relativity Error structure: Poisson Initial variable selected based on industry knowledge: Gender Driver age Vehicle value Area (territory) Variable NOT in initial model: Vehicle body Vehicle age 10

12 Example: initial frequency model Link function: Log Area Relativity Error structure: Poisson Initial variable selected based on industry knowledge: Gender Driver age Vehicle value Area (territory) Variable NOT in initial model: Vehicle body Vehicle age 11

13 Example: initial frequency model - residuals Frequency residuals are hard to interpret without Crunching Two clusters: Data points with claims Data points without claims 12

14 Example: initial frequency model - residuals Order observations from smallest to largest predicted value Group residuals into 500 buckets The graph plots the average residual in the bucket Crunched residuals look good! 13

15 Building component predictive models can be separated into two steps Set project goals and review background Gather and prepare data Explore Data Build Component Predictive Validate Component Combine Component Incorporate Constraints Initial Modeling Selecting error structure and link function Build simple initial model Testing basic modeling assumptions and methodology Iterative modeling Refining your initial models through a series of iterative steps complicating the model, then simplifying the model, then repeating towerswatson.com 14

16 Iterative Modeling Initial models are refined using an iterative modeling approach Iterative modeling involves many decisions to complicate and simplify the models Review Model Your modeling toolbox can help you make these decisions We will discuss your tools shortly Simplify Exclude Group Curves Complicate Include Interactions 15

17 Ideal Model Structure To produce a sensible model that explains recent historical experience and is likely to be predictive of future experience Overall mean Best One parameter per observation Underfit: Predictive Poor explanatory power Model Complexity (number of parameters) Overfit: Poor predictive power Explains history 16

18 Your modeling tool box Model decisions include: Simplification: excluding variables, grouping levels, fitting curves Complication: including variables, adding interactions Your modeling toolbox will help you make these decisions Your tools include: Judgment (e.g., do the trends make sense?) Balance tests (i.e. actual vs. expected test) Parameters/standard errors Consistency of patterns over time or random data sets Type III statistical tests (e.g., chi-square tests, F-tests) 17

19 Modeling toolbox: judgment Modeled Frequency Relativity Vehicle Value The modeler should also ask, does this pattern make sense? Patterns may often be counterintuitive, but become reasonable after investigation Uses: Inclusion/exclusion Grouping Fitting curves Assessing interactions 18

20 Modeling toolbox: balance test Actual vs. Expected Frequency - Vehicle Age Balance test is essentially an actual vs. expected Can identify variables that are not in the model where the model is not in balance Indicates variable may be explaining something not in the model Uses: Inclusion 19

21 Modeling toolbox: parameters/standard errors Modeled Frequency Relativities With Standard Errors - Vehicle Body Parameters and standard errors provide confidence in the pattern exhibited by the data Uses: Horizontal line test for exclusion Plateaus for grouping A measure of credibility 20

22 Modeling toolbox: consistency of patterns Checking for consistency of patterns over time or across random parts of a data set is a good practical test Uses: Validating modeling decisions Including/excluding factors Grouping levels Fitting curves Adding Interactions Modeled Frequency Relativity Age Category 21

23 Modeling toolbox: type III tests Chi test and/or F-Test is a good statistical test to compare nested models H o : Two models are essentially the same H 1 : Two models are not the same Principle of parsimony: If two models are the same, choose the simpler model Uses: Inclusion/exclusion Chi-Square Percentage Meaning Action* <5% Reject H o Use More Complex Model 5%-15% Grey Area??? 15%-30% Grey Area??? >30% Accept H o Use Simpler Model 22

24 Example: frequency model iteration 1 simplification Modeling decision: Grouping Age Category and Area Tools Used: judgment, parameter estimates/std deviations, type III test Age Category Relativity Area Relativity Chi Sq P Val = 97.4% Chi Sq P Val = 99.9% 23

25 Example: frequency model iteration 1 simplification Modeling decision: fitting a curve to vehicle value Tools used: judgment, type III test, consistency test Vehicle Value Relativity Initial Model Vehicle Value Relativity Curve Fit Chi Sq P Val = 100.0% 24

26 Example: frequency model iteration 2 complication Modeling decision: adding vehicle body type Tools used: balance test, parameter estimates/std deviations, type III test Balance Test: Actual vs. Expected Across Vehicle Body Type Vehicle Body Type Not In Model Vehicle Body Type Relativities Vehicle Body Type Included in Model Chi Sq P Val = 1.3% 25

27 Example: iterative modeling continued. Iteration 3 - simplification Group vehicle body type Iteration 4 complication Add vehicle age Iteration 5 simplification group vehicle age levels 26

28 Example: frequency model iteration 6 complication Action: adding age x gender interaction Tools used: balance test, type III test, consistency test, judgment Balance Test: Two Way Actual vs. Expected Across Age x Gender Age x Gender Interaction NOT in model Vehicle Body Type Relativities Vehicle Body Type Included in Model M F Chi Sq P Val = 47.5% 27

29 Predictive models must be validated to have confidence in the predictive power of the models Set project goals and review background Gather and prepare data Explore Data Build Component Predictive Validate Component Combine Component Incorporate Constraints Model validation techniques include: Examining residuals Examining gains curves Examining hold out samples Changes in parameter estimates Actual vs. expected on hold out sample Component models and combined risk premium model should be validated 28

30 Model validation: residual analysis Recheck residuals to ensure appropriate shape 10 Studentized Standardized Deviance Residuals by Policyholder Age lt Crunched residuals are symmetric For Severity - Does the Box- Whisker show symmetry across levels? 29

31 >.2 >.5 >.7 > 1.4 > 2.2 > 2.9 > 3.6 > 4.3 > 5.0 > 5.7 > 6.5 > 7.2 > 7.9 > 8.6 > 9.3 > 10.1 > 10.8 > 11.5 > 12.2 > 12.9 > 13.6 > 14.4 > 15.1 > 15.8 > 16.5 > 17.2 > 17.9 > 18.7 > 19.4 > 20.1 > 20.8 > 21.5 > 22.3 > 23.0 > 23.7 > 24.4 > 25.1 > 25.8 > 26.6 > 27.3 Model validation: residual analysis (cont d) Common issues with residual plots Studentized Standardized Deviance Residuals , ,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 Fitte d Value Two concentrations suggests two perils: split or use joint modeling Normal Error Structure/Log Link (Studentized Standardized Deviance Residuals) Transformed Fitted Value Asymmetrical appearance suggests power of variance function is too low > 1 > 3 > 6 > 11 > 16 > 21 > 26 > 31 > 37 > 42 > 47 > 52 > 57 > 62 > 68 > 73 > 78 > 83 > Crunched Residuals (Group Size: 72) Elliptical pattern is ideal Fitted Value Use crunched residuals for frequency 30

32 Model validation: gains curves Gains curve are good for comparing predictiveness of models Order observations from largest to smallest predicted value on X axis Cumulative actual claim counts (or losses) on Y axis As you move from left to right, the better model should accumulate actual losses faster 31

33 Model validation: hold out samples Holdout samples are effective at validating models Determine estimates based on part of data set Uses estimates to predict other part of data set Full Test/Training for Large Data Sets Partial Test/Training for Smaller Data Sets Train Data Build All Data Build Data Split Data Data Train Data Refit Parameters Test Data Compare Predictions to Actual Split Data Test Data Compare Predictions to Actual Predictions should be close to actuals for heavily populated cells 32

34 Model validation: lift charts on hold out data Actual vs. expected on holdout data is an intuitive validation technique Good for communicating model performance to non-technical audiences Can also create actual vs. expected across predictor dimensions 33

35 Component frequency and severity models can be combined to create pure premium models Set project goals and review background Gather and prepare data Explore Data Build Component Predictive Validate Component Combine Component Incorporate Constraints Component models can be constructed in many different ways The standard model: COMPONENT MODELS Frequency Severity COMBINE Frequency x Severity Poisson/ Negative Binomial Gamma 34

36 Building a model on modeled pure premium When using modeled pure premiums, select the gamma/log link (not the Tweedie) 1,400 1,200 1, Density: Severity Severity Modeled pure premiums will not have a point mass at zero Density ,000 4,000 6,000 8,000 10,000 12,000 14,000 Range Density 2,400 2,200 2,000 1,800 1,600 1,400 1,200 1, Density: Pure Premium Pure Premium Raw pure premiums are bimodal (i.e., have a point mass at zero) and require a distribution such as the Tweedie ,000 4,000 6,000 8,000 10,000 12,000 14,000 Range 35

37 Various constraints often need to be applied to the modeled pure premiums Set project goals and review background Gather and prepare data Explore Data Build Component Predictive Validate Component Combine Component Incorporate Constraints Goal: Convert modeled pure premiums into indications after consideration of internal and external constraints Not always possible or desirable to charge the fully indicated rates in the short run Marketing decisions Regulatory constraints Systems constraints Need to adjust the indications for known constraints 36

38 Constraints to give desired subsidies Offsetting one predictor changes parameters of other correlated predictors to make up for the restrictions The stronger the exposure correlation, the more that can be made up through the other variable Consequently, the modeler should not refit models when a desired subsidy is incorporated into the rating plan Example Result of refitting with constraint Potential action Insurer-Desired Subsidy Sr. mgmt wants subsidy to attract drivers 65+ Regulatory Subsidy Regulatory constraint requires subsidy of drivers 65+ Correlated factors will adjust to partially make up for the difference. For example, territories with retirement communities will increase. Do not refit models with constraint Consider implication of refitting and make a business decision 37

Fathom Dynamic Data TM Version 2 Specifications

Fathom Dynamic Data TM Version 2 Specifications Data Sources Fathom Dynamic Data TM Version 2 Specifications Use data from one of the many sample documents that come with Fathom. Enter your own data by typing into a case table. Paste data from other

More information

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers Why enhance GLM? Shortcomings of the linear modelling approach. GLM being

More information

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &

More information

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann Forrest W. Young & Carla M. Bann THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA CB 3270 DAVIE HALL, CHAPEL HILL N.C., USA 27599-3270 VISUAL STATISTICS PROJECT WWW.VISUALSTATS.ORG

More information

Enterprise Miner Tutorial Notes 2 1

Enterprise Miner Tutorial Notes 2 1 Enterprise Miner Tutorial Notes 2 1 ECT7110 E-Commerce Data Mining Techniques Tutorial 2 How to Join Table in Enterprise Miner e.g. we need to join the following two tables: Join1 Join 2 ID Name Gender

More information

Overview and Practical Application of Machine Learning in Pricing

Overview and Practical Application of Machine Learning in Pricing Overview and Practical Application of Machine Learning in Pricing 2017 CAS Spring Meeting May 23, 2017 Duncan Anderson and Claudine Modlin (Willis Towers Watson) Mark Richards (Allstate Insurance Company)

More information

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA Predictive Analytics: Demystifying Current and Emerging Methodologies Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA May 18, 2017 About the Presenters Tom Kolde, FCAS, MAAA Consulting Actuary Chicago,

More information

8. MINITAB COMMANDS WEEK-BY-WEEK

8. MINITAB COMMANDS WEEK-BY-WEEK 8. MINITAB COMMANDS WEEK-BY-WEEK In this section of the Study Guide, we give brief information about the Minitab commands that are needed to apply the statistical methods in each week s study. They are

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming

More information

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM * Which directories are used for input files and output files? See menu-item "Options" and page 22 in the manual.

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

Cpk: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc.

Cpk: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc. C: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc. C is one of many capability metrics that are available. When capability metrics are used, organizations typically provide

More information

Predict Outcomes and Reveal Relationships in Categorical Data

Predict Outcomes and Reveal Relationships in Categorical Data PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,

More information

Clustering in Ratemaking: Applications in Territories Clustering

Clustering in Ratemaking: Applications in Territories Clustering Clustering in Ratemaking: Applications in Territories Clustering Ji Yao, PhD FIA ASTIN 13th-16th July 2008 INTRODUCTION Structure of talk Quickly introduce clustering and its application in insurance ratemaking

More information

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered

More information

SAS (Statistical Analysis Software/System)

SAS (Statistical Analysis Software/System) SAS (Statistical Analysis Software/System) SAS Adv. Analytics or Predictive Modelling:- Class Room: Training Fee & Duration : 30K & 3 Months Online Training Fee & Duration : 33K & 3 Months Learning SAS:

More information

Stat 4510/7510 Homework 4

Stat 4510/7510 Homework 4 Stat 45/75 1/7. Stat 45/75 Homework 4 Instructions: Please list your name and student number clearly. In order to receive credit for a problem, your solution must show sufficient details so that the grader

More information

And the benefits are immediate minimal changes to the interface allow you and your teams to access these

And the benefits are immediate minimal changes to the interface allow you and your teams to access these Find Out What s New >> With nearly 50 enhancements that increase functionality and ease-of-use, Minitab 15 has something for everyone. And the benefits are immediate minimal changes to the interface allow

More information

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures STA 2023 Module 3 Descriptive Measures Learning Objectives Upon completing this module, you should be able to: 1. Explain the purpose of a measure of center. 2. Obtain and interpret the mean, median, and

More information

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

More information

Lecture: Simulation. of Manufacturing Systems. Sivakumar AI. Simulation. SMA6304 M2 ---Factory Planning and scheduling. Simulation - A Predictive Tool

Lecture: Simulation. of Manufacturing Systems. Sivakumar AI. Simulation. SMA6304 M2 ---Factory Planning and scheduling. Simulation - A Predictive Tool SMA6304 M2 ---Factory Planning and scheduling Lecture Discrete Event of Manufacturing Systems Simulation Sivakumar AI Lecture: 12 copyright 2002 Sivakumar 1 Simulation Simulation - A Predictive Tool Next

More information

Multiple Regression White paper

Multiple Regression White paper +44 (0) 333 666 7366 Multiple Regression White paper A tool to determine the impact in analysing the effectiveness of advertising spend. Multiple Regression In order to establish if the advertising mechanisms

More information

Why is Statistics important in Bioinformatics?

Why is Statistics important in Bioinformatics? Why is Statistics important in Bioinformatics? Random processes are inherent in evolution and in sampling (data collection). Errors are often unavoidable in the data collection process. Statistics helps

More information

Cognalysis TM Reserving System User Manual

Cognalysis TM Reserving System User Manual Cognalysis TM Reserving System User Manual Return to Table of Contents 1 Table of Contents 1.0 Starting an Analysis 3 1.1 Opening a Data File....3 1.2 Open an Analysis File.9 1.3 Create Triangles.10 2.0

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to

More information

Building Better Parametric Cost Models

Building Better Parametric Cost Models Building Better Parametric Cost Models Based on the PMI PMBOK Guide Fourth Edition 37 IPDI has been reviewed and approved as a provider of project management training by the Project Management Institute

More information

Exploratory model analysis

Exploratory model analysis Exploratory model analysis with R and GGobi Hadley Wickham 6--8 Introduction Why do we build models? There are two basic reasons: explanation or prediction [Ripley, 4]. Using large ensembles of models

More information

Selected Introductory Statistical and Data Manipulation Procedures. Gordon & Johnson 2002 Minitab version 13.

Selected Introductory Statistical and Data Manipulation Procedures. Gordon & Johnson 2002 Minitab version 13. Minitab@Oneonta.Manual: Selected Introductory Statistical and Data Manipulation Procedures Gordon & Johnson 2002 Minitab version 13.0 Minitab@Oneonta.Manual: Selected Introductory Statistical and Data

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

1 RefresheR. Figure 1.1: Soy ice cream flavor preferences

1 RefresheR. Figure 1.1: Soy ice cream flavor preferences 1 RefresheR Figure 1.1: Soy ice cream flavor preferences 2 The Shape of Data Figure 2.1: Frequency distribution of number of carburetors in mtcars dataset Figure 2.2: Daily temperature measurements from

More information

Correctly Compute Complex Samples Statistics

Correctly Compute Complex Samples Statistics SPSS Complex Samples 15.0 Specifications Correctly Compute Complex Samples Statistics When you conduct sample surveys, use a statistics package dedicated to producing correct estimates for complex sample

More information

This chapter will show how to organize data and then construct appropriate graphs to represent the data in a concise, easy-to-understand form.

This chapter will show how to organize data and then construct appropriate graphs to represent the data in a concise, easy-to-understand form. CHAPTER 2 Frequency Distributions and Graphs Objectives Organize data using frequency distributions. Represent data in frequency distributions graphically using histograms, frequency polygons, and ogives.

More information

Package ToTweedieOrNot

Package ToTweedieOrNot Type Package Package ToTweedieOrNot December 1, 2014 Title Code for the paper Generalised linear models for aggregate claims; to Tweedie or not? Version 1.0 Date 2014-11-27 Author Oscar Alberto Quijano

More information

Artificial Neural Networks (Feedforward Nets)

Artificial Neural Networks (Feedforward Nets) Artificial Neural Networks (Feedforward Nets) y w 03-1 w 13 y 1 w 23 y 2 w 01 w 21 w 22 w 02-1 w 11 w 12-1 x 1 x 2 6.034 - Spring 1 Single Perceptron Unit y w 0 w 1 w n w 2 w 3 x 0 =1 x 1 x 2 x 3... x

More information

Correctly Compute Complex Samples Statistics

Correctly Compute Complex Samples Statistics PASW Complex Samples 17.0 Specifications Correctly Compute Complex Samples Statistics When you conduct sample surveys, use a statistics package dedicated to producing correct estimates for complex sample

More information

Lesson 18-1 Lesson Lesson 18-1 Lesson Lesson 18-2 Lesson 18-2

Lesson 18-1 Lesson Lesson 18-1 Lesson Lesson 18-2 Lesson 18-2 Topic 18 Set A Words survey data Topic 18 Set A Words Lesson 18-1 Lesson 18-1 sample line plot Lesson 18-1 Lesson 18-1 frequency table bar graph Lesson 18-2 Lesson 18-2 Instead of making 2-sided copies

More information

Statistics Lecture 6. Looking at data one variable

Statistics Lecture 6. Looking at data one variable Statistics 111 - Lecture 6 Looking at data one variable Chapter 1.1 Moore, McCabe and Craig Probability vs. Statistics Probability 1. We know the distribution of the random variable (Normal, Binomial)

More information

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010 THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

Chapter 2: Modeling Distributions of Data

Chapter 2: Modeling Distributions of Data Chapter 2: Modeling Distributions of Data Section 2.2 The Practice of Statistics, 4 th edition - For AP* STARNES, YATES, MOORE Chapter 2 Modeling Distributions of Data 2.1 Describing Location in a Distribution

More information

Distributions of Continuous Data

Distributions of Continuous Data C H A P T ER Distributions of Continuous Data New cars and trucks sold in the United States average about 28 highway miles per gallon (mpg) in 2010, up from about 24 mpg in 2004. Some of the improvement

More information

MAT 110 WORKSHOP. Updated Fall 2018

MAT 110 WORKSHOP. Updated Fall 2018 MAT 110 WORKSHOP Updated Fall 2018 UNIT 3: STATISTICS Introduction Choosing a Sample Simple Random Sample: a set of individuals from the population chosen in a way that every individual has an equal chance

More information

Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums

Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums José Garrido Department of Mathematics and Statistics Concordia University, Montreal EAJ 2016 Lyon, September

More information

Excel 2010 with XLSTAT

Excel 2010 with XLSTAT Excel 2010 with XLSTAT J E N N I F E R LE W I S PR I E S T L E Y, PH.D. Introduction to Excel 2010 with XLSTAT The layout for Excel 2010 is slightly different from the layout for Excel 2007. However, with

More information

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA, 2015 MODULE 4 : Modelling experimental data Time allowed: Three hours Candidates should answer FIVE questions. All questions carry equal

More information

Descriptive Statistics, Standard Deviation and Standard Error

Descriptive Statistics, Standard Deviation and Standard Error AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.

More information

Voluntary State Curriculum Algebra II

Voluntary State Curriculum Algebra II Algebra II Goal 1: Integration into Broader Knowledge The student will develop, analyze, communicate, and apply models to real-world situations using the language of mathematics and appropriate technology.

More information

Introduction to mixed-effects regression for (psycho)linguists

Introduction to mixed-effects regression for (psycho)linguists Introduction to mixed-effects regression for (psycho)linguists Martijn Wieling Department of Humanities Computing, University of Groningen Groningen, April 21, 2015 1 Martijn Wieling Introduction to mixed-effects

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts

More information

Introducing Microsoft SQL Server 2016 R Services. Julian Lee Advanced Analytics Lead Global Black Belt Asia Timezone

Introducing Microsoft SQL Server 2016 R Services. Julian Lee Advanced Analytics Lead Global Black Belt Asia Timezone Introducing Microsoft SQL Server 2016 R Services Julian Lee Advanced Analytics Lead Global Black Belt Asia Timezone SQL Server 2016: Everything built-in built-in built-in built-in built-in built-in $2,230

More information

Bootstrapping Method for 14 June 2016 R. Russell Rhinehart. Bootstrapping

Bootstrapping Method for  14 June 2016 R. Russell Rhinehart. Bootstrapping Bootstrapping Method for www.r3eda.com 14 June 2016 R. Russell Rhinehart Bootstrapping This is extracted from the book, Nonlinear Regression Modeling for Engineering Applications: Modeling, Model Validation,

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

AP Statistics Summer Assignment:

AP Statistics Summer Assignment: AP Statistics Summer Assignment: Read the following and use the information to help answer your summer assignment questions. You will be responsible for knowing all of the information contained in this

More information

STATS PAD USER MANUAL

STATS PAD USER MANUAL STATS PAD USER MANUAL For Version 2.0 Manual Version 2.0 1 Table of Contents Basic Navigation! 3 Settings! 7 Entering Data! 7 Sharing Data! 8 Managing Files! 10 Running Tests! 11 Interpreting Output! 11

More information

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer Model Assessment and Selection Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Model Training data Testing data Model Testing error rate Training error

More information

Chapter 6: DESCRIPTIVE STATISTICS

Chapter 6: DESCRIPTIVE STATISTICS Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling

More information

SLStats.notebook. January 12, Statistics:

SLStats.notebook. January 12, Statistics: Statistics: 1 2 3 Ways to display data: 4 generic arithmetic mean sample 14A: Opener, #3,4 (Vocabulary, histograms, frequency tables, stem and leaf) 14B.1: #3,5,8,9,11,12,14,15,16 (Mean, median, mode,

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

Continuous Improvement Toolkit. Normal Distribution. Continuous Improvement Toolkit.

Continuous Improvement Toolkit. Normal Distribution. Continuous Improvement Toolkit. Continuous Improvement Toolkit Normal Distribution The Continuous Improvement Map Managing Risk FMEA Understanding Performance** Check Sheets Data Collection PDPC RAID Log* Risk Analysis* Benchmarking***

More information

BIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26

BIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26 Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26 INTRODUCTION Graphs are one of the most important aspects of data analysis and presentation of your of data. They are visual representations

More information

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Module 2B Organizing Data and Comparing Distributions (Part II) STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and

More information

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II) STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and

More information

Brief Guide on Using SPSS 10.0

Brief Guide on Using SPSS 10.0 Brief Guide on Using SPSS 10.0 (Use student data, 22 cases, studentp.dat in Dr. Chang s Data Directory Page) (Page address: http://www.cis.ysu.edu/~chang/stat/) I. Processing File and Data To open a new

More information

MINITAB Release Comparison Chart Release 14, Release 13, and Student Versions

MINITAB Release Comparison Chart Release 14, Release 13, and Student Versions Technical Support Free technical support Worksheet Size All registered users, including students Registered instructors Number of worksheets Limited only by system resources 5 5 Number of cells per worksheet

More information

Chapter 5. Track Geometry Data Analysis

Chapter 5. Track Geometry Data Analysis Chapter Track Geometry Data Analysis This chapter explains how and why the data collected for the track geometry was manipulated. The results of these studies in the time and frequency domain are addressed.

More information

REPLACING MLE WITH BAYESIAN SHRINKAGE CAS ANNUAL MEETING NOVEMBER 2018 GARY G. VENTER

REPLACING MLE WITH BAYESIAN SHRINKAGE CAS ANNUAL MEETING NOVEMBER 2018 GARY G. VENTER REPLACING MLE WITH BAYESIAN SHRINKAGE CAS ANNUAL MEETING NOVEMBER 2018 GARY G. VENTER ESTIMATION Problems with MLE known since Charles Stein 1956 paper He showed that when estimating 3 or more means, shrinking

More information

REGULARIZED REGRESSION FOR RESERVING AND MORTALITY MODELS GARY G. VENTER

REGULARIZED REGRESSION FOR RESERVING AND MORTALITY MODELS GARY G. VENTER REGULARIZED REGRESSION FOR RESERVING AND MORTALITY MODELS GARY G. VENTER TODAY Advances in model estimation methodology Application to data that comes in rectangles Examples ESTIMATION Problems with MLE

More information

Chapter 6. THE NORMAL DISTRIBUTION

Chapter 6. THE NORMAL DISTRIBUTION Chapter 6. THE NORMAL DISTRIBUTION Introducing Normally Distributed Variables The distributions of some variables like thickness of the eggshell, serum cholesterol concentration in blood, white blood cells

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART V Credibility: Evaluating what s been learned 10/25/2000 2 Evaluation: the key to success How

More information

BUSINESS ANALYTICS. 96 HOURS Practical Learning. DexLab Certified. Training Module. Gurgaon (Head Office)

BUSINESS ANALYTICS. 96 HOURS Practical Learning. DexLab Certified. Training Module. Gurgaon (Head Office) SAS (Base & Advanced) Analytics & Predictive Modeling Tableau BI 96 HOURS Practical Learning WEEKDAY & WEEKEND BATCHES CLASSROOM & LIVE ONLINE DexLab Certified BUSINESS ANALYTICS Training Module Gurgaon

More information

piecewise ginireg 1 Piecewise Gini Regressions in Stata Jan Ditzen 1 Shlomo Yitzhaki 2 September 8, 2017

piecewise ginireg 1 Piecewise Gini Regressions in Stata Jan Ditzen 1 Shlomo Yitzhaki 2 September 8, 2017 piecewise ginireg 1 Piecewise Gini Regressions in Stata Jan Ditzen 1 Shlomo Yitzhaki 2 1 Heriot-Watt University, Edinburgh, UK Center for Energy Economics Research and Policy (CEERP) 2 The Hebrew University

More information

SPSS Basics for Probability Distributions

SPSS Basics for Probability Distributions Built-in Statistical Functions in SPSS Begin by defining some variables in the Variable View of a data file, save this file as Probability_Distributions.sav and save the corresponding output file as Probability_Distributions.spo.

More information

Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 機器學習與生物資訊學 Molecular Biomedical Informatics 分子生醫資訊實驗室 機器學習與生物資訊學 Machine Learning & Bioinformatics 1 Evaluation The key to success 2 Three datasets of which the answers must be known 3 Note on parameter tuning It

More information

OVERVIEW & RECAP COLE OTT MILESTONE WRITEUP GENERALIZABLE IMAGE ANALOGIES FOCUS

OVERVIEW & RECAP COLE OTT MILESTONE WRITEUP GENERALIZABLE IMAGE ANALOGIES FOCUS COLE OTT MILESTONE WRITEUP GENERALIZABLE IMAGE ANALOGIES OVERVIEW & RECAP FOCUS The goal of my project is to use existing image analogies research in order to learn filters between images. SIMPLIFYING

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Nonparametric Methods Recap

Nonparametric Methods Recap Nonparametric Methods Recap Aarti Singh Machine Learning 10-701/15-781 Oct 4, 2010 Nonparametric Methods Kernel Density estimate (also Histogram) Weighted frequency Classification - K-NN Classifier Majority

More information

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010 Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010 Tim Hanson, Ph.D. University of South Carolina T. Hanson (USC) Stat 704: Data Analysis I, Fall 2010 1 / 26 Additive predictors

More information

MDM 4UI: Unit 8 Day 2: Regression and Correlation

MDM 4UI: Unit 8 Day 2: Regression and Correlation MDM 4UI: Unit 8 Day 2: Regression and Correlation Regression: The process of fitting a line or a curve to a set of data. Coefficient of Correlation(r): This is a value between and allows statisticians

More information

3. Data Analysis and Statistics

3. Data Analysis and Statistics 3. Data Analysis and Statistics 3.1 Visual Analysis of Data 3.2.1 Basic Statistics Examples 3.2.2 Basic Statistical Theory 3.3 Normal Distributions 3.4 Bivariate Data 3.1 Visual Analysis of Data Visual

More information

Product Catalog. AcaStat. Software

Product Catalog. AcaStat. Software Product Catalog AcaStat Software AcaStat AcaStat is an inexpensive and easy-to-use data analysis tool. Easily create data files or import data from spreadsheets or delimited text files. Run crosstabulations,

More information

Chapter 2: The Normal Distributions

Chapter 2: The Normal Distributions Chapter 2: The Normal Distributions Measures of Relative Standing & Density Curves Z-scores (Measures of Relative Standing) Suppose there is one spot left in the University of Michigan class of 2014 and

More information

Poisson Regression and Model Checking

Poisson Regression and Model Checking Poisson Regression and Model Checking Readings GH Chapter 6-8 September 27, 2017 HIV & Risk Behaviour Study The variables couples and women_alone code the intervention: control - no counselling (both 0)

More information

Normal Data ID1050 Quantitative & Qualitative Reasoning

Normal Data ID1050 Quantitative & Qualitative Reasoning Normal Data ID1050 Quantitative & Qualitative Reasoning Histogram for Different Sample Sizes For a small sample, the choice of class (group) size dramatically affects how the histogram appears. Say we

More information

Generalized least squares (GLS) estimates of the level-2 coefficients,

Generalized least squares (GLS) estimates of the level-2 coefficients, Contents 1 Conceptual and Statistical Background for Two-Level Models...7 1.1 The general two-level model... 7 1.1.1 Level-1 model... 8 1.1.2 Level-2 model... 8 1.2 Parameter estimation... 9 1.3 Empirical

More information

Technical Support Minitab Version Student Free technical support for eligible products

Technical Support Minitab Version Student Free technical support for eligible products Technical Support Free technical support for eligible products All registered users (including students) All registered users (including students) Registered instructors Not eligible Worksheet Size Number

More information

Data Management - 50%

Data Management - 50% Exam 1: SAS Big Data Preparation, Statistics, and Visual Exploration Data Management - 50% Navigate within the Data Management Studio Interface Register a new QKB Create and connect to a repository Define

More information

Learn What s New. Statistical Software

Learn What s New. Statistical Software Statistical Software Learn What s New Upgrade now to access new and improved statistical features and other enhancements that make it even easier to analyze your data. The Assistant Data Customization

More information

IBM SPSS Categories. Predict outcomes and reveal relationships in categorical data. Highlights. With IBM SPSS Categories you can:

IBM SPSS Categories. Predict outcomes and reveal relationships in categorical data. Highlights. With IBM SPSS Categories you can: IBM Software IBM SPSS Statistics 19 IBM SPSS Categories Predict outcomes and reveal relationships in categorical data Highlights With IBM SPSS Categories you can: Visualize and explore complex categorical

More information

An introduction to SPSS

An introduction to SPSS An introduction to SPSS To open the SPSS software using U of Iowa Virtual Desktop... Go to https://virtualdesktop.uiowa.edu and choose SPSS 24. Contents NOTE: Save data files in a drive that is accessible

More information

Special Review Section. Copyright 2014 Pearson Education, Inc.

Special Review Section. Copyright 2014 Pearson Education, Inc. Special Review Section SRS-1--1 Special Review Section Chapter 1: The Where, Why, and How of Data Collection Chapter 2: Graphs, Charts, and Tables Describing Your Data Chapter 3: Describing Data Using

More information

Error Analysis, Statistics and Graphing

Error Analysis, Statistics and Graphing Error Analysis, Statistics and Graphing This semester, most of labs we require us to calculate a numerical answer based on the data we obtain. A hard question to answer in most cases is how good is your

More information

For our example, we will look at the following factors and factor levels.

For our example, we will look at the following factors and factor levels. In order to review the calculations that are used to generate the Analysis of Variance, we will use the statapult example. By adjusting various settings on the statapult, you are able to throw the ball

More information

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT This chapter provides step by step instructions on how to define and estimate each of the three types of LC models (Cluster, DFactor or Regression) and also

More information

Averages and Variation

Averages and Variation Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus

More information

Model selection and validation 1: Cross-validation

Model selection and validation 1: Cross-validation Model selection and validation 1: Cross-validation Ryan Tibshirani Data Mining: 36-462/36-662 March 26 2013 Optional reading: ISL 2.2, 5.1, ESL 7.4, 7.10 1 Reminder: modern regression techniques Over the

More information

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as

More information

Chapter 6. THE NORMAL DISTRIBUTION

Chapter 6. THE NORMAL DISTRIBUTION Chapter 6. THE NORMAL DISTRIBUTION Introducing Normally Distributed Variables The distributions of some variables like thickness of the eggshell, serum cholesterol concentration in blood, white blood cells

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order. Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good

More information