Mia Stephens JMP Academic Ambassador, SAS, NC

Size: px

Start display at page:

Download "Mia Stephens JMP Academic Ambassador, SAS, NC"

Kellie Preston
5 years ago
Views:

College, FL Chair of ISO/Technical Committee on

1 Japan Discovery Summit 11/18/2016 Shaping up Big Data A data workout with JMP Michèle Boulanger Rollins College, FL Chair of ISO/Technical Committee on Applications of Statistics Mia Stephens JMP Academic Ambassador, SAS, NC

2 ISO - World of International Standards ISO/TC69: Applications of Statistical Methods Current presence of JMP ISO/JTC1: Joint Technical Committee on Information Systems WG9: Big data NIST (Nat l Institute of Standards and Technology): Lead US for Big Data standardization Partnership between TC69 and JTC1/WG9 Future role of JMP 11/18/2016 2

3 What is Big Data? Observational/transactional 5Vs (Volume/Variety/Veracity/Velocity/Variation) Organization (centralized, distributed) Structure Data model (strict schema, flat schema) Data relationship (complex relationships, almost flat with few relationships) NoSQL, Hadoop as a way to handle distributed storage and manage initial summaries 11/18/2016 3

4 Medicare Fraud Case Study - Medicare is the American universal insurance program for people over 65 years old - Covers millions of people - Served by hundred of thousands of practitioners - Why do we care about fraud? 11/18/2016 4

5 39 Medicare Fraud Cases Settled in 2016! 11/18/2016 5

6 Medicare makes the data submitted by practitioners publicly available Dataset is located at: Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other- Supplier.html 11/18/2016 6

7 Data Curation Challenges 1. Where do we start? - Access the database - Sample from the full data set 2. How do we deal with dirty data in sample? - Missing values - Recoding - Text mining and latent class analysis - Multiple response variable 3. How do we transfer what we learn to the full dataset? - Reproducibility/Codification/Scalability - One sample, two samples, more samples? 11/18/2016 7

8 Data Curation Challenges Cont d 4. How do we augment my data with relevant information? - Virtual joining 5. How do we deal with the speed at which the data arrive (velocity)? - Input data process and authoritative list 6. How do we transform the data? 7. How do we detect suspicious patterns? - Outliers and scoring - Clustering and scoring - Association analysis (Correspondence analysis) 11/18/2016 8

9 1. Where do we start? Two approaches a. Use Query Builder (outside JMP) Go to Query Builder, use the ODBC Manager and the appropriate drivers to access the.txt file for CY2012 Use SQL to select as randomly as possible a subset of the original file of size 10,000 (about 0.1% of the original file) Enter the sampled dataset of size 10K into JMP 11/18/2016 9

10 Where do we start? 2nd approach b. Use JMP Query Builder (within JMP) Enter the full.txt file in JMP using text import Use JMP Query Builder to randomly select a sample of size 10K (about 0.1%) No challenge, easy...but limited by size 11/18/

11 11

12 Description of the data 9,153,272 records, 30 columns, and 2.36 GB 9861 providers (some providers conduct more than 1 procedures) 81 specialties Over 1271 procedure codes Continuous variables very skewed. Correlations and 3 big outliers in terms of number of services or beneficiaries Main variables: NPI, CREDENTIALS, PROVIDER_TYPE, HCPCS, STATE, GENDER, OFFICE OR NOT, LINE_SRVC_CNT, BENE_UNIQUE_CNT, AVERAGE_SUBMITTED_CHRG_AMT, AVERAGE MEDICARE ALLOWED 31/10/

13 2. How do we deal with dirty data in sample? Cleaning up CREDENTIALS Text explorer Substitute and JSL script Recode and formula Virtual join with authoritative list Multiple response and distribution Informative missing or code missing Text explorer on cleaner data Latent class analysis (LCA) 11/18/

14 LCA on CREDENTIALS 11/18/

15 3. How do we transfer what we learn on the dataset to the full dataset? Pass formulae to the full dataset JSL script Will have to be translated Pass authoritative list to full dataset Iterative process: Resample and redo analyses Reproducibility/Codification/Scalability Need for capturing the cleaning formulae One sample, two samples, more samples? 11/18/

16 Initial Authoritative list selected 11/18/

17 Next Three Steps 4. How do we augment the data with relevant information? - Join and virtual join - Dataset on quality metrics 5. How do we deal with the speed at which the data arrive (velocity)? - Input data process standardization 6. How do we transform the data? 11/18/

18 7. How do we detect suspicious patterns? Approach #1: Outliers Platform 1. Transform the data - Convert continuous variables into 2 meaningful ratios 2. Standardize ratios by specialty type 3. Identify outliers - Explore Outliers platform - Multivariate platform - Interpretation of biggest outliers in sample 4. Score full dataset by Mahalanobis distance 11/18/

19 Outliers Analysis- Mahalanobis Distance 11/18/

20 Outlier Analysis Cont d 11/18/

21 Apply Procedure to Complete Dataset Virtual join with mean-std by provider type Use same mean and std.dev calculated on sample to standardize the full dataset Use same formula for Mahalanobis distance as obtained in sample Look at the results 11/18/

22 Challenges 88 provider specialty levels versus 78 in sample! - Add 10 missing levels to referenced dataset after checking them Obtain Mahalanobis distances with parameters from sample ü My 1st outlier in sample has a rank of 114! 11/18/

23 Outliers on full dataset 11/18/

24 7. How do we detect suspicious patterns? Approach #2: Clustering Platform 1. Identify set of qualitative and continuous variables for cluster analysis 2. Run the hierarchical clustering platform 3. Identify outliers - Explore clusters - Interpretation of small outstanding clusters in sample 4. Score full dataset by distance to closest cluster 11/18/

25 Hierarchical Results 20 clusters 11/18/

26 Identify Associations between the 2 Approaches 1. Multiple level correspondence analysis Mahalanobis distance and hierarchical clusters 2. Bin Mahalanobis distances 3. Run correspondence analysis 4. Look at results on sample Interpret results of association analysis 26

27 Correspondence Analysis 11/18/

28 Final Results We have identified a list of transactions as potential candidates for investigation by provider type. Why not doing it by procedure? Process is very iterative in nature and requires team working between analysts, domain experts, and IT experts All analyses applied to the full dataset need to be recorded This exercise is not about fraud per se, but about standardization of process and procedures to allow the team of experts to be most effective 11/18/

29 IT Standardized Structure for Big data from NIST/JTC1/WG9 11/18/

30 Some Conclusions in Approaching Big 1. Strategy Data Analytics - Work on a sample and ask IT to apply scalable procedures the rest of the data set. -. Then standardized processes are required to work together 2. The role of JMP - The latest developments of JMP 13 support greatly the strategy above - JMP is still the best software application for discovery and discovery is the name of the game in Analytics 11/18/

31 Thank you Acknowledgements - M. Johnson and T. Kubiack - William Zhou and Bryan Yan from JMP Shanghai office - Ricky Sluder from SAS - Wo Chang and Dan Samarov from NIST - Nancy Grady from SAIC 11/18/

An Introduction to Preparing Data for Analysis with JMP. Full book available for purchase here. About This Book... ix About The Author...

An Introduction to Preparing Data for Analysis with JMP. Full book available for purchase here. Contents About This Book... ix About The Author... xiii Chapter 1: Data Management in the Analytics Process...