Mia Stephens JMP Academic Ambassador, SAS, NC
|
|
- Kellie Preston
- 5 years ago
- Views:
Transcription
1 Japan Discovery Summit 11/18/2016 Shaping up Big Data A data workout with JMP Michèle Boulanger Rollins College, FL Chair of ISO/Technical Committee on Applications of Statistics Mia Stephens JMP Academic Ambassador, SAS, NC
2 ISO - World of International Standards ISO/TC69: Applications of Statistical Methods Current presence of JMP ISO/JTC1: Joint Technical Committee on Information Systems WG9: Big data NIST (Nat l Institute of Standards and Technology): Lead US for Big Data standardization Partnership between TC69 and JTC1/WG9 Future role of JMP 11/18/2016 2
3 What is Big Data? Observational/transactional 5Vs (Volume/Variety/Veracity/Velocity/Variation) Organization (centralized, distributed) Structure Data model (strict schema, flat schema) Data relationship (complex relationships, almost flat with few relationships) NoSQL, Hadoop as a way to handle distributed storage and manage initial summaries 11/18/2016 3
4 Medicare Fraud Case Study - Medicare is the American universal insurance program for people over 65 years old - Covers millions of people - Served by hundred of thousands of practitioners - Why do we care about fraud? 11/18/2016 4
5 39 Medicare Fraud Cases Settled in 2016! 11/18/2016 5
6 Medicare makes the data submitted by practitioners publicly available Dataset is located at: Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other- Supplier.html 11/18/2016 6
7 Data Curation Challenges 1. Where do we start? - Access the database - Sample from the full data set 2. How do we deal with dirty data in sample? - Missing values - Recoding - Text mining and latent class analysis - Multiple response variable 3. How do we transfer what we learn to the full dataset? - Reproducibility/Codification/Scalability - One sample, two samples, more samples? 11/18/2016 7
8 Data Curation Challenges Cont d 4. How do we augment my data with relevant information? - Virtual joining 5. How do we deal with the speed at which the data arrive (velocity)? - Input data process and authoritative list 6. How do we transform the data? 7. How do we detect suspicious patterns? - Outliers and scoring - Clustering and scoring - Association analysis (Correspondence analysis) 11/18/2016 8
9 1. Where do we start? Two approaches a. Use Query Builder (outside JMP) Go to Query Builder, use the ODBC Manager and the appropriate drivers to access the.txt file for CY2012 Use SQL to select as randomly as possible a subset of the original file of size 10,000 (about 0.1% of the original file) Enter the sampled dataset of size 10K into JMP 11/18/2016 9
10 Where do we start? 2nd approach b. Use JMP Query Builder (within JMP) Enter the full.txt file in JMP using text import Use JMP Query Builder to randomly select a sample of size 10K (about 0.1%) No challenge, easy...but limited by size 11/18/
11 11
12 Description of the data 9,153,272 records, 30 columns, and 2.36 GB 9861 providers (some providers conduct more than 1 procedures) 81 specialties Over 1271 procedure codes Continuous variables very skewed. Correlations and 3 big outliers in terms of number of services or beneficiaries Main variables: NPI, CREDENTIALS, PROVIDER_TYPE, HCPCS, STATE, GENDER, OFFICE OR NOT, LINE_SRVC_CNT, BENE_UNIQUE_CNT, AVERAGE_SUBMITTED_CHRG_AMT, AVERAGE MEDICARE ALLOWED 31/10/
13 2. How do we deal with dirty data in sample? Cleaning up CREDENTIALS Text explorer Substitute and JSL script Recode and formula Virtual join with authoritative list Multiple response and distribution Informative missing or code missing Text explorer on cleaner data Latent class analysis (LCA) 11/18/
14 LCA on CREDENTIALS 11/18/
15 3. How do we transfer what we learn on the dataset to the full dataset? Pass formulae to the full dataset JSL script Will have to be translated Pass authoritative list to full dataset Iterative process: Resample and redo analyses Reproducibility/Codification/Scalability Need for capturing the cleaning formulae One sample, two samples, more samples? 11/18/
16 Initial Authoritative list selected 11/18/
17 Next Three Steps 4. How do we augment the data with relevant information? - Join and virtual join - Dataset on quality metrics 5. How do we deal with the speed at which the data arrive (velocity)? - Input data process standardization 6. How do we transform the data? 11/18/
18 7. How do we detect suspicious patterns? Approach #1: Outliers Platform 1. Transform the data - Convert continuous variables into 2 meaningful ratios 2. Standardize ratios by specialty type 3. Identify outliers - Explore Outliers platform - Multivariate platform - Interpretation of biggest outliers in sample 4. Score full dataset by Mahalanobis distance 11/18/
19 Outliers Analysis- Mahalanobis Distance 11/18/
20 Outlier Analysis Cont d 11/18/
21 Apply Procedure to Complete Dataset Virtual join with mean-std by provider type Use same mean and std.dev calculated on sample to standardize the full dataset Use same formula for Mahalanobis distance as obtained in sample Look at the results 11/18/
22 Challenges 88 provider specialty levels versus 78 in sample! - Add 10 missing levels to referenced dataset after checking them Obtain Mahalanobis distances with parameters from sample ü My 1st outlier in sample has a rank of 114! 11/18/
23 Outliers on full dataset 11/18/
24 7. How do we detect suspicious patterns? Approach #2: Clustering Platform 1. Identify set of qualitative and continuous variables for cluster analysis 2. Run the hierarchical clustering platform 3. Identify outliers - Explore clusters - Interpretation of small outstanding clusters in sample 4. Score full dataset by distance to closest cluster 11/18/
25 Hierarchical Results 20 clusters 11/18/
26 Identify Associations between the 2 Approaches 1. Multiple level correspondence analysis Mahalanobis distance and hierarchical clusters 2. Bin Mahalanobis distances 3. Run correspondence analysis 4. Look at results on sample Interpret results of association analysis 26
27 Correspondence Analysis 11/18/
28 Final Results We have identified a list of transactions as potential candidates for investigation by provider type. Why not doing it by procedure? Process is very iterative in nature and requires team working between analysts, domain experts, and IT experts All analyses applied to the full dataset need to be recorded This exercise is not about fraud per se, but about standardization of process and procedures to allow the team of experts to be most effective 11/18/
29 IT Standardized Structure for Big data from NIST/JTC1/WG9 11/18/
30 Some Conclusions in Approaching Big 1. Strategy Data Analytics - Work on a sample and ask IT to apply scalable procedures the rest of the data set. -. Then standardized processes are required to work together 2. The role of JMP - The latest developments of JMP 13 support greatly the strategy above - JMP is still the best software application for discovery and discovery is the name of the game in Analytics 11/18/
31 Thank you Acknowledgements - M. Johnson and T. Kubiack - William Zhou and Bryan Yan from JMP Shanghai office - Ricky Sluder from SAS - Wo Chang and Dan Samarov from NIST - Nancy Grady from SAIC 11/18/
An Introduction to Preparing Data for Analysis with JMP. Full book available for purchase here. About This Book... ix About The Author...
An Introduction to Preparing Data for Analysis with JMP. Full book available for purchase here. Contents About This Book... ix About The Author... xiii Chapter 1: Data Management in the Analytics Process...
More informationData Management Glossary
Data Management Glossary A Access path: The route through a system by which data is found, accessed and retrieved Agile methodology: An approach to software development which takes incremental, iterative
More informationIntroduction to Data Mining and Data Analytics
1/28/2016 MIST.7060 Data Analytics 1 Introduction to Data Mining and Data Analytics What Are Data Mining and Data Analytics? Data mining is the process of discovering hidden patterns in data, where Patterns
More informationPSS718 - Data Mining
Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the
More informationSupplier Recovery Claim Automation
Supplier Recovery Claim Automation Veronica Heber, Senior Reliability Analyst, Cummins Inc. Neha Kichambare, Reliability Analytics Specialist, Cummins Inc. Introduction tion In order to tie warranty claim
More informationData Analyst Nanodegree Syllabus
Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working
More informationTowards Practical Differential Privacy for SQL Queries. Noah Johnson, Joseph P. Near, Dawn Song UC Berkeley
Towards Practical Differential Privacy for SQL Queries Noah Johnson, Joseph P. Near, Dawn Song UC Berkeley Outline 1. Discovering real-world requirements 2. Elastic sensitivity & calculating sensitivity
More informationDatameer for Data Preparation:
Datameer for Data Preparation: Explore, Profile, Blend, Cleanse, Enrich, Share, Operationalize DATAMEER FOR DATA PREPARATION: EXPLORE, PROFILE, BLEND, CLEANSE, ENRICH, SHARE, OPERATIONALIZE Datameer Datameer
More informationData Analyst Nanodegree Syllabus
Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working
More informationBig Data with Hadoop Ecosystem
Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process
More informationPreprocessing Short Lecture Notes cse352. Professor Anita Wasilewska
Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept
More informationScoring Outside the Box Nascif Abousalh-Neto, JMP Principal Software Developer, SAS
Scoring Outside the Box Nascif Abousalh-Neto, JMP Principal Software Developer, SAS Daniel Valente, Ph.D., JMP Senior Product Manager, SAS Introduction Scoring the process of using a model created by a
More informationAn Enchanted World: SAS in an Open Ecosystem
An Enchanted World: SAS in an Open Ecosystem Tuba Islam SAS Global Technology Practice C opyr i g ht 2016, SAS Ins titut e Inc. All rights res er ve d. Diversity can bring power if there is collaboration
More informationAccelerate your SAS analytics to take the gold
Accelerate your SAS analytics to take the gold A White Paper by Fuzzy Logix Whatever the nature of your business s analytics environment we are sure you are under increasing pressure to deliver more: more
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationQLIK INTEGRATION WITH AMAZON REDSHIFT
QLIK INTEGRATION WITH AMAZON REDSHIFT Qlik Partner Engineering Created August 2016, last updated March 2017 Contents Introduction... 2 About Amazon Web Services (AWS)... 2 About Amazon Redshift... 2 Qlik
More informationdata-based banking customer analytics
icare: A framework for big data-based banking customer analytics Authors: N.Sun, J.G. Morris, J. Xu, X.Zhu, M. Xie Presented By: Hardik Sahi Overview 1. 2. 3. 4. 5. 6. Why Big Data? Traditional versus
More informationBellman : A Data Quality Browser
Bellman : A Data Quality Browser Theodore Johnson and Tamraparni Dasu AT&T Labs Research johnsont@research.att.com tamr@research.att.com Abstract: When a data analyst starts a new project, she is often
More informationMHPE 494: Data Analysis. Welcome! The Analytic Process
MHPE 494: Data Analysis Alan Schwartz, PhD Department of Medical Education Memoona Hasnain,, MD, PhD, MHPE Department of Family Medicine College of Medicine University of Illinois at Chicago Welcome! Your
More informationBIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,
BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29, 2016 1 OBJECTIVES ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29, 2016 2 WHAT
More informationEvolution of Database Systems
Evolution of Database Systems Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies, second
More informationData Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality
Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing
More informationOverview and Practical Application of Machine Learning in Pricing
Overview and Practical Application of Machine Learning in Pricing 2017 CAS Spring Meeting May 23, 2017 Duncan Anderson and Claudine Modlin (Willis Towers Watson) Mark Richards (Allstate Insurance Company)
More informationStatistics Lecture 6. Looking at data one variable
Statistics 111 - Lecture 6 Looking at data one variable Chapter 1.1 Moore, McCabe and Craig Probability vs. Statistics Probability 1. We know the distribution of the random variable (Normal, Binomial)
More information9. Conclusions. 9.1 Definition KDD
9. Conclusions Contents of this Chapter 9.1 Course review 9.2 State-of-the-art in KDD 9.3 KDD challenges SFU, CMPT 740, 03-3, Martin Ester 419 9.1 Definition KDD [Fayyad, Piatetsky-Shapiro & Smyth 96]
More informationCreating and Checking the PIRLS International Database
Chapter 8 Creating and Checking the PIRLS International Database Juliane Barth and Oliver Neuschmidt 8.1 Overview The PIRLS 2006 International Database is a unique resource for policy makers and analysts,
More informationGuide Users along Information Pathways and Surf through the Data
Guide Users along Information Pathways and Surf through the Data Stephen Overton, Overton Technologies, LLC, Raleigh, NC ABSTRACT Business information can be consumed many ways using the SAS Enterprise
More informationWhy Quality Depends on Big Data
Why Quality Depends on Big Data Korea Test Conference Michael Schuldenfrei, CTO Who are Optimal+? 2 Company Overview Optimal+ provides Manufacturing Intelligence software that delivers realtime, big data
More informationLazy Big Data Integration
Lazy Big Integration Prof. Dr. Andreas Thor Hochschule für Telekommunikation Leipzig (HfTL) Martin-Luther-Universität Halle-Wittenberg 16.12.2016 Agenda Integration analytics for domain-specific questions
More informationA detailed comparison of EasyMorph vs Tableau Prep
A detailed comparison of vs We at keep getting asked by our customers and partners: How is positioned versus?. Well, you asked, we answer! Short answer and are similar, but there are two important differences.
More informationData Warehouse and Data Mining
Data Warehouse and Data Mining Lecture No. 07 Terminologies Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Database
More informationData Tamer: A Scalable Data Curation System. Michael Stonebraker
Data Tamer: A Scalable Data Curation System by Michael Stonebraker How Does This Fit Into Big Data? I have too much of it Volume problem It s coming at me too fast Velocity problem It s coming at me from
More informationDetect Cyber Threats with Securonix Proxy Traffic Analyzer
Detect Cyber Threats with Securonix Proxy Traffic Analyzer Introduction Many organizations encounter an extremely high volume of proxy data on a daily basis. The volume of proxy data can range from 100
More informationLearning Objectives for Data Concept and Visualization
Learning Objectives for Data Concept and Visualization Assignment 1: Data Quality Concept and Impact of Data Quality Summarize concepts of data quality. Understand and describe the impact of data on actuarial
More informationEfficient and Scalable Friend Recommendations
Efficient and Scalable Friend Recommendations Comparing Traditional and Graph-Processing Approaches Nicholas Tietz Software Engineer at GraphSQL nicholas@graphsql.com January 13, 2014 1 Introduction 2
More informationINTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá
INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús
More informationCSC 261/461 Database Systems Lecture 26. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101
CSC 261/461 Database Systems Lecture 26 Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 Announcements Poster Presentation on May 03 (During our usual lecture time) Mandatory for all Graduate
More informationGetting to the Fun Part 3: How to Prepare Your Data for Analysis JMP Discovery Conference - Frankfurt Mandy Chambers - SAS
Getting to the Fun Part 3: How to Prepare Your Data for Analysis JMP Discovery Conference - Frankfurt Mandy Chambers - SAS JMP 14 has the functionality to import multiple files into a single data table,
More informationPutting it all together: Creating a Big Data Analytic Workflow with Spotfire
Putting it all together: Creating a Big Data Analytic Workflow with Spotfire Authors: David Katz and Mike Alperin, TIBCO Data Science Team In a previous blog, we showed how ultra-fast visualization of
More informationBEST BIG DATA CERTIFICATIONS
VALIANCE INSIGHTS BIG DATA BEST BIG DATA CERTIFICATIONS email : info@valiancesolutions.com website : www.valiancesolutions.com VALIANCE SOLUTIONS Analytics: Optimizing Certificate Engineer Engineering
More informationValue of Data Transformation. Sean Kandel, Co-Founder and CTO of Trifacta
Value of Data Transformation Sean Kandel, Co-Founder and CTO of Trifacta Organizations today generate and collect an unprecedented volume and variety of data. At the same time, the adoption of data-driven
More informationOLAP and Data Warehousing
OLAP and Data Warehousing Lab Exercises Part I OLAP Purpose: The purpose of this practical guide to data warehousing is to learn how online analytical processing (OLAP) methods and tools can be used to
More informationOracle Big Data Science IOUG Collaborate 16
Oracle Big Data Science IOUG Collaborate 16 Session 4762 Tim and Dan Vlamis Tuesday, April 12, 2016 Vlamis Software Solutions Vlamis Software founded in 1992 in Kansas City, Missouri Developed 200+ Oracle
More informationBig Data Executive Program
Big Data Executive Program Big Data Executive Program Business Visualization for Big Data (BV) SAS Visual Analytics help people see things that were not obvious to them before. Even when data volumes are
More informationDoing the Data Science Dance
Doing the Data Science Dance Dean Abbott Abbott Analytics, SmarterHQ KNIME Fall Summit 2018 Email: dean@abbottanalytics.com Twitter: @deanabb 1 Data Science vs. Other Labels 2 Google Trends 3 Abbott Analytics,
More informationVisualization and text mining of patent and non-patent data
of patent and non-patent data Anton Heijs Information Solutions Delft, The Netherlands http://www.treparel.com/ ICIC conference, Nice, France, 2008 Outline Introduction Applications on patent and non-patent
More informationJMP Scripting Using JMP 14 Exam
JMP Scripting Using JMP 14 Exam During the testing of these objectives; you will be expected to perform common tasks, such as: JSL Building Blocks: Use basic elements in JSL including: numbers, dates,
More informationBuilding a Recommendation System for EverQuest Landmark s Marketplace
Building a Recommendation System for EverQuest Landmark s Marketplace Ben G. Weber Director of BI & Analytics, Daybreak Game Company Motivation Content discovery is becoming a challenge for players Questions
More informationResource and Performance Distribution Prediction for Large Scale Analytics Queries
Resource and Performance Distribution Prediction for Large Scale Analytics Queries Prof. Rajiv Ranjan, SMIEEE School of Computing Science, Newcastle University, UK Visiting Scientist, Data61, CSIRO, Australia
More informationEnd-to-End data mining feature integration, transformation and selection with Datameer Datameer, Inc. All rights reserved.
End-to-End data mining feature integration, transformation and selection with Datameer Fastest time to Insights Rapid Data Integration Zero coding data integration Wizard-led data integration & No ETL
More informationSAS (Statistical Analysis Software/System)
SAS (Statistical Analysis Software/System) SAS Adv. Analytics or Predictive Modelling:- Class Room: Training Fee & Duration : 30K & 3 Months Online Training Fee & Duration : 33K & 3 Months Learning SAS:
More informationDATA MINING AND WAREHOUSING
DATA MINING AND WAREHOUSING Qno Question Answer 1 Define data warehouse? Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports management's decision-making
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data
More informationStarting small to go Big: Building a Living Database
Starting small to go Big: Building a Living Database Michael Sabbatino 1,2, Baker, D.V. Vic 3,4, Rose, K. 1, Romeo, L. 1,2, Bauer, J. 1, and Barkhurst, A. 3,4 1 US Department of Energy, National Energy
More informationData Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha
Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationData Analysis for Yield Improvement using TIBCO s Spotfire Data Analysis Software
Data Analysis for Yield Improvement using TIBCO s Spotfire Data Analysis Software Andrew Choo, Thorsten Saeger TriQuint Semiconductor Corporation 2300 NE Brookwood Parkway, Hillsboro, OR 97124 Andrew.Choo@tqs.com
More informationData Mining & Data Warehouse
Data Mining & Data Warehouse Associate Professor Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology (1) 2016 2017 1 Points to Cover Why Do We Need Data Warehouses?
More informationSemiconductor Wafer Spatial Pattern Classification With JSL. Don Kent IMFlash Senior Product Engineer
Semiconductor Wafer Spatial Pattern Classification With JSL Don Kent IMFlash Senior Product Engineer April 4, 2005 Via Della Conciliazione, Pope Benedict inauguration Slide - 2 March 13, 2013 Via Della
More informationCluster Analysis Gets Complicated
Cluster Analysis Gets Complicated Collinearity is a natural problem in clustering. So how can researchers get around it? Cluster analysis is widely used in segmentation studies for several reasons. First
More informationChapter 5: Outlier Detection
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 5: Outlier Detection Lecture: Prof. Dr.
More informationWorkload Characterization Techniques
Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/
More informationAnalysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data
Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data D.Radha Rani 1, A.Vini Bharati 2, P.Lakshmi Durga Madhuri 3, M.Phaneendra Babu 4, A.Sravani 5 Department
More informationDivide & Recombine with Tessera: Analyzing Larger and More Complex Data. tessera.io
1 Divide & Recombine with Tessera: Analyzing Larger and More Complex Data tessera.io The D&R Framework Computationally, this is a very simple. 2 Division a division method specified by the analyst divides
More informationMicrosoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo
Microsoft Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo NEW QUESTION 1 You have an Azure HDInsight cluster. You need to store data in a file format that
More informationCouchbase Architecture Couchbase Inc. 1
Couchbase Architecture 2015 Couchbase Inc. 1 $whoami Laurent Doguin Couchbase Developer Advocate @ldoguin laurent.doguin@couchbase.com 2015 Couchbase Inc. 2 2 Big Data = Operational + Analytic (NoSQL +
More informationSlides for Data Mining by I. H. Witten and E. Frank
Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-
More informationCOMP 465 Special Topics: Data Mining
COMP 465 Special Topics: Data Mining Introduction & Course Overview 1 Course Page & Class Schedule http://cs.rhodes.edu/welshc/comp465_s15/ What s there? Course info Course schedule Lecture media (slides,
More informationData Mining with Elastic
2017 IJSRST Volume 3 Issue 3 Print ISSN: 2395-6011 Online ISSN: 2395-602X Themed Section: Science and Technology Data Mining with Elastic Mani Nandhini Sri, Mani Nivedhini, Dr. A. Balamurugan Sri Krishna
More informationIntroducing Oracle R Enterprise 1.4 -
Hello, and welcome to this online, self-paced lesson entitled Introducing Oracle R Enterprise. This session is part of an eight-lesson tutorial series on Oracle R Enterprise. My name is Brian Pottle. I
More informationOracle Big Data Science
Oracle Big Data Science Tim Vlamis and Dan Vlamis Vlamis Software Solutions 816-781-2880 www.vlamis.com @VlamisSoftware Vlamis Software Solutions Vlamis Software founded in 1992 in Kansas City, Missouri
More informationData Science Training
Data Science Training R, Predictive Modeling, Machine Learning, Python, Bigdata & Spark 9886760678 Introduction: This is a comprehensive course which builds on the knowledge and experience a business analyst
More informationIvy s Business Analytics Foundation Certification Details (Module I + II+ III + IV + V)
Ivy s Business Analytics Foundation Certification Details (Module I + II+ III + IV + V) Based on Industry Cases, Live Exercises, & Industry Executed Projects Module (I) Analytics Essentials 81 hrs 1. Statistics
More informationData Informatics. Seon Ho Kim, Ph.D.
Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate
More informationLies, Damned Lies and Statistics Using Data Mining Techniques to Find the True Facts.
Lies, Damned Lies and Statistics Using Data Mining Techniques to Find the True Facts. BY SCOTT A. BARNES, CPA, CFF, CGMA The adversarial nature of the American legal system creates a natural conflict between
More informationIT DATA WAREHOUSING AND DATA MINING UNIT-2 BUSINESS ANALYSIS
PART A 1. What are production reporting tools? Give examples. (May/June 2013) Production reporting tools will let companies generate regular operational reports or support high-volume batch jobs. Such
More informationOutrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS
Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS Topics AGENDA Challenges with Big Data Analytics How SAS can help you to minimize time to value with
More informationGetting Started with GeoQuery
Getting Started with GeoQuery A quick-start guide to the download and use of spatial data for international development geo.aiddata.org GeoQuery Quick Start Handbook v. 1.01, December 2017 WWW.GEOQUERY.ORG
More informationChapter 1, Introduction
CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from
More informationWhat s New in Spotfire DXP 1.1. Spotfire Product Management January 2007
What s New in Spotfire DXP 1.1 Spotfire Product Management January 2007 Spotfire DXP Version 1.1 This document highlights the new capabilities planned for release in version 1.1 of Spotfire DXP. In this
More informationDATA PREPROCESSING. Pronalaženje skrivenog znanja Bojan Furlan
DATA PREPROCESSING Pronalaženje skrivenog znanja Bojan Furlan WHY DO WE NEED TO PREPROCESS THE DATA? Raw data contained in databases is unpreprocessed, incomplete, and noisy. For example, the databases
More informationTaking Your Application Design to the Next Level with Data Mining
Taking Your Application Design to the Next Level with Data Mining Peter Myers Mentor SolidQ Australia HDNUG 24 June, 2008 WHO WE ARE Industry experts: Growing, elite group of over 90 of the world s best
More informationDATA WAREHOUING UNIT I
BHARATHIDASAN ENGINEERING COLLEGE NATTRAMAPALLI DEPARTMENT OF COMPUTER SCIENCE SUB CODE & NAME: IT6702/DWDM DEPT: IT Staff Name : N.RAMESH DATA WAREHOUING UNIT I 1. Define data warehouse? NOV/DEC 2009
More informationData Collection, Simple Storage (SQLite) & Cleaning
Data Collection, Simple Storage (SQLite) & Cleaning Duen Horng (Polo) Chau Georgia Tech CSE 6242 A / CS 4803 DVA Jan 15, 2013 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationChapter 2 Organizing and Graphing Data. 2.1 Organizing and Graphing Qualitative Data
Chapter 2 Organizing and Graphing Data 2.1 Organizing and Graphing Qualitative Data 2.2 Organizing and Graphing Quantitative Data 2.3 Stem-and-leaf Displays 2.4 Dotplots 2.1 Organizing and Graphing Qualitative
More information1. Attempt any two of the following: 10 a. State and justify the characteristics of a Data Warehouse with suitable examples.
Instructions to the Examiners: 1. May the Examiners not look for exact words from the text book in the Answers. 2. May any valid example be accepted - example may or may not be from the text book 1. Attempt
More informationUnit 10 Databases. Computer Concepts Unit Contents. 10 Operational and Analytical Databases. 10 Section A: Database Basics
Unit 10 Databases Computer Concepts 2016 ENHANCED EDITION 10 Unit Contents Section A: Database Basics Section B: Database Tools Section C: Database Design Section D: SQL Section E: Big Data Unit 10: Databases
More informationCluster Analysis. CSE634 Data Mining
Cluster Analysis CSE634 Data Mining Agenda Introduction Clustering Requirements Data Representation Partitioning Methods K-Means Clustering K-Medoids Clustering Constrained K-Means clustering Introduction
More informationMULTIVARIATE ANALYSIS OF STEALTH QUANTITATES (MASQ)
MULTIVARIATE ANALYSIS OF STEALTH QUANTITATES (MASQ) Application of Machine Learning to Testing in Finance, Cyber, and Software Innovation center, Washington, D.C. THE SCIENCE OF TEST WORKSHOP 2017 AGENDA
More informationBig Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018
Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/
More informationREVENUE REPORTING DASHBOARD FOR A HOTEL GROUP
REVENUE REPORTING DASHBOARD FOR A HOTEL GROUP THE CLIENT PROBLEM Our client, an international hotel chain, wanted to create a completely automated performance evaluation engine for ancillary products.
More informationDistance-based Outlier Detection: Consolidation and Renewed Bearing
Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationGetting more from your Engineering Data. John Chapman Regional Technical Manager
Getting more from your Engineering Data John Chapman Regional Technical Manager 2012 HALLIBURTON. ALL RIGHTS RESERVED. Getting more from your Engineering Data? extracting information from data to make
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationComputational Databases: Inspirations from Statistical Software. Linnea Passing, Technical University of Munich
Computational Databases: Inspirations from Statistical Software Linnea Passing, linnea.passing@tum.de Technical University of Munich Data Science Meets Databases Data Cleansing Pipelines Fuzzy joins Data
More informationWhat is Data Warehouse like
What is Data Warehouse like in the Big Data Era? Sales (Asia) Data Warehouse Sales (US) ETL ETL Collects and organizes historical data from multiple sources Inventory Advertising ETL ETL So far Ø Star
More information