Optimize Data Quality for Input to Critical Projects

Similar documents
HA215 SAP HANA Monitoring and Performance Analysis

HA301. SAP HANA 2.0 SPS03 - Advanced Modeling COURSE OUTLINE. Course Version: 15 Course Duration:

HA150 SQL Basics for SAP HANA

HA100 SAP HANA Introduction

HA300 SAP HANA Modeling

HA100 SAP HANA Introduction

HA300 SAP HANA Modeling

HA100 SAP HANA Introduction

HA215 SAP HANA Monitoring and Performance Analysis

SAP Analytics Cloud model maintenance Restoring invalid model data caused by hierarchy conflicts

Device Operation Process Diagrams. SAP Mobile Secure rapid-deployment solution September 2014

HA150. SAP HANA 2.0 SPS02 - SQL and SQLScript for SAP HANA COURSE OUTLINE. Course Version: 14 Course Duration: 3 Day(s)

ADM505. Oracle Database Administration COURSE OUTLINE. Course Version: 15 Course Duration: 3 Day(s)

BC414. Programming Database Updates COURSE OUTLINE. Course Version: 15 Course Duration: 2 Day(s)

HA355. SAP HANA Smart Data Integration COURSE OUTLINE. Course Version: 12 Course Duration: 3 Day(s)

MDG100 Master Data Governance

SLT100. Real Time Replication with SAP LT Replication Server COURSE OUTLINE. Course Version: 13 Course Duration: 3 Day(s)

S4H01. Introduction to SAP S/4HANA COURSE OUTLINE. Course Version: 04 Course Duration: 2 Day(s)

C4C30. SAP Cloud Applications Studio COURSE OUTLINE. Course Version: 21 Course Duration: 4 Day(s)

Complementary Demo Guide

BW305H. Query Design and Analysis with SAP Business Warehouse Powered by SAP HANA COURSE OUTLINE. Course Version: 15 Course Duration: 5 Day(s)

FAQs Data Sources SAP Hybris Cloud for Customer PUBLIC

DS10. Data Services - Platform and Transforms COURSE OUTLINE. Course Version: 15 Course Duration: 3 Day(s)

BW405. BW/4HANA Query Design and Analysis COURSE OUTLINE. Course Version: 14 Course Duration: 5 Day(s)

CLD100. Cloud for SAP COURSE OUTLINE. Course Version: 16 Course Duration: 2 Day(s)

Week 2 Unit 3: Creating a JDBC Application. January, 2015

HA 450. Application Development for SAP HANA COURSE OUTLINE. Course Version: 12 Course Duration:

HA150. SAP HANA 2.0 SPS03 - SQL and SQLScript for SAP HANA COURSE OUTLINE. Course Version: 15 Course Duration:

Device Application Onboarding Process Diagrams. SAP Mobile Secure: SAP Afaria 7 SP5 September 2014

BOCRC. SAP Crystal Reports Compact Course COURSE OUTLINE. Course Version: 15 Course Duration: 3 Day(s)

HA240 SAP HANA 2.0 SPS02

BOD410 SAP Lumira 2.0 Designer

FAQs Data Workbench SAP Hybris Cloud for Customer PUBLIC

Let s Exploit DITA: How to automate an App Catalog

ADM506. Database Administration Oracle II COURSE OUTLINE. Course Version: 15 Course Duration: 2 Day(s)

BC470. Form Printing with SAP Smart Forms COURSE OUTLINE. Course Version: 18 Course Duration:

BC405 Programming ABAP Reports

FAQs Data Cleansing SAP Hybris Cloud for Customer PUBLIC

HA100 SAP HANA Introduction

BC404. ABAP Programming in Eclipse COURSE OUTLINE. Course Version: 16 Course Duration: 3 Day(s)

UX402 SAP SAPUI5 Development

BC403 Advanced ABAP Debugging

Week 2 Unit 1: Introduction and First Steps with EJB. January, 2015

FAQs OData Services SAP Hybris Cloud for Customer PUBLIC

SAP Hybris Billing, Pricing Simulation Extended Functions Release 2.0, SP03

BW305. SAP Business Warehouse Query Design and Analysis COURSE OUTLINE. Course Version: 15 Course Duration: 5 Day(s)

ADM110. Installing and Patching SAP S/4HANA and SAP Business Suite Systems COURSE OUTLINE. Course Version: 17 Course Duration: 4 Day(s)

ADM110. Installing and Patching SAP S/4HANA and SAP Business Suite Systems COURSE OUTLINE. Course Version: 18 Course Duration: 4 Day(s)

HA240 Authorization, Security and Scenarios

CA611 Testing with ecatt

S4H410. SAP S/4HANA Embedded Analytics and Modeling with Core Data Services (CDS) Views COURSE OUTLINE. Course Version: 05 Course Duration: 2 Day(s)

SAP 3D Visual Enterprise 9.0: Localization of Authoring Content

SAP Business One Integration Framework

SAP EarlyWatch Alert. SAP HANA Deployment Best Practices Active Global Support, SAP AG 2015

S4D430 Building Views in Core Data Services ABAP (CDS ABAP)

BIT660 Data Archiving

Device Configuration Process Diagrams. SAP Mobile Secure: SAP Afaria 7 SP5 September 2014

TADM51. SAP NetWeaver AS - DB Operation (Oracle) COURSE OUTLINE. Course Version: 15 Course Duration: 5 Day(s)

SAP HANA SPS 09 - What s New? SAP River

HA400 ABAP Programming for SAP HANA

BW462 SAP BW/4HANA COURSE OUTLINE. Course Version: 16 Course Duration: 5 Day(s)

BW310H. Data Warehousing with SAP Business Warehouse powered by SAP HANA COURSE OUTLINE. Course Version: 15 Course Duration: 5 Day(s)

BOID10. SAP BusinessObjects Information Design Tool COURSE OUTLINE. Course Version: 17 Course Duration: 5 Day(s)

SAP SMS 365 SAP Messaging Proxy 365 Product Description August 2016 Version 1.0

FAQs Facebook Integration with SAP Hybris Cloud for Customer SAP Hybris Cloud for Customer PUBLIC

ADM535. DB2 LUW Administration for SAP COURSE OUTLINE. Course Version: Course Duration: 3 Day(s)

COURSE LISTING. Courses Listed. Training for Database & Technology with Modeling in SAP HANA. Last updated on: 30 Nov 2018.

BW350H. SAP BW Powered by SAP HANA - Data Acquisition COURSE OUTLINE. Course Version: 15 Course Duration: 5 Day(s)

D75AW. Delta ABAP Workbench SAP NetWeaver 7.0 to SAP NetWeaver 7.51 COURSE OUTLINE. Course Version: 18 Course Duration:

COURSE LISTING. Courses Listed. Training for Database & Technology with Administration in Database Migration. 3 September 2018 (21:31 BST)

Week 1 Unit 1: Introduction to Data Science

DBW4H. Data Warehousing with SAP BW/4HANA - Delta from SAP BW powered by SAP HANA COURSE OUTLINE. Course Version: 13 Course Duration: 2 Day(s)

BC401. ABAP Objects COURSE OUTLINE. Course Version: 18 Course Duration:

UX300 SAP Screen Personas 3.0 Development

COURSE LISTING. Courses Listed. Training for Database & Technology with Modeling in SAP HANA. Einsteiger. Fortgeschrittene.

SAP HANA SPS 08 - What s New? SAP HANA Web-based Development Workbench. (Delta from SPS 07 to SPS 08) SAP HANA Product Management May, 2014

UX400. OpenUI5 Development Foundations COURSE OUTLINE. Course Version: 02 Course Duration: 5 Day(s)

SAP HANA SPS 08 - What s New? SAP HANA Interactive Education - SHINE (Delta from SPS 07 to SPS 08) SAP HANA Product Management May, 2014

Using SAP SuccessFactors Integration Center for generating exports on Interview Central. SAP SuccessFactors Recruiting Management

SAP HANA Operation Expert Summit PLAN - Hardware Landscapes. Addi Brosig, SAP HANA Product Management May 2014

SAP HANA SPS 08 - What s New? SAP HANA Modeling (Delta from SPS 07 to SPS 08) SAP HANA Product Management May, 2014

opensap TEXT ANALYTICS WITH SAP HANA PLATFORM WEEK 1

SAP: Speeding GRC Control Testing by 90% with SAP Solutions for GRC

BW362. SAP BW Powered by SAP HANA COURSE OUTLINE. Course Version: 11 Course Duration: 5 Day(s)

Software and Delivery Requirements

System x Server for SAP Business One, version for SAP HANA

COURSE LISTING. Courses Listed. Training for Cloud with SAP Ariba in Integration. 20 August 2018 (03:01 BST) Grundlagen.

opensap: Big Data with SAP HANA Vora Course Week 03 - Exercises

COURSE LISTING. Courses Listed. Training for Cloud with SAP Ariba in Buy Side. 27 July 2018 (05:54 BST) Grundlagen. Fortgeschrittene.

SAP Mobile Secure Rapiddeployment. Software Requirements

SAP Cloud Platform Configuration SAP Subscription Billing

opensap Extending SAP S/4HANA Cloud and SAP S/4HANA SAP S/4HANA UX Fundamentals PUBLIC

How to create a What If simulation in SAP Analytics Cloud

FAQs Data Workbench SAP Hybris Cloud for Customer PUBLIC

Analyze Big Data Faster and Store It Cheaper

Transitioning from Migration Workbench to Data Workbench

Customer Helpdesk User Manual

COURSE LISTING. Courses Listed. Training for Database & Technology with Technologieberater in Associate with Database. Last updated on: 28 Sep 2018

UX125 SAP Fiori Elements. Public

SAP Fiori Launchpad Process Flow. SAP Fiori UX launchpad Configuration: End to End CEG: November 2014

Transcription:

Data Quality Service Optimize Data Quality for Input to Critical Projects Harness Your Most Valuable Asset for Successful Outcomes 1 / 12

Table of Contents 3 Fuel Applications to Take Advantage of Machine Learning Technologies 5 Inspecting and Analyzing the Data 8 Establishing Data Readiness 9 Finishing with the Implementation Phase 2 / 12

Fuel Applications to Take Advantage of Machine Learning Technologies Your success is our primary goal at SAP. The Data Network organization at SAP recognizes that positive results in a datadriven project depend on reliable data quality. Before you kick off your project, we analyze your data and deliver a data quality and data profiling report on the completeness, uniqueness, timeliness, validity, accuracy, and consistency of your data and its machine learning readiness. CAPITALIZING ON A THREE-STEP APPROACH Our data quality and data profiling efforts follow a three-step approach. First, we examine your data in the inspection phase to check whether a proof of concept is feasible. Next comes the readiness phase, in which we lead an ideation workshop to help you define a key use case. We also estimate the current fitness of your data to support a proof of concept for your use case. Finally, in the implementation phase, we recommend and introduce specific measures to improve the quality of your data to make it suitable for prototyping and production. As Figure 1 illustrates, we can iterate phases as necessary until your data is fully production ready. 3 / 12

Figure 1: Phases of the Data Quality Assessment Inspection Understand the use case Load data Assess data quality Perform quantitative assessment of data dimensions Conclusion Are we ready to start the data service either as proof of concept or in production? Data readiness Does the data quality meet the target product requirement? Can we execute on the data science task? Can we or the customer fix any data quality issues? 4 / 12

Inspecting and Analyzing the Data In the inspection phase, we vet the data for accessibility, loadability, consistency, and completeness (see Figure 2). Figure 2: Summary of the Inspection Phase and Deliverables Load Understand data Understand use case Prepare use case for data science Reiterate to understand better Perform quantitative assessment of data quality dimensions Access: database, flat files? Column description? Focus: Visualization? Prediction? Optimization? Monitoring? Proscription? Anomaly detection? Forecast? Check whether the state of the data is sufficient for prototyping. Return to Load at least once. Uniquely identified? Anonymization necessary? Personally identifiable information with or without consent? Does the data set contain every variable the use-case needs? If anything is unclear, check with the data science team immediately. Data sources properly linked schema? What are the target variables or outputs? What are the features or inputs? Is data timely enough for the use-case? What should the result look like mere predictions? An effect study? Consistent or fuzzy format: date, address, currency, symbols Obscurities? Consult a domain expert? Add external data? Is the sample size appropriate for the chosen model? Data size: - Full data, subsample? - If subsample, why? - infinity starts at n=30 Understand contextual relation of variables. Work with a subsample for prototyping if the data is big. Provide data profiling: Facets, DescTools Produce correlationdependence matrixes Something odd? Transformations needed? Narrow down a set of predictive features if the data is big Lambda architecture. Missing data? Can we impute? Are correlations in line with presumed column relations? In the inspection phase, we vet the data for accessibility, loadability, consistency, and completeness. 5 / 12

EXAMINING QUALITY ALONG STANDARD DIMENSIONS Specific questions we ask during data inspection include: How does your data rate along the traditional data quality dimensions (see Figure 3)? Is your data consistently formatted? Is there missing data, and if so, why? Can we impute any missing data? Does your data contain personally identifying information that we should anonymize in order to comply with data protection rules? Is the sample size sufficient to make machine learning possible? What type of external data could improve the quality of the machine learning model? Figure 3: Standard Data Quality Dimensions Dimension Description Measure Example Completeness Proportion of data that is 100% complete 1-(rows with (NA NULL,empty))/(# rows) 100 persons, with 3 columns; 2 empty rows, 5 rows with only 2 filled columns each: 93% complete Uniqueness Proportion of unique rows that are stored once (# total rows/# unique rows)% 110 persons, but population is only 100: 100/110%=90.9% unique Timeliness Degree to which date or time of measured data deviates from reality or use case Mean of time deviation It might take 2 days from data entry to the update of the database 2-day delay Validity Proportion of data that conforms to superformat (# rows with at least one invalid column/# total rows)% Clear Accuracy Degree to which data correctly describes the real world (# rows with at least one inaccurate column/#total rows)% U.S. dates in the EU 10/01 is Oct 1 and not Jan 10 Consistency Identity of an entity (# of consistent rows/ #total rows)% Dates with identical formats 6 / 12

PRIORITIZING VARIABLES We also perform the following steps during the inspection phase: Identify the most important variables for the use case Undertake a sound statistical analysis Estimate statistical distributions and correlation matrixes Recheck the results with domain experts Identify features that are essential elements of a machine learning model EMPOWERING YOU TO EXPLORE YOUR DATA We keep you constantly updated during the inspection phase and provide interactive applications that help you dive into your data. Our deliverables for this phase include a simple interactive data profile that you can explore on your own. See Figure 4 for an example of interactive data profiling. Figure 4: Interactive Data Profiling with Open-Source Facets Software 7 / 12

Establishing Data Readiness After the inspection phase comes the readiness phase, in which we help make sure the data is ready to support a proof of concept. PREPARING FOR THE PROOF OF CONCEPT As soon as the data supports work on a prototype, we start the readiness phase with four weeks of discovery that culminate in the delivery of the proof of concept. If data quality meets requirements, and there are no detectable data issues, we then continue with production in the final implementation phase. IMPROVING CLEANLINESS AS NEEDED If the data is not ready for prototyping or production, we work out a clear road map for improving data cleanliness in the implementation phase. This can include such steps as: Formatting data with SAP HANA smart data quality software to improve data consistency by putting key identifiers into a single consistent format Inferring data using machine learning techniques to interpolate missing data points We start the readiness phase with four weeks of discovery that culminate in the delivery of the proof of concept. 8 / 12

Finishing with the Implementation Phase Our data quality service concludes with the implementation phase, during which we check whether your data meets the data quality criteria for production. IMPUTING MISSING DATA POINTS If there are missing data points, the corresponding rows cannot be fed to the machine learning model. However, we can predict the values of missing entries by taking the dependencies between your columns into account, as shown in Figure 5. The resulting rows can help train the model by enhancing its predictive capabilities. If your data set is fairly sizeable, we also investigate whether it is larger than the minimum size necessary for the machine learning model you re considering. If so, we can simply filter out the rows with missing data points. If both options are viable, we examine and report on the effects each would have on your final results. 9 / 12

Figure 5: Filling in Missing Data Points with Machine Learning We can predict the values of missing entries by taking the dependencies between your columns into account. We use the resulting rows to help train the machine learning model. 10 / 12

ADDING EXTERNAL DATA AS A PROXY FOR KEY MISSING VARIABLES When your data set lacks essential variables, we can replace them using external variables that exhibit a sufficiently high correlation with the original missing variables. Examples are the gross domestic product as an approximation of a nation s wealth, or SAT and ACT test scores as proxies for a student s cognitive abilities. BOOSTING QUALITY FOR PRODUCTION If your data does not yet appear fit for eventual production, we list for your data experts the tasks remaining to prepare it for the machine learning model. Once the data issues are resolved, our team starts modeling and prototyping. See Figure 6. Figure 6: Preparations for Prototyping in the Implementation Phase Can we execute? Decision to fix or ignore the problems Wish list for customer Document issues from findings step Missing data in rows or too few data in subcategories: imputation? Precise description of problems and action items for the customer Discuss issues with data science team Discuss issues with product team Missing data columns: necessary for proof of concept or production? Inconsistent data formatting: manual coding? SAP HANA smart data quality software? Typical issues: missing values, meaning of zeros, duplicated rows and columns, inconsistent date formats, unspecified units, inconsistent spelling, ambiguous naming, data too coarse, text as numbers, numbers as text, biased sample Integrate comments into the issue report Missing identifiers: alternative identifiers? identifiers reconstructible? 11 / 12

LEARN MORE For additional information about the data quality service from the Data Network organization at SAP, visit us online. If your data needs more work at the end of the implementation phase, we list for your data experts the tasks remaining to prepare it for the machine learning model. 12 / 12

Follow us www.sap.com/contactsap Studio SAP 57174enUS (18/06) No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP SE or an SAP affiliate company. The information contained herein may be changed without prior notice. Some software products marketed by SAP SE and its distributors contain proprietary software components of other software vendors. National product specifications may vary. These materials are provided by SAP SE or an SAP affiliate company for informational purposes only, without representation or warranty of any kind, and SAP or its affiliated companies shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP or SAP affiliate company products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty. In particular, SAP SE or its affiliated companies have no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or release any functionality mentioned therein. This document, or any related presentation, and SAP SE s or its affiliated companies strategy and possible future developments, products, and/or platforms, directions, and functionality are all subject to change and may be changed by SAP SE or its affiliated companies at any time for any reason without notice. The information in this document is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, and they should not be relied upon in making purchasing decisions. SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP SE (or an SAP affiliate company) in Germany and other countries. All other product and service names mentioned are the trademarks of their respective companies. See https://www.sap.com/copyright for additional trademark information and notices.