Optimize Data Quality for Input to Critical Projects

Data Quality Service Optimize Data Quality for Input to Critical Projects Harness Your Most Valuable Asset for Successful Outcomes 1 / 12

Table of Contents 3 Fuel Applications to Take Advantage of Machine Learning Technologies 5 Inspecting and Analyzing the Data 8 Establishing Data Readiness 9 Finishing with the Implementation Phase 2 / 12

Fuel Applications to Take Advantage of Machine Learning Technologies Your success is our primary goal at SAP. The Data Network organization at SAP recognizes that positive results in a datadriven project depend on reliable data quality. Before you kick off your project, we analyze your data and deliver a data quality and data profiling report on the completeness, uniqueness, timeliness, validity, accuracy, and consistency of your data and its machine learning readiness. CAPITALIZING ON A THREE-STEP APPROACH Our data quality and data profiling efforts follow a three-step approach. First, we examine your data in the inspection phase to check whether a proof of concept is feasible. Next comes the readiness phase, in which we lead an ideation workshop to help you define a key use case. We also estimate the current fitness of your data to support a proof of concept for your use case. Finally, in the implementation phase, we recommend and introduce specific measures to improve the quality of your data to make it suitable for prototyping and production. As Figure 1 illustrates, we can iterate phases as necessary until your data is fully production ready. 3 / 12

Figure 1: Phases of the Data Quality Assessment Inspection Understand the use case Load data Assess data quality Perform quantitative assessment of data dimensions Conclusion Are we ready to start the data service either as proof of concept or in production? Data readiness Does the data quality meet the target product requirement? Can we execute on the data science task? Can we or the customer fix any data quality issues? 4 / 12

Inspecting and Analyzing the Data In the inspection phase, we vet the data for accessibility, loadability, consistency, and completeness (see Figure 2). Figure 2: Summary of the Inspection Phase and Deliverables Load Understand data Understand use case Prepare use case for data science Reiterate to understand better Perform quantitative assessment of data quality dimensions Access: database, flat files? Column description? Focus: Visualization? Prediction? Optimization? Monitoring? Proscription? Anomaly detection? Forecast? Check whether the state of the data is sufficient for prototyping. Return to Load at least once. Uniquely identified? Anonymization necessary? Personally identifiable information with or without consent? Does the data set contain every variable the use-case needs? If anything is unclear, check with the data science team immediately. Data sources properly linked schema? What are the target variables or outputs? What are the features or inputs? Is data timely enough for the use-case? What should the result look like mere predictions? An effect study? Consistent or fuzzy format: date, address, currency, symbols Obscurities? Consult a domain expert? Add external data? Is the sample size appropriate for the chosen model? Data size: - Full data, subsample? - If subsample, why? - infinity starts at n=30 Understand contextual relation of variables. Work with a subsample for prototyping if the data is big. Provide data profiling: Facets, DescTools Produce correlationdependence matrixes Something odd? Transformations needed? Narrow down a set of predictive features if the data is big Lambda architecture. Missing data? Can we impute? Are correlations in line with presumed column relations? In the inspection phase, we vet the data for accessibility, loadability, consistency, and completeness. 5 / 12

EXAMINING QUALITY ALONG STANDARD DIMENSIONS Specific questions we ask during data inspection include: How does your data rate along the traditional data quality dimensions (see Figure 3)? Is your data consistently formatted? Is there missing data, and if so, why? Can we impute any missing data? Does your data contain personally identifying information that we should anonymize in order to comply with data protection rules? Is the sample size sufficient to make machine learning possible? What type of external data could improve the quality of the machine learning model? Figure 3: Standard Data Quality Dimensions Dimension Description Measure Example Completeness Proportion of data that is 100% complete 1-(rows with (NA NULL,empty))/(# rows) 100 persons, with 3 columns; 2 empty rows, 5 rows with only 2 filled columns each: 93% complete Uniqueness Proportion of unique rows that are stored once (# total rows/# unique rows)% 110 persons, but population is only 100: 100/110%=90.9% unique Timeliness Degree to which date or time of measured data deviates from reality or use case Mean of time deviation It might take 2 days from data entry to the update of the database 2-day delay Validity Proportion of data that conforms to superformat (# rows with at least one invalid column/# total rows)% Clear Accuracy Degree to which data correctly describes the real world (# rows with at least one inaccurate column/#total rows)% U.S. dates in the EU 10/01 is Oct 1 and not Jan 10 Consistency Identity of an entity (# of consistent rows/ #total rows)% Dates with identical formats 6 / 12

PRIORITIZING VARIABLES We also perform the following steps during the inspection phase: Identify the most important variables for the use case Undertake a sound statistical analysis Estimate statistical distributions and correlation matrixes Recheck the results with domain experts Identify features that are essential elements of a machine learning model EMPOWERING YOU TO EXPLORE YOUR DATA We keep you constantly updated during the inspection phase and provide interactive applications that help you dive into your data. Our deliverables for this phase include a simple interactive data profile that you can explore on your own. See Figure 4 for an example of interactive data profiling. Figure 4: Interactive Data Profiling with Open-Source Facets Software 7 / 12

Establishing Data Readiness After the inspection phase comes the readiness phase, in which we help make sure the data is ready to support a proof of concept. PREPARING FOR THE PROOF OF CONCEPT As soon as the data supports work on a prototype, we start the readiness phase with four weeks of discovery that culminate in the delivery of the proof of concept. If data quality meets requirements, and there are no detectable data issues, we then continue with production in the final implementation phase. IMPROVING CLEANLINESS AS NEEDED If the data is not ready for prototyping or production, we work out a clear road map for improving data cleanliness in the implementation phase. This can include such steps as: Formatting data with SAP HANA smart data quality software to improve data consistency by putting key identifiers into a single consistent format Inferring data using machine learning techniques to interpolate missing data points We start the readiness phase with four weeks of discovery that culminate in the delivery of the proof of concept. 8 / 12

Finishing with the Implementation Phase Our data quality service concludes with the implementation phase, during which we check whether your data meets the data quality criteria for production. IMPUTING MISSING DATA POINTS If there are missing data points, the corresponding rows cannot be fed to the machine learning model. However, we can predict the values of missing entries by taking the dependencies between your columns into account, as shown in Figure 5. The resulting rows can help train the model by enhancing its predictive capabilities. If your data set is fairly sizeable, we also investigate whether it is larger than the minimum size necessary for the machine learning model you re considering. If so, we can simply filter out the rows with missing data points. If both options are viable, we examine and report on the effects each would have on your final results. 9 / 12

Figure 5: Filling in Missing Data Points with Machine Learning We can predict the values of missing entries by taking the dependencies between your columns into account. We use the resulting rows to help train the machine learning model. 10 / 12

ADDING EXTERNAL DATA AS A PROXY FOR KEY MISSING VARIABLES When your data set lacks essential variables, we can replace them using external variables that exhibit a sufficiently high correlation with the original missing variables. Examples are the gross domestic product as an approximation of a nation s wealth, or SAT and ACT test scores as proxies for a student s cognitive abilities. BOOSTING QUALITY FOR PRODUCTION If your data does not yet appear fit for eventual production, we list for your data experts the tasks remaining to prepare it for the machine learning model. Once the data issues are resolved, our team starts modeling and prototyping. See Figure 6. Figure 6: Preparations for Prototyping in the Implementation Phase Can we execute? Decision to fix or ignore the problems Wish list for customer Document issues from findings step Missing data in rows or too few data in subcategories: imputation? Precise description of problems and action items for the customer Discuss issues with data science team Discuss issues with product team Missing data columns: necessary for proof of concept or production? Inconsistent data formatting: manual coding? SAP HANA smart data quality software? Typical issues: missing values, meaning of zeros, duplicated rows and columns, inconsistent date formats, unspecified units, inconsistent spelling, ambiguous naming, data too coarse, text as numbers, numbers as text, biased sample Integrate comments into the issue report Missing identifiers: alternative identifiers? identifiers reconstructible? 11 / 12

LEARN MORE For additional information about the data quality service from the Data Network organization at SAP, visit us online. If your data needs more work at the end of the implementation phase, we list for your data experts the tasks remaining to prepare it for the machine learning model. 12 / 12

Follow us www.sap.com/contactsap Studio SAP 57174enUS (18/06) No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP SE or an SAP affiliate company. The information contained herein may be changed without prior notice. Some software products marketed by SAP SE and its distributors contain proprietary software components of other software vendors. National product specifications may vary. These materials are provided by SAP SE or an SAP affiliate company for informational purposes only, without representation or warranty of any kind, and SAP or its affiliated companies shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP or SAP affiliate company products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty. In particular, SAP SE or its affiliated companies have no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or release any functionality mentioned therein. This document, or any related presentation, and SAP SE s or its affiliated companies strategy and possible future developments, products, and/or platforms, directions, and functionality are all subject to change and may be changed by SAP SE or its affiliated companies at any time for any reason without notice. The information in this document is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, and they should not be relied upon in making purchasing decisions. SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP SE (or an SAP affiliate company) in Germany and other countries. All other product and service names mentioned are the trademarks of their respective companies. See https://www.sap.com/copyright for additional trademark information and notices.