Using the Random Sampling Option in Profiles

Similar documents
Importing Metadata from Relational Sources in Test Data Management

Using Synchronization in Profiling

Detecting Outliers in Column Profile Results in Informatica Analyst

Setting up a Salesforce Outbound Message in Informatica Cloud

How to Use Full Pushdown Optimization in PowerCenter

Generating Credit Card Numbers in Test Data Management

Making a POST Request Using Informatica Cloud REST API Connector

Creating Column Profiles on LDAP Data Objects

Creating a Column Profile on a Logical Data Object in Informatica Developer

Configuring a Web Services Transformation in Informatica Cloud to Read Data from SAP BW BEx Query

Migrating Mappings and Mapplets from a PowerCenter Repository to a Model Repository

Importing Metadata From an XML Source in Test Data Management

Using a Web Services Transformation to Get Employee Details from Workday

Importing Connections from Metadata Manager to Enterprise Information Catalog

Importing Metadata From a Netezza Connection in Test Data Management

Converting Relational Input into Hierarchical Output using Google BigQuery Connector

How to Generate a Custom URL in the REST Web Service Consumer Transformation

Optimizing Performance for Partitioned Mappings

Dynamic Data Masking: Capturing the SET QUOTED_IDENTIFER Value in a Microsoft SQL Server or Sybase Database

Creating OData Custom Composite Keys

Increasing Performance for PowerCenter Sessions that Use Partitions

Purging Profile and Scorecard Results from the Profiling Warehouse

Configuring Intelligent Streaming 10.2 For Kafka on MapR

How to Configure MapR Hive ODBC Connector with PowerCenter on Linux

Importing Flat File Sources in Test Data Management

Configuring a JDBC Resource for IBM DB2/ iseries in Metadata Manager HotFix 2

New Features and Enhancements in Big Data Management 10.2

Informatica Enterprise Information Catalog

How to Use Topic Patterns in Kafka Data Objects

Upgrading Multiple Secure Agents on the Same Linux Server to Secure Agent Version 33.0

Configuring a JDBC Resource for IBM DB2 for z/os in Metadata Manager

PowerExchange for Facebook: How to Configure Open Authentication using the OAuth Utility

Publishing and Subscribing to Cloud Applications with Data Integration Hub

Performing a Post-Upgrade Data Validation Check

Writing Reports with Report Builder and SSRS Level 2

Configuring SAML-based Single Sign-on for Informatica Web Applications

Configuring Ports for Big Data Management, Data Integration Hub, Enterprise Information Catalog, and Intelligent Data Lake 10.2

Configuring a JDBC Resource for Sybase IQ in Metadata Manager

Using Standard Generation Rules to Generate Test Data

How to Export a Mapping Specification as a Virtual Table

Manually Defining Constraints in Enterprise Data Manager

Configuring a JDBC Resource for MySQL in Metadata Manager

How to Optimize Jobs on the Data Integration Service for Performance and Stability

Informatica PowerExchange for Tableau User Guide

Creating an Avro to Relational Data Processor Transformation

Inline LOBs (Large Objects)

InfoSphere Guardium 9.1 TechTalk Reporting 101

Informatica Data Explorer Performance Tuning

Code Page Configuration in PowerCenter

Informatica PowerExchange for Microsoft Azure Blob Storage 10.2 HotFix 1. User Guide

Retrieving Data Quality Rules Using REST API

IBM C IBM Cognos Analytics Author V11.

Informatica Cloud Spring Complex File Connector Guide

Sizing Guidelines and Performance Tuning for Intelligent Streaming

Using MDM Big Data Relationship Management to Perform the Match Process for MDM Multidomain Edition

Tuning Enterprise Information Catalog Performance

Configuring AWS IAM Authentication for Informatica Cloud Amazon Redshift Connector

IBM DB2 Web Query for System i

Using Data Replication with Merge Apply and Audit Apply in a Single Configuration

Perceptive Intelligent Capture Visibility

How to Migrate Microsoft SQL Server Connections from the OLE DB to the ODBC Provider Type

Creating an Analyst Viewer User and Group

Informatica Cloud Data Integration Winter 2017 December. What's New

Mining Your Warranty Data Finding Anomalies (Part 1)

Course Outline. Writing Reports with Report Builder and SSRS Level 1 Course 55123: 2 days Instructor Led. About this course

Version 11 Release 0 May 31, IBM Interact - GDPR IBM

Technical White Paper

IBM C IBM Cognos 10 BI Author. Download Full Version :

IBM dashdb Local. Using a software-defined environment in a private cloud to enable hybrid data warehousing. Evolving the data warehouse

Performance Optimization for Informatica Data Services ( Hotfix 3)

Strategies for Incremental Updates on Hive

Oracle Database 12c: Use XML DB

Siebel Analytics Platform Installation and Configuration Guide. Version 7.8.4, Rev. A February 2006

Enabling Single Sign-On Using Microsoft Azure Active Directory in Axon Data Governance 5.2

Oracle Standard Management Pack

TECHNICAL WHITE PAPER. Using SQL Performance for DB2: Gaining Insight into Stored Procedure Characteristics

Open Ports on a SQL. August 22, Copyright 2013 by World Class CAD, LLC. All Rights Reserved.

Security Enhancements in Informatica 9.6.x

New Features Guide Sybase ETL 4.9

Enabling SAML Authentication in an Informatica 10.2.x Domain

How Oracle Does It. No Read Locks

Rapid SQL 7.5 Evaluation Guide. Published: September 28, 2007

INFORMATICA PERFORMANCE

Informatica Cloud Spring Data Integration Hub Connector Guide

Selection/Formula Properties: Enhancements

Tools, tips, and strategies to optimize BEx query performance for SAP HANA

USING AgilePLM IN MSI

Hyperion Interactive Reporting Reports & Dashboards Essentials

RSA NetWitness Logs. IBM Tivoli Identity Manager. Event Source Log Configuration Guide. Last Modified: Monday, March 06, 2017

Informatica Cloud (Version Winter 2016) Magento Connector User Guide

How to Convert an SQL Query to a Mapping

Informatica Cloud Spring REST API Connector Guide

Session V-STON Stonefield Query: The Next Generation of Reporting

CS313D: ADVANCED PROGRAMMING LANGUAGE

Foglight for DB2 LUW Monitoring DB2 Database Systems User and Reference Guide

McAfee epolicy Orchestrator Release Notes

Embarcadero DB Optimizer 1.5 New Features Guide. Published: March 16, 2009

Analytics with IMS and QMF

Exam Questions C

vfire Server Console Guide Version 1.5

Transcription:

Using the Random Sampling Option in Profiles Copyright Informatica LLC 2017. Informatica and the Informatica logo are trademarks or registered trademarks of Informatica LLC in the United States and many jurisdictions throughout the world. A current list of Informatica trademarks is available on the web at https:// www.informatica.com/trademarks.html.

Abstract You can choose to run a profile on all the rows in a data object, first N number of rows, or a random sample of data in the data object. This article discusses the random sampling options in profiling and how to use the options based on your requirement. Supported Versions Data Quality 10.1.1 Table of Contents Overview.... 2 Random Sampling Computation.... 2 Using Random Sampling Option in Informatica Analyst.... 3 Using Random Sampling in Informatica Developer.... 4 Overview You can run a profile on all the rows in the data object to perform a complete data quality analysis of the data source. You can also run the profile on the first few rows, or run the profile on a random sample of rows based on your business requirement. In Informatica Analyst and Informatica Developer, when you create or edit a column profile, you can select a sampling option in the profile wizard. After you choose to run the profile on a random sample of rows, the random sample algorithm chooses the rows at random in the data object to run the profile on. When you choose a random sampling option for column profiles, the Analyst tool and Developer tool performs drilldown on the staged data. This can impact the drill-down performance. When you choose a random sampling option for data domain discovery profiles, the Analyst tool and Developer tool performs drill down on live data. A data analyst can use random sampling to predict the distribution of data in a source system or quickly find the data quality of a source. You can use random sampling when the data source has skewed distribution or asymmetrical distribution of data. You cannot use random sampling option for unstructured data sources. Random Sampling Computation The random sampling algorithm retrieves the total row count from the data source and computes the number of random sample rows. If the data source is a statistical database, such as Oracle, Microsoft SQL Server, or IBM DB2, then the algorithm gets the row count from the statistics API. For non-statistical databases, the Data Integration Service runs the ROW_COUNT mapping to retrieve the row count. The algorithm computes the number of random sample rows based on the random sampling option that you choose in the profile wizard. If the data source is a relational data source, such as Oracle, Microsoft SQL Server, or IBM DB2 and supports random sampling of data, then the Data Integration Service pushes the SQL query to the database. For example, to select the random rows in the Customers table for profiling, a sample query is Select * from Customers SAMPLE (X) statement, where X is the approximate percentage of random rows. The query returns an approximate X percentage of rows on which the profile runs. For example, assume that the estimated source row count for the Customers table is 100 rows. The computed approximate percentage of random rows X is 0.35. The query might return 33 or 36 rows. This is because the query Select * from Customers SAMPLE (0.35) may or may not return 35 rows as a small difference in rows might exist between the query results and the computed percentage of random rows. 2

If the data source does not support the random sampling option, then the Data Integration Service runs a profiling custom transformation after it runs the source transformation. The profiling custom transformation passes the random sample rows downstream for column profile or data domain discovery profile computation. You can choose one of the following types of random sampling options in the Analyst tool or the Developer tool: Random sample (auto) The random sample algorithm computes the percentage of sample rows based on the total row count in the data source. If the total row count is less than 1000, then the profile runs on 100% rows. The following table shows the random sample algorithm computation based on the number of rows in the data object: Data Source Row Count Computed Percentage of Rows for Random Sampling <1K 100% 1K to 10K 90%, 80%, 70%...10% 10K to 100K 10% 100K to 1M 10%, 9%, 8%... 1% >1M 1% Random sample You can configure the number of random rows when you choose the Random sample option. The random sampling algorithm converts the absolute number of rows to percentage based on the source row count. Using Random Sampling Option in Informatica Analyst You can choose the random sampling option in the Analyst tool when you create or edit a column profile. 1. In the Analyst tool, click New > Profile. The profile wizard appears. 2. Choose Single source to create a column profile. Click Next. 3. In the Specify General Properties screen, enter a name for the profile, and choose a location to save the profile. Click Next. 4. In the Select Source screen, choose a data object. Click Next. 5. In the Specify Settings screen, choose the sampling option as Random sample or Random sample (auto) based on your requirements. The following image shows the sampling option in the Specify Settings screen in the Analyst tool: 3

6. Choose a drilldown option and the run-time environment for the column profile. Click Next. 7. In the Specify Rules and Filters screen, you can choose to add rules or filters. 8. Click Save and Run to run the profile, or click Save and Finish to save the profile. Using Random Sampling in Informatica Developer You can choose to run the column profile on a random sample of data in the data source in the Developer tool. 1. In the Developer tool, click File > New > Profile. The profile wizard appears. 2. In the profile wizard, choose Profile to create a column profile. Click Next. 3. In the Configure general properties screen, enter a name for the profile, and click Add to choose a data object. Select the Run Profile on Finish option to run the profile after you create the profile. Click Next. 4. Click Sampling Options in the Column Profiling and Domain Discovery section. The sampling options for the column profile appears. 5. Choose Random Sample of or Random Sample (Auto) option. If you choose the Random Sample of option, then choose the number of random rows to run the profile on. The following image shows the sampling options for a column profile in the Developer tool: 4

6. Choose a drilldown option and the run-time environment to run the profile. Click Finish. The profile runs on a random sample of data in the data source. Author Lavanya S Senior Technical Writer Acknowledgements The author would like to thank Manasjyoti Sharma for his contributions to this article. 5