Sizing Guidelines and Performance Tuning for Intelligent Streaming

Similar documents
Configuring Intelligent Streaming 10.2 For Kafka on MapR

How to Use Topic Patterns in Kafka Data Objects

Performance Tuning and Sizing Guidelines for Informatica Big Data Management

Tuning Enterprise Information Catalog Performance

microsoft

Increasing Performance for PowerCenter Sessions that Use Partitions

Fluentd + MongoDB + Spark = Awesome Sauce

Tuning the Hive Engine for Big Data Management

Tuning Intelligent Data Lake Performance

Exam Questions

Tuning Intelligent Data Lake Performance

BlackBerry AtHoc Networked Crisis Communication Capacity Planning Guidelines. AtHoc SMS Codes

vcloud Automation Center Reference Architecture vcloud Automation Center 5.2

Flash Storage Complementing a Data Lake for Real-Time Insight

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Processing of big data with Apache Spark

Optimizing Performance for Partitioned Mappings

Installing and configuring Apache Kafka

Upgrading Big Data Management to Version Update 2 for Cloudera CDH

Real-time Session Performance

Intra-cluster Replication for Apache Kafka. Jun Rao

Data Acquisition. The reference Big Data stack

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

REAL-TIME ANALYTICS WITH APACHE STORM

SCALING UP VS. SCALING OUT IN A QLIKVIEW ENVIRONMENT

Data Acquisition. The reference Big Data stack

Informatica PowerExchange for Microsoft Azure Blob Storage 10.2 HotFix 1. User Guide

Velocity Software Compatibility List (SCL) 2.9

Talend Big Data Sandbox. Big Data Insights Cookbook

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Performance and Scalability with Griddable.io

Oracle Database 12c: JMS Sharded Queues

Distributed systems for stream processing

MOHA: Many-Task Computing Framework on Hadoop

Esper EQC. Horizontal Scale-Out for Complex Event Processing

Installing and Upgrading vrealize Automation. vrealize Automation 7.3

Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage

Introduction. Architecture Overview

Upgrading Big Data Management to Version Update 2 for Hortonworks HDP

Interstage Big Data Complex Event Processing Server V1.0.0

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Planning Resources. vrealize Automation 7.1

Goverlan Reach Server Hardware & Operating System Guidelines

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

New Features and Enhancements in Big Data Management 10.2

Configuring Sqoop Connectivity for Big Data Management

Strategies for Incremental Updates on Hive

VOLTDB + HP VERTICA. page

Spark Overview. Professor Sasu Tarkoma.

Building Durable Real-time Data Pipeline

Using the Random Sampling Option in Profiles

rkafka rkafka is a package created to expose functionalities provided by Apache Kafka in the R layer. Version 1.1

Configuring Ports for Big Data Management, Data Integration Hub, Enterprise Information Catalog, and Intelligent Data Lake 10.2

Big Data Architect.

Reference Architecture. vrealize Automation 7.0

SugarCRM on IBM i Performance and Scalability TECHNICAL WHITE PAPER

How to Configure MapR Hive ODBC Connector with PowerCenter on Linux

Big data systems 12/8/17

Hosted Microsoft Exchange Server 2003 Deployment Utilizing Network Appliance Storage Solutions

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Functional Comparison and Performance Evaluation. Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14

This PDF is no longer being maintained. Search the SolarWinds Success Center for more information.

... IBM Power Systems with IBM i single core server tuning guide for JD Edwards EnterpriseOne

IBM Data Replication for Big Data

PUBLIC SAP Vora Sizing Guide

Setting up a Salesforce Outbound Message in Informatica Cloud

Sync Services. Server Planning Guide. On-Premises

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Reference Architecture. 04 December 2017 vrealize Automation 7.3

Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned. Yaroslav Tkachenko Senior Data Engineer at Activision

Reference Architecture

Sync Services. Server Planning Guide. On-Premises

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

How to Generate a Custom URL in the REST Web Service Consumer Transformation

Oracle JD Edwards EnterpriseOne Object Usage Tracking Performance Characterization Using JD Edwards EnterpriseOne Object Usage Tracking

How to Use Full Pushdown Optimization in PowerCenter

10 Million Smart Meter Data with Apache HBase

Importing Metadata from Relational Sources in Test Data Management

WHITE PAPER: BEST PRACTICES. Sizing and Scalability Recommendations for Symantec Endpoint Protection. Symantec Enterprise Security Solutions Group

Windows Server 2012: Server Virtualization

Kafka Streams: Hands-on Session A.A. 2017/18

Publishing and Subscribing to Cloud Applications with Data Integration Hub

Informatica Data Explorer Performance Tuning

Deployment Planning Guide

Auto Management for Apache Kafka and Distributed Stateful System in General

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang

Adobe Acrobat Connect Pro 7.5 and VMware ESX Server

Tools for Social Networking Infrastructures

Scalable Streaming Analytics

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

EsgynDB Enterprise 2.0 Platform Reference Architecture

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Making a POST Request Using Informatica Cloud REST API Connector

Catalogic DPX TM 4.3. ECX 2.0 Best Practices for Deployment and Cataloging

Measuring HEC Performance For Fun and Profit

Over the last few years, we have seen a disruption in the data management

Oracle JD Edwards EnterpriseOne Object Usage Tracking Performance Characterization Using JD Edwards EnterpriseOne Object Usage Tracking

Transcription:

Sizing Guidelines and Performance Tuning for Intelligent Streaming Copyright Informatica LLC 2017. Informatica and the Informatica logo are trademarks or registered trademarks of Informatica LLC in the United States and many jurisdictions throughout the world. A current list of Informatica trademarks is available on the web at https://www.informatica.com/trademarks.html. Other company and product names may be trade names or trademarks of their respective owners.

Abstract You can tune Intelligent Streaming for better performance. This article provides recommendations that you can use to tune hardware, Spark configuration, mapping configuration, and the Kafka cluster. Supported Versions Informatica Intelligent Streaming 10.2 Table of Contents Overview.... 2 Determine Your Hardware.... 2 Intelligent Streaming Deployment Types.... 3 Deployment Criteria.... 3 Deployment Type Comparison.... 3 Tune the Spark Engine.... 4 Tune Spark Parameters.... 4 Tune the Kafka Cluster.... 5 Tune the Mapping.... 6 Sizing Recommendations for Overhead Properties.... 6 Recommendations for Tuning the Kafka Cluster.... 6 Documentation Reference.... 7 Overview Use Informatica Intelligent Streaming mappings to collect streaming data, build the business logic for the data, and push the logic to a Spark engine for processing. The Spark engine uses Spark Streaming to process data. The Spark engine reads the data, divides the data into micro batches, and publishes it. Streaming mappings run continuously. When you create and run a streaming mapping, a Spark application is created on the Hadoop cluster which runs forever unless killed or cancelled through the Data Integration Service. To optimize the performance of Intelligent Streaming and your system, perform the following tasks: Determine your hardware requirement. Tune the Spark engine. Tune the mapping. Tune the Kafka cluster. Determine Your Hardware To optimize the performance, acquire the right type of hardware for your Intelligent Streaming environment. Consider the following points while determining your hardware: The criteria for determining a deployment is the number of messages processed and not the size of data in gigabytes. 2

Choose the hardware such that batches do not get queued. Batches are queued when the batch processing time is greater than the batch interval time. Use the following approach to determine hardware capacity: - Implement a proof of concept (POC) to determine the throughput for each core. - Determine the peak load in terms of the anticipated number of the messages processed per second. - Divide the peak load by the throughput for each core to get the total number of cores required and the size of the disk, memory, and network appropriately. Intelligent Streaming Deployment Types Sizing and tuning recommendations vary based on the deployment type. Based on certain deployment factors in the domain and Hadoop environments, Informatica categorizes Intelligent Streaming into the following types: Sandbox deployment Small deployment Medium deployment Large deployment Deployment Criteria The following criteria determine the Intelligent Streaming deployment type: Number of messages processed The total number of messages per second processed from a well-tuned Kafka cluster and written back to a well-tuned Kafka cluster. Total cores The total number of cores for each container. Total memory The maximum RAM available for each container. Deployment Type Comparison The following table compares Intelligent Streaming deployment types based on the standard values for each deployment factor: Deployment Messages Per Second CPU Memory Sandbox 100,000 - Domain. 4 - Hadoop. 24 Small 500,000 - Domain. 4 - Hadoop. 64 - Hadoop. 40 GB - Hadoop. 120 GB 3

Deployment Messages Per Second CPU Memory Medium 1 million - Domain. 4 - Hadoop. 116 Large 10 million - Domain. 4 - Hadoop. 1056 - Hadoop. 224 GB - Hadoop. 2104 GB Tune the Spark Engine When you develop mappings in the Developer tool to run on the Spark engine, consider the following tuning recommendations and performance best practices: Tune Spark parameters To optimize Intelligent Streaming performance, tune Spark parameters in the hadoopenv.properties file. To tune Spark parameters for specific mappings, configure the execution parameters of the Streaming mapping Run-time properties in the Developer tool. If you tune the parameters in the hadoopenv.properties file, the configuration applies to all mappings that you create. Tune the Kafka cluster To optimize the performance of the Kafka cluster, configure the number of the nodes and brokers per node in the Kafka cluster. Tune Spark Parameters Tune the Spark parameters in the hadoopenv.properties file. The hadoopenv.properties file is located in the following directory: <Informatica installation directory>/services/shared/hadoop/<hadoop distribution name>/infaconf If you tune the parameters in the hadoopenv.properties file, the configuration applies to all mappings that you create. You can configure the following parameters based on the input data rate, mapping complexity, and concurrency of mappings: spark.executor.cores The number of cores to use on each executor. Recommended value: Specify 3 to 4 cores for each executor. Specifying a higher number of cores might lead to performance degradation. spark.executor.memory The amount of memory to use for each executor process. Recommended value: Specify a value of 8 GB. spark.driver.memory The amount of memory to use for the driver process. Recommended value: Specify a value of 8 GB. spark.driver.cores The number of cores to use for each driver process. 4

Recommended value: Specify 8 cores. spark.executor.instances The total number of executors to be started. This number depends on number of machines in the cluster, memory allocated, and cores per machine. Configure the number of executor instances based on the following deployment types: Sandbox deployment. 4 Small deployment. 14 Medium deployment. 27 Large deployment. 262 spark.sql.shuffle.partitions The total number of partitions used for a SQL shuffle operation. Recommended value: Specify a value that equals the total number of executor cores if total executor cores allocated is less than 200. Maximum value is 200. Configure the partitions based on the following deployment types: Sandbox deployment. 16 Small deployment. 56 Medium deployment. 108 Large deployment. 200 spark.kryo.registrationrequired Indicates whether registration with Kryo is required. Recommended value: True spark.kryo.classestoregister The comma-separated list of custom class names to register with Kryo if you use Kyro serialization. Specify the following value for all deployment types: org.apache.spark.sql.catalyst.expressions.genericrow,[ljava.lang.object;, org.apache.spark.sql.catalyst.expressions.genericrowwithschema, org.apache.spark.sql.types.structtype,[lorg.apache.spark.sql.types.structfield;, org.apache.spark.sql.types.structfield, org.apache.spark.sql.types.stringtype$, org.apache.spark.sql.types.metadata, scala.collection.immutable.map$emptymap$ [Lorg.apache.spark.sql.catalyst.InternalRow;, scala.reflect.classtag$$anon$1,java.lang.class Tune the Kafka Cluster The following table lists the Kafka cluster properties that you can tune based on the deployment type : Property Sandbox Small Medium Large Number of Kafka brokers for each node 1 2 4 8 Number of Kafka nodes 1 2 4 8 Note: Kafka brokers can be consumers or producers. 5

Tune the Mapping To tune the mapping, use the sizing recommendations based on mapping complexity. Mapping complexity is defined by the number of sources, targets, and transformations in the mapping. Transformations can be CPU bound (Expression transformation), memory bound (Lookup transformation), or disk bound. Mappings can be grouped into following categories of complexity: Complexity Sources Transformations Targets Simple 3 6 (3 memory bound and 3 CPU bound) 1 Standard 7 10 (8 memory bound and 2 CPU bound) 1 Complex 6 13 (8 memory bound and 3 CPU bound) 1 Sizing Recommendations for Overhead Properties The following table lists the overhead requirements based on mapping complexity: Mapping Complexity VCPUs Overhead Memory Overhead Simple 33% 33% Standard 120% 120% Complex 150% 150% Recommendations for Tuning the Kafka Cluster Consider the following recommendations to tune the Kafka cluster: Configure the Kafka cluster so that Intelligent Streaming can produce and consume messages at the needed message ingestion rate. To increase the rate of message consumption in Intelligent Streaming, increase the number of Kafka brokers in the Kafka cluster and in the Kafka connection. Increase the number of partitions on the Kafka topic. Ideally, the number of partitions can be equal to the number of CPU cores allocated to the executors. For example, if you set spark.executor.instances to 6 and spark.executor.cores to 3, there are 18 cores allocated. Then set the number of Kafka partitions to 18, so that there are 18 parallel tasks to read from the Kafka Source. For example, you can use the following command to specify the number of partitions:./ kafka-topics.sh --create --zookeeper zookeeper_host_name1:zookeeper_port_number,zookeeper_host_name2:zookeeper_port_number,zoo keeper_host_name3:zookeeper_port_number --replication-factor 1 --partitions 18 --topic NewOSConfigSrc Ensure that the Kafka producer is publishing messages to every partition in a load balanced manner. Reduce the number of network hops between Intelligent Streaming and the Kafka cluster. Ideally the Kafka broker must be on the same machine as the data node or the Kafka cluster can run on its own machines with a zero latency network. 6

Configure the batch.size and linger.ms properties to increase throughput. For each partition, the producer maintains buffers of unsent records. The batch.size property specifies the size of the buffer. To accumulate as many messages as possible in the buffer, configure a high value for the batch.size property. By default, the buffer sends messages immediately. To increase the time that the producer waits before sending messages in a batch, set the linger.ms property to 5 milliseconds. Kafka scalability depends on disk and network performance. The test setup included 12 disks per node on a 10 GBPS network with an open file limit of 65000. Documentation Reference See the following performance-related How-To Library articles for Informatica big data products: Tuning the Hardware and Hadoop Clusters for Informatica Big Data Products. Provides tuning recommendations for the hardware and the Hadoop cluster for better performance of Informatica big data products. Performance Tuning and Sizing Guidelines for Big Data Management 10.2. Provides sizing recommendations for the Hadoop cluster and the Informatica domain, tuning recommendations for various Big Data Management components, best practices to design efficient mappings, troubleshooting tips, and case studies. Authors Vidya Vasudevan Lead Technical Writer Shahbaz Hussain Principal Performance Engineer 7