Luncheon Webinar Series December 18th, 2015

Size: px

Start display at page:

Download "Luncheon Webinar Series December 18th, 2015"

Chloe Maxwell
6 years ago
Views:

1 Luncheon Webinar Series December 18th, 2015 How to get started with DataStage (aka IBM InfoSphere Information Server) running natively on Hadoop presented by Beate Porstonsored By: 0

2 How to get started with DataStage (aka IBM InfoSphere Information Server) running natively on Hadoop Questions and suggestions regarding presentation topics? - send to editor@dsxchange.com Downloading the presentation Replay will be available within one day with with details Pricing and configuration - send to editor@dsxchange.net Subject line : Pricing For those that stay through the entire presentation, we have a extra give away! Bonus Offer Free premium membership for your DataStage Management! Submit your management s address and we will offer him access on your behalf. Info@dsxchange.net subject line Managers special. Join us all at Linkedin 1

3 How to get started with DataStage v11.5 running natively on Hadoop December, 2015 Beate Porst Product Manager IBM InfoSphere Information Server IBM Corporation

4 Agenda Quick Introduction into InfoSphere Information Server v11.5 Architecture and System topologies for Information Server on Hadoop Installation & Setup Performance Observations Q&A

Information Governance Catalog Understand & Collaborate Catalog technical metadata & align w/ business language Mange (big) data lineage New compliance reporting Data Quality Cleanse & Monitor

5 Information Empowerment for your Data Ecosystem.. powered by Information Server Integrating and transforming data and content to deliver accurate, consistent, timely and complete information on a single platform unified by a common metadata layer Information Governance Catalog Understand & Collaborate Catalog technical metadata & align w/ business language Mange (big) data lineage New compliance reporting Data Quality Cleanse & Monitor Analyze & validate w/ enhanced classification Cleanse & standardize Define, manage & monitor data rules + exceptions Data Integration Transform & Deliver Massive scalability Power for any complexity Deliver in batch and/or realtime with change capture common connectivity shared metadata security (new data privacy functions included) common execution engine with flexible deployments (new native MPP runtime on Hadoop)

Information Server Release History 2006 2007 2008 2009 2010 2011 2012 2013 2014

6 Information Server Release History New GA: 9/25: EOS: 9/2016 EOS: 4/2017 5

Information Server Recent Activity 2012 2013

7 Information Server Recent Activity FP Business Driven Governance - Policy and rules support for information governance - Web-based blueprints - Integrated metadata mgmt enhancements Sustainable Quality - Data Quality Console - Standardization Rules Designer - Data Rules Advancements Agile integration - InfoSphere Data Click - Enhanced Workload Mgmt - ODM Integration - Hadoop Balanced Optimization - HDFS Extensions Business Driven Governance - IDA Additional Workflow Roles - Data Rules Metadata - Bulk metadata import Sustainable Quality - Profiling Big Data - Exception Stage - New QS standardization rulesets Agile Integration - Big Data Features * JSON support * JDBC connector - DB2 on z/os load optimization - Data Click new data sources/targets Business Driven Governance - Info Governance Catalog - Shop for Data - Smart Hover - Collect & Share - Lineage@Scale Sustainable Quality - Governance Dashboard integration - Performance Optimizations - Productivity Enhancements - Global Geocoding Agile Integration - Self-service Data integration - Cloud Connectors - MDM Integration - Sort compress - Hadoop currency - Greenplum Connector Business Driven Governance - Subscription Manager - Stewardship Center (w/bpm) - Term Custom Attributes - Customizable attribute display - Lineage Admin Console - Prebuilt Governance Content - IGC Data Classification Sustainable Quality - Data Quality Exception Management Updates - Exception SQL Views - Stewardship Center Data Remediation Workflow - Data Classification - Global Geocoding support Agile Integration - Cognos TM1 Connector and Metadata Import - HDFS Secure Connector - IDAA pushdown support - Hypervisor support for v BigInsights v4 support

8 Summary Information Server v FP1 FP2 Platform Extensions - Native execution on Hadoop - In-place upgrade v v11.5 Business Driven Governance - Governance Catalog Extensible Framework - Column-level lineage for Hadoop files - Multi-language support - XML Schema Definition support - Data class definitions - Asset interchange for extended lineage content Sustainable Quality - Enhanced Data Classification - Address Verification and Enrichment Advancements Agile Integration - Data Integration running natively on Hadoop - Automatic HDFS metadata import - Comprehensive and fast HDFS Connectivity - Out of the Box Database Pushdown - Out of the Box ERP Pack support - Embedded sensitive data protection

9 V11.5 Detailed Capability Comparison InfoSphere Information Governance Catalog InfoSphere Information Server For Data Integration InfoSphere Information Server For Data Quality InfoSphere Information Server Enterprise Edition BigInsights BigIntegrate BigInsights BigQuality Business Glossary Metadata Management and Lineage Logical and Physical Data Modeling Data Cleansing and Enrichment Data Quality Validation & Monitoring Data Stewardship SOA Deployment Data Specification Mapping Extract, transform, load (ETL) Change Data Delivery 2 2 Self Serve Data Access Data Masking View reports in Cognos IBM BigInsights included (see notes) 4 4 Runs natively in Hadoop 1 Limited to 250 assets (any combination of glossary terms, categories, information governance policies and information governance rules) 2 One database Source or Capture Agent excluding z/os and must be used with DataStage as target 3 View only access for any pre-defined report provided for Information Server 4 Maximum of 5-node cluster of IBM BigInsights Data Scientist v4.1 install in support of Information Server 5 Requires additional entitlement for Optim ODPP Separate add-on purchases: data replication, ERP connectors (SAP, SAS), Postal address verification / geo-coding New offering

10 Key Use Cases for Data Integration on Hadoop Data Reservoir & Logical Warehouse Warehouse Offloading Modernize warehouse architecture through the Data Reservoir improving efficiency (TCO) and extending analytics warehouse Integrate Transform Cleanse Govern HDFS Improve efficiency of existing warehouse investments by offloading dark data or augmenting it with sandboxes warehouse Integrate Transform Cleanse Govern HDFS Enhanced 360º view Enhance insight of key business entities (e.g. customer) by integrating and correlating new data sources and building an integrated view MDM Integrate Transform Cleanse Govern HDFS Exploratory Analysis Discover & explore new insights more rapidly and in a more agile & iterative manner Integrate Transform Cleanse Govern HDFS

Information Server BigIntegrate Ingest, transform, process and deliver any data into & within Hadoop Satisfy

Connect Connect to wide range of traditional enterprise data sources as well as Hadoop data sources Native

and aggregate any data volume Benefit from hundreds of built-in transformation functions Leverage

11 Information Server BigIntegrate Ingest, transform, process and deliver any data into & within Hadoop Satisfy the most complex transformation requirements with the most scalable runtime available in batch or real-time Connect Connect to wide range of traditional enterprise data sources as well as Hadoop data sources Native connectors with highest level of performance and scalability for key data sources Design & Transform Transform and aggregate any data volume Benefit from hundreds of built-in transformation functions Leverage metadata-driven productivity and enable collaboration Manage & Monitor Use a simple, web-based dashboard to manage your runtime environment

Information Server BigQuality Analyze, cleanse and monitor your big data Most comprehensive data quality

defined data classes Analyzes data structure, content and quality Automates your data analysis process

integration processes Monitor Assess and monitor the quality of your data in any place and across systems

12 Information Server BigQuality Analyze, cleanse and monitor your big data Most comprehensive data quality capabilities that run natively on Hadoop Analyze Discovers data of interest to the org based on business defined data classes Analyzes data structure, content and quality Automates your data analysis process Cleanse Investigate, standardize, match and survive data at scale and with the full power of common data integration processes Monitor Assess and monitor the quality of your data in any place and across systems Align quality indicators to business policies Engage data steward team when issues exceed thresholds of the business

13 12

for your Hadoop data Use the power of your Hadoop cluster to integrate, transform & cleanse data without writing

14 Information Server on Hadoop Offering The most scalable Transformation and Data Integration and Quality engine now runs natively on Hadoop Runs 10x-20x faster than MapReduce Get enterprise-class transformation and cleansing for your Hadoop data Use the power of your Hadoop cluster to integrate, transform & cleanse data without writing a single line of code Hadoop distribution currency: BigInsights 4.0 & 4.1 HortonWorks 2.2 & 2.3 Cloudera 5.3 &

Native Hadoop Runtime Optimize your Integration/Transformation and Data Quality workload based on data locality and resources availability Design

15 Native Hadoop Runtime Optimize your Integration/Transformation and Data Quality workload based on data locality and resources availability Design your integration, data preparation or cleansing once and run it on your Hadoop Cluster, on your traditional engine or optimize to run on your database

16 Information Server on Hadoop Features Full support for Information Analyzer, QualityStage, DataStage and DataClick jobs Support for Kerberos enabled cluster Full Edge/Client node support for Engine Tier install Automatic binary distribution (if not detected) to data nodes or NFS mount Data locality support for HDFS file reads (e.g. BDFS, DataSet etc.) Container size estimation Visibility in DS Job log (Hadoop tracking URL) & YARN Job browser Support for Hadoop Node Labels Support for YARN scheduler queues Support for ODP distributions (BigInsights, HortonWorks, Pivotal etc.) and Cloudera

17 16 RUNTIME ARCHITECTURE & DEPLOYMENT OPTIONS

18 System Topology IS Engine Tier Installed on Hadoop Edge Node All other IS Tiers can be on the Edge Node or outside the cluster Information Server binaries live on all s that will run DataStage jobs Information Server binaries are copied to s at job run time using HDFS if binaries don t already exist IS Client Tier /opt/ibm/informationserver IS Engine Tier /opt/ibm/informationserver Hadoop Cluster IS Service Tier /opt/ibm/informationserver Hadoop Edge Node IS Metadata Repository Tier /opt/ibm/informationserver

19 Grid Deployments on and off Hadoop Stand-alone Information Server Grid Information Server Grid on Hadoop 18

20 Deployment Models Information Server on Hadoop: Typical Hadoop Environment 3 different deployment models for Information Server within a typical Hadoop Environment 19

21 One Information Server Instance Multiple Engines On and off Hadoop Requirement: needs to be v11.5 (no version mix between components) Services & Repository DS Project A PX Engine Stand-alone DS Project B PX Engine On Hadoop 20

22 DataStage Job Runtime Architecture on Hadoop Jobs are submitted from an IS Client (1) Conductor asks IS YARN Client for an Application Master(AM) to run the job (2) IS YARN Client manages IS AM pool, starts new ones when necessary (3) Conductor passes IS AM resource requirements and commands to start Section Leaders (4) IS AM gets containers from YARN Resource Manager(not pictured) YARN Node Managers(NM) on s start YARN containers with Section Leaders (5) Section Leaders connect back to Conductor and start players (6) 5 IS Client Tier Submit Job 1 Section Leader Player 1 Player 2 Player N /opt/ibm/informationserver IS Engine Tier Hadoop Cluster Conductor /opt/ibm/informationserver IS Service Tier Section Leader Player 1 Player 2 Player N 2 /opt/ibm/informationserver Hadoop Edge Node IS YARN Client YARN Containers 4 IS Metadata Repository Tier IS Application Master 3 /opt/ibm/informationserver

23 22 INSTALLATION & SETUP

Required Clients to install are HDFS and YARN Validate

24 Installation Edge Node Provisioning Provisioned through Ambari(pictured), Cloudera Manager, or manually. Required Clients to install are HDFS and YARN Validate by running yarn and hdfs commands Hadoop Cluster Hadoop Edge Node

25 Installation Information Server on Hadoop Information Server Tiers are installed in the typical fashion through the IBM Information Server install. IS Client Tier IS Engine Tier IS Service Tier Hadoop Edge Node IS Metadata Repository Tier /opt/ibm/informationserver Hadoop Cluster

26 Validate Engine Tier Install Make sure a simple job with Transform can compile and run locally Run with default config file on local node Don t run on run on Hadoop yet! APT_YARN_CONFIG

27 Creating local Information Server Binary Paths IS Client Tier IS Service Tier IS Metadata Repository Tier Currently a Manual step since jobs don t run as root Be careful to create with correct permissions Cluster settings affect who the owner should be Hadoop Cluster IS Engine Tier Hadoop Edge Node /opt/ibm/informationserver /opt/ibm/informationserver /opt/ibm/informationserver /opt/ibm/informationserver

28 Setting up Users on Hadoop Gather the User & Group names that will run Jobs Create HDFS permissions for those users sudo -u hdfs hadoop fs -mkdir /user/infosphere_information_server_user_name sudo -u hdfs hadoop fs -chown InfoSphere_Information_Server_user_name :InfoSphere_Information_Server_user_group /user/infosphere_information_server_user_name E.g., to create a user folder for the user dsadm, issue: sudo -u hdfs hadoop fs -mkdir /user/dsadm sudo -u hdfs hadoop fs -chown dsadm:dstage /user/dsadm 27 Additional settings might be required if not running on an Edge node

29 Starting the Information Server YARN Client Can be started manually using PXEngine/etc/yarn_conf/s tart-pxyarn.sh Will be started automatically with first job run on Hadoop Will start 2 ApplicationMasters by default Tuneable with APT_YARN_AM_POOL_SIZE Troubleshoot with PXEngine/logs/yarn_logs/ yarn_client_out.0 IS Client Tier IS Engine Tier Hadoop Cluster IS Service Tier /opt/ibm/informationserver Hadoop Edge Node IS YARN Client IS Metadata Repository Tier /opt/ibm/informationserver /opt/ibm/informationserver /opt/ibm/informationserver IS IS Application Application Master Master

30 Create Static Configuration File with All Cluster Nodes This will localize binaries on all nodes with first job run node "conductor_node" { fastname "myconductor.mycompany.com" pools "conductor" "export" resource disk "/data" {pool "" "export" "conductor_node"} resource scratchdisk "/scratch" {} } node "node0" { fastname compute1.mycompany.com" pools "" resource disk "/data" {pool "" "export" "node0"} resource scratchdisk "/scratch" {} } node "node1" { fastname compute2.mycompany.com" pools "" resource disk "/data" {pool "" "export" "node1"} resource scratchdisk "/scratch" {} }

31 Validate Running on Hadoop Make sure a simple job with Transform can run on Hadoop Run with static config file on all nodes APT_YARN_CONFIG = /opt/ibm/informationserver/server/pxengine/etc/yarn_conf/yarnconfig.cfg APT_YARN_MODE=true In yarnconfig.cfg

32 How Binary Localization Works? Cached in HDFS by IS YARN Client on startup Localized by jobs from HDFS cache if they don t exist at job run time Requires ~4GB of space in /tmp Tuneable with APT_YARN_BINARY_COPY_MODE IS Client Tier IS Engine Tier Hadoop Cluster IS Service Tier /opt/ibm/informationserver Hadoop Edge Node IS YARN Client IS Metadata Repository Tier /opt/ibm/informationserver /opt/ibm/informationserver /opt/ibm/informationserver IS IS Application Application Master Master

$node "conductor_node" { fastname "myconductor.mycompany.$

33 Dynamic Configuration Files Dynamic configuration files take advantage of resource management and HDFS for DataSets Predefined dynamic config file: /opt/ibm/informationserver/server/dynamic_config node "conductor_node" { fastname "myconductor.mycompany.com" pools "conductor" "export" resource disk "/data" {pool "" "export" "conductor_node"} resource scratchdisk "/scratch" {} } node "node0" { fastname "$host" pools "" resource disk "/data" {pool "" "export" "node0"} resource scratchdisk "/scratch" {} } node "node1" { fastname "$host" pools "" resource disk "/data" {pool "" "export" "node1"} resource scratchdisk "/scratch" {} } HDFS Local Disk

34 The Information Server Yarn Config File yarnconfig.cfg Located in: /opt/ibm/informationserver/server/pxengine/etc/yarn_conf/yarnconfig.cfg APT_YARN_MODE=true If defined and set to 1 or true runs the given PX job on the local Hadoop install in YARN mode. APT_YARN_CONTAINER_SIZE=64 Defines the size in MBs of the containers that will be requested to run PX Section Leader and Player processes in. The default is 64MB if not set. APT_YARN_CONTAINER_VCORES=0 Defines the number of virtual cores that the containers will request to run PX Section Leader and Player processes in. The default is 0 which means "Don't set it". APT_YARN_AM_CONTAINER_SIZE=256 Defines the size in MBs of the container that will be requested to run PX Application Master process. The default is 256MB if not set. APT_YARN_AM_POOL_SIZE=2 The number of pre-started Application Masters, default is 2. APT_YARN_NODE_LABEL_EXPR= Define the node label that Information Server jobs should use when being submitted to the YARN scheduler. APT_YARN_SCHEDULER_QUEUE= Define the default queue that Information Server jobs should use when being submitted to the YARN scheduler. The default is empty which will use the default scheduler queue.

35 DataStage Job Run time logs YARN Client Connection Hadoop tracking URL Application Master Connection YARN Container Allocation Job Processes Running

36 DataStage Job Runtime Hadoop Console DataStage Application Master Information Application Run Time Container Allocated Resources

37 Using Hadoop Node Labels Separate application workloads Supported by Apache Hadoop 2.6, HDP 2.2, CDH 5.4, IOP 4.0 IIS node label can be controlled by Hadoop scheduler queue or passed with jobs Unlabelled nodes available to any application dependent on queue configuration Not supported for Fair Scheduler yet (YARN-2497) Apache Hadoop 2.8 allows borrowing nodes to increase cluster utilization IS Client Tier IISNode /opt/ibm/informationserver GPUNode IS Engine Tier /opt/ibm/informationserver Hadoop Cluster IS Service Tier IISNode /opt/ibm/informationserver GPUNode Hadoop Edge Node IS Metadata Repository Tier

38 HDFS Data Replication IIS Job writes two partition data files P1 and P2 One block will always reside local to the writing node Other blocks replicated based on HDFS rack awareness algorithm Number of replicas depends on HDFS configuration, Default=3 IIS Job that reads P1 and P2 requests to run local to the blocks Job will read block from another node if locality isn t possible IS Client Tier IISNode /opt/ibm/informationserver GPUNode IS Engine Tier /opt/ibm/informationserver P1 Hadoop Cluster IS Service Tier IISNode /opt/ibm/informationserver GPUNode Hadoop Edge Node P2 IS Metadata Repository Tier 1 2

39 HADOOP / YARN Environment Settings Parameter Description Default value yarn.log-aggregation-enable yarn.nodemanager.resource.memo ry-mb yarn.nodemanager.log.retainseconds yarn.nodemanager.pmem-checkenabled yarn.nodemanager.vmem-checkenabled yarn.nodemanager.vmem-pmemratio yarn.resourcemanager.nodemanag ers.heartbeat-interval-ms 38 Manages YARN log files. Set this parameter to false if you want the log files stored in the local file system. true Specifies the duration in seconds that Hadoop retains container logs Determines if physical memory limits exist for containers. If set to true, job is stopped if a container uses more than the physical memory limit that you specify. Set this parameter to false if you do not want jobs to fail when the containers consume more memory than they are allocated. Sets the amount of physical memory that can be allocated for containers. Determines if virtual memory limits exist for containers. If this parameter is set to true, the job is stopped if a container is using more than the virtual limit that you specify. Set this parameter to false if you do not want jobs to fail when the containers consume more memory than they are allocated. Sets the ratio of virtual memory to physical memory limits for containers. If yarn.nodemanager.vmem-check-enabled is set to true, jobs might be stopped by YARN if the ratio of the virtual memory that a container consumes compared to the physical memory is greater than the ratio that you specify. Controls the start time for parallel jobs. For clusters that have fewer than 50 nodes, 1000 ms is often too long and leads to a longer start time for parallel jobs. You can set this value to 50 milliseconds to ensure parallel jobs start in a timely true 8192 MB true 2.1 Recommended value false 1000 ms 50 milliseconds.

40 Parameter Description Default value yarn.scheduler.capacity. maximum-am-resourcepercent Specifies the maximum percentage of resources for all queues in the cluster that can be used to run application masters, and controls the number of concurrent active applications. Defaults vary between distrubutions of Hadoop. Recommended value yarn.scheduler.capacity.q ueue-path.maximumam-resource-percent Specifies the maximum percentage of resources for a single queue in the cluster that can be used to run application masters, and controls the number of concurrent active applications. Defaults vary between distrubutions of Hadoop. yarn.scheduler.incremen t-allocation-mb yarn.scheduler.minimum -allocation-mb This value indicates how much the container size can be incremented. If you submit tasks with resource requests lower than the minimum-allocation value, the requests are set to the minimum-allocation value. This parameter helps conserve resources on the cluster by setting the minimum amount of memory that can be requested for a container. The default container size for parallel processes is 64 MB. 512 MB on Cloudera 1024 MB for most Hadoop distributions 256 MB or l 39 Note: If changing the yarn.scheduler.minimum-allocation-mb value with Ambari- 2.1, you must specify whether the changes should be applied to the MapReduce specific resource settings. If you are significantly reducing the value of yarn.scheduler.minimum-allocation-mb, do not change the MapReduce values based on the new value, because it could cause MapReduce jobs to fail.

41 40 PERFORMANCE OBSERVATIONS

42 Performance Observations Running Information Server jobs natively on Hadoop / Yarn Running Information Server jobs natively under YARN scales out linearly! Throughput doubles if number of Hadoop data double YARN introduces some overhead for Job startup time Job startup time is slightly slower then a non-yarn start up Storing data on HDFS is up to 13% slower then native OS storage Observations when running a realistic DataStage workload on a YARN managed Hadoop cluster: Using Static configuration files performance running on/off Hadoop would be similar (for similar resources) This is mostly because it doesn t need to store DataStage specific files on HDFS as jobs will run on statically defined nodes Using dynamic configuration files: We observed a performance penalty on Hadoop of up to 13% due to the HDFS usage Storing data on HDFS is significantly slower than native OS storage due to things such as the replication factor 41

43 Test System Topology BigInsights Cluster DB2 Server Master Node Data Node 1... Data Node N Data Warehouse For the TPC-DI Workload Information Server Services, Repository Engine Number of Systems: 11 The specs for each box are identical (IBM xseries High Volume Racks x3630 M4) CPU: 32 cores ( 4 Sandy-Bridge EP each with 8 cores) Memory: 64 GB Disk: 14 X 1TB Network: interconnected with 10GbE 42

44 Scale Out Test 43 DataStage throughput doubled when doubling the number of hadoop data nodes.

45 44 TPC-DI Workload Performance in Different Modes

46 45 Q&A

47 46 Where to get more Information? Product Documentation: IBM Information Server Knowledge Center: 01.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/com.ibm.swg.im.iis.ishado op.nav.doc/containers/cont_iisinfsrv_hadoop.html?lang=en Remember: BigIntegrate / BigQuality are only offerings the actual product is Information Server Tutorial on How to setup Information Server on Hadoop on a Cloudera CDH Contact: Beate Porst (porst@us.ibm.com) -- Product Manager Data Integration

48 47 Q&A What are IBM BigInsights BigIntegrate & IBM BigInsights BigQuality These are offerings (specific bundles/licenses/prices)for your Hadoop Data Integration & Data Quality needs. These offerings are powered by InfoSphere Information Server now running natively on Hadoop / Yarn. Which Hadoop Distributions are supported? ODP distributions (e.g. IBM BigInsights, HortonWorks, Pivotal), Cloudera running on Linux OS (X86). Can I connect (read/write) to data sources outside of Hadoop? Yes, you can connect to pretty much any data source accessible by Information Server. (from mainframe to cloud) Where will data transformation / quality processes run? Processes will run on any /all of the Data Nodes in the Hadoop distribution on which the product is installed. The number of data nodes utilized to run a particular job depends on the partioning level associated with a job during Job start up (configuration file) Do I need to know how to write Java, HiveQL, Pig or any other programming language to create Data Integration or quality processes No, data integration and quality processes are designed using an intuitive graphical design interface. You compose your transformation logic out of pre-build operators (think of it as LEGO bricks) that you hook together to form a final flow of data

49 Q&A Will I be able to get Data Lineage or Impact Analysis for jobs running on Hadoop? Yes, Information Server on Hadoop utilize Information Server s shared metadata feature which allows to automatically capture design & operation metadata and deduce data lineage and dependency analysis no matter where the job runs. Is Information Server on Hadoop using Map/Reduce? No, jobs are processed by the Information Server Parallel Execution Engine which is a highly scalable MPP (cluster) engine. Each data node has a copy of the PX engine libraries and therefore a job can run in parallel on multiple data nodes. Are BigIntegrate & BigQuality offerings the only option to license Information Server on Hadoop? No, any of the Information Server v11.5 offerings can be deployed on Hadoop. Is the Information Server Parallel Execution Engine (PX) faster than Spark? The IBM PX engine and Spark are both high-performant cluster computing MPP engines. Based on internal tests, we have seen many use cases, specifically when processing large volumes of data where IBM PX engine was more performant than Spark. 48

50 THANK YOU

51 How to get started with DataStage (aka IBM InfoSphere Information Server) running natively on Hadoop Questions and suggestions regarding presentation topics? - send to editor@dsxchange.com Downloading the presentation Replay will be available within one day with with details Pricing and configuration - send to editor@dsxchange.net Subject line : Pricing For those that stay through the entire presentation, we have a extra give away! Bonus Offer Free premium membership for your DataStage Management! Submit your management s address and we will offer him access on your behalf. Info@dsxchange.net subject line Managers special. Join us all at Linkedin 50

Information empowerment for your evolving data ecosystem

Information empowerment for your evolving data ecosystem Highlights Enables better results for critical projects and key analytics initiatives Ensures the information is trusted, consistent and governed