Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Size: px

Start display at page:

Download "Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016"

Herbert Barnett
6 years ago
Views:

1 Khadija Souissi Auf z Systems November IBM z Systems Mainframe Event 2016

2 Acknowledgements Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation IBM Corporation 2

3 Agenda What is Spark Spark Details The ecosystem for Spark on z/os Use Cases References 2016 IBM Corporation 3

4 Big Data Maturity Most Businesses Are Here Operations Data Warehousing Line of Business and Analytics New Business Imperatives Value Lower the Cost of Storage Warehouse Modernization Data lake Data offload ETL offload Queryable archive and staging Data-informed Decision Making Full dataset analysis (no more sampling) Extract value from nonrelational data 360 o view of all enterprise data Exploratory analysis and discovery Business Transformation Create new business models Risk-aware decision making Fight fraud and counter threats Optimize operations Attract, grow, retain customers 4 Big Data Maturity 2016 IBM Corporation 4

5 What Spark is (and is not)

6 What Spark Is, What it Is Not An Apache Foundation open source project Not a product An in-memory compute engine that works with data Not a data store Enables highly sophisticated analysis on huge volumes of data at scale Unified environment for data scientists, developers and data engineers Radically simplifies the process of developing intelligent apps fueled by data 2015 IBM Corporation 2016 IBM Corporation 6

7 What Spark on z/os is Not What it is not A data cache for all data in DB2, IMS, IDAA, VSAM Just a different SQL engine or query optimizer An effective mechanism to access a single data source for analytics Why isn t it the same as a query acceleration / IBM DB2 Analytics Accelerator? Spark does not optimize SQL queries Spark is not a mechanism to store data, it rather provides interfaces to uniformly access data & to apply analytics using a unified interface DB2 Analytics Accelerator interaction with applications is via DB2; Spark interaction with applications is via Spark interfaces (Stream, MLlib, Graphx, SQL), driven through REST or Java Spark analytics can access data in DB2, DB2 Analytics Accelerator, VSAM, IMS, off platform, etc IBM Corporation 7

Java APIs Scala and Python interactive shells Runs on Hadoop, Mesos, standalone or cloud General purpose Covers a wide range of

8 Apache Spark a compute Engine Fast Leverages aggressively cached in-memory distributed computing and JVM threads Faster than MapReduce for some workloads Ease of use (for programmers) Written in Scala, an object-oriented, functional programming language Scala, Python and Java APIs Scala and Python interactive shells Runs on Hadoop, Mesos, standalone or cloud General purpose Covers a wide range of workloads Provides SQL, streaming and complex analytics Logistic regression in Hadoop and Spark from IBM Corporation 8

9 Spark History Over 750 contributors from over 250 companies including IBM, Google, Amazon, SAP, Databricks 2016 IBM Corporation 9

10 Why does Spark matter to the business 2016 IBM Corporation 10

11 Who uses Spark Build models quickly. Iterate faster. Apply intelligence everywhere IBM Corporation 11

12 IBM s Involvement and Commitment IBM is one of the four founding members of the UC Berkeley AMPLab Work closely with AMPLab research on projects of mutual interest June 2015, IBM announced: 3,500 researchers and developers to work on Spark-related projects at IBM labs worldwide Donate IBM SystemML machine learning technology to the Spark open source ecosystem Spark Technology Center established in San Francisco for the data science and developer community IBM supports the Spark community Code contributions Partnerships with AMPLab Galvanize and Big Data University (MOOC) Education for data scientists and developers Visit the IBM Spark Technology Center at

13 Spark Details

14 The Spark Stack, Architectural Overview 2016 IBM Corporation 14

15 How to use Spark? 2016 IBM Corporation 15

16 What is Spark used for? 2016 IBM Corporation 16

17 Spark Streaming 2016 IBM Corporation 17

involves lots of disk I/O New Approach: Keep more data

18 Why Apache Spark Existing approach: MapReduce for complex jobs, interactive query, and online event-hub processing involves lots of disk I/O New Approach: Keep more data in-memory with a new distributed execution engine 2016 IBM Corporation 18

19 How Does Spark Process Your Data - RDDs Resilient Distributed Data (RDD): Spark s basic unit of data Immutable, modifications create new RDDs Caching, persistence (memory, spilling to disk) Fault tolerance If data in memory is lost it can be recreated from their original lineage RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes An RDD is physically distributed across the cluster, but manipulated as one logical entity: Spark will automatically distribute any required processing to all partitions Perform necessary redistributions and aggregations as well Names 2016 IBM Corporation 19

20 How Does Spark Process Your Data Resilient Distributed Data Sets (RDDs) Spark s basic unit of data Immutable collection of elements that can be operated on in parallel across a cluster Fault tolerance If data in memory is lost it will be recreated from lineage Caching, persistence (memory, spilling, disk) and check-pointing Many database or file types can be supported An RDD is physically distributed across the cluster, but manipulated as one logical entity: Spark will distribute any required processing to all partitions where the RDD exists and perform necessary redistributions and aggregations as well. Example: Consider a distributed RDD Names made of names Names 2016 IBM Corporation 20

How Does Spark Process Your Data Resilient Distributed Data Sets (RDDs) Key idea: write programs in terms of transformations on distributed datasets An RDD is Spark s basic unit of

21 How Does Spark Process Your Data Resilient Distributed Data Sets (RDDs) Key idea: write programs in terms of transformations on distributed datasets An RDD is Spark s basic unit of data Fault tolerance If data in memory is lost it will be recreated from lineage Caching, persistence (memory or disk) Many databases or file types can be supported 2016 IBM Corporation 21

22 Spark Programming Model Operations on RDDs (datasets) Transformation Action Transformations use lazy evaluation Executed only if an action requires it An application consist of a Directed Acyclic Graph (DAG) Each action results in a separate batch job Parallelism is determined by the number of RDD partitions 2016 IBM Corporation 22

23 Resilient Distributed Datasets (RDDs) 2016 IBM Corporation 23

24 Spark lazy execution At a high level, any Spark application creates RDDs out of some input..... runs (lazy) transformations of these RDDs to some other form (shape)... and finally perform actions to collect or store data Execution (data access) only starts when an action is encountered IBM Corporation 24

25 Common, Popular Methods to Access Data Spark SQL Provide for relational queries expressed in SQL, HiveQL and Scala Seamlessly mix SQL queries with Spark programs Provide a single interface for efficiently working with structured data including Apache Hive, Parquet and JSON files Standard connectivity through JDBC/ODBC Integration of Spark z/os with Rocket Software provides unique functionality to access data across a wide variety of environments with very high performance and flexibility Spark R Spark R is an R package that provides a light-weight front-end to use Apache Spark from R Spark R exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster. Goal is to make Spark R production ready Rocket Software has announced intent to support R on z/os 2016 IBM Corporation 25

26 The ecosystem Spark on z Systems

Spark Support on z Systems Spark processing of z data Access to z data JDBC access to DB2 z/os, IMS Rocket Mainframe Data Service for Apache Spark (MDSS) Spark running on z Systems z/os Linux on z

27 Spark Support on z Systems Spark processing of z data Access to z data JDBC access to DB2 z/os, IMS Rocket Mainframe Data Service for Apache Spark (MDSS) Spark running on z Systems z/os Linux on z Systems Operating system Hardware Bitness z/os 2.1 z Systems 64 SUSE Linux Enterprise Server 12 z Systems 64 Red Hat Enterprise Linux 7.1 z Systems 64 Ubuntu LTS z Systems IBM Corporation 27

Partner Provided Spark Stream MLib Apache Spark Core GraphX Java z/os z/os.

28 IBM z/os Platform for Apache Spark External Data Sources Twitter HDFS DB2 LUW Oracle JSON/ XML Tera OLTP Applications CICS, IMS, WAS Analysis routines Spark SQL Spark Analytic Applications: Customer Provided IBM Provided Partner Provided Spark Stream MLib Apache Spark Core GraphX Java z/os z/os... RDD RDD RDD RDD IBM DB2 Analytics Accelerator DB2 z/os Type 2 Type 4 Type 4 IMS Mainframe Data Service for Apache Spark z/os VSAM SMF And more 2016 IBM Corporation 28

29 IBM z/os Platform for Apache Spark... Almost any data source from any location can be processed in the Spark environment on z/os Mainframe Data Service for Apache Spark (MDSS) is key to providing a single, optimized view of heterogeneous data sources MDSS can integrate off-platform data sources as well Large majority of cycles used by MDSS are ziip-eligible Possible to use Spark on z/os without it, but MDSS is recommended Integration in Online Transaction Processing (OLTP) is possible, but response time may be an issue Spark is the high performance solution for processing big data IBM DB2 Analytics Accelerator optimization with DB2 for z/os can be integrated 2016 IBM Corporation 29

z Systems Analytics and Spark: Well-matched Stream GraphX SQL MLlib Compression RDDcache File System Network Scheduler Serialization Faster framework for in-memory analytics, both batch and real-time

30 z Systems Analytics and Spark: Well-matched Stream GraphX SQL MLlib Compression RDDcache File System Network Scheduler Serialization Faster framework for in-memory analytics, both batch and real-time Federated data-in-place analytics access while preserving consistent APIs Why Spark on z Systems Optimized performance with zedc for compression SMT2 for better thread performance SIMD for optimized floating point performance and accelerate analytics processing for mathematical models via vector processing Leverage IBM Java optimizations and enhancements ziip eligibility 2016 IBM Corporation 30

31 Why and when use Spark on z/os? The environment where Apache Spark z/os makes sense: Running real-time or batch analytics over a variety of heterogeneous data sources Efficient real-time access to current and historical transactions Where a majority of data is z/os resident When data contains sensitive information Don't scatter across several distributed nodes to be held in memory for some unknown period of time When implementing common analytic interfaces that are shared with users on distributed platforms z/os strengths are valuable in a Spark environment: Intra-SQL and intra-partition parallelism for optimal data access to: nearly all z/os data environments distributed data sources 2016 IBM Corporation 31

32 The Ecosystem for Spark on z/os relevance to the new consumers Data Scientist The convincer The primary customer Creates the spark application(s) that produce insights with business value Probably doesn't know or care where all of the Spark resources come from Close to the platform, probably a Z- based person Works with the IM specialist to associate a view of the data with the actual on-platform assets Data Engineer The wrangler App Developer The builder Helps the Data scientist assemble and clean the data, write applications Probably better awareness about resource details, but still is primarily concerned with the problem to solve, not the platform 2016 IBM Corporation 32

33 Use Cases

Business Use Case illustrated in a showcase Financial Institution: offers both retail /

organization Stock Price History public data Credit Card Info Trade Transaction Info Customer

34 Business Use Case illustrated in a showcase Financial Institution: offers both retail / consumer banking as well as investment services Business Critical Information owned by the organization Stock Price History public data Credit Card Info Trade Transaction Info Customer Info Social Media Data: Twitter GOAL Produce right-time offers for cross sell or upsell, tailored for individual customers based on: customer profile, trade transaction history, credit card purchases and social media sentiment and information Spark zos Client Insight Use Case IBM Corporation 34

35 Spark z/os Showcase: Reference Architecture Cloudant Linux on z REST API CICS Transaction system DB2 z/os JDBC IMS Transaction system Credit Card Info Spark Job Server JDBC Open Source Visualization Libraries Linux on z z/os JDBC IMS VSAM Trade Transaction Info Customer Info 2016 IBM Corporation 35

Analyzing SMF Data with Dump Data Sets LPARn Spark Application using SparkSQL Spark application is agnostic to data source and number of sources LPAR1 MDSS Client SMF Realtime JDBC Mainframe Data

36 Analyzing SMF Data with Dump Data Sets LPARn Spark Application using SparkSQL Spark application is agnostic to data source and number of sources LPAR1 MDSS Client SMF Realtime JDBC Mainframe Data Service for Spark z/os (MDSS) Logstream Logstream Logstream LPAR2 MDSS Client SMF Realtime Logstream Logstream Logstream SMF Realtime LPAR3 MDSS Client SMF Realtime Logstream Logstream Logstream Logstream Logstream Logstream MDSS required on at least one system, MDSS agents required on all systems. No IPL required for installation Logstream recording mode required for real time interfaces 2016 IBM Corporation 36

Streaming SMF into Spark supports custom input streams Custom Receiver support PoC built with a custom SMF data receiver using real-time SMF services SMF30 records stored in a DStream in Spark Time

37 Streaming SMF into Spark supports custom input streams Custom Receiver support PoC built with a custom SMF data receiver using real-time SMF services SMF30 records stored in a DStream in Spark Time series of RDDs DStream map() method translates raw SMF30 records into formatted strings Simple example to show the data flow -- Formatted records printed to stdout Custom SMF Receiver byte[] DStream String DStream time 1 time 2 time 3 time 4 SMF30s time 0 to 1 Formatted SMF30s time 0 to 1 map() Operation SMF30s time 1 to 2 Formatted SMF30s time 1 to 2 SMF30s time 2 to 3 Formatted SMF30s time 2 to 3 SMF30s time 3 to 4 Formatted SMF30s time 3 to IBM Corporation 37

38 SMF Analytics Example: SMF Streaming into Spark is the analytics operating system Supports input streams Custom Receiver support PoC built with a custom SMF data receiver using real-time SMF services This can be an interesting way to learn and get comfortable with Spark technologies Existing SMFEWTM Simple Spark SMF Application Spark Custom SMF Data Receiver Java JNI Interfaces SMF Read In-Memory Data Service Existing SMF Buffering Technology SMF Java Classes Code Running in Spark on z/os Base z/os Support New z/os Support PoC Code Existing z/os Support 2016 IBM Corporation 38

39 SMF Real-Time Streaming into 1. Run the command to start the Spark job./bin/spark-submit --master local /spark/spark15/lib/javasmfreceiver.jar IFASMF.INMEM 2. See the status of the SMF real-time resource on the MVS console SY1 d smf,m IFA714I SMF STATUS 082 IN MEMORY CONNECTIONS Resource: IFASMF.INMEM Con#: 0001 Connect Time: :20:11 ASID: 002F RECORDS: Look at the Spark job output in Unix SMF30.2 on 15:21: SMF30.2 on 15:21: SMF30.2 on 15:21: IBM Corporation 39

40 New Redbook on Apache Spark implementation on IBM z/os 2016 IBM Corporation 40 40

41 Want to see and know more about Spark on z Systems? Then be our Guest for the Spark on z Systems Event For more information click here Contact: Khadija Souissi, souissik@de.ibm.com 2016 IBM Corporation 41

42 References Spark Communities Spark SQL Programming Guide: IBM SystemML Open Source: June 2015, we announced to open source SystemML SystemML has been accepted as an Apache Incubator project Source code: Published paper: Big Data University Spark Fundamentals course IBM paper on Fueling Insight Economy with Apache www-01.ibm.com/marketing/iwm/dre/signup?source=ibmanalytics&s_pkg=ov37158&dynform=19170&s_tact=m161001w A Deeper Understanding of Spark Internals IBM z/os Platform for Apache Spark documentation IBM 4 2

43 Thank you!

Analytics on z Systems, what s new?

Khadija Souissi Analytics on z Systems, what s new? IBM Architektentage 16.11.2016 Agenda Spark on z Systems What is Spark Spark Details The ecosystem for Spark on z/os Use Cases New Dimension for the