Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Size: px
Start display at page:

Download "Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016"

Transcription

1 Khadija Souissi Auf z Systems November IBM z Systems Mainframe Event 2016

2 Acknowledgements Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation IBM Corporation 2

3 Agenda What is Spark Spark Details The ecosystem for Spark on z/os Use Cases References 2016 IBM Corporation 3

4 Big Data Maturity Most Businesses Are Here Operations Data Warehousing Line of Business and Analytics New Business Imperatives Value Lower the Cost of Storage Warehouse Modernization Data lake Data offload ETL offload Queryable archive and staging Data-informed Decision Making Full dataset analysis (no more sampling) Extract value from nonrelational data 360 o view of all enterprise data Exploratory analysis and discovery Business Transformation Create new business models Risk-aware decision making Fight fraud and counter threats Optimize operations Attract, grow, retain customers 4 Big Data Maturity 2016 IBM Corporation 4

5 What Spark is (and is not)

6 What Spark Is, What it Is Not An Apache Foundation open source project Not a product An in-memory compute engine that works with data Not a data store Enables highly sophisticated analysis on huge volumes of data at scale Unified environment for data scientists, developers and data engineers Radically simplifies the process of developing intelligent apps fueled by data 2015 IBM Corporation 2016 IBM Corporation 6

7 What Spark on z/os is Not What it is not A data cache for all data in DB2, IMS, IDAA, VSAM Just a different SQL engine or query optimizer An effective mechanism to access a single data source for analytics Why isn t it the same as a query acceleration / IBM DB2 Analytics Accelerator? Spark does not optimize SQL queries Spark is not a mechanism to store data, it rather provides interfaces to uniformly access data & to apply analytics using a unified interface DB2 Analytics Accelerator interaction with applications is via DB2; Spark interaction with applications is via Spark interfaces (Stream, MLlib, Graphx, SQL), driven through REST or Java Spark analytics can access data in DB2, DB2 Analytics Accelerator, VSAM, IMS, off platform, etc IBM Corporation 7

8 Apache Spark a compute Engine Fast Leverages aggressively cached in-memory distributed computing and JVM threads Faster than MapReduce for some workloads Ease of use (for programmers) Written in Scala, an object-oriented, functional programming language Scala, Python and Java APIs Scala and Python interactive shells Runs on Hadoop, Mesos, standalone or cloud General purpose Covers a wide range of workloads Provides SQL, streaming and complex analytics Logistic regression in Hadoop and Spark from IBM Corporation 8

9 Spark History Over 750 contributors from over 250 companies including IBM, Google, Amazon, SAP, Databricks 2016 IBM Corporation 9

10 Why does Spark matter to the business 2016 IBM Corporation 10

11 Who uses Spark Build models quickly. Iterate faster. Apply intelligence everywhere IBM Corporation 11

12 IBM s Involvement and Commitment IBM is one of the four founding members of the UC Berkeley AMPLab Work closely with AMPLab research on projects of mutual interest June 2015, IBM announced: 3,500 researchers and developers to work on Spark-related projects at IBM labs worldwide Donate IBM SystemML machine learning technology to the Spark open source ecosystem Spark Technology Center established in San Francisco for the data science and developer community IBM supports the Spark community Code contributions Partnerships with AMPLab Galvanize and Big Data University (MOOC) Education for data scientists and developers Visit the IBM Spark Technology Center at

13 Spark Details

14 The Spark Stack, Architectural Overview 2016 IBM Corporation 14

15 How to use Spark? 2016 IBM Corporation 15

16 What is Spark used for? 2016 IBM Corporation 16

17 Spark Streaming 2016 IBM Corporation 17

18 Why Apache Spark Existing approach: MapReduce for complex jobs, interactive query, and online event-hub processing involves lots of disk I/O New Approach: Keep more data in-memory with a new distributed execution engine 2016 IBM Corporation 18

19 How Does Spark Process Your Data - RDDs Resilient Distributed Data (RDD): Spark s basic unit of data Immutable, modifications create new RDDs Caching, persistence (memory, spilling to disk) Fault tolerance If data in memory is lost it can be recreated from their original lineage RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes An RDD is physically distributed across the cluster, but manipulated as one logical entity: Spark will automatically distribute any required processing to all partitions Perform necessary redistributions and aggregations as well Names 2016 IBM Corporation 19

20 How Does Spark Process Your Data Resilient Distributed Data Sets (RDDs) Spark s basic unit of data Immutable collection of elements that can be operated on in parallel across a cluster Fault tolerance If data in memory is lost it will be recreated from lineage Caching, persistence (memory, spilling, disk) and check-pointing Many database or file types can be supported An RDD is physically distributed across the cluster, but manipulated as one logical entity: Spark will distribute any required processing to all partitions where the RDD exists and perform necessary redistributions and aggregations as well. Example: Consider a distributed RDD Names made of names Names 2016 IBM Corporation 20

21 How Does Spark Process Your Data Resilient Distributed Data Sets (RDDs) Key idea: write programs in terms of transformations on distributed datasets An RDD is Spark s basic unit of data Fault tolerance If data in memory is lost it will be recreated from lineage Caching, persistence (memory or disk) Many databases or file types can be supported 2016 IBM Corporation 21

22 Spark Programming Model Operations on RDDs (datasets) Transformation Action Transformations use lazy evaluation Executed only if an action requires it An application consist of a Directed Acyclic Graph (DAG) Each action results in a separate batch job Parallelism is determined by the number of RDD partitions 2016 IBM Corporation 22

23 Resilient Distributed Datasets (RDDs) 2016 IBM Corporation 23

24 Spark lazy execution At a high level, any Spark application creates RDDs out of some input..... runs (lazy) transformations of these RDDs to some other form (shape)... and finally perform actions to collect or store data Execution (data access) only starts when an action is encountered IBM Corporation 24

25 Common, Popular Methods to Access Data Spark SQL Provide for relational queries expressed in SQL, HiveQL and Scala Seamlessly mix SQL queries with Spark programs Provide a single interface for efficiently working with structured data including Apache Hive, Parquet and JSON files Standard connectivity through JDBC/ODBC Integration of Spark z/os with Rocket Software provides unique functionality to access data across a wide variety of environments with very high performance and flexibility Spark R Spark R is an R package that provides a light-weight front-end to use Apache Spark from R Spark R exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster. Goal is to make Spark R production ready Rocket Software has announced intent to support R on z/os 2016 IBM Corporation 25

26 The ecosystem Spark on z Systems

27 Spark Support on z Systems Spark processing of z data Access to z data JDBC access to DB2 z/os, IMS Rocket Mainframe Data Service for Apache Spark (MDSS) Spark running on z Systems z/os Linux on z Systems Operating system Hardware Bitness z/os 2.1 z Systems 64 SUSE Linux Enterprise Server 12 z Systems 64 Red Hat Enterprise Linux 7.1 z Systems 64 Ubuntu LTS z Systems IBM Corporation 27

28 IBM z/os Platform for Apache Spark External Data Sources Twitter HDFS DB2 LUW Oracle JSON/ XML Tera OLTP Applications CICS, IMS, WAS Analysis routines Spark SQL Spark Analytic Applications: Customer Provided IBM Provided Partner Provided Spark Stream MLib Apache Spark Core GraphX Java z/os z/os... RDD RDD RDD RDD IBM DB2 Analytics Accelerator DB2 z/os Type 2 Type 4 Type 4 IMS Mainframe Data Service for Apache Spark z/os VSAM SMF And more 2016 IBM Corporation 28

29 IBM z/os Platform for Apache Spark... Almost any data source from any location can be processed in the Spark environment on z/os Mainframe Data Service for Apache Spark (MDSS) is key to providing a single, optimized view of heterogeneous data sources MDSS can integrate off-platform data sources as well Large majority of cycles used by MDSS are ziip-eligible Possible to use Spark on z/os without it, but MDSS is recommended Integration in Online Transaction Processing (OLTP) is possible, but response time may be an issue Spark is the high performance solution for processing big data IBM DB2 Analytics Accelerator optimization with DB2 for z/os can be integrated 2016 IBM Corporation 29

30 z Systems Analytics and Spark: Well-matched Stream GraphX SQL MLlib Compression RDDcache File System Network Scheduler Serialization Faster framework for in-memory analytics, both batch and real-time Federated data-in-place analytics access while preserving consistent APIs Why Spark on z Systems Optimized performance with zedc for compression SMT2 for better thread performance SIMD for optimized floating point performance and accelerate analytics processing for mathematical models via vector processing Leverage IBM Java optimizations and enhancements ziip eligibility 2016 IBM Corporation 30

31 Why and when use Spark on z/os? The environment where Apache Spark z/os makes sense: Running real-time or batch analytics over a variety of heterogeneous data sources Efficient real-time access to current and historical transactions Where a majority of data is z/os resident When data contains sensitive information Don't scatter across several distributed nodes to be held in memory for some unknown period of time When implementing common analytic interfaces that are shared with users on distributed platforms z/os strengths are valuable in a Spark environment: Intra-SQL and intra-partition parallelism for optimal data access to: nearly all z/os data environments distributed data sources 2016 IBM Corporation 31

32 The Ecosystem for Spark on z/os relevance to the new consumers Data Scientist The convincer The primary customer Creates the spark application(s) that produce insights with business value Probably doesn't know or care where all of the Spark resources come from Close to the platform, probably a Z- based person Works with the IM specialist to associate a view of the data with the actual on-platform assets Data Engineer The wrangler App Developer The builder Helps the Data scientist assemble and clean the data, write applications Probably better awareness about resource details, but still is primarily concerned with the problem to solve, not the platform 2016 IBM Corporation 32

33 Use Cases

34 Business Use Case illustrated in a showcase Financial Institution: offers both retail / consumer banking as well as investment services Business Critical Information owned by the organization Stock Price History public data Credit Card Info Trade Transaction Info Customer Info Social Media Data: Twitter GOAL Produce right-time offers for cross sell or upsell, tailored for individual customers based on: customer profile, trade transaction history, credit card purchases and social media sentiment and information Spark zos Client Insight Use Case IBM Corporation 34

35 Spark z/os Showcase: Reference Architecture Cloudant Linux on z REST API CICS Transaction system DB2 z/os JDBC IMS Transaction system Credit Card Info Spark Job Server JDBC Open Source Visualization Libraries Linux on z z/os JDBC IMS VSAM Trade Transaction Info Customer Info 2016 IBM Corporation 35

36 Analyzing SMF Data with Dump Data Sets LPARn Spark Application using SparkSQL Spark application is agnostic to data source and number of sources LPAR1 MDSS Client SMF Realtime JDBC Mainframe Data Service for Spark z/os (MDSS) Logstream Logstream Logstream LPAR2 MDSS Client SMF Realtime Logstream Logstream Logstream SMF Realtime LPAR3 MDSS Client SMF Realtime Logstream Logstream Logstream Logstream Logstream Logstream MDSS required on at least one system, MDSS agents required on all systems. No IPL required for installation Logstream recording mode required for real time interfaces 2016 IBM Corporation 36

37 Streaming SMF into Spark supports custom input streams Custom Receiver support PoC built with a custom SMF data receiver using real-time SMF services SMF30 records stored in a DStream in Spark Time series of RDDs DStream map() method translates raw SMF30 records into formatted strings Simple example to show the data flow -- Formatted records printed to stdout Custom SMF Receiver byte[] DStream String DStream time 1 time 2 time 3 time 4 SMF30s time 0 to 1 Formatted SMF30s time 0 to 1 map() Operation SMF30s time 1 to 2 Formatted SMF30s time 1 to 2 SMF30s time 2 to 3 Formatted SMF30s time 2 to 3 SMF30s time 3 to 4 Formatted SMF30s time 3 to IBM Corporation 37

38 SMF Analytics Example: SMF Streaming into Spark is the analytics operating system Supports input streams Custom Receiver support PoC built with a custom SMF data receiver using real-time SMF services This can be an interesting way to learn and get comfortable with Spark technologies Existing SMFEWTM Simple Spark SMF Application Spark Custom SMF Data Receiver Java JNI Interfaces SMF Read In-Memory Data Service Existing SMF Buffering Technology SMF Java Classes Code Running in Spark on z/os Base z/os Support New z/os Support PoC Code Existing z/os Support 2016 IBM Corporation 38

39 SMF Real-Time Streaming into 1. Run the command to start the Spark job./bin/spark-submit --master local /spark/spark15/lib/javasmfreceiver.jar IFASMF.INMEM 2. See the status of the SMF real-time resource on the MVS console SY1 d smf,m IFA714I SMF STATUS 082 IN MEMORY CONNECTIONS Resource: IFASMF.INMEM Con#: 0001 Connect Time: :20:11 ASID: 002F RECORDS: Look at the Spark job output in Unix SMF30.2 on 15:21: SMF30.2 on 15:21: SMF30.2 on 15:21: IBM Corporation 39

40 New Redbook on Apache Spark implementation on IBM z/os 2016 IBM Corporation 40 40

41 Want to see and know more about Spark on z Systems? Then be our Guest for the Spark on z Systems Event For more information click here Contact: Khadija Souissi, souissik@de.ibm.com 2016 IBM Corporation 41

42 References Spark Communities Spark SQL Programming Guide: IBM SystemML Open Source: June 2015, we announced to open source SystemML SystemML has been accepted as an Apache Incubator project Source code: Published paper: Big Data University Spark Fundamentals course IBM paper on Fueling Insight Economy with Apache www-01.ibm.com/marketing/iwm/dre/signup?source=ibmanalytics&s_pkg=ov37158&dynform=19170&s_tact=m161001w A Deeper Understanding of Spark Internals IBM z/os Platform for Apache Spark documentation IBM 4 2

43 Thank you!

Analytics on z Systems, what s new?

Analytics on z Systems, what s new? Khadija Souissi Analytics on z Systems, what s new? IBM Architektentage 16.11.2016 Agenda Spark on z Systems What is Spark Spark Details The ecosystem for Spark on z/os Use Cases New Dimension for the

More information

zspotlight: Spark on z/os

zspotlight: Spark on z/os zspotlight: Spark on z/os Avijit Chatterjee, Ph.D. achatter@us.ibm.com, @ChatterAvijit STSM, IBM Competitive Project Office 1 CEOs are increasingly focused on customers as individuals leveraging contextual

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Saving ETL Costs Through Data Virtualization Across The Enterprise

Saving ETL Costs Through Data Virtualization Across The Enterprise Saving ETL Costs Through Virtualization Across The Enterprise IBM Virtualization Manager for z/os Marcos Caurim z Analytics Technical Sales Specialist 2017 IBM Corporation What is Wrong with Status Quo?

More information

ANY Data for ANY Application Exploring IBM Data Virtualization Manager for z/os in the era of API Economy

ANY Data for ANY Application Exploring IBM Data Virtualization Manager for z/os in the era of API Economy ANY Data for ANY Application Exploring IBM for z/os in the era of API Economy Francesco Borrello francesco.borrello@it.ibm.com IBM z Analytics Traditional Data Integration Inadequate No longer Viable to

More information

Analyzing Flight Data

Analyzing Flight Data IBM Analytics Analyzing Flight Data Jeff Carlson Rich Tarro July 21, 2016 2016 IBM Corporation Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

The Evolution of Big Data Platforms and Data Science

The Evolution of Big Data Platforms and Data Science IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering

More information

Chapter 4: Apache Spark

Chapter 4: Apache Spark Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,

More information

IBM Data Virtualization Manager for z/os Leverage data virtualization synergy with API economy to evolve the information architecture on IBM Z

IBM Data Virtualization Manager for z/os Leverage data virtualization synergy with API economy to evolve the information architecture on IBM Z IBM for z/os Leverage data virtualization synergy with API economy to evolve the information architecture on IBM Z IBM z Analytics Agenda Big Data vs. Dark Data Traditional Data Integration Mainframe Data

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

Specialist ICT Learning

Specialist ICT Learning Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.

More information

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Webinar Series TMIP VISION

Webinar Series TMIP VISION Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Apache Spark 2.0. Matei

Apache Spark 2.0. Matei Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and

More information

Data Virtualization for the Enterprise

Data Virtualization for the Enterprise Data Virtualization for the Enterprise New England Db2 Users Group Meeting Old Sturbridge Village, 1 Old Sturbridge Village Road, Sturbridge, MA 01566, USA September 27, 2018 Milan Babiak Client Technical

More information

Data Platforms and Pattern Mining

Data Platforms and Pattern Mining Morteza Zihayat Data Platforms and Pattern Mining IBM Corporation About Myself IBM Software Group Big Data Scientist 4Platform Computing, IBM (2014 Now) PhD Candidate (2011 Now) 4Lassonde School of Engineering,

More information

IBM Data Replication for Big Data

IBM Data Replication for Big Data IBM Data Replication for Big Data Highlights Stream changes in realtime in Hadoop or Kafka data lakes or hubs Provide agility to data in data warehouses and data lakes Achieve minimum impact on source

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Cloud, Big Data & Linear Algebra

Cloud, Big Data & Linear Algebra Cloud, Big Data & Linear Algebra Shelly Garion IBM Research -- Haifa 2014 IBM Corporation What is Big Data? 2 Global Data Volume in Exabytes What is Big Data? 2005 2012 2017 3 Global Data Volume in Exabytes

More information

Fast Innovation requires Fast IT

Fast Innovation requires Fast IT Fast Innovation requires Fast IT Cisco Data Virtualization Puneet Kumar Bhugra Business Solutions Manager 1 Challenge In Data, Big Data & Analytics Siloed, Multiple Sources Business Outcomes Business Opportunity:

More information

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019 From Single Purpose to Multi Purpose Data Lakes Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019 Agenda Data Lakes Multiple Purpose Data Lakes Customer Example Demo Takeaways

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Dell In-Memory Appliance for Cloudera Enterprise

Dell In-Memory Appliance for Cloudera Enterprise Dell In-Memory Appliance for Cloudera Enterprise Spark Technology Overview and Streaming Workload Use Cases Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

Apache Spark and Scala Certification Training

Apache Spark and Scala Certification Training About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over

More information

IBM IMS Tools Keynote

IBM IMS Tools Keynote IBM IMS TECHNICAL SYMPOSIUM 2016 IBM IMS Tools Keynote Janet LeBlanc IMS Tools Offering Manager 2016 IBM Corporation Agenda Our journey where we have been A couple of products you should see this week:

More information

Comparative Study of Apache Hadoop vs Spark

Comparative Study of Apache Hadoop vs Spark International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 7 ISSN : 2456-3307 Comparative Study of Apache Hadoop vs Spark Varsha

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark Agenda Spark Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing

More information

About Codefrux While the current trends around the world are based on the internet, mobile and its applications, we try to make the most out of it. As for us, we are a well established IT professionals

More information

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F. Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member,

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years

More information

The Never Ending Value of z Systems Focus on Analytics & Big Data

The Never Ending Value of z Systems Focus on Analytics & Big Data The Never Ending Value of z Systems Focus on Analytics & Big Data Hélène Lyon Distinguished Engineer & CTO, Analytics on z Systems for Europe Europe IMS SWAT Technical Executive IBM Systems, zsoftware

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc. An Introduction to Apache Spark Big Data Madison: 29 July 2014 William Benton @willb Red Hat, Inc. About me At Red Hat for almost 6 years, working on distributed computing Currently contributing to Spark,

More information

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET SOLUTION SHEET Syncsort DMX-h Simplifying Big Data Integration Goals of the Modern Data Architecture Data warehouses and mainframes are mainstays of traditional data architectures and still play a vital

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

IBM DB2 Analytics Accelerator Trends and Directions

IBM DB2 Analytics Accelerator Trends and Directions March, 2017 IBM DB2 Analytics Accelerator Trends and Directions DB2 Analytics Accelerator for z/os on Cloud Namik Hrle IBM Fellow Peter Bendel IBM STSM Disclaimer IBM s statements regarding its plans,

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos" Outline Part I: Getting to know Spark Part II: Basic programming Part III: Spark under

More information

An Overview of Apache Spark

An Overview of Apache Spark An Overview of Apache Spark CIS 612 Sunnie Chung 2014 MapR Technologies 1 MapReduce Processing Model MapReduce, the parallel data processing paradigm, greatly simplified the analysis of big data using

More information

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기 빅데이터기술개요 2016/8/20 ~ 9/3 윤형기 (hky@openwith.net) D4 http://www.openwith.net 2 Hive http://www.openwith.net 3 What is Hive? 개념 a data warehouse infrastructure tool to process structured data in Hadoop. Hadoop

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud Microsoft Azure Databricks for data engineering Building production data pipelines with Apache Spark in the cloud Azure Databricks As companies continue to set their sights on making data-driven decisions

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

Turning Relational Database Tables into Spark Data Sources

Turning Relational Database Tables into Spark Data Sources Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following

More information

Reactive App using Actor model & Apache Spark. Rahul Kumar Software

Reactive App using Actor model & Apache Spark. Rahul Kumar Software Reactive App using Actor model & Apache Spark Rahul Kumar Software Developer @rahul_kumar_aws About Sigmoid We build realtime & big data systems. OUR CUSTOMERS Agenda Big Data - Intro Distributed Application

More information

Analytics with IMS and QMF

Analytics with IMS and QMF Analytics with IMS and QMF Steve Mink Worldwide z System Analytics Client Success mink@us.ibm.com March 2015 * IMS Technical Symposium 2015 QMF for z/os 11 QMF for z/os is a visual business intelligence

More information

WHITEPAPER. MemSQL Enterprise Feature List

WHITEPAPER. MemSQL Enterprise Feature List WHITEPAPER MemSQL Enterprise Feature List 2017 MemSQL Enterprise Feature List DEPLOYMENT Provision and deploy MemSQL anywhere according to your desired cluster configuration. On-Premises: Maximize infrastructure

More information

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Principal Software Engineer Red Hat Emerging Technology June 24, 2015 USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging

More information

2/4/2019 Week 3- A Sangmi Lee Pallickara

2/4/2019 Week 3- A Sangmi Lee Pallickara Week 3-A-0 2/4/2019 Colorado State University, Spring 2019 Week 3-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 1: MAPREDUCE PA1

More information

Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP

Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP 07.29.2015 LANDING STAGING DW Let s start with something basic Is Data Lake a new concept? What is the closest we can

More information

Bringing Data to Life

Bringing Data to Life Bringing Data to Life Data management and Visualization Techniques Benika Hall Rob Harrison Corporate Model Risk March 16, 2018 Introduction Benika Hall Analytic Consultant Wells Fargo - Corporate Model

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

Oracle GoldenGate for Big Data

Oracle GoldenGate for Big Data Oracle GoldenGate for Big Data The Oracle GoldenGate for Big Data 12c product streams transactional data into big data systems in real time, without impacting the performance of source systems. It streamlines

More information

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 22. Spark Paul Krzyzanowski Rutgers University Fall 2016 November 26, 2016 2015-2016 Paul Krzyzanowski 1 Apache Spark Goal: generalize MapReduce Similar shard-and-gather approach to

More information

Evolution From Shark To Spark SQL:

Evolution From Shark To Spark SQL: Evolution From Shark To Spark SQL: Preliminary Analysis and Qualitative Evaluation Xinhui Tian and Xiexuan Zhou Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017 UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017 HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in

More information

Backtesting with Spark

Backtesting with Spark Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel 1 Traditional Grid Shared storage Storage and compute scale independently Bottleneck on I/O

More information

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group

More information

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 CSC 261/461 Database Systems Lecture 24 Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 Announcements Term Paper due on April 20 April 23 Project 1 Milestone 4 is out Due on 05/03 But I would

More information

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics

More information

Deploying Applications on DC/OS

Deploying Applications on DC/OS Mesosphere Datacenter Operating System Deploying Applications on DC/OS Keith McClellan - Technical Lead, Federal Programs keith.mcclellan@mesosphere.com V6 THE FUTURE IS ALREADY HERE IT S JUST NOT EVENLY

More information

Data Analytics using MapReduce framework for DB2's Large Scale XML Data Processing

Data Analytics using MapReduce framework for DB2's Large Scale XML Data Processing IBM Software Group Data Analytics using MapReduce framework for DB2's Large Scale XML Data Processing George Wang Lead Software Egnineer, DB2 for z/os IBM 2014 IBM Corporation Disclaimer and Trademarks

More information

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED PLATFORM Executive Summary Financial institutions have implemented and continue to implement many disparate applications

More information

Outline. CS-562 Introduction to data analysis using Apache Spark

Outline. CS-562 Introduction to data analysis using Apache Spark Outline Data flow vs. traditional network programming What is Apache Spark? Core things of Apache Spark RDD CS-562 Introduction to data analysis using Apache Spark Instructor: Vassilis Christophides T.A.:

More information

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

Blurring the Line Between Developer and Data Scientist

Blurring the Line Between Developer and Data Scientist Blurring the Line Between Developer and Data Scientist Notebooks with PixieDust va barbosa va@us.ibm.com Developer Advocacy IBM Watson Data Platform WHY ARE YOU HERE? More companies making bet-the-business

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Netezza The Analytics Appliance

Netezza The Analytics Appliance Software 2011 Netezza The Analytics Appliance Michael Eden Information Management Brand Executive Central & Eastern Europe Vilnius 18 October 2011 Information Management 2011IBM Corporation Thought for

More information

Distributed Computing with Spark and MapReduce

Distributed Computing with Spark and MapReduce Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How

More information

Fluentd + MongoDB + Spark = Awesome Sauce

Fluentd + MongoDB + Spark = Awesome Sauce Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here Wipro Open Source Practice: Vision & Mission Vision

More information

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau CSE 6242 / CX 4242 Data and Visual Analytics Georgia Tech Spark & Spark SQL High- Speed In- Memory Analytics over Hadoop and Hive Data Instructor: Duen Horng (Polo) Chau Slides adopted from Matei Zaharia

More information

/ Cloud Computing. Recitation 13 April 14 th 2015

/ Cloud Computing. Recitation 13 April 14 th 2015 15-319 / 15-619 Cloud Computing Recitation 13 April 14 th 2015 Overview Last week s reflection Project 4.1 Budget issues Tagging, 15619Project This week s schedule Unit 5 - Modules 18 Project 4.2 Demo

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

Scalable Tools - Part I Introduction to Scalable Tools

Scalable Tools - Part I Introduction to Scalable Tools Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/mbds2018/ Scalable Tools session

More information