Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Similar documents
An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Processing of big data with Apache Spark

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

CSE 444: Database Internals. Lecture 23 Spark

Cloud Computing & Visualization

DATA SCIENCE USING SPARK: AN INTRODUCTION

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Spark Overview. Professor Sasu Tarkoma.

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Big Data Infrastructures & Technologies

Big Data Architect.


Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Unifying Big Data Workloads in Apache Spark

A Tutorial on Apache Spark

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Big Data Hadoop Stack

The Evolution of Big Data Platforms and Data Science

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Spark, Shark and Spark Streaming Introduction

Chapter 4: Apache Spark

Apache Spark 2.0. Matei

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Big data systems 12/8/17

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Hadoop Development Introduction

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

MapReduce, Hadoop and Spark. Bompotas Agorakis

Cloud, Big Data & Linear Algebra

Hadoop. Introduction / Overview

Massive Online Analysis - Storm,Spark

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Project Design. Version May, Computer Science Department, Texas Christian University

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science

2/4/2019 Week 3- A Sangmi Lee Pallickara

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Distributed Computation Models

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

An Overview of Apache Spark

Spark: A Brief History.

Practical Big Data Processing An Overview of Apache Flink

An Introduction to Apache Spark

Embedded Technosolutions

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Data Platforms and Pattern Mining

Shark: Hive (SQL) on Spark

Webinar Series TMIP VISION

Outline. CS-562 Introduction to data analysis using Apache Spark

Hadoop course content

Scalable Tools - Part I Introduction to Scalable Tools

Certified Big Data Hadoop and Spark Scala Course Curriculum

Expert Lecture plan proposal Hadoop& itsapplication

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Specialist ICT Learning

Big Data Analytics using Apache Hadoop and Spark with Scala

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Lecture 11 Hadoop & Spark

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Accelerating Spark Workloads using GPUs

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Spark and distributed data processing

Resilient Distributed Datasets

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Databricks, an Introduction

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Migrate from Netezza Workload Migration

Apache Spark and Scala Certification Training

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Analyzing Flight Data

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Turning Relational Database Tables into Spark Data Sources

Memory Management for Spark. Ken Salem Cheriton School of Computer Science University of Waterloo

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Hadoop Map Reduce 10/17/2018 1

Innovatus Technologies

An Introduction to Apache Spark

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

a Spark in the cloud iterative and interactive cluster computing

Batch Processing Basic architecture

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Databases and Big Data Today. CS634 Class 22

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Introduction to Spark

Introduction to MapReduce (cont.)

Transcription:

USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015

ABOUT ME Distributed systems and data science in Red Hat's Emerging Technology group Active open-source and Fedora developer Before Red Hat: programming language research

FORECAST Distributed data processing: history and mythology Data processing in the cloud Introducing Apache Spark How we use Spark for data science at Red Hat

DATA PROCESSING Recent history & persistent mythology

CHALLENGES What makes distributed data processing difficult?

MAPREDUCE (2004) A novel application of some very old functional programming ideas to distributed computing All data are modeled as key-value pairs Mappers transform pairs; reducers merge several pairs with the same key into one new pair Runtime system shuffles data to improve locality

WORD COUNT "a b" "c e" "a b" "d a" "d b"

MAPPED INPUTS (a, 1) (c, 1) (a, 1) (d, 1) (d, 1) (b, 1) (e, 1) (b, 1) (a, 1) (b, 1)

SHUFFLED RECORDS (a, 1) (b, 1) (c, 1) (d, 1) (e, 1) (a, 1) (b, 1) (d, 1) (a, 1) (b, 1)

REDUCED RECORDS (a, 3) (b, 3) (c, 1) (d, 2) (e, 1)

HADOOP (2005) Open-source implementation of MapReduce, a distributed filesystem, and more Inexpensive way to store and process data with scale-out on commodity hardware Motivates many of the default assumptions we make about big data today

FACTS You need an architecture that will scale out to many nodes to handle real-world data analytics Your network and disks probably aren't fast enough Locality is everything: you need to be able to run compute jobs on the nodes storing your data

FACTS You need an architecture that will scale out to many nodes to handle real-world data analytics Your network and disks probably aren't fast enough Locality is everything: you need to be able to run compute jobs on the nodes storing your data

...at least two analytics production clusters (at Microsoft and Yahoo) have median job input sizes under 14 GB and 90% of jobs on a Facebook cluster have input sizes under 100 GB. Appuswamy et al., Nobody ever got fired for buying a cluster. Microsoft Research Tech Report.

Takeaway #1: you may need petascale storage, but you probably don't even need terascale compute.

Takeaway #2: moderately sized workloads benefit more from scale-up than scale out.

FACTS You need an architecture that will scale out to many nodes to handle real-world data analytics Your network and disks probably aren't fast enough Locality is everything: you need to be able to run compute jobs on the nodes storing your data

Contrary to our expectations... CPU (and not I/O) is often the bottleneck [and] improving network performance can improve job completion time by a median of at most 2% Ousterhout et al., Making Sense of Performance in Data Analytics Frameworks. USENIX NSDI 15.

Takeaway #3: I/O is not the bottleneck (especially in moderately-sized jobs); focus on CPU performance.

FACTS You need an architecture that will scale out to many nodes to handle real-world data analytics Your network and disks probably aren't fast enough Locality is everything: you need to be able to run compute jobs on the nodes storing your data

Takeaway #4: collocated data and compute was a sensible choice for petascale jobs in 2005, but shouldn't necessarily be the default today.

FACTS (REVISED) You probably don't need an architecture that will scale out to many nodes to handle real-world data analytics (and might be better served by scaling up) Your network and disks probably aren't the problem You have enormous flexibility to choose the best technologies for storage and compute

HADOOP IN 2015 MapReduce is low-level, verbose, and not an obvious fit for many interesting problems No unified abstractions: Hive or Pig for query, Giraph for graph, Mahout for machine learning, etc. Fundamental architectural assumptions need to be revisited along with the facts motivating them

DATA PROCESSING IN THE CLOUD How our assumptions should change

COLLOCATED DATA AND COMPUTE

ELASTIC RESOURCES

DISTINCT STORAGE AND COMPUTE Combine the best storage system for your application with elastic compute resources.

INTRODUCING SPARK

Apache Spark is a framework for distributed computing based on a high-level, expressive abstraction.

Query ML Graph Streaming Spark core

ad hoc Mesos YARN Query ML Graph Streaming Spark core

Language bindings for Scala, Java, Python, and R Query ad hoc ML Mesos Graph Streaming YARN Spark core Access data from JDBC, Gluster, HDFS, S3, and more

ad hoc Mesos YARN A resilient distributed dataset is a partitioned, immutable, lazy collection.

A resilient distributed dataset is a partitioned, immutable, lazy collection.

A resilient distributed dataset is a partitioned, immutable, lazy collection. The PARTITIONS making up an RDD can be distributed across multiple machines

A resilient distributed dataset is a partitioned, immutable, lazy collection. TRANSFORMATIONS create new (lazy) collections; ACTIONS force computations and return results

CREATING AN RDD # from an in-memory collection spark.parallelize(range(1, 1000)) # from the lines of a text file spark.textfile("hamlet.txt") # from a Hadoop-format binary file spark.hadoopfile("...") spark.sequencefile("...") spark.objectfile("...")

TRANSFORMING RDDS # transform each element independently numbers.map(lambda x: x + 1) # turn each element into zero or more elements lines.flatmap(lambda s: s.split(" ")) # reject elements that don't satisfy a predicate vowels = ['a', 'e', 'i', 'o', 'u'] words.filter(lambda s: s[0] in vowels) # keep only one copy of duplicate elements words.distinct()

TRANSFORMING RDDS # return an RDD of key-value pairs, sorted by # the keys of each pairs.sortbykey() # combine every two pairs having the same key, # using the given reduce function pairs.reducebykey(lambda x, y: max(x, y)) # join together two RDDs of pairs so that # [(a, b)] join [(a, c)] == [(a, (b, c))] pairs.join(other_pairs)

CACHING RESULTS # tell Spark to cache this RDD in cluster # memory after we compute it sorted_pairs = pairs.sortbykey() sorted_pairs.cache() # as above, except also store a copy on disk sorted_pairs.persist(memory_and_disk) # uncache and free this result sorted_pairs.unpersist()

COMPUTING RESULTS # compute this RDD and return a # count of elements numbers.count() # compute this RDD and materialize it # as a local collection counts.collect() # compute this RDD and write each # partition to stable storage words.saveastextfile("...")

WORD COUNT EXAMPLE # create an RDD backed by the lines of a file f = spark.textfile("...") #...mapping from lines of text to words words = f.flatmap(lambda line: line.split(" ")) #...mapping from words to occurrences occs = words.map(lambda word: (word, 1)) #...reducing occurrences to counts counts = occs.reducebykey(lambda a, b: a + b) # POP QUIZ: what have we computed so far? counts.saveastextfile("...")

PETASCALE STORAGE, IN-MEMORY COMPUTE storage/compute nodes compute-only nodes, which primarily operate on cached post-etl data

DATA SCIENCE AT RED HAT

THE EMERGING TECH DATA SCIENCE TEAM Engineers with distributed systems, data science, and scientific computing expertise Goal: help internal customers solve data problems and make data-driven decisions Principles: identify best practices, question outdated assumptions, use best-of-breed technology

DEVELOPMENT Six compute-only nodes Two nodes for Gluster storage Apache Spark running under Apache Mesos Open-source notebook interfaces to analyses

DATA SOURCES FTP S3 SQL MongoDB ElasticSearch

INTERACTIVE QUERY

TWO CASE STUDIES

ROLE ANALYSIS Data source: historical configuration and telemetry data for internal machines from ElasticSearch Data size: hundreds of GB Analysis: identify machine roles based on the packages each has installed

BUDGET FORECASTING Data sources: operational log data for OpenShift Online (from MariaDB), actual costs incurred by OpenShift Data size: over 120 GB Analysis: identify operational metrics most strongly correlated with operating expenses; model daily operating expense as a function of these metrics Aggregating performance metrics: 17 hours in MariaDB, 15 minutes in Spark!

NEXT STEPS

DEMO VIDEO See a video demo of Continuum Analytics, PySpark, and Red Hat Storage: h.264 or Ogg Theora

WHERE FROM HERE Check out the Emerging Technology Data Science team's library to help build your own data-driven applications: https://github.com/willb/silex/ See my blog for articles about open source data science: http://chapeau.freevariable.com Questions?

THANKS