Trends and Challenges in Big Data

Similar documents
Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Warehouse- Scale Computing and the BDAS Stack

Unifying Big Data Workloads in Apache Spark

The Datacenter Needs an Operating System

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

New Developments in Spark

Apache Spark 2.0. Matei

Big Data Infrastructures & Technologies

DATA SCIENCE USING SPARK: AN INTRODUCTION

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX

An Introduction to Apache Spark

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Spark, Shark and Spark Streaming Introduction

Webinar Series TMIP VISION

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

Resilient Distributed Datasets

The Evolution of Big Data Platforms and Data Science

MapR Enterprise Hadoop

Approaching the Petabyte Analytic Database: What I learned

THE DEFINITIVE GUIDE FOR AWS CLOUD EC2 FAMILIES

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

April Copyright 2013 Cloudera Inc. All rights reserved.

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Next-Generation Cloud Platform

Spark and distributed data processing

Backtesting with Spark

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Cloud Computing & Visualization

Accelerate Big Data Insights

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Toward a Memory-centric Architecture

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research

Bringing Data to Life

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Intelligent Services Serving Machine Learning

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Apache Spark Graph Performance with Memory1. February Page 1 of 13

Twitter data Analytics using Distributed Computing

Olivia Klose Technical Evangelist. Sascha Dittmann Cloud Solution Architect

Distributed Machine Learning" on Spark

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

MapReduce, Hadoop and Spark. Bompotas Agorakis

Analyzing Flight Data

Introduction to MapReduce

A Tutorial on Apache Spark

Data-Intensive Distributed Computing

Conquering Big Data with Apache Spark

Specialist ICT Learning

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Lambda Architecture with Apache Spark

Spark Overview. Professor Sasu Tarkoma.

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Flash Storage Complementing a Data Lake for Real-Time Insight

Data Platforms and Pattern Mining

Oracle Exadata: Strategy and Roadmap

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Licensing Oracle on Amazon EC2, RDS and Microsoft Azure now twice as expensive!

Dell In-Memory Appliance for Cloudera Enterprise

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Distributed Computing with Spark and MapReduce

Massive Online Analysis - Storm,Spark

WITH INTEL TECHNOLOGIES

Oracle Big Data Connectors

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

Deep Learning Frameworks with Spark and GPUs


Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Databricks, an Introduction

Prototyping Data Intensive Apps: TrendingTopics.org

Analysis in the Big Data Era

Spark: A Brief History.

Lecture 11 Hadoop & Spark

Enabling the A.I. Era: From Materials to Systems

Modern Data Warehouse The New Approach to Azure BI

Albis: High-Performance File Format for Big Data Systems

Micron and Hortonworks Power Advanced Big Data Solutions

Hadoop. Introduction / Overview

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Distributed Computing with Spark

Big Data Analytics using Apache Hadoop and Spark with Scala

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Programming Systems for Big Data

Transcription:

Trends and Challenges in Big Data Ion Stoica November 14, 2016 PDSW-DISCS 16 UC BERKELEY

Before starting Disclaimer: I know little about HPC and storage More collaboration than ever between HPC, Distributes Systems, Big Data / Machine Learning communities Hope this talk will help a bit in bringing us even closer 2

Big Data Research at Berkeley AMPLab (Jan 2011- Dec 2016) Mission: Make sense of big data 8 faculty, 60+ students Machines Algorithms People 3

Big Data Research at Berkeley AMPLab (Jan 2011- Dec 2016) Mission: Make sense of big data 8 faculty, 60+ students Machines Algorithms People Goal: Next generation of open source data analytics stack for industry & academia Berkeley Data Analytics Stack (BDAS)

Spark Streaming BDAS Stack BlinkDB SparkSQL Mesos Succinct Tachyon BDAS Stack Sample Clean SparkR Spark Core GraphX 3 rd party MLBase MLlib Hadoop Yarn Velox HDFS, S3, Ceph, Processing Res. Mgmnt Storage

Several Successful Projects Apache Spark: most popular big data execution engine 1000+ contributors 1000+ orgs; offered by all major clouds and distributors Apache Mesos: cluster resource manager Manages 10,000+ node clusters Used by 100+ organizations (e.g., Twitter, Verizon, GE) Alluxio (a.k.a Tachyon): in-memory distributed store Used by 100+ organizations (e.g., IBM, Alibaba)

This Talk Reflect on how application trends, i.e., user needs & requirements hardware trends have impacted the design of our systems How we can use these lessons to design new systems

2009

2009: State-of-the-art in Big Data Apache Hadoop Large scale, flexible data processing engine Batch computation (e.g., 10s minutes to hours) Open Source Getting rapid industry traction: High profile users: Facebook, Twitter, Yahoo!, Distributions: Cloudera, Hortonworks Many companies still in austerity mode

2009: Application Trends Iterative computations, e.g., Machine Learning More and more people aiming to get insights from data Interactive computations, e.g., ad-hoc analytics SQL engines like Hive and Pig drove this trend 10

2009: Application Trends Despite huge amounts of data, many working sets in big data clusters fit in memory 11

2009: Application Trends Memory (GB) Facebook (% jobs) Microsoft (% jobs) Yahoo! (% jobs) 8 69 38 66 16 74 51 81 32 96 82 97.5 64 97 98 99.5 128 98.8 99.4 99.8 192 99.5 100 100 256 99.6 100 100 *G Ananthanarayanan, A. Ghodsi, S. Shenker, I. Stoica, Disk-Locality in Datacenter Computing Considered Irrelevant, HotOS 2011 12

2009: Application Trends Memory (GB) Facebook (% jobs) Microsoft (% jobs) Yahoo! (% jobs) 8 69 38 66 16 74 51 81 32 96 82 97.5 64 97 98 99.5 128 98.8 99.4 99.8 192 99.5 100 100 256 99.6 100 100 *G Ananthanarayanan, A. Ghodsi, S. Shenker, I. Stoica, Disk-Locality in Datacenter Computing Considered Irrelevant, HotOS 2011 13

2009: Hardware Trends Memory still riding the Moore s law 1.00E+09 1.00E+08 1.00E+07 1.00E+06 Cost ($/GB) 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+00 1.00E-01 1.00E-02 1.00E-03 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 http://www.jcmit.com/memoryprice.htm 14

2009: Hardware Trends Memory still riding the Moore s law I/O throughput and latency stagnant HDD dominating data clusters as storage of choice Many deployments as low as 20MB/sec per drive 15

Applications Requirements: ad-hoc queries ML algos Enabler: working sets fit in memory Memory growing with Moore s Law In-memory processing Multi-stage BSP model I/O performance stagnant (HDDs) 2009 Hardware

2009: Our Solution: Apache Spark In-memory processing Great for ad-hoc queries Generalizes MapReduce to multi-stage computations Implement BSP model Share data between stages via memory Great for iterative computations, e.g., ML algorithms

2009: Technical Solutions Low-overhead resilience mechanisms à Resilient Distributed Datasets (RDDs) Efficiently support for ML algos à Powerful and flexible APIs map/reduce just two of over 80+ APIs

2012

2012: Application Trends People started to assemble e2e data analytics pipelines Raw Data ETL Ad-hoc exploration Advanced Analytics Data Products Need to stitch together a hodgepodge of systems Difficult to manage, learn, and use

Applications Requirements: ad-hoc queries ML algos Requirements: build e2e big data pipelines Enabler: working sets fit in memory Memory growing with Moore s Law In-memory processing Multi-stage BSP model Unified platform: SQL ML Graphs Streaming I/O performance stagnant (HDDs) 2009 Hardware 2012

2012: Our Solution: Unified Platform Support a variety of workloads Support a variety of input sources Provide a variety of language bindings Spark SQL interactive Spark Streaming real-time MLlib machine learning GraphX graph Spark Core Python, Java, Scala, R a

2014

2014: Application Trends New users, new requirements Spark early adopters Users Understands MapReduce & functional APIs Data Engineers Data Scientists Statisticians R users PyData

2014: Hardware Trends Memory capacity still growing fast Cost ($/GB) 1.00E+09 1.00E+08 1.00E+07 1.00E+06 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+00 1.00E-01 1.00E-02 1.00E-03 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 http://www.jcmit.com/memoryprice.htm

2014: Hardware Trends Memory capacity still growing fast Many clusters and datacenters transitioning to SSDs Orders of magnitude improvements in I/O and latency DigitalOcean: SSD only instances since 2013 CPU performance growth slowing down

Applications Requirements: ad-hoc queries ML algos Enabler: working sets fit in memory Memory growing with Moore s Law I/O performance stagnant (HDDs) In-memory processing Multi-stage BSP model Requirements: build e2e big data pipelines Hardware Unified platform: SQL ML Graphs Streaming Requirements: new users: data scientists & analysts Improved performance Memory still growing fast I/O perf. improving CPU stagnant API: DataFrame Storage rep.: Binary format Columnar Code generation 2009 2012 2014

pdata.map(lambda x: (x.dept, [x.age, 1])) \.reducebykey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \.map(lambda x: [x[0], x[1][0] / x[1][1]]) \.collect() data.groupby( dept ).avg( age )

DataFrame API DataFrame logically equivalent to a relational table Operators mostly relational with additional ones for statistical analysis, e.g., quantile, std, skew Popularized by R and Python/pandas, languages of choice for Data Scientists

DataFrames in Spark Make DataFrame declarative, unify DataFrame and SQL DataFrame and SQL share same query optimizer, and execution engine Tightly integrated with rest of Spark Python DF Java/Scala DF Logical Plan ML library takes DataFrames as input & output Easily Every convert optimizations RDDs DataFrames automatically applies to SQL, and Scala, Python and R DataFrames Execution R DF

One Query Plan, One Execution Engine DataFrame SQL DataFrame R DataFrame Python DataFrame Scala RDD Python RDD Scala 0 2 4 6 8 10 Time for aggregation benchmark (s) 31

One Query Plan, One Execution Engine DataFrame SQL DataFrame R DataFrame Python DataFrame Scala RDD Python RDD Scala 0 2 4 6 8 10 Time for aggregation benchmark (s) 32

What else does DataFrame enable? Typical DB optimizations across operators: Join reordering, pushdown, etc Compact binary representation: Columnar, compressed format for caching Whole-stage code generation: Remove expensive iterator calls Fuse across multiple operators

600 500 TPC-DS Spark 2.0 vs 1.6 Lower is Better Time (1.6) Time (2.0) Runtime (seconds) 400 300 200 100 0

2016 (What s Next?)

What s Next? Application trends Hardware trends Challenges and techniques 36

Application Trends Data only as valuable as the decisions and actions it enables What does it mean? Faster decisions better than slower decisions Decisions on fresh data better than on stale data Decisions on personal data better than on aggregate data 37

Application Trends Real-time decisions decide in ms on live data the current state of the environment with strong security privacy, confidentiality, integrity 38

Latency Applications Quality Update Decision Security Zero-time defense sophisticated, accurate, robust sec sec privacy, integrity Parking assistant sophisticated, robust sec sec privacy Disease discovery sophisticated, accurate hours sec/min privacy, integrity IoT (smart buildings) sophisticated, robust min/hour sec privacy, integrity Earthquake warning sophisticated, accurate, robust min ms integrity Chip manufacturing sophisticated, accurate, robust min sec/min confidentiality, integrity Fraud detection sophisticated, accurate min ms privacy, integrity Fleet driving sophisticated, accurate, robust sec sec privacy, integrity Virtual assistants sophisticated, robust min/hour sec integrity Video QoS at scale sophisticated min ms/sec privacy, integrity

Latency Applications Quality Update Decision Security Zero-time defense sophisticated, accurate, robust sec sec privacy, integrity Parking assistant sophisticated, robust sec sec privacy Disease discovery sophisticated, accurate hours sec/min Query privacy, integrity engine IoT (smart buildings) sophisticated, Data robust Preprocess Intermediate min/hour sec privacy, integrity Decision System Decision (e.g., train) data Earthquake warning sophisticated, accurate, robust (e.g., model) min Automatic ms integrity decision engine Chip manufacturing sophisticated, accurate, robust min sec/min confidentiality, integrity Fraud detection sophisticated, accurate Decision min System ms privacy, integrity Fleet driving sophisticated, accurate, robust sec sec privacy, integrity Virtual assistants sophisticated, robust min/hour sec integrity Video QoS at scale sophisticated min ms/sec privacy, integrity

Latency Applications Quality Update Decision Security Zero-time defense sophisticated, accurate, robust sec sec privacy, integrity Parking assistant sophisticated, robust sec sec privacy Disease discovery sophisticated, accurate hours sec/min privacy, integrity IoT (smart buildings) sophisticated, robust min/hour sec privacy, integrity Earthquake warning sophisticated, accurate, robust min ms integrity Chip manufacturing sophisticated, accurate, robust min sec/min confidentiality, integrity Fraud detection sophisticated, accurate min ms privacy, integrity Fleet driving sophisticated, accurate, robust sec sec privacy, integrity Addressing these challenges, the goal of next Berkeley lab: RISE (Real-time Secure Execution) Lab Virtual assistants sophisticated, robust min/hour sec integrity Video QoS at scale sophisticated min ms/sec privacy, integrity

What s next? Application trends Hardware trends Challenges and techniques 42

Moore s Law is Slowing Down 43

What Does It Mean? CPUs affected most: just 20-30%/year perf. improvements More complex layouts à harder to scale Mostly by increasing number of cores à harder to take advantage Memory: still grows at 30-40%/year Regular layouts, stacked technologies Network: grows at 30-50%/year 100/200/400GBpE NICs at horizon Full-bisection bandwidth network topologies CPUs is the bottleneck and it s getting worse! 44

What Does It Mean? CPUs affected most: just 20-30%/year perf. improvements More complex layouts à harder to scale Mostly by increasing number of cores à harder to take advantage Memory: still grows at 30-40%/year Regular layouts, stacked technologies Network: grows at 30-50%/year 100/200/400GBpE NICs at horizon Full-bisection bandwidth network topologies Memory-to-core ratio is increasing e.g., AWS: 7-8GB/vcore à 17GB/vcore (X1) 45

Unprecedented Hardware Innovation From CPU to specialized chips: GPUs, FPGAs, ASICs/co-processors (e.g., TPU) Tightly integrated, e.g., Intel s latest Xeon integrates CPU & FPGA New, disruptive memory technologies HBM (High Bandwidth Memory), same package at CPU 46

High Bandwidth Memory (HBM) 8 channels = 1024 bits http://www.amd.com/en-us/innovations/software-technologies/hbm 2 channels @ 128 bits

High Bandwidth Memory (HBM) http://www.amd.com/en-us/innovations/software-technologies/hbm 8 stacks = 4096 bits à 500 GB/sec

Unprecedented Hardware Innovation From CPU to specialized chips: GPUs, FPGAs, ASICs/co-processors (e.g., TPU) Tightly integrated (e.g., Intel s latest Xeon integrates CPU & FPGA) New, disruptive memory technologies HBM2: 8 DRAM chips/package à 1TB/sec 49

Unprecedented Hardware Innovation From CPU to specialized chips: GPUs, FPGAs, ASICs/co-processors (e.g., TPU) Tightly integrated (e.g., Intel s latest Xeon integrates CPU & FPGA) New, disruptive memory technologies HBM2: 8 DRAM chips/package à 1TB/sec 3D XPoint 50

3D XPoint Technology Developed by Intel and Micron Announced last year; products released this year Characteristics: Non-volatile memory 2-5x DRAM latency! 8-10x density of DRAM 1000x more resilient than SSDs

Unprecedented Hardware Innovation From CPU to specialized chips: GPUs, FPGAs, ASICs/co-processors (e.g., TPU) Tightly integrated (e.g., Intel s latest Xeon integrates CPU & FPGA) New, disruptive memory technologies HBM2: 8 DRAM chips/package à 1TB/sec 3D XPoint Renaissance of hardware design David Patterson 52

Applications Requirements: ad-hoc queries ML algos Enabler: working sets fit in memory Memory growing with Moore s Law I/O performance stagnant (HDDs) In-memory processing Multi-stage BSP model Requirements: build e2e big data pipelines Unified platform: SQL ML Graphs Streaming Hardware Requirements: new Real-time users: data decisions on scientists fresh data & analysts Strong security Improved API: DataFrame performance Storage rep.:? Binary format Columnar Memory still rapidly growing evolving fast Code generation I/O Specialized perf. improving processing: GPUs, FPGAs, CPU stagnant ASICs, SGX, 100s core CPUs 2009 2012 2014 2016

What s next? Application trends Hardware trends Challenges and techniques 54

Complexity Computation Software Software CPU CPU GPU FPGA ASIC + SGX 55

Complexity Memory 2015 2020 L1/L2 cache ~1 ns L1/L2 cache ~1 ns L3 cache ~10 ns L3 cache HBM ~10 ns ~10 ns / ~1TB/s / ~10GB Main memory ~100 ns / ~80 GB/s / ~100GB Main memory ~100 ns / ~80 GB/s / ~100GB NAND SSD ~100 usec / ~10 GB/s / ~1 TB NVM (3D Xpoint) NAND SSD ~1 usec / ~10GB/s / ~1TB ~100 usec / ~10 GB/s / ~10 TB Fast HHD ~10 msec / ~100 MB/s / ~10 TB Fast HHD ~10 msec / ~100 MB/s / ~100 TB

Complexity More and More Choices t2.nano, t2.micro, t2.small m4.large, m4.xlarge, m4.2xlarge, m4.4xlarge, m3.medium, c4.large, c4.xlarge, c4.2xlarge, c3.large, c3.xlarge, c3.4xlarge, r3.large, r3.xlarge, r3.4xlarge, i2.2xlarge, i2.4xlarge, d2.xlarge d2.2xlarge, d2.4xlarge, Amazon EC2 Basic tier: A0, A1, A2, A3, A4 Optimized Compute : D1, D2, D3, D4, D11, D12, D13 D1v2, D2v2, D3v2, D11v2, Latest CPUs: G1, G2, G3, Network Optimized: A8, A9 Compute Intensive: A10, A11, Microsoft AZURE n1-standard-1, ns1-standard-2, ns1-standard-4, ns1-standard-8, ns1-standard-16, ns1highmem-2, ns1-highmem-4, ns1-highmem-8, n1-highcpu-2, n1-highcpu-4, n1- highcpu-8, n1-highcpu-16, n1- highcpu-32, f1-micro, g1-small Google Cloud Engine 57

Complexity More and More Constraints Latency Accuracy Cost Security 58

Techniques for Conquering Complexity Use additional choices to simplify! Expose and control tradeoffs Don t forget tried & true techniques 59

Use Choices to Simplify System Design < 2010 2011-2014 > 2014 L1/L2 cache L3 cache Main memory Fast HHD L1/L2 cache L3 cache Main memory NAND SSD Fast HHD L1/L2 cache L3 cache Main memory NAND SSD Fast HHD 60

Use Choices to Simplify System Design Example: NVIDIA DGX-1 supercomputer for Deep Learning Pascal P100 HBM HBM (720GB/s / 16GB) HBM (720GB/s / 16GB) HBM (720HB/s / 16GB) 100 GB/s Main memory NVM (3D Xpoint) NAND SSD Fast HHD 61

Use Choices to Simplify System Design Possible datacenter architecture (e.g., FireBox, UC Berkeley) L1/L2 cache L1/L2 cache L3 cache Main memory L1/L2 cache L3 cache Main memory L1/L2 cache L3 cache Main memory Ultra-fast persistent des-aggregated storage (~10 usec / ~ 10 GBs / ~ 1 PB) L3 cache HBM Main memory NVM (3D Xpoint) NAND SSD Fast HHD 62

Use Choices to Simplify App Design Maybe no need to optimize every algorithm for every specialized processor if run in cloud, just pick best instance types for your app! 63

Expose and Control Tradeoffs Latency vs. accuracy Approximate query processing (e.g., BlinkDB) Ensembles and correction ML models (e.g., Clipper) Job completion time vs. cost Predict response times given configuration (e.g., Earnest) Security vs. latency vs. functionality E.g., CryptDB, Opaque 64

Expose and Control Tradeoffs Caching vs. memory HBM allows to be configured either as cache or memory region Declarative vs. procedural Enable users to pick specific query plans for complex declarative programs & complex environments 65

Tried & True Techniques Sampling: Scheduling (e.g., Sparrow), querying (e.g., BlinkDB), storage (e.g., KMN) Speculation: Replicate time-sensitive requests/jobs (e.g., Dolly) Incremental updates: Storage (e.g., IndexedRDDs), and ML models (e.g., Clipper) Cost-based optimization: Pick target hardware at run-time 66

Summary Application and hardware trends often determine solution We are at an inflection point both in terms of both apps and hardware trends Many research opportunities Be aware of complexity : use myriad of choices to simplify! 67

Thanks