The Potential of Cloud Computing: Challenges, Opportunities, Impact"!!!!

Similar documents
Data Management in the Cloud. Tim Kraska

Intro to Software as a Service (SaaS) and Cloud Computing

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

1.2 Why Not Use SQL or Plain MapReduce?

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Large-Scale GPU programming

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Hadoop/MapReduce Computing Paradigm

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Introduction to Data Management CSE 344

University of Maryland. Tuesday, March 23, 2010

High Performance and Cloud Computing (HPCC) for Bioinformatics

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Introduction to Data Management CSE 344

CSE6331: Cloud Computing

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Resilient Distributed Datasets

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

CS 61C: Great Ideas in Computer Architecture. MapReduce

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Introduction to Map Reduce

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

The MapReduce Abstraction

Introduction to Data Management CSE 344

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Warehouse- Scale Computing and the BDAS Stack

what is cloud computing?

Chapter 5. The MapReduce Programming Model and Implementation

Data Intensive Scalable Computing. Thanks to: Randal E. Bryant Carnegie Mellon University

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Batch Processing Basic architecture

The MapReduce Framework

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2013 AWS Worldwide Public Sector Summit Washington, D.C.

MapReduce, Hadoop and Spark. Bompotas Agorakis

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Cloud Computing. What is cloud computing. CS 537 Fall 2017

Embedded Technosolutions

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

L22: SC Report, Map Reduce

Advanced Data Management Technologies

Databases 2 (VU) ( / )

MapReduce and Hadoop

CSE 444: Database Internals. Lecture 23 Spark

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Big Data Hadoop Stack

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Parallel Computing: MapReduce Jin, Hai

Data Centers and Cloud Computing

CS MapReduce. Vitaly Shmatikov

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Database Systems CSE 414

1. Introduction to MapReduce

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

HPC learning using Cloud infrastructure

Welcome to the New Era of Cloud Computing

Next-Generation Cloud Platform

MapReduce. Cloud Computing COMP / ECPE 293A

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Clustering Lecture 8: MapReduce

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

Big Data Management and NoSQL Databases

Parallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

DCBench: a Data Center Benchmark Suite

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Data Centers and Cloud Computing. Data Centers

DATA SCIENCE USING SPARK: AN INTRODUCTION

"Big Data" Open Source Systems. CS347: Map-Reduce & Pig. Motivation for Map-Reduce. Building Text Index - Part II. Building Text Index - Part I

Large Scale Computing Infrastructures

MapReduce: Simplified Data Processing on Large Clusters

Lecture 9: MIMD Architectures

Shark: Hive (SQL) on Spark

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

CISC 7610 Lecture 2b The beginnings of NoSQL

Thrill : High-Performance Algorithmic

Backtesting with Spark

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Lecture 9: MIMD Architectures

Shark: Hive (SQL) on Spark

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

Map Reduce & Hadoop Recommended Text:

Magellan Project. Jeff Broughton NERSC Systems Department Head October 7, 2009

Improving the MapReduce Big Data Processing Framework

INFS 214: Introduction to Computing

CSE 344 MAY 2 ND MAP/REDUCE

Importing and Exporting Data Between Hadoop and MySQL

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Data Intensive Scalable Computing

Transcription:

UC Berkeley The Potential of Cloud Computing: Challenges, Opportunities, Impact"!!!! Tim Kraska, UC Berkeley! Reliable Adaptive Distributed Systems Laboratory! Image: John Curley http://www.flickr.com/photos/jay_que/1834540/ 1!

11/10/10 2 Systems Group/ETH Zurich [Anology from IM 2/09 / Daniel Abadi]

Do you want milk? Buy bottled milk Pay-per-use Lower maintenance cost Linear scaling Fault-tolerant Buy a cow High upfront investment High maintenance cost Produces a fixed amount of milk Stepwise scaling 11/10/10 3 Systems Group/ETH Zurich

Traditional DB vs. Cloud DB Traditional DB High upfront investment Fixed amount of TRX Stepwise scaling High maintenance cost Cloud Pay-per-use Linear scaling Fault-tolerant Lower maintenance cost 11/10/10 4 Systems Group/ETH Zurich

Get Your Own Supercomputer! 41.88 Teraflops on LINPACK benchmark! Faster than #145 entry on current Top500 list! 880 EC cluster nodes! No batch queues start your job in minutes! No grantwriting just a credit card ($1400/hour, 1 hour minimum purchase)! No need to predict usage in advance pay only for what you use, get more on demand! Lease several simultaneously! MPI and Matlab available, among other packages!

Warehouse Scale Computers! Built to support consumer demand of Web services (email, social networking, etc.)! Private clouds of Google, Microsoft, Amazon,...! Warehouse scale buying power = 5-7x cheaper hardware, networking, administration cost! photos: Cnet News, Sun Microsystems, datacenterknowledge.com 6

Type & Price/Hour" 2007: Public Cloud Computing 1GHz core eqv." Arrives! RAM" Small - $0.085! 1! 1.7 GB! 160 GB! Disk and/or I/O" Large - $0.34! 4! 7.5 GB! 850 GB, 2 spindles! XLarge - $0.68! 8! 15 GB! 1690 GB, 3 spindles! Cluster ( HPC ) - $1.60! 32*! 23 GB! 1690 GB + 10Gig Ethernet! Unlike Grid, computing on-demand for the public! Virtual machines from $0.085 to $1.60/hr! Pay as you go with credit card, 1 hr. minimum! Cheaper if willing to share or risk getting kicked off! Machines provisioned & booted in a few minutes! 1,000 machines for 1 hr = 1 machine 1,000 hrs! * 2x quad-core Intel Xeon (Nehalem)

What s In It For You?! Cloud Computing progress 2008-2010! Web startups! CS researchers! Enterprise apps, eg email! Educators! scientific & high-performance computing (HPC)!

What s in it for you?! What: Low cost + on-demand + cost-associativity = Democratization of Supercomputing! How: Sophisticated software & operational expertise mask unreliability of commodity components! Example: MapReduce" LISP language construct (~1960) for elegantly expressing data parallel operations w/sequential code! MapReduce (Google) and Hadoop (open source) provide failure masking & resource management for performing this on commodity clusters! Warehouse scale software engineering issues hidden from application programmer!

Map/Reduce in a Nutshell Given: a (very large) dataset a well-defined computation task to be performed on elements of this dataset (preferably, in a parallel fashion on a large cluster) MapReduce programming model/abstraction/framework: Just express what you want to compute (map() & reduce()) Don t worry about parallelization, fault tolerance, data distribution, load balancing (MapReduce takes care of these) What changes from one application to another is the actual computation; the programming structure stays similar [OSDI 04] 11/10/10 10 Systems Group/ETH Zurich

Map/Reduce in a Nutshell Here is the framework in simple terms: Read lots of data Map: extract something that you care about from each record (similar to map in functional programming) Shuffle and sort Reduce: aggregate, summarize, filter, or transform (similar to fold in functional programming) Write the results One can use as many Maps and Reduces as required to model a given problem [OSDI 04] 11/10/10 11 Systems Group/ETH Zurich

Map/Reduce Example: Count Word Occurrences in a Document map in FP fold in FP map(k1, v1) à list(k2, v2)" reduce(k2, list(v2)) à list(v2)" map (String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce (String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); 11/10/10 12 Systems Group/ETH Zurich

Google s MapReduce Model Input key*value pairs Input key*value pairs... Data store 1 map Data store n map (key 1, values...) (key 2, values...) (key 3, values...) (key 1, values...) (key 2, values...) (key 3, values...) == Barrier == : Aggregates intermediate values by output key key 1, intermediate values key 2, intermediate values key 3, intermediate values reduce reduce reduce final key 1 values final key 2 values final key 3 values [OSDI 04] 11/10/10 13 Systems Group/ETH Zurich

Parallel Dataflow Systems and Languages Sawzall Pig La:n, HiveQL DryadLINQ High-level Dataflow Languages MapReduce Apache Hadoop Dryad Parallel Dataflow Systems... A high-level language provides: more transparent program structure easier program development and maintenance automatic optimization opportunities 11/10/10 14 Systems Group/ETH Zurich

Yahoo! Pig & Pig Latin! dataflow program written in Pig Latin language Pig system physical dataflow job Hadoop 15!

Example: Web Data Analysis! Find the top 10 most visited pages in each category Visits User Url Time Amy cnn.com 8:00 Amy bbc.com 10:00 Amy flickr.com 10:05 Fred cnn.com 12:00 Url Info Url Category PageRank cnn.com News 0.9 bbc.com News 0.8 flickr.com Photos 0.7 espn.com Sports 0.9 16!

Example: Data Flow Diagram! Load Visits Group by url Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls 17!

Example in Pig Latin!! visits!!= load ʻ/data/visitsʼ as (user, url, time);! gvisits!= group visits by url;! visitcounts!= foreach gvisits generate url, count(visits);!! urlinfo!= load ʻ/data/urlInfoʼ as (url, category, prank);! visitcounts!= join visitcounts by url, urlinfo by url;!! gcategories!= group visitcounts by category;! topurls!= foreach gcategories generate top (visitcounts,10);!! store topurls into ʻ/data/topUrlsʼ;! 18!!

Pig System Overview! SQL! user automa&c rewrite + op&mize Pig Hadoop Map- Reduce or or cluster ETH Zurich, Fall 2008! Massively Parallel Data Analysis 19! with MapReduce!

Compilation into Load Visits Group by url Map 1 MapReduce! Every (co)group or join opera:on forms a map- reduce boundary. Reduce 1 Map 2 Foreach url generate count Load Url Info Other opera:ons are pipelined into map and reduce phases. Join on url Group by category Foreach category generate top10(urls) Reduce 2 Map 3 Reduce 3 ETH Zurich, Fall 2008! Massively Parallel Data Analysis 20! with MapReduce!

MapReduce in Practice! Example: spam classification! training: 10 7 URLs x 64KB data each = 640GB data! One heavy-duty server: ~270 hours! 100 servers in cloud: ~3 hours (= ~$255)! Rapid uptake in other scientific research! Large-population genetic risk analysis & simulation (Harvard Medical School)! Genome sequencing (UNC Chapel Hill Cancer Ctr)! Information retrieval research (U. Glasgow Terrier)! Compact Muon Solenoid Expt. (U. Nebraska Lincoln)!!

Challenge: Cloud Programming! Challenge: exposing parallelism! MapReduce relies on embarrassing parallelism! Programmers must (re)write problems to expose this parallelism, if it s there to be found! Tools still primitive, though progressing rapidly! MapReduce early success story! Pig (Yahoo! Research), Hive (Apache Foundation)! Pregel à New framework for graph computatoins! Mesos share cluster among Hadoop, MPI, interactive jobs (UC Berkeley)! Challenge for tool authors: parallel software hard to debug and operate reliably (YY Zhou)!

Challenge: Big Data! Application" DNA Sequencing (Illumina HiSeq machine)! Large Synoptic Survey Telescope! Large Hadron Collider! Data generated per day" 1 TB! 30 TB; 400 Mbps sustained data rate between Chile and NCSA! 60 TB! Challenge: Long-haul networking is most expensive cloud resource, and improving most slowly! Copy 8 TB to Amazon over ~20 Mbps network => ~35 days, ~$800 in transfer fees! How about shipping 8TB drive to Amazon instead? => 1 day, ~$150 (shipping + transfer fees)! Source: Ed Lazowska, escience 2010, Microsoft Cloud Futures Workshop, lazowska.cs.washington.edu/cloud2010.pdf!

Clusters of Commodity PC s vs. symmetric multiprocessors! An Analogy: Clusters of Commodity PC s, c.1995! Potential advantages: incremental scaling, absolute capacity, economy of using commodity HW & SW! 1995: SaaS on SMPs; software architecture SMPcentric! 2010: Berkeley undergrads prototype SaaS in 6 8 weeks and deploy on cloud computing!

Conclusion! Democratization of supercomputing capability! Time to answer may be faster even if hardware isn t! Writing a grant proposal around a supercomputer?! Software & hardware rapidly improving as vendors listen to scientific/hpc customers! HPC opportunity to influence design of commodity equipment! Impact: Potential democratizing effect comparable to microprocessor!

Acknowledgments! Colleagues at RAD Lab, UC Berkeley and Systems Group, ETH Zurich! Colleagues at our industrial collaborators! 26!

UC Berkeley Thank you!