Data Intensive Scalable Computing. Thanks to: Randal E. Bryant Carnegie Mellon University

Similar documents
Data Intensive Scalable Computing

The Future of High Performance Computing

The Future of High- Performance Computing

The Future of High- Performance Computing

MSST 08 Tutorial: Data-Intensive Scalable Computing for Science

CSE6331: Cloud Computing

Chapter 5. The MapReduce Programming Model and Implementation

Efficient On-Demand Operations in Distributed Infrastructures

Oracle #1 RDBMS Vendor

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Decentralized Distributed Storage System for Big Data

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

CS MapReduce. Vitaly Shmatikov

Towards Scalable Data Management for Map-Reduce-based Data-Intensive Applications on Cloud and Hybrid Infrastructures

High Performance and Cloud Computing (HPCC) for Bioinformatics

CS 6240: Parallel Data Processing in MapReduce: Module 1. Mirek Riedewald

Embedded Technosolutions

BIG DATA TESTING: A UNIFIED VIEW

The MapReduce Framework

Introduction to Data Management CSE 344

Large-Scale GPU programming

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Lessons Learned While Building Infrastructure Software at Google

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

The OpenCirrus TM Project: A global Testbed for Cloud Computing R&D

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

MapReduce for Data Intensive Scientific Analyses

Challenges for Data Driven Systems

Next-Generation Cloud Platform

Hierarchy of knowledge BIG DATA 9/7/2017. Architecture

Introduction to MapReduce

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

2013 AWS Worldwide Public Sector Summit Washington, D.C.

A BigData Tour HDFS, Ceph and MapReduce

Oracle Database 10G. Lindsey M. Pickle, Jr. Senior Solution Specialist Database Technologies Oracle Corporation

Hadoop/MapReduce Computing Paradigm

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Part I What are Databases?

Chapter 6 VIDEO CASES

MapReduce. Cloud Computing COMP / ECPE 293A

Evolution of Database Systems

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Introduction to Database Systems CSE 414

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

Data-Intensive Distributed Computing

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

MapReduce, Hadoop and Spark. Bompotas Agorakis

Intro to Software as a Service (SaaS) and Cloud Computing

Big Data Management and NoSQL Databases

Map Reduce. Yerevan.

Data Centers and Cloud Computing

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Massive Online Analysis - Storm,Spark

Cloud Computing at Yahoo! Thomas Kwan Director, Research Operations Yahoo! Labs

Assessing performance in HP LeftHand SANs

Big-Data Computing: Creating revolutionary breakthroughs in commerce, science, and society

Google File System (GFS) and Hadoop Distributed File System (HDFS)

745: Advanced Database Systems

Introduction to Data Management CSE 344

Azure MapReduce. Thilina Gunarathne Salsa group, Indiana University

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Distributed File Systems II

Big Data and Cloud Computing

Massively Parallel Processing. Big Data Really Fast. A Proven In-Memory Analytical Processing Platform for Big Data

Achieving Horizontal Scalability. Alain Houf Sales Engineer

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Forget about the Clouds, Shoot for the MOON

A Review Paper on Big data & Hadoop

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

CS 61C: Great Ideas in Computer Architecture. MapReduce

Cloud Computing. Ennan Zhai. Computer Science at Yale University

CISC 7610 Lecture 2b The beginnings of NoSQL

Data Intensive Computing SUBTITLE WITH TWO LINES OF TEXT IF NECESSARY PASIG June, 2009

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

Some Reflections on Advanced Geocomputations and the Data Deluge

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CompSci 516: Database Systems

Large-Scale Data Engineering. Overview and Introduction

HPC learning using Cloud infrastructure

CSE 124: THE DATACENTER AS A COMPUTER. George Porter November 20 and 22, 2017

MOHA: Many-Task Computing Framework on Hadoop

Data Centers and Cloud Computing. Data Centers

Chapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES

Global Journal of Engineering Science and Research Management

Distributed Systems CS6421

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Data Centric Computing

The Fusion Distributed File System

Transcription:

Data Intensive Scalable Computing Thanks to: Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant

Big Data Sources: Seismic Simulations Wave propagation during an earthquake Large-scale parallel numerical simulations Quake 4D wavefields: O(TB) 500 GB Geological model 10-100 GB Unstructured mesh 3D volume structure 12+ hours 4000+ processors Mesh generation & Numerical Simulation 1-10 TB Spatio-temporal earthquake dataset Space-Time representation 3D Space t0 t1 t2 Time tn-1 2

Big Data Sources: LSST LSST: Large Synoptic Survey Telescope (2016) Pixel count: 3.2 Gpixels, Dynamic range: 16 bits Readout time: 2 sec. Nightly data generation rate: 15 TBytes (16 bits) Avg. yearly data generation rate: 6.8 PBytes Total disk storage: 200 Pbytes 3

Examples of Big Data Sources Wal-Mart 267 million items/day, sold at 6,000 stores HP building them 4PB data warehouse Mine data to manage supply chain, understand market trends, formulate pricing strategies Sloan Digital Sky Survey New Mexico telescope captures 200 GB image data/day Latest dataset release: 10 TB, 287 million celestial objects SkyServer provides SQL access 4

Science Our Data-Driven World Data bases from astronomy, genomics, natural languages, seismic modeling, Humanities Scanned books, historic documents, Commerce Corporate sales, stock market transactions, census, airline traffic, Entertainment Internet images, Hollywood movies, MP3 files, Medicine MRI & CT scans, patient records, 5

We Can Get It Automation + Internet We Can Keep It 1 TB Disk @ $199 (20 / GB) We Can Use It Scientific breakthroughs Business process efficiencies Realistic special effects Better health care Why So Much Data? Could We Do More? Apply more computing power to this data 6

Oceans of Data, Skinny Pipes 1 Terabyte Easy to store Hard to move Disks MB / s Time Seagate Barracuda 78 3.6 hours Seagate Cheetah 125 2.2 hours Networks MB / s Time Home Internet < 0.625 > 18.5 days Gigabit Ethernet < 125 > 2.2 hours PSC Teragrid Connection < 3,750 > 4.4 minutes 7

Data-Intensive System Challenge For computation that accesses 1 TB in 5 mins Data distributed over 100+ disks - Assuming uniform data partitioning Compute using 100+ processors Connected by at least gigabit Ethernet System Requirements Lots of disks Lots of processors Located in close proximity - Within reach of fast, local-area network 8

Google s Computing Infrastructure System ~ 3 million processors in clusters of ~2000 processors each Commodity parts - x86 processors, IDE disks, Ethernet communications - Gain reliability through redundancy & software management Partitioned workload - Data: Web pages, indices distributed across processors - Function: crawling, index generation, index search, document retrieval, Ad placement A Data-Intensive Scalable Computer (DISC) Large-scale computer centered around data - Collecting, maintaining, indexing, computing Similar systems at Microsoft & Yahoo Barroso, Dean, Hölzle, Web Search for a Planet: The Google Cluster Architecture IEEE Micro 2003 9

MapReduce MapReduce Programming Model Dean & Ghemawat: MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004 Reduce k 1 k 1 k r Key-Value Pairs M M M M Map x 1 x 2 x 3 Map computation across many objects - E.g., 10 10 Internet web pages Aggregate results in many different ways System deals with issues of resource allocation & reliability x n 10

Comparing Parallel Computation Models MapReduce SETI@home Threads MPI PRAM Low Communication Coarse-Grained DISC + MapReduce Provides Coarse-Grained Parallelism Computation done by independent processes File-based communication Observations Relatively natural programming model Research issue to explore full potential and limits - Dryad project at MSR - Pig project at Yahoo! High Communication Fine-Grained 11

Desiderata for DISC Systems Focus on Data Terabytes, not tera-flops Problem-Centric Programming Platform-independent expression of data parallelism Interactive Access From simple queries to massive computations Robust Fault Tolerance Component failures are handled as routine events 12

System Comparison: Programming Models Conventional Supercomputers Application Programs DISC Application Programs Software Packages Machine-Dependent Programming Model Machine-Independent Programming Model Runtime System Hardware Hardware Programs described at very low level Specify detailed control of processing & communications Rely on small number of software packages Written by specialists Limits classes of problems & solution methods 13 Application programs written in terms of high-level operations on data Runtime system controls scheduling, load balancing,

Compare to Transaction Processing Main Commercial Use of Large-Scale Computing Banking, finance, retail transactions, airline reservations Stringent Functional Requirements Only one person gets last $1 from shared bank account Must not lose money when transferring between accounts Small number of high-performance, high-reliability servers Our Needs are Different More relaxed consistency requirements Fewer sources of updates, write-once, read-many data. Individual computations access more data 14

CS Research Issues Applications Language translation, image processing, Application Support Machine learning over very large data sets Web crawling Programming Abstract programming models to support large-scale computation Distributed databases System Design Error detection & recovery mechanisms Resource scheduling and load balancing Distribution and sharing of data across system 15

HPC Fault Tolerance P 1 P 2 P 3 P 4 P 5 Checkpoint Wasted Computation Restore Checkpoint 16

P 1 P 2 P 3 P 4 P 5 Checkpoint Restore Checkpoint Wasted Computation HPC Fault Tolerance Checkpoint Periodically store state of all processes Significant I/O traffic Restore When failure occurs Reset state to that of last checkpoint All intervening computation wasted Performance Scaling Very sensitive to number of failing components 16

Map/Reduce Map/Reduce Operation Map Reduce Map Reduce Map Reduce Map Reduce 17

Map/Reduce Operation Map/Reduce Map Reduce Map Reduce Map Reduce Map Reduce Characteristics Computation broken into many, short-lived tasks Mapping, reducing Use disk storage to hold intermediate results Strengths Great flexibility in placement, scheduling, and load balancing Handle failures by recomputation Can access large data sets 17

Map/Reduce Operation Map/Reduce Map Reduce Map Reduce Map Reduce Map Reduce Characteristics Computation broken into many, short-lived tasks Mapping, reducing Use disk storage to hold intermediate results Strengths Great flexibility in placement, scheduling, and load balancing Handle failures by recomputation Can access large data sets Weaknesses Higher overhead Lower raw performance 17

Choosing Execution Models Message Passing / Shared Memory Achieves high performance when everything works well Requires careful tuning of programs Vulnerable to single points of failure Map/Reduce Allows for abstract programming model More flexible, adaptable, and robust Performance limited by disk I/O Alternatives? Is there some way to combine to get strengths of both? 18

Getting Started Goal: Get faculty & students active in DISC Hardware: Rent from Amazon Elastic Compute Cloud (EC2) Generic Linux cycles for $0.10 / hour ($877 / yr) Simple Storage Service (S3) Network-accessible storage for $0.15 / GB / month ($1800/TB/yr) Example: maintain crawled copy of web 50 TB, 100 processors, 0.5 TB/day refresh ~ $250K / year Software Hadoop Project Open source project providing file system and MapReduce Supported and used by Yahoo Prototype on single machine, map onto cluster 19

Rely on Kindness of Others Google setting up dedicated cluster for university use Loaded with open-source software - Including Hadoop IBM providing additional software support NSF will determine how facility should be used. 20

More Sources of Kindness Yahoo! is a major supporter of Hadoop Yahoo! plans to work with other universities 21

Beyond the U.S. 22

Testbed for system research in DISC systems HP, Yahoo and Intel Create Compute Cloud Stacey Higginbotham, Tuesday, July 29, 2008 at 10:37 AM PT Comments (8) Related Stories HP Weds Cloud and High-performance Computing Intel Friends Facebook to Make x86 Chips Sexy Elastra Gets $12M Is It Amazon s Enterprise Play? Powered by Sphere Updated at the bottom: At long last, Hewlett-Packard is stepping up with an answer to cloud computing by inking a partnership with two other big technology vendors and three universities to create a cloud computing testbed. Through its R&D unit, HP Labs, the computing giant had has teamed up with Intel, Yahoo, the Infocomm Development Authority of Singapore (IDA), the University of Illinois at Urbana-Champaign, the National Science Foundation (NSF) and the Karlsruhe Institute of Technology in Germany. 23

More Information Data-Intensive Supercomputing: The case for DISC Tech Report: CMU-CS-07-128 Available from http://www.cs.cmu.edu/~bryant 24