Data Intensive Scalable Computing

Size: px

Start display at page:

Download "Data Intensive Scalable Computing"

Candace Christiana Fitzgerald
5 years ago
Views:

1 Data Intensive Scalable Computing Randal E. Bryant Carnegie Mellon University

Examples of Big Data Sources Wal-Mart 267 million

LSST Chilean telescope will scan entire sky every 3

2 Examples of Big Data Sources Wal-Mart 267 million items/day, sold at 6,000 stores HP built them 4 PB data warehouse Mine data to manage supply chain, understand market trends, formulate pricing strategies LSST Chilean telescope will scan entire sky every 3 days A 3.2 gigapixel digital camera Generate 30 TB/day of image data 2

3 Why So Much Data? We Can Get It 3 Automation + Internet We Can Keep It Seagate Barracuda 1.5 $150 (10 / GB) We Can Use It Scientific breakthroughs Business process efficiencies Realistic special effects Better health care Could We Do More? Apply more computing power to this data

4 Google Data Center Dalles, Oregon 4 Hydroelectric 2 / KW Hr 50 Megawatts Enough to power 6,000 homes

Can access from anywhere Easy sharing and collaboration I ve got terabytes of data.

5 Varieties of Cloud Computing I don t want to be a system administrator. You handle my data & applications. 5 Hosted services Documents, web-based , etc. Can access from anywhere Easy sharing and collaboration I ve got terabytes of data. Tell me what they mean. Very large, shared data repository Complex analysis Data-intensive scalable computing (DISC)

Oceans of Data, Skinny Pipes 1 Terabyte Easy to

Barracuda 115 2.3 hours Seagate Cheetah 125 2.

625 > 18.5 days Gigabit Ethernet < 125 > 2.

6 Oceans of Data, Skinny Pipes 1 Terabyte Easy to store Hard to move Disks MB / s Time Seagate Barracuda hours Seagate Cheetah hours Networks MB / s Time Home Internet < > 18.5 days Gigabit Ethernet < 125 > 2.2 hours 6 PSC Teragrid Connection < 3,750 > 4.4 minutes

7 Data-Intensive System Challenge For Computation That Accesses 1 TB in 5 minutes Data distributed over 100+ disks Assuming uniform data partitioning Compute using 100+ processors Connected by gigabit Ethernet (or equivalent) System Requirements Lots of disks Lots of processors Located in close proximity Within reach of fast, local-area network 7

Desiderata for DISC Systems Focus on Data Terabytes, not tera-flops Problem-Centric Programming Platform-independent expression of data parallelism Interactive Access

8 Desiderata for DISC Systems Focus on Data Terabytes, not tera-flops Problem-Centric Programming Platform-independent expression of data parallelism Interactive Access From simple queries to massive computations Robust Fault Tolerance Component failures are handled as routine events Contrast to existing supercomputer / HPC systems 8

System Comparison: Programming Models Conventional Supercomputers Application Programs DISC Application Programs Software Packages Machine-Dependent Programming Model Machine-Independent Programming

9 System Comparison: Programming Models Conventional Supercomputers Application Programs DISC Application Programs Software Packages Machine-Dependent Programming Model Machine-Independent Programming Model Runtime System Hardware Hardware 9 Programs described at very low level Specify detailed control of processing & communications Rely on small number of software packages Written by specialists Limits classes of problems & solution methods Application programs written in terms of high-level operations on data Runtime system controls scheduling, load balancing,

10 System Comparison: Reliability Runtime errors commonplace in large-scale systems Hardware failures Transient errors Software bugs Conventional Supercomputers DISC Brittle Systems 10 Main recovery mechanism is to recompute from most recent checkpoint Must bring down system for diagnosis, repair, or upgrades Flexible Error Detection and Recovery Runtime system detects and diagnoses errors Selective use of redundancy and dynamic recomputation Replace or upgrade components while system running Requires flexible programming model & runtime environment

11 Exploring Parallel Computation Models MapReduce Threads MPI PRAM Low Communication Coarse-Grained High Communication Fine-Grained DISC + MapReduce Provides Coarse-Grained Parallelism 11 Computation done by independent processes File-based communication Observations Relatively natural programming model Research issue to explore full potential and limits Dryad project at MSR Pig project at Yahoo!

Message Passing P 1 P 2 P 3 P 4 P 5 Existing HPC Machines Shared Memory P 1 P 2 P 3 P 4 P 5 Memory 12 Characteristics Long-lived processes Make use of spatial locality Hold all program data in

12 Message Passing P 1 P 2 P 3 P 4 P 5 Existing HPC Machines Shared Memory P 1 P 2 P 3 P 4 P 5 Memory 12 Characteristics Long-lived processes Make use of spatial locality Hold all program data in memory High bandwidth communication Strengths High utilization of resources Effective for many scientific applications Weaknesses Very brittle: relies on everything working correctly and in close synchrony

traffic Restore When failure occurs Reset state to that of last checkpoint All

13 P 1 P 2 P 3 P 4 P 5 HPC Fault Tolerance 13 Checkpoint Restore Checkpoint Wasted Computation Checkpoint Periodically store state of all processes Significant I/O traffic Restore When failure occurs Reset state to that of last checkpoint All intervening computation wasted Performance Scaling Very sensitive to number of failing components

14 Map/Reduce Operation Map/Reduce Map Reduce Map Reduce Map Reduce Map Reduce Characteristics Computation broken into many, short-lived tasks Mapping, reducing Use disk storage to hold intermediate results Strengths Great flexibility in placement, scheduling, and load balancing Handle failures by recomputation Can access large data sets 14 Weaknesses Higher overhead Lower raw performance

15 Generalizing Map/Reduce E.g., Microsoft Dryad Project Computational Model 15 Acyclic graph of operators But expressed as textual program Each takes collection of objects and produces objects Purely functional model Implementation Concepts Objects stored in files or memory Any object may be lost; any operator may fail Replicate & recompute for fault tolerance Dynamic scheduling # Operators >> # Processors Op k Op k Op k Op k Op 2 Op 2 Op 2 Op 2 Op 1 Op 1 Op 1 Op 1 x 1 x 2 x 3 x n

16 Concluding Thoughts Data-Intensive Computing Becoming Commonplace Facilities available from Google/IBM, Yahoo!, Hadoop becoming platform of choice Lots of applications are fairly straightforward Use Map to do embarrassingly parallel execution Make use of load balancing and reliable file system of Hadoop What Remains Integrating more demanding forms of computation Computations over large graphs Sparse numerical applications Challenges: programming, implementation efficiency 16

Data Intensive Scalable Computing. Thanks to: Randal E. Bryant Carnegie Mellon University

Data Intensive Scalable Computing Thanks to: Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Big Data Sources: Seismic Simulations Wave propagation during an earthquake Large-scale