Managing and Executing Loosely-Coupled Large-Scale Applications on Clusters, Grids, and Supercomputers

Size: px

Start display at page:

Download "Managing and Executing Loosely-Coupled Large-Scale Applications on Clusters, Grids, and Supercomputers"

Horace Richardson
5 years ago
Views:

Managing and Executing Loosely-Coupled Large-Scale Applications on Clusters, Grids, and Supercomputers Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago

1 Managing and Executing Loosely-Coupled Large-Scale Applications on Clusters, Grids, and Supercomputers Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago Collaborators: Ian Foster (UC/CI/ANL), Yong Zhao (MS), Mike Wilde (CI/ANL), Zhao Zhang (CI), Alex Szalay (JHU), Jerry Yan (NASA/ARC), Catalin Dumitrescu (FANL), many others from Swift and Falkon teams GlobusWorld08 May 14 th, 2008

2 What will we do with 1+ Exaflops and 1M+ cores?

3 1) Tackle Bigger and Bigger Problems Computational Scientist as Hero 5/19/2008 Managing and Executing Loosely-Coupled Large-Scale Applications on 3

4 2) Tackle Increasingly Complex Problems Computational Scientist as Logistics Officer 4

5 Problem Types Input data size Hi Med Data analysis, mining Big data and many tasks Lo Heroic MPI tasks Many loosely coupled tasks 1 1K 1M Number of tasks 5

6 Challenge #1: Long Queue Times Wait queue times are typically longer than the job duration times SDSC DataStar 1024 Processor Cluster

7 Challenge #2: Slow Job Dispatch Rates Production LRMs ~1 job/sec dispatch rates 100% What job durations are 90% 80% 70% needed for 90% efficiency: Production LRMs: 900 sec Development LRMs: 100 sec Experimental LRMs: 50 sec 1~10 sec should be possible Medium Size Grid Site (1K processors) 0% Task Length (sec) 1 task/sec (i.e. PBS, Condor 6.8) 10 tasks/sec (i.e. Condor 6.9.2) 100 tasks/sec 500 tasks/sec (i.e. Falkon) 1K tasks/sec 10K tasks/sec 100K tasks/sec 1M tasks/sec System Comments Throughput (tasks/sec) Condor (v6.7.2) - Production Dual Xeon 2.4GHz, 4GB 0.49 PBS (v2.1.8) - Production Dual Xeon 2.4GHz, 4GB 0.45 Condor (v6.7.2) - Production Quad Xeon 3 GHz, 4GB 2 Condor (v6.8.2) - Production /19/2008 Condor Managing (v6.9.3) and Executing - Development Loosely-Coupled Large-Scale Applications on 11 7 Condor-J2 - Experimental Quad Xeon 3 GHz, 4GB 22 Efficiency 60% 50% 40% 30% 20% 10%

8 Challenge #3: Poor Scalability of Shared File Systems Throughput (Mb/s) GPFS R LOCAL R GPFS R+W LOCAL R+W Number of Nodes GPFS vs. LOCAL Read Throughput 1 node: 0.48Gb/s vs. 1.03Gb/s 2.15x 160 nodes: 3.4Gb/s vs. 165Gb/s 48x Read+Write Throughput: 1 node: 0.2Gb/s vs. 0.39Gb/s 1.95x 160 nodes: 1.1Gb/s vs. 62Gb/s 55x Metadata (mkdir / rm -rf) 1 node: 151/sec vs. 199/sec 1.3x 160 nodes: 21/sec vs /sec 1516x 8

9 Hypothesis Significant performance improvements can be obtained in the analysis of large dataset by leveraging information about data analysis workloads rather than individual data analysis tasks. Important concepts related to the hypothesis Workload: a complex query (or set of queries) decomposable into simpler tasks to answer broader analysis questions Data locality is crucial to the efficient use of large scale distributed systems for scientific and data-intensive applications Allocate computational and caching storage resources, co-scheduled to optimize workload performance 9

10 Falkon: a Fast and Light-weight task execution framework Goal: enable the rapid and efficient execution of many independent jobs on large compute clusters Combines three components: a streamlined task dispatcher able to achieve order-ofmagnitude higher task dispatch rates than conventional schedulers Challenge #1 resource provisioning through multi-level scheduling techniques Challenge #2 data diffusion and data-aware scheduling to leverage the co-located computational and storage resources Challenge #3 10

11 Falkon Overview 11

Dispatcher Throughput Fast: Up to 3700 tasks/sec Scalable: 54,000 processors 1,500,000 tasks queued Efficiency 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% ANL/UC, Java, 200 CPUs ANL/UC, C, 200 CPUs

12 Dispatcher Throughput Fast: Up to 3700 tasks/sec Scalable: 54,000 processors 1,500,000 tasks queued Efficiency 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% ANL/UC, Java, 200 CPUs ANL/UC, C, 200 CPUs SiCortex, C, 5760 CPUs BG/P, C, 2048 CPUs Task Length (sec) Throughput (tasks/sec) ANL/UC, Java 200 CPUs 3773 ANL/UC, Java Bundling CPUs 2534 ANL/UC, C 200 CPUs 3186 SiCortex, C 5760 CPUs 1758 BlueGene/P, C 1024 CPUs Executor Implementation and Various Systems Efficient: High efficiency with second long tasks on 1000s of processors 12

13 Falkon Integration with Swift Application #Tasks/workflow #Stages ATLAS: High Energy Physics Event Simulation 500K 1 fmri DBIC: AIRSN Image Processing 100s 12 FOAM: Ocean/Atmosphere Model GADU: Genomics 40K 4 HNL: fmri Aphasia Study NVO/NASA: Photorealistic Montage/Morphology 1000s 16 QuarkNet/I2U2: Physics Science Education 10s 3 ~ 6 RadCAD: Radiology Classifier Training 1000s 5 SIDGrid: EEG Wavelet Processing, Gaze Analysis 100s 20 SDSS: Coadd, Cluster Search 40K, 500K 2, 8 SDSS: Stacking, AstroPortal 10Ks ~ 100Ks 2 ~ 4 MolDyn: Molecular Dynamics 1Ks ~ 20Ks 8 MARS: Economic Modeling 1M~1B 1 DOCK: : Molecular Dynamics 1B 1 Specification Abstract computation SwiftScript Compiler Virtual Data Catalog SwiftScript Swift Architecture Scheduling Execution Engine (Karajan w/ Swift Runtime) C C C C Swift runtime callouts Status reporting Provenance collector Execution Virtual Node(s) Provenance data launcher Provenance data launcher file1 App F1 file2 App F2 file3 Provisioning Falkon Resource Provisioner Amazon EC

14 Functional MRI (fmri) Wide range of analyses Testing, interactive analysis, production runs Data mining Parameter studies 14

15 fmri Application GRAM vs. Falkon: 85%~90% lower run time GRAM/Clustering vs. Falkon: 40%~74% lower run time GRAM GRAM/Clustering Falkon Time (s) /19/2008 Managing Falkon: and a Executing Fast and Light-weight Loosely-Coupled task Large-Scale execution framework Applications on Input Data Size (Volumes) 15

16 B. Berriman, J. Good (Caltech) J. Jacob, D. Katz (JPL) 16

17 Montage Application GRAM/Clustering vs. Falkon: 57% lower application run time MPI* vs. Falkon: 4% higher application run time * MPI should be lower bound GRAM/Clustering MPI Falkon Time (s) mproject mdiff/fit mbackground madd(sub) 5/19/2008 Managing Falkon: and a Executing Fast and Light-weight Loosely-Coupled task Large-Scale execution framework Applications on Components madd total 17

18 MolDyn Application 244 molecules jobs seconds on 216 CPUs CPU hours Efficiency: 99.8% Speedup: 206.9x 8.2x faster than GRAM/PBS 50 molecules w/ GRAM (4201 jobs) 25.3 speedup Task ID Time (sec) waitqueuetime exectime resultsqueuetime 18

19 MARS Economic Modeling on IBM BG/P CPU Cores: 2048 Tasks: Micro-tasks: Elapsed time: 1601 secs CPU Hours: 894 Speedup: 1993X (ideal 2048) Efficiency: 97.3% 19

20 Many Many Tasks: Identifying Potential Drug Targets 200+ Protein x target(s) 5M+ ligands (Mike Kubal, Benoit Roux, and others) 20

21 DOCK on SiCortex CPU cores: 5760 Tasks: Elapsed time: sec Compute time: 1.94 CPU years Average task time: sec Speedup: 5650X (ideal 5760) Efficiency: 98.2% 5/19/2008 Managing and Executing Loosely-Coupled Large-Scale Applications on 21

22 Falkon: Data Diffusion Resource acquired in response to demand Data and applications diffuse from archival storage to newly acquired resources Resource caching allows faster responses to subsequent requests Cache Eviction Strategies: RANDOM, FIFO, LRU, LFU Resources are released when demand drops Task Dispatcher Data-Aware Scheduler Idle Resources Persistent Storage Shared File System text Provisioned Resources 22

AstroPortal Stacking Service Purpose On-demand stacks of random

random files Time-varying load Sample Workloads Locality Number of

38 154345 111699 2 97999 49000 3 88857 29620 4 76575 19145 5 60590

23 AstroPortal Stacking Service Purpose On-demand stacks of random locations within ~10TB dataset Challenge Rapid access to 10-10K random files Time-varying load Sample Workloads Locality Number of Objects Number of Files S 4 Web page or Web Service Sloan Data =

24 AstroPortal Stacking Service with Data Diffusion Time (ms) per stack per CPU Low data locality Similar (but better) performance to GPFS Data Diffusion (GZ) Data Diffusion (FIT) GPFS (GZ) GPFS (FIT) Number of CPUs Time (ms) per stack per CPU Data Diffusion (GZ) Data Diffusion (FIT) GPFS (GZ) GPFS (FIT) Number of CPUs High data locality Near perfect scalability 24

25 AstroPortal Stacking Service with Data Diffusion Aggregate throughput: 39Gb/s 10X higher than GPFS Reduced load on GPFS 0.49Gb/s 1/10 of the original load Time (ms) per stack per CPU Ideal Locality Data Diffusion (GZ) Data Diffusion (FIT) GPFS (GZ) GPFS (FIT) Aggregate Throughput (Gb/s) Data Diffusion Throughput Local Data Diffusion Throughput Cache-to-Cache Data Diffusion Throughput GPFS GPFS Throughput (FIT) GPFS Throughput (GZ) Locality Big performance gains as locality increases 25

26 Number of Nodes Throughput (Gb/s) Queue Length (x1k) Data Diffusion: Data-Intensive Workload 250K tasks on 128 processors 10MB read, 10ms compute Comparing GPFS with data diffusion sec vs sec (ideal is 1415 sec) Time (sec) Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue Length Number of Nodes Number of Nodes Aggregate Throughput (Gb/s) Throughput per Node (MB/s) Queue Length (x1k) Time (sec) Cache Hit Local % Cache Hit Global % Cache Miss % Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue Length Number of Nodes Cache Hit/Miss %

27 Data Diffusion: Data-Intensive Workload Throughput (Gb/s) Local Worker Caches (Gb/s) Remote Worker Caches (Gb/s) GPFS Throughput (Gb/s) Throughput: Average: 14Gb/s vs 4Gb/s Peak: 100Gb/s vs. 6Gb/s Ideal firstavailable goodcachecompute, 1GB goodcachecompute, 1.5GB goodcachecompute, 2GB Response Time goodcachecompute, 4GB maxcache-hit, 4GB maxcomputeutil, 4GB 3 sec vs 1569 sec 506X Average Response Time (sec) firstavailable goodcache- 5/19/2008 Managing and Executing Loosely-Coupled Large-Scale compute, Applications compute, compute, on 1GB 1.5GB 2GB goodcache- goodcache- 4GB goodcachecompute, max-cachehit, 4GB 27 maxcomputeutil, 4GB

28 Hadoop vs. Swift Time (sec) Classic benchmarks for MapReduce Word Count Sort Swift performs similar or better than Hadoop (on 32 processors) 100 Swift+PBS Hadoop Word Count Time (sec) Swift+Falkon Hadoop Sort /19/2008 Managing and Executing Loosely-Coupled 1 Large-Scale Applications on 75MB 350MB 703MB Data Size 10MB 100MB 1000MB Data Size 28

29 Mythbusting Embarrassingly Happily parallel apps are trivial to run Logistical problems can be tremendous Loosely coupled apps do not require supercomputers Total computational requirements can be enormous Individual tasks may be tightly coupled Workloads frequently involve large amounts of I/O Loosely coupled apps do not require specialized system software Shared file systems are good all around solutions 29 They don t scale proportionally with the compute resources

30 Solutions Falkon A Fast and Light-weight task execution framework Globus Incubator Project Swift Parallel programming tool for rapid and reliable specification, execution, and management of large-scale science workflows Environments: Clusters: TeraPort (TP) Grids: Open Science Grid (OSG), TeraGrid (TG) Specialized large machines: SiCortex 5732 Supercomputers: IBM BlueGene/P (BG/P) 30

31 More Information More information: Personal research page: Falkon: Swift: Collaborators (relevant to this proposal): Ian Foster, The University of Chicago & Argonne National Laboratory Alex Szalay, The Johns Hopkins University Yong Zhao, Microsoft Mike Wilde, Computation Institute, University of Chicago & Argonne National Laboratory Catalin Dumitrescu, Fermi National Laboratory Zhao Zhang, The University of Chicago Jerry C. Yan, NASA, Ames Research Center Funding: NASA: Ames Research Center, Graduate Student Research Program (GSRP) DOE: Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Dept. of Energy NSF: TeraGrid 31

A Data Diffusion Approach to Large Scale Scientific Exploration

A Data Diffusion Approach to Large Scale Scientific Exploration Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago Joint work with: Yong Zhao: Microsoft Ian Foster: