A Data Diffusion Approach to Large Scale Scientific Exploration

Size: px

Start display at page:

Download "A Data Diffusion Approach to Large Scale Scientific Exploration"

Giles Reynolds
5 years ago
Views:

of Chicago, CS & CI, Argonne National Laboratory, MCS Alex Szalay: The Johns Hopkins Univ., Dept.

1 A Data Diffusion Approach to Large Scale Scientific Exploration Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago Joint work with: Yong Zhao: Microsoft Ian Foster: Univ. of Chicago, CS & CI, Argonne National Laboratory, MCS Alex Szalay: The Johns Hopkins Univ., Dept. of Physics and Astronomy Funded by: NSF: TeraGrid DOE: Advanced Scientific Computing Research NASA: Ames Research Center 27 Microsoft escience Workshop at RENCI October 21 st, 27

Data Diffusion Example: AstroPortal Stacking Service

dataset Challenge Rapid access to 1-1K random files

storage 1/22/27 A Data Diffusion Approach to Large Scale

2 Data Diffusion Example: AstroPortal Stacking Service Purpose On-demand stacks of random locations within ~1TB dataset Challenge Rapid access to 1-1K random files Time-varying load Solution Dynamic acquisition of compute, storage 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 2 S 4 Web page or Web Service = Sloan Data

3 Falkon: a Fast and Light-weight task execution framework Goal: enable the rapid and efficient execution of many independent jobs on large compute clusters Combines three techniques: multi-level scheduling techniques to enable separate treatments of resource provisioning a streamlined task dispatcher able to achieve order-of-magnitude higher task dispatch rates than conventional schedulers performs data diffusion and uses a data-aware scheduler to leverage the co-located computational and storage resources 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 3

4 Execution Model Dispatch Policy first-available, max-compute-util, max-cache-hit* Caching Policy random, FIFO, LRU, LFU Replay policy Resource Acquisition Policy one-at-a-time, additive, exponential, all-at-once, optimal* Resource Release Policy distributed, centralized* 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 4

dynamic resource provisioning WS Dispatcher Provisioner Compute Resource 1 WS Clients Compute Resource m

5 Falkon 2-Tier Architecture Tier 1: Dispatcher GT4 Web Service accepting task submissions from clients and sending them to available executors Tier 2: Executor Run tasks on local resources Provisioner Static and dynamic resource provisioning WS Dispatcher Provisioner Compute Resource 1 WS Clients Compute Resource m Compute Resources Executor 1 Executor n 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 5

6 Dispatch Throughput Throughput (tasks/sec) WS Calls (no security) Falkon (no security) Falkon (GSISecureConversation) Number of Executors 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 7

7 Execution Efficiency Efficiency Sleep 64 Sleep 32 Sleep 16 Sleep 8 Sleep 4 Sleep 2 Sleep Number of Executors 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 8

8 Resource Provisioning. provisioner registration 1. task(s) submit 2. resource allocation to GRAM 3. resource allocation to LRM 4. executor registration 5. notification for work 6. pick up task(s) 7. deliver task(s) results 8. notification for task result 9. pick up task(s) results 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 1

9 Dynamic Resource Provisioning End-to-end execution time: 126 sec in ideal case 494 sec 1276 sec Average task queue time: 42.2 sec in ideal case 611 sec 43.5 sec Trade-off: Resource Utilization for Execution Efficiency GRAM +PBS Ideal (32 nodes) Falkon-15 Falkon-6 Falkon-12 Falkon-18 Falkon- Queue Time (sec) Execution Time (sec) Execution Time % 8.5% 17.% 17.6% 19.3% 28.7% 29.2% 29.7% GRAM +PBS Ideal (32 nodes) Falkon-15 Falkon-6 Falkon-12 Falkon-18 Falkon- Time to complete (sec) Resouce Utilization 3% 89% 75% 65% 59% 44% 1% Execution Efficiency 26% 72% 75% 84% 85% 99% 1% Resource Allocations /22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 13

10 Data Diffusion Data caching at executor local disk RANDOM FIFO LRU LFU Avoid the need for a shared file system Theoretical linear Throughput (Mb/s) scalability with compute resources Local Read Local Read+Write GPFS Read GPFS Read+Write File Size (MB) /22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 14

11 Shared File System Performance Read Throughput (Mb/s) Nodes 32 Nodes 16 Nodes 8 Nodes 4 Nodes 2 Nodes 1 Node Read+Write Throughput (Mb/s) Nodes 32 Nodes 16 Nodes 8 Nodes 4 Nodes 2 Nodes 1 Node 1E File Size (MB) 1 1 1E File Size (MB) 1 1 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 15

12 Local Disk Theoretical Performance Read Throughput (Mb/s) Nodes 32 Nodes 16 Nodes 8 Nodes 4 Nodes 2 Nodes 1 Node Read+Write Throughput (Mb/s) Nodes 32 Nodes 16 Nodes 8 Nodes 4 Nodes 2 Nodes 1 Node 1E File Size (MB) 1 1 1E File Size (MB) 1 1 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 16

13 Data Diffusion: Micro-Benchmarks 1. Model (local disk): local disk performance model 2. Model (shared file system): shared file system (GPFS) performance model 3. Falkon (next-available policy): load balancing across available executors, operates on shared file system 4. Falkon (next-available policy) + Wrapper: same as (3), but tasks execute through a wrapper that creates a temp scratch directory on the shared file system, makes symbolic links, executes task, and removes the temp scratch directory and symbolic links 5. Falkon (first-available policy % locality): load balancing across available executors with data caching; % data locality with data read from the shared file system 6. Falkon (first-available policy 1% locality): same as (5), but with the workload from (5) repeating four times; the caches are first populated (not as part of the timed experiment) 7. Falkon (max-compute-util policy % locality): same as (5), but uses the dataaware scheduler to send tasks to the executors that have the most data cached; the workload has % data locality 8. Falkon (max-compute-util policy 1% locality): same as (7), but with the workload repeated four times 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 19

14 Micro-Benchmarks Results: 1-64 Nodes Read Throughput (Mb/s) Model (local disk) 2. Model (shared file system) 3. Falkon (next-available policy) 4. Falkon (next-available policy) + Wrapper 5. Falkon (first-available policy % locality) 6. Falkon (first-available policy 1% locality) 7. Falkon (max-compute-util policy % locality) 8. Falkon (max-compute-util policy 1% locality) Read+Write Throughput (Mb/s) Model (local disk) 2. Model (shared file system) 3. Falkon (next-available policy) 4. Falkon (next-available policy) + Wrapper 5. Falkon (first-available policy % locality) 6. Falkon (first-available policy 1% locality) 7. Falkon (max-compute-util policy % locality) 8. Falkon (max-compute-util policy 1% locality) Number of Nodes Number of Nodes 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 2

15 Micro-Benchmarks Results: 1 Node Read Throughput (Mb/s) Model (local disk) 2. Model (shared file system) 3. Falkon (next-available policy) 4. Falkon (next-available policy) + Wrapper 5. Falkon (first-available policy % locality) 6. Falkon (first-available policy 1% locality) 7. Falkon (max-compute-util policy % locality) 8. Falkon (max-compute-util policy 1% locality) Read+Write Throughput (Mb/s) Model (local disk) 2. Model (shared file system) 3. Falkon (next-available policy) 4. Falkon (next-available policy) + Wrapper 5. Falkon (first-available policy % locality) 6. Falkon (first-available policy 1% locality) 7. Falkon (max-compute-util policy % locality) 8. Falkon (max-compute-util policy 1% locality) 1B 1KB 1KB 1KB 1MB 1MB 1MB 1GB File Size 1B 1KB 1KB 1KB 1MB 1MB 1MB 1GB File Size 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 21

16 Micro-Benchmarks Results: 8 Nodes Read Throughput (Mb/s) Model (local disk) 2. Model (shared file system) 3. Falkon (next-available policy) 4. Falkon (next-available policy) + Wrapper 5. Falkon (first-available policy % locality) 6. Falkon (first-available policy 1% locality) 7. Falkon (max-compute-util policy % locality) 8. Falkon (max-compute-util policy 1% locality) Read+Write Throughput (Mb/s) Model (local disk) 2. Model (shared file system) 3. Falkon (next-available policy) 4. Falkon (next-available policy) + Wrapper 5. Falkon (first-available policy % locality) 6. Falkon (first-available policy 1% locality) 7. Falkon (max-compute-util policy % locality) 8. Falkon (max-compute-util policy 1% locality) 1B 1KB 1KB 1KB 1MB 1MB 1MB 1GB File Size 1B 1KB 1KB 1KB 1MB 1MB 1MB 1GB File Size 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 22

17 Micro-Benchmarks Results: 64 Nodes Read Throughput (Mb/s) Model (local disk) 2. Model (shared file system) 3. Falkon (next-available policy) 4. Falkon (next-available policy) + Wrapper 5. Falkon (first-available policy % locality) 6. Falkon (first-available policy 1% locality) 7. Falkon (max-compute-util policy % locality) 8. Falkon (max-compute-util policy 1% locality) Read+Write Throughput (Mb/s) Model (local disk) 2. Model (shared file system) 3. Falkon (next-available policy) 4. Falkon (next-available policy) + Wrapper 5. Falkon (first-available policy % locality) 6. Falkon (first-available policy 1% locality) 7. Falkon (max-compute-util policy % locality) 8. Falkon (max-compute-util policy 1% locality) 1B 1KB 1KB 1KB 1MB 1MB 1MB 1GB File Size 1B 1KB 1KB 1KB 1MB 1MB 1MB 1GB File Size 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 23

18 Task Dispatch Throughput: 64 Nodes, 1 Byte files on GPFS Read Read+Write Throughput (tasks/sec) No Data Access 3. Falkon (nextavailable policy) 4. Falkon (nextavailable policy) + Wrapper 5. Falkon (firstavailable policy % locality) 6. Falkon (firstavailable policy 1% locality) 7. Falkon (maxcompute-util policy compute-util policy 8. Falkon (max- % locality) 1% locality) 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 24

19 Dispatch Policies 1% 1%1%1% 1%1%1% 99% 1% 99% 1% 99% 1% 1% 97% 1% 97% Cache Hit Rate (local disk) 9% 8% 7% 6% 5% 4% 3% 63% 31% first-available max-compute-util max-cache-hit 2% 16% 1% % 8% 4% Number of Nodes 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 25

Applications via Swift Application #Tasks/workflow #Stages ATLAS: High Energy Physics Event Simulation 5K 1 fmri DBIC: AIRSN Image Processing 1s 12 FOAM: Ocean/Atmosphere Model 2 3 GADU: Genomics 4K

20 Applications via Swift Application #Tasks/workflow #Stages ATLAS: High Energy Physics Event Simulation 5K 1 fmri DBIC: AIRSN Image Processing 1s 12 FOAM: Ocean/Atmosphere Model 2 3 GADU: Genomics 4K 4 HNL: fmri Aphasia Study 5 4 NVO/NASA: Photorealistic Montage/Morphology 1s 16 QuarkNet/I2U2: Physics Science Education 1s 3 ~ 6 RadCAD: Radiology Specification 1s 5 Classifier Training Abstract SIDGrid: EEG Wavelet 1s 2 computation Processing, Gaze Analysis SDSS: Coadd, Cluster Search 4K, 5K 2, 8 SDSS: Stacking, AstroPortal 1Ks ~ 1Ks 2 ~ 4 SwiftScript MolDyn: Molecular Dynamics 1Ks ~ 2Ks 8 Compiler Virtual Data Catalog SwiftScript Swift Architecture Scheduling Execution Engine (Karajan w/ Swift Runtime) C C C C Swift runtime callouts Status reporting Provenance collector Execution Virtual Node(s) Provenance data launcher Provenance data launcher file1 App F1 file2 App F2 file3 Provisioning Falkon Resource Provisioner Amazon EC2 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 29 13

21 Applications Medicine: fmri (Falkon vs. GRAM) Astronomy: Montage (Falkon vs. GRAM vs. MPI) Stacking (Falkon with Data Diffusion vs. Falkon using shared file system) 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 3

22 Applications: fmri GRAM vs. Falkon: 85%~9% lower run time GRAM/Clustering vs. Falkon: 4%~74% lower run time GRAM GRAM/Clustering Falkon Time (s) /22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 31 Input Data Size (Volumes)

23 Applications: Montage GRAM/Clustering vs. Falkon: 57% lower application run time MPI* vs. Falkon: 4% higher application run time * MPI should be lower bound GRAM/Clustering MPI Falkon Time (s) mproject mdiff/fit mbackground madd(sub) 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 32 Components madd total

Applications: Stacking Object Count per File 275 25 225 2 175 15

24 Applications: Stacking Object Count per File Files 32M+ objects 1.5M+ files 3.3 TB compressed / 9TB uncompressed 1E+6 1E+6 1E+6 SDSS DR5 Dataset 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 33

25 Stacking Workloads Uniform sample from 32M objects ZIPF distribution of object popularity alpha =.1, 1., 2. Stacking Size (# of objects) Object Distribution Working Set Size (# of files) Working Set Size (# of objects) Locality (accesses per file) 1 ZIPF - alpha ZIPF - alpha ZIPF - alpha UNIFORM /22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 34

26 Stacking Performance Evaluation Time (sec) ZIPF - alpha.1 ZIPF - alpha 1. ZIPF - alpha 2. UNIFORM Web Server Shared File System (WAN) Shared File System (LAN) Data Diffusion Original Dataset Location 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 35

27 Object Size Effect 35 Shared File System (LAN) Data Diffusion 3 25 Time (sec) x1 1x1 5x5 Object Size 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 36

28 Data Diffusion Effect Shared File System (LAN - ZIPF 2.) Shared File System (LAN - ZIPF 1.) Shared File System (LAN - ZIPF.1) Shared File System (LAN - UNIFORM) Data Diffusion (ZIPF 2.) Data Diffusion (ZIPF 1.) 1st Time 2nd Time Data Diffusion (ZIPF.1) Data Diffusion (UNIFORM) Time (sec) 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 37

29 1K*2 Stacking Workload 1 9 1K Objects - UNIFORM 1st Time: Instantenous Throughput 1K Objects - UNIFORM 1st Time: Average Throughput 1K Objects - UNIFORM 2nd Time: Instantenous Throughput 1K Objects - UNIFORM 2nd Time: Average Throughput Throughput (Cutouts/sec) Time (sec) 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 38

30 1K Stacking Workload 25 1K Objects - UNIFORM: Instantenous Throughput 1K Objects - UNIFORM: Average Throughput Throughput (Cutouts/sec) Time (sec) 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 39

31 Future Work Explore data pre-fetching policies Update Swift to use data diffusion 3-Tier Architecture (essential on BG/P) Extend provisioner to support the Virtual Workspace Service (opens door to EC2) Globus Incubator Project Alternate languages and technologies 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 4

32 More Information Web: Related Projects: Swift: AstroPortal: Data Diffusion Collaborators: Ioan Raicu, Computer Science Dept., The University of Chicago Yong Zhao, Microsoft Ian Foster, Math and Computer Science Div., Argonne National Laboratory & Computer Science Dept., The University of Chicago Alex Szalay, Department of Physics and Astronomy, The Johns Hopkins University Other Falkon Collaborators: Catalin Dumitrescu, Computer Science Dept., The University of Chicago Mike Wilde, Computation Institute, University of Chicago & Argonne National Laboratory Funding: NASA: Ames Research Center, Graduate Student Research Program (GSRP) DOE: Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Dept. of Energy NSF: TeraGrid 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 44

33 More Information: Related Documents Ioan Raicu, Yong Zhao, Catalin Dumitrescu, Ian Foster, Mike Wilde. Falkon: a Fast and Light-weight task execution framework, to appear at IEEE/ACM SuperComputing 27. Ioan Raicu, Catalin Dumitrescu, Ian Foster. Dynamic Resource Provisioning in Grid Environments, to appear TeraGrid Conference 27. Yong Zhao, Mihael Hategan, Ben Clifford, Ian Foster, Gregor von Laszewski, Ioan Raicu, Tiberiu Stef-Praun, Mike Wilde. Swift: Fast, Reliable, Loosely Coupled Parallel Computation, to appear at IEEE Workshop on Scientific Workflows 27. Yong Zhao, Mihael Hategan, Ioan Raicu, Mike Wilde, Ian Foster. Swift: a Parallel Programming Tool for Large Scale Scientific Computations, under review at Scientific Programming Journal, Special Issue on Dynamic Computational Workflows: Discovery, Optimization, and Scheduling. Ioan Raicu, Ian Foster, Alex Szalay. Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets, poster presentation, IEEE/ACM SuperComputing 26. Ioan Raicu, Ian Foster, Alex Szalay, Gabriela Turcu. AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis, TeraGrid Conference 26, June 26. Alex Szalay, Julian Bunn, Jim Gray, Ian Foster, Ioan Raicu. The Importance of Data Locality in Distributed Computing Applications, NSF Workflow Workshop 26. 1/22/27 A Data Diffusion Approach to Large Scale Scientific Exploration 45

Managing and Executing Loosely-Coupled Large-Scale Applications on Clusters, Grids, and Supercomputers

Managing and Executing Loosely-Coupled Large-Scale Applications on Clusters, Grids, and Supercomputers Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago Collaborators: