Workflow System for Data-intensive Many-task Computing

Size: px

Start display at page:

Download "Workflow System for Data-intensive Many-task Computing"

Lynn Hicks
6 years ago
Views:

1 Workflow System for Data-intensive Many-task Computing Masahiro Tanaka and Osamu Tatebe University of Tsukuba, JST/CREST Japan Science and Technology Agency ISP2S2 Dec 4,

2 Outline Background Pwrake Workflow System IO-aware Task Scheduling Locality-aware Task Scheduling using Multi-Constraint Graph Partitioning (MCGP) Disk cache-aware Tasks Scheduling Conclusion ISP2S2 Dec 4,

3 Background: Data-intensive Science Many Scientific Fields Astronomy, Bioinformatics, Earth Science, Particle Physics, Data I/O > Computation Interaction through File System Handles huge amount of data HSC for Subaru Telescope generates ~300GB/night. Requires Parallel processing on Distributed Computer Systems ISP2S2 Dec 4,

4 Example of scientific workflow: Montage (Astronomy image processing) Workflow DAG Input images mprojectpp mdiff+mfitplane mbgmodel mbackground madd mshrink madd mjpeg Output image Process Task output file input ISP2S2 Dec 4,

5 Pwrake Workflow System ISP2S2 Dec 4,

6 Pwrake Workflow System Parallel Workflow extension to Rake Target: data-intensive and many-task scientific workflow Pwrake is based on: Rake : Ruby version of UNIX Make Workflow definition language for Many-Task Scientific Workflows Gfarm : Distributed File System Scalable I/O performance Use local storage of compute nodes ISP2S2 Dec 4,

7 Workflow Definition Language Task definition format e.g. DAX Need script to define many tasks. Design a new language e.g. Swift (Wilde et al. 2011) Learning cost, Niche community. Use an existing language e.g. GXP Make (Taura et al. 2013) Extension rule is not enough for scientific workflows. ISP2S2 Dec 4,

8 Our solution: Rake Ruby Make Build tool written in Ruby Widely-used tool in the Ruby community Rake is an internal DSL Ruby is an host language reduces learning cost able to use Ruby language features ISP2S2 Dec 4,

9 Useful features of Rake For-Loop BASENAMES = Array of basenames for i in BASENAMES file "out/#{i}.fits" => "src/#{i}.fits" do t sh "mprojectpp #{t.prerequisites[0]} #{t.name} region.hdr" end end Complex Rule using script FILEMAP = Mapping input files to output files rule /^d /.*.fits$/ => proc{ x FILEMAP[x]} do t p1,p2 = t.prerequisites sh "mdiff #{p1} #{p2} #{t.name} region.hdr" end ISP2S2 Dec 4,

10 Design of Pwrake Inherit Rake Workflow (task) definition language File-based task dependency Resume/Restart Implementation, e.g., Task class, Application module Implement Pwrake extension remote process execution parallel task execution task queue (includes scheduling) find file location using Gfarm API ISP2S2 Dec 4,

11 Pwrake Archetecture Master node Worker nodes Pwrake process Task Graph worker thread Task process process files files Queue worker thread SSH Gfarm worker thread enq deq worker thread process process files files ISP2S2 Dec 4,

12 Design of Task Queue Task NodeQueue TaskQueue enq Node 1 Node 2 deq worker thread Node 3 Other nodes ISP2S2 Dec 4,

13 MB/s I/O-aware Task Scheduling Issues: Read performance of File Locality Disk cache Gfarm file (HDD, GbE) (buffer/page cache) disk process cache Local file cache file cache Remote 0 Remote Local file disk Gfarm file disk ISP2S2 Dec 4,

14 Locality-aware Scheduling based on MCGP (Multi-Constraint Graph Partitioning) (CCGrid 2012) ISP2S2 Dec 4,

15 Locality-aware Scheduling Methods 1. Naïve locality scheduling Assign a task to a node where its input file is stored. 2. Method using MCGP (Multi-Constraint Graph Partitioning) Our proposal (CCGrid 2012) (Idle workers steal tasks) ISP2S2 Dec 4,

16 Graph Partitioning Task Scheduling Is Graph Partitioning also applicable to Workflow DAG? Vertex Computation Edge Communication Minimize: Edge-cut Data movement ISP2S2 Dec 4,

17 Graph Partitioning on DAG Standard Graph Partitioning Ideal Partitioning for Scheduling Node-A Node-B Node-C Node-D Former Tasks Latter Tasks Standard GP is not aware of parallelizable tasks ISP2S2 Dec 4,

18 Multi-Constraint Graph Partitioning (MCGP) Vertex Weight Vectors w 1 VA w 4 (1,3,1) VB w i w 5 i V A w i j 1 w1 j V (1,2,4) (2,2,2) i j 2 nd dim: w w 6 i i i w1, w2, w3 1 st dim: i V V A A B 2 2 j V B V B 6 w 2 w 3 (2,3,1) (3,1,1) w 6 (3,1,3) 3 rd dim: i V A w i j 3 w3 j V B 6 Balance the sum of Vertex Weights at each dimension ISP2S2 Dec 4,

19 Proposed method: MCGP (Multi-Constraint Graph Partitioning) w 1 w 1 w 1 w 1 w 2 w 2 w 2 w 3 w 4 w 4 w 4 w 4 w 5 w 5 w 6 w 6 w 7 w 1 = (1,0,0,0,0) w 2 = (0,1,0,0,0) w 3 = (0,0,0,0,0) w 4 = (0,0,1,0,0) w 5 = (0,0,0,1,0) w 6 = (0,0,0,0,1) w 7 = (0,0,0,0,0) N task N group : set 1 at i th dim set 0 at others N task < N group : set 0 to all dims ISP2S2 Dec 4,

20 Result of MCGP Node-A Node-B w 1 = (1,0,0,0,0) w 2 = (0,1,0,0,0) w 3 = (0,0,0,0,0) w 4 = (0,0,1,0,0) Uniform Distribution in each phase Minimize Edge-cut applied to Entire Graph w 5 = (0,0,0,1,0) w 6 = (0,0,0,0,1) w 7 = (0,0,0,0,0) ISP2S2 Dec 4,

21 Image Position and Task Nodes (Initially, input files are stored in one node) Naïve locality MCGP ISP2S2 Dec 4,

22 Data Movement between nodes Data Size Ratio (%) A(Unconcern) B(Naïve locality) C(MCGP) ISP2S2 Dec 4,

23 Workflow Execution Time Elapsed Time (sec) A(Unconcern) B(Naïve locality) C(MCGP) 31% down 22% down Includes time to solve MCGP (30 ms) ISP2S2 Dec 4,

24 Disk cache-aware Task Scheduling (Cluster 2014) ISP2S2 Dec 4,

25 Disk Cache-aware Task Scheduling LIFO queue: Intermediate files are read soon. High probability that the file is cached. Trailing task problem (Armstrong et al. MTAGS 2010) Workflow DAG LIFO queue Execution order Last-in First-out A1 A2 A3 A4 A5 A6 A6 A6 B6 B6 A5 B1 B2 B3 B4 B5 B6 A5 A4 B5 A3 C A2 time First-in A1 ISP2S2 Dec 4,

26 Proposed Scheduling methods Workflow DAG A1 A2.. An B1 B2.. Bn C A1 A3 An-2 An B2 Bn-3 Bn-1 A2 A4 An-1 B1 B3 Bn-2 Bn Trailing tasks A1 B1 A3 B3 An-2 Bn-2 An Bn : A2 B2 A4 B4 An-1 Bn-1 idle core A1 B1 A3 B3 An-2 Bn-2 Bn-1 : A2 B2 A4 B4 An-1 An Bn LIFO HRF A1 B1 B2 B3 An-1 Bn-1 Bn-2 HRF: Highest Rank First : A2 A3 A4 A5 An-2 An Bn Bn-3 FIFO (HRF) LIFO LIFO+HRF Rank Equalization+HRF Disk Cache Trailing Task Task Overlap Rank Overlap HRF ISP2S2 Dec 4,

27 Core Utilization FIFO mprojectpp mdiff LIFO Trailing Task Sequential tasks Rank Eq+HRF (different setting) mdiff overlaps mprojectpp LIFO+HRF No Trailing Task ISP2S2 Dec 4,

28 Measurement of Strong Scaling (1-12 nodes, Logarithmic) 1node 8cores x1.9 speedup from FIFO 1/n cores 4nodes 32cores Next Slide ISP2S2 Dec 4,

29 Measurement of Strong Scaling (4-12 nodes, Linear) 4nodes 32cores 12nodes 96cores 12 % speedup from LIFO (trailing task) 1/n cores ISP2S2 Dec 4,

30 Conclusion We developed Pwrake workflow system for data-intensive, many-task workflows. I/O-aware workflow scheduling: Locality-aware scheduling using MCGP remote file access: 88% 14% workflow execution: 31% speedup Disk cache-aware scheduling LIFO: 1.9x speedup HRF: ~12% speedup ISP2S2 Dec 4,

Disk Cache-Aware Task Scheduling

Disk Cache-Aware Task Scheduling Disk ache-aware Task Scheduling For Data-Intensive and Many-Task Workflow Masahiro Tanaka and Osamu Tatebe University of Tsukuba, JST/REST Japan Science and Technology Agency IEEE luster 2014 2014-09-24