Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Thanh-Chung Dao 1 Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and Shigeru Chiba The University of Tokyo

Thanh-Chung Dao 2 Supercomputers Expensive clusters Multi-core processors Large capacity of main memory High-speed network Focus mainly on compute-intensive applications Data-intensive workloads are emerging as supercomputing problems Graph processing Pre-processing of simulation data

Thanh-Chung Dao 3 MapReduce Simple parallel paradigm to process large datasets Hidden parallelization & communication PageRank example Input Splitting Mapping Shuffling Reducing Result Function Mapper Input PageA à PageB, PageC Begin N = outbound links For each outbound link output <Page, 1/N> End PageA à PageB, PageC PageB à PageC PageC à PageA, PageB PageA à PageB, PageC PageB à PageC PageC à PageA, PageB <PageB, 0.5> <PageC, 0.5> Shuffling Rank contribution Done automatically <PageC, 1> (Users can ignore) <PageA, 0.5> <PageB, 0.5> <PageA, Function 0.5> Reducer <PageA, 0.5> Input <PageA, x 1 >, <PageA, x n > Begin rank = 0 <PageB, 0.5> <PageB, 1> <PageB, For 0.5> each item x i rank += x i output <PageA, rank> End <PageC, 0.5> <PageC, 1> <PageC, 1.5> PageA 0.5 PageB 1 PageC 1.5

Thanh-Chung Dao 4 Hadoop MapReduce Standard of MapReduce implementation Provide easy-to-use MapReduce APIs TCP/IP-based communication Designed to run on commodity clusters Lab clusters, or Amazon EC2 Scalability (32,000 nodes at Yahoo) & Resilience Written in Java

Thanh-Chung Dao 5 Improving Hadoop MapReduce Performance on Supercomputers Hadoop MapReduce is good choice on supercomputers Maturity Productivity Resource allocation at runtime (# of processes, memory, CPU) Supercomputer Static Hadoop Dynamic Communication MPI TCP/IP Workload Compute-intensive Data-intensive

Thanh-Chung Dao 6 Our Approach JVM Reuse Statically create JVM processes and dynamically allocate to Hadoop tasks Enable efficient MPI communication by Hadoop tasks Statically created processes can exploit efficient MPI Dynamic allocation enables to use the original Hadoop implementation Shorten start-up time of processes Technique Process pool is used to implement JVM Reuse Minimize changes of the original Hadoop engine

Thanh-Chung Dao 7 Why MPI is required for Hadoop The de facto high-speed communication on supercomputers Improve slow MapReduce shuffling Enable Hadoop to co-host traditional MPI applications Combine MPI and MapReduce models Rich data analysis workflow Throughput (Mbps) 0 10000 30000 Efficient data sharing between MPI and MapReduce models E.g. MPI can access data located at Hadoop file system (HDFS) Throughput (Mbps) MPI TCP On FX10 supercomputer 10 times faster 2 0 2 4 2 8 2 12 2 16 2 20 2 26 Message size (Bytes)

Thanh-Chung Dao 8 Slow MapReduce shuffling on Hadoop TCP/IP-based communication JVM-Bypass (Wang et al., IPDPS 2013) MapTasks Map output 1 Map output n HTTP Servlet Server ReduceTasks Sort & Merge nodes Local disk Multiple requests at once Reducing Mapping Phase Shuffling Phase Reducing Phase

Thanh-Chung Dao 9 Dynamic Process Creation on MPI Discouraged on supercomputers Reasons of performance Collective mechanism (MPISpawn) Gang scheduling (error-prone if not enough resource) Gerbil (Xu el al., CCGrid 2015) Co-hosting MPI applications on Hadoop Creating dynamically processes Its experiments showed significant overhead Resources should be specified before running MPI applications Number of processes is known (static) Memory and CPU cores

Thanh-Chung Dao 10 Dynamic Process Creation on Hadoop Required Resources are allocated on demand to run MapReduce applications Number of processes is unknown (dynamic)

Thanh-Chung Dao 11 Dynamic Process Creation on Hadoop Required Resources are allocated on demand to run MapReduce applications Number of processes is unknown (dynamic) 1 Master node 2 n A Node

Thanh-Chung Dao 12 Dynamic Process Creation on Hadoop Required Resources are allocated on demand to run MapReduce applications Number of processes is unknown (dynamic) 6 tasks 1 User Job Submission Master node 6 tasks 2 8 tasks n A Node Request sending

Thanh-Chung Dao 13 Dynamic Process Creation on Hadoop Required Resources are allocated on demand to run MapReduce applications Number of processes is unknown (dynamic) 6 tasks 1 Process creation 6 processes User Job Submission Master node 6 tasks 2 Process creation 6 processes 8 tasks n Process creation 8 processes Processes A Node Request sending Each task is run on a process

Thanh-Chung Dao 14 Dynamic Process Creation on Hadoop Required Resources are allocated on demand to run MapReduce applications Number of processes is unknown (dynamic) 6 tasks 1 Process creation Task running User Job Submission Master node 6 tasks 2 Process creation Task running 8 tasks n Process creation Task running Processes A Node Request sending Each task is run on a process

Thanh-Chung Dao 15 Dynamic Process Creation on Hadoop Required Resources are allocated on demand to run MapReduce applications Number of processes is unknown (dynamic) 6 tasks 1 Process creation Terminated User Job Submission Master node 6 tasks 2 Process creation Terminated 8 tasks n Process creation Terminated Processes A Node Request sending Each task is run on a process

Thanh-Chung Dao 16 Dynamic Process Creation on Hadoop Required Resources are allocated on demand to run MapReduce applications Number of processes is unknown (dynamic) 1 User Job Completion Master node 2 n A Node Request sending

Thanh-Chung Dao 17 Idea of Reusing added Idle JVM processes Number of processes is statically fixed 1 idle idle idle Master node 2 idle idle idle Processes A Node n idle idle idle

Thanh-Chung Dao 18 Idea of Reusing added Idle JVM processes Number of processes is statically fixed 6 tasks 1 idle idle idle User Job Submission Master node 6 tasks 2 idle idle idle Processes A Node 8 tasks n Request sending idle idle idle

Thanh-Chung Dao 19 Idea of Reusing added Idle JVM processes Number of processes is statically fixed 6 tasks 1 Allocation Busy Busy idle User Job Submission Processes A Node Master node 6 tasks 8 tasks 2 n Request sending Allocation Busy Busy idle Busy Busy Busy

Thanh-Chung Dao 20 Idea of Reusing added Idle JVM processes Number of processes is statically fixed 6 tasks 1 Running Running idle User Job Submission Master node 6 tasks 2 Running Running idle Processes A Node 8 tasks n Request sending Running Running Running

Thanh-Chung Dao 21 Idea of Reusing added Idle JVM processes Number of processes is statically fixed 6 tasks 1 Cleanup Cleanup idle User Job Submission Master node 6 tasks 2 Cleanup Cleanup idle Processes A Node 8 tasks n Request sending Cleanup Cleanup Cleanup

Thanh-Chung Dao 22 Idea of Reusing added Idle JVM processes Number of processes is statically fixed 6 tasks 1 idle idle idle User Job Submission Master node 6 tasks 2 idle idle idle Processes A Node 8 tasks n Request sending idle idle idle

Thanh-Chung Dao 23 Idea of Reusing added Idle JVM processes Number of processes is statically fixed 1 idle idle idle User Job Completion Master node 2 idle idle idle Processes A Node n Request sending idle idle idle

Thanh-Chung Dao 24 JVM Reuse enables MPI communication MPI communication is established at the beginning JVM Reuse keeps processes running MPI connection is always available

Thanh-Chung Dao 25 JVM Reuse shortens start-up time JVM start-up flow of Program A Cmd: java A Class loader subsystem OS level process creation Class loading Class linking (verification & initializing) Invoke main() method of A Execution engine JIT compiler After A finishes, Program B wants to reuse JVM of A Execute A instructions Process creation & class loader are skipped Invoke main() method of B Execute B instructions

Thanh-Chung Dao 26 Iterative jobs benefit from JVM Reuse Iterative jobs Many short running JVM processes PageRank is an example Iterative job flow Yes, then quit No Stop Cond? Initial data Pre-processing job Job A Maps use results of the previous Iteration Job A Map Reduce Map Reduce Map Reduce

Thanh-Chung Dao 27 Implementation: Process Pool Process creation request Hadoop YARN (Node manager) Process Request sending State changing Pool Manager Round-robin scheduling Allocation Set busy Export env. variables Create new class loader Invoke main() method Busy Busy idle Busy Busy Busy idle De-allocation Clean static fields Unset busy Process Pool on each node

Thanh-Chung Dao 28 Our MPI shuffling design MapTasks Map output 2 Map output n Node 2 Local disk MPI send/recv Shuffle Manager One request at once ReduceTask 2 Sort & Merge Reducing Mapping Phase Shuffling Phase Reducing Phase

Thanh-Chung Dao 29 Reuse s Technical Issues Loading user s classes The original flow exports CLASSPATH before running Reflection Load user s classes at runtime Create a new class loader for each user Avoid class confliction Clean-up Static fields Security problem e.g. UserGroup static field Must be reset Current design Reset user information and job conf. static fields

Thanh-Chung Dao 30 Other Technical Issues Enable Hadoop YARN to host traditional MPI applications YARN is a resource manager Work in progress MPI AppMaster Monitor MPI ranks MPI Container Host a rank Avoid gang scheduling of MPI Work in progress

Thanh-Chung Dao 31 Evaluation Hadoop version v2.2.0 Changes in our implementation Line of code / total of Hadoop: ~1000 / 1,851,473 Number of classes / total of Hadoop: 9 / 35142

Thanh-Chung Dao 32 Cluster setup FX10 supercomputer Sparc64 Ixfx 1.848 GHz (16 cores) & 32GB RAM MPI over Tofu interconnection (5GB/s) Central storage Hadoop setup One master and many slaves OpenJDK 7 HDFS is run on the central storage OpenMPI 1.6 Java MPI binding (Vega-Gisbert et al.) MCA parameter: plm_ple_cpu_affinity = 0

Thanh-Chung Dao 33 Evaluation of JVM Reuse MPI benefit MPI vs. TCP/IP shuffling Tera-sort job Run on 32 FX10 nodes 4-slot pool & -Xmx4096 Start-up time JVM Reuse vs. the original PageRank iterative job 400 GB wikipedia data Run on 8 FX10 nodes 6-slot pool & -Xmx4096

Thanh-Chung Dao 34 MPI vs. TCP/IP shuffling Tera-sort Execution Time (s) 0 50 100 150 200 250 Nonblocking TCP Shuffle Blocking TCP shuffle Blocking MPI shuffle up to 10% faster 4 8 16 Input size (GB) Nonblocking : accept multi-connection at once Blocking : accept one connection at once

Thanh-Chung Dao 35 Shorten start-up time (PageRank) Task ID 1 30 64 98 137 180 223 Start-up (JVM & User info) Task initializing Shuffling (at Reducers) Data reading Task running Task finishing (MapOutput writing) Start-up time 1 st iteration 2 nd iteration Original Hadoop 0s 40s 80s 120s 160s 200s 240s 280s 320s 360s 400s Task ID 1 30 64 98 136 179 222 2 nd iteration 50 seconds faster JVM Reuse Hadoop 0s 40s 80s 120s 160s 200s 240s 280s 320s 360s 400s Time (s)

Thanh-Chung Dao 36 More iterations Sum of start-up time (s) 0 1000 3000 5000 Original Hadoop JVM Reuse-based Hadoop 1 2 4 6 8 Iteration number (n) Execution Time (s) 0 200 400 600 800 1000 Original Hadoop JVM Reuse-based Hadoop 1 2 4 6 8 Iteration number (n) Execution Time (s) 0 5000 10000 15000 20000 Start-up time Computation time Original Hadoop Approach JVM Reuse Hadoop Sum of start-up time Execution time Ratio

Thanh-Chung Dao 37 Related work M3R (VLDB 2012) Also apply JVM Reuse to enable in-memory MapReduce Not providing any optimization of JVM Reuse and its evaluation Hadoop MapReduce (HMR) engine is written in X10 We keep the original HMR engine with minimum changes JVM Reuse in Hadoop v1 (2012) Only for a single job JVM processes are terminated after their job is completed Gerbil: MPI + YARN (CCGrid 2015) Hadoop Yarn co-hosts MPI applications Long start-up time and significant overhead DataMPI (IPDPS 2014) Hadoop-like MapReduce implementation using MPI & C JVM-Bypass (IPDPS 2013) C-based shuffling engine & RDMA supported We focus on using MPI over Hadoop processes

Thanh-Chung Dao 38 Summary Improving Hadoop MapReduce performance on supercomputers Approach: JVM Reuse Statically create JVM processes and dynamically allocate to Hadoop tasks Enable efficient MPI communication on Hadoop Shorten start-up time Minimum changes of the original Hadoop Future work JVM Reuse drawback Affect CPU-bound tasks Co-host MPI applications more efficiently Full cleanup