HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241 USENIX ATC 11

Big Data Industrial Revolution of Data The heartbeat of mobile, cloud and social computing Expanding faster than Moore s law E.g., Internet of Things What is Big Data? Too large to work with using traditional tools (e.g., RDBMS) Require a new architecture Massively parallel software running on 100s~1000s of servers 2 USENIX ATC 11

Dataflow Model for Big Data Analytics User Applications modeled as dataflow graphs Write subroutines running on the vertices Abstracted away from messy details of distributed computing System runtime Dynamically dataflow graphs to the cluster Handles all the low level details Data partitioning, task distribution, load balancing, node communications, fault tolerance, MapReduce Hadoop Partitioned Input D T A A Partitioned Input D A T A Map Tasks MAP MAP MAP MAP spill spill Spill spill Streaming dataflow shuffle copier merge shuffle copier merge shuffle copier merge RE sort sort DU sort CE Reduce Tasks Aggregated Output reduce reduce reduce Sequential dataflow Aggregated Output Dryad 3 USENIX ATC 11

What Worked Parallel programming is hard Distributed programming is harder Dataflow model makes it a lot easier An appropriately high level of abstraction User required to consider data parallelisms exposed by the dataflow Runtime distributes executions of subroutines by exploiting data dependencies encoded in the dataflow Nontrivial software written with threads, sehores, and mutexes are incomprehensible to humans. Edward A. Lee CGO 2007, March 2007 Auto-Partitioning Compiler for Intel Network Processor (IXP) 4 USENIX ATC 11

What Didn t Work Dataflow abstraction makes Big Data system appear as a black box Very difficult for the user to understand runtime behaviors Performance analysis & tuning remain a big challenge Key challenges of performance analysis for Big Data Massively distributed system How to correlate concurrent performance activities (across 10000s of programs and machines)? High level dataflow abstraction How to relate low level performance activities to high level dataflow model? 5 USENIX ATC 11

HiTune: Vtune for Hadoop Distributed instrumentations Lightweight sampling using binary instrumentation No source code modifications Implemented using Java programming language agents Generic sampling information collected Dataflow-driven analysis Re-constructing dataflow execution process using low level sampling information Based on a dataflow specification Implemented as several Hadoop jobs 6 USENIX ATC 11

HiTune 0.9 Status Used intensively both inside Intel and by several external customers Open sourced under Apache License 2.0 Available at https://github.com/hitune/hitune Query QL Excel Spreadsheet Visual Report Samples (.xlsm) HiTune Report (.csv) Instrumentation Local HiTune data Local HiTune data Local HiTune data Adaptor Adaptor Adaptor Chukwa Agent Aggregation Chukwa Collector Hadoop Job Analysis HiTune Analyzer Hadoop Job Chukwa Collector HDFS Local HiTune data Local HiTune data Adaptor Adaptor Chukwa Agent Chukwa Collector PostProcess Chukwa Demux Local HiTune data Adaptor HiTune Paser HiTune Paser 7 USENIX ATC 11

Overhead Ratio of instrumented vs. uninstrumented clusters Less than 2% runtime overhead due to instrumentation Ratio of Completion Time 5-slave cluster 10-slave cluster 20-slave cluster 120% 101% 100% 101% 100% 80% 60% 40% 101% 100% 100% 20% 0% 101% 102% 101% Sort WordCount Nutch indexing Workloads Ratio of Throughput 5-slave cluster 10-slave cluster 20-slave cluster 120% 100% 98% 100% 98% 80% 60% 40% 99% 98% 100% 20% 0% 98% 100% 98% Sort WordCount Nutch indexing Workloads Ratio of CPU Utilization 5-slave cluster 10-slave cluster 20-slave cluster 120% 100% 102% 101% 100% 80% 60% 40% 100% 100% 100% 20% 0% 100% 102% 100% Sort WordCount Nutch indexing Workloads Ratio of Memory Utilization 5-slave cluster 10-slave cluster 20-slave cluster 120% 100% 102% 100% 101% 80% 60% 40% 102% 100% 100% 20% 0% 102% 100% 102% Sort WordCount Nutch indexing Workloads 8 USENIX ATC 11

The Hadoop Dataflow Model Partitioned Input D A T A Map Tasks spill spill Spill spill copier merge sort reduce copier copier Reduce Tasks shuffle shuffle shuffle merge merge sort reduce sort reduce Aggregated Output Streaming dataflow Sequential dataflow 9 USENIX ATC 11

Case Study: Limitation of Traditional Tools Sorting many small files (3200 500KB-sized files) using Hadoop 0.20.1 Cluster very lightly utilized (extremely low CPU, disk I/O and network utilization) No obvious bottlenecks or hotspots in the cluster Traditional tools (e.g., system monitors and program profilers) fail to reveal the root cause 10 USENIX ATC 11

Case Study: Limitation of Traditional Tools HiTune results (dataflow execution) reveal the root cause Upgrading to Fair Scheduler 2.0 fixes the issue Dataflow Execution Chart 0.19.1 The Low Utilization Issue 0.20.1 The Fix bootstrap shuffle sort reduce idle Map/Reduce Tasks Map/Reduce Tasks Time line Time line 11 USENIX ATC 11

Case Study: Limitation of Hadoop Logs TeraSort Large gap between end of and end of shuffle None of CPU, disk I/O and network bandwidth are bottlenecked during the gap Shuffle Fetchers Busy Percent metric reported by Hadoop is always 100% Increasing the number of copier threads brings no improvement Traditional tools or Hadoop logs fail to reveal the root cause gap 12 USENIX ATC 11

Case Study: Limitation of Hadoop Logs HiTune results (dataflow-based hotspot breakdown) reveal the root cause Copier threads idle 80% of the time, waiting for memory merge thread memory merge thread busy mostly due to compression Copier threads Idle, 80% reserve, 79% Busy, 20% Others,1% Memory Merge threads Compress, 81% Idle, 13% Busy, 87% Others, 6% Changing compression codec to LZO fixes this issue 13 USENIX ATC 11

Case Study: Extensibility Easily extended to support Simply changing the dataflow specification data flow stage timeline Stage Init Active Close Close Processing Period reduce stage timeline Active data flow stage timeline Stage Init Init Active Close Stage Close Processing Period stage timeline Active Aggregation query in performance benchmarks 68% of time spent on data input/output, Hadoop/ initialization & cleanup Critical to reduce intermediate results, improve data input/output, and reduce Hadoop/ overheads Map Tasks Reduce Tasks Close 0.2% Init 2.1% Stage Close 20.2% Active 77.6% Output 3.2% Input 42.4% Operations 32.0% Close 0.2% Init 2.1% Stage Close 20.2% Active 77.6% Output 3.2% Input 42.4% Operations 32.0% 14 USENIX ATC 11

Summary HiTune - VTune for Hadoop Better insights on Hadoop runtime behaviors Dataflow-based analysis Extremely low runtime overheads Very good scalability & extensibility v0.9 open sourced under Apache License 2.0 See https://github.com/hitune/hitune 15 USENIX ATC 11

16 USENIX ATC 11