Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems

Size: px

Start display at page:

Download "Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems"

Ethel Hamilton
5 years ago
Views:

1 Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Md. Wasi-ur- Rahman, M.Sc. Graduate Program in Computer Science and Engineering The Ohio State University 2016 Dissertation Committee: Dr. Dhabaleswar K. (DK) Panda, Advisor Dr. Ponnuswamy Sadayappan Dr. Radu Teodorescu Dr. Xiaoyi Lu

2 c Copyright by Md. Wasi-ur- Rahman 2016

3 Abstract Big Data processing and High-Performance Computing (HPC) are two disruptive technologies that are converging to meet the challenges exposed by large-scale data analysis. MapReduce, a popular parallel programming model for data-intensive applications, is being used extensively through different execution frameworks (e.g. batch processing, Directed Acyclic Graph or DAG) on modern HPC systems because of its ease-ofprogramming, fault-tolerance, and scalability. However, as these applications begin scaling to terabytes of data, the socket-based communication model, which is the default implementation in the open-source MapReduce execution frameworks, demonstrates performance bottleneck. Moreover, because of the synchronized nature of stocking the data in various execution phases, the default Hadoop MapReduce framework cannot leverage the full potential of the underlying interconnect. MapReduce frameworks also rely heavily on the availability of the local storage media, which introduces space inadequacy for applications that generate a large amount of intermediate data. On the other hand, most leadershipclass HPC systems follow the traditional Beowulf architecture with separate parallel storage system and either no, or very limited, local storage. The storage architectures in these HPC systems are not naively conducive for default MapReduce. Also, modern high performance interconnects (e.g. InfiniBand) used to access the parallel storage in these systems can provide extremely low latency and high bandwidth. Additionally, advanced storage architectures, such as Non-Volatile Memories (NVM), can provide byte-addressability as ii

4 well as data persistence. Efficient utilization of all these resources through enhanced designs of execution frameworks with tuned parameter space is crucial for MapReduce in terms of performance and scalability. This work addresses several of the shortcomings that the current MapReduce execution frameworks hold. It presents an enhanced Big Data execution framework, HOMR (Hybrid Overlapping in MapReduce), which improves the MapReduce job execution pipeline by maximizing overlapping among execution phases. HOMR also introduces RDMA (Remote Direct Memory Access) based shuffle engine with advanced shuffle algorithms to leverage the benefits of high-performance interconnects used in HPC systems. It minimizes the large number of disk accesses in the MapReduce execution frameworks through in-memory operations combined with fast execution pipeline. This work also proposes different deployment architectures while utilizing Lustre as underlying storage and provides fast shuffle strategies with dynamic adjustments. The priority based storage selection for intermediate data storage ensures the best storage usage at any point of job execution. This work also presents a variant of HOMR, that can exploit the byte-addressability of NVM to provide fast execution of MapReduce applications. Finally, a generalized advising framework is presented in this work that can provide optimum configuration recommendations for any MapReduce system with profiling and prediction capabilities. Through performance modeling of this MapReduce execution framework, techniques of predicting job execution performance are demonstrated on leadership-class HPC clusters at large scale. iii

5 To my family, friends, and mentors. iv

6 Acknowledgments This work was made possible because of the love and support of many people throughout my doctoral years. I would like to take this opportunity to thank all of them. I would like to express my special appreciation and thanks to my advisor Professor Dr. Dhabaleswar K. Panda. I would like to thank him for encouraging my research and for allowing me to grow as a scientist. His advice on both research as well as on my career have been priceless and I am highly grateful to work with him and have his suggestions and guidance throughout my career. I would also like to thank my dissertation committee members, Dr. P. Sadayappan and Dr. R. Teodorescu for agreeing to serve as the committee member and for their valuable feedback during and after my candidacy proposal that helped me to improve my dissertation. I would also like to thank Dr. Xiaoyi Lu, for providing me useful feedbacks and guiding me throughout my Ph.D. Working with him helped me to grow as a researcher. I would also like to thank all my present and past colleagues who have helped me in different ways throughout my studies. From my family, I would like to thank my parents - my father, Mr. M. A. Mannan and my mother, Mrs. Niger Sultana. It is their love, determination, and sacrifices that helped me to pursue my career goals. Their prayer for me has been what sustained me thus far and I consider myself a lucky person to have their guidance throughout my life. v

7 I would like to thank my dearest wife, Nusrat Sharmin Islam, for her love, support, and patience on me that helped me to move forward in every moment of my life. Because of working in the similar field, we used to have a lot of discussions related to our research that helped me to think beyond scope and limit and I consider myself a lucky person that I have shared this journey with her. I would also like to thank my elder sister, Dr. Nusrat Jahan Chhanda, and my brotherin-law, Dr. Muhannad Mustafa - for their constant love and appreciation for whatever I do. My sister s presence in my life have helped me to go through difficult times and words are not enough to express my gratitude for them. I would also like to thank my younger brother, Md. Wali-ur- Rahman, for his eagerness to know my whereabouts every now and then. I would like to thank all my friends back at home for their support, particularly - Susmita, Amit, Asad, Tareq, Pavel, Nusrat, Rashed, Drabir, Nadia, Uzzal, Sabrina. I would also like to thank my friends here - Mehmet, Pelin, Ayan, Erdem and many others. Lastly, and definitely, not the least, I am thankful to my little princess - Nayirah Sahar Rahman. Her smile always makes my day and helps me to forget all the stresses I might have so that I can start fresh again with all my strength knowing that she would be there at the end. I would like to thank her for being a part of my life. vi

8 Vita B.Sc., Computer Science and Engineering, Bangladesh University of Engineering and Technology, Bangladesh Research Engineer, Institute of Information and Communication Technology, Bangladesh University of Engineering and Technology, Bangladesh Lecturer, Computer Science and Engineering, Bangladesh University of Engineering and Technology, Bangladesh 2010-Present Ph.D., Computer Science and Engineering, The Ohio State University, USA Graduate Teaching Associate, The Ohio State University, USA 2012-Present Graduate Research Associate, The Ohio State University, USA Student Research Co-op/Intern, IBM T. J. Watson Research Center, USA Publications M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, Accelerating MapReduce and DAG Execution Frameworks to Leverage Non-Volatile Memory on HPC Systems, In 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS 17), [Under Review]. N. S. Islam, M. W. Rahman, X. Lu, and D. K. Panda, Efficient Data Access Strategies for Hadoop and Spark on HPC Cluster with Heterogeneous Storage, In 2016 IEEE International Conference on Big Data (IEEE BigData 16), December vii

9 M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, Can Non-Volatile Memory Benefit MapReduce Applications on HPC Clusters?, In 1st Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS 16), in conjunction with SC 16, November M. W. Rahman, N. S. Islam, X. Lu, D. Shankar, and D. K. Panda, MR-Advisor: A Comprehensive Tuning Tool for Advising HPC Users to Accelerate MapReduce Applications on Supercomputers, In 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 16), October M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, A Comprehensive Study of MapReduce over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters, In IEEE Transactions on Parallel and Distributed Systems (TPDS), August X. Lu, M. W. Rahman, N. S. Islam, D. Shankar, and D. K. Panda, Accelerating Big Data Processing on Modern HPC Clusters, In Conquering Big Data with High Performance Computing - Springer International Publishing, July N. S. Islam, M. W. Rahman, X. Lu, and D. K. Panda, High Performance Design for HDFS with Byte-Addressability of NVM and RDMA, In 24th International Conference on Supercomputing (ICS 16), June D. Shankar, X. Lu, M. W. Rahman, N. S. Islam, and D. K. Panda, Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters, In The Journal of Supercomputing - Springer, June D. Shankar, X. Lu, N. S. Islam, M. W. Rahman, and D. K. Panda, High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Nonblocking Extensions, Designs, and Benefits, In 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS 16), May N. S. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In-Memory File Systems for Hadoop and Spark Applications on HPC Clusters, In 2015 IEEE International Conference on Big Data (IEEE BigData 15), October D. Shankar, X. Lu, M. W. Rahman, N. S. Islam, and D. K. Panda, Benchmarking Key- Value Stores on High-Performance Storage and Interconnects for Web-Scale Workloads, In 2015 IEEE International Conference on Big Data (IEEE BigData 15), October 2015 (Short Paper). viii

10 N. S. Islam, D. Shankar, X. Lu, M. W. Rahman, and D. K. Panda, Accelerating I/O Performance of Big Data Analytics on HPC Clusters through RDMA-based Key-Value Store, In 44th International Conference on Parallel Processing (ICPP 15), September A. Bhat, N. S. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS, In 6th Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware (BPOE - 6), August M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA, In 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS 15), May N. S. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, In 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 15), May D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. S. Islam, and D. K. Panda, Can RDMA Benefit Online Data Processing Workloads on Memcached and MySQL?, In 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 15), March 2015 (Poster Paper). N. S. Islam, X. Lu, M. W. Rahman, R. Rajachandrasekar, and D. K. Panda, In-Memory I/O and Replication for HDFS with Memcached: Early Experiences, In 2014 IEEE International Conference on Big Data (IEEE BigData 14), October M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, Performance Modeling for RDMA- Enhanced Hadoop MapReduce, In 43rd International Conference on Parallel Processing (ICPP 14), September D. Shankar, X. Lu, M. W. Rahman, N. S. Islam, and D. K. Panda, A Micro-benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks, In 5th Workshop on Big data benchmarks, Performance Optimization, and Emerging hardware (BPOE - 5), in conjunction with VLDB, September M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-based Approach Benefit?, In 20th International European Conference on Parallel Processing (Euro-Par 14), August ix

11 X. Lu, M. W. Rahman, N. S. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, In International Symposium on High-Performance Interconnects (HotI 14), August M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, In International Conference on Supercomputing (ICS 14), June N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Exploit Maximum Overlapping in RDMA-Enhanced HDFS over InfiniBand, In International Symposium on High Performance and Distributed Computing (HPDC 14), June 2014 (Short Paper). R. Rajachandrasekar, S. Potluri, A. Venkatesh, K. Hamidouche, M. W. Rahman, and D. K. Panda, MIC-Check: A Distributed Checkpointing Framework for the Intel Many Integrated Cores Architecture, In International Symposium on High Performance and Distributed Computing (HPDC 14), June 2014 (Short Paper). M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, Does RDMA-based Enhanced Hadoop MapReduce Need a New Performance Model?, In ACM Symposium on Cloud Computing (SoCC 13), October 2013 (Poster Paper). X. Lu, N. S. Islam, M. W. Rahman, J. Jose, H. Subramoni, H. Wang, and D. Panda, High- Performance Design of Hadoop RPC with RDMA over InfiniBand, In International Conference on Parallel Processing (ICPP 13), October N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda, Can Parallel Replication Benefit HDFS for High-Performance Interconnects?, In International Symposium on High- Performance Interconnects (HotI 13), August 2013 (Short Paper). X. Lu, M. W. Rahman, N. S. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks, In International Workshop on Big Data Benchmarking (WBDB 13), July M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramoni, H. Wang, and D. K. Panda, High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand, In International Workshop on High Performance Data Intensive Computing (HPDIC 13) in conjunction with IPDPS 13, May x

12 N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D. K. Panda, A Micro-Benchmark Suite for Evaluating HDFS Operations on Modern Clusters, In Special Issue of LNCS on papers from WBDB 12 Workshop, December N. S. Islam, X. Lu, M. W. Rahman, J. Jose, H. Wang, and D. K. Panda, A Micro-Benchmark Suite for Evaluating HDFS Operations on Modern Clusters, In International Workshop on Big Data Benchmarking (WBDB 12), December N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and D. K. Panda, High Performance RDMA-Based Design of HDFS over Infini- Band, In International Conference on Supercomputing (SC 12), November J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, C. Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, In International Parallel & Distributed Processing Symposium (IPDPS 12), May J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transports, In International Symposium on Cluster, Cloud, and Grid Computing (CCGrid 12), May M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. S. Islam, H. Subramoni, C. Murthy, and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, In International Symposium on Performance Analysis of Systems and Software (ISPASS 12), April 2012 (Poster Paper). J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. S. Islam, X. Ouyang, H. Wang, S. Sur, and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, In International Conference on Parallel Processing (ICPP 11), September Fields of Study Major Field: Computer Science and Engineering xi

13 Table of Contents Page Abstract Dedication Acknowledgments ii iv v Vita vii List of Tables xix List of Figures xxi 1. Introduction Problem Statement Organization of the Thesis Background Hadoop Hadoop Distributed File System (HDFS) Hadoop MapReduce Detailed Design of Hadoop MapReduce Overview of YARN MapReduce Execution Frameworks Batch Processing DAG Processing Other Big Data Middleware Spark Tachyon Hive xii

14 2.4 Hadoop Benchmarks and Workloads TeraSort Sort PUMA SWIM Intel HiBench InfiniBand InfiniBand Verbs Layer InfiniBand IP Layer Unified Communication Runtime (UCR) High Performance Storage Technologies Solid State Drive (SSD) Non-Volatile Memory (NVM) Lustre High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand Introduction Design of RDMA-based Hadoop MapReduce Architectural Overview Detailed Design RDMA based Shuffle Faster Merge Intermediate Data Pre-fetching and Caching Overlap of Shuffle, Merge, and Reduce Performance Evaluation Experimental Setup OSU-RI Compute OSU-RI Storage Evaluation with the TeraSort Benchmark Evaluation with the Sort Benchmark Benefits of Caching Related Work Summary Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects Introduction Hybrid Overlapping in MapReduce Architectural Overview Pipelined Design of HOMR xiii

15 4.2.3 Efficient Shuffle Algorithms All-Average Algorithm Greedy Shuffle Algorithm On-demand Shuffle Adjustment Implementation Performance Evaluation Experimental Setup OSU-RI Compute OSU-RI Storage TACC Stampede Performance Analysis of HOMR Shuffle Algorithms: All Average vs Greedy Benefits from On-demand Shuffle Adjustment HOMR Profiling Evaluation of Micro-benchmarks Comparison with Hadoop over Sockets Comparison with Hadoop over RDMA Evaluation of Macro-benchmarks Evaluation using SWIM Evaluation using PUMA Related Work Summary High-Performance Designs of MapReduce over Lustre for Intermediate Data Placement and Shuffle Strategies Introduction HOMR over Lustre Architectural Overview System Deployment Architectures Shuffle Strategies Read-based Shuffle (HOMR-Read) RDMA-based Shuffle (HOMR-RDMA) Optimization in Shuffle Strategies Tuning stripe size and stripe count Optimizing the number of readers and writers Dynamic Adaptation (HOMR-Adaptive) Profiler Priority Directory Selection Dynamic Selection (HOMR-Dynamic) Static Selection (HOMR-Static) Performance Evaluation xiv

16 5.3.1 Experimental Setup OSU-RI Compute OSU-RI Storage TACC Stampede SDSC Gordon Evaluation with SIO Architecture Evaluation of Micro-benchmarks Comparison of progress in different phases Evaluation with Hadoop-1.x Resource Utilization CPU Memory Network Disk I/O Evaluation of Macro-Benchmarks Evaluation with SS Architecture Evaluation with Micro-benchmarks Resource Utilization Evaluation with Hyb Architecture Evaluation with Micro-benchmarks Evaluation with Concurrent MapReduce Jobs Evaluation for Priority Storage Selection Related Work Summary Accelerating MapReduce with Non-Volatile Memory Introduction Leveraging NVM for MapReduce NVRAM-Assisted Map Spilling Overview of NVRMR Implementation Details of NVRMR Dynamic Adaptation to the Network Congestion Performance Evaluation Experimental Setup OSU-RI SDSC Comet Performance Comparison using NVRAM and DRAM Performance Comparison with Sort and TeraSort Performance Comparison with Intel HiBench and PUMA Performance Comparison with Large Workloads Performance Comparison with TPC-H Queries xv

17 6.4 Related Work Summary High-Performance DAG Execution Framework with NVRAM and RDMA Introduction High-Performance Design of Apache Tez Design of NVRTMR Tez over Lustre Performance Evaluation Experimental Setup OSU-RI SDSC Comet Performance Comparison using NVRAM and DRAM Performance Comparison with Sort Performance Comparison with Intel HiBench workloads Performance Comparison with TPC-H Queries Evaluation with Tez over Lustre Related Work Summary Performance Modeling for RDMA-enhanced Hadoop MapReduce Introduction Proposed Model Configuration Parameters Phase Cost Parameters Operation Cost Parameters Modeling MapTask Modeling ReduceTask Modeling Shuffle Phase Modeling Merge Phase Modeling Reduce Phase Modeling a MapReduce Job Model Simplification and Characterization Model Simplification Model Characterization Varying Total Data Size Varying maximum concurrent reducers, r max Varying cluster size, n Varying total number of Reducers, r Varying multiple configuration parameters xvi

18 8.4 Model Validation Experimental Setup OSU-RI Compute OSU-WCI TACC Stampede Model Validation for TeraSort Model Validation for Other Workloads Model Validation for Enhanced MapReduce over TCP/IP Related Work Summary A Comprehensive Tuning, Profiling, and Prediction Tool for Advising HPC Users to Accelerate MapReduce Applications Introduction Design Details of MR-Advisor Information Collection Workload Preparation and Deployment Parameter Mapping Workload Preparation Deployment and Monitoring Analysis and Recommendation Black-box Approach Prediction through Profiling Performance Evaluation Experimental Setup OSU-RI Storage TACC Stampede SDSC Gordon OSU-RI Comparison with Existing Frameworks Case Studies on Different Clusters Case Study I - TACC Stampede with User-Space Parameters Case Study II - OSU-RI Storage with User-Space Parameters Case Study III - Clusters TACC Stampede and SDSC Gordon with System-Space Parameters Case Study IV - OSU-RI2 with System-Space Parameters Tuned Performance Comparison Performance Profiling and Prediction Related Work xvii

19 9.5 Summary Future Research Directions Study of the Proposed Designs on Applications Study of Power-Performance Trade-offs Designing Advanced MapReduce Engines for Next-Generation Processors Impact of HPC Resources on Real-Time Stream/Graph Processing Software Release and Its Impact Conclusion and Contributions Bibliography xviii

20 List of Tables Table Page 4.1 Comparison of average shuffle times Benefits of on-demand shuffling with HOMR Normalized execution times for 100GB TeraSort Summary of benefits for SWIM Storage Capacity Comparison on Typical HPC Clusters Average break-down times for different stages in map Benchmarks used and the data sizes form Intel HiBench and PUMA Performance comparison of NVRMR and DRMR Benchmarks used and the data sizes from Intel HiBench Performance comparison of NVRTMR and DRTMR A small subset of Hadoop configuration parameters Cost parameters representing execution times for different phases Cost parameters representing execution times for different operations Configuration and analysis of profiled data with variation in single parameter for TeraSort (OSU-RI Compute) xix

21 8.5 Configuration and analysis of profiled data with varying data size, n, and r for TeraSort (TACC Stampede) Workload specific reduce time and key-value size High-level Comparison between MR-Advisor and Existing Tuning Frameworks for MapReduce Dimensions of system space parameters in MR-Advisor Comparison between MR-Advisor and Starfish (OSU-RI Storage, 8 nodes) TeraSort tuning and comparison with best practice values for < Number o f Reducers, Block Size> =<96, 128 MB> Tuning with system-space parameters in RHMR xx

22 List of Figures Figure Page 1.1 MapReduce execution model and modern HPC resources The Proposed Research framework Overview of MapReduce and its components Work-flows for Hive running over MapReduce and Tez (Courtesy: [120]) Architectural overview of RDMA-based MapReduce Detailed design and components of RDMA-based MapReduce Overlapping of different phases in RDMA-based MapReduce execution compared to the default Job execution time evaluation and comparison for TeraSort with RDMAbased MapReduce (OSU-RI Storage) Evaluation of TeraSort on OSU-RI Compute Job execution time evaluation and comparison for Sort with RDMA-based MapReduce (OSU-RI Storage) Sort benchmark evaluation with SSD-backed HDFS storage Benefits of pre-fetching and caching in RDMA-based MapReduce Profiling of data transmission in shuffle phase Architecture of HOMR xxi

23 4.3 Overlapping of different phases in MapReduce job: (a) Default architecture, (b) RDMA-based architecture (Chapter 3), (c) HOMR architecture Sample weight assignment for both algorithms Average response time for shuffle algorithms Merge-reduce progress for skewed data distribution Profiling of HOMR Comparison of TeraSort between default architecture and HOMR Comparison of Sort between default architecture and HOMR Comparison of TeraSort across RDMA-based designs Comparison of Sort across RDMA-based designs Facebook SWIM workload evaluation Evaluating PUMA workloads YARN MapReduce running over a typical Lustre setup on modern HPC clusters Design of YARN MapReduce over Lustre Job execution flow on different deployment architectures for HOMR over Lustre Tuning of stripe size and stripe count in Lustre Optimization in Lustre read and write threads Performance evaluation and comparison with different local directory configuration Priority directory selection in HOMR xxii

24 5.8 Evaluation of Sort with variation in cluster and data sizes for SIO Map and reduce phase progress comparison in different clusters Sort benchmark evaluation with variation in cluster and data size for Hadoop- 1.x Profiling CPU usage Profiling memory usage Profiling network throughput Profiling I/O operations in local disk Macro benchmark evaluation and summary Evaluation of Sort with variation in cluster and data sizes for SS Resource utilization in TACC Stampede Evaluation of Sort with variation in cluster and data sizes for Hyb Performance benefits for concurrent job execution Data distribution in different storage for Hyb architecture (OSU, 32 cores with SSD, HDD, and Lustre) Evaluation of Sort with different intermediate data directory High-level execution overview of RDMA-enhanced MapReduce frameworks and comparison with NVRMR NVRMR design and implementation in Hadoop RDMA buffer pools for non-blocking operations in NVRMR ShuffleHandler Network congestion profiling using PerfQuery with NVRMR Comparison of Map and Reduce time in the overall execution for Hadoop MapReduce (OSU-RI2) xxiii

25 6.7 Comparison of Map and Reduce time in the overall execution for Hadoop MapReduce (SDSC Comet) Performance benefits for Intel HiBench and PUMA workloads (SDSC Comet) Evaluation of larger HiBench datasets with Hadoop on 32 nodes (SDSC Comet) TPC-H query evaluation with Hive over Hadoop MapReduce High-level overview for Tez and MapReduce applications NVRTMR design and implementation Comparison of Map and Reduce time in the overall execution for Tez MapReduce Performance benefits for NVRTMR with Intel HiBench workloads (SDSC Comet) TPC-H query evaluation with Hive over Hadoop and Tez MapReduce Performance benefits for NVRTMR over Lustre with different deployment architectures (SDSC Comet) Evaluation with existing model and comparison using Sort (20 GB) Model validation for TeraSort in different clusters Model validation using different workloads in OSU-RI Compute Configuration dimensions for typical deployments Design and architectural details of MR-Advisor Job submission and monitoring of WPDU Performance speedup using MR-Advisor with different benchmarks over HDFS on 8 nodes xxiv

26 9.5 Tuning for total maps and file system block size with TeraGen (TACC Stampede) Performance characteristics with total number of reducers and file system block size using TeraSort on 16 nodes (TACC Stampede) Study of total number of reducers and file system block size over HDFS on 8 nodes (OSU-RI Storage) System-space parameter tuning for HMR and RHMR over HDFS with 8 nodes (OSU-RI Storage) Performance comparison for different MapReduce frameworks using Tera- Sort with 32 nodes (TACC Stampede) Job profiling using MR-Advisor (OSU-RI Storage) Prediction for TeraSort with MR-Advisor (TACC Stampede) xxv

27 Chapter 1: Introduction In recent times, Big Data has been one of the most used terminologies in both academia and industry. The term Big Data symbolizes the amount of data that is produced and consumed around us at each moment in time. Although it started as a hot IT buzzword during , it has posed many new challenges and questions since then that has opened up new research opportunities and use cases. The 2011 International Data Corporation (IDC) study [1] on Digital Universe indicated the beginning of Information Age, where the foundation of economic value is largely derived from information rather than physical things. The amount of data around the world seems to be exploding; the rate of this information growth appears to be exceeding Moore s Law and it is expected that 35 zettabytes of data will be generated and consumed by the end of this decade. According to the Market Strategy and BI Research group [2], data volumes are doubling every year. Moreover, 42.6% of respondents are saving more than three years of data for analytical purposes. The term Big Data has become more viable as soon as various cost-efficient approaches and solutions have begun to emerge to deal with this data deluge: handling the volume with faster storage, fast processing to tame the high velocity of the data growth, and using advanced technologies matching with the variability of the data. The expansion of usage for different Big Data technologies is also creating novel convergence opportunities between Big Data and other fields - in general, the convergence of 1

28 machines, data, and analytics. Big Data technologies are already providing DNA strand decoding to predict disease pattern as well as suggesting what movies we might want to watch online [35]. The ongoing challenge of extracting knowledge from soaring volumes of unstructured data using analysis tools and processing technology is now applicable to everything in our day-to-day life - weather forecasting [74], social media marketing [59], virtual gaming [128], and even maintaining law and security through crime prediction [40]. According to Forbes [88], even political data analytics is advancing from simple microtargeting to true predictive data science, which is playing a key role in many decisionmaking processes for the 2016 U.S.A. national election candidates. Such an enormous application field for the Big Data technologies will be extravagant unless they are upgraded with the advanced technologies in the related convergent fields. On the other hand, High-Performance Computing (HPC) is an active area of research providing ground-breaking solutions to many scientific and engineering problems with advanced networking, storage, and computation capabilities. Since technological advancements have reduced the cost and complexity of implementing an HPC cluster [29] these days, HPC is now providing solutions to numerous applications from the business domain as well. Because of this reason, the two most disruptive technologies in modern computer science - namely, Big Data processing and High-Performance Computing, are converging slowly to meet the challenges exposed by large-scale data analytics. For several years now, Hadoop [46] is acting as the backbone of many different Big Data technologies. The data processing capabilities provided by the MapReduce [41] programming model implemented with different execution frameworks are considered to be the most powerful technology Big Data has ever produced. Similar to other scientific parallel 2

29 programming models, MapReduce is also being extensively used for data intensive computing in modern HPC systems because of its fault-tolerance and scalability. However, as the applications begin scaling to terabytes of data, the socket-based communication model, which is the default interface in the open-source implementation of MapReduce (Hadoop), demonstrates performance bottleneck. MapReduce also relies heavily on the availability of the local storage media, which introduces space inadequacy for applications with a large amount of temporary intermediate data. Furthermore, frequent local disk accesses in different phases of MapReduce job execution lead to significant performance degradation. Although MapReduce can be configured to run on top of any underlying file system for storage, the most common of them is HDFS [109] (Hadoop Distributed File System). It runs over HDFS and takes advantage of local disks on compute nodes to achieve better data-locality. However, most modern HPC clusters [20, 49, 110] tend to follow the traditional Beowulf architecture [112, 113] model, where the compute nodes are provisioned with a lightweight operating system, and either a disk-less or a limited-capacity local storage [43]. At the same time, they are connected to a sub-cluster of dedicated I/O nodes with enhanced parallel file systems, such as Lustre, which can provide fast and scalable data storage solutions. Furthermore, the introduction of Non-Volatile Memory (NVM) in these high-end systems also opens up new opportunities to re-think the present architectures of MapReduce execution frameworks. Figure 1.1 shows an example of the execution flow for current MapReduce applications. It also highlights the resources that the modern HPC clusters offer to these execution frameworks that are unable to take advantage of because of their intrinsic design considerations. For example, the deployment of the parallel global file systems in HPC clusters 3

MapReduce Execution Model Distributed Storage Map Processes Local Storage Reduce Processes Distributed Storage HPC Resources High-Performance Interconnect Global

1 MapReduce execution model and modern HPC resources are not similar compared to the distributed shared-nothing architecture that the MapReduce frameworks use.

default MapReduce jobs with large-scale data sizes to run. Also, modern high-performance interconnects (e.g. InfiniBand) used in these clusters can provide extremely low latency and high bandwidth.

Although MapReduce is well suited for batch processing, its synchronized nature of stocking the data in various phases of a job makes it inadequate for

30 MapReduce Execution Model Distributed Storage Map Processes Local Storage Reduce Processes Distributed Storage HPC Resources High-Performance Interconnect Global Parallel Storage In-memory Processing Non-Volatile Memory Figure 1.1 MapReduce execution model and modern HPC resources are not similar compared to the distributed shared-nothing architecture that the MapReduce frameworks use. This deployment architecture for default MapReduce is not naively conducive because these clusters usually have small-capacity local disks and this prevents default MapReduce jobs with large-scale data sizes to run. Also, modern high-performance interconnects (e.g. InfiniBand) used in these clusters can provide extremely low latency and high bandwidth. Efficient utilization of these resources is crucial for MapReduce in terms of performance and scalability. Although MapReduce is well suited for batch processing, its synchronized nature of stocking the data in various phases of a job makes it inadequate for multi-step or iterative job processing. For such workloads, Directed Acyclic Graph (DAG) execution engines (e.g. Spark [121], Tez [120]) provide faster job scheduling compared to MapReduce. In this model, jobs are represented as vertices in a graph where the order of execution is specified 4

31 by the directionality of the edges in the graph [9]. Such execution models can schedule all different jobs at once rather than in a synchronized manner. However, DAG engines such as Tez also need to be re-designed to take advantage of different HPC resources (e.g. RDMA over InfiniBand, parallel file systems, NVM) on HPC clusters. To leverage the benefits from advanced HPC cluster features, the default frameworks (MapReduce and Tez) must be re-designed and remodeled keeping the fault-tolerance and usability intact. Also, these frameworks consist of a huge number of performance-sensitive configure- and run-time parameters. Efficient parameter tuning for such large frameworks is necessary to obtain performance across different HPC clusters. To summarize, we address the following important challenges in this thesis: 1. Can we re-design Hadoop MapReduce to take advantage of high-performance interconnects such as InfiniBand and exploit advanced features such as RDMA? 2. Can this RDMA-enhanced design of MapReduce guarantee maximum possible overlapping across all phases of job execution while utilizing in-memory processing as much as possible? 3. Can we devise alternate deployments and shuffle strategies for MapReduce running over parallel file systems, such as Lustre, based on the intermediate data storage? How can we extract the best performance behavior for each of these deployments? 4. How can we re-design the MapReduce execution frameworks to leverage the byteaddressability of NVRAM and extract performance benefits? 5. Can DAG execution engine, Tez, be enhanced with similar design concepts used for high-performance MapReduce? 5

32 6. How to formulate a comprehensive performance model for high-performance MapReduce framework that can be used to predict the performance characteristics of any MapReduce job running at large scale? 7. Can a framework be proposed that optimizes the job execution performance of MapReduce and DAG engines through comprehensive tuning and profiling on the entire parameter space? 1.1 Problem Statement The designs of default Hadoop and many of the data analytics frameworks (DAG engines e.g. Tez [120]) running on top of it mainly focus on the commodity servers which are typically equipped with low-bandwidth interconnects. These clusters often have multiple large-capacity local HDDs to achieve better data-locality for MapReduce jobs. In contrast, modern HPC clusters [49, 110] have quite different execution environments, where highspeed interconnects, like InfiniBand, 10 Gigabit Ethernet (10 GigE), and high performance but smaller capacity local disks, like SSD, are commonly used. In addition, a global storage system, like Lustre [129], is often shared by all the compute nodes to meet the storage requirements of HPC applications. If we directly run default Hadoop on HPC clusters, it is hard to achieve optimal performance. Recent studies [37, 57, 66, 84, 126] have also shown that default Hadoop components can not leverage HPC cluster features, like Remote Direct Memory Access (RDMA) enabled high performance interconnects, high-throughput and large capacity parallel file systems, efficiently. InfiniBand is the most popular RDMAenabled high-performance interconnect in TOP500 [26], while Lustre [129] is widely deployed on modern HPC clusters. 6

33 This leads us to the following broad challenge: Can a high-performance MapReduce and DAG execution framework be designed to take advantage of HPC resources and improve performance and scalability of different data analytics workloads and applications on leadership-class HPC systems? In this era of Big Data and HPC convergence, such high-performance solution for MapReduce and DAG will greatly impact the usage of different big data middlewares in the communities. Figure 1.2 shows the overall scope of this thesis. We aim to address the following challenges using this research framework: Applications, Benchmarks, Workloads High-Performance MapReduce and DAG Execution Framework RDMA Shuffle with Hybrid Overlapping Advanced Shuffle and Storage Selection NVRAM Assisted Intermediate Data Shuffle Intermediate Data Single/Multiple Local Storage Sockets Intermediate Data Hybrid Storage with Lustre Intermediate Data Hybrid Storage with/without Lustre RDMA Performance Modeling, Tuning, Profiling and Prediction Input/Output Data Storage HDFS Lustre Networking Technologies (InfiniBand, 1/10/40 Gigabit Ethernet) Storage Technologies (HDD, SSD, NVM) Figure 1.2 The Proposed Research framework 1. Can we re-design Hadoop MapReduce to take advantage of high performance interconnects such as InfiniBand and exploit advanced features such as RDMA? To avoid the potential bottleneck from the Java sockets based communication, designers of high-end enterprise data centers and clouds have been looking towards high performance interconnects such as InfiniBand to allow the unimpeded scaling 7

34 of their big data applications [33]. The current Hadoop middleware components do not leverage the high performance communication features offered by InfiniBand. Without high performance InfiniBand support in the vanilla Hadoop system, the big data analytic applications using Hadoop cannot get the best performance on such systems. Recent research works [57, 66, 71, 114, 126] analyze on the huge performance improvements possible for different cloud computing middlewares using InfiniBand networks. The research project Hadoop-A [126] provides native IB verbs support for data shuffle in Hadoop MapReduce on InfiniBand network. However, it does not exploit all the potential performance benefits of high performance networks. 2. Can this RDMA-enhanced design of MapReduce guarantee maximum possible overlapping across all phases of job execution while utilizing in-memory processing as much as possible? The RDMA-based MapReduce framework can gain significant performance benefits compared to the default architecture running over high performance interconnects such as 10 GigE and IPoIB. However, transmitting the data using RDMA without enhancing the default algorithms for shuffle and merge does not take into account the possibility of increased overlapping among map, shuffle, merge, and reduce phases. Also, differences in the network resource utilization between the default framework and the RDMA-based design need to be investigated to improve the utilization further. To achieve better overlapping across all job execution phases while utilizing in-memory processing, advanced shuffle algorithms are needed. For example, in the initial stage of job execution, RDMA-enhanced design can shuffle only a small portion of data from each map location, which minimizes network utilization. As soon as all maps finish execution, this design can start fetching more data and utilize the 8

35 network with almost no stalls. This can be improved by an intelligent shuffle algorithm which can bring the data that is required more compared to others at each shuffle request so that the merge and reduce phases also progress faster by processing on those data. We also believe that the overlapping across different phases can be optimized more to achieve significant performance benefits for any underlying interconnects. 3. Can we devise alternate deployments and shuffle strategies for MapReduce running over parallel file systems, such as Lustre, based on the intermediate data storage? How can we extract the best performance behavior for each of these deployments? For most of the high-performance Lustre installations on HPC clusters, using Lustre as local storage for MapReduce jobs should be straight-forward since this can be completed only in two steps of file system write and read operations, which is intuitively seen as a high-speed data shuffle path because of high-throughput read/write operations of Lustre. However, the transmission time inside Lustre (among the Object Storage Servers) depends on many factors such as cluster interconnect, workload variation, etc. These factors may introduce read and write overhead in Lustre for a MapReduce job running with many concurrent maps and reduces, each of them reading from or writing to the file system. On the other hand, high-performance interconnects, such as InfiniBand and 10/40 Gigabit Ethernet, are commonly used for data movement amongst the compute nodes on modern HPC clusters. They can provide extremely low latency and high bandwidth. For example, the latest InfiniBand network provides around 1 µs for point-to-point data transfer latency and up to 100 Gbps 9

36 bandwidth. Such a high-speed data movement path can also be utilized for data shuffling from mappers to reducers as an alternative strategy where a significantly less number of processes read from Lustre, thereby reducing the contention. Also, since different HPC clusters provide different Lustre installations in terms of read/write performance and the Lustre interconnect, utilizing only Lustre as the intermediate data storage may not yield to optimum performance for every scenario. For example, SDSC Gordon [49] provides local SSDs on each compute node and the Lustre is accessible through 10 GigE links; whereas, TACC Stampede provides fast Lustre setup through InfiniBand FDR with slow local HDDs per node. Thus, utilizing Lustre for all intermediate data in TACC Stampede is better compared to utilizing local HDDs. However, this is not always true for SDSC Gordon. Because of these reasons, a priority based intermediate storage selection is necessary to dynamically detect the best storage at runtime. 4. How can we re-design MapReduce execution frameworks to leverage the byteaddressability of NVRAM and extract performance benefits? Because of the non-volatile property, NVM devices provide fast read/write access latency as well as data persistence in the occurrence of a power failure. These features are ideal for large data processing engines, such as Hadoop, as fault-tolerance for these frameworks is not a desirable property, but a necessity. Moreover, because of the byte-addressability, NVM devices can also be used as Non-Volatile Random Access Memory (NVRAM) which provides DRAM like performance characteristics. Because of these reasons, utilizing NVRAM for MapReduce frameworks in the most efficient manner is a critical challenge. 10

37 5. Can DAG execution engine, Tez, be enhanced with similar design concepts used for high-performance MapReduce? DAG execution engines also depend on bulk data transfer across compute nodes as well as intermediate data read/write during the course of job execution. Thus, similar performance enhancing features from RDMA-enhanced MapReduce over HPC clusters can improve the job execution performance of the DAG engines. However, the I/O and data communication patterns can vary significantly between MapReduce and DAGs which brings new scopes to improve on the design architectures of such engines as well. 6. How to formulate a comprehensive performance model for high-performance MapReduce framework that can be used to predict the performance characteristics of any MapReduce job running at large scale? Several performance modeling works [54, 82, 131] have been carried out to deeply analyze the default MapReduce framework. The model proposed in [54] can predict the performance of MapReduce jobs for the Apache Hadoop distribution that supports default MapReduce architecture with traditional TCP/IP-based communication. In default MapReduce, only the shuffle and merge phases are overlapped, whereas, the RDMA-enhanced design increases the efficiency of the framework by overlapping maximally among all the three phases (shuffle, merge, and reduce) of a MapReduce job execution pipeline. Unlike the default distribution, the RDMAenhanced design reduces the number of disk operations via pre-fetching and caching of map outputs as well as in-memory merge operations for the entire dataset. Therefore, the parameters in the existing models are not sufficient to capture all of the 11

38 aspects of the RDMA-enhanced design and thus cannot make a good prediction of the performance for this new design. 7. Can a framework be proposed that optimizes the job execution performance of MapReduce and DAG engines through comprehensive tuning and profiling on the entire parameter space? Advanced designs of MapReduce leveraging the benefits from HPC resources can provide supreme solutions through careful parameter tuning. The set of parameters governing the performance of such advanced designs of MapReduce may vary depending on the underlying system resources utilized. But none of the existing tuning frameworks has considered parameter tuning for RDMA-enhanced MapReduce implementations. Moreover, the performance tuning on the DAG execution engines (e.g. Spark [121], Tez [120]) has not been investigated yet. Designing one single efficient tuning framework for any MapReduce implementation is hard, because of not only the parameter configuration space being huge, but also the parameter tuning for any particular MapReduce job being an arduous task. It is also difficult for users to determine suitable values for the parameters without having a good understanding of the MapReduce application characteristics. Thus, we believe that a generalized performance tuning framework for any MapReduce implementation is of utmost importance to provide performance insights for different MapReduce applications and workloads. 1.2 Organization of the Thesis The rest of the thesis is organized as follows. Chapter 2 introduces the topics and concepts that are relevant to this thesis. Chapter 3 provides the RDMA-based design of 12

39 Hadoop MapReduce over InfiniBand, while Chapter 4 discusses the hybrid overlapping design on top of RDMA-based MapReduce. Chapter 5 explores the design studies for MapReduce over Lustre with different deployment architectures. Chapter 6 highlights the accelerated framework for MapReduce with Non-Volatile Memory. Chapter 7 discusses the similar design enhancements for DAG execution engine, Tez. We provide a detailed performance model for RDMA-based design of MapReduce in Chapter 8. Chapter 9 introduces MR-Advisor, a comprehensive tuning and prediction tool for MapReduce execution frameworks. After that, some future research directions are highlighted in Chapter 10. The thesis is concluded in Chapter 12 after highlighting the broader impact of this thesis in Chapter

40 Chapter 2: Background In this section, we provide some necessary background details. 2.1 Hadoop Hadoop [46] is a popular open source implementation of the MapReduce [41] programming model. The Hadoop Distributed File System (HDFS) [109] is the primary storage for Hadoop cluster Hadoop Distributed File System (HDFS) An HDFS cluster consists of two types of nodes: NameNode and DataNode. The NameNode manages the file system namespace and the DataNodes store the actual data. In a generic HDFS cluster, usually one NameNode and multiple DataNode processes are run to provide the file system functionalities. HDFS divides large files into blocks of size 64 MB at default settings. Each block is stored as an independent file in the local file system of the DataNodes. HDFS usually replicates each block to three (replication factor 3) DataNodes. In this way, it guarantees data availability and fault-tolerance by replicating each block to multiple (default replication factor being 3) DataNodes. The DataNodes on the other hand, act as the storage system for the HDFS files. 14

41 2.1.2 Hadoop MapReduce Figure 2.1(a) shows a typical first generation MapReduce (MRv1) cluster. The NameNode has a JobTracker process and all the DataNodes can run one or more TaskTracker processes. These processes, together, act as a master-slave architecture for a MapReduce job. A MapReduce job usually consists of three basic stages: map, shuffle/merge/sort and reduce. Figure 2.1(b) shows these stages in details. Input HDFS TaskTracker DataNode Split 0 map Sort Shuffle Output HDFS JobTracker NameNode TaskTracker DataNode TaskTracker Split 1 map Sort merge reduce Part 0 HDFS Replication Client DataNode merge reduce Part 1 HDFS Replication TaskTracker DataNode Split 2 map Sort (a) High-level overview of Hadoop MapReduce (MRv1) architecture (b) MapReduce data flow with multiple map and reduce tasks Figure 2.1 Overview of MapReduce and its components JobTracker is the service within Hadoop that farms out tasks to specific nodes in the cluster. A single JobTracker and a number of TaskTrackers are responsible for successful completion of a MapReduce job. Each TaskTracker can launch several MapTasks, one per split of data. The map() function converts the original records into intermediate results and stores them onto the local file system. Each of these files are sorted into many data partitions, one per ReduceTask. The JobTracker then launches the ReduceTasks as soon as the map outputs are available from the MapTasks. TaskTracker can spawn several concurrent MapTasks or ReduceTasks. Each ReduceTask starts fetching the map outputs from the map output locations that are already completed. This stage is the shuffle/merge period, 15

42 where the data from various map output locations are sent and received via HTTP requests and responses. While receiving these data from various locations, a merge algorithm is run to merge these data to be used as an input for the reduce operation. Then each ReduceTask loads and processes the merged outputs using the user defined reduce() function. The final result is then stored into HDFS Detailed Design of Hadoop MapReduce As discussed in Section 2.1.2, in the default MapReduce workflow, shown in Figure 2.1(b), map outputs are kept in files residing in local disks which can be accessed by TaskTrackers. During shuffle, ReduceTasks ask TaskTrackers to provide these map outputs over HTTP. Most of the components that are responsible for these operations are shown in the left side of both ReduceTask and TaskTracker in Figure 3.2. Some of these components are described here. HTTP Servlet: HTTP Servlets run in TaskTracker to handle incoming requests from the ReduceTasks. Upon HTTP request, the Servlets get the appropriate map output file from local disk and send the output in an HTTP response message. It divides the content into packets with 64 KB packet size. Copier: Inside the ReduceTask, the copiers are responsible for requesting appropriate map outputs from TaskTracker in an HTTP request message. Upon arrival of the data, it keeps the data in memory, if sufficient amount of memory is available, or in a local disk, otherwise. In-Memory Merger: Inside ReduceTask, there are two types of merge operations. If the data received during shuffle can be held in memory, the in-memory merger merges these data from memory and then keeps the merged output to local disk. Otherwise, only the Local FS Merger is used. 16

43 Local FS Merger: Local FS Merger gets the data from local disk which are kept either by Copier or In-Memory Merger. It merges and stores these data in local disk in an iterative manner, minimizing the total number of merged output files in local disk each time. Map Completion Fetcher: Inside ReduceTask, a Map Completion Fetcher keeps track of currently completed map tasks. After a map completion event occurs, it signals the Copiers to start copying the data for this map output by sending out HTTP request. In the existing implementation of MapReduce, TaskTracker and ReduceTask exchange their information over Java Sockets. Each HTTP Servlet sends an acknowledgment to the appropriate ReduceTask with the meta information for the data requested and then begins sending data packets one after another. ReduceTask, after receiving the acknowledgment, extracts the meta information and sets appropriate parameters and buffers for receiving the data packets Overview of YARN In a typical Hadoop-1.x (MRv1) cluster, the master, a.k.a JobTracker is responsible for accepting jobs from clients as well as job scheduling and resource management. Whereas, Hadoop-2.x improves on the scalability limitation by introducing separate node and resource managers. In Hadoop-2.x, YARN (Yet Another Resource Negotiator) [123] which is also known as MapReduce 2.0 (MRv2) decouples the resource management and scheduling functionality of the JobTracker in Hadoop-1.x. There is a global Resource Manager (RM) responsible for assigning resources to all the jobs running on the Hadoop cluster. The Node Manager (NM) is similar to the TaskTracker in MRv1 with one NM running per node. For each application, there is an Application Master (AM). The AM coordinates with the RM and NMs to execute the corresponding tasks. 17

44 2.2 MapReduce Execution Frameworks MapReduce programming models have different execution frameworks as mentioned below Batch Processing Batch processing usually refers to running a set of jobs serially one after another. In terms of MapReduce [41] model, Hadoop provides the batch execution framework for MapReduce programs. Hadoop does not keep any information for co-dependent jobs, rather it executes each job one after another and for the co-dependent jobs, the first jobs output is used as the second jobs input. Because of this reason, each job output must be stored in a persistent storage after successful completion DAG Processing DAG (Directed Acyclic Graph) execution engines are similar to Hadoop MapReduce in terms of job execution phases. However, they differ in the scheduling approach and in the execution work-flow. Current popular DAG frameworks include Tez [106, 120], which provides MapReduce developers APIs to write YARN applications, both batch and interactive. Figure 2.2 presents the difference between MapReduce and Tez work-flows when a Hive [4] query is run on top of these frameworks. As shown in Figure 2.2(a), a typical Hive query can launch several MapReduce jobs, each of which writes the data to the underlying file system for storing and replication purposes. This causes the end user to wait for a long time before the result of such query is produced. Apache Tez, on the other hand, converts such a set of MapReduce jobs into a DAG and execute the DAG with multiple reduce phases (shown in Figure 2.2(b)). This 18

45 MR Job 1 MR Job 2 Single MR job with multiple reduce phases File System Layer MR Job 3 File System Layer MR Job 4 map reduce (a) Hive with MapReduce (b) Hive with Tez Figure 2.2 Work-flows for Hive running over MapReduce and Tez (Courtesy: [120]) minimizes the runtime considerably because of the less number of file system access operations. This execution is more suitable for interactive applications such as database queries and streaming applications. 2.3 Other Big Data Middleware Spark Spark [121, 134] is an open source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. It was designed for specific types of workloads in cluster computing namely, iterative workloads such as machine learning algorithms that reuse a working set of data across parallel operations and interactive data mining. To optimize for these types of workloads, Spark employs the concept of in-memory cluster computing, where datasets can be cached in memory to reduce their access latency. Spark s architecture revolves around the concept of a Resilient Distributed Dataset (RDD) [134], which is a fault-tolerant collection of objects distributed across a set of nodes that can be 19

46 operated on in parallel. These collections are resilient, because they can be rebuilt if a portion of the dataset is lost Tachyon Tachyon is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across Spark or MapReduce cluster frameworks. It achieves high performance by leveraging in-memory lineage information. Tachyon caches working set files in memory, thereby avoiding going to disk to load datasets that are frequently read Hive Apache Hive [4] is a data warehouse engine that is utilized to perform various database operations when the database is backed by a distributed storage, such as HDFS. It provides translation of SQL to Hadoop MapReduce jobs through which it collects the desired output of the queries. 2.4 Hadoop Benchmarks and Workloads There are several benchmarks and workloads that are available to run on top of Hadoop. In this thesis, we present evaluations with the following benchmarks and workloads TeraSort TeraSort [25] is probably the most well-known Hadoop benchmark. It is a benchmark that combines testing the HDFS and MapReduce layers of a Hadoop cluster. The input data for TeraSort can be generated by TeraGen [24] tool, which writes the desired number of rows of data in the input directory. By default, the key and value size is fixed for this benchmark at 100 bytes. TeraSort takes the data from the input directory and sorts it to another directory. The output of the TeraSort can be validated by a TeraValidate tool which 20

47 is also available with Hadoop codebase. Recently, according to [28], TeraSort has been included in the TPC [27] benchmark suite Sort The Sort [22] benchmark uses the MapReduce framework to sort the input directory into the output directory. The inputs and outputs must be Sequence files where the keys and values are stored. In this benchmark, the key and value lengths can be as small as 10 bytes to as large as thousands. The input of this benchmark can be generated by RandomWriter [18] which writes random-sized, key value pairs in HDFS. This benchmark is a very useful tool to measure the performance efficiency of MapReduce cluster with random key-value sizes PUMA PUrdue MApReduce Benchmark Suite (PUMA) [32, 100] represents a range of MapReduce applications exhibiting different application characteristics. It has a total of 13 benchmarks: some of which exhibit high computation with respect to shuffle, some of which produces high shuffle volume, and others have a equal distribution of these two phases SWIM Statistical Workload Injector for MapReduce (SWIM) [111] contains suites of workloads consisting of thousands of jobs, with complex data, arrival, and computation patterns. It also contains workload synthesis tools to generate representative test workloads by sampling historical MapReduce cluster traces. Most of the workloads in SWIM are generated from historical Hadoop traces on large clusters at Facebook. 21

48 2.4.5 Intel HiBench Intel HiBench [62] is a benchmark suite provided by Intel [12] to evaluate different big data frameworks, such as Hadoop, Spark [121] and streaming frameworks, such as Flink [116], or Storm [119]. 2.5 InfiniBand InfiniBand [61] is an industry standard switched fabric that is designed for interconnecting nodes in HPC clusters. It is a high-speed, general purpose I/O interconnect that is widely used by scientific computing centers world-wide. The recently released TOP500 [26] rankings in November 2016 indicate that more than 37.4% of the computing systems use InfiniBand as their primary interconnect. One of the main features of InfiniBand is Remote Direct Memory Access (RDMA). This feature allows software to read or update memory contents of another remote process without any software involvement at the remote side. This feature is very powerful and can be used to implement high-performance communication protocols. InfiniBand has started making inroads into the commercial domain with the recent convergence around RDMA over Converged Enhanced Ethernet (RoCE) [19]. InfiniBand offers various software layers through which it can be accessed InfiniBand Verbs Layer The verbs layer is the lowest access layer to InfiniBand. Verbs are used to transfer data and are completely OS-bypassed, i.e. there are no intermediate copies in the OS. The verbs interface follows the Queue Pair (or communication end-points) model. Upper-level software using verbs can place a work request on a queue pair. The work request is then processed by the Host Channel Adapter (HCA). When work is completed, a completion notification is placed on the completion queue. Upper level software can detect completion 22

49 by polling the completion queue or by asynchronous events (interrupts). The OpenFabrics interface [95] is the most popular verbs access layer due to its applicability to various InfiniBand vendors InfiniBand IP Layer InfiniBand software stacks, such as OpenFabrics [95], provide driver for implementing the IP layer. This makes it possible to use the InfiniBand device as just another network interface available from the system with an IP address. Such IB devices are presented as ib0, ib1 and so on just like other Ethernet IP interfaces (e.g. eth0, eth1). Although the verbs layer in InfiniBand provides OS-bypass, the IP layer does not provide so. This layer is often called IP-over-IB or IPoIB for short. We use this terminology in this paper. Out of the two modes (Unreliable Datagram or UD and Reliable Connection or RC) available for IPoIB, RC is used more as it provides better performance by leveraging reliability from the hardware. In this paper, we use connected mode IPoIB, which has better point-to-point performance Unified Communication Runtime (UCR) The Unified Communication Runtime (UCR) [70] is a light-weight, high performance communication runtime, designed and developed at The Ohio State University. It aims to unify the communication runtime requirements of scientific parallel programming models, such as MPI and Partitioned Global Address Space (PGAS) along with those of data-center middle-ware, such as Memcached [45], HBase [47], MapReduce [46], etc. UCR is designed as a native library to extract high performance from advanced network technologies. The design of UCR has evolved from MVAPICH2 software stack. MVAPICH2 and MVAPICH2-X [15] are popular implementations of MPI-3 specification. 23

50 These libraries are being used by more than 2,700 organizations in 83 countries and also distributed by popular InfiniBand software stacks and Linux distributions like RedHat and SUSE. 2.6 High Performance Storage Technologies Solid State Drive (SSD) Solid State Drives have amassed a lot of attention owing to significant data-throughput and efficiency gains over traditional spinning disks. The core-intelligence of an SSD can be attributed to its Flash Translation Layer (FTL) which plays a vital role in the adoption of this technology. Some of major functionality of an SSD, such as Wear leveling, Garbage collection, and Logical Block Mapping, is packed into the FTL. High bandwidth as well as low latency makes SSD an ideal candidate to be used in the DataNodes of the Hadoop cluster or as the intermediate data storage in order to lessen the I/O bottlenecks for MapReduce applications Non-Volatile Memory (NVM) Non-Volatile Memory provides not only faster read and write latencies similar to DRAM, but also persistence, which makes it an ideal choice for data-intensive applications. Because of the persistence, NVM is categorized as Storage Class Memory (SCM). With the introduction of such high-performance storage architectures, I/O bottlenecks in today s software frameworks are removed and thereby increases the possibility of new design changes. 24

51 2.6.3 Lustre Lustre is a POSIX-compliant, stateful, object-based parallel file system that has been deployed on several large-scale supercomputing clusters and data centers. Its ability to handle simultaneous I/O requests from hundreds of thousands of clients, and the raw throughput it offers by leveraging high-bandwidth network interconnects like InfiniBand, has made it the predominant choice of parallel file system. The Lustre file system has a scalable architecture with three primal components - Meta Data Server (MDS), Object Storage Server (OSS) and the clients that issue the I/O requests. To access a file, a client first obtains its meta data from the primary MDS, including file attributes, file permissions, and the layout of file objects in the form of Extended Attributes (EA). Subsequent file I/O operations are performed directly between the client and the OSS. By decoupling meta data operations from I/O operations when fulfilling I/O requests, Lustre is able to provide good I/O throughput. 25

52 Chapter 3: High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand 3.1 Introduction The Hadoop MapReduce framework provides application developers several critical functions such as key-based sorting, multi-way merging, data shuffling, etc. When the MapTasks finish the user defined map operation and set key-value pairs into context for the ReduceTasks, the framework executes the shuffle operation which sends key-value pairs from the MapTasks to the appropriate ReduceTask. It also performs the merge of the keyvalue pairs from multiple mappers on the reducer side. As the size of the dataset increases, the entire operation becomes very communication-intensive. As big data applications begin scaling to terabytes of data, the socket based communication model, which is the default implementation in Hadoop MapReduce, demonstrates performance bottleneck. To avoid this potential bottleneck, designers of high end enterprise data centers and clouds have been looking towards high performance interconnects such as InfiniBand to allow the unimpeded scaling of their big data applications [33]. The Greenplum Analytics Workbench [11], a node InfiniBand based cluster is one of the latest in a series of clusters being deployed to design, develop, and test InfiniBand based solutions for the Hadoop ecosystem. Although such systems are being used, the Hadoop 26

53 middleware components do not leverage the high performance communication features offered by InfiniBand. Without high performance InfiniBand support in the vanilla Hadoop system, the big data analytic applications using Hadoop cannot get the best performance on such systems. Research works [57, 66, 71, 114, 126] analyze on the huge performance improvements possible for different cloud computing middlewares using InfiniBand networks. The research project Hadoop-A [126] provides native IB verbs support for data shuffle in Hadoop MapReduce on InfiniBand network. However, it does not exploit the full performance potential through design using high performance networks. As a result, it is necessary to rethink the Hadoop MapReduce system design when a high performance network is available. This leads to the following challenges: 1. How do we re-design Hadoop MapReduce to take advantage of high performance networks such as InfiniBand and exploit advanced features such as RDMA? 2. What will be the performance improvement of MapReduce applications with the new RDMA based design in Hadoop MapReduce over InfiniBand? 3. Can the new design lead to consistent performance improvement on different hardware configurations, such as with multiple HDD or SSD on the compute node of Hadoop cluster? 3.2 Design of RDMA-based Hadoop MapReduce In this section, we propose our design for MapReduce that takes the benefits of RDMA over InfiniBand. We discuss the key components of this design and describe the related challenges and our approach to find the solution. 27

3.2.1 Architectural Overview Implementing the shuffle engine inside MapReduce with RDMA opens up new design opportunities for other job execution phases (e.g. merge and reduce) as well.

Because of this reason, our RDMA-based MapReduce implementation not only contains RDMA-based shuffle engine, but also introduces new Applications RDMA-based MapReduce (MRoIB) Design features Job Task

intermediate 1/10/40GigE Network RDMA Capable Networks (IB, 10GE/ iwarp, RoCE..) data and overlapping of merge and reduce Figure 3.1 Architectural overview of RDMAbased MapReduce phases.

54 3.2.1 Architectural Overview Implementing the shuffle engine inside MapReduce with RDMA opens up new design opportunities for other job execution phases (e.g. merge and reduce) as well. The major reason behind this is the fast data transfer offered by RDMA which does not require any CPU involvement at the remote side. Because of this reason, our RDMA-based MapReduce implementation not only contains RDMA-based shuffle engine, but also introduces new Applications RDMA-based MapReduce (MRoIB) Design features Job Task Map Prefetch/Caching of MOF Tracker Tracker In-Memory Merge Reduce Overlap of Merge & Reduce Java Sockets Interface Java Native Interface (JNI) MRoIB Design IB Verbs pre-fetching and caching of intermediate 1/10/40GigE Network RDMA Capable Networks (IB, 10GE/ iwarp, RoCE..) data and overlapping of merge and reduce Figure 3.1 Architectural overview of RDMAbased MapReduce phases. A high-level overview of RDMAbased MapReduce (MRoIB) is presented in Figure 3.1. Here, we keep the existing design and add our RDMA-based communication features as a configurable option with a selection parameter, mapred.ib.enabled. With this parameter set to true, user can enable RDMA-based features and vice versa Detailed Design In this section, we introduce the major components in our RDMA based MapReduce design and then explain our design details. Figure 3.2 shows a hybrid version of MapReduce consisting of both the default vanilla design and our proposed RDMA-based design. We first discuss our updated Shuffle design followed by the Merge design. 28

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters by Nusrat Sharmin Islam Advisor: Dhabaleswar K. (DK) Panda Department of Computer Science and Engineering The Ohio State