FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

Similar documents
The Google File System

The Google File System

CA485 Ray Walshe Google File System

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Software and Tools for HPE s The Machine Project

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

Mitigating Data Skew Using Map Reduce Application

The Google File System

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Performance and Scalability with Griddable.io

CLOUD-SCALE FILE SYSTEMS

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

CSE 124: Networked Services Fall 2009 Lecture-19

The Google File System

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

The Google File System

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Distributed File Systems II

CSE 124: Networked Services Lecture-16

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

The Google File System (GFS)

A Fast and High Throughput SQL Query System for Big Data

Chapter 5. The MapReduce Programming Model and Implementation

NPTEL Course Jan K. Gopinath Indian Institute of Science

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Google is Really Different.

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Enterprise Intranet Collaboration Environment

The Google File System

IBM InfoSphere Streams v4.0 Performance Best Practices

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

Advanced Databases: Parallel Databases A.Poulovassilis

Distributed Filesystem

Google File System. Arun Sundaram Operating Systems

HashKV: Enabling Efficient Updates in KV Storage via Hashing

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Building Durable Real-time Data Pipeline

Best Practices for Setting BIOS Parameters for Performance

Network Design Considerations for Grid Computing

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Tools for Social Networking Infrastructures

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Introduction to Windows Azure Cloud Computing Futures Group, Microsoft Research Roger Barga, Jared Jackson, Nelson Araujo, Dennis Gannon, Wei Lu, and

Introduction to MapReduce

Accelerating Enterprise Search with Fusion iomemory PCIe Application Accelerators

White paper Version 3.10

DeepSort: Scalable Sorting with High Efficiency

Presented by: Nafiseh Mahmoudi Spring 2017

SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Social Environment

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Distributed Systems 16. Distributed File Systems II

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval

GFS: The Google File System. Dr. Yingwu Zhu

Azure SQL Database for Gaming Industry Workloads Technical Whitepaper

Using Transparent Compression to Improve SSD-based I/O Caches

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Distributed System. Gang Wu. Spring,2018

Xcellis Technical Overview: A deep dive into the latest hardware designed for StorNext 5

GearDB: A GC-free Key-Value Store on HM-SMR Drives with Gear Compaction

MapReduce. U of Toronto, 2014

Google File System. By Dinesh Amatya

Maximize the Speed and Scalability of Your MuleSoft ESB with Solace

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

High Performance Computing on MapReduce Programming Framework

Tuning Intelligent Data Lake Performance

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

PACM: A Prediction-based Auto-adaptive Compression Model for HDFS. Ruijian Wang, Chao Wang, Li Zha

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

IBM Education Assistance for z/os V2R2

Staggeringly Large Filesystems

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Warehouse-Scale Computing

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

This material is covered in the textbook in Chapter 21.

GFS: The Google File System

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

Energy Management of MapReduce Clusters. Jan Pohland

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Optimizing Apache Spark with Memory1. July Page 1 of 14

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights

Parallel Computing: MapReduce Jin, Hai

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Atlas: Baidu s Key-value Storage System for Cloud Data

SurFS Product Description

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Network and storage settings of ES NAS high-availability network storage services

Job-Oriented Monitoring of Clusters

Map-Reduce. Marco Mura 2010 March, 31th

vsan 6.6 Performance Improvements First Published On: Last Updated On:

Transcription:

Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv, bin.lub, yangyu.taoyy, li.chao, jingren.zhou, hongtang}@alibaba-inc.com Overview Apsara is a large-scale general-purpose distributed computing system developed at Alibaba Cloud Computing Inc (also called Aliyun). It is responsible for managing cluster resources within a datacenter, as well as scheduling parallel execution for a wide range of distributed online and offline applications. It is the common foundation of a majority of public cloud services offered by Aliyun, and supports all data processing workload within Alibaba as well. Some key components of the Apsara system include: (a) a distributed file system, called Pangu; (b) a distributed computation framework, called Fuxi, for cluster resource management and job scheduling; and (c) a parallel DAG-based data computation framework on top of Fuxi to process large datasets. Apsara is mostly written in C/C++ to achieve high performance and efficiency. More details about Apsara can be found in []. Apsara has been deployed on hundreds of thousands of physical servers across tens of data centers at Aliyun, with the largest clusters consisting of more than 5000 servers each. Any DAG-based distributed data processing can be implemented as a Fuxi job, including simple Map/Reduce jobs and more sophisticated machine learning jobs. Source inputs and outputs, along with intermediate data are placed in Pangu, with a variety of replication and locality configurations to achieve the optimal balance between performance and reliability in order to accommodate varying requirements from different applications. In this paper, we introduce Fuxi, a distributed sort implementation on top of Apsara, and present our Daytona Gray and Minute results with details in technical implementation, performance improvements, and efficiency analysis. Fuxi is able to complete the 00TB Daytona Gray benchmark in 377 seconds on random non-skewed dataset and 50 seconds on skewed dataset, and Indy Gray benchmark in 39 seconds. For the Daytona Minute benchmark, Fuxi sorts 7.7 TB in 58 seconds on random non-skewed dataset and 79 seconds on skewed dataset. For the Indy Minute benchmark, Fuxi sorts TB in 59 seconds. System Configuration Before describing the overall architecture and detailed implementation, we present our system configuration as follows.. Hardware We run Fuxi in a relatively homogeneous cluster with a total of 3377 machines, classified into groups.

334 machines * Intel(R) Xeon(R) E5-630@.30GHz (6 cores per CPU) 96GB memory * TB SATA hard drives 0Gb/s Ethernet, 3: subscription ratio 43 machines * Intel(R) Xeon(R) E5-650v@.60GHz (8 cores per CPU) 8GB memory * TB SATA hard drives 0Gb/s Ethernet, 3: subscription ratio. Software OS: Red Hat Enterprise Linux Server release 5.7 Apsara distributed computing system Fuxi implemented on top of the Apsara Occasional background workloads (the cluster is shared with multiple groups, and we choose relatively idle periods of the cluster to conduct our experiments). Fuxi Overview In this section, we present an overview of the Fuxi design on top of the Apsara system. The same implementation is used to perform both Gray and Minute, with various system optimizations outlined in the next section. Input partition Map task Dump file task Dump file Merge sort Input partition Map task network Output file task Merge sort Output file Input partition 3 Map task 3 Figure. An Overview of Fuxi We first perform a sampling of the input to determine the range partition boundaries and divide the rest of the sorting process into two phases, map and sort, as shown in Figure. Each phase is carried out with multiple parallel tasks. During the map phase, each input data segment is read from local disks (via Pangu Chunkserver daemon) and range partitioned by a map task. Each partitioned result of a map task is then transferred via network to the corresponding sort task. During the sort phase, each sort task repeatedly collects records from the map tasks, and keeps

them in memory. It periodically performs an in-memory sort when its buffer becomes full, followed by writing the sorted data into temporary files (also stored on Pangu without replication). After all records are collected, it combines all the previously sorted data both in memory and temporary files using a merge sort and outputs the final result in Pangu. Fuxi finishes with a number of range-partitioned and internally-sorted data files as the final result by the time all sort tasks finish. Implementation and Optimizations In this Section, we describe the implementation details of Fuxi and explain several key performance optimizations, including data generation, sampling, I/O optimization, pipelined and overlapped execution, and network communications. Finally, we present additional improvements for the Minute benchmark.. Input data generation We prepare the input datasets in either random or skewed distribution using gensort, and copy them to Pangu. Each input dataset is split to multiple segments. The number of segments matches the number of map tasks so that each task processes one data segment. In addition, each input dataset is two-way replicated by Pangu. That is, every segment has two copies. Each copy of a segment consists of a number of Pangu chunks distributed evenly across all the disks on one machine. Each Pangu chunk is stored as a local file system file by Pangu plus associated metadata (Pangu manages individual disks directly to maximize parallel IO bandwidth instead of bundling them via RAID). All segments for a dataset are uniformly distributed among all machines in a balanced fashion. We use 0000 segments for the 00TB dataset in Gray and 3400 segments for the datasets in Minute. As described below, the output data are also two-way replicated in a similar way.. Input sampling In order to discover data distribution and reduce data skew in the sort phase, we sample the input of Daytona to generate the range partition boundaries so that each partition consists of a relatively equal amount of data to process. Given that the input data is split into X segments, we randomly select Y locations on each segment, and read from each location Z continuous records. In this way, we collect X * Y * Z sample records in total. We then sort these samples and calculate the range boundaries by dividing them into S range partitions evenly, in which S represents the number of sort tasks. For the 00TB Daytona Gray benchmark, with 0,000 data segments, we select 300 random locations on each segment, and read one record on each location. As a result, we collect 6,000,000 samples, sort, and divide them into 0000 ranges, same as the number of sort tasks. The calculated range boundaries are used by each map task to partition records in the map phase. The entire sampling phase takes approximately 35 seconds. For the Daytona Minute benchmark, with 3350 data segments, we select 900 random locations on each segment, and read one record on each location. A total number of 3,05,000 samples are collected, sorted, and then divided to 0050 ranges. The entire sampling phase takes about 4 seconds. The sampling step is skipped for Indy sort benchmarks. 3. I/O dual buffering We employ the standard dual buffering technique for I/O operations on each Pangu file. 3

Specifically, Fuxi processes data in one I/O buffer while instructing Pangu to carry out I/O operations with the other buffer. The roles of the two buffers switch periodically in order to guarantee both I/O and data processing to be done in parallel, thereby significantly reducing the task latency. 50 00 50 00 50 300 350 400 450 500 550 600 650 s () () (3) (4) (5) (6) (7) (8) stage Figure. Fuxi Latency Breakdown 4. Overlapped execution As shown in Figure, to further reduce the overall latency, we break each phase into several small steps and overlap the execution of all the steps as much as possible during the process. These steps are listed below. () Data sampling () Job startup (3) Map task reads input (4) Map task sends data to sort task (5) task receives data (6) task sorts in-memory data and dumps to temporary files if necessary (7) task combines in-memory data and temporary files by merge sort (8) task writes final output We parallelize the execution of data sampling and job startup, the latter of which dispatches all the task processes, and performs other bookkeeping chores such as collecting the network addresses of all the sort tasks, and informing the responding map tasks. When step () finishes, the sampling program stores the partition boundaries into Pangu and generates another flag file for notification. Once dispatched, all the map tasks wait for the notification by periodically checking the existence of the flag file. Once the partition boundaries become available, the map tasks immediately read them and perform partitioning accordingly. Naturally, steps (3) (4) and (5) are pipelined during the map phase, and steps (7) and (8) pipelined 4

during the sort phase. In step (6), had the sorting and dumping been deferred until the entire memory assigned to a task is filled up, step (5) would have been blocked because the memory is still fully occupied during sorting and no memory is left for receiving new data. To mitigate this problem, we parallelize (5) and (6) by starting the sorting once a certain amount of memory is occupied, so that (6) is brought forward and (5) is not blocked. We start the merge and dump of sorted in-memory data to temporary files when memory is fully occupied. Obviously the lower the memory limit to start sorting, the earlier step (6) would start. In our experiments, as a tradeoff between concurrency and merge sort efficiency, we start (6) when approximately /0 of memory assigned to a task is occupied by the received data. As a result, we are able to parallelize I/O and computation without noticeable delay, at the expense of more temporary files to be merged later, without introducing significant overhead in our cases. Figure shows the elapsed time for each step and their overlapped execution. 5. Handling network communication There is significant network communication traffic between map tasks and sort tasks. Each network packet arrival raises an interrupt to CPUs to process. If the interrupt handling is set to be affinitive to a specific CPU core, then the interrupt handling may be delayed when the same CPU is busy with sorting, resulting in request time-out and, even worse, packet loss. By balancing the load of network interrupts among all CPU cores via setting the smp_affinity system option, the rates of timeout events and packet loss are significantly decreased. Figure 3. Real-time mode architecture 6. Additional improvements for Minute As the elapsed time of Minute is required to be limited up to 60 seconds, the scheduling overhead of ordinary batch data processing becomes non-negligible. To mitigate the overhead, we run Minue in Fuxi s Real-time Job mode. The Real-time Job mode is developed to achieve high performance with low scheduling latency and in-memory computation. Figure 3 illustrates the architecture of Real-time Job mode. In a typical production environment, the system is setup as part of the cluster deployment process, and runs as a long-living service by creating a pool of persistent worker processes on each machine. Users can later submit a variety of jobs to the scheduler of the system and be informed the status of job execution. To comply with the requirements of sort benchmark competition, where all setup and shutdown activities directly related to sorting must be included in the reported time, we modified our client program to start the worker process pool before we submit the Minute job, and stop the workers after job completion, and we include the time for starting/stopping the worker pool in the reported elapsed times. The target scenario of Real-time Job mode is latency-sensitive data processing with medium- 5

sized dataset (no more than 0TB), since all records are likely to be cached in main memory, including both input and output data. In our experiments, we only use Real-time Job mode for Minute benchmark. Benchmark Results We present our Fuxi results for both 00TB Gray and Minute benchmarks below. Apsara has been deployed in our production environment for over six years, with non-interruptive progressive upgrade to avoid impacting jobs or online services during the process. Fuxi is in fact a lightweight specialization of our general-purpose DAG-based data computation framework, and most of the aforementioned optimizations are inside Apsara and the DAG-based data computation. Many of our production jobs using the same framework process much larger datasets than 00TB, and it is fairly common for a job to run multiple hours (Fuxi and Pangu has built-in mechanisms to recover from a variety of hardware/software failures). For all benchmark runs described below, input/output/temporary data are uncompressed. We ensure data are not buffered in OS page cache by running a data purge job that randomly reads from local file system before each benchmark run. Pangu ensures data durability by flushing data to disk when an output file stream is closed.. Gray Table shows the performance of Fuxi in different Gray benchmarks. Daytona sort includes sampling on start while Indy sort does not. For Daytona, the final output data are stored with a replication factor of two, same as the input data. Pangu ensures that the two copies are replicated to different machines (and racks). For Indy, the final output data are stored with only one copy. Overall, Fuxi is able to perform Indy Gray against the randomly distributed input dataset in 39 seconds, Daytona Gray against the randomly distributed input dataset in 377 seconds, and Daytona Gray against the skewed input dataset in 50 seconds. They are equivalent to a sorting throughput of 8. TB, 5.9 TB and.7 TB per minute, respectively. Table. Gray benchmark results Indy Gray Daytona Gray Daytona Gray Skewed Input Size (bytes) 00,000,000,000,000 00,000,000,000,000 00,000,000,000,000 Input File Copy Output File Copy Time (sec) 39 377 50 Effective Throughput 8. 5.9.7 (TB/min) Number of Nodes 3377 3377 3377 Checksum 746a5007040ea07ed 746a5007040ea07ed 746a50ec99390d87 Duplicate Keys 0 0 308537357. Minute We also run Fuxi using Real-time Job mode for the Minute benchmark. With 5 trials, the median time of sorting 7.7 TB is 58 seconds for uniformly-distributed data, and 79 seconds for 6

the Daytona category with skewed data distribution. For the Indy category, Fuxi is able to sort TB randomly distributed input dataset in 59 seconds. The experiment details are shown in table. Similar to Gray, all input and output datasets are written in copies, stored in two different machines. Table. Daytona Minute benchmark results Indy Minute Daytona Minute Daytona Minute Skewed Input Size TB 7.7 TB 7.7 TB Records 099564050 7696584500 7696584500 Trial 58 59 85 Trial 59 6 8 Trial 3 59 6 78 Trial 4 6 60 80 Trial 5 60 6 80 Trial 6 60 6 8 Trial 7 60 57 79 Trial 8 6 57 77 Trial 9 59 58 78 Trial 0 59 57 77 Trial 6 60 79 Trial 58 58 79 Trial 3 59 58 80 Trial 4 59 57 79 Trial 5 58 58 80 Median Time(sec) 59 58 79 Nodes 3377 3377 3377 Checksum ccccb43ea7b483c 8f5c5cd7fc7a47b 8f5cf6f4068d6876 Duplicate Keys 0 0 6577609 Acknowledgements We would like to thank the members of the Fuxi team for their contributions to the Fuxi distributed computation framework, on top of which Fuxi is developed and implemented. We would also like to thank the entire Apsara team at Aliyun for their support and collaboration. Reference [] Zhang, Z., Li, C., Tao, Y., Yang, R., Tang, H., & Xu, J. (04). Fuxi: a fault-tolerant resource management and job scheduling system at internet scale. In Proceedings of the VLDB Endowment, 04, 7(3), 393-404. 7