Native-Task Performance Test Report

Similar documents
HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Service Oriented Performance Analysis

Analysis in the Big Data Era

Tuning the Hive Engine for Big Data Management

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.

Facilitating Magnetic Recording Technology Scaling for Data Center Hard Disk Drives through Filesystem-level Transparent Local Erasure Coding

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

High Performance Computing on MapReduce Programming Framework

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Tuning Intelligent Data Lake Performance

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

MapReduce Design Patterns

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Improving the MapReduce Big Data Processing Framework

CS Project Report

Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Accelerate Big Data Insights

Initial Results on Provisioning Variation in Cloud Services

Functional Comparison and Performance Evaluation. Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14

Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds. Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

April Copyright 2013 Cloudera Inc. All rights reserved.

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

Column Stores vs. Row Stores How Different Are They Really?

Evolution From Shark To Spark SQL:

EsgynDB Enterprise 2.0 Platform Reference Architecture

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

BigDataBench: a Benchmark Suite for Big Data Application

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Backtesting with Spark

DON T CRY OVER SPILLED RECORDS Memory elasticity of data-parallel applications and its application to cluster scheduling

CSE 124: Networked Services Lecture-17

MapReduce. Stony Brook University CSE545, Fall 2016

The amount of data increases every day Some numbers ( 2012):

Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Introduction to MapReduce

SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers

Main-Memory Requirements of Big Data Applications on Commodity Server Platform

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Distributed Filesystem

A Fast and High Throughput SQL Query System for Big Data

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Map Reduce Group Meeting

scc: Cluster Storage Provisioning Informed by Application Characteristics and SLAs

Energy Management of MapReduce Clusters. Jan Pohland

Understanding the Role of Memory Subsystem on Performance and Energy-Efficiency of Hadoop Applications

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

Lecture 11 Hadoop & Spark

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Correlation based File Prefetching Approach for Hadoop

Technological Challenges in the GAIA Archive

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Tuning Intelligent Data Lake Performance

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

SparkBench: A Comprehensive Spark Benchmarking Suite Characterizing In-memory Data Analytics

Today s Papers. Architectural Differences. EECS 262a Advanced Topics in Computer Systems Lecture 17

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

MapR Enterprise Hadoop

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Lecture 2: Memory Systems

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Scott Oaks, Oracle Sunil Raghavan, Intel Daniel Verkamp, Intel 03-Oct :45 p.m. - 4:30 p.m. Moscone West - Room 3020

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

CloudRank-D: benchmarking and ranking cloud computing systems for data processing applications

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Global Journal of Engineering Science and Research Management

2/26/2017. For instance, consider running Word Count across 20 splits

SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores

Progress on Efficient Integration of Lustre* and Hadoop/YARN

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

HaLoop Efficient Iterative Data Processing on Large Clusters

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

SNAP Performance Benchmark and Profiling. April 2014

Benchmarking Approach for Designing a MapReduce Performance Model

Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization

9 May Swifta. A performant Hadoop file system driver for Swift. Mengmeng Liu Andy Robb Ray Zhang

Triton file systems - an introduction. slide 1 of 28

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case

DCBench: a Data Center Benchmark Suite

NAS for Server Virtualization Dennis Chapman Senior Technical Director NetApp

Map-Reduce. Marco Mura 2010 March, 31th

Identifying Performance Bottlenecks with Real- World Applications and Flash-Based Storage

Distributed File Systems II

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee

Comparison of Storage Protocol Performance ESX Server 3.5

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Cloudian Sizing and Architecture Guidelines

Transcription:

Native-Task Performance Test Report Intel Software Wang, Huafeng, Huafeng.wang@intel.com Zhong, Xiang, xiang.zhong@intel.com Intel Software Page 1

1. Background 2. Related Work 3. Preliminary Experiments 3.1 Experimental Environment Workbench Wordcount Sort DFSIO Pagerank Peculiarity CPU-intensive IO-intensive IO-intensive Map :CPU-intensive Reduce :IO-intensive Hivebench-Aggregation Map :CPU-intensive Reduce :IO-intensive Hivebench-Join Terasort CPU-intensive Map :CPU-intensive Reduce : IO-intensive K-Means Iteration stage: CPU-intensive Classification stage: IO-intensive Nutchindexing CPU-intensive & IO-intensive Cluster settings Intel Software Page 2

Hadoop version 1..3-Intel (patched with native task) Cluster size 4 Disk per machine Network CPU L3 Cache size Memory 7 SATA Disk per node GbE network E5-268(32 core per node) 248 KB 64GB per node Map Slots 3*32+1*26=122 Reduce Slots 3*16+1*13=61 Job Configuration io.sort.mb compression Compression algo Dfs.block.size 1GB Enabled snappy 256MB Io.sort.record.percent.2 Dfs replica 3 3.2 Performance Metrics Data size before compressio n Data size after compressio n Native job run time(s Original job run time(s) Job performa nce improve- Map stage performa nce Intel Software Page 3

) ment improve ment Wordcoun 1TB 5GB 1523.43 3957.11 159.8% 159.8% t Sort 5GB 249GB 2662.43 366.97 15.2% 45.4% DFSIO-Rea 1TB NA 1249.68 1384.52 1.8% 26% d DFSIO-Wri 1TB NA 6639.22 7165.97 7.9% 7.9% te Pagerank Pages:5M 217GB 615.71 11644.63 9.7% 133.8% Total:481GB Hive-Aggr Uservisits:5 345GB 1113.82 1662.74 49.3% 76.2% egation G Pages:6M Total:82GB Hive-Join Uservisits:5 382GB 117.55 1467.8 32.5% 42.8% G Pages:6M Total:86GB Terasort 1TB NA 423.35 636.49 51.3% 19.1% K-Means Clusters:5 Samples:2G Inputfilesec ondample:4 M Total:378GB 35GB 5734.11 745.62 22.9% 22.9% Nutchinde xing 12 1 8 6 4 2 Pages:4M 22G 3957 367 2662 1523 1385 125 NA 4387.6 46.56 4.9% 13.2% Hi-Bench Job Execution Time Analysis 7166 6639 1663 1467 1114 118 11645 636 745 616 5734 423 461 4388 Original mapred job runtime Native-Task job runtime Intel Software Page 4

1 136 271 46 541 676 811 946 181 1216 1351 1486 1621 1756 1891 226 2161 2296 2431 2566 271 2836 2971 316 3241 3376 3511 3646 1 136 271 46 541 676 811 946 181 1216 1351 1486 1621 1756 1891 226 2161 2296 2431 2566 271 2836 2971 316 3241 3376 3511 3646 3.3 Results 3.3.1 Wordcount Job Details: Name Maps Reducers wordcount 1984 22 15 1 5 Job Execution Time(Original Wordcount) 15 1 5 Job Execution Time(Native-Task Wordcount) Native-Task running state: Intel Software Page 5

Start time: 9:14 Finish time: 9:37 Original running state: Start time: 1:32 Intel Software Page 6

1 111 221 331 441 551 661 771 881 991 111 1211 1321 1431 1541 1651 1761 1871 1981 291 221 2311 2421 2531 2641 2751 2861 2971 1 111 221 331 441 551 661 771 881 991 111 1211 1321 1431 1541 1651 1761 1871 1981 291 221 2311 2421 2531 2641 2751 2861 2971 Finish time: 11:28 Analysis Wordcount is a CPU-intensive workload and it s map stage run through the whole job. So the native-task has a huge performance improvement. 3.3.2 Sort Job Details: Name Maps Reducers sorter 124 33 15 1 5 15 1 5 Job Execution Time(Original Sort) Job Execution Time(Native-Task Sort) Native-Task running state: Intel Software Page 7

Start time ::25 Finish time :1:1 Analysis Sort is IO-intensive at both map and stage. We can see that it s time occupy the most of whole job running time, because of that, the performance improvement is limited. 3.3.3 DFSIO-Read Job Details: Name Maps Reducers Datatools.jar 256 1 Result Analyzer 5 63 Intel Software Page 8

1 266 531 796 161 1326 1591 1856 2121 2386 2651 2916 3181 3446 3711 3976 4241 456 4771 536 531 5566 5831 696 6361 6626 6891 7156 1 52 13 154 25 256 37 358 49 46 511 562 613 664 715 766 817 868 919 97 121 172 1123 1174 1225 1276 1327 1378 1 52 13 154 25 256 37 358 49 46 511 562 613 664 715 766 817 868 919 97 121 172 1123 1174 1225 1276 1327 1378 15 Job Execution Time(Original DFSIO-Read) 1 5 15 Job Execution Time(Native-Task DFSIO-Read) 1 5 3.3.4 DFSIO-Write Job Details: Name Maps Reducers Datatools.jar 512 1 Result Analyzer 5 63 15 Job Execution Time(Original DFSIO-Write) 1 5 Intel Software Page 9

1 266 531 796 161 1326 1591 1856 2121 2386 2651 2916 3181 3446 3711 3976 4241 456 4771 536 531 5566 5831 696 6361 6626 6891 7156 15 Job Execution Time(Native-Task DFSIO-Write) 1 5 Native-Task running state: Aggregation start time: 9:58 Aggregation finish time: 1:19 Join start time: 1:19 Join finish time: 12:1 Original running state: Intel Software Page 1

Aggregation start time: 2:22 Aggregation finish time: 2:46 Join start time: 2:46 Join finish time: 22:45 Analysis DFSIO is IO-intensive both at read and write stage. It s bottleneck is network bandwidth so the performance improvement is limited. 3.3.5 Pagerank Job Details: Name Maps Reducers Pagerank_Stage1 163 33 Intel Software Page 11

1 41 81 121 161 21 241 281 321 361 41 441 481 521 561 61 641 681 721 761 81 841 881 921 961 11 141 181 1 41 81 121 161 21 241 281 321 361 41 441 481 521 561 61 641 681 721 761 81 841 881 921 961 11 141 181 Pagerank_Stage2 33 33 15 1 5 Job Execution Time(Original Pagerank) 15 1 5 Job Execution Time(Native-Task Pagerank) Native-Task running state: Start time: 1:33 Intel Software Page 12

Finish time: 12:8 Original running state: Start time: 1:59 Finish time: 14:6 Analysis Pagerank is a CPU-intensive workload and it s map stage take about 5% of the whole job running time. So the performance improvement is obvious. 3.3.6 Hive-Aggregation Job Details: Name Maps Reducers INSERT OVERWRITE TABLE 1386 33 Intel Software Page 13

1 6 119 178 237 296 355 414 473 532 591 65 79 768 827 886 945 14 163 1122 1181 124 1299 1358 1417 1476 1535 1594 1 6 119 178 237 296 355 414 473 532 591 65 79 768 827 886 945 14 163 1122 1181 124 1299 1358 1417 1476 1535 1594 uservisits...sourceip(stage-1) 15 1 5 Job Execution Time (Original Hive-Aaggregation) 15 1 5 Job Execution Time(Native-Task Hive-Aggregation) Original running state: Start time: 15:52 Intel Software Page 14

Finish time :16:22 Analysis Hive-Aggregation is CPU-intensive at map stage and IO-intensive at stage. It s map stage occupy the most of running time and when it comes to stage, network bandwidth limits the performance. So the performance improvement at map stage is obvious. 3.3.7 Hive-Join Job Details: Name Maps Reducers INSERT OVERWRITE TABLE 1551 33 rankings_uservisi...1(stage-1) INSERT OVERWRITE TABLE 99 33 rankings_uservisi...1(stage-2) INSERT OVERWRITE TABLE 99 1 rankings_uservisi...1(stage-3) INSERT OVERWRITE TABLE 1 1 rankings_uservisi...1(stage-4) Intel Software Page 15

1 6 119 178 237 296 355 414 473 532 591 65 79 768 827 886 945 14 163 1122 1181 124 1299 1358 1417 1476 1535 1594 1 6 119 178 237 296 355 414 473 532 591 65 79 768 827 886 945 14 163 1122 1181 124 1299 1358 1417 1476 1535 1594 15 1 5 15 1 5 Job Execution Time(Original Hive-Join) Job Execution Time(Native-Task Hive-Join) Original running state: Start time: 16:32 Finish time :16:58 Intel Software Page 16

1 231 461 691 921 1151 1381 1611 1841 271 231 2531 2761 2991 3221 3451 3681 3911 4141 4371 461 4831 561 5291 5521 5751 5981 6211 1 231 461 691 921 1151 1381 1611 1841 271 231 2531 2761 2991 3221 3451 3681 3911 4141 4371 461 4831 561 5291 5521 5751 5981 6211 Analysis Hive-join is a CPU-intensive workload and it s map stage takes a high percent of whole running time. So we can see at map stage, the performance is improved by native-task. 3.3.8 Terasort Job Details: Name Maps Reducers Terasort 458 34 15 Job Execution Time(Original Terasort) 1 5 15 Job Execution Time(Native-Task Terasort) 1 5 Native-Task running state: Original running state: Intel Software Page 17

Start time: 8:39 Finish time: 1:24 Analysis Terasort is CPU-intensive at map stage and IO-intensive at stage.it s map stage occupy the majority of the running time so there is a huge performance improvement at map stage. 3.3.9 K-Means Job Details: Name Maps Reducers Cluster Iterator running 14 63 iteration 1 Cluster Iterator running 14 63 Intel Software Page 18

1 281 561 841 1121 141 1681 1961 2241 2521 281 381 3361 3641 3921 421 4481 4761 541 5321 561 5881 6161 6441 6721 71 7281 1 271 541 811 181 1351 1621 1891 2161 2431 271 2971 3241 3511 3781 451 4321 4591 4861 5131 541 5671 5941 6211 6481 6751 721 7291 iteration 2 Cluster Iterator running 14 63 iteration 3 Cluster Iterator running 14 63 iteration 4 Cluster Iterator running 14 63 iteration 5 Cluster Classification 14 Driver running 15 Job Execution Time(Original Kmeans) 1 5 15 1 5 Job Execution Time(Native Kmeans) Native-Task running state: Intel Software Page 19

Start time: 2:38 Finish time: 22:23 Original running state: Start time: 9:43 Finish time: 11:41 Intel Software Page 2

1 169 337 55 673 841 19 1177 1345 1513 1681 1849 217 2185 2353 2521 2689 2857 325 3193 3361 3529 3697 3865 433 421 4369 4537 1 169 337 55 673 841 19 1177 1345 1513 1681 1849 217 2185 2353 2521 2689 2857 325 3193 3361 3529 3697 3865 433 421 4369 4537 Analisys From the running state graph, we can see that the former 5 iteration is CPU-intensive and the last classification stage is IO-intensive. The two stages almost equally split the whole running time. So the performance improvement at map stage is evident. 3.3.1 Nutchindexing Job Details: Name Maps Reducers index-lucene 1171 133 /HiBench/Nutch/Input/indexes 8 Job Execution Time(Original Nutchindexing) 6 4 2 8 Job Execution Time(Native Nutchindexing) 6 4 2 Native-Task running state: Intel Software Page 21

Start time: 17:26 Finish time: 18:4 Original running state: Start time: 18:4 Intel Software Page 22

Finish time: 19:56 Analysis Nutchindexing is CPU-intensive at map stage but the stage take the majority of whole running time. So the performance improvement is not so huge. 3.4 Other related results 3.4.1 Cache miss hurts Sorting performance 5 4 Sorting time increase rapidly as cache miss rate increase We divide a large buffer into several memory unit. BlockSize is the size of memory unit we doing the sorting. SortTime/ sort time/log(blocksize) sort time/log(blocksize) 3 2 1 partition size reach CPU L3 cache limit log2(blocksize) 5 1 15 2 25 3 Intel Software Page 23

3.4.2 Compare with BlockMapOutputBuffer Time /s 2 15 1 Job Execution Time Breakdown(WordCount) NO Combiner, 16 mapper x 8 r, 4 nodes, 3 SATA/node 6MB per task, compression ratio: 2/1 BlockMapOutputBuffer collector native task collector 5 Avg job time(s) Avg map task time(s)avg task time(s) 7% faster than BlockMapOutputBuffer collector. BlockMapOutputBuffer supports ONLY BytesWritable 3.4.3 Effect of JVM reuse 4.5% improvement for Original Hadoop, 8% improvement for Native-Task Intel Software Page 24

14 12 1 8 6 4 2 Effect of Task JVM Reuse 1381 1322 Original Hadoop Improve 4.5% 51 Naitve-Task 472 No JVM-Reuse JVM-Reuse Improve 8% 4 nodes, 4 map slots per node. 3.4.4 Hadoop don t scale well when slots number increase 4 nodes(32 core per node), 16 map slots max, CPU, memory, disk are NOT fully used. Performance drops unexpectedly when slots# increase time/s 8 6 4 2 Performance of different map slots 741 718 total job time 369 339 total map stage time(/s) 4 * 16 mapper slots 4 * 4 mapper Wordcount benchmark Intel Software Page 25

3.4.5 Native-Task mode: full task optimization 2x faster further for Native-Task full time optimization, compared with native collector 1 9 8 7 6 5 4 3 2 1 Native Task modes: full task vs. Collector only Hadoop Original total job time() total map stage time() Native-Task(Collector) Native-Task(Full Task) Wordcount benchmark 4. Conclusions Intel Software Page 26