Native-Task Performance Test Report

Size: px

Start display at page:

Download "Native-Task Performance Test Report"

Rosamund Jenkins
5 years ago
Views:

1 Native-Task Performance Test Report Intel Software Wang, Huafeng, Zhong, Xiang, Intel Software Page 1

2 1. Background 2. Related Work 3. Preliminary Experiments 3.1 Experimental Environment Workbench Wordcount Sort DFSIO Pagerank Peculiarity CPU-intensive IO-intensive IO-intensive Map :CPU-intensive Reduce :IO-intensive Hivebench-Aggregation Map :CPU-intensive Reduce :IO-intensive Hivebench-Join Terasort CPU-intensive Map :CPU-intensive Reduce : IO-intensive K-Means Iteration stage: CPU-intensive Classification stage: IO-intensive Nutchindexing CPU-intensive & IO-intensive Cluster settings Intel Software Page 2

3 Hadoop version 1..3-Intel (patched with native task) Cluster size 4 Disk per machine Network CPU L3 Cache size Memory 7 SATA Disk per node GbE network E5-268(32 core per node) 248 KB 64GB per node Map Slots 3*32+1*26=122 Reduce Slots 3*16+1*13=61 Job Configuration io.sort.mb compression Compression algo Dfs.block.size 1GB Enabled snappy 256MB Io.sort.record.percent.2 Dfs replica Performance Metrics Data size before compressio n Data size after compressio n Native job run time(s Original job run time(s) Job performa nce improve- Map stage performa nce Intel Software Page 3

) ment improve ment Wordcoun 1TB 5GB 1523.43 3957.11 159.8% 159.8% t Sort 5GB 249GB 2662.43 366.97 15.2% 45.4% DFSIO-Rea 1TB NA 1249.68 1384.52 1.8% 26% d DFSIO-Wri 1TB NA 6639.22 7165.97 7.9% 7.

4 ) ment improve ment Wordcoun 1TB 5GB % 159.8% t Sort 5GB 249GB % 45.4% DFSIO-Rea 1TB NA % 26% d DFSIO-Wri 1TB NA % 7.9% te Pagerank Pages:5M 217GB % 133.8% Total:481GB Hive-Aggr Uservisits:5 345GB % 76.2% egation G Pages:6M Total:82GB Hive-Join Uservisits:5 382GB % 42.8% G Pages:6M Total:86GB Terasort 1TB NA % 19.1% K-Means Clusters:5 Samples:2G Inputfilesec ondample:4 M Total:378GB 35GB % 22.9% Nutchinde xing Pages:4M 22G NA % 13.2% Hi-Bench Job Execution Time Analysis Original mapred job runtime Native-Task job runtime Intel Software Page 4

1 136 271 46 541 676 811 946 181 1216 1351 1486 1621 1756 1891 226 2161 2296 2431 2566 271 2836 2971 316 3241 3376 3511 3646 1 136 271 46 541 676 811 946 181 1216 1351 1486 1621 1756 1891 226 2161

5 Results Wordcount Job Details: Name Maps Reducers wordcount Job Execution Time(Original Wordcount) Job Execution Time(Native-Task Wordcount) Native-Task running state: Intel Software Page 5

6 Start time: 9:14 Finish time: 9:37 Original running state: Start time: 1:32 Intel Software Page 6

7 Finish time: 11:28 Analysis Wordcount is a CPU-intensive workload and it s map stage run through the whole job. So the native-task has a huge performance improvement Sort Job Details: Name Maps Reducers sorter Job Execution Time(Original Sort) Job Execution Time(Native-Task Sort) Native-Task running state: Intel Software Page 7

8 Start time ::25 Finish time :1:1 Analysis Sort is IO-intensive at both map and stage. We can see that it s time occupy the most of whole job running time, because of that, the performance improvement is limited DFSIO-Read Job Details: Name Maps Reducers Datatools.jar Result Analyzer 5 63 Intel Software Page 8

1 266 531 796 161 1326 1591 1856 2121 2386 2651 2916 3181 3446 3711 3976 4241 456 4771 536 531 5566 5831 696 6361 6626 6891 7156 1 52 13 154 25 256 37

919 97 121 172 1123 1174 1225 1276 1327 1378 15 Job Execution Time(Original DFSIO-Read) 1 5 15 Job Execution Time(Native-Task DFSIO-Read) 1 5 3.3.4 DFSIO-Write Job Details: Name Maps Reducers Datatools.

9 Job Execution Time(Original DFSIO-Read) Job Execution Time(Native-Task DFSIO-Read) DFSIO-Write Job Details: Name Maps Reducers Datatools.jar Result Analyzer Job Execution Time(Original DFSIO-Write) 1 5 Intel Software Page 9

1 266 531 796 161 1326 1591 1856 2121 2386 2651 2916 3181

6626 6891 7156 15 Job Execution Time(Native-Task

10 Job Execution Time(Native-Task DFSIO-Write) 1 5 Native-Task running state: Aggregation start time: 9:58 Aggregation finish time: 1:19 Join start time: 1:19 Join finish time: 12:1 Original running state: Intel Software Page 1

Aggregation start time: 2:22 Aggregation finish time: 2:46 Join start time: 2:46 Join finish time: 22:45 Analysis DFSIO is IO-intensive both at read and write stage.

11 Aggregation start time: 2:22 Aggregation finish time: 2:46 Join start time: 2:46 Join finish time: 22:45 Analysis DFSIO is IO-intensive both at read and write stage. It s bottleneck is network bandwidth so the performance improvement is limited Pagerank Job Details: Name Maps Reducers Pagerank_Stage Intel Software Page 11

1 41 81 121 161 21 241 281 321 361 41 441 481 521 561 61 641 681 721 761 81 841 881 921 961 11 141 181 1 41 81 121 161 21 241 281 321 361 41 441 481 521 561 61 641 681 721 761 81 841 881 921

12 Pagerank_Stage Job Execution Time(Original Pagerank) Job Execution Time(Native-Task Pagerank) Native-Task running state: Start time: 1:33 Intel Software Page 12

Finish time: 12:8 Original running state: Start time: 1:59 Finish time: 14:6 Analysis Pagerank is a CPU-intensive workload and it s map stage take about 5% of the whole

13 Finish time: 12:8 Original running state: Start time: 1:59 Finish time: 14:6 Analysis Pagerank is a CPU-intensive workload and it s map stage take about 5% of the whole job running time. So the performance improvement is obvious Hive-Aggregation Job Details: Name Maps Reducers INSERT OVERWRITE TABLE Intel Software Page 13

1 6 119 178 237 296 355 414 473 532 591 65 79 768 827 886 945 14 163 1122 1181 124 1299 1358 1417 1476 1535

..sourceip(stage-1) 15 1 5 Job Execution Time (Original Hive-Aaggregation) 15 1 5 Job Execution

14 uservisits...sourceip(stage-1) Job Execution Time (Original Hive-Aaggregation) Job Execution Time(Native-Task Hive-Aggregation) Original running state: Start time: 15:52 Intel Software Page 14

15 Finish time :16:22 Analysis Hive-Aggregation is CPU-intensive at map stage and IO-intensive at stage. It s map stage occupy the most of running time and when it comes to stage, network bandwidth limits the performance. So the performance improvement at map stage is obvious Hive-Join Job Details: Name Maps Reducers INSERT OVERWRITE TABLE rankings_uservisi...1(stage-1) INSERT OVERWRITE TABLE rankings_uservisi...1(stage-2) INSERT OVERWRITE TABLE 99 1 rankings_uservisi...1(stage-3) INSERT OVERWRITE TABLE 1 1 rankings_uservisi...1(stage-4) Intel Software Page 15

1 6 119 178 237 296 355 414 473 532 591 65 79 768 827 886 945 14 163 1122 1181 124 1299 1358 1417 1476 1535 1594 1 6 119 178 237 296 355 414 473 532 591 65 79 768 827 886 945 14 163 1122 1181 124

16 Job Execution Time(Original Hive-Join) Job Execution Time(Native-Task Hive-Join) Original running state: Start time: 16:32 Finish time :16:58 Intel Software Page 16

1 231 461 691 921 1151 1381 1611 1841 271 231 2531 2761 2991 3221 3451 3681 3911 4141 4371 461 4831 561 5291 5521 5751 5981 6211 1 231 461 691 921 1151 1381 1611 1841 271 231 2531 2761 2991 3221 3451

17 Analysis Hive-join is a CPU-intensive workload and it s map stage takes a high percent of whole running time. So we can see at map stage, the performance is improved by native-task Terasort Job Details: Name Maps Reducers Terasort Job Execution Time(Original Terasort) Job Execution Time(Native-Task Terasort) 1 5 Native-Task running state: Original running state: Intel Software Page 17

18 Start time: 8:39 Finish time: 1:24 Analysis Terasort is CPU-intensive at map stage and IO-intensive at stage.it s map stage occupy the majority of the running time so there is a huge performance improvement at map stage K-Means Job Details: Name Maps Reducers Cluster Iterator running iteration 1 Cluster Iterator running Intel Software Page 18

1 281 561 841 1121 141 1681 1961 2241 2521 281

5881 6161 6441 6721 71 7281 1 271 541 811 181

3781 451 4321 4591 4861 5131 541 5671 5941 6211

running 14 63 iteration 3 Cluster Iterator

Execution Time(Original Kmeans) 1 5 15 1 5 Job

19 iteration 2 Cluster Iterator running iteration 3 Cluster Iterator running iteration 4 Cluster Iterator running iteration 5 Cluster Classification 14 Driver running 15 Job Execution Time(Original Kmeans) Job Execution Time(Native Kmeans) Native-Task running state: Intel Software Page 19

20 Start time: 2:38 Finish time: 22:23 Original running state: Start time: 9:43 Finish time: 11:41 Intel Software Page 2

1 169 337 55 673 841 19 1177 1345 1513 1681 1849 217 2185 2353 2521 2689 2857 325 3193 3361 3529 3697 3865 433 421 4369 4537 1

So the performance improvement at map stage is evident. 3.

21 Analisys From the running state graph, we can see that the former 5 iteration is CPU-intensive and the last classification stage is IO-intensive. The two stages almost equally split the whole running time. So the performance improvement at map stage is evident Nutchindexing Job Details: Name Maps Reducers index-lucene /HiBench/Nutch/Input/indexes 8 Job Execution Time(Original Nutchindexing) Job Execution Time(Native Nutchindexing) Native-Task running state: Intel Software Page 21

22 Start time: 17:26 Finish time: 18:4 Original running state: Start time: 18:4 Intel Software Page 22

Finish time: 19:56 Analysis Nutchindexing is CPU-intensive at map stage but the stage take the majority of whole running time. So the performance improvement is not so huge. 3.

23 Finish time: 19:56 Analysis Nutchindexing is CPU-intensive at map stage but the stage take the majority of whole running time. So the performance improvement is not so huge. 3.4 Other related results Cache miss hurts Sorting performance 5 4 Sorting time increase rapidly as cache miss rate increase We divide a large buffer into several memory unit. BlockSize is the size of memory unit we doing the sorting. SortTime/ sort time/log(blocksize) sort time/log(blocksize) partition size reach CPU L3 cache limit log2(blocksize) Intel Software Page 23

3.4.2 Compare with BlockMapOutputBuffer Time /s 2 15 1 Job Execution Time Breakdown(WordCount) NO Combiner, 16 mapper x 8 r, 4 nodes, 3 SATA/node 6MB per task, compression ratio: 2/1

24 3.4.2 Compare with BlockMapOutputBuffer Time /s Job Execution Time Breakdown(WordCount) NO Combiner, 16 mapper x 8 r, 4 nodes, 3 SATA/node 6MB per task, compression ratio: 2/1 BlockMapOutputBuffer collector native task collector 5 Avg job time(s) Avg map task time(s)avg task time(s) 7% faster than BlockMapOutputBuffer collector. BlockMapOutputBuffer supports ONLY BytesWritable Effect of JVM reuse 4.5% improvement for Original Hadoop, 8% improvement for Native-Task Intel Software Page 24

14 12 1 8 6 4 2 Effect of Task JVM Reuse 1381 1322 Original Hadoop Improve 4.

25 Effect of Task JVM Reuse Original Hadoop Improve 4.5% 51 Naitve-Task 472 No JVM-Reuse JVM-Reuse Improve 8% 4 nodes, 4 map slots per node Hadoop don t scale well when slots number increase 4 nodes(32 core per node), 16 map slots max, CPU, memory, disk are NOT fully used. Performance drops unexpectedly when slots# increase time/s Performance of different map slots total job time total map stage time(/s) 4 * 16 mapper slots 4 * 4 mapper Wordcount benchmark Intel Software Page 25

3.4.5 Native-Task mode: full task optimization 2x faster further for Native-Task full time optimization, compared with native collector 1 9 8 7 6 5 4 3 2 1 Native Task modes: full

26 3.4.5 Native-Task mode: full task optimization 2x faster further for Native-Task full time optimization, compared with native collector Native Task modes: full task vs. Collector only Hadoop Original total job time() total map stage time() Native-Task(Collector) Native-Task(Full Task) Wordcount benchmark 4. Conclusions Intel Software Page 26

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241