Native-Task Performance Test Report Intel Software Wang, Huafeng, Huafeng.wang@intel.com Zhong, Xiang, xiang.zhong@intel.com Intel Software Page 1
1. Background 2. Related Work 3. Preliminary Experiments 3.1 Experimental Environment Workbench Wordcount Sort DFSIO Pagerank Peculiarity CPU-intensive IO-intensive IO-intensive Map :CPU-intensive Reduce :IO-intensive Hivebench-Aggregation Map :CPU-intensive Reduce :IO-intensive Hivebench-Join Terasort CPU-intensive Map :CPU-intensive Reduce : IO-intensive K-Means Iteration stage: CPU-intensive Classification stage: IO-intensive Nutchindexing CPU-intensive & IO-intensive Cluster settings Intel Software Page 2
Hadoop version 1..3-Intel (patched with native task) Cluster size 4 Disk per machine Network CPU L3 Cache size Memory 7 SATA Disk per node GbE network E5-268(32 core per node) 248 KB 64GB per node Map Slots 3*32+1*26=122 Reduce Slots 3*16+1*13=61 Job Configuration io.sort.mb compression Compression algo Dfs.block.size 1GB Enabled snappy 256MB Io.sort.record.percent.2 Dfs replica 3 3.2 Performance Metrics Data size before compressio n Data size after compressio n Native job run time(s Original job run time(s) Job performa nce improve- Map stage performa nce Intel Software Page 3
) ment improve ment Wordcoun 1TB 5GB 1523.43 3957.11 159.8% 159.8% t Sort 5GB 249GB 2662.43 366.97 15.2% 45.4% DFSIO-Rea 1TB NA 1249.68 1384.52 1.8% 26% d DFSIO-Wri 1TB NA 6639.22 7165.97 7.9% 7.9% te Pagerank Pages:5M 217GB 615.71 11644.63 9.7% 133.8% Total:481GB Hive-Aggr Uservisits:5 345GB 1113.82 1662.74 49.3% 76.2% egation G Pages:6M Total:82GB Hive-Join Uservisits:5 382GB 117.55 1467.8 32.5% 42.8% G Pages:6M Total:86GB Terasort 1TB NA 423.35 636.49 51.3% 19.1% K-Means Clusters:5 Samples:2G Inputfilesec ondample:4 M Total:378GB 35GB 5734.11 745.62 22.9% 22.9% Nutchinde xing 12 1 8 6 4 2 Pages:4M 22G 3957 367 2662 1523 1385 125 NA 4387.6 46.56 4.9% 13.2% Hi-Bench Job Execution Time Analysis 7166 6639 1663 1467 1114 118 11645 636 745 616 5734 423 461 4388 Original mapred job runtime Native-Task job runtime Intel Software Page 4
1 136 271 46 541 676 811 946 181 1216 1351 1486 1621 1756 1891 226 2161 2296 2431 2566 271 2836 2971 316 3241 3376 3511 3646 1 136 271 46 541 676 811 946 181 1216 1351 1486 1621 1756 1891 226 2161 2296 2431 2566 271 2836 2971 316 3241 3376 3511 3646 3.3 Results 3.3.1 Wordcount Job Details: Name Maps Reducers wordcount 1984 22 15 1 5 Job Execution Time(Original Wordcount) 15 1 5 Job Execution Time(Native-Task Wordcount) Native-Task running state: Intel Software Page 5
Start time: 9:14 Finish time: 9:37 Original running state: Start time: 1:32 Intel Software Page 6
1 111 221 331 441 551 661 771 881 991 111 1211 1321 1431 1541 1651 1761 1871 1981 291 221 2311 2421 2531 2641 2751 2861 2971 1 111 221 331 441 551 661 771 881 991 111 1211 1321 1431 1541 1651 1761 1871 1981 291 221 2311 2421 2531 2641 2751 2861 2971 Finish time: 11:28 Analysis Wordcount is a CPU-intensive workload and it s map stage run through the whole job. So the native-task has a huge performance improvement. 3.3.2 Sort Job Details: Name Maps Reducers sorter 124 33 15 1 5 15 1 5 Job Execution Time(Original Sort) Job Execution Time(Native-Task Sort) Native-Task running state: Intel Software Page 7
Start time ::25 Finish time :1:1 Analysis Sort is IO-intensive at both map and stage. We can see that it s time occupy the most of whole job running time, because of that, the performance improvement is limited. 3.3.3 DFSIO-Read Job Details: Name Maps Reducers Datatools.jar 256 1 Result Analyzer 5 63 Intel Software Page 8
1 266 531 796 161 1326 1591 1856 2121 2386 2651 2916 3181 3446 3711 3976 4241 456 4771 536 531 5566 5831 696 6361 6626 6891 7156 1 52 13 154 25 256 37 358 49 46 511 562 613 664 715 766 817 868 919 97 121 172 1123 1174 1225 1276 1327 1378 1 52 13 154 25 256 37 358 49 46 511 562 613 664 715 766 817 868 919 97 121 172 1123 1174 1225 1276 1327 1378 15 Job Execution Time(Original DFSIO-Read) 1 5 15 Job Execution Time(Native-Task DFSIO-Read) 1 5 3.3.4 DFSIO-Write Job Details: Name Maps Reducers Datatools.jar 512 1 Result Analyzer 5 63 15 Job Execution Time(Original DFSIO-Write) 1 5 Intel Software Page 9
1 266 531 796 161 1326 1591 1856 2121 2386 2651 2916 3181 3446 3711 3976 4241 456 4771 536 531 5566 5831 696 6361 6626 6891 7156 15 Job Execution Time(Native-Task DFSIO-Write) 1 5 Native-Task running state: Aggregation start time: 9:58 Aggregation finish time: 1:19 Join start time: 1:19 Join finish time: 12:1 Original running state: Intel Software Page 1
Aggregation start time: 2:22 Aggregation finish time: 2:46 Join start time: 2:46 Join finish time: 22:45 Analysis DFSIO is IO-intensive both at read and write stage. It s bottleneck is network bandwidth so the performance improvement is limited. 3.3.5 Pagerank Job Details: Name Maps Reducers Pagerank_Stage1 163 33 Intel Software Page 11
1 41 81 121 161 21 241 281 321 361 41 441 481 521 561 61 641 681 721 761 81 841 881 921 961 11 141 181 1 41 81 121 161 21 241 281 321 361 41 441 481 521 561 61 641 681 721 761 81 841 881 921 961 11 141 181 Pagerank_Stage2 33 33 15 1 5 Job Execution Time(Original Pagerank) 15 1 5 Job Execution Time(Native-Task Pagerank) Native-Task running state: Start time: 1:33 Intel Software Page 12
Finish time: 12:8 Original running state: Start time: 1:59 Finish time: 14:6 Analysis Pagerank is a CPU-intensive workload and it s map stage take about 5% of the whole job running time. So the performance improvement is obvious. 3.3.6 Hive-Aggregation Job Details: Name Maps Reducers INSERT OVERWRITE TABLE 1386 33 Intel Software Page 13
1 6 119 178 237 296 355 414 473 532 591 65 79 768 827 886 945 14 163 1122 1181 124 1299 1358 1417 1476 1535 1594 1 6 119 178 237 296 355 414 473 532 591 65 79 768 827 886 945 14 163 1122 1181 124 1299 1358 1417 1476 1535 1594 uservisits...sourceip(stage-1) 15 1 5 Job Execution Time (Original Hive-Aaggregation) 15 1 5 Job Execution Time(Native-Task Hive-Aggregation) Original running state: Start time: 15:52 Intel Software Page 14
Finish time :16:22 Analysis Hive-Aggregation is CPU-intensive at map stage and IO-intensive at stage. It s map stage occupy the most of running time and when it comes to stage, network bandwidth limits the performance. So the performance improvement at map stage is obvious. 3.3.7 Hive-Join Job Details: Name Maps Reducers INSERT OVERWRITE TABLE 1551 33 rankings_uservisi...1(stage-1) INSERT OVERWRITE TABLE 99 33 rankings_uservisi...1(stage-2) INSERT OVERWRITE TABLE 99 1 rankings_uservisi...1(stage-3) INSERT OVERWRITE TABLE 1 1 rankings_uservisi...1(stage-4) Intel Software Page 15
1 6 119 178 237 296 355 414 473 532 591 65 79 768 827 886 945 14 163 1122 1181 124 1299 1358 1417 1476 1535 1594 1 6 119 178 237 296 355 414 473 532 591 65 79 768 827 886 945 14 163 1122 1181 124 1299 1358 1417 1476 1535 1594 15 1 5 15 1 5 Job Execution Time(Original Hive-Join) Job Execution Time(Native-Task Hive-Join) Original running state: Start time: 16:32 Finish time :16:58 Intel Software Page 16
1 231 461 691 921 1151 1381 1611 1841 271 231 2531 2761 2991 3221 3451 3681 3911 4141 4371 461 4831 561 5291 5521 5751 5981 6211 1 231 461 691 921 1151 1381 1611 1841 271 231 2531 2761 2991 3221 3451 3681 3911 4141 4371 461 4831 561 5291 5521 5751 5981 6211 Analysis Hive-join is a CPU-intensive workload and it s map stage takes a high percent of whole running time. So we can see at map stage, the performance is improved by native-task. 3.3.8 Terasort Job Details: Name Maps Reducers Terasort 458 34 15 Job Execution Time(Original Terasort) 1 5 15 Job Execution Time(Native-Task Terasort) 1 5 Native-Task running state: Original running state: Intel Software Page 17
Start time: 8:39 Finish time: 1:24 Analysis Terasort is CPU-intensive at map stage and IO-intensive at stage.it s map stage occupy the majority of the running time so there is a huge performance improvement at map stage. 3.3.9 K-Means Job Details: Name Maps Reducers Cluster Iterator running 14 63 iteration 1 Cluster Iterator running 14 63 Intel Software Page 18
1 281 561 841 1121 141 1681 1961 2241 2521 281 381 3361 3641 3921 421 4481 4761 541 5321 561 5881 6161 6441 6721 71 7281 1 271 541 811 181 1351 1621 1891 2161 2431 271 2971 3241 3511 3781 451 4321 4591 4861 5131 541 5671 5941 6211 6481 6751 721 7291 iteration 2 Cluster Iterator running 14 63 iteration 3 Cluster Iterator running 14 63 iteration 4 Cluster Iterator running 14 63 iteration 5 Cluster Classification 14 Driver running 15 Job Execution Time(Original Kmeans) 1 5 15 1 5 Job Execution Time(Native Kmeans) Native-Task running state: Intel Software Page 19
Start time: 2:38 Finish time: 22:23 Original running state: Start time: 9:43 Finish time: 11:41 Intel Software Page 2
1 169 337 55 673 841 19 1177 1345 1513 1681 1849 217 2185 2353 2521 2689 2857 325 3193 3361 3529 3697 3865 433 421 4369 4537 1 169 337 55 673 841 19 1177 1345 1513 1681 1849 217 2185 2353 2521 2689 2857 325 3193 3361 3529 3697 3865 433 421 4369 4537 Analisys From the running state graph, we can see that the former 5 iteration is CPU-intensive and the last classification stage is IO-intensive. The two stages almost equally split the whole running time. So the performance improvement at map stage is evident. 3.3.1 Nutchindexing Job Details: Name Maps Reducers index-lucene 1171 133 /HiBench/Nutch/Input/indexes 8 Job Execution Time(Original Nutchindexing) 6 4 2 8 Job Execution Time(Native Nutchindexing) 6 4 2 Native-Task running state: Intel Software Page 21
Start time: 17:26 Finish time: 18:4 Original running state: Start time: 18:4 Intel Software Page 22
Finish time: 19:56 Analysis Nutchindexing is CPU-intensive at map stage but the stage take the majority of whole running time. So the performance improvement is not so huge. 3.4 Other related results 3.4.1 Cache miss hurts Sorting performance 5 4 Sorting time increase rapidly as cache miss rate increase We divide a large buffer into several memory unit. BlockSize is the size of memory unit we doing the sorting. SortTime/ sort time/log(blocksize) sort time/log(blocksize) 3 2 1 partition size reach CPU L3 cache limit log2(blocksize) 5 1 15 2 25 3 Intel Software Page 23
3.4.2 Compare with BlockMapOutputBuffer Time /s 2 15 1 Job Execution Time Breakdown(WordCount) NO Combiner, 16 mapper x 8 r, 4 nodes, 3 SATA/node 6MB per task, compression ratio: 2/1 BlockMapOutputBuffer collector native task collector 5 Avg job time(s) Avg map task time(s)avg task time(s) 7% faster than BlockMapOutputBuffer collector. BlockMapOutputBuffer supports ONLY BytesWritable 3.4.3 Effect of JVM reuse 4.5% improvement for Original Hadoop, 8% improvement for Native-Task Intel Software Page 24
14 12 1 8 6 4 2 Effect of Task JVM Reuse 1381 1322 Original Hadoop Improve 4.5% 51 Naitve-Task 472 No JVM-Reuse JVM-Reuse Improve 8% 4 nodes, 4 map slots per node. 3.4.4 Hadoop don t scale well when slots number increase 4 nodes(32 core per node), 16 map slots max, CPU, memory, disk are NOT fully used. Performance drops unexpectedly when slots# increase time/s 8 6 4 2 Performance of different map slots 741 718 total job time 369 339 total map stage time(/s) 4 * 16 mapper slots 4 * 4 mapper Wordcount benchmark Intel Software Page 25
3.4.5 Native-Task mode: full task optimization 2x faster further for Native-Task full time optimization, compared with native collector 1 9 8 7 6 5 4 3 2 1 Native Task modes: full task vs. Collector only Hadoop Original total job time() total map stage time() Native-Task(Collector) Native-Task(Full Task) Wordcount benchmark 4. Conclusions Intel Software Page 26