MVAPICH2 Project Update and Big Data Acceleration

Size: px

Start display at page:

Download "MVAPICH2 Project Update and Big Data Acceleration"

Shavonne Young
5 years ago
Views:

1 MVAPICH2 Project Update and Big Data Acceleration Presentation at HPC Advisory Council European Conference 212 by Dhabaleswar K. (DK) Panda The Ohio State University Hari Subramoni The Ohio State University

2 Presentation Outline MVAPICH Project Update 1.8 Release Sample Performance Numbers Upcoming Features and Performance Numbers Big Data Acceleration Motivation Solutions Memcached HBase HDFS HPC Advisory - Hamburg 2

3 MVAPICH2 Software High Performance open-source MPI Library for InfiniBand, 1Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2), Available since 22 Used by more than 1,93 organizations (HPC Centers, Industry and Universities) in 68 countries More than 114, downloads from OSU site directly Empowering many TOP5 clusters 5 th ranked 73,278-core cluster (Tsubame 2.) at Tokyo Institute of Technology 7 th ranked 111,14-core cluster (Pleiades) at NASA 25 th ranked 62,976-core cluster (Ranger) at TACC and many others Available with software stacks of many InfiniBand, High-speed Ethernet and server vendors including Linux Distros (RedHat and SuSE) Partner in the upcoming U.S. NSF-TACC Stampede (1-15 PFlop) System HPC Advisory - Hamburg 3

4 Latest MVAPICH2 1.8 Features Released on 4/3/12 Major features (Compared to MVAPICH2 1.7) Support for MPI communication from NVIDIA GPU device memory High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) High performance intra-node point-to-point communication for multi-gpu adapters/node (GPU-GPU, GPU-Host and Host- GPU) Taking advantage of CUDA IPC (available in CUDA 4.1) in intra-node communication for multiple GPU adapters/node Optimized and tuned collectives for GPU device buffers MPI datatype support for point-to-point and collective communication from GPU device buffers New shared memory design for enhanced intra-node small message performance Support suspend/resume functionality with mpirun_rsh Enhanced support for CPU binding with socket and numanode level granularity Checkpoint-Restart and run-through stabilization support in OFA-IB-Nemesis interface Enhancing OFA-IB-Nemesis interface to handle IB errors gracefully Enhanced integration with SLURM and PBS Reduced memory footprint of the library Enhanced one-sided communication design with reduced memory requirement Enhancements and tuned collectives (Bcast and Alltoallv) Support iwarp interoperability between Intel NE2 and Chelsio T4 Adapters RoCE enable environment variable name is changed from MV2_USE_RDMAOE to MV2_USE_RoCE Update to hwloc v1.4 HPC Advisory - Hamburg 4

5 Latency (us) Latency (us) One-way Latency: MPI over IB Small Message Latency Large Message Latency MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR Message Size (bytes) Message Size (bytes) DDR, QDR GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch HPC Advisory - Hamburg 5

6 Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) Bandwidth: MPI over IB 7 Unidirectional Bandwidth Bidirectional Bandwidth 15 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR Message Size (bytes) Message Size (bytes) DDR, QDR GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch HPC Advisory - Hamburg 6

7) with more advanced, tuned and adaptive designs Provides flexibility to tune performance of critical applications

7 Hybrid Transport Design (UD-RC/XRC) Supports all different modes: RC/XRC, UD, UD+RC or UD+XRC Available in MVAPICH (since 1.1) Available in MVAPICH2 (since 1.7) with more advanced, tuned and adaptive designs Provides flexibility to tune performance of critical applications on large-scale clusters with various modes M. Koop, T. Jones and D. K. Panda, MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand, IPDPS 8 HPC Advisory - Hamburg 7

Time (us) HPCC Random Ring Latency with MVAPICH2 (UD/RC/Hybrid) 6 5 4 3 2 1 26% 4% 38% 3% UD Hybrid RC 128 256 512 124 No.

8 Time (us) HPCC Random Ring Latency with MVAPICH2 (UD/RC/Hybrid) % 4% 38% 3% UD Hybrid RC No. of Processes Experimental Platform : LLNL Hyperion Cluster (Intel Harpertown + QDR IB) RC: Default Parameters UD: MV2_HYBRID_ENABLE_THRESHOLD=1, MV2_HYBRID_MAX_RC_CONNECTIONS= Hybrid: MV2_HYBRID_ENABLE_THRESHOLD=1 HPC Advisory - Hamburg 8

9 Latency (us) Latency (us) Shared-memory Aware Collectives MVAPICH2 Reduce/Allreduce with 4K cores on TACC Ranger (AMD Barcelona, SDR IB) Latency (us) MPI_Reduce (496 cores) Original Shared-memory Message Size (bytes) MV2_USE_SHMEM_REDUCE=/1 Pt2Pt Barrier Shared-Mem Barrier Number of Processes 5 MPI_ Allreduce (496 cores) 4 Original 3 2 Shared-memory K 8K Message Size (bytes) MV2_USE_SHMEM_ALLREDUCE=/1 MVAPICH2 Barrier with 1K Intel Westmere cores, QDR IB MV2_USE_SHMEM_BARRIER=/1 HPC Advisory - Hamburg 9

10 Bandwidth(MB/s) Bandwidth(MB/s) Latency (us) MVAPICH2 vs. OpenMPI (GPU Device- GPU Device, Inter-node) 4 Latency MVAPICH2 3 OpenMPI 2 1 B E T T E R K 64K 1M Message Size (bytes) 4 Bandwidth MVAPICH2 4 Bi-directional Bandwidth MVAPICH2 3 2 OpenMPI B E T T E R 3 2 OpenMPI B E T T E R K 64K 1M Message Size (bytes) K 64K 1M Message Size (bytes) MVAPICH2 1.8rc1 and OpenMPI (Trunk nightly snapshot on Mar 26, 212) Westmere with ConnectX-2 QDR HCA, NVIDIA Tesla C275 GPU and CUDA Toolkit 4.1 HPC Advisory - Hamburg 1

11 LBM Step Time (S) Applications-Level Evaluation (Lattice Boltzmann Method (LBM)) MPI MPI-GPU 13.7% 12.% 11.8% 256*256* *256* *512* *512*512 Domain Size X*Y*Z 9.4% LBM-CUDA (Courtesy: Carlos Rosale, TACC) is a parallel distributed CUDA implementation of a Lattice Boltzmann Method for multiphase flows with large density ratios Node specification on Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M27, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory CUDA 4.1 and MVAPICH2 1.8rc1 16 nodes with one MPI process per node/gpu HPC Advisory - Hamburg 11

12 MVAPICH2 Plans for Exascale Performance and Memory scalability toward 5K-1M cores Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF ) Enhanced Optimization for GPU Support and Accelerators Extending the GPGPU support Support for Intel MIC UD-Multicast-Aware Collectives Taking advantage of Collective Offload framework Including support for non-blocking collectives (MPI 3.) Extended topology-aware collectives Power-aware collectives Enhanced Multi-rail Designs Automatic optimization of collectives LiMIC2, XRC, Hybrid (UD-RC/XRC) and Multi-rail Support for MPI Tools Interface Checkpoint-Restart and migration support with incremental checkpointing Fault-tolerance with run-through stabilization (being discussed in MPI 3.) QoS-aware I/O and checkpointing Automatic tuning and self-adaptation for different systems and applications HPC Advisory - Hamburg 12

Scalable OpenSHMEM and Hybrid (MPI and OpenSHMEM) designs Based on OpenSHMEM Reference Implementation http://openshmem.

13 Scalable OpenSHMEM and Hybrid (MPI and OpenSHMEM) designs Based on OpenSHMEM Reference Implementation Provides a design over GASNet Does not take advantage of all OFED features Design scalable and High-Performance OpenSHMEM over OFED Designing a Hybrid MPI +OpenSHMEM Model Current Model Separate Runtimes for OpenSHMEM and MPI Possible deadlock if both runtimes are not progressed Consumes more network resource Our Approach Single Runtime for MPI and OpenSHMEM Hybrid (OpenSHMEM+MPI) Application OpenSHMEM calls OpenSHMEM Interface OSU Design MPI calls MPI Interface InfiniBand Network Hybrid MPI+OpenSHMEM HPC Advisory - Hamburg 13

14 Time (ms) Time (ms) Time (us) Time (ms) Micro-Benchmark Performance (OpenSHMEM) % Atomic Operations OpenSHMEM-GASNet OpenSHMEM-OSU 16% fadd finc cswap swap add inc Broadcast (256 bytes) OpenSHMEM-GASNet OpenSHMEM-OSU No. of Processes 41% Collect (256 bytes) OpenSHMEM-GASNet OpenSHMEM-OSU 36% No. of Processes Reduce (256 bytes) OpenSHMEM-GASNet OpenSHMEM-OSU 35% No. of Processes HPC Advisory - Hamburg 14

15 Resource Utilization (MB) Time (s) Time (s) Performance of Hybrid (OpenSHMEM+MPI) Applications D Heat Transfer Modeling Hybrid-GASNet Hybrid-OSU 8 6 Graph-5 Hybrid-GASNet Hybrid-OSU No. of Processes 34% Improved Performance for Hybrid Applications 34% improvement for 2DHeat Transfer Modeling with 512 processes 45% improvement for Graph5 with 256 processes Our approach with single Runtime consumes 27% lesser network resources Will be available with MVAPICH2 1.9 Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 212 HPC Advisory - Hamburg No. of Processes Network Resource Usage Hybrid-GASNet Hybrid-OSU No. of Processes J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and 45% 27%

16 Many Integrated Core (MIC) Architecture Intel s Many Integrated Core (MIC) architecture geared for HPC Critical part of their solution for exascale computing Knights Ferry (KNF) and Knights Corner (KNC) Many low-power x86 processor cores with hardware threads and wider vector units Applications and libraries can run with minor or no modifications Using MVAPICH2 for communication within a KNF coprocessor out-of-the box (Default) shared memory designs tuned for KNF (Optimized V1) designs enhanced using lower level API (Optimized V2) HPC Advisory - Hamburg 16

Normalized Bandwidth Normalized Bandwidth Normalized Latency Normalized Latency Point-to-Point Communication 1. Latency 13% 17% Default.8 Optimized V1.6 Optimized V2.

17 Normalized Bandwidth Normalized Bandwidth Normalized Latency Normalized Latency Point-to-Point Communication 1. Latency 13% 17% Default.8 Optimized V1.6 Optimized V K Message Size (Bytes) 8K 1. 15% Bandwidth Default.8 Optimized V1.6 Optimized V Default Optimized V1 Optimized V2 Latency 32K 128K 512K 2M Message Size (Bytes) Default Optimized V1 Optimized V2 Bandwidth 7% 8% 4X 9.5X K 4K 16K Message Size (Bytes).2. Knights Ferry - D 1.2GHz card - 32 cores Alpha 9 MIC software stack with an additional pre-alpha patch 32K 128K 512K 2M Message Size (Bytes) HPC Advisory - Hamburg 17

18 Normalized Latency Normalized Latency Normalized Latency Collective Communication Broadcast Default Optimized V1 Optimized V2 17% Broadcast Default Optimized V1 Optimized V2 2% 26% Message Size (Bytes). 2K 8K 32K 128K 512K Message Size (Bytes) Scatter Default Optimized V1 Optimized V2 72% 8% K 128K 256K 512K 1M Message Size (Bytes) S. Potluri, K. Tomko, D. Bureddy and D. K. Panda Intra-MIC MPI Communication using MVPAICH2: Early Experience, TACC-Intel Highly-Parallel Computing Symposium, April 212 Best Student Paper Award HPC Advisory - Hamburg 18

19 Time (us) Time (us) Time (us) Time (us) Hardware Multicast-aware MPI_Bcast Small Messages (1,24 Cores) Default Multicast K Message Size (Bytes) 16 byte MPI_Bcast Default Multicast Number of Processes HPC Advisory - Hamburg Large Messages (1,24 Cores) Default Multicast 2K 8K 32K 128K 512K Message Size (Bytes) Default 64KB MPI_Bcast Multicast Number of Processes Initial support was available in MVAPICH, enhanced will be available in MVAPICH2 1.9

Normalized Performance Application Benefits with Non-Blocking Collectives based on CX-2 Collective Offload 17% Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128

20 Normalized Performance Application Benefits with Non-Blocking Collectives based on CX-2 Collective Offload 17% Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes) HPL-Offload HPL-1ring HPL-Host % HPL Problem Size (N) as % of Total Memory Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes) 21.8% K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur and D. K. Panda, High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT, ISC 211 K. Kandalla, H. Subramoni, J. Vienne, K. Tomko, S. Sur and D. K. Panda, Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL, HotI 211 Modified Pre-Conjugate Gradient Solver with Offload- Allreduce does up to 21.8% better than default version HPC Advisory - Hamburg K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne and D. K. Panda, Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS 12 2

2K 4K 8K 16K 32K 64K 128K 256K 512K 128K 2K 4K 8K 16K 32K Latency (msec) 64K 12K 256K 512K 128K Latency

Scatter-Default Scatter-Topo-Aware 22% Message Size (Bytes) Message Size (Bytes) Default (Binomial) Vs

Kandalla, H. Subramoni, A. Vishnu and D. K.

Case Studies with Scatter and Gather, CAC 1 14 % H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K.

21 2K 4K 8K 16K 32K 64K 128K 256K 512K 128K 2K 4K 8K 16K 32K Latency (msec) 64K 12K 256K 512K 128K Latency (msec) Topology-Aware Collectives Gather-Default Gather-Topo-Aware 54% Scatter-Default Scatter-Topo-Aware 22% Message Size (Bytes) Message Size (Bytes) Default (Binomial) Vs Topology-Aware Algorithms with 296 Processes K. Kandalla, H. Subramoni, A. Vishnu and D. K. Panda, Designing Topology-Aware Collective Communication Algorithms for Large Scale Infiniband Clusters: Case Studies with Scatter and Gather, CAC 1 14 % H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K.Tomko, R. McLay, K. Schulz, and D. K. Panda, Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters, Cluster 11 HPC Advisory - Hamburg Impact of Network-Topology Aware Algorithms on Broadcast Performance 21

22 Power and Energy Savings with Power-Aware Collectives Performance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes 5% 32% CPMD Application Performance and Power Savings: K. Kandalla, E. Mancini, Sayantan Sur and D. K. Panda, Designing Power Aware Collective Communication Algorithms for InfiniBand Clusters, ICPP 1 HPC Advisory - Hamburg 22

23 Presentation Outline MVAPICH Project Update 1.8 Release Sample Performance Numbers Upcoming Features and Performance Numbers Big Data Acceleration Motivation Solutions Memcached HBase HDFS HPC Advisory - Hamburg 23

24 Can High-Performance Interconnects Benefit Big Data Acceleration? Beginning to draw interest from the enterprise domain Oracle, IBM, Google are working along these directions Performance in the enterprise domain remains a concern Where do the bottlenecks lie? Can these bottlenecks be alleviated with new designs? HPC Advisory - Hamburg 24

25 Common Protocols using Open Fabrics Application Application Interface Sockets Verbs Protocol mplementation Kernel Space Ethernet Driver TCP/IP IPoIB TCP/IP Hardware Offload SDP RDMA iwarp User space RDMA User space RDMA User space Network Adapter Ethernet Adapter InfiniBand Adapter Ethernet Adapter InfiniBand Adapter iwarp Adapter RoCE Adapter InfiniBand Adapter Network Switch Ethernet Switch InfiniBand Switch Ethernet Switch InfiniBand Switch Ethernet Switch Ethernet Switch InfiniBand Switch 1/1/4 GigE IPoIB 1/4 GigE- TOE SDP iwarp RoCE IB Native ISC '12 25

26 Can New Data Analysis and Management Systems be designed with High-Performance Networks and Protocols? Current Design Enhanced Designs Our Approach Application Application Application Sockets Accelerated Sockets Verbs / Hardware Offload OSU Design Verbs Interface 1/1 GigE Network 1 GigE or InfiniBand 1 GigE or InfiniBand Sockets not designed for high-performance Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop) Zero-copy not available for non-blocking sockets HPC Advisory - Hamburg 26

27 Memcached Architecture Main memory CPUs Main memory CPUs SSD HDD... SSD HDD!"#$%"$#& High Performance Networks High Performance Networks Main memory SSD CPUs HDD High Performance Networks Main memory SSD CPUs HDD Web Frontend Servers (Memcached Clients) Distributed Caching Layer Allows to aggregate spare memory from multiple nodes General purpose (Memcached Servers) Typically used to cache database queries, results of API calls Scalable model, but typical usage very network intensive Native IB-verbs-level Design and evaluation with Memcached Server: (Database Servers) Memcached Client: (libmemcached) Main memory SSD CPUs HDD ISC '12 27

28 Time (us) Time (us) Memcached Get Latency (Small Message) SDP IPoIB 14 OSU-RC-IB 1GigE 1GigE OSU-UD-IB K 2K K 2K Message Size Message Size Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) Memcached Get latency 4 bytes RC/UD DDR: 6.82/7.55 us; QDR: 4.28/4.86 us 2K bytes RC/UD DDR: 12.31/12.78 us; QDR: 8.19/8.46 us Almost factor of four improvement over 1GE (TOE) for 2K bytes on the DDR cluster HPC Advisory - Hamburg 28

Thousands of Transactions per second (TPS) Thousands of Transactions per second (TPS) Memcached Get TPS (4byte) 16 14 12 1 16 14 12 1 SDP OSU-RC-IB OSU-UD-IB IPoIB 1GigE 8 8 6 6 4 4 2 2 4 8 1 2 4 8

29 Thousands of Transactions per second (TPS) Thousands of Transactions per second (TPS) Memcached Get TPS (4byte) SDP OSU-RC-IB OSU-UD-IB IPoIB 1GigE K No. of Clients No. of Clients Memcached Get transactions per second for 4 bytes On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients Significant improvement with native IB QDR compared to SDP and IPoIB HPC Advisory - Hamburg 29

Time (ms) Time (ms) Application Level Evaluation Real Application Workloads 12 SDP 1 IPoIB OSU-RC-IB 8 OSU-UD-IB OSU-Hybrid-IB 6 35 3 25 2 15 4 2 1 5 1 4 8 No.

30 Time (ms) Time (ms) Application Level Evaluation Real Application Workloads 12 SDP 1 IPoIB OSU-RC-IB 8 OSU-UD-IB OSU-Hybrid-IB No. of Clients Real Application Workload RC 32 ms, UD 318 ms, Hybrid 314 ms for 124 clients 12X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design HPC Advisory - Hamburg No. of Clients J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP 11 J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, CCGrid 12

31 Overview of HBase Architecture An open source database project based on Hadoop framework for hosting very large tables Major components: HBaseMaster, HRegionServer and HBaseClient HBase and HDFS are deployed in the same cluster to get better data locality HPC Advisory - Hamburg 31

Time (us) Time (us) HBase Put/Get Detailed Analysis 3 25 Communication 25 2 Communication Preparation Server Processing 2 15 15 Server Serialization Client Processing Client Serialization 1 1 5 5

32 Time (us) Time (us) HBase Put/Get Detailed Analysis 3 25 Communication 25 2 Communication Preparation Server Processing Server Serialization Client Processing Client Serialization GigE IPoIB 1GigE OSU-IB HBase Put 1KB HBase Get 1KB HBase 1KB Put Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time HBase 1KB Get Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time 1GigE IPoIB 1GigE OSU-IB W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS 12 HPC Advisory - Hamburg 32

33 Time (us) Ops/sec HBase Single Server-Multi-Client Results IPoIB OSU-IB 4 4 1GigE 3 3 1GigE No. of Clients Latency No. of Clients Throughput HBase Get latency 4 clients: 14.5 us; 16 clients: us HBase Get throughput 4 clients: 37.1 Kops/sec; 16 clients: 53.4 Kops/sec 27% improvement in throughput for 16 clients over 1GE HPC Advisory - Hamburg 33

34 Time (us) Time (us) HBase YCSB Read-Write Workload No. of Clients Read Latency 1 HBase Get latency (Yahoo! Cloud Service Benchmark) 64 clients: 2. ms; 128 Clients: 3.5 ms 42% improvement over IPoIB for 128 clients HBase Get latency 64 clients: 1.9 ms; 128 Clients: 3.5 ms 4% improvement over IPoIB for 128 clients IPoIB 1GigE OSU-IB 1GigE No. of Clients Write Latency J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS 12 HPC Advisory - Hamburg 34

Hadoop Architecture Underlying Hadoop Distributed File System (HDFS) Fault-tolerance by replicating data blocks NameNode: stores information on data blocks DataNodes:

35 Hadoop Architecture Underlying Hadoop Distributed File System (HDFS) Fault-tolerance by replicating data blocks NameNode: stores information on data blocks DataNodes: store blocks and host Map-reduce computation JobTracker: track jobs and detect failure Model scales but high amount of communication during intermediate phases ISC '12 35

36 Time (s) Time (s) RDMA-based Design for Native HDFS Preliminary Results File Write Time (SSD) 1GigE IPoIB OSU-design Communication Time 1GigE IPoIB OSU-Design HDFS (Hadoop-.2.2) File Write Experiment using four data nodes on IB-QDR Cluster HDFS File Write Time 2 GB 7.78s File Size (GB) For 2 GB File Size - 26% improvement over IPoIB FileSize (GB) In Communication, for 3 GB File Size, 36% improvement over IPoIB, 87% improvement over 1GigE HPC Advisory - Hamburg 36

37 Web Pointers MVAPICH Web Page HPC Advisory - Hamburg 37

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

In the multi-core age, How do larger, faster and cheaper and more responsive sub-systems affect data management? Panel at ADMS 211 Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory Department