MVAPICH2 Project Update and Big Data Acceleration
|
|
- Shavonne Young
- 5 years ago
- Views:
Transcription
1 MVAPICH2 Project Update and Big Data Acceleration Presentation at HPC Advisory Council European Conference 212 by Dhabaleswar K. (DK) Panda The Ohio State University Hari Subramoni The Ohio State University
2 Presentation Outline MVAPICH Project Update 1.8 Release Sample Performance Numbers Upcoming Features and Performance Numbers Big Data Acceleration Motivation Solutions Memcached HBase HDFS HPC Advisory - Hamburg 2
3 MVAPICH2 Software High Performance open-source MPI Library for InfiniBand, 1Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2), Available since 22 Used by more than 1,93 organizations (HPC Centers, Industry and Universities) in 68 countries More than 114, downloads from OSU site directly Empowering many TOP5 clusters 5 th ranked 73,278-core cluster (Tsubame 2.) at Tokyo Institute of Technology 7 th ranked 111,14-core cluster (Pleiades) at NASA 25 th ranked 62,976-core cluster (Ranger) at TACC and many others Available with software stacks of many InfiniBand, High-speed Ethernet and server vendors including Linux Distros (RedHat and SuSE) Partner in the upcoming U.S. NSF-TACC Stampede (1-15 PFlop) System HPC Advisory - Hamburg 3
4 Latest MVAPICH2 1.8 Features Released on 4/3/12 Major features (Compared to MVAPICH2 1.7) Support for MPI communication from NVIDIA GPU device memory High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) High performance intra-node point-to-point communication for multi-gpu adapters/node (GPU-GPU, GPU-Host and Host- GPU) Taking advantage of CUDA IPC (available in CUDA 4.1) in intra-node communication for multiple GPU adapters/node Optimized and tuned collectives for GPU device buffers MPI datatype support for point-to-point and collective communication from GPU device buffers New shared memory design for enhanced intra-node small message performance Support suspend/resume functionality with mpirun_rsh Enhanced support for CPU binding with socket and numanode level granularity Checkpoint-Restart and run-through stabilization support in OFA-IB-Nemesis interface Enhancing OFA-IB-Nemesis interface to handle IB errors gracefully Enhanced integration with SLURM and PBS Reduced memory footprint of the library Enhanced one-sided communication design with reduced memory requirement Enhancements and tuned collectives (Bcast and Alltoallv) Support iwarp interoperability between Intel NE2 and Chelsio T4 Adapters RoCE enable environment variable name is changed from MV2_USE_RDMAOE to MV2_USE_RoCE Update to hwloc v1.4 HPC Advisory - Hamburg 4
5 Latency (us) Latency (us) One-way Latency: MPI over IB Small Message Latency Large Message Latency MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR Message Size (bytes) Message Size (bytes) DDR, QDR GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch HPC Advisory - Hamburg 5
6 Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) Bandwidth: MPI over IB 7 Unidirectional Bandwidth Bidirectional Bandwidth 15 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR Message Size (bytes) Message Size (bytes) DDR, QDR GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch HPC Advisory - Hamburg 6
7 Hybrid Transport Design (UD-RC/XRC) Supports all different modes: RC/XRC, UD, UD+RC or UD+XRC Available in MVAPICH (since 1.1) Available in MVAPICH2 (since 1.7) with more advanced, tuned and adaptive designs Provides flexibility to tune performance of critical applications on large-scale clusters with various modes M. Koop, T. Jones and D. K. Panda, MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand, IPDPS 8 HPC Advisory - Hamburg 7
8 Time (us) HPCC Random Ring Latency with MVAPICH2 (UD/RC/Hybrid) % 4% 38% 3% UD Hybrid RC No. of Processes Experimental Platform : LLNL Hyperion Cluster (Intel Harpertown + QDR IB) RC: Default Parameters UD: MV2_HYBRID_ENABLE_THRESHOLD=1, MV2_HYBRID_MAX_RC_CONNECTIONS= Hybrid: MV2_HYBRID_ENABLE_THRESHOLD=1 HPC Advisory - Hamburg 8
9 Latency (us) Latency (us) Shared-memory Aware Collectives MVAPICH2 Reduce/Allreduce with 4K cores on TACC Ranger (AMD Barcelona, SDR IB) Latency (us) MPI_Reduce (496 cores) Original Shared-memory Message Size (bytes) MV2_USE_SHMEM_REDUCE=/1 Pt2Pt Barrier Shared-Mem Barrier Number of Processes 5 MPI_ Allreduce (496 cores) 4 Original 3 2 Shared-memory K 8K Message Size (bytes) MV2_USE_SHMEM_ALLREDUCE=/1 MVAPICH2 Barrier with 1K Intel Westmere cores, QDR IB MV2_USE_SHMEM_BARRIER=/1 HPC Advisory - Hamburg 9
10 Bandwidth(MB/s) Bandwidth(MB/s) Latency (us) MVAPICH2 vs. OpenMPI (GPU Device- GPU Device, Inter-node) 4 Latency MVAPICH2 3 OpenMPI 2 1 B E T T E R K 64K 1M Message Size (bytes) 4 Bandwidth MVAPICH2 4 Bi-directional Bandwidth MVAPICH2 3 2 OpenMPI B E T T E R 3 2 OpenMPI B E T T E R K 64K 1M Message Size (bytes) K 64K 1M Message Size (bytes) MVAPICH2 1.8rc1 and OpenMPI (Trunk nightly snapshot on Mar 26, 212) Westmere with ConnectX-2 QDR HCA, NVIDIA Tesla C275 GPU and CUDA Toolkit 4.1 HPC Advisory - Hamburg 1
11 LBM Step Time (S) Applications-Level Evaluation (Lattice Boltzmann Method (LBM)) MPI MPI-GPU 13.7% 12.% 11.8% 256*256* *256* *512* *512*512 Domain Size X*Y*Z 9.4% LBM-CUDA (Courtesy: Carlos Rosale, TACC) is a parallel distributed CUDA implementation of a Lattice Boltzmann Method for multiphase flows with large density ratios Node specification on Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M27, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory CUDA 4.1 and MVAPICH2 1.8rc1 16 nodes with one MPI process per node/gpu HPC Advisory - Hamburg 11
12 MVAPICH2 Plans for Exascale Performance and Memory scalability toward 5K-1M cores Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF ) Enhanced Optimization for GPU Support and Accelerators Extending the GPGPU support Support for Intel MIC UD-Multicast-Aware Collectives Taking advantage of Collective Offload framework Including support for non-blocking collectives (MPI 3.) Extended topology-aware collectives Power-aware collectives Enhanced Multi-rail Designs Automatic optimization of collectives LiMIC2, XRC, Hybrid (UD-RC/XRC) and Multi-rail Support for MPI Tools Interface Checkpoint-Restart and migration support with incremental checkpointing Fault-tolerance with run-through stabilization (being discussed in MPI 3.) QoS-aware I/O and checkpointing Automatic tuning and self-adaptation for different systems and applications HPC Advisory - Hamburg 12
13 Scalable OpenSHMEM and Hybrid (MPI and OpenSHMEM) designs Based on OpenSHMEM Reference Implementation Provides a design over GASNet Does not take advantage of all OFED features Design scalable and High-Performance OpenSHMEM over OFED Designing a Hybrid MPI +OpenSHMEM Model Current Model Separate Runtimes for OpenSHMEM and MPI Possible deadlock if both runtimes are not progressed Consumes more network resource Our Approach Single Runtime for MPI and OpenSHMEM Hybrid (OpenSHMEM+MPI) Application OpenSHMEM calls OpenSHMEM Interface OSU Design MPI calls MPI Interface InfiniBand Network Hybrid MPI+OpenSHMEM HPC Advisory - Hamburg 13
14 Time (ms) Time (ms) Time (us) Time (ms) Micro-Benchmark Performance (OpenSHMEM) % Atomic Operations OpenSHMEM-GASNet OpenSHMEM-OSU 16% fadd finc cswap swap add inc Broadcast (256 bytes) OpenSHMEM-GASNet OpenSHMEM-OSU No. of Processes 41% Collect (256 bytes) OpenSHMEM-GASNet OpenSHMEM-OSU 36% No. of Processes Reduce (256 bytes) OpenSHMEM-GASNet OpenSHMEM-OSU 35% No. of Processes HPC Advisory - Hamburg 14
15 Resource Utilization (MB) Time (s) Time (s) Performance of Hybrid (OpenSHMEM+MPI) Applications D Heat Transfer Modeling Hybrid-GASNet Hybrid-OSU 8 6 Graph-5 Hybrid-GASNet Hybrid-OSU No. of Processes 34% Improved Performance for Hybrid Applications 34% improvement for 2DHeat Transfer Modeling with 512 processes 45% improvement for Graph5 with 256 processes Our approach with single Runtime consumes 27% lesser network resources Will be available with MVAPICH2 1.9 Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 212 HPC Advisory - Hamburg No. of Processes Network Resource Usage Hybrid-GASNet Hybrid-OSU No. of Processes J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and 45% 27%
16 Many Integrated Core (MIC) Architecture Intel s Many Integrated Core (MIC) architecture geared for HPC Critical part of their solution for exascale computing Knights Ferry (KNF) and Knights Corner (KNC) Many low-power x86 processor cores with hardware threads and wider vector units Applications and libraries can run with minor or no modifications Using MVAPICH2 for communication within a KNF coprocessor out-of-the box (Default) shared memory designs tuned for KNF (Optimized V1) designs enhanced using lower level API (Optimized V2) HPC Advisory - Hamburg 16
17 Normalized Bandwidth Normalized Bandwidth Normalized Latency Normalized Latency Point-to-Point Communication 1. Latency 13% 17% Default.8 Optimized V1.6 Optimized V K Message Size (Bytes) 8K 1. 15% Bandwidth Default.8 Optimized V1.6 Optimized V Default Optimized V1 Optimized V2 Latency 32K 128K 512K 2M Message Size (Bytes) Default Optimized V1 Optimized V2 Bandwidth 7% 8% 4X 9.5X K 4K 16K Message Size (Bytes).2. Knights Ferry - D 1.2GHz card - 32 cores Alpha 9 MIC software stack with an additional pre-alpha patch 32K 128K 512K 2M Message Size (Bytes) HPC Advisory - Hamburg 17
18 Normalized Latency Normalized Latency Normalized Latency Collective Communication Broadcast Default Optimized V1 Optimized V2 17% Broadcast Default Optimized V1 Optimized V2 2% 26% Message Size (Bytes). 2K 8K 32K 128K 512K Message Size (Bytes) Scatter Default Optimized V1 Optimized V2 72% 8% K 128K 256K 512K 1M Message Size (Bytes) S. Potluri, K. Tomko, D. Bureddy and D. K. Panda Intra-MIC MPI Communication using MVPAICH2: Early Experience, TACC-Intel Highly-Parallel Computing Symposium, April 212 Best Student Paper Award HPC Advisory - Hamburg 18
19 Time (us) Time (us) Time (us) Time (us) Hardware Multicast-aware MPI_Bcast Small Messages (1,24 Cores) Default Multicast K Message Size (Bytes) 16 byte MPI_Bcast Default Multicast Number of Processes HPC Advisory - Hamburg Large Messages (1,24 Cores) Default Multicast 2K 8K 32K 128K 512K Message Size (Bytes) Default 64KB MPI_Bcast Multicast Number of Processes Initial support was available in MVAPICH, enhanced will be available in MVAPICH2 1.9
20 Normalized Performance Application Benefits with Non-Blocking Collectives based on CX-2 Collective Offload 17% Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes) HPL-Offload HPL-1ring HPL-Host % HPL Problem Size (N) as % of Total Memory Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes) 21.8% K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur and D. K. Panda, High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT, ISC 211 K. Kandalla, H. Subramoni, J. Vienne, K. Tomko, S. Sur and D. K. Panda, Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL, HotI 211 Modified Pre-Conjugate Gradient Solver with Offload- Allreduce does up to 21.8% better than default version HPC Advisory - Hamburg K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne and D. K. Panda, Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS 12 2
21 2K 4K 8K 16K 32K 64K 128K 256K 512K 128K 2K 4K 8K 16K 32K Latency (msec) 64K 12K 256K 512K 128K Latency (msec) Topology-Aware Collectives Gather-Default Gather-Topo-Aware 54% Scatter-Default Scatter-Topo-Aware 22% Message Size (Bytes) Message Size (Bytes) Default (Binomial) Vs Topology-Aware Algorithms with 296 Processes K. Kandalla, H. Subramoni, A. Vishnu and D. K. Panda, Designing Topology-Aware Collective Communication Algorithms for Large Scale Infiniband Clusters: Case Studies with Scatter and Gather, CAC 1 14 % H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K.Tomko, R. McLay, K. Schulz, and D. K. Panda, Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters, Cluster 11 HPC Advisory - Hamburg Impact of Network-Topology Aware Algorithms on Broadcast Performance 21
22 Power and Energy Savings with Power-Aware Collectives Performance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes 5% 32% CPMD Application Performance and Power Savings: K. Kandalla, E. Mancini, Sayantan Sur and D. K. Panda, Designing Power Aware Collective Communication Algorithms for InfiniBand Clusters, ICPP 1 HPC Advisory - Hamburg 22
23 Presentation Outline MVAPICH Project Update 1.8 Release Sample Performance Numbers Upcoming Features and Performance Numbers Big Data Acceleration Motivation Solutions Memcached HBase HDFS HPC Advisory - Hamburg 23
24 Can High-Performance Interconnects Benefit Big Data Acceleration? Beginning to draw interest from the enterprise domain Oracle, IBM, Google are working along these directions Performance in the enterprise domain remains a concern Where do the bottlenecks lie? Can these bottlenecks be alleviated with new designs? HPC Advisory - Hamburg 24
25 Common Protocols using Open Fabrics Application Application Interface Sockets Verbs Protocol mplementation Kernel Space Ethernet Driver TCP/IP IPoIB TCP/IP Hardware Offload SDP RDMA iwarp User space RDMA User space RDMA User space Network Adapter Ethernet Adapter InfiniBand Adapter Ethernet Adapter InfiniBand Adapter iwarp Adapter RoCE Adapter InfiniBand Adapter Network Switch Ethernet Switch InfiniBand Switch Ethernet Switch InfiniBand Switch Ethernet Switch Ethernet Switch InfiniBand Switch 1/1/4 GigE IPoIB 1/4 GigE- TOE SDP iwarp RoCE IB Native ISC '12 25
26 Can New Data Analysis and Management Systems be designed with High-Performance Networks and Protocols? Current Design Enhanced Designs Our Approach Application Application Application Sockets Accelerated Sockets Verbs / Hardware Offload OSU Design Verbs Interface 1/1 GigE Network 1 GigE or InfiniBand 1 GigE or InfiniBand Sockets not designed for high-performance Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop) Zero-copy not available for non-blocking sockets HPC Advisory - Hamburg 26
27 Memcached Architecture Main memory CPUs Main memory CPUs SSD HDD... SSD HDD!"#$%"$#& High Performance Networks High Performance Networks Main memory SSD CPUs HDD High Performance Networks Main memory SSD CPUs HDD Web Frontend Servers (Memcached Clients) Distributed Caching Layer Allows to aggregate spare memory from multiple nodes General purpose (Memcached Servers) Typically used to cache database queries, results of API calls Scalable model, but typical usage very network intensive Native IB-verbs-level Design and evaluation with Memcached Server: (Database Servers) Memcached Client: (libmemcached) Main memory SSD CPUs HDD ISC '12 27
28 Time (us) Time (us) Memcached Get Latency (Small Message) SDP IPoIB 14 OSU-RC-IB 1GigE 1GigE OSU-UD-IB K 2K K 2K Message Size Message Size Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) Memcached Get latency 4 bytes RC/UD DDR: 6.82/7.55 us; QDR: 4.28/4.86 us 2K bytes RC/UD DDR: 12.31/12.78 us; QDR: 8.19/8.46 us Almost factor of four improvement over 1GE (TOE) for 2K bytes on the DDR cluster HPC Advisory - Hamburg 28
29 Thousands of Transactions per second (TPS) Thousands of Transactions per second (TPS) Memcached Get TPS (4byte) SDP OSU-RC-IB OSU-UD-IB IPoIB 1GigE K No. of Clients No. of Clients Memcached Get transactions per second for 4 bytes On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients Significant improvement with native IB QDR compared to SDP and IPoIB HPC Advisory - Hamburg 29
30 Time (ms) Time (ms) Application Level Evaluation Real Application Workloads 12 SDP 1 IPoIB OSU-RC-IB 8 OSU-UD-IB OSU-Hybrid-IB No. of Clients Real Application Workload RC 32 ms, UD 318 ms, Hybrid 314 ms for 124 clients 12X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design HPC Advisory - Hamburg No. of Clients J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP 11 J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, CCGrid 12
31 Overview of HBase Architecture An open source database project based on Hadoop framework for hosting very large tables Major components: HBaseMaster, HRegionServer and HBaseClient HBase and HDFS are deployed in the same cluster to get better data locality HPC Advisory - Hamburg 31
32 Time (us) Time (us) HBase Put/Get Detailed Analysis 3 25 Communication 25 2 Communication Preparation Server Processing Server Serialization Client Processing Client Serialization GigE IPoIB 1GigE OSU-IB HBase Put 1KB HBase Get 1KB HBase 1KB Put Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time HBase 1KB Get Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time 1GigE IPoIB 1GigE OSU-IB W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS 12 HPC Advisory - Hamburg 32
33 Time (us) Ops/sec HBase Single Server-Multi-Client Results IPoIB OSU-IB 4 4 1GigE 3 3 1GigE No. of Clients Latency No. of Clients Throughput HBase Get latency 4 clients: 14.5 us; 16 clients: us HBase Get throughput 4 clients: 37.1 Kops/sec; 16 clients: 53.4 Kops/sec 27% improvement in throughput for 16 clients over 1GE HPC Advisory - Hamburg 33
34 Time (us) Time (us) HBase YCSB Read-Write Workload No. of Clients Read Latency 1 HBase Get latency (Yahoo! Cloud Service Benchmark) 64 clients: 2. ms; 128 Clients: 3.5 ms 42% improvement over IPoIB for 128 clients HBase Get latency 64 clients: 1.9 ms; 128 Clients: 3.5 ms 4% improvement over IPoIB for 128 clients IPoIB 1GigE OSU-IB 1GigE No. of Clients Write Latency J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS 12 HPC Advisory - Hamburg 34
35 Hadoop Architecture Underlying Hadoop Distributed File System (HDFS) Fault-tolerance by replicating data blocks NameNode: stores information on data blocks DataNodes: store blocks and host Map-reduce computation JobTracker: track jobs and detect failure Model scales but high amount of communication during intermediate phases ISC '12 35
36 Time (s) Time (s) RDMA-based Design for Native HDFS Preliminary Results File Write Time (SSD) 1GigE IPoIB OSU-design Communication Time 1GigE IPoIB OSU-Design HDFS (Hadoop-.2.2) File Write Experiment using four data nodes on IB-QDR Cluster HDFS File Write Time 2 GB 7.78s File Size (GB) For 2 GB File Size - 26% improvement over IPoIB FileSize (GB) In Communication, for 3 GB File Size, 36% improvement over IPoIB, 87% improvement over 1GigE HPC Advisory - Hamburg 36
37 Web Pointers MVAPICH Web Page HPC Advisory - Hamburg 37
In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.
In the multi-core age, How do larger, faster and cheaper and more responsive sub-systems affect data management? Panel at ADMS 211 Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory Department
More informationAcceleration for Big Data, Hadoop and Memcached
Acceleration for Big Data, Hadoop and Memcached A Presentation at HPC Advisory Council Workshop, Lugano 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationScaling with PGAS Languages
Scaling with PGAS Languages Panel Presentation at OFA Developers Workshop (2013) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationThe Future of Supercomputer Software Libraries
The Future of Supercomputer Software Libraries Talk at HPC Advisory Council Israel Supercomputing Conference by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationAccelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached
Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached Talk at HPC Advisory Council Lugano Conference (213) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationSoftware Libraries and Middleware for Exascale Systems
Software Libraries and Middleware for Exascale Systems Talk at HPC Advisory Council China Workshop (212) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationLatest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand
Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationMPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits
MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits Ashish Kumar Singh, Sreeram Potluri, Hao Wang, Krishna Kandalla, Sayantan Sur, and Dhabaleswar K. Panda Network-Based
More informationSR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience
SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory
More informationMemcached Design on High Performance RDMA Capable Interconnects
Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan
More informationHigh Performance MPI Support in MVAPICH2 for InfiniBand Clusters
High Performance MPI Support in MVAPICH2 for InfiniBand Clusters A Talk at NVIDIA Booth (SC 11) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationImproving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters
Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of
More informationExploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR
Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication
More informationUnifying UPC and MPI Runtimes: Experience with MVAPICH
Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
More informationIntra-MIC MPI Communication using MVAPICH2: Early Experience
Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University
More informationBig Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management
Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management SigHPC BigData BoF (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationMVAPICH2 and MVAPICH2-MIC: Latest Status
MVAPICH2 and MVAPICH2-MIC: Latest Status Presentation at IXPUG Meeting, July 214 by Dhabaleswar K. (DK) Panda and Khaled Hamidouche The Ohio State University E-mail: {panda, hamidouc}@cse.ohio-state.edu
More informationPerformance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms
Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State
More informationDesigning Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters
Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi
More informationSupport for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth
Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU
More informationMVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand
MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 213 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationCan Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?
Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer
More informationA Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS
A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS Adithya Bhat, Nusrat Islam, Xiaoyi Lu, Md. Wasi- ur- Rahman, Dip: Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng
More informationStudy. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09
RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Introduction Problem Statement
More informationDesigning High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning
5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:
More informationAccelerating Big Data Processing with RDMA- Enhanced Apache Hadoop
Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop Keynote Talk at BPOE-4, in conjunction with ASPLOS 14 (March 2014) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationOverview of the MVAPICH Project: Latest Status and Future Roadmap
Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationUnified Runtime for PGAS and MPI over OFED
Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction
More informationHigh-Performance Training for Deep Learning and Computer Vision HPC
High-Performance Training for Deep Learning and Computer Vision HPC Panel at CVPR-ECV 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationOverview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap
Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) MeeKng by Dhabaleswar K. (DK) Panda The Ohio State University E- mail: panda@cse.ohio- state.edu h
More informationAccelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects
Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects Keynote Talk at ADMS 214 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationEnabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters
Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationDesigning Power-Aware Collective Communication Algorithms for InfiniBand Clusters
Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,
More informationOptimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication
Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram Potluri* Hao Wang* Devendar Bureddy* Ashish Kumar Singh* Carlos Rosales + Dhabaleswar K. Panda* *Network-Based
More informationCan Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?
Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Sayantan Sur, Abhinav Vishnu, Hyun-Wook Jin, Wei Huang and D. K. Panda {surs, vishnu, jinhy, huanwei, panda}@cse.ohio-state.edu
More informationHow to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries
How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationDesigning High Performance Communication Middleware with Emerging Multi-core Architectures
Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationOptimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2
Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, D. K. Panda Network-Based
More informationGPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks
GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks Presented By : Esthela Gallardo Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, Jonathan Perkins, Hari Subramoni,
More informationDesigning High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters
Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar
More informationSupporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters
Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters Presentation at IXPUG Meeting, July 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationPerformance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X
Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Intel Nerve Center (SC 7) Presentation Dhabaleswar K (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu Parallel
More informationHigh-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT
High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT Krishna Kandalla (1), Hari Subramoni (1), Karen Tomko (2), Dmitry Pekurovsky
More informationAccelerating HPL on Heterogeneous GPU Clusters
Accelerating HPL on Heterogeneous GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline
More informationEfficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning
Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning Ching-Hsiang Chu 1, Xiaoyi Lu 1, Ammar A. Awan 1, Hari Subramoni 1, Jahanzeb Hashmi 1, Bracy Elton 2 and Dhabaleswar
More informationAdvanced Topics in InfiniBand and High-speed Ethernet for Designing HEC Systems
Advanced Topics in InfiniBand and High-speed Ethernet for Designing HEC Systems A Tutorial at SC 13 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationDesigning MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach
Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Talk at OpenFabrics Workshop (April 216) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationHigh-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters
High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters Talk at OpenFabrics Workshop (April 2016) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationMVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand
MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand Matthew Koop 1,2 Terry Jones 2 D. K. Panda 1 {koop, panda}@cse.ohio-state.edu trj@llnl.gov 1 Network-Based Computing Lab, The
More informationOptimizing & Tuning Techniques for Running MVAPICH2 over IB
Optimizing & Tuning Techniques for Running MVAPICH2 over IB Talk at 2nd Annual IBUG (InfiniBand User's Group) Workshop (214) by Hari Subramoni The Ohio State University E-mail: subramon@cse.ohio-state.edu
More informationCUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters
CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty
More informationCoupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications
Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationMemory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand
Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of
More informationCRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart
CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, Dhabaleswar K. Panda Department of Computer
More informationHigh Performance File System and I/O Middleware Design for Big Data on HPC Clusters
High Performance File System and I/O Middleware Design for Big Data on HPC Clusters by Nusrat Sharmin Islam Advisor: Dhabaleswar K. (DK) Panda Department of Computer Science and Engineering The Ohio State
More informationProgramming Models for Exascale Systems
Programming Models for Exascale Systems Talk at HPC Advisory Council Stanford Conference and Exascale Workshop (214) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationExploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters
Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth
More informationUsing MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming
Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming MVAPICH2 User Group (MUG) MeeFng by Jithin Jose The Ohio State University E- mail: jose@cse.ohio- state.edu h
More informationImpact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase
2 IEEE 8th International Conference on Cloud Computing Technology and Science Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase Xiaoyi Lu, Dipti Shankar, Shashank Gugnani,
More informationMPI: Overview, Performance Optimizations and Tuning
MPI: Overview, Performance Optimizations and Tuning A Tutorial at HPC Advisory Council Conference, Lugano 2012 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationDesigning Shared Address Space MPI libraries in the Many-core Era
Designing Shared Address Space MPI libraries in the Many-core Era Jahanzeb Hashmi hashmi.29@osu.edu (NBCL) The Ohio State University Outline Introduction and Motivation Background Shared-memory Communication
More informationEnhancing Checkpoint Performance with Staging IO & SSD
Enhancing Checkpoint Performance with Staging IO & SSD Xiangyong Ouyang Sonya Marcarelli Dhabaleswar K. Panda Department of Computer Science & Engineering The Ohio State University Outline Motivation and
More informationOptimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications
Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda Network Based Computing Laboratory
More informationHigh Performance Migration Framework for MPI Applications on HPC Cloud
High Performance Migration Framework for MPI Applications on HPC Cloud Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda {zhanjie, luxi, panda}@cse.ohio-state.edu Computer Science & Engineering Department,
More informationA Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation
A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation Miao Luo, Ping Lai, Sreeram Potluri, Emilio P. Mancini, Hari Subramoni, Krishna Kandalla, Dhabaleswar K. Panda Department of Computer
More informationApplication-Transparent Checkpoint/Restart for MPI Programs over InfiniBand
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering
More informationInterconnect Your Future
Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators
More informationMulti-Threaded UPC Runtime for GPU to GPU communication over InfiniBand
Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State
More informationSupport Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach
Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach Talk at OSC theater (SC 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:
More informationMVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning
MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning Talk at Mellanox Theater (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationHigh Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters
High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters Presentation at Mellanox Theatre (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationHigh Performance MPI on IBM 12x InfiniBand Architecture
High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu, Brad Benton 1 and Dhabaleswar K. Panda {vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com 1 1 Presentation Road-Map Introduction
More informationReducing Network Contention with Mixed Workloads on Modern Multicore Clusters
Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational
More informationEfficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning
Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, and Dhabaleswar K. Panda Network-Based Computing Laboratory Department
More informationHigh-Performance Broadcast for Streaming and Deep Learning
High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction
More information2008 International ANSYS Conference
2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,
More informationAdvanced RDMA-based Admission Control for Modern Data-Centers
Advanced RDMA-based Admission Control for Modern Data-Centers Ping Lai Sundeep Narravula Karthikeyan Vaidyanathan Dhabaleswar. K. Panda Computer Science & Engineering Department Ohio State University Outline
More informationHigh Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters
High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters Presentation at Mellanox Theatre (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationAccelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.
Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio
More informationIntroduction to Xeon Phi. Bill Barth January 11, 2013
Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider
More informationDesigning Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges
Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges Talk at Brookhaven National Laboratory (Oct 214) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:
More informationInterconnect Your Future
Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer
More informationAccelerating and Benchmarking Big Data Processing on Modern Clusters
Accelerating and Benchmarking Big Data Processing on Modern Clusters Keynote Talk at BPOE-6 (Sept 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationScalability and Performance of MVAPICH2 on OakForest-PACS
Scalability and Performance of MVAPICH2 on OakForest-PACS Talk at JCAHPC Booth (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationTECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016
TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016 MULTI GPU PROGRAMMING Node 0 Node 1 Node N-1 MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM
More informationDesigning High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing
Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing Talk at Storage Developer Conference SNIA 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu
More informationOp#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2
Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2 MVAPICH2 User Group (MUG) Mee#ng by Hari Subramoni The Ohio State University E- mail: subramon@cse.ohio- state.edu h
More informationSolutions for Scalable HPC
Solutions for Scalable HPC Scot Schultz, Director HPC/Technical Computing HPC Advisory Council Stanford Conference Feb 2014 Leading Supplier of End-to-End Interconnect Solutions Comprehensive End-to-End
More informationAccelerating and Benchmarking Big Data Processing on Modern Clusters
Accelerating and Benchmarking Big Data Processing on Modern Clusters Open RG Big Data Webinar (Sept 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationPerformance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA
Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to
More informationINAM 2 : InfiniBand Network Analysis and Monitoring with MPI
INAM 2 : InfiniBand Network Analysis and Monitoring with MPI H. Subramoni, A. A. Mathews, M. Arnold, J. Perkins, X. Lu, K. Hamidouche, and D. K. Panda Department of Computer Science and Engineering The
More informationMemcached Design on High Performance RDMA Capable Interconnects
211 International Conference on Parallel Processing Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi-ur-Rahman,
More informationCP2K Performance Benchmark and Profiling. April 2011
CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox
More informationHigh-Performance and Scalable Designs of Programming Models for Exascale Systems Talk at HPC Advisory Council Stanford Conference (2015)
High-Performance and Scalable Designs of Programming Models for Exascale Systems Talk at HPC Advisory Council Stanford Conference (215) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationOPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA
OPEN MPI WITH RDMA SUPPORT AND CUDA Rolf vandevaart, NVIDIA OVERVIEW What is CUDA-aware History of CUDA-aware support in Open MPI GPU Direct RDMA support Tuning parameters Application example Future work
More informationThe Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011
The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities
More informationThe MVAPICH2 Project: Pushing the Frontier of InfiniBand and RDMA Networking Technologies Talk at OSC/OH-TECH Booth (SC 15)
The MVAPICH2 Project: Pushing the Frontier of InfiniBand and RDMA Networking Technologies Talk at OSC/OH-TECH Booth (SC 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationMELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구
MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio
More informationUCX: An Open Source Framework for HPC Network APIs and Beyond
UCX: An Open Source Framework for HPC Network APIs and Beyond Presented by: Pavel Shamis / Pasha ORNL is managed by UT-Battelle for the US Department of Energy Co-Design Collaboration The Next Generation
More informationCPMD Performance Benchmark and Profiling. February 2014
CPMD Performance Benchmark and Profiling February 2014 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting
More information