Optimizing communications on clusters of multicores
|
|
- Winfred Rodney Waters
- 5 years ago
- Views:
Transcription
1 Optimizing communications on clusters of multicores Alexandre DENIS with contributions from: Elisabeth Brunet, Brice Goglin, Guillaume Mercier, François Trahay Runtime Project-team INRIA Bordeaux - Sud-Ouest INRIA-Illinois workshop Paris, June 2009
2 The RUNTIME Team Mid-size research group Head: Raymond Namyst 6 permanent researchers 2 engineers 6 PhD students Part of INRIA Bordeaux Sud-Ouest Research Center Computer Science Lab at University of Bordeaux 1 (LaBRI) Bordeaux
3 Current trends The world is goig multicore Currently: cores per node Tomorrow: hundreds of cores per node Sometimes: multiple NIC per node New programming models Pure-MPI approach One MPI process per core Hybrid approach (MPI + threads/openmp) One MPI process per node/socket 3
4 A thread-aware approach for communications Impact of multi-core on communications Communication library must support multi-threading Multiple flows from multiple threads share a NIC Decouple MPI_Send/Recv from NIC send/recv New opportunities for optimization! Aggregate packets from different threads The best optimization strategy depends on: the capabilities and performance of the underlying network host architecture applications requirements Opportunistically use idle cores Decide optimization at run time
5 Implementing data transfers efficiently Transfer time PIO + copy DMA + copy DMA + RdV Message size
6 Implementing data transfers efficiently Transfer time PIO + copy DMA + copy DMA + RdV Message size
7 Implementing data transfers efficiently Transfer time Sending the 3 chunks sequentially = t1 + t2 + t3 (if no pipelining effects) t 2 t 3 t 1 Chunk 1 Chunk 2 Chunk 3 Message size
8 Implementing data transfers efficiently Transfer time This second strategy is better if t 1 +t 3 > t 4 + k.(sizeof(chunk 1)+sizeof(chunk3)) t 4 t 3 t 1 Chunk 1+3 Chunk 2 Message size
9 NewMadeleine architecture: the interface layer Application submits requests Non-blocking operation pack / send Meta-data are associated to data Reordering constraints Tag, seq number, etc User communication interfaces Incremental message building interface Optimizing scheduler Policy New Madeleine Send-receive interface Simple MPI implementation: Mad-MPI NICs
10 NewMadeleine architecture: the network layer NewMadeleine activity triggered by NIC Busy NIC gather application requests Idle NIC invoke the scheduler pack / send Analysis of the user request backlog Application of a strategy Synthesis of a network request Available drivers Myrinet MX, Infiniband verbs, Quadrics Elan, SCI Sisci, TCP sockets Optimizing scheduler Policy New Madeleine NICs
11 NewMadeleine architecture: strategies Strategy invoked when NIC becomes idle Combined with any network(s) pack / send Currently available: Default Aggregation Multirail QoS Raw transmission Aggressive aggregation of small packets Split packets over multiple networks Optimizing scheduler Core scheduler Strategy Tactic 1 Tactic 2 New Madeleine Priority-based scheduling NICs
12 Mixing network and threads Multiple levels of threads support in communication subsystem Level 1: Thread safety Allow simultaneous access to the library Level 2: Background progression of communication Make communications progress in background (rendez-vous, non-blocking) Level 3: Parallel processing Use several cores to process communication 12
13 Ensuring thread-safety Level 1: thread safety Coarse grain Library-wide mutex Avoids simultaneous access to the library Overhead = 140ns negligible MPI_Irecv MPI_Test CPU CPU 13
14 Ensuring thread-safety Level 1: thread safety Coarse grain Fine grain Library-wide mutex Avoids simultaneous access to the library Overhead = 140ns MPI_Irecv Action-wide mutexes Local thread-safety Allows simultaneous access to the library Overhead = 230ns MPI_Test CPU CPU 14
15 Background processing Level 2: background processing A progression thread per NIC Rendezvous handshake progression MPI_Irecv MX/Myrinet, OpenMPI/TCP Priority issue on overloaded systems Overhead : 400ns (inter-core, same chip synchronization) 2-3us. (inter-chip synchronization) Depends on thread placement MPI_Test CPU CPU 15
16 Parallel processing of communication flows Level 3: take benefit from multi-core: The PIOMan communication manager Communication processing seen as a sequence of operations (tasklets) Operations may be scheduled on any core through hooks in thread scheduler: idle, timer, context switch Load balancing of processing Idle cores 'help' working cores Offloading of asynchronous operations No priority issue Reduced overhead : 400ns Placement is controlled Less context switches CPU CPU 16
17 Offloading small messages Sending small messages consumes CPU memcopy or PIO may monopolyze a CPU for dozens of µs MPI_Isend Even a MPI_Isend can be split Split the non-blocking send into basic operations a) Register the MPI request b) Submit the packet to the NIC Spread the operations on cores Compute MPI_Wait Core 1 Core 2 17
18 Offloading small messages Sending small messages consumes CPU MPI_Isend memcopy or PIO may monopolyze a CPU for dozens of µs Even a MPI_Isend can be split Split the non-blocking send into basic operations a) Register the MPI request b) Submit the packet to the NIC Spread the operations on cores Compute MPI_Wait Core 1 Core 2 18
19 The PIOMan communication engine PIOMan: the PM2 I/O event manager Thread-aware I/O event manager Offload message submissions on idle cores, if available Uses interrupt and/or polling transparently, depending on system load Makes communication progress asynchronously No thread needed in application No priority issue, even on overloaded systems Well integrated with the Marcel thread scheduler test_event() NIC_test() PIOMan I/O manager NIC MPI interface NewMadeleine wait_event() NIC_wait() Marcel Thread scheduler 19
20 Bridging the gap with standard API Mad-MPI: a light MPI implementation on top of NewMadeleine Bringing the performance of NewMadeleine to simple MPI applications MPI ADI3 interface NEMESIS/NewMadeleine: a new architecture for MPICH2 Devices CH3 BG/L Quadrics MPICH2 Towards a powerful optimization engine for MPI Channels soc k shm Nemesis - NewMad intra-node inter-node
21 MPICH2 over NewMadeleine The preliminary version was straightforward to implement MPI_xxx point to point communication are (almost) directly mapped on nmad_xxx The NEMESIS is used for intra-node (shared memory) communication (joint work with Argonne NL) Published at IPDPS 2009 Performance is good Low latency overhead, bandwidth is the same Scientific challenges Optimization of MPI datatypes management Support for collective operations within NewMadeleine Full support of dynamicity over heterogeneous (or multi-rails) configurations
22 Experimental testbed For point-to-point experiments Two 2-quadcore nodes with Intel Xeon 3.16GHz 4GB of memory per node One Myrinet 10G NIC and one ConnectX Infiniband HCA For NAS parallel benchmarks Ten 4-dualcore nodes with AMD Opteron 2.6GHz 32 GB of memory per node One Infiniband 10G HCA
23 PIOMan's impact on latency performance 1 MPICH2- Nemesis Shared-memory MPICH2- Nemesis/PIO Man Open MPI 8 MPICH2- NewMad Myrinet 10G MPICH2- NewMad(PIOM an) Open MPI (BTL) Open MPI (PML) Latency (microseconds) Latency (microseconds) MPI message size (bytes) MPI Message size (bytes)
24 Multi-threaded latency over Infiniband OMB multi-threaded latency benchmark 1 sender thread 1000 N receiver threads 4-bytes messages MVAPICH2 MPICH2- NewMadeleine MVAPICH2 1.2-p1 Active waiting Concurrent polling OpenMPI Segmentation fault NewMadeleine Fixed-spin waiting No concurrent polling No overloaded CPU Lat ency (µs)
25 MPI overlap benchmark Time Measured MPI_Isend(dest,sreq); Computation(); MPI_Wait(&sreq); MPI_Recv(dest); Computation time: 20µs for small messages (eager) 400µs for large messages (rendezvous)
26 Communication progress with PIOMan Overlapping eager messages with MX Rendezvous progress with Inf iniband Sending Time (microseconds) Reference (no computation) Open MPI (BTL) MPICH2- NewMad Open MPI (PML) MPICH2- NewMad (PIOMan) Sending Time (microseconds) Ref erence MVAPICH2 MPICH2- NewMad Open MPI MPICH2- NewMad(PIOM an) K 4K 6 K 8 K 10K 12K 14K 16K 18K 20K 22K 24K 28K 32K 36K MPI message size (bytes) 0 12K 27K 41K 6 1K 91K 137K 205K 308K 461K 69 2K 1M 2.3M MPI Message size (bytes)
27 NAS Parallel Benchmarks 8/9 Processes 16 Processes Execution Times (in seconds) MVAPICH2 Open MPI MPICH2- NewMad MPICH2- NewMad (PIOMan) Execution Times (seconds) MVAPICH2 Open MPI MPICH2- NewMad MPICH2- NewMad (PIOMan) BT CG EP FT SP MG LU 0 BT CG EP FT SP MG LU NAS Kernel Type (Class C) NAS Kernel Type (Class C)
28 NAS Parallel Benchmarks (contd) 32/36 Processes 64 Processes Execution Times (in seconds) MVAPICH2 Open MPI MPICH2- NewMad MPICH2- NewMad (PIOMan) Execution Times (seconds) MVAPICH2 Open MPI MPICH2- NewMad BT CG EP FT SP MG LU 0 BT CG EP FT SP MG LU NAS Kernel Type (Class C) NAS Kernel Type (Class C)
29 Heterogeneous multi-rail Long fragments Heterogeneous splitting Split ratio computed according to: Performance of each network Estimated availability of NICs Network sampling is necessary Short fragments Aggregation over the fastest NIC Myri-10G NICs Quadrics NIC
30 Aggregating bandwidth through adaptive multi-rail NewMadeleine adaptive multi-rail: Myri-10G + QsNet Adaptive multi-rail 1500 Iso-split Myri + QsNet Bandwidth (MB/s) 1000 Myri-10G Quadrics QsNet Data size (bytes)
31 Point-to-point performance Adaptive Multirail 6 MPICH2- NewMad (Myrinet) Latency MPICH2- NewMad (Myrinet +Infiniband) MPICH2- NewMad (Infiniband) MPICH2- NewMad(Myrin et) Bandwidth MPICH2- NewMad (Myrinet+Inf inib and) MPICH2- NewMad (Inf iniband) Latency (microseconds) Bandwidth (Mbit/s) K 4K 16K 6 4K 256K 1M 4M 16M 64M MPI message size (bytes) MPI Message size (bytes) MPICH2-NewMad, point-to-point, Myrinet 10G NIC + ConnectX Infiniband HCA
32 High-throughput intra-node communications Increasing number of cores requires efficient intra-node communication Existing strategy: double buffering Consumes CPU cycles Pollutes cache KNEM offers cheap alternative for large messages Linux kernel-assisted memory data transfers Support for non-contiguous/asynchronous transfers and for I/O AT copy offload New backend in MPI2-Nemesis Improves pt-to-pt and collectives performance significantly Especially when no cache is shared Developped in collaboration with ANL Results to appear in ICPP (Vienna, sep 2009)
33 High-throughput intra-node communications
34 Conclusion The world is going multicore! Massively multicore clusters may arrive sooner than expected It has an impact on communication subsystem Mixing MPI and threads does not work out of the box Multicore is an opportunity to optimize communications Optimization strategies of multiple flows Use idle cores for communication progression We need new communication engines Designed from the groud up with multicore/multithread in mind Fully parallel NUMA-aware Machine-wide optimization of traffic
35 Thank you! More information: Software available on INRIA Gforge:
Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance
Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance S. Moreaud, B. Goglin, D. Goodell, R. Namyst University of Bordeaux RUNTIME team, LaBRI INRIA, France Argonne National Laboratory
More informationAdaptive MPI Multirail Tuning for Non-Uniform Input/Output Access
Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access S. Moreaud, B. Goglin and R. Namyst INRIA Runtime team-project University of Bordeaux, France Context Multicore architectures everywhere
More informationNEWMADELEINE: a Fast Communication Scheduling Engine for High Performance Networks
NEWMADELEINE: a Fast Communication Scheduling Engine for High Performance Networks Olivier AUMAGE, Élisabeth BRUNET, Nathalie FURMENTO, Raymond NAMYST INRIA, LABRI, Université Bordeaux 1, 351 cours de
More informationReducing Network Contention with Mixed Workloads on Modern Multicore Clusters
Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational
More informationResearch on the Implementation of MPI on Multicore Architectures
Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer
More informationLessons learned from MPI
Lessons learned from MPI Patrick Geoffray Opinionated Senior Software Architect patrick@myri.com 1 GM design Written by hardware people, pre-date MPI. 2-sided and 1-sided operations: All asynchronous.
More informationA Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation
A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation Miao Luo, Ping Lai, Sreeram Potluri, Emilio P. Mancini, Hari Subramoni, Krishna Kandalla, Dhabaleswar K. Panda Department of Computer
More informationImplementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand
Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Jiuxing Liu and Dhabaleswar K. Panda Computer Science and Engineering The Ohio State University Presentation Outline Introduction
More information10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G
10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G Mohammad J. Rashti and Ahmad Afsahi Queen s University Kingston, ON, Canada 2007 Workshop on Communication Architectures
More informationRollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello
Rollback-Recovery Protocols for Send-Deterministic Applications Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello Fault Tolerance in HPC Systems is Mandatory Resiliency is
More informationDesigning High Performance Communication Middleware with Emerging Multi-core Architectures
Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationWhy you should care about hardware locality and how.
Why you should care about hardware locality and how. Brice Goglin TADaaM team Inria Bordeaux Sud-Ouest Agenda Quick example as an introduction Bind your processes What's the actual problem? Convenient
More informationLiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster
LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster H. W. Jin, S. Sur, L. Chai, and D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering
More informationStudy. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09
RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Introduction Problem Statement
More informationA Case for High Performance Computing with Virtual Machines
A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation
More informationMPI Programming Techniques
MPI Programming Techniques Copyright (c) 2012 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any
More informationKNEM: a Generic and Scalable Kernel-Assisted Intra-node MPI Communication Framework
KNEM: a Generic and Scalable Kernel-Assisted Intra-node MPI Communication Framework Brice Goglin, Stéphanie Moreaud To cite this version: Brice Goglin, Stéphanie Moreaud. KNEM: a Generic and Scalable Kernel-Assisted
More information1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects
Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects
More informationUnifying UPC and MPI Runtimes: Experience with MVAPICH
Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
More informationOpen MPI und ADCL. Kommunikationsbibliotheken für parallele, wissenschaftliche Anwendungen. Edgar Gabriel
Open MPI und ADCL Kommunikationsbibliotheken für parallele, wissenschaftliche Anwendungen Department of Computer Science University of Houston gabriel@cs.uh.edu Is MPI dead? New MPI libraries released
More informationDesigning Power-Aware Collective Communication Algorithms for InfiniBand Clusters
Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,
More informationMiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces
MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces Hye-Churn Jang Hyun-Wook (Jin) Jin Department of Computer Science and Engineering Konkuk University Seoul, Korea {comfact,
More informationHigh Performance MPI on IBM 12x InfiniBand Architecture
High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu, Brad Benton 1 and Dhabaleswar K. Panda {vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com 1 1 Presentation Road-Map Introduction
More informationRDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits
RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits Sayantan Sur Hyun-Wook Jin Lei Chai D. K. Panda Network Based Computing Lab, The Ohio State University Presentation
More informationUnderstanding MPI on Cray XC30
Understanding MPI on Cray XC30 MPICH3 and Cray MPT Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Cray provides enhancements on top of this: low level communication
More informationPlacement de processus (MPI) sur architecture multi-cœur NUMA
Placement de processus (MPI) sur architecture multi-cœur NUMA Emmanuel Jeannot, Guillaume Mercier LaBRI/INRIA Bordeaux Sud-Ouest/ENSEIRB Runtime Team Lyon, journées groupe de calcul, november 2010 Emmanuel.Jeannot@inria.fr
More informationComparative Performance Analysis of RDMA-Enhanced Ethernet
Comparative Performance Analysis of RDMA-Enhanced Ethernet Casey B. Reardon and Alan D. George HCS Research Laboratory University of Florida Gainesville, FL July 24, 2005 Clement T. Cole Ammasso Inc. Boston,
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationMay 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to
A Sleep-based Our Akram Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) May 1, 2011 Our 1 2 Our 3 4 5 6 Our Efficiency in Back-end Processing Efficiency in back-end
More informationDesigning An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems
Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems Lei Chai Ping Lai Hyun-Wook Jin Dhabaleswar K. Panda Department of Computer Science
More informationUnified Runtime for PGAS and MPI over OFED
Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction
More informationOptimizing non-blocking Collective Operations for InfiniBand
Optimizing non-blocking Collective Operations for InfiniBand Open Systems Lab Indiana University Bloomington, USA IPDPS 08 - CAC 08 Workshop Miami, FL, USA April, 14th 2008 Introduction Non-blocking collective
More informationHigh-Performance Broadcast for Streaming and Deep Learning
High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction
More informationGb Ethernet Protocols for Clusters: An OpenMPI, TIPC, GAMMA Case Study
Gb Ethernet Protocols for Clusters: An OpenMPI, TIPC, GAMMA Case Study Stylianos Bounanos and Martin Fleury University of Essex, Electronic Systems Engineering Department, Colchester, CO4 3SQ, United Kingdom
More informationImplementing MPI on Windows: Comparison with Common Approaches on Unix
Implementing MPI on Windows: Comparison with Common Approaches on Unix Jayesh Krishna, 1 Pavan Balaji, 1 Ewing Lusk, 1 Rajeev Thakur, 1 Fabian Tillier 2 1 Argonne Na+onal Laboratory, Argonne, IL, USA 2
More informationPerformance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms
Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State
More informationNEWMADELEINE: a Fast Communication Scheduling Engine for High Performance Networks
INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE NEWMADELEINE: a Fast Communication Scheduling Engine for High Performance Networks Olivier AUMAGE and Élisabeth BRUNET and Nathalie FURMENTO
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationDesign and Implementation of Open-MX: High-Performance Message Passing over generic Ethernet hardware
Design and Implementation of Open-MX: High-Performance Message Passing over generic Ethernet hardware Brice Goglin To cite this version: Brice Goglin. Design and Implementation of Open-MX: High-Performance
More informationProActive SPMD and Fault Tolerance Protocol and Benchmarks
1 ProActive SPMD and Fault Tolerance Protocol and Benchmarks Brian Amedro et al. INRIA - CNRS 1st workshop INRIA-Illinois June 10-12, 2009 Paris 2 Outline ASP Model Overview ProActive SPMD Fault Tolerance
More informationImproving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters
Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of
More informationAssessing the Ability of Computation/Communication Overlap and Communication Progress in Modern Interconnects
Assessing the Ability of Computation/Communication Overlap and Communication Progress in Modern Interconnects Mohammad J. Rashti Ahmad Afsahi Department of Electrical and Computer Engineering Queen s University,
More informationI N F M A S T E R S T H E S I S I N C O M P U T E R S C I E N C E
I N F - 3981 M A S T E R S T H E S I S I N C O M P U T E R S C I E N C E Efficient Intra-node Communication for Chip Multiprocessors Torje S. Henriksen October, 2008 FACULTY OF SCIENCE Department of Computer
More informationExploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR
Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication
More informationFuture Routing Schemes in Petascale clusters
Future Routing Schemes in Petascale clusters Gilad Shainer, Mellanox, USA Ola Torudbakken, Sun Microsystems, Norway Richard Graham, Oak Ridge National Laboratory, USA Birds of a Feather Presentation Abstract
More informationOptimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications
Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda Network Based Computing Laboratory
More informationI/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)
I/O Systems Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) I/O Systems 1393/9/15 1 / 57 Motivation Amir H. Payberah (Tehran
More informationUsing Lamport s Logical Clocks
Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based
More informationHigh-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT
High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT Krishna Kandalla (1), Hari Subramoni (1), Karen Tomko (2), Dmitry Pekurovsky
More informationUnderstanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System
Understanding the Impact of Multi- Architecture in Cluster Computing: A Case Study with Intel Dual- System Lei Chai Qi Gao Dhabaleswar K. Panda Department of Computer Science and Engineering The Ohio State
More informationMemory Management Strategies for Data Serving with RDMA
Memory Management Strategies for Data Serving with RDMA Dennis Dalessandro and Pete Wyckoff (presenting) Ohio Supercomputer Center {dennis,pw}@osc.edu HotI'07 23 August 2007 Motivation Increasing demands
More informationHigh Performance Computing. University questions with solution
High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The
More informationThe Google File System (GFS)
1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints
More informationMemory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand
Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of
More informationDesigning Next Generation Data-Centers with Advanced Communication Protocols and Systems Services
Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. W. Jin and D. K. Panda Network Based Computing Laboratory
More informationAdvanced RDMA-based Admission Control for Modern Data-Centers
Advanced RDMA-based Admission Control for Modern Data-Centers Ping Lai Sundeep Narravula Karthikeyan Vaidyanathan Dhabaleswar. K. Panda Computer Science & Engineering Department Ohio State University Outline
More informationComparing Ethernet & Soft RoCE over 1 Gigabit Ethernet
Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet Gurkirat Kaur, Manoj Kumar 1, Manju Bala 2 1 Department of Computer Science & Engineering, CTIEMT Jalandhar, Punjab, India 2 Department of Electronics
More informationMULTI-CONNECTION AND MULTI-CORE AWARE ALL-GATHER ON INFINIBAND CLUSTERS
MULTI-CONNECTION AND MULTI-CORE AWARE ALL-GATHER ON INFINIBAND CLUSTERS Ying Qian Mohammad J. Rashti Ahmad Afsahi Department of Electrical and Computer Engineering, Queen s University Kingston, ON, CANADA
More informationAdvantages to Using MVAPICH2 on TACC HPC Clusters
Advantages to Using MVAPICH2 on TACC HPC Clusters Jérôme VIENNE viennej@tacc.utexas.edu Texas Advanced Computing Center (TACC) University of Texas at Austin Wednesday 27 th August, 2014 1 / 20 Stampede
More information2008 International ANSYS Conference
2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,
More informationOPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA
OPEN MPI WITH RDMA SUPPORT AND CUDA Rolf vandevaart, NVIDIA OVERVIEW What is CUDA-aware History of CUDA-aware support in Open MPI GPU Direct RDMA support Tuning parameters Application example Future work
More informationTopology and affinity aware hierarchical and distributed load-balancing in Charm++
Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National
More informationRuntime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism
1 Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism Takeshi Nanri (Kyushu Univ. and JST CREST, Japan) 16 Aug, 2016 4th Annual MVAPICH Users Group Meeting 2 Background
More informationMVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand
MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand Matthew Koop 1,2 Terry Jones 2 D. K. Panda 1 {koop, panda}@cse.ohio-state.edu trj@llnl.gov 1 Network-Based Computing Lab, The
More informationOVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI
CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing
More informationIsoStack Highly Efficient Network Processing on Dedicated Cores
IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single
More informationUNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM
UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM Sweety Sen, Sonali Samanta B.Tech, Information Technology, Dronacharya College of Engineering,
More informationDesign challenges of Highperformance. MPI over InfiniBand. Presented by Karthik
Design challenges of Highperformance and Scalable MPI over InfiniBand Presented by Karthik Presentation Overview In depth analysis of High-Performance and scalable MPI with Reduced Memory Usage Zero Copy
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationImproving Packet Processing Performance of a Memory- Bounded Application
Improving Packet Processing Performance of a Memory- Bounded Application Jörn Schumacher CERN / University of Paderborn, Germany jorn.schumacher@cern.ch On behalf of the ATLAS FELIX Developer Team LHCb
More informationMellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007
Mellanox Technologies Maximize Cluster Performance and Productivity Gilad Shainer, shainer@mellanox.com October, 27 Mellanox Technologies Hardware OEMs Servers And Blades Applications End-Users Enterprise
More informationAccelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.
Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio
More informationProgramming for Fujitsu Supercomputers
Programming for Fujitsu Supercomputers Koh Hotta The Next Generation Technical Computing Fujitsu Limited To Programmers who are busy on their own research, Fujitsu provides environments for Parallel Programming
More informationThe Optimal CPU and Interconnect for an HPC Cluster
5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance
More informationImproving Virtual Machine Scheduling in NUMA Multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore
More informationNon-Blocking Collectives for MPI
Non-Blocking Collectives for MPI overlap at the highest level Torsten Höfler Open Systems Lab Indiana University Bloomington, IN, USA Institut für Wissenschaftliches Rechnen Technische Universität Dresden
More informationMulti-Threaded UPC Runtime for GPU to GPU communication over InfiniBand
Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State
More informationIntel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*
Intel Cluster Toolkit Compiler Edition. for Linux* or Windows HPC Server 8* Product Overview High-performance scaling to thousands of processors. Performance leadership Intel software development products
More informationApplication-Transparent Checkpoint/Restart for MPI Programs over InfiniBand
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering
More informationAdvanced Computer Networks. End Host Optimization
Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct
More informationThe LiMIC Strikes Back. Hyun-Wook Jin System Software Laboratory Dept. of Computer Science and Engineering Konkuk University
The LiMIC Strikes Back Hyun-Wook Jin System Software Laboratory Dept. of Computer Science and Engineering Konkuk University jinh@konkuk.ac.kr Contents MPI Intra-node Communication Introduction of LiMIC
More informationReducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet
Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Pilar González-Férez and Angelos Bilas 31 th International Conference on Massive Storage Systems
More informationUniversity of Tokyo Akihiro Nomura, Yutaka Ishikawa
University of Tokyo Akihiro Nomura, Yutaka Ishikawa 1 NBC = Non-blocking + Collective Exploit communication computation overlap Do complicated communications easily and efficiently NBC will be introduced
More informationMPICH/MADIII : a Cluster of Clusters Enabled MPI Implementation
MPICH/MADIII : a Cluster of Clusters Enabled MPI Implementation Olivier Aumage Guillaume Mercier Abstract This paper presents an MPI implementation that allows an easy and efficient use of the interconnection
More informationThe NE010 iwarp Adapter
The NE010 iwarp Adapter Gary Montry Senior Scientist +1-512-493-3241 GMontry@NetEffect.com Today s Data Center Users Applications networking adapter LAN Ethernet NAS block storage clustering adapter adapter
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationCRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart
CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, Dhabaleswar K. Panda Department of Computer
More informationMPI and MPI on ARCHER
MPI and MPI on ARCHER Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationPerformance Evaluation of InfiniBand with PCI Express
Performance Evaluation of InfiniBand with PCI Express Jiuxing Liu Amith Mamidala Abhinav Vishnu Dhabaleswar K Panda Department of Computer and Science and Engineering The Ohio State University Columbus,
More informationGFS: The Google File System
GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one
More informationHow to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries
How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationCooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices
Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices Ryousei Takano, Hidemoto Nakada, Takahiro Hirofuchi, Yoshio Tanaka, and Tomohiro Kudoh Information Technology Research
More informationNetwork Design Considerations for Grid Computing
Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom
More informationWork Project Report: Benchmark for 100 Gbps Ethernet network analysis
Work Project Report: Benchmark for 100 Gbps Ethernet network analysis CERN Summer Student Programme 2016 Student: Iraklis Moutidis imoutidi@cern.ch Main supervisor: Balazs Voneki balazs.voneki@cern.ch
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationIX: A Protected Dataplane Operating System for High Throughput and Low Latency
IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this
More informationWhat communication library can do with a little hint from programmers? Takeshi Nanri (Kyushu Univ. and JST CREST, Japan)
1 What communication library can do with a little hint from programmers? Takeshi Nanri (Kyushu Univ. and JST CREST, Japan) 2 Background Various tuning opportunities in communication libraries:. Protocol?
More informationSoftware Distributed Shared Memory with High Bandwidth Network: Production and Evaluation
,,.,, InfiniBand PCI Express,,,. Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation Akira Nishida, The recent development of commodity hardware technologies makes
More informationSSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh
SSS: An Implementation of Key-value Store based MapReduce Framework Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh MapReduce A promising programming tool for implementing largescale
More informationHigh Performance Computing: Concepts, Methods & Means Enabling Technologies 2 : Cluster Networks
High Performance Computing: Concepts, Methods & Means Enabling Technologies 2 : Cluster Networks Prof. Amy Apon Department of Computer Science and Computer Engineering University of Arkansas March 15 th,
More information