HA-PACS Project Challenge for Next Step of Accelerating Computing
|
|
- Brenda Armstrong
- 5 years ago
- Views:
Transcription
1 HA-PAS Project hallenge for Next Step of Accelerating omputing Taisuke Boku enter for omputational Sciences University of Tsukuba 1
2 Outline of talk Introduction of S, U. Tsukuba HA-PAS Project overview HA-PAS Base luster HA-PAS Applications TA (Tightly oupled Accelerators) Summary 2
3 S at University of Tsukuba enter for omputational Sciences Established in years as enter for omputational Physics Reorganized as enter for omputational Sciences in 2004 Daily collaborative researches with two kinds of researchers (about 30 in total) omputational Scientists who have NEEDS (applications) omputer Scientists who have SEEDS (system & solution)
4 S(cont d) Application fields Particle Physics Astrophysics Nuclear Physics Quantum ondensed Matter Physics Life Science Global Environment Science omputer system fields High Performance omputing Systems omputational Informatics Not a general omputer Service enter ollaborative Research enter for omputational Sciences and omputer Science
5 Project plan of HA-PAS HA-PAS (Highly Accelerated Parallel Advanced system for omputational Sciences) Accelerating critical problems on various scientific fields in enter for omputational Sciences, University of Tsukuba The target application fields will be partially limited urrent target: QD, Astro, QM/MM (quantum mechanics / molecular mechanics, for life science) Two parts HA-PAS base cluster: for development of GPU-accelerated code for target fields, and performing product-run of them HA-PAS/TA: (TA = Tightly oupled Accelerators) for elementary research on new technology for accelerated computing Our original communication system based on PI-Express named PEARL, and a prototype communication chip named PEAH2 5
6 GPU omputing: current trend of HP GPU clusters in TOP500 on Nov nd 天河 Tienha-1A (Rpeak=4.70 PFLOPS) 4th 星雲 Nebulae (Rpeak=2.98 PFLOPS) 5th TSUBAME2.0 (Rpeak=2.29 PFLOPS) (1st K omputer Rpeak=11.28 PFLOPS) Features high peak performance / cost ratio high peak performance / power ratio large scale applications with GPU acceleration don t run yet in production on GPU cluster Our First target is to develop large scale applications accelerated by GPU in real computational sciences 6
7 Issues of GPU luster Problems of GPGPU for HP Data I/O performance limitation Ex) GPGPU: PIe gen2 x16 Peak Performance: 8GB/s (I/O) 665 GFLOPS (NVIDIA M2090) Memory size limitation Ex) M2090: 6GByte vs PU: GByte ommunication between accelerators: no direct path (external) communication latency via PU becomes large Ex) GPGPU: GPU mem PU mem (MPI) PU mem GPU mem Researches for direct communication between GPUs are required Our another target is developing a direct communication system between external GPUs for a feasibility study for future accelerated computing 7
8 Project Formation HA-PAS (Highly Accelerated Parallel Advanced system for omputational Sciences) Apr Mar. 2014, 3-year project (the system will be maintain until Mar. 2016) Project Office for Exascale omputational Sciences (Leader: Prof. M. Umemura) Develop large scale GPU applications : 14 members Elementary Particle Physics, Astrophysics, Bioscience, Nuclear Physics, Quantum Matter Physics, Global Environmental Science, omputational Informatics, High Performance omputing Systems Project Office for Exascale omputing System Development(Leader: Prof. T. Boku) Develop two types of GPU cluster systems: 15 members 8
9 HA-PAS base cluster (Feb. 2012) 9
10 HA-PAS base cluster Front view Side view 10
11 HA-PAS base cluster Front view of 3 blade chassis Rear view of one blade chassis with 4 blades Rear view of Infiniband switch and cables (yellow=fibre, black=copper) 11
12 HA-PAS: base cluster (computation node) AVX (2.6GHz x 8flop/clock) (16GB, 12.8GB/s)x8 =128GB, 102.4GB/s 20.8GFLOPSx16 =332.8GFLOPS Total: 3TFLOPS 665GFLOPSx4 =2660GFLOPS (6GB, 177GB/s)x4 =24GB, 708GB/s 12 8GB/s
13 HA-PAS: base cluster unit(pu) Intel Xeon E5 (SandyBridge-EP) x 2 8 cores/socket (16 cores/node) with 2.6 GHz AVX (256-bit SIMD) on each core peak perf./socket = 2.6 x 4 x 2 = GFLOPS pek perf./node = GFLOPS Each socket supports up to 40 lanes of PIe gen3 great performance to connect multiple GPUs without I/O performance bottleneck current NVIDIA M2090 supports just PIe gen2, but net generation (Kepler) will support PIe gen3 M2090 x4 can be connected to 2 SandyBridge-EP still remaining PIe gen3 x8 x2 Infiniband QDR x 2 13
14 HA-PAS: base cluster unit(gpu) NVIDIA M2090 x 4 Number of processor core: 512 Processor core clock: 1.3 GHz DP 665 GFLOPS, SP 1331GFLOPS PI Express gen2 16 system interface Board power dissipation: <= 225 W Memory clock: 1.85 GHz, size: 6GB with E, 177GB/s Shared/L1 ache: 64KB, L2 ache: 768KB 14
15 HA-PAS: base cluster unit(blade node) 1x PIe slot for HA 2x NVIDIA Tesla M2090 2x 2.6GHz 8core SandyBridge-EP Air flow 2x 2.5 HDD 2x NVIDIA Tesla M2090 Power Supply Unit and Fan - 8U enclosure - 4 nodes - 3 PSU(Hot Swappable) - 6 Fans(Hot Swappable) Front view Rear view 15
16 Basic performance data MPI pingpong 6.4 GB/s (N 1/2 = 8KB) with dual rail Infiniband QDR (Mellanox onnectx-3) actually FDR for HA and QDR for switch PIe benchmark (Device -> Host memory copy), aggregated perf. for 4 GPUs simultaneously 24 GB/s (N 1/2 = 20KB) PIe gen2 x16 x4, theoretical peak = 8 GB/s x4 = 32 GB/s Stream (memory) 74.6 GB/s theoretical peak = GB/s 16
17 PIe Host:Device communication performance Slower start on Host->Device compared with Device->Host 17
18 HA-PAS Application (1):Elementary Particle Physics Multi-scale physics Investigate hierarchical properties via direct construction of nuclei in lattice QD GPU to solve large sparse linear systems of equations quark Finite temperature and density Phase analysis of QD at finite temperature and density GPU to perform matrix-matrix product of dense matrices Expected QD phase diagram proton neutron nucleus 18
19 HA-PAS Applications (2):Astrophysics (A) ollisional N-body Simulation Globular lusters Formation of the most primordial objects formed more than 10 giga years. Fossil object as a clue to investigate the primordial universe Massive Black Holes in Galaxies Understanding of the formation of massive black holes in galaxies Numerical simulations of complicated gravitational interactions between stars and multiple black holes in galaxy centers. Direct (brute force) calculations of acceleration and jerks are required to achieve the required numerical accuracy omputations of the accelerations of particles and their time derivatives (jerks) are time consuming. Accelerations and jerks are computed on GPU (B) Radiation Transfer First Stars and Re-ionization of the Universe Understanding of the formation of the first stars in the universe and the succeeded re-ionization of the universe. Accretion Disks around Black Holes Study of the high temperature regions around black holes alculation of the physical effects of photons emitted by stars and galaxies onto the surrounding matter. So far, poorly investigated due to its huge amount of computational cost, though it is of critical importance in the formation of stars and galaxies. omputations of the radiation intensity and the resulting chemical reactions based on the ray-tracing methods can be highly accelerated with GPUs owing to its high concurrency. 19
20 HA-PAS Application (3):Bioscience GPU acceleration - Direct coulmb (Gromacs, NAMD, Amber) -2 electron integral DNA-protein complex macroscale MD QM region > 100 atoms 20 Reaction mechanisms QM/MM-MD
21 HA-PAS Application (4) Other advanced researches on HP Division in S XcalableMP-dev (XMP-dev) for easy and simple programming language to support distributed memory & GPU accelerated computing for large scale computational sciences G8 NuFuSE (Nuclear Fusion Simulation for Exascale) project platform for porting Plasma Simulation ode with GPU technology limate simulation especially for LES (Large Eddy Simulation) for cloud-level resolution on city-model size simulation Any other collaboration... 21
22 HA-PAS: TA (Tightly oupled Accelerator) TA: Tightly oupled Accelerator Direct connection between accelerators (GPUs) Using PIe as a communication device between accelerator Most acceleration device and other I/O device are connected by PIe as PIe end-point (slave device) An intelligent PIe device logically enables an end-point device to directly communicate with other end-point devices PEARL: PI Express Adaptive and Reliable Link We already developed such PIe device (PEAH, PI Express Adaptive ommunication Hub) on JST-REST project low power and dependable network for embedded system It enables direct connection between nodes by PIe Gen2 x4 link Improving PEAH for HP to realize TA 22
23 PEAH PEAH: PI-Express Adaptive ommunication Hub An intelligent PI-Express communication switch to use PIe link directly for node-to-node interconnection Edge of PEAH PIe link can be connected to any peripheral devices, including GPU Prototype PEAH chip 4-port PI-E gen.2 with x4 lane / port PI-E link edge control feature: root complex and end points are automatically switched (flipped) according to the connection handling Other fault-tolerant (reliability) function is implemented: flip network link to allow single link fault in HA-PAS/TA prototype development, we will enhance current PEAH chip PEAH2 23
24 HA-PAS/TA (Tightly oupled Accelerator) True GPU-direct current GPU clusters require 3- hop communication (3-5 times memory copy) For strong scaling, Inter-GPU direct communication protocol is needed for lower latency and higher throughput Enhanced version of PEAH PEAH2 x4 lanes -> x8 lanes hardwired on main data path and PIe interface fabric PIe IB HA PIe PU MEM MEM Node PIe GPU MEM PU PIe PEAH 2 Node PIe GPU IB Switch PIe IB HA PIe PU MEM MEM PIe GPU MEM PU PIe PEAH 2 PIe GPU 24
25 Implementation of PEAH2: ASI FPGA FPGA based implementation today s advanced FPGA allows to use PIe hub with multiple ports currently gen2 x 8 lanes x 4 ports are available soon gen3 will be available (?) easy modification and enhancement fits to standard (full-size) PIe board internal multi-core general purpose PU with programmability is available easily split hardwired/firmware partitioning on certain level on control layer ontrolling PEAH2 for GPU communication protocol collaboration with NVIDIA for information sharing and discussion based on UDA4.0 device to device direct memory copy protocol 25
26 HA-PAS/TA Node luster = N Gx4 PEAH2 x 2 PEARL Ring Network Gx4 PEAH2 x 2... Gx4 PEAH2 x 2 Node luster with 16 nodes GPUx64 (G) PUx32 () GPU comm with PIe IB link / node PU: Xeon E5 GPU: Kepler Infiniband Link High speed GPU-GPU comm. by PEAH within N (PI-E gen2x8 = 5GB/s/link) Infiniband QDR (x2) for N-N comm. (4GB/s/link) 4 N with 16 nodes, or 8 N with 8 nodes = 360 TFLOPS extension to base cluster Node luster Node luster Node luster Node luster... Node luster 26 Infiniband Network
27 PEARL/PEAH2 variation (1) Option 1: Performance comparison among IB and PEARL can be evenly compared Additional latency by PIe switch G3 x16 QPI PIe GPU GPU GPU GPU G3 x16 PIe SW G3 x8 IB HA G3 x8 PEA H2 G2 x8 27
28 PEARL/PEAH2 variation (2) Option 2: Requires only 72 lanes in total asymmetric connection among 3 blocks of GPUs QPI G3 x16 GPU G3 x16 GPU PIe G3 x16 GPU G3 x8 IB HA PIe SW G3 x16 GPU PEA H2 G2 x8 28
29 PEAH2 prototype board for TA FPGA (Altera Stratix IV GX530) daughter board connector PIe external link connector x2 (one more on daughter board) PIe edge connector (to host server) power regulators for FPGA 29
30 Summary HA-PAS consists of two elements: HA-PAS base cluster for application development and HA-PAS/TA for elementary study for advanced technology on direct communication among accelerating devices (GPUs) HA-PAS base cluster started its operation from Feb with 802 TFLOPS peak performance FPGA implementation of PEAH2 is finished for the prototype version on Mar and enhanced for final version in following 6 months HA-PAS/TA with at least 300 TFLOPS additional performance will be installed around Mar
Appro Supercomputer Solutions Appro and Tsukuba University
Appro Supercomputer Solutions Appro and Tsukuba University Accelerator Cluster Collaboration Steven Lyness, VP HPC Solutions Engineering About Appro Over 20 Years of Experience 1991 2000 OEM Server Manufacturer
More informationT2K & HA-PACS Projects Supercomputers at CCS
T2K & HA-PACS Projects Supercomputers at CCS Taisuke Boku Deputy Director, HPC Division Center for Computational Sciences University of Tsukuba Two Streams of Supercomputers at CCS Service oriented general
More informationHA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs
HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs Yuetsu Kodama Division of High Performance Computing Systems Center for Computational Sciences University of Tsukuba,
More informationTightly Coupled Accelerators Architecture
Tightly Coupled Accelerators Architecture Yuetsu Kodama Division of High Performance Computing Systems Center for Computational Sciences University of Tsukuba, Japan 1 What is Tightly Coupled Accelerators
More informationInterconnection Network for Tightly Coupled Accelerators Architecture
Interconnection Network for Tightly Coupled Accelerators Architecture Toshihiro Hanawa, Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato Center for Computational Sciences University of Tsukuba, Japan 1 What
More informationHeterogeneous Multi-Computer System A New Platform for Multi-Paradigm Scientific Simulation
Heterogeneous Multi-Computer System A New Platform for Multi-Paradigm Scientific Simulation Taisuke Boku, Hajime Susa, Masayuki Umemura, Akira Ukawa Center for Computational Physics, University of Tsukuba
More informationGROMACS (GPU) Performance Benchmark and Profiling. February 2016
GROMACS (GPU) Performance Benchmark and Profiling February 2016 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Mellanox, NVIDIA Compute
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationLAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015
LAMMPS-KOKKOS Performance Benchmark and Profiling September 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox, NVIDIA
More informationMILC Performance Benchmark and Profiling. April 2013
MILC Performance Benchmark and Profiling April 2013 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting
More informationTightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications
1 Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications Toshihiro Hanawa Information Technology Center, The University of Tokyo Taisuke Boku Center for Computational
More informationLustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE
Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE Hitoshi Sato *1, Shuichi Ihara *2, Satoshi Matsuoka *1 *1 Tokyo Institute
More informationANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation
ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent
More informationNAMD Performance Benchmark and Profiling. January 2015
NAMD Performance Benchmark and Profiling January 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationAnalyzing Performance and Power of Applications on GPUs with Dell 12G Platforms. Dr. Jeffrey Layton Enterprise Technologist HPC
Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms Dr. Jeffrey Layton Enterprise Technologist HPC Why GPUs? GPUs have very high peak compute capability! 6-9X CPU Challenges
More informationHOKUSAI System. Figure 0-1 System diagram
HOKUSAI System October 11, 2017 Information Systems Division, RIKEN 1.1 System Overview The HOKUSAI system consists of the following key components: - Massively Parallel Computer(GWMPC,BWMPC) - Application
More informationExascale: challenges and opportunities in a power constrained world
Exascale: challenges and opportunities in a power constrained world Carlo Cavazzoni c.cavazzoni@cineca.it SuperComputing Applications and Innovation Department CINECA CINECA non profit Consortium, made
More informationSystem Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.
System Design of Kepler Based HPC Solutions Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering. Introduction The System Level View K20 GPU is a powerful parallel processor! K20 has
More informationTimothy Lanfear, NVIDIA HPC
GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision
More informationRECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016
RECENT TRENDS IN GPU ARCHITECTURES Perspectives of GPU computing in Science, 26 th Sept 2016 NVIDIA THE AI COMPUTING COMPANY GPU Computing Computer Graphics Artificial Intelligence 2 NVIDIA POWERS WORLD
More informationCray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:
Cray events! Cray User Group (CUG):! When: May 16-19, 2005! Where: Albuquerque, New Mexico - USA! Registration: reserved to CUG members! Web site: http://www.cug.org! Cray Technical Workshop Europe:! When:
More informationSTAR-CCM+ Performance Benchmark and Profiling. July 2014
STAR-CCM+ Performance Benchmark and Profiling July 2014 Note The following research was performed under the HPC Advisory Council activities Participating vendors: CD-adapco, Intel, Dell, Mellanox Compute
More informationANSYS Fluent 14 Performance Benchmark and Profiling. October 2012
ANSYS Fluent 14 Performance Benchmark and Profiling October 2012 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information
More informationTESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications
TESLA P PERFORMANCE GUIDE HPC and Deep Learning Applications MAY 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important
More informationGROMACS Performance Benchmark and Profiling. September 2012
GROMACS Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource
More informationHigh Performance Computing with Accelerators
High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationE4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU
E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè ARM64 and GPGPU 1 E4 Computer Engineering Company E4 Computer Engineering S.p.A. specializes in the manufacturing of high performance IT systems of medium
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationGame-changing Extreme GPU computing with The Dell PowerEdge C4130
Game-changing Extreme GPU computing with The Dell PowerEdge C4130 A Dell Technical White Paper This white paper describes the system architecture and performance characterization of the PowerEdge C4130.
More informationAMBER 11 Performance Benchmark and Profiling. July 2011
AMBER 11 Performance Benchmark and Profiling July 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource -
More informationModern computer architecture. From multicore to petaflops
Modern computer architecture From multicore to petaflops Motivation: Multi-ores where and why Introduction: Moore s law Intel Sandy Brige EP: 2.3 Billion nvidia FERMI: 3 Billion 1965: G. Moore claimed
More informationCPMD Performance Benchmark and Profiling. February 2014
CPMD Performance Benchmark and Profiling February 2014 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting
More informationACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015
ACCELERATED COMPUTING: THE PATH FORWARD Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 COMMODITY DISRUPTS CUSTOM SOURCE: Top500 ACCELERATED COMPUTING: THE PATH FORWARD It s time to start
More information"On the Capability and Achievable Performance of FPGAs for HPC Applications"
"On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies
More informationSNAP Performance Benchmark and Profiling. April 2014
SNAP Performance Benchmark and Profiling April 2014 Note The following research was performed under the HPC Advisory Council activities Participating vendors: HP, Mellanox For more information on the supporting
More informationTESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications
TESLA P PERFORMANCE GUIDE Deep Learning and HPC Applications SEPTEMBER 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important
More informationInterconnect Your Future
Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators
More informationHPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)
HPC and IT Issues Session Agenda Deployment of Simulation (Trends and Issues Impacting IT) Discussion Mapping HPC to Performance (Scaling, Technology Advances) Discussion Optimizing IT for Remote Access
More informationInfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014
InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment TOP500 Supercomputers, June 2014 TOP500 Performance Trends 38% CAGR 78% CAGR Explosive high-performance
More informationPedraforca: a First ARM + GPU Cluster for HPC
www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu
More informationThe Mont-Blanc approach towards Exascale
http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationGPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3
/CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose
More informationLAMMPS Performance Benchmark and Profiling. July 2012
LAMMPS Performance Benchmark and Profiling July 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationDesigning Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters
Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy
More informationGen-Z Memory-Driven Computing
Gen-Z Memory-Driven Computing Our vision for the future of computing Patrick Demichel Distinguished Technologist Explosive growth of data More Data Need answers FAST! Value of Analyzed Data 2005 0.1ZB
More informationHPC Technology Update Challenges or Chances?
HPC Technology Update Challenges or Chances? Swiss Distributed Computing Day Thomas Schoenemeyer, Technology Integration, CSCS 1 Move in Feb-April 2012 1500m2 16 MW Lake-water cooling PUE 1.2 New Datacenter
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationAdaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics
Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics
More informationDouble Rewards of Porting Scientific Applications to the Intel MIC Architecture
Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford
More informationTechnologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017
Technologies and application performance Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017 The landscape is changing We are no longer in the general purpose era the argument of
More informationActive-Active LNET Bonding Using Multiple LNETs and Infiniband partitions
April 15th - 19th, 2013 LUG13 LUG13 Active-Active LNET Bonding Using Multiple LNETs and Infiniband partitions Shuichi Ihara DataDirect Networks, Japan Today s H/W Trends for Lustre Powerful server platforms
More informationrepresent parallel computers, so distributed systems such as Does not consider storage or I/O issues
Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines
More informationGPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation
GPU ACCELERATED COMPUTING 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation GAMING PRO ENTERPRISE VISUALIZATION DATA CENTER AUTO
More informationInspur AI Computing Platform
Inspur Server Inspur AI Computing Platform 3 Server NF5280M4 (2CPU + 3 ) 4 Server NF5280M5 (2 CPU + 4 ) Node (2U 4 Only) 8 Server NF5288M5 (2 CPU + 8 ) 16 Server SR BOX (16 P40 Only) Server target market
More informationPreparing GPU-Accelerated Applications for the Summit Supercomputer
Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership
More informationPower Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017
Power Systems AC922 Overview Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017 IBM POWER HPC Platform Strategy High-performance computer and high-performance
More informationScaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc
Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC
More informationEnabling Performance-per-Watt Gains in High-Performance Cluster Computing
WHITE PAPER Appro Xtreme-X Supercomputer with the Intel Xeon Processor E5-2600 Product Family Enabling Performance-per-Watt Gains in High-Performance Cluster Computing Appro Xtreme-X Supercomputer with
More informationNAMD GPU Performance Benchmark. March 2011
NAMD GPU Performance Benchmark March 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Intel, Mellanox Compute resource - HPC Advisory
More informationPerformance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA
Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationCray XC Scalability and the Aries Network Tony Ford
Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?
More informationAltair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015
Altair OptiStruct 13.0 Performance Benchmark and Profiling May 2015 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute
More informationGPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com
GPU Acceleration of Matrix Algebra Dr. Ronald C. Young Multipath Corporation FMS Performance History Machine Year Flops DEC VAX 1978 97,000 FPS 164 1982 11,000,000 FPS 164-MAX 1985 341,000,000 DEC VAX
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011
ANSYS HPC Technology Leadership Barbara Hutchings barbara.hutchings@ansys.com 1 ANSYS, Inc. September 20, Why ANSYS Users Need HPC Insight you can t get any other way HPC enables high-fidelity Include
More informationCP2K Performance Benchmark and Profiling. April 2011
CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC
More informationHPC Hardware Overview
HPC Hardware Overview John Lockman III April 19, 2013 Texas Advanced Computing Center The University of Texas at Austin Outline Lonestar Dell blade-based system InfiniBand ( QDR) Intel Processors Longhorn
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationNumerical Algorithm Co-Design multi-scale simulation at extreme scale
Co-Design 2014, Guangzhou, Nov. 6-8 Numerical Algorithm Co-Design multi-scale simulation at extreme scale Wei Ge Institute of Process Engineering (IPE), CAS Co-Design 2014, Guangzhou, Nov. 6-8 Co-Design
More informationExperts in Application Acceleration Synective Labs AB
Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg
More informationCMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**
CMAQ 5.2.1 PARALLEL PERFORMANCE WITH MPI AND OPENMP** George Delic* HiPERiSM Consulting, LLC, P.O. Box 569, Chapel Hill, NC 27514, USA 1. INTRODUCTION This presentation reports on implementation of the
More informationIntel Select Solutions for Professional Visualization with Advantech Servers & Appliances
Solution Brief Intel Select Solution for Professional Visualization Intel Xeon Processor Scalable Family Powered by Intel Rendering Framework Intel Select Solutions for Professional Visualization with
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:
More informationSolutions for Scalable HPC
Solutions for Scalable HPC Scot Schultz, Director HPC/Technical Computing HPC Advisory Council Stanford Conference Feb 2014 Leading Supplier of End-to-End Interconnect Solutions Comprehensive End-to-End
More informationMELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구
MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio
More informationTSUBAME-KFC : Ultra Green Supercomputing Testbed
TSUBAME-KFC : Ultra Green Supercomputing Testbed Toshio Endo,Akira Nukada, Satoshi Matsuoka TSUBAME-KFC is developed by GSIC, Tokyo Institute of Technology NEC, NVIDIA, Green Revolution Cooling, SUPERMICRO,
More informationParallel Computer Architecture - Basics -
Parallel Computer Architecture - Basics - Christian Terboven 19.03.2012 / Aachen, Germany Stand: 15.03.2012 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda Processor
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More informationOverview of Parallel Computing. Timothy H. Kaiser, PH.D.
Overview of Parallel Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu Introduction What is parallel computing? Why go parallel? The best example of parallel computing Some Terminology Slides and examples
More informationThe rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla
The rcuda technology: an inexpensive way to improve the performance of -based clusters Federico Silla Technical University of Valencia Spain The scope of this talk Delft, April 2015 2/47 More flexible
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationFujitsu s Approach to Application Centric Petascale Computing
Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview
More informationNAMD Performance Benchmark and Profiling. November 2010
NAMD Performance Benchmark and Profiling November 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: HP, Mellanox Compute resource - HPC Advisory
More informationFujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited
Fujitsu HPC Roadmap Beyond Petascale Computing Toshiyuki Shimizu Fujitsu Limited Outline Mission and HPC product portfolio K computer*, Fujitsu PRIMEHPC, and the future K computer and PRIMEHPC FX10 Post-FX10,
More informationAccelerating high-performance computing with hybrid platforms
Accelerating high-performance computing with hybrid platforms October 2010 Dell THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE
More informationPerformance comparison between a massive SMP machine and clusters
Performance comparison between a massive SMP machine and clusters Martin Scarcia, Stefano Alberto Russo Sissa/eLab joint Democritos/Sissa Laboratory for e-science Via Beirut 2/4 34151 Trieste, Italy Stefano
More informationUniversity at Buffalo Center for Computational Research
University at Buffalo Center for Computational Research The following is a short and long description of CCR Facilities for use in proposals, reports, and presentations. If desired, a letter of support
More informationLS-DYNA Performance Benchmark and Profiling. October 2017
LS-DYNA Performance Benchmark and Profiling October 2017 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: LSTC, Huawei, Mellanox Compute resource
More informationANSYS HPC Technology Leadership
ANSYS HPC Technology Leadership 1 ANSYS, Inc. November 14, Why ANSYS Users Need HPC Insight you can t get any other way It s all about getting better insight into product behavior quicker! HPC enables
More informationHigh Performance Computing
High Performance Computing Dror Goldenberg, HPCAC Switzerland Conference March 2015 End-to-End Interconnect Solutions for All Platforms Highest Performance and Scalability for X86, Power, GPU, ARM and
More informationNAMD Performance Benchmark and Profiling. February 2012
NAMD Performance Benchmark and Profiling February 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource -
More informationLS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton
More informationChallenges in Developing Highly Reliable HPC systems
Dec. 1, 2012 JS International Symopsium on DVLSI Systems 2012 hallenges in Developing Highly Reliable HP systems Koichiro akayama Fujitsu Limited K computer Developed jointly by RIKEN and Fujitsu First
More informationArchitecting High Performance Computing Systems for Fault Tolerance and Reliability
Architecting High Performance Computing Systems for Fault Tolerance and Reliability Blake T. Gonzales HPC Computer Scientist Dell Advanced Systems Group blake_gonzales@dell.com Agenda HPC Fault Tolerance
More information