Accelerated Computing Unified with Communication Towards Exascale
|
|
- Everett Hart
- 5 years ago
- Views:
Transcription
1 Accelerated Computing Unified with Communication Towards Exascale Taisuke Boku Deputy Director, Center for Computational Sciences / Faculty of Systems and Information Engineering University of Tsukuba 1
2 CCS at University of Tsukuba Center for Computational Sciences Established in years as former Center for Computational Physics Reorganized as Center for Computational Sciences in 2004 Daily collaborative researches with two kinds of researchers (about 30 in total) Computational Scientists who have NEEDS (applications) Computer Scientists who have SEEDS (system & solution) 2 IHPC Forum, Changsha 2013/05/28
3 Current HPC and Supercomputer Research Program in Japa 3
4 Post-peta~Exa scale computing formation in Japan MEXT Council on HPCI Plan and Promotion Council for Science and Technology Policy (CSTP) HPCI Consortium established on 2012/04 Users and supercomputer centers (resource providers) The consortium will play an important role of the future HPC R&D WG for applications WG for systems JST (Japan Science and Technology Agency) Basic Research Programs CREST: Development of System Software Technologies for post-peta Scale High Performance Computing White paper for strategic direction/ development of HPC in JAPAN Feasibility Study of Advanced High Performance Computing Workshop of SDHPC (Strategic Development of HPC) Organized by Univ. of Tsukuba Univ. of Tokyo Tokyo Institute of Tech. Kyoto Univ. RIKEN supercomputer center and AICS AIST JST Tohoku Univ. RIKEN (AICS) Proposal to MEXT for Exascale 4 Evaluation by CSTP Budget approval by Ministry of Finance
5 Only information about architecture in RIKEN proposal 5
6 (slide is courtesy by Y. Ishikawa, U. Tokyo) Mid-term plan on HPC research in Japan Basic Research Programs CREST: Development of System Software Technologies for post-peta Scale High Performance Computing White paper for strategic direction/ development of HPC in Japan US$ 23M US$ 17M US$ 15M Feasibility Study of Advanced High Performance Computing US$10M Three teams will run Whether or not the national R&D project starts depends on results of those feasibility studies R&D of Advanced HPC Deployments 2011Q4 :HA-PACS, 0.8PF, Univ. Tsukuba -> be upgraded to 1.2PF (2013Q3) 2012: BG/Q, 1.2PF, KEK 2012Q1 : <1PF, Univ. Tokyo 2012Q2: 0.7PF, Kyoto-U 2015: 30PF? (Tokyo & Tsukuba) 6
7 Projects in FS Four projects (3 system-side + 1 application-side) were selected based on proposals, on 2012/07 Focus & Challenge general purpose (base) system accelerated computing vector processing mini applications Leading PI Y. Ishikawa (Univ. of Tokyo) M. Sato (Univ. of Tsukuba) H. Kobayashi (Tohoku Univ.) H. Tomita (AICS, RIKEN) & S. Matsuoka (TITECH) Member Institutes & Vendors U. Tokyo, Kyushu U. Fujitsu, Hitachi, NEC U. Tsukuba, TITECH, Aizu U., AICS, Hitachi Tohoku U., JAMSTEC, NEC AICS, TITECH 7
8 Two activities in U. Tsukuba HA-PACS + JST-CREST developing a platform for new concept of accelerated computing hardware and software development for testbed on TCA (Tightly Coupled Accelerators) architecture new algorithm and code based on this concept Feasibility Study a proposal for accelerator based exa-scale computing system interconnection is embedded as instruction level with computation ultimate style of highly computable system desig 8
9 Key Issues on Next Generation s Accelerated Computing 9
10 Accelerated computing in next generatio What accelerators contribute to? Flops: Concentrating (almost) all the power just for computation Simplicity: Reducing complexity of control, everything is in data parallel Vector: Another style of vector computing (simple data parallel) Energy: Very high performance/power We still need a challenge for performance/power ratio with x20 to achieve 1EFlops/20MW Accelerated computing is one of the shortest path It s good for computation and horizontal access to the memory device However... 10
11 Issues on accelerated computing in near future Limited memory capacity current GPU = 6GB : 1.3TFlops = 1 : 200 -> Co-design for memory capacity saving Limited dynamism on computation on GPU, warp splitting makes heavy performance degradation -> depending on application/algorithm but it should be solved by application/algorithm -> Co-design for effective vector feature Limited capacity on fast storage (register) # of cores is large, but each core is very small -> Co-design for loop-level parallelism We need a large scale parallel system even based on accelerators Performance, bandwidth and capacity are covered by the way of co-design 11
12 Issues on accelerated computing in future (cont d) Trading-off: Power vs Dynamism (Flexibility) fine grain individual cores consume much power, then ultra-wide SIMD feature to exploit maximum Flops is needed Interconnection is a serious issue current accelerators are hosted by general CPU, and the system is not stand-alone current accelerators are connected by some interface bus with CPU then interconnection current accelerators are connected through network interface attached to the host CPU Latency is essential (not just bandwidth) with the problem of memory capacity, strong scaling is required to solve the problems weak scaling doesn t work in some case because of time to solution limit in many algorithms, reduction of just a scalar value over millions of node is required Accelerators must be tightly coupled with each other, meaning They should be equipped with communication facility of their own 12
13 Research on Current Accelerators: TCA (Tightly Coupled Accelerators) (update from CO-DESIGN 2012) 13
14 TCA (Tightly Coupled Accelerators) Architecture True GPU-direct current GPU clusters require 3- hop communication (3-5 times memory copy) For strong scaling, inter-gpu direct communication protocol is needed for lower latency and higher throughput Enhanced version of PEACH PEACH2 x4 lanes -> x8 lanes hardwired on main data path and PCIe interface fabric PCIe IB HCA PCIe CPU Node PCIe GPU CPU PCIe PEAC H2 Node PCIe GPU MEM MEM IB Switch PCIe IB HCA PCIe CPU PCIe GPU CPU PCIe PEAC H2 PCIe GPU MEM MEM 14
15 TCA testbed node structure CPU can uniformly access to GPUs. PEACH2 can access every GPUs Kepler architecture + CUDA 5.0 GPUDirect Support for RDMA Performance over QPI is quite bad. => support only for two GPUs on the same socket Connect among 3 nodes G2 x8 G2 x8 This configuration is similar to HA- PACS base cluster except PEACH2. G2 x8 PEA CH2 G2 x8 All the PCIe lanes (80 lanes) embedded in CPUs are used. CPU (Xeon E5 v2) G2 x16 G2 x16 QPI PCIe CPU (Xeon E5 v2) G2 x16 G2 x16 K20X K20X K20X K20X G3 x8 IB HCA 15
16 PEACH2 board n PCI Express Gen2 x8 peripheral board n Compatible with PCIe Spec. Side View 16 Top View
17 PEACH2 board Main board + sub board FPGA (Altera Stratix IV 530GX) PCI Express x8 card edge Most part operates at 250 MHz (PCIe Gen2 logic runs at 250MHz) DDR3SDRAM Power supply for various voltage PCIe x16 cable connecter PCIe x8 cable connecter 17
18 HA-PACS/TCA (computation node) 4 Channels 1,866 MHz 59.7 GB/sec AVX (2.8 GHz x 8 flop/clock) 22.4 GFlops x20 =448.0 GFlops Ivy Bridge Ivy Bridge 4 Channels 1,866 MHz 59.7 GB/sec (16 GB, 14.9 GB/s)x8 =128 GB, GB/s Legacy Devices Total: TFlops Gen 2 x 16 Gen 2 x 16 Gen 2 x 16 Gen 2 x 16 PEACH2 board (TCA interconnect) 1.31 TFlopsx4 =5.24 TFlops 4 x NVIDIA K20X (6 GB, 250 GB/s)x4 =24 GB, 1 TB/s 18 8 GB/s Gen 2 x 8 Gen 2 x 8 Gen 2 x 8
19 HA-PACS Base Cluster + TCA (TCA part starts operation on Nov. 1 st 2013 Base Cluster TCA HA-PACS Base Cluster = 2.99 TFlops x 268 node = 802 TFlops HA-PACS/TCA = 5.69 TFlops x 64 node = 364 TFlops TOTAL: PFlops 19
20 HA-PACS/TCA computation node inside 20
21 Comparison with traditional method Latency (usec) User-level remote GPU memory copy CUDA: Intra-node cudamemcpy() No-UVA: with Unified Virtual Address UVA: without Unified Virtual Address MVAPICH2: 1.9b, MV2_USE_CUDA= InfiniBand FDR10 maybe reduced to 7us at minimum? 5 0 CUDA (nouva) CUDA (UVA) MVAPICH2 PEACH2 (GPU-GPU) PEACH2 (GPU>CPU-GPU) Data Size (Bytes) Bandwidth (Mbytes/sec) Better performance both on latency and bandwidth (up to 4KB) Faster than intra-node CUDA memory copy CUDA (nouva) CUDA (UVA) MVAPICH2 PEACH2 (GPU-GPU) PEACH2 (GPU>CPU-GPU) Data size (Bytes) 21
22 Pingpong with 3-hop in 8 node cluster #hop: 0 (Direct) ~ 3 200~300 ns additional latency for each hop Bandwidth is kept for any count of hop Latency (usec) PIO Direct PIO 3 hop DMA Direct DMA 1 hop DMA 2 hop DMA 3 hop GPU Direct GPU 3 hop Data Size (bytes) Bandwidth (MBytes/sec) DMA Direct DMA 1 hop DMA 2 hop DMA 3 hop GPU Direct MVA2 GPU Data Size (bytes) 22
23 Simple example of stencil computatio yellow colored surface must be comminucated continuous on k-dim. only, other are stride or block-stride a fixed pattern of communication in all time development loop j k i 3-D decomposition in 1x2x2 (4 nodes) PEACH2 supports block-stride transfer, and also RDMA chaining by hardware 23
24 Evaluation on small problem (Himeno Benchmark) 2.5 Typical situation for strong scaling 2 44% perf. up Speed up x1x1 4x1x1 2x2x MPI MV2 TCA Small 24
25 Univ. of Tsukuba s Feasibility Study Proposal based on Extreme Accelerated Computing 25
26 Project organizatio Joint project with Titech (Makino), Aizu U (Nakazato), RIKEN (Taiji), U Tokyo, KEK, Hiroshima U, and Hitachi as a super computer company Target apps: QCD in particle physics, tree N-body, HMD in Astrophysics, MD in life sci., FDM of earthquake, FMO in chemistry, RS-DFT in material, NICAM in climate sci. Application Study (U Tsukuba, RIKEN, U. Tokyo, KEK, Hiroshima U) (FS midterm report by Sato) Global Climate Science (Tanaka, Yashiro) Earth Science (Okamoto) Life Science (Taiji, Umeda) Nanomaterial Science (Oshiyama) Astrophysics (Umemura, Yoshikawa) Particle physics (Kururamashi, Ishikawa, Matsufuru) Simulator and Evaluation tools (Kodama) Accelerator and Network System Design (Titech, U Tsukuba, Aizu) simulation Feedback Processor core architecture (Nakazato) Programming model Programming model (Sato) API Basic Accelerator architecture (Makino) Accelerator Nework (Boku) Study on Implementation and power (Hitachi) Programming model and Simulation Tools (U. Tsukuba) Study on Implementation (Hitachi) 26
27 Between power-effective computing and dynamism of computatio constant power small constant Flops many-core CPU large number of independent cores MIC GPU SIMD feature (operations / instruction) large CPU Our Solution (extreme SIMD) GRAPE-DR small 27
28 Basic concept of extreme SIMD accelerator Fully concentrated for Flops (Compute Oriented & Reduced Memory) A number of SIMD cores (500~1K/chip) with very low dynamism Very limited memory capacity/core -> on-chip SRAM High density of accelerating chip Simple and small Processing Elements with some number of registers and on-chip SRAM Tightly coupled with on-chip interconnection network Interconnection network 2-D torus on-chip (up to 4K cores), 2-D torus on-board (~16 chips) and n-d torus network among boards special feature for broadcast/reduction tree These chips are also tightly connected by inter-chip network which can be operated directly from controller Importance of Co-design Deciding memory capacity and # of cores on chip (trading-off) Network bandwidth and # of ports Based on target applications and algorithms 28
29 Global design of our proposal (FS midterm report by Sato) Two keys for exascale computing Power and strong-scaling We study exascale heterogeneous systems with accelerators of many-cores. We are interested in: Architecture of accelerators: core and memory architecture Special-purpose functions Direct connection between accelerators in a group Power estimation and evaluation Programming model and computational science applications Requirement for general-purpose system etc Acc elerator GP Processor Memory Acc elerator Node Group Accelerator ネットワーク Network Acc elerator GP Processor Memory Node Acc elerator Contro ller Contro ller Contro ller Contro ller Node Accelerator Chip (example) Group Node Core X Mem Core X Mem Core X Mem Core X Mem Node Core X Mem Core X Mem Core X Mem Core X Mem Node Core X Mem Core X Mem コア X Mem Core X Mem Core X Mem Core X Mem Core X Mem コア X Mem Storage To System network System Network System 29
30 PACS-G: a straw man architecture (FS midterm report by Sato) SIMD architecture, for compute oriented apps (N-body, MD), and stencil apps cores (64x64), 2FMA@1GHz, 4GFlops x 4096 = 16TFlops/chip 2D mesh (+ broardcast/reduction) on-chip network for stencil apps. We expect 14nm technology at the range of year , Chip die size: 20mm x 20mm Mainly working on on-chip memory (size 512 MB/chip, 128KB/core), and, with module memory by 3D-stack/wide IO DRAM memory (via 2.5D TSV), bandwidth 1TB/s, size 16-32GB/chip ホストプロセッサ host CPU コント contr- ローラ oller データ inst. & data 命令メモリ memory メモリ 放送 BC memory メモリ comm. 通信バッファ buffer FSACC チップ PACS-G chip PE PE PE PE W/chips expected 64K chips for 1 EFlops (at peak) inter-accelerator direct network 加速機構間ネットワーク comm. 通信バッファ buffer 放送メモリ BC memory 放送 BC memory メモリ 放送 BC memory メモリ 結果縮約ネットワーク PE PE PE PE PE PE PE PE PE PE PE PE comm. 通信バッファ buffer 3D stack or Wide IO DRAM comm. 通信バッファ buffer
31 PACS-G: a straw man architecture A group of chips are connected via ac celerator network (inter-chip network) (FS midterm report by Sato) 25 50Gbps/link for inter-chip: If we extend 2-D m esh network to the (2D-mesh) external network in a group, we need GB/s (= 32ch. x Gbps x 2(bi-direction)) comm. buffer For 50Gbps data transfer, we may need direct optical interconnect from chip. (?) I/O Interface to Host: PCI Express Gen 4 x16 (not enough!!!) Programming model: XcalableMP + OpenACC Use OpenACC to specify offloaded fragment of code and data movement To align data and computation to core, we use the concept "template" of XcalableMP (virtual index space). We can generate code for each core. (And data parallel lang. like C*) interconnect between chips (2D mesh) Board connector to other acc unit. PSU silicon interposer PSU accelerator module optical interconnect module accelerator LSI memory module HMC Wide I/O DRAM 31 An example of implementation (for 1U rack)
32 Performance evaluation by simulatio FDTD (Finite Difference Time Domain) Himeno (Fluid Dynamics) Matrix multiplication N-body RS-DFT (Real Space Density Function Theory) 2011 Gordon Bell Prize application on K 32
33 FDTD problem mapping Earthquake wave transition simulation (also used in Application FS) 3-D space difference scheme (order-2 for time, order-4 for space) In every time step, velocity vectors and stress tensor are calculated and updated Each grid point has 12 of single precision variables Each PE takes n x n x n grids with stencil shadow size of 2 -> 83KB memory/pe is required for n=8 For 4096 PEs on a chip, 128x128x128 grids are available for computatio 33
34 FDTD estimated performance FP operation counts and memory reference for each grid pint velocity update: 67 FPs + 73 memory refs. stress tensor update: 81 FPs + 82 memory refs. either case is bottlenecked by memory reference: each PE s local memory bandwidth is 16GB/s then 19.8 us for computation of all grids Shadow data exchange for 3-D stencil computation with tightly coupled communication channels shadow area: 6 variables for velocity and 3 variables for tensor (cyclic boundary condition) shadow area communication: transfer 8x8x2 grids to 6 neighboring PEs. Inter-PE communication for 3-D mapped on 2-D is assumed 2GB/s, then 13.8 us for cyclic boundary shadow area exchange, it takes 18.4 us Total computation + communication time 52.0us, 1.5GFlops/PE, 6TFlops/chip (37%) 34
35 Application and benchmark evaluatio FDTD: 5.2TFlops (single precision) Himeno Benchmark (middle: 128x128x256): 7.4TFlops (single precision) Matrix-matrix multiply: Kernel: 14.9TFlops (double precision) Large scale application: RS-DFT for 100,000 atoms (same as Gordon Bell Prize case on K Computer) Strong scaling up to 4096 chips (= 64 PFlops) is possible to keep 35% efficiency, which is the same efficiency with K Computer -> equivalent to have 6 systems of entire K Computer A group of 4096 chips (=64 PFlops) If data fits on on-chip memory, ratio B/F is 4 B/F, total mem size = 2TB If data fits into module memory, ratio B/F is B/F, total mem size = 64TB Co-design is quite important for such an aggressive architecture with various limitation Co-design decides what we can get if we loose what 35
36 Current status and pla We are now working on performance estimation by co-design process 2012 (done): QCD, N-body, MD, HMD (on-going RS-DFT) 2013: earth quake sim, NICAM (climate), FMO (chemistry) Also, developing simulators (clock-level/instruction level) for more precious and quantitative performance evaluation Compiler development (XMP and OpenACC) (Re-)Design and investigation of network topology 2D mesh is sufficient? or, other alternative? Code development for apps using Host and Acc, including I/O Precise and more detailed estimation of power consumptions 36
37 Advanced issues Is just simple 4K SIMD on a chip enough? PACS-G accelerator is a pure throughput-core solution algorithm must satisfy very large occupancy (effective cores not be masked) in some application (ex. MD) we need more fine grained flexibility rather than big SIMD possibly we put some number of individual general cores for function level parallelism How to utilize 3-D stack memory effectively? currently 3-D stack memory is just for temporal buffer and check pointing inter-accelerator network directly accesses 3-D stack memory for coarse grained communication & buffering How about general CPU to be attached? 37
38 Summary Next generation Exa-scale (Flops) system is a challenge to power consumption (Flops/W) The shortest way is to reduce the power with a minimum dynamism to control a large number of FP units On-chip SRAM solution is the only way to provide data throughput for thousands of cores, which leads very small capacity/core Strong scaling is essential, and direct interconnection between accelerators is necessary, within chip and over chips Co-design is quite important for such an aggressive architecture with various limitation Accelerator is not a magic, but needed for various applications (not covering all) 38
Tightly Coupled Accelerators Architecture
Tightly Coupled Accelerators Architecture Yuetsu Kodama Division of High Performance Computing Systems Center for Computational Sciences University of Tsukuba, Japan 1 What is Tightly Coupled Accelerators
More informationHA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs
HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs Yuetsu Kodama Division of High Performance Computing Systems Center for Computational Sciences University of Tsukuba,
More informationInterconnection Network for Tightly Coupled Accelerators Architecture
Interconnection Network for Tightly Coupled Accelerators Architecture Toshihiro Hanawa, Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato Center for Computational Sciences University of Tsukuba, Japan 1 What
More informationT2K & HA-PACS Projects Supercomputers at CCS
T2K & HA-PACS Projects Supercomputers at CCS Taisuke Boku Deputy Director, HPC Division Center for Computational Sciences University of Tsukuba Two Streams of Supercomputers at CCS Service oriented general
More informationFujitsu s new supercomputer, delivering the next step in Exascale capability
Fujitsu s new supercomputer, delivering the next step in Exascale capability Toshiyuki Shimizu November 19th, 2014 0 Past, PRIMEHPC FX100, and roadmap for Exascale 2011 2012 2013 2014 2015 2016 2017 2018
More informationTightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications
1 Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications Toshihiro Hanawa Information Technology Center, The University of Tokyo Taisuke Boku Center for Computational
More informationHA-PACS Project Challenge for Next Step of Accelerating Computing
HA-PAS Project hallenge for Next Step of Accelerating omputing Taisuke Boku enter for omputational Sciences University of Tsukuba taisuke@cs.tsukuba.ac.jp 1 Outline of talk Introduction of S, U. Tsukuba
More informationJapan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS
Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS HPC User Forum, 7 th September, 2016 Outline of Talk Introduction of FLAGSHIP2020 project An Overview of post K system Concluding Remarks
More informationMELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구
MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio
More informationInterconnect Your Future
Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators
More informationFujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited
Fujitsu HPC Roadmap Beyond Petascale Computing Toshiyuki Shimizu Fujitsu Limited Outline Mission and HPC product portfolio K computer*, Fujitsu PRIMEHPC, and the future K computer and PRIMEHPC FX10 Post-FX10,
More informationIntroduction of Oakforest-PACS
Introduction of Oakforest-PACS Hiroshi Nakamura Director of Information Technology Center The Univ. of Tokyo (Director of JCAHPC) Outline Supercomputer deployment plan in Japan What is JCAHPC? Oakforest-PACS
More informationAppro Supercomputer Solutions Appro and Tsukuba University
Appro Supercomputer Solutions Appro and Tsukuba University Accelerator Cluster Collaboration Steven Lyness, VP HPC Solutions Engineering About Appro Over 20 Years of Experience 1991 2000 OEM Server Manufacturer
More informationA Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC
A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC Hisashi YASHIRO RIKEN Advanced Institute of Computational Science Kobe, Japan My topic The study for Cloud computing My topic
More informationSystem Software for Big Data and Post Petascale Computing
The Japanese Extreme Big Data Workshop February 26, 2014 System Software for Big Data and Post Petascale Computing Osamu Tatebe University of Tsukuba I/O performance requirement for exascale applications
More informationWhite paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation
White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview
More informationAim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group
Aim High Intel Technical Update Teratec 07 Symposium June 20, 2007 Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Risk Factors Today s s presentations contain forward-looking statements.
More informationFujitsu s Approach to Application Centric Petascale Computing
Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview
More informationLatest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand
Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationDesign of Scalable Network Considering Diameter and Cable Delay
Tohoku Design of Scalable etwork Considering Diameter and Cable Delay Kentaro Sano Tohoku University, JAPA Agenda Introduction Assumption Preliminary evaluation & candidate networks Cable length and delay
More informationLustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE
Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE Hitoshi Sato *1, Shuichi Ihara *2, Satoshi Matsuoka *1 *1 Tokyo Institute
More informationGPU-centric communication for improved efficiency
GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop
More informationFujitsu s Technologies to the K Computer
Fujitsu s Technologies to the K Computer - a journey to practical Petascale computing platform - June 21 nd, 2011 Motoi Okuda FUJITSU Ltd. Agenda The Next generation supercomputer project of Japan The
More informationHETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA
HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS
More informationScaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc
Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC
More informationOverview of Tianhe-2
Overview of Tianhe-2 (MilkyWay-2) Supercomputer Yutong Lu School of Computer Science, National University of Defense Technology; State Key Laboratory of High Performance Computing, China ytlu@nudt.edu.cn
More informationExploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR
Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication
More informationBuilding the Most Efficient Machine Learning System
Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide
More informationThe Mont-Blanc approach towards Exascale
http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are
More informationBuilding the Most Efficient Machine Learning System
Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide
More informationHeterogeneous Multi-Computer System A New Platform for Multi-Paradigm Scientific Simulation
Heterogeneous Multi-Computer System A New Platform for Multi-Paradigm Scientific Simulation Taisuke Boku, Hajime Susa, Masayuki Umemura, Akira Ukawa Center for Computational Physics, University of Tsukuba
More informationThe way toward peta-flops
The way toward peta-flops ISC-2011 Dr. Pierre Lagier Chief Technology Officer Fujitsu Systems Europe Where things started from DESIGN CONCEPTS 2 New challenges and requirements! Optimal sustained flops
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationVPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability
VPI / InfiniBand Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability Mellanox enables the highest data center performance with its
More informationIBM CORAL HPC System Solution
IBM CORAL HPC System Solution HPC and HPDA towards Cognitive, AI and Deep Learning Deep Learning AI / Deep Learning Strategy for Power Power AI Platform High Performance Data Analytics Big Data Strategy
More informationThe Road from Peta to ExaFlop
The Road from Peta to ExaFlop Andreas Bechtolsheim June 23, 2009 HPC Driving the Computer Business Server Unit Mix (IDC 2008) Enterprise HPC Web 100 75 50 25 0 2003 2008 2013 HPC grew from 13% of units
More informationSupport for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth
Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU
More informationThe Earth Simulator Current Status
The Earth Simulator Current Status SC13. 2013 Ken ichi Itakura (Earth Simulator Center, JAMSTEC) http://www.jamstec.go.jp 2013 SC13 NEC BOOTH PRESENTATION 1 JAMSTEC Organization Japan Agency for Marine-Earth
More informationVPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability
VPI / InfiniBand Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability Mellanox enables the highest data center performance with its
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More informationPerformance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA
Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to
More informationDesigning Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters
Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi
More informationInterconnect Your Future
Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer
More informationPerformance Comparison between Two Programming Models of XcalableMP
Performance Comparison between Two Programming Models of XcalableMP H. Sakagami Fund. Phys. Sim. Div., National Institute for Fusion Science XcalableMP specification Working Group (XMP-WG) Dilemma in Parallel
More informationHPC Technology Trends
HPC Technology Trends High Performance Embedded Computing Conference September 18, 2007 David S Scott, Ph.D. Petascale Product Line Architect Digital Enterprise Group Risk Factors Today s s presentations
More informationPerformance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA
Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November,
More informationThe Future of High Performance Interconnects
The Future of High Performance Interconnects Ashrut Ambastha HPC Advisory Council Perth, Australia :: August 2017 When Algorithms Go Rogue 2017 Mellanox Technologies 2 When Algorithms Go Rogue 2017 Mellanox
More informationCan Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?
Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Sayantan Sur, Abhinav Vishnu, Hyun-Wook Jin, Wei Huang and D. K. Panda {surs, vishnu, jinhy, huanwei, panda}@cse.ohio-state.edu
More informationTitan - Early Experience with the Titan System at Oak Ridge National Laboratory
Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid
More informationPost-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem Toshiyuki Shimizu FUJITSU LIMITED Nov. 14th, 2017 Exhibitor Forum, SC17, Nov. 14, 2017 0 Post-K: Building up Arm HPC Ecosystem Fujitsu s approach for HPC Approach
More informationSystem Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited
System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited 2018 IEEE 68th Electronic Components and Technology Conference San Diego, California May 29
More informationBasic Specification of Oakforest-PACS
Basic Specification of Oakforest-PACS Joint Center for Advanced HPC (JCAHPC) by Information Technology Center, the University of Tokyo and Center for Computational Sciences, University of Tsukuba Oakforest-PACS
More informationTSUBAME-KFC : Ultra Green Supercomputing Testbed
TSUBAME-KFC : Ultra Green Supercomputing Testbed Toshio Endo,Akira Nukada, Satoshi Matsuoka TSUBAME-KFC is developed by GSIC, Tokyo Institute of Technology NEC, NVIDIA, Green Revolution Cooling, SUPERMICRO,
More informationHigh Performance Computing
High Performance Computing Dror Goldenberg, HPCAC Switzerland Conference March 2015 End-to-End Interconnect Solutions for All Platforms Highest Performance and Scalability for X86, Power, GPU, ARM and
More informationBrand-New Vector Supercomputer
Brand-New Vector Supercomputer NEC Corporation IT Platform Division Shintaro MOMOSE SC13 1 New Product NEC Released A Brand-New Vector Supercomputer, SX-ACE Just Now. Vector Supercomputer for Memory Bandwidth
More informationPhilippe Thierry Sr Staff Engineer Intel Corp.
HPC@Intel Philippe Thierry Sr Staff Engineer Intel Corp. IBM, April 8, 2009 1 Agenda CPU update: roadmap, micro-μ and performance Solid State Disk Impact What s next Q & A Tick Tock Model Perenity market
More informationDesigning High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters
Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar
More informationSolutions for Scalable HPC
Solutions for Scalable HPC Scot Schultz, Director HPC/Technical Computing HPC Advisory Council Stanford Conference Feb 2014 Leading Supplier of End-to-End Interconnect Solutions Comprehensive End-to-End
More informationImplicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC
Fourth Workshop on Accelerator Programming Using Directives (WACCPD), Nov. 13, 2017 Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Takuma
More informationPreparing GPU-Accelerated Applications for the Summit Supercomputer
Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership
More informationAccelerated Computing Jun Makino Interactive Research Center of Science Tokyo Institute of Technology
Accelerated Computing Jun Makino Interactive Research Center of Science Tokyo Institute of Technology IEEE Cluster 2011 Sept 28, 2011 Satoshi s three questions 1. What does accelerated computing solve
More informationNVIDIA Update and Directions on GPU Acceleration for Earth System Models
NVIDIA Update and Directions on GPU Acceleration for Earth System Models Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA Carl Ponder, PhD, Applications Software Engineer, NVIDIA,
More informationOverview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo
Overview of Supercomputer Systems Supercomputing Division Information Technology Center The University of Tokyo Supercomputers at ITC, U. of Tokyo Oakleaf-fx (Fujitsu PRIMEHPC FX10) Total Peak performance
More information7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT
7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT Draft Printed for SECO Murex S.A.S 2012 all rights reserved Murex Analytics Only global vendor of trading, risk management and processing systems focusing also
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationArchitectures for Scalable Media Object Search
Architectures for Scalable Media Object Search Dennis Sng Deputy Director & Principal Scientist NVIDIA GPU Technology Workshop 10 July 2014 ROSE LAB OVERVIEW 2 Large Database of Media Objects Next- Generation
More informationInterconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp
Interconnect Challenges in a Many Core Compute Environment Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp Agenda Microprocessor general trends Implications Tradeoffs Summary
More informationOverview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo
Overview of Supercomputer Systems Supercomputing Division Information Technology Center The University of Tokyo Supercomputers at ITC, U. of Tokyo Oakleaf-fx (Fujitsu PRIMEHPC FX10) Total Peak performance
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all
More informationAn Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters
An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction
More informationUpdate of Post-K Development Yutaka Ishikawa RIKEN AICS
Update of Post-K Development Yutaka Ishikawa RIKEN AICS 11:20AM 11:40AM, 2 nd of November, 2017 FLAGSHIP2020 Project Missions Building the Japanese national flagship supercomputer, post K, and Developing
More informationTimothy Lanfear, NVIDIA HPC
GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision
More informationCRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar
CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION
More informationReal Application Performance and Beyond
Real Application Performance and Beyond Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400 Fax: 408-970-3403 http://www.mellanox.com Scientists, engineers and analysts
More informationVoltaire Making Applications Run Faster
Voltaire Making Applications Run Faster Asaf Somekh Director, Marketing Voltaire, Inc. Agenda HPC Trends InfiniBand Voltaire Grid Backbone Deployment examples About Voltaire HPC Trends Clusters are the
More informationPerformance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms
Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationTECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016
TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016 MULTI GPU PROGRAMMING Node 0 Node 1 Node N-1 MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM
More informationAn Introduction to OpenACC
An Introduction to OpenACC Alistair Hart Cray Exascale Research Initiative Europe 3 Timetable Day 1: Wednesday 29th August 2012 13:00 Welcome and overview 13:15 Session 1: An Introduction to OpenACC 13:15
More informationPost-Petascale Computing. Mitsuhisa Sato
Challenges on Programming Models and Languages for Post-Petascale Computing -- from Japanese NGS project "The K computer" to Exascale computing -- Mitsuhisa Sato Center for Computational Sciences (CCS),
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationFPGA-based Supercomputing: New Opportunities and Challenges
FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:
More informationExploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters
Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth
More informationIdentifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning
Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)
More informationProductive Performance on the Cray XK System Using OpenACC Compilers and Tools
Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationCray XC Scalability and the Aries Network Tony Ford
Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?
More informationExperiences of the Development of the Supercomputers
Experiences of the Development of the Supercomputers - Earth Simulator and K Computer YOKOKAWA, Mitsuo Kobe University/RIKEN AICS Application Oriented Systems Developed in Japan No.1 systems in TOP500
More informationFUJITSU HPC and the Development of the Post-K Supercomputer
FUJITSU HPC and the Development of the Post-K Supercomputer Toshiyuki Shimizu Vice President, System Development Division, Next Generation Technical Computing Unit 0 November 16 th, 2016 Post-K is currently
More informationThe Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011
The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities
More informationPerformance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability
Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability Mellanox InfiniBand Host Channel Adapters (HCA) enable the highest data center
More informationOverview of Reedbush-U How to Login
Overview of Reedbush-U How to Login Information Technology Center The University of Tokyo http://www.cc.u-tokyo.ac.jp/ Supercomputers in ITC/U.Tokyo 2 big systems, 6 yr. cycle FY 08 09 10 11 12 13 14 15
More informationParallel Computing. November 20, W.Homberg
Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better
More informationSystem Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.
System Design of Kepler Based HPC Solutions Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering. Introduction The System Level View K20 GPU is a powerful parallel processor! K20 has
More informationCoupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications
Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationResults from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence
Results from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence Jens Domke Research Staff at MATSUOKA Laboratory GSIC, Tokyo Institute of Technology, Japan Omni-Path User Group 2017/11/14 Denver,
More information