Accelerated Computing Unified with Communication Towards Exascale

Size: px
Start display at page:

Download "Accelerated Computing Unified with Communication Towards Exascale"

Transcription

1 Accelerated Computing Unified with Communication Towards Exascale Taisuke Boku Deputy Director, Center for Computational Sciences / Faculty of Systems and Information Engineering University of Tsukuba 1

2 CCS at University of Tsukuba Center for Computational Sciences Established in years as former Center for Computational Physics Reorganized as Center for Computational Sciences in 2004 Daily collaborative researches with two kinds of researchers (about 30 in total) Computational Scientists who have NEEDS (applications) Computer Scientists who have SEEDS (system & solution) 2 IHPC Forum, Changsha 2013/05/28

3 Current HPC and Supercomputer Research Program in Japa 3

4 Post-peta~Exa scale computing formation in Japan MEXT Council on HPCI Plan and Promotion Council for Science and Technology Policy (CSTP) HPCI Consortium established on 2012/04 Users and supercomputer centers (resource providers) The consortium will play an important role of the future HPC R&D WG for applications WG for systems JST (Japan Science and Technology Agency) Basic Research Programs CREST: Development of System Software Technologies for post-peta Scale High Performance Computing White paper for strategic direction/ development of HPC in JAPAN Feasibility Study of Advanced High Performance Computing Workshop of SDHPC (Strategic Development of HPC) Organized by Univ. of Tsukuba Univ. of Tokyo Tokyo Institute of Tech. Kyoto Univ. RIKEN supercomputer center and AICS AIST JST Tohoku Univ. RIKEN (AICS) Proposal to MEXT for Exascale 4 Evaluation by CSTP Budget approval by Ministry of Finance

5 Only information about architecture in RIKEN proposal 5

6 (slide is courtesy by Y. Ishikawa, U. Tokyo) Mid-term plan on HPC research in Japan Basic Research Programs CREST: Development of System Software Technologies for post-peta Scale High Performance Computing White paper for strategic direction/ development of HPC in Japan US$ 23M US$ 17M US$ 15M Feasibility Study of Advanced High Performance Computing US$10M Three teams will run Whether or not the national R&D project starts depends on results of those feasibility studies R&D of Advanced HPC Deployments 2011Q4 :HA-PACS, 0.8PF, Univ. Tsukuba -> be upgraded to 1.2PF (2013Q3) 2012: BG/Q, 1.2PF, KEK 2012Q1 : <1PF, Univ. Tokyo 2012Q2: 0.7PF, Kyoto-U 2015: 30PF? (Tokyo & Tsukuba) 6

7 Projects in FS Four projects (3 system-side + 1 application-side) were selected based on proposals, on 2012/07 Focus & Challenge general purpose (base) system accelerated computing vector processing mini applications Leading PI Y. Ishikawa (Univ. of Tokyo) M. Sato (Univ. of Tsukuba) H. Kobayashi (Tohoku Univ.) H. Tomita (AICS, RIKEN) & S. Matsuoka (TITECH) Member Institutes & Vendors U. Tokyo, Kyushu U. Fujitsu, Hitachi, NEC U. Tsukuba, TITECH, Aizu U., AICS, Hitachi Tohoku U., JAMSTEC, NEC AICS, TITECH 7

8 Two activities in U. Tsukuba HA-PACS + JST-CREST developing a platform for new concept of accelerated computing hardware and software development for testbed on TCA (Tightly Coupled Accelerators) architecture new algorithm and code based on this concept Feasibility Study a proposal for accelerator based exa-scale computing system interconnection is embedded as instruction level with computation ultimate style of highly computable system desig 8

9 Key Issues on Next Generation s Accelerated Computing 9

10 Accelerated computing in next generatio What accelerators contribute to? Flops: Concentrating (almost) all the power just for computation Simplicity: Reducing complexity of control, everything is in data parallel Vector: Another style of vector computing (simple data parallel) Energy: Very high performance/power We still need a challenge for performance/power ratio with x20 to achieve 1EFlops/20MW Accelerated computing is one of the shortest path It s good for computation and horizontal access to the memory device However... 10

11 Issues on accelerated computing in near future Limited memory capacity current GPU = 6GB : 1.3TFlops = 1 : 200 -> Co-design for memory capacity saving Limited dynamism on computation on GPU, warp splitting makes heavy performance degradation -> depending on application/algorithm but it should be solved by application/algorithm -> Co-design for effective vector feature Limited capacity on fast storage (register) # of cores is large, but each core is very small -> Co-design for loop-level parallelism We need a large scale parallel system even based on accelerators Performance, bandwidth and capacity are covered by the way of co-design 11

12 Issues on accelerated computing in future (cont d) Trading-off: Power vs Dynamism (Flexibility) fine grain individual cores consume much power, then ultra-wide SIMD feature to exploit maximum Flops is needed Interconnection is a serious issue current accelerators are hosted by general CPU, and the system is not stand-alone current accelerators are connected by some interface bus with CPU then interconnection current accelerators are connected through network interface attached to the host CPU Latency is essential (not just bandwidth) with the problem of memory capacity, strong scaling is required to solve the problems weak scaling doesn t work in some case because of time to solution limit in many algorithms, reduction of just a scalar value over millions of node is required Accelerators must be tightly coupled with each other, meaning They should be equipped with communication facility of their own 12

13 Research on Current Accelerators: TCA (Tightly Coupled Accelerators) (update from CO-DESIGN 2012) 13

14 TCA (Tightly Coupled Accelerators) Architecture True GPU-direct current GPU clusters require 3- hop communication (3-5 times memory copy) For strong scaling, inter-gpu direct communication protocol is needed for lower latency and higher throughput Enhanced version of PEACH PEACH2 x4 lanes -> x8 lanes hardwired on main data path and PCIe interface fabric PCIe IB HCA PCIe CPU Node PCIe GPU CPU PCIe PEAC H2 Node PCIe GPU MEM MEM IB Switch PCIe IB HCA PCIe CPU PCIe GPU CPU PCIe PEAC H2 PCIe GPU MEM MEM 14

15 TCA testbed node structure CPU can uniformly access to GPUs. PEACH2 can access every GPUs Kepler architecture + CUDA 5.0 GPUDirect Support for RDMA Performance over QPI is quite bad. => support only for two GPUs on the same socket Connect among 3 nodes G2 x8 G2 x8 This configuration is similar to HA- PACS base cluster except PEACH2. G2 x8 PEA CH2 G2 x8 All the PCIe lanes (80 lanes) embedded in CPUs are used. CPU (Xeon E5 v2) G2 x16 G2 x16 QPI PCIe CPU (Xeon E5 v2) G2 x16 G2 x16 K20X K20X K20X K20X G3 x8 IB HCA 15

16 PEACH2 board n PCI Express Gen2 x8 peripheral board n Compatible with PCIe Spec. Side View 16 Top View

17 PEACH2 board Main board + sub board FPGA (Altera Stratix IV 530GX) PCI Express x8 card edge Most part operates at 250 MHz (PCIe Gen2 logic runs at 250MHz) DDR3SDRAM Power supply for various voltage PCIe x16 cable connecter PCIe x8 cable connecter 17

18 HA-PACS/TCA (computation node) 4 Channels 1,866 MHz 59.7 GB/sec AVX (2.8 GHz x 8 flop/clock) 22.4 GFlops x20 =448.0 GFlops Ivy Bridge Ivy Bridge 4 Channels 1,866 MHz 59.7 GB/sec (16 GB, 14.9 GB/s)x8 =128 GB, GB/s Legacy Devices Total: TFlops Gen 2 x 16 Gen 2 x 16 Gen 2 x 16 Gen 2 x 16 PEACH2 board (TCA interconnect) 1.31 TFlopsx4 =5.24 TFlops 4 x NVIDIA K20X (6 GB, 250 GB/s)x4 =24 GB, 1 TB/s 18 8 GB/s Gen 2 x 8 Gen 2 x 8 Gen 2 x 8

19 HA-PACS Base Cluster + TCA (TCA part starts operation on Nov. 1 st 2013 Base Cluster TCA HA-PACS Base Cluster = 2.99 TFlops x 268 node = 802 TFlops HA-PACS/TCA = 5.69 TFlops x 64 node = 364 TFlops TOTAL: PFlops 19

20 HA-PACS/TCA computation node inside 20

21 Comparison with traditional method Latency (usec) User-level remote GPU memory copy CUDA: Intra-node cudamemcpy() No-UVA: with Unified Virtual Address UVA: without Unified Virtual Address MVAPICH2: 1.9b, MV2_USE_CUDA= InfiniBand FDR10 maybe reduced to 7us at minimum? 5 0 CUDA (nouva) CUDA (UVA) MVAPICH2 PEACH2 (GPU-GPU) PEACH2 (GPU>CPU-GPU) Data Size (Bytes) Bandwidth (Mbytes/sec) Better performance both on latency and bandwidth (up to 4KB) Faster than intra-node CUDA memory copy CUDA (nouva) CUDA (UVA) MVAPICH2 PEACH2 (GPU-GPU) PEACH2 (GPU>CPU-GPU) Data size (Bytes) 21

22 Pingpong with 3-hop in 8 node cluster #hop: 0 (Direct) ~ 3 200~300 ns additional latency for each hop Bandwidth is kept for any count of hop Latency (usec) PIO Direct PIO 3 hop DMA Direct DMA 1 hop DMA 2 hop DMA 3 hop GPU Direct GPU 3 hop Data Size (bytes) Bandwidth (MBytes/sec) DMA Direct DMA 1 hop DMA 2 hop DMA 3 hop GPU Direct MVA2 GPU Data Size (bytes) 22

23 Simple example of stencil computatio yellow colored surface must be comminucated continuous on k-dim. only, other are stride or block-stride a fixed pattern of communication in all time development loop j k i 3-D decomposition in 1x2x2 (4 nodes) PEACH2 supports block-stride transfer, and also RDMA chaining by hardware 23

24 Evaluation on small problem (Himeno Benchmark) 2.5 Typical situation for strong scaling 2 44% perf. up Speed up x1x1 4x1x1 2x2x MPI MV2 TCA Small 24

25 Univ. of Tsukuba s Feasibility Study Proposal based on Extreme Accelerated Computing 25

26 Project organizatio Joint project with Titech (Makino), Aizu U (Nakazato), RIKEN (Taiji), U Tokyo, KEK, Hiroshima U, and Hitachi as a super computer company Target apps: QCD in particle physics, tree N-body, HMD in Astrophysics, MD in life sci., FDM of earthquake, FMO in chemistry, RS-DFT in material, NICAM in climate sci. Application Study (U Tsukuba, RIKEN, U. Tokyo, KEK, Hiroshima U) (FS midterm report by Sato) Global Climate Science (Tanaka, Yashiro) Earth Science (Okamoto) Life Science (Taiji, Umeda) Nanomaterial Science (Oshiyama) Astrophysics (Umemura, Yoshikawa) Particle physics (Kururamashi, Ishikawa, Matsufuru) Simulator and Evaluation tools (Kodama) Accelerator and Network System Design (Titech, U Tsukuba, Aizu) simulation Feedback Processor core architecture (Nakazato) Programming model Programming model (Sato) API Basic Accelerator architecture (Makino) Accelerator Nework (Boku) Study on Implementation and power (Hitachi) Programming model and Simulation Tools (U. Tsukuba) Study on Implementation (Hitachi) 26

27 Between power-effective computing and dynamism of computatio constant power small constant Flops many-core CPU large number of independent cores MIC GPU SIMD feature (operations / instruction) large CPU Our Solution (extreme SIMD) GRAPE-DR small 27

28 Basic concept of extreme SIMD accelerator Fully concentrated for Flops (Compute Oriented & Reduced Memory) A number of SIMD cores (500~1K/chip) with very low dynamism Very limited memory capacity/core -> on-chip SRAM High density of accelerating chip Simple and small Processing Elements with some number of registers and on-chip SRAM Tightly coupled with on-chip interconnection network Interconnection network 2-D torus on-chip (up to 4K cores), 2-D torus on-board (~16 chips) and n-d torus network among boards special feature for broadcast/reduction tree These chips are also tightly connected by inter-chip network which can be operated directly from controller Importance of Co-design Deciding memory capacity and # of cores on chip (trading-off) Network bandwidth and # of ports Based on target applications and algorithms 28

29 Global design of our proposal (FS midterm report by Sato) Two keys for exascale computing Power and strong-scaling We study exascale heterogeneous systems with accelerators of many-cores. We are interested in: Architecture of accelerators: core and memory architecture Special-purpose functions Direct connection between accelerators in a group Power estimation and evaluation Programming model and computational science applications Requirement for general-purpose system etc Acc elerator GP Processor Memory Acc elerator Node Group Accelerator ネットワーク Network Acc elerator GP Processor Memory Node Acc elerator Contro ller Contro ller Contro ller Contro ller Node Accelerator Chip (example) Group Node Core X Mem Core X Mem Core X Mem Core X Mem Node Core X Mem Core X Mem Core X Mem Core X Mem Node Core X Mem Core X Mem コア X Mem Core X Mem Core X Mem Core X Mem Core X Mem コア X Mem Storage To System network System Network System 29

30 PACS-G: a straw man architecture (FS midterm report by Sato) SIMD architecture, for compute oriented apps (N-body, MD), and stencil apps cores (64x64), 2FMA@1GHz, 4GFlops x 4096 = 16TFlops/chip 2D mesh (+ broardcast/reduction) on-chip network for stencil apps. We expect 14nm technology at the range of year , Chip die size: 20mm x 20mm Mainly working on on-chip memory (size 512 MB/chip, 128KB/core), and, with module memory by 3D-stack/wide IO DRAM memory (via 2.5D TSV), bandwidth 1TB/s, size 16-32GB/chip ホストプロセッサ host CPU コント contr- ローラ oller データ inst. & data 命令メモリ memory メモリ 放送 BC memory メモリ comm. 通信バッファ buffer FSACC チップ PACS-G chip PE PE PE PE W/chips expected 64K chips for 1 EFlops (at peak) inter-accelerator direct network 加速機構間ネットワーク comm. 通信バッファ buffer 放送メモリ BC memory 放送 BC memory メモリ 放送 BC memory メモリ 結果縮約ネットワーク PE PE PE PE PE PE PE PE PE PE PE PE comm. 通信バッファ buffer 3D stack or Wide IO DRAM comm. 通信バッファ buffer

31 PACS-G: a straw man architecture A group of chips are connected via ac celerator network (inter-chip network) (FS midterm report by Sato) 25 50Gbps/link for inter-chip: If we extend 2-D m esh network to the (2D-mesh) external network in a group, we need GB/s (= 32ch. x Gbps x 2(bi-direction)) comm. buffer For 50Gbps data transfer, we may need direct optical interconnect from chip. (?) I/O Interface to Host: PCI Express Gen 4 x16 (not enough!!!) Programming model: XcalableMP + OpenACC Use OpenACC to specify offloaded fragment of code and data movement To align data and computation to core, we use the concept "template" of XcalableMP (virtual index space). We can generate code for each core. (And data parallel lang. like C*) interconnect between chips (2D mesh) Board connector to other acc unit. PSU silicon interposer PSU accelerator module optical interconnect module accelerator LSI memory module HMC Wide I/O DRAM 31 An example of implementation (for 1U rack)

32 Performance evaluation by simulatio FDTD (Finite Difference Time Domain) Himeno (Fluid Dynamics) Matrix multiplication N-body RS-DFT (Real Space Density Function Theory) 2011 Gordon Bell Prize application on K 32

33 FDTD problem mapping Earthquake wave transition simulation (also used in Application FS) 3-D space difference scheme (order-2 for time, order-4 for space) In every time step, velocity vectors and stress tensor are calculated and updated Each grid point has 12 of single precision variables Each PE takes n x n x n grids with stencil shadow size of 2 -> 83KB memory/pe is required for n=8 For 4096 PEs on a chip, 128x128x128 grids are available for computatio 33

34 FDTD estimated performance FP operation counts and memory reference for each grid pint velocity update: 67 FPs + 73 memory refs. stress tensor update: 81 FPs + 82 memory refs. either case is bottlenecked by memory reference: each PE s local memory bandwidth is 16GB/s then 19.8 us for computation of all grids Shadow data exchange for 3-D stencil computation with tightly coupled communication channels shadow area: 6 variables for velocity and 3 variables for tensor (cyclic boundary condition) shadow area communication: transfer 8x8x2 grids to 6 neighboring PEs. Inter-PE communication for 3-D mapped on 2-D is assumed 2GB/s, then 13.8 us for cyclic boundary shadow area exchange, it takes 18.4 us Total computation + communication time 52.0us, 1.5GFlops/PE, 6TFlops/chip (37%) 34

35 Application and benchmark evaluatio FDTD: 5.2TFlops (single precision) Himeno Benchmark (middle: 128x128x256): 7.4TFlops (single precision) Matrix-matrix multiply: Kernel: 14.9TFlops (double precision) Large scale application: RS-DFT for 100,000 atoms (same as Gordon Bell Prize case on K Computer) Strong scaling up to 4096 chips (= 64 PFlops) is possible to keep 35% efficiency, which is the same efficiency with K Computer -> equivalent to have 6 systems of entire K Computer A group of 4096 chips (=64 PFlops) If data fits on on-chip memory, ratio B/F is 4 B/F, total mem size = 2TB If data fits into module memory, ratio B/F is B/F, total mem size = 64TB Co-design is quite important for such an aggressive architecture with various limitation Co-design decides what we can get if we loose what 35

36 Current status and pla We are now working on performance estimation by co-design process 2012 (done): QCD, N-body, MD, HMD (on-going RS-DFT) 2013: earth quake sim, NICAM (climate), FMO (chemistry) Also, developing simulators (clock-level/instruction level) for more precious and quantitative performance evaluation Compiler development (XMP and OpenACC) (Re-)Design and investigation of network topology 2D mesh is sufficient? or, other alternative? Code development for apps using Host and Acc, including I/O Precise and more detailed estimation of power consumptions 36

37 Advanced issues Is just simple 4K SIMD on a chip enough? PACS-G accelerator is a pure throughput-core solution algorithm must satisfy very large occupancy (effective cores not be masked) in some application (ex. MD) we need more fine grained flexibility rather than big SIMD possibly we put some number of individual general cores for function level parallelism How to utilize 3-D stack memory effectively? currently 3-D stack memory is just for temporal buffer and check pointing inter-accelerator network directly accesses 3-D stack memory for coarse grained communication & buffering How about general CPU to be attached? 37

38 Summary Next generation Exa-scale (Flops) system is a challenge to power consumption (Flops/W) The shortest way is to reduce the power with a minimum dynamism to control a large number of FP units On-chip SRAM solution is the only way to provide data throughput for thousands of cores, which leads very small capacity/core Strong scaling is essential, and direct interconnection between accelerators is necessary, within chip and over chips Co-design is quite important for such an aggressive architecture with various limitation Accelerator is not a magic, but needed for various applications (not covering all) 38

Tightly Coupled Accelerators Architecture

Tightly Coupled Accelerators Architecture Tightly Coupled Accelerators Architecture Yuetsu Kodama Division of High Performance Computing Systems Center for Computational Sciences University of Tsukuba, Japan 1 What is Tightly Coupled Accelerators

More information

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs Yuetsu Kodama Division of High Performance Computing Systems Center for Computational Sciences University of Tsukuba,

More information

Interconnection Network for Tightly Coupled Accelerators Architecture

Interconnection Network for Tightly Coupled Accelerators Architecture Interconnection Network for Tightly Coupled Accelerators Architecture Toshihiro Hanawa, Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato Center for Computational Sciences University of Tsukuba, Japan 1 What

More information

T2K & HA-PACS Projects Supercomputers at CCS

T2K & HA-PACS Projects Supercomputers at CCS T2K & HA-PACS Projects Supercomputers at CCS Taisuke Boku Deputy Director, HPC Division Center for Computational Sciences University of Tsukuba Two Streams of Supercomputers at CCS Service oriented general

More information

Fujitsu s new supercomputer, delivering the next step in Exascale capability

Fujitsu s new supercomputer, delivering the next step in Exascale capability Fujitsu s new supercomputer, delivering the next step in Exascale capability Toshiyuki Shimizu November 19th, 2014 0 Past, PRIMEHPC FX100, and roadmap for Exascale 2011 2012 2013 2014 2015 2016 2017 2018

More information

Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications

Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications 1 Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications Toshihiro Hanawa Information Technology Center, The University of Tokyo Taisuke Boku Center for Computational

More information

HA-PACS Project Challenge for Next Step of Accelerating Computing

HA-PACS Project Challenge for Next Step of Accelerating Computing HA-PAS Project hallenge for Next Step of Accelerating omputing Taisuke Boku enter for omputational Sciences University of Tsukuba taisuke@cs.tsukuba.ac.jp 1 Outline of talk Introduction of S, U. Tsukuba

More information

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS HPC User Forum, 7 th September, 2016 Outline of Talk Introduction of FLAGSHIP2020 project An Overview of post K system Concluding Remarks

More information

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators

More information

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited Fujitsu HPC Roadmap Beyond Petascale Computing Toshiyuki Shimizu Fujitsu Limited Outline Mission and HPC product portfolio K computer*, Fujitsu PRIMEHPC, and the future K computer and PRIMEHPC FX10 Post-FX10,

More information

Introduction of Oakforest-PACS

Introduction of Oakforest-PACS Introduction of Oakforest-PACS Hiroshi Nakamura Director of Information Technology Center The Univ. of Tokyo (Director of JCAHPC) Outline Supercomputer deployment plan in Japan What is JCAHPC? Oakforest-PACS

More information

Appro Supercomputer Solutions Appro and Tsukuba University

Appro Supercomputer Solutions Appro and Tsukuba University Appro Supercomputer Solutions Appro and Tsukuba University Accelerator Cluster Collaboration Steven Lyness, VP HPC Solutions Engineering About Appro Over 20 Years of Experience 1991 2000 OEM Server Manufacturer

More information

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC Hisashi YASHIRO RIKEN Advanced Institute of Computational Science Kobe, Japan My topic The study for Cloud computing My topic

More information

System Software for Big Data and Post Petascale Computing

System Software for Big Data and Post Petascale Computing The Japanese Extreme Big Data Workshop February 26, 2014 System Software for Big Data and Post Petascale Computing Osamu Tatebe University of Tsukuba I/O performance requirement for exascale applications

More information

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview

More information

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Aim High Intel Technical Update Teratec 07 Symposium June 20, 2007 Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Risk Factors Today s s presentations contain forward-looking statements.

More information

Fujitsu s Approach to Application Centric Petascale Computing

Fujitsu s Approach to Application Centric Petascale Computing Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Design of Scalable Network Considering Diameter and Cable Delay

Design of Scalable Network Considering Diameter and Cable Delay Tohoku Design of Scalable etwork Considering Diameter and Cable Delay Kentaro Sano Tohoku University, JAPA Agenda Introduction Assumption Preliminary evaluation & candidate networks Cable length and delay

More information

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE Hitoshi Sato *1, Shuichi Ihara *2, Satoshi Matsuoka *1 *1 Tokyo Institute

More information

GPU-centric communication for improved efficiency

GPU-centric communication for improved efficiency GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop

More information

Fujitsu s Technologies to the K Computer

Fujitsu s Technologies to the K Computer Fujitsu s Technologies to the K Computer - a journey to practical Petascale computing platform - June 21 nd, 2011 Motoi Okuda FUJITSU Ltd. Agenda The Next generation supercomputer project of Japan The

More information

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS

More information

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC

More information

Overview of Tianhe-2

Overview of Tianhe-2 Overview of Tianhe-2 (MilkyWay-2) Supercomputer Yutong Lu School of Computer Science, National University of Defense Technology; State Key Laboratory of High Performance Computing, China ytlu@nudt.edu.cn

More information

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

The Mont-Blanc approach towards Exascale

The Mont-Blanc approach towards Exascale http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

Heterogeneous Multi-Computer System A New Platform for Multi-Paradigm Scientific Simulation

Heterogeneous Multi-Computer System A New Platform for Multi-Paradigm Scientific Simulation Heterogeneous Multi-Computer System A New Platform for Multi-Paradigm Scientific Simulation Taisuke Boku, Hajime Susa, Masayuki Umemura, Akira Ukawa Center for Computational Physics, University of Tsukuba

More information

The way toward peta-flops

The way toward peta-flops The way toward peta-flops ISC-2011 Dr. Pierre Lagier Chief Technology Officer Fujitsu Systems Europe Where things started from DESIGN CONCEPTS 2 New challenges and requirements! Optimal sustained flops

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability VPI / InfiniBand Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability Mellanox enables the highest data center performance with its

More information

IBM CORAL HPC System Solution

IBM CORAL HPC System Solution IBM CORAL HPC System Solution HPC and HPDA towards Cognitive, AI and Deep Learning Deep Learning AI / Deep Learning Strategy for Power Power AI Platform High Performance Data Analytics Big Data Strategy

More information

The Road from Peta to ExaFlop

The Road from Peta to ExaFlop The Road from Peta to ExaFlop Andreas Bechtolsheim June 23, 2009 HPC Driving the Computer Business Server Unit Mix (IDC 2008) Enterprise HPC Web 100 75 50 25 0 2003 2008 2013 HPC grew from 13% of units

More information

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU

More information

The Earth Simulator Current Status

The Earth Simulator Current Status The Earth Simulator Current Status SC13. 2013 Ken ichi Itakura (Earth Simulator Center, JAMSTEC) http://www.jamstec.go.jp 2013 SC13 NEC BOOTH PRESENTATION 1 JAMSTEC Organization Japan Agency for Marine-Earth

More information

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability VPI / InfiniBand Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability Mellanox enables the highest data center performance with its

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer

More information

Performance Comparison between Two Programming Models of XcalableMP

Performance Comparison between Two Programming Models of XcalableMP Performance Comparison between Two Programming Models of XcalableMP H. Sakagami Fund. Phys. Sim. Div., National Institute for Fusion Science XcalableMP specification Working Group (XMP-WG) Dilemma in Parallel

More information

HPC Technology Trends

HPC Technology Trends HPC Technology Trends High Performance Embedded Computing Conference September 18, 2007 David S Scott, Ph.D. Petascale Product Line Architect Digital Enterprise Group Risk Factors Today s s presentations

More information

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November,

More information

The Future of High Performance Interconnects

The Future of High Performance Interconnects The Future of High Performance Interconnects Ashrut Ambastha HPC Advisory Council Perth, Australia :: August 2017 When Algorithms Go Rogue 2017 Mellanox Technologies 2 When Algorithms Go Rogue 2017 Mellanox

More information

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Sayantan Sur, Abhinav Vishnu, Hyun-Wook Jin, Wei Huang and D. K. Panda {surs, vishnu, jinhy, huanwei, panda}@cse.ohio-state.edu

More information

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid

More information

Post-K: Building the Arm HPC Ecosystem

Post-K: Building the Arm HPC Ecosystem Post-K: Building the Arm HPC Ecosystem Toshiyuki Shimizu FUJITSU LIMITED Nov. 14th, 2017 Exhibitor Forum, SC17, Nov. 14, 2017 0 Post-K: Building up Arm HPC Ecosystem Fujitsu s approach for HPC Approach

More information

System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited

System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited 2018 IEEE 68th Electronic Components and Technology Conference San Diego, California May 29

More information

Basic Specification of Oakforest-PACS

Basic Specification of Oakforest-PACS Basic Specification of Oakforest-PACS Joint Center for Advanced HPC (JCAHPC) by Information Technology Center, the University of Tokyo and Center for Computational Sciences, University of Tsukuba Oakforest-PACS

More information

TSUBAME-KFC : Ultra Green Supercomputing Testbed

TSUBAME-KFC : Ultra Green Supercomputing Testbed TSUBAME-KFC : Ultra Green Supercomputing Testbed Toshio Endo,Akira Nukada, Satoshi Matsuoka TSUBAME-KFC is developed by GSIC, Tokyo Institute of Technology NEC, NVIDIA, Green Revolution Cooling, SUPERMICRO,

More information

High Performance Computing

High Performance Computing High Performance Computing Dror Goldenberg, HPCAC Switzerland Conference March 2015 End-to-End Interconnect Solutions for All Platforms Highest Performance and Scalability for X86, Power, GPU, ARM and

More information

Brand-New Vector Supercomputer

Brand-New Vector Supercomputer Brand-New Vector Supercomputer NEC Corporation IT Platform Division Shintaro MOMOSE SC13 1 New Product NEC Released A Brand-New Vector Supercomputer, SX-ACE Just Now. Vector Supercomputer for Memory Bandwidth

More information

Philippe Thierry Sr Staff Engineer Intel Corp.

Philippe Thierry Sr Staff Engineer Intel Corp. HPC@Intel Philippe Thierry Sr Staff Engineer Intel Corp. IBM, April 8, 2009 1 Agenda CPU update: roadmap, micro-μ and performance Solid State Disk Impact What s next Q & A Tick Tock Model Perenity market

More information

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar

More information

Solutions for Scalable HPC

Solutions for Scalable HPC Solutions for Scalable HPC Scot Schultz, Director HPC/Technical Computing HPC Advisory Council Stanford Conference Feb 2014 Leading Supplier of End-to-End Interconnect Solutions Comprehensive End-to-End

More information

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Fourth Workshop on Accelerator Programming Using Directives (WACCPD), Nov. 13, 2017 Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Takuma

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

Accelerated Computing Jun Makino Interactive Research Center of Science Tokyo Institute of Technology

Accelerated Computing Jun Makino Interactive Research Center of Science Tokyo Institute of Technology Accelerated Computing Jun Makino Interactive Research Center of Science Tokyo Institute of Technology IEEE Cluster 2011 Sept 28, 2011 Satoshi s three questions 1. What does accelerated computing solve

More information

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

NVIDIA Update and Directions on GPU Acceleration for Earth System Models NVIDIA Update and Directions on GPU Acceleration for Earth System Models Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA Carl Ponder, PhD, Applications Software Engineer, NVIDIA,

More information

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo Overview of Supercomputer Systems Supercomputing Division Information Technology Center The University of Tokyo Supercomputers at ITC, U. of Tokyo Oakleaf-fx (Fujitsu PRIMEHPC FX10) Total Peak performance

More information

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT 7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT Draft Printed for SECO Murex S.A.S 2012 all rights reserved Murex Analytics Only global vendor of trading, risk management and processing systems focusing also

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

Architectures for Scalable Media Object Search

Architectures for Scalable Media Object Search Architectures for Scalable Media Object Search Dennis Sng Deputy Director & Principal Scientist NVIDIA GPU Technology Workshop 10 July 2014 ROSE LAB OVERVIEW 2 Large Database of Media Objects Next- Generation

More information

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp Interconnect Challenges in a Many Core Compute Environment Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp Agenda Microprocessor general trends Implications Tradeoffs Summary

More information

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo Overview of Supercomputer Systems Supercomputing Division Information Technology Center The University of Tokyo Supercomputers at ITC, U. of Tokyo Oakleaf-fx (Fujitsu PRIMEHPC FX10) Total Peak performance

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all

More information

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction

More information

Update of Post-K Development Yutaka Ishikawa RIKEN AICS

Update of Post-K Development Yutaka Ishikawa RIKEN AICS Update of Post-K Development Yutaka Ishikawa RIKEN AICS 11:20AM 11:40AM, 2 nd of November, 2017 FLAGSHIP2020 Project Missions Building the Japanese national flagship supercomputer, post K, and Developing

More information

Timothy Lanfear, NVIDIA HPC

Timothy Lanfear, NVIDIA HPC GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision

More information

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION

More information

Real Application Performance and Beyond

Real Application Performance and Beyond Real Application Performance and Beyond Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400 Fax: 408-970-3403 http://www.mellanox.com Scientists, engineers and analysts

More information

Voltaire Making Applications Run Faster

Voltaire Making Applications Run Faster Voltaire Making Applications Run Faster Asaf Somekh Director, Marketing Voltaire, Inc. Agenda HPC Trends InfiniBand Voltaire Grid Backbone Deployment examples About Voltaire HPC Trends Clusters are the

More information

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016 TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016 MULTI GPU PROGRAMMING Node 0 Node 1 Node N-1 MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM

More information

An Introduction to OpenACC

An Introduction to OpenACC An Introduction to OpenACC Alistair Hart Cray Exascale Research Initiative Europe 3 Timetable Day 1: Wednesday 29th August 2012 13:00 Welcome and overview 13:15 Session 1: An Introduction to OpenACC 13:15

More information

Post-Petascale Computing. Mitsuhisa Sato

Post-Petascale Computing. Mitsuhisa Sato Challenges on Programming Models and Languages for Post-Petascale Computing -- from Japanese NGS project "The K computer" to Exascale computing -- Mitsuhisa Sato Center for Computational Sciences (CCS),

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

FPGA-based Supercomputing: New Opportunities and Challenges

FPGA-based Supercomputing: New Opportunities and Challenges FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:

More information

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth

More information

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

Experiences of the Development of the Supercomputers

Experiences of the Development of the Supercomputers Experiences of the Development of the Supercomputers - Earth Simulator and K Computer YOKOKAWA, Mitsuo Kobe University/RIKEN AICS Application Oriented Systems Developed in Japan No.1 systems in TOP500

More information

FUJITSU HPC and the Development of the Post-K Supercomputer

FUJITSU HPC and the Development of the Post-K Supercomputer FUJITSU HPC and the Development of the Post-K Supercomputer Toshiyuki Shimizu Vice President, System Development Division, Next Generation Technical Computing Unit 0 November 16 th, 2016 Post-K is currently

More information

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011 The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities

More information

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability Mellanox InfiniBand Host Channel Adapters (HCA) enable the highest data center

More information

Overview of Reedbush-U How to Login

Overview of Reedbush-U How to Login Overview of Reedbush-U How to Login Information Technology Center The University of Tokyo http://www.cc.u-tokyo.ac.jp/ Supercomputers in ITC/U.Tokyo 2 big systems, 6 yr. cycle FY 08 09 10 11 12 13 14 15

More information

Parallel Computing. November 20, W.Homberg

Parallel Computing. November 20, W.Homberg Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better

More information

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering. System Design of Kepler Based HPC Solutions Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering. Introduction The System Level View K20 GPU is a powerful parallel processor! K20 has

More information

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Results from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence

Results from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence Results from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence Jens Domke Research Staff at MATSUOKA Laboratory GSIC, Tokyo Institute of Technology, Japan Omni-Path User Group 2017/11/14 Denver,

More information