Accelerated Computing Unified with Communication Towards Exascale

Size: px

Start display at page:

Download "Accelerated Computing Unified with Communication Towards Exascale"

Everett Hart
5 years ago
Views:

1 Accelerated Computing Unified with Communication Towards Exascale Taisuke Boku Deputy Director, Center for Computational Sciences / Faculty of Systems and Information Engineering University of Tsukuba 1

CCS at University of Tsukuba Center for Computational Sciences Established in 1992 12 years as former Center for Computational Physics Reorganized as Center for Computational Sciences in 2004 Daily

2 CCS at University of Tsukuba Center for Computational Sciences Established in years as former Center for Computational Physics Reorganized as Center for Computational Sciences in 2004 Daily collaborative researches with two kinds of researchers (about 30 in total) Computational Scientists who have NEEDS (applications) Computer Scientists who have SEEDS (system & solution) 2 IHPC Forum, Changsha 2013/05/28

3 Current HPC and Supercomputer Research Program in Japa 3

4 Post-peta~Exa scale computing formation in Japan MEXT Council on HPCI Plan and Promotion Council for Science and Technology Policy (CSTP) HPCI Consortium established on 2012/04 Users and supercomputer centers (resource providers) The consortium will play an important role of the future HPC R&D WG for applications WG for systems JST (Japan Science and Technology Agency) Basic Research Programs CREST: Development of System Software Technologies for post-peta Scale High Performance Computing White paper for strategic direction/ development of HPC in JAPAN Feasibility Study of Advanced High Performance Computing Workshop of SDHPC (Strategic Development of HPC) Organized by Univ. of Tsukuba Univ. of Tokyo Tokyo Institute of Tech. Kyoto Univ. RIKEN supercomputer center and AICS AIST JST Tohoku Univ. RIKEN (AICS) Proposal to MEXT for Exascale 4 Evaluation by CSTP Budget approval by Ministry of Finance

5 Only information about architecture in RIKEN proposal 5

6 (slide is courtesy by Y. Ishikawa, U. Tokyo) Mid-term plan on HPC research in Japan Basic Research Programs CREST: Development of System Software Technologies for post-peta Scale High Performance Computing White paper for strategic direction/ development of HPC in Japan US$ 23M US$ 17M US$ 15M Feasibility Study of Advanced High Performance Computing US$10M Three teams will run Whether or not the national R&D project starts depends on results of those feasibility studies R&D of Advanced HPC Deployments 2011Q4 :HA-PACS, 0.8PF, Univ. Tsukuba -> be upgraded to 1.2PF (2013Q3) 2012: BG/Q, 1.2PF, KEK 2012Q1 : <1PF, Univ. Tokyo 2012Q2: 0.7PF, Kyoto-U 2015: 30PF? (Tokyo & Tsukuba) 6

7 Projects in FS Four projects (3 system-side + 1 application-side) were selected based on proposals, on 2012/07 Focus & Challenge general purpose (base) system accelerated computing vector processing mini applications Leading PI Y. Ishikawa (Univ. of Tokyo) M. Sato (Univ. of Tsukuba) H. Kobayashi (Tohoku Univ.) H. Tomita (AICS, RIKEN) & S. Matsuoka (TITECH) Member Institutes & Vendors U. Tokyo, Kyushu U. Fujitsu, Hitachi, NEC U. Tsukuba, TITECH, Aizu U., AICS, Hitachi Tohoku U., JAMSTEC, NEC AICS, TITECH 7

8 Two activities in U. Tsukuba HA-PACS + JST-CREST developing a platform for new concept of accelerated computing hardware and software development for testbed on TCA (Tightly Coupled Accelerators) architecture new algorithm and code based on this concept Feasibility Study a proposal for accelerator based exa-scale computing system interconnection is embedded as instruction level with computation ultimate style of highly computable system desig 8

9 Key Issues on Next Generation s Accelerated Computing 9

10 Accelerated computing in next generatio What accelerators contribute to? Flops: Concentrating (almost) all the power just for computation Simplicity: Reducing complexity of control, everything is in data parallel Vector: Another style of vector computing (simple data parallel) Energy: Very high performance/power We still need a challenge for performance/power ratio with x20 to achieve 1EFlops/20MW Accelerated computing is one of the shortest path It s good for computation and horizontal access to the memory device However... 10

11 Issues on accelerated computing in near future Limited memory capacity current GPU = 6GB : 1.3TFlops = 1 : 200 -> Co-design for memory capacity saving Limited dynamism on computation on GPU, warp splitting makes heavy performance degradation -> depending on application/algorithm but it should be solved by application/algorithm -> Co-design for effective vector feature Limited capacity on fast storage (register) # of cores is large, but each core is very small -> Co-design for loop-level parallelism We need a large scale parallel system even based on accelerators Performance, bandwidth and capacity are covered by the way of co-design 11

12 Issues on accelerated computing in future (cont d) Trading-off: Power vs Dynamism (Flexibility) fine grain individual cores consume much power, then ultra-wide SIMD feature to exploit maximum Flops is needed Interconnection is a serious issue current accelerators are hosted by general CPU, and the system is not stand-alone current accelerators are connected by some interface bus with CPU then interconnection current accelerators are connected through network interface attached to the host CPU Latency is essential (not just bandwidth) with the problem of memory capacity, strong scaling is required to solve the problems weak scaling doesn t work in some case because of time to solution limit in many algorithms, reduction of just a scalar value over millions of node is required Accelerators must be tightly coupled with each other, meaning They should be equipped with communication facility of their own 12

13 Research on Current Accelerators: TCA (Tightly Coupled Accelerators) (update from CO-DESIGN 2012) 13

TCA (Tightly Coupled Accelerators) Architecture True GPU-direct current GPU clusters

direct communication protocol is needed for lower latency and higher throughput

and PCIe interface fabric PCIe IB HCA PCIe CPU Node PCIe GPU CPU PCIe PEAC H2 Node

14 TCA (Tightly Coupled Accelerators) Architecture True GPU-direct current GPU clusters require 3- hop communication (3-5 times memory copy) For strong scaling, inter-gpu direct communication protocol is needed for lower latency and higher throughput Enhanced version of PEACH PEACH2 x4 lanes -> x8 lanes hardwired on main data path and PCIe interface fabric PCIe IB HCA PCIe CPU Node PCIe GPU CPU PCIe PEAC H2 Node PCIe GPU MEM MEM IB Switch PCIe IB HCA PCIe CPU PCIe GPU CPU PCIe PEAC H2 PCIe GPU MEM MEM 14

15 TCA testbed node structure CPU can uniformly access to GPUs. PEACH2 can access every GPUs Kepler architecture + CUDA 5.0 GPUDirect Support for RDMA Performance over QPI is quite bad. => support only for two GPUs on the same socket Connect among 3 nodes G2 x8 G2 x8 This configuration is similar to HA- PACS base cluster except PEACH2. G2 x8 PEA CH2 G2 x8 All the PCIe lanes (80 lanes) embedded in CPUs are used. CPU (Xeon E5 v2) G2 x16 G2 x16 QPI PCIe CPU (Xeon E5 v2) G2 x16 G2 x16 K20X K20X K20X K20X G3 x8 IB HCA 15

16 PEACH2 board n PCI Express Gen2 x8 peripheral board n Compatible with PCIe Spec. Side View 16 Top View

17 PEACH2 board Main board + sub board FPGA (Altera Stratix IV 530GX) PCI Express x8 card edge Most part operates at 250 MHz (PCIe Gen2 logic runs at 250MHz) DDR3SDRAM Power supply for various voltage PCIe x16 cable connecter PCIe x8 cable connecter 17

HA-PACS/TCA (computation node) 4 Channels

8 GHz x 8 flop/clock) 22.4 GFlops x20 =448.

Gen 2 x 16 PEACH2 board (TCA interconnect) 1.

18 HA-PACS/TCA (computation node) 4 Channels 1,866 MHz 59.7 GB/sec AVX (2.8 GHz x 8 flop/clock) 22.4 GFlops x20 =448.0 GFlops Ivy Bridge Ivy Bridge 4 Channels 1,866 MHz 59.7 GB/sec (16 GB, 14.9 GB/s)x8 =128 GB, GB/s Legacy Devices Total: TFlops Gen 2 x 16 Gen 2 x 16 Gen 2 x 16 Gen 2 x 16 PEACH2 board (TCA interconnect) 1.31 TFlopsx4 =5.24 TFlops 4 x NVIDIA K20X (6 GB, 250 GB/s)x4 =24 GB, 1 TB/s 18 8 GB/s Gen 2 x 8 Gen 2 x 8 Gen 2 x 8

19 HA-PACS Base Cluster + TCA (TCA part starts operation on Nov. 1 st 2013 Base Cluster TCA HA-PACS Base Cluster = 2.99 TFlops x 268 node = 802 TFlops HA-PACS/TCA = 5.69 TFlops x 64 node = 364 TFlops TOTAL: PFlops 19

20 HA-PACS/TCA computation node inside 20

21 Comparison with traditional method Latency (usec) User-level remote GPU memory copy CUDA: Intra-node cudamemcpy() No-UVA: with Unified Virtual Address UVA: without Unified Virtual Address MVAPICH2: 1.9b, MV2_USE_CUDA= InfiniBand FDR10 maybe reduced to 7us at minimum? 5 0 CUDA (nouva) CUDA (UVA) MVAPICH2 PEACH2 (GPU-GPU) PEACH2 (GPU>CPU-GPU) Data Size (Bytes) Bandwidth (Mbytes/sec) Better performance both on latency and bandwidth (up to 4KB) Faster than intra-node CUDA memory copy CUDA (nouva) CUDA (UVA) MVAPICH2 PEACH2 (GPU-GPU) PEACH2 (GPU>CPU-GPU) Data size (Bytes) 21

22 Pingpong with 3-hop in 8 node cluster #hop: 0 (Direct) ~ 3 200~300 ns additional latency for each hop Bandwidth is kept for any count of hop Latency (usec) PIO Direct PIO 3 hop DMA Direct DMA 1 hop DMA 2 hop DMA 3 hop GPU Direct GPU 3 hop Data Size (bytes) Bandwidth (MBytes/sec) DMA Direct DMA 1 hop DMA 2 hop DMA 3 hop GPU Direct MVA2 GPU Data Size (bytes) 22

23 Simple example of stencil computatio yellow colored surface must be comminucated continuous on k-dim. only, other are stride or block-stride a fixed pattern of communication in all time development loop j k i 3-D decomposition in 1x2x2 (4 nodes) PEACH2 supports block-stride transfer, and also RDMA chaining by hardware 23

24 Evaluation on small problem (Himeno Benchmark) 2.5 Typical situation for strong scaling 2 44% perf. up Speed up x1x1 4x1x1 2x2x MPI MV2 TCA Small 24

25 Univ. of Tsukuba s Feasibility Study Proposal based on Extreme Accelerated Computing 25

26 Project organizatio Joint project with Titech (Makino), Aizu U (Nakazato), RIKEN (Taiji), U Tokyo, KEK, Hiroshima U, and Hitachi as a super computer company Target apps: QCD in particle physics, tree N-body, HMD in Astrophysics, MD in life sci., FDM of earthquake, FMO in chemistry, RS-DFT in material, NICAM in climate sci. Application Study (U Tsukuba, RIKEN, U. Tokyo, KEK, Hiroshima U) (FS midterm report by Sato) Global Climate Science (Tanaka, Yashiro) Earth Science (Okamoto) Life Science (Taiji, Umeda) Nanomaterial Science (Oshiyama) Astrophysics (Umemura, Yoshikawa) Particle physics (Kururamashi, Ishikawa, Matsufuru) Simulator and Evaluation tools (Kodama) Accelerator and Network System Design (Titech, U Tsukuba, Aizu) simulation Feedback Processor core architecture (Nakazato) Programming model Programming model (Sato) API Basic Accelerator architecture (Makino) Accelerator Nework (Boku) Study on Implementation and power (Hitachi) Programming model and Simulation Tools (U. Tsukuba) Study on Implementation (Hitachi) 26

27 Between power-effective computing and dynamism of computatio constant power small constant Flops many-core CPU large number of independent cores MIC GPU SIMD feature (operations / instruction) large CPU Our Solution (extreme SIMD) GRAPE-DR small 27

28 Basic concept of extreme SIMD accelerator Fully concentrated for Flops (Compute Oriented & Reduced Memory) A number of SIMD cores (500~1K/chip) with very low dynamism Very limited memory capacity/core -> on-chip SRAM High density of accelerating chip Simple and small Processing Elements with some number of registers and on-chip SRAM Tightly coupled with on-chip interconnection network Interconnection network 2-D torus on-chip (up to 4K cores), 2-D torus on-board (~16 chips) and n-d torus network among boards special feature for broadcast/reduction tree These chips are also tightly connected by inter-chip network which can be operated directly from controller Importance of Co-design Deciding memory capacity and # of cores on chip (trading-off) Network bandwidth and # of ports Based on target applications and algorithms 28

Global design of our proposal (FS midterm report by Sato) Two keys for exascale computing Power and strong-scaling We study exascale heterogeneous systems with accelerators of many-cores.

29 Global design of our proposal (FS midterm report by Sato) Two keys for exascale computing Power and strong-scaling We study exascale heterogeneous systems with accelerators of many-cores. We are interested in: Architecture of accelerators: core and memory architecture Special-purpose functions Direct connection between accelerators in a group Power estimation and evaluation Programming model and computational science applications Requirement for general-purpose system etc Acc elerator GP Processor Memory Acc elerator Node Group Accelerator ネットワーク Network Acc elerator GP Processor Memory Node Acc elerator Contro ller Contro ller Contro ller Contro ller Node Accelerator Chip (example) Group Node Core X Mem Core X Mem Core X Mem Core X Mem Node Core X Mem Core X Mem Core X Mem Core X Mem Node Core X Mem Core X Mem コア X Mem Core X Mem Core X Mem Core X Mem Core X Mem コア X Mem Storage To System network System Network System 29

30 PACS-G: a straw man architecture (FS midterm report by Sato) SIMD architecture, for compute oriented apps (N-body, MD), and stencil apps cores (64x64), 2FMA@1GHz, 4GFlops x 4096 = 16TFlops/chip 2D mesh (+ broardcast/reduction) on-chip network for stencil apps. We expect 14nm technology at the range of year , Chip die size: 20mm x 20mm Mainly working on on-chip memory (size 512 MB/chip, 128KB/core), and, with module memory by 3D-stack/wide IO DRAM memory (via 2.5D TSV), bandwidth 1TB/s, size 16-32GB/chip ホストプロセッサ host CPU コント contr- ローラ oller データ inst. & data 命令メモリ memory メモリ放送 BC memory メモリ comm. 通信バッファ buffer FSACC チップ PACS-G chip PE PE PE PE W/chips expected 64K chips for 1 EFlops (at peak) inter-accelerator direct network 加速機構間ネットワーク comm. 通信バッファ buffer 放送メモリ BC memory 放送 BC memory メモリ放送 BC memory メモリ結果縮約ネットワーク PE PE PE PE PE PE PE PE PE PE PE PE comm. 通信バッファ buffer 3D stack or Wide IO DRAM comm. 通信バッファ buffer

PACS-G: a straw man architecture A group of 1024 2048 chips are connected via ac celerator network (inter-chip network) (FS midterm report by Sato) 25 50Gbps/link for inter-chip: If we extend 2-D m

31 PACS-G: a straw man architecture A group of chips are connected via ac celerator network (inter-chip network) (FS midterm report by Sato) 25 50Gbps/link for inter-chip: If we extend 2-D m esh network to the (2D-mesh) external network in a group, we need GB/s (= 32ch. x Gbps x 2(bi-direction)) comm. buffer For 50Gbps data transfer, we may need direct optical interconnect from chip. (?) I/O Interface to Host: PCI Express Gen 4 x16 (not enough!!!) Programming model: XcalableMP + OpenACC Use OpenACC to specify offloaded fragment of code and data movement To align data and computation to core, we use the concept "template" of XcalableMP (virtual index space). We can generate code for each core. (And data parallel lang. like C*) interconnect between chips (2D mesh) Board connector to other acc unit. PSU silicon interposer PSU accelerator module optical interconnect module accelerator LSI memory module HMC Wide I/O DRAM 31 An example of implementation (for 1U rack)

32 Performance evaluation by simulatio FDTD (Finite Difference Time Domain) Himeno (Fluid Dynamics) Matrix multiplication N-body RS-DFT (Real Space Density Function Theory) 2011 Gordon Bell Prize application on K 32

33 FDTD problem mapping Earthquake wave transition simulation (also used in Application FS) 3-D space difference scheme (order-2 for time, order-4 for space) In every time step, velocity vectors and stress tensor are calculated and updated Each grid point has 12 of single precision variables Each PE takes n x n x n grids with stencil shadow size of 2 -> 83KB memory/pe is required for n=8 For 4096 PEs on a chip, 128x128x128 grids are available for computatio 33

34 FDTD estimated performance FP operation counts and memory reference for each grid pint velocity update: 67 FPs + 73 memory refs. stress tensor update: 81 FPs + 82 memory refs. either case is bottlenecked by memory reference: each PE s local memory bandwidth is 16GB/s then 19.8 us for computation of all grids Shadow data exchange for 3-D stencil computation with tightly coupled communication channels shadow area: 6 variables for velocity and 3 variables for tensor (cyclic boundary condition) shadow area communication: transfer 8x8x2 grids to 6 neighboring PEs. Inter-PE communication for 3-D mapped on 2-D is assumed 2GB/s, then 13.8 us for cyclic boundary shadow area exchange, it takes 18.4 us Total computation + communication time 52.0us, 1.5GFlops/PE, 6TFlops/chip (37%) 34

35 Application and benchmark evaluatio FDTD: 5.2TFlops (single precision) Himeno Benchmark (middle: 128x128x256): 7.4TFlops (single precision) Matrix-matrix multiply: Kernel: 14.9TFlops (double precision) Large scale application: RS-DFT for 100,000 atoms (same as Gordon Bell Prize case on K Computer) Strong scaling up to 4096 chips (= 64 PFlops) is possible to keep 35% efficiency, which is the same efficiency with K Computer -> equivalent to have 6 systems of entire K Computer A group of 4096 chips (=64 PFlops) If data fits on on-chip memory, ratio B/F is 4 B/F, total mem size = 2TB If data fits into module memory, ratio B/F is B/F, total mem size = 64TB Co-design is quite important for such an aggressive architecture with various limitation Co-design decides what we can get if we loose what 35

36 Current status and pla We are now working on performance estimation by co-design process 2012 (done): QCD, N-body, MD, HMD (on-going RS-DFT) 2013: earth quake sim, NICAM (climate), FMO (chemistry) Also, developing simulators (clock-level/instruction level) for more precious and quantitative performance evaluation Compiler development (XMP and OpenACC) (Re-)Design and investigation of network topology 2D mesh is sufficient? or, other alternative? Code development for apps using Host and Acc, including I/O Precise and more detailed estimation of power consumptions 36

37 Advanced issues Is just simple 4K SIMD on a chip enough? PACS-G accelerator is a pure throughput-core solution algorithm must satisfy very large occupancy (effective cores not be masked) in some application (ex. MD) we need more fine grained flexibility rather than big SIMD possibly we put some number of individual general cores for function level parallelism How to utilize 3-D stack memory effectively? currently 3-D stack memory is just for temporal buffer and check pointing inter-accelerator network directly accesses 3-D stack memory for coarse grained communication & buffering How about general CPU to be attached? 37

38 Summary Next generation Exa-scale (Flops) system is a challenge to power consumption (Flops/W) The shortest way is to reduce the power with a minimum dynamism to control a large number of FP units On-chip SRAM solution is the only way to provide data throughput for thousands of cores, which leads very small capacity/core Strong scaling is essential, and direct interconnection between accelerators is necessary, within chip and over chips Co-design is quite important for such an aggressive architecture with various limitation Accelerator is not a magic, but needed for various applications (not covering all) 38

Tightly Coupled Accelerators Architecture

Tightly Coupled Accelerators Architecture Yuetsu Kodama Division of High Performance Computing Systems Center for Computational Sciences University of Tsukuba, Japan 1 What is Tightly Coupled Accelerators