Sunway TaihuLight: The system and applications. Zhao Liu Director of Application Support Department National Supercomputing Center in Wuxi

Size: px
Start display at page:

Download "Sunway TaihuLight: The system and applications. Zhao Liu Director of Application Support Department National Supercomputing Center in Wuxi"

Transcription

1 Sunway TaihuLight: The system and applications Zhao Liu Director of Application Support Department National Supercomputing Center in Wuxi

2 Outline Sunway Machine Applications and Programming Challenges Long Term Plans

3 Sunway TaihuLight 神威 太湖之光

4 Sunway TaihuLight City Rank in Top50 Shanghai 1 Suzhou 7 Wuxi 14 Nantong 24 Changzhou 34 Jiaxing 50

5 Sunway-I: - CMA service, commercial chip Tflops - 48 th of TOP500 Sunway BlueLight: - NSCC-Jinan, core processor - 1 Pflops - 14 th of TOP500 Sunway TaihuLight: - NSCC-Wuxi, core processor Pflops - 1 st of TOP500 The Sunway Machine Family

6 Overall Architecture of SW26010 MPE: management processing element 管理 MPE 核心 管理 MPE 核心 运算核 CPE 心簇 Clusters 运算核 CPE 心簇 Clusters CPE: computing processing element IMPE: intelligent memory processing element SI: system interface 智能存储核心 主存 智能存储核心 主存 片上网络 Network On Chip 智能存储核心 IMPE IMPE IMPE Main Memory Main Memory 主存 Main Memory 系统 SI 接口

7 communication row and column buses credit-based wormhole switch cluster controller The Architecture of the CPE Cluster streaming engine: processing of DMA instructions synchronization control: fast hardware synchronization for CPEs consistency processing unit: assuring cache coherency with the MPEs 运算核 CPE Clusters 心簇 Cluster 簇控制器 Controller... 运算运算运算运算 CPE CPE CPE CPE 核心核心核心核心... 运算运算运算运算 CPE CPE CPE CPE 核心核心核心核心... 运算运算运算运算 CPE CPE CPE CPE 核心核心核心核心... 运算运算运算运算 CPE CPE CPE CPE 核心核心核心核心 Cluster Communication 簇通信网络 Network Cluster Communication 簇通信网络接口 Network Interface CTLB L2 ICache 流传输引 Streaming Engine 擎 Network 片上网络接口 On Chip Interface Interruption 中断控制 Control Synchroniza 同步控制 tion Control 一致性处理 Consistency processing 部件 Unit Network On Chip 片上网络

8 The MPE microarchitecture MPE: L1 Instruction L1 指令 Cache Cache complete L1 and L2 caches PC Instruction 指令部件 Unit 3 pipes full support for management, service, and Pipe0 Execution 执行部件 Unit Pipe1 Pipe2 GPR computing functions mainly designed for running OS, runtime, and programs with less parallelism Memory 访存部件 Accessing Unit L1 Data L1 数据 Cache L2 Cache

9 The CPE microarchitecture CPE: L1 Instruction L1 指令 Cache Cache L1 cache for instruction only Scratch Pad Memory (i.e. Local Data Memory) 2 pipes On- 片上网络 Chip 路由器 NR Instruction PC 取指部件 Unit Execution 执行部件 Unit Pipe0 Pipe1 GPR mainly designed for highly parallel computation Memory 访存部件 Accessing Unit SPM/L1 数据 Data Cache

10 High-Density Integration of the Computing System A Five-Level Integration Hierarchy computing node computing board super node cabinet entire computing system

11 High-Density Integration of the Computing System A Five-Level Integration Hierarchy computing node computing board super node cabinet entire computing system

12 High-Density Integration of the Computing System A Five-Level Integration Hierarchy computing node computing board super node cabinet entire computing system

13 High-Density Integration of the Computing System A Five-Level Integration Hierarchy computing node computing board super node cabinet entire computing system

14 High-Density Integration of the Computing System A Five-Level Integration Hierarchy computing node computing board super node cabinet entire computing system

15 A System with Over 10 Million Cores

16 Outline Sunway Machine Applications and Programming Challenges Long Term Plans

17 Applications overview Over 200 applications in different HPC domains 15 full-scale applications More than 50 applications that can be scaled to half system More than 100 applications that can be scaled to one million cores Collaborate with more than 300 companies, research institutes and universities Porting open source applications to Sunway, including WRF, VASP, LAMMPS, etc. Develop highly efficient applications on Sunway

18 An (Incomplete) List of Full-Scale Applications 2016 Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation 2017 Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-em Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

19 An (Incomplete) List of Full-Scale Applications 2016 Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires A graph processing application designed Run-away Electron for Trajectory local Internet Simulation companies in China Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation 2017 Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-em Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

20 An (Incomplete) List of Full-Scale Applications 2016 Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Work of the Particle Simulator Genome Functional Annotation and Homeotic Research Team Gene Building of RIKEN AICS Spacecraft CFD Numerical Simulation 2017 Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-em Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

21 An (Incomplete) List of Full-Scale Applications 2016 Gordon Bell Finalists Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation 2017 Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-em Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

22 An (Incomplete) List of Full-Scale Applications 2016 Gordon Bell Prize Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation 2017 Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-em Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

23 An (Incomplete) List of Full-Scale Applications 2016 Gordon Bell Prize Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation 2017 Gordon Bell Finalists Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-em Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

24 An (Incomplete) List of Full-Scale Applications 2016 Gordon Bell Prize Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation 2017 Gordon Bell Prize Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-em Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

25 Two Major Challenges The Memory Barrier The Application Barrier

26 Machine Capability Comparison TaihuLight Tianhe-2 Titan K Computer communication bandwidth Peak Performance Memory Size hpgmg TaihuLight Titan Linpack Tianhe-2 K Computer Graph memory bandwidth Gflops/Watt HPCG Tflops/m^3

27 Again, a Multi-Objective Optimization Problem with a Budget Constraint Peak /.1Pflops Mem / TB Price / Million USD Sunway TaihuLight Tianhe-2 Piz Daint Titan Sequoia K

28 Again, a Multi-Objective Optimization Problem with a Budget Constraint a budget machine with aggressive compute performance Peak /.1Pflops Mem / TB Price / Million USD Sunway TaihuLight Tianhe-2 Piz Daint Titan Sequoia K

29 Major Features to Consider Sunway TaihuLight 125 Pflops 10 million cores user-controlled 64 KB LDM 32 GB and 136GB/s per node 22 flops/byte MPE + CPE register communication among CPEs

30 Major Features to Consider Sunway TaihuLight 125 Pflops 10 million cores user-controlled 64 KB LDM 32 GB and 136GB/s per node 22 flops/byte MPE + CPE register communication among CPEs Intel KNL 7250 of Cori: NVIDIA P100 of Piz Daint: 6.5 flops/byte 7.2 flops/byte

31 Major Challenge : Memory Wall Sunway TaihuLight 125 Pflops 10 million cores user-controlled 64 KB LDM 32 GB and 136GB/s per node 22 flops/byte MPE + CPE register communication among CPEs Refactoring and Redesigning

32 MPI One MPI process runs on one management core (MPE) Sunway OpenACC Sunway OpenACC conducts data transfer between main memory and local data memory (LDM), and distributes the kernel workload to the computing cores (CPEs) Athread Programming Model on TaihuLight MPI + X X: (Sunway OpenACC / Athread) Athread is the threading library to manage threads on computing core (CPE), which is used in the Sunway OpenACC implementation

33 Two Major Challenges The Memory Barrier The Application Barrier

34 Example (I): Nonlinear Earthquake Simulation on Sunway TaihuLight Dynamic rupture source generator (originated from CG-FDM) Seismic wave propagation (originated from AWP-ODC) Other utilities: source partitioner 3D Model Interpolator Restart controller Fault Stress Init Dynamic Rupture Source Generator (Based on CG-FDM) Source Partitioner Seismic Wave Propagation (Based on AWP-ODC) Velocity Stress Update Update Next Timestep Stress Adjustment For Plasticity Friction Law Ctrl Source Injection Wave Eqn Solver 3D Vel/Den Model 3D Model Interpolator Snapshot/Sesimo Recorder Restart Controller LZ4 Compression, Group I/O, Balanced I/O Forwarding

35 Multi-Level Domain Decomposition

36 A Balanced Memory Scheme v 1 v 2 v n before v 1 v 2 v n after u 1 u 2 u w n 1 w 2 xx1 xx 2 yy1 yy 2 zz1 zz 2 w n xxn... yyn zzn u 1 u 2 u w n 1 w 2 u1 v1 w1 u2 v2 w2 fuse arrays w n xx1 xx 2 yy1 yy 2 zz1 zz 2 yz1 yz2 xxn... yyn zzn yzn (1) array fusion, (2) halo exchange through register communication, dvelcx yz1 yz2 yzn dvelcx xx1 yy1 zz1 xy1 xz1 yz1 xx 2yy 2 zz 2 xy 2 xz2 yz2 (3) and optimized blocking configuration guided by an analytical model v 1 v 2 v n u 1 u 2 un w 1 w 2 w n u1 v1 w1 u2 v2 w2 dstrqc dstrqc ID: 00 ID: 01 ID: 02 ID: 63 xx1 xx 2 xxn yz1 yz2 yy1 yy 2 zz1 zz 2... yzn zzn yyn xx1 yy1 zz1 xy1 xz1 yz1 xx 2yy 2 zz 2 xy 2 xz2 yz2 Left boundary Right boundary Inner part DMA transfer Register communication Register communication

37 Weak Scaling 18 Ideal (Linear) 14 Ideal (Non-linear) Ideal (Linear+Compress) 9 Ideal (Non-linear+Compress) 6 PFLOPS Linear (Peak: 10.7PFLops, Para. eff. 97.9%) Non-linear (Peak: 15.2PFlops, Para. eff. 80.1%) Linear+Compress (Peak: 14.2PFlops, Para. eff. 96.5%) Non-linear+Compress (Peak: 18.9PFlops, Para. eff. 79.5%) 8K 12K 16K 24K 32K 40K 48K 64K 80K 96K 120K 160K Number of processes

38 Weak Scaling PFLOPS 18 Ideal (Linear) 14 Ideal (Non-linear) Ideal (Linear+Compress) 9 Ideal (Non-linear+Compress) Peak Mem BW BYTE/flop Best Efficiency Linear (Peak: 10.7PFLops, Para. eff. 97.9%) 1 Titan 27 Pflops 5,475 TB/s Non-linear 0.2 half (Peak: machine: 15.2PFlops, 1.6 Pflops, Para. eff. 11.9% 80.1%) Linear+Compress (Peak: 14.2PFlops, Para. eff. 96.5%) 0.6 Non-linear+Compress (Peak: 18.9PFlops, Para. eff. 79.5%) Sunway without compression: 15.2 Pflops, 12.2% 8K K Pflops 16K4,473 TB/s24K 0.04 TaihuLight 32K 40K 48Kwith compression: 64K 80K K Pflops, 120K15.1% 160K Number of processes

39 Simulation Results

40 Two Major Challenges The Memory Barrier The Application Barrier

41 More and more component models marine biology ocean model oceanatmosphere boundary atmosphere model space weather ocean-ice boundary coupler land-atmosphere boundary atmospheric chemistry dynamic ice ice model ice-land boundary land model land biology solid earth hydrological process

42 The application barrier a high complexity in application, and a heavy legacy in the code base (millions lines of code) an extremely complicated MPMD program with no hotspots (or hundreds of hotspots) misfit between the in-place design philosophy and the new architecture lack of people with interdisciplinary knowledge and experience

43 Application (II): Porting CESM and Redesigning CAM-SE for Sunway TaihuLight CAM5.0 POP2.0 CPL7 CLM4.0 CICE4.0 CESM1.2.0 Tsinghua + BNU 30+ Professors and Students Four component models, millions lines of code Large-scale run on Sunway TaihuLight 24,000 MPI processes Over one million cores 10-20x speedup for kernels 2-3x speedup for the entire model Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer, in Proceedings of SC

44 OpenACC-based Refactoring of CAM Pass tracers (u, v) to dynamics CAM initial Dyn_run Phy_run1 Phy_run2 Pass state variables and tracers Pass state variables manual transformation of loops manual OpenACC parallelization and optimization on code and data structures Euler_step: do ie = nets, nete compute Q min/max values for lim8 compute Biharmonic mixing term f do ie = nets, nete 2D advection step data packing Bonundary exchange Data extracting 1 do ie = nets, nete do ie = nets, nete do k = 1, nlev do k = 1, nlev do q = 1, qsize do q = 1, qsize qmin(k,q,ie) = Qtens(k,q,ie) = qmax(k,q,ie) = Qtens(k,q,ie) = Data packing 4 do ie = nets, nete do ie = nets, nete do q = 1, qsize do q = 1, qsize do k = 1, nlev do k = 1, nlev qmin(k,q,ie) = Qtens(k,q,ie) = qmax(k,q,ie) = Qtens(k,q,ie) = Data packing 5!$ACC PARALLEL LOOP!$ACC PARALLEL LOOP do ie_q = 1, do ie_q = 1, qsize*(nete-nets) qsize*(nete-nets) do k = 1, nlev do k = 1, nlev q = func(ie_q) q = func(ie_q) ie = func(ie_q) ie = func(ie_q) qmin(k,q,ie) = Qtens(k,q,ie) = qmax(k,q,ie) = Qtens(k,q,ie) =!$ACC PARALLEL LOOP Data packing 6 do ie = nets, nete do ie = nets, nete do k = 1, nlev do q = 1, qsize dp(k) = func_1() do k = 1, nlev do q = 1, qsize. Qtens(k,q,ie) = func_2(dp(k)) do ie = nets, nete do ie = nets, nete do k = 1, nlev do k = 1, nlev dp0 = func_3() do q = 1, qsize dpdiss = func_4() qmin(k,q,ie) = do q = 1, qsize qmax(k,q,ie) = Qtens(k,q,ie) = func_5(dp0, dpdiss) do ie = nets, nete do k = 1, nlev do k = 1, nlev dp_star(k) = func_8(dp(k)) dp(k) = func_5() Vstar(k) = func_6() do k = 1, nlev Qtens(k,q,ie) = do q = 1, qsize func_9(dp_star(k)) do k = 1, nlev Qtens(k,q,ie) = func_7(dp(k), Vstar(k)) Data packing 2 do ie = nets, nete do ie = nets, nete do k = 1, nlev do q = 1, qsize do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = func_2(func_1()) optimized:. do ie = nets, nete do ie = nets, nete do k = 1, nlev do k = 1, nlev do q = 1, qsize do q = 1, qsize qmin(k,q,ie) = Qtens(k,q,ie) = qmax(k,q,ie) = func_5(func_3(),func_4()) do ie = nets, nete do k = 1, nlev do q = 1, qsize Qtens(k,q,ie) = do k = 1, nlev func_9(func_8(func_5())) Qtens(k,q,ie) = func_7(func_5(),func_6()) Data packing 3 do begin_chunk to end_chunk tphysbc() tphysbc() tphysbc() { { { do begin_chunk to end_chunk do begin_chunk to end_chunk convect_deep_tend(6.47%) convect_deep_tend(6.47%) convect_deep_tend(6.47%) convect_shallow_tend(15.57%) convect_shallow_tend(15.57%) enddo macrop_driver_tend(8.38%) macrop_driver_tend(8.38%) microp_aero_run(4.29%) microp_aero_run(4.29%) do begin_chunk to end_chunk microp_driver_tend(7.13%) microp_driver_tend(7.13%) microp_driver_tend(7.13%) aerosol_wet_intr(4.29%) aerosol_wet_intr(4.29%) enddo convect_deep_tend_2(0.51%) convect_deep_tend_2(0.51%) radiation_tend(54.07%) radiation_tend(54.07%) do begin_chunk to end_chunk } enddo radiation_tend(54.07%) enddo } enddo } convect_deep_tend(6.47%) { zm_conv_tend(6.47%) do begin_chunk to end_chunk { convect_deep_tend(6.47%) do begin_chunk to end_chunk { zm_convr(2.03%) zm_conv_tend(6.47%) enddo { do begin_chunk to end_chunk zm_convr(2.03%) zm_conv_evap() zm_conv_evap() enddo montran() do begin_chunk to end_chunk convtranc(0.06%) montran() } enddo } do begin_chunk to end_chunk enddo convtranc(0.06%) enddo } } tool based transformation of loops

45 Athread-based Fine-grained Redesign Step 1: rewrite of Fortran OpenACC code to Athread C code finer memory control through a specific DMA scheme more efficient vectorization elek C0,0 C7,0 C0, C7, elek+7 C0,7... C7,7 Step 2: register-communication based redesign remove data dependency expose more parallelism Stage 1 Ci, j a16*i a16*i+a16*i+1... a16*i+...+a16*i+15 Stage 2 C0,0 a0... a0+...+a15 C1,0 a16 a8+a9... a a31 Ck, 0 ak a0+a1 ak+ak+1... ak+...+ak p15 + p15 = p111 = p31 p127 Stage 3 Ci, j a16*i a16*i+a16*i+1... a16*i+...+a16*i+15 p16*i

46 1 Sunway CG (64 CPEs) could be equivalent to 0.1x Intel Core or 1.8x Intel Core or 7.2x Intel Core or in certain cases 43.1x Intel Core Performance Improvement through Redesign

47 Scaling the Dynamic Core to Millions of Cores PFlops elements in each process 192 elements in each process 768 elements in each process 650 elelments in each process 3.3 PFlops Para.eff 98.55% 2.4 PFlops Para.eff 92.2% 2.72 PFlops Para.eff 92.9% 1.76 PFlops Para.eff 88.3% Number of Processes

48 Simulation of Hurricane Katrina

49 Simulation of Hurricane Katrina

50 Outline Sunway Machine Applications and Programming Challenges Long Term Plans

51 Long Term Plan Traditional HPC Applications (Science -> Service) 52

52 Long Term Plan Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro tanshan.gif 15-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of Realistic 10 Hz Scenarios, Gordon Bell Prize Finalist, SC

53 Long Term Plan Traditional HPC Applications (Science -> Service) Deep Learning Related Applications 54

54 Average time per iteration Long Term Plan Training AlexNet with swcaffe Traditional HPC Applications (Science -> Service) x 1.0x Deep Learning Related Applications total convolution fully connected 2.2x 2.7x? 3.5x 4.5x 1.0x 7.1x 8.5x Intel(24 cores) swblas(1cg) swdnn(1cg) 55

55 Long Term Plan Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro 56

56 Long Term Plan Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro 57

57 Long Term Plan Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro Sunway Exa-Scale System 58

58 Sunway TaihuLight: - NSCC-Wuxi, core processor Pflops - 1 st of TOP500 Sunway Exa-Pilot System: ~ 10 Tflops per node - 10 ~ 20 Gflops/W Sunway Exa-Scale System ~ Pflops - 30 Gflops/W The Sunway Exa-Scale Plan

59

HPC and Big Data: Updates about China. Haohuan FU August 29 th, 2017

HPC and Big Data: Updates about China. Haohuan FU August 29 th, 2017 HPC and Big Data: Updates about China Haohuan FU August 29 th, 2017 1 Outline HPC and Big Data Projects in China Recent Efforts on Tianhe-2 Recent Efforts on Sunway TaihuLight 2 MOST HPC Projects 2016

More information

China s HPC development: a brief review and perspectives

China s HPC development: a brief review and perspectives China s HPC development: a brief review and perspectives Depei Qian Beihang University/Sun Yat-sen University International Symposium on Impact of extreme scale computing Tokyo, Japan Nov. 2, 2017 Outline

More information

The State and Opportunities of HPC Applications in China. Ruibo Wang National University of Defense Technology

The State and Opportunities of HPC Applications in China. Ruibo Wang National University of Defense Technology The State and Opportunities of HPC Applications in China Ruibo Wang National University of Defense Technology Outline Brief introduction to the Sites Applications Fusion Development of HPC, Cloud & Big

More information

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer CHAO YANG Dr. Chao Yang is a full professor at the Laboratory of Parallel Software and Computational Sciences, Institute of Software, Chinese Academy Sciences. His research interests include numerical

More information

Alexander Heinecke (Intel), Josh Tobin (UCSD), Alexander Breuer (UCSD), Charles Yount (Intel), Yifeng Cui (UCSD) Parallel Computing Lab Intel Labs

Alexander Heinecke (Intel), Josh Tobin (UCSD), Alexander Breuer (UCSD), Charles Yount (Intel), Yifeng Cui (UCSD) Parallel Computing Lab Intel Labs Alexander Heinecke (Intel), Josh Tobin (UCSD), Alexander Breuer (UCSD), Charles Yount (Intel), Yifeng Cui (UCSD) Parallel Computing Lab Intel Labs USA November 14 th 2017 Legal Disclaimer & Optimization

More information

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P Stephen Wang 1, James Lin 1, William Tang 2, Stephane Ethier 2, Bei Wang 2, Simon See

More information

Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer

Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer Haohuan Fu 1,2, Junfeng Liao 1,2,3, Wei Xue 1,2,3, Lanning Wang 4, Dexun Chen 3, Long Gu 5, Jinxiu

More information

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Fourth Workshop on Accelerator Programming Using Directives (WACCPD), Nov. 13, 2017 Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Takuma

More information

Supercomputer Networking via IPv6. Xing Li, Congxiao Bao, Wusheng Zhang

Supercomputer Networking via IPv6. Xing Li, Congxiao Bao, Wusheng Zhang Supercomputer Networking via IPv6 Xing Li, Congxiao Bao, Wusheng Zhang 2018-08-08 Outline Sunway TaihuLight Supercomputer CERNET and CERNET2 Challenges Solution Performance Q&A Remarks 2 National Supercomputing

More information

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid

More information

Report on the Sunway TaihuLight System. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

Report on the Sunway TaihuLight System. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory Report on the Sunway TaihuLight System Jack Dongarra University of Tennessee Oak Ridge National Laboratory June 24, 2016 University of Tennessee Department of Electrical Engineering and Computer Science

More information

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor Stephen Wang 1, James Lin 1,4, William Tang 2, Stephane Ethier 2, Bei Wang 2, Simon See 1,3

More information

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu 1 2 3 Outline 1. Background 2. Sunway architecture

More information

Accelerating High Performance Computing.

Accelerating High Performance Computing. Accelerating High Performance Computing http://www.nvidia.com/tesla Computing The 3 rd Pillar of Science Drug Design Molecular Dynamics Seismic Imaging Reverse Time Migration Automotive Design Computational

More information

Towards Exascale Computing with the Atmospheric Model NUMA

Towards Exascale Computing with the Atmospheric Model NUMA Towards Exascale Computing with the Atmospheric Model NUMA Andreas Müller, Daniel S. Abdi, Michal Kopera, Lucas Wilcox, Francis X. Giraldo Department of Applied Mathematics Naval Postgraduate School, Monterey

More information

System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited

System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited 2018 IEEE 68th Electronic Components and Technology Conference San Diego, California May 29

More information

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer? Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Understanding IO patterns of SSDs

Understanding IO patterns of SSDs 固态硬盘 I/O 特性测试 周大 众所周知, 固态硬盘是一种由闪存作为存储介质的数据库存储设备 由于闪存和磁盘之间物理特性的巨大差异, 现有的各种软件系统无法直接使用闪存芯片 为了提供对现有软件系统的支持, 往往在闪存之上添加一个闪存转换层来实现此目的 固态硬盘就是在闪存上附加了闪存转换层从而提供和磁盘相同的访问接口的存储设备 一方面, 闪存本身具有独特的访问特性 另外一方面, 闪存转换层内置大量的算法来实现闪存和磁盘访问接口之间的转换

More information

IHK/McKernel: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing

IHK/McKernel: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing : A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing Balazs Gerofi Exascale System Software Team, RIKEN Center for Computational Science 218/Nov/15 SC 18 Intel Extreme Computing

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Intro Michael Bader Winter 2015/2016 Intro, Winter 2015/2016 1 Part I Scientific Computing and Numerical Simulation Intro, Winter 2015/2016 2 The Simulation Pipeline phenomenon,

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC Hisashi YASHIRO RIKEN Advanced Institute of Computational Science Kobe, Japan My topic The study for Cloud computing My topic

More information

Analysis and Visualization Algorithms in VMD

Analysis and Visualization Algorithms in VMD 1 Analysis and Visualization Algorithms in VMD David Hardy Research/~dhardy/ NAIS: State-of-the-Art Algorithms for Molecular Dynamics (Presenting the work of John Stone.) VMD Visual Molecular Dynamics

More information

Analysis of Characteristic Features of HPC Applications. Lian Jin HPC Engineer, Inspur

Analysis of Characteristic Features of HPC Applications. Lian Jin HPC Engineer, Inspur Analysis of Characteristic Features of HPC Applications Lian Jin HPC Engineer, Inspur HPC applications How to make these applications much more efficient? Application optimization System optimization Understanding

More information

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited Fujitsu HPC Roadmap Beyond Petascale Computing Toshiyuki Shimizu Fujitsu Limited Outline Mission and HPC product portfolio K computer*, Fujitsu PRIMEHPC, and the future K computer and PRIMEHPC FX10 Post-FX10,

More information

GPU-optimized computational speed-up for the atmospheric chemistry box model from CAM4-Chem

GPU-optimized computational speed-up for the atmospheric chemistry box model from CAM4-Chem GPU-optimized computational speed-up for the atmospheric chemistry box model from CAM4-Chem Presenter: Jian Sun Advisor: Joshua S. Fu Collaborator: John B. Drake, Qingzhao Zhu, Azzam Haidar, Mark Gates,

More information

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS

More information

GPU Consideration for Next Generation Weather (and Climate) Simulations

GPU Consideration for Next Generation Weather (and Climate) Simulations GPU Consideration for Next Generation Weather (and Climate) Simulations Oliver Fuhrer 1, Tobias Gisy 2, Xavier Lapillonne 3, Will Sawyer 4, Ugo Varetto 4, Mauro Bianco 4, David Müller 2, and Thomas C.

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt

More information

Evolution of Transparent Computing with UEFI - Intel s Practice on TC. Lu Ju Director, Platform Software Infrastructure, Intel Asia Pacific R&D Ltd

Evolution of Transparent Computing with UEFI - Intel s Practice on TC. Lu Ju Director, Platform Software Infrastructure, Intel Asia Pacific R&D Ltd Evolution of Transparent Computing with UEFI - Intel s Practice on TC Lu Ju Director, Platform Software Infrastructure, Intel Asia Pacific R&D Ltd Agenda TC - the Retrospective UEFI - the Review & Update

More information

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory Quinn Mitchell HPC UNIX/LINUX Storage Systems ORNL is managed by UT-Battelle for the US Department of Energy U.S. Department

More information

Alexander Heinecke. Parallel Computing Lab Intel Labs USA

Alexander Heinecke. Parallel Computing Lab Intel Labs USA Alexander Heinecke Parallel Computing Lab Intel Labs USA April 23 th 2018 Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL

More information

Introduction to National Supercomputing Centre in Guangzhou and Opportunities for International Collaboration

Introduction to National Supercomputing Centre in Guangzhou and Opportunities for International Collaboration Exascale Applications and Software Conference 21st 23rd April 2015, Edinburgh, UK Introduction to National Supercomputing Centre in Guangzhou and Opportunities for International Collaboration Xue-Feng

More information

Experiences of the Development of the Supercomputers

Experiences of the Development of the Supercomputers Experiences of the Development of the Supercomputers - Earth Simulator and K Computer YOKOKAWA, Mitsuo Kobe University/RIKEN AICS Application Oriented Systems Developed in Japan No.1 systems in TOP500

More information

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by Professor Wen-Mei Hwu 1 Outline Experiences from working with domain experts to develop GPU codes on Blue Waters

More information

LCOC Lustre Cache on Client

LCOC Lustre Cache on Client 1 LCOC Lustre Cache on Client Xue Wei The National Supercomputing Center, Wuxi, China Li Xi DDN Storage 2 NSCC-Wuxi and the Sunway Machine Family Sunway-I: - CMA service, 1998 - commercial chip - 0.384

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt

More information

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Sadaf Alam & Thomas Schulthess CSCS & ETHzürich CUG 2014 * Timelines & releases are not precise Top 500

More information

China Next Generation Internet (CNGI) project and its impact. MA Yan Beijing University of Posts and Telecommunications 2009/08/06.

China Next Generation Internet (CNGI) project and its impact. MA Yan Beijing University of Posts and Telecommunications 2009/08/06. China Next Generation Internet (CNGI) project and its impact MA Yan Beijing University of Posts and Telecommunications 2009/08/06 Outline Next Generation Internet CNGI project in general CNGI-CERNET2 CERNET2

More information

云计算入门 Introduction to Cloud Computing GESC1001

云计算入门 Introduction to Cloud Computing GESC1001 Lecture #6 云计算入门 Introduction to Cloud Computing GESC1001 Philippe Fournier-Viger Professor School of Humanities and Social Sciences philfv8@yahoo.com Fall 2017 1 Introduction Last week: how cloud applications

More information

Fujitsu s Approach to Application Centric Petascale Computing

Fujitsu s Approach to Application Centric Petascale Computing Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Moore's Law abandoned serial programming around 2004 Courtesy Liberty Computer Architecture Research Group

More information

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November,

More information

GPU Developments for the NEMO Model. Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA

GPU Developments for the NEMO Model. Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA GPU Developments for the NEMO Model Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA NVIDIA HPC AND ESM UPDATE TOPICS OF DISCUSSION GPU PROGRESS ON NEMO MODEL 2 NVIDIA GPU

More information

COMP 633 Parallel Computing.

COMP 633 Parallel Computing. COMP 633 Parallel Computing http://www.cs.unc.edu/~prins/classes/633/ Parallel computing What is it? multiple processors cooperating to solve a single problem hopefully faster than a single processor!

More information

FUJITSU HPC and the Development of the Post-K Supercomputer

FUJITSU HPC and the Development of the Post-K Supercomputer FUJITSU HPC and the Development of the Post-K Supercomputer Toshiyuki Shimizu Vice President, System Development Division, Next Generation Technical Computing Unit 0 November 16 th, 2016 Post-K is currently

More information

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics

More information

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED Key Technologies for 100 PFLOPS How to keep the HPC-tree growing Molecular dynamics Computational materials Drug discovery Life-science Quantum chemistry Eigenvalue problem FFT Subatomic particle phys.

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Introduction of Oakforest-PACS

Introduction of Oakforest-PACS Introduction of Oakforest-PACS Hiroshi Nakamura Director of Information Technology Center The Univ. of Tokyo (Director of JCAHPC) Outline Supercomputer deployment plan in Japan What is JCAHPC? Oakforest-PACS

More information

Testbed Overview in China Xiaohong Huang Beijing University of Posts and Telecommunications

Testbed Overview in China Xiaohong Huang Beijing University of Posts and Telecommunications Testbed Overview in China Xiaohong Huang Beijing University of Posts and Telecommunications CJK FI Workshop, Nov. 2011 Agenda Why we need testbeds PlanetLab, Emulab, OpenFlow, Seattle, ORBIT,... Idea Standardization

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Moore's Law abandoned serial programming around 2004 Courtesy Liberty Computer Architecture Research Group

More information

CS2214 COMPUTER ARCHITECTURE & ORGANIZATION SPRING Top 10 Supercomputers in the World as of November 2013*

CS2214 COMPUTER ARCHITECTURE & ORGANIZATION SPRING Top 10 Supercomputers in the World as of November 2013* CS2214 COMPUTER ARCHITECTURE & ORGANIZATION SPRING 2014 COMPUTERS : PRESENT, PAST & FUTURE Top 10 Supercomputers in the World as of November 2013* No Site Computer Cores Rmax + (TFLOPS) Rpeak (TFLOPS)

More information

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25 ROME Workshop @ IPDPS Vancouver Memory Footprint of Locality Information On Many- Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25 Locality Matters to HPC Applications Locality Matters

More information

: Operating System 计算机原理与设计

: Operating System 计算机原理与设计 0117401: Operating System 计算机原理与设计 Chapter 1-2: CS Structure 陈香兰 xlanchen@ustceducn http://staffustceducn/~xlanchen Computer Application Laboratory, CS, USTC @ Hefei Embedded System Laboratory, CS, USTC

More information

Toward Automated Application Profiling on Cray Systems

Toward Automated Application Profiling on Cray Systems Toward Automated Application Profiling on Cray Systems Charlene Yang, Brian Friesen, Thorsten Kurth, Brandon Cook NERSC at LBNL Samuel Williams CRD at LBNL I have a dream.. M.L.K. Collect performance data:

More information

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea

More information

Fujitsu s Technologies to the K Computer

Fujitsu s Technologies to the K Computer Fujitsu s Technologies to the K Computer - a journey to practical Petascale computing platform - June 21 nd, 2011 Motoi Okuda FUJITSU Ltd. Agenda The Next generation supercomputer project of Japan The

More information

GPU Developments in Atmospheric Sciences

GPU Developments in Atmospheric Sciences GPU Developments in Atmospheric Sciences Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA David Hall, Ph.D., Sr. Solutions Architect, NVIDIA, Boulder, CO, USA NVIDIA HPC UPDATE

More information

INSPUR and HPC Innovation

INSPUR and HPC Innovation INSPUR and HPC Innovation Dong Qi (Forrest) Product manager Inspur dongqi@inspur.com Contents 1 2 3 4 5 Inspur introduction HPC Challenge and Inspur HPC strategy HPC cases Inspur contribution to HPC community

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

Parallel Programming Futures: What We Have and Will Have Will Not Be Enough

Parallel Programming Futures: What We Have and Will Have Will Not Be Enough Parallel Programming Futures: What We Have and Will Have Will Not Be Enough Michael Heroux SOS 22 March 27, 2018 Sandia National Laboratories is a multimission laboratory managed and operated by National

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all

More information

Accelerated Earthquake Simulations

Accelerated Earthquake Simulations Accelerated Earthquake Simulations Alex Breuer Technische Universität München Germany 1 Acknowledgements Volkswagen Stiftung Project ASCETE: Advanced Simulation of Coupled Earthquake-Tsunami Events Bavarian

More information

AvalonMiner Raspberry Pi Configuration Guide. AvalonMiner 树莓派配置教程 AvalonMiner Raspberry Pi Configuration Guide

AvalonMiner Raspberry Pi Configuration Guide. AvalonMiner 树莓派配置教程 AvalonMiner Raspberry Pi Configuration Guide AvalonMiner 树莓派配置教程 AvalonMiner Raspberry Pi Configuration Guide 简介 我们通过使用烧录有 AvalonMiner 设备管理程序的树莓派作为控制器 使 用户能够通过控制器中管理程序的图形界面 来同时对多台 AvalonMiner 6.0 或 AvalonMiner 6.01 进行管理和调试 本教程将简要的说明 如何把 AvalonMiner

More information

Introduction to the K computer

Introduction to the K computer Introduction to the K computer Fumiyoshi Shoji Deputy Director Operations and Computer Technologies Div. Advanced Institute for Computational Science RIKEN Outline ü Overview of the K

More information

Running the FIM and NIM Weather Models on GPUs

Running the FIM and NIM Weather Models on GPUs Running the FIM and NIM Weather Models on GPUs Mark Govett Tom Henderson, Jacques Middlecoff, Jim Rosinski, Paul Madden NOAA Earth System Research Laboratory Global Models 0 to 14 days 10 to 30 KM resolution

More information

ENDURING DIFFERENTIATION Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear ENDURING DIFFERENTIATION Timothy Lanfear WHERE ARE WE? 2 LIFE AFTER DENNARD SCALING GPU-ACCELERATED PERFORMANCE 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 10 3 10 2 Single-threaded perf

More information

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION. Timothy Lanfear ENDURING DIFFERENTIATION Timothy Lanfear WHERE ARE WE? 2 LIFE AFTER DENNARD SCALING 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 Transistors (thousands) 1.1X per year 10 3 10 2 Single-threaded

More information

Multiprotocol Label Switching The future of IP Backbone Technology

Multiprotocol Label Switching The future of IP Backbone Technology Multiprotocol Label Switching The future of IP Backbone Technology Computer Network Architecture For Postgraduates Chen Zhenxiang School of Information Science and Technology. University of Jinan (c) Chen

More information

赛灵思技术日 XILINX TECHNOLOGY DAY 用赛灵思 FPGA 加速机器学习推断 张帆资深全球 AI 方案技术专家

赛灵思技术日 XILINX TECHNOLOGY DAY 用赛灵思 FPGA 加速机器学习推断 张帆资深全球 AI 方案技术专家 赛灵思技术日 XILINX TECHNOLOGY DAY 用赛灵思 FPGA 加速机器学习推断 张帆资深全球 AI 方案技术专家 2019.03.19 Who is Xilinx? Why Should I choose FPGA? Only HW/SW configurable device 1 2 for fast changing networks High performance / low

More information

Petascale Computing Research Challenges

Petascale Computing Research Challenges Petascale Computing Research Challenges - A Manycore Perspective Stephen Pawlowski Intel Senior Fellow GM, Architecture & Planning CTO, Digital Enterprise Group Yesterday, Today and Tomorrow in HPC ENIAC

More information

Machine Vision Market Analysis of 2015 Isabel Yang

Machine Vision Market Analysis of 2015 Isabel Yang Machine Vision Market Analysis of 2015 Isabel Yang CHINA Machine Vision Union Content 1 1.Machine Vision Market Analysis of 2015 Revenue of Machine Vision Industry in China 4,000 3,500 2012-2015 (Unit:

More information

Mapping MPI+X Applications to Multi-GPU Architectures

Mapping MPI+X Applications to Multi-GPU Architectures Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under

More information

The Exascale Era Has Arrived

The Exascale Era Has Arrived Technology Spotlight The Exascale Era Has Arrived Sponsored by NVIDIA Steve Conway, Earl Joseph, Bob Sorensen, and Alex Norton November 2018 EXECUTIVE SUMMARY Earlier this year, scientists broke the exascale

More information

Stan Posey, NVIDIA, Santa Clara, CA, USA

Stan Posey, NVIDIA, Santa Clara, CA, USA Stan Posey, sposey@nvidia.com NVIDIA, Santa Clara, CA, USA NVIDIA Strategy for CWO Modeling (Since 2010) Initial focus: CUDA applied to climate models and NWP research Opportunities to refactor code with

More information

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1 Last week We have discussed about:

More information

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Aim High Intel Technical Update Teratec 07 Symposium June 20, 2007 Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Risk Factors Today s s presentations contain forward-looking statements.

More information

Post-K Development and Introducing DLU. Copyright 2017 FUJITSU LIMITED

Post-K Development and Introducing DLU. Copyright 2017 FUJITSU LIMITED Post-K Development and Introducing DLU 0 Fujitsu s HPC Development Timeline K computer The K computer is still competitive in various fields; from advanced research to manufacturing. Deep Learning Unit

More information

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC Stefan Maintz, Dr. Markus Wetzstein smaintz@nvidia.com; mwetzstein@nvidia.com Companies Academia VASP USERS AND USAGE 12-25% of CPU cycles @ supercomputing

More information

Overview of Tianhe-2

Overview of Tianhe-2 Overview of Tianhe-2 (MilkyWay-2) Supercomputer Yutong Lu School of Computer Science, National University of Defense Technology; State Key Laboratory of High Performance Computing, China ytlu@nudt.edu.cn

More information

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications TESLA P PERFORMANCE GUIDE Deep Learning and HPC Applications SEPTEMBER 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

High performance Computing and O&G Challenges

High performance Computing and O&G Challenges High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

Computer Networks. Wenzhong Li. Nanjing University

Computer Networks. Wenzhong Li. Nanjing University Computer Networks Wenzhong Li Nanjing University 1 Chapter 3. Packet Switching Networks Network Layer Functions Virtual Circuit and Datagram Networks ATM and Cell Switching X.25 and Frame Relay Routing

More information

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 ACCELERATED COMPUTING: THE PATH FORWARD Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 COMMODITY DISRUPTS CUSTOM SOURCE: Top500 ACCELERATED COMPUTING: THE PATH FORWARD It s time to start

More information

Technologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017

Technologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017 Technologies and application performance Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017 The landscape is changing We are no longer in the general purpose era the argument of

More information

Brand-New Vector Supercomputer

Brand-New Vector Supercomputer Brand-New Vector Supercomputer NEC Corporation IT Platform Division Shintaro MOMOSE SC13 1 New Product NEC Released A Brand-New Vector Supercomputer, SX-ACE Just Now. Vector Supercomputer for Memory Bandwidth

More information

NVIDIA GPU TECHNOLOGY UPDATE

NVIDIA GPU TECHNOLOGY UPDATE NVIDIA GPU TECHNOLOGY UPDATE May 2015 Axel Koehler Senior Solutions Architect, NVIDIA NVIDIA: The VISUAL Computing Company GAMING DESIGN ENTERPRISE VIRTUALIZATION HPC & CLOUD SERVICE PROVIDERS AUTONOMOUS

More information

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning Ching-Hsiang Chu 1, Xiaoyi Lu 1, Ammar A. Awan 1, Hari Subramoni 1, Jahanzeb Hashmi 1, Bracy Elton 2 and Dhabaleswar

More information

Simulating Stencil-based Application on Future Xeon-Phi Processor

Simulating Stencil-based Application on Future Xeon-Phi Processor Simulating Stencil-based Application on Future Xeon-Phi Processor PMBS workshop at SC 15 Chitra Natarajan Carl Beckmann Anthony Nguyen Intel Corporation Intel Corporation Intel Corporation Mauricio Araya-Polo

More information

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories HPC Benchmarking Presentations: Jack Dongarra, University of Tennessee & ORNL The HPL Benchmark: Past, Present & Future Mike Heroux, Sandia National Laboratories The HPCG Benchmark: Challenges It Presents

More information

INSPUR and HPC Innovation. Dong Qi (Forrest) Oversea PM

INSPUR and HPC Innovation. Dong Qi (Forrest) Oversea PM INSPUR and HPC Innovation Dong Qi (Forrest) Oversea PM dongqi@inspur.com Contents 1 2 3 4 5 Inspur introduction HPC Challenge and Inspur HPC strategy HPC cases Inspur contribution to HPC community Inspur

More information

Tianhe-2, the world s fastest supercomputer. Shaohua Wu Senior HPC application development engineer

Tianhe-2, the world s fastest supercomputer. Shaohua Wu Senior HPC application development engineer Tianhe-2, the world s fastest supercomputer Shaohua Wu Senior HPC application development engineer Inspur Inspur revenue 5.8 2010-2013 6.4 2011 2012 Unit: billion$ 8.8 2013 21% Staff: 14, 000+ 12% 10%

More information

如何查看 Cache Engine 缓存中有哪些网站 /URL

如何查看 Cache Engine 缓存中有哪些网站 /URL 如何查看 Cache Engine 缓存中有哪些网站 /URL 目录 简介 硬件与软件版本 处理日志 验证配置 相关信息 简介 本文解释如何设置处理日志记录什么网站 /URL 在 Cache Engine 被缓存 硬件与软件版本 使用这些硬件和软件版本, 此配置开发并且测试了 : Hardware:Cisco 缓存引擎 500 系列和 73xx 软件 :Cisco Cache 软件版本 2.3.0

More information

ECE 574 Cluster Computing Lecture 2

ECE 574 Cluster Computing Lecture 2 ECE 574 Cluster Computing Lecture 2 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 24 January 2019 Announcements Put your name on HW#1 before turning in! 1 Top500 List November

More information

The Future of the Message-Passing Interface. William Gropp

The Future of the Message-Passing Interface. William Gropp The Future of the Message-Passing Interface William Gropp www.cs.illinois.edu/~wgropp MPI and Supercomputing The Message Passing Interface (MPI) has been amazingly successful First released in 1992, it

More information