A Preliminary evalua.on of OpenPOWER through op.mizing stencil based algorithms

Similar documents
Power Technology For a Smarter Future

Open Innovation with Power8

Power 7. Dan Christiani Kyle Wieschowski

Jeff Stuecheli, PhD IBM Power Systems IBM Systems & Technology Group Development International Business Machines Corporation 1

Eric Schwarz. IBM Accelerators. July 11, IBM Corporation

POWER8 for DB2 and SAP

Energy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich

OpenCAPI and its Roadmap

LinuxCon Japan 2014 OpenPOWER Technical Overview. Jeff Scheel Chief Engineer Linux on Power May 21, IBM Corporation

POWER9 Announcement. Martin Bušek IBM Server Solution Sales Specialist

POWER7: IBM's Next Generation Server Processor

Infrastructure Matters: POWER8 vs. Xeon x86

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Spring 2011 Prof. Hyesoon Kim

CAPI SNAP framework, the tool for C/C++ programmers to accelerate by a 2 digit factor using FPGA technology

POWER7: IBM's Next Generation Server Processor

Industry Collaboration and Innovation

OpenPOWER Innovations for HPC. IBM Research. IWOPH workshop, ISC, Germany June 21, Christoph Hagleitner,

Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

New Interconnnects. Moderator: Andy Rudoff, SNIA NVM Programming Technical Work Group and Persistent Memory SW Architect, Intel

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Revolutionizing the Datacenter

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

All About the Cell Processor

n N c CIni.o ewsrg.au

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Interconnect Your Future

Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit

Vector Engine Processor of SX-Aurora TSUBASA

IBM "Broadway" 512Mb GDDR3 Qimonda

IBM PSSC Montpellier Customer Center. Blue Gene/P ASIC IBM Corporation

Parallel Architectures

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

POWER9. Jeff Stuecheli POWER Systems, IBM Systems IBM Corporation

On the efficiency of the Accelerated Processing Unit for scientific computing

HW Trends and Architectures

What does Heterogeneity bring?

Accelerating the RISC-V Revolution: Unleashing Custom Silicon with Revolutionary Design Platforms and Custom Accelerators

OpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Mapping MPI+X Applications to Multi-GPU Architectures

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

POWER7+ TM IBM IBM Corporation

GPU Fundamentals Jeff Larkin November 14, 2016

Each Milliwatt Matters

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS

How to Write Fast Code , spring th Lecture, Mar. 31 st

IBM i テクニカル ワークショップ IBM i and the Future. Tim Rowe. IBM i Architect Application Development Systems Management.

High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Modeling Performance Use Cases with Traffic Profiles Over ARM AMBA Interfaces

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

How to build a Megacore microprocessor. by Andreas Olofsson (MULTIPROG WORKSHOP 2017)

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Near- Data Computa.on: It s Not (Just) About Performance

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Почему IBM POWER8 оптимальная платформа для PostgreSQL

Application Performance on Dual Processor Cluster Nodes

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Jim Keller. Digital Equipment Corp. Hudson MA

A Prototype Storage Subsystem based on PCM

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

A 101 Guide to Heterogeneous, Accelerated, Data Centric Computing Architectures

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

Parallel Computing: Parallel Architectures Jin, Hai

IBM Power System S822LC for Big Data

Lecture 2. Memory locality optimizations Address space organization

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

WHY PARALLEL PROCESSING? (CE-401)

Lecture 13. Shared memory: Architecture and programming

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

edram to the Rescue Why edram 1/3 Area 1/5 Power SER 2-3 Fit/Mbit vs 2k-5k for SRAM Smaller is faster What s Next?

SoC Platforms and CPU Cores

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Master Informatics Eng.

POWER6 Processor and Systems

Multicore and MIPS: Creating the next generation of SoCs. Jim Whittaker EVP MIPS Business Unit

IBM CORAL HPC System Solution

Toward a Memory-centric Architecture

Accelerating High Performance Computing.

EEM 486: Computer Architecture. Lecture 9. Memory

14:332:331. Week 13 Basics of Cache

The University of Texas at Austin

Lecture 9: MIMD Architecture

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

Puey Wei Tan. Danny Lee. IBM zenterprise 196

Venezia: a Scalable Multicore Subsystem for Multimedia Applications

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel and Distributed Computing

COSC 6385 Computer Architecture - Multi Processor Systems

Parallel Computer Architecture - Basics -

Transcription:

A Preliminary evalua.on of OpenPOWER through op.mizing stencil based algorithms Speaker: Jingheng Xu Tsinghua University Revolu'onizing the Datacenter Join the Conversa'on #OpenPOWERSummit

Contents 1 About Us 2 POWER8 Processor 3 CAPI Technology 4 Preliminary Example

About Us! Tsinghua High Performance Geo-Computing Group(HPGC) Algorithm Applica.on Architectur e The Best Computa.onal Solu.on

About Us! Tsinghua High Performance Geo-Computing Group(HPGC)! Application:climate simulation, seismic modeling, etc.

About Us! Tsinghua High Performance Geo-Computing Group(HPGC)! Application:climate simulation, seismic modeling, etc.! Platform:CPU,GPU,DFE,etc. CPU GPU DFE

About Us! Tsinghua High Performance Geo-Computing Group(HPGC)! Application:climate simulation, seismic modeling, etc.! Platform:CPU,GPU,DFE,etc.! Partner:

Contents 1 About Us 2 POWER8 Processor 3 CAPI Technology 4 Preliminary Example

POWER8 Processor Technology 22 nm SOI, edram, 15 ML 650 mm2 Cores 12 cores (SMT8) 8 dispatch, 10 issue, 16 execution pipes 2x internal data flows/ queues Enhanced prefetching 64 KB data cache, 32 KB instruction cache Accelerators Crypto and memory expansion Transactional memory VMM assist Memory Bus POWER8 Scale-Out Dual Chip Module Core Core Core L2 L2 L2 L3 Bank L3 Bank L3 Bank Chip Interconnect L3 Bank L3 Bank L3 Bank L2 L2 L2 Core Core Core SMP PCIe CAPI SMP SMP Interconnect SMP Interconnect SMP CAPI PCIe SMP Core Core Core L2 L3 Bank L3 Bank Core Chip Interconnect L2 L2 L3 Bank L3 Bank L2 Core L2 L3 Bank L3 Bank L2 Core Memory Bus Caches 512 KB SRAM L2 / core 96 MB edram shared L3 Memory Up to 230 GB/s sustained bandwidth Bus Interfaces Durable open memory attach interface Integrated PCIe Gen3 SMP interconnect CAPI Data move/vm mobility Energy Management On-chip power management microcontroller

POWER8 Processor z x y Jacobi FD4 FD8

POWER8 Processor z x y Jacobi

Lightweight Tuning OpenMP & SMT NUMA Ctrl GPR- oriented SIMD VSR- oriented SIMD 3D Cache Blocking 2D Cache Blocking POWER8 Processor

Lightweight Tuning OpenMP & SMT NUMA Ctrl GPR- oriented SIMD VSR- oriented SIMD 3D Cache Blocking 2D Cache Blocking POWER8 Processor

Lightweight Tuning OpenMP & SMT NUMA Ctrl POWER8 Processor GPR- oriented SIMD VSR- oriented SIMD 3D Cache Blocking GPR: General- Purpose Register VSR: Vector- Scalar Register 2D Cache Blocking

Lightweight Tuning OpenMP & SMT NUMA Ctrl GPR- oriented SIMD VSR- oriented SIMD 3D Cache Blocking 2D Cache Blocking POWER8 Processor

Lightweight Tuning OpenMP & SMT NUMA Ctrl POWER8 Processor GPR- oriented SIMD VSR- oriented SIMD 3D Cache Blocking 2D Cache Blocking Unit: GFlops Jacobi FD4 FD8 3D Blocking 80 95 136 2D Blocking 102 175 161

POWER8 Processor

POWER8 Processor

Contents 1 About Us 2 POWER8 Processor 3 CAPI Technology 4 Preliminary Example

CAPI Technology

CAPI Technology Virt Addr Variables Input Data Device Driver Storage Area Variables Input Data Memory Subsystem Output Data 3 versions of the data (not coherent). thousands of instructions in the device driver. PCIE FPGA Output Data Variables Input Data POWER8 Core App DD POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core

CAPI Technology Virt Addr Memory Subsystem 1 coherent version of the data. No device driver call/instructions. PCIE PSL FPGA Output Data Input Variables Data POWER8 Core App POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core

Preliminary Example Stencil based RTM (Reversed Time Migra'on) Three main Challenges: 1. Memory Access Pressure 2. Computa.onal Pressure 3. I/O Pressure

Preliminary Example Three main Challenges: Memory Access; Computational Pressure;File I/O MA & CP Total Time File I/O & Others 1 core 81.64s 82.94s 1.30s 20 cores 4.54s 7.88s 3.34s POWER Optimized Version MA & CP Total Time File I/O & Others 1 core 21.55s 22.59s 1.04s 20 cores 1.51s 4.54s 3.03s

Preliminary Example Hybrid Algorithm Host (POWER8): " Take charge of I/O & other part " Mainly File I/O " Only One POWER8 Core to avoid write conflict. Device (FPGA): " Specifically take charge of computations " Adopting CAPI to avoid longlatency data transfer

Preliminary Example Original POWER Opt CAPI Version 20 Power8 processor Cores 7.9s in total 20 Power8 processor Cores 4.5s in total 1 Power8 processor Core & 1 FPGA 2.4s in total* *There is s.ll some accuracy problem of this result.

Conclusion Extremely High Performance OpenPOWER system with CAPI Powerful Host Flexible Device Low-latency Interface

Jingheng Xu, Haohuan Fu, Yu Song, Hongbo Peng, etc. 18653236889@163.com Tsinghua University, Beijing, China IBM China Systems and Technology Laboratory Revolu'onizing the Datacenter Join the Conversa'on #OpenPOWERSummit

Jingheng Xu, Haohuan Fu, Yu Song, Hongbo Peng, etc. 18653236889@163.com Tsinghua University, Beijing, China IBM China Systems and Technology Laboratory Revolu'onizing the Datacenter Join the Conversa'on #OpenPOWERSummit

Reference