Race to Exascale: Opportunities and Challenges. Avinash Sodani, Ph.D. Chief Architect MIC Processor Intel Corporation

Similar documents
HPC Technology Trends

John Hengeveld Director of Marketing, HPC Evangelist

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Intel Many Integrated Core (MIC) Architecture

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

HPC in the Multicore Era

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Hybrid Architectures Why Should I Bother?

Godson Processor and its Application in High Performance Computers

Medical practice: diagnostics, treatment and surgery in supercomputing centers

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

The Mont-Blanc approach towards Exascale

Petascale Computing Research Challenges

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel

The Road from Peta to ExaFlop

Intel High-Performance Computing. Technologies for Engineering

Introduction to Xeon Phi. Bill Barth January 11, 2013

Vector Engine Processor of SX-Aurora TSUBASA

Atos announces the Bull sequana X1000 the first exascale-class supercomputer. Jakub Venc

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Philippe Thierry Sr Staff Engineer Intel Corp.

Microprocessor Trends and Implications for the Future

Intel HPC Technologies Outlook

Motivation Goal Idea Proposition for users Study

Intel Architecture for Software Developers

Power Measurement Using Performance Counters

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012

Intel Architecture for HPC

Intel : Accelerating the Path to Exascale. Kirk Skaugen Vice President Intel Architecture Group General Manager Data Center Group

Marching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J.

Technology Trends Presentation For Power Symposium

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Anatomy of AMD s TeraScale Graphics Engine

Accelerating Insights In the Technical Computing Transformation

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Fundamentals of Quantitative Design and Analysis

Steve Scott, Tesla CTO SC 11 November 15, 2011

Lecture 1: Introduction

Jeff Kash, Dan Kuchta, Fuad Doany, Clint Schow, Frank Libsch, Russell Budd, Yoichi Taira, Shigeru Nakagawa, Bert Offrein, Marc Taubenblatt

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Perspectives on the Memory Wall. John D. McCalpin, Ph.D IBM Global Microprocessor Development Austin, TX

From Majorca with love

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

The Era of Heterogeneous Computing

Computer Architecture and Structured Parallel Programming James Reinders, Intel

ECE 588/688 Advanced Computer Architecture II

A 101 Guide to Heterogeneous, Accelerated, Data Centric Computing Architectures

L évolution des architectures et des technologies d intégration des circuits intégrés dans les Data centers

The STREAM Benchmark. John D. McCalpin, Ph.D. IBM eserver Performance ^ Performance

Intro to: Ultra-low power, ultra-high bandwidth density SiP interconnects

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

Each Milliwatt Matters

Parallelism for the Masses: Opportunities and Challenges

European energy efficient supercomputer project

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

ECE 486/586. Computer Architecture. Lecture # 2

Composite Metrics for System Throughput in HPC

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Innovating Beyond Moore s Law. Craig Stephen Vice President, R&D, Oracle Labs

1. NoCs: What s the point?

April 2 nd, Bob Burroughs Director, HPC Solution Sales

The Exascale Challenge:

Building supercomputers from embedded technologies

Fra superdatamaskiner til grafikkprosessorer og

Brand-New Vector Supercomputer

Brief Background in Fiber Optics

Overview. Energy-Efficient and Power-Constrained Techniques for ExascaleComputing. Motivation: Power is becoming a leading design constraint in HPC

Lecture 7: Parallel Processing

ECE 588/688 Advanced Computer Architecture II

The Future of GPU Computing

Aim High Intel. Exa, Zetta, Yotta Not Just Goofy Words Oklahoma Supercomputing Symposium October 3, Stephen R. Wheat, Ph.D.

Introducing Sandy Bridge

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

March 10 11, 2015 San Jose

Parallel Software 2.0

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

There s STILL plenty of room at the bottom! Andreas Olofsson

Scaling and Energy Efficiency in Next Generation Core Networks and Switches

ECE 574 Cluster Computing Lecture 23

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

The Cray Rainier System: Integrated Scalar/Vector Computing

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Architectural Musings

Platforms Design Challenges with many cores

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

PART I - Fundamentals of Parallel Computing

The Future of Electrical I/O for Microprocessors. Frank O Mahony Intel Labs, Hillsboro, OR USA

Exascale challenges. June 27, Ecole Polytechnique Palaiseau France

Multi-Core Microprocessor Chips: Motivation & Challenges

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Lecture 8: RISC & Parallel Computers. Parallel computers

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited

Timothy Lanfear, NVIDIA HPC

Transcription:

Race to Exascale: Opportunities and Challenges Avinash Sodani, Ph.D. Chief Architect MIC Processor Intel Corporation

Exascale Goal: 1-ExaFlops (10 18 ) within 20 MW by 2018 1 ZFlops 100 EFlops 10 EFlops 1 EFlops 100 PFlops 10 PFlops 1 PFlops 100 TFlops 10 TFlops 1 TFlops 100 GFlops Medical Imaging Genomics Research 10 GFlops 1 GFlops Forecast 100 MFlops 1993 1999 2005 2011 2017 2023 Weather Prediction Solve many yet impossible life changing problems Make PFlop HPC computing affordable and ubiquitous 2029

Trends to Exascale Performance Roughly 10x performance every 4 years. Predicts that we ll hit Exascale performance in 2018-19

But that doesn t mean it is done! Many system level challenges Power efficiency Compute density Memory technology Network technology Reliability Software Microprocessor Power Management Parallel Software Interconnect Exascale systems will integrated multiple technology innovations Memory Reliability And Resiliency to name a few

Today we ll talk about Two Challenges: Power, Memory Commonly held myths about power consumption Approach forward

Challenge 1: Compute Power At System Level: Today: 10 PF, 12 MW 1200 pj/op Exaflop: 1000 PF, 20 MW 20 pj/op Needs improvements in all system components Processor-subsystem needs to reduce to 10 pj/op ~60x improvement needed for Exascale

Need to balance with capacity and power Challenge 2: Memory Memory bandwidth fundamental to HPC performance 1 ExaFlop Machine ~200-300 PB/sec ~2-3 pj/bit GDDR will be out of steam Periphery connected solutions will run into pin count issues Existing technology trends leave 3-4x gap on pj/bit

Power: Commonly Held Myths IA cannot hit power efficiencies Need major shift in programming paradigm OOO/Speculation and other ST features too power hungry IA memory model, Cache Coherence too power hungry Caches don t help for HPC They waste power

Purpose of Exascale Pursuit Not just to achieve exascale flops Expectation is it will force efficiencies and innovations at multiple levels. Technologies invented in its pursuit will benefit all forms of computing. Hence, important that we don t special case the technologies just for high performance Linpack score goals

My Musings Given aggressive goals tempting to go for radical changes New programming models Big steps back on microarchitecture: removal of coherency, caches, speculation, etc. Big shifts in HW/SW boundary Typically ubiquitous innovations Do not rely on major behavior changes (they cause it) Do not require major eco-system enablement as pre-requisite. They simplify instead of complicate Meet targets while retaining features that make solution easy to use That s the real Exascale challenge.

Many Integrated Core (MIC) Architecture Knights Line of products Key attributes Many core and many threads High power efficiency Wide vectors and advance FP capability to provide FLOPs Coherent caches IA ISA and programming model Runs standard programs and toolset Knights Corner 1 st Intel MIC product 22nm process >50 Intel Architecture cores Future Knights Products Demo ed Knights Corner in SC 11 running 1TF DP

Myth 1: IA cannot hit power efficiencies Need major shift in programming paradigm

Performance/Power Progression Process:1.3x - 1.4x (per generation) Arch/Uarch: 1.1x - 2.0x (per generation) Multi core Many core improvement: 4x (one time) Recurring improvement: 1.4 3.0x every 2 years 65nm 2005 45nm 2007 32nm 2009 22nm 2011 * 15nm 11nm 8nm 2013 * 2015 * 2017 * 2019+ MANUFACTURING DEVELOPMENT RESEARCH

PicoJoules/Operation Power 10000 1000 Multi Core Many Core 100 10 TARGET 10-20 pj/op 1 Today Time Gap 2-3x Exascale Timeframe Gap reduced to 2-3x from 50x with existing techniques! Have 7-8 years to innovate to cover this gap Do not need new programming paradigm to do that

Myth 2: OOO/Speculation and other ST features too power hungry

Core Power Distribution Non Compute-heavy Application Compute-heavy Application 2% 4% 9% Fetch+Decode 1% 5% 2% Fetch+Decode 13% Caches 45% 21% OOO/Spec 6% OOO+speculation Integer Execution Caches TLBs Legacy Others 75% FP 12% 3% 0% 2% OOO+speculation Integer Execution Caches TLBs Legacy Others FP Execution Power dominated by compute as should be the case OOO/Speculation/TLB: < 10% X86 Legacy+Decode = ~1%

Chip-level Power Distribution Fetch+Decode OOO+speculation Integer Execution TLBs 40% Uncore Caches 45% FP Legacy Others Fetch+Decode OOO+speculation Integer Execution Caches TLBs Legacy Others FP Execution Uncore At chip level core power is even smaller portion (~15%). X86 support, OOO, TLBs ~6% of the chip power Benefits outweigh the gains from removing them

Myth 3: IA memory model, Cache Coherence too power hungry

Coherency Power Distribution 2% 3% 1% 15% 9% Core+Cache 5% 20% 60% Core+Cache 5% Memory IO On-die interconnect 20% 60% Memory IO Data Transfer Address Snoops/Resp Overhead Typically coherency traffic is 4% of total power Programming benefits outweigh the power cost

Myth 4: Caches don t help for HPC They waste power

MPKI in HPC Workloads MPKI MPKI Most HPC workloads benefit from caches Less than 20 MPKI for 1M-4M caches

Caches save power 50 45 40 35 Relative BW Relative BW/Watt Relative BW/Watt 30 25 20 15 10 5 0 Memory BW L2 Cache BW L1 Cache BW Caches save power since memory communication avoided Caches 8x-45x better at BW/Watt compared to memory Power break-even point around 11% hit rate (L2 cache)

Challenge 1: Power: Approach Fwd Power/performance efficiency with general purpose core with existing programming model and ISA Support caches and coherence Ease of programming. Drive efficiencies in general purpose core Continued reduction in power across the board Reduce power in computation (execution units) Reduce power in uncore and memory

Challenge 2: Memory: Approach Fwd Significant power consumed in Memory Need to drive 20 pj/bit to 2-3 pj/bit Balancing BW, capacity and power is hard problem More hierarchical memories Progressively more integration Multi-package Usage Memory Multi-chip Package Usage Memory Direct Attach Usage Memory CPU 2A Logic Die package CPUDie 2A Logic Die CPUDie

Exascale System Level Challenge A Multi-disciplinary Approach Is Necessary Power Management Interconnect Silicon Photonics Microprocessor Exascale systems will integrated multiple technology innovations Reliability And Resiliency Programmability Parallel Software Memory

Summary Many challenges to reach Exascale Exciting times Important that ensuing innovations have broader applicability Exascale efficiencies within reach w/ general purpose cores Without changing programming model or ISA Gap ~2x to Exascale pj/op 7-8 years to bridge that More integration over time Reduces power, increases reliability, increases scalability

Backup