The Road from Peta to ExaFlop

Similar documents
Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

HPC Technology Trends

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Scaling and Energy Efficiency in Next Generation Core Networks and Switches

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE

Philippe Thierry Sr Staff Engineer Intel Corp.

Brand-New Vector Supercomputer

International Technology Roadmap for Semiconductors

Technology challenges and trends over the next decade (A look through a 2030 crystal ball) Al Gara Intel Fellow & Chief HPC System Architect

Data Transport: Defining the Problem and the Solution for Photonics in servers

HPC Storage Use Cases & Future Trends

InfiniBand Strengthens Leadership as The High-Speed Interconnect Of Choice

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

Race to Exascale: Opportunities and Challenges. Avinash Sodani, Ph.D. Chief Architect MIC Processor Intel Corporation

ECE 486/586. Computer Architecture. Lecture # 2

Copyright 2012 EMC Corporation. All rights reserved.

The Benefits of Solid State in Enterprise Storage Systems. David Dale, NetApp

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Intel SSD Data center evolution

IBM DS8880F All-flash Data Systems

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

From Majorca with love

The way toward peta-flops

Cray XC Scalability and the Aries Network Tony Ford

Jeff Kash, Dan Kuchta, Fuad Doany, Clint Schow, Frank Libsch, Russell Budd, Yoichi Taira, Shigeru Nakagawa, Bert Offrein, Marc Taubenblatt

Lecture 20: Distributed Memory Parallelism. William Gropp

System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

3D NAND Technology Scaling helps accelerate AI growth

Future Routing Schemes in Petascale clusters

Best Practices for Setting BIOS Parameters for Performance

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

Multi-Core Microprocessor Chips: Motivation & Challenges

Technology Trends IT ELS. Kevin Kettler Dell CTO

The PowerEdge M830 blade server

A New NSF TeraGrid Resource for Data-Intensive Science

Nimble Storage vs HPE 3PAR: A Comparison Snapshot

3D systems-on-chip. A clever partitioning of circuits to improve area, cost, power and performance. The 3D technology landscape

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp

Storage Update and Storage Best Practices for Microsoft Server Applications. Dennis Martin President, Demartek January 2009 Copyright 2009 Demartek

Cisco HyperFlex HX220c M4 Node

Network Design Considerations for Grid Computing

SAS Technical Update Connectivity Roadmap and MultiLink SAS Initiative Jay Neer Molex Corporation Marty Czekalski Seagate Technology LLC

Carlo Cavazzoni, HPC department, CINECA

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

Memory Systems IRAM. Principle of IRAM

Interconnect Your Future

Brief Background in Fiber Optics

Fra superdatamaskiner til grafikkprosessorer og

How To Get The Most Out Of Flash Deployments

John Fragalla TACC 'RANGER' INFINIBAND ARCHITECTURE WITH SUN TECHNOLOGY. Presenter s Name Title and Division Sun Microsystems

Samsung s Green SSD (Solid State Drive) PM830. Boost data center performance while reducing power consumption. More speed. Less energy.

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10

Consideration for Advancing Technology in Computer System Packaging. Dale Becker, Ph.D. IBM Corporation, Poughkeepsie, NY

POWER9 Announcement. Martin Bušek IBM Server Solution Sales Specialist

Building blocks for high performance DWH Computing

Flash Storage with 24G SAS Leads the Way in Crunching Big Data

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

NVMe: The Protocol for Future SSDs

New Approach to Unstructured Data

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

HIGH PERFORMANCE COMPUTING FROM SUN

FUSION PROCESSORS AND HPC

Hybrid Memory Cube (HMC)

Flash In the Data Center

Exascale: Parallelism gone wild!

Exascale challenges. June 27, Ecole Polytechnique Palaiseau France

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

Modular Platforms Market Trends & Platform Requirements Presentation for IEEE Backplane Ethernet Study Group Meeting. Gopal Hegde, Intel Corporation

STORAGE NETWORKING TECHNOLOGY STEPS UP TO PERFORMANCE CHALLENGES

Gen-Z Memory-Driven Computing

Power Technology For a Smarter Future

Advancing high performance heterogeneous integration through die stacking

POWER7: IBM's Next Generation Server Processor

Enabling Technology for the Cloud and AI One Size Fits All?

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

16GFC Sets The Pace For Storage Networks

Open Innovation with Power8

Emerging IC Packaging Platforms for ICT Systems - MEPTEC, IMAPS and SEMI Bay Area Luncheon Presentation

InfiniBand SDR, DDR, and QDR Technology Guide

Building a Phased Plan for End-to-End FCoE. Shaun Walsh VP Corporate Marketing Emulex Corporation

VISUALIZING THE PACKAGING ROADMAP

Intel Many Integrated Core (MIC) Architecture

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

3D & Advanced Packaging

Mellanox Virtual Modular Switch

HP Z Turbo Drive G2 PCIe SSD

Lecture 2: Performance

The Future of High Performance Interconnects

Exascale: challenges and opportunities in a power constrained world

Cisco HyperFlex HX220c M4 and HX220c M4 All Flash Nodes

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Trends in Scientific Discovery Engines

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Don t Forget the Memory. Dean Klein, VP Memory System Development Micron Technology, Inc.

A Prototype Storage Subsystem based on PCM

Transcription:

The Road from Peta to ExaFlop Andreas Bechtolsheim June 23, 2009

HPC Driving the Computer Business Server Unit Mix (IDC 2008) Enterprise HPC Web 100 75 50 25 0 2003 2008 2013 HPC grew from 13% of units in 2003 to 33% in 2008 HPC projected to be the highest % of servers by 2013

Changes in the Computer Business Virtualization Consolidates Enterprise Datacenters Reduces the number of servers required in Enterprise Cloud Computing Outsources Applications Further reduces size of the traditional Enterprise server market High Performance Computing Gaining Share Open ended demand for more performance, does not virtualize Moore s Law Benefits HPC Market Go Faster, Faster

High-speed Fabrics Everywhere 10 GigE and Infiniband shipping in volume HPC, Database and Storage Clusters Outstanding scaling for wide range of applications Predictable roadmaps to 100 Gbps and beyond

Clusters are Driving the HPC Market 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Q103 Q203 Q303 Q403 Q104 Q204 Q304 Q404 Q105 Q205 Q305 Q405 Q106 Q206 Q306 Q406 Q107 Q207 Q307 Q407 Cluster Non-Cluster

HPC Market Forecast (IDC 2008)

Top500 List November 2002 Projected Performance Development 10 Pflop/s 1 Pflop/s SUM 100 Tflop/s 10 Tflop/s N=1 ES 1 Tflop/s 100 Gflop/s 10 Gflop/s N=500 1 Gflop/s 100 Mflop/s 1993 1995 1997 1999 2001 2003 2005 2007 2009 20th List / Nov. 2002 www.top500.org page 6

Top 500 June 2008: First PF Result 2020

Top500 Projected Performance 2020 1000 PFlops 2016

Top 500 List Observations It took 11 years to get from 1 TF to 1 PF Question can this be achieved? Performance doubled approximately every year Assuming the trend continues, 1 EF by 2020 Moore s law predicts 2X Transistors every 2 years Need to double every year to achieve EF in 2020

Challenges Semiconductor Roadmap Packaging Technology Power and Cooling Local Interconnect External Interconnect Storage System Software Scalability Exploiting Parallelism

The Basic Math: More than Moore Aggregate Performance = N * C * F * I N Number of Modules 20% / Y Budget, Power C Cores per Module 40% % / 1Y / Y Technology, Power F Frequency 5% / Y Technology, Power I Instruction Efficiency 15% % / 1Y / Y Architecture, Power TOTAL 100% / Y Must increase system size, cores per module, and Instruction Efficiency to double every year

Semiconductor Technology Roadmap 1.E+03 1.E+02 Functions per Chip Flash Bits/Chip (Gbits) Multi-Level-Cell (4bit MLC) Flash Bits/Chip (Gbits) Multi-Level-Cell (2bit MLC) Product Functions/Chip [ Giga (10^9) - bits, transistors ] 1.E+01 1.E+00 Flash Bits/Chip (Gbits) Single-Level-Cell (SLC ) DRAM Bits/Chip (Gbits) MPU GTransistors/Chip - high-performance (hp) 1.E-01 MPU GTransistors/Chip - cost-performanc (cp) 1.E-02 1995 2000 2005 2010 2015 2020 2025 Year of Production 2007-2022 ITRS Range Average Industry "Moores Law : 2x Functions/chip Per 2 Years

TeraFlops/CPU Socket over Time Throughput (TF/S) 10 7.5 5 2.5 0 2010 (32nm) 2012 (22nm) 2015(16nm) 2018 (11nm) 2020 (8)

CPU Module [Socket] (2010 => 2020) Year 2010 2020 Ratio Clock Rate 2.5 GHz 4 GHz 5%/Y FLOPS/Clock 4 16 4X FLOPS/Core 10 GF 64 GF 6.4X Cores/Module 16 160 10X FLOPS/Module 160 GF 10 TF 64X Mem Bandwidth 30 GB/s 2 TB/s 64X M Bandwidth/F 0.2 B/F 0.2 B/F = IO Bandwidth 3 GB/s 192 GB/s 64X IO Bandwidth/F 0.02 B/F 0.02 B/F = Power / Module 250W 500W 2X Power Efficiency 0.6 GF/W 20 GF/W 32X Challenging but doable

System Performance (2010 => 2020) Year 2010 2020 Ratio FLOPS/Module 160 GF 10 TF 64X Modules/System 16,000 100,000 6X FLOPS/System 2.5 PF 1 EF 320X Cores/Module 16 160 10X Cores/System 256,000 16M 64X Memory/Module 30 GB/s 2 TB/s 64X Memory/System 0.2 B/F 0.2 B/F = IO Bandwidth 3 GB/s 192 GB/s 64X IO Bandwidth/F 0.02 B/F 0.02 B/F = Power / Module 250W 500W 2X Power / System 4 MW 50 MW 12X Challenging but doable

The Biggest Challenge: Memory Bandwidth Memory bandwidth must grow with throughput 2020 CPU needs > 64X the memory bandwidth Traditional Package I/O pins are basically fixed Electrical signaling hitting speed limits How to scale memory bandwidth? Solution: Multi-Chip 3D Packaging

Multi-Chip 3D Packaging Wire bonded stacked die Thru-Si via Stacking Need to combine CPU + Memory on one Module

High-density 3D Multi-Chip Module (MCM) TSV Mold resin thickness on top of die: 0.10 mm 0.025mm 1.0 Substrate thickness: 0.16 Ball pitch: 0.8 mm Die attach thickness 0.015 Embedded

Benefits of MCM Packaging Enables much higher memory bandwidth Greatly reduces memory I/O power Memory signals are local to MCM Reduces system size and power More channels, wider interfaces, faster I/O

MCM Enables Fabric I/O Integration 2010: 1*4X QDR (32 Gbps / direction) Support for global memory addressing 2020: 6*12X XDR (1.72 Tbps / direction) Mesh or Higher Radix Fabric Topologies 12X Copper for Module-Module Traces 12X Optical for Board-Board, Rack-Rack Very high message rates (Several Billion/sec)

Benefits of integrating Router with CPU Best way to get highest message rate Match Injection and Link Bandwidth No congestion on receive Avoids intermediate bus conversions Separate router chips are I/O Bound Eliminates half of the I/O pins and power Lowest cost and lowest power design

What is the Best Fabric for Exascale? Optimal solution depends on economics Cost of NIC, Router, Optical Interconnect Good global and local bandwidth Pure 3D Torus for Exascale system is too large Robust Dynamic Routing desirable Combination of mesh and tree look promising Higher radix meshes significantly reduce hop-count Needed for load balancing and to recover from hardware failures

Expected Link Data Rate 100 75 50 25 0 Gbps per Lane 2009 2012 2016 2020 10 Gbps shipping today 25 Gbps expected 2012 50 Gbps in 2016 Higher speeds will require integrated optics interface

Next Challenge: Power Power Efficiency is critical Design for greatest power efficiency Clock frequency versus power Minimize interface and I/O power Optimize CapEx and OpEx

Power per Core Constant performance improvement of 20% per generation 200W/cm 2 100W/cm 2 50W/cm 2 25W/cm 2 Constant power density 25 W/cm 2 Source: D. Frank, C. Tyberg, IBM Research

Power Efficiency (Power per Throughput) Power / Throughput Frequency Power = Clock * Capacitance * Vdd^2 High-frequency designs consume much more power

Power Efficiency Strategy Reduce I/O power as much as possible Requires MCM packaging, lower voltage interface levels Saves more than 50% compared to power today Minimize Leakage Current Lower temperature/liquid cooling helps Optimize transistor designs and materials Simplify CPU Architecture Lower memory latency simplifies pipelines Integrate NIC and I/O subsystems Most savings from better packaging

Microchannel Fluidic Heatsinks

Liquid Cooling is Essential Reduces Power Reduces power required to drive I/O Reduces leakage currents Avoids wasting power on moving air Increases Rack Density Reduces Number of Racks Reduces Weight and Structural Costs Reduces Cabling Improves Heat Removal per Socket Liquid Cooling required to increase system density Physics of air cooling are not changing Liquid Cooling with Microchannels look promising

Packaging Technology Summary Main new development is Multi-Chip Modules Many more signals available on-module than off-module Increases memory bandwidth while reducing power Decouples memory bandwidth from I/O Pins Fabric Interface Integrated NIC reduces power and improves performance Integrated Router supports Mesh and Fat-Tree topologies SERDES support copper and optical (fiber) interfaces Power and Cooling 480VAC to each Rack Liquid Cooling to reduce power and increase density Dramatic Increase in Throughput (and power) per rack

ExaScale Storage Storage Bandwidth Requirements 0.1 GB/s per TF 100 GB/sec per PF 100 TB/sec per EF Forget Hard Disks Disks are not going any faster Useful as a tape replacement At 100 MB/sec per disk, 100 TB/sec would require 1M disks Solid State Storage Arriving just in time Rapid Performance Improvements Rapid Cost-reduction expected

Today s SSD vs HDD Conventional 2.5 HDD 15K RPM, 146 GB 180 Write IOPS 320 Read IOPS $1 per IOPS Solid State 2.5 SSD 0 RPM, 64 GB 8K Write IOPS 35K Read IOPS $0.10 per IOPS

Sun Flash DIMMs 30,000 Read IOPS, 10,000 Write IOPS Single-Level Flash SATA Interface

PCI Flash Storage 100,000+ IOPS, Up to 1 TB Capacity Low Profile PCI Card Slot

Gartner Flash Forecast (August 2008)

Solid State Storage Summary Density Doubling Each Year Cost Falling by 50% Per Year Access times improving quickly Throughput improving quickly Phase-Change Technology looks promising Interface moving from SATA to PCI Express Multi-GB/sec per PCI Controller 100 TB/sec suddenly looks possible

Scaling to ExaScale: CPU Throughput Three Dimensions of Scalability Frequency, Cores, FLOPS/Core Increasing Frequency is most difficult Expect modest increases going forward Problem is power consumption per core Increasing the Number of Cores per Chip Proportional to Technology Improvements Moore s Law predicts doubling every 2 years Increasing FLOPS per Core Increase instruction set parallelism More Functional Units and SIMD instructions

Scaling Throughput per Core GF/Core 10 16 32 64 1 PF 100K 64K 32K 16K 10 PF 1M 640K 320K 160K 100 PF 10M 6.4M 3.2M 1.6M 1000 PF 100M 64M 32M 16M Critical to increase throughput per core

Scaling Throughput per Power GF/W 0.64 3 10 20 1 PF 1.5M 300K 100K 50K 10 PF 15M 3M 1M 500K 100 PF 150M 30M 10M 5M 1000 PF 1500M 300M 100M 50M Critical to improve power efficiency to reduce OPEX

Scaling Throughput per CPU Module GF/M 160 640 2500 10000 1 PF 6.4K 1.6K 400 100 10 PF 64K 16K 4K 1K 100 PF 640K 160K 40K 10K 1000 PF 6.4M 1.6M 400K 100K Number of CPU Modules drives system size and cost

Technology Summary Moore s Law will continue for at least 10 Years Transistors will double approximately every 2 year Not enough to double performance every year More than Moore required for 1 EF by 2020 Frequency Gains are very difficult Power increases super-linear with clock rate Must exploit parallelism with more cores Need to increase FLOPS/Core Predictable way to increase performance Mul-add, multiple FPUs, SIMD extensions Need to increase memory and I/O bandwidth Need to scale with throughput Need a factor of 64X by 2020

The Software Challenge The limits of application parallelism Instruction set parallelism Number of cores per CPU Module Number of CPU modules per system Need to exploit parallelism at all levels Quality of compiler code generation Functional parallelism within node (SMP Threads) Data parallelism across nodes (MPI Tasks) Ultimate question is application parallelism Will require re-architecting of applications Not all applications will scale to Exaflop size RAS Related Challenges MTBI declines with system size Needs high-speed checkpoint restart

System Conclusions Expect Throughput to Double every year Combination of Moore s law and efficiency gains First 1 EF System likely by 2020 Smallest Top 500 System likely 10 PF in 2020 PetaFlop Systems will be very small and affordable One PetaFlop per rack by 2016 Personal PetaFlop computer by 2020 Greatly Broadens the HPC market HPC market growth will continue for a long time Big opportunity for small, medium, and large systems HPC is a major driver to advance server technology CPUs, Memory, Fabric, Storage, Software Ripple-down effect from the largest to the smallest systems