Heterogeneous Processing

Similar documents
What does Heterogeneity bring?

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective

IMPLICIT+EXPLICIT Architecture

GPUs and GPGPUs. Greg Blanton John T. Lubia

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

COMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor.

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

ECE 571 Advanced Microprocessor-Based Design Lecture 20

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Introduction to Microprocessor

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

PC I/O. May 7, Howard Huang 1

Parallel Computing: Parallel Architectures Jin, Hai

Memory Systems IRAM. Principle of IRAM

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Computer Architecture

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Implementation of DSP Algorithms

CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTURE

Multimedia in Mobile Phones. Architectures and Trends Lund

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

From Brook to CUDA. GPU Technology Conference

ECE 571 Advanced Microprocessor-Based Design Lecture 18

Hardware Oriented Security

Application Performance on Dual Processor Cluster Nodes

Flexible Architecture Research Machine (FARM)

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

WHY PARALLEL PROCESSING? (CE-401)

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation

ECE 471 Embedded Systems Lecture 2

KiloCore: A 32 nm 1000-Processor Array

Compilation for Heterogeneous Platforms

What are Clusters? Why Clusters? - a Short History

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

The Mont-Blanc approach towards Exascale

Technology Trends Presentation For Power Symposium

Fundamentals of Quantitative Design and Analysis

Experts in Application Acceleration Synective Labs AB

Intel Enterprise Processors Technology

CONSOLE ARCHITECTURE

Support for Programming Reconfigurable Supercomputers

Workload Optimized Systems: The Wheel of Reincarnation. Michael Sporer, Netezza Appliance Hardware Architect 21 April 2013

The University of Texas at Austin

Fundamental CUDA Optimization. NVIDIA Corporation

Computers and Microprocessors. Lecture 34 PHYS3360/AEP3630

High Performance Computing with Accelerators

Outline Marquette University

The Nios II Family of Configurable Soft-core Processors

Graphics Processor Acceleration and YOU

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Performance potential for simulating spin models on GPU

Copyright 2012, Elsevier Inc. All rights reserved.

Reconfigurable Cell Array for DSP Applications

H100 Series FPGA Application Accelerators

INF5063: Programming heterogeneous multi-core processors Introduction

Tesla GPU Computing A Revolution in High Performance Computing

CUDA OPTIMIZATIONS ISC 2011 Tutorial

EECS4201 Computer Architecture

high performance medical reconstruction using stream programming paradigms

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Uniprocessor Computer Architecture Example: Cray T3E

How to build a Megacore microprocessor. by Andreas Olofsson (MULTIPROG WORKSHOP 2017)

UMBC. Rubini and Corbet, Linux Device Drivers, 2nd Edition, O Reilly. Systems Design and Programming

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Computer System Components

Massively Parallel Architectures

Memory latency: Affects cache miss penalty. Measured by:

Memory latency: Affects cache miss penalty. Measured by:

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

XPU A Programmable FPGA Accelerator for Diverse Workloads

GPU Fundamentals Jeff Larkin November 14, 2016

Trends in the Infrastructure of Computing

Topic & Scope. Content: The course gives

Multi-Core Microprocessor Chips: Motivation & Challenges

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Field Programmable Gate Array (FPGA) Devices

IBM "Broadway" 512Mb GDDR3 Qimonda

Altera SDK for OpenCL

Mainstream Computer System Components

Robert Jamieson. Robs Techie PP Everything in this presentation is at your own risk!

Parallelism. CS6787 Lecture 8 Fall 2017

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Jim Keller. Digital Equipment Corp. Hudson MA

CS 152, Spring 2011 Section 10

Embedded Systems. 7. System Components

POWER9 Announcement. Martin Bušek IBM Server Solution Sales Specialist

Master Informatics Eng.

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

What s New with GPGPU?

Hakam Zaidan Stephen Moore

Transcription:

Heterogeneous Processing Maya Gokhale maya@lanl.gov

Outline This talk: Architectures chip node system Programming Models and Compiler Overview Applications Systems Software

Clock rates and transistor counts FPGA Cell Intel dual core Opteron

How are the transistors being used? G4e Dual core Opteron FPGA http://arstechnica.com/articles/paedia/cpu/p4andg4e.ars/3 Clearspeed Cell Mathstar

Clearspeed CSX600 coprocessor layout Debug SRAM Memory Interface Bus Processor Core ISU System services Array of 96 Processor Elements 250 MHz IBM 0.13µ process, 8-layer metal (copper) 47% logic, 53% memory More logic than most processors! 15 mm x 15 mm die size 128 million transistors Approximately 10 Watts Chip-to-chip Bridge Ports Processor sparing allows high yields, currently ~40%

CSX600 processor core Multi-Threaded Array Processing Programmed in high-level languages Hardware multi-threading for latency tolerance Asynchronous, overlapped I/O Run-time extensible instruction set Bi-endian Array of 96 Processor Elements (PEs) Each is a VLIW core, not just an ALU 4-stage 32-bit or 64-bit multiply-add pipelines Divide/square root unit Built-in PE fault tolerance, resiliency Closely coupled 6K local SRAM per PE Independent address pointers per PE 128 bytes of registers per PE High performance, low power dissipation

Mathstar 400 Silicon Objects with operation up to 1 GHz 256 ALUs 64 MACs 80 RFs Two Bi-directional 500MHz DDR 16-bit LVDS ports (64 Gbps of bandwidth) 96 pins of LVCMOS GPIO, operating either synchronously or asynchronously up to 100 MHz. Twelve banks of 500MHz internal SRAM memory banks (57 GBytes/sec.) Two 266 MHz 36-bit DDR (72-bits per cycle) RLDRAM II controllers for external memory accesses (4.8 GBytes/sec.)4.8 GBytes/sec.) Field Programmable Object Array a coarse granularity reprogrammable device Silicon Objects: 16 bit configurable machines, such as an Arithmetic Logic Unit, Multiply-Accumulator or Register File. Silicon Object behavior and the interconnection among Silicon Objects are field-programmable. ALU, MAC, Register File On-chip RAM High-speed LVDS I/O and general purpose I/O Optimized for DSP applications

Cell source: wikipedia Cell Broadband Engine ASIC contains 64-bit PowerPC with two hardware threads 8 single precision FP processors with 4-way SIMD instruction set 256KB local memory Element Interconnect Bus 200BG/s peak 4GB XDR DRAM (based on Rambus, lower latency than DDR) 25.6GB/s Runs at 3.2GHz

GRAPE Gravity pipeline Series of ASICs specifically tailored to perform various sorts of force calculations Special purpose accelerator board on PCI bus of workstation GRAPE computes only the N^2 force calculations, and microprocessor does all the rest Communication cost is N (number of particles), computation is N^2 GRAPE function also ported to reconfigurable hardware (FPGA) PROGRAPE http://grape.astron.s.u-tokyo.ac.jp/~makino/papers/gbp2000-full

System on Chip (SoC) architectures Multiple, heterogeneous resources integrated onto the IC Memory SRAM for L1, L2 cache embedded DRAM (BG/L) configurable Block RAM on FPGA configure width, number of ports Computational resources multiple complete CPUs cache coherence is an issue: BG/L nodes are not, Opteron are RISC processor + multiple SIMD/Vector units: Cell PPU+SPUs vertex and fragment pipelines: GPUs multiple ALUs Clearspeed: 96 SIMD processing elements multiple function units 2 FP units on each BG/L processor arrays of hard multipliers, MAC units on FPGAs reconfigurable logic create specialized DSP pipelines, floating point units, crypto processors ASICs GRAPE gravity pipeline for n-body computations

Heterogeneous system characteristics specialized to an application class floating point multimedia signal/image processing cryptography provides extremely high performance on kernel operations Relative Power Requirements dual core Opteron 68W (rev. F chip - 90nm), 95W dual core Intel 135W Cell (PCI-E card) - 150W GPU - 28W idle, 50W 2D graphics, 120-130W 3D graphics Clearspeed - 25W FPGA - 8W Grape-4 chip 5-8W

Memory Hierarchy Register Set O(100) bytes processor clock rate - 2-3GHz On-chip Sram 8KB-16KB L1, 1MB L2 100-200 MHZ (can be DDR or QDR) multiple parallel banks, possibly dual ported Off-chip SRAM 16MB 100-200 MHZ multiple parallel banks Off-chip DRAM 4-8GB 16MHz Local memory hierarchy is implicit in processor-based architectures write code to be cache friendly minimize non-local memory accesses in NUMA parallel machines Memory hierarchy is exposed in accelerator architectures memory usage must be explicitly managed Block RAM on FPGAs local scratchpad and DRAM in Clearspeed local memory and DRAM in Cell

Putting heterogeneous processors into systems

Workstation accelerator: I/O card collection of interconnected SoC processors. each processor has dedicated (or shared) memory subsystem accelerator system attaches to host workstation via an I/O bus data acquisition channel for direct access to real-time data streams

Clusters with accelerators High performance, multi-level interconnect Infiniband, Myrinet, GigE High performance microprocessors 64-bit, multi-core, multi-socket co-processors Graphics boards, floating point arrays, FPGAs accelerator can be peer to microprocessor on network peer to microprocessor across sockets on hypertransport on I/O bus on memory bus

SRC Computer SRC Hi-Bar Switch SNAP Memory µp PCI-X Disk Gig Ethernet etc. Storage Area Network SNAP Memory µp PCI-X Local Area Network MAP MAP Chaining GPIO Wide Area Network Common Memory Common Memory FPGA board augments microprocessor MAP on DIMM interface 2.8 GB/s 2 large FPGAs multiple banks of on board SRAM on board DRAM provides for 20 simultaneous memory accesses @ 150MHZ FPGAs can be interconnected independently of microprocessor

Cray XD1 special ASIC to talk hypertransport Only one small FPGA co-processor 16MB QDR SRAM - 12 GB/s BW

FPGA co-processors Opteron motherboards Hypertransport connection between Opteron and FPGA Use DIMM slots on MB for FPGA memory Include additional off-chip SRAM Two companies DRC, XtremeData DRC XtremeData

Clusters augmented with Floating Point Arrays Clearspeed Board recently partnered with IBM to build cluster of FPA-accelerated nodes Board contains two CS processors Each processor has 96 double precision SIMD PEs; RISC control processor; I/O controller advertise 50GF sustained DGEMM using 25W GPGPU programmable graphics processors Cell Blade Mercury and IBM have partnered on Cell Blade architecture two CBEs per blade PCI-E or IB to connect blad to microprocessor

BG/L Each ASIC contains 2 700 MHz PPC 440 (32-bit), each with a double pipelined of double precision FP units on chip DRAM controller on chip interconnect network interface 3 comm networks: 3D torus for peer-topeer, tree for collective communication, fast interrupt network for barriers From http://www.llnl.gov/asci/platforms/bluegenel/images/bgl_slide2.gif

Software environments Back when floating point co-processors were introduced, targeting co-processor from compiler was easy differentiate by data type: operations on integer data types translated to integer opcodes and mapped to integer unit; operations on FP data types translated to floating point opcodes and mapped to floating point unit No longer straightforward when MMX/SSE were introduced harder to determine whether an integer operation should use multimedia unit or integer unit needed compiler to re-factor loops and vectorize alternatives use libraries written in assembly code; programmer has to call library add a new data type eg. poly to refer to vector data Problem is even harder with heterogeneous accelerators

Programming models and compilers With heterogeneous accelerators there are multiple function units multiple execution units multiple threads of control exposed memory hierarchy asymmetrical parallelism control RISC processor array of programmable datapath-oriented processors When accelerator is on interconnection network or I/O bus, data reorganization and communication costs must be factored into the benefit afforded by accelerator Current state of practice is to leave it to the application programmer partition program between control and data path re-organize, align data to match accelerator requirements communicate and synchronize between multiple parallel processes

Programming models for heterogeneous computing many opportunities for parallelism requires multiple (possibly hierarchical) programming models process level between multiple nodes or sockets thread level among multiple, possibly heterogeneous, cores vector/simd/systolic/pipeline within a core

Applications accelerator architectures driven by commercial forces network routing: FPGA signal and image processing: FPGA, FPOA multimedia: GPU, Cell HPC community re-engineer applications to fit accelerators cryptography SAR hyperspectral imagery, video financial codes seismic scientific simulations

GPUS Pat McCormick pat@lanl.gov

Why Graphics Processors? They re everywhere Most sit idle in desktop systems Performance and cost ~240 GFLOPS peak vs. 12 for Pentium 4 40+ GB/sec memory bandwidth vs. 6 for Pentium 4 Designed for parallelism Lots of math units, local 4-way SIMD, dual/co issue Transistors for compute vs. out-of-order, prediction, etc.

GPU Performance Trends

The Graphics Pipeline

Architecture MIMD Engines 48+ cores in the latest GPUs SIMD Engines Based on GeForce 6800 architecture, courtesy of NVIDIA Corp.

Programming Models Drive the GPU with a graphics-centric API OpenGL or DirectX

Programming Models Low-level options ATI s CTM (Close To the Metal) SIGGRAPH 2006 Presentation Removes graphics state issues, but code at the assembly level High-level options PeakStream - Matt s talk this afternoon Scout - Jeff s talk this afternoon Brook - Stanford (http://graphics.stanford.edu/proejcts/brookgpu) Sh - C++-based metalanguage (http://libsh.org)

Agenda Introduction to heterogeous computing (Maya Gokhale and Pat McCormick, LANL) Applications The Chances and Challenges of Parallelism: Comparison of Hardwired (GPU) and Reconfigurable (FPGA) Devices (Robert Strzodka, Stanford University) On the Acceleration of Graph Problems on FPGA (Zack Baker, LANL) Speech Recognition on Cell Broadband Engine (Yang Liu, LLNL) Transport Kernel on Cell Broadband Engine (Paul Henning, LANL) Systems Software and Tools Peakstream Development Environment (Matt Papakipos, Peakstream) The Scout GPU compiler (Jeff Inman, LANL) Array allocation in non-cached memory systems (Justin Tripp, LANL) Compiler Support for Heterogeneous Computing in a CELL Processor (Yuan Zhao, Rice) Program analysis tools for Heterogeneous Computing (Matt Sottile, LANL) Operating Systems issues in Heterogeneous Computing (Ron Minnich, LANL)