CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Similar documents
Parallel Computing: Parallel Architectures Jin, Hai

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Computer Architecture

CellSs Making it easier to program the Cell Broadband Engine processor

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Introduction to Computing and Systems Architecture

Core Fusion: Accommodating Software Diversity in Chip Multiprocessors

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!

Lecture 1: Gentle Introduction to GPUs

How to Write Fast Code , spring th Lecture, Mar. 31 st

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

CPU-GPU Heterogeneous Computing

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

All About the Cell Processor

Parallel Computing. Hwansoo Han (SKKU)

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Comparing Memory Systems for Chip Multiprocessors

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Introduction to CELL B.E. and GPU Programming. Agenda

Xbox 360 Architecture. Lennard Streat Samuel Echefu

Modern Processor Architectures. L25: Modern Compiler Design

Lecture 1: Introduction

Outline Marquette University

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Saman Amarasinghe and Rodric Rabbah Massachusetts Institute of Technology

Multi-core Architectures. Dr. Yingwu Zhu

CSC501 Operating Systems Principles. OS Structure

45-year CPU Evolution: 1 Law -2 Equations

Performance, Power, Die Yield. CS301 Prof Szajda

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

An Introduction to Parallel Programming

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Multicore Hardware and Parallelism

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

Review: Creating a Parallel Program. Programming for Performance

Massively Parallel Architectures

The University of Texas at Austin

Microprocessor Trends and Implications for the Future

! Readings! ! Room-level, on-chip! vs.!

High Performance Computing on GPUs using NVIDIA CUDA

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CSE 392/CS 378: High-performance Computing - Principles and Practice

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Datacenter application interference

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture Spring 2016

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto

CUDA GPGPU Workshop 2012

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Spring 2011 Prof. Hyesoon Kim

Online Course Evaluation. What we will do in the last week?

Spring 2010 Prof. Hyesoon Kim. Xbox 360 System Architecture, Anderews, Baker

ECE 8823: GPU Architectures. Objectives

Arquitecturas y Modelos de. Multicore

high performance medical reconstruction using stream programming paradigms

Re-architecting Virtualization in Heterogeneous Multicore Systems

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Convergence of Parallel Architecture

An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Parallelism in Hardware

Parallel Architecture. Hwansoo Han

Design Principles for End-to-End Multicore Schedulers

Multicore Challenge in Vector Pascal. P Cockshott, Y Gdura

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Response Time and Throughput

Staged Memory Scheduling

Performance of Multicore LUP Decomposition

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

EE/CSCI 451: Parallel and Distributed Computation

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware

740: Computer Architecture, Fall 2013 SOLUTIONS TO Midterm I

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

CSE502: Computer Architecture CSE 502: Computer Architecture

The Use of Cloud Computing Resources in an HPC Environment

GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Compilation for Heterogeneous Platforms

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Core 2 vs I-series. How Far Have We Really Come?

CS427 Multicore Architecture and Parallel Computing

Transcription:

CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Status Quo Previously, CPU vendors tried to use special purpose units for targeted performance wins: FPUs, SSEs, Altivec, Now industry explores hybrid models Cell: 1 PowerPC and 8 s AMD Fusion: CPU + GPU Intel Sandy Bridge

Classification of Heterogeneous Multicore Different ISAs? Some Systems on Chip multi/manycore & GPUs Same ISA Traditional Multi/Many cores Homogeneous Multicore Cores of Different Capabilities Same cores Different cores

How Do We Design Them? by designing a series of cores from scratch by reusing a series of previouslyimplemented processor cores and change the interface The entire family of a microprocessor can be incorporated into a single chip (how about clock frequency?) a combination of the two approaches

Why do we need them? Today s microprocessor design should balance high throughput and good singlethread performance Haven t SMT and CMP achieved this? Heterogeneous multicore can do it more efficiently Match each application to the core(s) of its performance demand Provide better area-efficient coverage of workload demands

Why do we need them? Different applications have different resource requirements during their execution What are these resource requirements? The best core for execution may vary over time What do we gain from assigning the right thread/process/application to the right core? performance enhancement power consumption reduction

Why do we need them? Application with fewer and sophisticated threads -> Traditional multicore with latency optimized cores Application with high concurrency -> large number of throughput optimized cores

Two Main Challenges Efficiency Locality Data movement costs more than computation.

Issues in Heterogeneous Multicore Processors Are they scalable? How is the scheduling done (objective function)? Who does the scheduling? What is the cost of task switching? What is the cost of task migration? When to switch core? Still need shared address space? How about coherence?

Examples from real-life

XBOX 360

22.4 GB/s 10.8 GB/s

Specifications 3 CPU cores 4-way SIMD vector units 8-way 1MB L2 cache (3.2 GHz) 2 way SMT In-order 2 Instructions/cycle ATI GPU with embedded EDRAM 3D graphics units 512-Mbyte DRAM main memory

Philosophy Value for 5-7 years Big performance increase over last generation Support high-definition video extremely high pixel fill rate (goal: 100+ million pixels/s) Flexible to suit dynamic range of games balance hardware, homogenous resources Programmability (easy to program)

Cell Processor

Memory Controller Each Cell chip has: One PowerPC core 8 compute cores (s) On-chip Memory controller On-chip I/O On-chip network to connect them all A PS3 has 1 Cell Chip (6 usable s) Overview 25.6GB/s 256MB of DRAM Memory 512MB XDR DRAM I/O PowerPC On-chip Network

Memory Controller PowerPC Core (PPE) 3.2GHz Dual issue, in-order 2-way multithreaded 512KB L2 cache No hardware prefetching SIMD (altivec) + FMA 6.4 GFlop/s (double precision) 25.6 GFlop/s (single precision) PowerPC Serves 2 purposes: Compatibility processor; runs legacy code, compilers, libraries, etc. Performs all system level functions including starting the other cores I/O On-chip Network 25.6GB/s 512MB XDR DRAM

Memory Controller Synergistic Processing Element () Offload processor Dual issue, in-order, VLIW inspired SIMD processor 128 x 128b register file 256KB local store, no I$, no D$ Runs a small program (placed in LS) Offloads system functions to the PPE MFC (memory flow controller) a programmable DMA engine and onchip network interface Performance 1.83 GFlop/s (double precision) 25.6 GFlop/s (single precision) 51.2GB/s access the local store 25.6GB/s access to the network PowerPC On-chip Network I/O 25.6GB/s 512MB XDR DRAM

Threading-Model For each program the PPE must: Create a context Load the program (embedded in the binary) Create a pthread to run it Each pthread can be as simple as: spe_context_run( ); pthread_exit( ); Typically, split the work into 2 phases: Code that runs on the s Code that runs on the PPE with a barrier in between Create pthreads program program program Barrier() Barrier() Barrier() Barrier() Barrier() PPE phase phase phase phase Barrier() Barrier() Barrier() Barrier() Barrier() Barrier() Barrier() Join pthreads

Threads-Communication The PPE may communicate with the s via: individual mailboxes (FIFO) HW signals DRAM Direct PPE access to the s local stores PowerPC (PPE) $ Shared off-chip DRAM Local Store Local Store Local Store Local Store

AMD S LLANO FUSION APU

INTEL: SANDY BRIDGE

Sandy Bridge Intel Tock Intel s first processor with GPU on the processor itself. Improvement over its predecessor Nehalem Targeting multimedia applications Introduced Advanced Vector Extensions (AVX) More power-efficient than Westmere

The GPU can access the large L3 cache Intel s team totally re-designed the GPU

Example from research

Core Fusion: Accommodating Software Diversity in Chip Multiprocessor A single application can go from highly sequential to highly parallel in different phases This is at odd with traditional multicore, or even tradition heterogeneous multicore Die composition is set at design time and at design time we do not know the different applications that may execute on the chip! What can we do?

Core Fusion: Accommodating Software Diversity in Chip Multiprocessor Group of independent cores Cores can dynamically morph into larger CPU or used as distinct CPUs.

Core Fusion: Accommodating Software Diversity in Chip Multiprocessor The application requests core fusion/split actions through a pair of FUSE and SPLIT ISA instructions, respectively. FUSE and SPLIT instructions are executed conditionally by hardware, based on the value of an OS-visible control register that indicates which cores within a fusion group are eligible for fusion. To enable core fusion, the OS allocates either two or four of the cores in a fusion group to the application when the applications context-switched in, and annotates the group s control register.

Challenges Programs demands vary over time Objective function may vary over time The core capability may vary over time (e.g. if it gets overheated)

Some Numbers In paper published in 2005: The Impact of Performance Asymmetry in Multicore Architectures ISCA 05 They experimented with 4 cores Used several configurations and many benchmarks What did they find?

all slow 3 slow 2 slow 1 slow all fast Same or Many runs Perf. Metric Perf. Metric Studying impact Scalability Stability (Asymm) 35

Workloads evaluated Cjbb CjAppServer Apache Zeus TPC-H COMP H.264 PMake Middle-tier business apps. Throughput parallel Webservers Throughput parallel Task-based parallelization Embarrassingly parallel 36

Impact of asymmetry Workloads Cjbb CjAppServer Apache Zeus TPC-H COMP H.264 PMake Scalable P P P O P O P P Stable O P O O O O P P Fix P P O P P 37

Programming Heterogeneous Multicore Systems Algorithm, Correctness, Thread Partitioning Programmers Don t reason about asymmetry Asymmetry negatively affects applications -Observed unpredictable workload behavior This can be fixed by - Evaluating threads work partitioning -Scheduling of threads with asymmetry

As A Programmer Define your threads As many threads as you can Know your hardware and your system software How many cores? How heterogeneous are they? How does the OS scheduling work? Now combine your threads depending on the core strength You can assign a thread to a core (many languages allow this: pthreads, OpenMP,..) Load imbalance is your enemy!

Load Imbalance in Parallel Applications: Main Reasons Poor single processor performance Typically in the memory system Too much parallelism overhead Thread creation, synchronization, communication Different amounts of work across processors Computation and communication Different speeds (or available resources) for the processors

How To Recognize Load Imbalance? Hint: Time spent at synchronization is high and is uneven across cores, but not always so simple Imbalance may change over phases! Insufficient parallelism always lead to load imbalance Useful tools: http://www.cs.uoregon.edu/research/tau/hom e.php http://icl.cs.utk.edu/papi/ http://pages.cs.wisc.edu/~paradyn/

Important Questions When Dealing With Load Imbalance Thread costs Do all threads have equal costs? If not, when are the costs known? Before starting, when created, or only when it ends Thread dependencies Can all threads be run in any order (including parallel)? If not, when are the dependencies known? Before starting, when created, or only when it ends Locality Is it important for some threads to be scheduled on the same processor (or nearby) to reduce communication cost? When is the information about communication known?

Important Questions When Dealing With Load Imbalance Thread cost known before starting: matrix multiplication Cost known when thread created: N-Body Cost known when thread ends: search

Important Questions When Dealing With Load Imbalance Threads can execute at any order: dependence free loops Threads have a known dependency graph: matrix computations Dependence unknown: search

Important Questions When Dealing With Load Imbalance Threads, once created do not communicate: embarrassingly parallel applications Threads communicate in predictable pattern: PDE solver Information about communication is unpredictable: discrete event simulation

Many solutions Of Different Flavors Task queue (centralized or distributed) Work stealing Work pushing Off-line scheduling Self-scheduling

Question: Are homogeneous cores of different ISAs useful in something? Justify

Conclusions Asymmetric systems may be the de facto in the future Good for energy and performance but may introduce unpredictability To reduce unpredictability, loadbalancing is your friend! In multiprogramming environment, the advantage of heterogeneous multicore is obvious!