Task Superscalar: Using Processors as Functional Units

Similar documents
Trace-driven Simulation of Multithreaded Applications. Alejandro Rico, Alejandro Duran, Felipe Cabarcas Yoav Etsion, Alex Ramirez and Mateo Valero

Task Superscalar: Using Processors as Functional Units

CellSs Making it easier to program the Cell Broadband Engine processor

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

IP Core Design of Task Scheduler to Support Out-of-Order Execution in an MPSoC Environment

Hardware-based Speculation

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Handout 2 ILP: Part B

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

The University of Texas at Austin

Precise Exceptions and Out-of-Order Execution. Samira Khan

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Analysis of the Task Superscalar Architecture Hardware Design

SMPSs Submission to HPCC 2010 Class 2 Competition. Josep M. Perez, Rosa M. Badia, Jesus Labarta, Eduard Ayguade

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

A Case for Hardware Task Management Support for the StarSS Programming Model

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Chapter 06: Instruction Pipelining and Parallel Processing

Control flow and data flow in processors

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Exploitation of instruction level parallelism

Lecture-13 (ROB and Multi-threading) CS422-Spring

Superscalar Organization

Parallel Computing: Parallel Architectures Jin, Hai

Multi-cycle Instructions in the Pipeline (Floating Point)

Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,

Instruction Level Parallelism (ILP)

Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CUDA GPGPU Workshop 2012

RECAP. B649 Parallel Architectures and Programming

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Getting CPI under 1: Outline

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CS 1013 Advance Computer Architecture UNIT I

for Task-Based Parallelism

Instruction-Level Parallelism and Its Exploitation

Current Microprocessors. Efficient Utilization of Hardware Blocks. Efficient Utilization of Hardware Blocks. Pipeline

Multithreaded Processors. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014

INSTRUCTION LEVEL PARALLELISM

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Lect. 2: Types of Parallelism

Hardware-Based Speculation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Region-Based Memory Management for Task Dataflow Models

Portland State University ECE 588/688. Dataflow Architectures

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Modern Processor Architectures. L25: Modern Compiler Design

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Superscalar Processors

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Sequoia. Mattan Erez. The University of Texas at Austin

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

Cor Meenderinck, Ben Juurlink Nexus: hardware support for task-based programming

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Overview of research activities Toward portability of performance

GPU Fundamentals Jeff Larkin November 14, 2016

Design of Out-Of-Order Superscalar Processor with Speculative Thread Level Parallelism

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

CS425 Computer Systems Architecture

High-Performance Parallel Computing

A Dependency-Aware Task-Based Programming Environment for Multi-Core Architectures

ROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev.

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

The IA-64 Architecture. Salient Points

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

EECC551 Exam Review 4 questions out of 6 questions

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Copyright 2012, Elsevier Inc. All rights reserved.

Superscalar Machines. Characteristics of superscalar processors

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Dynamic Issue & HW Speculation. Raising the IPC Ceiling

Page 1. Raising the IPC Ceiling. Dynamic Issue & HW Speculation. Fix OOO Completion Problem First. Reorder Buffer In Action

Dataflow: The Road Less Complex

Multi-Level Computing Architectures (MLCA)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Superscalar Processors

Speculation and Future-Generation Computer Architecture

Transcription:

Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M. Badia Eduard Ayguade Jesus Labarta Mateo Valero HotPar, June 2010 Yoav Etsion Senior Researcher

Parallel Programming is Hard A few key problems of parallel programming: 1. Exposing operations that can execute in parallel 2. Managing data synchronization 3. Managing data transfers None of these exist in sequential programming but they do exists in processors executing sequential programs Hot Topics in Parallelism, June 2010 2

Sequential Program, Parallel Execution Engine Out-of-Order pipelines automatically manage a parallel substrate A heterogeneous parallel substrate (FP, ALU, BR ) Yet, the input instruction stream is sequential [Tomasulo 67][Patt 85] The obvious questions: 1. How do out-of-order processors manage parallelism? 2. Why can t ILP out-of-order pipelines scale? 3. Can we apply the same principles to tasks? 4. Can task pipelines scale? Hot Topics in Parallelism, June 2010 3

Outline Recap: How do OoO processors uncover parallelism? The StarSs programming model A high-level view of the task superscalar pipeline Can a task pipeline scale? Conclusions and Future Work Hot Topics in Parallelism, June 2010 4

How Do Out-of-Order Processors Do it? Exposing parallelism Register renaming tables map consumers to producers Observing an instruction window to find independent instructions Data synchronization Data transfers act as synchronization tokens Dataflow scheduling prevents data conflicts Data transfers Broadcasts tagged data Input is a sequential stream: complexities are hidden from programmer Hot Topics in Parallelism, June 2010 5

Can We Scale Out-of-Order Processors? Building a large instruction window is difficult (Latency related) Timing constraints require a global clock Broadcast does not scale, but latency cannot tolerate switched networks Broadcasting tags yields a large associative lookup in the reservation stations Utilizing a large instruction window Control path speculation is a real problem, as most in-flight instructions are speculated Most available parallelism used to overcome the memory wall, not exploit parallel resource (Not latency related!) (Back to latency ) But what happens if we operate on tasks rather than instructions? Hot Topics in Parallelism, June 2010 6

Outline Recap: How do OoO processors uncover parallelism? The StarSs programming model: Tasks as abstract instructions High-level view of the task superscalar pipeline Can a task pipeline scale? Conclusions and Future Work Hot Topics in Parallelism, June 2010 7

The StarSs Programming Model Tasks as the basic work unit Operational flow: a master thread spawns tasks, which are dispatched to multiple worker processors (aka the functional units) Runtime system dynamically resolves dependencies, construct the task graph, and schedules tasks Programmers annotate the directionality of operands input, output, or inout Operands can consist of memory regions, not only scalar values Further extends the pipeline capabilities Shameless plug: StarSs versions for SMPs and the Cell are freely available Hot Topics in Parallelism, June 2010 8

The StarSs Programming Model Simple annotations All effects on shared state are explicitly expressed Kernels can be compiled for different processors Intuitively Annotated Kernel Functions #pragma css task input(a, b) inout(c) void sgemm_t(float a[m][m], float b[m][m], float c[m][m]); #pragma css task inout(a) void spotrf_t(float a[m][m]); #pragma css task input(a) inout(b) void strsm_t(float a[m][m], float b[m][m]); #pragma css task input(a) inout(b) void ssyrk_t(float a[m][m], float b[m][m]); Example: Cholesky Decomposition Hot Topics in Parallelism, June 2010 9

The StarSs Programming Model Code is seemingly sequential, and executes on the master thread Invoking kernel functions generates tasks, which are sent to the runtime s2s filter injects necessary code Runtime dynamically constructs the task dependency graph Easier to debug, since execution is similar to sequential execution for (int j = 0; j<n; j++) { for (int k = 0; k<j; k++) for (int i = j+1; i<n; i++) sgemm_t(a[i][k], A[j][k], A[i][j]); } Seemingly Sequential Code for (int i = 0; i<j; i++) ssyrk_t(a[j][i], A[j][j]); spotrf_t(a[j][j]); for (int i = j+1; i<n; i++) strsm_t(a[j][j], A[i][j]); Example: Cholesky Decomposition Hot Topics in Parallelism, June 2010 10

The StarSs Programming Model It is not feasible to have a programmer express such a graph Resulting Task Graph (5x5 matrix) Out-of-order execution No loop level barriers (a-la OpenMP) Facilitates distant parallelism Tasks 6 and 23 execute in parallel Availability of data dependencies supports relaxed memory models DAG consistency Bulk consistency [Blumofe 96] [Torrellas 09] Hot Topics in Parallelism, June 2010 11

So Why Move to Hardware? Problem: software runtime does not scale beyond 32-64 processors Software decode rate is 700ns - 2.5us per task Difference is between Intel Xeon and Cell PPU Scaling therefore implies much longer tasks Longer tasks imply larger datasets that do not fit in the cache Hardware offers inherent parallelism Vertical: Horizontal: pipelining distributing load over multiple units Hot Topics in Parallelism, June 2010 12

Outline Recap: How do OoO processors uncover parallelism? The StarSs programming model: Tasks as abstract instructions A high-level view of the task superscalar pipeline Can a task pipeline scale? Conclusions and Future Work Hot Topics in Parallelism, June 2010 13

Task Superscalar: a high-level view Master processors send tasks to the pipeline Object versioning table (OVTs) are used to map data consumers and producers Combination of a register file and a renaming table Task dependency graph is stored in multiplexed reservation stations Heterogeneous backend GPUs become equivalent to a vector unit found in many processors Fetch Decode Dispatch Functional Units Multiplexed Reservation Stations HW/SW Scheduler P P P P Hot Topics in Parallelism, June 2010 14 Nested Task Generation Master Processors P P P P Task Dep. Decode GPU P Worker Processors MRS MRS... Task Decode Unit... OVT OVT OVT OVT... MRS MRS

Result: Uncovering Large Amounts of Parallelism # of ready tasks 1000 900 800 700 600 500 400 300 200 100 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MatMul 4Kx4K FFT 256K Normalized execution time Cholesky 4Kx4K Knn 50K samp. Jacobi 4Kx4K H264Dec 1 HD Fr. The figure shows the number of ready tasks throughout the execution Parallelism can be found even in complex dependency structures Cholesky, H264, Jacobi Hot Topics in Parallelism, June 2010 15

Outline Recap: How do OoO processors uncover parallelism? The StarSs programming model: Tasks as abstract instructions A high-level view of the task superscalar pipeline Can a task pipeline scale? Conclusions and Future Work Hot Topics in Parallelism, June 2010 16

Overcoming the limitations of ILP pipelines Task window timing No need for a global clock we can afford crossing clock domains Building a large pipeline Multiplex reservation stations into a single structure Relaxed timing constraints on decodes facilitates the creation of explicit graph edges Eliminates the need for associative MRS lookups We estimate storing tens-of-thousands of in-flight tasks Hot Topics in Parallelism, June 2010 17

Overcoming the limitations of ILP pipelines Utilizing a large window Tasks are non-speculative We can afford to wait for branches to be resolved Overcoming the memory wall Explicit data dependency graph facilitates data scheduling Overlap computation with communications Schedule work to exploit locality Already done on the Cell B.E. version of StarSs StarSs runtime tolerates memory latencies of thousands of cycles Hot Topics in Parallelism, June 2010 18

Outline Recap: How do OoO processors uncover parallelism? The StarSs programming model: Tasks as abstract instructions A high-level view of the task superscalar pipeline Can a task pipeline scale? Conclusions and Future Work Hot Topics in Parallelism, June 2010 19

Conclusions Dynamically scheduled out-of-order pipelines are very effective in managing parallelism The mechanism is effective, but limited by timing constraints Task-based dataflow programming models uncover runtime parallelism Utilize an intuitive task abstraction Intel RapidMind [McCool 06], Sequoia [Fatahalian 06], OoOJava [Jenista 10] Combine the two: Task Superscalar A task-level out-of-order pipeline using cores as functional units We are currently implementing a task superscalar simulator Execution engine for high-level models: Ct, CnC, MapReduce Hot Topics in Parallelism, June 2010 20

Future Work Explore locality-based scheduling algorithms HW/SW scheduling interface Scheduling for a heterogeneous backend Automatic grouping of tasks to GPU warps Combining both dynamic and static dependency analysis Explore the effectiveness of known out-of-order optimizations Lazy renaming already shown to work in StarSs Hot Topics in Parallelism, June 2010 21

Thank You Hot Topics in Parallelism, June 2010 22

Related Work Thread-Level Speculation (TLS) Multiscalar [Sohi 95], Trace Procs. [Rotenberg 97], Hydra [Hammond 98] Dataflow Intermittent work in the 1970s, 1980s, 1990s [Dennis, Watson, Arvind, Culler, Papadopoulos, ] TRIPS [Sankaralingam 06], WaveScalar [Swanson 03] Other task-level dataflow models CellSs [Bellens 06], RapidMind [McCool 06], Sequoia [Fatahalian 06], OoOJava [Jenista 10] Hardware support for Tasks Carbon [Kumar 07], ADM [Sanchez 10], MLCA [Karim 04] Hot Topics in Parallelism, June 2010 23