Dataflow Architectures. Karin Strauss

Similar documents
Portland State University ECE 588/688. Dataflow Architectures

Computer Architecture: Dataflow/Systolic Arrays

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

Dataflow: The Road Less Complex

Dennis [1] developed the model of dataflowschemas, building on work by Karp and Miller [2]. These dataflow graphs, as they were later called, evolved

WaveScalar. Winter 2006 CSE WaveScalar 1

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Good old days ended in Nov. 2002

Ben Lee* and A. R. Hurson** *Oregon State University Department of Electrical and Computer Engineering Corvallis, Oregon

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

High Performance Computing. University questions with solution

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

TRIPS: Extending the Range of Programmable Processors

Curriculum 2013 Knowledge Units Pertaining to PDC


Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Processors. Young W. Lim. May 12, 2016

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

November 7, 2014 Prediction

The University of Texas at Austin

Precise Exceptions and Out-of-Order Execution. Samira Khan

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

Continuum Computer Architecture

Grassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands

ROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev.

Chapter. Out of order Execution

Dataflow Processing. A.R. Hurson Computer Science Department Missouri Science & Technology

Processor (IV) - advanced ILP. Hwansoo Han

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Superscalar Processors

Dynamic Dataflow. Seminar on embedded systems

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Dynamic Issue & HW Speculation. Raising the IPC Ceiling

Page 1. Raising the IPC Ceiling. Dynamic Issue & HW Speculation. Fix OOO Completion Problem First. Reorder Buffer In Action

Alexandria University

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

History Importance Comparison to Tomasulo Demo Dataflow Overview: Static Dynamic Hybrids Problems: CAM I-Structures Potential Applications Conclusion

Caches. Samira Khan March 23, 2017

Exploitation of instruction level parallelism

Hardware-based Speculation

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Review: Creating a Parallel Program. Programming for Performance

Pipelined Processor Design. EE/ECE 4305: Computer Architecture University of Minnesota Duluth By Dr. Taek M. Kwon

Multithreaded Processors. Department of Electrical Engineering Stanford University

EECS 570 Final Exam - SOLUTIONS Winter 2015

Data Flow Graph Partitioning Schemes

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

101. The memory blocks are mapped on to the cache with the help of a) Hash functions b) Vectors c) Mapping functions d) None of the mentioned

SPECULATIVE MULTITHREADED ARCHITECTURES

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

UNIT I (Two Marks Questions & Answers)

Architecture or Parallel Computers CSC / ECE 506

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

Current Microprocessors. Efficient Utilization of Hardware Blocks. Efficient Utilization of Hardware Blocks. Pipeline

«Computer Science» Requirements for applicants by Innopolis University

COMPUTER ORGANIZATION AND DESI

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

Processors. Young W. Lim. May 9, 2016

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

Lecture: Interconnection Networks

and Multithreading Ben Lee, Oregon State University A.R. Hurson, Pennsylvania State University

Module 4c: Pipelining

Chapter 4. The Processor

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Dynamic Dataflow. Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology

Out of Order Processing

Dynamic Scheduling. CSE471 Susan Eggers 1

Dataflow Languages. Languages for Embedded Systems. Prof. Stephen A. Edwards. March Columbia University

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Lecture: Out-of-order Processors

The Processor: Instruction-Level Parallelism

Hardware-Based Speculation

Processor Architecture

CSE 240A Midterm Exam

Deterministic Shared Memory Multiprocessing

Martin Kruliš, v

Speculation and Future-Generation Computer Architecture

Advanced Computer Architecture

DATAFLOW COMPUTERS: THEIR HISTORY AND FUTURE

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Functional modeling style for efficient SW code generation of video codec applications

Computing Parable. The Archery Teacher. Courtesy: S. Keshav, U. Waterloo. Computer Science. Lecture 16, page 1

Transcription:

Dataflow Architectures Karin Strauss

Introduction Dataflow machines: programmable computers with hardware optimized for fine grain data-driven parallel computation fine grain: at the instruction granularity data-driven: activated by data availability only data dependences constrain parallelism programs usually represented as graphs

Terminology x + y nodes (FUs) tokens (data) arcs (dependences) input port enabling firing input data + firing = output data

Node Types functional: (e.g. +, -, * ) conditional: value control value if control == true nothing if control == false value if control == false nothing if control == true merge: input 1 input 2 value out non-strict firing rule acts as a serializer

if then else u v if(u < v) z = x + y; else z = x - y; x < y + - z

Iteration, Recursion, Reuse and Reentrance using the same graph to perform computation on different data sets assume no storage elements are used, data is only present in wires

while example suppose this node takes 10x what a regular node takes to perform computation t x y 0 m m 1 <,c c 2 c,c c 3 +1 + 4 h,m + 5 <,c + 6 c,c + 7 +1 + 8 h,m +

Synchronization static locks (compound branch and merge nodes) nodes only fire when all inputs are ready loss of concurrency acknowledging (control flow protocol) extra arcs from consumer to producer increases resources needed can be reduced with detailed analysis

Synchronization II dynamic each iteration is executed in a separate instance of the graph code copying new instance of subgraph is created per iteration need to direct tokens from earlier iterations to inputs of new iteration tagged tokens attach a tag to each token, associating it with an iteration fire when input tokens have all the same tag

Tagged Tokens create other problems how to manage tags (size, distribution) storage overhead tags have to be stored with tokens tokens that cannot be consumed at the moment may need to be stored for later use too much parallelism! storage overflow deadlocks

Procedures procedures can be called from several distinct calling sites callee node address sent in special token mechanism to direct the return value token

Processing Element Architecture several processing elements (PEs) that communicate with each other 1) enabling unit receives token 1 enabling unit 2 memory 3 4 functional unit tokens and nodes 2) enabling unit stores token at addressed node 3) if node is now enabled, send node to functional unit 4) functional unit processes node 5) output + destination address are sent back to enabling unit

Tagged Architectures 1) matching unit receives token 1 matching unit memory fetching unit 2 3 memory 4 5 functional unit 2) check memory; if all other inputs with same tag are there, send all tokens to fetching unit 3) fetching unit retrieves node from memory 4) fetch unit assembles an executable packet and sends it to functional unit tokens nodes 5) functional unit executes node with inputs provided by packet 6) output is sent back to matching unit

One-level Architecture a functional unit delivers tokens to the enabling unit of the correct processing element

Two-level Architecture each functional unit consists of several functional elements that can process packets in parallel

Two-stage Architecture each enabling unit can send executable packets to any functional unit good for heterogeneous functional units

Manchester Dataflow Machine Gurd and Watson (1976-1981), two-level machine I/O switch token queue matching unit instruction store processing unit overflow unit pipelined ring matching unit pairs tokens large data sets overflow to overflow units appropriate instruction is fetched from instruction store inputs and instruction are forwarded to processing unit

Underutilization poor performance underutilization of functional units imbalance overhead computation underutilization of storage space large data sets code replication tags, destination addresses

Dataflow Model Benefits: exposes parallelism can tolerate latencies mechanisms for fine-grain synchronization Drawbacks: loss of locality (interleaving of instructions) waste of resources space overhead

Trends convergence of dataflow architectures with conventional Von-Neumann architectures Ianucci s hybrid approach decoupled architectures out-of-order processors

Ianucci s Hybrid Approach scheduling quanta little or no parallelism among instructions maximize run length (more locality) minimize # of arcs between quanta increase resource utilization deadlock avoidance: dependence sets

Ianucci s Hybrid Approach (II) Network & Global Memory Local Memory Registers Local Memory Registers ALU ALU

Decoupled Architectures and Out-of-Order Processors Decoupled architectures: decentralized structures (distinct FUs) instruction steering based on input and output dependences, and operation to be performed Out-of-Order processors: register renaming to identify iteration instruction scheduling based on ready input operands predication?

Less Traditional Proposals Wavescalar (U. Washington, Mark Oskin) PIM architecture supports traditional Von-Neumann style memory semantics in a dataflow model any programming language Edge/Trips (U.T. Austin, Doug Burger) direct instruction communication (within blocks) groups of 128 instructions mapped to an array of execution units: dataflow within, sequential across loads and stores still do through memory ordering hw

Related OoO Techniques instruction collapsing (Micro 37) strands: dependent instructions with intermediate computation that does not need to be committed to architectural state (Wills - Georgia Tech) e.g. e = a + (b + (c + d)) mini-graphs: sequence of instructions with at most 2 inputs, 1 output, one memory operation and 1 terminal control transfer (Roth - U. Pennsylvania) goal: save processor resources instruction queue entries reorder buffer entries registers / register file accesses