ESE532: System-on-a-Chip Architecture. Today. Process. Message FIFO. Thread. Dataflow Process Model Motivation Issues Abstraction Recommended Approach

Similar documents
ESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder

ESE532: System-on-a-Chip Architecture. Today. Message. Real Time. Real-Time Tasks. Real-Time Guarantees. Real Time Demands Challenges

Dataflow Languages. Languages for Embedded Systems. Prof. Stephen A. Edwards. March Columbia University

ESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM

Overview of Dataflow Languages. Waheed Ahmad

EECS 144/244: Fundamental Algorithms for System Modeling, Analysis, and Optimization

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.5. Krste Asanovic UC Berkeley Fall 2009

A Streaming Multi-Threaded Model

CS250 VLSI Systems Design Lecture 8: Introduction to Hardware Design Patterns

Petri Nets ee249 Fall 2000

Fundamental Algorithms for System Modeling, Analysis, and Optimization

Computational Process Networks a model and framework for high-throughput signal processing

SysteMoC. Verification and Refinement of Actor-Based Models of Computation

CS250 VLSI Systems Design Lecture 9: Patterns for Processing Units and Communication Links

Functional modeling style for efficient SW code generation of video codec applications

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Compositionality in system design: interfaces everywhere! UC Berkeley

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Applying Models of Computation to OpenCL Pipes for FPGA Computing. Nachiket Kapre + Hiren Patel

Compositionality in Synchronous Data Flow: Modular Code Generation from Hierarchical SDF Graphs

LabVIEW Based Embedded Design [First Report]

Computational Process Networks

fakultät für informatik informatik 12 technische universität dortmund Data flow models Peter Marwedel TU Dortmund, Informatik /10/08

SDL. Jian-Jia Chen (slides are based on Peter Marwedel) TU Dortmund, Informatik 年 10 月 18 日. technische universität dortmund

FSMs & message passing: SDL

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Asynchronous Elastic Dataflows by Leveraging Clocked EDA

L2: Design Representations

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Modelling and simulation of guaranteed throughput channels of a hard real-time multiprocessor system

Embedded Systems CS - ES

High Performance Computing. University questions with solution

BeiHang Short Course, Part 2: Description and Synthesis

LabVIEW: Visual Programming Using a Dataflow Model Extended With Graphical Control Structures. Outline

Control Hazards. Branch Prediction

Contents Part I Basic Concepts The Nature of Hardware and Software Data Flow Modeling and Transformation

ESE532: System-on-a-Chip Architecture. Today. Message. Preclass 1. Computing Forms. Preclass 1

Implementation of Process Networks in Java

ES623 Networked Embedded Systems

Moore s Law. Computer architect goal Software developer assumption

Mapping Multiple Processes onto SPEs of the CELL BE Platform using the SCO Model of Computation

ESE534: Computer Organization. Previously. Today. Computing Requirements (review) Instruction Control. Instruction Taxonomy

HW/SW Codesign. Exercise 2: Kahn Process Networks and Synchronous Data Flows

mywbut.com GATE SOLVED PAPER - CS (A) 2 k (B) ( k+ (C) 3 logk 2 (D) 2 logk 3

Portland State University ECE 588/688. Dataflow Architectures

Modelling, Analysis and Scheduling with Dataflow Models

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

Computer Architecture: Dataflow/Systolic Arrays

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

ESE534: Computer Organization. Previously. Today. Preclass. Why Registers? Latches, Registers

Today. ESE532: System-on-a-Chip Architecture. Day 3: Memory Scaling. Message. Day 3: Lesson DRAM DRAM

Embedded Systems 7 BF - ES - 1 -

Control Hazards. Prediction

Outline. Petri nets. Introduction Examples Properties Analysis techniques. 1 EE249Fall04

Embedded Systems 8. Identifying, modeling and documenting how data moves around an information system. Dataflow modeling examines

Lecture 22: Buffering & Scheduling. CSE 123: Computer Networks Alex C. Snoeren

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

Internal Server Architectures

A Process Model suitable for defining and programming MpSoCs

AdaStreams : A Type-based Programming Extension for Stream-Parallelism with Ada 2005

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Node Prefetch Prediction in Dataflow Graphs

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

HIGH-LEVEL SYNTHESIS

Embedded Systems 7. Models of computation for embedded systems

Precise Exceptions and Out-of-Order Execution. Samira Khan

Software Engineering using Formal Methods

Lecture 4: Synchronous Data Flow Graphs - HJ94 goal: Skiing down a mountain

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

CS 426 Parallel Computing. Parallel Computing Platforms

CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.4. Krste Asanovic UC Berkeley Fall 2009

ESE534: Computer Organization. Previously. Today. Preclass. Preclass. Latches, Registers. Boolean Logic Gates Arithmetic Complexity of computations

ESE534 Computer Organization. Today. Temporal Programmable. Spatial Programmable. Needs? Spatially Programmable

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

5008: Computer Architecture

Lab 4 File System. CS140 February 27, Slides adapted from previous quarters

Register Transfer Methodology II

Outline. Register Transfer Methodology II. 1. One shot pulse generator. Refined block diagram of FSMD

Unit 2 Packet Switching Networks - II

Protocol Specification. Using Finite State Machines

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

Naiad (Timely Dataflow) & Streaming Systems

Under the Hood, Part 1: Implementing Message Passing

Dynamic Control Hazard Avoidance

Dataflow Architectures. Karin Strauss

Software Architecture in Practice

MODELING OF BLOCK-BASED DSP SYSTEMS

Deterministic Parallel Programming

SpecC Methodology for High-Level Modeling

SECTION A. (i) The Boolean function in sum of products form where K-map is given below (figure) is:

ESE535: Electronic Design Automation. Today. Question. Question. Intuition. Gate Array Evaluation Model

Relaxed Memory-Consistency Models

CS519: Computer Networks. Lecture 5, Part 5: Mar 31, 2004 Queuing and QoS

EE/CSCI 451 Midterm 1

Overview. Distributed Computing with Oz/Mozart (VRH 11) Mozart research at a glance. Basic principles. Language design.

Petri Nets ~------~ R-ES-O---N-A-N-C-E-I--se-p-te-m--be-r Applications.

Transcription:

ESE53: System-on-a-Chip Architecture Day 5: January 30, 07 Dataflow Process Model Today Dataflow Process Model Motivation Issues Abstraction Recommended Approach Message Parallelism can be natural Discipline to avoid common pitfalls Maintain determinism as much as possible Identify rich potential parallelism Abstract out implementation details Admit to many implementations 3 Process Abstraction of a processor Looks like each process is running on a separate processor Has own state, including Program Counter (PC) Memory Input/output May not actually run on processor Could be specialized hardware block 4 Thread Has a separate locus of control (PC) May share memory Run in common address space with other threads For today no shared memory (technically no threads) 5 FIFO First In First Out Delivers inputs to outputs in order Data presence Consumer knows when data available Back Pressure Producer knows when at capacity Typically stalls Decouples producer and consumer Hardware: maybe even different clocks 6

Preclass Preclass Value of separate processes? Compute Data Parallel GCD Why challenge for SIMD implementation? What benefit get from multiple processes (processors) running Data Parallel GCD? while(a!=b) a=max(a,b)-min(a,b) a=min(a,b) return(a); 7 8 Preclass 3 How long to process each input? Correlation in delays? What benefit from FIFO and processes? Process Processes allow expression of independent control Convenient for things that advance independently Performance optimization resource utilization 9 0 Issues Communication how move data between processes? Synchronization how define how they advance relative to each other? Determinism for the same inputs, do we get the same outputs? Today s Stand Communication FIFO-like channels Synchronization dataflow with FIFOs Determinism how to achieve until you must give it up.

Operation/Operator Dataflow Process Model Operation logic computation to be performed A process that communicates through dataflow inputs and outputs Operator physical block that performs an Operation 3 4 Dataflow / Control Flow Dataflow Program is a graph of operations Operation consumes tokens and produces tokens All operations run concurrently All processes Control flow (e.g. C) Program is a sequence of operations Operation reads inputs and writes outputs into common store One operation runs at a time defines successor 5 Token Data value with presence indication May be conceptual Only exist in high-level model Not kept around at runtime Or may be physically represented One bit represents presence/absence of data 6 Stream Logical abstraction of a persistent pointto-point communication link between operators Has a (single) source and sink Carries data presence / flow control Provides in-order (FIFO) delivery of data from source to sink Streams Captures communications structure Explicit producer!consumer link up Abstract communications Physical resources or implementation Delay from source to sink stream 7 8 3

Dataflow Process Network Collection of Operators Connected by Streams Communicating with Data Tokens 9 Dataflow Abstracts Timing Doesn t say on which cycle calculation occurs Does say What order operations occur in How data interacts i.e. which inputs get mixed together Permits Scheduling on different # and types of resources Operators with variable delay [examples?] Variable delay in interconnect [examples?] 0 Operations Can be implemented on different operators with different characteristics Small or large processor Hardware unit Different levels of internal Data-level parallelism Instruction-level parallelism May itself be described as Dataflow process network, sequential, hardware register transfer language Examples Operators with Variable Delay Cached memory or computation Shift-and-add multiply Iterative divide or square-root Variable delay interconnect Shared bus Distance changes Wireless, longer/shorter cables Computation placed on different cores Clock Independent Semantics Semantics Interconnect Takes n-clocks Need to implement semantics i.e. get same result as if computed as indicated But can implement any way we want That preserves the semantics Exploit freedom of implementation 3 4 4

Synchronous Dataflow (SDF) Dataflow Variants Particular, restricted form of dataflow Each operation Consumes a fixed number of input tokens Produces a fixed number of output tokens When full set of inputs are available Can produce output Can fire any (all) operations with inputs available at any point in time 5 6 Synchronous Dataflow SDF: Execution Semantics + k k k k + while (true) Pick up any operator If operation has full set of inputs Compute operation Produce outputs Send outputs to consumers 7 8 Multirate Synchronous Dataflow Rates can be different Allow lower frequency operations Communicates rates to tools Use in scheduling, provisioning Rates must be constant Data independent decimate 9 SDF Can validate flows to check legal Like KCL! token flow must be conserved No node should be starved of tokens Collect tokens Schedule operations onto processing elements Provisioning of operators Provide real-time guarantees Compute required depth of all buffers Model restrictions! analysis power Simulink is SDF model 30 5

SDF: good/bad graphs SDF: good/bad graphs 3 3 Dynamic Rates? Data Dependence When might static rates be limiting? (prevent useful optimizations?) Compress/decompress Lossless Even Run-Length-Encoding Filtering Discard all packets from spamrus Anything data dependent Add Two Operators Switch Select 33 34 Switch Filtering Example spamrus? dup switch discard 35 36 6

Select Constructing If-Then-Else 37 38 In-Order Merge Idiom to Selectively Consume Input Task: Merge to ordered streams in order onto a single output stream Hold onto current head on loop Key step in merge sort Shown left here With T-side control Use to illustrate switch/select 39 40 In-Order Merge In-Order Merge Use one for each of the two input streams Perform Comparison 4 4 7

In-Order Merge Act on result of comparison Looping for (i=0;i<limit;i++) 43 44 Universal Once we add switch and select, the dataflow model is as powerful as any other E.g. can do anything we could do in C Turing Complete in formal CS terms 45 Dynamic Challenges In general, cannot say If a graph is well formed Will not deadlock How many tokens may have to buffer in stream Will not bufferlock deadlock on finite buffer size When not deadlock on unbounded buffer Right proportion of operators for computation More powerful model! weaker analysis Larger burden to guarantee correctness 46 Deterministic No Peaking As described so far, Timing-independent Always gets same answer Regardless of operator and stream delays Execution time can be data and implementation dependent 47 Key to determinism: behavior doesn t depend on timing Cannot ask if a token is present If (not_empty(in)) Out.put(3); Else Out.put(); 48 8

When peaking necessary? What are cases where we need the ability to ask if a data item is present? Preclass User Input Server When Peak Optimization? What are cases where asking about data presence might allow performance optimization? Process data as soon as arrives 49 50 Peaking Removed model restriction (no peaking) Gained expressive power Can grab data as shows up Weaken our guarantees Possible to get non-deterministic behavior Data-Dependent Parallelism Sequential code With malloc(), new() Size of data determines memory footprint, size of structures Amount of computation performed Here process graph may want to match data structure Add a process spawn 5 5 Data-Dependent Parallelism Where might benefit having datadependent parallelism? Searching variable number of targets Simulation Data-Independent Parallelism What happens if we cannot spawn? Why likely to be less important for a particular SoC target? 53 54 9

Approach () Approach 55 Identify natural parallelism Convert to streaming flow Initially leave operators software Focus on correctness Identify flow rates, computation per operator, parallelism needed Refine operators Decompose further parallelism? E.g. SIMD changes making now model potential hardware 56 Approach () Refine coordination as necessary for implementation Map operators and streams to resources Provision hardware Scheduling: Map operations to operators Memories, interconnect Profile and tune 57 Big Ideas Capture gross parallel structure with Process Network Use dataflow synchronization for determinism Abstract out timing of implementations Give freedom to optimize implementation for performance Minimally use non-determinism as necessary 58 Admin Reading for Day 6 on web HW3 due Friday Expanded feedback incl. HW 59 0