Lecture 4: Synchronous Data Flow Graphs - HJ94 goal: Skiing down a mountain

Similar documents
EE213A - EE298-2 Lecture 8

STATIC SCHEDULING FOR CYCLO STATIC DATA FLOW GRAPHS

Embedded Systems 8. Identifying, modeling and documenting how data moves around an information system. Dataflow modeling examines

Embedded Systems CS - ES

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

Overview of Dataflow Languages. Waheed Ahmad

Software Synthesis Trade-offs in Dataflow Representations of DSP Applications

HW/SW Codesign. Exercise 2: Kahn Process Networks and Synchronous Data Flows

EECS 144/244: Fundamental Algorithms for System Modeling, Analysis, and Optimization

Static Scheduling and Code Generation from Dynamic Dataflow Graphs With Integer- Valued Control Streams

Chapter 2 Data Flow Modeling and Transformation 2.1 Introducing Data Flow Graphs

Fundamental Algorithms for System Modeling, Analysis, and Optimization

Software Synthesis from Dataflow Models for G and LabVIEW

Dataflow Languages. Languages for Embedded Systems. Prof. Stephen A. Edwards. March Columbia University

Hardware-Software Codesign. 6. System Simulation

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

EE382N.23: Embedded System Design and Modeling

A Schedulability-Preserving Transformation Scheme from Boolean- Controlled Dataflow Networks to Petri Nets

Hierarchical FSMs with Multiple CMs

The Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Meta-Data-Enabled Reuse of Dataflow Intellectual Property for FPGAs

Mapping Array Communication onto FIFO Communication - Towards an Implementation

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

Portland State University ECE 588/688. Dataflow Architectures

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Extensions of Daedalus Todor Stefanov

Dynamic Dataflow. Seminar on embedded systems

A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms

fakultät für informatik informatik 12 technische universität dortmund Data flow models Peter Marwedel TU Dortmund, Informatik /10/08

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi

Automatic Parallelization of NLPs with Non-Affine Index- Expressions. Marco Bekooij (NXP research) Tjerk Bijlsma (University of Twente)

Distributed Algorithms in Networks EECS 122: Lecture 17

Modelling, Analysis and Scheduling with Dataflow Models

LabVIEW Based Embedded Design [First Report]

Main Points of the Computer Organization and System Software Module

From synchronous models to distributed, asynchronous architectures

Precedence Graphs Revisited (Again)

Dataflow Architectures. Karin Strauss

Embedded Systems 7. Models of computation for embedded systems

FSMs & message passing: SDL

SDL. Jian-Jia Chen (slides are based on Peter Marwedel) TU Dortmund, Informatik 年 10 月 18 日. technische universität dortmund

Computational Process Networks

Unit 2: High-Level Synthesis

High-Level Synthesis (HLS)

Code Generation for TMS320C6x in Ptolemy

Co-synthesis and Accelerator based Embedded System Design

AdaStreams : A Type-based Programming Extension for Stream-Parallelism with Ada 2005

ESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder

Cover Page. The handle holds various files of this Leiden University dissertation

Contents Part I Basic Concepts The Nature of Hardware and Software Data Flow Modeling and Transformation

EE249 Discussion Petri Nets: Properties, Analysis and Applications - T. Murata. Chang-Ching Wu 10/9/2007

Modelling and simulation of guaranteed throughput channels of a hard real-time multiprocessor system

Promela and SPIN. Mads Dam Dept. Microelectronics and Information Technology Royal Institute of Technology, KTH. Promela and SPIN

Buffer Sizing to Reduce Interference and Increase Throughput of Real-Time Stream Processing Applications

Real-Time Component Software. slide credits: H. Kopetz, P. Puschner

A Hierarchical Multiprocessor Scheduling System for DSP Applications

SDF Domain. 3.1 Purpose of the Domain. 3.2 Using SDF Deadlock. Steve Neuendorffer

Hardware Software Codesign

Petri Nets. Petri Nets. Petri Net Example. Systems are specified as a directed bipartite graph. The two kinds of nodes in the graph:

VHDL simulation and synthesis

Computational Models for Concurrent Streaming Applications

Embedded Systems 7 BF - ES - 1 -

CS250 VLSI Systems Design Lecture 9: Patterns for Processing Units and Communication Links

Parameterized Modeling and Scheduling for Dataflow Graphs 1

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

An algorithm for Performance Analysis of Single-Source Acyclic graphs

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

EE382V: System-on-a-Chip (SoC) Design

[6] E. A. Lee and D. G. Messerschmitt, Static Scheduling of Synchronous Dataflow Programs for Digital Signal Processing, IEEE Trans.

ESE532: System-on-a-Chip Architecture. Today. Process. Message FIFO. Thread. Dataflow Process Model Motivation Issues Abstraction Recommended Approach

Dynamic Response Time Optimization for SDF Graphs

Static Dataflow Graphs

Learning Lab 3: Parallel Methods of Solving the Linear Equation Systems

CS370 Operating Systems

High-level idea. Abstractions for algorithms and parallel machines. Computation DAG s. Abstractions introduced in lecture

Parallel Computer Architecture and Programming Written Assignment 3

Symbolic Buffer Sizing for Throughput-Optimal Scheduling of Dataflow Graphs

ECE 450:DIGITAL SIGNAL. Lecture 10: DSP Arithmetic

High Performance Computing. University questions with solution

Concurrent Models of Computation

CA441 BPM - Modelling Workflow with Petri Nets. Modelling Workflow with Petri Nets. Workflow Management Issues. Workflow. Process.

Buffer Dimensioning for Throughput Improvement of Dynamic Dataflow Signal Processing Applications on Multi-Core Platforms

HIGH-LEVEL SYNTHESIS

MODELING OF BLOCK-BASED DSP SYSTEMS

Interfacing a High Speed Crypto Accelerator to an Embedded CPU

Functional modeling style for efficient SW code generation of video codec applications

A Predictable RTOS. Mantis Cheng Department of Computer Science University of Victoria

Applying Models of Computation to OpenCL Pipes for FPGA Computing. Nachiket Kapre + Hiren Patel

Basic Low Level Concepts

SysteMoC. Verification and Refinement of Actor-Based Models of Computation

SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION

COMP Parallel Computing. CC-NUMA (2) Memory Consistency

Concurrent Reading and Writing of Clocks

A Framework for Space and Time Efficient Scheduling of Parallelism

GRAph Parallel Actor Language A Programming Language for Parallel Graph Algorithms

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Computer Architecture: Dataflow/Systolic Arrays

Transcription:

Lecture 4: Synchronous ata Flow Graphs - I. Verbauwhede, 05-06 K.U.Leuven HJ94 goal: Skiing down a mountain SPW, Matlab, C pipelining, unrolling Specification Algorithm Transformations loop merging, compaction Memory Transformations and Optimizations 40 bit accumulator Floating-point to Fied-point ASIC Special Purpose Retargetable coprocessor SP processor SP- RISC RISC Page

Overview Lecture : what is a system-on-chip Lecture : terminology for the different steps Lecture 3: models of computation Lecture 4 today: two MOC s Synchronous data flow graphs Control flow 3 Time Representations Tag t is abstraction of time (temporal order) - Absolute time = global ordering=overspecification - Cumbersome and harmful because reduces degree of freedom - Order in t is order in events (t<t <=> e<e ) 3 representations: -Absolute time T = R (T totally ordered closed connected set) v -iscrete time T T is totally ordered discrete set < in T such that t t' ( t < t' ) ( t' < t) = -Precedences T T is partially ordered discrete set v t t 4 Page

Models for Time Timed Models of Computation = total order Continuous time- iscrete event- (simulation with zero-delay??) Synchronous Clocked discrete time = most used iscrete time-(synchronous/reactive) Untimed MOC = partial order Sequential Processes with Rendez-Vous Kahn Networks ata-flow networks Reality = miture of MOC s 5 Today: Reference: E. Lee,. Messerschmitt, Synchronous data flow, Proceedings of the IEEE, Vol. 75, No.9, September 987. Other reference: E. Lee,. Messerschmitt, Static Scheduling of Synchronous ata Flow Programs for igital Signal Processing, IEEE Transactions on Computers, Vol. C-36, No., Jan. 987. (This reference includes the proofs for the first reference.) For multi-dimensional signal processing: stream scheduling: very effective for video and image processing applications Eamples: Phideo [Philips] 6 Page 3

ata flow ata flow representation of an algorithm: is a directed graph nodes are computations (actors) arcs (or edges) are paths over which the data ( samples ) travels. F shows which computations to perform, not sequence. Sequence is only determined by data dependencies. Hence eposes concurrency. 7 ata flow (cont.) Assume infinite stream of input samples. So nodes perform computations an infinite times. Node will fire (start its computation) when inputs are available. Node with no inputs can fire anytime. Numbers indicate the number of samples (tokens) produced, consumed by one firing. Nodes will fire when input data is available, called data-driven. Hence it eposes concurrency. Nodes must be free of side effects : e.g. a write to a memory location followed by a read, only allowed if there is an arc between them 8 Page 4

ata flow (cont.) True data flow: overhead for checking the availability of input tokens is too large. BUT, synchronous data flow: the number of tokens produced/consumed is know beforehand (a priori)! Hence, the scheduling can done a priori, at compile time. Thus there is NO runtime overhead! For signal processing applications: the number of tokens produced & consumed is independent of the data and known beforehand (= relative sample rates). 9 Synchronous ata Flow - definition Synchronous data flow graph (SF) is a network of synchronous nodes (also called blocks). A node is a function that is invoked whenever there are enough inputs available. The inputs are consumed. For a synchronous node, the consumptions and productions are known a priori. Homogeneous SF graph: when only s on the graph. 0 Page 5

elay - elay of signal processing Unit delay on arc between A and B, means A B n-th sample consumed by B, is (n-)th sample produced by A. Initialized by d zero samples A synchronous compiler Translation from SF graph to a sequential program on a processor Two tasks: Allocation of shared memory between blocks or setting up communication between blocks Scheduling blocks onto processors such that all input data is available when block is invoked Goal: create Periodic Admissible Parallel Schedule (PAPS) Page 6

Precedence graph - Schedule Precedence graph indicates the sequence of operations: C A A B B C Schedule determines when and where (which processor or which data path unit) the node fires. Valid schedules: A B C Invalid schedule: C A B B A C 3 Blocked Schedule Blocked: one cycle terminates before net one starts C A B F E G Static schedule 3 processors/units: valid blocked schedule With pipeline (not blocked): P A C G P A C P B F P B F P3 E P3 G E 4 Page 7

Small large grain Iteration period = length of one cycle = /throughput Goal: minimize iteration period Iteration period bound = minimum achievable (assuming pipelining) = bound by total number of operations in loop divided by number of delays in the loop) Atomic SF graph, when nodes are primitive operations Large grain SF graph, when nodes are larger functions: Eample: IIR filter = small grain JPEG = large grain 5 SF graph implementation Implementation requires: buffering of the data samples passing between nodes schedule nodes when inputs are available ynamic implementation (= runtime) requires runtime scheduler checks when inputs are available and schedules nodes when a processor is free. usually epensive because overhead Contribution of Lee-87: SF graphs can be scheduled at compile time no overhead 6 Compiler will: determine the eecution order of the nodes on one or multiple processors or data path units determine communication buffers between nodes. Page 8

Periodic schedule for SF graph Assumptions: infinite stream of input data (the case for signal processing applications) periodic schedule: same schedule applied repetitively on input stream Goal: check if schedule can be found: Periodic admissible sequential schedule (PASS) for a single processor or data path unit Periodic admissible parallel schedule (PAPS) for multiple processors n n n n PASS 7 Rate inconsistency n Consistent solution n Formal approach Γ = Construct topology matri each node is a column each arc is a row entry (i,j) = data produced on node i by arc j. consumption is negative entry n e n e e n n - 0 0-0 - e Self loop entry? 8 Page 9

FIFO queues b(n) = size of queues on each arc 0 0 v(n) = 0 or or 0 indicates firing node 0 0 b(n) = b(n) Γ v(n) e e n n - 0 0-0 - n e n 0 b(0) = 0, b() = 0 0 e 9 FIFO queues & delays elays are handled by initializing b(0) with the delay values: n n b(0) = So at start-up: can fire two times before firing n again So, every directed loop must have at least one delay to be able to start 0 Page 0

Identifying inconsistent sample rates Necessary condition for the eistence of periodic schedule with bounded memory Rank of Γ is s- (s is number of nodes) n n n n e e n n - 0 0-0 - rank? e e n n - 0 0-0 - rank? Relative firing frequency Topology matri with the correct rank, has a strictly positive (element-wise) integer vector q in its right nulspace: Thus: Γq = 0 n n e e n n - 0 0-0 - rank =, q = q determines number of times each node is invoked! Page

Insufficient delays Rank s- is a necessary but not a sufficient condition: n n n n - - = 0 0 3 Scheduling for single processor Given: positive integer vector q, such that Γq = 0 given b(0) The i-th node is runnable if it has not been run qi times it will not cause the buffer size to become negative Class S (sequential) algorithm creates a static schedule: is an algorithm that schedules a node if it is runnable it updates b(n) it stops when no more nodes are runnable. If the class S algorithm terminates before it has scheduled each node the number of times specified in the q vector, then it is said to be deadlocked. 4 Page

Eample Class S algorithm Solve for smallest positive integer q Form a list of all nodes in the system for each node, schedule if runnable, try each node once if each node has been scheduled qi times, STOP. If no node can be scheduled, indicate deadlock else continue with the net node. n n Schedule: - - 3-3 is PASS - - 3 is not PASS - - 3-3 is not PASS (Compleity: traverse the graph once, visiting each edge once). Optimization: minimize buffer (=memory) requirements 5 Schedule for parallel processors Assumptions: homogeneous processors, no overhead in communication if PASS eists, then also PAPS (because we could run all nodes on one processor) A blocked periodic admissible parallel schedule is set of lists {Xi; i =,... M} M is the number of processors Xi = periodic schedule for processor i p is smallest positive integer vector, such that Γp = 0. Then a cycle of schedule invokes every node q = Jp times. J is called the blocking factor (can be different from ). 6 Page 3

Precedence graph n e n e e e n n - 0 0 - - 0 Γp = 0. PASS:? rank =, p = Precedence graph for unity blocking factor: n n n 7 Schedule on two processors, J= Assumptions: node takes time unit, node takes, node 3 takes 3 X = {3} X = {,, } n n n Time processor 3 processor Iteration period = 4 8 Page 4

Schedule on two processors, J= Assumptions: node takes time unit, node takes, node 3 takes 3 nodes have self loops (so nodes can not overlap with themselves) n n n 3 n n n 4 X = {3,,, } X = {,,, 3} Time processor 3 processor 3 Iteration period is 7/ = 3.5 9 Why are we doing this? The principle of synchronous data flow is used in many simulators Based on this, multi-dimensional data flow representations have been developed. Reality is always more complicated. Issues in practice: choose schedule to minimize memory requirements. include non data flow nodes if-then-else data dependent calculations 30 Page 5