EE213A - EE298-2 Lecture 8

Similar documents
Lecture 4: Synchronous Data Flow Graphs - HJ94 goal: Skiing down a mountain

STATIC SCHEDULING FOR CYCLO STATIC DATA FLOW GRAPHS

Embedded Systems 8. Identifying, modeling and documenting how data moves around an information system. Dataflow modeling examines

Interfacing a High Speed Crypto Accelerator to an Embedded CPU

Static Scheduling and Code Generation from Dynamic Dataflow Graphs With Integer- Valued Control Streams

Overview of Dataflow Languages. Waheed Ahmad

EECS 144/244: Fundamental Algorithms for System Modeling, Analysis, and Optimization

Software Synthesis Trade-offs in Dataflow Representations of DSP Applications

Chapter 2 Data Flow Modeling and Transformation 2.1 Introducing Data Flow Graphs

LabVIEW Based Embedded Design [First Report]

Fundamental Algorithms for System Modeling, Analysis, and Optimization

Embedded Systems CS - ES

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

HW/SW Codesign. Exercise 2: Kahn Process Networks and Synchronous Data Flows

Automatic Parallelization of NLPs with Non-Affine Index- Expressions. Marco Bekooij (NXP research) Tjerk Bijlsma (University of Twente)

EE249 Discussion Petri Nets: Properties, Analysis and Applications - T. Murata. Chang-Ching Wu 10/9/2007

Dataflow Languages. Languages for Embedded Systems. Prof. Stephen A. Edwards. March Columbia University

A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms

A Schedulability-Preserving Transformation Scheme from Boolean- Controlled Dataflow Networks to Petri Nets

Distributed Algorithms in Networks EECS 122: Lecture 17

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

RESYNCHRONIZATION FOR MULTIPROCESSOR DSP IMPLEMENTATION PART 1: MAXIMUM THROUGHPUT RESYNCHRONIZATION 1

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

A Hierarchical Multiprocessor Scheduling System for DSP Applications

Dynamic Scheduling Implementation to Synchronous Data Flow Graph in DSP Networks

Dynamic Response Time Optimization for SDF Graphs

Computational Process Networks

Software Synthesis from Dataflow Models for G and LabVIEW

EE382N.23: Embedded System Design and Modeling

High-level idea. Abstractions for algorithms and parallel machines. Computation DAG s. Abstractions introduced in lecture

Dataflow Architectures. Karin Strauss

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

An algorithm for Performance Analysis of Single-Source Acyclic graphs

Meta-Data-Enabled Reuse of Dataflow Intellectual Property for FPGAs

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

Buffer Sizing to Reduce Interference and Increase Throughput of Real-Time Stream Processing Applications

Symbolic Buffer Sizing for Throughput-Optimal Scheduling of Dataflow Graphs

[6] E. A. Lee and D. G. Messerschmitt, Static Scheduling of Synchronous Dataflow Programs for Digital Signal Processing, IEEE Trans.

SDF Domain. 3.1 Purpose of the Domain. 3.2 Using SDF Deadlock. Steve Neuendorffer

Parameterized Modeling and Scheduling for Dataflow Graphs 1

Portland State University ECE 588/688. Dataflow Architectures

Multicore DSP Software Synthesis using Partial Expansion of Dataflow Graphs

Contents Part I Basic Concepts The Nature of Hardware and Software Data Flow Modeling and Transformation

Resource Sharing & Management

Precedence Graphs Revisited (Again)

A Framework for Space and Time Efficient Scheduling of Parallelism

From synchronous models to distributed, asynchronous architectures

Theoretical Constraints on Multi-Dimensional Retiming Design Techniques

Arithmetic Logic Unit

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Promela and SPIN. Mads Dam Dept. Microelectronics and Information Technology Royal Institute of Technology, KTH. Promela and SPIN

COMP Parallel Computing. CC-NUMA (2) Memory Consistency

Parallel Computer Architecture and Programming Written Assignment 3

Static Dataflow Graphs

9/3/12. Outline. Part 5. Computational Complexity (1) Software cost factors. Software cost factors (cont d) Algorithm and Computational Complexity

Design of a Dynamic Data-Driven System for Multispectral Video Processing

ESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder

Search Algorithms. IE 496 Lecture 17

FPGA Matrix Multiplier

Implementation of Process Networks in Java

A Message Passing Standard for MPP and Workstations

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

Hardware-Software Codesign. 6. System Simulation

SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION

Computer System Architecture Final Examination Spring 2002

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

Code Generation for TMS320C6x in Ptolemy

Dynamic Dataflow. Seminar on embedded systems

ECE 450:DIGITAL SIGNAL. Lecture 10: DSP Arithmetic

The Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Buffer Allocation in Regular Dataflow Networks: An Approach Based on Coloring Circular- Arc Graphs

Adaptive Stream Mining: A Novel Dynamic Computing Paradigm for Knowledge Extraction

MODELING OF BLOCK-BASED DSP SYSTEMS

Implementing Dataflow With Threads

G52CON: Concepts of Concurrency

DOWNLOAD PDF CORE JAVA APTITUDE QUESTIONS AND ANSWERS

CA441 BPM - Modelling Workflow with Petri Nets. Modelling Workflow with Petri Nets. Workflow Management Issues. Workflow. Process.

CSE 486/586 Distributed Systems

Extensions of Daedalus Todor Stefanov

EE/CSCI 451: Parallel and Distributed Computation

Main Points of the Computer Organization and System Software Module

Architectural Styles. Software Architecture Lecture 5. Copyright Richard N. Taylor, Nenad Medvidovic, and Eric M. Dashofy. All rights reserved.

Concurrent Reading and Writing of Clocks

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses

CS514: Intermediate Course in Computer Systems

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Hierarchical Petri Net Simulator: Simulation, Design Validation, and Model Checking Tool for Hierarchical Place/Transition Petri Nets

Unit 2: High-Level Synthesis

Unit.9 Integer Programming

Binary Search and Worst-Case Analysis

Outline. Definition. 2 Height-Balance. 3 Searches. 4 Rotations. 5 Insertion. 6 Deletions. 7 Reference. 1 Every node is either red or black.

GRAph Parallel Actor Language A Programming Language for Parallel Graph Algorithms

High Performance Computing

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Solutions. Proof of halting: Denote the size of the search set by n. By definition of traversable set, the loop header runs at most n + 1 times.

An Extension to the Foundation Fieldbus Model for Specifying Process Control Strategies

Embedded Systems 7. Models of computation for embedded systems

CIS Operating Systems Synchronization based on Busy Waiting. Professor Qiang Zeng Spring 2018

High Performance Computing. University questions with solution

CellSs Making it easier to program the Cell Broadband Engine processor

Transcription:

EE3A - EE98- Lecture 8 Synchronous ata Flow Ingrid Verbauwhede epartment of Electrical Engineering University of California Los Angeles ingrid@ee.ucla.edu EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 Motivation SP representation & modeling to epose the concurrency (parallelism) closer to the natural description of SP applications Specification: MATLAB, SPW, Cossap, C/C Floating point (SP Canvas) Fied point Algorithm transformations Architecture alternatives ASIC Bit parallel Bit serial Special Purpose (Art esigner) Retargetable coprocessor (Target compiler technologies, Easics) SP processors (TI TMS30C54) EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 SP etensions to RISC (Tensilica)

Introduction: Reference: E. Lee,. Messerschmitt, Synchronous data flow Proceedings of the IEEE, Vol. 75, No.9, September 987. Other reference: E. Lee,. Messerschmitt, Static Scheduling of Synchronous ata Flow Programs for igital Signal Processing, IEEE Transactions on Computers, Vol. C-36, No., Jan. 987. (This reference includes the proofs for the first reference.) EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 3 ata flow ata flow representation of an algorithm: is a directed graph nodes are computations arcs (or edges) are paths over which the data ( samples ) travels. F shows which computations to perform, not sequence. Sequence is only determined by data dependencies. Hence eposes concurrency. EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 4

ata flow (cont.) Assume infinite stream of input samples. So nodes perform computations an infinite times. Node will fire (start its computation) when inputs are available. Node with no inputs can fire anytime. Numbers indicate the number of samples (tokens) produced, consumed by one firing. Nodes will fire when input data is available, called data-driven. Hence it eposes concurrency. EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 5 ata flow (cont.) True data flow: overhead for checking the availability of input tokens is too large. BUT, synchronous data flow: the number of tokens produced/consumed is know beforehand (a priori)! Hence, the scheduling can done a priori, at compile time. Thus there is NO runtime overhead! For signal processing applications: the number of tokens produced & consumed is independent of the data and known beforehand (= relative sample rates). EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 6 3

Synchronous ata Flow - definition Synchronous data flow graph (SF) is a network of synchronous nodes (also called blocks). A node is a function that is invoked whenever there are enough inputs available. The inputs are consumed. For a synchronous node, the consumptions and productions are known a priori. Homogeneous SF graph: when only s on the graph. EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 7 Precedence graph - Schedule Precedence graph indicates the sequence of operations: C A A B B C Schedule determines when and where (which processor or which data path unit) the node fires. Valid schedules: A B C Invalid schedule: C A B B A C EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 8 4

Blocked Schedule Precedence graph indicates the sequence of operations: C G A F Static schedule B E 3 processors/units: valid blocked schedule With pipeline: P A C G P A C P B F P B F P3 E P3 G EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 E 9 Small large grain Iteration period = length of one cycle = /throughput Goal: minimize iteration period Iteration period bound =? Atomic SF graph, when nodes are primitive operations Large grain SF graph, when nodes are larger functions: Eample: LPS speech processing (will be project) EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 0 5

SF graph implementation Implementation requires: buffering of the data samples passing between nodes schedule nodes when inputs are available ynamic implementation (= runtime) requires runtime scheduler checks when inputs are available and schedules nodes when a processor is free. usually epensive because overhead Contribution of Lee-87: SF graphs can be scheduled at compile time no overhead Compiler will: determine the eecution order of the nodes on one or multiple processors or data path units determine communication buffers between nodes. EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 Periodic schedule for SF graph Assumptions: infinite stream of input data (the case for signal processing applications) periodic schedule: same schedule applied repetitively on input stream Goal: check if schedule can be found: Periodic admissible sequential schedule (PASS) for a single processor or data path unit Periodic admissible parallel schedule (PAPS) for multiple processors n n n n Rate inconsistency Consistent solution EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 n n 6

Formal approach Construct topology matri each node is a column each arc is a row entry (i,j) = data produced on node i by arc j. consumption is negative entry Γ = e e n n - 0 0-0 - n e n e Self loop entry? EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 3 FIFO queues b(n) = size of queues on each arc 0 0 v(n) = 0 or or 0 indicates firing node 0 0 b(n) = b(n) Γ v(n) e e n n - 0 0-0 - n e n 0 b(0) = 0, b() = 0 0 e EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 4 7

FIFO queues & delays elays are handled by initializing b(0) with the delay values: n n b(0) = So at start-up: can fire two times before firing n again EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 5 Identifying inconsistent sample rates Necessary condition for the eistence of periodic schedule: Rank of Γ is s- (s is number of nodes) n n n n e e n n - 0 0-0 - rank? e e n n - 0 0-0 - EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 rank? 6 8

Relative firing frequency Topology matri with the correct rank, has a strictly positive (element-wise) integer vector q in its right nulspace: Thus: Γq = 0 n n e e n n - 0 0-0 - rank =, q = q determines number of times each node is invoked! EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 7 Insufficient delays Rank s- is a necessary but not a sufficient condition: n n n n - - = 0 0 EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 8 9

Scheduling for single processor Given: positive integer vector q, such that Γq = 0 given b(0) The i-th node is runnable if it has not been run qi times it will not cause the buffer size to become negative Class S (sequential) algorithm: is an algorithm that schedules a node if it is runnable it stops when no more nodes are runnable. If the class S algorithm terminates before it has scheduled each node the number of times specified in the q vector, then it is said to be deadlocked. EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 9 Eample Class S algorithm Solve for smallest positive integer q Form a list of all nodes in the system for each node, schedule if runnable, try each node once if each node has been scheduled qi times, STOP. If no node can be scheduled, indicate deadlock else continue with the net node. n n Schedule: - - 3-3 is PASS - - 3 is not PASS - - 3-3 is not PASS (Compleity: traverse the graph once, visiting each edge once). Optimization: minimize buffer (=memory) requirements EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 0 0

Schedule for parallel processors Assumptions: homogeneous processors, no overhead in communication if PASS eists, then also PAPS (because we could run all nodes on one processor) A blocked periodic admissible parallel schedule is set of lists {Xi; i =,... M} M is the number of processors Xi = periodic schedule for processor i p is smallest positive integer vector, such that Γp = 0. Then a cycle of schedule invokes every node q = Jp times. J is called the blocking factor (can be different from ). EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 Precedence graph Assumptions: n e n e e e n n - 0 0 - - 0 Γp = 0. PASS:? rank =, p = Precedence graph for unity blocking factor: n n n EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8

Schedule on two processors, J= Assumptions: node takes time unit, node takes, node 3 takes 3 X = {3} X = {,, } n n n Time processor 3 processor Iteration period = 4 EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 3 Schedule on two processors, J= Assumptions: node takes time unit, node takes, node 3 takes 3 nodes have self loops (so nodes can not overlap with themselves) n n n 3 n n n 4 X = {3,,, } X = {,,, 3} Time processor 3 processor 3 Iteration period is 7/ = 3.5 EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 4

Why are we doing this? The principle of synchronous data flow is used in many simulators: Ocapi - SPW Will model our speech processing project in this way. Combine with bit true simulations and real-life is closely modeled. Issues in practice: choose schedule to minimize memory requirements. include non data flow nodes if-then-else data dependent calculations EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture 8 5 3