Lecture 4: Synchronous ata Flow Graphs - I. Verbauwhede, 05-06 K.U.Leuven HJ94 goal: Skiing down a mountain SPW, Matlab, C pipelining, unrolling Specification Algorithm Transformations loop merging, compaction Memory Transformations and Optimizations 40 bit accumulator Floating-point to Fied-point ASIC Special Purpose Retargetable coprocessor SP processor SP- RISC RISC Page
Overview Lecture : what is a system-on-chip Lecture : terminology for the different steps Lecture 3: models of computation Lecture 4 today: two MOC s Synchronous data flow graphs Control flow 3 Time Representations Tag t is abstraction of time (temporal order) - Absolute time = global ordering=overspecification - Cumbersome and harmful because reduces degree of freedom - Order in t is order in events (t<t <=> e<e ) 3 representations: -Absolute time T = R (T totally ordered closed connected set) v -iscrete time T T is totally ordered discrete set < in T such that t t' ( t < t' ) ( t' < t) = -Precedences T T is partially ordered discrete set v t t 4 Page
Models for Time Timed Models of Computation = total order Continuous time- iscrete event- (simulation with zero-delay??) Synchronous Clocked discrete time = most used iscrete time-(synchronous/reactive) Untimed MOC = partial order Sequential Processes with Rendez-Vous Kahn Networks ata-flow networks Reality = miture of MOC s 5 Today: Reference: E. Lee,. Messerschmitt, Synchronous data flow, Proceedings of the IEEE, Vol. 75, No.9, September 987. Other reference: E. Lee,. Messerschmitt, Static Scheduling of Synchronous ata Flow Programs for igital Signal Processing, IEEE Transactions on Computers, Vol. C-36, No., Jan. 987. (This reference includes the proofs for the first reference.) For multi-dimensional signal processing: stream scheduling: very effective for video and image processing applications Eamples: Phideo [Philips] 6 Page 3
ata flow ata flow representation of an algorithm: is a directed graph nodes are computations (actors) arcs (or edges) are paths over which the data ( samples ) travels. F shows which computations to perform, not sequence. Sequence is only determined by data dependencies. Hence eposes concurrency. 7 ata flow (cont.) Assume infinite stream of input samples. So nodes perform computations an infinite times. Node will fire (start its computation) when inputs are available. Node with no inputs can fire anytime. Numbers indicate the number of samples (tokens) produced, consumed by one firing. Nodes will fire when input data is available, called data-driven. Hence it eposes concurrency. Nodes must be free of side effects : e.g. a write to a memory location followed by a read, only allowed if there is an arc between them 8 Page 4
ata flow (cont.) True data flow: overhead for checking the availability of input tokens is too large. BUT, synchronous data flow: the number of tokens produced/consumed is know beforehand (a priori)! Hence, the scheduling can done a priori, at compile time. Thus there is NO runtime overhead! For signal processing applications: the number of tokens produced & consumed is independent of the data and known beforehand (= relative sample rates). 9 Synchronous ata Flow - definition Synchronous data flow graph (SF) is a network of synchronous nodes (also called blocks). A node is a function that is invoked whenever there are enough inputs available. The inputs are consumed. For a synchronous node, the consumptions and productions are known a priori. Homogeneous SF graph: when only s on the graph. 0 Page 5
elay - elay of signal processing Unit delay on arc between A and B, means A B n-th sample consumed by B, is (n-)th sample produced by A. Initialized by d zero samples A synchronous compiler Translation from SF graph to a sequential program on a processor Two tasks: Allocation of shared memory between blocks or setting up communication between blocks Scheduling blocks onto processors such that all input data is available when block is invoked Goal: create Periodic Admissible Parallel Schedule (PAPS) Page 6
Precedence graph - Schedule Precedence graph indicates the sequence of operations: C A A B B C Schedule determines when and where (which processor or which data path unit) the node fires. Valid schedules: A B C Invalid schedule: C A B B A C 3 Blocked Schedule Blocked: one cycle terminates before net one starts C A B F E G Static schedule 3 processors/units: valid blocked schedule With pipeline (not blocked): P A C G P A C P B F P B F P3 E P3 G E 4 Page 7
Small large grain Iteration period = length of one cycle = /throughput Goal: minimize iteration period Iteration period bound = minimum achievable (assuming pipelining) = bound by total number of operations in loop divided by number of delays in the loop) Atomic SF graph, when nodes are primitive operations Large grain SF graph, when nodes are larger functions: Eample: IIR filter = small grain JPEG = large grain 5 SF graph implementation Implementation requires: buffering of the data samples passing between nodes schedule nodes when inputs are available ynamic implementation (= runtime) requires runtime scheduler checks when inputs are available and schedules nodes when a processor is free. usually epensive because overhead Contribution of Lee-87: SF graphs can be scheduled at compile time no overhead 6 Compiler will: determine the eecution order of the nodes on one or multiple processors or data path units determine communication buffers between nodes. Page 8
Periodic schedule for SF graph Assumptions: infinite stream of input data (the case for signal processing applications) periodic schedule: same schedule applied repetitively on input stream Goal: check if schedule can be found: Periodic admissible sequential schedule (PASS) for a single processor or data path unit Periodic admissible parallel schedule (PAPS) for multiple processors n n n n PASS 7 Rate inconsistency n Consistent solution n Formal approach Γ = Construct topology matri each node is a column each arc is a row entry (i,j) = data produced on node i by arc j. consumption is negative entry n e n e e n n - 0 0-0 - e Self loop entry? 8 Page 9
FIFO queues b(n) = size of queues on each arc 0 0 v(n) = 0 or or 0 indicates firing node 0 0 b(n) = b(n) Γ v(n) e e n n - 0 0-0 - n e n 0 b(0) = 0, b() = 0 0 e 9 FIFO queues & delays elays are handled by initializing b(0) with the delay values: n n b(0) = So at start-up: can fire two times before firing n again So, every directed loop must have at least one delay to be able to start 0 Page 0
Identifying inconsistent sample rates Necessary condition for the eistence of periodic schedule with bounded memory Rank of Γ is s- (s is number of nodes) n n n n e e n n - 0 0-0 - rank? e e n n - 0 0-0 - rank? Relative firing frequency Topology matri with the correct rank, has a strictly positive (element-wise) integer vector q in its right nulspace: Thus: Γq = 0 n n e e n n - 0 0-0 - rank =, q = q determines number of times each node is invoked! Page
Insufficient delays Rank s- is a necessary but not a sufficient condition: n n n n - - = 0 0 3 Scheduling for single processor Given: positive integer vector q, such that Γq = 0 given b(0) The i-th node is runnable if it has not been run qi times it will not cause the buffer size to become negative Class S (sequential) algorithm creates a static schedule: is an algorithm that schedules a node if it is runnable it updates b(n) it stops when no more nodes are runnable. If the class S algorithm terminates before it has scheduled each node the number of times specified in the q vector, then it is said to be deadlocked. 4 Page
Eample Class S algorithm Solve for smallest positive integer q Form a list of all nodes in the system for each node, schedule if runnable, try each node once if each node has been scheduled qi times, STOP. If no node can be scheduled, indicate deadlock else continue with the net node. n n Schedule: - - 3-3 is PASS - - 3 is not PASS - - 3-3 is not PASS (Compleity: traverse the graph once, visiting each edge once). Optimization: minimize buffer (=memory) requirements 5 Schedule for parallel processors Assumptions: homogeneous processors, no overhead in communication if PASS eists, then also PAPS (because we could run all nodes on one processor) A blocked periodic admissible parallel schedule is set of lists {Xi; i =,... M} M is the number of processors Xi = periodic schedule for processor i p is smallest positive integer vector, such that Γp = 0. Then a cycle of schedule invokes every node q = Jp times. J is called the blocking factor (can be different from ). 6 Page 3
Precedence graph n e n e e e n n - 0 0 - - 0 Γp = 0. PASS:? rank =, p = Precedence graph for unity blocking factor: n n n 7 Schedule on two processors, J= Assumptions: node takes time unit, node takes, node 3 takes 3 X = {3} X = {,, } n n n Time processor 3 processor Iteration period = 4 8 Page 4
Schedule on two processors, J= Assumptions: node takes time unit, node takes, node 3 takes 3 nodes have self loops (so nodes can not overlap with themselves) n n n 3 n n n 4 X = {3,,, } X = {,,, 3} Time processor 3 processor 3 Iteration period is 7/ = 3.5 9 Why are we doing this? The principle of synchronous data flow is used in many simulators Based on this, multi-dimensional data flow representations have been developed. Reality is always more complicated. Issues in practice: choose schedule to minimize memory requirements. include non data flow nodes if-then-else data dependent calculations 30 Page 5