Textbook: VLSI ARRAY PROCESSORS S.Y. Kung

Size: px

Start display at page:

Download "Textbook: VLSI ARRAY PROCESSORS S.Y. Kung"

Eugenia Robinson
5 years ago
Views:

1 : 1/34 Textbook: VLSI ARRAY PROCESSORS S.Y. Kung Prentice-Hall, Inc. : INSTRUCTOR : CHING-LONG SU kevinsu@twins.ee.nctu.edu.tw

2 Chapter 4 2/34 Chapter 4 Systolic Array Processors

3 Outline of Chapter 4 3/ Introduction 4.2 Systolic Array Processors 4.4 Performance Analysis and esign Optimization 4.5 Systolic Arrays for the Transitive Closure and ynamic Programming Problems 4.6 Systolic esign for Artificial Neural Network 4.7 Conclusion Remarks 4.8 Problems

4 4.1 Introduction 4/ Introduction 4.2 Systolic Array Processors 4.4 Performance Analysis and esign Optimization 4.5 Systolic Arrays for the Transitive Closure and ynamic Programming Problems 4.6 Systolic esign for Artificial Neural Network 4.7 Conclusion Remarks 4.8 Problems

5 4.1 Introduction 5/34 In Chapter 4 1. Review the algorithm mapping onto SFG methodology 2. iscuss the cut-set systolization (retiming) method for systolic array design

6 4.2 Systolic Array Processors 6/ Introduction 4.2 Systolic Array Processors 4.4 Performance Analysis and esign Optimization 4.5 Systolic Arrays for the Transitive Closure and ynamic Programming Problems 4.6 Systolic esign for Artificial Neural Network 4.7 Conclusion Remarks 4.8 Problems

7 4.2 Systolic Array Processors 7/34 Systolic Array Processor 1. A systolic system is a network of processors which rhythmically compute and pass data through the system. 2. Systolic Array Processor avoids the classic memory access bottleneck problem. 3. Systolic Array Processor can solve the computebound and I/O-bound computations.

8 4.2 Systolic Array Processors 8/34 Basic Configuration of Systolic Array Memory PE PE PE PE PE PE PE PE

9 4.2 Systolic Array Processors 9/34 1. Synchrony: The data are rhythmically computed (timed by a global clock) and passed through the network. 2. Modularity and Regularity 3. Spatial Locality and Temporal Locality 4. Pipelinability efinition of Systolic Arrays

10 4.2 Systolic Array Processors 10/34 Example 1: Systolic Array for Convolution - u 1 - u 0 - y 0 - y 1 W 0 W 0 W 0 W 0 0 a in a out a out= a in b out W i b in b out= b in+ a in* W i

11 4.2 Systolic Array Processors 11/34 C out a 23 b 32 a 12 b 21 Example 2: Hexagonal Systolic Array for Band Matrix Multiplication T=4 T=3 a 32 a 31 T=2 a 22 a 21 T=1 a 11 T=1 c 11 b 11 T=1 b 22 b 12 T=2 b 23 b 13 T=3 T=4 T=2 c 21 c 12 T=3 c 31 c 22 c 13 T=4 c 41 c 32 c 23 c 14

12 4.2 Systolic Array Processors 12/34 Properties of Systolic Array 1. Simple and Regular esign 2. Concurrency and Communication 3. Balancing Computation with I/O

13 4.2 Systolic Array Processors 13/34 Clock istribution Scheme for Synchronization of the Systolic Array System: H-tree Layout for the Balance of the Clock Circuit elay Linear Array Square Array Hexagonal Array

14 4.2 Systolic Array Processors 14/34 Systolic vs. SIM vs. SFG Arrays Control Unit (Central Control) Global Communication Control Bus Processing Uuit Processing Uuit Processing Uuit ata Bus Interconnection Network (Local) SIM Array

15 4.2 Systolic Array Processors 15/34 Systolic vs. SIM vs. SFG Arrays Control Unit Control Unit Control Unit Processing Uuit Processing Uuit Processing Uuit Interconnection Network (Local) Systolic Array

16 4.2 Systolic Array Processors 16/34 Systolic vs. SIM vs. SFG Arrays Control Unit Control Unit Control Unit Processing Uuit Processing Uuit Processing Uuit ata Bus Global Communication Interconnection Network (Local) SFG Array

17 4.2 Systolic Array Processors 17/ Introduction 4.2 Systolic Array Processors 4.4 Performance Analysis and esign Optimization 4.5 Systolic Arrays for the Transitive Closure and ynamic Programming Problems 4.6 Systolic esign for Artificial Neural Network 4.7 Conclusion Remarks 4.8 Problems

18 18/34 Three Stage of Canonical Mapping Algorithm for Systolic Array esign 1. erive a (local) G from the Algorithm 2. Map the G to an SFG Array 3. Transform the SFG to a Systolic Array (i.e. retiming)

19 19/34 The maor systolic array design gap is that most SFGs are not given in temporally localized form. Systolic Array = SFG Array + Pipeline Retiming

20 20/34 Cut-Set Retiming Procedure 1. Timing Scale: All delays may be scaled, i.e. (I/O own Sample) 2. elay Transfer: Given a cut-set of the SFG, which partitions the graph into two components, we can group the edges of the cut-set into two inbound edges and outbound edges.

21 21/34 ata Transfer Rule Inbound Cut +k +k - k Outbound

22 22/34 Systolization Procedure 1. Selection of Basic Operation Modules 2. Applying Retiming Rules 3. Combination of elay and Operation Modules

23 23/34 Systolization Procedure: Example of Lattice Filters X 2 X 1 X 0 + * + * + * Critical Path = 6 MAC Y 0 Y 1 Y * + * + * + k i SFG for AR Lattice Filter - X 2 - X 1 - X 0 + * + * + * Y 0 - Y 1 - Y 2-2 * + 2 * + 2 * + Step 1. Time-Rescaled SFG for AR Lattice Filter

24 24/34 Systolization Procedure: Example of Lattice Filters - X 2 - X 1 - X 0 Y 0 - Y 1 - Y * * * * + 2- * + Step 2. Retiming SFG for AR Lattice Filter + * Critical Path = 2 MAC - X 2 - X 1 - X 0 Y 0 - Y 1 - Y 2 - Step 3. Systolic Array for AR Lattice Filter

25 25/34 Systolization Procedure: Example of Matrix Multiplication b 41 b 42 b 32 b 22 b 12 b 43 b 31 b 44 b 33 b 21 b 34 b 23 b 11 b 24 b 13 b 14 a 14 a 13 a 12 a 11 a 24 a 23 a 22 a 21 a 34 a 33 a 32 a 31 a 44 a 43 a 42 a 41

26 26/34 All systolic arrays obtained from linear proections of the G can be derived by the following two steps. 1. Mapping the G onto SFGs by the SFG Proection Procedure 2. Mapping the SFG onto a Systolic Array by the Cut-Set Retiming

27 27/34 Retiming in the Sorting Systolic Arrays: Insertion Sorter x 1 4 x 1 3 i INPUT x 1 2 x 1 1 x 11 x 2 1 x 31 x 41 m 1 1 m 21 m 31 m 4 1 m 51 m i m x 12 x 22 x 32 x 42 m 2 2 m 32 m 42 m 52 m i m x 2 3 x 3 3 x 43 m 33 m 43 m 53 i 3 m 3 m 3 d = s =[1,0] x 34 x 44 m 44 m 54 m i m x 45

28 28/34 Retiming in the Sorting Systolic Arrays: Selection Sorter i INPUT x 11 x 2 1 x 31 x 41 m 11 m 2 1 m 3 1 m 4 1 m 5 1 x 12 x 22 x 3 2 x 4 2 m 22 m 32 m 4 2 m 5 2 x 23 x 3 3 x 4 3 m 33 m 43 m 53 x 3 4 x 4 4 m 44 m 5 4 d = s = [0,1] x 4 5 x 1 x 2 x 3 x 4 m 1 1 x 1 1 m 2 2 x 2 1 m 3 3 x 3 1 m 4 4 x 4 1

29 29/34 Retiming in the Sorting Systolic Arrays: Bubble-Sorter i INPUT x 1 1 x 21 x 3 1 x 4 1 m 11 m 21 m 31 m 41 m 51 x 12 x 2 2 x 32 x 42 m 2 2 m 3 2 m 4 2 m 52 x 23 x 33 x 4 3 m 3 3 m 43 m 5 3 d = s = [1,1] x 3 4 x 44 m 44 m 54 x 4 5 x 2 2 x 3 3 x 4 4 m 1 3 m 1 1 x 1 1 m 1 5 m 1 7

30 30/34 Rotation of Schedule Vector for Insertion Sorter x 1 4 x 1 3 i INPUT esired Schedule x 1 2 x 1 1 x 11 x 2 1 x 31 x 41 m 1 1 m 21 m 31 m 4 1 m 51 m i m x 12 x 22 x 32 x 42 m 2 2 m 32 m 42 m 52 m i m x 2 3 x 3 3 x 43 m 33 m 43 m 53 i 3 m 3 m 3 d = s =[1,0] efault Schedule x 34 x 44 m 44 m 54 m i m x 45

31 31/34 Bit Level Systolic Arrays: Example of Inner Product of Two Vectors The inner product vector c of two vectors a and b is computed as: c n = 1a k b k k = Assume that elements of a and b are m-bit integer.

32 32/34 c a 1 Bit Level Systolic Arrays: Example of Inner Product of Two Vectors n k = 1 = a b = = 2 b 1 k + m 1 = 0 a a 2 1, k b a m 1 = 0 n b b 1, n m 1 = 0 a n, 2 m 1 = 0 b n, a 1, m-1 a 1,1 a 1,0 b 1, m-1 b 1,1 b 1,0 Top Layer: b 1, 0 a 1,m-1 b 1,0 a 1, 1 b 1,0 a 1, 0 b 1, 1 a 1,m-1 b 1,1 a 1, 1 b 1,1 a 1, 0 b 1,m-1 a 1, m-1 b 1, m-1 a 1,1 b 1, m-1 a 1,0

33 33/34 c a 1 n k = 1 = a b = = 2 Bit Level Systolic Arrays: Example of Inner Product of Two Vectors b 1 k + m 1 = 0 a a 2 1, k b a m 1 = 0 n b b 1, n m 1 = 0 a n, 2 m 1 = 0 b n, a n, m-1 a n, 1 a n, 0 b n, m-1 b n, 1 b n, 0 Bottom Layer: b n, 0 a n,m-1 b n, 0 a n,1 b n, 0 a n,0 b n,1 a n,m-1 b n,1 a n,1 b n, 1 a n,0 b n,m - 1 a n,m-1 b n,m - 1 a n,1 b n, m-1 a n,0

34 34/34 a 0, 0 b 0,0 b 0,1 a 0, 1 a 0, 2 b 0, 2 Top Layer a 1, 0 a 1, 1 b 1,0 b 1,1 Bit Level Systolic Arrays: Example of Inner Product of Two Vectors Sum Carry a 1, 2 a 2, 1 a 2, 0 b 2,0 b 2,1 b 1, 2 a 2, 2 b 2, 2 HA Bottom Layer FA c 4 c 0 c 3 c 1 c 2

SHARED MEMORY VS DISTRIBUTED MEMORY

OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors