Embedded Systems Design with Platform FPGAs

Similar documents
High Performance Computing. University questions with solution

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

ARCHITECTURAL CLASSIFICATION. Mariam A. Salih

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

BlueGene/L (No. 4 in the Latest Top500 List)

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Parallel Architectures

High Performance Computing Systems

UNIT I (Two Marks Questions & Answers)

Online Course Evaluation. What we will do in the last week?

Top500 Supercomputer list

Objectives of the Course

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

SHARED MEMORY VS DISTRIBUTED MEMORY

Introduction to parallel computing

Flynn s Taxonomy of Parallel Architectures

Parallel Computing Introduction

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

High Performance Computing in C and C++

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Hash Tables and Hash Functions

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Unit 9 : Fundamentals of Parallel Processing

Dr. Joe Zhang PDC-3: Parallel Platforms

Module 5 Introduction to Parallel Processing Systems

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Lecture 17: Array Algorithms

Announcement. Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Chapter 9 Objectives. 9.2 RISC Machines

Multiprocessors & Thread Level Parallelism

Introduction to Parallel Computing

Advanced Computer Architecture. The Architecture of Parallel Computers

Computer parallelism Flynn s categories

Moore s Law. Computer architect goal Software developer assumption

Chapter 17 - Parallel Processing

Data parallel algorithms 1

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Chapter 11. Introduction to Multiprocessors

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1

Part IV. Chapter 15 - Introduction to MIMD Architectures

Programmation Concurrente (SE205)

Questions. 6. Suppose we were to define a hash code on strings s by:

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Advanced Parallel Architecture. Annalisa Massini /2017

Lecture 9: MIMD Architectures

Parallel Numerics, WT 2013/ Introduction

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

Types of Parallel Computers

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors)

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Processor Architecture and Interconnect

Introduction to High-Performance Computing

Parallel Programming: Background Information

Parallel Programming: Background Information

Functional Programming in Haskell Prof. Madhavan Mukund and S. P. Suresh Chennai Mathematical Institute

A Catalog of While Loop Specification Patterns

Lecture 8: RISC & Parallel Computers. Parallel computers

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Introduction to Parallel Programming

Lecture_Module3 Prof Saroj Kaushik, IIT Delhi

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

CPS122 Lecture: From Python to Java last revised January 4, Objectives:

Approaches to Parallel Computing

Taxonomy of Parallel Computers, Models for Parallel Computers. Levels of Parallelism

CSCI 4717 Computer Architecture

Maximizing Face Detection Performance

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

THREAD LEVEL PARALLELISM

Parallel Computing Ideas

Chapter 9 Multiprocessors

Draw a diagram of an empty circular queue and describe it to the reader.

CS302 Topic: Algorithm Analysis #2. Thursday, Sept. 21, 2006

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Parallel Programming Programowanie równoległe

Parallel Processors. Session 1 Introduction

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Rapid: A Configurable Architecture for Compute-Intensive Applications

Database Systems: Design, Implementation, and Management Tenth Edition. Chapter 9 Database Design

Parallel Processing: October, 5, 2010

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Transaction Management: Concurrency Control, part 2

Locking for B+ Trees. Transaction Management: Concurrency Control, part 2. Locking for B+ Trees (contd.) Locking vs. Latching

10th August Part One: Introduction to Parallel Computing

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Introduction to parallel computing. Seminar Organization

CS Parallel Algorithms in Scientific Computing

Computer Systems Architecture

University of Toronto Faculty of Applied Science and Engineering

ECE 122. Engineering Problem Solving Using Java

Lecture 9: MIMD Architectures

Introduction to Parallel Programming

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

Transcription:

Embedded Systems Design with Platform FPGAs Spatial Design Ron Sass and Andrew G. Schmidt http://www.rcs.uncc.edu/ rsass University of North Carolina at Charlotte Spring 2011 Embedded Systems Design with Platform FPGAs 1/ 66

C h a p t e r 5 S p a t i a l D e s i g n Embedded Systems Design with Platform FPGAs 2/ 66

Chapter 5 Learning Objectives Topics principles of parallelism using principles to find parallelism practical aspects of implementing on an FPGA Embedded Systems Design with Platform FPGAs 3/ 66

Origins of Parallel in electronics, parallel and serial refer to the physical arrangement of components in space i.e., two resistors can be in parallel (a) or in series (b) however, applications are typically constructed from sequences commands R 1 R 2 R 1 with introduction of multiple processors, hardware is in parallel even though each processor is still programmed sequentially (a) R 2 (b) Embedded Systems Design with Platform FPGAs 4/ 66

Parallel and Sequential This led to the two terms we use here: sequential the same components are re-used over time and the the results are the accumulation of a sequence of operations parallel multiple instances of components are used and some operations may happen concurrently most applications written in the 20th Century are sequential most embedded systems have parallel components (as well as parallel programs) the goal of this chapter is find parallelism in applications written sequentially Embedded Systems Design with Platform FPGAs 5/ 66

Motivating Example consider an application consisting of eight tasks each task is completed in one time-step with one unit, the application completes in eight cycles Embedded Systems Design with Platform FPGAs 6/ 66

Before and After Finding Parallelism u0: unit t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 0 1D 2D 3D 4D 5D 6D 7D 8D (a) u3: unit t 3 u2: unit t 2 t 6 u1: unit t 1 t 5 u0: unit t 0 t 4 t 7 0 1D 2D 3D (b) Embedded Systems Design with Platform FPGAs 7/ 66

Motivating Example (cont d) If we had four units, we might be able to complete the application in few cycles. some units may get two tasks some units may get one task another unit may get three task This results in the application finishing 2.1 faster but requires 4 as much resources. Embedded Systems Design with Platform FPGAs 8/ 66

Motivating Question One might ask: Why not move task t 7 to unit u3? the work will be evenly distributed, and the appliction will complete in 2 versus 3 cycles... however, the tasks may have ordering requirements t 6 has to complete before t 7 begins and t 2 has to complete before t 6 begins Finding the legal parallelism in an application is our goal. Embedded Systems Design with Platform FPGAs 9/ 66

High-Level Concepts Over the next few slides, we ll discuss some key parallel concepts. granularity degree of parallelism spatial organizations Embedded Systems Design with Platform FPGAs 10/ 66

Granularity The size of the parallel tasks influences many decisions. The granularity is typically determined by the frequency at which the tasks communicate: fine-grain tasks generally need to communicate more frequently. Embedded Systems Design with Platform FPGAs 11/ 66

Degree of Parallelism degree of parallelism (DOP) can be used in multiple senses or, The DOP in application is useful when considering resource utilization versus maximum performance. Embedded Systems Design with Platform FPGAs 12/ 66

DOP Example With the DOP illustrated, we can reason about potential solutions. For example, since the maximum DOP(t) is 6, having more than six units would be overkill using four units would achieve 78% of the maximum performance while only using 66% of the resources Parallel threads 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 Time Embedded Systems Design with Platform FPGAs 13/ 66

Flynn s Taxonomy One well-known way of classifying parallel architectures is by how many streams of instructions and how many streams of data are possible. Single Instruction Stream, Single Data Stream (SISD) Single Instruction Stream, Multiple Data Stream (SIMD) Multiple Instruction Stream, Single Data Stream (MISD) Multiple Instruction Stream, Multiple Data Stream (MIMD) Embedded Systems Design with Platform FPGAs 14/ 66

Single Instruction Stream, Single Data Stream Single Instruction Stream, Single Data Stream (SISD) is the single-core, von Neumann style processor model used in many general-purpose computers one stream of instructions one stream of data read/write s Embedded Systems Design with Platform FPGAs 15/ 66

Single Instruction Stream, Multiple Data Streams Single Instruction Stream, Multiple Data Stream (SIMD) model uses simple Processing Elements (PEs) that are all performing instructions from a single instruction stream in lock step one stream of instructions broadcasted to all PEs each PE had its own memory (so multiple streams of data being simultaneously read and written) Embedded Systems Design with Platform FPGAs 16/ 66

Multiple Instruction Stream, Single Data Stream Multiple Instruction Stream, Single Data Stream (MISD) here parallel units each have their own, independent instruction stream but a single stream of data travels from unit to unit usually data travels over an explicit communication sometimes called systolic architecture Embedded Systems Design with Platform FPGAs 17/ 66

Multiple Instruction Stream, Multiple Data Streams Multiple Instruction Stream, Multiple Data Stream (MIMD) (probably the most common) multiple processing units, each with its own: memory hierarchy, stream of instructions, and stream of data to/from local memory Embedded Systems Design with Platform FPGAs 18/ 66

Everything Old is New Again At one point in the late 1990s, commodity off-the-shelf components made MIMD very economical and thus very popular. volume of parallel processors was MIMD hardware but programmed as Single Program Mutiple Data (SPMD) that is, one program was replicated to work on multiple data streams however, replicated copies did not run lock step also, internally GPGPUs and certain graphics instructions on modern processors are similar to SIMD Also, FPGAs are large enough now it is feasible to implement any of these architectures. Embedded Systems Design with Platform FPGAs 19/ 66

Application Parallelism thus far, we have considered parallel hardware structures to utilize multiple instruction streams or multiple data streams, we must find them in our software reference design vector and pipeline parallelism control parallelism data (regular) parallelism Embedded Systems Design with Platform FPGAs 20/ 66

Vector and Pipeline Parallelism suppose we have a task T that can be divided up into n subtasks {t 0, t 1,..., t n 1 } (the granularity can vary) if we map these subtasks onto parallel units as illustrated below t 0 t 1 t n21 each tasks performs part of the processing and forwards its partial results to the next subtask via a buffer T Embedded Systems Design with Platform FPGAs 21/ 66

Pipeline Performance clearly, if we only need to produce one output, the performance is n cycles per T operation however, if we need a sequence of m outputs O = o 0, o 1,..., o m 1 then m number of T operations is m m + n and clearly as m, the average performance approaches 1 cycle per T Embedded Systems Design with Platform FPGAs 22/ 66

Pipelines in Embedded Systems one may not think of large vectors of data in embedded systems however, many embedded systems do periodically sample their environment this results in a large stream of inputs (even though the data was never stored as a vector in primary or secondary memory) Embedded Systems Design with Platform FPGAs 23/ 66

Pipeline Balance in the simplest situation, all of the pipeline s parameters are in balance every task takes the same amount of time every partial result requires same amount memory to communicate it arrival rate of input matches rate at which the pipeline can process it if not, time-space diagram may be helpful in identifying the bottleneck (see next slide) one may need to analyze the bandwidth (see next chapter) Embedded Systems Design with Platform FPGAs 24/ 66

Time-Space Diagram Thread 3 Active C A Thread 2 Idle A Comm Thread 1 Comm A I A Thread 0 Active C A Idle Time Embedded Systems Design with Platform FPGAs 25/ 66

Control Parallelism control parallelism (also known as irregular or ad hoc parallelism) examines a single stream of instructions looking to separate them into parallel operations in some ways this can be considered a a generalization of pipeline parallelism pipeline parallism tends to be more uniform control parallelism typically has a much more complex communication patterns the number of cycles to complete each parallel tasks in control parallelism tends to vary more than pipeline parallelism Embedded Systems Design with Platform FPGAs 26/ 66

Control Parallelism Example Consider the following program that computes the roots of the polynomial ax 2 + bx + c with the Quadratic Formula: The sequential code would be: x 0, x 1 = b ± b 2 4ac 2a (s1) descr = sqrt(b*b-4*a*c) (s2) denom = 2*a (s3) x0 = (-b+descr)/denom (s4) x1 = (-b-descr)/denom The inputs to our entity will be a, b, and c. The outputs will be x0 and x1. Embedded Systems Design with Platform FPGAs 28/ 66

Statement-Level Granularity a 2*a (s2) denom b descr denom (s3) x0 b c sqrt(b*b-4*a*c) descr (s1) b descr denom (s4) x1 Embedded Systems Design with Platform FPGAs 29/ 66

Operational-Level Granularity a b c mul mul 4 mul sub neg sqrt 2 mul add sub div div x0 x1 Embedded Systems Design with Platform FPGAs 30/ 66

Data (Regular) Parallelism in contrast to control parallelism, data parallelism looks at a single data stream and tries to separate it into multiple, parallel data streams advantage: dividing up large datasets can often lead to a higher degree of parallelism (and thus higher performance) disadvantage: often the entire dataset is required all at once Embedded Systems Design with Platform FPGAs 31/ 66

Example of Data Parallelism Starting with some array of data, we want to assign each parallel unit a portion of the data it can be divided up in a blocked arrangement 1100 1100 1100 11110000 11110000 11110000 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 11110000 11110000 11110000 11110000 11110000 11110000 or it can be divided up in a cyclic fashion 00 11 0000 111100 0000 111100 0000 111100 0000 1111 00 11 00 11 0000 111100 0000 111100 0000 111100 0000 1111 00 11 00 11 0000 111100 0000 111100 0000 111100 0000 1111 Embedded Systems Design with Platform FPGAs 32/ 66

Identifying Parallelism Main challenges: software reference design most likely written with a sequential model in mind determining granularity of parallelism finding legal, alternative orderings so that some operations can execute concurrently Increasing level of abstraction Dependence graph Synthesis Software reference design Discover parallelism Parallel implementation Usually this requires thinking about the algorithm more abstractly before implementing in hardware. Embedded Systems Design with Platform FPGAs 33/ 66

Orderings consider a task T composed of a sequence of n subtasks: T = t 0, t 1,..., t m 1 we say that T is totally ordered because programs written for a single processor are written as a total order we define a legal order as Embedded Systems Design with Platform FPGAs 34/ 66

Example of Orderings suppose task T is the following code a = (x+y) ; // (s1) b = (x-y) ; // (s2) r = a/b ; // (s3) the original total ordering is another legal ordering is T = s1, s2, s3 T = s2, s1, s3 this demonstrates the main concept s1 has to complete before s3 and s2 has to complete before s3 but s1 and s2 are independent independent operations can be executed in parallel Embedded Systems Design with Platform FPGAs 36/ 66

Partial Order we want to take a totally ordered, sequential software reference design and find a partial ordering of its subtasks formally, a partial order is a set (our subtasks) and a binary relation ( must complete before ) 1 in our case, this relation has a specific name called dependence 1 technically, this is precedence relation (but they are very similar and the partial order is more common) Embedded Systems Design with Platform FPGAs 37/ 66

Dependence given a sequence T we can define a relation R on the set T R T T in English, the relation is (t i, t j ) R if subtask t i comes before subtask t j and we could recreate the total order of a sequential program with (t i, t j ) R tot if i < j or just ( 0 i < n (t i, t i+1 )) however, to find other legal orderings what we want to do is find a subset of R tot we determine if it is legal to remove the ordering by looking at the IN and OUT sets of the subtasks Embedded Systems Design with Platform FPGAs 38/ 66

IN/OUT Sets these concepts can be applied at any granularity but to illustrate, we ll use program statements here the OUT set of a task t i, denoted OUT (t i ), is all of the values that t i outputs the IN set of a task t i, denoted IN(t i ), is all of the values that t i requires for statements, the OUT set is (approximately) the memory location of the left-hand side of an assignment statement the IN set are the memory locations of all of the variables reference on the right-hand side Embedded Systems Design with Platform FPGAs 39/ 66

Dependence (formally) with the IN sets and OUT sets of every subtask, we can formally define the dependence relation as (t i, t j ) D if OUT (t i ) IN(t j ) note: we talked about values when introducing the IN sets and OUT sets variables may get rewritten over time we want to track the values not just variable names references and aliases can confound the problem because we cannot always tell what address a pointer has array references with variable indices are often solvable but can be tricky for humans Embedded Systems Design with Platform FPGAs 40/ 66

Other Dependence by tracking values, we finding what is called true dependence consider the following code a = (x+y) ; r = a ; b = (x-y) ; r = r/b ; // (s4) // (s5) // (s6) // (s7) some places will say there is output dependence between s5 and s7 (because s7 has to execute after s5 to produce the same results) we also have to be sure s5 is executed before s6 because these other forms of dependence can usually be avoided by simply renaming the variables or converting to the code to another form Embedded Systems Design with Platform FPGAs 42/ 66

Dependence Example consider the previous code re-written in three statements a = (x+y) ; b = (x-y) ; r = a/b ; // (s4) // (s6) // (s7) we can calculate the IN sets and OUT sets for all three: subtask t IN(t) OUT (t) a = (x+y) ; {x, y} {a} b = (x-y) ; {x, y} {b} r = a/b ; {a, b} {r} Embedded Systems Design with Platform FPGAs 44/ 66

Dependence Example (cont d) next, we can manually compute the dependence relation via brute-force: (t i, t j ) D if OUT (t i ) IN(t j ) OUT s1 s2 s3 s1 OUT (s2) IN(s1) OUT (s2) IN(s1) = {} = {} s2 OUT (s1) IN(s2) OUT (t 3 ) IN(t 2 ) IN = {} = {} s3 OUT (s1) IN(s3) OUT (s2) IN(s3) = {a} = {b} Embedded Systems Design with Platform FPGAs 45/ 66

Dependence Set All of the non-empty intersections indicate that there is a dependence from one subtask to another. In the previous example, this would be: x y s1 s2 s3 r D = {(s1, s3), (s2, s3)} If you treat x and y as source components and r as a sink, these dependences can be visualized (above). Embedded Systems Design with Platform FPGAs 46/ 66

Control Dependence So far, we have only considered sequences of assignment statements; imperative programming languages also have conditional and iterative control structures. for example, in the pseudocode if( i<j ) then x = y ; whether x depends on y s value is indirectly influenced by the expression i < j when translating to hardware, we can use templates to handle this form of dependence without formalizing it Embedded Systems Design with Platform FPGAs 48/ 66

If-Then-Else and While rather than exhaustively go through every control structure that might be found in a software reference design, we focus on two general cases (the rest can be derived from these) if-then-else while loop for both, we illustrate with a single assignment statement but this could be a sequence of statements or other nested control statements Embedded Systems Design with Platform FPGAs 49/ 66

If-Then-Else an if-then-else structure can be created by instatiating then statement else statement and the conditional expression the OUT set values of the then and else parts are the inputs to a MUX and the conditional determines which values propagate forward Embedded Systems Design with Platform FPGAs 50/ 66

If-Then-Else (illustrated) S0 if ( C ) then S0 else S1 S1 WB C Embedded Systems Design with Platform FPGAs 51/ 66

While-Loop a while structure can be created by instatiating body and the conditional expression next, a state machine is needed to determine when the conditional evaluates to false during an iteration, some values will write back to an external store (not shown) or feed back for the next iteration Embedded Systems Design with Platform FPGAs 52/ 66

While-Loop (Graphically) while ( C(i) ) { S(i) } C S(i) R S1 Embedded Systems Design with Platform FPGAs 53/ 66

Loops In General loops are a rich source for parallelism in hardware very large body of research into how to formally analyze loops a full discussion is beyond the scope of this chapter but a few techniques are high lighted Embedded Systems Design with Platform FPGAs 54/ 66

Loop Unrolling loop-carried dependences are dependences between the iterations of the loop if there are no loop-carried dependences, unrolling is is simply mechanical while ( C(i) ) { S(i) i = i+1 } while ( C(i+0) or C(i+1) or C(i+2) ) { if ( C(i+0) ) S(i+0) if ( C(i+1) ) S(i+1) if ( C(i+2) ) S(i+2) i = i+2 } sometimes the conditionals can be avoided Embedded Systems Design with Platform FPGAs 55/ 66

Loop Induction Variables often programmers will introduce loop-carried dependences to reduce the amount of computation inside the loop often in embedded systems, the FPGA provides plenty of computation and we would prefer to avoid the dependence below shows an example of removing induction variables Before: After: for ( i=0; i<n; i++ ) a[j] = b[k]; j++; k+=2; _Kii2 = k; _Kii1 = j; for ( i = 0; i<n; i++ ) a[_kii1+i] = b[_kii2+i*2]; Embedded Systems Design with Platform FPGAs 57/ 66

Uniform Dependence Vectors as the number of iterations in a loop increase, the size of the set D increases if the relations between the iterations are uniform, then we can use a more compact form uniform dependence vectors also help the embedded system designer see the relationship between the iterations graphically for a singly-nested loop, the iterations are laid out on a line, doubly-nested loops on an x y plane, and so on Embedded Systems Design with Platform FPGAs 58/ 66

Iteration Spaces Singly-Nested Loop 0 2 4 6 8 10 i i Doubly-Nested Loop 10 8 6 4 2 2 4 6 8 10 j Embedded Systems Design with Platform FPGAs 59/ 66

Dependence Vectors (1D) 0 2 4 6 8 10 i sum = 0 ; for( i=0 ; i<10 ; i++ ) sum = sum + x[i] ; // (s3) Embedded Systems Design with Platform FPGAs 61/ 66

Dependence Vectors (2D) i 10 8 6 4 2 for i=1 to 3 for j=1 to 5 x[i] = x[i] + samp[i][j] 2 4 6 8 10 j Embedded Systems Design with Platform FPGAs 63/ 66

Chapter-In-Review started with basic understanding of parallel and serial described a number concept related to parallelism (DOP, granularity, etc.) described parallelism in hardware and software introduced some for formal mathematics to help identify parallelism in a software reference design Embedded Systems Design with Platform FPGAs 64/ 66

Chapter 5 Terms Embedded Systems Design with Platform FPGAs 65/ 66

Chapter 5 Terms degree of parallelism (DOP) Embedded Systems Design with Platform FPGAs 66/ 66