CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.3. Krste Asanovic UC Berkeley Fall 2009

Size: px

Start display at page:

Download "CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.3. Krste Asanovic UC Berkeley Fall 2009"

Albert Morton
5 years ago
Views:

1 CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.3 Krste Asanovic UC Berkeley Fall 2009

2 Overall Problem Statement P3 bit string Audio Application(s) (Berkeley) Hardware Pattern Language P3 bit string Audio Hardware (RTL)

3 BHPL Goals BHPL captures problem-solution pairs for creating hardware designs (machines) to execute applications BHPL Non-Goals Doesn t describe applications themselves, only machines that execute applications and strategies for mapping applications onto machines

4 BHPL Overview Applications (including OPL patterns) Structural Patterns Pipelines Agent&Repository Computational Patterns Circuits Dense Linear Algebra Dynamic Programming odel-view- Controller Event Based Process Control N-Body ethods Sparse Linear Algebra Spectral ethods Graph Algorithms FSs BHPL Iteration ap- Reduce Layered Systems Task Graphs achines Unstructured Grids Graph Traversal Structured Grids Graphical odels apping Patterns

5 achine Vocabulary

6 achine Vocabulary achines described using a hierarchical structural decomposition Units (processing engines) emories Networks (connect multiple entities) Channels (point-to-point connections) (emories, Networks, and Channels are really just specialized Units)

7 Hierarchy within Unit Input Port Output Port Input/Output Port

8 Hierarchy within emory

9 Hierarchy within Network

10 Hierarchy within Network (2)

11 Hierarchy within Channel

12 Transaction Scheduler A single controller that executes a stream of transactions Each transaction is an atomic (i.e., indivisible) computation that reads and writes machine state and sends messages on inputs and outputs Where does this fit in taxonomy? How to show graphically?

13 Controller Types State achine Controller control lines generated by state machine icrocoded Controller single-cycle datapath, control lines in RO/RA In-Order Pipeline Controller pipelined control, dynamic interaction between stages Out-of-Order Pipeline Controller operations within a control stream might be reordered internally Threaded Pipeline Controller multiple control streams one execution pipeline can be either in-order (PPU) or out-of-order

14 Leaf-Level Hardware Register Combinational Logic Wires emory ultiplexer/alu Tristate driver FIFO Conventional schematic notation (Need additional notation for asynchronous logic?)

15 Hardware Patterns

16 Decoupled Units Problem: Difficult to design a large unit with a single controller, especially when components have variable processing rates. Large controllers have long combinational paths. Solution: Break large unit into smaller sub-units where each sub-unit has a separate controller and all channels between subunits have some form of decoupling (i.e., no combinational path between units on each side of channel). Applicability: Larger units where area and performance overhead of decoupling is small compared to benefits of simpler design and shorter controller critical paths. Consequences: Decoupled channels generally have greater communication latency and area/power cost. Sub-unit controllers must cope with unknown arrival time of inputs and unknown time of availability of space on outputs. Sub-units must be synchronized explicitly.

17 Decoupled Units Network Channels to network are always decoupled in any case Shared emory Unless shared memory is truly multiported, channels to memory must be decoupled

18 Pipelined Operator Problem: Combinational function of operator has long critical path that would reduce system clock frequency. High throughput of this function is required. Solution: Divide combinational function using pipeline registers such that logic in each stage has critical path below desired cycle time. Improve throughput by initiating new operation every clock cycle overlapped with propagation of earlier operations down pipeline. Applicability: Operators that require high throughput but where latency is not critical. Consequences: Latency of function increases due to propagation through pipeline registers, adds energy/op. Any associated controller might have to track execution of operation across multiple cycles.

19 Pipelined Operator f(g(in)) Clock Clock g(in) f(in) Clock Clock Clock

20 ulticycle Operator Problem: Combinational function of operator has long critical path that would reduce system clock frequency. High throughput of this function is not required. Solution: Hold input registers stable for multiple clock cycles of main system, and capture output after combinational function has settled. Applicability: Operators where high throughput is not required, or if latency is critical (in which case, replicate to increase throughput). Consequences: Associated controller has to track execution of operation across multiple cycles. CAD tools might detect false critical path in block.

21 ulticycle Operator f(g(in)) Clock Clock Clock/2 f(g(in)) Clock/2

22 emory Patterns True ultiport emory Banked emory Interleave lesser-ported banks to provide higher bandwidth Cached emory emory hierarchy to provide higher-bandwidth, lower latency for predictable accesses Bypassed emory Reduce latency of pipelined dependent memory accesses

23 Network Patterns Connects multiple units using shared resources Bus Low-cost, ordered Crossbar High-performance ulti-stage network Trade cost/performance

24 Control+Datapath Problem: Solution: Applicability: Consequences:

25 achine Types If SCSD, SCD, CD machines are patterns, what is the problem-solution? If they re solutions, what re the problems?

26 SCD Distributed emory C N D D D D D Examples: PP, ICL DAP, C-1, C-2, aspar, Sony Playstation-2 Graphics Engine, Vision processing chips

27 SCD Shared emory C D D D D D Examples: STARAN, BSP, TI ASC, CDC Star-100, ulti-lane Vector achines

28 CD Shared emory C C C C C D D D D D Examples: Burroughs B5x00 series, Network Packet Routers

29 Homogeneous CD Distributed emory essage Network C C C C C D D D D D Examples: Caltech Cosmic Cube, Transputer, ncube, Clusters

30 Heterogeneous CD Distributed emory P = C + D P P P P P Examples: Signal Processing Pipelines,

31 Systolic P = C + D P P P P P P P Examples: Warp, Raw, otion Estimation Engines,

32 Channels Control->Datapath direct pipelined? (maybe don t need pipelined controller?) Datapath<->emory fixed latency cannot have shared memory without true multiport decoupled in-order out-of-order Control<->Network<->Control fixed latency FIFOs addressable messaging

33 Application Patterns (from OPL) BHPL Version 0.3 Structural Patterns Iteration odel-view- Controller ap- Reduce Pipelines Event Based Layered Systems Agent&Repository Process Control Task Graphs Computational Patterns Circuits N-Body ethods Unstructured Grids Dense Linear Algebra Sparse Linear Algebra Graph Traversal Spectral ethods Structured Grids Dynamic Programming Graph Algorithms Graphical odels FSs achine Organizations PNC Layer Systolic Hardware Building Blocks SCD Shared emory SCD Distributed emory Processing FS icrocoded Engine Threaded Pipeline In-Order Pipeline Out-of- Order Pipeline FIFO emory Banked emory Cached emory ultiport emory CD Shared emory Homogeneous CD Distributed emory Heterogeneous CD Distributed emory Bypassed emory Networks CA Arbiter Bus ulti-stage Networks Crossbar Channels Communication Channel

CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.4. Krste Asanovic UC Berkeley Fall 2009

CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.4 Krste Asanovic UC Berkeley Fall 2009 Overall Problem Statement P3 bit string Audio Application(s) (Berkeley) Hardware Pattern