CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.4. Krste Asanovic UC Berkeley Fall 2009

Size: px

Start display at page:

Download "CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.4. Krste Asanovic UC Berkeley Fall 2009"

Claude McCoy
5 years ago
Views:

1 CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.4 Krste Asanovic UC Berkeley Fall 2009

2 Overall Problem Statement P3 bit string Audio Application(s) (Berkeley) Hardware Pattern Language P3 bit string Audio Hardware (RTL)

3 BHPL Goals BHPL captures problem-solution pairs for creating hardware designs (machines) to execute applications BHPL Non-Goals Doesn t describe applications themselves, only machines that execute applications and strategies for mapping applications onto machines

4 BHPL BHPL Overview Applications (including OPL patterns) Structural Patterns Iteration odel-view- Controller ap- Reduce Pipelines Event Based Layered Systems Agent&Repository Process Control Task Graphs Computational Patterns Circuits N-Body ethods Unstructured Grids apping Patterns Dense Linear Algebra Sparse Linear Algebra Graph Traversal Spectral ethods Structured Grids Dynamic Programming Graph Algorithms Graphical odels FSs achines

5 achine Vocabulary

6 achine Vocabulary achines described using a hierarchical structural decomposition Units (processing engines) emories Networks (connect multiple entities) Channels (point-to-point connections) (emories, Networks, and Channels are just specialized Units)

7 Hierarchy within Unit Input Port Output Port Input/Output Port

8 Hierarchy within emory

9 Hierarchy within Network (2)

10 Hierarchy within Network

11 Hierarchy within Channel

12 Units are FSs All units are digital hardware, i.e., describable as a finite-state machine (FS) Different ways of factoring out the FS description of a unit Structural decomposition into hierarchical sub-units Decompose functionality into control + datapath Can further decompose control into inter-transaction scheduling plus intra-transaction sequencing All factorings are equivalent, so pick factoring that best explains what unit does

13 Structural Decomposition

14 Control + Datapath

15 Controller Types State achine Controller control lines generated by state machine icrocoded Controller single-cycle datapath, control lines in RO/RA In-Order Pipeline Controller pipelined control, dynamic interaction between stages Out-of-Order Pipeline Controller operations within a control stream might be reordered internally Threaded Pipeline Controller multiple control streams one execution pipeline can be either in-order (PPU) or out-of-order

16 Leaf-Level Hardware Register Combinational Logic Wires emory ultiplexer/alu Tristate driver FIFO Conventional schematic notation (Need additional notation for asynchronous logic?)

17 Hardware Patterns

18 Decoupled Units Problem: Difficult to design a large unit with a single controller, especially when components have variable processing rates. Large controllers have long combinational paths. Solution: Break large unit into smaller sub-units where each sub-unit has a separate controller and all channels between subunits have some form of decoupling (i.e., no combinational path between units on each side of channel). Applicability: Larger units where area and performance overhead of decoupling is small compared to benefits of simpler design and shorter controller critical paths. Consequences: Decoupled channels generally have greater communication latency and area/power cost. Sub-unit controllers must cope with unknown arrival time of inputs and unknown time of availability of space on outputs. Sub-units must be synchronized explicitly.

19 Decoupled Units Network Channels to network are always decoupled in any case Shared emory Unless shared memory is truly multiported, channels to memory must be decoupled

20 Pipelined Operator Problem: Combinational function of operator has long critical path that would reduce system clock frequency. High throughput of this function is required. Solution: Divide combinational function using pipeline registers such that logic in each stage has critical path below desired cycle time. Improve throughput by initiating new operation every clock cycle overlapped with propagation of earlier operations down pipeline. Applicability: Operators that require high throughput but where latency is not critical. Consequences: Latency of function increases due to propagation through pipeline registers, adds energy/op. Any associated controller might have to track execution of operation across multiple cycles.

21 Pipelined Operator f(g(in)) Clock Clock g(in) f(in) Clock Clock Clock

22 ulticycle Operator Problem: Combinational function of operator has long critical path that would reduce system clock frequency. High throughput of this function is not required. Solution: Hold input registers stable for multiple clock cycles of main system, and capture output after combinational function has settled. Applicability: Operators where high throughput is not required, or if latency is critical (in which case, replicate to increase throughput). Consequences: Associated controller has to track execution of operation across multiple cycles. CAD tools might detect false critical path in block.

23 ulticycle Operator f(g(in)) Clock Clock Clock/2 f(g(in)) Clock/2

24 emory Patterns True ultiport emory Banked emory Interleave lesser-ported banks to provide higher bandwidth Cached emory emory hierarchy to provide higher-bandwidth, lower latency for predictable accesses Bypassed emory Reduce latency of pipelined dependent memory accesses

25 Network Patterns Connects multiple units using shared resources Bus Low-cost, ordered Crossbar High-performance ulti-stage network Trade cost/performance

26 Control+Datapath Problem: Solution: Applicability: Consequences:

27 achine Types If SCSD, SCD, CD machines are patterns, what is the problem-solution? If they re solutions, what re the problems?

28 SCD Distributed emory C N D D D D D Examples: PP, ICL DAP, C-1, C-2, aspar, Sony Playstation-2 Graphics Engine, Vision processing chips

29 SCD Shared emory C D D D D D Examples: STARAN, BSP, TI ASC, CDC Star-100, ulti-lane Vector achines

30 CD Shared emory C C C C C D D D D D Examples: Burroughs B5x00 series, Network Packet Routers

31 Homogeneous CD Distributed emory essage Network C C C C C D D D D D Examples: Caltech Cosmic Cube, Transputer, ncube, Clusters

32 Heterogeneous CD Distributed emory P = C + D P P P P P Examples: Signal Processing Pipelines,

33 Systolic P = C + D P P P P P P P Examples: Warp, Raw, otion Estimation Engines,

34 Channels Control->Datapath direct pipelined? (maybe don t need pipelined controller?) Datapath<->emory fixed latency cannot have shared memory without true multiport decoupled in-order out-of-order Control<->Network<->Control fixed latency FIFOs addressable messaging

35 Application Patterns (from OPL) BHPL Version 0.3 Structural Patterns Iteration odel-view- Controller ap- Reduce Pipelines Event Based Layered Systems Agent&Repository Process Control Task Graphs Computational Patterns Circuits N-Body ethods Unstructured Grids Dense Linear Algebra Sparse Linear Algebra Graph Traversal Spectral ethods Structured Grids Dynamic Programming Graph Algorithms Graphical odels FSs achine Organizations PNC Layer Systolic Hardware Building Blocks SCD Shared emory SCD Distributed emory Processing FS icrocoded Engine Threaded Pipeline In-Order Pipeline Out-of- Order Pipeline FIFO emory Banked emory Cached emory ultiport emory CD Shared emory Homogeneous CD Distributed emory Heterogeneous CD Distributed emory Bypassed emory Networks CA Arbiter Bus ulti-stage Networks Crossbar Channels Communication Channel

CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.3. Krste Asanovic UC Berkeley Fall 2009

CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.3 Krste Asanovic UC Berkeley Fall 2009 Overall Problem Statement P3 bit string Audio Application(s) (Berkeley) Hardware Pattern