SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik

SoC Design Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik

Chapter 5 On-Chip Communication

Outline 1. Introduction 2. Shared media 3. Switched media 4. Network on Chip 5. NoC General Implementation 6. NoC Problem Oriented Implementation 3

1. Introduction Communication channels are need between components on a chip A communication channel must provide o Communication media o Communication protocol Implementation is an optimization problem with trade-off o Speed o Resources consumption o reliability Two implementation possibilities exist o Shared media Ex: bus o Switched media Crossbar switch 4

2. Shared media Use of a common communication link Only one master at a time o The master writes on a BUS and all components listen Advantage o Resource efficient o Broadcast easy Drawbacks o slow o Need for arbitration Centralized arbitration Ex: PCI, CoreConnect, AMBA Decentral arbitration Ex: CAN, Ethernet o Not fault tolerant No communication possible on failure Mod4 Mod 1 Arbiter Mod3 Mod2 5

2. SoC-Buses ARM AMBA o Consist of two BUS-Systems Advance High-Speed Bus (AHB): high-performance system interconnect for connecting processor to high-performance modules Advance Peripheral Bus (APB), used to connect the slower peripherals o Cross Communication via a bridge 6

2. SoC-Buses IBM CoreConnect o Processor Local Bus (PLB): highperformance bus, used to connect high-bandwidth devices such as processor cores, memory, and DMA controllers o On-Chip Peripheral Bus (OPB): a secondary bus used to decoupled the peripherals from the PLB in order to avoid a lost of system performance o Device Control Register (DCR): allows lower performance status and configuration registers to be read and written 7

2. SoC-Buses SoC-Buses are usually realized using large OR-Gates o Easy to implement 8

3. Switched media Supports point-to-point communication Many masters at a time Advantage o Performance o Tailored communication o No need for arbitration o Not fault tolerant Mod4 Mod 1 Drawbacks o Resource hungry Mod2 Mod3 9

3. SoC-Switched media Altera AVALON: designed to provide greater flexibility and performance, while consuming minimal logic resources than shared system bus Binds together components in a system based on the Avalon interface connects Avalon master and slave ports on components in a system Some features o Components of differing data widths o Components operating in different clock domains o Components using multiple Avalon ports 10

3. SoC-Switched media Silicore Whishbone: designed to foster design reuse by alleviating systemon-a-chip integration problems This is accomplished by creating a common, logical interface between IP cores Improved portability and reliability of the system faster time-to-market for the end user The Wisbone specification makes use of o RULES, o RECOMMENDATIONS, o SUGGESTIONS, o PERMISSIONS and o OBSERVATIONS simple, open, highly configurable interface 11

3. SoC-Switched media Slave Slave Master Interconnection Point-to-point Data flow Shared bus Crossbar switch Master Master Master Slave Created by the designer Concrete implementation by the System integrator 12

3. SoC-Switched media 1-D Switching: Reconfigurable Multiple Bus Network (RMB) o Connections among components are dynamically realized at run-time o Drawback: Time consuming computation for the route Resources hungry o Advantage: Flexibility Switch 1 2 3 4 5 13

3. SoC-Switched media Controller: o manages the switch at a local level o receive requests from the left, right, local o Four kinds of command REQUEST, REPLY, CANCEL, DESTROY Data network: o Transportation of data and command FIFOs: o buffer for commands coming from different side of the crosspoint 14

3. SoC-Switched media 2-D Switching: Reconfigurable Multiple Bus Network (RMB) o Drawback: Time consuming computation for the route Resources hungry o Advantage: Flexibility Switches 15

4. Network on Chip Hemani Kumar & Jantsch [00], Benini & DeMicheli [02], Dally [01] A Network on Chip consists of o A set of processing elements (PE) Processor Memory Custom hardware block o A set of router Route message to destination Communication is done by sending packets 16

4. Network on Chip Router o Must Fast and efficient o 5 inputs and outputs o Input-FIFOs o Data lines o Address lines o Additional control signals to neighbors Router structure o Router control via messages o Messages are sent in packets consisting of the address of the destination router, control bits and the payload (data) 17

4. Network on Chip FIFO-Design 18

4. Network on Chip Output-Arbiter-Design 19

4. Network on Chip - Routing Circuit Switching steps: 1. routing probe traversing the network from source to destination 2. upon reaching the address, an acknowledgment is sent back to the source address 3. data are transferred at the full bandwidth of the hardware 4. release the lock on the links at the end of the transmission Store-and-Forward (SAF) 1. The packets are temporally store in nodes 2. The routing information examined to determine which output channel to direct the packet to Virtual Cut-Through (VCT) o Address the deficiency of SAF (buffering message at each nodes) o Does not store the packet in a node if an output channel is free o Packets cut through the router of the node to an available output channel 20

4. Network on Chip - Routing o Reduce the hop count (#packet stations) o Alleviates the need for an excessive amount of memory along the path of a message Wormhole routing o Conceived to address the deficiency in VCT if an output channel is not available the packet must be stored in the current node s memory o Wormhole routing divides a message into smaller flow-control digits than packets called flits o Each message contains: one header flit, which carries the routing and control information data flits to store the remaining data for the message o The header flit always goes first to allocate a path for the data flits less memory requirement o If an output channel is available, the header flit is routed and the remaining data flits follow in a pipeline style fashion 21

4. Network on Chip - Routing Deterministic routing o XY-Routing if (Xrouter < Xdest) the packet ist forwarded in the east direction if (Xrouter > Xdest) the packet ist forwarded in the west direction if (Xrouter = Xdest and Yrouter > Ydest) the packet is sent to the south of the current router if (Xrouter = Xdest and Yrouter < Ydest) the packet is sent to the north of the current router if (Xrouter = Xdest and Yrouter = Ydest) the packet is sent to the local PE 22

4. Network on Chip - Routing Adaptive routing o the direction where to send an incoming packet is not fixed a priori o The routing algorithm may decides to use more complex schemes for routing o Usually used to improve the performance in the presence of localized traffic provide fault-tolerance in the network o Packets are not always routed along the shortest path Ex: Q-routing o adaptive routing algorithm based on Q-learning, a form of reinforcement learning o Initially build a routing table based on the delivery times (Q values) of the packets to every router in the network o Delivery times are updated every time a router forwards a packet for a particular destination o Router learn with the time the efficient route to all destination 23

4. Network on Chip - Routing performance metrics o Latency: time a message need from its source to its destination difference between the time where the last packet of the message arrives at destination and the time when the first packet of the message is output from the source o Throughput: maximum traffic a network can accept per unit of time typically measured as bytes or packets per node per cycle Deadlock and Livelock o Deadlock is a situation that occurs when a packet is waiting for an event that can never happen due to a circular dependence on resources o Livelock, on the other hand, is a configuration of the network in which packets continue to move, but never reach their destination 24

5. 3x3 NoC Implementation (1,1) (2,1) (3,1) (1,1) (2,1) (3,1) TC LV TC LV (2,2) (3,2) (1,2) (2,2) (3,2) (1,2) (1,3) (2,3) VGA (3,3) (1,3) (2,3) VGA (3,3) Area constraint in PACE PAR in FPGA Editor 25

6. NoC Efficient Implementation - ClusteRing Transceiver 1 Transceiver 2 LB 1 LB Mas ter R Ring Slave Ring Slave R LB Mas ter LB 2 LB Slave S Ring Master Ring 1 Ring Master S LB Slave LB Mas ter R Ring Slave Ring Slave R LB Mas ter LB Slave S Ring Master Ring LB LB 3 LB 4 Master Transceiver 3 Transceiver 4 Ro uter S Slave Ring 2 26

6. NoC ClusteRing Transceiver & Router RAM RAM Reg Reg ProcA FSM 2 FSM 1 FSM 2 FSM 1 ProcB Reg Reg Ring FSM 1 FSM 2 Reg Reg FSM 1 FSM 2 reg Reg 27

6. NoC ClusteRing Data transfer protocol Client 0 Client 1 Client 2 Client 1: # of bytes Client 1: Status Code Client 0: # of bytes Client 0: Status Code Client 2: # of bytes Client 2: Status Code Client 1: # of bytes Client 1: Status Code Client n: # of bytes Client n: Status Code Received data Client n: # of bytes Client n: Status Code Received data Client n: # of bytes Client n: Status Code Received data 28

6. NoC ClusteRing Case study SVD: hardwarenah o 8x8 Matrix 1 Prozessor: 149 us 2 Prozessoren: 151 us 4 Prozessoren: 160 us o 200x32 Matrix 1 Prozessor: 59707 us 2 Prozessoren: 36534 us 31839 Berechnung (88 %) MB Proc DDR RAM Perif. MB Proc Block RAM Ring Block RAM MB Proc 4694 Kommunikation (12 %) 4 Prozessoren: 18150 us 12960 us Berechnung (71%) 5190 us Kommunikation (29 %) Block RAM MB Proc 29

6. NoC The Singular Value Decomposition (SVD) A = U *Σ* V T P1 P2 Pn 30

Computation of the SVD 31

Parallel implementation Because the post multiplication of A (k) by Q (k) affects only the columns i and j, a parallel implementation is possible. Pairwise column orthogonalization (Brent & Luk) Mapping of virtual processors to physical ones 32

Parallel Implementation Block Orthogonalization: 33