The Design of the KiloCore Chip

Size: px

Start display at page:

Download "The Design of the KiloCore Chip"

Brendan May
6 years ago
Views:

The Design of the KiloCore Chip Aaron Stillmaker*, Brent Bohnenstiehl,

University of California, Davis VLSI Computation Laboratory June 20, 2017

Introduction and Motivation Core Scaling Trend KiloCore Architectural

1 The Design of the KiloCore Chip Aaron Stillmaker*, Brent Bohnenstiehl, Bevan Baas DAC 2017: Design Challenges of New Processor Architectures University of California, Davis VLSI Computation Laboratory June 20, 2017 VCL *Currently with California State University, Fresno Outline Introduction and Motivation Core Scaling Trend KiloCore Architectural Design Physical Chip Design Chip Flow Design Challenges Final Die Application Measurements 2 1

2 Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for by K. Rupp New data added by B. Baas 3 Optimal Computational Tile Size The most efficient implementations (energy, throughput, circuit area) have: Processor sizes that capture computational kernels with few excess circuits ~ ~~ ~ ~~ Interprocessor interconnect Unused or low benefitper-cost circuits Energy Effic. Clock rate Area Effic. 4 2

3 Key Properties of KiloCore Very small processor tiles Vast numbers of processors per chip Processors can be used for non-traditional purposes in highly efficient implementations Processors dissipate very little energy per workload when active Processors dissipate very low power when idle Essentially no algorithm-specific hardware in the programmable processors 5 Outline Introduction and Motivation Core Scaling Trend KiloCore Architectural Design Physical Chip Design Chip Flow Design Challenges Final Die Application Measurements 6 3

KiloCore Design Contains exactly 1,000 processors on one chip One of the first fabricated chips to contain 1,000 processors Fastest clock rate processor designed at a university Didn t

Block Diagram Single Processor One 64 KB memory 7 GALS Clocking KiloCore contains a fully-independent Globally-Asynchronous Locally- Synchronous (GALS) clock domain in each of its 1000

clock oscillators are placed inside their own clock domains there are no global clock signals (except three for configuration and testing) Each clock oscillator is fully-unconstrained

4 KiloCore Design Contains exactly 1,000 processors on one chip One of the first fabricated chips to contain 1,000 processors Fastest clock rate processor designed at a university Didn t receive all libraries until 34 days before taping out 12 memories containing 64 KB each for 768 KB of shared memory Memories are accessible by two processors directly above each KiloCore Block Diagram Single Processor One 64 KB memory 7 GALS Clocking KiloCore contains a fully-independent Globally-Asynchronous Locally- Synchronous (GALS) clock domain in each of its 1000 processors, 1000 packet routers, and 12 independent memories Processor programmable clock oscillators are ~1% of tile area Router oscillators are simplified and very small Each of the 2012 clock oscillators are placed inside their own clock domains there are no global clock signals (except three for configuration and testing) Each clock oscillator is fully-unconstrained oscillators may change frequency (below their f max ), halt, and restart arbitrarily to minimize power consumption Data transfer across clock domains is handled by dual-clock FIFOs Processor Router 8 4

5 Inter-Processor Communication Source synchronous 16-bit communication Two layer source-synchronous circuit switched network Up to 28 Gbps per link, 456 Gbps total tile I/O Single layer dynamic packet routing network 9.1 Gbps maximum per port 45.5 Gbps maximum Processor Core Circuit Switch x2 Packet Router 9 Outline Introduction and Motivation Core Scaling Trend KiloCore Architectural Design Physical Chip Design Chip Flow Design Challenges Final Die Application Measurements 10 5

Bottom-Up Physical Design Flow Macro Block(s) Design Flow Timing Constraints Synthesis

LVS HDL Bottom-Up design flow Design aspects undecided when the physical design was

Configuration Number of processors Timing Analysis Export Macro Block 11 Bottom-Up

Design Flow HDL oscillators HDL proc. & accel. no 4. Meets timing? yes 5.

6 Bottom-Up Physical Design Flow Macro Block(s) Design Flow Timing Constraints Synthesis Floorplanning/ Power Gridding Placement Clock Tree Synthesis Routing Optimization DRC & LVS HDL Bottom-Up design flow Design aspects undecided when the physical design was started: Routers Inter-processor I/O Chip I/O Memories (both local and global) Oscillators Configuration Number of processors Timing Analysis Export Macro Block 11 Bottom-Up Physical Design Flow Macro Blocks oscillators 1. Design Flow 2. Design Flow HDL oscillators HDL proc. & accel. no 4. Meets timing? yes 5. Design Flow HDL chip Macro Blocks proc. & mem. chip Macro Block test chip 3. Design Flow 3x3 HDL test chip 32x32 6. Chip Finishing 7. To Fab final GDS 12 6

Processor Tile Implementation Relatively simple processor design No hard macros Osc was pre-placed and routed I/O pin locations specified for optimal abutment Power rails specified Quick design

(1%) Dmem (29%) Core (20%) Imem (33%) Fifo2 (4%) Fifo1 (4%) Router (9%) Circ. Comm.

7 Processor Tile Implementation Relatively simple processor design No hard macros Osc was pre-placed and routed I/O pin locations specified for optimal abutment Power rails specified Quick design iterations Easily changed design aspects such as memory size or tile size Verilog was changing up until 5 days before tapeout ~580k transistors per processor tile 239 μm wide μm tall Osc. (1%) Dmem (29%) Core (20%) Imem (33%) Fifo2 (4%) Fifo1 (4%) Router (9%) Circ. Comm. (1%) Single processor with highlighted gates (percentage of tile area) 13 Metal Plan Tile Metal Plan Global Metal Plan m11 Std. Cells Inter-Std. Cell Signals Local Power Grid Global Signals (0.13%) Power Grid IO Pads m10 (2.50%) m9 (2.31%) (18.63%) m8 (4.94%) (6.52%) m7 (7.02%) (21.49%) m6 (7.53%) (7.27%) m5 (18.32%) (15.40%) m4 (24.71%) (13.56%) m3 (22.12%) (6.07%) m2 (12.83%) (8.42%) m1 (0.21%) Signal percentages are length per layer, per total signal length 14 7

49%) m1 Power grid percentage is percentage of max allowable 15 Processor Level Power Rails Used 7 metal layers for local power grid 39.

8 Metal Plan Tile Metal Plan Global Metal Plan m11 m10 Std. Cells Inter-Std. Cell Signals Local Power Grid Global Signals Power Grid (11.95%) (9.47%) IO Pads m9 (16.81%) m8 (6.49%) m7 (8.41%) m6 (6.49%) m5 (4.44%) m4 m3 (4.44%) m2 (9.49%) m1 Power grid percentage is percentage of max allowable 15 Processor Level Power Rails Used 7 metal layers for local power grid 39.6 μm horizontal pitch 20.0 μm vertical pitch Connected local grid between all processors Including the standard cell level power rails Manually drew in the grid of metal at array level View focused on single core, showing array painted power 16 8

Global Power Rails Metal layers 10 and 11 are used for global power grid Only connected to metal layer 9 power rails (top level local power grid) Global power rails at

9 Global Power Rails Metal layers 10 and 11 are used for global power grid Only connected to metal layer 9 power rails (top level local power grid) Global power rails at chip level 17 Processor Abutment With 1000 homogeneous processors, very tight abutment is important Communication wires were constrained with I/O pin placement 0.4 μm horizontal spacing and 4.5 μm vertical spacing Required fabrication cells, such as alignment cells cause interruption Removing a processor is non-ideal due to timing Specific rows required added vertical space to fit these cells Configuration North Circuit Comm. North Router Comm. West East East West South South Config. Processor Address 18 9

1/4/18 Processor Timing Paths 196 independently timed datapaths for direct communication up to only two tile hops Possible timing paths Circuit switched (x2 layers):

() Signals can pass through cores router router Obuf router router router Packet switched: Single hop routers on each processor Figure shows all possible

10 1/4/18 Processor Timing Paths 196 independently timed datapaths for direct communication up to only two tile hops Possible timing paths Circuit switched (x2 layers): Each core has 2 origins: One output buffer (Obuf) One long-distance buffer () Each core has 3 destinations: One long-distance buffer () Input buffer 0 () Input buffer 1 () Signals can pass through cores router router Obuf router router router Packet switched: Single hop routers on each processor Figure shows all possible inter-processor communication paths up to two tiles away 19 Inter-Processor Communication Wires Circuit switch circuitry 1% area (9% including FIFOs) Packet router circuitry 9% area 20 10

C4 Routing and Chip I/O Premade flip-chip BGA package was used 564 C4 bumps 162 I/O (63 LVDS Pairs) 402 Power (Vdd, Gnd, and VddIO) Non-ideal C4 placement and quantity All

activity 21 Oscillator and Processor Clock Tree Inverter ring oscillator Osc.

11 C4 Routing and Chip I/O Premade flip-chip BGA package was used 564 C4 bumps 162 I/O (63 LVDS Pairs) 402 Power (Vdd, Gnd, and VddIO) Non-ideal C4 placement and quantity All drivers and ESD clamps were on the periphery When all cores are at full speed and 100% activity, only sufficient current for center processors Most applications average 50% activity 21 Oscillator and Processor Clock Tree Inverter ring oscillator Osc. Inverter chain cells were hand selected Placed and routed before processor to maintain timing Clock tree with 5,522 leafs to drive the flip-flop memories Processor Skew (70 ps max.) Average Measured Processor Osc. Frequencies 1.1 V 1.78 GHz 900 mv 1.24 GHz 560 mv 115 MHz 22 11

12 Array Placement with Memories Shared Memory SRAM Cell Embedding memories in the array placement Hard SRAM memory blocks forced the shared memory to take a specific shape Memories placed on bottom to eliminate: Configuration signal pass-through Non-regular timing for signals over memory Wasted space

KiloCore Chip Technology 32nm IBM PDSOI CMOS Num. Procs. 1000 Num. Mems.

13 KiloCore Chip Technology 32nm IBM PDSOI CMOS Num. Procs Num. Mems. 12 Die Area 64 mm 2 Array Area 60 mm 2 Transistors C4 Bumps Package 621 Million 564 (162 I/O) 676 Pad Flip-Chip BGA 25 Outline Introduction and Motivation Core Scaling Trend KiloCore Architectural Design Physical Chip Design Chip Flow Design Challenges Final Die Application Measurements 26 13

14 Applications Several applications have been implemented for KiloCore: Fast Fourier Transform 4096 length, 16-bit fixed-point complex data Advanced Encryption Standard 128-bit keys Low Density Parity Check 4095 code length Sorting 100 Byte records with 10 Byte keys 1850 records per sorted block 27 Application Comparison Normalized Throughput per Area CPU GPU KiloCore AES FFT PC Sort Normalized Throughput per Watt CPU GPU KiloCore AES FFT PC Sort KiloCore operating at 1.1 V Data: B. Bohnenstiehl et al., JSSC

15 Acknowledgments Funding and Support DoD and ARL/ARO Grant W911NF TAPO NSF CAREER award CCF Grant No CCF Grant No CCF Grant No CCF Grant No SRC GRC Grant 1598 CSR Grant 1659 GRC Grant 1971 GRC Grant 2321 ST Microelectronics C2S2 Intel Corporation UCD Faculty Research Grant MOSIS Artisan 29 15

KiloCore: A 32 nm 1000-Processor Array

KiloCore: A 32 nm 1000-Processor Array Brent Bohnenstiehl, Aaron Stillmaker, Jon Pimentel, Timothy Andreas, Bin Liu, Anh Tran, Emmanuel Adeagbo, Bevan Baas University of California, Davis VLSI Computation