A Scalable Processor Architecture for the Next Generation of Low Power Supercomputer. PRACE Workshop, October 2010 Andreas Olofsson

Size: px

Start display at page:

Download "A Scalable Processor Architecture for the Next Generation of Low Power Supercomputer. PRACE Workshop, October 2010 Andreas Olofsson"

Daniel Green
6 years ago
Views:

1 A Scalable Processor Architecture for the Next Generation of Low Power Supercomputer PRACE Workshop, October 2010 Andreas Olofsson

2 Company Introduction Company founded in 2008 with mission to produce programmable processors with 10x the energy efficient of existing products while using 1/10th of the development budget Veteran processor design team with 10 successful low power products, 100M units shipped, and over $200M in product related revenue Adapteva has completed development of a ground breaking programmable soft accelerator with energy efficiency of up to 50 GFLOPS/Watt Now sampling first product to lead customers and strategic partners 2

3 Why efficient supercomputing is a critical need! 10,000, ,000,000 Top500 1,000,000 10,000, ,000 1,000,000 (E) 10,000 FGPA/ MULTI x in 3 years 6x in 3 years 1 (T) Data from James Anderson at Lincoln Labs and Top , (P) MICRO 10 GFLOP WATT , TFLOP

4 DARPA UHPC Program Big Big Winners: Winners: INTEL, INTEL, NVIDIA, NVIDIA, MIT, MIT, SANDIA SANDIA 4

5 Bad News for Semi. Companies 5

6 More Bad News.. $2.36 in 10ku $4 VS. 6 $10M development cost per chip ARM RISC Core up to 120MHz USB 2.0, Ethernet, CAN, GPIO, etc 512KB flash, 65KB SRAM 12 bit A/D, 10 bit D/A Coffee beans Water..

7 Area vs Max Performance (65nm) AMD OPTERON 36 GFLOPS 283 mm^2 Ease of Use Efficient CPU IBM CELL 230 GFLOP/s 220 mm^2 What if we could borrow only the best features? VIRTEX5 LX50 48 GMACS 146mm^2 NVIDIA GTX GFLOPS 576mm^2 Accelerator Scale

8 A History of Embedded Processing MAC-DSP ERA VLIW/SIMD ERA FPGA ERA MANYCORE ERA RIOD STRENGTHS ENERGY EFFICIENCY ENERGY EFFICIENCY & EASE OF USE ENERGY EFFICIENCY & FLEXIBILITY & SCALABILITY ENERGY EFFICIENCY & SCALABILITY & EASE OF USE ACHILLES HEEL(s) PROGRAMMING MODEL NOT SCALABLE AREA INEFFICIENT?? NORMALIZED RFORMANCE (GFLOP/W) The The future future of of embedded embedded processing processing is is programmable programmable massively massively parallel parallel multicore multicore chips! chips! 8

9 Epiphany TM Signal Processing Fabric Easy Easy to to use use programming programming model model Scalable to 1000's of cores Scalable to 1000's of cores IEEE Floating Point Compliant IEEE Floating Point Compliant 9 CPU MEMORY ROUTER DMA GFLOP/W GFLOP/W at at 65nm 65nm 0.7mm per processing tile 0.7mm per processing tile 100 GFLOP/W at 28nm 100 GFLOP/W at 28nm

10 Architecture Summary Memory Memory Model: Model: HW HW driven driven caches caches kill kill energy energy efficiency! efficiency! Cache Cache misses misses very very expensive! expensive! So..NO So..NO HW HW caches. caches. Distributed Distributed Flat Flat Shared Shared Memory Memory All All memory memory accessible accessible by by all all cores cores 32KB 32KB multi multi bank bank SRAM SRAM per per core core CPU: CPU: Network Network On Chip: On Chip: 10 Wires Wires are are cheap! cheap! bit bit atomic atomic packets packets One One packet packet per per clock clock cycle cycle 11 ns ns per per hop hop Stability Stability is is expensive! expensive! Optimized Optimized for for math math Good Good enough enough for for control control code code 11 GHz GHz operation operation 22 FLOPS/cycle FLOPS/cycle entry entry register register file file

11 A Very Simple Flat Memory Model Completely Completely unprotected unprotected Examples Examples of of core core to to core core communication: communication: STR STR R0, R0, [R1,#42] [R1,#42] LDR LDR R0, R0, [R1,#42] [R1,#42] 0x000FFFFF 0x000F0000 MEMORY MAPD REGISTERS RESERVED 0x INTERNAL MEMORY BANK 3 0x INTERNAL MEMORY BANK 2 0x INTERNAL MEMORY BANK 1 0x INTERNAL MEMORY BANK 0 0x RESERVED CORE_0_1_3_3 CORE_0_0_0_3 0x05c00000 CORE_0_0_0_2 CORE_0_1_3_2 0x05b00000 CORE_0_0_0_1 CORE_0_1_3_1 0x05a00000 CORE_0_0_0_0 CORE_0_1_3_0 0x RESERVED 0x CORE_0_0_0_3 CORE_0_1_2_3 0x CORE_0_0_0_2 CORE_0_1_2_2 0x CORE_0_0_0_1 CORE_0_1_2_1 0x CORE_0_0_0_0 CORE_0_1_2_0 0x RESERVED 0x CORE_0_0_0_3 CORE_0_1_1_3 0x04c00000 CORE_0_0_0_2 CORE_0_1_1_2 0x04b00000 CORE_0_0_0_1 CORE_0_1_1_1 0x04a00000 CORE_0_0_0_0 CORE_0_1_1_0 RESERVED 0x x CORE_0_1_0_3 0x CORE_0_1_0_2 0x CORE_0_1_0_1 CORE_0_1_0_0 RESERVED INTERNAL MEMORY 0x x x x x

12 Network On Chip Routing Algorithm Each Eachprocessor processornode nodehas has x,y x,ycoordinate coordinate Destination Destinationaddress address compared comparedto tomesh meshnode node address addressto todetermine determine direction direction Reads Readsare are write writerequests requests X first, X first,then thenyy IF (1,1) IF IF IF (2,1) IF IF IF (3,1) 12 IF IF

13 Architecture Comparison Intel80[2] Tile64[3] Intel48[5] Renesas[4] (This work) Process 65nm 90nm 45nm 45nm 65nm Frequency 3.13GHz 850MHz 2GHz 648MHz 500MHz Cores Area (mm^2) 275 n/a Transistors (Million) n/a 40 Performance (GLFOPS) 1000 No Native Floating Point n/a Power (W) Area/Core(mm) 3.43 n/a Watt/ (GHZ*Cores)

14 A Programmer's Architecture 14 Task Channel Programming Model ANSI C Programmable using efficient GNU C compiler Runs floating point C programs out of the box! No special program constructs needed! Native single cycle floating point instruction support! Shared memory architecture Orthogonal register and instruction set

15 GNU COMPILER (AVAILABLE) WYSIWYG SIMULATOR

15 Robust Programming Environment MULTICORE PROGRAMMING MODEL/LIBS (IN DEVELOPMENT) ECLIPSE IDE (AVAILABLE) 15 GNU COMPILER (AVAILABLE) WYSIWYG SIMULATOR (AVAILABLE) GNU DEBUGGER (AVAILABLE) GNU BINUTILS (AVAILABLE)

16 FFT Mapping Example 16 s2 Approach: P0 P point FFT is spread over 16 processors s1,s2,s3,s4 refer to the four FFT stages for combining data with 64 point complex data movements Lower # procs transfer W0 to higher # procs. Lower # proc calculates Wj0+Wj1 x Cj, higher # proc calculates Wj0 Wj1 x Cj Results: s4 P4 P4 P8 P8 s3 <3us execution time High efficiency Work in progress, still room for improvement s1 s3 s4 s2 s1 s2 s1 s4 s3 P12 P12 s2 s1 P1 P1 P2 P2 s3 s4 P5 P5 P6 P6 P9 P9 P10 P10 s4 s3 P13 P13 P14 P14 s1 s2 s3 s4 s1 s2 s1 s2 s4 s3 s1 s2 P3 P3 s3 P7 P7 P11 P11 s4 P15 P15

Compromised Computing What users want: An easy to use C

consumption, and all interconnect standards Soft

Microprocessor takes care of the really complex stuff

17 Compromised Computing What users want: An easy to use C programmable device with infinite energy, zero power consumption, and all interconnect standards Soft Floating Point Accelerator FPGA 17 The compromise: Microprocessor takes care of the really complex stuff FPGA driven connectivity A soft accelerator takes care of the 10% code

18 A Radar Signal Chain Example A/D Mult channel High Data Rate FRONT END (ASIC/ FPGA) 10,000+ Giga ops Filterering Fixed Point Low Complexity NEW! MIDDLE END BACK END (up) ,000 Gigaflop Adaptive Beamforming Pulse Compression Floating Point Medium Complexity Gigaflop Clutter Target Detection Large Code High Complexity Implementing complex floating point algorithms in FPGA too complex Microprocessor based middle end processing far too power hungry The answer is a new device optimized for this type middle end floating point signal processing! 18

19 Arcitecture Application Sweet spot What What our our customers customers want! want! Floating Floating point point Massive Massive performance performance Lower Lower power power (<3 (<3 W) W) per per chip chip C programmability C programmability and and ease ease of of use use Scalability Scalability GREAT FIT APPLICATIONS BASED ON CUSTOMER INTERVIEWS Portable Ultrasound Antenna Pre-distorition Airborne Radar Data Center/HPC Automotive Radar Soft real time compression Antenna Beam-forming 19

20 E16xx Product Intro (samples available! ) Features: 16 Independent Processor Cores 4 full duplex FPGA LVDS links 4Mb On chip Shared Memory 65nm IBM Process Platform Performance LVDS LVDS 350mW core power at 500MHz! 32 GFLOPS/sec max performance 8 GB/sec Off chip Bandwidth 512 GB/sec On chip Bandwidth Availability: 20 LVDS LVDS LVDS LVDS LVDS LVDS Silicon in lab confirms power advantage Currently sampling to lead customers

21 Short Term Product Roadmap 1024 CPUs 1.5GHz 4 TFLOPS <40 Watt 28nm Roadmap SCALABLE >100 GFLOP/WATT Performance 64 CPUs 1.5GHz 256 GFLOPS <2.5 Watt 28nm 16 CPUs 1GHZ 32GFLOPS <2 Watt 65nm Q Q

22 Development ECO System State of the art Tool Chain: (available now!) Optimized GNU C compiler Multi core debugger Standard Eclipse IDE Standard C and Math library Unique guaranteed 100% accurate WYSIWYG chip simulator Evaluation Boards: (available now!) Plugin boards for standard Altera FPGA development kits FPGA emulation board available for SW/HW co development Software Libraries: (available November) Multi core FFT implementation 22 Multi core programming libraries

23 Adapteva HPC Capabilities GFLOP FMC (VITA 57) accelerator card available 3/15/11 1 TFLOP 50 Watt accelerator card in the works for Q Adapteva, Bittware, and Petapath brings together key expertise in chip, board, and software design enabling fast execution of prototype HPC systems. Will continue to grow library support base on customer feedback Can spin double precision floating point chip with very limited funding

Adapteva Board Offering Based on Altera Stratix IV Family Attaches to ATCA carriers or other cards equipped with AMC bays and used in MicroTCA system Advanced FPGA framework simplifies system design.

24 Adapteva Board Offering Based on Altera Stratix IV Family Attaches to ATCA carriers or other cards equipped with AMC bays and used in MicroTCA system Advanced FPGA framework simplifies system design. <50 Watts with daughter card Base card available today Daughter card available 3/15 FPGA handles back plane connectivity 24 Board in Production Today FMC Connector enables up to 128 GFLOPS of soft performance with Adapteva chips.

25 A straw man 1 TFLOP, 50 Watt AMC card Connectors 65mm 180mm 25 1 TFLOP of performance 16 MB of distributed memory X GB of on board memory Shows off performance density achievable Probably not practical for any real applications DRAM FPGA DRAM Connectivity Gateway

26 Conclusions 26 The next generation supercomputers will not be practical without a radically different architecture The only way to improve energy efficiency significantly is through compromise Adapteva has developed a software programmable floating point accelerator demonstrating 25 GFLOPS/W in silicon and 100 GFLOPs/W on the road map for 2011 The programming model needs to change..but not drastically

27 Contact Information General General Contact Contact Information Information infoadapteva.com infoadapteva.com Address: Address: Adapteva Adapteva Inc. Inc Massachusetts Massachusetts Ave, Ave, Suite Suite Lexington Lexington MA MA Mobile: Mobile: (781) (781) (Andreas) (Andreas) Office: Office: (781) (781)

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company