Two routes to specialisation: Loki and lowrisc. Robert Mullins, University of Cambridge WEEE September 2015 Espoo, Finland

Size: px

Start display at page:

Download "Two routes to specialisation: Loki and lowrisc. Robert Mullins, University of Cambridge WEEE September 2015 Espoo, Finland"

Christal Henry
6 years ago
Views:

1 Two routes to specialisation: Loki and lowrisc Robert Mullins, University of Cambridge WEEE September 2015 Espoo, Finland

2 Specialisation More transistors but end of Dennard scaling Dark silicon, utilisation wall etc. Specialisation is an answer, but not without problems

3 Specialisation More transistors but end of Dennard scaling Dark silicon, utilisation wall etc. Specialisation is an answer, but not without problems Some possible directions Many heterogeneous SoCs Tackle complexity with open-source? (lowrisc) Explore how to make SoC designs more flexible (target broader markets) Homogeneous sea of resources FPGA -> CGRA -> manycore/mppa (Loki) Specialise software for each application

4 Loki Simple tiled many-core processor 8-cores per tile + 64KB SRAM, 40nm Each core is a complete 32-bit processor 40nm (<2W for 128-cores) Message-passing support at ISA level Every instruction can send its result to a remote location on chip Register mapped FIFOs Fast multicast support within a tile No cache coherency support between tiles at present (can share data via L2) Configurable on-chip memory system Each tile contains SRAM that may be dedicated as scratchpad, L1 or L2 cache

5 Loki Inter-tile routers 8 cores Local interconnects 64KB Chip-wide networks: 1. L1$ to L2$ requests 2. L2$ to memory requests 3. Core to core data 4. Mem/L2$ responses 5. Credit network

6 A Loki tile

7 A Loki tile Sequential consistency is retained within a tile as operations arrive at each bank in the order they entered the network (crossbar) [see Zhang PDCN 05]

10 Loki s memory system Each bank can service a miss and offers hit-under-miss support Synchronization/atomics Load-and-OP (AND, OR, XOR, ADD), Exchange LL/SC Can access command set at memory banks (sendconfig instruction) Send cache line to another bank Flush, invalidate or prefetch cache lines Bypass L1/L2 Memset cache line Same mechanism can be used to form packets on core-to-core network

13 Area (approx.) Cores ~50% SRAM 40-45% Routers 4-6% Other ~2-3%

14 Loki pipeline Small custom ISA Incl. support for predicated execution 6 register mapped network FIFOs (blocking reads) Decoupled loads Every instruction can send its result on network Can send instructions too! Channel map table Read in decode stage 16 entry table that maps channel names to network addresses

15 Example uint32_t updatecrc32(uint8_t ch, uint32_t crc) { return table[(crc ˆ ch) & 0xff] ˆ (crc >> 8); } setchmapi 1, r15 [...] fetch r10 xor r11, r14, r13 lli r12, %lo(table) lui r12, %hi(table) andi r11, r11, 255 slli r11, r11, 2 addu r11, r12, r11 ldw 0(r11) -> 1 srli r12, r14, 8 xor.eop r11, r2, r12

16 L0 I$ / scratchpad Fetch stage contains small (64 instruction), fully associative, I$ Can skip tag checks with in buffer jmp Instructions just executed in FIFO order until end of packet (don t have an actual PC) Execute stage contains small (256 word) local scratchpad

17 Execution patterns (within a tile) MIMD DLP (SIMD) DLP with helper core (scalarization) One core is dedicated to provide common data over multicast bus Enables work done by remaining data-parallel cores to be reduced Worker farm Task-level pipelines Dataflow (single persistent instruction packet per core) Can support a single instruction per core [See UCAM-CL-TR-846 for full details]

18 Example: JPEG colour conversation [Bates13]

19 DOACROSS loops [Campanoni et al. ISCA 2014] Substantial speedup available from exploiting DOACROSS parallelism 16 in-order cores ( Atom like) Much improved performance with low-latency communication mechanism ( ring cache RC) for signals and values

20 Example: ADPCM (encoder) We can exploit some DOACROSS parallelism in the case of ADPCM Achieves 2X using 3 cores Can do slightly better by simply splitting loop body across two cores Body then fits in core s L0 I$ ~2.5X on 2 cores Plan to explore simplified HELIX implementation for our Loki LLVM port Fast signals and shared L1 should make Loki a good target

21 ILP Splitting Another approach to grouping/fusing cores LLVM pass to automatically split a program across N cores in a tile using available ILP, communicates values over local tile core to core network Early results: Stencil2D (MachSuite) 1.78X (3 cores) Gemm/Blocked (MachSuite) 1.75X (3 cores) Matrix Multiply (2 cores) Initial attempt 0.72X With use of restrict 1.41X Exploit commutativity 1.86X Currently, exploring optimisations to consider the order basic blocks are visited and some microarchitectural enhancements (inp. FIFO issues) [Alex Bradbury]

22 AES case study AES-128-CTR mode 2 days work for a recent graduate Want to avoid running same code on each core: Would have poor L0 I$ performance Cores would produce less regular memory accesses Instead, the AES code is mapped as a task pipeline Loki Results 5.1 cycles/byte on one tile 2.5 cycles/byte on two tiles 11.5Gps at 450MHz for 128-cores Comparison: ARM + NEON Bitsliced implementation Lower bound is 13 cycles/byte

23 AES example: single tile (8-core) mapping Cores 1-5 address banks 2-5 using 4 separate channels to save 1 or 2 address manipulation instructions in the loop body

24 Current status Loki LLVM compiler implementation ISS + complete SystemC model SystemVerilog implementation is complete ( < 30K LOC) Generates 128-core ASIC version and 32-core FPGA implementation Test infrastructure, including random program generator Promising single-tile and multi-tile results Will tape-out very soon! 4mm x 4mm die, TSMC 40nm (128 cores, 1MB on-chip cache) Off-chip I/O to FPGA Northbridge 4 x 13-bit length matched full-duplex source synchronous DDR channels

25 Development boards Dev. boards will be available next year. Package (352 ball BGA) and board from Michael Taylor s group at UCSD See Community Aim to distribute boards to research groups or provide remote access Support research in compilers, mapping, applications etc.

26 Subject: Redo BBC Micro (2008)

27 Subject: Development of an open-source SoC (2014) Create an open-source SoC capable of running Linux well Make it real to encourage contributions and grow community Volume silicon manufacture Ability to purchase in small quantities Low-cost development board Regular updates to SoC Events, training and documentation lowrisc C.I.C (Not-for-profit company)

28 Why create an open source SoC? Research and teaching Serve the open-source community Demand from industry Remove constraints on use of processor IP Use lots of cores freely to provide flexible implementation Lower costs create proven base for derivatives Why now?

29 Approach to design Aim for simplicity no backwards compatibility issues, no baggage, clean sheet design Think about security from the start Free from commercial influences and release cycles Cores are free and customisable (one ISA) Aim to maximise functionality and flexibility (no trade-offs to create product range)

RISC-V RISC-V ISA from UC Berkeley Aim to create open ISA standard for industry Explicitly designed to be extensible Simple base integer ISA (~40 instructions)

30 RISC-V RISC-V ISA from UC Berkeley Aim to create open ISA standard for industry Explicitly designed to be extensible Simple base integer ISA (~40 instructions) 32-bit, 64-bit, 128-bit (!) variants Rocket SoC: cores, L1, L2 cache, interconnect Silicon proven (45nm and 28nm) Chisel (open-source HW construction language)

31 lowrisc SoC

32 Current status

33 General purpose tagged memory Prevent control-flow hijacking attacks Accelerate debug tools use-after-free detection Per-word locks, full/empty bits for synchronization Control-flow integrity Assist Garbage collection Dynamic information flow tracking (DIFT) Capabilities Transactional memory Provenance tracking

34 General-purpose tagged memory LLVM pass has been implemented to tag sensitive pointers i.e. code pointers, virtual function table pointer, function pointers,. Every load of a sensitive pointer is replaced with a load that expects a particular tag to be read, if this is not the case an exception is raised Prevents classic buffer overflow attacks and return-orientatedprogramming Some other related attacks may remain if code has the right/wrong! bugs Overheads and future work

35 Minion cores Will initially support DMA and programmable I/O Use minions to generate I/O signals, pre-processor I/O data etc. Would like to also use minions to support tagged memory Particular tags trigger message to minion from application processor Minion executes security policy in parallel with app. Processor Plan to investigate implementing more of the SoC using minion cores + appropriate shims E.g. memory controller Will use the Pulpino core from Luca Benini s group at ETHZ

36 Open source HW Smaller community, higher barrier to entry Fabricating chips is expensive Verification effort is significant Patching can't be done in the same way typically Of course, all good reasons to produce an open known good SoC design and to promote a community effort

37 Roadmap Create untethered version of SoC with tagged memory Complete core SoC implementation (no GPU initially) First test chips (40 or 28nm) 2 to 4 cores, most probably dual-issue Integrate 3 rd party IP, e.g. mem controller, USB, Ethernet Support early adopters in creating derivative designs Third Party Design Starts 2017 Volume fab. run for community dev. board Strengthen lowrisc IP offerings

38 Research in the open Have lots of ideas, collaborate and share from day one Open development helps to attract best people, even if they contribute remotely (huge amount of good will and enthusiasm for these projects if people know what you are trying to do!) Make it easy for people to get involved, reproduce, extend and improve (this requires significant effort) Work with industry Provide vehicle to evaluate/implement other research ideas

39 Find out more and get involved ORCONF 2015 October 9-11 th, 2015 Ideasquare, Geneva ORCONF began as an annual event for openrisc developers. Now run as a Free and Open Source Silicon (FOSSi) event. lowrisc workshop on Friday Talks on RISC-V

40 Final thoughts Exploring two different approaches to achieving energy efficiency through specialisation: Loki: flexible processor array lowrisc: an open source SoC Opportunities to collaborate with others on both projects More information about lowrisc at See also phab.lowrisc.org Sign up to announcement and discussion lists

41 Acknowledgements Both lowrisc and Loki are team efforts Loki team currently includes Daniel Bates, Alex Bradbury and Alex Chadwick (Recent work on DNNs by Chihang Wang and Sam Tarver. Earlier work on configurable L1 memory system by Andreas Koltes) lowrisc team currently includes Wei Song, Alex Bradbury and numerous external contributors. Contributions on tagged memory and minion core I/O shims by Hongyan Xia and Martin Papadopoulos. Recent work on tagged memory architecture and LLVM support by Lucas Sonnabend and Matthew Toseland Loki is funded by an ERC starter grant (GA n ) This work was previously supported by UK EPSRC grant EP/G033110/1 lowrisc is kindly supported by a private donation and a donation from Google. Thank you for listening!

The lowrisc project Alex Bradbury

The lowrisc project Alex Bradbury lowrisc C.I.C. 3 rd April 2017 lowrisc We are producing an open source Linux capable System-on-a- Chip (SoC) 64-bit multicore Aim to be the Linux of the Hardware world