GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework. Jan Gray CARRV2017: 2017/10/14

Size: px

Start display at page:

Download "GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework. Jan Gray CARRV2017: 2017/10/14"

Archibald Ambrose Powers
5 years ago
Views:

1 GRVI halanx Update: A Massively arallel RISC-V FGA Accelerator Framework Jan Gray jan@fpga.org CARRV2017: 2017/10/14

2 FGA Datacenter Accelerators Are Almost Mainstream Catapult v2. Intel += Altera. OpenOWER CAI. AWS F1. Baidu. Alibaba. Huawei FGAs as computers Massively parallel, customized, connected, versatile High throughput, low latency, low energy Great, except for two challenges Software: C++ workload??? FGA accelerator? Hardware: tape out a complex SoC daily? CARRV2017: 2017/10/14 2

3 CI EXRESS 1 CI EXRESS HBM HIGH BANDWIDTH MEMORY Hardware Challenge 10G 100G NETWORKS HY HY HY HY NIC 4 HBM CHANNEL 1 NIC 3 NIC 2 NIC 1 FGA ACCELERATORS A1 A1 A1 A1 A1 A1 A1 HBM CHANNEL 2 DRAM CHANNEL 1 DRAM CHANNEL 2 A1 CARRV2017: 2017/10/14 HOST 3 ERIH

4 GRVI halanx: FGA Accelerator Framework For software-first accelerators: Run parallel software on 100s of soft processors Add custom logic as needed = More 5 second recompiles, fewer 5 hour ARs GRVI: FGA-efficient RISC-V RV32I soft CU halanx: processor/accelerator fabric Many clusters of Es, RAMs, accelerators, I/O Message passing in a GAS across a Hoplite NoC: FGA-optimal fast/wide 2D torus CARRV2017: 2017/10/14 4

5 Why RISC-V? Open ISA, welcomes innovation Comprehensive infrastructure and ecosystem Specs, tests, simulators, cores, compilers, libs, FOSS As with LLVM, research will accrue to RISC-V Its simple ISA allows an efficient FGA soft CU CARRV2017: 2017/10/14 5

6 GRVI: Austere RISC-V rocessing Element Simpler, smaller processors more processors more task and memory parallelism GRVI core RV32I, minus CSRs, exceptions, plus mul*, lr/sc 3 stage pipeline (fetch, decode, execute) 2 cycle loads; 3 cycle taken branches/jumps Typically MHz 0.7 MIS/LUT CARRV2017: 2017/10/14 6

7 GRVI RV32I Microarchitecture CARRV2017: 2017/10/14 7

8 GRVI RV32I Datapath: ~250 LUTs CARRV2017: 2017/10/14 8

9 IMEM 4-8 KB 2:1 IMEM 4-8 KB 2:1 IMEM 4-8 KB 2:1 2:1 CMEM = 128 KB CLUSTER DATA ACCELERATOR(S) IMEM 4-8 KB GRVI Cluster: 0-8 Es KB Shared Memory 4:4 XBAR 64 CARRV2017: 2017/10/14 9

10 GRVI Cluster Tile: ~3500 LUTs CARRV2017: 2017/10/14 10

11 Composing Clusters with Message assing on a Hoplite NoC Hoplite: rethink FGA NoC router architecture No segmentation/flits, VCs, buffering, credits Unidirectional rings Deflecting dimension order routing of whole msgs Simple; frugal; wide; fast: Gbps/link 1% area delay of FGA-tuned VC flit routers CARRV2017: 2017/10/14 11

Example Hoplite NoC 256b links @ 400 MHz = 100

12 Example Hoplite NoC 256b 400 MHz = 100 Gb/s links; <3% of FGA CARRV2017: 2017/10/14 12

13 IMEM 4-8 KB 2:1 IMEM 4-8 KB 2:1 IMEM 4-8 KB 2:1 2:1 CMEM = 128 KB CLUSTER DATA ACCELERATOR(S) IMEM 4-8 KB GRVI Cluster with NoC Interfaces HOLITE ROUTER 300 = header + 32b msg dest addr + 256b msg data NoC ITF 4:4 XBAR 64 CARRV2017: 2017/10/14

14 10 5 Clusters 8 GRVI Es = 400 GRVI halanx (KU040, 12/2015) CARRV2017: 2017/10/14 14

15 arallel rogramming Models? Small kernels, local or GAS shared memory, message passing, memcpy/rdma DRAM Current: multithreaded C++ w/ message passing Uses GCC for RISC-V RV32IMA. Thank you! Future: OpenCL, KNs, 4, Accelerated with custom FUs, AXI cores, RAMs CARRV2017: 2017/10/14 15

16 11/30/16: Amazon AWS EC2 F1! CARRV2017: 2017/10/14 16

17 F1 s UltraScale+ XCVU9 FGAs 1.2 M 6-LUTs Kb BRAMs (8 MB) Kb URAMs (30 MB) 6840 DSs CARRV2017: 2017/10/14 17

18 1680 RISC-Vs, 26 MB CMEM (VU9, 12/2016) 30 7 clusters of { 8 GRVI, 128 KB CMEM, router } First kilocore RISC-V, and the most 32b RISC cores on a chip in any technology CARRV2017: 2017/10/14 18

19 1, 32, 1680 RISC-Vs CARRV2017: 2017/10/14 19

1680 Core GRVI halanx Statistics Resource Use Util. % Logical nets 3.2 M - Routable nets 1.8 M - CLB LUTs 795 K 67.2% CLB registers 744 K 31.5% BRAM 840 38.9% URAM 840 87.5% DS 840 12.

20 1680 Core GRVI halanx Statistics Resource Use Util. % Logical nets 3.2 M - Routable nets 1.8 M - CLB LUTs 795 K 67.2% CLB registers 744 K 31.5% BRAM % URAM % DS % Frequency eak MIS CRAM Bandwidth NoC Bisection BW ower (IN26) ower/core MAX VCU118 Temp 250 MHz 420 GIS 2.5 TB/s 900 Gb/s W mw/core 44C CARRV2017: 2017/10/14 20 Vivado / ES1 Max RAM use ~32 GB Flat build time 11 hours Tools bugs 0 >1000 BRAMs DSs available for accelerators

21 Amazon F1.2xlarge Instance ENA NVMe XEON XEON VU9 DRAM 122 GB DRAM 64 GB CARRV2017: 2017/10/14 21

22 Amazon F1.16xlarge Instance DRAM DRAM DRAM DRAM ENA VU9 VU9 VU9 VU9 NVMe NVMe NVMe NVMe XEON XEON DRAM 976 DRAM GB DRAM CIe SWITCH FABRIC VU9 VU9 VU9 VU9 DRAM DRAM DRAM DRAM 64 GB CARRV2017: 2017/10/14 22

23 Recent Work Bridge halanx and AXI4 system interfaces Message passing with host CUs (x86 or ARM) DRAM channel RDMA request/response messaging SDK hardware targets 1000-core AWS F1 (<$2/hr) 80-core YNQ (Z7020) ($65 edu) CARRV2017: 2017/10/14 23

24 GRVI halanx on Zynq with AXI Bridges CARRV2017: 2017/10/14 24

25 YNQ-Z1 Demo: arallel Burst DRAM Readback Test: 80 Cores B CARRV2017: 2017/10/14 25

26 GRVI halanx on AWS F1 (WI) (Not yet 1240 bridged; 1240 F1.16XL) CARRV2017: 2017/10/14 26

27 4Q17 Work in rogress Complete initial F1.2XL and F1.16XL ports GRVI halanx SDK Specs, libraries, examples, tests As YNQ Jupyter ython notebooks, bitstreams AMI+AFI in AWS Marketplace Full instrumentation event counters; tracing Evaluate porting effort & perf on workloads TBD CARRV2017: 2017/10/14 27

28 In Conclusion Enable programmers to access massive reconfigurable / memory parallelism Frugal design enables competitive performance Value proposition unproven, awaits workloads SDK coming soon, enabling parallel RISC-V research and teaching on 80-8,000 core systems Thank you CARRV2017: 2017/10/14 28

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens? Seymour Cray GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens Jan Gray jan@fpga.org http://fpga.org