FPGA-based Evaluation Platform for Disaggregated Computing

Size: px

Start display at page:

Download "FPGA-based Evaluation Platform for Disaggregated Computing"

Angela Gaines
5 years ago
Views:

1 This project has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement No FPGA-based Evaluation Platform for Disaggregated Computing Dimitris Theodoropoulos, Nikolaos Alachiotis, and Dionisis Pnevmatikatos 1

2 Outline Introduction (motivation, resource disaggregation, dedbox) Evaluation platform (architecture overview, software support) Hardware prototype (multiple interconnected FPGAs) Experimental evaluation (matrix multiply test case) Conclusions

3 Current server infrastructure Server trays typically comprise compute, memory, and acceleration resources Intra-tray network ACC ACC ACC ACC IBM IBM

4 Current server infrastructure Server trays typically comprise compute, memory, and acceleration resources Efficient SW communication ow power consumption ACC Intra-tray network ACC ACC ACC esource proportionality fixed at design time educed granularity when allocating resources to virtual machines IBM IBM

5 esource allocation example VM 1: 3 Units, 1 VM 2: 2 Units, 1 VM 3: 3 Units, 1 Server 1 Server 2 Server 3

6 esource allocation example VM 1: 3 Units, 1 VM 2: 2 Units, 1 VM 3: 3 Units, 1 Memory utilization: 66% utilization: 50% Server 1 Server 2 Server 3

7 esource allocation example VM 1: 3 Units, 1 VM 2: 2 Units, 1 VM 3: 3 Units, 1 VM 4: 4 Units, 3 s (IT DOES NOT FIT) Memory utilization: 66% utilization: 50% Server 1 Server 2 Server 3

8 esource disaggregation Server-centric approach Intra-tray Intra-tray network Intra-tray network ACC ACC ACC network ACC Intra-tray ACC ACC ACC network ACC ACC ACC ACC ACC ACC ACC ACC ACC

9 esource disaggregation Server-centric approach Intra-tray Intra-tray network Intra-tray network ACC ACC ACC network ACC Intra-tray ACC ACC ACC network ACC ACC ACC ACC ACC ACC ACC ACC ACC Intra-tray network esource-centric approach ACC ACC ACC ACC ACC ACC ACC ACC ACC ACC ACC ACC ACC ACC ACC ACC

10 esource allocation example VM 1: 3 Units, 1 VM 2: 2 Units, 1 VM 3: 3 Units, 1 VM 4: 4 Units, 3 s (IT DOES NOT FIT) Memory utilization: 66% utilization: 50% Server 1 Server 2 Server 3

11 esource allocation example VM 1: 3 Units, 1 VM 2: 2 Units, 1 VM 3: 3 Units, 1 VM 4: 4 Units, 3 s (IT FITS) Memory utilization: 66% Server 1 utilization: 50% Server 2 Server 3 Memory utilization: 100% utilization: 100%

12 The dedbox Architecture Paradigm shift: Mainboard-as-a-unit disaggregated function-block-as-a-unit

13 Evaluation Platform

14 Contribution We provide a tool to write and optimize code for disaggregated environments through enabling runs on real hardware. This facilitates the exploration of tradeoffs between the overhead for remote data accesses and the effect of caches, either locally or remotely, for real-world execution scenarios.

15 Platform architecture low-latency interconnect Three types of resources, referred to as blocks mblock Master-worker scheme, with a serving as the master and memory blocks and/or accelerators being the workers. off-chip memory PHY off-chip memory PHY PU DMA PU DMA I/C ACC off-chip memory... PHY PU DMA Accel. Accel. ablock

16 Platform architecture: low-latency interconnect Provides general-purpose processing capacity mblock PU + memory instructions/data off-chip memory PHY PU DMA off-chip memory PHY PU DMA DMA + PHY pairs remote access to multiple workers off-chip memory PHY PU DMA I/C ACC... Accel. Accel. ablock

17 Platform architecture: mblock low-latency interconnect Provides memory resources mblock PU manages access to local memories enables computations near memory DMA + PHY pair facilitates remote data transfers off-chip memory PHY off-chip memory PHY PU DMA PU DMA I/C ACC off-chip memory... PHY PU DMA Accel. Accel. ablock

18 Platform architecture: ablock low-latency interconnect Provides the infrastructure to deploy custom hardware accelerators mblock PU controls the accelerators facilitates communication with the master off-chip memory PHY PU DMA off-chip memory PHY PU DMA Interconnect Quick accelerator-based datapath construction off-chip memory PHY PU DMA I/C... Accel. Accel. ablock

19 Software stack for inter-block communication application SW user application code SW API remote memory allocation, memcpy from / to remote blocks, accelerator start / poll, local / remote block debug -specific firmware user task code mblock-specific firmware user task code ablock-specific firmware PHY layer block-to-block data transfers The Physical ayer - PHY IP module (Aurora IP)

20 Software stack for inter-block communication application SW user application code SW API remote memory allocation, memcpy from / to remote blocks, accelerator start / poll, local / remote block debug -specific firmware user task code mblock-specific firmware user task code ablock-specific firmware Block ayer - Block-specific firmware for initialization, testing, debugging - mblock and ablock task code execution PHY layer block-to-block data transfers The Physical ayer - PHY IP module (Aurora IP)

21 Software stack for inter-block communication application SW user application code SW API remote memory allocation, memcpy from / to remote blocks, accelerator start / poll, local / remote block debug API ayer - Platform s API exposed to the user (Drives operations on the Block ayer) -specific firmware user task code mblock-specific firmware user task code ablock-specific firmware Block ayer - Block-specific firmware for initialization, testing, debugging - mblock and ablock task code execution PHY layer block-to-block data transfers The Physical ayer - PHY IP module (Aurora IP)

22 Software stack for inter-block communication application SW user application code User ayer - API-based user code SW API remote memory allocation, memcpy from / to remote blocks, accelerator start / poll, local / remote block debug API ayer - Platform s API exposed to the user (Drives operations on the Block ayer) -specific firmware user task code mblock-specific firmware user task code ablock-specific firmware Block ayer - Block-specific firmware for initialization, testing, debugging - mblock and ablock task code execution PHY layer block-to-block data transfers The Physical ayer - PHY IP module (Aurora IP)

23 Inter-block communication flow xblock = mblock or ablock xblock alloc, size, type ACK, addr, id read, id, idx ACK, data xblock write, id, idx, data ACK xblock 1 allocatev 2 readv 3 writev xblock toblock, id ready data xblock fromblock, id ACK, data xblock taskproc, tid, in ids, out ids ACK xblock taskacc, acid, in ids, out ids ACK ACK memcpy memcpy nearmemory ToBlock FromBlock TaskProcess nearmemory TaskAccel

24 Hardware Prototype

Hardware prototype Three-block prototype: one SOC with reconfig. logic per block type Xilinx ZC706 boards - Zynq 7045 MP: 2 AM Cortex A9 s and Kintex 7 prog. logic 156.

25 Hardware prototype Three-block prototype: one SOC with reconfig. logic per block type Xilinx ZC706 boards - Zynq 7045 MP: 2 AM Cortex A9 s and Kintex 7 prog. logic MHz Si570 SMA connectors + SMA bridge physical cables - AF14 AG14 AD18 AD19 Zynq7045 PS7 ACP MGP0 peripheral interconnect QP W8 W7 SMA connectors DMA engine SFP Aurora 64B66B SFP memory interconnect DMA engine SMA Aurora 64B66B SMA zc706 ablock mblock

26 Hardware prototype The Processing Unit on the is the master. Connected to local and memory interconnect through the ACP (Accelerator Coherency Port) MHz Si570 SMA connectors + SMA bridge physical cables - AF14 AG14 AD18 AD19 Zynq7045 PS7 ACP MGP0 peripheral interconnect QP W8 W7 SMA connectors DMA engine SFP Aurora 64B66B SFP memory interconnect DMA engine SMA Aurora 64B66B SMA zc706 ablock mblock

27 Hardware prototype The Processing Unit on the is the master. Connected to local and memory interconnect through the ACP Cache-coherent remote data transfers MHz Si570 SMA connectors + SMA bridge physical cables - AF14 AG14 AD18 AD19 Zynq7045 PS7 ACP MGP0 peripheral interconnect QP W8 W7 SMA connectors DMA engine SFP Aurora 64B66B SFP ablock memory interconnect DMA engine SMA Aurora 64B66B SMA zc706 mblock

28 Experimental Evaluation

29 Experimental evaluation Matrix multiplication (HS-based IP by Xilinx for acceleration, mmult) Evaluation of four different execution scenarios (application mappings) PS

30 Experimental evaluation Matrix multiplication (HS-based IP by Xilinx for acceleration, mmult) Evaluation of four different execution scenarios (application mappings) SMA IF mblock PS PS-

31 Experimental evaluation Matrix multiplication (HS-based IP by Xilinx for acceleration, mmult) Evaluation of four different execution scenarios (application mappings) SMA IF SMA IF mblock mblock PS PS- PS-NEA-

32 Experimental evaluation Matrix multiplication (HS-based IP by Xilinx for acceleration, mmult) Evaluation of four different execution scenarios (application mappings) SMA IF SMA IF SFP IF mmult mblock mblock ablock PS PS- PS-NEA- PS-ACCE

33 Experimental evaluation Matrix multiplication (HS-based IP by Xilinx for acceleration, mmult) Evaluation of four different execution scenarios (application mappings) SMA IF SMA IF SFP IF mmult mblock mblock ablock PS PS- PS-NEA- PS-ACCE

34 Experimental evaluation: PS- code excerpt Declaration SMA IF Allocation mblock PS- Initialization Execution

35 Experimental evaluation: PS- code excerpt SMA IF emote memory access mblock PS-

36 Speedup Performance comparison: speedups eference PS 1: 32KB 2: 512KB Matrix size

37 Speedup Performance comparison: speedups eference SMA IF PS 1: 32KB 2: 512KB mblock PS- Matrix size PS outperforms PS- in all cases due to sufficient capacity of the

38 Speedup Performance comparison: speedups eference SMA IF cache size sufficiently large to fit data PS mblock PS-NEA- 1: 32KB 2: 512KB Matrix size

39 Speedup Performance comparison: speedups eference SMA IF Better cache utilization on the mblock, hides the transfer overhead due to ACP PS 1: 32KB 2: 512KB Matrix size mblock PS-NEA-

40 Speedup Performance comparison: speedups eference SMA IF Excessive memory requirements + transfer overhead PS mblock PS-NEA- 1: 32KB 2: 512KB Matrix size

41 Speedup Performance comparison: speedups eference SFP IF Transfer overhead comparable with computation mmult PS 1: 32KB 2: 512KB Matrix size ablock PS-ACCE

42 Speedup Performance comparison: speedups eference SFP IF Favorable computation to communication ratio mmult PS 1: 32KB 2: 512KB Matrix size ablock PS-ACCE

43 Conclusions Presented an FPGA-based evaluation platform for code preparation and optimization for disaggregated environments. - Multiple FPGA boards assume different roles, e.g., compute, memory, and acceleration - Software support and user-friendly API eliminate deployment/adoption overhead Facilitates the exploration of tradeoffs between the overhead for remote data accesses and the effect of caches, either locally or remotely, for real-world execution scenarios.

44 Future work Add functionality and flexibility to the software stack - Support for task-based execution to facilitate parallelism - Support for library-based legacy codes Augment acceleration capabilities and flexibility - Partial reconfiguration - Task offloading to processors near accelerators for better performance

45 Thank you!

S2C K7 Prodigy Logic Module Series

S2C K7 Prodigy Logic Module Series Low-Cost Fifth Generation Rapid FPGA-based Prototyping Hardware The S2C K7 Prodigy Logic Module is equipped with one Xilinx Kintex-7 XC7K410T or XC7K325T FPGA device