Bridging the GPGPU-FPGA Efficiency Gap

Size: px

Start display at page:

Download "Bridging the GPGPU-FPGA Efficiency Gap"

Henry Ramsey
6 years ago
Views:

1 Bridging the GPGPU-FPGA Efficiency Gap Christopher W. Fletcher 1,2, Ilia Lebedev 2, Narges B. Asadi 3, Daniel R. Burke 2, John Wawrzynek 2 1 :M.I.T., 2 :UC Berkeley, 3 : Stanford February 28th FPGA

2 Design by µarch Template Idea: FPGAs designs from architectural templates February 28th FPGA

3 Design by µarch Template Idea: FPGAs designs from architectural templates February 28th FPGA

4 Design by µarch Template Idea: FPGAs designs from architectural templates 32x2 32x x2 32x2 x - February 28th FPGA

5 Design by µarch Template Idea: FPGAs designs from architectural templates 32x2 32x x2 32x2 x - Q: How do template-based designs on an FPGA compare to general purpose processors? February 28th FPGA

many proteins, of each type, there are in a cell Cell Signaling

6 Cell Signaling Networks Goal: Given flow cytometry data, learn the structure of cell signaling networks Flow Cytometry Raw count of how many proteins, of each type, there are in a cell Cell Signaling Networks Structures that model protein signaling pathways February 28th FPGA

7 FPGA & GPGPU (Top level) Once the application becomes compute-bound, FPGA an application-specific GPGPU February 28th FPGA

8 FPGA & GPGPU (Top level) Once the application becomes compute-bound, FPGA an application-specific GPGPU Local order thread dispatch Memory subsystem Result Arbiter Memory Interface Input/Output MCU (Host) Interface FPGA Shared L2 Cache GPGPU February 28th FPGA Host Interface Picture based on the Nvidia Fermi.

9 FPGA & GPGPU (Top level) Once the application becomes compute-bound, FPGA an application-specific GPGPU Local order thread dispatch Memory subsystem Result Arbiter Memory Interface Input/Output MCU (Host) Interface FPGA Shared L2 Cache GPGPU February 28th FPGA Host Interface Picture based on the Nvidia Fermi.

10 FPGA & GPGPU (Top level) Once the application becomes compute-bound, FPGA an application-specific GPGPU Local order thread dispatch Memory subsystem Result Arbiter Memory Interface Input/Output MCU (Host) Interface FPGA Shared L2 Cache GPGPU February 28th FPGA Host Interface Picture based on the Nvidia Fermi.

11 FPGA & GPGPU (Top level) Once the application becomes compute-bound, FPGA an application-specific GPGPU Local order thread dispatch Memory subsystem Result Arbiter Memory Interface I n p u t / O u t p u t MCU (Host ) Interface F P G A Shared L2 Cache G P G P U February 28th FPGA Host Interface Picture based on the Nvidia Fermi.

12 FPGA & GPGPU (Top level) Once the application becomes compute-bound, FPGA an application-specific GPGPU Local order thread dispatch Memory subsystem Result Arbiter Memory Interface Input/Output MCU (Host) Interface FPGA Shared L2 Cache GPGPU February 28th FPGA Host Interface Picture based on the Nvidia Fermi.

13 FPGA & GPGPU (Top level) Once the application becomes compute-bound, FPGA an application-specific GPGPU Local order thread dispatch Memory subsystem Result Arbiter Memory Interface Input/Output MCU (Host) Interface FPGA Shared L2 Cache GPGPU February 28th FPGA Host Interface Picture based on the Nvidia Fermi.

14 FPGA & GPGPU (per Node) FPGA GPGPU INIT Inputs Start! Data Valid? Control Block RAM Instruction Cache Thread Block Scheduler Thread Block Scheduler Dispatcher Dispatcher Register File Previous[0] 2 Ports Previous[2] Core LD/ST Control (data) Control (local orders) Post processing 2 Ports SFU Data Threads Serialization Interconnect Post Post Next[0] Next[1] Next[2] 3 Lanes Post Interconnect Network Shared L1 Cache February 28th FPGA

15 FPGA & GPGPU (per Node) FPGA GPGPU INIT Inputs Start! Data Valid? Control Block RAM GPGPU Core Instruction Cache Thread Block Scheduler Thread Block Scheduler Dispatcher Dispatcher Dispatch Port Operand Collector Register File Control (data) Control (local orders) Post processing Previous[0] 2 Ports Previous[2] 2 Ports Floating Point Order Sampler Result Queue From Previous Core Integer FPGA Core Data Core LD/ST SFU Local orders Data Threads Serialization Interconnect Post Post Next[0] Next[1] Next[2] 3 Lanes Post LOG Table To Next Core En GS Graph Sampler Interconnect Network Shared L1 Cache February 28th FPGA

16 FPGA & GPGPU (per Node) FPGA GPGPU INIT Inputs Start! Data Valid? Control Block RAM GPGPU Core Instruction Cache Thread Block Scheduler Thread Block Scheduler Dispatcher Dispatcher Dispatch Port Operand Collector Register File Control (data) Control (local orders) Post processing Previous[0] 2 Ports Previous[2] 2 Ports Floating Point Order Sampler Result Queue From Previous Core Integer FPGA Core Data Core LD/ST SFU Local orders Data Threads Serialization Interconnect Post Post Next[0] Next[1] Next[2] 3 Lanes Post LOG Table To Next Core En GS Graph Sampler Interconnect Network Shared L1 Cache February 28th FPGA

17 Summary of Results Our study compares the throughput of an FPGA core relative to a GPGPU core. Nvidia GT 330m (mobile) Nvidia GTX 480 Virtex 5, 155t x x Virtex 6, 240t x x Example: A Virtex-5 core achieves x throughput, relative to a GT 330m core. Despite the GPGPU s higher core clock frequency, the FPGA core maintains competitive throughput. February 28th FPGA

18 Acknowledgements For making this work possible, we thank: NIH / Cancer Research Institute Grant # Members of the Stanford Nolan Lab Berkeley Reconfigurable Computing Group M.I.T. Hornet Group UC Berkeley Wireless Research Center (BWRC) Contributors to the GateLib research library February 28th FPGA

Register File Organization

Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities