Power Efficient Solutions w/ FPGAs. Bill Jenkins Altera Sr. Product Specialist for Programming Language Solutions

Size: px

Start display at page:

Download "Power Efficient Solutions w/ FPGAs. Bill Jenkins Altera Sr. Product Specialist for Programming Language Solutions"

Harry Campbell
5 years ago
Views:

1 1

2 Poer Efficient Solutions / FPGs Bill Jenkins ltera Sr. Product Specialist for Programming Language Solutions

System Challenges CPU rchitecture is inefficient for most

Result: Excessive poer consumption Bottleneck I/O Memory

starving the CPU for data Result: Slo Performance (high

3 System Challenges CPU rchitecture is inefficient for most parallel computing applications (big data, search) Result: Excessive poer consumption Bottleneck I/O Memory CPU Bottleneck Bottleneck I/O CPU Bottlenecks are starving the CPU for data Result: Slo Performance (high latency) Market Reaction: Groth of customized hardare and architectures 3

4 Role of FPG Resource Sharing Virtualization of computation, Storage, Netorking ccelerators Netork cceleration, Hypervisor offload Data ccess cceleration lgorithm cceleration Cluster Computing CPU and FPG Cluster Fabric Cluster Interconnect Host CPU FPG DRM 4

FPGs Increase Efficiency in the Data Center FPGs can greatly enhance CPU-based data center processing by accelerating algorithms and minimizing bottlenecks 10X+ increase in performance per att

5 FPGs Increase Efficiency in the Data Center FPGs can greatly enhance CPU-based data center processing by accelerating algorithms and minimizing bottlenecks 10X+ increase in performance per att Massively parallel architecture Has 10 to 100 times the number of computational units Enables pipelined designs that perform multiple / different instructions in a single clock cycle Better localized memory avoids bottlenecks Programmability enables application-specific accelerators >5M Logic Elements 3200Mbps DDR4 SDRM/ 2.5Tbps HMC 1.5TFLOPs Floating Point DSP Programmable I/O 5

6 Mapping a simple program to an FPG CPU instructions High-level code Mem[100] += 42 * Mem[101] R0 Load Mem[100] R1 Load Mem[101] R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 Store R0 Mem[100] 6

7 First let s take a look at execution on a simple CPU Ldddr LdData Stddr PC Fetch Load Store StData Instruction Op Val addr Baddr Registers LU Op C Caddr B CWriteEnable CData 7 Fixed and general architecture: Op - General cover-all-cases data-paths - Fixed data-idths - Fixed operations

8 Load constant value into register Ldddr LdData Stddr PC Fetch Load Store StData Instruction Op Val addr Baddr Registers LU Op C Caddr B CWriteEnable CData Op Very inefficient use of hardare! 8

9 CPU activity, step by step R0 Load Mem[100] R1 Load Mem[101] Time R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 Store R0 Mem[100] 9

10 On the FPG e unroll the CPU hardare R0 Load Mem[100] R1 Load Mem[101] Space R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 10 Store R0 Mem[100]

11 and specialize by position R0 Load Mem[100] R1 Load Mem[101] 1. Instructions are fixed. Remove Fetch R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 11 Store R0 Mem[100]

12 and specialize R0 Load Mem[100] R1 Load Mem[101] 1. Instructions are fixed. Remove Fetch 2. Remove unused LU ops R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 12 Store R0 Mem[100]

13 and specialize R0 Load Mem[100] R1 Load Mem[101] 1. Instructions are fixed. Remove Fetch 2. Remove unused LU ops 3. Remove unused Load / Store R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 13 Store R0 Mem[100]

14 and specialize R0 Load Mem[100] R1 Load Mem[101] R2 Load #42 1. Instructions are fixed. Remove Fetch 2. Remove unused LU ops 3. Remove unused Load / Store 4. Wire up registers properly! nd propagate state. R2 Mul R1, R2 R0 dd R2, R0 Store R0 Mem[100] 14

15 and specialize R0 Load Mem[100] R1 Load Mem[101] R2 Load #42 1. Instructions are fixed. Remove Fetch 2. Remove unused LU ops 3. Remove unused Load / Store 4. Wire up registers properly! nd propagate state. 5. Remove dead data. R2 Mul R1, R2 R0 dd R2, R0 Store R0 Mem[100] 15

16 and specialize R0 Load Mem[100] R1 Load Mem[101] R2 Load #42 1. Instructions are fixed. Remove Fetch 2. Remove unused LU ops 3. Remove unused Load / Store 4. Wire up registers properly! nd propagate state. 5. Remove dead data. 6. Reschedule! R2 Mul R1, R2 R0 dd R2, R0 Store R0 Mem[100] 16

17 Custom Data-Path on the FPG Matches Your lgorithm! High-level code Mem[100] += 42 * Mem[101] Custom data-path load load 42 Build exactly hat you need: Operations Data idths Memory size & configuration Efficiency: store Throughput / Latency / Poer 17

18 rchitectural Example: Image Processing Ine x y = 1 1 x = 1 y = 1 I old x + x y + y F x y Convolutions: dataflo can proceed in pipelined fashion No need to ait until the entire execution is complete Start a ne set of data calculations as soon as the first stage completes its execution

19 Processor (CPU/GPU) Implementation for(int y=1; y<height-1; ++y) { for(int x=1; x<idth-1; ++x) { for(int y2=-1; y2<1; ++y2) { for(int x2=-1; x2<1; ++x2) { i2[y][x] += i[y+y2][x+x2] * filter[y2][x2]; Cache CPU Main Memory cache can hide poor memory access patterns 19

20 FPG Implementation Example performance point: 1 pixel per cycle Cache requirements: 9 reads + 1 rite per cycle Memory Cache 9 read ports! Custom Data-path Expensive hardare! Poer overhead Cost overhead: more built in addressing flexibility than e need Why not customize the cache for the application? 20

21 Optimizing the Cache Start out ith the initial picture that is W pixels ide 21

22 Optimizing the Cache Let s remove all the lines that aren t in the neighborhood of the indo 22

23 Optimizing the Cache Take all of the lines and arrange them as a 1D array of pixels 23

24 Optimizing the Cache Remove the pixels at the edges that e don t need for the computation 24

25 Optimizing the Cache What happens hen e move the indo one pixel to the right? We have created a shift register implementation 25

$Shift Registers in Softare sr[2*w+2] sr[0] data_in data_out[9] pixel_t sr[2*w+3]; hile(keep_going) { //$ $output data data_out = {sr[ 0], sr[ 1], sr[ 2], sr[ ], sr[ +1], sr[ +2] sr[2*], sr[2*+1], sr[2*+2]} //.$

26 Shift Registers in Softare sr[2*w+2] sr[0] data_in data_out[9] pixel_t sr[2*w+3]; hile(keep_going) { // Shift data in #pragma unroll for(int i=1; i<2*w+3; ++i) sr[i] = sr[i-1] sr[0] = data_in; } // Tap output data data_out = {sr[ 0], sr[ 1], sr[ 2], sr[ ], sr[ +1], sr[ +2] sr[2*], sr[2*+1], sr[2*+2]} //... Managing data movement to match the FPG s architectural strengths is key to obtaining high performance 26

27 Traditional OpenCL Implementation of a Pipeline(CPU/GPU) Global Memory (DDR) Buffer Buffer Buffer Buffer Kernel 1 Kernel 2 Kernel 3 High-latency: requires access to global memory High memory-bandidth Requires host coordination to pass buffers from one kernel to another With a particular design example e achieved 183 Images/s on a Stratix V PCIe card

28 Leveraging Kernel-to-Kernel Channels Global Memory (DDR) Buffer Buffer Channel declaration: channel int my_channel; Create a queue: value_type channel(); Kernel 1 Kernel 2 Kernel 3 Channel rite: rite_channel_altera(my_channel, x); Push data into the queue: void rite_channel_altera(channel Channels &ch, value_type data); Lo-latency communication beteen kernels Significantly Pop the first element less memory from the queue bandidth requirements value_type read_channel_altera(channel &ch); Host is not involved in coordinating communication beteen kernels This implementation on the same Stratix V PCIe card resulted in 400 Images/s Channel read: int y = read_channel_altera(my_channel);

29 FPG Code #pragma OPENCL_EXTENSION cl_altera_channel : enable // Declaration of Channel PI data types channel float prod_k1_channel; channel float k1_k2_channel; channel float k2_k3_channel; channel float k3_res_channel; kernel void convolution_prod( int batch_id_begin, int batch_id_end, global const volatile float * restrict input_global) { for(...) { rite_channel_altera( prod_k1_channel, input_global[...]); } rite_channel_altera( k1_k2_channel, input_global[...]);... } Kernels are ritten as standard building blocks that are connected together through channels The concept of having multiple concurrent kernels executing simultaneously and communicating directly on a device is currently unique to FPGs Offered as Vendor Extension Portable in OpenCL 2.0 through the concept of OpenCL Pipes

30 Migration Beteen FPGs In OpenCL, a float uses soft logic in an older FPGs Gen10 FPGs have hardened floating point logic built into the DSP blocks On rria 10 using the same code results in processing 6800 Images/s Stratix 10 expectations: Large increase in floating point resources Higher internal frequencies achievable 1.6x-2x performance increase 12x-16x performance/att efficiency versus Stratix V 30

31 dditional Improvements: IO Channels Kernel Channels are beteen OpenCL kernels IO Channels take data directly from and to IO interfaces in the FPG Camera or video feed could be processed directly in the FPG ithout going through the host Result could be passed out to the graphics card to be displayed or back to host memory for the host to use Private, Local and Global memory can no be used to buffer as needed IO Channels Kernel 1 Kernel 2 Kernel 3 31 FPG Kernel Channels/Pipes

32 Lessons Learned Exploiting pipelining on the FPG requires some attention to coding style to overcome the inherent assumptions of riting softare FPGs do not have caches Need to exploit data reuse in a more explicit ay The concept of dataflo pipelining ill not realize its full potential if e rite intermediate results to memory Bandidth limitations begin to dominate compute Use direct kernel to kernel communication called channels Native support for floating point on the FPG allos order of magnitude performance increase Code can be ported to neer FPGs ithout modification to get performance increase IO Channels can loer latency and improve performance even more by taking the host out of the processing chain even more

System Acceleration Overview. Bill Jenkins Altera Sr. Product Specialist for Programming Language Solutions

System Acceleration Overview. Bill Jenkins Altera Sr. Product Specialist for Programming Language Solutions 1 System cceleration Overview Bill Jenkins ltera Sr. Product Specialist for Programming Language Solutions Industry Trends Increasing product functionality and performance Smaller time-to-market window