FPGAs for Image Processing

Size: px

Start display at page:

Download "FPGAs for Image Processing"

Elvin Wiggins
6 years ago
Views:

1 FPGAs for Image Processing A DSL and program transformations Rob Stewart Greg Michaelson Idress Ibrahim Deepayan Bhowmik Andy Wallace Paulo Garcia Heriot-Watt University 10 May 2016

2 What I will say 1. EPSRC Rathlin project interested in remote image processing. 2. We ve developed a DSL for FPGAs called RIPL. 3. Dataflow IR transformation between RIPL and FPGA help. Low powered accelerated remote image processing.

3 FPGAs vs GPUs FPGAs! energy efficient! sometimes faster % hard to program % hard to optimise GPUs! fast floating point! fast SIMD parallelism % uses lots of energy % poor performance with irregular memory access

4 FPGAs vs CPUs "A Comparison of CPUs, GPUs, FPGAs, and Massively Parallel Processor Arrays for Random Number Generation". D Thomas et a. Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays, 2009.

5 Block RAM on an FPGA

6 DSPs on an FPGA

7 RIPL in an FPGA

8 RIPL in an FPGA

9 Part 1 of 4: RIPL skeletons.

A RIPL program program = image1 = imread 512 512; image2 = imap image1 (λ[.] -> ([. -1] + [.

10 A RIPL program program = image1 = imread ; image2 = imap image1 (λ[.] -> ([. -1] + [.] + [.+1]) / 3); image3 = imap image2 (λ[.] -> ([. -1] + [.] + [.+1]) / 3); image4 = map image3 (λ[ x] -> [ min 255 ( x + 50) ]); out image4 ;

11 Memory efficient skeletons RIPL: λ[.] ([.-1] + [.] + [.+1]) / 3 State transitions: σ s1 2 Ø init: 0 σ 1 stream: σ s1 1 s ' σ 1 s 1 1 [.+1] 2 0 [.-1] midpoint index 1 [.] s ' 1 1 Images are just streams of pixels.

12 RIPL skeletons map : I (M,N) ([P] A [P] A ) I (M,N) imap : I (M,N) (P i P) I (M,N) scalerow : I (M,N) ([P] A [P] B ) I (M (B/A),N) scalecol : I (M,N) ([P] A [P] B ) I (M,N (B/A)) filter2d : I (M,N) (x, y) : (Int, Int) [K] (x y) I (M,N) zipwith : I (M,N) I (M,N) ([P] A [P] A [P] A ) I (M,N) unzip : I (M,N) (P i P) (P i P) (I (M,N), I (M,N) ) foldscalar : I (M,N) Int (P Int Int) Int foldvector : I (M,N) Int a : Int (P [Int] a [Int] a) [Int] a transpose : I (M,N) I (N,M)

13 RIPL to FPGAs 1. Use algorithmic skeletons. 2. Compile RIPL pipelined parallel dataflow graphs. 3. Optimise apply dataflow transformations. 4. Compile dataflow graph hardware description with Verilog. 5. Synthesise Verilog for an FPGA. 6. Send bitstream to the FPGA.

14 Part 2 of 4: RIPL to dataflow.

15 RIPL to dataflow

16 RIPLs dataflow constraints memory bound - + runtime scheduling + SDF - - CSDF DPN + expressiveness

17 RIPLs small step dataflow semantics Skeleton implementation is set of transition rule. σ x, S [a,b] [c,d] σ y, S Transition from σ x to σ y Start with internal state S, end with S Consumes [a, b] pixels, generates [c, d] pixels "What" is computed defined by RIPL programmer

18 RIPLs small step dataflow semantics image2 = imap image1 (λ[.] -> ([. -1] + [.] + [.+1]) / 3); RIPL: λ[.] ([.-1] + [.] + [.+1]) / 3 State transitions: σ s1 2 Ø init: 0 σ 1 stream: σ s1 1 s ' σ 1 s 1 1 [.+1] 2 0 [.-1] midpoint index 1 [.] s ' 1 1 σ 0, [0, 0, 0] [23,27] σ 1, [27, 23, 0] σ 1, [27, 23, 0] [28] [27] σ 1, [27, 23, 28] σ 1, [23, 27, 28] [34] [28] σ 1, [34, 23, 28] σ 1, [34, 23, 28] [92] [51] σ 1, [34, 92, 28]

19 Part 3 of 4: optimising dataflow.

20 Dataflow profiling Find bottlenecks using open source TURNUS tool critical dataflow path actors with high computational latency low clock frequency

Slice LUT Slice registers Block RAM DSP48E FMax /FIFO (MHz) Naive 3664 8777 88 49 55.41 Final_XY 76 80 0 0 721.48 Centre_XY 182 199 0 0 530.81 Stream_to_YUV 90 287 24 0 420.

21 Slice LUT Slice registers Block RAM DSP48E FMax /FIFO (MHz) Naive Final_XY Centre_XY Stream_to_YUV update_model YUV2RGB displacement update_weight karray_derv karray_evaluation

22 Manual dataflow transformation Profile Guided Dataflow Transformation for FPGAs & CPUs. R. Stewart, D. Bhowmik, G. Michaelson, A. Wallace. Special Issue on Dataflow, in The Journal of Signal Processing Systems, Springer, Functionality Transformation Registers Slice LUTs BRAM DSP Clock (MHz) Stream to YUV YUV to RGB Displacement Update weight k-array derive None Loop elimination None Actor fusion None Task parallelism None Fission Just square root (none) Square root Lookup Combined None Loop promotion

23 Interactive dataflow transformation Task parallel decomposition video Data parallel fan out/fan in video

24 Part 4 of 4: evaluation.

26 Power performance Sub-module Power (W) Camera 24MHz Camera 100MHz Visual Saliency 50MHz Visual Saliency 85MHz Visual Saliency 100MHz

27 Space performance Resource Usage Occupation DSP48E1s 3 1% FIFO36E1s 2 1% External IOB33s 80 40% RAMB18E1s % RAMB36E1s 26 18% Slices % Slice Registers % Slice LUTS % Slice LUT-Flip Flop pairs %

28 Throughput performance Processing time (ms) Frame rate FPGA CPU Current experiments show RIPL performance FPS.

29 Our contribution A new image processing DSL for FPGAs. Small step operational dataflow semantics for skeletons. Identified profiling metrics that matter for FPGAs. A graphical dataflow transformations framework. FPGA-based image processing system architecture.

30 Future work Evaluate RIPLs expressivity for real world computer vision. Many dataflow implementations for each skeleton. Machine learning to construct & prune search space of all possible dataflow representations of a single RIPL program. Integrate transformations with dataflow profiling tool. Automated compiler based transformation. Thanks. R.Stewart@hw.ac.uk

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items (ULFFT) November 3, 2008 Product Specification Dillon Engineering, Inc. 4974 Lincoln Drive Edina, MN USA, 55436 Phone: 952.836.2413 Fax: 952.927.6514 E-mail: info@dilloneng.com URL: www.dilloneng.com Core