ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

Size: px

Start display at page:

Download "ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests"

Lawrence Gilbert
6 years ago
Views:

1 ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer Engineering, Cornell University 2 Google Inc

2 Outline Loop pipelining in HLS Irregular loop nest ElasticFlow architecture ElasticFlow synthesis Experimental results 2

3 Clock Cycle Loop Pipelining An important optimization in HLS Create a static schedule for the loop body to allow successive loop iterations to be overlapped 0 j=0 load A1 j=1 Objective Initiation Interval (II) B1 *+ C1 load A2 B2 *+ C2 steady state j=2 load A3 B3 *+ C3 j=3 load A4 B4 *+ C4 II=1 II=1 Throughput for(; i < 4; i++){ for(j=0; j < 4; j++){ #pragma pipeline acc += A[j] * i; } } 3

4 Pipelining Outer Loop for(; i < 4; i++){ for(j=0; j < 4; j++){ acc += A[j] * i; } } 1 Pipelining only inner loop for(; i < 4; i++){ for(j=0; j < 4; j++){ #pragma pipeline acc += A[i] * j; } } 1 inner loop iteration per cycle Fixed inner loop bound 2 Pipelining outer loop by unrolling inner loop for(; i < 4; i++){ #pragma pipeline acc += A[0] * i; acc += A[1] * i; acc += A[2] * i; acc += A[3] * i; } 1 outer loop iteration per cycle 4

5 Pipelining Irregular Loop Nest Contains one or more dynamic-bound inner loops Number of inner loop iterations vary during runtime Accesses less-regular data structures (eg sparse matrices, graphs, and hash tables) common in emerging applications How to pipeline this loop nest to achieve one lookup per cycle? 0 1 N Hash buckets Keys Hash lookup for (i : keys_to_find) #pragma pipeline } hv = Jenkins_hash(k); p = hashtbl[hv]keys; while (p && p->key!=k) p = p->next; format_output(p) 5

6 Clock Cycle Aggressively Unrolling Inner Loop B j=0 j=1 j=2 j=3 j=4 j=5 j=0 j=1 j=2 j=3 j=4 j=5 j=0 j=1 j=2 j=3 j=4 j=5 j=0 j=1 j=2 j=3 j=4 j=5 Resource 1 lookup/cycle i=4 j=0 j=1 j=2 j=3 j=4 j=5 i=4 i=5 j=0 j=1 j=2 j=3 j=4 j=5 i=5 for (i : keys_to_find) #pragma pipeline } hv = Jenkins_hash(k); p = hashtbl[hv]keys; for (j=0; j<6; j++) #pragma unroll if (p && p->key!=k) p = p->next; format_output(p) A B C 6

7 Clock Cycle Issues with Aggressive Unrolling B j=0 j=1 j=2 j=3 j=4 j=5 j=99 j=0 j=1 j=2 j=3 j=4 j=5 j=99 j=0 j=1 j=2 j=3 j=4 j=5 j=0 j=1 j=2 j=3 j=4 j=5 Resource i=4 j=0 j=1 j=2 j=3 j=4 j=5 1 May not be statically determinable i=5 j=0 j=1 j=2 j=3 j=4 j=5 2 Worst-case bound >> common case (eg 99 vs 2) i=6 j=0 j=1 j=2 j=3 j=4 j=5 3 Unnecessarily deep pipeline, very inefficient in area i=7 j=0 j=1 j=2 j=3 j=4 i=8 j=0 j=1 j=2 i=9 j=0 j=1 7

8 Need for a New Approach Irregular loop nests are prevalent Graph processing, Scientific computation, Image processing, etc Naive approaches result in low throughput or large area Need resource-efficient pipelining of the outer loop for an irregular loop nest to target one outer loop iteration per cycle 8

9 ElasticFlow Concept ElasticFlow Architecture and associated synthesis techniques Effectively accelerate irregular loop nests Transform the irregular loop nest into a multi-stage dataflow pipeline Dynamically distribute different outer loop instances of the dynamic-bound inner loop to one or more processing units Inner loops execute in a pipelined fashion across different outer loop iterations 9

10 ElasticFlow Architecture Each dynamic-bound inner loop is mapped to an application-specific loop processing array (LPA) LPA contains one or more loop processing units (LPUs) Each LPU executes an inner loop until completion, which automatically handles inner loop carried dependences A Distributor <i, val> <i, val> B C LPU 1 LPU 2 LPU K Collector <i, val> Loop Processing Array (LPA) for B 10

11 Distributor and Collector Distributor Dynamically distributes inner loop instances to LPUs Collector Collects results from the LPUs Acts as an reorder buffer (ROB) to ensure that results are committed to the next stage in-order A Distributor B C Collector LPA 11

12 Clock Cycle ElasticFlow on Hash Lookup A B for (i : keys_to_find) { #pragma pipeline hv = Jenkins_hash(k); p = hashtbl[hv]keys; while (p && p->key!=k) p = p->next; A B C LPUs specialized for B C } format_output(p) Dynamically overlap inner loops across outer loop iterations to achieve a throughput of one outer loop iteration per cycle 12

13 Execution with Single LPU A B C Single LPU for Stage B Execution in Stage A and C can overlap in time Inner loop iterations execute serially on Stage B stall stall stall stall Clock Cycle for (i : keys_to_find) { #pragma pipeline stall i=4 stall A hv = Jenkins_hash(k); p = hashtbl[hv]keys; stall i=4 stall while (p && p->key!=k) i=5 i=4 B p = p->next; i=5 stall i=6 i=5 stall i=6 stall C format_output(p) Throughput i=7 i=6 } bottlenecked by the inner loop latency i=7 in stage stall B i=7 13

14 Execution with Multiple LPUs A B C Multiple LPUs for Stage B Dynamically schedule inner loops A i=4 i=5 i=6 i=7 LPU 1 LPU 2 LPU 3 LPU 4 i=4 B i=5 i=7 i=6 Multiple LPUs for B C i=4 i=5 i=6 i=7 i=4 i=5 i=6 i=4 i=5 i=6 i=4 i=5 i=7 i=6 i=7 i=7 Single LPU for B 14 Clock Cycle

15 Dynamic Scheduling Dynamic scheduling policy Mitigates the effect of unbalanced workload Inefficient on resource throughput utilization due to latency variation of different inner loops due to many stalls and idles! A Dynamic scheduling B C Static scheduling A B C i=4 i=5 i=6 i=7 LPU 1 LPU 2 LPU 3 LPU 4 i=4 i=5 i=7 i=6 i=4 i=5 i=6 i=7 LPU 1 LPU 2 LPU 3 LPU 4 i=5 i=4 i=6 i=7 15

16 Multiple Dynamic-Bound Inner Loops B D for (; i<num_keys; i++) #pragma pipeline } // A: lookup hashtbl1 // B: dynamic-bound loop while (p && p->key!=k) p = p->next; // C: loop up hashtbl2 // D: dynamic-bound loop while (q && q->key!=k) q = q->next; // E: merge results Database join Architecture with dedicated LPAs slpu 1 LPA B A Distributor <i, val> slpu K Collector <i, val> E C Distributor <i, val> B B D D slpu 1 slpu K Collector LPA D <i, val> Each LPA is dedicated to a particular inner loop 16

17 Issues with Dedicated LPAs If loop B incurs much longer average latency than loop D, the LPA for loop D results in poor resource utilization execute dbjoin on dedicated LPUs slpa B slpa D slpu 1 slpu 2 i B =0 i B =1 slpu 3 i B =2 slpu 1 slpu 2 i D =0 i D =1 i D =3 i D =4 slpu 3 i D =2 i D =5 Clock Cycle i B =4 i B =5 i B =3 Idle Idle Idle 17

18 LPA Sharing An LPA can be shared among one or more inner loops slpu: single-loop processing unit, dedicated to one loop mlpu: multi-loop processing unit, shared among multiple loops slpa: single-loop processing array, consists of multiple slpus for a particular loop mlpa: multi-loop processing array, consists of multiple mlpus each shared among loops A <s, i, val> Distributor C Shared mlpus Collector <s, i, val> Architecture with shared LPUs B/D B/D B/D <i, val> E <i, val> mlpab,d vs slpa 18

19 Execution with Shared LPUs mlpa improves resource utilizations and performance by reducing pipeline stalls for unbalanced workload Execution of dbjoin on dedicated LPAs slpa B slpa D slpu 1 slpu 2 slpu 3 slpu 1 slpu 2 slpu 3 i D =0 i D =1 i D =2 i B =0 i D =3 i D =4 i B =2 i B =1 i D =5 Execution on shared mlpa mlpa B,D mlpu 1 mlpu 2 mlpu 3 mlpu 4 i D =0 i D =1 i B =0 i B =2 i B =1 i D =2 Clock Cycle i B =4 i B =5 i B =3 Idle Idle Idle i D =3 i B =5 i B =4 i D =5 i D =4 i B =3 Even requires fewer LPUs 19

20 ElasticFlow Synthesis Maps irregular loop nest to the ElasticFlow architecture Partition the loop nest into multiple stages Identify inner loop candidates to form the LPAs Synthesize these inner loops into slpus and mlpus for (; i<num_keys; i++) #pragma pipeline A Goal: Optimize LPU allocation to meet the expected throughput } // A: lookup hashtbl1 // B: dynamic-bound loop while (p && p->key!=k) p = p->next; // C: loop up hashtbl2 // D: dynamic-bound loop while (q && q->key!=k) q = q->next; // E: merge results B C D E Distributor 1 How many? 2 Shared or not shared? Shared mlpus Collector mlpa B,D 20

21 slpu Allocation Definitions TP: Expected number of outer loop iterations per cycle II i : Achievable initiation interval (II) of inner loop i L i : Latency in cycles of a single iteration of loop i B i : Common-case bound of inner loop i (from profiling) U i =[II i (B i -1)+L i ] TP Number of slpus Common-case latency of each inner loop instance To achieve the expected throughput Need this many slpu to hide the latency of inner loop How many simultaneous in-flight outer loop iterations is required? 21

22 mlpu Allocation Replace dedicated slpus with shared mlpus to improve performance and resource utilization How many slpus should be replaced with mlpus? Inherent trade-off between performance and area mlpus improve performance by allowing adaptive assignment of resources to different types of loops depending on workload mlpus typically consume more area than slpus 22

23 LPU Allocation Optimize the tradeoff as an integer linear program given Resource usage of each type of LPU Area of the slpa architecture sharing + #LPUs performance Total area of the LPAs Prevent over-allocation of LPUs Each loop maps to a single type of LPA Loops mapped to compatible LPA 23

24 Time ROB Buffer Sizing Reorder buffer (ROB) must hold all results from the LPUs that are not yet ready to be committed Distributor stalled when ROB is full LPUs cannot process new outer loop iterations, and become underutilized Need to store results from to i=7 because they finish before Distributor LPU 1 LPU 2 LPU K Collector (ROB) LPA A LPU 1 LPU 2 LPU 3 LPU 4 i=4 stall i=5 i=6 i=4 i=5 i=7 i=7 i=6 Problem: how to statically but suitably size the ROB during synthesis? B C i=4 i=5 i=6 i=7 24

25 ROB Buffer Sizing We estimate the ROB size based on profiling Maximum latency L max Minimum latency L min Average latency L avg Latency standard deviation σ Our estimates (for K LPUs) achieve good performance based on the following empirical formulation 25

26 Deadlock Avoidance Both slpa and mlpa are deadlock-free Limit the number of in-flight outer loop iterations to be no greater than the number of available ROB entries Entire dataflow architecture cannot deadlock If the architecture forms a directed acyclic graph If there is data dependence between shared inner loops 26

27 Experimental Setup ElasticFlow s setup leverages a commercial HLS tool which uses LLVM compiler as its front-end Compared ElasticFlow to pipelining techniques employed in state-of-the-arts commercial HLS tool Target Xilinx Virtex-7 FPGA with 5-ns target clock period Benchmark applications Graph processing, database, scientific computing, image processing 27

28 Performance for Different Number of LPUs Normalized speedup9 Performance Comparison Close to proportional improvement in performance for increasing Benchmark number applications of LPUs 1 LPU 2 LPUs 4 LPUs 8 LPUs 28

29 ElasticFlow vs Aggressive Unrolling Achieves comparable performance with significantly less resource usage Unrolling is inapplicable when the worst-case loop bound cannot be statically determined Design Technique Latency LUTs Registers dbjoin Unroll ElasticFlow spmv Unroll ElasticFlow comparable 15x reduction 45x reduction 29

30 Effectiveness of LPU Sharing Using mlpa can further improve the performance by 21%-34% with similar area Comparison of mlpus over slpus Design # slpus # mlpus Latency Reduction Slice Overhead cfd-a % 38% cfd-b % 52% dbjoin-a % 70% dbjoin-b % 57% Significant latency reduction Small area overhead 30

31 Take-Away Points Existing HLS tools rely on static pipelining techniques Extract parallelism only at compile time Not competitive for irregular programs with dynamic parallelism Need for adaptive pipelining techniques Dynamically extract parallelism at runtime Efficiently handle statically unanalyzable program patterns We address pipelining of irregular loop nests containing dynamic-bound inner loops Novel dataflow pipeline architecture and synthesis techniques Substantial performance improvement 31

32 ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer Engineering, Cornell University 2 Google Inc

33 Backup Slides 33

34 Coarse-Grained Pipelined Accelerators (CGPA) Liu, Johnson, and August, DAC 14 Generates coarse-grained pipelines for a loop nest by partitioning it into parallel and non-parallel sections Employs replicated data-level parallelism to create multiple identical copies of the parallel section Applies decoupled pipeline parallelism to separate the parallel and sequential sections with a set of FIFOs ElasticFlow achieves additional performance and resource efficiency Enables out-of-order execution and dynamic scheduling Optimizes allocation and sharing of LPUs with mlpa architecture Studies sizing for both ROB and delay line and runtime policy to prevent deadlock 34

35 Comparison with CGPA 35

36 Widx Kocberber, Grot, Picorel, Falsafi, Lim, and Ranganathan, MICRO 13 A reconfigurable accelerator for hash indexing in database systems Uses decoupled pipeline architecture similar to ElasticFlow Hashing unit distributes work to a parallel array of walker units, whose results are combined in a n output unit ElasticFlow is a technique for addressing a more general problem of pipelining irregular loop nests 36

Enabling Adaptive Loop Pipelining in High-Level Synthesis

Enabling Adaptive Loop Pipelining in High-Level Synthesis Steve Dai, Gai Liu, Ritchie Zhao, Zhiru Zhang School of Electrical and Computer Engineering, Cornell University, Ithaca, NY Email: {hd273,gl387,rz252,zhiruz}@cornell.edu