Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling

Size: px

Start display at page:

Download "Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling"

Alannah King
6 years ago
Views:

1 Cornell University Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling Tao Chen and G. Edward Suh Computer Systems Laboratory Cornell University

2 Accelerator-Rich Computing Systems Computing systems are becoming accelerator-rich General-purpose cores + a large number of accelerators Challenge: Design and verification complexity Non-recurring engineering (NRE) cost per accelerator Manual efforts are a major source of cost Create computation pipelines Manage data supply from memory High-Level Synthesis (HLS) This work: An automated framework for generating accelerators with efficient data supply 2

3 Inefficiencies in Accelerator Data Supply Scratchpad-based accelerators On-chip scratchpad memory (SPM) Manually designed logic to move data between SPM and main memory Pros: Good performance Cons: High design effort, accelerator-specific, not reusable Accelerator SPM Compute Logic Memory Bus Preload Logic 3

4 Inefficiencies in Accelerator Data Supply Scratchpad-based accelerators On-chip scratchpad memory (SPM) Manually designed logic to move data between SPM and main memory Pros: Good performance Cons: High design effort, accelerator-specific, not reusable Cache-based accelerators Accelerator Pros: Low design effort, cache can be reused Compute Logic Cache Memory Bus Cons: Uncertain memory latency impacts performance 3

5 Optimize Data Supply for Cache-Based Accelerators Approach: automated framework for generating accelerators with efficient data supply Accelerator Source Automated Framework Accelerator w/ Efficient Data Supply 4

6 Optimize Data Supply for Cache-Based Accelerators Approach: automated framework for generating accelerators with efficient data supply Accelerator Source Automated Framework Accelerator w/ Efficient Data Supply Techniques Prefetching Tagging memory accesses Accelerator Compute Logic Cache HW Prefetcher Memory Bus 4

7 Optimize Data Supply for Cache-Based Accelerators Approach: automated framework for generating accelerators with efficient data supply Accelerator Source Automated Framework Accelerator w/ Efficient Data Supply Techniques Accelerator Prefetching Tagging memory accesses Access/Execute Decoupling Program slicing + architecture template Access Logic Cache Execute Logic HW Prefetcher Memory Bus 4

8 Impact of Uncertain Memory Latency Example: Sparse Matrix Vector Multiplication (spmv) Pipeline generated with High-Level Synthesis (HLS) // inner loop of sparse matrix // vector multiplication for (j = begin; j < end; j++) { #pragma HLS pipeline Si = val[j] * vec[cols[j]]; sum = sum + Si; } HLS Time 5

9 Impact of Uncertain Memory Latency Example: Sparse Matrix Vector Multiplication (spmv) Pipeline generated with High-Level Synthesis (HLS) // inner loop of sparse matrix // vector multiplication for (j = begin; j < end; j++) { #pragma HLS pipeline Si = val[j] * vec[cols[j]]; sum = sum + Si; } Regular stride Irregular Regular stride Reduce cache misses for regular accesses Prefetch data into the cache Tolerate cache misses for irregular accesses Access/Execute Decoupling HLS 5 Time miss A cache miss stalls the entire accelerator pipeline

10 Hardware Prefetching Predict future memory accesses PC is often used as a hint Stream localization Spatial correlation prediction for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; } Global Addr Stream C C B8 6

11 Hardware Prefetching Predict future memory accesses PC is often used as a hint Stream localization Spatial correlation prediction for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; } Problem: accelerators lack a PC Solution: generate PC-like tags for accelerator memory accesses Global Addr Stream C C B8 PC x C Local Addr Streams PC y 541C regular strides PC z B8 irregular no pred 6

12 Hardware Prefetching Predict future memory accesses PC is often used as a hint Stream localization Spatial correlation prediction for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; } Problem: accelerators lack a PC Solution: generate PC-like tags for accelerator memory accesses Global Addr Stream C C B8 Local Addr Streams 6 CDFG PC x C PC y 541C regular strides BB2 x BB1 y + BB3 PC z B8 irregular no pred z

13 Decoupled Access/Execute (DAE) Limitations of Hardware Prefetching Not accurate for complex patterns / Needs warm-up time Fundamental reason: lack of semantic information Decoupled Access/Execute Allow memory accesses to run ahead to preload data Memory Access Cache w/ DAE memory latency Value Comp Accelerator Access Logic Execute Logic Time Manual Preload SPM w/ Manual Preload memory latency Compute Cache HW Prefetche r Memory Bus 7

14 Traditional DAE is not Effective for Accelerators miss Traditional DAE: access part forwards data to execute part Problem: access pipeline stalls on misses Throughput is limited by access pipeline Original Access Decoupled Execute Goal: allow access pipeline to continue to flow under misses 8

15 DAE Accelerator with Decoupled Loads Anatomy of a load AGen Req AGen Req Resp Solution: Delegate request/response handling Resp Req Resp AGen 9

16 Memory Unit Proxy for handling memory requests and responses Supports response reordering and store-to-load forwarding Store Addr Load Addr Load Data to AccU Store Data from ExeU Store Addr Queue Dep Check Fwd Data Store Data Queue Fwd Data Queue Mem Unit Load Queue to ExeU memreq memresp 10

17 Memory Unit Proxy for handling memory requests and responses Supports response reordering and store-to-load forwarding Store Addr Load Addr Load Data to AccU Store Data from ExeU Store Addr Queue Dep Check Fwd Data Store Data Queue Fwd Data Queue Mem Unit Load Queue to ExeU memreq ST memresp 10

18 Automated DAE Accelerator Generation Program slicing for generating access/execute slices Architectural template with configurable parameters access.c HLS slicing accel.c slicing execute.c HLS access.v Architectural Template execute.v Parameters Queue sizes Port width MemUnit config etc HW Generation Access/Execute Decoupled Accel Written in PyMTL 11

19 Evaluation Methodology Vertically integrated modeling methodology System components: cycle-level (gem5) Accelerators: register-transfer-level (Vivado HLS, PyMTL) Area, power and energy: gate-level (commercial ASIC flow) Benchmark accelerators from MachSuite Name Description bbgemm Blocked matrix multiplication bfsbulk Breadth-First Search gemm Dense matrix multiplication mdknn Molecular dynamics (K-Nearest Neighbor) nw Needleman-Wunsch algorithm spmvcrs Sparse matrix vector multiplication stencil2d 2D stencil computation viterbi Viterbi algorithm 12

20 Performance Comparison 2.28x speedup on average Prefetching and DAE work in synergy 13

21 Energy Comparison 15% energy reduction on average because of reduced stalls MemUnits/queues only consume a small amount of energy 14

22 More Details in the Paper Deadlock Avoidance Customization of Memory Units Baseline Validation Power and Area Comparison Energy, Power and Area Breakdown Sensitivity Study on Varying Queue Sizes Design Space Exploration: Queue Size Customization 15

23 Summary Cache-based accelerators Avoid the high design cost of manual data movement logic Problem: Inefficient in handling uncertain memory latency Approach: Automated program analysis and architectural template to generate accelerators with efficient data supply Tagging memory requests to enable prefetching Decoupling to enable memory accesses to run ahead Results: High-performance cache-based accelerators with minimal manual efforts 16

Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling

Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling Tao Chen and G. Edward Suh Cornell University Ithaca, NY 14850, USA {tc466, gs272}@cornell.edu Abstract This