LogCA: A High-Level Performance Model for Hardware Accelerators

Size: px

Start display at page:

Download "LogCA: A High-Level Performance Model for Hardware Accelerators"

Sophia Andrews
5 years ago
Views:

1 Everything should be made as simple as possible, but not simpler Albert Einstein LogCA: A High-Level Performance Model for Hardware Accelerators Muhammad Shoaib Bin Altaf* David A. Wood University of Wisconsin-Madison * Now at AMD Research, Austin TX

2 Executive Summary Accelerators do not always perform as expected Crucial for programmers and architects to understand the factors which affect performance Simple analytical models beneficial early in the design stage Our proposal: LogCA High-level performance model Help identify design bottlenecks and possible optimizations Validation across variety of on-chip and off-chip accelerators Two retrospective case studies demonstrate the usefulness of the model 2

3 Outline Motivation LogCA Results Conclusion 3

.., the accelerator is tuned to provide HIGHER PERFORMANCE.

4 Why Need a Model? An accelerator is a separate architectural substructure... that is architected using a different set of objectives than the base processor,..., the accelerator is tuned to provide HIGHER PERFORMANCE.. than with the general-purpose base hardware M7: Next Generation SPARC Hotchips Power8 Hpctchips S. Patel and W. Hwu. Accelerators Architectures. Micro

5 Why a Model? Encryption algorithm on UltraSPARC T2 10 Host outperforms Accelerator outperforms Better Time (ms) Break-even point Host 0.01 Accelerator Block Size (Bytes) Amdahl s Law for Accelerators 5

6 Why a Model? 100 Advanced Encryption Standard (AES) UltraSPARC T2 Better Speedup 10 1 Break-even points SPARC T4 GPU 0.1 Offloaded Data (Bytes) Running the same kernel, accelerators can have different break-even points 6

7 Outline Motivation LogCA Results Conclusion 7

8 The Performance Model Inspired by LogP [CACM 1996] Abstract accelerator using five parameters L Latency: Cycles to move data o Overhead: Setup cost g Granularity: Size of the off-loaded data C Computational index: Amount of work done per byte of data A Acceleration: Speedup ignoring overheads Sixth parameter β generalizes to kernels with non-linear complexity Host Interface Accelerator 8

9 The Performance Model Execution w/o an accelerator T 0 (g) = C 0 (g) Execution with one accelerator T 1 (g) = o 1 (g) + L 1 (g) + C 1 (g) Host Interface Accelerator T 0 (g) C 0 (g) o 1 (g) L 1 (g) C 1 (g)= # $(&) ( time T 1 (g) Gain 9

10 Granularity independent latency Captures the effect of granularity on speedup Speedup bounded by acceleration lim & - Speedup g = A Overheads dominate at smaller granularities Speedup(g) &67 = # 89:9 ; < < # 89: Speedup (g) A Granularity (Bytes) Amdahl s law for Accelerators 10

11 Performance Metrics Right amount of off-loaded data? Inspired from vector machine metrics N?, A g 7 : Granularity for a speedup of 1 g 7 is essentially independent of acceleration Identify complexity of the interface g< A: Granularity for a speedup of ( B Increasing A also increases g< A Speedup A Small g g ( 7 B Granularity (Bytes) g 1 Large g 1 Simple Interface Complex Interface 11

decreases with the increase in granularity Speedup (g) 10 1 0.

12 Granularity dependent latency Speedup bounded by computational intensity C/L lim & - Speedup g < # : (linear algorithms) Speedup for sub-linear algorithms asymptotically decreases with the increase in granularity Speedup (g) A C L Granularity (Bytes) Linearly Sub-linearly Speedup (g) A g Speedup Granularity (Bytes) 12

13 Granularity dependent latency Computational intensity must be greater than 1 to achieve any speedup Computational intensity should be greater than peak performance to achieve A/2 Speedup A/2 1 A C L 1 C L A g 7 g A/2 Granularity (Bytes) Performance metrics help programmers early in the design cycle 13

bottlenecks 1000 oc oca A C L Speedup 100 10 1 0.

14 Bottleneck Analysis using LogCA 10X change in parameter è 20% performance gain Helps focus on performance bottlenecks 1000 oc oca A C L Speedup oc A oc oca A A LogCA L_0.1x o_0.1x C_10x A_10x Granularity (Bytes) 14

15 Outline Motivation LogCA Results Conclusion 15

16 Experimental Methodology Fixed-function and general-purpose accelerators Cryptographic accelerators on SPARC architectures Discrete and integrated GPUs Kernels with varying complexities Encryption, Hashing, Matrix Multiplication, FFT, Search, Radix Sort Retrospective case studies Cryptographic interface in SPARC architectures Memory interface in GPUs 16

17 Case Study I Cryptographic Interface in the SPARC Architecture PCIe Crypto Accelerator UltraSPARC T2 SPARC T3 SPARC T4 engine SPARC T4 instructions 17

18 Conclusion Simple models effective in predicting performance of accelerators Proposed a high-level performance model for hardware accelerators These models help programmers and architects visually identify bottlenecks and suggest optimizations Performance metrics for programmers in deciding the right amount of offloaded data Limitations include inability to model resource contention, caches, and irregular memory access patterns 18

19 Questions? Source: 19

Gables: A Roofline Model for Mobile SoCs

Gables: A Roofline Model for Mobile SoCs Mark D. Hill, Wisconsin & Former Google Intern Vijay Janapa Reddi, Harvard & Former Google Intern HPCA, Feb 2019 Outline Motivation Gables Model Example Balanced