Scaling Datacenter Accelerators With Compute-Reuse Architectures

Size: px

Start display at page:

Download "Scaling Datacenter Accelerators With Compute-Reuse Architectures"

Carol Holmes
5 years ago
Views:

1 Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs and David Wentzlaff ISCA 2018 Session 5A June 5, 2018 Los Angeles, CA

2 Scaling Datacenter Accelerators With Compute-Reuse Architectures 2 Sources: "Cramming more components onto integrated circuits GE Moore, Computer 1965 Next-Gen Power Solutions for Hyperscale Data Centers, DataCenter Knowledge 2016

3 Scaling Datacenter Accelerators With Compute-Reuse Architectures 3 Sources: "Cramming more components onto integrated circuits GE Moore, Computer 1965 Next-Gen Power Solutions for Hyperscale Data Centers, DataCenter Knowledge 2016

4 Scaling Datacenter Accelerators With Compute-Reuse Architectures 4? Sources: "Cramming more components onto integrated circuits GE Moore, Computer 1965 Next-Gen Power Solutions for Hyperscale Data Centers, DataCenter Knowledge 2016

Scaling Datacenter Accelerators With Compute-Reuse Architectures Sources: Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective, Hazelwood et al.

5 Scaling Datacenter Accelerators With Compute-Reuse Architectures Sources: Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective, Hazelwood et al. HPCA 2018 Cloud TPU, Google, FPGA Accelerated Computing Using AWS F1 Instances, David Pellerin, AWS summit 2017 Microsoft unveils Project Brainwave for real-time AI, Doug Burger, NVIDIA TESLA V100, NVIDIA, 5

6 Scaling Datacenter Accelerators With Compute-Reuse Architectures Sources: Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective, Hazelwood et al. HPCA 2018 Cloud TPU, Google, FPGA Accelerated Computing Using AWS F1 Instances, David Pellerin, AWS summit 2017 Microsoft unveils Project Brainwave for real-time AI, Doug Burger, NVIDIA TESLA V100, NVIDIA, 6

7 Scaling Datacenter Accelerators With Compute-Reuse Architectures Transistor scaling stops. Chip specialization runs out of steam. What s Next? Sources: Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective, Hazelwood et al. HPCA 2018 Cloud TPU, Google, FPGA Accelerated Computing Using AWS F1 Instances, David Pellerin, AWS summit 2017 Microsoft unveils Project Brainwave for real-time AI, Doug Burger, NVIDIA TESLA V100, NVIDIA, 7

8 Scaling Datacenter Accelerators With Compute-Reuse Architectures 8 Observation I: The Density of Emerging Memories are Projected to Increase ITRS Logic Roadmap

9 Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations Temporal locality introduces redundancy in videos encoders (recurrent blocks in white) t=0 sec t=2 sec t=4 sec Source: Face recognition in unconstrained videos with matched background similarity, Wolf et al., CVPR

Computations Temporal locality introduces redundancy in videos

recurrence 38% recurrence 61% recurrence Source: Face recognition

10 Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations Temporal locality introduces redundancy in videos encoders (recurrent blocks in white) t=0 sec t=2 sec t=4 sec 0% recurrence 38% recurrence 61% recurrence Source: Face recognition in unconstrained videos with matched background similarity, Wolf et al., CVPR

11 Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations Search term commonality retrieves the similar content intercontinental downtown los angeles Source: Google 11

12 Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations Search term commonality retrieves the similar content intercontinental downtown los angeles hotel in downtown los angeles near intercontinental Source: Google 12

Search term commonality retrieves the similar content intercontinental

13 Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations Search term commonality retrieves the similar content intercontinental downtown los angeles hotel in downtown los angeles near intercontinental Source: Google 13

14 Scaling Datacenter Accelerators With Compute-Reuse Architectures 14 Observation II: Datacenter Accelerators Perform Redundant Computations Power laws suggest high recurrent processing of popular content Source: Twitter

Accelerators Perform Redundant Computations Power laws

15 Scaling Datacenter Accelerators With Compute-Reuse Architectures 15 Observation II: Datacenter Accelerators Perform Redundant Computations Power laws suggest high recurrent processing of popular content Source: Twitter

16 Scaling Datacenter Accelerators With Compute-Reuse Architectures Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. Host Processors Shared LLC / NoC Acceleration Fabric Accelerator Core Input Lookup input core result input DMA Engine output Scratchpad Memory COREx: Compute-Reuse Architecture For Accelerators 16

17 Scaling Datacenter Accelerators With Compute-Reuse Architectures Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. Host Processors Shared LLC / NoC Acceleration Fabric lookup Accelerator Core Input Lookup fetched result input core result hit input DMA Engine output Scratchpad Memory core result Compute-Reuse Storage COREx: Compute-Reuse Architecture For Accelerators 17

Scaling Datacenter Accelerators With Compute-Reuse Architectures Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing.

18 Scaling Datacenter Accelerators With Compute-Reuse Architectures Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. Host Processors Shared LLC / NoC Acceleration Fabric lookup Accelerator Core Input Lookup fetched result input core result hit input DMA Engine output Scratchpad Memory core result Compute-Reuse Storage COREx: Compute-Reuse Architecture For Accelerators 18

19 Architectural Guidelines 19 Accelerator Core DMA Engine Scratchpad Specialized Compute Lanes General-Purpose CMP Shared LLC

20 Architectural Guidelines Accelerators Memoization is Natural o Little or no additional programming effort o Built-in input-compute-output flow Output Accelerator Core DMA Engine Compute Scratchpad Specialized Compute Lanes Input General-Purpose CMP Shared LLC 20

21 Architectural Guidelines Accelerators Memoization is Natural o Little or no additional programming effort o Built-in input-compute-output flow But Not Straightforward! o High lookup costs o Unnecessary accesses o High access costs COREx Key Ideas: o Hashing (reduce lookup costs) o Lookup filtering (fewer accesses) o Banking (reduce access costs) Accelerator Core DMA Engine Output Compute Scratchpad Specialized Compute Lanes Input General-Purpose CMP Shared LLC 21

22 Architectural Guidelines Accelerators Memoization is Natural o Little or no additional programming effort o Built-in input-compute-output flow Goal: Extend Specialization with Workload-Specific Memoization But Not Straightforward! o High lookup costs o Unnecessary accesses o High access costs COREx Key Ideas: o Hashing (reduce lookup costs) o Lookup filtering (fewer accesses) o Banking (reduce access costs) Accelerator Core DMA Engine Output Compute Scratchpad Specialized Compute Lanes Input General-Purpose CMP Shared LLC 22

23 Top Level Architecture Mem. Chip Func. Block Control Datapath SoC Interconnect Accelerator Core DMA Engine Scratchpad Specialized Compute Lanes General-Purpose CMP Shared LLC 23

24 Top Level Architecture New Modules: o Input Hashing Unit (IHU) Mem. Chip Func. Block Control Datapath COREx Interconnect IHU Accelerator Core DMA Engine Scratchpad Specialized Compute Lanes General-Purpose CMP Shared LLC SoC Interconnect 24

25 Top Level Architecture New Modules: o Input Hashing Unit (IHU) Mem. Chip Func. Block Control Datapath ILU Associative Cache Cache Ctrl. o Input Lookup Unit (ILU) COREx Interconnect IHU SoC Interconnect Accelerator Core DMA Engine Hashes Scratchpad Specialized Compute Lanes General-Purpose CMP Shared LLC 25

26 Top Level Architecture New Modules: o Input Hashing Unit (IHU) Mem. Chip Func. Block Control Datapath COREx Interconnect ILU Associative Cache Cache Ctrl. Fetch CHT RAM-Array Table RAM-Array Ctrl. o Input Lookup Unit (ILU) IHU Accelerator Core DMA Engine Scratchpad General-Purpose CMP o Computation History Table(CHT) Specialized Compute Lanes Shared LLC SoC Interconnect 26

27 Top Level Architecture New Modules: o Input Hashing Unit (IHU) o Input Lookup Unit (ILU) o Computation History Table(CHT) Mem. Chip Func. Block Control Datapath COREx Interconnect IHU ILU Associative Cache Cache Ctrl. Accelerator Core DMA Engine Scratchpad Fetch Specialized Compute Lanes Match Input CHT RAM-Array Table RAM-Array Ctrl. General-Purpose CMP Shared LLC SoC Interconnect 27

28 Top Level Architecture New Modules: o Input Hashing Unit (IHU) Mem. Chip Func. Block Control Datapath COREx Interconnect ILU Associative Cache Cache Ctrl. Fetch CHT RAM-Array Table RAM-Array Ctrl. o Input Lookup Unit (ILU) IHU Accelerator Core DMA Engine Scratchpad General-Purpose CMP o Computation History Table(CHT) Specialized Compute Lanes Shared LLC SoC Interconnect Use Output 28

29 Building COREx IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 29 Case Study: Acceleration of Video Motion Estimation Optimization Goals: o Runtime, Energy, and Energy-Delay Product (EDP) Baseline: highly-tuned accelerators o Sweep space for design alternatives (Aladdin) o Find optimal accelerator design for each goal

30 Building COREx IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 30 Case Study: Acceleration of Video Motion Estimation Optimization Goals: o Runtime, Energy, and Energy-Delay Product (EDP) Baseline: highly-tuned accelerators o Sweep space for design alternatives (Aladdin) o Find optimal accelerator design for each goal

31 Building COREx IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 31 Case Study: Acceleration of Video Motion Estimation Optimization Goals: o Runtime, Energy, and Energy-Delay Product (EDP) Baseline: highly-tuned accelerators o Sweep space for design alternatives (Aladdin) o Find optimal accelerator design for each goal Runtime OPT: 5.8[us] EDP OPT: 148.7[pJs] Energy OPT: 6.2[uJ]

32 Building COREx IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 32 Memoization-Layers Specialization o Extract input traces, examine hit and miss rates of different ILU/CHT sizes. o Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space. Example: Resistive RAM based COREx

33 Memoization-Layers Specialization o Extract input traces,

o Integrate accelerators with emerging memory based ILU+CHT,

Example: Resistive RAM based COREx Runtime Optimization: 2.

33 Building COREx IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 33 Memoization-Layers Specialization o Extract input traces, examine hit and miss rates of different ILU/CHT sizes. o Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space. Example: Resistive RAM based COREx Runtime Optimization: 2.7x Speedup. 512KB ILU, 32GB CHT EDP Optimization: 63.5% EDP Saved. 512KB ILU, 2GB CHT Energy Optimization: 56.6% Energy Saved. 64KB ILU, 8MB CHT

34 Experimental Setup IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 34 Workloads Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP Graph Processing Maps Service: Shortest Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. ("SSP") Walking Route BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

35 Experimental Setup IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 35 Workloads Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP Graph Processing Maps Service: Shortest Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. ("SSP") Walking Route BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions. Temporal Redundancy

36 Experimental Setup IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 36 Workloads Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP Graph Processing Maps Service: Shortest Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. ("SSP") Walking Route BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions. Temporal Redundancy Search Commonality

37 Experimental Setup IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 37 Workloads Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP Graph Processing Maps Service: Shortest Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. ("SSP") Walking Route BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions. Temporal Redundancy Search Commonality Content Popularity (75%, 90%, 95% Recurrence)

38 Experimental Setup Workloads Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Methodology Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP Graph Processing Maps Service: Shortest Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. ("SSP") Walking Route BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions. o Evaluate ILU/CHT as ReRAM, STT-RAM, PCM, or Racetrack (Destiny) o Integrate with highly-tuned accelerators (Aladdin) IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. Temporal Redundancy Search Commonality Content Popularity (75%, 90%, 95% Recurrence) 38

39 Results IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 39 Runtime-OPT: Avg x Speedup o Negligible Differences Between Memories

40 Results IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 40 Runtime-OPT: Avg x Speedup o Negligible Differences Between Memories EDP-OPT: Avg. 50%-68% Savings o PCM/Racetrack High write energy o Gain less for low bias apps (freq. updates)

50%-68% Savings o PCM/Racetrack High write energy o Gain less for low bias

22%-50% Savings o PCM unbeneficial for 75% bias SSSP/RBM General Trends: o

41 Results Runtime-OPT: Avg x Speedup o Negligible Differences Between Memories EDP-OPT: Avg. 50%-68% Savings o PCM/Racetrack High write energy o Gain less for low bias apps (freq. updates) Energy-OPT: Avg. 22%-50% Savings o PCM unbeneficial for 75% bias SSSP/RBM General Trends: o Large CHTs (MBs-TBs) for Speedup. Smaller (KBs-GBs) for EDP, Smallest for Energy (KBs-MBs) IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 41

42 Conclusions 42 Memoization is Fit for Accelerators o Memoization-Ready Programming Environment+Interface

43 Conclusions 43 Memoization is Fit for Accelerators o Memoization-Ready Programming Environment+Interface Memoization is Fit for Datacenters o Temporal Redundancy, Search Commonality, Content Popularity

44 Conclusions 44 COREx Extends Hardware Specialization o Memoization-layer specialization tailored for the workload

45 Conclusions 45 COREx Extends Hardware Specialization o Memoization-layer specialization tailored for the workload COREx Opens New Opportunities for Future Architectures o Shift compute from non-scaling CMOS to still-scaling memories

46 Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs David Wentzlaff

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power