Practical Near-Data Processing for In-Memory Analytics Frameworks

Size: px

Start display at page:

Download "Practical Near-Data Processing for In-Memory Analytics Frameworks"

Maximilian Johns
6 years ago
Views:

1 Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University PACT Oct 19, 2015

Motivating Trends End of Dennard scaling systems are energy limited Emerging big data workloads o Massive datasets, limited temporal locality, irregular access patterns o They perform poorly on

2 Motivating Trends End of Dennard scaling systems are energy limited Emerging big data workloads o Massive datasets, limited temporal locality, irregular access patterns o They perform poorly on conventional cache hierarchies Need alternatives to improve energy efficiency Deep Neural Networks MapReduce Graphs Figs: 2

Near-Data Processing o Enabled by 3D integration o Practical technology solution o

3 PIM & NDP Improve performance & energy by avoiding data movement Processing-In-Memory (1990 s 2000 s) o Same-die integration is too expensive Near-Data Processing o Enabled by 3D integration o Practical technology solution o Processing on the logic die Hybrid Memory Cube (HMC) High Bandwidth Memory (HBM) Figs: 3

4 Base NDP Hardware Memory Stack High-Speed Serial Link Host Processor Stacks linked to host multi-core processor o Code with temporal locality: runs on host o Code without temporal locality: runs on NDP Vault... NoC Bank DRAM Die Vault Logic Logic Die Channel 3D memory stack o x10 bandwidth, x3-5 power improvement o 8-16 vaults per stack Vertical channel Dedicated vault controller o NDP cores General-purpose, in-order cores FPU, L1 caches I/D, no L2 Multithreaded for latency tolerance 4

5 Challenges and Contributions NDP for large-scale highly distributed analytics frameworks? General coherence maintaining is expensive Scalable and adaptive software-assisted coherence? Inefficient communication and synchronization through host processor Pull-based model to directly communicate, remote atomic operations? Hardware/software interface A lightweight runtime to hide low-level details to make program easier? Processing capability and energy efficiency Balanced and efficient hardware A general, efficient, balanced, practical-to-use NDP architecture 5

6 Example App: PageRank Edge-centric, scatter-gather, graph processing framework Other analytics frameworks have similar behaviors Edge-centric SG edge_scatter(edge_t e) src sends update over e update_gather(update_t u) apply u to dst while not done for e in all edges edge_scatter(e) for u in all updates update_gather(u) PageRank u = src.rank / src.out_degree sum += u if all gathered Sequential dst.rank accesses = b * sum (stream + (1-b) in/out) Partitioned dataset, local processing Synchronization between iterations Communication between graph partitions 6

7 Architecture Design Memory model, communication, coherence, Lightweight hardware structures and software runtime

8 Shared Memory Model Unified physical address space across stacks o Direct access from any NDP/host core to memory in any vault/stack In PageRank o One thread to access data in a remote graph partition For edges across two partitions Local Vault Memory Implementation o Memory ctrl forwards local/remote accesses o Shared router in each vault Memory request NDP Core Local NDP Core Mem Ctrl Remote NDP Core Router 8

9 Virtual Memory Support NDP threads access virtual address space o Small TLB per core (32 entries) o Large pages to minimize TLB misses (2 MB) o Sufficient to cover local memory & remote buffers In PageRank o Each core works on local data, much smaller than the entire dataset o 0.25% miss rate for PageRank TLB misses served by OS in host o Similar to IOMMU misses in conventional systems 9

10 Software-Assisted Coherence Maintaining general coherence is expensive in NDP systems o Highly distributed, multiple stacks Analytics frameworks o Little data sharing except for communication o Data partitioning is coarse-grained Vault 0 Vault 1 Vault Memory Mem Ctrl Vault Memory Mem Ctrl Only allow data to be cached in one cache o Owner cache o No need to check other caches Page-level coarse-grained o Owner cache configurable through PTE $ NDP Core $ NDP Core $ NDP Core Owner cache identified by TLB $ Memory vault identified NDP by physical Core address 10

11 Software-Assisted Coherence Scalable o Avoids directory lookup and storage Vault 0 Vault 1 Vault Memory Dataset Vault Memory Adaptive o Data may overflow to other vault o Able to cache data from any vault in local cache $ NDP Core Mem Ctrl $ NDP Core $ NDP Core Mem Ctrl $ NDP Core Flush only when owner cache changes o Rarely happen as dataset partitioning is fixed 11

12 Communication Pull-based model o Producer buffers intermediate/result data locally and separately o Post small message (address, size) to consumer o Consumer pulls data when it needs with load instructions Task Task Task Task Cores Cores Cores Cores Process Buffer Task Task Task Task Pull 12

13 Communication Pull-based model is efficient and scalable o Sequential accesses to data o Asynchronous and highly parallel o Avoids the overheads of extra copies o Eliminates host processor bottleneck In PageRank o Used to communicate the update lists across partitions 13

14 Communication HW optimization: remote load buffer (RLBs) o A small buffer per NDP core (a few cachelines) o Prefetch and cache remote (sequential) load accesses Remote data are not cache-able in the local cache Do not want owner cache change as it results in cache flush Coherence guarantee with RLBs o Remote stores bypass RLB All writes go to the owner cache Owner cache always has the most up-to-date data o Flush RLBs at synchronization point at which time new data are guaranteed to be visible to others Cheap as each iteration is long and RLB is small 14

15 Synchronization Remote atomic operations o Fetch-and-add, compare-and-swap, etc. o HW support at memory controllers [Ahn et al. HPCA 05] Higher-level synchronization primitives o Build by remote atomic operations o E.g., hierarchical, tree-style barrier implementation Core vault stack global In PageRank o Build barrier between iterations 15

16 Software Runtime Hide low-level coherence/communication features o Expose simple set of API Data partitioning and program launch o Optionally specify running core and owner cache close to dataset o No need to be perfect, correctness is guaranteed by remote access Hybrid workloads o Coarsely divide work between host and NDP by programmers Based on temporal locality and parallelism o Guarantee no concurrent accesses from host and NDP cores 16

17 Evaluation Three analytics framework: MapReduce, Graph, DNN

18 Methodology Infrastructure o zsim o McPAT + CACTI + Micron s DRAM power calculator Calibrate with public HMC literatures Applications o MapReduce: Hist, LinReg, grep o Graph: PageRank, SSSP, ALS o DNN: ConvNet, MLP, da

19 Porting Frameworks MapReduce o In map phase, input data streamed in o Shuffle phase handled by pull-based communication Graph o Edge-centric o Pull remote update lists when gathering Deep Neural Networks o Convolution/pooling layers handled similar to Graph o Fully-connected layers use local combiner before communication Once the framework is ported, no changes to the user-level apps 19

20 Normalized Performance Normalized Energy Graph: Edge- vs. Vertex-Centric Performance SSSP Vertex-Centric ALS Edge-Centric 2.9x performance and energy improvement o Edge-centric version optimize for spatial locality o Higher utilization for cachelines and DRAM rows SSSP Vertex-Centric Energy ALS Edge-Centric 20

21 Bandwidth Utilization Normalized Performance Balance: PageRank % 80% 60% 40% 20% 0% Saturate after 8 cores Number of Cores per Vault 1.0GHz 1T 1.0GHz 2T 1.0GHz 4T 0.5GHz 1T 0.5GHz 2T 0.5GHz 4T Performance scales to 4-8 cores per vault o Bandwidth saturates Final design o 4 cores per vault o 1.0 GHz o 2-threaded o Area constrained 21

22 Normalized Speedup Scalability Performance Scaling vs. # Stacks Hist PageRank ConvNet 1 stack 2 stacks 4 stacks 8 stacks 16 stacks Performance scales well up to 16 stacks (256 vaults, 1024 threads) Inter-stack links are not heavily used 22

23 Final Comparison Four systems o Conv-DDR3 Host processor + 4 DDR3 channels o Conv-3D Host processor + 8 HMC stacks o Base-NDP Host processor + 8 HMC stacks with NDP cores Communication coordinated by host o NDP Similar to Base-NDP With our coherence and communication 23

24 Final Comparison 1.5 Execution Time 1.5 Energy Conv-DDR3 Conv-3D Base-NDP NDP Conv-DDR3 Conv-3D Base-NDP NDP Conv-3D: improve 20% for Graph (bandwidth-bound), more energy Base-NDP: 3.5x faster and 3.4x less energy than Conv-DDR3 NDP: up to 16x improvement than Conv-DDR3, 2.5x over Base-NDP 24

25 Hybrid Workloads Execution Time Breakdown Use both host processor and NDP cores for processing NDP portion: similar speedup Host portion: slight slowdown o Due to coarse-grained address interleaving FisherScoring Host Time NDP Time K-Core 25

26 Conclusion Lightweight hardware structures and software runtime o Hides hardware details o Scalable and adaptive software-assisted coherence model o Efficient communication and synchronization Balanced and efficient hardware Up to 16x improvement over DDR3 baseline o 2.5x improvement over previous NDP systems Software optimization o 3x improvement from spatial locality 26

27 Thanks! Questions?

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard