Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Similar documents
Accelerating PageRank using Partition-Centric Processing

arxiv: v4 [cs.dc] 6 Aug 2018

NUMA-aware Graph-structured Analytics

GraFBoost: Using accelerated flash storage for external graph analytics

Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques. James Fox

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

GPU Sparse Graph Traversal

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Parallel graph traversal for FPGA

Scalable GPU Graph Traversal!

Optimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D.

GPU Sparse Graph Traversal. Duane Merrill

Optimizing Memory Performance for FPGA Implementation of PageRank

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

NXgraph: An Efficient Graph Processing System on a Single Machine

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

CACHE-GUIDED SCHEDULING

Effect of memory latency

Comparing Memory Systems for Chip Multiprocessors

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

NXgraph: An Efficient Graph Processing System on a Single Machine

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin

Extreme-scale Graph Analysis on Blue Waters

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

Performance of Multicore LUP Decomposition

Practical Near-Data Processing for In-Memory Analytics Frameworks

GAIL The Graph Algorithm Iron Law

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Harp-DAAL for High Performance Big Data Computing

Basic Communication Operations (Chapter 4)

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu

Graph Partitioning for Scalable Distributed Graph Computations

Big Data Analytics Using Hardware-Accelerated Flash Storage

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CS 426 Parallel Computing. Parallel Computing Platforms

Graph Data Management

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations

Lecture 13: March 25

A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

A Case Study in Optimizing GNU Radio s ATSC Flowgraph

CS 475: Parallel Programming Introduction

FPGP: Graph Processing Framework on FPGA

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages

Facilitating Magnetic Recording Technology Scaling for Data Center Hard Disk Drives through Filesystem-level Transparent Local Erasure Coding

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Fast BVH Construction on GPUs

FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015

Current and Future Challenges of the Tofu Interconnect for Emerging Applications

Accelerated Load Balancing of Unstructured Meshes

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Chapter 2: Memory Hierarchy Design Part 2

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Parallel Algorithm Engineering

EE/CSCI 451 Midterm 1

XPU A Programmable FPGA Accelerator for Diverse Workloads

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

Jignesh M. Patel. Blog:

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

High-Performance Packet Classification on GPU

Chapter 2: Memory Hierarchy Design Part 2

Parallel Exact Inference on the Cell Broadband Engine Processor

QR Decomposition on GPUs

EE/CSCI 451: Parallel and Distributed Computation

Programming as Successive Refinement. Partitioning for Performance

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

8. Hardware-Aware Numerics. Approaching supercomputing...

Lecture notes for CS Chapter 2, part 1 10/23/18

Extreme-scale Graph Analysis on Blue Waters

Adapted from David Patterson s slides on graduate computer architecture

8. Hardware-Aware Numerics. Approaching supercomputing...

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

Distributed computing: index building and use

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

Basics of Performance Engineering

Transcription:

Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Outline Introduction Partition-centric Processing Methodology Analytical Evaluation Experimental Results Generalization Conclusion 2

Graph Analytics Graphs ubiquitously preferred data representation Social network Internet Road Network Era of Big Data, Era of large Graphs Billions of nodes and edges Need high performance processing 3

PageRank Fundamental Node Ranking algorithm Iteratively compute weighted sum of neighbor s PR,- Important benchmark for the performance of Graph Analytics Sparse Matrix Vector multiplication core kernel of many scientific and engg. applications 4

Challenges: Pull Direction PageRank (PDPR) 1. PDPR Algorithm Read PR u fine-grained random memory accesses cache line utilization, DRAM traffic sustained memory bandwidth Cache misses, CPU stalls 2. Random accesses to SPR 3. DRAM traffic due to random accesses 5

Challenges: Vertex-Centric GAS (BVGAS) State-of-the-art method 1,2 Scatter u V, write msg = PR u, v v N o (u) (semi-sorted on v) Gather Read msg and accumulate PR,u- into PR,v- cache line utilization; prevent CPU stalls Drawbacks: Traverse entire graph twice inherently sub-optimal oblivious to vertex ordering induced locality coarse-grained random accesses poor DRAM BW 1. Buono, Daniele, et al. "Optimizing sparse matrix-vector multiplication for large-scale data analytics." Proceedings of the 2016 International Conference on Supercomputing. ACM, 2016 2. Beamer, Scott, et al. "Reducing PageRank communication via propagation blocking." Proceedings of Parallel and Distributed Processing Symposium. IEEE, 2017 6

Contributions Novel Partition-centric Processing Methodology enables efficient Processor-DRAM communication Optimizations to address communication challenges Partition-centric update propagation DRAM traffic Partition-Node Graph Data Layout sequential DRAM accesses Branch avoidance mechanism remove data-dependent branches Achieves upto 4. 7 GTEPS sustained throughput using 16 cores upto 77% of peak DRAM bandwidth Applicable to weighted graphs and generic SpMV computation 7

Outline Introduction Partition-centric Processing Methodology Analytical Evaluation Experimental Results Generalization Conclusion 8

Graph Partitioning Partitions disjoint cacheable sets of vertices Partition-centric abstraction of Graph set of links between nodes and partitions unlocks comm. efficiency not achievable with VC/EC paradigms Index based partitioning simple, low pre-processing overhead 9

Partition-Centric Processing Methodology (PCPM) Partition-Centric Processing with GAS model Scatter messages to neighbouring partitions Gather incoming messages to compute new PageRank values Write messages in statically allocated disjoint memory spaces (bins) no locks/atomics, scalability Dest. ID written only in first iteration, comm. Each thread processes 1 partition at a time Vertex data cacheable low latency random access 10

Optimization 1: Partition-Centric Update Propagation Single update from a node to all neighbours in a partition Natural outcome of PC abstraction Drastically reduce communication volume MSB of destination IDs for demarcation read new update if MSB = 1 Issues to address Scatter traverses unused edges *(7,1), (7,2)+ switch bins for each update insertion Gather Data-dependent unpredictable branches due to MSB check 11

Optimization 2: Data Layout Bipartite Partition-Node Graph (PNG) at most 1 edge between node and partition eliminate unused edge traversal Group the edges by destination partition All updates to one bin at a time Random access to vertices Create PNG on a per-partition basis Vertices cached, DRAM accesses sequential 1. Original Graph 2. PNG Layout 12

Optimization 3: Branch Avoidance Gather uses pointers to read bins destid_ptr for destid_bins update_ptr for update_bins When to increment pointers? destid_ptr every iteration update_ptr if MSB = 1 Directly add MSB to update_ptr no branch based cond. check on MSB 13

Outline Introduction Partition-centric Processing Methodology Analytical Evaluation Experimental Results Generalization Conclusion 14

Parameters Original Graph G V, E n = V m = E PNG Layout G P, V, E E edges between nodes and partitions k = P = # partitions Software d v = sizeof (updates) = 4B/8B d i = sizeof (index) = 4B Cache c mr = PDPR cache miss ratio l = sizeof (cache line) = 64B r = E E 1 * r and c mr are a function of graph locality. As locality increases, r and c mr 15

DRAM Communication Model Method PDPR comm BVGAS comm Communication Volume m d i + c mr l 2m d i + d v PCPM comm m d i 1 + 1 r + 2d v r BVGAS comm oblivious to locality good if locality is low and c mr is high PCPM comm BVGAS comm good if locality is low and c mr is high linear in 1 good for high locality r graphs as well 16

Random Access Model Method PDPR ra # Random DRAM accesses mc mr md v BVGASra l PCPM ra k 2 PCPM ra BVGAS ra < PDPR ra Example kron graph n = 33.5M, m = 1.05B, k = 512, l = 64B PCPM ra 0.26M BVGAS ra 67M *Assuming full cache line utilization for BVGAS 17

Outline Introduction Partition-centric Processing Methodology Analytical Evaluation Experimental Results Generalization Conclusion 18

Experimental Setup Large real-life and synthetic graphs Datasest Description # Nodes (M) # Edges (M) gplus Google+ social network 28.9 463 pld Pay-level-domain (web crawl) 42.9 623 web Webbase-2001 (high locality) 118.1 992.8 Kron Synthetic (high density) 33.5 1048 twitter Follower network 61.6 1468.4 sd1 Subdomain graph (web crawl) 95 1937.5 Intel Xeon E5-2650 v2 processor @ 2.3 GHz Dual-socket 8 cores per socket 32 KB L1 cache, 256 KB L2 cache DRAM 59.6 GB/s Read bandwidth, 32.9 GB/s Write bandwidth 19

Comparison with Baselines: Execution Time Upto 4. 1 speedup over PDPR Upto 3. 8 speedup over BVGAS Table: Execution Time of 1 PageRank Iteration Average 5 speedup in the Scatter phase Radically faster than BVGAS for high locality web graph 20

Comparison with Baselines: DRAM Performance Average 1. 7 reduction in comm. volume over BVGAS Average 2. 2 reduction in comm. volume over PDPR For sd1, PCPM sustained BW 77% of peak BW Average 1. 6 higher bandwidth than BVGAS 21

Comparison with Baselines: Effect of Locality Orig graph with original node labeling GOrder graph with GOrder 1 node labeling Increased spatial locality among node neighbors Table: PDPR and PCPM benefit from optimized node labeling 1. Wei, Hao, et al. "Speedup graph processing by graph ordering." Proceedings of the 2016 International Conference on Management of Data. ACM, 2016. 22

PCPM: Effect of Optimizations Opt 1 Partition-centric Update Propagation Opt 2 PNG Data Layout Opt 3 Branch Avoidance in Gather 23

PCPM: Effect of Partition Size partition size r, DRAM traffic partition size beyond cache capacity cache misses, sudden in DRAM traffic 256KB size 1MB DRAM traffic, execution time Vertex accesses served by slower L3 cache 24

Pre-processing Time Pre-processing compute bin sizes, PNG construction Optimizations Pre-process all partitions in parallel Exploit overlap in bin size computation and PNG construction Result very small overhead Easily amortized over few PageRank iterations Table: Pre-processing time of different methodologies 25

Outline Introduction Partition-centric Processing Methodology Analytical Evaluation Experimental Results Generalization Conclusion 26

Generalization PageRank is an example PCPM for processing weighted graphs possible programming model for graph analytics Updates Dest. ID Edge Wt. PR[6] PR[7] Bin 0 Extendible to generic SpMV (non-square matrices) computation partition rows and columns separately parallelize Scatter over column partitions parallelize Gather over row partitions 2 0 1 2 w 62 w 70 w 71 w 72 27

Generalization PCPM optimizations are generic software techniques not specific to the multicore platform used Can be ported to FPGAs and GPUs as well FPGAs store vertex data in BRAM GPUs store vertex data in shared memory user-controlled on-chip memories even more suitable 28

Outline Introduction Partition-centric Processing Methodology Analytical Evaluation Experimental Results Generalization Conclusion 29

Conclusion Proposed novel Partition-centric method for PageRank Developed optimizations to Reduce volume of DRAM traffic Enhance sustained DRAM bandwidth Comparison with state-of-the-art on multicore Average 2. 7 increase in throughput Average 1. 7 reduction in DRAM communication Average 1. 6 higher sustained memory bandwidth Can be extended to Weighted graphs and generic SpMV Other platforms such as GPUs and FPGAs etc. 30

https://github.com/kartiklakhotia/pcpm Comments & Questions projectsharp.usc.edu