CuSha: Vertex-Centric Graph Processing on GPUs

Size: px

Start display at page:

Download "CuSha: Vertex-Centric Graph Processing on GPUs"

Dina Barrett
6 years ago
Views:

1 CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani, Keval Vora, Rajiv Gupta, Laxmi N. Bhuyan HPDC Vancouver, Canada June,

2 Motivation Graph processing Real world graphs are large & sparse Power law distribution HiggsTwitter M Edges,.M Vertices LiveJournal 9M Edges, M Vertices Pokec M Edges,.M Vertices

3 Graphics Processing Unit (GPU) Single Instruction Multiple Data (SIMD) Repetitive processing patterns on regular data GPU Global Memory Streaming Multiprocessor Streaming Multiprocessor Streaming Multiprocessor Shared Memory Warp N - Warp Warp Shared Memory Warp N - Warp Warp Shared Memory Warp N - Warp Warp

4 Graphics Processing Unit (GPU) Single Instruction Multiple Data (SIMD) Repetitive processing patterns on regular data GPU Streaming Multiprocessor Global Memory Irregular Graphs Streaming Multiprocessor Streaming Multiprocessor Warp Warp Warp N - Shared Memory Irregular Memory Accesses Warp Warp Warp N - Shared Memory GPU underutilization Warp Warp Warp N - Shared Memory

5 Prior Work Using CSR assign vertex to a thread Threads iterate through assigned vertex s neighbors Non-coalesced memory accesses Work imbalance among threads Using CSR assign vertex to a virtual warp Virtual warp lanes process vertex s neighbors in parallel Non-coalesced memory accesses Work imbalance reduced but still exist Using CSR assemble frontiers Exploration based graph algorithms Non-coalesced memory accesses [Harish and Narayanan HiPC ] [Merrill et al, PPoPP ] [Hong et al, PPoPP ]

6 Prior Work Using CSR assign vertex to a thread Threads iterate through assigned vertex s neighbors Non-coalesced memory accesses Work imbalance among threads Using CSR assign vertex to a virtual warp Virtual warp lanes process vertex s neighbors in parallel Non-coalesced memory accesses Work imbalance reduced but still exist Using CSR assemble frontiers Exploration based graph algorithms Non-coalesced memory accesses [Harish and Narayanan HiPC ] Compressed Sparse Row [Merrill et al, PPoPP ] [Hong et al, PPoPP ]

7 Compressed Sparse Row (CSR) InEdgeIdxs 9 9 VertexValues x x x x x x x x s 9

8 Compressed Sparse Row (CSR) InEdgeIdxs 9 9 VertexValues x x x x x x x x s 9

9 Virtual Warp Centric (VWC) [PPoPP ] Physical warp broken into smaller virtual warps,, 8, lanes Advantages: Neighbors processed in parallel Load imbalance is reduced Disadvantages: s Load imbalance still exists GPU underutilization Non-coalesced memory accesses InEdgeIdxs VertexValues 9 x x x x x x x x 9

10 VWC-CSR Benchmark Avg global memory access efficiency (%) Avg warp execution efficiency (%) Breadth-First Search (BFS) Single Source Shortest Path (SSSP) PageRank (PR) Connected Components (CC) Single Source Widest Path (SSWP) Neural Network (NN) Heat Simulation (HS) Circuit Simulation (CS) Caused by Non-Coalesced memory loads and stores Caused by GPU Underutilization and Load Imbalance

11 G-Shards 8 Employs Shards Data required by computation placed contiguously Improves locality [Kyrola et al, OSDI ] Shard Shard 9 x x x x x x x x x x x x x 9 x VertexValues x x x x x x x x

12 G-Shards: Construction Edges partitioned based on destination vertex s index Edges sorted based on source vertex s index 9 W W VertexValues Shard Shard x x x x x x x 9 x x x x x x x x x x x x x x x 9 W W

13 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x VertexValues x x x x x x x x Block Shared Memory

14 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x VertexValues x x x x x x x x Block Shared Memory

15 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x VertexValues x x x x x x x x Block Shared Memory

16 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x VertexValues x x x x x x x x Block Shared Memory

17 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x VertexValues x x x x x x x x Block Shared Memory

18 G-Shards: Processing Iteratively process all shards Each shard processed by a thread block Steps: Read Compute Update Write Shard Shard x x x x x x x 9 x x x x x x x Coalesced memory accesses in all steps VertexValues x x x x x x x x Block Shared Memory

19 Frequency (Thousands) Frequency (Thousands) G-Shards Large and sparse graphs have small computation windows Graph Size Graph Sparsity 8 _8 _ Window Sizes Window Sizes

20 G-Shards Large and sparse graphs have small computation windows Threads in a warp Shard j- Shard j Shard j+ GPU underutilization Many warp threads remain idle during th step Solution Wi,j- Wi,j+ Group windows to utilize maximum threads Concatenated Windows Wi,j t -

21 Concatenated Windows (CW) Shard Shard x x x x VertexValues x x x x x x x x x x x x x x x x x 9 x

22 Concatenated Windows (CW) Detach array from shards Shard Shard x x x x VertexValues x x x x x x x x x x x x x x x x x 9 x

23 Concatenated Windows (CW) Detach array from shards Concatenate the ones belonging to windows of same shards Shard Shard x x x x VertexValues x x x x x x x x x x x x x x x x x 9 x

24 Concatenated Windows (CW) Detach array from shards Concatenate the ones belonging to windows of same shards Mapper array Fast access of in shards Shard Shard CW CW Mapper Mapper x x x 8 x VertexValues x 9 x x x x x x x x x 8 x x 9 x x x x x 9 x

25 CW: Processing First steps remain same Writeback is done by block threads using mapper array Thread IDs k k+ m m+ m+ n n+ n+ p CW i Mapper for W i,j- for W i,j for W i,j+ Shard j+ Shard j Shard j- k k+ m+ m+ n+ n+ Wi,j+ Wi,j Wi,j- m n p

26 CW: Processing First steps remain same Writeback is done by block threads using mapper array k k+ m m+ m+ n n+ n+ p Thread IDs CW i Mapper Improves GPU utilization for W i,j- in th step for W i,j for W i,j+ Shard j Shard j- Shard j+ k k+ m+ m+ n+ n+ Wi,j+ Wi,j Wi,j- m n p

27 CuSha Framework Uses G-Shards and CWs Vertex centric programming model init_compute, compute, update_condition compute must be atomic // SSSP in CuSha typedef struct Edge { unsigned int Weight; } Edge; // requires a variable for edge weight typedef struct Vertex { unsigned int Dist; } Vertex; // requires a variable for distance device void init_compute( Vertex* local_v, Vertex* V ) { local_v->dist = V->Dist; // fetches distances from global memory to shared memory } device void compute( Vertex* SrcV, StaticVertex* SrcV_static, Edge* E, Vertex* local_v ) { if (SrcV->Dist!= INF) // minimum distance of a vertex from source node is calculated atomicmin ( &(local_v->dist), SrcV->Dist + E->Weight ); } device bool update_condition( Vertex* local_v, Vertex* V ) { return ( local_v->dist < V->Dist ); // return true if newly calculated value is smaller }

28 Experimental Setup GeForce GTX8: SMs, GB GDDR RAM Intel Core i-9k Sandy bridge cores (HT enabled). GHz, DDR RAM, PCI-e. x CUDA. on Ubuntu. 8 Benchmarks BFS, SSSP, PR, CC, SSWP, NN, HS, CS Input graphs from the SNAP dataset collection SNAP:

29 Speedup over Multi-Threaded CPU BM G-Shards CW BFS.x-.x.x-.8x SSSP.x-.x.99x-.x PR.x-.x.x-8.98x CC.x-.x.x-.x Average speedups G-Shards:.x-.9x CW:.88x-.x SSWP.9x-.x.x-.8x NN.8x-9.x.9x-9.9x HS.x-.x.8x-.x CS.9x-.x.9x-.x Inputs G-Shards CW LiveJournal.x-.x.x-9.x Pokec.x-.9x.x-.89x HiggsTwitter.x-.x.x-.x RoadNetCA.9x-9.9x.9x-.9x WebGoogle.9x-9.9x.9x-.9x Amazon.x-.x.88x-.x

30 Speedup over VWC-CSR BM G-Shards CW BFS.9x-.9x.9x-.x SSSP.9x-.9x.x-.9x PR.x-.88x.8x-.x CC.8x-.x.x-.x Average speedups G-Shards:.x-.x CW:.89x-.89x SSWP.9x-.x.9x-.x NN.x-.x.x-.x HS.x-.x.x-.x CS.x-.x.x-.8x Inputs G-Shards CW LiveJournal.x-.x.9x-.x Pokec.x-.x.x-.8x HiggsTwitter.x-.9x.x-.x RoadNetCA.x-8.x.9x-.99x WebGoogle.x-.x.x-.x Amazon.x-.x.x-.x

31 Efficiency Efficiency Global Memory Efficiency 8 CSR G-Shards CW CSR G-Shards CW 8 BFS SSSP PR CC SSWP NN HS CS BFS SSSP PR CC SSWP NN HS CS Avg Memory Load Efficiency Avg Memory Store Efficiency Average CSR G-Shards CW Avg global memory load efficiency Avg global memory store efficiency 8.8% 8.%.9%.9%.%.%

32 Efficiency Warp Execution Efficiency 9 CSR G-Shards CW 8 BFS SSSP PR CC SSWP NN HS CS Avg Warp Execution Efficiency Average CSR G-Shards CW Avg warp execution efficiency.8% 88.9% 9.%

33 CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW Space Overhead Min Average Max Amazon WebGoogle RoadNetCA HiggsTwitter Pokec LiveJournal G-Shards over CSR: CW over G-Shards: ( E - V )xsizeof(index) + E xsizeof(vertex) bytes E xsizeof(index) bytes

34 CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW CSR G-Shards CW Execution Time Breakdown HD Copy GPU Computation DH Copy BFS SSSP PR CC SSWP NN HS CS Space overheads directly impact HD copy times G-Shards and CW are computation friendly

35 Other Experiments Traversed Edges Per Seconds (TEPS) G-Shards and CW up to times better than CSR Sensitivity study using synthetic benchmarks G-Shards is sensitive to: Graph Size Graph Density Number of vertices assigned to shards CW less affected by these parameters

36 Conclusion G-Shards Shard based representation mapped for GPUs Fully coalesced memory accesses Concatenated Windows Improves GPU utilization CuSha Framework Vertex centric programming model.x-.9x speedup over best VWC-CSR

37 Thanks CuSha on GitHub GRASP

CuSha: Vertex-Centric Graph Processing on GPUs

CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani Keval Vora Rajiv Gupta Laxmi N. Bhuyan Computer Science and Engineering Department University of California Riverside, CA, USA {fkhor, kvora,