Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning
|
|
- Shanon Marsh
- 5 years ago
- Views:
Transcription
1 Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning Angen Zheng, Alexandros Labrinidis, and Panos K. Chrysanthis University of Pittsburgh 1
2 Graph Partitioning Applications of Graph Partitioning Scientific Simulations Distributed Graph Computation o Pregel, Hama, Giraph VLSI Design Task Scheduling Linear Programming 2
3 A Balanced Partitioning = Even Load Distribution N2 N3 N1 Balanced: 3
4 Minimal Edge-Cut = Minimal Data Comm N2 N3 N1 Minimizing Edge-Cut: 4
5 Minimal Edge-Cut = Minimal Data Comm But Minimal Data Comm Minimal Comm Cost STD DEV: Mb/s STD DEV: Mb/s STD DEV: Mb/s Figure 1. Pair-Wise Network Bandwidth (J. Xue, BigData 15) Group neighboring vertices as close as possible The partitioner has to be Architecture-Aware 5
6 Overview of the State-of-the-Art Balanced Graph (Re)Partitioning Partitioners (static graphs) Repartitioners (dynamic graphs) Offline Methods (High Quality) (Poor Scalability) Online Methods (Moderate Quality) (High Scalability) Offline Methods (High Quality) (Poor Scalability) Online Methods (Moderate~High Quality) (High Scalability) Architecture-Aware Architecture-Aware 6
7 Roadmap Conclusions Evaluation Introduction Planar 7
8 Planar: Problem Statement Given G=(V, E) and an initial Partitioning P: Balancing Load: Minimizing Communication: Network Cost Minimizing Migration: 8
9 Planar Planar Planar Planar Planar Planar: Overview S k S k+1 S k+2 S k+4 S k+5 Phase-1: Logical Vertex Migration Phase-1a: Minimizing Comm Cost Phase-1b: Ensuring Balanced Partitions Migration Planning What vertices to move? Where to move? Phase-2: Physical Vertex Migration Perform the Migration Plan Phase-3: Convergence Check Still beneficial? 9
10 Phase-1a: Minimizing Comm Cost N3 1 N1 N2 N3 1 6 N1 6 1 N2 6 1 N1 6 N2 N
11 Phase-1a: Minimizing Comm Cost N Run Planar on each partition in Parallel Each boundary vertex of my partition make a migration decision on my own Probabilistic vertex migration N1 6 N2 11
12 Phase-1a: Minimizing Comm Cost N Run Planar on each partition in Parallel Each boundary vertex of my partition make a migration decision on my own Probabilistic vertex migration N1 6 N2 12
13 Use vertex a as an example N Run Planar on each partition in Parallel Each boundary vertex of my partition make a migration decision on my own Probabilistic vertex migration N1 6 N2 g(a, N1, N1) = 0 Max Gain: 0 Optimal Dest: N1 13
14 Move vertex a to N2? N Run Planar on each partition in Parallel Each boundary vertex of my partition make a migration decision on my own Probabilistic vertex migration N1 6 N2 old_comm(a, N1) = 2 * * 1 = 13 N3 1 new_comm(a, N2) = 1 * * 1 = 7 mig(a, N1, N2) = 1 * 6 = 6 N1 1 6 N2 g(a, N1, N2) = = 0 Max Gain: 0 Optimal Dest: N1 14
15 Move vertex a to N3? N Run Planar on each partition in Parallel Each boundary vertex of my partition make a migration decision on my own Probabilistic vertex migration N1 6 N2 old_comm(a, N1) = 2 * * 1 = 13 N new_comm(a, N3) = 1 * * 1 = 3 mig(a, N1, N3) = 1 * 1 = 1 g(a, N1, N3) = = 9 N1 N2 Max Gain: 9 Optimal Dest: N3 15
16 Phase-1a: Probabilistic Vertex Migration Migration Planning N3 1 Partition Boundary Vtx N1 a b N2 d e N3 g 1 6 Migration Dest N3 Gain 9 N3 N3 2 3 N3 N Max Gain N1 N2 Probability 9/9 2/3 3/3 0 0 Migrate with a probability proportional to the gain 16
17 Phase-1b: Balancing Partitions Quota-Based Vertex Migration Q1: How much work should each overloaded partition migrate to each underloaded partition? Potential Gain Computation Similar to Phase-1a vertex gain computation Iteratively allocate quota starting from the partition pair having the largest gain. Q2: What vertices to migrate? Phase-1a vertex migration, but limited by the quota. 17
18 Planar Planar Planar Planar Planar Planar: Physical Vertex Migration S k S k+1 S k+2 S k+4 S k+5 Phase-1: Logical Vertex Migration Phase-1a: Minimizing Comm Cost Phase-1b: Ensuring Balanced Partitions Migration Planning What vertices to move? Where to move? Phase-2: Physical Vertex Migration Perform the Migration Plan Phase-3: Convergence Check Still beneficial? 18
19 Planar Planar Planar Planar Planar Planar: Convergence Check S k S k+1 S k+2 S k+4 S k+5 Phase-1: Logical Vertex Migration Phase-1a: Minimizing Comm Cost Phase-1b: Ensuring Balanced Partitions Migration Planning What vertices to move? Where to move? Phase-2: Physical Vertex Migration Perform the Migration Plan Phase-3: Convergence Check Still beneficial? 19
20 Planar Planar Planar Planar Planar Phase-3: Convergence Repartitioning Epoch Enough changes (structure/load) Converge S k S k+1 S k+2 S k+4 S k+5 Converge improvement achieved per adaptation superstep < δ after τ consecutive adaptation supersteps δ = 1% and τ = 10 (via Sensitivity Analysis) 20
21 Evaluation Microbenchmarks Convergence Study (Param Selection) Partitioning Quality Real-World Workloads Breadth First Search (BFS) Single Source Shortest Path (SSSP) Scalability Test Scalability vs Graph Size Scalability vs # of Partitions Scalability vs Graph Size and # of Partitions 21
22 Partitioning Quality: Setup Dataset 12 datasets from various areas # of Parts 40 (two 20-core machines) Initial Partitioners HP: DG: LDG: Hashing Partitioning Deterministic Greedy Linear Deterministic Greedy 22
23 Partitioning Quality: Datasets Dataset V E Description wave ,118,662 auto 448,695 6,629,222 FEM 333SP 3,712,815 22,217,266 CA-CondMat 108, , 756 DBLP 317,080 1,049,866 Collaboration Network -Eron 36, ,831 as-skitter 1,696,415 22,190,596 Internet Topology Amazon 334, ,872 Product Network USA-roadNet 23,947,347 58,333,344 roadnet-pa 1,090,919 6,167,592 Road Network YouTube 3,223,589 24,447,548 Com-LiveJournal 4,036,537 69,362,378 Social Network Friendster 124,836,180 3,612,134,270
24 Partitioning Quality: Planar achieved up to 68% improvement Improv. Max Avg. HP 68% 53% DG 46% 24% LDG 69% 48% 24
25 Evaluation Microbenchmarks Convergence Study (Param Selection) Partitioning Quality Real-World Workloads Breadth First Search (BFS) Single Source Shortest Path (SSSP) Scalability Test Scalability vs Graph Size Scalability vs # of Partitions Scalability vs Graph Size and # of Partitions 25
26 Real-World Workload: Setup Cluster Configuration PittMPICluster (FDR Infiniband) Gordon (QDR Infiniband) # of Nodes Network Topology Single Switch (32 nodes / switch) 4*4*4 3D Torus of Switches (16 nodes / switch) Network Bandwidth 56Gbps 8Gbps Node Configuration # of Sockets PittMPICluster (Intel Haswell) 2 (10 cores / socket) Gordon (Intel Sandy Bridge) 2 (8 cores / socket) L3 Cache 25MB 20MB Memory Bandwidth 65GB/s 85GB/s 26
27 Planar: Avoiding Resource Contention on the Memory Subsystems of Multicore Machines System Bottleneck (A. Zheng EDBT 16) PittMPICluster Memory (λ=1) Gordon Network (λ=0) Degree of Contention Intra-Node Network Comm Cost Maximal Inter-Node Network Comm Cost 27
28 Real-World Workload: Baselines Balanced Graph (Re)Partitioning Partitioners (static graphs) Repartitioners (dynamic graphs) Offline Methods (High Quality) (Poor Scalability) Online Methods (Moderate Quality) (High Scalability) Offline Methods (High Quality) (Poor Scalability) Online Methods (Moderate~High Quality) (High Scalability) uniplanar Initial Partitioner: DG 28
29 BFS Exec. Time on PittMPICluster (λ=1): Planar achieved up to 9x speedups as-skitter: 60 Partitions: V =1.6M, E = 22M three 20-core machines 9x 7.5x 5.8x 4.1x 1.48x 1.37x 1x 29
30 BFS Comm Volume on PittMPICluster (λ=1): Planar had the lowest intra-node comm volume as-skitter: 60 Partitions: V =1.6M, E = 22M three 20-core machines Reduction Intra-Socket Inter-Socket DG 51% 38% METIS 51% 36% PARMETIS 47% 34% uniplanar 44% 28% ARAGON 4.3% 0.8% PARAGON 5.2% 2.6% 30
31 BFS Exec. Time on Gordon (λ=0): Planar achieved up to 3.2x speedups as-skitter: 48 Partitions: V =1.6M, E = 22M three 16-core machines 3.2x 1.05x 1.16x 1.21x 1x 31
32 BFS Comm. Volume on Gordon (λ=0): Planar had the lowest inter-node comm volume as-skitter: 48 Partitions: V =1.6M, E = 22M three 16-core machines 51% 11% 0.1% 25% 32
33 Conclusions PLANAR Architecture-Aware Adaptive Graph Repartitioner Communication Heterogeneity Shared Resource Contention Up to 9x speedups on real-world workloads. Scaled up to a graph with 3.6B edges. Acknowledgments: Peyman Givi Patrick Pisciuneri Mark Silvis Funding: NSF OIA NSF CBET
34 Thank You! Homepage: ADMT: 34
35 Backup Slides 35
36 Phase-3: Convergence (Param Selection) Initial Partitioner: DG (Deterministic Greedy) # of Parts: 40 (two 20-core nodes) δ = 1% and τ = 10 36
37 Scalability vs Graph Size: BFS Exec. Time friendster: V = 124M, E =3.6B 60 of Partitions: three 20-core machines Speedup (60 cores) DG 1.55x uniplanar 1.32x PARAGON 1.08x 37
38 Scalability vs Graph Size: Repart. Time friendster: V = 124M, E =3.6B 60 of Partitions: three 20-core machines 38
39 Scalability vs # of Partitions: BFS Exec. Time friendster: V = 124M, E =3.6B 60~120 of Partitions: three to six 20-core machines) Speedup (120 cores) DG 2.9x uniplanar 1.30x PARAGON 1.15x 39
40 Scalability vs # of Partitions: Repart. Time friendster: V = 124M, E =3.6B 60~120 of Partitions: three to six 20-core machines) 40
41 Intra-Node Shared Resource Contention Sending Core Receiving Core 1. Load 2b. Write 3. Load 2a. Load 4a. Load 4b. Write Send Buffer Shared Buffer Receive Buffer 41
42 Intra-Node Shared Resource Contention Cached Send/Shared/Receive Buffer Multiple copies of the same data in LLC, contending for LLC and MC 42
43 Intra-Node Shared Resource Contention Cached Send/Shared Buffer Cached Receive/Shared Buffer Multiple copies of the same data in LLC, contending for LLC, MC, and QPI 43
44 Inter-Node Comm Cost? Intra-Node Comm Cost Node#1 Node#2 RDMA-enabled Network 44
45 Inter-Node Comm Cost Intra-Node Comm Cost Infiniband: 1.7GB/s~37.5GB/s DDR3: 6.25GB/s~16.6GB/s Dual-socket Xeon E5v2 server with DDR FDR 4x NICs per socket [1]. C. Binnig et. al. The End of Slow Networks: It stime for a Redesign. CoRR, 2015 Revisit the Impact of Memory Subsystem Carefully! 45
46 Planar: Avoiding Contention Node#1 Node#2 Send Buffer Sending Core Receive Buffer Sending Core IB HCA IB HCA 46
Argo: Architecture- Aware Graph Par33oning
Argo: Architecture- Aware Graph Par33oning Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange Department of Computer Science, University of PiCsburgh hcp://db.cs.pic.edu/group/ hcp://www.prognosgclab.org/
More informationDNA Interaction Network
Social Network Web Network Social Network DNA Interaction Network Follow Network User-Product Network Nonuniform network comm costs Contentiousness of the memory subsystems Nonuniform comp requirement
More informationArchitecture-Aware Graph Repartitioning for Data-Intensive Scientific Computing
Architecture-Aware Graph Repartitioning for Data-Intensive Scientific Computing Angen Zheng, Alexandros Labrinidis, Panos K. Chrysanthis Advanced Data Management Technologies Laboratory Department of Computer
More informationPlanar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning
Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning Angen Zheng, Alexandros Labrinidis, Panos K. Chrysanthis Department of Computer Science, University of Pittsburgh {anz28, labrinid,
More informationPuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks
PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks George M. Slota 1,2 Kamesh Madduri 2 Sivasankaran Rajamanickam 1 1 Sandia National Laboratories, 2 The Pennsylvania
More informationPuLP. Complex Objective Partitioning of Small-World Networks Using Label Propagation. George M. Slota 1,2 Kamesh Madduri 2 Sivasankaran Rajamanickam 1
PuLP Complex Objective Partitioning of Small-World Networks Using Label Propagation George M. Slota 1,2 Kamesh Madduri 2 Sivasankaran Rajamanickam 1 1 Sandia National Laboratories, 2 The Pennsylvania State
More informationSociaLite: A Datalog-based Language for
SociaLite: A Datalog-based Language for Large-Scale Graph Analysis Jiwon Seo M OBIS OCIAL RESEARCH GROUP Overview Overview! SociaLite: language for large-scale graph analysis! Extensions to Datalog! Compiler
More informationPULP: Fast and Simple Complex Network Partitioning
PULP: Fast and Simple Complex Network Partitioning George Slota #,* Kamesh Madduri # Siva Rajamanickam * # The Pennsylvania State University *Sandia National Laboratories Dagstuhl Seminar 14461 November
More informationTanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia
How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems
More informationOn Smart Query Routing: For Distributed Graph Querying with Decoupled Storage
On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage Arijit Khan Nanyang Technological University (NTU), Singapore Gustavo Segovia ETH Zurich, Switzerland Donald Kossmann Microsoft
More informationOrder or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations
Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National
More informationDesigning Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters
Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationThe Potential of Diffusive Load Balancing at Large Scale
Center for Information Services and High Performance Computing The Potential of Diffusive Load Balancing at Large Scale EuroMPI 2016, Edinburgh, 27 September 2016 Matthias Lieber, Kerstin Gößner, Wolfgang
More informationTopology and affinity aware hierarchical and distributed load-balancing in Charm++
Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National
More informationFast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques. James Fox
Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques James Fox Collaborators Oded Green, Research Scientist (GT) Euna Kim, PhD student (GT) Federico Busato, PhD student (Universita di
More informationSimple Parallel Biconnectivity Algorithms for Multicore Platforms
Simple Parallel Biconnectivity Algorithms for Multicore Platforms George M. Slota Kamesh Madduri The Pennsylvania State University HiPC 2014 December 17-20, 2014 Code, presentation available at graphanalysis.info
More informationMizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/34 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationOn Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs
On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationEfficient Top-k Shortest-Path Distance Queries on Large Networks by Pruned Landmark Labeling with Application to Network Structure Prediction
Efficient Top-k Shortest-Path Distance Queries on Large Networks by Pruned Landmark Labeling with Application to Network Structure Prediction Takuya Akiba U Tokyo Takanori Hayashi U Tokyo Nozomi Nori Kyoto
More informationOncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries
Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big
More informationOptimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications
Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda Network Based Computing Laboratory
More informationFast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs
Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University 2 Oracle
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationLogGP: A Log-based Dynamic Graph Partitioning Method
LogGP: A Log-based Dynamic Graph Partitioning Method Ning Xu, Lei Chen, Bin Cui Department of Computer Science, Peking University, Beijing, China Hong Kong University of Science and Technology, Hong Kong,
More informationHardware Evolution in Data Centers
Hardware Evolution in Data Centers 2004 2008 2011 2000 2013 2014 Trend towards customization Increase work done per dollar (CapEx + OpEx) Paolo Costa Rethinking the Network Stack for Rack-scale Computers
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationAdaptive Routing Strategies for Modern High Performance Networks
Adaptive Routing Strategies for Modern High Performance Networks Patrick Geoffray Myricom patrick@myri.com Torsten Hoefler Indiana University htor@cs.indiana.edu 28 August 2008 Hot Interconnect Stanford,
More informationReducing Network Contention with Mixed Workloads on Modern Multicore Clusters
Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational
More informationHigh-performance Graph Analytics
High-performance Graph Analytics Kamesh Madduri Computer Science and Engineering The Pennsylvania State University madduri@cse.psu.edu Papers, code, slides at graphanalysis.info Acknowledgments NSF grants
More informationEECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun
EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,
More informationSociaLite: A Python-Integrated Query Language for
SociaLite: A Python-Integrated Query Language for Big Data Analysis Jiwon Seo * Jongsoo Park Jaeho Shin Stephen Guo Monica Lam STANFORD UNIVERSITY M OBIS OCIAL RESEARCH GROUP * Intel Parallel Research
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationCSE638: Advanced Algorithms, Spring 2013 Date: March 26. Homework #2. ( Due: April 16 )
CSE638: Advanced Algorithms, Spring 2013 Date: March 26 Homework #2 ( Due: April 16 ) Task 1. [ 40 Points ] Parallel BFS with Cilk s Work Stealing. (a) [ 20 Points ] In homework 1 we analyzed and implemented
More informationScaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc
Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC
More informationNERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber
NERSC Site Update National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Richard Gerber NERSC Senior Science Advisor High Performance Computing Department Head Cori
More informationGaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems
Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems 1 Presented by Hadeel Alabandi Introduction and Motivation 2 A serious issue to the effective utilization
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationPregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010
Pregel: A System for Large- Scale Graph Processing Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010 1 Graphs are hard Poor locality of memory access Very
More informationDesigning Power-Aware Collective Communication Algorithms for InfiniBand Clusters
Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,
More informationEnterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015
Enterprise Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang November 9th, 5 Graph is Ubiquitous Breadth-First Search (BFS) is Important Wide Range of Applications Single Source Shortest Path
More informationJinho Hwang and Timothy Wood George Washington University
Jinho Hwang and Timothy Wood George Washington University Background: Memory Caching Two orders of magnitude more reads than writes Solution: Deploy memcached hosts to handle the read capacity 6. HTTP
More informationLatest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand
Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationSMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.
SMB Direct Update Tom Talpey and Greg Kramer Microsoft 1 Outline Part I Ecosystem status and updates SMB 3.02 status SMB Direct applications RDMA protocols and networks Part II SMB Direct details Protocol
More informationIrregular Graph Algorithms on Parallel Processing Systems
Irregular Graph Algorithms on Parallel Processing Systems George M. Slota 1,2 Kamesh Madduri 1 (advisor) Sivasankaran Rajamanickam 2 (Sandia mentor) 1 Penn State University, 2 Sandia National Laboratories
More informationStore Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete
Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete 1 DDN Who We Are 2 We Design, Deploy and Optimize Storage Systems Which Solve HPC, Big Data and Cloud Business
More informationExploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization
Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization
More informationCharacterizing Biological Networks Using Subgraph Counting and Enumeration
Characterizing Biological Networks Using Subgraph Counting and Enumeration George Slota Kamesh Madduri madduri@cse.psu.edu Computer Science and Engineering The Pennsylvania State University SIAM PP14 February
More informationEnabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters
Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationMILC Performance Benchmark and Profiling. April 2013
MILC Performance Benchmark and Profiling April 2013 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting
More informationGateways to Discovery: Cyberinfrastructure for the Long Tail of Science
Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science ECSS Symposium, 12/16/14 M. L. Norman, R. L. Moore, D. Baxter, G. Fox (Indiana U), A Majumdar, P Papadopoulos, W Pfeiffer, R. S.
More informationSR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience
SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory
More informationOPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA
OPEN MPI WITH RDMA SUPPORT AND CUDA Rolf vandevaart, NVIDIA OVERVIEW What is CUDA-aware History of CUDA-aware support in Open MPI GPU Direct RDMA support Tuning parameters Application example Future work
More informationmodern database systems lecture 10 : large-scale graph processing
modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due graphs
More informationArchitectures for Scalable Media Object Search
Architectures for Scalable Media Object Search Dennis Sng Deputy Director & Principal Scientist NVIDIA GPU Technology Workshop 10 July 2014 ROSE LAB OVERVIEW 2 Large Database of Media Objects Next- Generation
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationGPS: A Graph Processing System
GPS: A Graph Processing System Semih Salihoglu and Jennifer Widom Stanford University {semih,widom}@cs.stanford.edu Abstract GPS (for Graph Processing System) is a complete open-source system we developed
More informationOceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.
OceanStor 9000 Issue V1.01 Date 2014-03-29 HUAWEI TECHNOLOGIES CO., LTD. Copyright Huawei Technologies Co., Ltd. 2014. All rights reserved. No part of this document may be reproduced or transmitted in
More informationSlurm Configuration Impact on Benchmarking
Slurm Configuration Impact on Benchmarking José A. Moríñigo, Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT - Dept. Technology Avda. Complutense 40, Madrid 28040, SPAIN Slurm User Group Meeting 16
More informationWedge A New Frontier for Pull-based Graph Processing. Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018
Wedge A New Frontier for Pull-based Graph Processing Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018 Graph Processing Problems modelled as objects (vertices) and connections between
More informationLow-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM
Low-Cost Inter-Linked ubarrays (LIA) Enabling Fast Inter-ubarray Data Movement in DRAM Kevin Chang rashant Nair, Donghyuk Lee, augata Ghose, Moinuddin Qureshi, and Onur Mutlu roblem: Inefficient Bulk Data
More informationGraph Partitioning for Scalable Distributed Graph Computations
Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
More informationPregel. Ali Shah
Pregel Ali Shah s9alshah@stud.uni-saarland.de 2 Outline Introduction Model of Computation Fundamentals of Pregel Program Implementation Applications Experiments Issues with Pregel 3 Outline Costs of Computation
More informationKing Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing
King Abdullah University of Science and Technology CS348: Cloud Computing Large-Scale Graph Processing Zuhair Khayyat 10/March/2013 The Importance of Graphs A graph is a mathematical structure that represents
More informationCP2K Performance Benchmark and Profiling. April 2011
CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox
More informationInfiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin
Infiniswap Efficient Memory Disaggregation Mosharaf Chowdhury with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow
More informationFOEDUS: OLTP Engine for a Thousand Cores and NVRAM
FOEDUS: OLTP Engine for a Thousand Cores and NVRAM Hideaki Kimura HP Labs, Palo Alto, CA Slides By : Hideaki Kimura Presented By : Aman Preet Singh Next-Generation Server Hardware? HP The Machine UC Berkeley
More informationImproving Virtual Machine Scheduling in NUMA Multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore
More informationGRAM: Scaling Graph Computation to the Trillions
GRAM: Scaling Graph Computation to the Trillions Ming Wu, Fan Yang, Jilong Xue, Wencong Xiao, Youshan Miao, Lan Wei, Haoxiang Lin, Yafei Dai, Lidong Zhou Microsoft Research, Peking University, Beihang
More informationUNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM
UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM Sweety Sen, Sonali Samanta B.Tech, Information Technology, Dronacharya College of Engineering,
More informationExploring the Hidden Dimension in Graph Processing
Exploring the Hidden Dimension in Graph Processing Mingxing Zhang, Yongwei Wu, Kang Chen, *Xuehai Qian, Xue Li, and Weimin Zheng Tsinghua University *University of Shouthern California Graph is Ubiquitous
More informationNUMA-aware Graph-structured Analytics
NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11
More informationShort Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy
Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy François Tessier, Venkatram Vishwanath Argonne National Laboratory, USA July 19,
More informationExtreme-scale Graph Analysis on Blue Waters
Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State
More informationPerformance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA
Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to
More informationTECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016
TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016 MULTI GPU PROGRAMMING Node 0 Node 1 Node N-1 MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM
More informationJulienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing
Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing Laxman Dhulipala Joint work with Guy Blelloch and Julian Shun SPAA 07 Giant graph datasets Graph V E (symmetrized) com-orkut
More informationGAIL The Graph Algorithm Iron Law
GAIL The Graph Algorithm Iron Law Scott Beamer, Krste Asanović, David Patterson GAP Berkeley Electrical Engineering & Computer Sciences gap.cs.berkeley.edu Graph Applications Social Network Analysis Recommendations
More informationExploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR
Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication
More informationMaximizing Memory Performance for ANSYS Simulations
Maximizing Memory Performance for ANSYS Simulations By Alex Pickard, 2018-11-19 Memory or RAM is an important aspect of configuring computers for high performance computing (HPC) simulation work. The performance
More informationAn Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs
An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering
More informationOptimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service
Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service * Kshitij Sudan* Sadagopan Srinivasan Rajeev Balasubramonian* Ravi Iyer Executive Summary Goal: Co-schedule N applications
More informationCoordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin
Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification
More informationComet Virtualization Code & Design Sprint
Comet Virtualization Code & Design Sprint SDSC September 23-24 Rick Wagner San Diego Supercomputer Center Meeting Goals Build personal connections between the IU and SDSC members of the Comet team working
More informationIntroduction to Infiniband
Introduction to Infiniband FRNOG 22, April 4 th 2014 Yael Shenhav, Sr. Director of EMEA, APAC FAE, Application Engineering The InfiniBand Architecture Industry standard defined by the InfiniBand Trade
More informationBig Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management
Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management SigHPC BigData BoF (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationImproving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters
Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of
More informationUsing DDN IME for Harmonie
Irish Centre for High-End Computing Using DDN IME for Harmonie Gilles Civario, Marco Grossi, Alastair McKinstry, Ruairi Short, Nix McDonnell April 2016 DDN IME: Infinite Memory Engine IME: Major Features
More informationHigh-Performance Training for Deep Learning and Computer Vision HPC
High-Performance Training for Deep Learning and Computer Vision HPC Panel at CVPR-ECV 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationSub-millisecond Stateful Stream Querying over Fast-evolving Linked Data
Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data Yunhao Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems (IPADS) Shanghai Jiao Tong University Stream Query
More informationOverview of Tianhe-2
Overview of Tianhe-2 (MilkyWay-2) Supercomputer Yutong Lu School of Computer Science, National University of Defense Technology; State Key Laboratory of High Performance Computing, China ytlu@nudt.edu.cn
More informationPaving the Road to Exascale Computing. Yossi Avni
Paving the Road to Exascale Computing Yossi Avni HPC@mellanox.com Connectivity Solutions for Efficient Computing Enterprise HPC High-end HPC HPC Clouds ICs Mellanox Interconnect Networking Solutions Adapter
More informationPerformance Impact of Resource Contention in Multicore Systems
Performance Impact of Resource Contention in Multicore Systems R. Hood, H. Jin, P. Mehrotra, J. Chang, J. Djomehri, S. Gavali, D. Jespersen, K. Taylor, R. Biswas Commodity Multicore Chips in NASA HEC 2004:
More informationThe Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011
The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities
More informationThe Exascale Architecture
The Exascale Architecture Richard Graham HPC Advisory Council China 2013 Overview Programming-model challenges for Exascale Challenges for scaling MPI to Exascale InfiniBand enhancements Dynamically Connected
More informationSTREAMER: a Distributed Framework for Incremental Closeness Centrality
STREAMER: a Distributed Framework for Incremental Closeness Centrality Computa@on A. Erdem Sarıyüce 1,2, Erik Saule 4, Kamer Kaya 1, Ümit V. Çatalyürek 1,3 1 Department of Biomedical InformaBcs 2 Department
More informationImage-Space-Parallel Direct Volume Rendering on a Cluster of PCs
Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr
More informationLessons from Post-processing Climate Data on Modern Flash-based HPC Systems
Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National Center of Atmospheric Research,
More information