NUMA-aware Graph-structured Analytics
|
|
- Elizabeth Holland
- 5 years ago
- Views:
Transcription
1 NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China
2 Big Data Everywhere 00 Million Tweets/day 1.11 Billion Users 6 Billion Photos 100 Hrs of Video every minute How do we understand and use Big Data?
3 Data Analytics 00 Million Tweets/day 1.11 Billion Users 6 Billion Photos 100 Hrs of Video every minute Machine Learning and Data Mining NLP
4 It s about the graphs...
5 NUMA & Graph-analytics Application Hardware Processor Memory Single Unified Multi- Core NUCA Multi- Socket NUMA Now e.g. 80 Cores with 1TB RAM NLP 8 Sockets X (10 Cores with128gb local RAM)
6 How about? NUMA systems Graph-analytics
7 Contribution Polymer: NUMA-aware Graph-structured Analytics A comprehensive analysis that uncovers issues for running graph analytics system on NUMA platform A new system that exploits both NUMA-aware data layout and memory access strategies Three optimizations for global synchronization efficiency, load balance and data structure flexibility A detailed evaluation that demonstrates the performance and scalability benefits
8 Outline Background & Issues Design of Polymer Evaluation
9 Outline Background & Issues Design of Polymer Evaluation
10 Example: PageRank A centrality analysis algorithm to measure the relative rank for each element of a linked set Characteristics Linked set data dependence Rank of who links it predictable accesses Convergence iterative computation
11 Graph-analytics The scatter-gather model scatter : propagate the current value of a vertex to its neighbors along edges gather : accumulate values from neighbors to compute the next value of a vertex vertex TOPO In-memory data structure Graph Topology Application-specific Data Runtime State edge curr D1 D2 D3 D D5 D6 next curr next DATA STAT
12 Vertex-centric (e.g. Ligra) STAT/curr TOPO/vertex TOPO/out-edge DATA/curr DATA/next STAT/next
13 Vertex-centric (e.g. Ligra) STAT/curr TOPO/vertex TOPO/out-edge DATA/curr DATA/next STAT/next SEQ RND R W
14 Edge-centric (e.g. X-Stream) partition 5 3 TOPO/edge STAT/curr DATA/curr shuffle phase DATA/Uout DATA/Uin DATA/next STAT/next
15 Edge-centric (e.g. X-Stream) partition 5 3 TOPO/edge STAT/curr DATA/curr RND R shuffle phase DATA/Uout DATA/Uin SEQ SEQ W W R R DATA/next STAT/next RND W
16 NUMA Characteristics A commodity NUMA machine Multiple processor nodes (i.e., socket) Processor = multiple cores + a local DRAM A globally shared memory abstract (cache-coherence) Hallmark: Non-uniform memory access Latency (Cycle) Inst. 0-hop 1-hop 2-hop 80-core Intel Xeon machine Load Store Bandwidth (MB/s) Access 0-hop 1-hop 2-hop IL 80-core Intel Xeon machine SEQ RAND
17 NUMA Characteristics A commodity NUMA machine Multiple processor nodes (i.e., socket) Sequential remote access is faster than random local access Processor = multiple cores + a local DRAM A globally shared memory abstract (cache-coherence) Hallmark: Non-uniform memory access & Random remote access is awesome! Latency (Cycle) Inst. 0-hop 1-hop 2-hop 80-core Intel Xeon machine Load Store Bandwidth (MB/s) Access 0-hop 1-hop 2-hop IL 80-core Intel Xeon machine SEQ RAND
18 NUMA Characteristics The world we lived in: first-touch policy binding virtual pages to physical frames locating on a memory node where a thread first touches the pages CPU MEM Centralized Interleaved Associated The world we want to lived in
19 NUMA Characteristics The world we lived in: first-touch policy binding virtual pages to physical frames locating on a memory node where a thread first touches the pages Both centralized and interleaved data layout will hamper locality and parallelism Centralized Interleaved & Associated CPU Associated layout is the ideal one. MEM The world we want to lived in
20 NUMA Characteristics Lack of locality (access neighboring vertices) It is inevitable to access remote memory Random access is always there update How to mix?? Local Global SEQ RND
21 Access Strategy on NUMA Vertex-centric Model Completely overlooked (e.g. Ligra) L G SEQ RND 5 3 N0 N1 SEQ R L RND W G
22 Access Strategy on NUMA Edge-centric Model Inefficient way (e.g. X-Stream) N0 N1 L G SEQ RND RND R L shuffle phase SEQ W L SEQ R L SEQ W G SEQ R L RND W L
23 Runtime (sec) Normalized Speedup Normalized Speedup Normalized Speedup Scalability & Performance on NUMA Scalability: #Cores vs. #sockets Intel 80-cores (8Sx10C) Ligra X-Stream Galois #cores Ligra X-Stream Galois #sockets X-Stream 8C: 6.92X 8S:.58X Galois 10C: 6.19X 8S: 2.90X X-Stream 1S: 132s 8S: 29s Galois 1S: 33s 8S: 12s Performance (sec) Intel 80-cores (8Sx10C) Ligra X-Stream Galois #sockets Scalability: #sockets Ligra X-Stream Galois #sockets worse! 8 Socket LG: 2.9X XS: 1.X AMD 6-core (8Sx8C)
24 Runtime (sec) Normalized Speedup Normalized Speedup Normalized Speedup Scalability & Performance on NUMA Scalability: #Cores vs. #sockets Intel 80-cores (8Sx10C) X-Stream 1S: 132s 8S: 29s Galois 1S: 33s 8S: 12s Ligra X-Stream Galois #cores Performance (sec) #sockets #sockets X-Stream 8C: 6.92X 8S:.58X Galois 10C: 6.19X 8S: 2.90X Minimize remote & 1 2random accesses 8 Eliminate 160 the 8 Ligra combination Ligraof them 120 X-Stream Galois 6 X-Stream Galois 80 0 Intel 80-cores (8Sx10C) Ligra X-Stream Galois #sockets Scalability: #sockets worse! 8 Socket LG: 2.9X XS: 1.X AMD 6-core (8Sx8C)
25 Outline Background & Issues Design of Polymer Evaluation
26 Goal#1: Reduce remote accesses Co-locating data and computation within the same NUMA node
27 Graph-aware Data Layout Co-locating data and computation within the same NUMA node Graph-aware partitioning TOPO DATA STAT D1 D2 D3 D D5 D
28 Graph-aware Data Layout Co-locating data and computation within the same NUMA node 1. Graph-aware partitioning N N D1 D2 D3 Intuitive D D5 D
29 Graph-aware Data Layout Co-locating data and computation within the same NUMA node 1. Graph-aware partitioning N D1 D2 D3 N D D5 D sophisticated
30 Graph-aware Data Layout Co-locating data and computation within the same NUMA node Graph-aware partitioning D1 D2 D3 N0 agent D D5 D sophisticated N1 5 3
31 Graph-aware Data Layout Co-locating data and computation within the same NUMA node Graph-aware partitioning NUMA-aware allocation Virt-Memory Phys-Memory TOPO D1 D2 D3 N0 agent D D5 D seq 2.local 3.long 1.seq/rnd DATA 2.global 3.long STAT N seq/rnd 2.global 3.short
32 Goal#2: Eliminate random + remote Random remote access access neighboring vertices on other nodes distribute the computations on a singe vertex over multiple nodes
33 Goal#2: Eliminate random + remote Random remote access access neighboring vertices on other nodes distribute the computations on a singe vertex over multiple nodes Each node handles all edges of partial vertices RND SEQ L G RND SEQ G L Each node handles partial edges of all vertices
34 1 NUMA-aware Access Strategy distribute the computations on singe vertex over multiple NUMA-nodes N0 N1 STAT/curr TOPO/vertex TOPO/out-edge DATA/curr DATA/next STAT/next
35 NUMA-aware Access Strategy distribute the computations on singe vertex over multiple NUMA-nodes N0 N1 DATA/curr SEQ R L DATA/next RND W G DATA/curr DATA/next SEQ R G RND W L
36 Optimizations 1. Rolling update 2. Hierarchical and efficient barrier 3. Adaptive data structure
37 Outline Background & Issues Design of Polymer Evaluation
38 Implementation Polymer ~5,300 SLOCs of C++ code Based on scatter-gather iterative model Support both push and pull mode Use a synchronous scheduler Several typical graph applications Sparse MM: PageRank, SpMV and BP Traversal: BFS, CC and SSSP Open source:
39 Experiment Settings Baseline: Ligra v201.3, X-Stream v0.9, and Galois v2.2 Platform 80-core Intel machine (w/o Hyper-Threading) 8 sockets (E7-8850: 10 cores and 128 GB local RAM) 6-core AMD machine (also 8 sockets) Algorithms (6) Graphs () Intel/8X10 3 Spars MM algorithms 3 Traversal algorithms 2 real-world and 2 synthetic graphs AMD/8X8 Graph V E Twitter 1.7 M 1.7B rmat M 2.1B Powerlaw 10.0M 105M roadus 23.9M 58M
40 Overall Performance Algo. V Polymer Ligra X-Stream Galois Twitter PR rmat Power roadus Twitter SpMV rmat Power roadus Twitter BP rmat Power roadus
41 Overall Performance Algo. V Polymer Ligra X-Stream Galois Twitter X X X 11.6 PR rmat X X X 19.6 Power X X X 6.6 roadus X X X 1. Twitter X X X 11.7 SpMV rmat X X X 1.9 Power X X X 6.2 roadus X X X 3.6 Twitter X X X 57.1 BP rmat X X X 75.0 Power X X X 8.6 roadus X X X 7.1
42 Overall Performance Algo. V Polymer Ligra X-Stream Galois PR SpMV BP Twitter X X X 11.6 rmat X X X 19.6 Power X X X 6.6 roadus X X X 1. Twitter X ~ 3.8X 3.08X 7.89X 1.55X rmat X X X 1.9 Compared with best cases Power X X X 6.2 roadus X X X 3.6 Twitter X X X 57.1 rmat X X X 75.0 Power X X X 8.6 roadus X X X 7.1
43 Overall Performance Algo. V Polymer Ligra X-Stream Galois Twitter BFS rmat Power roadus Twitter CC rmat Power roadus * Twitter SSSP rmat Power roadus *
44 Overall Performance Algo. V Polymer Ligra X-Stream Galois Twitter X X X 2.5 BFS rmat X X X 2.5 Power X X X 0.36 roadus X X X 5.01 Twitter X X X 31.9 CC rmat X X X 33.9 Power X X X 3.51 roadus X X * Twitter X X X 26.3 SSSP rmat X X X 28.5 Power X X X 26.6 roadus X X *
45 Normalized Speedup Runtime (sec) Normalized Speedup Runtime (sec) Scalability on NUMA Performance & Scalability: #sockets Ligra X-Stream Galois Polymer #sockets 12.7X Performance & Scalability: #sockets Ligra X-Stream Galois Polymer #sockets 3.78X #sockets Ligra Galois Polymer Ligra X-Stream Galois Polymer #sockets Intel 80-cores (8Sx10C) PageRank 5.8X/X-Stream 2.85X/Ligra 2.19X/Galois Intel 80-cores (8Sx10C) BFS 0.80X/Galois 1.25X/Ligra 2.96X/X-stream
46 Conclusion Polymer: NUMA-aware Graph-structured Analytics A comprehensive analysis that uncovers several NUMA characteristics and issues with existing NUMA-oblivious graph analytics systems A new system that exploits both NUMA- and graphaware data layout and memory access strategies minimize remote access eliminate remote random access Three optimizations for global synchronization efficiency, load balance and data structure flexibility
47 Thanks Questions projects/polymer.html Institute of Parallel and Distributed Systems
NUMA-Aware Graph-Structured Analytics
NUMA-Aware Graph-Structured Analytics Kaiyuan Zhang Rong Chen Haibo Chen Shanghai Key aboratory of Scalable Computing and Systems Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University,
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationGridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University
GridGraph: Large-Scale Graph Processing on a Single Machine Using -Level Hierarchical Partitioning Xiaowei ZHU Tsinghua University Widely-Used Graph Processing Shared memory Single-node & in-memory Ligra,
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationMosaic: Processing a Trillion-Edge Graph on a Single Machine
Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys
More informationHigh-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock
High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)
More informationGraFBoost: Using accelerated flash storage for external graph analytics
GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by: 1 Large Graphs are Found Everywhere in Nature
More informationTanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia
How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems
More informationOptimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation
Optimizing Cache Performance for Graph Analytics Yunming Zhang 6.886 Presentation Goals How to optimize in-memory graph applications How to go about performance engineering just about anything In-memory
More informationA Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph Analytics Donald Nguyen, Andrew Lenharth and Keshav Pingali The University of Texas at Austin, Texas, USA {ddn@cs, lenharth@ices, pingali@cs}.utexas.edu Abstract
More informationX-Stream: Edge-centric Graph Processing using Streaming Partitions
X-Stream: Edge-centric Graph Processing using Streaming Partitions Amitabha Roy, Ivo Mihailovic, Willy Zwaenepoel {amitabha.roy, ivo.mihailovic, willy.zwaenepoel}@epfl.ch EPFL Abstract X-Stream is a system
More informationA Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms
A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms Gurbinder Gill, Roshan Dathathri, Loc Hoang, Keshav Pingali Department of Computer Science, University of Texas
More informationX-Stream: A Case Study in Building a Graph Processing System. Amitabha Roy (LABOS)
X-Stream: A Case Study in Building a Graph Processing System Amitabha Roy (LABOS) 1 X-Stream Graph processing system Single Machine Works on graphs stored Entirely in RAM Entirely in SSD Entirely on Magnetic
More informationReport. X-Stream Edge-centric Graph processing
Report X-Stream Edge-centric Graph processing Yassin Hassan hassany@student.ethz.ch Abstract. X-Stream is an edge-centric graph processing system, which provides an API for scatter gather algorithms. The
More informationBoosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search
Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Jialiang Zhang, Soroosh Khoram and Jing Li 1 Outline Background Big graph analytics Hybrid
More informationGPU Sparse Graph Traversal
GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex
More informationAutomatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC
Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road
More informationAccelerating Irregular Computations with Hardware Transactional Memory and Active Messages
MACIEJ BESTA, TORSTEN HOEFLER spcl.inf.ethz.ch Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages LARGE-SCALE IRREGULAR GRAPH PROCESSING Becoming more important
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More informationThe GraphGrind Framework: Fast Graph Analytics on Large. Shared-Memory Systems
Memory-Centric System Level Design of Heterogeneous Embedded DSP Systems The GraphGrind Framework: Scott Johan Fischaber, BSc (Hon) Fast Graph Analytics on Large Shared-Memory Systems Jiawen Sun, PhD candidate
More informationSub-millisecond Stateful Stream Querying over Fast-evolving Linked Data
Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data Yunhao Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems (IPADS) Shanghai Jiao Tong University Stream Query
More informationLocality-Aware Software Throttling for Sparse Matrix Operation on GPUs
Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Yanhao Chen 1, Ari B. Hayes 1, Chi Zhang 2, Timothy Salmon 1, Eddy Z. Zhang 1 1. Rutgers University 2. University of Pittsburgh Sparse
More informationFalcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University
Falcon: Scaling IO Performance in Multi-SSD Volumes Pradeep Kumar H Howie Huang The George Washington University SSDs in Big Data Applications Recent trends advocate using many SSDs for higher throughput
More informationSAP HANA. Jake Klein/ SVP SAP HANA June, 2013
SAP HANA Jake Klein/ SVP SAP HANA June, 2013 SAP 3 YEARS AGO Middleware BI / Analytics Core ERP + Suite 2013 WHERE ARE WE NOW? Cloud Mobile Applications SAP HANA Analytics D&T Changed Reality Disruptive
More informationNUMA-aware OpenMP Programming
NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC
More informationEfficient graph computation on hybrid CPU and GPU systems
J Supercomput (2015) 71:1563 1586 DOI 10.1007/s11227-015-1378-z Efficient graph computation on hybrid CPU and GPU systems Tao Zhang Jingjie Zhang Wei Shu Min-You Wu Xiaoyao Liang Published online: 21 January
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationGPU Sparse Graph Traversal. Duane Merrill
GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor
More informationFlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs
FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs Da Zheng, Disa Mhembere, Randal Burns Department of Computer Science Johns Hopkins University Carey E. Priebe Department of Applied
More informationHigh Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA
High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it
More informationA Framework for Processing Large Graphs in Shared Memory
A Framework for Processing Large Graphs in Shared Memory Julian Shun Based on joint work with Guy Blelloch and Laxman Dhulipala (Work done at Carnegie Mellon University) 2 What are graphs? Vertex Edge
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationArquitecturas y Modelos de. Multicore
Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez Opening statements * Some visionaries already predicted multicores 30 years ago And they have
More informationExtreme-scale Graph Analysis on Blue Waters
Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationWhy you should care about hardware locality and how.
Why you should care about hardware locality and how. Brice Goglin TADaaM team Inria Bordeaux Sud-Ouest Agenda Quick example as an introduction Bind your processes What's the actual problem? Convenient
More informationCOMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory
COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy
More informationBig Data Analytics Using Hardware-Accelerated Flash Storage
Big Data Analytics Using Hardware-Accelerated Flash Storage Sang-Woo Jun University of California, Irvine (Work done while at MIT) Flash Memory Summit, 2018 A Big Data Application: Personalized Genome
More informationA Distributed Multi-GPU System for Fast Graph Processing
A Distributed Multi-GPU System for Fast Graph Processing Zhihao Jia Stanford University zhihao@cs.stanford.edu Pat M c Cormick LANL pat@lanl.gov Yongkee Kwon UT Austin yongkee.kwon@utexas.edu Mattan Erez
More informationClaude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique
Got 2 seconds Sequential 84 seconds Expected 84/84 = 1 second!?! Got 25 seconds MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Séminaire MATHEMATIQUES
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationCallisto-RTS: Fine-Grain Parallel Loops
Callisto-RTS: Fine-Grain Parallel Loops Tim Harris, Oracle Labs Stefan Kaestle, ETH Zurich 7 July 2015 The following is intended to provide some insight into a line of research in Oracle Labs. It is intended
More informationcustinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader
custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader What we will see today The first dynamic graph data structure for the GPU. Scalable in size Supports the same functionality
More informationNon-uniform memory access (NUMA)
Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access
More informationComputation and Communication Efficient Graph Processing with Distributed Immutable View
Computation and Communication Efficient Graph Processing with Distributed Immutable View Rong Chen, Xin Ding, Peng Wang, Haibo Chen, Binyu Zang, Haibing Guan Shanghai Key Laboratory of Scalable Computing
More informationHANA Performance. Efficient Speed and Scale-out for Real-time BI
HANA Performance Efficient Speed and Scale-out for Real-time BI 1 HANA Performance: Efficient Speed and Scale-out for Real-time BI Introduction SAP HANA enables organizations to optimize their business
More informationTopology and affinity aware hierarchical and distributed load-balancing in Charm++
Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National
More informationBinding Nested OpenMP Programs on Hierarchical Memory Architectures
Binding Nested OpenMP Programs on Hierarchical Memory Architectures Dirk Schmidl, Christian Terboven, Dieter an Mey, and Martin Bücker {schmidl, terboven, anmey}@rz.rwth-aachen.de buecker@sc.rwth-aachen.de
More informationEverything you always wanted to know about multicore graph processing but were afraid to ask
Everything you always wanted to know about multicore graph processing but were afraid to ask Jasmina Malicevic, Baptiste Lepers, and Willy Zwaenepoel, EPFL https://www.usenix.org/conference/atc17/technical-sessions/presentation/malicevic
More informationMorsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age. Presented by Dennis Grishin
Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age Presented by Dennis Grishin What is the problem? Efficient computation requires distribution of processing between
More informationGTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs
GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs Min-Soo Kim, Kyuhyeon An, Himchan Park, Hyunseok Seo, and Jinwook Kim Department of Information and Communication Engineering
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationExtreme-scale Graph Analysis on Blue Waters
Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State
More informationGetting Performance from OpenMP Programs on NUMA Architectures
Getting Performance from OpenMP Programs on NUMA Architectures Christian Terboven, RWTH Aachen University terboven@itc.rwth-aachen.de EU H2020 Centre of Excellence (CoE) 1 October 2015 31 March 2018 Grant
More informationWedge A New Frontier for Pull-based Graph Processing. Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018
Wedge A New Frontier for Pull-based Graph Processing Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018 Graph Processing Problems modelled as objects (vertices) and connections between
More informationIn-Memory Data Management for Enterprise Applications. BigSys 2014, Stuttgart, September 2014 Johannes Wust Hasso Plattner Institute (now with SAP)
In-Memory Data Management for Enterprise Applications BigSys 2014, Stuttgart, September 2014 Johannes Wust Hasso Plattner Institute (now with SAP) What is an In-Memory Database? 2 Source: Hector Garcia-Molina
More informationPiccolo. Fast, Distributed Programs with Partitioned Tables. Presenter: Wu, Weiyi Yale University. Saturday, October 15,
Piccolo Fast, Distributed Programs with Partitioned Tables 1 Presenter: Wu, Weiyi Yale University Outline Background Intuition Design Evaluation Future Work 2 Outline Background Intuition Design Evaluation
More informationSoftware and Tools for HPE s The Machine Project
Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2 CPU-Centric
More informationA New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader
A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationArchitectural Implications on the Performance and Cost of Graph Analytics Systems
Architectural Implications on the Performance and Cost of Graph Analytics Systems Qizhen Zhang,, Hongzhi Chen, Da Yan,, James Cheng, Boon Thau Loo, and Purushotham Bangalore University of Pennsylvania
More informationLeveraging Flash in HPC Systems
Leveraging Flash in HPC Systems IEEE MSST June 3, 2015 This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security,
More informationOS impact on performance
PhD student CEA, DAM, DIF, F-91297, Arpajon, France Advisor : William Jalby CEA supervisor : Marc Pérache 1 Plan Remind goal of OS Reproducibility Conclusion 2 OS : between applications and hardware 3
More informationSociaLite: A Datalog-based Language for
SociaLite: A Datalog-based Language for Large-Scale Graph Analysis Jiwon Seo M OBIS OCIAL RESEARCH GROUP Overview Overview! SociaLite: language for large-scale graph analysis! Extensions to Datalog! Compiler
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationNUMA replicated pagecache for Linux
NUMA replicated pagecache for Linux Nick Piggin SuSE Labs January 27, 2008 0-0 Talk outline I will cover the following areas: Give some NUMA background information Introduce some of Linux s NUMA optimisations
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationGraph Partitioning for Scalable Distributed Graph Computations
Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
More informationG(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu
G(B)enchmark GraphBench: Towards a Universal Graph Benchmark Khaled Ammar M. Tamer Özsu Bioinformatics Software Engineering Social Network Gene Co-expression Protein Structure Program Flow Big Graphs o
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationJure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah
Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks
More informationStreamBox: Modern Stream Processing on a Multicore Machine
StreamBox: Modern Stream Processing on a Multicore Machine Hongyu Miao and Heejin Park, Purdue ECE; Myeongjae Jeon and Gennady Pekhimenko, Microsoft Research; Kathryn S. McKinley, Google; Felix Xiaozhu
More informationEnterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015
Enterprise Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang November 9th, 5 Graph is Ubiquitous Breadth-First Search (BFS) is Important Wide Range of Applications Single Source Shortest Path
More informationBig Graph Processing. Fenggang Wu Nov. 6, 2016
Big Graph Processing Fenggang Wu Nov. 6, 2016 Agenda Project Publication Organization Pregel SIGMOD 10 Google PowerGraph OSDI 12 CMU GraphX OSDI 14 UC Berkeley AMPLab PowerLyra EuroSys 15 Shanghai Jiao
More informationEfficient Graph Computation on Hybrid CPU and GPU Systems
Noname manuscript No. (will be inserted by the editor) Efficient Graph Computation on Hybrid CPU and GPU Systems Tao Zhang Jingjie Zhang Wei Shu Min-You Wu Xiaoyao Liang* Received: date / Accepted: date
More informationLecture 17. NUMA Architecture and Programming
Lecture 17 NUMA Architecture and Programming Announcements Extended office hours today until 6pm Weds after class? Partitioning and communication in Particle method project 2012 Scott B. Baden /CSE 260/
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationAccelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh
Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary
More informationSoft Updates Made Simple and Fast on Non-volatile Memory
Soft Updates Made Simple and Fast on Non-volatile Memory Mingkai Dong, Haibo Chen Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University @ NVMW 18 Non-volatile Memory (NVM) ü Non-volatile
More informationEnabling Hybrid Parallel Runtimes Through Kernel and Virtualization Support. Kyle C. Hale and Peter Dinda
Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization Support Kyle C. Hale and Peter Dinda Hybrid Runtimes the runtime IS the kernel runtime not limited to abstractions exposed by syscall
More informationScalable Data-driven PageRank: Algorithms, System Issues, and Lessons Learned
Scalable Data-driven PageRank: Algorithms, System Issues, and Lessons Learned Joyce Jiyoung Whang, Andrew Lenharth, Inderjit S. Dhillon, and Keshav Pingali University of Texas at Austin, Austin TX 78712,
More informationGraph Data Management
Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of
More informationNUMA Support for Charm++
NUMA Support for Charm++ Christiane Pousa Ribeiro (INRIA) Filippo Gioachin (UIUC) Chao Mei (UIUC) Jean-François Méhaut (INRIA) Gengbin Zheng(UIUC) Laxmikant Kalé (UIUC) Outline Introduction Motivation
More informationSTREAMER: a Distributed Framework for Incremental Closeness Centrality
STREAMER: a Distributed Framework for Incremental Closeness Centrality Computa@on A. Erdem Sarıyüce 1,2, Erik Saule 4, Kamer Kaya 1, Ümit V. Çatalyürek 1,3 1 Department of Biomedical InformaBcs 2 Department
More informationGraph Analytics and Machine Learning A Great Combination Mark Hornick
Graph Analytics and Machine Learning A Great Combination Mark Hornick Oracle Advanced Analytics and Machine Learning November 3, 2017 Safe Harbor Statement The following is intended to outline our research
More informationPuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks
PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks George M. Slota 1,2 Kamesh Madduri 2 Sivasankaran Rajamanickam 1 1 Sandia National Laboratories, 2 The Pennsylvania
More informationBandwidth Avoiding Stencil Computations
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu
More informationFrom Think Like a Vertex to Think Like a Graph. Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson
From Think Like a Vertex to Think Like a Graph Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson Large Scale Graph Processing Graph data is everywhere and growing
More informationLatency-Tolerant Software Distributed Shared Memory
Latency-Tolerant Software Distributed Shared Memory Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin University of Washington USENIX ATC 2015 July 9, 2015 25
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 60 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Pregel: A System for Large-Scale Graph Processing
More informationCascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching
Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Kefei Wang and Feng Chen Louisiana State University SoCC '18 Carlsbad, CA Key-value Systems in Internet Services Key-value
More informationGRAM: Scaling Graph Computation to the Trillions
GRAM: Scaling Graph Computation to the Trillions Ming Wu, Fan Yang, Jilong Xue, Wencong Xiao, Youshan Miao, Lan Wei, Haoxiang Lin, Yafei Dai, Lidong Zhou Microsoft Research, Peking University, Beihang
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationThe dark powers on Intel processor boards
The dark powers on Intel processor boards Processing Resources (3U VPX) Boards with Multicore CPUs: Up to 16 cores using Intel Xeon D-1577 on TR C4x/msd Boards with 4-Core CPUs and Multiple Graphical Execution
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationArchitecture of a Real-Time Operational DBMS
Architecture of a Real-Time Operational DBMS Srini V. Srinivasan Founder, Chief Development Officer Aerospike CMG India Keynote Thane December 3, 2016 [ CMGI Keynote, Thane, India. 2016 Aerospike Inc.
More information