Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons
|
|
- Georgina Elliott
- 5 years ago
- Views:
Transcription
1 Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons Assefaw Gebremedhin Purdue University (Star/ng August 2014, Washington State University School of Electrical Engineering and Computer Science) Joint work with Alex Pothen
2 Scien/fic inquiry EXERIMENTAL THEORETICAL COMPUTATIONAL (Simula/on)
3 Scien/fic inquiry EXERIMENTAL THEORETICAL COMPUTATIONAL (Simula/on) The 4 th PARADIGM (Data) connectedness
4 Complex connectedness is everywhere! The social interconnec/ons we have The informa/on we consume The technological systems we use The economic systems we live in The poli/cal systems we operate in The organiza/ons we work at The ins/tu/ons we belong to The ecological systems around us Ourselves (cell, brain).
5 Combinatorial models and algorithms in computa/onal sciences Embedded in scien/fic compu/ng Matrix factoriza/ons Matchings Vertex orderings Parallel compu/ng Independent sets Graph colorings At forefront of discovery Data analysis Network science Exploring the interplay between combinatorial and numerical algorithms crucial for developing scalable methods on HPC pla\orms
6 Challenges on manycore compu/ng: general Programming models Algorithm and data structure design Memory management Energy consump/on
7 Challenges specific to combinatorial (graph) algorithms Low available concurrency Poor data locality Irregular memory access pa]ern Access pa]ern determined only at run/me High data access to computa/on ra/o
8 Some promising algorithmic ``paradigms (for parallelizing graph algorithms) 1. Specula/on- and- itera/on 2. Approximate update 3. Parallelized search tree
9 1. Specula/on- and- itera/on Idea Maximize concurrency by tenta/vely tolera/ng poten/al inconsistencies, and then detect and resolve inconsistencies later, itera/vely.
10 Specula/on- and- itera/on example: parallelizing greedy coloring Independent- set based (prior approaches) Find maximal independent set in parallel (Luby s algorithm) Limited success Specula/on- and- itera/on Dataflow ITERATIVE(G =(V,E)) U V while U is not empty do 1. Speculatively color vertices in U in parallel 2. Check consistency of colors in U in parallel, store conflicts in R U R Fine- grain (edge- level) synchroniza/on; no itera/on Feasible when there is HW support for FGS
11 Specula/on- and- itera/on based coloring on distributed- memory architectures Exploit ini/al data distribu/on Proceed in rounds, each having two phases: tenta/ve coloring conflict detec/on Superstep 1 Communicate Organize coloring phase in supersteps Use randomiza/on in resolving conflicts Round 1 Round Superstep 2 Communicate Detect conflicts Superstep 1 Communicate Detect conflicts
12 Sample experimental results: distributed memory Distance- 1 coloring 5- point grid graph, 32K by 32K grid, 2D distributed IBM Blue Gene/P E E 01 Actual Ideal Compute time in seconds Actual Ideal Compute time in seconds (log scale) 1.25E E E E ,000 x 8,000 16,000 x 16,000 32,000 x 32,000 1,024 4,096 16,384 Grid dimensions (top) and number of processors (bottom) 7.81E ,024 2,048 4,096 8,192 16,384 # of processors Weak scaling Strong scaling Catalyurek, Dobrian, Geberemedhin, Halappanavar, Pothen IPDPS 2011
13 Study on mul/threaded pla\orms Intel Nehalem Sun Niagara 2 Cray XMT HT0 HT1 Core 0 HT0 HT1 Core 1 HT0 HT1 Core 2 HT0 HT1 Core 3 HT0 HT1 Core 0 HT0 HT1 Core 1 HT0 HT1 Core 2 HT0 HT1 Core Core 0 Core 1 Core 7 Core 0 Core 1 Core 7 Processor 0 Processor 1 Processor 127 L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache Memory Controller Memory Controller Memory Controller Shared L3 Cache Shared L3 Cache 8x9 Cache Crossbar 8x9 Cache Crossbar Switch Buffer Switch Buffer Switch Buffer Shared L2 Cache (8 Banks of 512 KB) Shared L2 Cache (8 Banks of 512 KB) Hypertransport Hypertransport Hypertransport Memory Controller QPI QPI Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller Memory Bank (8GB) Memory Bank (8GB) Memory Bank (8GB) Memory Bank Memory Bank Memory Bank Memory Bank 3D Torus Network Shared Global Virtual Memory Shared Global Virtual Memory Shared Global Virtual Memory (8 GBytes X 128 = 1 TBytes) With hardware shuffling at 64 Bytes granularity two quad- core chips two hyperthreads per core private L1 and L2 cache, shared L3 cache two 8- core sockets 8 hardware threads per socket L1 cache on core, shared L2 cache 128 processors 128 hardware thread streams per processor cache- less, globally accessible shared memory hardware support for fine- grain synchroniza/on Catalyurek, Feo, Gebremedhin, Halappanavar, Pothen. Parallel Compu/ng, 2012
14 Experimental results: distance- 1 coloring Small- world graphs with ver/ces and 134M 1B edges Cray XMT!"#$%&'$()*+',% $&#% #$)% &#(% )'% *#% &)% (% '% #% &% Itera/ve +,-./#'% +,-./#$% +,-./#)% +,-./#0%!"#$%&'$()*+',% %#$& $%)& #$(& )'& *$& #)& (& '& $& #&!"%& Dataflow +,-./$'& +,-./$%& +,-./$)& +,-./$0& Cray XMT!"$%!"$%&!"#$% &% #% '% (% &)% *#% )'% &#(%!"#$%& #& $& '& (& #)& *$& )'& #$(& -.#/$0%)1%20)($'')0'% -.#/$0%)1%20)($'')0'% Small- world graph with 2 24 = 16M ver/ces and 134M edges #(&"!#%" &$"!")*+,-./01+," #")*+,-.2/01+," $")*+,-.2/01+," '#"!&"!"()*+,-./0*+" #"()*+,-1./0*+" Niagara 2!"#$%&'$()*+',% '#"!&" %" $" Itera/ve %")*+,-.2/01+,"!"#$%&'$()*+',% %" $" #" Itera/ve Nehalem #"!"!" #" $" %"!&"!"!" #" $" %" -.#/$0%)1%()0$'% -.#/$0%)1%()0$'%
15 2. Approximate update Idea Minimize synchroniza/on cost by op/ng for concurrent data structure update with approximate data instead of serialized data structure update with exact data
16 Approximate update example: Smallest Last ordering Ordering Property Smallest Last for i = n to 1: v i has smallest back degree in V \ {v n, v n- 1,..., v i+1 } Back degree Forward degree v1 v 2 v i v n 1 v n B π (G): maximum back degree in π Degree B*(G) = min π B π (G) = B SL (G) (min among n! possibili/es) δ*(g) = maximum minimum degree in an induced subgraph of G (max among 2 n possibili/es) B SL (G) = δ*(g)
17
18 Parallelizing SL ordering Considered two approaches Approach 1: Regular Parallelizes ordering closely maintaining serial behavior Maintains a global bin array B, and local (per thread) bin arrays B k, for k =1 to p Needs to deal with three poten/al problems ( race condi/ons ) A pair of ver/ces in an extreme bin are adjacent to each other Removal of mul/ple ver/ces from the same bin Addi/on of mul/ple ver/ces to the same bin Approach 2: Relaxed Se]le for an approximate solu/on in favor of increased concurrency Works with only local bin arrays In upda/ng loca/ons of ver/ces in bin structure, approximate dynamic degrees used
19 Experimental results: ordering, scalability, g- graphs time / time using 1 thread (%) g1 g2 g3 g4 g5 time / time using 1 thread (%) g1 g2 g3 g4 g threads threads SL- Regular SL- Relaxed Patwary, Gebremedhin, Pothen EuroPar 2011
20 3. Parallelized search tree Idea In a branch- and- bound algorithm, exchange bounds among processors immediately so as to realize superlinear speedup
21 Parallelized search tree example: Clique algorithms and applica/on Developed fast branch- and- bound algorithm for finding maximum clique Algorithm applied to analyze large- scale social and informa/on networks compute strongly connected components in temporal networks WWW2014. Collaborators: Gleich and Rossi (Purdue)
22 Parallelized search tree example: Clique algorithms and applica/on Developed fast branch- and- bound algorithm for finding maximum clique Algorithm applied to analyze large- scale social and informa/on networks compute strongly connected components in temporal networks WWW2014. Collaborators: Gleich and Rossi (Purdue) bio log Runtime ω/ω collab 3. inter 4. retweet 5. tech 6. web 7. faceboo log V + E social
23 Parallelized search tree example: Clique algorithms and applica/on Developed fast branch- and- bound algorithm for finding maximum clique Algorithm applied to analyze large- scale social and informa/on networks compute strongly connected components in temporal networks WWW2014. Collaborators: Gleich and Rossi (Purdue) Superlinear speed up due to parallelized search tree Speedup brock400 4 (331) san (1) san (0.2) brock800 4 (3604) brock400 3 (619) p hat (4) san1000 (1) Processors
24 Related libraries ColPack A package consis/ng of implementa/ons of a variety of graph coloring, vertex ordering and related problems in support of sparse Jacobian and Hessian computa/on via Automa/c Differen/a/on MTCOL Mul/threaded codes for select graph coloring and vertex ordering problems Parallel Maximum Clique Finder (PMC) A fast parallel (shared- memory) implementa/on for finding maximum cliques in large sparse networks For further info visit:
25 Broader themes Local computa/on algorithms Concurrent data structures Resilience
26 Funding acknowledgements DOE Office of Science (current) CSCAPES (SciDAC- 2) NSF
27 Thank you!
Parallel Graph Coloring For Many- core Architectures
Parallel Graph Coloring For Many- core Architectures Mehmet Deveci, Erik Boman, Siva Rajamanickam Sandia Na;onal Laboratories Sandia National Laboratories is a multi-program laboratory managed and operated
More informationarxiv: v1 [cs.dc] 16 May 2012
Graph Coloring Algorithms for Multi-core and Massively Multithreaded Architectures Ümit V. Çatalyürek John Feo Assefaw H. Gebremedhin Mahantesh Halappanavar Alex Pothen arxiv:1205.3809v1 [cs.dc] 16 May
More informationParallel Computing 38 (2012) Contents lists available at SciVerse ScienceDirect. Parallel Computing
Parallel Computing 38 (2012) 576 594 Contents lists available at SciVerse ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Graph coloring algorithms for multi-core and massively
More informationFoundation of Parallel Computing- Term project report
Foundation of Parallel Computing- Term project report Shobhit Dutia Shreyas Jayanna Anirudh S N (snd7555@rit.edu) (sj7316@rit.edu) (asn5467@rit.edu) 1. Overview: Graphs are a set of connections between
More informationOp#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD
Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD Riyaz Haque and David F. Richards This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationECSE 425 Lecture 25: Mul1- threading
ECSE 425 Lecture 25: Mul1- threading H&P Chapter 3 Last Time Theore1cal and prac1cal limits of ILP Instruc1on window Branch predic1on Register renaming 2 Today Mul1- threading Chapter 3.5 Summary of ILP:
More informationScalable Hybrid Implementation of Graph Coloring using MPI and OpenMP
Scalable Hybrid Implementation of Graph Coloring using MPI and OpenMP Ahmet Erdem Sarıyüce, Erik Saule, and Ümit V. Çatalyürek Department of Biomedical Informatics Department of Computer Science and Engineering
More informationDistributed State Es.ma.on Algorithms for Electric Power Systems
Distributed State Es.ma.on Algorithms for Electric Power Systems Ariana Minot, Blue Waters Graduate Fellow Professor Na Li, Professor Yue M. Lu Harvard University, School of Engineering and Applied Sciences
More informationHypergraph Sparsifica/on and Its Applica/on to Par//oning
Hypergraph Sparsifica/on and Its Applica/on to Par//oning Mehmet Deveci 1,3, Kamer Kaya 1, Ümit V. Çatalyürek 1,2 1 Dept. of Biomedical Informa/cs, The Ohio State University 2 Dept. of Electrical & Computer
More informationOutline. In Situ Data Triage and Visualiza8on
In Situ Data Triage and Visualiza8on Kwan- Liu Ma University of California at Davis Outline In situ data triage and visualiza8on: Issues and strategies Case study: An earthquake simula8on Case study: A
More informationCS252 Graduate Computer Architecture Spring 2014 Lecture 13: Mul>threading
CS252 Graduate Computer Architecture Spring 2014 Lecture 13: Mul>threading Krste Asanovic krste@eecs.berkeley.edu http://inst.eecs.berkeley.edu/~cs252/sp14 Last Time in Lecture 12 Synchroniza?on and Memory
More informationExecu&on Templates: Caching Control Plane Decisions for Strong Scaling of Data Analy&cs
Execu&on Templates: Caching Control Plane Decisions for Strong Scaling of Data Analy&cs Omid Mashayekhi Hang Qu Chinmayee Shah Philip Levis July 13, 2017 2 Cloud Frameworks SQL Streaming Machine Learning
More informationA Push- Relabel- Based Maximum Cardinality Bipar9te Matching Algorithm on GPUs
A Push- Relabel- Based Maximum Cardinality Biparte Matching Algorithm on GPUs Mehmet Deveci,, Kamer Kaya, Bora Uçar, and Ümit V. Çatalyürek, Dept. of Biomedical InformaDcs, The Ohio State University Dept.
More informationNikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris
Early Experiences on Accelerating Dijkstra s Algorithm Using Transactional Memory Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing Systems Laboratory School of Electrical
More informationThread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications
Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores
More informationLink State Rou.ng Reading: Sec.ons 4.2 and 4.3.4
Link State Rou.ng Reading: Sec.ons. and.. COS 6: Computer Networks Spring 009 (MW :0 :50 in COS 05) Michael Freedman Teaching Assistants: WyaN Lloyd and Jeff Terrace hnp://www.cs.princeton.edu/courses/archive/spring09/cos6/
More informationDistributed-Memory Parallel Algorithms for Matching and Coloring
Distributed-Memory Parallel Algorithms for Matching and Coloring Ümit V. Çatalyürek, Florin Dobrian, Assefaw Gebremedhin, Mahantesh Halappanavar, Alex Pothen Depts. of Biomedial Informatics and Electrical
More informationMPICH: A High-Performance Open-Source MPI Implementation. SC11 Birds of a Feather Session
MPICH: A High-Performance Open-Source MPI Implementation SC11 Birds of a Feather Session Schedule MPICH2 status and plans Presenta
More informationSuper Instruction Architecture for Heterogeneous Systems. Victor Lotric, Nakul Jindal, Erik Deumens, Rod Bartlett, Beverly Sanders
Super Instruction Architecture for Heterogeneous Systems Victor Lotric, Nakul Jindal, Erik Deumens, Rod Bartlett, Beverly Sanders Super Instruc,on Architecture Mo,vated by Computa,onal Chemistry Coupled
More informationEfficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on
Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationParallel Distance-k Coloring Algorithms for Numerical Optimization
Parallel Distance-k Coloring Algorithms for Numerical Optimization Assefaw Hadish Gebremedhin Fredrik Manne Alex Pothen Abstract Matrix partitioning problems that arise in the efficient estimation of sparse
More informationDISTRIBUTED-MEMORY PARALLEL ALGORITHMS FOR DISTANCE-2 COLORING AND THEIR APPLICATION TO DERIVATIVE COMPUTATION
DISTRIBUTED-MEMORY PARALLEL ALGORITHMS FOR DISTANCE-2 COLORING AND THEIR APPLICATION TO DERIVATIVE COMPUTATION DORUK BOZDAĞ, ÜMİT V. ÇATALYÜREK, ASSEFAW H. GEBREMEDHIN, FREDRIK MANNE, ERIK G. BOMAN, AND
More informationM 2 R: Enabling Stronger Privacy in MapReduce Computa;on
M 2 R: Enabling Stronger Privacy in MapReduce Computa;on Anh Dinh, Prateek Saxena, Ee- Chien Chang, Beng Chin Ooi, Chunwang Zhang School of Compu,ng Na,onal University of Singapore 1. Mo;va;on Distributed
More informationPerformance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis
Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis Elif Dede, Madhusudhan Govindaraju Lavanya Ramakrishnan, Dan Gunter, Shane Canon Department of Computer Science, Binghamton
More informationLixia Liu, Zhiyuan Li Purdue University, USA. grants ST-HEC , CPA and CPA , and by a Google Fellowship
Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009 Work supported in part by NSF through Work supported in part by NSF through grants ST-HEC-0444285, CPA-0702245 and CPA-0811587, and
More informationc 2010 Society for Industrial and Applied Mathematics
SIAM J. SCI. COMPUT. Vol. 32, No. 4, pp. 2418 2446 c 2010 Society for Industrial and Applied Mathematics DISTRIBUTED-MEMORY PARALLEL ALGORITHMS FOR DISTANCE-2 COLORING AND RELATED PROBLEMS IN DERIVATIVE
More informationA One-Sided View of HPC: Global-View Models and Portable Runtime Systems
A One-Sided View of HPC: Global-View Models and Portable Runtime Systems James Dinan James Wallace Gives Postdoctoral Fellow Argonne Na9onal Laboratory Why Global-View? Proc 0 Proc 1 Proc n Global address
More informationParallel Distance-k Coloring Algorithms for Numerical Optimization
Parallel Distance-k Coloring Algorithms for Numerical Optimization Assefaw Hadish Gebremedhin 1, Fredrik Manne 1, and Alex Pothen 2 1 Department of Informatics, University of Bergen, N-5020 Bergen, Norway
More informationCoherent HyperTransport Enables The Return of the SMP
Coherent HyperTransport Enables The Return of the SMP Einar Rustad Copyright 2010 - All rights reserved. 1 Top500 History The expensive SMPs used to rule: Cray XMP, Convex Exemplar, Sun ES NOW, the Clusters
More informationPor$ng Monte Carlo Algorithms to the GPU. Ryan Bergmann UC Berkeley Serpent Users Group Mee$ng 9/20/2012 Madrid, Spain
Por$ng Monte Carlo Algorithms to the GPU Ryan Bergmann UC Berkeley Serpent Users Group Mee$ng 9/20/2012 Madrid, Spain 1 Outline Introduc$on to GPUs Why they are interes$ng How they operate Pros and cons
More informationConcurrency-Optimized I/O For Visualizing HPC Simulations: An Approach Using Dedicated I/O Cores
Concurrency-Optimized I/O For Visualizing HPC Simulations: An Approach Using Dedicated I/O Cores Ma#hieu Dorier, Franck Cappello, Marc Snir, Bogdan Nicolae, Gabriel Antoniu 4th workshop of the Joint Laboratory
More informationNetwork Coding: Theory and Applica7ons
Network Coding: Theory and Applica7ons PhD Course Part IV Tuesday 9.15-12.15 18.6.213 Muriel Médard (MIT), Frank H. P. Fitzek (AAU), Daniel E. Lucani (AAU), Morten V. Pedersen (AAU) Plan Hello World! Intra
More informationFixed- Parameter Evolu2onary Algorithms
Fixed- Parameter Evolu2onary Algorithms Frank Neumann School of Computer Science University of Adelaide Joint work with Stefan Kratsch (U Utrecht), Per Kris2an Lehre (DTU Informa2cs), Pietro S. Oliveto
More informationDatabase Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:
Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive
More informationCOMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory
COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationScalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems
Cray XMT Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems Next Generation Cray XMT Goals Memory System Improvements
More informationOp#mizing MapReduce for Highly- Distributed Environments
Op#mizing MapReduce for Highly- Distributed Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering University of Minnesota hep://www.cs.umn.edu/~chandra 1 Big
More informationAlgorithm Engineering with PRAM Algorithms
Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationhashfs Applying Hashing to Op2mize File Systems for Small File Reads
hashfs Applying Hashing to Op2mize File Systems for Small File Reads Paul Lensing, Dirk Meister, André Brinkmann Paderborn Center for Parallel Compu2ng University of Paderborn Mo2va2on and Problem Design
More informationAccelerating Satellite Image Based Large- Scale Settlement Detection with GPU!
Accelerating Satellite Image Based Large- Scale Settlement Detection with GPU! Dilip%R.%Patlolla% Anil%M.%Cheriyadat% Eddie%A.%Bright% Jeane9e%E.%Weaver% % Oak%Ridge%Na?onal%Laboratory% Oak%Ridge,%TN%%
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationPARALLEL MAXIMUM CLIQUE ALGORITHMS WITH APPLICATIONS TO NETWORK ANALYSIS
PARALLEL MAXIMUM CLIQUE ALGORITHMS WITH APPLICATIONS TO NETWORK ANALYSIS RYAN A. ROSSI, DAVID F. GLEICH, AND ASSEFAW H. GEBREMEDHIN Abstract. We present a fast, parallel maximum clique algorithm for large
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationHabanero-Java Library: a Java 8 Framework for Multicore Programming
Habanero-Java Library: a Java 8 Framework for Multicore Programming PPPJ 2014 September 25, 2014 Shams Imam, Vivek Sarkar shams@rice.edu, vsarkar@rice.edu Rice University https://wiki.rice.edu/confluence/display/parprog/hj+library
More informationCSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates
CSCI 599 Class Presenta/on Zach Levine Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates April 26 th, 2012 Topics Covered in this Presenta2on A (Brief) Review of HMMs HMM Parameter Learning Expecta2on-
More informationECE 1749H: Interconnec1on Networks for Parallel Computer Architectures: Rou1ng. Prof. Natalie Enright Jerger
ECE 1749H: Interconnec1on Networks for Parallel Computer Architectures: Rou1ng Prof. Natalie Enright Jerger Rou1ng Overview Discussion of topologies assumed ideal rou1ng In prac1ce Rou1ng algorithms are
More informationPARALLEL MAXIMUM CLIQUE ALGORITHMS WITH APPLICATIONS TO NETWORK ANALYSIS
SIAM J. SCI. COMPUT. Vol. 37, No. 5, pp. C589 C616 c 2015 Society for Industrial and Applied Mathematics PARALLEL MAXIMUM CLIQUE ALGORITHMS WITH APPLICATIONS TO NETWORK ANALYSIS RYAN A. ROSSI, DAVID F.
More informationOh, Exascale! The effect of emerging architectures on scien1fic discovery. Kenneth Moreland, Sandia Na1onal Laboratories
Photos placed in horizontal posi1on with even amount of white space between photos and header Oh, $#*@! Exascale! The effect of emerging architectures on scien1fic discovery Ultrascale Visualiza1on Workshop,
More informationWhere we are in the Course
Where we are in the ourse More fun in the Network Layer! We ve covered packet forwarding Now we ll learn about roung Applicaon Transport Network Link Physical SE 61 University of Washington 1 Roung versus
More informationTopology and affinity aware hierarchical and distributed load-balancing in Charm++
Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National
More informationRaceMob: Crowdsourced Data Race Detec,on
RaceMob: Crowdsourced Data Race Detec,on Baris Kasikci, Cris,an Zamfir, and George Candea School of Computer & Communica3on Sciences Data Races to shared memory loca,on By mul3ple threads At least one
More informationTools zur Op+mierung eingebe2eter Mul+core- Systeme. Bernhard Bauer
Tools zur Op+mierung eingebe2eter Mul+core- Systeme Bernhard Bauer Agenda Mo+va+on So.ware Engineering & Mul5core Think Parallel Models Added Value Tooling Quo Vadis? The Mul5core Era Moore s Law: The
More informationAsynchronous and Fault-Tolerant Recursive Datalog Evalua9on in Shared-Nothing Engines
Asynchronous and Fault-Tolerant Recursive Datalog Evalua9on in Shared-Nothing Engines Jingjing Wang, Magdalena Balazinska, Daniel Halperin University of Washington Modern Analy>cs Requires Itera>on Graph
More informationBig Data, Big Compute, Big Interac3on Machines for Future Biology. Rick Stevens. Argonne Na3onal Laboratory The University of Chicago
Assembly Annota3on Modeling Design Big Data, Big Compute, Big Interac3on Machines for Future Biology Rick Stevens stevens@anl.gov Argonne Na3onal Laboratory The University of Chicago There are no solved
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationParallel Computing Architectures
Parallel Computing Architectures Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ 2 An Abstract Parallel Architecture Processor Processor
More informationEvaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li
Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,
More informationCS 465 Final Review. Fall 2017 Prof. Daniel Menasce
CS 465 Final Review Fall 2017 Prof. Daniel Menasce Ques@ons What are the types of hazards in a datapath and how each of them can be mi@gated? State and explain some of the methods used to deal with branch
More informationEnabling Scalable Data Analysis for Large Computa9onal Structural Biology Datasets on Distributed Memory Systems
Enabling Scalable Data Analysis for Large Computa9onal Structural Biology Datasets on Distributed Memory Systems Michela Taufer Global Compu9ng Laboratory Computer and Informa9on Sciences University of
More informationParallel Exact Inference on the Cell Broadband Engine Processor
Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview
More informationBLAS. Basic Linear Algebra Subprograms
BLAS Basic opera+ons with vectors and matrices dominates scien+fic compu+ng programs To achieve high efficiency and clean computer programs an effort has been made in the last few decades to standardize
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationECE 1749H: Interconnec1on Networks for Parallel Computer Architectures: Rou1ng. Prof. Natalie Enright Jerger
ECE 1749H: Interconnec1on Networks for Parallel Computer Architectures: Rou1ng Prof. Natalie Enright Jerger Announcements Feedback on your project proposals This week Scheduled extended 1 week Next week:
More informationMPI & OpenMP Mixed Hybrid Programming
MPI & OpenMP Mixed Hybrid Programming Berk ONAT İTÜ Bilişim Enstitüsü 22 Haziran 2012 Outline Introduc/on Share & Distributed Memory Programming MPI & OpenMP Advantages/Disadvantages MPI vs. OpenMP Why
More informationCarlo Cavazzoni, HPC department, CINECA
Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have
More informationOrigin- des*na*on Flow Measurement in High- Speed Networks
IEEE INFOCOM, 2012 Origin- des*na*on Flow Measurement in High- Speed Networks Tao Li Shigang Chen Yan Qiao Introduc*on (Defini*ons) Origin- des+na+on flow between two routers is the set of packets that
More informationShengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota
Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)
More informationMapReduce, Apache Hadoop
Czech Technical University in Prague, Faculty of Informaon Technology MIE-PDB: Advanced Database Systems hp://www.ksi.mff.cuni.cz/~svoboda/courses/2016-2-mie-pdb/ Lecture 12 MapReduce, Apache Hadoop Marn
More informationUnderstanding Graph Computa3on Behavior to Enable Robust Benchmarking
Understanding Graph Computa3on Behavior to Enable Robust Benchmarking Fan Yang* and Andrew A. Chien* *University of Chicago, Argonne Na3onal Laboratory {fanyang, achien}@cs.uchicago.edu HPDC, June 18,
More informationAllevia'ng memory bandwidth pressure with wavefront temporal blocking and diamond 'ling Tareq Malas* Georg Hager Gerhard Wellein David Keyes*
Allevia'ng memory bandwidth pressure with wavefront temporal blocking and diamond 'ling Tareq Malas* Georg Hager Gerhard Wellein David Keyes* Erlangen Regional Compu0ng Center, Germany *King Abdullah Univ.
More informationParallel programming with Java Slides 1: Introduc:on. Michelle Ku=el August/September 2012 (lectures will be recorded)
Parallel programming with Java Slides 1: Introduc:on Michelle Ku=el August/September 2012 mku=el@cs.uct.ac.za (lectures will be recorded) Changing a major assump:on So far, most or all of your study of
More informationParallel Computing in Combinatorial Optimization
Parallel Computing in Combinatorial Optimization Bernard Gendron Université de Montréal gendron@iro.umontreal.ca Course Outline Objective: provide an overview of the current research on the design of parallel
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationMapReduce, Apache Hadoop
NDBI040: Big Data Management and NoSQL Databases hp://www.ksi.mff.cuni.cz/ svoboda/courses/2016-1-ndbi040/ Lecture 2 MapReduce, Apache Hadoop Marn Svoboda svoboda@ksi.mff.cuni.cz 11. 10. 2016 Charles University
More informationInterconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Topics Taxonomy Metric Topologies Characteristics Cost Performance 2 Interconnection
More informationA Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System
A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System Ilkay Al(ntas and Daniel Crawl San Diego Supercomputer Center UC San Diego Jianwu Wang UMBC WorDS.sdsc.edu Computa3onal
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationCS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1
CS252 Spring 2017 Graduate Computer Architecture Lecture 14: Multithreading Part 2 Synchronization 1 Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture
More informationEITF20: Computer Architecture Part 5.1.1: Virtual Memory
EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache
More informationCost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)
Cost of Concurrency in Hybrid Transactional Memory Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University) 1 Transactional Memory: a history Hardware TM Software TM Hybrid TM 1993 1995-today
More informationComputer Systems CSE 410 Autumn Memory Organiza:on and Caches
Computer Systems CSE 410 Autumn 2013 10 Memory Organiza:on and Caches 06 April 2012 Memory Organiza?on 1 Roadmap C: car *c = malloc(sizeof(car)); c->miles = 100; c->gals = 17; float mpg = get_mpg(c); free(c);
More informationPerformance Impact of Resource Contention in Multicore Systems
Performance Impact of Resource Contention in Multicore Systems R. Hood, H. Jin, P. Mehrotra, J. Chang, J. Djomehri, S. Gavali, D. Jespersen, K. Taylor, R. Biswas Commodity Multicore Chips in NASA HEC 2004:
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationNARCCAP: North American Regional Climate Change Assessment Program. Seth McGinnis, NCAR
NARCCAP: North American Regional Climate Change Assessment Program Seth McGinnis, NCAR mcginnis@ucar.edu NARCCAP: North American Regional Climate Change Assessment Program Nest highresolution regional
More informationHardware Transactional Memory on Haswell
Hardware Transactional Memory on Haswell Viktor Leis Technische Universität München 1 / 15 Introduction transactional memory is a very elegant programming model transaction { transaction { a = a 10; c
More informationA Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures
A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures Georgios Rokos 1, Gerard Gorman 2, and Paul H J Kelly 1 1 Software Peroformance Optimisation Group, Department of
More informationMaximum Clique Solver using Bitsets on GPUs
Maximum Clique Solver using Bitsets on GPUs Matthew VanCompernolle 1, Lee Barford 1,2, and Frederick Harris, Jr. 1 1 Department of Computer Science and Engineering, University of Nevada, Reno 2 Keysight
More information1240 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 5, MAY 2017
1240 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 5, MAY 2017 Algorithms for Balanced Graph Colorings with Applications in Parallel Computing Hao Lu, Mahantesh Halappanavar, Daniel
More informationGraphChi: Large-Scale Graph Computation on Just a PC
OSDI 12 GraphChi: Large-Scale Graph Computation on Just a PC Aapo Kyrölä (CMU) Guy Blelloch (CMU) Carlos Guestrin (UW) In co- opera+on with the GraphLab team. BigData with Structure: BigGraph social graph
More informationColPack: Software for Graph Coloring and Related Problems in Scientific Computing
A ColPack: Software for Graph Coloring and Related Problems in Scientific Computing ASSEFAW H. GEBREMEDHIN, Purdue University DUC NGUYEN, Purdue University MD. MOSTOFA ALI PATWARY, Northwestern University
More informationA Parallel Distance-2 Graph Coloring Algorithm for Distributed Memory Computers
A Parallel Distance-2 Graph Coloring Algorithm for Distributed Memory Computers Doruk Bozdağ 1, Umit Catalyurek 1, Assefaw H. Gebremedhin 2, Fredrik Manne 3, Erik G. Boman 4,andFüsun Özgüner 1 1 Ohio State
More informationGrappa: A latency tolerant runtime for large-scale irregular applications
Grappa: A latency tolerant runtime for large-scale irregular applications Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin Computer Science & Engineering, University
More informationInstructor: Randy H. Katz hbp://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #13. Warehouse Scale Computer
CS 61C: Great Ideas in Computer Architecture Cache Performance and Parallelism Instructor: Randy H. Katz hbp://inst.eecs.berkeley.edu/~cs61c/fa13 10/8/13 Fall 2013 - - Lecture #13 1 New- School Machine
More informationUPCRC. Illiac. Gigascale System Research Center. Petascale computing. Cloud Computing Testbed (CCT) 2
Illiac UPCRC Petascale computing Gigascale System Research Center Cloud Computing Testbed (CCT) 2 www.parallel.illinois.edu Mul2 Core: All Computers Are Now Parallel We con'nue to have more transistors
More information