Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on

Size: px
Start display at page:

Download "Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on"

Transcription

1 Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on Max Grossman, Gladys Gonzalez, Mauricio Araya- Polo GT, San Jose March 25, 2014

2 Agenda The Problem at Hand 3. Strategy 4. GPU 5. Results 6. onclusions

3 Mo<va<on As heterogeneous clusters becoming the standard, full resource becomes more challenging No generally- applicable best for irregular on heterogeneous systems Kirchhoff is a great case study in op@mizing holis@c resource u@liza@on on huge data sets (TBs)

4 The Problem at Hand (I) What is Kirchhoff Subsurface imaging algorithm based on ray- tracing 2 stages: travel@me (TT) computa@on and migra@on (MIG) Lots of divergent, irregular kernels

5 The Problem at Hand (II) Build on GT 2013 scheduling work Greedy inter- node task system for job Two- stage pipeline of tasks: stage 1 is computa@on- bound, stage 2 is I/O bound result aggrega@on tasks For TT: Single compute process per node Mul@ple threads per process nego@ate access to GPUs and PU cores by dona@on For MIG: Many compute processes per node, each assigned a single device (PU or GPU) Master Node microjobs Worker Node microjobs Worker Node Manager spawn Worker

6 The Problem at Hand (III) While previous work was evaluated on real- world datasets, no one ever requests support for smaller data For example: 1. GPUs with ~3.5GB of global memory 2. Individual dense matrix size of ~2600MB 3. 5 matrices on- device = ~360% GPU memory u@liza@on from these data structures alone learly, something more clever is required to scale up to arbitrarily large datasets

7 Solu<on Strategy ustom of device memory management for TT and MIG based on of each algorithm Focus on: Maximizing of device memory Minimizing PU- GPU Maximizing overlap (inter- device and inter- node) Future- proofing against larger and larger datasets

8 GPU Implementa<on (I) MIG Problem Each trace accesses a set of matrices, but a single trace has insufficient computa@on for the whole GPU Hundreds to thousands of these matrices in total Must manage transport of whole matrices to and from GPU Matrices referenced by a trace can be pre- computed based on trace # trace

9 GPU Implementa<on (II) MIG - Memory Management Group traces into chunks based on matrix affinity Pre- allocate on device 1) objects for parameterizing a single trace chunk, and 2) slots for caching tables GPU Global Memory Sta@c Alloca@ons P P P P P P

10 GPU Implementa<on (III) MIG - Model Dedicated I/O and compute threads The host side must manage: 1. Kernel objects 2. TT caching slots 3. UDA streams, events A working set object stores the work for a single kernel invoca@on trace chunks, associated matrix cache slots, etc Memory- mapped flags enable immediate resource release Dynamic device selec@on based on resource availability Working Set = UDA stream = UDA event = kernel parameters

11 GPU Implementa<on (IV) MIG - ache Management ache management uses reference to track in- use cache slots When placing a set of tables on the GPU for a trace chunk, the host will simply check for tables in- cache and for sufficient free slots Exclusive device access and pre- determined table accesses implies no forced evic@on If resource alloca@on for a par@cular chunk fails (no more streams, events, cache slots, etc) 1. The working set currently being populated is immediately launched on the GPU 2. The current trace chunk is executed mul@- threaded on the PU

12 GPU Implementa<on (V) TT Problem Small of massive, 3D, global, read- only matrices shared by all tasks and all devices Unpredictable access pajerns threads sharing each GPU kernel pipeline, each using different access pajerns

13 GPU Implementa<on (VI) TT - ache Design global matrices into sub- matrices of fixed size N N Pre- allocate as many cache slots as possible on GPU reference tracking Track all references to global matrices on- device Submatrix UIDs Track cache hits and misses N N N N N N N N N N N N N N N N N N N N N N N N

14 GPU Implementa<on (VI) TT - aching Policy Use LRU- like caching policy on host to manage cache slots threads share a single device s memory Itera@vely call kernels un@l converge on comple@on Refresh cache on each itera@on Lock rlock during kernel launch to ensure consistent mapping and slots Given: Global submatrix -> cache slot mapping Thread-local reference history for every thread All cache hits and misses for a kernel instance for each submatrix s: if s in hits or s in misses: mark s as MRU in history if s in misses and!s.is_cached: wlock.lock() if!s.is_cached: slot = select_victim(mapping, histories) update_slot(slot, s) if own(wlock): update_mapping(mapping) wlock.unlock()

15 Results (I) 2 socket systems, 6 core Intel Xeon X5675 NVIDIA K10 Infiniband QDR Intel SW stack Panasas FS UDA 5.0 V prec- sqrt=true, prec- div=true, fmad=false

16 Results (II) TT compares a 1 core + 1 GPU against 16 PU cores. MIG compares 16 cores + 1 GPU against 16 PU cores. Data$Set Legacy$Execution$Time GPU$Execution$Time Speedup TT"A TT"B TT" TT"D TT"E TT"F TT"G Data$Set Legacy$Execution$Time GPU$Execution$Time Speedup MIG$A For comparison, GT13 results showed ~3.5x TT speedup, ~1.15x MIG speedup

17 Results (III) #"Evic.ons,"References,"Misses" 4500" 4000" 3500" 3000" 2500" 2000" 1500" 1000" #"Tables"Evicted" #"Reported"Tables"Referenced" #"Reported"ache"Misses" #"Points"Processed" " " " " 50000" #"Points"Processes" 500" 0" 1" 184" 367" 550" 733" 916" 1099" 1282" 1465" 1648" 1831" 2014" 2197" 2380" 2563" 2746" 2929" 3112" 3295" 3478" 3661" 3844" 4027" 4210" 4393" 4576" 4759" 4942" 5125" 5308" 5491" 5674" 5857" 6040" 6223" 6406" 6589" 6772" 6955" 7138" 7321" Time"Step" 0" between points being processed and table cache behavior.

18 Results (IV) between cache behavior and per- @me. 0" 0.5" 1" 1.5" 2" 2.5" 3" 3.5" 4" 4.5" 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" 4000" 4500" 1" 140" 279" 418" 557" 696" 835" 974" 1113" 1252" 1391" 1530" 1669" 1808" 1947" 2086" 2225" 2364" 2503" 2642" 2781" 2920" 3059" 3198" 3337" 3476" 3615" 3754" 3893" 4032" 4171" 4310" 4449" 4588" 4727" 4866" 5005" 5144" 5283" 5422" 5561" 5700" 5839" 5978" 6117" 6256" 6395" 6534" 6673" 6812" 6951" 7090" 7229" 7368" Itera&on)Execu&on)Time)(s)) #)Tables)Evicted,)Refereced,)ache)Misses) Time)Step) #"Tables"Evicted" #"Reported"Tables"Referenced" #"Reported"ache"Misses" Table"ache"apacity" IteraEon"ExecuEon"Time"

19 onclusions This work expands on full resource from GT 2013 Performance improvements from GPUs for both MIG and TT, even with caching overheads Kirchhoff on GPU is future- proofed against the next wave of ever- growing datasets

20 Backup Slides

21 Results (II) Overall wallclock, including I/O, etc: TT Set A Set B Legacy (s) (1.00x) (1.00x) GPU only (s) (0.83x) (1.37x) Sta<c Hybrid (s) (1.09x) (1.35x) Dynamic Hybrid (s) (1.53x) (1.78x) Good speedup for TT Significant amount of sequen@al code remains Accelerated rela@vely small part of applica@on Poor speedup for MIG MIG Set A Set B Legacy (s) (1.00x) (1.00x) Dynamic Hybrid (s) (1.03x) (1.17x)

22 Results (III) Further of TT Performance: Slowdown for 16/1596 shots, Min=0.84x, Max=3.65x Speedup Per- Shot Speedup, Data Set A Shots

23 Results (IV) Further of MIG Performance: Recall the microjob pipeline Stage 1 is compute bound, Stage 2 is I/O bound Microjobs ompleted Time (m) Legacy Stage1 Legacy Stage2 Hybrid Stage 1 Hybrid Stage 2 MIG Stage 1 Set A Set B Legacy (m) Dynamic Hybrid (m) 1320 (3.73x) 1200 (3.30x)

24 Results (V) Further of MIG Performance: Many microjobs were too small to fully resources of actual kernel 104 kernels on GPU, 22 kernels on PU Maximum speedup of 35.3x Speedup MIG Kernels

25 Acknowledgements Repsol Gladys Gonzalez Mauricio Araya- Polo (now in Shell E&P) NVIDIA

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

Hypergraph Sparsifica/on and Its Applica/on to Par//oning

Hypergraph Sparsifica/on and Its Applica/on to Par//oning Hypergraph Sparsifica/on and Its Applica/on to Par//oning Mehmet Deveci 1,3, Kamer Kaya 1, Ümit V. Çatalyürek 1,2 1 Dept. of Biomedical Informa/cs, The Ohio State University 2 Dept. of Electrical & Computer

More information

GASPP: A GPU- Accelerated Stateful Packet Processing Framework

GASPP: A GPU- Accelerated Stateful Packet Processing Framework GASPP: A GPU- Accelerated Stateful Packet Processing Framework Giorgos Vasiliadis, FORTH- ICS, Greece Lazaros Koromilas, FORTH- ICS, Greece Michalis Polychronakis, Columbia University, USA So5ris Ioannidis,

More information

Efficient Static and Dynamic Memory Management Techniques for Multi-GPU Systems

Efficient Static and Dynamic Memory Management Techniques for Multi-GPU Systems Efficient Static and Dynamic Memory Management Techniques for Multi-GPU Systems ABSTRACT Max Grossman Dept. of Computer Science - MS 132 Rice University, P.O. Box 1892 Houston, TX 77251, USA jmg3@rice.edu

More information

GPU Cluster Computing. Advanced Computing Center for Research and Education

GPU Cluster Computing. Advanced Computing Center for Research and Education GPU Cluster Computing Advanced Computing Center for Research and Education 1 What is GPU Computing? Gaming industry and high- defini3on graphics drove the development of fast graphics processing Use of

More information

Network Coding: Theory and Applica7ons

Network Coding: Theory and Applica7ons Network Coding: Theory and Applica7ons PhD Course Part IV Tuesday 9.15-12.15 18.6.213 Muriel Médard (MIT), Frank H. P. Fitzek (AAU), Daniel E. Lucani (AAU), Morten V. Pedersen (AAU) Plan Hello World! Intra

More information

Execu&on Templates: Caching Control Plane Decisions for Strong Scaling of Data Analy&cs

Execu&on Templates: Caching Control Plane Decisions for Strong Scaling of Data Analy&cs Execu&on Templates: Caching Control Plane Decisions for Strong Scaling of Data Analy&cs Omid Mashayekhi Hang Qu Chinmayee Shah Philip Levis July 13, 2017 2 Cloud Frameworks SQL Streaming Machine Learning

More information

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System Ilkay Al(ntas and Daniel Crawl San Diego Supercomputer Center UC San Diego Jianwu Wang UMBC WorDS.sdsc.edu Computa3onal

More information

Super Instruction Architecture for Heterogeneous Systems. Victor Lotric, Nakul Jindal, Erik Deumens, Rod Bartlett, Beverly Sanders

Super Instruction Architecture for Heterogeneous Systems. Victor Lotric, Nakul Jindal, Erik Deumens, Rod Bartlett, Beverly Sanders Super Instruction Architecture for Heterogeneous Systems Victor Lotric, Nakul Jindal, Erik Deumens, Rod Bartlett, Beverly Sanders Super Instruc,on Architecture Mo,vated by Computa,onal Chemistry Coupled

More information

Ar#ficial Intelligence

Ar#ficial Intelligence Ar#ficial Intelligence Advanced Searching Prof Alexiei Dingli Gene#c Algorithms Charles Darwin Genetic Algorithms are good at taking large, potentially huge search spaces and navigating them, looking for

More information

MapReduce, Apache Hadoop

MapReduce, Apache Hadoop NDBI040: Big Data Management and NoSQL Databases hp://www.ksi.mff.cuni.cz/ svoboda/courses/2016-1-ndbi040/ Lecture 2 MapReduce, Apache Hadoop Marn Svoboda svoboda@ksi.mff.cuni.cz 11. 10. 2016 Charles University

More information

Asaf Cidon, Assaf Eisenman, Mohammad Alizadeh and Sachin KaH

Asaf Cidon, Assaf Eisenman, Mohammad Alizadeh and Sachin KaH Cli$anger: Scaling Performance Cliffs in Memory Caches [NSDI 2016] Cache OS: Data Center Dynamic Cache Management Asaf Cidon, Assaf Eisenman, Mohammad Alizadeh and Sachin KaH 1 Key-Value Caches are Essen1al

More information

Computer Systems CSE 410 Autumn Memory Organiza:on and Caches

Computer Systems CSE 410 Autumn Memory Organiza:on and Caches Computer Systems CSE 410 Autumn 2013 10 Memory Organiza:on and Caches 06 April 2012 Memory Organiza?on 1 Roadmap C: car *c = malloc(sizeof(car)); c->miles = 100; c->gals = 17; float mpg = get_mpg(c); free(c);

More information

MapReduce, Apache Hadoop

MapReduce, Apache Hadoop Czech Technical University in Prague, Faculty of Informaon Technology MIE-PDB: Advanced Database Systems hp://www.ksi.mff.cuni.cz/~svoboda/courses/2016-2-mie-pdb/ Lecture 12 MapReduce, Apache Hadoop Marn

More information

CLOUD SERVICES. Cloud Value Assessment.

CLOUD SERVICES. Cloud Value Assessment. CLOUD SERVICES Cloud Value Assessment www.cloudcomrade.com Comrade a companion who shares one's ac8vi8es or is a fellow member of an organiza8on 2 Today s Agenda! Why Companies Should Consider Moving Business

More information

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

GpuWrapper: A Portable API for Heterogeneous Programming at CGG GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016 GpuWrapper: Objectives &

More information

Example. You manage a web site, that suddenly becomes wildly popular. Performance starts to degrade. Do you?

Example. You manage a web site, that suddenly becomes wildly popular. Performance starts to degrade. Do you? Scheduling Main Points Scheduling policy: what to do next, when there are mul:ple threads ready to run Or mul:ple packets to send, or web requests to serve, or Defini:ons response :me, throughput, predictability

More information

Dynacache: Dynamic Cloud Caching

Dynacache: Dynamic Cloud Caching Dynacache: Dynamic Cloud Caching Asaf Cidon, Assaf Eisenman, Mohammad Alizadeh and Sachin Ka? Stanford University MIT 1 Driving Web Applica4on Performance Web- scale applica4ons heavily reliant on memory

More information

Fault Tolerant Runtime ANL. Wesley Bland Joint Lab for Petascale Compu9ng Workshop November 26, 2013

Fault Tolerant Runtime ANL. Wesley Bland Joint Lab for Petascale Compu9ng Workshop November 26, 2013 Fault Tolerant Runtime Research @ ANL Wesley Bland Joint Lab for Petascale Compu9ng Workshop November 26, 2013 Brief History of FT Checkpoint/Restart (C/R) has been around for quite a while Guards against

More information

Today s Lecture. CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce

Today s Lecture. CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce 8/29/12 Instructors Krste Asanovic, Randy H. Katz hgp://inst.eecs.berkeley.edu/~cs61c/fa12 Fall 2012 - - Lecture #3 1 Today

More information

Heterogeneous platforms

Heterogeneous platforms Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Any platform using a GPU is a heterogeneous platform! Further in this talk

More information

hashfs Applying Hashing to Op2mize File Systems for Small File Reads

hashfs Applying Hashing to Op2mize File Systems for Small File Reads hashfs Applying Hashing to Op2mize File Systems for Small File Reads Paul Lensing, Dirk Meister, André Brinkmann Paderborn Center for Parallel Compu2ng University of Paderborn Mo2va2on and Problem Design

More information

Amol Deshpande, University of Maryland Lisa Hellerstein, Polytechnic University, Brooklyn

Amol Deshpande, University of Maryland Lisa Hellerstein, Polytechnic University, Brooklyn Amol Deshpande, University of Maryland Lisa Hellerstein, Polytechnic University, Brooklyn Mo>va>on: Parallel Query Processing Increasing parallelism in compu>ng Shared nothing clusters, mul> core technology,

More information

A Script- Based Autotuning Compiler System to Generate High- Performance CUDA code

A Script- Based Autotuning Compiler System to Generate High- Performance CUDA code A Script- Based Autotuning Compiler System to Generate High- Performance CUDA code Malik Khan, Protonu Basu, Gabe Rudy, Mary Hall, Chun Chen, Jacqueline Chame Mo:va:on Challenges to programming the GPU

More information

Por$ng Monte Carlo Algorithms to the GPU. Ryan Bergmann UC Berkeley Serpent Users Group Mee$ng 9/20/2012 Madrid, Spain

Por$ng Monte Carlo Algorithms to the GPU. Ryan Bergmann UC Berkeley Serpent Users Group Mee$ng 9/20/2012 Madrid, Spain Por$ng Monte Carlo Algorithms to the GPU Ryan Bergmann UC Berkeley Serpent Users Group Mee$ng 9/20/2012 Madrid, Spain 1 Outline Introduc$on to GPUs Why they are interes$ng How they operate Pros and cons

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

LUMOS. A Framework with Analy1cal Models for Heterogeneous Architectures. Liang Wang, and Kevin Skadron (University of Virginia)

LUMOS. A Framework with Analy1cal Models for Heterogeneous Architectures. Liang Wang, and Kevin Skadron (University of Virginia) LUMOS A Framework with Analy1cal Models for Heterogeneous Architectures Liang Wang, and Kevin Skadron (University of Virginia) What is LUMOS A set of first- order analy1cal models targe1ng heterogeneous

More information

Heterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM

Heterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM Heterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM 25th March, GTC 2014, San Jose CA AnE- Pekka Hynninen ane.pekka.hynninen@nrel.gov NREL is a na*onal laboratory of the U.S. Department of Energy,

More information

MLlib. Distributed Machine Learning on. Evan Sparks. UC Berkeley

MLlib. Distributed Machine Learning on. Evan Sparks.  UC Berkeley MLlib & ML base Distributed Machine Learning on Evan Sparks UC Berkeley January 31st, 2014 Collaborators: Ameet Talwalkar, Xiangrui Meng, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia,

More information

Main Points. Address Transla+on Concept. Flexible Address Transla+on. Efficient Address Transla+on

Main Points. Address Transla+on Concept. Flexible Address Transla+on. Efficient Address Transla+on Address Transla+on Main Points Address Transla+on Concept How do we convert a virtual address to a physical address? Flexible Address Transla+on Segmenta+on Paging Mul+level transla+on Efficient Address

More information

Opera&ng Systems ECE344

Opera&ng Systems ECE344 Opera&ng Systems ECE344 Lecture 10: Scheduling Ding Yuan Scheduling Overview In discussing process management and synchroniza&on, we talked about context switching among processes/threads on the ready

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

: Advanced Compiler Design. 8.0 Instruc?on scheduling

: Advanced Compiler Design. 8.0 Instruc?on scheduling 6-80: Advanced Compiler Design 8.0 Instruc?on scheduling Thomas R. Gross Computer Science Department ETH Zurich, Switzerland Overview 8. Instruc?on scheduling basics 8. Scheduling for ILP processors 8.

More information

Networks and Opera/ng Systems Chapter 13: Scheduling

Networks and Opera/ng Systems Chapter 13: Scheduling Networks and Opera/ng Systems Chapter 13: Scheduling (252 0062 00) Donald Kossmann & Torsten Hoefler Frühjahrssemester 2013 Systems Group Department of Computer Science ETH Zürich Last /me Process concepts

More information

Re- op&mizing Data Parallel Compu&ng

Re- op&mizing Data Parallel Compu&ng Re- op&mizing Data Parallel Compu&ng Sameer Agarwal Srikanth Kandula, Nicolas Bruno, Ming- Chuan Wu, Ion Stoica, Jingren Zhou UC Berkeley A Data Parallel Job can be a collec/on of maps, A Data Parallel

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information

Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD

Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD Riyaz Haque and David F. Richards This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore

More information

ProAc&ve Rou&ng In Scalable Data Centers with PARIS

ProAc&ve Rou&ng In Scalable Data Centers with PARIS ProAc&ve Rou&ng In Scalable Data Centers with PARIS Theophilus Benson Duke University Joint work with Dushyant Arora + and Jennifer Rexford* + Arista Networks *Princeton University Data Center Networks

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Op#miza#on of CUDA- based Monte Carlo Simula#on for Radia#on Therapy. GTC 2014 N. Henderson & K. Murakami

Op#miza#on of CUDA- based Monte Carlo Simula#on for Radia#on Therapy. GTC 2014 N. Henderson & K. Murakami Op#miza#on of CUDA- based Monte Carlo Simula#on for Radia#on Therapy GTC 2014 N. Henderson & K. Murakami The collabora#on Geant4 @ Special thanks to the CUDA Center of Excellence Program Makoto Asai, SLAC

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10 Agenda CS 61C: Great Ideas in Computer Architecture (Machine Structures) Instructors: Randy H. Katz David A. PaHerson hhp://inst.eecs.berkeley.edu/~cs61c/fa10 Request and Data Level Parallelism Administrivia

More information

Predictable Timing Analysis of x86 Multicores using High-Level Parallel Patterns

Predictable Timing Analysis of x86 Multicores using High-Level Parallel Patterns Predictable Timing Analysis of x86 Multicores using High-Level Parallel Patterns Kevin Hammond, Susmit Sarkar and Chris Brown University of St Andrews, UK T: @paraphrase_fp7 E: kh@cs.st-andrews.ac.uk W:

More information

Lecture 2: Memory in C

Lecture 2: Memory in C CIS 330:! / / / / (_) / / / / _/_/ / / / / / \/ / /_/ / `/ \/ / / / _/_// / / / / /_ / /_/ / / / / /> < / /_/ / / / / /_/ / / / /_/ / / / / / \ /_/ /_/_/_/ _ \,_/_/ /_/\,_/ \ /_/ \ //_/ /_/ Lecture 2:

More information

Profiling & Tuning Applica1ons. CUDA Course July István Reguly

Profiling & Tuning Applica1ons. CUDA Course July István Reguly Profiling & Tuning Applica1ons CUDA Course July 21-25 István Reguly Introduc1on Why is my applica1on running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA,

More information

Today s Objec4ves. Data Center. Virtualiza4on Cloud Compu4ng Amazon Web Services. What did you think? 10/23/17. Oct 23, 2017 Sprenkle - CSCI325

Today s Objec4ves. Data Center. Virtualiza4on Cloud Compu4ng Amazon Web Services. What did you think? 10/23/17. Oct 23, 2017 Sprenkle - CSCI325 Today s Objec4ves Virtualiza4on Cloud Compu4ng Amazon Web Services Oct 23, 2017 Sprenkle - CSCI325 1 Data Center What did you think? Oct 23, 2017 Sprenkle - CSCI325 2 1 10/23/17 Oct 23, 2017 Sprenkle -

More information

Scaling MongoDB: Avoiding Common Pitfalls. Jon Tobin Senior Systems

Scaling MongoDB: Avoiding Common Pitfalls. Jon Tobin Senior Systems Scaling MongoDB: Avoiding Common Pitfalls Jon Tobin Senior Systems Engineer Jon.Tobin@percona.com @jontobs www.linkedin.com/in/jonathanetobin Agenda Document Design Data Management Replica3on & Failover

More information

Op#mizing MapReduce for Highly- Distributed Environments

Op#mizing MapReduce for Highly- Distributed Environments Op#mizing MapReduce for Highly- Distributed Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering University of Minnesota hep://www.cs.umn.edu/~chandra 1 Big

More information

Argo: Architecture- Aware Graph Par33oning

Argo: Architecture- Aware Graph Par33oning Argo: Architecture- Aware Graph Par33oning Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange Department of Computer Science, University of PiCsburgh hcp://db.cs.pic.edu/group/ hcp://www.prognosgclab.org/

More information

Effec%ve So*ware. Lecture 9: JVM - Memory Analysis, Data Structures, Object Alloca=on. David Šišlák

Effec%ve So*ware. Lecture 9: JVM - Memory Analysis, Data Structures, Object Alloca=on. David Šišlák Effec%ve So*ware Lecture 9: JVM - Memory Analysis, Data Structures, Object Alloca=on David Šišlák david.sislak@fel.cvut.cz JVM Performance Factors and Memory Analysis» applica=on performance factors total

More information

Caching and Demand- Paged Virtual Memory

Caching and Demand- Paged Virtual Memory Caching and Demand- Paged Virtual Memory Defini8ons Cache Copy of data that is faster to access than the original Hit: if cache has copy Miss: if cache does not have copy Cache block Unit of cache storage

More information

The Art and Science of Memory Alloca4on

The Art and Science of Memory Alloca4on The Art and Science of Memory Alloca4on Don Porter 1 Binary Formats RCU Logical Diagram Memory Allocators Threads User System Calls Kernel Today s Lecture File System Networking Sync Memory Management

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs ASPLOS 2013 GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications

More information

Parallel Graph Coloring For Many- core Architectures

Parallel Graph Coloring For Many- core Architectures Parallel Graph Coloring For Many- core Architectures Mehmet Deveci, Erik Boman, Siva Rajamanickam Sandia Na;onal Laboratories Sandia National Laboratories is a multi-program laboratory managed and operated

More information

Introduc)on to Hyades

Introduc)on to Hyades Introduc)on to Hyades Shawfeng Dong Department of Astronomy & Astrophysics, UCSSC Hyades 1 Hardware Architecture 2 Accessing Hyades 3 Compu)ng Environment 4 Compiling Codes 5 Running Jobs 6 Visualiza)on

More information

Parallel Implementation of Task Scheduling using Ant Colony Optimization

Parallel Implementation of Task Scheduling using Ant Colony Optimization Parallel Implementaon of Task Scheduling using Ant Colony Opmizaon T. Vetri Selvan 1, Mrs. P. Chitra 2, Dr. P. Venkatesh 3 1 Thiagaraar College of Engineering /Department of Computer Science, Madurai,

More information

6.888 Lecture 8: Networking for Data Analy9cs

6.888 Lecture 8: Networking for Data Analy9cs 6.888 Lecture 8: Networking for Data Analy9cs Mohammad Alizadeh ² Many thanks to Mosharaf Chowdhury (Michigan) and Kay Ousterhout (Berkeley) Spring 2016 1 Big Data Huge amounts of data being collected

More information

Asynchronous and Fault-Tolerant Recursive Datalog Evalua9on in Shared-Nothing Engines

Asynchronous and Fault-Tolerant Recursive Datalog Evalua9on in Shared-Nothing Engines Asynchronous and Fault-Tolerant Recursive Datalog Evalua9on in Shared-Nothing Engines Jingjing Wang, Magdalena Balazinska, Daniel Halperin University of Washington Modern Analy>cs Requires Itera>on Graph

More information

A Comprehensive Study on the Performance of Implicit LS-DYNA

A Comprehensive Study on the Performance of Implicit LS-DYNA 12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

SEDA An architecture for Well Condi6oned, scalable Internet Services

SEDA An architecture for Well Condi6oned, scalable Internet Services SEDA An architecture for Well Condi6oned, scalable Internet Services Ma= Welsh, David Culler, and Eric Brewer University of California, Berkeley Symposium on Operating Systems Principles (SOSP), October

More information

Virtual Memory B: Objec5ves

Virtual Memory B: Objec5ves Virtual Memory B: Objec5ves Benefits of a virtual memory system" Demand paging, page-replacement algorithms, and allocation of page frames" The working-set model" Relationship between shared memory and

More information

Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons

Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons Assefaw Gebremedhin Purdue University (Star/ng August 2014, Washington State University School of Electrical Engineering

More information

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed ASPERA HIGH-SPEED TRANSFER Moving the world s data at maximum speed ASPERA HIGH-SPEED FILE TRANSFER Aspera FASP Data Transfer at 80 Gbps Elimina8ng tradi8onal bo

More information

Tools zur Op+mierung eingebe2eter Mul+core- Systeme. Bernhard Bauer

Tools zur Op+mierung eingebe2eter Mul+core- Systeme. Bernhard Bauer Tools zur Op+mierung eingebe2eter Mul+core- Systeme Bernhard Bauer Agenda Mo+va+on So.ware Engineering & Mul5core Think Parallel Models Added Value Tooling Quo Vadis? The Mul5core Era Moore s Law: The

More information

Architecture of So-ware Systems Massively Distributed Architectures Reliability, Failover and failures. Mar>n Rehák

Architecture of So-ware Systems Massively Distributed Architectures Reliability, Failover and failures. Mar>n Rehák Architecture of So-ware Systems Massively Distributed Architectures Reliability, Failover and failures Mar>n Rehák Mo>va>on Internet- based business models imposed new requirements on computa>onal architectures

More information

Speeding up the execution of numerical computations and simulations with rcuda José Duato

Speeding up the execution of numerical computations and simulations with rcuda José Duato Speeding up the execution of numerical computations and simulations with rcuda José Duato Universidad Politécnica de Valencia Spain Outline 1. Introduction to GPU computing 2. What is remote GPU virtualization?

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of

More information

What were his cri+cisms? Classical Methodologies:

What were his cri+cisms? Classical Methodologies: 1 2 Classifica+on In this scheme there are several methodologies, such as Process- oriented, Blended, Object Oriented, Rapid development, People oriented and Organisa+onal oriented. According to David

More information

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark

More information

Algorithms for Auto- tuning OpenACC Accelerated Kernels

Algorithms for Auto- tuning OpenACC Accelerated Kernels Outline Algorithms for Auto- tuning OpenACC Accelerated Kernels Fatemah Al- Zayer 1, Ameerah Al- Mu2ry 1, Mona Al- Shahrani 1, Saber Feki 2, and David Keyes 1 1 Extreme Compu,ng Research Center, 2 KAUST

More information

Instructor: Randy H. Katz hbp://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #13. Warehouse Scale Computer

Instructor: Randy H. Katz hbp://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #13. Warehouse Scale Computer CS 61C: Great Ideas in Computer Architecture Cache Performance and Parallelism Instructor: Randy H. Katz hbp://inst.eecs.berkeley.edu/~cs61c/fa13 10/8/13 Fall 2013 - - Lecture #13 1 New- School Machine

More information

Database Machine Administration v/s Database Administration: Similarities and Differences

Database Machine Administration v/s Database Administration: Similarities and Differences Database Machine Administration v/s Database Administration: Similarities and Differences IOUG Exadata Virtual Conference Vivek Puri Manager Database Administration & Engineered Systems The Sherwin-Williams

More information

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

Fast Computational GPU Design with GT-Pin

Fast Computational GPU Design with GT-Pin ast omputational GPU esign with GT-Pin Melanie Kambadur *, Sunpyo Hong +, Juan abral +, Harish Patil +, hi-keung Luk +, Sohaib Sajid +, Martha. Kim *. * olumbia University, New York, NY. + Intel orporation,

More information

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014 Mul$processor Architecture CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014 1 Agenda Announcements (5 min) Quick quiz (10 min) Analyze results of STREAM benchmark (15 min) Mul$processor

More information

MapReduce. Tom Anderson

MapReduce. Tom Anderson MapReduce Tom Anderson Last Time Difference between local state and knowledge about other node s local state Failures are endemic Communica?on costs ma@er Why Is DS So Hard? System design Par??oning of

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture Dirk Schmidl, Christian Terboven, Andreas Wolf, Dieter an Mey, Christian Bischof IEEE Cluster 2010 / Heraklion September 21, 2010

More information

Database design and implementation CMPSCI 645. Lecture 08: Storage and Indexing

Database design and implementation CMPSCI 645. Lecture 08: Storage and Indexing Database design and implementation CMPSCI 645 Lecture 08: Storage and Indexing 1 Where is the data and how to get to it? DB 2 DBMS architecture Query Parser Query Rewriter Query Op=mizer Query Executor

More information

Large- Scale Sor,ng: Breaking World Records. Mike Conley CSE 124 Guest Lecture 12 March 2015

Large- Scale Sor,ng: Breaking World Records. Mike Conley CSE 124 Guest Lecture 12 March 2015 Large- Scale Sor,ng: Breaking World Records Mike Conley CSE 124 Guest Lecture 12 March 2015 Sor,ng Given an array of items, put them in order 5 2 8 0 2 5 4 9 0 1 0 0 0 0 0 0 1 2 2 4 5 5 8 9 Many algorithms

More information

Overcoming the Barriers of Graphs on GPUs: Delivering Graph Analy;cs 100X Faster and 40X Cheaper

Overcoming the Barriers of Graphs on GPUs: Delivering Graph Analy;cs 100X Faster and 40X Cheaper Overcoming the Barriers of Graphs on GPUs: Delivering Graph Analy;cs 100X Faster and 40X Cheaper November 18, 2015 Super Compu3ng 2015 The Amount of Graph Data is Exploding! Billion+ Edges! 2 Graph Applications

More information

Physis: An Implicitly Parallel Framework for Stencil Computa;ons

Physis: An Implicitly Parallel Framework for Stencil Computa;ons Physis: An Implicitly Parallel Framework for Stencil Computa;ons Naoya Maruyama RIKEN AICS (Formerly at Tokyo Tech) GTC12, May 2012 1 è Good performance with low programmer produc;vity Mul;- GPU Applica;on

More information

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL (stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

GPU COMPUTING WITH MSC NASTRAN 2013

GPU COMPUTING WITH MSC NASTRAN 2013 SESSION TITLE WILL BE COMPLETED BY MSC SOFTWARE GPU COMPUTING WITH MSC NASTRAN 2013 Srinivas Kodiyalam, NVIDIA, Santa Clara, USA THEME Accelerated computing with GPUs SUMMARY Current trends in HPC (High

More information

ECSE 425 Lecture 25: Mul1- threading

ECSE 425 Lecture 25: Mul1- threading ECSE 425 Lecture 25: Mul1- threading H&P Chapter 3 Last Time Theore1cal and prac1cal limits of ILP Instruc1on window Branch predic1on Register renaming 2 Today Mul1- threading Chapter 3.5 Summary of ILP:

More information

Heterogeneous Resources Management In Modern Data Centers with Dynamic Workloads Ningfang Mi

Heterogeneous Resources Management In Modern Data Centers with Dynamic Workloads Ningfang Mi Heterogeneous Resources Management In Modern Data Centers with Dynamic Workloads Ningfang Mi Electrical and Computer Engineering Dept. Northeastern University ningfang@ece.neu.edu 1 Research Focus To investigate

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

Implementing MPI on Windows: Comparison with Common Approaches on Unix

Implementing MPI on Windows: Comparison with Common Approaches on Unix Implementing MPI on Windows: Comparison with Common Approaches on Unix Jayesh Krishna, 1 Pavan Balaji, 1 Ewing Lusk, 1 Rajeev Thakur, 1 Fabian Tillier 2 1 Argonne Na+onal Laboratory, Argonne, IL, USA 2

More information

Maximizing Memory Performance for ANSYS Simulations

Maximizing Memory Performance for ANSYS Simulations Maximizing Memory Performance for ANSYS Simulations By Alex Pickard, 2018-11-19 Memory or RAM is an important aspect of configuring computers for high performance computing (HPC) simulation work. The performance

More information

10 Things to Consider When Using Apache Ka7a: U"liza"on Points of Apache Ka4a Obtained From IoT Use Case

10 Things to Consider When Using Apache Ka7a: Ulizaon Points of Apache Ka4a Obtained From IoT Use Case 10 Things to Consider When Using Apache Ka7a: U"liza"on Points of Apache Ka4a Obtained From IoT Use Case May 16, 2017 NTT DATA CorporaAon Naoto Umemori, Yuji Hagiwara 2017 NTT DATA Corporation Contents

More information

RouteBricks: Exploi2ng Parallelism to Scale So9ware Routers

RouteBricks: Exploi2ng Parallelism to Scale So9ware Routers RouteBricks: Exploi2ng Parallelism to Scale So9ware Routers Mihai Dobrescu and etc. SOSP 2009 Presented by Shuyi Chen Mo2va2on Router design Performance Extensibility They are compe2ng goals Hardware approach

More information

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS Adithya Bhat, Nusrat Islam, Xiaoyi Lu, Md. Wasi- ur- Rahman, Dip: Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng

More information

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming

More information

Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation

Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation 2nd International Workshop on Overlay Architectures for FPGAs (OLAF) 2016 Kevin Andryc, Tedy Thomas and Russell Tessier University of Massachusetts

More information

UPCRC. Illiac. Gigascale System Research Center. Petascale computing. Cloud Computing Testbed (CCT) 2

UPCRC. Illiac. Gigascale System Research Center. Petascale computing. Cloud Computing Testbed (CCT) 2 Illiac UPCRC Petascale computing Gigascale System Research Center Cloud Computing Testbed (CCT) 2 www.parallel.illinois.edu Mul2 Core: All Computers Are Now Parallel We con'nue to have more transistors

More information

Using Virtual Texturing to Handle Massive Texture Data

Using Virtual Texturing to Handle Massive Texture Data Using Virtual Texturing to Handle Massive Texture Data San Jose Convention Center - Room A1 Tuesday, September, 21st, 14:00-14:50 J.M.P. Van Waveren id Software Evan Hart NVIDIA How we describe our environment?

More information

Oracle Exadata: Strategy and Roadmap

Oracle Exadata: Strategy and Roadmap Oracle Exadata: Strategy and Roadmap - New Technologies, Cloud, and On-Premises Juan Loaiza Senior Vice President, Database Systems Technologies, Oracle Safe Harbor Statement The following is intended

More information