Multicore from an Application s Perspective. Erik Hagersten Uppsala Universitet

Size: px
Start display at page:

Download "Multicore from an Application s Perspective. Erik Hagersten Uppsala Universitet"

Transcription

1 Multicore from an Application s Perspective Erik Hagersten Uppsala Universitet

2 Communication in an SMP A: B: Shared Memory $ $ $ Thread Thread Thread Read A Read A Read A... Read A Write A Read B Read A

3 Communication in a NUMA Wors A: Mem $ Thread Read A Read A Read A Interconnect ~3-5ns Mem $ Thread... Read A Write A

4 Communication in [some] Multicores Mem ~ns External I/F $ CPU $ CPU L2$ Mem I/F $ CPU $ CPU Simple fast CPU core + local caches read A write A t treads

5 Looks and Smells Like an SMP? SMP Multicore Memory Memory Interconnect L2 L2 32 L2 L L L CPU CPU CPU Well, how about: Cost of thread communication? Cache capacity per thread? L C L2 32 L C Memory bandwidth per thread? Gotta Optimize For Locality!

6 Criteria for HPC Algorithms Past: Minimize communication Maximize scalability (s of CPUs) Multicores today: Communication is for free [on some multicores] Scalability is limited to 32 threads The caches are tiny Memory bandwidth is scarse Data locality is key! (Both for Capacity and Capability Computing!)

7 Case Study: Gauss-Seidel on Multicores Erik Hagersten Sweden Thanks: Dan Wallin(arch), Henrik Löf (sci comp) and Sverker Holmgren (sci comp) From Wallin et al, ICS 26

8 Natural Order Gauss-Seidel = sweep path = previous = current = data dependence,2,3,4 = iteration number = cacheline layout

9 Natural Order Gauss-Seidel = sweep path IF (convergence_test) <done> else <iterate again> = previous = current = data dependence,2,3,4 = iteration number = cacheline layout

10 Natural Order Gauss-Seidel = sweep path = previous = current = data dependence,2,3,4 = iteration number = cacheline layout Data dependence Poor Parallelism

11 Red-Black Gauss-Seidel = sweep path,5,5,5,5,5 = previous = current = data dependence,2,3,4 = iteration number = cacheline layout

12 Red-Black Gauss-Seidel step,5: update the blacks = sweep path,5,5,5,5,5 = previous = current = data dependence,2,3,4 = iteration number = cacheline layout

13 Red-Black Gauss-Seidel step, update all reds = sweep path Update all blacks <barrier> Update all reds <barrier> = previous = current = data dependence,2,3,4 = iteration number = cacheline layout great parallelism!!!

14 Red-Black Gauss-Seidel Parallel version = sweep path thread thread thread 2 thread 3 thread 4 IN PARALLELL { Update all blacks <barrier> Update all reds <barrier> } = previous = current = data dependence,2,3,4 = iteration number = cacheline layout

15 Any Drawbacks of the Red-Black? Poor Cache Locality of Red-Black: Each element will be brought into the cache twice per iteration Natural Order: Each element will be brought into the cache once per iteration You can do even better Natural Order with Temporal Blocking

16 G-S, temporal blocking: several iterations per sweep = sweep path = previous = current = data dependence,2,3,4 = iteration number = cacheline layout = active region

17 G-S, temporal blocking: several iterations per sweep = sweep path = previous = current = data dependence,2,3,4 = iteration number = cacheline layout = active region In this case: 4 iterations per sweep. (σ = 4) σ =, for natural order G-S σ =,5 for red-black G-S

18 G-S 3D, σ=2

19 G-S 3D, σ=2

20 Acumem Graph, 3D N=29 Cache miss ratio (percent) MB MB 52 KB σ =,5 σ = σ = 2 σ = 4 σ = 8 σ = 6 32 MB 6 MB 8 MB 4 MB Cache size Miss ratio ~ Memory bandwidth

21 Parallel G-S, temporal blocked thread thread thread 2 thread 3 = sweep path Synchronization flags = previous = current = data dependence,2,3,4 = iteration number = cacheline layout = active region = sync flag iteration no Wait until lefty is done: Lots of communication Producer/Consumer flag Sharing of data values

22 Parallel G-S 3D t t t2 t3

23 Parallel G-S 3D cacheline layout (size B bytes) t t t2 t3 2 Stratup cost = (#threads-)/(nσ) Communication: one flag synchronization per N 2 /#threads ops one communication miss per B*N/#threads bytes

24 Parallel Executiontime Execution time per step (sec) 2,5 2,5,5 σ= threads= threads=4 threads=2 threads=8, RedBlack RBGS Natural Order Temp.Blocked TBGS GS

25 Performance comparison with Red-Black σ =,5 N = 257, 32 threads (Sun E5 K, US IIIcu = SMP!!) Performance ratio TBGS/RBGS 3,5 3, 2,5 2,,5, σ = σ = 2 σ = 4 σ = 8 σ = Threads

26 Multicore Simulation σ = Performance ratio TBGS/RBGS 2,5 2,,5,,5, Sun 5k Simulated Multicore Threads

27 Using Gauss-Seidel Smoother in a Multigrid G-S Important part of many real apps! EX: G-S as a Smoother in Multigrid Iterative algorithm More efficient smother cuts #iterations

28 One slide summary Today s algorithms assume expensive communication The communication cost of [some] multicores is close to zero Locality is becoming key to performance [again] Redesign HPC algorithms to face this fact! (For both Capacity and Capability computing) We show: * 3x performance gain * ~3x less bandwidth Is it time to revisit more algorithms?

CPUs vs. Memory Hierarchies

CPUs vs. Memory Hierarchies CPUs vs. ory Hierarchies eller hur mår Moores lag Erik Hagersten Uppsala Universitet CPU Improvements Moore s Law 2x performance improvm./18 mo (Gordon Moore actually never said this ) Relative Performance

More information

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4) 1 Cache Design You have already written your civic registration number (personnummer) on the cover page in the format YyMmDd-XXXX. Use the following formulas to calculate the parameters of your caches:

More information

Memory Consistency and Multiprocessor Performance

Memory Consistency and Multiprocessor Performance Memory Consistency Model Memory Consistency and Multiprocessor Performance Define memory correctness for parallel execution Execution appears to the that of some correct execution of some theoretical parallel

More information

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB Memory Consistency and Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB 1 Memory Consistency Model Define memory correctness for parallel execution Execution appears to the that

More information

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box

More information

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart

More information

Ministry of Education and Science of Ukraine Odessa I.I.Mechnikov National University

Ministry of Education and Science of Ukraine Odessa I.I.Mechnikov National University Ministry of Education and Science of Ukraine Odessa I.I.Mechnikov National University 1 Modern microprocessors have one or more levels inside the crystal cache. This arrangement allows to reach high system

More information

Computer Architecture Crash course

Computer Architecture Crash course Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping

More information

The future is not what it

The future is not what it The future is not what it used to be HPC Computer Landscape in 5-6 years? esop Workshop NFR, March 11, 2011 Erik Hagersten Uppsala University & Rogue Wave Software Future 1 Affiliations and Disclaimer

More information

How to Optimize Geometric Multigrid Methods on GPUs

How to Optimize Geometric Multigrid Methods on GPUs How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient

More information

Shared Memory. SMP Architectures and Programming

Shared Memory. SMP Architectures and Programming Shared Memory SMP Architectures and Programming 1 Why work with shared memory parallel programming? Speed Ease of use CLUMPS Good starting point 2 Shared Memory Processes or threads share memory No explicit

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion M. Stürmer, H. Köstler, and U. Rüde Lehrstuhl für Systemsimulation Friedrich-Alexander-Universität Erlangen-Nürnberg

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Parallel Poisson Solver in Fortran

Parallel Poisson Solver in Fortran Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of

More information

What Thread-Level Parallelism (TLP) can buy you. Erik Hagersten Uppsala University Sweden

What Thread-Level Parallelism (TLP) can buy you. Erik Hagersten Uppsala University Sweden What Thread-Level Parallelism (TLP) can buy you Erik Hagersten Uppsala University Sweden Old Trend1: Deeper pipelines/higher freq. Exploring ILP (instruction-level parallelism) I R PC Regs Data Dependence

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6. Cache Memories From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6. Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging

More information

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

CS 3510 Comp&Net Arch

CS 3510 Comp&Net Arch CS 3510 Comp&Net Arch Cache P1 Dr. Ken Hoganson 2010 Von Neuman Architecture Instructions and Data Op Sys CPU Main Mem Secondary Store Disk I/O Dev Bus The Need for Cache Memory performance has not kept

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

Computer Systems C S Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College

Computer Systems C S Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College Computer Systems C S 0 7 Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College 2 Today s Topics TODAY S LECTURE: Caching ANNOUNCEMENTS: Assign6 & Assign7 due Friday! 6 & 7 NO late

More information

Lecture 24: Virtual Memory, Multiprocessors

Lecture 24: Virtual Memory, Multiprocessors Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large

More information

Geilo Winter School Henrik Löf Uppsala Universitet

Geilo Winter School Henrik Löf Uppsala Universitet Geilo Winter School 2008 Henrik Löf Uppsala Universitet 1 Optimizing parallel applications Low latencies (locality of reference) Cache memories Remote Accesses (NUMA) Low parallel overhead Minimize communication

More information

LECTURE 10: Improving Memory Access: Direct and Spatial caches

LECTURE 10: Improving Memory Access: Direct and Spatial caches EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1

More information

Synchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization.

Synchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization. Synchronization sum := thread_create Execution on a sequentially consistent shared-memory machine: Erik Hagersten Uppsala University Sweden while (sum < threshold) sum := sum while + (sum < threshold)

More information

CPS 104 Computer Organization and Programming Lecture 20: Virtual Memory

CPS 104 Computer Organization and Programming Lecture 20: Virtual Memory CPS 104 Computer Organization and Programming Lecture 20: Virtual Nov. 10, 1999 Dietolf (Dee) Ramm http://www.cs.duke.edu/~dr/cps104.html CPS 104 Lecture 20.1 Outline of Today s Lecture O Virtual. 6 Paged

More information

Parallelizing the Method of Conjugate Gradients for Shared Memory Architectures

Parallelizing the Method of Conjugate Gradients for Shared Memory Architectures IT Licentiate theses 2004-005 Parallelizing the Method of Conjugate Gradients for Shared Memory Architectures HENRIK LÖF UPPSALA UNIVERSITY Department of Information Technology Parallelizing the Method

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Coherent HyperTransport Enables The Return of the SMP

Coherent HyperTransport Enables The Return of the SMP Coherent HyperTransport Enables The Return of the SMP Einar Rustad Copyright 2010 - All rights reserved. 1 Top500 History The expensive SMPs used to rule: Cray XMP, Convex Exemplar, Sun ES NOW, the Clusters

More information

Oracle Performance on M5000 with F20 Flash Cache. Benchmark Report September 2011

Oracle Performance on M5000 with F20 Flash Cache. Benchmark Report September 2011 Oracle Performance on M5000 with F20 Flash Cache Benchmark Report September 2011 Contents 1 About Benchware 2 Flash Cache Technology 3 Storage Performance Tests 4 Conclusion copyright 2011 by benchware.ch

More information

CS3350B Computer Architecture CPU Performance and Profiling

CS3350B Computer Architecture CPU Performance and Profiling CS3350B Computer Architecture CPU Performance and Profiling Marc Moreno Maza http://www.csd.uwo.ca/~moreno/cs3350_moreno/index.html Department of Computer Science University of Western Ontario, Canada

More information

Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen

Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen Rolf Rabenseifner rabenseifner@hlrs.de Universität Stuttgart, Höchstleistungsrechenzentrum Stuttgart (HLRS)

More information

NUMA-aware OpenMP Programming

NUMA-aware OpenMP Programming NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC

More information

Interconnection Network

Interconnection Network Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Topics

More information

Motivation for Caching and Optimization of Cache Utilization

Motivation for Caching and Optimization of Cache Utilization Motivation for Caching and Optimization of Cache Utilization Agenda Memory Technologies Bandwidth Limitations Cache Organization Prefetching/Replacement More Complexity by Coherence Performance Optimization

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

COSC 6385 Computer Architecture. - Memory Hierarchies (I) COSC 6385 Computer Architecture - Hierarchies (I) Fall 2007 Slides are based on a lecture by David Culler, University of California, Berkley http//www.eecs.berkeley.edu/~culler/courses/cs252-s05 Recap

More information

Computer Caches. Lab 1. Caching

Computer Caches. Lab 1. Caching Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main

More information

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains: The Lecture Contains: Four Organizations Hierarchical Design Cache Coherence Example What Went Wrong? Definitions Ordering Memory op Bus-based SMP s file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture10/10_1.htm[6/14/2012

More information

Convergence of Parallel Architecture

Convergence of Parallel Architecture Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Multi-core Computing Lecture 2

Multi-core Computing Lecture 2 Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012 Multi-core Computing Lectures: Progress-to-date

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,

More information

Computer parallelism Flynn s categories

Computer parallelism Flynn s categories 04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories

More information

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay! Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1

More information

Improving Cache Performance

Improving Cache Performance Improving Cache Performance Computer Organization Architectures for Embedded Computing Tuesday 28 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition,

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 14 Caches III Lecturer SOE Dan Garcia Google Glass may be one vision of the future of post-pc interfaces augmented reality with video

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner CPS104 Computer Organization and Programming Lecture 16: Virtual Memory Robert Wagner cps 104 VM.1 RW Fall 2000 Outline of Today s Lecture Virtual Memory. Paged virtual memory. Virtual to Physical translation:

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #15 3/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 From last class Outline

More information

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2017 Lecture 15 LAST TIME: CACHE ORGANIZATION Caches have several important parameters B = 2 b bytes to store the block in each cache line S = 2 s cache sets

More information

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018 irtual Memory Kevin Webb Swarthmore College March 8, 2018 Today s Goals Describe the mechanisms behind address translation. Analyze the performance of address translation alternatives. Explore page replacement

More information

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Tiled Matrix Multiplication

Tiled Matrix Multiplication Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;

More information

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Automatic Generation of Algorithms and Data Structures for Geometric Multigrid Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Introduction Multigrid Goal: Solve a partial differential

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

ECE331: Hardware Organization and Design

ECE331: Hardware Organization and Design ECE331: Hardware Organization and Design Lecture 24: Cache Performance Analysis Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Overview Last time: Associative caches How do we

More information

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy

More information

Main Memory Supporting Caches

Main Memory Supporting Caches Main Memory Supporting Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width clocked bus Bus clock is typically slower than CPU clock Cache Issues 1 Example cache block read

More information

Improving System. Performance: Caches

Improving System. Performance: Caches Improving System Performance: Caches December 04 CSC201 Section 002 Fall, 2000 A Motivating Example Application: making a (mechanical) clock dozens of tools and pages of instructions, hundreds of parts

More information

--Introduction to Culler et al. s Chapter 2, beginning 192 pages on software

--Introduction to Culler et al. s Chapter 2, beginning 192 pages on software CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Parallel Programming (Chapters 2, 3, & 4) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived from

More information

HW Trends and Architectures

HW Trends and Architectures Pavel Tvrdík, Jiří Kašpar (ČVUT FIT) HW Trends and Architectures MI-POA, 2011, Lecture 1 1/29 HW Trends and Architectures prof. Ing. Pavel Tvrdík CSc. Ing. Jiří Kašpar Department of Computer Systems Faculty

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Performance metrics for caches

Performance metrics for caches Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:

More information

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang Multigrid Solvers Method of solving linear equation

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Efficient Synchronization and Coherence for Nonuniform Communication Architectures

Efficient Synchronization and Coherence for Nonuniform Communication Architectures IT Licentiate theses 2003-008 Efficient Synchronization and Coherence for Nonuniform Communication Architectures ZORAN RADOVIĆ UPPSALA UNIVERSITY Department of Information Technology Efficient Synchronization

More information

CSC501 Operating Systems Principles. OS Structure

CSC501 Operating Systems Principles. OS Structure CSC501 Operating Systems Principles OS Structure 1 Announcements q TA s office hour has changed Q Thursday 1:30pm 3:00pm, MRC-409C Q Or email: awang@ncsu.edu q From department: No audit allowed 2 Last

More information

Optimizing Made Easy: ThreadSpotter Erik Hagersten, Chief Scientist

Optimizing Made Easy: ThreadSpotter Erik Hagersten, Chief Scientist Copyright 2012 Rogue Wave Software All Rights Reserved Optimizing Made Easy: ThreadSpotter Erik Hagersten, Chief Scientist Rogue Wave: A Global Company Sweden Germany Moscow, reseller OR UK TX MA CO (HQ)

More information

ECE468 Computer Organization and Architecture. Virtual Memory

ECE468 Computer Organization and Architecture. Virtual Memory ECE468 Computer Organization and Architecture Virtual Memory ECE468 vm.1 Review: The Principle of Locality Probability of reference 0 Address Space 2 The Principle of Locality: Program access a relatively

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Memory Hierarchy: Motivation

Memory Hierarchy: Motivation Memory Hierarchy: Motivation The gap between CPU performance and main memory speed has been widening with higher performance CPUs creating performance bottlenecks for memory access instructions. The memory

More information

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous

More information

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

More information

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015 Low latency, high bandwidth communication. Infiniband and RDMA programming Knut Omang Ifi/Oracle 2 Nov, 2015 1 Bandwidth vs latency There is an old network saying: Bandwidth problems can be cured with

More information

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All

More information

ECE4680 Computer Organization and Architecture. Virtual Memory

ECE4680 Computer Organization and Architecture. Virtual Memory ECE468 Computer Organization and Architecture Virtual Memory If I can see it and I can touch it, it s real. If I can t see it but I can touch it, it s invisible. If I can see it but I can t touch it, it

More information

Lect. 2: Types of Parallelism

Lect. 2: Types of Parallelism Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric

More information

COSC4201 Multiprocessors

COSC4201 Multiprocessors COSC4201 Multiprocessors Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Multiprocessing We are dedicating all of our future product development to multicore

More information

14:332:331. Week 13 Basics of Cache

14:332:331. Week 13 Basics of Cache 14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson s UCB CS152 slides and Mary Jane Irwin s PSU CSE331 slides] 331 Lec20.1 Fall 2003 Head

More information

Memory Hierarchy: The motivation

Memory Hierarchy: The motivation Memory Hierarchy: The motivation The gap between CPU performance and main memory has been widening with higher performance CPUs creating performance bottlenecks for memory access instructions. The memory

More information

And in Review! ! Locality of reference is a Big Idea! 3. Load Word from 0x !

And in Review! ! Locality of reference is a Big Idea! 3. Load Word from 0x ! CS61C L23 Caches II (1)! inst.eecs.berkeley.edu/~cs61c CS61C Machine Structures Lecture 23 Caches II 2010-07-29!!!Instructor Paul Pearce! TOOLS THAT AUTOMATICALLY FIND SOFTWARE BUGS! Black Hat (a security

More information

Delhi Noida Bhopal Hyderabad Jaipur Lucknow Indore Pune Bhubaneswar Kolkata Patna Web: Ph:

Delhi Noida Bhopal Hyderabad Jaipur Lucknow Indore Pune Bhubaneswar Kolkata Patna Web:     Ph: Serial : 2BS_CS_C_Computer Orgnisation_248 Delhi Noida Bhopal Hyderabad Jaipur Lucknow Indore Pune Bhubaneswar Kolkata Patna Web: E-mail: info@madeeasy.in Ph: -452462 CLASS TEST 28- COMPUTER SCIENCE &

More information