Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Similar documents
Programming as Successive Refinement. Partitioning for Performance

Review: Creating a Parallel Program. Programming for Performance

Steps in Creating a Parallel Program

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

ECE 669 Parallel Computer Architecture

Performance Optimization Part II: Locality, Communication, and Contention

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Module 2: Virtual Memory and Caches Lecture 3: Virtual Memory and Caches. The Lecture Contains:

Lecture 13: Cache Hierarchies. Today: cache access basics and innovations (Sections )

Module 4: Parallel Programming: Shared Memory and Message Passing Lecture 7: Examples of Shared Memory and Message Passing Programming

Introduction to OpenMP. Lecture 10: Caches

Memory Consistency. Challenges. Program order Memory access order

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Memory Hierarchy. Bojian Zheng CSCD70 Spring 2018

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

1. Memory technology & Hierarchy

Master Informatics Eng.

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Scientific Applications. Chao Sun

Cray XE6 Performance Workshop

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Performance and Optimization Issues in Multicore Computing

Module 20: Multi-core Computing Multi-processor Scheduling Lecture 39: Multi-processor Scheduling. The Lecture Contains: User Control.

Why Multiprocessors?

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Caching Basics. Memory Hierarchies

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9

Advanced Computer Architecture

Parallelization Principles. Sathish Vadhiyar

NOW Handout Page 1. Recap: Performance Trade-offs. Shared Memory Multiprocessors. Uniprocessor View. Recap (cont) What is a Multiprocessor?

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2.

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Energy Efficient Adaptive Beamforming on Sensor Networks

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

The SPLASH-2 Programs: Characterization and Methodological Considerations

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Fundamental CUDA Optimization. NVIDIA Corporation

Performance of Multicore LUP Decomposition

LECTURE 11. Memory Hierarchy

Optimising for the p690 memory system

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Mo Money, No Problems: Caches #2...

Distributed File Systems Issues. NFS (Network File System) AFS: Namespace. The Andrew File System (AFS) Operating Systems 11/19/2012 CSC 256/456 1

Lecture 9: MIMD Architectures

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

How to Write Fast Numerical Code Spring 2011 Lecture 7. Instructor: Markus Püschel TA: Georg Ofenbeck

Computer Architecture Prof. Smruthi Ranjan Sarangi Department of Computer Science and Engineering Indian Institute of Technology, Delhi

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Make sure that your exam is not missing any sheets, then write your full name and login ID on the front.

ECE 571 Advanced Microprocessor-Based Design Lecture 10

LSN 7 Cache Memory. ECT466 Computer Architecture. Department of Engineering Technology

Fundamental CUDA Optimization. NVIDIA Corporation

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT

Cache Performance (H&P 5.3; 5.5; 5.6)

Latency Hiding on COMA Multiprocessors

Chapter 1 Computer System Overview

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Principles of Parallel Algorithm Design: Concurrency and Mapping

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Program Transformations for the Memory Hierarchy

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

A Comparison of MPI, SHMEM and Cache-coherent Shared Address Space Programming Models on the SGI Origin2000

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

1/25/12. Administrative

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

CS 475: Parallel Programming Introduction

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Παράλληλη Επεξεργασία

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele

Overview of research activities Toward portability of performance

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Introduction to parallel Computing

Transcription:

The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention Hot-spots Overlap Summary file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture9/9_1.htm[6/14/2012 11:56:22 AM]

Data Access and Communication The memory hierarchy (caches and main memory) plays a significant role in determining communication cost May easily dominate the inherent communication of the algorithm For uniprocessor, the execution time of a program is given by useful work time + data access time Useful work time is normally called the busy time or busy cycles Data access time can be reduced either by architectural techniques (e.g., large caches) or by cache-aware algorithm design that exploits spatial and temporal locality Data Access In multiprocessors Every processor wants to see the memory interface as its own local cache and the main memory In reality it is much more complicated If the system has a centralized memory (e.g., SMPs), there are still caches of other processors; if the memory is distributed then some part of it is local and some is remote For shared memory, data movement from local or remote memory to cache is transparent while for message passing it is explicit View a multiprocessor as an extended memory hierarchy where the extension includes caches of other processors, remote memory modules and the network topology file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture9/9_2.htm[6/14/2012 11:56:23 AM]

Artifactual Comm. Communication caused by artifacts of extended memory hierarchy Data accesses not satisfied in the cache or local memory cause communication Inherent communication is caused by data transfers determined by the program Artifactual communication is caused by poor allocation of data across distributed memories, unnecessary data in a transfer, unnecessary transfers due to systemdependent transfer granularity, redundant communication of data, finite replication capacity (in cache or memory) Inherent communication assumes infinite capacity and perfect knowledge of what should be transferred Capacity Problem Most probable reason for artifactual communication Due to finite capacity of cache, local memory or remote memory May view a multiprocessor as a three-level memory hierarchy for this purpose: local cache, local memory, remote memory Communication due to cold or compulsory misses and inherent communication are independent of capacity Capacity and conflict misses generate communication resulting from finite capacity Generated traffic may be local or remote depending on the allocation of pages General technique: exploit spatial and temporal locality to use the cache properly file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture9/9_3.htm[6/14/2012 11:56:23 AM]

Temporal Locality Maximize reuse of data Schedule tasks that access same data in close succession Many linear algebra kernels use blocking of matrices to improve temporal (and spatial) locality Example: Transpose phase in Fast Fourier Transform (FFT); to improve locality, the algorithm carries out blocked transpose i.e. transposes a block of data at a time Spatial Locality Consider a square block decomposition of grid solver and a C-like row major layout i.e. A[ i ][j] and A[ i ][j+1] have contiguous memory locations file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture9/9_4.htm[6/14/2012 11:56:23 AM]

2D to 4D Conversion Essentially you need to change the way memory is allocated The matrix A needs to be allocated in such a way that the elements falling within a partition are contiguous The first two dimensions of the new 4D matrix are block row and column indices i.e. for the partition assigned to processor P 6 these are 1 and 2 respectively (assuming 16 processors) The next two dimensions hold the data elements within that partition Thus the 4D array may be declared as float B[vP][vP][N/vP][N/vP] The element B[3][2][5][10] corresponds to the element in 10 th column, 5 th row of the partition of P 14 Now all elements within a partition have contiguous addresses Transfer Granularity How much data do you transfer in one communication? For message passing it is explicit in the program For shared memory this is really under the control of the cache coherence protocol: there is a fixed size for which transactions are defined (normally the block size of the outermost level of cache hierarchy) In shared memory you have to be careful Since the minimum transfer size is a cache line you may end up transferring extra data e.g., in grid solver the elements of the left and right neighbors for a square block decomposition (you need only one element, but must transfer the whole cache line): no good solution file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture9/9_5.htm[6/14/2012 11:56:23 AM]

Worse: False Sharing If the algorithm is designed so poorly that Two processors write to two different words within a cache line at the same time The cache line keeps on moving between two processors The processors are not really accessing or updating the same element, but whatever they are updating happen to fall within a cache line: not a true sharing, but false sharing For shared memory programs false sharing can easily degrade performance by a lot Easy to avoid: just pad up to the end of the cache line before starting the allocation of the data for the next processor (wastes memory, but improves performance) Contention It is very easy to ignore contention effects when designing algorithms Can severely degrade performance by creating hot-spots Location hot-spot: Consider accumulating a global variable; the accumulation takes place on a single node i.e. all nodes access the variable allocated on that particular node whenever it tries to increment it file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture9/9_6.htm[6/14/2012 11:56:23 AM]

Hot-spots Avoid location hot-spot by either staggering accesses to the same location or by designing the algorithm to exploit a tree structured communication Module hot-spot Normally happens when a particular node saturates handling too many messages (need not be to same memory location) within a short amount of time Normal solution again is to design the algorithm in such a way that these messages are staggered over time Rule of thumb: design communication pattern such that it is not bursty ; want to distribute it uniformly over time Overlap Increase overlap between communication and computation Not much to do at algorithm level unless the programming model and/or OS provide some primitives to carry out prefetching, block data transfer, non-blocking receive etc. Normally, these techniques increase bandwidth demand because you end up communicating the same amount of data, but in a shorter amount of time (execution time hopefully goes down if you can exploit overlap) Summary Parallel programs introduce three overhead terms: busy overhead (extra work), remote data access time, and synchronization time Goal of a good parallel program is to minimize these three terms Goal of a good parallel computer architecture is to provide sufficient support to let programmers optimize these three terms (and this is the focus of the rest of the course) file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture9/9_7.htm[6/14/2012 11:56:23 AM]