Special Course on Computer Architecture

Similar documents
Processor Architecture

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Accelerator Programming Lecture 1

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Lecture-22 (Cache Coherence Protocols) CS422-Spring

3D WiNoC Architectures

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Parallel Algorithm Engineering

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

CS 470 Spring Mike Lam, Professor. OpenMP

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Memory Hierarchy in a Multiprocessor

CS 470 Spring Mike Lam, Professor. OpenMP

Foundations of Computer Systems

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Computer Architecture Memory hierarchies and caches

High Performance Computing: Tools and Applications

Performance study example ( 5.3) Performance study example

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Prediction Router: Yet another low-latency on-chip router architecture

PARALLEL MEMORY ARCHITECTURE

CS 5220: Shared memory programming. David Bindel

cp r /global/scratch/workshop/openmp-wg-oct2017 cd openmp-wg-oct2017 && ls Current directory

Raspberry Pi Basics. CSInParallel Project

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Lecture 1: Introduction

EPL372 Lab Exercise 5: Introduction to OpenMP

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Multiprocessors & Thread Level Parallelism

Introduction to OpenMP

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

CSC/ECE 506: Architecture of Parallel Computers Program 2: Bus-Based Cache Coherence Protocols Due: Wednesday, October 25, 2017

Multiprocessor Cache Coherency. What is Cache Coherence?

Our new HPC-Cluster An overview

ECE 574 Cluster Computing Lecture 10

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Assignment 1 OpenMP Tutorial Assignment

Lecture 4: OpenMP Open Multi-Processing

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Parallel Processing/Programming

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

OpenMP Fundamentals Fork-join model and data environment

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Workshop Agenda Feb 25 th 2015

CSE502: Computer Architecture CSE 502: Computer Architecture

For those with laptops

Chapter 5. Multiprocessors and Thread-Level Parallelism

CSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

ECE/CS 757: Homework 1

A Basic Snooping-Based Multi-Processor Implementation

INTRODUCTION TO OPENMP (PART II)

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

GLOSSARY. OpenMP. OpenMP brings the power of multiprocessing to your C, C++, and. Fortran programs. BY WOLFGANG DAUTERMANN

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Introduction to HPC and Optimization Tutorial VI

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Snooping-Based Cache Coherence

CS/COE1541: Intro. to Computer Architecture

DPHPC: Introduction to OpenMP Recitation session

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Lecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

the Intel Xeon Phi coprocessor

Shared Memory programming paradigm: openmp

Computer Architecture

Shared Symmetric Memory Systems

Multithreading in C with OpenMP

Make was originally a Unix tool from 1976, but it has been re-implemented several times, notably as GNU Make.

Interconnection Networks

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Lecture 18: Communication Models and Architectures: Interconnection Networks

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Multiprocessors continued

OpenMP: Open Multiprocessing

Shared memory parallel computing

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Transcription:

Special Course on Computer Architecture #9 Simulation of Multi-Processors Hiroki Matsutani and Hideharu Amano

Outline: Simulation of Multi-Processors Background [10min] Recent multi-core and many-core processors Network simulation [20min] Network simulation using Gem5 Exercise 1: Topology Mesh, Torus, and Pt2Pt Parallel programming [20min] OpenMP introduction Exercise 2: Performance evaluation using 48-core Coherence protocols [40min] Full-system simulation using Gem5 Exercise 3: Coherence protocol MI vs. MESI

Number of PEs (caches are not included) Multi- & many-core architectures 256 128 64 32 16 8 4 2 Accelerator Graphic processing units Many simple PEs are integrated ClearSpeed CSX600 MIT RAW picochip PC102 STI Cell BE Geforce 8800 UT TRIPS (OPN) Intel 80-core TILERA TILE64 Geforce GTX280 Sparc T1 Sparc T2 Intel Xeon, AMD Opteron Chip Multi-Processors IBM Power7, Fujitsu Sparc64 2002 2004 2006 2008 2010 Geforce GTX480 TILE Gx100 Xeon Phi Intel SCC Sparc T3 2012

Network-on-Chip (NoC) Interconnection network to connect many-cores Core Router 16-Core Tile Architecture

On-chip router architecture Input ports 1) selecting an output channel X+ FIFO 2) arbitration for the selected output channel GRANT ARBITER Output ports X+ X- Y+ Y- CORE FIFO FIFO FIFO FIFO X- Y+ 3) sending the packet Y- to 5x5 the output channel CROSSBAR CORE Routing, arbitration, & forwarding are performed in pipeline manner

Network topologies 4x4 Mesh 4x4 Torus Point-to- Point Every routers has direct links to all the other routers. Note links from only a single router are illustrated in this figure.

Outline: Simulation of Multi-Processors Background [10min] Recent multi-core and many-core processors Network simulation [20min] Network simulation using Gem5 Exercise 1: Topology Mesh, Torus, and Pt2Pt Parallel programming [20min] OpenMP introduction Exercise 2: Performance evaluation using 48-core Coherence protocols [40min] Full-system simulation using Gem5 Exercise 3: Coherence protocol MI vs. MESI

Network simulation (1/5) Pick up your account information Username (ca0**) Password Login the machine using two terminals > ssh <Username>@ikura.arc.ics.keio.ac.jp

Network simulation (2/5) Copy today s sample files to your directory > cp r ~matutani/20130614. > cd 20130614 > ls

Network simulation (3/5) View netwok.pl script on the right terminal > cd 20130614 > vi network.pl Start the network sim on the left terminal >./network.pl

Network simulation (4/5) View netwok.pl script on the right terminal Injection rates to be measured Numbers of source and destination nodes The topology is 4x4 Mesh

Network simulation (5/5) Draw a graph on your answer sheet X-axis: Injection rate [%] Y-axis: Latency [cycles] Latency is quite low and stable at low workload Latency increases rapidly after a certain threshold

Exercise 1 Draw the following graphs on the answer sheet 4x4 Mesh 4x4 Torus Point-to-Point (Pt2Pt) Modify network.pl appropriately. Replace --topology with Torus and Pt2Pt. Note --mesh-rows will be ignored for Pt2Pt. Add more measuring points to @injection_ rate_list for more accurate and smooth graphs. Discuss the results using your answer sheet Which topology is the best? Why?

Outline: Simulation of Multi-Processors Background [10min] Recent multi-core and many-core processors Network simulation [20min] Network simulation using Gem5 Exercise 1: Topology Mesh, Torus, and Pt2Pt Parallel programming [20min] OpenMP introduction Exercise 2: Performance evaluation using 48-core Coherence protocols [40min] Full-system simulation using Gem5 Exercise 3: Coherence protocol MI vs. MESI

ikura.arc.ics.keio.ac.jp

Ex1: Hello World #include <stdio.h> #include <omp.h> int main() { #pragma omp parallel printf("hello world from %d of %d n", omp_get_thread_num(), omp_get_num_threads()); return 0; }

Ex1: Hello World Modify ex1.c to parallelize gcc Wall fopenmp o ex1 ex1.c Perform ex1 using 1 thread Perform ex1 using 4 threads Perform ex1 using 48 threads

Ex2: Parallel for loop int main(int argc, char *argv[]) { int i, num; double start_time, end_time; num = atoi(argv[1]); start_time = omp_get_wtime(); omp_set_num_threads(num); #pragma omp parallel shared(a) private(i) { #pragma omp for for (i = 0; i < N; i++) A[i] = A[i] * A[i] - 3.0; } Split up loop iterations among the threads. Execute them in parallel. } end_time = omp_get_wtime(); printf("elapsed time with %d CPUs: %f sec n", num, end_time - start_time); return 0;

Ex2: Parallel for loop Modify ex2.c to parallelize gcc Wall fopenmp o ex2 ex2.c Perform ex2 using 1 thread Perform ex2 using 4 threads

Ex3: Reduction int main(int argc, char *argv[]) { int i, num; double s = 0.0; double start_time, end_time; num = atoi(argv[1]); start_time = omp_get_wtime(); omp_set_num_threads(num); #pragma omp parallel private(i) reduction(+:s) { #pragma omp for for (i = 0; i < N; i++) s += (4.0 / (4 * i + 1) - 4.0 / (4 * i + 3)); } Computational results of each thread (local copies) will be summarized (reduced) into a global shared variable. Useful when partial results are summed up into a single variable. printf("pi = %f n", s); end_time = omp_get_wtime(); printf("elapsed time with %d CPUs: %f sec n", num, end_time - start_time);

Exercise 2 Report the execution times of ex2 and ex3 using 1, 4, 16, 32, and 100 threads Num of threads 1 4 16 32 100 Execution time of Ex2 Execution time of Ex3 Does the execution time linearly decrease as the number of threads increase? Discuss the results using your answer sheet

Outline: Simulation of Multi-Processors Background [10min] Recent multi-core and many-core processors Network simulation [20min] Network simulation using Gem5 Exercise 1: Topology Mesh, Torus, and Pt2Pt Parallel programming [20min] OpenMP introduction Exercise 2: Performance evaluation using 48-core Coherence protocols [40min] Full-system simulation using Gem5 Exercise 3: Coherence protocol MI vs. MESI

Today s target architecture Chip multi-processors (CMPs) Multiple processors (each has private L1 cache) Shared L2 cache divided into multiple banks (SNUCA) Tile X86-64 CPU L1 cache (I & D) L2 cache bank

Today s target architecture Chip multi-processors (CMPs) Multiple processors (each has private L1 cache) Shared L2 cache divided into multiple banks (SNUCA) Processors and L2 cache banks are connected via NoC Tile X86-64 CPU L1 cache (I & D) L2 cache bank On-chip router

A cache coherence example Write back policy Cache-write updates the memory when block is evicted Write invalidate policy Cache-write invalidates all copies of the other sharers Tile Main memories Main memories

A cache coherence example A CPU wants to read a block cached at The CPU sends a read request to the memory controller The controller forwards the request to current owner The owner sends the block to the requestor Tile Main memories Main memories

Coherence protocols: MOESI class Status of each cache block is represented by M/O/E/S/I Modified (M) Modified (i.e., dirty) Valid in one cache Shared (S) Shared by multiple CPUs Exclusive (E) Clean Exists in one cache Invalid (I) Owned (O) May or may not clean Exists in multiple caches Owned by one cache Owner Responsibility to respond any requests MOESI protocols MSI, MOSI, MESI, MOESI,

Cache coherence protocols MSI protocol E state is not implemented. If the block is cached exclusively, main memory write is not needed when the cache is updated. However, MSI cannot know whether a block is cached exclusively. S-to-M transition always updates the main memory. MESI protocol O state is not implemented; Dirty sharing not allowed. M-to-S transition always updates the main memory. MOESI protocol O state is added; Dirty sharing is possible.

MSI protocol: State transition CpuRd = CPU Read BusRd = Bus Read CpuWr= CPU Write BusWr = Bus Write CpuRd --- CpuWr --- CpuRd --- CpuRd --- M CpuWr BusWr S M BusRd Flush S CpuWr BusWr CpuRd BusRd BusWr Flush BusWr --- I I BusRd --- BusWr --- S-to-M transitions flush (update) the main memory

Cache coherence protocols MSI protocol E state is not implemented. If the block is cached exclusively, main memory write is not needed when the cache is updated. However, MSI cannot know whether a block is cached exclusively. S-to-M transition always updates the main memory. MESI protocol O state is not implemented; Dirty sharing not allowed. M-to-S transition always updates the main memory. MOESI protocol O state is added; Dirty sharing is possible.

M S MESI protocol: State transition CpuRd --- CpuWr --- CpuWr BusUpgr CpuRd --- CpuWr --- CpuRd BusRd(C) CpuRd --- E I CpuRd BusRd(!C) C = If Cache exists!c = IF Cache not exist Flush = Main memory write FlushOpt = Cache-to-cache transfer M-to-S transitions flush (update) the main memory M BusWr Flush BusRd Flush S BusRd FlushOpt E BusRd FlushOpt BusWr FlushOpt I BusRd --- BusWr --- BusUpgr ---

Cache coherence protocols MSI protocol E state is not implemented. If the block is cached exclusively, main memory write is not needed when the cache is updated. However, MSI cannot know whether a block is cached exclusively. S-to-M transition always updates the main memory. MESI protocol O state is not implemented; Dirty sharing not allowed. M-to-S transition always updates the main memory. MOESI protocol O state is added; Dirty sharing is possible.

MOESI protocol: State transition (1/2) MOESI reduces memory bandwidth compared to MESI CpuRd --- CpuWr --- CpuRd --- O CpuWr BusUpgr CpuRd --- M CpuWr BusUpgr S CpuRd --- CpuWr --- CpuRd BusRd(C) E I CpuRd BusRd(!C) C = If Cache exists!c = IF Cache not exist

MOESI protocol: State transition (2/2) MOESI reduces memory bandwidth compared to MESI O BusRd Flush M BusWr Flush BusRd FlushOpt E BusWr FlushOpt BusRd Flush S BusRd FlushOpt BusWr Flush BusUpgr --- I BusRd --- BusWr --- BusUpgr ---

Full-system: OS boot (1/6) Login the machine using two terminals > ssh <Username>@ikura.arc.ics.keio.ac.jp > cd 20130614 Do not launch more than two terminals A 48-core machine is shared by up to 42 students

Full-system: OS boot (2/6) Boot Linux OS on the simulator from the right terminal > make boot Very Important: You must remember the port number (port number will change each time) Port number : 3456

Full-system: OS boot (3/6) Connect the simulator from the left terminal > telnet localhost <YourPortNumber> Very Important: You must specify the port number you ve just found in the right terminal Using wrong port num may peek at other students You will see Linux boot messages Port number : 3456

Full-system: OS boot (4/6) Connect the simulator from the left terminal > telnet localhost <YourPortNumber> Very Important: You must specify the port number you ve just found in the right terminal Using wrong port num may peek at other students You will see Linux boot messages Port number : 3456

Full-system: OS boot (5/6) Connect the simulator from the left terminal > telnet localhost <YourPortNumber> Linux OS will boot in 5-10 minutes You can login the Linux on the simulator Try cd /, ls, and more This is fast mode simulation without detailed cache behavior Port number : 3456

Full-system: OS boot (6/6) Dump checkpoint from the left terminal (none)/# m5 checkpoint Using checkpoint, you can resume simulation anytime Then exit the simulation Type Ctrl-c to the right terminal Type Ctrl-c to exit

Full-system: Simulation strategy Type make boot Boot Linux using fast (but inaccurate) simulation mode that does not model cache behavior Dump checkpoint and then exit Type make exec_mi or make exec_mesi Resume the simulation from the checkpoint using accurate (but slow) simulation mode that models memory, caches, and interconnection network Execute a benchmark program and then count the number of cycles for the execution Cache coherence protocols: MI and MESI

Full-system: MI (1/4) Resume the simulation from the checkpoint on the right terminal > make exec_mi Very Important: You must remember the port number (port number will change each time) Port number : 3456

Full-system: MI (2/4) Connect the simulator from the left terminal > telnet localhost <YourPortNumber> Very Important: You must specify the port number you ve just found in the right terminal Using wrong port num may peek at other students You can resume the simulation Port number : 3456

Full-system: MI (3/4) Execute a sample program on the left terminal (none)/# cd /root (none)/#./ex2 4 ex2 4 performs the program using 4 threads It takes 10-15 minutes Wait for 10-15 minutes

Full-system: MI (4/4) The simulation stops after 10-15 minutes Remember the execution cycles appeared in the right terminal Simulation stops automatically Remember the execution cycles

Full-system: MESI Do the same simulation using MESI protocol > make exec_mesi Very Important: You must specify the port number you ve just found in the right terminal The port number will change every run Perform Ex2 program again Port number : 3456

Exercise 3 Compare MI and MESI protocols in terms of execution cycles of Ex2 program Compare MI and MESI protocols in terms of execution cycles of Ex3 program (none)/# cd /root (none)/#./ex3 4 Discuss the results using your answer sheet Protocol Exec cycles of Ex2 Exec cycles of Ex3 MI 39,555,084 MESI