Special Course on Computer Architecture

Special Course on Computer Architecture #9 Simulation of Multi-Processors Hiroki Matsutani and Hideharu Amano

Outline: Simulation of Multi-Processors Background [10min] Recent multi-core and many-core processors Network simulation [20min] Network simulation using Gem5 Exercise 1: Topology Mesh, Torus, and Pt2Pt Parallel programming [20min] OpenMP introduction Exercise 2: Performance evaluation using 48-core Coherence protocols [40min] Full-system simulation using Gem5 Exercise 3: Coherence protocol MI vs. MESI

Number of PEs (caches are not included) Multi- & many-core architectures 256 128 64 32 16 8 4 2 Accelerator Graphic processing units Many simple PEs are integrated ClearSpeed CSX600 MIT RAW picochip PC102 STI Cell BE Geforce 8800 UT TRIPS (OPN) Intel 80-core TILERA TILE64 Geforce GTX280 Sparc T1 Sparc T2 Intel Xeon, AMD Opteron Chip Multi-Processors IBM Power7, Fujitsu Sparc64 2002 2004 2006 2008 2010 Geforce GTX480 TILE Gx100 Xeon Phi Intel SCC Sparc T3 2012

Network-on-Chip (NoC) Interconnection network to connect many-cores Core Router 16-Core Tile Architecture

On-chip router architecture Input ports 1) selecting an output channel X+ FIFO 2) arbitration for the selected output channel GRANT ARBITER Output ports X+ X- Y+ Y- CORE FIFO FIFO FIFO FIFO X- Y+ 3) sending the packet Y- to 5x5 the output channel CROSSBAR CORE Routing, arbitration, & forwarding are performed in pipeline manner

Network topologies 4x4 Mesh 4x4 Torus Point-to- Point Every routers has direct links to all the other routers. Note links from only a single router are illustrated in this figure.

Network simulation (1/5) Pick up your account information Username (ca0**) Password Login the machine using two terminals > ssh <Username>@ikura.arc.ics.keio.ac.jp

Network simulation (2/5) Copy today s sample files to your directory > cp r ~matutani/20130614. > cd 20130614 > ls

Network simulation (3/5) View netwok.pl script on the right terminal > cd 20130614 > vi network.pl Start the network sim on the left terminal >./network.pl

Network simulation (4/5) View netwok.pl script on the right terminal Injection rates to be measured Numbers of source and destination nodes The topology is 4x4 Mesh

Network simulation (5/5) Draw a graph on your answer sheet X-axis: Injection rate [%] Y-axis: Latency [cycles] Latency is quite low and stable at low workload Latency increases rapidly after a certain threshold

Exercise 1 Draw the following graphs on the answer sheet 4x4 Mesh 4x4 Torus Point-to-Point (Pt2Pt) Modify network.pl appropriately. Replace --topology with Torus and Pt2Pt. Note --mesh-rows will be ignored for Pt2Pt. Add more measuring points to @injection_ rate_list for more accurate and smooth graphs. Discuss the results using your answer sheet Which topology is the best? Why?

ikura.arc.ics.keio.ac.jp

Ex1: Hello World #include <stdio.h> #include <omp.h> int main() { #pragma omp parallel printf("hello world from %d of %d n", omp_get_thread_num(), omp_get_num_threads()); return 0; }

Ex1: Hello World Modify ex1.c to parallelize gcc Wall fopenmp o ex1 ex1.c Perform ex1 using 1 thread Perform ex1 using 4 threads Perform ex1 using 48 threads

Ex2: Parallel for loop int main(int argc, char *argv[]) { int i, num; double start_time, end_time; num = atoi(argv[1]); start_time = omp_get_wtime(); omp_set_num_threads(num); #pragma omp parallel shared(a) private(i) { #pragma omp for for (i = 0; i < N; i++) A[i] = A[i] * A[i] - 3.0; } Split up loop iterations among the threads. Execute them in parallel. } end_time = omp_get_wtime(); printf("elapsed time with %d CPUs: %f sec n", num, end_time - start_time); return 0;

Ex2: Parallel for loop Modify ex2.c to parallelize gcc Wall fopenmp o ex2 ex2.c Perform ex2 using 1 thread Perform ex2 using 4 threads

Ex3: Reduction int main(int argc, char *argv[]) { int i, num; double s = 0.0; double start_time, end_time; num = atoi(argv[1]); start_time = omp_get_wtime(); omp_set_num_threads(num); #pragma omp parallel private(i) reduction(+:s) { #pragma omp for for (i = 0; i < N; i++) s += (4.0 / (4 * i + 1) - 4.0 / (4 * i + 3)); } Computational results of each thread (local copies) will be summarized (reduced) into a global shared variable. Useful when partial results are summed up into a single variable. printf("pi = %f n", s); end_time = omp_get_wtime(); printf("elapsed time with %d CPUs: %f sec n", num, end_time - start_time);

Exercise 2 Report the execution times of ex2 and ex3 using 1, 4, 16, 32, and 100 threads Num of threads 1 4 16 32 100 Execution time of Ex2 Execution time of Ex3 Does the execution time linearly decrease as the number of threads increase? Discuss the results using your answer sheet

Today s target architecture Chip multi-processors (CMPs) Multiple processors (each has private L1 cache) Shared L2 cache divided into multiple banks (SNUCA) Tile X86-64 CPU L1 cache (I & D) L2 cache bank

Today s target architecture Chip multi-processors (CMPs) Multiple processors (each has private L1 cache) Shared L2 cache divided into multiple banks (SNUCA) Processors and L2 cache banks are connected via NoC Tile X86-64 CPU L1 cache (I & D) L2 cache bank On-chip router

A cache coherence example Write back policy Cache-write updates the memory when block is evicted Write invalidate policy Cache-write invalidates all copies of the other sharers Tile Main memories Main memories

A cache coherence example A CPU wants to read a block cached at The CPU sends a read request to the memory controller The controller forwards the request to current owner The owner sends the block to the requestor Tile Main memories Main memories

Coherence protocols: MOESI class Status of each cache block is represented by M/O/E/S/I Modified (M) Modified (i.e., dirty) Valid in one cache Shared (S) Shared by multiple CPUs Exclusive (E) Clean Exists in one cache Invalid (I) Owned (O) May or may not clean Exists in multiple caches Owned by one cache Owner Responsibility to respond any requests MOESI protocols MSI, MOSI, MESI, MOESI,

Cache coherence protocols MSI protocol E state is not implemented. If the block is cached exclusively, main memory write is not needed when the cache is updated. However, MSI cannot know whether a block is cached exclusively. S-to-M transition always updates the main memory. MESI protocol O state is not implemented; Dirty sharing not allowed. M-to-S transition always updates the main memory. MOESI protocol O state is added; Dirty sharing is possible.

MSI protocol: State transition CpuRd = CPU Read BusRd = Bus Read CpuWr= CPU Write BusWr = Bus Write CpuRd --- CpuWr --- CpuRd --- CpuRd --- M CpuWr BusWr S M BusRd Flush S CpuWr BusWr CpuRd BusRd BusWr Flush BusWr --- I I BusRd --- BusWr --- S-to-M transitions flush (update) the main memory

M S MESI protocol: State transition CpuRd --- CpuWr --- CpuWr BusUpgr CpuRd --- CpuWr --- CpuRd BusRd(C) CpuRd --- E I CpuRd BusRd(!C) C = If Cache exists!c = IF Cache not exist Flush = Main memory write FlushOpt = Cache-to-cache transfer M-to-S transitions flush (update) the main memory M BusWr Flush BusRd Flush S BusRd FlushOpt E BusRd FlushOpt BusWr FlushOpt I BusRd --- BusWr --- BusUpgr ---

MOESI protocol: State transition (1/2) MOESI reduces memory bandwidth compared to MESI CpuRd --- CpuWr --- CpuRd --- O CpuWr BusUpgr CpuRd --- M CpuWr BusUpgr S CpuRd --- CpuWr --- CpuRd BusRd(C) E I CpuRd BusRd(!C) C = If Cache exists!c = IF Cache not exist

MOESI protocol: State transition (2/2) MOESI reduces memory bandwidth compared to MESI O BusRd Flush M BusWr Flush BusRd FlushOpt E BusWr FlushOpt BusRd Flush S BusRd FlushOpt BusWr Flush BusUpgr --- I BusRd --- BusWr --- BusUpgr ---

Full-system: OS boot (1/6) Login the machine using two terminals > ssh <Username>@ikura.arc.ics.keio.ac.jp > cd 20130614 Do not launch more than two terminals A 48-core machine is shared by up to 42 students

Full-system: OS boot (2/6) Boot Linux OS on the simulator from the right terminal > make boot Very Important: You must remember the port number (port number will change each time) Port number : 3456

Full-system: OS boot (3/6) Connect the simulator from the left terminal > telnet localhost <YourPortNumber> Very Important: You must specify the port number you ve just found in the right terminal Using wrong port num may peek at other students You will see Linux boot messages Port number : 3456

Full-system: OS boot (4/6) Connect the simulator from the left terminal > telnet localhost <YourPortNumber> Very Important: You must specify the port number you ve just found in the right terminal Using wrong port num may peek at other students You will see Linux boot messages Port number : 3456

Full-system: OS boot (5/6) Connect the simulator from the left terminal > telnet localhost <YourPortNumber> Linux OS will boot in 5-10 minutes You can login the Linux on the simulator Try cd /, ls, and more This is fast mode simulation without detailed cache behavior Port number : 3456

Full-system: OS boot (6/6) Dump checkpoint from the left terminal (none)/# m5 checkpoint Using checkpoint, you can resume simulation anytime Then exit the simulation Type Ctrl-c to the right terminal Type Ctrl-c to exit

Full-system: Simulation strategy Type make boot Boot Linux using fast (but inaccurate) simulation mode that does not model cache behavior Dump checkpoint and then exit Type make exec_mi or make exec_mesi Resume the simulation from the checkpoint using accurate (but slow) simulation mode that models memory, caches, and interconnection network Execute a benchmark program and then count the number of cycles for the execution Cache coherence protocols: MI and MESI

Full-system: MI (1/4) Resume the simulation from the checkpoint on the right terminal > make exec_mi Very Important: You must remember the port number (port number will change each time) Port number : 3456

Full-system: MI (2/4) Connect the simulator from the left terminal > telnet localhost <YourPortNumber> Very Important: You must specify the port number you ve just found in the right terminal Using wrong port num may peek at other students You can resume the simulation Port number : 3456

Full-system: MI (3/4) Execute a sample program on the left terminal (none)/# cd /root (none)/#./ex2 4 ex2 4 performs the program using 4 threads It takes 10-15 minutes Wait for 10-15 minutes

Full-system: MI (4/4) The simulation stops after 10-15 minutes Remember the execution cycles appeared in the right terminal Simulation stops automatically Remember the execution cycles

Full-system: MESI Do the same simulation using MESI protocol > make exec_mesi Very Important: You must specify the port number you ve just found in the right terminal The port number will change every run Perform Ex2 program again Port number : 3456

Exercise 3 Compare MI and MESI protocols in terms of execution cycles of Ex2 program Compare MI and MESI protocols in terms of execution cycles of Ex3 program (none)/# cd /root (none)/#./ex3 4 Discuss the results using your answer sheet Protocol Exec cycles of Ex2 Exec cycles of Ex3 MI 39,555,084 MESI