Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Memory Consistency Model Define memory correctness for parallel execution Execution appears to the that of some correct execution of some theoretical parallel computer which has n sequential processors Particularly, remote writes must appear in a local processor in some correct sequence 2

Sequential Consistency Model Sequential consistency Memory read/writes are globally serialized; assume every cycle only one processor can proceed for one step, and write result appears on other processors immediately Processors do not reorder local reads and writes 3

Memory Consistency Examples Initially A == 0 and B == 0 P1 P2 A = 1; B = 1; L1: if (B == 0)... L2: if (A == 0)... Impossible for both if statements L1 & L2 to be true? Note the number of possible total orderings is an exponential function of #inst 4

Performance of Sequential Consistency SC must delay all memory accesses until all invalidates done What if write invalidate is delayed & processor continues? How about memory level parallelism? 5

Relaxed Memory Consistency Total storing order Only writes are globally serialized; assume every cycle at most one write can proceed, and the write result appears immediately Processors may reorder local reads/writes without RAW dependence Processor consistency Writes from one processor appear in the same order on all other processors Processors may reorder local reads/writes without RAW dependence 6

Relaxed Memory Consistency An architect has designed an SMP that uses processor consistency P1 P2 P3 A = get_data(); if (updated) if (flag) updated = 1; flag = 1; process_data(a); An programmer found that P3 uses stale data of A, sometimes. Initially updated==0 and flag==0. Can you defend the architect? 7

Memory Consistency and ILP Speculate on loads, recovered on violation With ILP and SC what will happen on this? P1 code P2 code P1 exec P2 exec A = 1 B = 1 spec. load B spec. load A print B print A store A store B retire load B flush at load A SC can be maintained, but expensive, so use TSO or PC Speculative execution and rollback can still improve performance Performance: ILP + Strong MC Weak MC 8

Memory Consistency in Real Programs Not really an issue for many common programs; they are synchronized A program is synchronized if all access to shared data are ordered by synchronization operations write (x)... release (s) {unlock}... acquire (s) {lock}... read(x) 9

Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization: Uninterruptable instruction to fetch and update memory (atomic operation); Synchronization operation needs this kind of primitive; For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization 10

Uninterruptable Instructions to Fetch and Update Memory Atomic exchange: interchange a value in a register for a value in memory Test-and-set: tests a value and sets it if the value passes the test Fetch-and-increment: it returns the value of a memory location and atomically increments it 11

Parallel App: Commercial Workload Online transaction processing workload (OLTP) (like TPC-B or -C) Decision support system (DSS) (like TPC-D) Web index search (Altavista) Benc h mark % Time User Mode % Time Kernel % Time I/O time (CPU I dle) OLTP 71% 18% 11% DSS (range) 82-94% 3-5% 4-13% DSS (avg) 87% 4% 9% Altavista > 98% < 1% <1% 12

Alpha 4100 SMP 4 CPUs 300 MHz Apha 211264 @ 300 MHz L1$ 8KB direct mapped, write through L2$ 96KB, 3-way set associative L3$ 2MB (off chip), direct mapped Memory latency 80 clock cycles Cache to cache 125 clock cycles 13

100 90 80 OLTP Performance as vary L3$ size 70 60 50 40 30 Idle PAL Code Memory Access L2/L3 Cache Access Instruction Execution 20 10 0 1 MB 2 MB 4 MB 8MB L3 Cache Size 14

3.25 3 2.75 2.5 2.25 2 1.75 1.5 1.25 1 0.75 0.5 0.25 0 L3 Miss Breakdown 1 MB 2 MB 4 MB 8 MB Cache size Instruction Capacity/Conflict Cold False Sharing True Sharing 15

3 2.5 2 Memory CPI as increase CPUs Instruction Conflict/Capacity Cold False Sharing True Sharing 1.5 1 0.5 0 1 2 4 6 8 L3: 2MB 2-way Processor count 16

OLTP Performance as vary L3$ block size 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 32 64 128 256 Block size in bytes Insruction Capacity/Conflict Cold False Sharing True Sharing 17

A pure NUMA 2 CPUs per node, SGI Origin 2000 Scales up to 2048 processors Design for scientific computation vs. commercial processing Scalable bandwidth is crucial to Origin 18

Parallel App: Scientific/Technical FFT Kernel: 1D complex number FFT 2 matrix transpose phases => all-to-all communication Sequential time for n data points: O(n log n) Example is 1 million point data set LU Kernel: dense matrix factorization Blocking helps cache miss rate, 16x16 Sequential time for nxn matrix: O(n 3 ) Example is 512 x 512 matrix 19

5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 FFT Kernel FFT 8 16 32 64 Processor count 3-hop miss to remote cache Remote miss to home Miss to local memory Hit 20

5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 LU kernel LU 3-hop miss to remote cache Remote miss to home Miss to local memory Hit 8 16 32 64 Processor count 21

Parallel App: Scientific/Technical Barnes App: Barnes-Hut n-body algorithm solving a problem in galaxy evolution n-body algs rely on forces drop off with distance; if far enough away, can ignore (e.g., gravity is 1/d 2 ) Sequential time for n data points: O(n log n) Example is 16,384 bodies Ocean App: Gauss-Seidel multigrid technique to solve a set of elliptical partial differential eq.s red-black Gauss-Seidel colors points in grid to consistently update points based on previous values of adjacent neighbors Multigrid solve finite diff. eq. by iteration using hierarch. Grid Communication when boundary accessed by adjacent subgrid Sequential time for nxn grid: O(n 2 ) Input: 130 x 130 grid points, 5 iterations 22

Barnes App Barnes Average cycles per memory reference 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 3-hop miss to remote cache Remote miss to home Miss to local memory Hit 8 16 32 64 Processor count 23

Ocean App Average cycles per reference 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Ocean 3-hop miss to remote cache Remote miss Local miss Cache hit 8 16 32 64 24

Multiprocessor Conclusion Some optimism about future Parallel processing beginning to be understood in some domains More performance than that achieved with a single-chip microprocessor MPs are highly effective for multiprogrammed workloads MPs proved effective for intensive commercial workloads, such as OLTP (assuming enough I/O to be CPU-limited), DSS applications (where query optimization is critical), and large-scale, web searching applications On-chip MPs appears to be growing 1) embedded market where natural parallelism often exists an obvious alternative to faster less silicon efficient, CPU. 2) diminishing returns in high-end microprocessor encourage designers to pursue on-chip multiprocessing 25