What SMT can do for You. John Hague, IBM Consultant Oct 06

Size: px

Start display at page:

Download "What SMT can do for You. John Hague, IBM Consultant Oct 06"

Charity Norris
5 years ago
Views:

1 What SMT can do for ou John Hague, IBM Consultant Oct 06

2 European Centre for Medium Range Weather Forecasting (ECMWF): Growth in HPC performance teraflops sustained VPP700 VPP5000 p690 P Times 5 every 4 years Moore's Law

3 The Problem (i.e. Challenge) For Parallel Programs using large number of Processors: CPU performance increases much faster than Memory Access rates Floating Point pipeline length increases (for scalar processors) Increased number of CPUs produces more inter-cpu Communications Percentage of peak TFLOPS decreases Currently about 12% at ECMWF eed to make better use of CPU to overcome wait for Floating Point Pipes Memory Access

4 CPI analysis for IFS (RAPS9) on Power5 MAI T Cycles 1,049,275,053,531 A Groups 371,422,635, B GCT 16,540,437, C Stalls 661,311,980, A Groups 371,422,635,466 A1 Base 370,476,829, A2 Cracking 945,806, A1 Base 370,476,829,425 A1A Inst 245,031,876, A1B Grouping 125,444,952, B GCT 16,540,437,851 B1 IC_Miss 7,365,359, B2 BR_MPred 6,124,862, B3 SRQ 2, B4 Other 3,050,213, CPI = Cycles Per Instruction IFS = ECMWF s Integrated Forecast System C Stalls 661,311,980,214 C1 LSU 194,682,702, C2 FXU 65,549,096, C3 FPU 340,351,526, C4 Other 60,728,654, C1 LSU 194,682,702,732 C1A Reject 54,305,104, C1B Dcache 77,594,158, C1C Other 62,783,440, C2 FXU 65,549,096,175 C2A DIV 1,251,638, C2B Other 64,297,457, C3 FPU 340,351,526,592 C3A FDIV 69,100,870, C3B Other 271,250,656, Thanks to Lawrence Hannon, IBM Austin, for assistance

5 CPI analysis for IFS (RAPS9) on Power5 LAITLI T Cycles 18,544,481,257 A Groups 4,812,590, B GCT 77,943, C Stalls 13,653,946, A Groups 4,812,590,956 A1 Base 4,811,679, A2 Cracking 911, A1 Base 4,811,679,026 A1A Inst 3,626,900, A1B Grouping 1,184,778, B GCT 77,943,390 B1 IC_Miss 29,743, B2 BR_MPred 35,234, B3 SRQ B4 Other 12,965, C Stalls 13,653,946,911 C1 LSU 11,515,115, C2 FXU 322,778, C3 FPU 1,562,957, C4 Other 253,094, C1 LSU 11,515,115,570 C1A Reject 3,959,632, C1B Dcache 7,113,000, C1C Other 442,483, C2 FXU 322,778,945 C2A DIV 6,199, C2B Other 316,579, C3 FPU 1,562,957,716 C3A FDIV 3,188, C3B Other 1,559,769, CLOUDSC3 T Cycles 18,536,291,356 A Groups 7,638,718, B GCT 639,681, C Stalls 10,257,892, A Groups 7,638,718,046 A1 Base 7,622,129, A2 Cracking 16,588, A1 Base 7,622,129,292 A1A Inst 4,002,027, A1B Grouping 3,620,101, B GCT 639,681,301 B1 IC_Miss 255,040, B2 BR_MPred 289,763, B3 SRQ B4 Other 94,877, C Stalls 10,257,892,009 C1 LSU 1,241,869, C2 FXU 3,988,830, C3 FPU 2,713,645, C4 Other 2,313,546, C1 LSU 1,241,869,614 C1A Reject 87,962, C1B Dcache 550,216, C1C Other 603,691, C2 FXU 3,988,830,685 C2A DIV 53,629, C2B Other 3,935,201, C3 FPU 2,713,645,467 C3A FDIV 698,417, C3B Other 2,015,227,

6 SMT (Simultaneous Multi Threading) Make better use of Execution Units (particularly Floating Point pipes) Two threads run on one physical CPU 2nd thread can be dispatched (at hardware level) while 1st thread waiting (e.g. for FP pipe or Load instruction) Upside Up to 2x Performance improvement Downside May get Cache Thrashing (unless twice as much Cache available) May get Paging (unless twice as much Memory available) but Shared Memory threads need much less than twice as much memory

7 SMT: Simplified Power5 Schematic Branch Thread 1 Program Counter Instruction Cache Alternate Selection Thread 2 Program Counter Thread 1 Instruction Buffer Thread 2 Instruction Buffer Thread Select (based on priority) and Group Dispatch Issue Queues (Shared between threads) Ref: IEEE Micro; March-April 2004; Kalla, Sinharoy & Tendler IBM Power5 Chip, A Dual-Core Multi-threaded Processor Conditional Load/Store Load/Store Fixed Point Fixed Point Floating Point Floating Point Out of order Execution Units

8 SMT Operation Initial Operation Two separate Program Counters (for 2 threads) Shared Instruction Cache Instruction Fetches alternate between threads Up to 8 instructions from same thread per cycle Thread Select Selects 5 instructions from same thread to form a group for Issue Queues Takes one entry in shared 20 slot Global Completion Table (GCT) Issue Queues (shared) Allocate shared rename registers (120 FP & 120 GP) Execution Units Up to 8 instructions can be executed, out of order, every cycle Groups complete in order for each thread (one from each thread per cycle)

9 Enhanced SMT Dynamic Resource Balancing Aims to prevent one thread hogging Issue Queues Throttles thread if too many L2 misses or GCT entries by: Reducing thread s priority Inhibiting thread s instruction decoding Flushing thread s instructions waiting for dispatch Adjustable Thread Priority 8 software controlled priority levels Implemented by controlling instruction decoding cycles Priority decreased if thread in: Idle loop Waiting for work Spin loop waiting for lock Priority Increased for : real time task

10 SMT on P5+ (p575 node) ode has 16 physical CPUs = 8 dual-core chips 2 logical CPUs are allocated to each physical CPU Parallel program can use MPI or OpenMP to double number of threads Try not to double memory requirements Best with appropriate binding Programs benefit from SMT if CPU pipes not fully utilised Memory bandwidth not fully utilised Programs may not benefit from SMT if they have a lot of memory traffic per Floating Point operation they are FP bound like SGEMM the program doesn t scale well

11 SMT benefit (on 16 CPU P5+) GFLOP GFLOP FP pipe use Memory access Code no SMT (16 threads) SMT (32 threads) One one s1=s1+1.d Both one s1=s1+1.d s2=s2+1.d0 Both one s1=s1+1.d s10=s10+1.d0 Both Streaming Stride 2: s1=s1+1.1d0*a(i) s2=s2+1.1d0*a(i+1) Both Skipping Stride 10: s1=s1+1.1d0*a(i) s2=s2+1.1d0*a(i+1)

12 SMT for Parallel jobs (on P5+) If one copy of program takes T One Copy T If SMT factor is S, 2 copies of program take 2*T/S Two Copies 2 * T/S If scalability factor for doubling threads if f, program takes 2*T/f/S Double Threads 2 * T/f/S For IFS T799 on 1200 CPU, S=1.3, f=1.8 Speedup due to SMT is S*f/2 = 1.35*1.8/2 = 1.22

13 Binding (for P5+ p575) Use MEMOR_AFFIIT=MCM Allocates memory to same resource (i.e chip or 2 CPU s on dual core p575) that as thread is running on MP_TASK_AFFIIT Keeps MPI task (and all it s threads) on same resource ot good if more than 2 threads without SMT, or more than 4 threads with SMP Binding keeps all threads on specified CPU Use special binding code Uses file created at boot time relating physical CPUs to bind CPUs If SMT enabled (AIX sees 32 CPUs per node) but not used (i.e.16 threads per node), specify If SMT enabled and used (i.e. 32 threads per node), specify If SMT used with front loaded threads, specify (for 16 threads per task)

14 Anomaly: ECMWF s 4D-Var min1 (RAPS8) communication time increased with run number Time Run number Monitoring on Monitoring off Monitoring on

15 Effect of monitoring Initially Task 0 dt CPU Comms Task 1 Task 2 Every minute: delay dt in CPU Task 3 After several hours dt dt dt dt Every minute (for tasks): delay dt in CPU delay *dt in Comms dt dt dt dt

16 Verification of Monitoring Effect Monitoring on (1 sec delay per node) Thanks to Oliver Treiber (ECMWF) for this Time Run number Monitoring on Monitoring off Monitoring on

17 ECMWF s IFS T799L91 10-day forecast (RAPS9) P4+ to P5+ comparison CPUs MPI x OMP WALL (secs) %Comms Gflops % of peak Power GHz p x % % Power5+ 1.9GHz p SMT 192 x % % Speed-up: Power4+ Power5+ = 1.56 But almost as fast on half as many P5+ CPUs as on P4+

18 ECMWF s IFS T399 (Ensemble Prediction) on P5+: SMT and Binding (RAPS8) Tasks_ Thrds 12_4 12_4 12_4 12_4 Use SMT Bind MEM AFF Time Conclusions MEMOR_AFFIIT is worth a percent or two SMT is worth about 20% Binding is worth a few %, particularly without SMT 24_4 24_ MP_TASK_AFFIIT=MCM has no noticeable effect 24_ _ Day Forecast, 48 steps, 3 16-CPU dual_core nodes, SMT Enabled

19 UKMO s Unified Model with MPI and OpenMP: P4+ v P5+ times with SMT and binding Physical CPUs System SMT/ Bind 16 Thrds 1 32 Threads/task Threads/task Threads/task P / P5+ / / / SMT= means 2 threads per physical CPU. E.g. 64 MPI tasks, 2 threads/task = 64CPUs = Best P4+ = Best P5+

20 UM: Conclusions Best times on P5+ are obtained using: SMT and Binding 1, 1, and 2 threads on 32, 64, and 128 CPUs respectively On same number of CPUs P5+ runs model 1.7 to 2.0 times as fast as P4+ P5+ runs model at about the same speed on half as many CPUs as P4+ Binding provides 1.1 to 1.3 times speedup SMT provides: 0.9 to 1.2 times speedup (speedup less than one only when scalability poor)

Power 7. Dan Christiani Kyle Wieschowski

Power 7. Dan Christiani Kyle Wieschowski Power 7 Dan Christiani Kyle Wieschowski History 1980-2000 1980 RISC Prototype 1990 POWER1 (Performance Optimization With Enhanced RISC) (1 um) 1993 IBM launches 66MHz POWER2 (.35 um) 1997 POWER2 Super