AMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DYNA Performance

Size: px

Start display at page:

Download "AMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DYNA Performance"

Doreen Charles
5 years ago
Views:

1 3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II AMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DY Performance Tim Wilkens Ph.D. Member of Technical Staff tim.wilkens@amd.com October 4, 2004 Computation Products Group 1 Agenda Architecture Opteron, Itanium, XeonEMT PGI Compiler DY driven enhancements Performance Neon and 3-Car 3 Model October 4, 2004 Computation Products Group 2 E - II - 1

2 CAE / IT II 3. LS-DY Anwenderforum, Bamberg 2004 Architecture Agenda Opteron = Execution + Memory Access + IO Customer Centric 64-bit Computing Instruction Decoding Processors with Artificial Intelligence L2 L1 Instruction 64KB L1 Data 64KB Branch Fetch Prediction Scan/Align Fastpath µ code Engine Instruction Control Unit (72 entries) CPU 1 System Request Queue CPU 0 44 entry Load Store Queue Int Decode & Rename FP Decode & Rename MULT 36-entry scheduler FADD FMUL FST DDR Memory CPU-CPU HT Link Crossbar IO HT Link CPU-CPU HT Link October 4, 2004 Computation Products Group 3 Architecture Agenda Opteron = Execution + Memory Access + IO Not all X86 Processors are created = RISC Cores scrupulous instruction preference L2 L1 Instruction 64KB L1 Data 64KB Fetch Fastpath Branch Prediction Scan/Align µ code Engine Instruction Control Unit (72 entries) CPU 1 System Request Queue CPU 0 44 entry Load Store Queue Int Decode & Rename MULT FP Decode & Rename 36-entry scheduler FADD FMUL FST DDR Memory CPU-CPU HT Link Crossbar IO HT Link CPU-CPU HT Link October 4, 2004 Computation Products Group 4 E - II - 2

3 3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II Architecture Agenda Opteron = Execution + Memory Access + IO Scalable Memory Bandwidth and IO physical memory scales with CPU # memory bandwidth scales with CPU # L2 CPU 1 CPU 0 System Request Queue L1 Instruction 64KB increased single threaded memory bandwidth memory latency does not scale with CPU # dramatically lower memory latency L1 Data 64KB 44 entry Load Store Queue Fetch Instruction Control Unit (72 entries) MULT Fastpath Branch Prediction Scan/Align Int Decode & Rename µ code Engine FP Decode & Rename 36-entry scheduler FADD FMUL FST DDR Memory CPU-CPU HT Link Crossbar IO HT Link CPU-CPU HT Link October 4, 2004 Computation Products Group 5 Customer Centric 64-bit Computing Opteron vs Itanium Fetch Branch Prediction Scan/Align Fastpath µ code Engine Instruction Control Unit (72 entries) Int Decode & Rename FP Decode & Rename 36-entry scheduler October 4, 2004 Computation Products Group 6 E - II - 3

4 CAE / IT II 3. LS-DY Anwenderforum, Bamberg 2004 Customer Centric 64-bit Computing Opteron vs Itanium Progressive 64-bit approach: 32-bit instruction + prefix byte Branch Fetch leverages x86 compiler Prediction technology reliable compilers, port easily code size increase is minimal (~5%) large caches not required Scan/Align x86 CPUs = RISC cores + CISC RISC instruction decoders Fastpath µ code Engine provides x86 processors high clock frequency and legacy compatibility processor not compiler manages RISC core - recompile rarely Itanium is a slave to Instruction the compiler - recompile Control often Unit (72 entries) out-of of-order order execution and register renaming Opteron manages it s s registers intelligently less compiler reliant Int Decode & Rename FP Decode & Rename Itanium requires the compiler to think for it strong compiler reliance 36-entry scheduler Both Opteron and Itanium are RISC, but Opteron doesn t require reinventing compilers, large caches & a mint to purchace October 4, 2004 Computation Products Group 7 All X86 RISC Cores aren t created = Opteron vs Xeon EMT Opteron: INT and FP Execution Units 80-bits FADD FMUL FST MULT Xeon EMT: FP Execution Units 128-bits FADD FMUL October 4, 2004 Computation Products Group 8 E - II - 4

5 3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II All X86 RISC Cores aren t created = Opteron vs Xeon EMT # of int pipes and pipeline depth impact integer throughput Opteron: INT and FP Execution Units Opteron has 3 integer pipes +50% reg,reg move thoughput 80-bits Opteron has 3 / units +50% +,-,logical,,logical, shift throughput # pipeline stages differs FADD FMUL FST # pipeline stages differs shorter instruction execution latency MULT Different Register File Sizes (Opteron 80-bit, Xeon 128 size dictates # RISC ops in an x86 instruction instruction preference dictates # bits written from FPU pipes limits scalar SIMD throughput Xeon EMT: FP Execution Units Design of FPU and issue bandwidth from FP scheduler Opteron: ADD/MUL/ST pipes eat and write 240 bits per clock 128-bits FADD FMUL Xeon: ADD/MUL pipes eat and write 128 bits per clock bit, Xeon 128-bit) Though Xeon64 and Opteron are instruction compatible, Xeon64 delivers ½ the throughput per clock on SIMD scalar code October 4, 2004 Computation Products Group 9 AMD Opteron TM,Pentium 4 (FPU analysis) Throughput of SSE, SSE2, x87 Operations Operation SSE Scalar SSE vector SSE2 scalar SSE2 vector X87 Add Multiply Add & Multiply 4 / cycle Operation SSE Scalar SSE vector SSE2 scalar SSE2 vector X87 Add 1 / 2 cycles 1 / 2 cycles Multiply Add & Multiply 1 / 2 cycles 4 / cycle 1 / 2 cycles 1 / 2 cycles October 4, 2004 Computation Products Group 10 E - II - 5

6 CAE / IT II 3. LS-DY Anwenderforum, Bamberg 2004 AMD Opteron TM,Pentium 4 ( Analysis) Throughput and Latency Comparison Operation 32-bit 64-bit Operation 32-bit 64-bit ADD/SUB ADD/SUB MUL signed MUL unsigned MOV mem,reg 4 cycle latency 4 cycle latency 1 / 2 cycles 1 / 2 cycles MUL signed MUL unsigned MOV mem,reg 18 cycle latency 10 cycle latency MOV reg,reg MOV reg,reg XOR/AND/OR XOR/AND/OR Shift/Rotate Shift/Rotate DIV signed 42 cycle latency DIV signed 80 cycle latency DIV unsigned 39 cycle latency DIV unsigned 80 cycle latency LEA LEA (2 0.5) / cycle October 4, 2004 Computation Products Group 11 Scalable Memory Bandwidth and IO Opteron s on die IO controller CPU 1 CPU 0 System Request Queue 6.4 GB/s DDR Memory Dual channels to DDR Memory GB/s Crossbar CPU-CPU HT Link GT/s coherent GB/s 6.4 GB/s IO HT Link 1.6 GT/s non-coherent CPU-CPU HT Link GB/s October 4, 2004 Computation Products Group 12 E - II - 6

7 3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II Scalable Memory Bandwidth and IO Opteron s on die IO controller October 4, 2004 Computation Products Group 13 Scalable Memory Bandwidth and IO Opteron s on die IO controller Hypertransport asynchronous coherent communication maintain MP cache coherency high rate of communication low impact on MP memory latency Memory Bandwidth scales linearly with # of processors in system greater % of theoretical peak delivered low latency memory access Memory Latency memory requests retired rapidly enhances memory bandwidth doesn t t scale linearly with # CPUS scalable SMP performance October 4, 2004 Computation Products Group 14 E - II - 7

8 CAE / IT II 3. LS-DY Anwenderforum, Bamberg 2004 PGI 5.2 Agenda Compiler Enhancements driven by DY Overview of enhancements in PGI 5.2 all vector code isn t created equal addressing of common block variables loop peeling & optimal vector code packing scalars into vector format shuffling data in loops with GPRs excessive prefetching caveats to using sw prefetch tuning of the unrolling heuristic less register pressure expanded class of vectorizable loops F90 pointer addressing support for objects greater than 2 GB October 4, 2004 Computation Products Group 15 All Vectorized Code isn t created = Minimizing bubbles in the FPU pipeline consider the following loop: DO i=1,n a(i) ) = a(i) ) + b(i)*[ )*[c(i)+d(i)])] PGI ENDDO PGI 5.2.* movlps (d),%xmm0 movhps 8(d),%xmm0 movlps (c),%xmm1 movhps 8(c),%xmm1 addps %xmm0,%xmm1 movlps (b),%xmm2 movhps 8(b),%xmm2 mulps %xmm2,%xmm1 movlps (a),%xmm3 movhps 8(a),%xmm3 addps %xmm3,%xmm1 use register renaming movlps (d),%xmm0 movhps 8(d),%xmm0 movlps (c),%xmm1 movhps 8(c),%xmm1 addps %xmm0,%xmm1 movlps (b),%xmm0 movhps 8(b),%xmm0 mulps %xmm0,%xmm1 movlps (a),%xmm0 movhps 8(a),%xmm0 addps %xmm0,%xmm1 operate from memory on 16-byte aligned addresses movaps (d),%xmm0 addps (c),%xmm0 mulps (b),%xmm1 addps (a),%xmm1 uses 4 registers generates 8 bubbles uses 2 registers generates 8 bubbles uses 1 register generates 2 bubbles October 4, 2004 Computation Products Group 16 E - II - 8

9 3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II Addressing of Common Blocks Minimize GPR stores and loads from stack consider the common blocks and loop below: Common/test1/x1(N),x2(N),x3(N),y1(N),y2(N),y3(N) Common/test2/px1(N),py1(N),px2(N),py2(N) Common/test3/vx2(N),vx3(N),vx4(N),vy2(N),vy3(N),vy4(N),vz2(N),vz3(N),vz4(N) Common/test3/g1(N),g2(N),g3(N),g4(N) Common/test4/ax(N),ay(N),bz(N) do i=1,n xi =area(i)*(x3(i)-x2(i)-x4(i)) yi =area(i)*(y3(i)-y2(i)-y4(i)) g1(i)= 1.-px1(i)*xi-py1(i)*htyi g2(i)=-1.-px2(i)*xi-py2(i)*htyi g3(i)= 2.-g1(i) g4(i)=-2.-g2(i) ax(i)=g2(i)*vx2(i)+g3(i)*vx3(i)+g4(i)*vx4(i) ay(i)=g2(i)*vy2(i)+g3(i)*vy3(i)+g4(i)*vy4(i) bz(i)=g2(i)*vz2(i)+g3(i)*vz3(i)+g4(i)*vz4(i) enddo October 4, 2004 Computation Products Group 17 Addressing of Common Blocks Minimize GPR stores and loads from stack consider the common blocks and loop below: Common/test1/x1(N),x2(N),x3(N),y1(N),y2(N),y3(N) Common/test2/px1(N),py1(N),px2(N),py2(N) Common/test3/vx2(N),vx3(N),vx4(N),vy2(N),vy3(N),vy4(N),vz2(N),vz3(N),vz4(N) Common/test3/g1(N),g2(N),g3(N),g4(N) Common/test4/ax(N),ay(N),bz(N) do movq i=1,n -120(%rsp), %r15 movlps xi (%r15,%rcx), =area(i)*(x3(i)-x2(i)-x4(i)) %xmm4 movhpsyi 8(%r15,%rcx), =area(i)*(y3(i)-y2(i)-y4(i)) %xmm4 PGI uses sepatate GPRs to movq g1(i)= -112(%rsp), 1.-px1(i)*xi-py1(i)*htyi %r15 movlps g2(i)=-1.-px2(i)*xi-py2(i)*htyi (%r15,%rcx), %xmm5 address each array, even for s s in movhpsg3(i)= 8(%r15,%rcx), 2.-g1(i) %xmm5 the same common block subps g4(i)=-2.-g2(i) %xmm4, %xmm5 movq ax(i)=g2(i)*vx2(i)+g3(i)*vx3(i)+g4(i)*vx4(i) -104(%rsp), %r15 Accentuates register pressure in movlps ay(i)=g2(i)*vy2(i)+g3(i)*vy3(i)+g4(i)*vy4(i) (%r15,%rcx), %xmm4 loops. One LS-DY loop had 54 GPR movhps bz(i)=g2(i)*vz2(i)+g3(i)*vz3(i)+g4(i)*vz4(i) 8(%r15,%rcx), %xmm4 mov s to and from stack in PGI 5.1.5, enddo subps %xmm4, %xmm5 Intel 7.1 had 0 occuracnces of GPR movq -96(%rsp), %r15 movs October 4, 2004 Computation Products Group 18 E - II - 9

10 CAE / IT II 3. LS-DY Anwenderforum, Bamberg 2004 Addressing of Common Blocks Minimize GPR stores and loads from stack consider the common blocks and loop below: Common/test1/x1(N),x2(N),x3(N),y1(N),y2(N),y3(N) Common/test2/px1(N),py1(N),px2(N),py2(N) Common/test3/vx2(N),vx3(N),vx4(N),vy2(N),vy3(N),vy4(N),vz2(N),vz3(N),vz4(N) Common/test3/g1(N),g2(N),g3(N),g4(N) Common/test4/ax(N),ay(N),bz(N) movaps do i=1,n-12000(%r9,%rdx),%xmm4 movaps xi -8000(%r9,%rdx),%xmm9 =area(i)*(x3(i)-x2(i)-x4(i)) movaps yi (%r8,%rdx),%xmm5 =area(i)*(y3(i)-y2(i)-y4(i)) PGI g1(i)= 1.-px1(i)*xi-py1(i)*htyi 5.2.* accesses all entities in movaps 4000(%r8,%rdx),%xmm6 movaps g2(i)=-1.-px2(i)*xi-py2(i)*htyi %xmm3,%xmm7 the same common block with 1 GPR subl g3(i)= $8,%eax 2.-g1(i) subps g4(i)=-2.-g2(i) (%r9,%rdx),%xmm4 GPR register pressure in PGI 5.2.* addl ax(i)=g2(i)*vx2(i)+g3(i)*vx3(i)+g4(i)*vx4(i) $8,%ecx is greatly reduced. No excess rops, subps ay(i)=g2(i)*vy2(i)+g3(i)*vy3(i)+g4(i)*vy4(i) (%r9,%rdx),%xmm9 in comparison to Intel, are generated movaps bz(i)=g2(i)*vz2(i)+g3(i)*vz3(i)+g4(i)*vz4(i) %xmm7,%xmm8 subps enddo (%r9,%rdx),%xmm4 Executable Operations from mulps %xmm5,%xmm4 memory are now performed less bubbles in FPU pipeline October 4, 2004 Computation Products Group 19 Loop Peeling & Optimal Vector Code Common Block Illustration Efficient code vectorization requires: uniform relative alignment of pointers in loops Can be achieved via use of common blocks Span of arrays covered in each loop iteration should be a multiple of 4 or 2 in single or double precision loop peeling to adjust common pointers to 16-byte aligned locations performing *,+,- from memory (requires 16-byte alignment) PGI 5.2.* implements peeling of code in CB loops Common/test1/vx2(N),vx3(N),vx4(N),vy2(N),vy3(N),vy4(N),vz2(N),vz3(N),vz4(N) Common/test2/g1(N),g2(N),g3(N),g4(N) Common/test3/ax(N),ay(N),bz(N) do i=1,n ax(i)=g2(i)*vx2(i)+g3(i)*vx3(i)+g4(i)*vx4(i) ay(i)=g2(i)*vy2(i)+g3(i)*vy3(i)+g4(i)*vy4(i) bz(i)=g2(i)*vz2(i)+g3(i)*vz3(i)+g4(i)*vz4(i) enddo October 4, 2004 Computation Products Group 20 E - II - 10

11 3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II Loop Peeling & Optimal Vector Code Common Block Illustration Efficient code vectorization requires: uniform relative alignment of pointers in loops Can be achieved via use of common blocks Span of arrays covered in each loop iteration should be a multiple of 4 or 2 in single or double precision loop peeling to adjust common pointers to 16-byte aligned locations performing *,+,- from memory (requires 16-byte alignment) PGI 5.2.* implements peeling of code in CB loops Common/test1/vx2(N),vx3(N),vx4(N),vy2(N),vy3(N),vy4(N),vz2(N),vz3(N),vz4(N) z4(n) Common/test1/vx2(N),vx3(N),vx4(N),vy2(N),vy3(N),vy4(N),vz2(N),vz3(N),vz4(N) Common/test2/g1(N),g2(N),g3(N),g4(N) Check relative alignment of test1,test2 and test3 common block pointers Common/test3/ax(N),ay(N),bz(N) If pointers are aligned to 16-byte boundaries JUMP TO VECTORSSE LOOP do i=1,n Scalar ax(i)=g2(i)*vx2(i)+g3(i)*vx3(i)+g4(i)*vx4(i) SSE loop +6*9 used to align CB pointers on a 16-byte boundary ay(i)=g2(i)*vy2(i)+g3(i)*vy3(i)+g4(i)*vy4(i) Vector bz(i)=g2(i)*vz2(i)+g3(i)*vz3(i)+g4(i)*vz4(i) SSE loop +6*9 used to perform most of computation enddo Scalar SSE loop +6*9 final iterations not covered by vector SSE loop October 4, 2004 Computation Products Group 21 Optimized Scalar Vector Transforms Packing 4 scalars in a vector loop Some vector loops require performing vector operations of scalar data upon vector quantities: a(i) ) = a(i) ) + b(j,i)*c(i c(i) ) + d(j,i)*e(i e(i) PGI does this via reading 4 floats, storing them to stack and then reading them in a 128-bit load: Create load / store dependencies Excessive # of rops required to perform this function Requires 8 x 32-bit movss loads / stores, 1 movaps read (14 rops) Creates 8 bubbles down FPU pipes PGI 5.2.* does this via interleaving floats 4 x movss reads, 2 x Unpcklps,, 1 x movlhps (11 rops) Creates 4 bubbles down FPU pipes Much shorter latency October 4, 2004 Computation Products Group 22 E - II - 11

12 CAE / IT II 3. LS-DY Anwenderforum, Bamberg 2004 Use GPRs to shuffle data Not an absolute statement but almost GPRs have the following advantages in loops that only shuffle data around: movss decodes to 2 rops,, a GPR mov decodes to 1 rop the FPU pipe can only perform 1 32-bit or 64-bit store per cycle while the unit can perform 2 of either pseudo vector copies of floats can be performed using 64-bit GPRs to perform 2 at a time, this utilizes the full throughput of the U and load store unit double precision moves should still be more efficient because the e unit can perform 2 x 64-bit stores per cycle whereas the FPU can only perform 1 x 64-bit store per cycle caution must be taken into consideration to not generate excessive register pressure throughput can be affected if there are many ops in addition to loads and stores occurring (add, sub, lea, etc.) October 4, 2004 Computation Products Group 23 Excessive Prefetching Caveats about software prefetch Prefetching can preemptively bring data into the cache in advance of it s use, but: Opteron has a very robust HW prefetcher for sequential data accesses HW prefetches move into L2 (12 vs 3 cycle latency compared to L1 does not consume execution dispatch bandwidth / sw prefetches do SW prefetches across 4KB page boundaries are dropped and suffer a 90 cycle latency penalty SW prefetch of non-sequential data accesses offers little benefit only bytes of every 64 bytes fetched is useful Rate of cache evictions is very high, useful data now has to be fetched from L2 MAB units in processor consumed quickly and prevents loads from occurring SW prefetches consume 1 of the 3 execution dispatch slots per clock cycle, thus limiting throughput through the FPU and IPC October 4, 2004 Computation Products Group 24 E - II - 12

13 3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II Tuning of Unrolling Heuristic Less is sometimes more Excessive unrolling of some classes of loops increases register pressure: Loops that do not benefit from compiler unrolling: multi-dimensional arrays in which (i,(, j, ) i isn t t the fastest moving index arrays whose index needs to be loaded to be determined, x( BIN(i) ) ) loops large in size that exceed the # of floating-point registers GPR and FP registers are spilled to memory causing: excess RISC operation counts compared more work required more execution time address generation held up by register load dependencies out of order execution is limited via not being able to load data to process PGI 5.2 unrolls less aggressively, allowing out of order execution within the processor to mask latency rather than compiler unrolling October 4, 2004 Computation Products Group 25 Expanded Class of Vector loops Vector Code generation enhancements Loops with the following constructs now vectorize in PGI 5.2 : loops containing SIGN or MERGE intrinsics large loops containing more than a preset limit of instructions Loop-carry Reduction Elimination (LRE( LRE) ) interfered with some loops vectorization invariant IF / ELSE transformations that hoist IF / ELSE constructs not dependent upon loop variables outside of loop replicating loop with w all cases of IF / ELSE statement Loops that operate upon data objects > 2 GB Loops in programs compiled with the i8 switch October 4, 2004 Computation Products Group 26 E - II - 13

14 CAE / IT II 3. LS-DY Anwenderforum, Bamberg 2004 LS-DY Performance Neon and 3-Car 3 Models October 4, 2004 Computation Products Group bit LS-DY v5434 Neon Model Performance LS-DY Neon Benchmark Performance IBM x Ghz - Myrinet 2P Opteron 1.8Ghz - Infiniband HP RX2600 Itanium Ghz - Infiniband 2P Opteron 2Ghz - Infiniband Wall Clock Time of Executio (lower is better) P 2P 4P 8P 16P 32P October 4, 2004 Computation Products Group 28 E - II - 14

15 3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II 64-bit LS-DY v5434 Neon Model Performance LS-DY Neon Benchmark Performance Relative to Itanium 2 IBM x Ghz - Myrinet 2P Opteron 1.8Ghz - Infiniband HP RX2600 Itanium Ghz - Infiniband 2P Opteron 2Ghz - Infiniband 1.20 Performance Relative to Itanium P 2P 4P 8P 16P 32P October 4, 2004 Computation Products Group bit LS-DY v Car Model Performance LS-DY 3-Car Benchmark Performance IBM x Ghz - Gigabit HP RX2600 Itanium Ghz - Infiniband 2P Opteron 2Ghz - Infiniband Wall Clock Time of Executio (lower is better) P 16P 32P 64P October 4, 2004 Computation Products Group 30 E - II - 15

16 CAE / IT II 3. LS-DY Anwenderforum, Bamberg bit LS-DY v Car Model Performance LS-DY 3-car Benchmark Performance Relative to Itanium 2 IBM x Ghz - Gigabit HP RX2600 Itanium Ghz - Infiniband 2P Opteron 2Ghz - Infiniband Performance Relative to Itanium P 16P 32P 64P October 4, 2004 Computation Products Group 31 Trademark Attribution AMD, the AMD Arrow Logo, AMD Opteron and combinations thereof are trademarks of Advanced Micro Devices, Inc. HyperTransport is a licensed trademark of the HyperTransport Technology Consortium. Other product names used in this presentation are for identification purposes only and may be trademarks of their respective companies. October 4, 2004 Computation Products Group 32 E - II - 16

XT Node Architecture

XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core