AMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DYNA Performance

Size: px
Start display at page:

Download "AMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DYNA Performance"

Transcription

1 3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II AMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DY Performance Tim Wilkens Ph.D. Member of Technical Staff tim.wilkens@amd.com October 4, 2004 Computation Products Group 1 Agenda Architecture Opteron, Itanium, XeonEMT PGI Compiler DY driven enhancements Performance Neon and 3-Car 3 Model October 4, 2004 Computation Products Group 2 E - II - 1

2 CAE / IT II 3. LS-DY Anwenderforum, Bamberg 2004 Architecture Agenda Opteron = Execution + Memory Access + IO Customer Centric 64-bit Computing Instruction Decoding Processors with Artificial Intelligence L2 L1 Instruction 64KB L1 Data 64KB Branch Fetch Prediction Scan/Align Fastpath µ code Engine Instruction Control Unit (72 entries) CPU 1 System Request Queue CPU 0 44 entry Load Store Queue Int Decode & Rename FP Decode & Rename MULT 36-entry scheduler FADD FMUL FST DDR Memory CPU-CPU HT Link Crossbar IO HT Link CPU-CPU HT Link October 4, 2004 Computation Products Group 3 Architecture Agenda Opteron = Execution + Memory Access + IO Not all X86 Processors are created = RISC Cores scrupulous instruction preference L2 L1 Instruction 64KB L1 Data 64KB Fetch Fastpath Branch Prediction Scan/Align µ code Engine Instruction Control Unit (72 entries) CPU 1 System Request Queue CPU 0 44 entry Load Store Queue Int Decode & Rename MULT FP Decode & Rename 36-entry scheduler FADD FMUL FST DDR Memory CPU-CPU HT Link Crossbar IO HT Link CPU-CPU HT Link October 4, 2004 Computation Products Group 4 E - II - 2

3 3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II Architecture Agenda Opteron = Execution + Memory Access + IO Scalable Memory Bandwidth and IO physical memory scales with CPU # memory bandwidth scales with CPU # L2 CPU 1 CPU 0 System Request Queue L1 Instruction 64KB increased single threaded memory bandwidth memory latency does not scale with CPU # dramatically lower memory latency L1 Data 64KB 44 entry Load Store Queue Fetch Instruction Control Unit (72 entries) MULT Fastpath Branch Prediction Scan/Align Int Decode & Rename µ code Engine FP Decode & Rename 36-entry scheduler FADD FMUL FST DDR Memory CPU-CPU HT Link Crossbar IO HT Link CPU-CPU HT Link October 4, 2004 Computation Products Group 5 Customer Centric 64-bit Computing Opteron vs Itanium Fetch Branch Prediction Scan/Align Fastpath µ code Engine Instruction Control Unit (72 entries) Int Decode & Rename FP Decode & Rename 36-entry scheduler October 4, 2004 Computation Products Group 6 E - II - 3

4 CAE / IT II 3. LS-DY Anwenderforum, Bamberg 2004 Customer Centric 64-bit Computing Opteron vs Itanium Progressive 64-bit approach: 32-bit instruction + prefix byte Branch Fetch leverages x86 compiler Prediction technology reliable compilers, port easily code size increase is minimal (~5%) large caches not required Scan/Align x86 CPUs = RISC cores + CISC RISC instruction decoders Fastpath µ code Engine provides x86 processors high clock frequency and legacy compatibility processor not compiler manages RISC core - recompile rarely Itanium is a slave to Instruction the compiler - recompile Control often Unit (72 entries) out-of of-order order execution and register renaming Opteron manages it s s registers intelligently less compiler reliant Int Decode & Rename FP Decode & Rename Itanium requires the compiler to think for it strong compiler reliance 36-entry scheduler Both Opteron and Itanium are RISC, but Opteron doesn t require reinventing compilers, large caches & a mint to purchace October 4, 2004 Computation Products Group 7 All X86 RISC Cores aren t created = Opteron vs Xeon EMT Opteron: INT and FP Execution Units 80-bits FADD FMUL FST MULT Xeon EMT: FP Execution Units 128-bits FADD FMUL October 4, 2004 Computation Products Group 8 E - II - 4

5 3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II All X86 RISC Cores aren t created = Opteron vs Xeon EMT # of int pipes and pipeline depth impact integer throughput Opteron: INT and FP Execution Units Opteron has 3 integer pipes +50% reg,reg move thoughput 80-bits Opteron has 3 / units +50% +,-,logical,,logical, shift throughput # pipeline stages differs FADD FMUL FST # pipeline stages differs shorter instruction execution latency MULT Different Register File Sizes (Opteron 80-bit, Xeon 128 size dictates # RISC ops in an x86 instruction instruction preference dictates # bits written from FPU pipes limits scalar SIMD throughput Xeon EMT: FP Execution Units Design of FPU and issue bandwidth from FP scheduler Opteron: ADD/MUL/ST pipes eat and write 240 bits per clock 128-bits FADD FMUL Xeon: ADD/MUL pipes eat and write 128 bits per clock bit, Xeon 128-bit) Though Xeon64 and Opteron are instruction compatible, Xeon64 delivers ½ the throughput per clock on SIMD scalar code October 4, 2004 Computation Products Group 9 AMD Opteron TM,Pentium 4 (FPU analysis) Throughput of SSE, SSE2, x87 Operations Operation SSE Scalar SSE vector SSE2 scalar SSE2 vector X87 Add Multiply Add & Multiply 4 / cycle Operation SSE Scalar SSE vector SSE2 scalar SSE2 vector X87 Add 1 / 2 cycles 1 / 2 cycles Multiply Add & Multiply 1 / 2 cycles 4 / cycle 1 / 2 cycles 1 / 2 cycles October 4, 2004 Computation Products Group 10 E - II - 5

6 CAE / IT II 3. LS-DY Anwenderforum, Bamberg 2004 AMD Opteron TM,Pentium 4 ( Analysis) Throughput and Latency Comparison Operation 32-bit 64-bit Operation 32-bit 64-bit ADD/SUB ADD/SUB MUL signed MUL unsigned MOV mem,reg 4 cycle latency 4 cycle latency 1 / 2 cycles 1 / 2 cycles MUL signed MUL unsigned MOV mem,reg 18 cycle latency 10 cycle latency MOV reg,reg MOV reg,reg XOR/AND/OR XOR/AND/OR Shift/Rotate Shift/Rotate DIV signed 42 cycle latency DIV signed 80 cycle latency DIV unsigned 39 cycle latency DIV unsigned 80 cycle latency LEA LEA (2 0.5) / cycle October 4, 2004 Computation Products Group 11 Scalable Memory Bandwidth and IO Opteron s on die IO controller CPU 1 CPU 0 System Request Queue 6.4 GB/s DDR Memory Dual channels to DDR Memory GB/s Crossbar CPU-CPU HT Link GT/s coherent GB/s 6.4 GB/s IO HT Link 1.6 GT/s non-coherent CPU-CPU HT Link GB/s October 4, 2004 Computation Products Group 12 E - II - 6

7 3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II Scalable Memory Bandwidth and IO Opteron s on die IO controller October 4, 2004 Computation Products Group 13 Scalable Memory Bandwidth and IO Opteron s on die IO controller Hypertransport asynchronous coherent communication maintain MP cache coherency high rate of communication low impact on MP memory latency Memory Bandwidth scales linearly with # of processors in system greater % of theoretical peak delivered low latency memory access Memory Latency memory requests retired rapidly enhances memory bandwidth doesn t t scale linearly with # CPUS scalable SMP performance October 4, 2004 Computation Products Group 14 E - II - 7

8 CAE / IT II 3. LS-DY Anwenderforum, Bamberg 2004 PGI 5.2 Agenda Compiler Enhancements driven by DY Overview of enhancements in PGI 5.2 all vector code isn t created equal addressing of common block variables loop peeling & optimal vector code packing scalars into vector format shuffling data in loops with GPRs excessive prefetching caveats to using sw prefetch tuning of the unrolling heuristic less register pressure expanded class of vectorizable loops F90 pointer addressing support for objects greater than 2 GB October 4, 2004 Computation Products Group 15 All Vectorized Code isn t created = Minimizing bubbles in the FPU pipeline consider the following loop: DO i=1,n a(i) ) = a(i) ) + b(i)*[ )*[c(i)+d(i)])] PGI ENDDO PGI 5.2.* movlps (d),%xmm0 movhps 8(d),%xmm0 movlps (c),%xmm1 movhps 8(c),%xmm1 addps %xmm0,%xmm1 movlps (b),%xmm2 movhps 8(b),%xmm2 mulps %xmm2,%xmm1 movlps (a),%xmm3 movhps 8(a),%xmm3 addps %xmm3,%xmm1 use register renaming movlps (d),%xmm0 movhps 8(d),%xmm0 movlps (c),%xmm1 movhps 8(c),%xmm1 addps %xmm0,%xmm1 movlps (b),%xmm0 movhps 8(b),%xmm0 mulps %xmm0,%xmm1 movlps (a),%xmm0 movhps 8(a),%xmm0 addps %xmm0,%xmm1 operate from memory on 16-byte aligned addresses movaps (d),%xmm0 addps (c),%xmm0 mulps (b),%xmm1 addps (a),%xmm1 uses 4 registers generates 8 bubbles uses 2 registers generates 8 bubbles uses 1 register generates 2 bubbles October 4, 2004 Computation Products Group 16 E - II - 8

9 3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II Addressing of Common Blocks Minimize GPR stores and loads from stack consider the common blocks and loop below: Common/test1/x1(N),x2(N),x3(N),y1(N),y2(N),y3(N) Common/test2/px1(N),py1(N),px2(N),py2(N) Common/test3/vx2(N),vx3(N),vx4(N),vy2(N),vy3(N),vy4(N),vz2(N),vz3(N),vz4(N) Common/test3/g1(N),g2(N),g3(N),g4(N) Common/test4/ax(N),ay(N),bz(N) do i=1,n xi =area(i)*(x3(i)-x2(i)-x4(i)) yi =area(i)*(y3(i)-y2(i)-y4(i)) g1(i)= 1.-px1(i)*xi-py1(i)*htyi g2(i)=-1.-px2(i)*xi-py2(i)*htyi g3(i)= 2.-g1(i) g4(i)=-2.-g2(i) ax(i)=g2(i)*vx2(i)+g3(i)*vx3(i)+g4(i)*vx4(i) ay(i)=g2(i)*vy2(i)+g3(i)*vy3(i)+g4(i)*vy4(i) bz(i)=g2(i)*vz2(i)+g3(i)*vz3(i)+g4(i)*vz4(i) enddo October 4, 2004 Computation Products Group 17 Addressing of Common Blocks Minimize GPR stores and loads from stack consider the common blocks and loop below: Common/test1/x1(N),x2(N),x3(N),y1(N),y2(N),y3(N) Common/test2/px1(N),py1(N),px2(N),py2(N) Common/test3/vx2(N),vx3(N),vx4(N),vy2(N),vy3(N),vy4(N),vz2(N),vz3(N),vz4(N) Common/test3/g1(N),g2(N),g3(N),g4(N) Common/test4/ax(N),ay(N),bz(N) do movq i=1,n -120(%rsp), %r15 movlps xi (%r15,%rcx), =area(i)*(x3(i)-x2(i)-x4(i)) %xmm4 movhpsyi 8(%r15,%rcx), =area(i)*(y3(i)-y2(i)-y4(i)) %xmm4 PGI uses sepatate GPRs to movq g1(i)= -112(%rsp), 1.-px1(i)*xi-py1(i)*htyi %r15 movlps g2(i)=-1.-px2(i)*xi-py2(i)*htyi (%r15,%rcx), %xmm5 address each array, even for s s in movhpsg3(i)= 8(%r15,%rcx), 2.-g1(i) %xmm5 the same common block subps g4(i)=-2.-g2(i) %xmm4, %xmm5 movq ax(i)=g2(i)*vx2(i)+g3(i)*vx3(i)+g4(i)*vx4(i) -104(%rsp), %r15 Accentuates register pressure in movlps ay(i)=g2(i)*vy2(i)+g3(i)*vy3(i)+g4(i)*vy4(i) (%r15,%rcx), %xmm4 loops. One LS-DY loop had 54 GPR movhps bz(i)=g2(i)*vz2(i)+g3(i)*vz3(i)+g4(i)*vz4(i) 8(%r15,%rcx), %xmm4 mov s to and from stack in PGI 5.1.5, enddo subps %xmm4, %xmm5 Intel 7.1 had 0 occuracnces of GPR movq -96(%rsp), %r15 movs October 4, 2004 Computation Products Group 18 E - II - 9

10 CAE / IT II 3. LS-DY Anwenderforum, Bamberg 2004 Addressing of Common Blocks Minimize GPR stores and loads from stack consider the common blocks and loop below: Common/test1/x1(N),x2(N),x3(N),y1(N),y2(N),y3(N) Common/test2/px1(N),py1(N),px2(N),py2(N) Common/test3/vx2(N),vx3(N),vx4(N),vy2(N),vy3(N),vy4(N),vz2(N),vz3(N),vz4(N) Common/test3/g1(N),g2(N),g3(N),g4(N) Common/test4/ax(N),ay(N),bz(N) movaps do i=1,n-12000(%r9,%rdx),%xmm4 movaps xi -8000(%r9,%rdx),%xmm9 =area(i)*(x3(i)-x2(i)-x4(i)) movaps yi (%r8,%rdx),%xmm5 =area(i)*(y3(i)-y2(i)-y4(i)) PGI g1(i)= 1.-px1(i)*xi-py1(i)*htyi 5.2.* accesses all entities in movaps 4000(%r8,%rdx),%xmm6 movaps g2(i)=-1.-px2(i)*xi-py2(i)*htyi %xmm3,%xmm7 the same common block with 1 GPR subl g3(i)= $8,%eax 2.-g1(i) subps g4(i)=-2.-g2(i) (%r9,%rdx),%xmm4 GPR register pressure in PGI 5.2.* addl ax(i)=g2(i)*vx2(i)+g3(i)*vx3(i)+g4(i)*vx4(i) $8,%ecx is greatly reduced. No excess rops, subps ay(i)=g2(i)*vy2(i)+g3(i)*vy3(i)+g4(i)*vy4(i) (%r9,%rdx),%xmm9 in comparison to Intel, are generated movaps bz(i)=g2(i)*vz2(i)+g3(i)*vz3(i)+g4(i)*vz4(i) %xmm7,%xmm8 subps enddo (%r9,%rdx),%xmm4 Executable Operations from mulps %xmm5,%xmm4 memory are now performed less bubbles in FPU pipeline October 4, 2004 Computation Products Group 19 Loop Peeling & Optimal Vector Code Common Block Illustration Efficient code vectorization requires: uniform relative alignment of pointers in loops Can be achieved via use of common blocks Span of arrays covered in each loop iteration should be a multiple of 4 or 2 in single or double precision loop peeling to adjust common pointers to 16-byte aligned locations performing *,+,- from memory (requires 16-byte alignment) PGI 5.2.* implements peeling of code in CB loops Common/test1/vx2(N),vx3(N),vx4(N),vy2(N),vy3(N),vy4(N),vz2(N),vz3(N),vz4(N) Common/test2/g1(N),g2(N),g3(N),g4(N) Common/test3/ax(N),ay(N),bz(N) do i=1,n ax(i)=g2(i)*vx2(i)+g3(i)*vx3(i)+g4(i)*vx4(i) ay(i)=g2(i)*vy2(i)+g3(i)*vy3(i)+g4(i)*vy4(i) bz(i)=g2(i)*vz2(i)+g3(i)*vz3(i)+g4(i)*vz4(i) enddo October 4, 2004 Computation Products Group 20 E - II - 10

11 3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II Loop Peeling & Optimal Vector Code Common Block Illustration Efficient code vectorization requires: uniform relative alignment of pointers in loops Can be achieved via use of common blocks Span of arrays covered in each loop iteration should be a multiple of 4 or 2 in single or double precision loop peeling to adjust common pointers to 16-byte aligned locations performing *,+,- from memory (requires 16-byte alignment) PGI 5.2.* implements peeling of code in CB loops Common/test1/vx2(N),vx3(N),vx4(N),vy2(N),vy3(N),vy4(N),vz2(N),vz3(N),vz4(N) z4(n) Common/test1/vx2(N),vx3(N),vx4(N),vy2(N),vy3(N),vy4(N),vz2(N),vz3(N),vz4(N) Common/test2/g1(N),g2(N),g3(N),g4(N) Check relative alignment of test1,test2 and test3 common block pointers Common/test3/ax(N),ay(N),bz(N) If pointers are aligned to 16-byte boundaries JUMP TO VECTORSSE LOOP do i=1,n Scalar ax(i)=g2(i)*vx2(i)+g3(i)*vx3(i)+g4(i)*vx4(i) SSE loop +6*9 used to align CB pointers on a 16-byte boundary ay(i)=g2(i)*vy2(i)+g3(i)*vy3(i)+g4(i)*vy4(i) Vector bz(i)=g2(i)*vz2(i)+g3(i)*vz3(i)+g4(i)*vz4(i) SSE loop +6*9 used to perform most of computation enddo Scalar SSE loop +6*9 final iterations not covered by vector SSE loop October 4, 2004 Computation Products Group 21 Optimized Scalar Vector Transforms Packing 4 scalars in a vector loop Some vector loops require performing vector operations of scalar data upon vector quantities: a(i) ) = a(i) ) + b(j,i)*c(i c(i) ) + d(j,i)*e(i e(i) PGI does this via reading 4 floats, storing them to stack and then reading them in a 128-bit load: Create load / store dependencies Excessive # of rops required to perform this function Requires 8 x 32-bit movss loads / stores, 1 movaps read (14 rops) Creates 8 bubbles down FPU pipes PGI 5.2.* does this via interleaving floats 4 x movss reads, 2 x Unpcklps,, 1 x movlhps (11 rops) Creates 4 bubbles down FPU pipes Much shorter latency October 4, 2004 Computation Products Group 22 E - II - 11

12 CAE / IT II 3. LS-DY Anwenderforum, Bamberg 2004 Use GPRs to shuffle data Not an absolute statement but almost GPRs have the following advantages in loops that only shuffle data around: movss decodes to 2 rops,, a GPR mov decodes to 1 rop the FPU pipe can only perform 1 32-bit or 64-bit store per cycle while the unit can perform 2 of either pseudo vector copies of floats can be performed using 64-bit GPRs to perform 2 at a time, this utilizes the full throughput of the U and load store unit double precision moves should still be more efficient because the e unit can perform 2 x 64-bit stores per cycle whereas the FPU can only perform 1 x 64-bit store per cycle caution must be taken into consideration to not generate excessive register pressure throughput can be affected if there are many ops in addition to loads and stores occurring (add, sub, lea, etc.) October 4, 2004 Computation Products Group 23 Excessive Prefetching Caveats about software prefetch Prefetching can preemptively bring data into the cache in advance of it s use, but: Opteron has a very robust HW prefetcher for sequential data accesses HW prefetches move into L2 (12 vs 3 cycle latency compared to L1 does not consume execution dispatch bandwidth / sw prefetches do SW prefetches across 4KB page boundaries are dropped and suffer a 90 cycle latency penalty SW prefetch of non-sequential data accesses offers little benefit only bytes of every 64 bytes fetched is useful Rate of cache evictions is very high, useful data now has to be fetched from L2 MAB units in processor consumed quickly and prevents loads from occurring SW prefetches consume 1 of the 3 execution dispatch slots per clock cycle, thus limiting throughput through the FPU and IPC October 4, 2004 Computation Products Group 24 E - II - 12

13 3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II Tuning of Unrolling Heuristic Less is sometimes more Excessive unrolling of some classes of loops increases register pressure: Loops that do not benefit from compiler unrolling: multi-dimensional arrays in which (i,(, j, ) i isn t t the fastest moving index arrays whose index needs to be loaded to be determined, x( BIN(i) ) ) loops large in size that exceed the # of floating-point registers GPR and FP registers are spilled to memory causing: excess RISC operation counts compared more work required more execution time address generation held up by register load dependencies out of order execution is limited via not being able to load data to process PGI 5.2 unrolls less aggressively, allowing out of order execution within the processor to mask latency rather than compiler unrolling October 4, 2004 Computation Products Group 25 Expanded Class of Vector loops Vector Code generation enhancements Loops with the following constructs now vectorize in PGI 5.2 : loops containing SIGN or MERGE intrinsics large loops containing more than a preset limit of instructions Loop-carry Reduction Elimination (LRE( LRE) ) interfered with some loops vectorization invariant IF / ELSE transformations that hoist IF / ELSE constructs not dependent upon loop variables outside of loop replicating loop with w all cases of IF / ELSE statement Loops that operate upon data objects > 2 GB Loops in programs compiled with the i8 switch October 4, 2004 Computation Products Group 26 E - II - 13

14 CAE / IT II 3. LS-DY Anwenderforum, Bamberg 2004 LS-DY Performance Neon and 3-Car 3 Models October 4, 2004 Computation Products Group bit LS-DY v5434 Neon Model Performance LS-DY Neon Benchmark Performance IBM x Ghz - Myrinet 2P Opteron 1.8Ghz - Infiniband HP RX2600 Itanium Ghz - Infiniband 2P Opteron 2Ghz - Infiniband Wall Clock Time of Executio (lower is better) P 2P 4P 8P 16P 32P October 4, 2004 Computation Products Group 28 E - II - 14

15 3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II 64-bit LS-DY v5434 Neon Model Performance LS-DY Neon Benchmark Performance Relative to Itanium 2 IBM x Ghz - Myrinet 2P Opteron 1.8Ghz - Infiniband HP RX2600 Itanium Ghz - Infiniband 2P Opteron 2Ghz - Infiniband 1.20 Performance Relative to Itanium P 2P 4P 8P 16P 32P October 4, 2004 Computation Products Group bit LS-DY v Car Model Performance LS-DY 3-Car Benchmark Performance IBM x Ghz - Gigabit HP RX2600 Itanium Ghz - Infiniband 2P Opteron 2Ghz - Infiniband Wall Clock Time of Executio (lower is better) P 16P 32P 64P October 4, 2004 Computation Products Group 30 E - II - 15

16 CAE / IT II 3. LS-DY Anwenderforum, Bamberg bit LS-DY v Car Model Performance LS-DY 3-car Benchmark Performance Relative to Itanium 2 IBM x Ghz - Gigabit HP RX2600 Itanium Ghz - Infiniband 2P Opteron 2Ghz - Infiniband Performance Relative to Itanium P 16P 32P 64P October 4, 2004 Computation Products Group 31 Trademark Attribution AMD, the AMD Arrow Logo, AMD Opteron and combinations thereof are trademarks of Advanced Micro Devices, Inc. HyperTransport is a licensed trademark of the HyperTransport Technology Consortium. Other product names used in this presentation are for identification purposes only and may be trademarks of their respective companies. October 4, 2004 Computation Products Group 32 E - II - 16

XT Node Architecture

XT Node Architecture XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core

More information

4. LS-DYNA Anwenderforum, Bamberg 2005 IT I. September 28, 2005 Computation Products Group 1. September 28, 2005 Computation Products Group 2

4. LS-DYNA Anwenderforum, Bamberg 2005 IT I. September 28, 2005 Computation Products Group 1. September 28, 2005 Computation Products Group 2 4. LS-DYNA Anwenderforum, Bamberg 2005 IT I High Performance Enterprise Computing Hardware Design & Performance Application Optimization Guide Performance Evaluation Lynn Lewis Director WW FAE MSS lynn.lewis@amd.com

More information

Intel released new technology call P6P

Intel released new technology call P6P P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new

More information

AMD s Next Generation Microprocessor Architecture

AMD s Next Generation Microprocessor Architecture AMD s Next Generation Microprocessor Architecture Fred Weber October 2001 Goals Build a next-generation system architecture which serves as the foundation for future processor platforms Enable a full line

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19

More information

Pentium 4 Processor Block Diagram

Pentium 4 Processor Block Diagram FP FP Pentium 4 Processor Block Diagram FP move FP store FMul FAdd MMX SSE 3.2 GB/s 3.2 GB/s L D-Cache and D-TLB Store Load edulers Integer Integer & I-TLB ucode Netburst TM Micro-architecture Pipeline

More information

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale Datapoint 2200 IA-32 Nicholas FitzRoy-Dale At the forefront of the computer revolution - Intel Difficult to explain and impossible to love - Hennessy and Patterson! Released 1970! 2K shift register main

More information

Power 7. Dan Christiani Kyle Wieschowski

Power 7. Dan Christiani Kyle Wieschowski Power 7 Dan Christiani Kyle Wieschowski History 1980-2000 1980 RISC Prototype 1990 POWER1 (Performance Optimization With Enhanced RISC) (1 um) 1993 IBM launches 66MHz POWER2 (.35 um) 1997 POWER2 Super

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Optimizing Memory Bandwidth

Optimizing Memory Bandwidth Optimizing Memory Bandwidth Don t settle for just a byte or two. Grab a whole fistful of cache. Mike Wall Member of Technical Staff Developer Performance Team Advanced Micro Devices, Inc. make PC performance

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Cell Programming Tips & Techniques

Cell Programming Tips & Techniques Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Pentium IV-XEON. Computer architectures M

Pentium IV-XEON. Computer architectures M Pentium IV-XEON Computer architectures M 1 Pentium IV block scheme 4 32 bytes parallel Four access ports to the EU 2 Pentium IV block scheme Address Generation Unit BTB Branch Target Buffer I-TLB Instruction

More information

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals SSE and SSE2 Timothy A. Chagnon 18 September 2007 All images from Intel 64 and IA 32 Architectures Software Developer's Manuals Overview SSE: Streaming SIMD (Single Instruction Multiple Data) Extensions

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

Lecture 9: Multiple Issue (Superscalar and VLIW)

Lecture 9: Multiple Issue (Superscalar and VLIW) Lecture 9: Multiple Issue (Superscalar and VLIW) Iakovos Mavroidis Computer Science Department University of Crete Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors Peter Sandon Senior PowerPC Processor Architect IBM Microelectronics All information in these materials is subject to

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC

More information

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Guest Lecturer: Alan Christopher 3/08/2014 Spring 2014 -- Lecture #19 1 Neuromorphic Chips Researchers at IBM and

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

Performance of the AMD Opteron LS21 for IBM BladeCenter

Performance of the AMD Opteron LS21 for IBM BladeCenter August 26 Performance Analysis Performance of the AMD Opteron LS21 for IBM BladeCenter Douglas M. Pase and Matthew A. Eckl IBM Systems and Technology Group Page 2 Abstract In this paper we examine the

More information

Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths

Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths Y. Kodama, T. Odajima, M. Matsuda, M. Tsuji, J. Lee and M. Sato RIKEN AICS (Advanced Institute for Computational

More information

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture ( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS252 Graduate Computer Architecture Midterm 1 Solutions CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate

More information

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines Assigned April 7 Problem Set #5 Due April 21 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen Uniprocessors HPC Fall 2012 Prof. Robert van Engelen Overview PART I: Uniprocessors and Compiler Optimizations PART II: Multiprocessors and Parallel Programming Models Uniprocessors Processor architectures

More information

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Instructor: Justin Hsia 3/08/2013 Spring 2013 Lecture #19 1 Review of Last Lecture Amdahl s Law limits benefits

More information

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Inside Intel Core Microarchitecture By Gabriel Torres on April 12, 2006 Page 1 of 7

Inside Intel Core Microarchitecture By Gabriel Torres on April 12, 2006 Page 1 of 7 http://www.hardwaresecrets.com/printpage/313/1 31-10-2007 18:21 1 of 1 By Gabriel Torres on April 12, 2006 Page 1 of 7 AMD Athlon Black Edition Dual-core 5000+ Premium Performance Great Value. Do you Dare?

More information

Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC

More information

Parallel Computing. Prof. Marco Bertini

Parallel Computing. Prof. Marco Bertini Parallel Computing Prof. Marco Bertini Modern CPUs Historical trends in CPU performance From Data processing in exascale class computer systems, C. Moore http://www.lanl.gov/orgs/hpc/salishan/salishan2011/3moore.pdf

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Multi-core processors are here, but how do you resolve data bottlenecks in native code? Multi-core processors are here, but how do you resolve data bottlenecks in native code? hint: it s all about locality Michael Wall October, 2008 part I of II: System memory 2 PDC 2008 October 2008 Session

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

AMD Opteron 4200 Series Processor

AMD Opteron 4200 Series Processor What s new in the AMD Opteron 4200 Series Processor (Codenamed Valencia ) and the new Bulldozer Microarchitecture? Platform Processor Socket Chipset Opteron 4000 Opteron 4200 C32 56x0 / 5100 (codenamed

More information

Platform Choices for LS-DYNA

Platform Choices for LS-DYNA Platform Choices for LS-DYNA Manfred Willem and Lee Fisher High Performance Computing Division, HP lee.fisher@hp.com October, 2004 Public Benchmarks for LS-DYNA www.topcrunch.org administered by University

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism. CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Instructor: Justin Hsia 1 Review of Last Lecture Amdahl s Law limits benefits of parallelization Request Level Parallelism

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

What Transitioning from 32-bit to 64-bit x86 Computing Means Today

What Transitioning from 32-bit to 64-bit x86 Computing Means Today What Transitioning from 32-bit to 64-bit x86 Computing Means Today Chris Wanner Senior Architect, Industry Standard Servers Hewlett-Packard 2004 Hewlett-Packard Development Company, L.P. The information

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar Vector Processors Kavitha Chandrasekar Sreesudhan Ramkumar Agenda Why Vector processors Basic Vector Architecture Vector Execution time Vector load - store units and Vector memory systems Vector length

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

Computer Architecture and Engineering. CS152 Quiz #5. April 23rd, Professor Krste Asanovic. Name: Answer Key

Computer Architecture and Engineering. CS152 Quiz #5. April 23rd, Professor Krste Asanovic. Name: Answer Key Computer Architecture and Engineering CS152 Quiz #5 April 23rd, 2009 Professor Krste Asanovic Name: Answer Key Notes: This is a closed book, closed notes exam. 80 Minutes 8 Pages Not all questions are

More information

CS 261 Fall Mike Lam, Professor. x86-64 Data Structures and Misc. Topics

CS 261 Fall Mike Lam, Professor. x86-64 Data Structures and Misc. Topics CS 261 Fall 2017 Mike Lam, Professor x86-64 Data Structures and Misc. Topics Topics Homogeneous data structures Arrays Nested / multidimensional arrays Heterogeneous data structures Structs / records Unions

More information

Compilation for Heterogeneous Platforms

Compilation for Heterogeneous Platforms Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey

More information

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli 06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM

More information

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode

More information

How to write powerful parallel Applications

How to write powerful parallel Applications How to write powerful parallel Applications 08:30-09.00 09.00-09:45 09.45-10:15 10:15-10:30 10:30-11:30 11:30-12:30 12:30-13:30 13:30-14:30 14:30-15:15 15:15-15:30 15:30-16:00 16:00-16:45 16:45-17:15 Welcome

More information

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions. MIPS Pipe Line 2 Introduction Pipelining To complete an instruction a computer needs to perform a number of actions. These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

Optimization of Lattice QCD codes for the AMD Opteron processor

Optimization of Lattice QCD codes for the AMD Opteron processor Optimization of Lattice QCD codes for the AMD Opteron processor Miho Koma (DESY Hamburg) ACAT2005, DESY Zeuthen, 26 May 2005 We report the current status of the new Opteron cluster at DESY Hamburg, including

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 4A: Instruction Level Parallelism - Static Scheduling Avinash Kodi, kodi@ohio.edu Agenda 2 Dependences RAW, WAR, WAW Static Scheduling Loop-carried Dependence

More information

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007 CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007 Name: Solutions (please print) 1-3. 11 points 4. 7 points 5. 7 points 6. 20 points 7. 30 points 8. 25 points Total (105 pts):

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Programmazione Avanzata

Programmazione Avanzata Programmazione Avanzata Vittorio Ruggiero (v.ruggiero@cineca.it) Roma, Marzo 2017 Pipeline Outline CPU: internal parallelism? CPU are entirely parallel pipelining superscalar execution units SIMD MMX,

More information

Parallelism and Performance Instructor: Steven Ho

Parallelism and Performance Instructor: Steven Ho Parallelism and Performance Instructor: Steven Ho Review of Last Lecture Cache Performance AMAT = HT + MR MP 2 Multilevel Cache Diagram Main Memory Legend: Request for data Return of data CPU L1$ Memory

More information

Multiple Issue ILP Processors. Summary of discussions

Multiple Issue ILP Processors. Summary of discussions Summary of discussions Multiple Issue ILP Processors ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

Optimizing Graphics Drivers with Streaming SIMD Extensions. Copyright 1999, Intel Corporation. All rights reserved.

Optimizing Graphics Drivers with Streaming SIMD Extensions. Copyright 1999, Intel Corporation. All rights reserved. Optimizing Graphics Drivers with Streaming SIMD Extensions 1 Agenda Features of Pentium III processor Graphics Stack Overview Driver Architectures Streaming SIMD Impact on D3D Streaming SIMD Impact on

More information

Instruction Set Progression. from MMX Technology through Streaming SIMD Extensions 2

Instruction Set Progression. from MMX Technology through Streaming SIMD Extensions 2 Instruction Set Progression from MMX Technology through Streaming SIMD Extensions 2 This article summarizes the progression of change to the instruction set in the Intel IA-32 architecture, from MMX technology

More information

Static Compiler Optimization Techniques

Static Compiler Optimization Techniques Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed at improving pipelined CPU performance: Static pipeline scheduling. Loop unrolling. Static branch

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers X: Fujitsu s New Generation 16 Processor for the next generation UNIX servers August 29, 2012 Takumi Maruyama Processor Development Division Enterprise Server Business Unit Fujitsu Limited All Rights Reserved,Copyright

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

Multi-Core Coding for Games & Phenom TM Software Optimization. Justin Boggs Sr. Developer Relations Engineer Sunnyvale, CA, USA

Multi-Core Coding for Games & Phenom TM Software Optimization. Justin Boggs Sr. Developer Relations Engineer Sunnyvale, CA, USA Multi-Core Coding for Games & Phenom TM Software Optimization Justin Boggs Sr. Developer Relations Engineer Sunnyvale, CA, USA July 2007 Agenda AMD Phenom TM Processor CPU Software Tools True Multi-core

More information

Intel Enterprise Processors Technology

Intel Enterprise Processors Technology Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology

More information

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-11: 80x86 Architecture

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-11: 80x86 Architecture ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-11: 80x86 Architecture 1 The 80x86 architecture processors popular since its application in IBM PC (personal computer). 2 First Four generations

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Kampala August, Agner Fog

Kampala August, Agner Fog Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler

More information

Jim Keller. Digital Equipment Corp. Hudson MA

Jim Keller. Digital Equipment Corp. Hudson MA Jim Keller Digital Equipment Corp. Hudson MA ! Performance - SPECint95 100 50 21264 30 21164 10 1995 1996 1997 1998 1999 2000 2001 CMOS 5 0.5um CMOS 6 0.35um CMOS 7 0.25um "## Continued Performance Leadership

More information

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA

More information

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions 1. (10 marks) Short answers. CPSC 313, 04w Term 2 Midterm Exam 2 Solutions Date: March 11, 2005; Instructor: Mike Feeley 1a. Give an example of one important CISC feature that is normally not part of a

More information

Spectre and Meltdown. Clifford Wolf q/talk

Spectre and Meltdown. Clifford Wolf q/talk Spectre and Meltdown Clifford Wolf q/talk 2018-01-30 Spectre and Meltdown Spectre (CVE-2017-5753 and CVE-2017-5715) Is an architectural security bug that effects most modern processors with speculative

More information

Next Generation Technology from Intel Intel Pentium 4 Processor

Next Generation Technology from Intel Intel Pentium 4 Processor Next Generation Technology from Intel Intel Pentium 4 Processor 1 The Intel Pentium 4 Processor Platform Intel s highest performance processor for desktop PCs Targeted at consumer enthusiasts and business

More information

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Review and Advanced d Concepts Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Pipelining Review PC IF/ID ID/EX EX/M

More information

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 ILP techniques Hardware Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit

More information

CS 6354: Static Scheduling / Branch Prediction. 12 September 2016

CS 6354: Static Scheduling / Branch Prediction. 12 September 2016 1 CS 6354: Static Scheduling / Branch Prediction 12 September 2016 On out-of-order RDTSC 2 Because of pipelines, etc.: RDTSC can actually take its time measurement after it starts Earlier instructions

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 1 / 44 ILP techniques

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Computer System Architecture Final Examination

Computer System Architecture Final Examination Computer System Architecture 6.823 inal Examination Name: his is an open book, open notes exam. 180 Minutes 21 Pages Notes: Not all questions are of equal difficulty, so look over the entire exam and budget

More information