A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

Size: px
Start display at page:

Download "A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures"

Transcription

1 A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing

2 A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures Overview/ Motivation of this research performance oriented classification of different software workload types (categories/ environments) on multi-core multiprocessor NUMA and UMA architectures server processors: 'dual processor dual core AMD Opteron 2220 SE' 'single processor quad core Intel Xeon E5506'

3 software program workload types/ categories/ memory intensive applications processing intensive memory and processing intensive (matrix multiplication), system call intensive applications, file reading and writing, socket based, message passing middle ware based thread based

4 Target Audience End users who wish to setup server based systems using commodity multi-core multi-processor systems hardware engineers as of to notice and understand that their processor designs are performing well on some application categories but in the same time doesn't perform well on some application categories compiler developers as to understand that given a particular processor architecture, their compiler generated code performance is also dependent on the category/type/domain of the application being written

5 Related Work In their paper [2] Haoqiang Jin, Robert Hood, Johnny Chang etc analyzes the effect of resource contention on shared resources (in the memory hierarchy), on the performance of applications running on multi-core based server processors. Here the authors consider the computational ability and analyzes the performance of executing computational intensive benchmarks. But my work is different as I try to analyze this with respect to different application domains such as computational intensive, memory intensive, system call intensive, middle-ware execution intensive (i.e. execution performance of enterprise middleware).

6 Related Work cont... In their paper [2] Haoqiang Jin, Robert Hood, Johnny Chang etc analyzes the effect of resource contention on shared resources (in the memory hierarchy), on the performance of applications running on multi-core based server processors. Here the authors consider the computational ability and analyzes the performance of executing computational intensive benchmarks. But my work is different as I try to analyze this with respect to different application domains such as computational intensive, memory intensive, system call intensive, middle-ware execution intensive (i.e. execution performance of enterprise middleware).

7 Related Work cont... several on-line reports which compare performance of processors from both companies [15, 16, 17, 18] Anyhow most of these reports simply present the performance metrics such as execution time and throughput. But we go beyond this and we not only present performance metrics such as execution time and throughput, but also we analyze performance metrics such as: level1 cache miss count level2 cache miss count Translation Lookaside Buffer misses Conditional branch instructions mis-predicted Cycles stalled on any resource Total Wall clock cycles Total Wall clock time

8 Related Work cont... How is our work different to those above All the above studies lacks a in-depth performance based classification of application categories on multi-core architecture alone vs. a hybrid multi-core multi-processor architecture. (i.e. a hybrid heterogeneous architectural comparison) with in detail hardware performance monitoring statistics. The work done in my research addresses this analysis.

9 Why is this work different from existing work Comparing the performance of using a multi-core system against using an equivalent hybrid multi-core multi-processor combined system. A categorization of generic application domains/types as on how well they perform on a multi-core system against a equivalent hybrid multi-core multi-processor system. Coming up with a simple set of generic benchmarks which could be used to evaluate hybrid heterogeneous systems, with in detail hardware performance monitoring counter statistics generation.

10 Evaluated Architectures Performance evaluation of: multi-processor multi-core architecture vs a equivalent capacity single-processor multi-core architecture. 1st server processor architecture is a dual-processor dualcore architecture (hence a multi-processor multi-core architecture) 2nd server processor architecture is a single-processor quadcore architecture (hence a single-processor multi-core architecture).

11 Evaluated Architectures Speed(GHz) L1 Cache (KB) L2 Cache L3 Cache Dual-Core AMD Opteron 2.8 (64 data + 64instru = 128) x per core 1MB x per core no Intel Xeon Quad Core 2.13 (32 data + 32instru = 64) x per core 256 KB x per core 4 MB (Intel Smart Cache, Shared)

12 CACHE PARAMETERS OF DUAL-CORE AMD OPTERON(TM) 2220 SE PROCESSOR Level Capacity Associativity (ways) Line Size (bytes) Access Latency (clocks) First Level Data 64 KB (per core) First Level Instruction 64 KB (per core) 2 64 N/A Second Level 1 MB (per core) Third Level No level 3 cache L1 to L2 relationship is exclusive. (i.e the L1 content is not repeated in L2)

13 The Xeon L1 to L2 to L3 relationship is inclusive. (i.e the L1/ L2 content must be present in L3) CACHE PARAMETERS OF INTEL XEON E5506 PROCESSOR Level Capacity Associativity (ways) Line Size (bytes) Access Latency (clocks) First Level Data 32 KB (per core) First Level Instruction 32 KB (per core) 4 N/A N/A Second Level 256KB (per core) Third Level 4MB (shared among all 4 cores)

14 INTEGRATED MEMORY CONTROLLER Dual-Core AMD Opteron 2220 SE Quad Core Intel Xeon E5506 Dual channel 128-bit wide 6-channel 333 MHz DDR memory 800 MHz DDR memory Peak memory bandwidth 5.3 Gbytes/s up to 25.6 GB/sec

15 Methodology and Implementation Custom Benchmarks. memory intensive testing programs (Test1) computational intensive testing programs (Test2) memory plus computational intensive testing program (Test3) system call intensive testing programs socket based testing program (Test4) file reading/ writing testing program (Test5) thread based, threading intensive testing program (Test6) middleware based testing programs (Test7)

16 CODE SEGMENTS FROM THE TEST 1 //Simple memory access application #define ROWS 1 //NOTE: double datatype size is 8bytes //to fit into 32kb cache #define COLUMNS 4096 double matrix_a[rows][columns]; for (x = tmpstartcolumn; x<tmpendcolumn; x++) { int y = matrix_a[0][x]; y++; matrix_a[0][x] = y; }

17 CODE SEGMENTS FROM THE TEST 2 //Simple memory access application plus with some further more //processing overhead (i.e. also utilizing the microprocessor further//more than the Test2) for (x = tmpstartcolumn; x<tmpendcolumn; x++) { double y = matrix_a[0][x]; //trying to exercise/ use the microprocessor further more int z, z2; for (z=0; z<num_of_inner_loops; z++) { } y= ((y + y * z) / 2) * 1.1; for (z2=0; z2<10; z2++) { } if (((int)y)%2 == 0) else matrix_a[0][x] = y; y = ((y * z) / 2) * ; y = ((y * z) / 2) * ;

18 Motivation/ Objective Analyze and describe the impact of processor performance parameters on the performance of different application domains and analyze how this behavior varies among server processor architectures We analyze the variant behavior of those processor performance parameters, among different application domains

19 processor performance parameters Resource Stalls Branch Mis-predictions Translation Lookaside Buffer (TLB) Misses Cache Misses

20 why Intel VTune, AMD Code Analyst was not used Intel VTune runs only on Intel processors and AMD Code Analyst runs only on AMD processors. My experiment architectures include both Intel processors and AMD processors I will have to use both the above tools. Then I will loose the uniformity of my results as these are two independent tools by two different processor vendors. Therefore I decided to use Performance Application Programming Interface (PAPI) library

21 PAPI hardware performance monitoring counters Ll data cache misses L1 Instruction Cache misses L2 cache misses Data translation lookaside buffer misses (DTLB) Total translation lookaside buffer misses (TLB) Conditional branch instructions mispredicted Cycles stalled on any resource

22 PAPI events and methods used PAPI_L1_DCM: L1 data cache misses PAPI_L1_ICM: L1 instruction Cache misses PAPI_L2_TCM: L2 total cache misses PAPI_TLB_DM: Data translation lookaside buffer misses PAPI_TLB_TL: Total translation lookaside buffer misses

23 PAPI events and methods used PAPI_BR_MSP: Conditional branch instructions mispredicted PAPI_RES_STL: Cycles stalled on any resource PAPI_get_real_cyc(): Total wall clock cycles PAPI_get_real_usec(): Total wall clock time

24 RESULTS, EVALUATION, ANALYSIS AND DISCUSSION

25 STATISTICS OBTAINED FOR TEST 1 Processor/ Array Size L1 Cache Misses L2 Cache Misses TLB Misses Conditional Branch Instructions Mis-predicted Opteron/32KB Xeon/32KB Opteron/64KB Xeon/64KB Opteron/1MB Xeon/1MB Opteron/4MB Xeon/4MB Opteron/5MB Xeon/5MB

26 Better Processor for Respective Application Domain Application Domain In terms of: Total Wall Clock Cycles processor intensive Xeon Opteron In terms of: Total Wall Clock Time (Effective Performance) memory and processing intensive (matrix multiplication) system call intensive: socket based system call intensive: file reading and writing based Xeon Xeon Xeon Xeon Xeon Xeon thread based Xeon Opteron/Xeon middle ware based --- Xeon

27 RESULTS, EVALUATION, ANALYSIS AND DISCUSSION cont Overall Analysis In Detail

28 Computational intensive performance analysis of the two processors

29 Computational intensive performance analysis cont % processor, iterations 0.6Mb RAM, iterations 2.4Mb RAM iterations 4.8Mb RAM, iterations 6Mb RAM Cycles stalled on any resource Total Wall clock time in seconds for all Threads Opteron Cycles stalled on any resource Xeon Cycles stalled on any resource Opteron - Total Wall clock time in seconds for all Threads Xeon - Total Wall clock time in seconds for all Threads

30 Computational Intensive cont... Why has the older Opteron performed better? No of pipeline stages: Opteron: 12 for integer 17 for floating-point Xeon: 14 for both integer and floating-point

31 Computational Intensive cont... Why has the older Opteron has performed better? Cont Opteron has the higher number of pipeline stages for floatingpoint The higher the number of pipeline stages the clock cycle time required for each stage is less hence the processor clock frequency could be increased Opteron processor is 2.8GHz and the Xeon is 2.13GHz This makes the Opteron's execution engine to perform faster and give a better throughput than the Xeon

32 Computational Intensive cont... Why has the older Opteron has performed better? Cont Opteron server used is a Dual Core Dual Processor and the Xeon server used is a Quad Core Single Processor Hence in Xeon all the four(4) cores are manufactured in a single die, it also leads to constraining the processor clock frequency, due to the compressed die area The more the die area is constrained for a single core the more the single core clock frequency will have to be reduced Hence Xeon clock frequency is 2.13GHz and Opteron is 2.8GHz

33 Computational Intensive cont... Why has the older Opteron has performed better? Cont Also the Opterons Integer pipe line can fetch and decode floating-point instructions as well This also will accelerate the Opteron's execution engine and will boost the throughput

34 Computational intensive performance analysis cont... Micro-architecture Parameters AMD Opteron 2220 SE Intel Xeon E5506 No of pipeline stages 8 way super-scalar processor 12 for integer 17 for floating-point 6 way super-scalar processor 14 Scheduler can issue \ dispatch up to how many μops per cycle 11 micro-ops (The schedulers and the load/store unit can dispatch). 3 micro-ops to the instruction control unit 6 μops

35 So now what is next? Will this be the same for other application domains?

36 Better Processor for Respective Application Domain Application Domain In terms of: Total Wall Clock Cycles processor intensive Xeon Opteron In terms of: Total Wall Clock Time (Effective Performance) memory and processing intensive (matrix multiplication) system call intensive: socket based system call intensive: file reading and writing based Xeon Xeon Xeon Xeon Xeon Xeon thread based Xeon Opteron/Xeon middle ware based --- Xeon

37 Significance of Integrated Memory Controller in overruling microprocessor performance Computational Intensive benchmark Level 2 cache misses No of loops Opteron Xeon Total translation lookaside buffer misses (TLB) No of loops Opteron Xeon Cycles stalled on any resource No of iterations Opteron Xeon Total Wall clock time in seconds for all Threads no of iterations Opteron Xeon

38 Significance of IMC in overruling microprocessor performance cont. Computational and largely memory intensive benchmark (500 x 500 and 1000 x 1000 matrix multiplication with extra computation performed) 25Mb - 93Mb RAM utilization Level 2 cache misses x x 1000 matrix size Opteron Xeon Total translation lookaside buffer misses (TLB) 500 x x 1000 matrix size Opteron Xeon Cycles stalled on any resource x x 1000 matrix size Opteron Xeon Total Wall clock time in seconds for all Threads 500 x x 1000 matrix size Opteron Xeon

39 Significance of IMC in overruling microprocessor performance cont. Memory Access: Load and Store Operation Enhancements AMD Opteron 2220 SE Intel Xeon E5506 Peak issue rate operation per cycle Two 64-bit loads or stores one 128-bit load one 128-bit store Load/ store queue 44-entry Deeper buffers for load and store operations: 48 load buffers 32 store buffers 10 fill buffers

40 Significance of IMC in overruling microprocessor performance cont. Integrated Memory Controller Memory access channels AMD Opteron 2220 SE Dual channel 128-bit wide Intel Xeon E channel Memory support 333 MHz DDR memory 800 MHz DDR memory Peak memory bandwidth 5.3 Gbytes/s up to 19.2 GB/sec

41 Significance of TLB Unit in overruling microprocessor performance?

42 Significance of TLB Unit in overruling microprocessor performance Xeon TLB unit is ahead of the Opteron Hence in general the TLB statistics of the Xeon should be better than the Opteron. But with our results we figured out the following points: 1st Point - Xeon Instruction TLB (ITLB) statistics are poor compared to the Opteron 2nd Point - When the memory access workload increases the Xeon TLB statistics become poor compared to the Opteron 3rd Point - The TLB miss penalty is overruled by a better IMC

43 Significance of TLB Unit in overruling microprocessor performance cont Translation Lookaside Buffers (TLB) Of The Microprocessors Dual-Core AMD Opteron 2220 SE Quad Core Intel Xeon E5506 Number of levels of TLB 2 Number of levels of TLB 2 L1 Data TLB for 4-KByte pages 32 entries DTLB0 for 4-KByte pages 64 entries L1 Data TLB for large pages (2MB/4MB) L1 Instruction TLB for 4-KByte pages L1 Instruction TLB for large pages (2MB/4MB) 8 entries DTLB0 for large pages (2MB/4MB) 32 entries 32 entries ITLB for 4-KByte pages 64 entries 8 entries ITLB for large pages 7 entries L1 TLB Associativity (ways) fully DTLB0 / ITLB Associativity (ways) L2 TLB for 4-KByte pages 512 entries STLB for 4-KByte pages 512 entries (services both data and instruction look-ups) 4 Associativity (ways) 4 STLB Associativity (ways) 4 An DTLB0 miss and STLB hit causes a penalty of An DTLB0 miss and STLB hit causes a penalty of 7cycles The delays associated with a miss to the STLB and Page miss handler(pmh) The delays associated with a miss to the STLB and Page miss handler(pmh) largely non-blocking

44 Significance of TLB Unit in overruling microprocessor performance cont... Computational Intensive benchmark 100% processor, 6Mb RAM Data translation lookaside buffer misses (DTLB) Instruction translation lookaside buffer misses (ITLB) Opteron Data translation lookaside buffer misses (DTLB) Xeon Data translation lookaside buffer misses (DTLB) Opteron Instruction translation lookaside buffer misses (ITLB) Xeon Instruction translation lookaside buffer misses (ITLB) Total TLB misses Opteron Total translation lookaside buffer misses (TLB) Xeon Total translation lookaside buffer misses (TLB) Total Wall clock time in seconds for all Threads Opteron - Total Wall clock time in seconds for all Threads Xeon - Total Wall clock time in seconds for all Threads

45 Significance of TLB Unit in overruling microprocessor cont.. Computational and largely memory intensive benchmark (500 x 500 matrix multiplication). 25Mb RAM utilization x500: Data translation lookaside buffer misses (DTLB) Thread 1 Thread 2 Thread 3 Thread 4 500x500: Total translation lookaside buffer misses (TLB) Opteron - Data translation lookaside buffer misses (DTLB) Xeon - Data translation lookaside buffer misses (DTLB) Thread 1 500x500: Instruction translation lookaside buffer misses (ITLB) Thread 2 Thread 3 Thread 4 500x500: Total Wall clock time in seconds for all Threads Opteron - Instruction translation lookaside buffer misses (ITLB) Xeon - Instruction translation lookaside buffer misses (ITLB) Thread 1 Thread 2 Thread 3 Thread 4 Opteron - Total translation lookaside buffer misses (TLB) Xeon - Total translation lookaside buffer misses (TLB) Opteron Xeon Total Wall clock time in seconds for all Threads

46 Significance of TLB Unit in overruling microprocessor cont.. Computational and largely memory intensive benchmark (1000 x 1000 matrix multiplication), 93Mb RAM utilization 1000x1000:Data translation lookaside buffer misses(dtlb) 1000x1000: Instruction translation lookaside buffer misses (ITLB) Thread 1 Thread 2 Thread 3 Thread 4 Opteron - Data translation lookaside buffer misses (DTLB) Xeon - Data translation lookaside buffer misses (DTLB) Thread 1 Thread 2 Thread 3 Thread 4 Opteron - Instruction translation lookaside buffer misses (ITLB) Xeon - Instruction translation lookaside buffer misses (ITLB) x1000: Total translation lookaside buffer misses (TLB) x1000: Total Wall clock time in seconds for all Threads Opteron - Total translation lookaside buffer misses (TLB) Total Wall clock time in seconds for all Threads Thread 1 Thread 2 Thread 3 Thread 4 Xeon - Total translation lookaside buffer misses (TLB) Opteron Xeon

47 Significance of Branch Prediction Unit in overruling microprocessor performance?

48 Significance of Branch Prediction Unit in overruling microprocessor performance From the hardware performance counter statistics obtained related to the Branch Prediction Unit of the two server processors we figured out the following two points: 1st Point - The Xeon Branch Prediction Unit is optimized and works better when the source code actually contains branching if command instructions. But doesn't work well when the source code doesn't contain any branching if command instructions (no if commands, but code contains for loops ). 2nd Point - branch mis-predictions has less impact (or no impact at all) on the final application execution throughput.

49 Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 1st point System call intensive benchmark (source code contains branching if command instructions) Computational intensive benchmark (source code contains branching if command instructions) Computational and largely memory intensive benchmark (500 x 500 matrix multiplication with extra computation performed) (source code contains branching if command instructions) Conditional branch instructions mispredicted requests requests Opteron- Conditional branch instructions mispredicted Xeon- Conditional branch instructions mispredicted Conditional branch instructions mispredicted E+06 Opteron Conditional branch instructions mispredicted Xeon Conditional branch instructions mispredicted 500x500- Conditional branch instructions mispredicted Thread1 Thread2 Thread3 Thread4 Opteron - Conditional branch instructions mispredicted Xeon - Conditional branch instructions mispredicted

50 Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 1st point Simple Memory access intensive benchmark (source code doesn't contain any branching if command instructions (no if commands, but code contains for loops )) Computational and largely memory intensive benchmark (500 x 500 matrix multiplication) (source code doesn't contain any branching if command instructions (no if commands, but code contains for loops )) 2000 Conditional branch instructions mispredicted 500x500: Conditional branch instructions mispredicted Opteron Conditional branch instructions mispredicted Xeon Conditional branch instructions mispredicted Thread 1 Thread 2 Thread 3 Thread 4 Opteron- Conditional branch instructions mispredicted Xeon- Conditional branch instructions mispredicted 0 32KB 64KB 1MB 4MB 5MB 6MB 8MB 10MB 12MB

51 Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 2nd point Computational intensive benchmark (source code contains branching if command instructions) Conditional branch instructions mispredicted E+06 Opteron Conditional branch instructions mispredicted Xeon Conditional branch instructions mispredicted Total Wall clock time in seconds for all Threads E+06 ` Opteron - Total Wall clock time in seconds for all Threads Xeon - Total Wall clock time in seconds for all Threads

52 Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 2nd point cont Computational and largely memory intensive benchmark (500 x 500 matrix multiplication) (source code doesn't contain any branching if command instructions (no if commands, but code contains for loops ) ) 500x500: Conditional branch instructions mispredicted 500x500: Total Wall clock time in seconds for all Threads Thread 1 Thread 2 Thread 3 Thread 4 Opteron- Conditional branch instructions mispredicted Xeon- Conditional branch instructions mispredicted Opteron Xeon Total Wall clock time in seconds for all Threads

53 Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 2nd point cont Computational and largely memory intensive benchmark (1000 x 1000 matrix multiplication) (source code doesn't contain any branching if command instructions (no if commands, but code contains for loops )) 1000x1000: Conditional branch instructions mispredicted 1000x1000: Total Wall clock time in seconds for all Threads Opteron- Conditional branch instructions mispredicted Total Wall clock time in seconds for all Threads Thread 1 Thread 2 Thread 3 Thread 4 Xeon- Conditional branch instructions mispredicted Opteron Xeon

54 Drawbacks / Limitations / Future work compared only L1 and L2 cache misses, but the Xeon processor has an L3 cache which has its obvious advantages in terms of memory access latency in comparison to external memory accesses on L2 cache misses in case of the Opteron processor Also in the work done cache sizes and latencies are considered, but the associativity is not considered, it is another parameter to look into. for all the performance metrics presented in this work, there is a requirement of further explanations, reasoning or justifications, requiring further architectural analysis in comparing the reason for why one processor to out perform the other. i.e To relate the results to the architectural features of the microprocessors.

55 Drawbacks / Limitations / Future work One could say that industry standard benchmarks should be used (such as SPEC2006) rather that writing the benchmarks since the former are accepted by the industry and research community, or at least explain why my hand-written ones are better. It may be possible that different benchmarks draw different conclusions about the two processors and hence, it is expected that the benchmarks themselves are accepted as credible.

56 Drawbacks / Limitations / Future work Most significantly, the work done claim that single-processor multi-core architectures are better than multi-processor multicore processors architectures, based only on the small set of experiments that I have performed in this research. This conclusion should be claimed to be limited to my hand-written benchmarks.

57 Conclusion & Contribution The findings of this work shows us that in most number application domains the 'Single processor Quad core Intel Xeon' gives better performance statistics than the 'Dual processor Dual core AMD Opteron', in terms of both the 'Total Wall Clock Cycles taken for the execution' and ' Total Wall Clock Time taken for the execution (Effective Performance)'. The experimented single processor quad-core UMA architecture performance is better than the equivalent dual-processor dualcore NUMA architecture. Hence I conclude that the evaluated single-processor multi-core server architecture performs better than the equivalent multiprocessor multi-core server architecture, in handling different workloads.

58 Conclusion & Contribution Coming up with a simple set of generic benchmarks which could be used to evaluate hybrid heterogeneous systems, with in detail hardware performance monitoring counter statistics generation. This research done and its results published is of a huge benefit for server processor architects/ designers as we have provided a detailed set of performance metrics and statistics, and a detailed analysis of the performance of different software domains on these server processor architectures. Using these: published results, hand-written custom benchmarks and the detailed analysis given, processor architects and designers could revisit and re-evaluate their processor architectures and re-analyze their processor designs to see why their processor architectures are performing well on some application domains and why not their processor architecture are not performing well on some other application domains.

59 Paper Publications Done W.M.R. Weerasuriya and D.N. Ranasinghe, Older Opteron Outperforms the Newer Xeon: A Memory Intensive Application Study of Server Based Microprocessors, 21 st International Conference on Systems Engineering (ICSEng2011), Las Vegas, NV, USA. W.M.R. Weerasuriya and D.N. Ranasinghe, A Comparative Performance Evaluation of Multi-Processor Multi-Core Server Processor Architectures on Enterprise Middleware Performance, 3 rd APSIPA ASC 2011, Xi'an, China. W.M.R. Weerasuriya and D.N. Ranasinghe, Performance Analysis of System Call Intensive Software Application Execution on Server Processor Architectures: Opteron and Xeon, 2nd International Conference on Emerging Trends in Engineering and Technology (IETET-2011), Kurukshetra(Haryana) India.

60 Thank you

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

XT Node Architecture

XT Node Architecture XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Jackson Marusarz Intel Corporation

Jackson Marusarz Intel Corporation Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Exploring the Effects of Hyperthreading on Scientific Applications

Exploring the Effects of Hyperthreading on Scientific Applications Exploring the Effects of Hyperthreading on Scientific Applications by Kent Milfeld milfeld@tacc.utexas.edu edu Kent Milfeld, Chona Guiang, Avijit Purkayastha, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER

More information

Intel released new technology call P6P

Intel released new technology call P6P P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved

More information

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture ( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline

More information

HiPERiSM Consulting, LLC.

HiPERiSM Consulting, LLC. HiPERiSM Consulting, LLC. George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill, NC 27514 george@hiperism.com http://www.hiperism.com Models-3 User s Conference September

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

Unit 8: Superscalar Pipelines

Unit 8: Superscalar Pipelines A Key Theme: arallelism reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode of next CIS 501: Computer Architecture Unit 8: Superscalar ipelines Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'ennsylvania'

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

2

2 1 2 3 4 5 6 For more information, see http://www.intel.com/content/www/us/en/processors/core/core-processorfamily.html 7 8 The logic for identifying issues on Intel Microarchitecture Codename Ivy Bridge

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Next Generation Technology from Intel Intel Pentium 4 Processor

Next Generation Technology from Intel Intel Pentium 4 Processor Next Generation Technology from Intel Intel Pentium 4 Processor 1 The Intel Pentium 4 Processor Platform Intel s highest performance processor for desktop PCs Targeted at consumer enthusiasts and business

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System Center for Information ervices and High Performance Computing (ZIH) Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor ystem Parallel Architectures and Compiler Technologies

More information

High performance computing. Memory

High performance computing. Memory High performance computing Memory Performance of the computations For many programs, performance of the calculations can be considered as the retrievability from memory and processing by processor In fact

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 19 Advanced Processors III 2006-11-2 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/ 1 Last

More information

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2) The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Six-Core AMD Opteron Processor

Six-Core AMD Opteron Processor What s you should know about the Six-Core AMD Opteron Processor (Codenamed Istanbul ) Six-Core AMD Opteron Processor Versatility Six-Core Opteron processors offer an optimal mix of performance, energy

More information

Profiling: Understand Your Application

Profiling: Understand Your Application Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Write only as much as necessary. Be brief!

Write only as much as necessary. Be brief! 1 CIS371 Computer Organization and Design Midterm Exam Prof. Martin Thursday, March 15th, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in

More information

Static, multiple-issue (superscaler) pipelines

Static, multiple-issue (superscaler) pipelines Static, multiple-issue (superscaler) pipelines Start more than one instruction in the same cycle Instruction Register file EX + MEM + WB PC Instruction Register file EX + MEM + WB 79 A static two-issue

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

KeyStone II. CorePac Overview

KeyStone II. CorePac Overview KeyStone II ARM Cortex A15 CorePac Overview ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB

More information

Memory Hierarchy Y. K. Malaiya

Memory Hierarchy Y. K. Malaiya Memory Hierarchy Y. K. Malaiya Acknowledgements Computer Architecture, Quantitative Approach - Hennessy, Patterson Vishwani D. Agrawal Review: Major Components of a Computer Processor Control Datapath

More information

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

More information

Itanium 2 Processor Microarchitecture Overview

Itanium 2 Processor Microarchitecture Overview Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs

More information

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded Reference Case Study of a Multi-core, Multithreaded Processor The Sun T ( Niagara ) Computer Architecture, A Quantitative Approach, Fourth Edition, by John Hennessy and David Patterson, chapter. :/C:8

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

Advanced cache optimizations. ECE 154B Dmitri Strukov

Advanced cache optimizations. ECE 154B Dmitri Strukov Advanced cache optimizations ECE 154B Dmitri Strukov Advanced Cache Optimization 1) Way prediction 2) Victim cache 3) Critical word first and early restart 4) Merging write buffer 5) Nonblocking cache

More information

HW1 Solutions. Type Old Mix New Mix Cost CPI

HW1 Solutions. Type Old Mix New Mix Cost CPI HW1 Solutions Problem 1 TABLE 1 1. Given the parameters of Problem 6 (note that int =35% and shift=5% to fix typo in book problem), consider a strength-reducing optimization that converts multiplies by

More information

Case Study 1: Optimizing Cache Performance via Advanced Techniques

Case Study 1: Optimizing Cache Performance via Advanced Techniques 6 Solutions to Case Studies and Exercises Chapter 2 Solutions Case Study 1: Optimizing Cache Performance via Advanced Techniques 2.1 a. Each element is 8B. Since a 64B cacheline has 8 elements, and each

More information

Anastasia Ailamaki. Performance and energy analysis using transactional workloads

Anastasia Ailamaki. Performance and energy analysis using transactional workloads Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun Online Transaction Processing $2B+ industry Characteristics:

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Intel Enterprise Processors Technology

Intel Enterprise Processors Technology Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization

An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization hung Lee and Peter Strazdins*, omputer Systems Group, Research School of omputer Science, The Australian National University (slides

More information

Computer System Components

Computer System Components Computer System Components CPU Core 1 GHz - 3.2 GHz 4-way Superscaler RISC or RISC-core (x86): Deep Instruction Pipelines Dynamic scheduling Multiple FP, integer FUs Dynamic branch prediction Hardware

More information

EECS 322 Computer Architecture Superpipline and the Cache

EECS 322 Computer Architecture Superpipline and the Cache EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Limitations of Scalar Pipelines

Limitations of Scalar Pipelines Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline

More information

Power Measurement Using Performance Counters

Power Measurement Using Performance Counters Power Measurement Using Performance Counters October 2016 1 Introduction CPU s are based on complementary metal oxide semiconductor technology (CMOS). CMOS technology theoretically only dissipates power

More information

Inside Intel Core Microarchitecture

Inside Intel Core Microarchitecture White Paper Inside Intel Core Microarchitecture Setting New Standards for Energy-Efficient Performance Ofri Wechsler Intel Fellow, Mobility Group Director, Mobility Microprocessor Architecture Intel Corporation

More information

Problem 1. (15 points):

Problem 1. (15 points): CMU 15-418/618: Parallel Computer Architecture and Programming Practice Exercise 1 A Task Queue on a Multi-Core, Multi-Threaded CPU Problem 1. (15 points): The figure below shows a simple single-core CPU

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Pentium IV-XEON. Computer architectures M

Pentium IV-XEON. Computer architectures M Pentium IV-XEON Computer architectures M 1 Pentium IV block scheme 4 32 bytes parallel Four access ports to the EU 2 Pentium IV block scheme Address Generation Unit BTB Branch Target Buffer I-TLB Instruction

More information

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

PowerPC 620 Case Study

PowerPC 620 Case Study Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance

More information

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008 Principles in Computer Architecture I CSE 240A (Section 631684) CSE 240A Homework Three November 18, 2008 Only Problem Set Two will be graded. Turn in only Problem Set Two before December 4, 2008, 11:00am.

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Low-power Architecture. By: Jonathan Herbst Scott Duntley

Low-power Architecture. By: Jonathan Herbst Scott Duntley Low-power Architecture By: Jonathan Herbst Scott Duntley Why low power? Has become necessary with new-age demands: o Increasing design complexity o Demands of and for portable equipment Communication Media

More information