A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures
|
|
- Randell Park
- 6 years ago
- Views:
Transcription
1 A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing
2 A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures Overview/ Motivation of this research performance oriented classification of different software workload types (categories/ environments) on multi-core multiprocessor NUMA and UMA architectures server processors: 'dual processor dual core AMD Opteron 2220 SE' 'single processor quad core Intel Xeon E5506'
3 software program workload types/ categories/ memory intensive applications processing intensive memory and processing intensive (matrix multiplication), system call intensive applications, file reading and writing, socket based, message passing middle ware based thread based
4 Target Audience End users who wish to setup server based systems using commodity multi-core multi-processor systems hardware engineers as of to notice and understand that their processor designs are performing well on some application categories but in the same time doesn't perform well on some application categories compiler developers as to understand that given a particular processor architecture, their compiler generated code performance is also dependent on the category/type/domain of the application being written
5 Related Work In their paper [2] Haoqiang Jin, Robert Hood, Johnny Chang etc analyzes the effect of resource contention on shared resources (in the memory hierarchy), on the performance of applications running on multi-core based server processors. Here the authors consider the computational ability and analyzes the performance of executing computational intensive benchmarks. But my work is different as I try to analyze this with respect to different application domains such as computational intensive, memory intensive, system call intensive, middle-ware execution intensive (i.e. execution performance of enterprise middleware).
6 Related Work cont... In their paper [2] Haoqiang Jin, Robert Hood, Johnny Chang etc analyzes the effect of resource contention on shared resources (in the memory hierarchy), on the performance of applications running on multi-core based server processors. Here the authors consider the computational ability and analyzes the performance of executing computational intensive benchmarks. But my work is different as I try to analyze this with respect to different application domains such as computational intensive, memory intensive, system call intensive, middle-ware execution intensive (i.e. execution performance of enterprise middleware).
7 Related Work cont... several on-line reports which compare performance of processors from both companies [15, 16, 17, 18] Anyhow most of these reports simply present the performance metrics such as execution time and throughput. But we go beyond this and we not only present performance metrics such as execution time and throughput, but also we analyze performance metrics such as: level1 cache miss count level2 cache miss count Translation Lookaside Buffer misses Conditional branch instructions mis-predicted Cycles stalled on any resource Total Wall clock cycles Total Wall clock time
8 Related Work cont... How is our work different to those above All the above studies lacks a in-depth performance based classification of application categories on multi-core architecture alone vs. a hybrid multi-core multi-processor architecture. (i.e. a hybrid heterogeneous architectural comparison) with in detail hardware performance monitoring statistics. The work done in my research addresses this analysis.
9 Why is this work different from existing work Comparing the performance of using a multi-core system against using an equivalent hybrid multi-core multi-processor combined system. A categorization of generic application domains/types as on how well they perform on a multi-core system against a equivalent hybrid multi-core multi-processor system. Coming up with a simple set of generic benchmarks which could be used to evaluate hybrid heterogeneous systems, with in detail hardware performance monitoring counter statistics generation.
10 Evaluated Architectures Performance evaluation of: multi-processor multi-core architecture vs a equivalent capacity single-processor multi-core architecture. 1st server processor architecture is a dual-processor dualcore architecture (hence a multi-processor multi-core architecture) 2nd server processor architecture is a single-processor quadcore architecture (hence a single-processor multi-core architecture).
11 Evaluated Architectures Speed(GHz) L1 Cache (KB) L2 Cache L3 Cache Dual-Core AMD Opteron 2.8 (64 data + 64instru = 128) x per core 1MB x per core no Intel Xeon Quad Core 2.13 (32 data + 32instru = 64) x per core 256 KB x per core 4 MB (Intel Smart Cache, Shared)
12 CACHE PARAMETERS OF DUAL-CORE AMD OPTERON(TM) 2220 SE PROCESSOR Level Capacity Associativity (ways) Line Size (bytes) Access Latency (clocks) First Level Data 64 KB (per core) First Level Instruction 64 KB (per core) 2 64 N/A Second Level 1 MB (per core) Third Level No level 3 cache L1 to L2 relationship is exclusive. (i.e the L1 content is not repeated in L2)
13 The Xeon L1 to L2 to L3 relationship is inclusive. (i.e the L1/ L2 content must be present in L3) CACHE PARAMETERS OF INTEL XEON E5506 PROCESSOR Level Capacity Associativity (ways) Line Size (bytes) Access Latency (clocks) First Level Data 32 KB (per core) First Level Instruction 32 KB (per core) 4 N/A N/A Second Level 256KB (per core) Third Level 4MB (shared among all 4 cores)
14 INTEGRATED MEMORY CONTROLLER Dual-Core AMD Opteron 2220 SE Quad Core Intel Xeon E5506 Dual channel 128-bit wide 6-channel 333 MHz DDR memory 800 MHz DDR memory Peak memory bandwidth 5.3 Gbytes/s up to 25.6 GB/sec
15 Methodology and Implementation Custom Benchmarks. memory intensive testing programs (Test1) computational intensive testing programs (Test2) memory plus computational intensive testing program (Test3) system call intensive testing programs socket based testing program (Test4) file reading/ writing testing program (Test5) thread based, threading intensive testing program (Test6) middleware based testing programs (Test7)
16 CODE SEGMENTS FROM THE TEST 1 //Simple memory access application #define ROWS 1 //NOTE: double datatype size is 8bytes //to fit into 32kb cache #define COLUMNS 4096 double matrix_a[rows][columns]; for (x = tmpstartcolumn; x<tmpendcolumn; x++) { int y = matrix_a[0][x]; y++; matrix_a[0][x] = y; }
17 CODE SEGMENTS FROM THE TEST 2 //Simple memory access application plus with some further more //processing overhead (i.e. also utilizing the microprocessor further//more than the Test2) for (x = tmpstartcolumn; x<tmpendcolumn; x++) { double y = matrix_a[0][x]; //trying to exercise/ use the microprocessor further more int z, z2; for (z=0; z<num_of_inner_loops; z++) { } y= ((y + y * z) / 2) * 1.1; for (z2=0; z2<10; z2++) { } if (((int)y)%2 == 0) else matrix_a[0][x] = y; y = ((y * z) / 2) * ; y = ((y * z) / 2) * ;
18 Motivation/ Objective Analyze and describe the impact of processor performance parameters on the performance of different application domains and analyze how this behavior varies among server processor architectures We analyze the variant behavior of those processor performance parameters, among different application domains
19 processor performance parameters Resource Stalls Branch Mis-predictions Translation Lookaside Buffer (TLB) Misses Cache Misses
20 why Intel VTune, AMD Code Analyst was not used Intel VTune runs only on Intel processors and AMD Code Analyst runs only on AMD processors. My experiment architectures include both Intel processors and AMD processors I will have to use both the above tools. Then I will loose the uniformity of my results as these are two independent tools by two different processor vendors. Therefore I decided to use Performance Application Programming Interface (PAPI) library
21 PAPI hardware performance monitoring counters Ll data cache misses L1 Instruction Cache misses L2 cache misses Data translation lookaside buffer misses (DTLB) Total translation lookaside buffer misses (TLB) Conditional branch instructions mispredicted Cycles stalled on any resource
22 PAPI events and methods used PAPI_L1_DCM: L1 data cache misses PAPI_L1_ICM: L1 instruction Cache misses PAPI_L2_TCM: L2 total cache misses PAPI_TLB_DM: Data translation lookaside buffer misses PAPI_TLB_TL: Total translation lookaside buffer misses
23 PAPI events and methods used PAPI_BR_MSP: Conditional branch instructions mispredicted PAPI_RES_STL: Cycles stalled on any resource PAPI_get_real_cyc(): Total wall clock cycles PAPI_get_real_usec(): Total wall clock time
24 RESULTS, EVALUATION, ANALYSIS AND DISCUSSION
25 STATISTICS OBTAINED FOR TEST 1 Processor/ Array Size L1 Cache Misses L2 Cache Misses TLB Misses Conditional Branch Instructions Mis-predicted Opteron/32KB Xeon/32KB Opteron/64KB Xeon/64KB Opteron/1MB Xeon/1MB Opteron/4MB Xeon/4MB Opteron/5MB Xeon/5MB
26 Better Processor for Respective Application Domain Application Domain In terms of: Total Wall Clock Cycles processor intensive Xeon Opteron In terms of: Total Wall Clock Time (Effective Performance) memory and processing intensive (matrix multiplication) system call intensive: socket based system call intensive: file reading and writing based Xeon Xeon Xeon Xeon Xeon Xeon thread based Xeon Opteron/Xeon middle ware based --- Xeon
27 RESULTS, EVALUATION, ANALYSIS AND DISCUSSION cont Overall Analysis In Detail
28 Computational intensive performance analysis of the two processors
29 Computational intensive performance analysis cont % processor, iterations 0.6Mb RAM, iterations 2.4Mb RAM iterations 4.8Mb RAM, iterations 6Mb RAM Cycles stalled on any resource Total Wall clock time in seconds for all Threads Opteron Cycles stalled on any resource Xeon Cycles stalled on any resource Opteron - Total Wall clock time in seconds for all Threads Xeon - Total Wall clock time in seconds for all Threads
30 Computational Intensive cont... Why has the older Opteron performed better? No of pipeline stages: Opteron: 12 for integer 17 for floating-point Xeon: 14 for both integer and floating-point
31 Computational Intensive cont... Why has the older Opteron has performed better? Cont Opteron has the higher number of pipeline stages for floatingpoint The higher the number of pipeline stages the clock cycle time required for each stage is less hence the processor clock frequency could be increased Opteron processor is 2.8GHz and the Xeon is 2.13GHz This makes the Opteron's execution engine to perform faster and give a better throughput than the Xeon
32 Computational Intensive cont... Why has the older Opteron has performed better? Cont Opteron server used is a Dual Core Dual Processor and the Xeon server used is a Quad Core Single Processor Hence in Xeon all the four(4) cores are manufactured in a single die, it also leads to constraining the processor clock frequency, due to the compressed die area The more the die area is constrained for a single core the more the single core clock frequency will have to be reduced Hence Xeon clock frequency is 2.13GHz and Opteron is 2.8GHz
33 Computational Intensive cont... Why has the older Opteron has performed better? Cont Also the Opterons Integer pipe line can fetch and decode floating-point instructions as well This also will accelerate the Opteron's execution engine and will boost the throughput
34 Computational intensive performance analysis cont... Micro-architecture Parameters AMD Opteron 2220 SE Intel Xeon E5506 No of pipeline stages 8 way super-scalar processor 12 for integer 17 for floating-point 6 way super-scalar processor 14 Scheduler can issue \ dispatch up to how many μops per cycle 11 micro-ops (The schedulers and the load/store unit can dispatch). 3 micro-ops to the instruction control unit 6 μops
35 So now what is next? Will this be the same for other application domains?
36 Better Processor for Respective Application Domain Application Domain In terms of: Total Wall Clock Cycles processor intensive Xeon Opteron In terms of: Total Wall Clock Time (Effective Performance) memory and processing intensive (matrix multiplication) system call intensive: socket based system call intensive: file reading and writing based Xeon Xeon Xeon Xeon Xeon Xeon thread based Xeon Opteron/Xeon middle ware based --- Xeon
37 Significance of Integrated Memory Controller in overruling microprocessor performance Computational Intensive benchmark Level 2 cache misses No of loops Opteron Xeon Total translation lookaside buffer misses (TLB) No of loops Opteron Xeon Cycles stalled on any resource No of iterations Opteron Xeon Total Wall clock time in seconds for all Threads no of iterations Opteron Xeon
38 Significance of IMC in overruling microprocessor performance cont. Computational and largely memory intensive benchmark (500 x 500 and 1000 x 1000 matrix multiplication with extra computation performed) 25Mb - 93Mb RAM utilization Level 2 cache misses x x 1000 matrix size Opteron Xeon Total translation lookaside buffer misses (TLB) 500 x x 1000 matrix size Opteron Xeon Cycles stalled on any resource x x 1000 matrix size Opteron Xeon Total Wall clock time in seconds for all Threads 500 x x 1000 matrix size Opteron Xeon
39 Significance of IMC in overruling microprocessor performance cont. Memory Access: Load and Store Operation Enhancements AMD Opteron 2220 SE Intel Xeon E5506 Peak issue rate operation per cycle Two 64-bit loads or stores one 128-bit load one 128-bit store Load/ store queue 44-entry Deeper buffers for load and store operations: 48 load buffers 32 store buffers 10 fill buffers
40 Significance of IMC in overruling microprocessor performance cont. Integrated Memory Controller Memory access channels AMD Opteron 2220 SE Dual channel 128-bit wide Intel Xeon E channel Memory support 333 MHz DDR memory 800 MHz DDR memory Peak memory bandwidth 5.3 Gbytes/s up to 19.2 GB/sec
41 Significance of TLB Unit in overruling microprocessor performance?
42 Significance of TLB Unit in overruling microprocessor performance Xeon TLB unit is ahead of the Opteron Hence in general the TLB statistics of the Xeon should be better than the Opteron. But with our results we figured out the following points: 1st Point - Xeon Instruction TLB (ITLB) statistics are poor compared to the Opteron 2nd Point - When the memory access workload increases the Xeon TLB statistics become poor compared to the Opteron 3rd Point - The TLB miss penalty is overruled by a better IMC
43 Significance of TLB Unit in overruling microprocessor performance cont Translation Lookaside Buffers (TLB) Of The Microprocessors Dual-Core AMD Opteron 2220 SE Quad Core Intel Xeon E5506 Number of levels of TLB 2 Number of levels of TLB 2 L1 Data TLB for 4-KByte pages 32 entries DTLB0 for 4-KByte pages 64 entries L1 Data TLB for large pages (2MB/4MB) L1 Instruction TLB for 4-KByte pages L1 Instruction TLB for large pages (2MB/4MB) 8 entries DTLB0 for large pages (2MB/4MB) 32 entries 32 entries ITLB for 4-KByte pages 64 entries 8 entries ITLB for large pages 7 entries L1 TLB Associativity (ways) fully DTLB0 / ITLB Associativity (ways) L2 TLB for 4-KByte pages 512 entries STLB for 4-KByte pages 512 entries (services both data and instruction look-ups) 4 Associativity (ways) 4 STLB Associativity (ways) 4 An DTLB0 miss and STLB hit causes a penalty of An DTLB0 miss and STLB hit causes a penalty of 7cycles The delays associated with a miss to the STLB and Page miss handler(pmh) The delays associated with a miss to the STLB and Page miss handler(pmh) largely non-blocking
44 Significance of TLB Unit in overruling microprocessor performance cont... Computational Intensive benchmark 100% processor, 6Mb RAM Data translation lookaside buffer misses (DTLB) Instruction translation lookaside buffer misses (ITLB) Opteron Data translation lookaside buffer misses (DTLB) Xeon Data translation lookaside buffer misses (DTLB) Opteron Instruction translation lookaside buffer misses (ITLB) Xeon Instruction translation lookaside buffer misses (ITLB) Total TLB misses Opteron Total translation lookaside buffer misses (TLB) Xeon Total translation lookaside buffer misses (TLB) Total Wall clock time in seconds for all Threads Opteron - Total Wall clock time in seconds for all Threads Xeon - Total Wall clock time in seconds for all Threads
45 Significance of TLB Unit in overruling microprocessor cont.. Computational and largely memory intensive benchmark (500 x 500 matrix multiplication). 25Mb RAM utilization x500: Data translation lookaside buffer misses (DTLB) Thread 1 Thread 2 Thread 3 Thread 4 500x500: Total translation lookaside buffer misses (TLB) Opteron - Data translation lookaside buffer misses (DTLB) Xeon - Data translation lookaside buffer misses (DTLB) Thread 1 500x500: Instruction translation lookaside buffer misses (ITLB) Thread 2 Thread 3 Thread 4 500x500: Total Wall clock time in seconds for all Threads Opteron - Instruction translation lookaside buffer misses (ITLB) Xeon - Instruction translation lookaside buffer misses (ITLB) Thread 1 Thread 2 Thread 3 Thread 4 Opteron - Total translation lookaside buffer misses (TLB) Xeon - Total translation lookaside buffer misses (TLB) Opteron Xeon Total Wall clock time in seconds for all Threads
46 Significance of TLB Unit in overruling microprocessor cont.. Computational and largely memory intensive benchmark (1000 x 1000 matrix multiplication), 93Mb RAM utilization 1000x1000:Data translation lookaside buffer misses(dtlb) 1000x1000: Instruction translation lookaside buffer misses (ITLB) Thread 1 Thread 2 Thread 3 Thread 4 Opteron - Data translation lookaside buffer misses (DTLB) Xeon - Data translation lookaside buffer misses (DTLB) Thread 1 Thread 2 Thread 3 Thread 4 Opteron - Instruction translation lookaside buffer misses (ITLB) Xeon - Instruction translation lookaside buffer misses (ITLB) x1000: Total translation lookaside buffer misses (TLB) x1000: Total Wall clock time in seconds for all Threads Opteron - Total translation lookaside buffer misses (TLB) Total Wall clock time in seconds for all Threads Thread 1 Thread 2 Thread 3 Thread 4 Xeon - Total translation lookaside buffer misses (TLB) Opteron Xeon
47 Significance of Branch Prediction Unit in overruling microprocessor performance?
48 Significance of Branch Prediction Unit in overruling microprocessor performance From the hardware performance counter statistics obtained related to the Branch Prediction Unit of the two server processors we figured out the following two points: 1st Point - The Xeon Branch Prediction Unit is optimized and works better when the source code actually contains branching if command instructions. But doesn't work well when the source code doesn't contain any branching if command instructions (no if commands, but code contains for loops ). 2nd Point - branch mis-predictions has less impact (or no impact at all) on the final application execution throughput.
49 Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 1st point System call intensive benchmark (source code contains branching if command instructions) Computational intensive benchmark (source code contains branching if command instructions) Computational and largely memory intensive benchmark (500 x 500 matrix multiplication with extra computation performed) (source code contains branching if command instructions) Conditional branch instructions mispredicted requests requests Opteron- Conditional branch instructions mispredicted Xeon- Conditional branch instructions mispredicted Conditional branch instructions mispredicted E+06 Opteron Conditional branch instructions mispredicted Xeon Conditional branch instructions mispredicted 500x500- Conditional branch instructions mispredicted Thread1 Thread2 Thread3 Thread4 Opteron - Conditional branch instructions mispredicted Xeon - Conditional branch instructions mispredicted
50 Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 1st point Simple Memory access intensive benchmark (source code doesn't contain any branching if command instructions (no if commands, but code contains for loops )) Computational and largely memory intensive benchmark (500 x 500 matrix multiplication) (source code doesn't contain any branching if command instructions (no if commands, but code contains for loops )) 2000 Conditional branch instructions mispredicted 500x500: Conditional branch instructions mispredicted Opteron Conditional branch instructions mispredicted Xeon Conditional branch instructions mispredicted Thread 1 Thread 2 Thread 3 Thread 4 Opteron- Conditional branch instructions mispredicted Xeon- Conditional branch instructions mispredicted 0 32KB 64KB 1MB 4MB 5MB 6MB 8MB 10MB 12MB
51 Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 2nd point Computational intensive benchmark (source code contains branching if command instructions) Conditional branch instructions mispredicted E+06 Opteron Conditional branch instructions mispredicted Xeon Conditional branch instructions mispredicted Total Wall clock time in seconds for all Threads E+06 ` Opteron - Total Wall clock time in seconds for all Threads Xeon - Total Wall clock time in seconds for all Threads
52 Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 2nd point cont Computational and largely memory intensive benchmark (500 x 500 matrix multiplication) (source code doesn't contain any branching if command instructions (no if commands, but code contains for loops ) ) 500x500: Conditional branch instructions mispredicted 500x500: Total Wall clock time in seconds for all Threads Thread 1 Thread 2 Thread 3 Thread 4 Opteron- Conditional branch instructions mispredicted Xeon- Conditional branch instructions mispredicted Opteron Xeon Total Wall clock time in seconds for all Threads
53 Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 2nd point cont Computational and largely memory intensive benchmark (1000 x 1000 matrix multiplication) (source code doesn't contain any branching if command instructions (no if commands, but code contains for loops )) 1000x1000: Conditional branch instructions mispredicted 1000x1000: Total Wall clock time in seconds for all Threads Opteron- Conditional branch instructions mispredicted Total Wall clock time in seconds for all Threads Thread 1 Thread 2 Thread 3 Thread 4 Xeon- Conditional branch instructions mispredicted Opteron Xeon
54 Drawbacks / Limitations / Future work compared only L1 and L2 cache misses, but the Xeon processor has an L3 cache which has its obvious advantages in terms of memory access latency in comparison to external memory accesses on L2 cache misses in case of the Opteron processor Also in the work done cache sizes and latencies are considered, but the associativity is not considered, it is another parameter to look into. for all the performance metrics presented in this work, there is a requirement of further explanations, reasoning or justifications, requiring further architectural analysis in comparing the reason for why one processor to out perform the other. i.e To relate the results to the architectural features of the microprocessors.
55 Drawbacks / Limitations / Future work One could say that industry standard benchmarks should be used (such as SPEC2006) rather that writing the benchmarks since the former are accepted by the industry and research community, or at least explain why my hand-written ones are better. It may be possible that different benchmarks draw different conclusions about the two processors and hence, it is expected that the benchmarks themselves are accepted as credible.
56 Drawbacks / Limitations / Future work Most significantly, the work done claim that single-processor multi-core architectures are better than multi-processor multicore processors architectures, based only on the small set of experiments that I have performed in this research. This conclusion should be claimed to be limited to my hand-written benchmarks.
57 Conclusion & Contribution The findings of this work shows us that in most number application domains the 'Single processor Quad core Intel Xeon' gives better performance statistics than the 'Dual processor Dual core AMD Opteron', in terms of both the 'Total Wall Clock Cycles taken for the execution' and ' Total Wall Clock Time taken for the execution (Effective Performance)'. The experimented single processor quad-core UMA architecture performance is better than the equivalent dual-processor dualcore NUMA architecture. Hence I conclude that the evaluated single-processor multi-core server architecture performs better than the equivalent multiprocessor multi-core server architecture, in handling different workloads.
58 Conclusion & Contribution Coming up with a simple set of generic benchmarks which could be used to evaluate hybrid heterogeneous systems, with in detail hardware performance monitoring counter statistics generation. This research done and its results published is of a huge benefit for server processor architects/ designers as we have provided a detailed set of performance metrics and statistics, and a detailed analysis of the performance of different software domains on these server processor architectures. Using these: published results, hand-written custom benchmarks and the detailed analysis given, processor architects and designers could revisit and re-evaluate their processor architectures and re-analyze their processor designs to see why their processor architectures are performing well on some application domains and why not their processor architecture are not performing well on some other application domains.
59 Paper Publications Done W.M.R. Weerasuriya and D.N. Ranasinghe, Older Opteron Outperforms the Newer Xeon: A Memory Intensive Application Study of Server Based Microprocessors, 21 st International Conference on Systems Engineering (ICSEng2011), Las Vegas, NV, USA. W.M.R. Weerasuriya and D.N. Ranasinghe, A Comparative Performance Evaluation of Multi-Processor Multi-Core Server Processor Architectures on Enterprise Middleware Performance, 3 rd APSIPA ASC 2011, Xi'an, China. W.M.R. Weerasuriya and D.N. Ranasinghe, Performance Analysis of System Call Intensive Software Application Execution on Server Processor Architectures: Opteron and Xeon, 2nd International Conference on Emerging Trends in Engineering and Technology (IETET-2011), Kurukshetra(Haryana) India.
60 Thank you
Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationXT Node Architecture
XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationJackson Marusarz Intel Corporation
Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits
More information4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.
Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationExploring the Effects of Hyperthreading on Scientific Applications
Exploring the Effects of Hyperthreading on Scientific Applications by Kent Milfeld milfeld@tacc.utexas.edu edu Kent Milfeld, Chona Guiang, Avijit Purkayastha, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER
More informationIntel released new technology call P6P
P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new
More informationMainstream Computer System Components
Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved
More information( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture
( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline
More informationHiPERiSM Consulting, LLC.
HiPERiSM Consulting, LLC. George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill, NC 27514 george@hiperism.com http://www.hiperism.com Models-3 User s Conference September
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationMainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation
Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer
More informationUnit 8: Superscalar Pipelines
A Key Theme: arallelism reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode of next CIS 501: Computer Architecture Unit 8: Superscalar ipelines Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'ennsylvania'
More informationAdvanced Processor Architecture
Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong
More information2
1 2 3 4 5 6 For more information, see http://www.intel.com/content/www/us/en/processors/core/core-processorfamily.html 7 8 The logic for identifying issues on Intel Microarchitecture Codename Ivy Bridge
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationA Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines
A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationAdvanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University
Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationHomework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures
Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationNext Generation Technology from Intel Intel Pentium 4 Processor
Next Generation Technology from Intel Intel Pentium 4 Processor 1 The Intel Pentium 4 Processor Platform Intel s highest performance processor for desktop PCs Targeted at consumer enthusiasts and business
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationMemory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System
Center for Information ervices and High Performance Computing (ZIH) Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor ystem Parallel Architectures and Compiler Technologies
More informationHigh performance computing. Memory
High performance computing Memory Performance of the computations For many programs, performance of the calculations can be considered as the retrievability from memory and processing by processor In fact
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 19 Advanced Processors III 2006-11-2 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/ 1 Last
More informationThe Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)
The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationSix-Core AMD Opteron Processor
What s you should know about the Six-Core AMD Opteron Processor (Codenamed Istanbul ) Six-Core AMD Opteron Processor Versatility Six-Core Opteron processors offer an optimal mix of performance, energy
More informationProfiling: Understand Your Application
Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationHigh-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs
High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationWrite only as much as necessary. Be brief!
1 CIS371 Computer Organization and Design Midterm Exam Prof. Martin Thursday, March 15th, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationUsing Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More information6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU
1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high
More informationCS146 Computer Architecture. Fall Midterm Exam
CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in
More informationStatic, multiple-issue (superscaler) pipelines
Static, multiple-issue (superscaler) pipelines Start more than one instruction in the same cycle Instruction Register file EX + MEM + WB PC Instruction Register file EX + MEM + WB 79 A static two-issue
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationLRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.
LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II
ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationModule 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.
Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch
More informationKeyStone II. CorePac Overview
KeyStone II ARM Cortex A15 CorePac Overview ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB
More informationMemory Hierarchy Y. K. Malaiya
Memory Hierarchy Y. K. Malaiya Acknowledgements Computer Architecture, Quantitative Approach - Hennessy, Patterson Vishwani D. Agrawal Review: Major Components of a Computer Processor Control Datapath
More informationChapter-5 Memory Hierarchy Design
Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or
More informationItanium 2 Processor Microarchitecture Overview
Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs
More informationReference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded
Reference Case Study of a Multi-core, Multithreaded Processor The Sun T ( Niagara ) Computer Architecture, A Quantitative Approach, Fourth Edition, by John Hennessy and David Patterson, chapter. :/C:8
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u
More informationEITF20: Computer Architecture Part4.1.1: Cache - 2
EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationAdvanced cache optimizations. ECE 154B Dmitri Strukov
Advanced cache optimizations ECE 154B Dmitri Strukov Advanced Cache Optimization 1) Way prediction 2) Victim cache 3) Critical word first and early restart 4) Merging write buffer 5) Nonblocking cache
More informationHW1 Solutions. Type Old Mix New Mix Cost CPI
HW1 Solutions Problem 1 TABLE 1 1. Given the parameters of Problem 6 (note that int =35% and shift=5% to fix typo in book problem), consider a strength-reducing optimization that converts multiplies by
More informationCase Study 1: Optimizing Cache Performance via Advanced Techniques
6 Solutions to Case Studies and Exercises Chapter 2 Solutions Case Study 1: Optimizing Cache Performance via Advanced Techniques 2.1 a. Each element is 8B. Since a 64B cacheline has 8 elements, and each
More informationAnastasia Ailamaki. Performance and energy analysis using transactional workloads
Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun Online Transaction Processing $2B+ industry Characteristics:
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationEITF20: Computer Architecture Part4.1.1: Cache - 2
EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss
More informationIntel Enterprise Processors Technology
Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology
More informationSuperscalar Processors
Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationAn Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization
An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization hung Lee and Peter Strazdins*, omputer Systems Group, Research School of omputer Science, The Australian National University (slides
More informationComputer System Components
Computer System Components CPU Core 1 GHz - 3.2 GHz 4-way Superscaler RISC or RISC-core (x86): Deep Instruction Pipelines Dynamic scheduling Multiple FP, integer FUs Dynamic branch prediction Hardware
More informationEECS 322 Computer Architecture Superpipline and the Cache
EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationLimitations of Scalar Pipelines
Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline
More informationPower Measurement Using Performance Counters
Power Measurement Using Performance Counters October 2016 1 Introduction CPU s are based on complementary metal oxide semiconductor technology (CMOS). CMOS technology theoretically only dissipates power
More informationInside Intel Core Microarchitecture
White Paper Inside Intel Core Microarchitecture Setting New Standards for Energy-Efficient Performance Ofri Wechsler Intel Fellow, Mobility Group Director, Mobility Microprocessor Architecture Intel Corporation
More informationProblem 1. (15 points):
CMU 15-418/618: Parallel Computer Architecture and Programming Practice Exercise 1 A Task Queue on a Multi-Core, Multi-Threaded CPU Problem 1. (15 points): The figure below shows a simple single-core CPU
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationOne-Level Cache Memory Design for Scalable SMT Architectures
One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract
More informationPentium IV-XEON. Computer architectures M
Pentium IV-XEON Computer architectures M 1 Pentium IV block scheme 4 32 bytes parallel Four access ports to the EU 2 Pentium IV block scheme Address Generation Unit BTB Branch Target Buffer I-TLB Instruction
More informationMemory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple
Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss
More informationECE 571 Advanced Microprocessor-Based Design Lecture 4
ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationKeywords and Review Questions
Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain
More informationCache Optimisation. sometime he thought that there must be a better way
Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching
More informationPowerPC 620 Case Study
Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance
More informationPrinciples in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008
Principles in Computer Architecture I CSE 240A (Section 631684) CSE 240A Homework Three November 18, 2008 Only Problem Set Two will be graded. Turn in only Problem Set Two before December 4, 2008, 11:00am.
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationLow-power Architecture. By: Jonathan Herbst Scott Duntley
Low-power Architecture By: Jonathan Herbst Scott Duntley Why low power? Has become necessary with new-age demands: o Increasing design complexity o Demands of and for portable equipment Communication Media
More information