A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *
|
|
- Geraldine Boone
- 6 years ago
- Views:
Transcription
1 A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University 1001 Ta Hsueh Rd., Hsinchu 300, Taiwan Abstract: Due to its moderate overhead and small quantization error, the time-stamping counter is currently the most precise time-measuring mechanism on Intel 80X86-based Platform. On the Pentium processors, we can simply use a conventional null benchmark to determine the time-stamping counter s overhead accurately. Similarly, on the Pentium Pro processors, Intel also recommends the same method for measuring this overhead. However, since the influence of the measured operation is neglected, we find this method becomes inaccurate on Pentium Pro. Therefore, in this paper, we propose a new method for determining the overhead of Pentium Pro s time-stamping counter. Furthermore, we also provide some empirical results to confirm the feasibility of our idea. Keywords: time measurement, time-stamping counters, Pentium Pro processors, out-of-order execution 1 Introduction Elapsed time is one of the most important performance metrics. Traditionally, a programmer use the timer and the programming interface offered by an operating system to measure an operation s elapsed time. However, most operating system timers are not precise enough. For instance, the resolution of the system timers offered by most UNIX operating systems is about 16.67ms ~ 20ms [Dow93]. Beside, retrieving the operating system timer requires executing many program instructions. This will further cause the operating system timer become more inaccurate. Consequently, most modern processors start to build the processor clock cycle counters inside themselves, such as the Alpha processors [Com98], the PowerPC processors [Mor97], and the Intel Pentium and Pentium Pro processors. A processor clock cycle counter is very precise. Its resolution is equal to the processor s clock cycle time. In addition, reading its content needs only one or few instructions. Hence, processor clock cycle counters have become the best choices for measuring the elapsed time of short operations. The time-stamping counter [Int97a][Sha97] is the processor clock cycle counter built in Intel 80X86 processors since Pentium. It can automatically count how many clock cycles have elapsed since the processor is initialized or reset. The time-stamping counter itself is a 64-bit register, and can be read by the RDTSC instruction. This instruction copies its high-order 32 bits to the general-purpose register EDX, and copies its low-order 32 bits to the EAX. For reading the time-stamping counter in C programs, we define a C language macro that contains inline assembly codes - the in figure 1. Whether this macro can be used in a user-mode program depends on the setting of the operating systems. For example, Windows NT sets the RDTSC instruction to be available in user-mode; hence, no extra kernel-mode service routines or kernel-mode drivers are required. This much reduces the overhead of reading the time-stamping counter. #define GetCycleCount(cycle_count) asm { \ asm LEA EDI, cycle_count \ asm RDTSC \ asm MOV [EDI], EAX \ asm MOV [EDI + 4], EDX \ } Figure 1: The macro for the Pentium processors Once the elapsed time of an operation is measured, the next issue we should address is how to determine the timestamping counter s overhead. This issue is quite important for measuring short operations. For example, during tuning an intra-process locking mechanism [Tsa97] on Windows NT, we had to measure the overheads of both its untuned version and its fine-tuned version. Moreover, we also had to measure the overheads of other thread synchronization mechanisms offered by Windows NT, such as critical section objects, mutex objects, and semaphore objects [Mic98]. On our test platform, the overheads of these synchronization operations are about 80 ~ 2000 clock cycles. These operations are quite short, especially after we aware how significant the overhead of the time-stamping counter will be. Hence, if we intend to measure these operations overheads accurately, we also have to precisely determine the extra overhead may be introduced by the time-stamping counter itself. Conventionally, the null benchmark [Int97b] shown in figure 2 is employed to determine the overhead introduced by a pair of the macro. (In the remaining part of this paper, we also refer to this overhead as T access timer.) Since Pentium is just an in-order, dual-issue * This work was supported by the National Science Council grants NSC E and NSC E
2 superscalar processor, the null benchmark can accurately determine the T access timer of the Pentium s time-stamping counter. This method is also recommended by Intel [Int97b] for determining the T access timer of the Pentium Pro s time stamping counter. However, due to the out-of-order execution capability, we discover the null benchmark will become inaccurate on Pentium Pro. This motivates us to propose a new method for determining the T access timer of the Pentium Pro s time-stamping counter. int i; int average_overhead = 0; int begin_cycle_count; int end_cycle_count; for (i = 0; i < N; i++) { GetCycleCount(begin_cycle_count); GetCycleCount(end_cycle_count); average_overhead += end_cycle_count begin_cycle_count; } average_overhead /= N; Figure 2: A null benchmark for measuring the average T access timer of the macro 2 The new macro for Pentium Pro Pentium Pro is an our-of-order, superscalar processor; i.e., the instructions execution order is probably different from the original program order. Hence, suppose we use the macro in figure 1 to measure the elapsed time of an operation. Let I 1 denotes the instructions executed between the two RDTSC instructions of the macros, and let I 2 denotes the original instructions enclosed by them in the program. On a Pentium Pro processor, the I 1 and the I 2 may be different. To solve this problem, we modify the macro in figure 1 by inserting the CPUID instruction before the RDTSC instruction. The reworked macros are shown in figure 3. #define GetCycleCount(cycle_count) asm { \ asm LEA EDI, cycle_count \ asm MOV EAX, 1 \ asm CPUID \ asm RDTSC \ asm MOV [EDI], EAX \ asm MOV [EDI + 4], EDX \ } Figure 3: The for the Pentium Pro processor The CPUID instruction is a serializing instruction. If it is present in an instruction sequence, the instructions after it are delayed until all instructions before it have finished. Suppose we use the macro in figure 3 to measure the elapsed time of an operations. The two CPUID instructions will confine the whole operation s instructions to being executed between them. Hence, on Pentium Pro, the macros in figure 3 can prevent the ordering problem of instruction execution. 3 The Problem of the null benchmark Although serializing instructions prevent the problem introduced by out-of-order instruction execution, they also have great impact on the processor efficiency. Before analyzing this new issue, we should describe how the Pentium Pro s pipeline operates. This pipeline can be partitioned into the following three segments [Int97c][Sha97]: In-order-issue front end: The pipeline stages in this segment are responsible for fetching macro instructions (the instructions belong to the Intel 80X86 instruction set), decoding them, and translating each of them into one or more Pentium Pro micro-ops. At the end, these micro-ops will be stored in a circular buffer called reorder buffer. Out-of-order core: It executes the micro-ops stored in the reorder buffer in an out-of-order fashion. After a micro-op has ended its execution, the result will be also recorded in the corresponding fields of the reorder buffer. In-order retirement unit: It is responsible for writing back the results of the micro-ops that have finished their execution. This is called the retirement of micro-ops. Since retiring micro-ops must obey the original program order, a micro-op that has completed execution cannot be retired, until all micro-ops that precede it have been retired. After a serializing instruction enters the reorder buffer, the micro-ops after it will not be issued to the out-of-order core, until all the micro-ops before it have been executed and retired. Therefore, a serializing instruction flushes and then restarts the latter half of the Pentium Pro s pipeline. This severely sinks the Pentium Pro s processor efficiency. Because a CPUID instruction s overhead is mainly incurred by flashing the latter half of the Pentium Pro s pipeline, it will be affected by the operation being measured. Consequently, this means that the T access timer of the macro will be also influenced by the measured operation. Since the null benchmark cannot reflect this kind of influence, it is unable to accurately determine the T access timer of the Pentium Pro s timestamping counter. This motivates us to design the new method that will be described below. 4 Our new approach On Pentium Pro, the s round-trip overhead T r can be further divided into the following three portions: T before RDTSC : Assume the CPUID instruction s influence on the processor efficiency is ignored. This portion of the T r represents the time interval from the has started, till the RDTSC instruction has retrieved the content of the time-
3 stamping counter. T after RDTSC : It is under the same assumption of the T before RDTSC. However, this portion of the T r represents the time interval from the RDTSC instruction has read the time-stamping counter, till the has finished. Pipeline-flush overhead: Because the CPUID instruction in the macro flushes the latter half of the Pentium Pro s pipeline, all instructions after it will be delayed. This portion of the T r denotes such effect. When using a pair of the macros to measure an operation s elapsed time, the s T access timer of can be expressed as follows: T access timer = T after RDTSC of the first + T before RDTSC of the second + pipeline-flush overhead of the second Because both the T before RDTSC and the T after RDTSC are independent of the measured operation, we can ignore which macro they belong. On the other hand, the pipeline-flush overhead of the second depends only on the operation being measured. Consequently, we can rewritten the s T access timer as follows: T access timer = T after RDTSC + T before RDTSC + pipeline-flush overhead of the measured operation Now, we already have enough background to discuss our new method for determining the T access timer. As mentioned previously, it depends on which operation being measured. For an operation Op, if intending to determine the corresponding T access timer, we have to conduct the following two-phase measurement: processor efficiency NOp Tnon-interleaved Figure 4: Enclosing the operation NOp with two In the first phase, we define the operation NOp as repeating the operation Op for N times by copying the source of it. As shown in figure 4, we use two macros to measure the elapsed time of the NOp, and refer to it as T non-interleaved. time T non-interleaved = the real elapsed time of executing Op for N times + T after RDTSC + T before RDTSC + pipeline-flush overhead of the NOp As depicted in figure 5, in the second phase, we first interleave N operation Op with (N + 1) macros. Then, we measure the elapsed time T interleaved by the first and the last macros. T interleaved = the real elapsed time of executing the Op for N times + processor efficiency N (T after RDTSC + T before RDTSC + pipeline-flush overhead of the Op) Figure 5: Interleaving N operation Op with (N + 1) macros As mentioned above, the pipeline-flush overhead depends only on the measured operation. However, we have to state this property more precisely. In fact, it depends on only several instructions before the CPUID instruction, not the whole operation. Hence, if the measured operation is not extremely short, we can assert that the pipeline-flush overhead of the NOp is equal to the pipeline-flush overhead of the Op. From this assumption, T interleaved - T non-interleaved = (N 1) (T after RDTSC + T before RDTSC + pipeline-flush overhead of the Op) = (N 1) T access timer Therefore, Op Op Op Tinterleaved T access timer = (T interleaved - T non-interleaved ) / (N 1) 5 Experiments for our method In this subsection, we will use our new method to perform several experiments. The first purpose of these experiments is to explore the relationship between the measured operation and the T access timer of the Pentium Pro s time-stamping counter. Then, the second purpose is to show why our method is better than the null benchmark. Finally, the third purpose is to justify the feasibility of our method. On Pentium Pro, the CPUID instruction s overhead dominates the time-stamping counter s T access timer. This overhead may be influenced by the following properties of an operation being measured: The type of this operation. The length of this operation. When we compile it, whether is the pipeline scheduling capability enabled? Which compiler option set is specified? (The Op time
4 Normal Debug 108 Number of array elements to be sorted (a) 100 Number of array elements to be sorted 125 (b) Overhead (cl ock cycles) Size of matrix (c) 100 Size of matrix (d) Figure 6: This figure shows the s T access timer for the C operations. These operations are: (a) quick sort on an integer array, (b) quick sort on a floating-point array, (c) integer matrix multiplication, and (d) floating-point matrix multiplication. debug option set is for developing programs. On the other hand, the release option set is for releasing programs.) For analyzing the relationship between the T access timer and the above four factors, we design the experiments below. We select seven types of operations, and each of them has a representative instruction mix. In addition, they can be further divided into two categories. The operations belong to the first category contain no serializing instruction, and they are: Quick sort on an integer array. Quick sort on a floating-point array. Integer matrix multiplication. Floating-point matrix multiplication. Memory duplication. In contrast, all operations of the second category consist of some serializing instructions, and they are: Acquire and release a simple spin lock. Enter and leave a Windows NT critical section. In fact, both of them are for thread synchronization, and employ several atomic read-modify-write instructions [Sha97] for this purpose. They are the source of serializing instructions. Among the seven operations, the memory duplication operation and the other two thread synchronization operations are written in 80X86 assembly language. On the other hand, the otherwise ones are written in C language. In order to understand how pipeline scheduling and how the compiler option set affect the T access timer, we produce three versions of binary codes for each C operation. The first version is the normal binary code. When compiling it, we use the release option set, and enable the pipeline scheduling capability. Then, the second version is the binary code. Except disabling the pipeline scheduling capability, all other compiler options are identical to the ones in the normal version. Finally, the third version is the debug binary code. Here we employ the debug option set, and enable the pipeline scheduling capability. We conduct the experiments on a 266 MHz Pentium II machine, and the operating system is Windows NT 4.0.The experimental results of these C operations are shown in figure 6(a) figure 6(d). The results of the assembly operations are depicted in figure 7(a) figure 7(c). These figures also display how the T access timer is influenced by an operation s length. Besides, for comparison, we also use the null benchmark to measure the T access timer. The result is very stable and fixed in 112 clock cycles. From these results, we discover that the upper bound of our method s T access timer is 42.1% larger than the null benchmark s. Moreover, the lower bound of our method is 8.8% smaller than the null benchmark. Therefore, we can aware how significant the null benchmark s T access timer may deviate from its real value. This also confirms that it is an inappropriate method for determining the timestamping counter s T access timer on Pentium Pro.
5 Number of memory blocks to be copied Times for acquiring and releasing a spin lock (a) (b) Times for entering and leaving a critical section Figure 7: This figure shows the s T access timer for the assembly operations. These operations are: (a) memory duplication, (b) simple spin lock, and (c) Windows NT critical section objects. (c) Then, the second problem we intend to address is how the previous four factors affect the time-stamping counter s T access timer. Unfortunately, we discover these factors are interdependent. Consequently, it is impossible to clearly identify the causal relationship between the T access timer and them. Instead, we state the following two properties observed from the experimental results: Altering the compiler option set has greater impact on the T access timer than disabling the pipeline scheduling capability, especially for operations in which floating-point instructions dominate the instruction mix. On Pentium Pro, the pipeline scheduling capability increases the throughput of delivering micro-ops to the reorder buffer. However, for floating-point operations, the performance bottleneck is on the floating-point function units inside the out-of-order core. Increasing the throughput of delivering micro-ops will not alleviate this bottleneck. However, if we change the compiler option set from the release option set to the debug option set, some optimization options will be disabled. This will change the instruction number and instruction mix of an operation s binary code, and alter the behavior of it dramatically. Hence, altering the compiler option set affects the T access timer more significantly. The T access timer is influenced if and only if the variation in an operation s length also change the behavior of the operation s tail. We give two examples here. The first one is the floating-point operations. As we have stated above, for a floatingpoint operation, the performance bottleneck is on the floating-point function units. If its length is varied, the floating-point function units will still be the bottleneck, and as busy as before. Since the behavior of the operation s tail remains similar, the T access timer is also insensitive to the variation in its length. This property can be verified in both figure 6(b) and figure 6(d). Another example is the memory duplication operation. This operation uses the memory-copy instruction to duplicate a memory block. By modifying the size of the memory block to be copied, we can control its length. If we enlarge the memory block, this operation will just increase the repeating count of the memory-copy instruction. Therefore, the behavior of its tail is always the same, and the T access timer also keeps stable. This can be confirmed in the figure 7(a). Next, the third topic we intend to state is the convergent problem of the T access timer measured by our method. For all operations containing no serializing instructions, the T access timer will become a fixed value just after one or two measurements. Consequently, the average T access timer will converge very quickly. For all operations consisting of serializing instructions, the T access timer will also enter a stable condition just after few measurements. Under this stable condition, the T access timer is varied in a periodic pattern, and its value is bounded within a fixed interval. The relationship between an operation s length and the corresponding fixed interval is depicted in figure 8.
6 14 14 Bounded range (clock cycles) Bounded range (clock cycles) Times for acquiring and releasing a spin lock Times for entering and leaving a critical section (a) Figure 8: When the s T access timer is measured with an operation that contains serializing instructions, eventually, it will converge to a bounded range. This figure shows this range for the following two operations: (a) simple spin locks, and (b) Windows NT critical section objects. (b) Because the variation of the T access timer is periodic and bounded, the average T access timer will also converge. However, the convergent speed will be slower than the previous kinds of operations. In summary, since the T access timer always converges in our experiments, we can confirm that our method for determining the T access timer is a practical approach. All above discussion in this section is restricted to the Pentium Pro processors. To date, the available processors belong to the Intel P6 family are Pentium Pro, Pentium II [Sha97], and Pentium III [Int99]. There are only few minor differences between them. Nevertheless, the cores of these processors are the same dynamic execution micro-architectures. Hence, our method for determining the T access timer is also suitable for all processors belong to the Intel P6 family. 6 Conclusion In this paper, we have presented how to use the timestamping counter to measure an operation s elapsed time on both the Pentium and the Pentium Pro processors. The overhead of the Pentium s time-stamping counter could be determined by a null benchmark, and this simple method could produce accurate result. However, on Pentium Pro, since the null benchmark could not consider how the measured operation affects the time-stamping counter s overhead, we proposed a new method to solve this problem. Furthermore, we also conducted several experiments to confirm our idea. The experimental results show that the overhead measured by the null benchmark may significantly deviate from its real value. Moreover, we can also discover that all the overheads measured by our method are both stable and able to converge. Hence, we can conclude our method is a practical approach. Manual, Vol. 3: System Programming Guide, Intel Corporation, [Int97b] Intel, Using the RDTSC Instruction for Performance Monitoring, Pentium II Processor Application Notes, Intel Corporation, RDTSCPM1.HTM, [Int97c] Intel, Intel Architecture Optimization Manual, Intel Corporation, [Int99] Intel, Pentium III Processor at 450 MHz, 500 MHz, and 550 MHz Datasheet, Intel Corporation, [Mic98] Microsoft, Platform SDK Windows Base Services, MSDN Library for Visual Studio 6.0, Microsoft Corporation, [Mor97] Motorola, PowerPC Microprocessor Family: The Programming Environments, Motorola, Inc., [Sha97] T. Shanley, Pentium Pro and Pentium II System Architecture, Addison-Wesley, [Tsa97] W. Tsay, The Design and Implementation of an IIPC Locking Facility, Master Thesis, Institute of Computer Science and Information Engineering, National Chiao Tung University, Taiwan. References: [Com98] Compaq, Alpha Architecture Handbook, Compaq Computer Corporation, [Dow93] K. Dowd, High Performance Computing, O Reilly & Associates, Inc., [Int97a] Intel, Intel Architecture Software Developer s
Pipelining and Vector Processing
Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline
More informationPerformance Tuning VTune Performance Analyzer
Performance Tuning VTune Performance Analyzer Paul Petersen, Intel Sept 9, 2005 Copyright 2005 Intel Corporation Performance Tuning Overview Methodology Benchmarking Timing VTune Counter Monitor Call Graph
More informationAN 831: Intel FPGA SDK for OpenCL
AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1
More informationSuperscalar Processors
Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationSnapshot Service Interface (SSI)
Snapshot Service Interface (SSI) A generic snapshot assisted backup framework for Linux Abstract This paper presents the design and implementation of Snapshot Service Interface-SSI, a standardized backup
More informationAn Overview of the BLITZ System
An Overview of the BLITZ System Harry H. Porter III Department of Computer Science Portland State University Introduction The BLITZ System is a collection of software designed to support a university-level
More informationPower Estimation of UVA CS754 CMP Architecture
Introduction Power Estimation of UVA CS754 CMP Architecture Mateja Putic mateja@virginia.edu Early power analysis has become an essential part of determining the feasibility of microprocessor design. As
More informationECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation
ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating
More informationLecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor
Lecture 12 Architectures for Low Power: Transmeta s Crusoe Processor Motivation Exponential performance increase at a low cost However, for some application areas low power consumption is more important
More informationEfficiency and memory footprint of Xilkernel for the Microblaze soft processor
Efficiency and memory footprint of Xilkernel for the Microblaze soft processor Dariusz Caban, Institute of Informatics, Gliwice, Poland - June 18, 2014 The use of a real-time multitasking kernel simplifies
More informationTime Measurement. CS 201 Gerson Robboy Portland State University. Topics. Time scales Interval counting Cycle counters K-best measurement scheme
Time Measurement CS 201 Gerson Robboy Portland State University Topics Time scales Interval counting Cycle counters K-best measurement scheme Computer Time Scales Microscopic Time Scale (1 Ghz Machine)
More informationMain Points of the Computer Organization and System Software Module
Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a
More informationAssembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam
Assembly Language Lecture 2 - x86 Processor Architecture Ahmed Sallam Introduction to the course Outcomes of Lecture 1 Always check the course website Don t forget the deadline rule!! Motivations for studying
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationMore on Conjunctive Selection Condition and Branch Prediction
More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationComputer Organization & Assembly Language Programming
Computer Organization & Assembly Language Programming CSE 2312 Lecture 11 Introduction of Assembly Language 1 Assembly Language Translation The Assembly Language layer is implemented by translation rather
More informationThe course that gives CMU its Zip! Time Measurement Oct. 24, 2002
15-213 The course that gives CMU its Zip! Time Measurement Oct. 24, 2002 Topics Time scales Interval counting Cycle counters K-best measurement scheme class18.ppt Computer Time Scales Microscopic Time
More informationTimed Compiled-Code Functional Simulation of Embedded Software for Performance Analysis of SOC Design
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 22, NO. 1, JANUARY 2003 1 Timed Compiled-Code Functional Simulation of Embedded Software for Performance Analysis of
More information1.3 Data processing; data storage; data movement; and control.
CHAPTER 1 OVERVIEW ANSWERS TO QUESTIONS 1.1 Computer architecture refers to those attributes of a system visible to a programmer or, put another way, those attributes that have a direct impact on the logical
More informationCSE 141 Summer 2016 Homework 2
CSE 141 Summer 2016 Homework 2 PID: Name: 1. A matrix multiplication program can spend 10% of its execution time in reading inputs from a disk, 10% of its execution time in parsing and creating arrays
More informationZilog Real-Time Kernel
An Company Configurable Compilation RZK allows you to specify system parameters at compile time. For example, the number of objects, such as threads and semaphores required, are specez80acclaim! Family
More informationi960 Microprocessor Performance Brief October 1998 Order Number:
Performance Brief October 1998 Order Number: 272950-003 Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel or otherwise, to any intellectual
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationPerformance Improvement of Hardware-Based Packet Classification Algorithm
Performance Improvement of Hardware-Based Packet Classification Algorithm Yaw-Chung Chen 1, Pi-Chung Wang 2, Chun-Liang Lee 2, and Chia-Tai Chan 2 1 Department of Computer Science and Information Engineering,
More informationAssembly Language. Lecture 2 x86 Processor Architecture
Assembly Language Lecture 2 x86 Processor Architecture Ahmed Sallam Slides based on original lecture slides by Dr. Mahmoud Elgayyar Introduction to the course Outcomes of Lecture 1 Always check the course
More informationAssembly Language Programming Introduction
Assembly Language Programming Introduction October 10, 2017 Motto: R7 is used by the processor as its program counter (PC). It is recommended that R7 not be used as a stack pointer. Source: PDP-11 04/34/45/55
More informationParallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs
Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung
More informationComputer Time Scales Time Measurement Oct. 24, Measurement Challenge. Time on a Computer System. The course that gives CMU its Zip!
5-23 The course that gives CMU its Zip! Computer Time Scales Microscopic Time Scale ( Ghz Machine) Macroscopic class8.ppt Topics Time Measurement Oct. 24, 2002! Time scales! Interval counting! Cycle counters!
More informationLINUX OPERATING SYSTEM Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science
A Seminar report On LINUX OPERATING SYSTEM Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED
More informationChapter 4 The Processor (Part 4)
Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline
More informationComputer Organization - Overview
Computer Organization - Overview Hyunyoung Lee CSCE 312 1 Course Overview Topics: Theme Five great realities of computer systems Computer system overview Summary NOTE: Most slides are from the textbook
More informationMARIE: An Introduction to a Simple Computer
MARIE: An Introduction to a Simple Computer 4.2 CPU Basics The computer s CPU fetches, decodes, and executes program instructions. The two principal parts of the CPU are the datapath and the control unit.
More informationArea-Efficient Error Protection for Caches
Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various
More informationOptimizing Memory Bandwidth
Optimizing Memory Bandwidth Don t settle for just a byte or two. Grab a whole fistful of cache. Mike Wall Member of Technical Staff Developer Performance Team Advanced Micro Devices, Inc. make PC performance
More informationA Lost Cycles Analysis for Performance Prediction using High-Level Synthesis
A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,
More informationA Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures
A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative
More informationComputer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More information15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15
More informationRECENTLY, researches on gigabit wireless personal area
146 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 55, NO. 2, FEBRUARY 2008 An Indexed-Scaling Pipelined FFT Processor for OFDM-Based WPAN Applications Yuan Chen, Student Member, IEEE,
More informationLecture 26: Parallel Processing. Spring 2018 Jason Tang
Lecture 26: Parallel Processing Spring 2018 Jason Tang 1 Topics Static multiple issue pipelines Dynamic multiple issue pipelines Hardware multithreading 2 Taxonomy of Parallel Architectures Flynn categories:
More informationA Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function
A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao
More informationELC4438: Embedded System Design Embedded Processor
ELC4438: Embedded System Design Embedded Processor Liang Dong Electrical and Computer Engineering Baylor University 1. Processor Architecture General PC Von Neumann Architecture a.k.a. Princeton Architecture
More informationCS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019
CS 31: Introduction to Computer Systems 22-23: Threads & Synchronization April 16-18, 2019 Making Programs Run Faster We all like how fast computers are In the old days (1980 s - 2005): Algorithm too slow?
More informationPrinciples. Performance Tuning. Examples. Amdahl s Law: Only Bottlenecks Matter. Original Enhanced = Speedup. Original Enhanced.
Principles Performance Tuning CS 27 Don t optimize your code o Your program might be fast enough already o Machines are getting faster and cheaper every year o Memory is getting denser and cheaper every
More informationIntroduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras
Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras Week - 04 Lecture 17 CPU Context Switching Hello. In this video
More informationImproving Http-Server Performance by Adapted Multithreading
Improving Http-Server Performance by Adapted Multithreading Jörg Keller LG Technische Informatik II FernUniversität Hagen 58084 Hagen, Germany email: joerg.keller@fernuni-hagen.de Olaf Monien Thilo Lardon
More informationCS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines
CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per
More informationImproved Address-Space Switching on Pentium. Processors by Transparently Multiplexing User. Address Spaces. Jochen Liedtke
Improved Address-Space Switching on Pentium Processors by Transparently Multiplexing User Address Spaces Jochen Liedtke GMD German National Research Center for Information Technology jochen.liedtkegmd.de
More informationDistributed Scheduling for the Sombrero Single Address Space Distributed Operating System
Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.
More informationWhat is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise
CSCI 4717/5717 Computer Architecture Topic: Instruction Level Parallelism Reading: Stallings, Chapter 14 What is Superscalar? A machine designed to improve the performance of the execution of scalar instructions.
More informationAdvanced Processor Architecture
Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong
More informationAn Overview of MIPS Multi-Threading. White Paper
Public Imagination Technologies An Overview of MIPS Multi-Threading White Paper Copyright Imagination Technologies Limited. All Rights Reserved. This document is Public. This publication contains proprietary
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationIA-32 Architecture COE 205. Computer Organization and Assembly Language. Computer Engineering Department
IA-32 Architecture COE 205 Computer Organization and Assembly Language Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Basic Computer Organization Intel
More informationBest Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.
IBM Optim Performance Manager Extended Edition V4.1.0.1 Best Practices Deploying Optim Performance Manager in large scale environments Ute Baumbach (bmb@de.ibm.com) Optim Performance Manager Development
More informationA MULTIPOINT VIDEOCONFERENCE RECEIVER BASED ON MPEG-4 OBJECT VIDEO. Chih-Kai Chien, Chen-Yu Tsai, and David W. Lin
A MULTIPOINT VIDEOCONFERENCE RECEIVER BASED ON MPEG-4 OBJECT VIDEO Chih-Kai Chien, Chen-Yu Tsai, and David W. Lin Dept. of Electronics Engineering and Center for Telecommunications Research National Chiao
More informationInterprocess Communication By: Kaushik Vaghani
Interprocess Communication By: Kaushik Vaghani Background Race Condition: A situation where several processes access and manipulate the same data concurrently and the outcome of execution depends on the
More informationAssembly Language for Intel-Based Computers, 4 th Edition. Kip R. Irvine. Chapter 2: IA-32 Processor Architecture
Assembly Language for Intel-Based Computers, 4 th Edition Kip R. Irvine Chapter 2: IA-32 Processor Architecture Chapter Overview General Concepts IA-32 Processor Architecture IA-32 Memory Management Components
More informationChapter 13 Reduced Instruction Set Computers
Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining
More informationAssembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit
Assembly Language for Intel-Based Computers, 4 th Edition Kip R. Irvine Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit Slides prepared by Kip R. Irvine Revision date: 09/25/2002
More informationIntroduction to Computer Systems: Semester 1 Computer Architecture
Introduction to Computer Systems: Semester 1 Computer Architecture Fall 2003 William J. Taffe using modified lecture slides of Randal E. Bryant Topics: Theme Five great realities of computer systems How
More informationPerformance Evaluation. December 2, 1999
15-213 Performance Evaluation December 2, 1999 Topics Getting accurate measurements Amdahl s Law class29.ppt Time on a Computer System real (wall clock) time = user time (time executing instructing instructions
More informationCSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review
CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, 2003 Review 1 Overview 1.1 The definition, objectives and evolution of operating system An operating system exploits and manages
More informationHardware and Software Architecture. Chapter 2
Hardware and Software Architecture Chapter 2 1 Basic Components The x86 processor communicates with main memory and I/O devices via buses Data bus for transferring data Address bus for the address of a
More information15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15
More informationComputer Systems A Programmer s Perspective 1 (Beta Draft)
Computer Systems A Programmer s Perspective 1 (Beta Draft) Randal E. Bryant David R. O Hallaron August 1, 2001 1 Copyright c 2001, R. E. Bryant, D. R. O Hallaron. All rights reserved. 2 Contents Preface
More informationEN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design
EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown
More informationAssembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture. Chapter Overview.
Assembly Language for Intel-Based Computers, 4 th Edition Kip R. Irvine Chapter 2: IA-32 Processor Architecture Slides prepared by Kip R. Irvine Revision date: 09/25/2002 Chapter corrections (Web) Printing
More informationDatapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale
Datapoint 2200 IA-32 Nicholas FitzRoy-Dale At the forefront of the computer revolution - Intel Difficult to explain and impossible to love - Hennessy and Patterson! Released 1970! 2K shift register main
More informationTHE orthogonal frequency-division multiplex (OFDM)
26 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 57, NO. 1, JANUARY 2010 A Generalized Mixed-Radix Algorithm for Memory-Based FFT Processors Chen-Fong Hsiao, Yuan Chen, Member, IEEE,
More informationCSC Operating Systems Spring Lecture - XII Midterm Review. Tevfik Ko!ar. Louisiana State University. March 4 th, 2008.
CSC 4103 - Operating Systems Spring 2008 Lecture - XII Midterm Review Tevfik Ko!ar Louisiana State University March 4 th, 2008 1 I/O Structure After I/O starts, control returns to user program only upon
More informationChecker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India
Advanced Department of Computer Science Indian Institute of Technology New Delhi, India Outline Introduction Advanced 1 Introduction 2 Checker Pipeline Checking Mechanism 3 Advanced Core Checker L1 Failure
More informationstructural RTL for mov ra, rb Answer:- (Page 164) Virtualians Social Network Prepared by: Irfan Khan
Solved Subjective Midterm Papers For Preparation of Midterm Exam Two approaches for control unit. Answer:- (Page 150) Additionally, there are two different approaches to the control unit design; it can
More informationChapter 3. Pipelining. EE511 In-Cheol Park, KAIST
Chapter 3. Pipelining EE511 In-Cheol Park, KAIST Terminology Pipeline stage Throughput Pipeline register Ideal speedup Assume The stages are perfectly balanced No overhead on pipeline registers Speedup
More informationChapter 4. MARIE: An Introduction to a Simple Computer 4.8 MARIE 4.8 MARIE A Discussion on Decoding
4.8 MARIE This is the MARIE architecture shown graphically. Chapter 4 MARIE: An Introduction to a Simple Computer 2 4.8 MARIE MARIE s Full Instruction Set A computer s control unit keeps things synchronized,
More informationCHAPTER 8: CPU and Memory Design, Enhancement, and Implementation
CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 5th Edition, Irv Englander John
More information1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722
Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict
More informationCS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07
CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 Objectives ---------- 1. To introduce the basic concept of CPU speedup 2. To explain how data and branch hazards arise as
More informationSISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:
SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs
More informationNewbie s Guide to AVR Interrupts
Newbie s Guide to AVR Interrupts Dean Camera March 15, 2015 ********** Text Dean Camera, 2013. All rights reserved. This document may be freely distributed without payment to the author, provided that
More informationTime Measurement Nov 4, 2009"
Time Measurement Nov 4, 2009" Reminder" 2! Computer Time Scales" Microscopic Time Scale (1 Ghz Machine) Macroscopic Integer Add FP Multiply FP Divide Keystroke Interrupt Handler Disk Access Screen Refresh
More informationGeneral Purpose Signal Processors
General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:
More informationCS4961 Parallel Programming. Lecture 2: Introduction to Parallel Algorithms 8/31/10. Mary Hall August 26, Homework 1, cont.
Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 26, 2010 1 Homework 1 Due 10:00 PM, Wed., Sept. 1 To submit your homework: - Submit a PDF file - Use the handin program
More informationSchool of Computer and Information Science
School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationEEM336 Microprocessors I. The Microprocessor and Its Architecture
EEM336 Microprocessors I The Microprocessor and Its Architecture Introduction This chapter presents the microprocessor as a programmable device by first looking at its internal programming model and then
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationUsing implicit fitness functions for genetic algorithm-based agent scheduling
Using implicit fitness functions for genetic algorithm-based agent scheduling Sankaran Prashanth, Daniel Andresen Department of Computing and Information Sciences Kansas State University Manhattan, KS
More informationMICROPROCESSOR TECHNOLOGY
MICROPROCESSOR TECHNOLOGY Assis. Prof. Hossam El-Din Moustafa Lecture 17 Ch.8 The Pentium and Pentium Pro Microprocessors 21-Apr-15 1 Chapter Objectives Contrast the Pentium and Pentium Pro with the 80386
More informationField Analysis. Last time Exploit encapsulation to improve memory system performance
Field Analysis Last time Exploit encapsulation to improve memory system performance This time Exploit encapsulation to simplify analysis Two uses of field analysis Escape analysis Object inlining April
More informationConcurrent & Distributed Systems Supervision Exercises
Concurrent & Distributed Systems Supervision Exercises Stephen Kell Stephen.Kell@cl.cam.ac.uk November 9, 2009 These exercises are intended to cover all the main points of understanding in the lecture
More informationUniversity of Toronto Faculty of Applied Science and Engineering
Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science
More informationModeling and Simulating Discrete Event Systems in Metropolis
Modeling and Simulating Discrete Event Systems in Metropolis Guang Yang EECS 290N Report December 15, 2004 University of California at Berkeley Berkeley, CA, 94720, USA guyang@eecs.berkeley.edu Abstract
More informationSimultaneous Multithreading Architecture
Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.
More information