A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

Size: px
Start display at page:

Download "A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *"

Transcription

1 A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University 1001 Ta Hsueh Rd., Hsinchu 300, Taiwan Abstract: Due to its moderate overhead and small quantization error, the time-stamping counter is currently the most precise time-measuring mechanism on Intel 80X86-based Platform. On the Pentium processors, we can simply use a conventional null benchmark to determine the time-stamping counter s overhead accurately. Similarly, on the Pentium Pro processors, Intel also recommends the same method for measuring this overhead. However, since the influence of the measured operation is neglected, we find this method becomes inaccurate on Pentium Pro. Therefore, in this paper, we propose a new method for determining the overhead of Pentium Pro s time-stamping counter. Furthermore, we also provide some empirical results to confirm the feasibility of our idea. Keywords: time measurement, time-stamping counters, Pentium Pro processors, out-of-order execution 1 Introduction Elapsed time is one of the most important performance metrics. Traditionally, a programmer use the timer and the programming interface offered by an operating system to measure an operation s elapsed time. However, most operating system timers are not precise enough. For instance, the resolution of the system timers offered by most UNIX operating systems is about 16.67ms ~ 20ms [Dow93]. Beside, retrieving the operating system timer requires executing many program instructions. This will further cause the operating system timer become more inaccurate. Consequently, most modern processors start to build the processor clock cycle counters inside themselves, such as the Alpha processors [Com98], the PowerPC processors [Mor97], and the Intel Pentium and Pentium Pro processors. A processor clock cycle counter is very precise. Its resolution is equal to the processor s clock cycle time. In addition, reading its content needs only one or few instructions. Hence, processor clock cycle counters have become the best choices for measuring the elapsed time of short operations. The time-stamping counter [Int97a][Sha97] is the processor clock cycle counter built in Intel 80X86 processors since Pentium. It can automatically count how many clock cycles have elapsed since the processor is initialized or reset. The time-stamping counter itself is a 64-bit register, and can be read by the RDTSC instruction. This instruction copies its high-order 32 bits to the general-purpose register EDX, and copies its low-order 32 bits to the EAX. For reading the time-stamping counter in C programs, we define a C language macro that contains inline assembly codes - the in figure 1. Whether this macro can be used in a user-mode program depends on the setting of the operating systems. For example, Windows NT sets the RDTSC instruction to be available in user-mode; hence, no extra kernel-mode service routines or kernel-mode drivers are required. This much reduces the overhead of reading the time-stamping counter. #define GetCycleCount(cycle_count) asm { \ asm LEA EDI, cycle_count \ asm RDTSC \ asm MOV [EDI], EAX \ asm MOV [EDI + 4], EDX \ } Figure 1: The macro for the Pentium processors Once the elapsed time of an operation is measured, the next issue we should address is how to determine the timestamping counter s overhead. This issue is quite important for measuring short operations. For example, during tuning an intra-process locking mechanism [Tsa97] on Windows NT, we had to measure the overheads of both its untuned version and its fine-tuned version. Moreover, we also had to measure the overheads of other thread synchronization mechanisms offered by Windows NT, such as critical section objects, mutex objects, and semaphore objects [Mic98]. On our test platform, the overheads of these synchronization operations are about 80 ~ 2000 clock cycles. These operations are quite short, especially after we aware how significant the overhead of the time-stamping counter will be. Hence, if we intend to measure these operations overheads accurately, we also have to precisely determine the extra overhead may be introduced by the time-stamping counter itself. Conventionally, the null benchmark [Int97b] shown in figure 2 is employed to determine the overhead introduced by a pair of the macro. (In the remaining part of this paper, we also refer to this overhead as T access timer.) Since Pentium is just an in-order, dual-issue * This work was supported by the National Science Council grants NSC E and NSC E

2 superscalar processor, the null benchmark can accurately determine the T access timer of the Pentium s time-stamping counter. This method is also recommended by Intel [Int97b] for determining the T access timer of the Pentium Pro s time stamping counter. However, due to the out-of-order execution capability, we discover the null benchmark will become inaccurate on Pentium Pro. This motivates us to propose a new method for determining the T access timer of the Pentium Pro s time-stamping counter. int i; int average_overhead = 0; int begin_cycle_count; int end_cycle_count; for (i = 0; i < N; i++) { GetCycleCount(begin_cycle_count); GetCycleCount(end_cycle_count); average_overhead += end_cycle_count begin_cycle_count; } average_overhead /= N; Figure 2: A null benchmark for measuring the average T access timer of the macro 2 The new macro for Pentium Pro Pentium Pro is an our-of-order, superscalar processor; i.e., the instructions execution order is probably different from the original program order. Hence, suppose we use the macro in figure 1 to measure the elapsed time of an operation. Let I 1 denotes the instructions executed between the two RDTSC instructions of the macros, and let I 2 denotes the original instructions enclosed by them in the program. On a Pentium Pro processor, the I 1 and the I 2 may be different. To solve this problem, we modify the macro in figure 1 by inserting the CPUID instruction before the RDTSC instruction. The reworked macros are shown in figure 3. #define GetCycleCount(cycle_count) asm { \ asm LEA EDI, cycle_count \ asm MOV EAX, 1 \ asm CPUID \ asm RDTSC \ asm MOV [EDI], EAX \ asm MOV [EDI + 4], EDX \ } Figure 3: The for the Pentium Pro processor The CPUID instruction is a serializing instruction. If it is present in an instruction sequence, the instructions after it are delayed until all instructions before it have finished. Suppose we use the macro in figure 3 to measure the elapsed time of an operations. The two CPUID instructions will confine the whole operation s instructions to being executed between them. Hence, on Pentium Pro, the macros in figure 3 can prevent the ordering problem of instruction execution. 3 The Problem of the null benchmark Although serializing instructions prevent the problem introduced by out-of-order instruction execution, they also have great impact on the processor efficiency. Before analyzing this new issue, we should describe how the Pentium Pro s pipeline operates. This pipeline can be partitioned into the following three segments [Int97c][Sha97]: In-order-issue front end: The pipeline stages in this segment are responsible for fetching macro instructions (the instructions belong to the Intel 80X86 instruction set), decoding them, and translating each of them into one or more Pentium Pro micro-ops. At the end, these micro-ops will be stored in a circular buffer called reorder buffer. Out-of-order core: It executes the micro-ops stored in the reorder buffer in an out-of-order fashion. After a micro-op has ended its execution, the result will be also recorded in the corresponding fields of the reorder buffer. In-order retirement unit: It is responsible for writing back the results of the micro-ops that have finished their execution. This is called the retirement of micro-ops. Since retiring micro-ops must obey the original program order, a micro-op that has completed execution cannot be retired, until all micro-ops that precede it have been retired. After a serializing instruction enters the reorder buffer, the micro-ops after it will not be issued to the out-of-order core, until all the micro-ops before it have been executed and retired. Therefore, a serializing instruction flushes and then restarts the latter half of the Pentium Pro s pipeline. This severely sinks the Pentium Pro s processor efficiency. Because a CPUID instruction s overhead is mainly incurred by flashing the latter half of the Pentium Pro s pipeline, it will be affected by the operation being measured. Consequently, this means that the T access timer of the macro will be also influenced by the measured operation. Since the null benchmark cannot reflect this kind of influence, it is unable to accurately determine the T access timer of the Pentium Pro s timestamping counter. This motivates us to design the new method that will be described below. 4 Our new approach On Pentium Pro, the s round-trip overhead T r can be further divided into the following three portions: T before RDTSC : Assume the CPUID instruction s influence on the processor efficiency is ignored. This portion of the T r represents the time interval from the has started, till the RDTSC instruction has retrieved the content of the time-

3 stamping counter. T after RDTSC : It is under the same assumption of the T before RDTSC. However, this portion of the T r represents the time interval from the RDTSC instruction has read the time-stamping counter, till the has finished. Pipeline-flush overhead: Because the CPUID instruction in the macro flushes the latter half of the Pentium Pro s pipeline, all instructions after it will be delayed. This portion of the T r denotes such effect. When using a pair of the macros to measure an operation s elapsed time, the s T access timer of can be expressed as follows: T access timer = T after RDTSC of the first + T before RDTSC of the second + pipeline-flush overhead of the second Because both the T before RDTSC and the T after RDTSC are independent of the measured operation, we can ignore which macro they belong. On the other hand, the pipeline-flush overhead of the second depends only on the operation being measured. Consequently, we can rewritten the s T access timer as follows: T access timer = T after RDTSC + T before RDTSC + pipeline-flush overhead of the measured operation Now, we already have enough background to discuss our new method for determining the T access timer. As mentioned previously, it depends on which operation being measured. For an operation Op, if intending to determine the corresponding T access timer, we have to conduct the following two-phase measurement: processor efficiency NOp Tnon-interleaved Figure 4: Enclosing the operation NOp with two In the first phase, we define the operation NOp as repeating the operation Op for N times by copying the source of it. As shown in figure 4, we use two macros to measure the elapsed time of the NOp, and refer to it as T non-interleaved. time T non-interleaved = the real elapsed time of executing Op for N times + T after RDTSC + T before RDTSC + pipeline-flush overhead of the NOp As depicted in figure 5, in the second phase, we first interleave N operation Op with (N + 1) macros. Then, we measure the elapsed time T interleaved by the first and the last macros. T interleaved = the real elapsed time of executing the Op for N times + processor efficiency N (T after RDTSC + T before RDTSC + pipeline-flush overhead of the Op) Figure 5: Interleaving N operation Op with (N + 1) macros As mentioned above, the pipeline-flush overhead depends only on the measured operation. However, we have to state this property more precisely. In fact, it depends on only several instructions before the CPUID instruction, not the whole operation. Hence, if the measured operation is not extremely short, we can assert that the pipeline-flush overhead of the NOp is equal to the pipeline-flush overhead of the Op. From this assumption, T interleaved - T non-interleaved = (N 1) (T after RDTSC + T before RDTSC + pipeline-flush overhead of the Op) = (N 1) T access timer Therefore, Op Op Op Tinterleaved T access timer = (T interleaved - T non-interleaved ) / (N 1) 5 Experiments for our method In this subsection, we will use our new method to perform several experiments. The first purpose of these experiments is to explore the relationship between the measured operation and the T access timer of the Pentium Pro s time-stamping counter. Then, the second purpose is to show why our method is better than the null benchmark. Finally, the third purpose is to justify the feasibility of our method. On Pentium Pro, the CPUID instruction s overhead dominates the time-stamping counter s T access timer. This overhead may be influenced by the following properties of an operation being measured: The type of this operation. The length of this operation. When we compile it, whether is the pipeline scheduling capability enabled? Which compiler option set is specified? (The Op time

4 Normal Debug 108 Number of array elements to be sorted (a) 100 Number of array elements to be sorted 125 (b) Overhead (cl ock cycles) Size of matrix (c) 100 Size of matrix (d) Figure 6: This figure shows the s T access timer for the C operations. These operations are: (a) quick sort on an integer array, (b) quick sort on a floating-point array, (c) integer matrix multiplication, and (d) floating-point matrix multiplication. debug option set is for developing programs. On the other hand, the release option set is for releasing programs.) For analyzing the relationship between the T access timer and the above four factors, we design the experiments below. We select seven types of operations, and each of them has a representative instruction mix. In addition, they can be further divided into two categories. The operations belong to the first category contain no serializing instruction, and they are: Quick sort on an integer array. Quick sort on a floating-point array. Integer matrix multiplication. Floating-point matrix multiplication. Memory duplication. In contrast, all operations of the second category consist of some serializing instructions, and they are: Acquire and release a simple spin lock. Enter and leave a Windows NT critical section. In fact, both of them are for thread synchronization, and employ several atomic read-modify-write instructions [Sha97] for this purpose. They are the source of serializing instructions. Among the seven operations, the memory duplication operation and the other two thread synchronization operations are written in 80X86 assembly language. On the other hand, the otherwise ones are written in C language. In order to understand how pipeline scheduling and how the compiler option set affect the T access timer, we produce three versions of binary codes for each C operation. The first version is the normal binary code. When compiling it, we use the release option set, and enable the pipeline scheduling capability. Then, the second version is the binary code. Except disabling the pipeline scheduling capability, all other compiler options are identical to the ones in the normal version. Finally, the third version is the debug binary code. Here we employ the debug option set, and enable the pipeline scheduling capability. We conduct the experiments on a 266 MHz Pentium II machine, and the operating system is Windows NT 4.0.The experimental results of these C operations are shown in figure 6(a) figure 6(d). The results of the assembly operations are depicted in figure 7(a) figure 7(c). These figures also display how the T access timer is influenced by an operation s length. Besides, for comparison, we also use the null benchmark to measure the T access timer. The result is very stable and fixed in 112 clock cycles. From these results, we discover that the upper bound of our method s T access timer is 42.1% larger than the null benchmark s. Moreover, the lower bound of our method is 8.8% smaller than the null benchmark. Therefore, we can aware how significant the null benchmark s T access timer may deviate from its real value. This also confirms that it is an inappropriate method for determining the timestamping counter s T access timer on Pentium Pro.

5 Number of memory blocks to be copied Times for acquiring and releasing a spin lock (a) (b) Times for entering and leaving a critical section Figure 7: This figure shows the s T access timer for the assembly operations. These operations are: (a) memory duplication, (b) simple spin lock, and (c) Windows NT critical section objects. (c) Then, the second problem we intend to address is how the previous four factors affect the time-stamping counter s T access timer. Unfortunately, we discover these factors are interdependent. Consequently, it is impossible to clearly identify the causal relationship between the T access timer and them. Instead, we state the following two properties observed from the experimental results: Altering the compiler option set has greater impact on the T access timer than disabling the pipeline scheduling capability, especially for operations in which floating-point instructions dominate the instruction mix. On Pentium Pro, the pipeline scheduling capability increases the throughput of delivering micro-ops to the reorder buffer. However, for floating-point operations, the performance bottleneck is on the floating-point function units inside the out-of-order core. Increasing the throughput of delivering micro-ops will not alleviate this bottleneck. However, if we change the compiler option set from the release option set to the debug option set, some optimization options will be disabled. This will change the instruction number and instruction mix of an operation s binary code, and alter the behavior of it dramatically. Hence, altering the compiler option set affects the T access timer more significantly. The T access timer is influenced if and only if the variation in an operation s length also change the behavior of the operation s tail. We give two examples here. The first one is the floating-point operations. As we have stated above, for a floatingpoint operation, the performance bottleneck is on the floating-point function units. If its length is varied, the floating-point function units will still be the bottleneck, and as busy as before. Since the behavior of the operation s tail remains similar, the T access timer is also insensitive to the variation in its length. This property can be verified in both figure 6(b) and figure 6(d). Another example is the memory duplication operation. This operation uses the memory-copy instruction to duplicate a memory block. By modifying the size of the memory block to be copied, we can control its length. If we enlarge the memory block, this operation will just increase the repeating count of the memory-copy instruction. Therefore, the behavior of its tail is always the same, and the T access timer also keeps stable. This can be confirmed in the figure 7(a). Next, the third topic we intend to state is the convergent problem of the T access timer measured by our method. For all operations containing no serializing instructions, the T access timer will become a fixed value just after one or two measurements. Consequently, the average T access timer will converge very quickly. For all operations consisting of serializing instructions, the T access timer will also enter a stable condition just after few measurements. Under this stable condition, the T access timer is varied in a periodic pattern, and its value is bounded within a fixed interval. The relationship between an operation s length and the corresponding fixed interval is depicted in figure 8.

6 14 14 Bounded range (clock cycles) Bounded range (clock cycles) Times for acquiring and releasing a spin lock Times for entering and leaving a critical section (a) Figure 8: When the s T access timer is measured with an operation that contains serializing instructions, eventually, it will converge to a bounded range. This figure shows this range for the following two operations: (a) simple spin locks, and (b) Windows NT critical section objects. (b) Because the variation of the T access timer is periodic and bounded, the average T access timer will also converge. However, the convergent speed will be slower than the previous kinds of operations. In summary, since the T access timer always converges in our experiments, we can confirm that our method for determining the T access timer is a practical approach. All above discussion in this section is restricted to the Pentium Pro processors. To date, the available processors belong to the Intel P6 family are Pentium Pro, Pentium II [Sha97], and Pentium III [Int99]. There are only few minor differences between them. Nevertheless, the cores of these processors are the same dynamic execution micro-architectures. Hence, our method for determining the T access timer is also suitable for all processors belong to the Intel P6 family. 6 Conclusion In this paper, we have presented how to use the timestamping counter to measure an operation s elapsed time on both the Pentium and the Pentium Pro processors. The overhead of the Pentium s time-stamping counter could be determined by a null benchmark, and this simple method could produce accurate result. However, on Pentium Pro, since the null benchmark could not consider how the measured operation affects the time-stamping counter s overhead, we proposed a new method to solve this problem. Furthermore, we also conducted several experiments to confirm our idea. The experimental results show that the overhead measured by the null benchmark may significantly deviate from its real value. Moreover, we can also discover that all the overheads measured by our method are both stable and able to converge. Hence, we can conclude our method is a practical approach. Manual, Vol. 3: System Programming Guide, Intel Corporation, [Int97b] Intel, Using the RDTSC Instruction for Performance Monitoring, Pentium II Processor Application Notes, Intel Corporation, RDTSCPM1.HTM, [Int97c] Intel, Intel Architecture Optimization Manual, Intel Corporation, [Int99] Intel, Pentium III Processor at 450 MHz, 500 MHz, and 550 MHz Datasheet, Intel Corporation, [Mic98] Microsoft, Platform SDK Windows Base Services, MSDN Library for Visual Studio 6.0, Microsoft Corporation, [Mor97] Motorola, PowerPC Microprocessor Family: The Programming Environments, Motorola, Inc., [Sha97] T. Shanley, Pentium Pro and Pentium II System Architecture, Addison-Wesley, [Tsa97] W. Tsay, The Design and Implementation of an IIPC Locking Facility, Master Thesis, Institute of Computer Science and Information Engineering, National Chiao Tung University, Taiwan. References: [Com98] Compaq, Alpha Architecture Handbook, Compaq Computer Corporation, [Dow93] K. Dowd, High Performance Computing, O Reilly & Associates, Inc., [Int97a] Intel, Intel Architecture Software Developer s

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Performance Tuning VTune Performance Analyzer

Performance Tuning VTune Performance Analyzer Performance Tuning VTune Performance Analyzer Paul Petersen, Intel Sept 9, 2005 Copyright 2005 Intel Corporation Performance Tuning Overview Methodology Benchmarking Timing VTune Counter Monitor Call Graph

More information

AN 831: Intel FPGA SDK for OpenCL

AN 831: Intel FPGA SDK for OpenCL AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Snapshot Service Interface (SSI)

Snapshot Service Interface (SSI) Snapshot Service Interface (SSI) A generic snapshot assisted backup framework for Linux Abstract This paper presents the design and implementation of Snapshot Service Interface-SSI, a standardized backup

More information

An Overview of the BLITZ System

An Overview of the BLITZ System An Overview of the BLITZ System Harry H. Porter III Department of Computer Science Portland State University Introduction The BLITZ System is a collection of software designed to support a university-level

More information

Power Estimation of UVA CS754 CMP Architecture

Power Estimation of UVA CS754 CMP Architecture Introduction Power Estimation of UVA CS754 CMP Architecture Mateja Putic mateja@virginia.edu Early power analysis has become an essential part of determining the feasibility of microprocessor design. As

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

Lecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor

Lecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor Lecture 12 Architectures for Low Power: Transmeta s Crusoe Processor Motivation Exponential performance increase at a low cost However, for some application areas low power consumption is more important

More information

Efficiency and memory footprint of Xilkernel for the Microblaze soft processor

Efficiency and memory footprint of Xilkernel for the Microblaze soft processor Efficiency and memory footprint of Xilkernel for the Microblaze soft processor Dariusz Caban, Institute of Informatics, Gliwice, Poland - June 18, 2014 The use of a real-time multitasking kernel simplifies

More information

Time Measurement. CS 201 Gerson Robboy Portland State University. Topics. Time scales Interval counting Cycle counters K-best measurement scheme

Time Measurement. CS 201 Gerson Robboy Portland State University. Topics. Time scales Interval counting Cycle counters K-best measurement scheme Time Measurement CS 201 Gerson Robboy Portland State University Topics Time scales Interval counting Cycle counters K-best measurement scheme Computer Time Scales Microscopic Time Scale (1 Ghz Machine)

More information

Main Points of the Computer Organization and System Software Module

Main Points of the Computer Organization and System Software Module Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a

More information

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam Assembly Language Lecture 2 - x86 Processor Architecture Ahmed Sallam Introduction to the course Outcomes of Lecture 1 Always check the course website Don t forget the deadline rule!! Motivations for studying

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Computer Organization & Assembly Language Programming

Computer Organization & Assembly Language Programming Computer Organization & Assembly Language Programming CSE 2312 Lecture 11 Introduction of Assembly Language 1 Assembly Language Translation The Assembly Language layer is implemented by translation rather

More information

The course that gives CMU its Zip! Time Measurement Oct. 24, 2002

The course that gives CMU its Zip! Time Measurement Oct. 24, 2002 15-213 The course that gives CMU its Zip! Time Measurement Oct. 24, 2002 Topics Time scales Interval counting Cycle counters K-best measurement scheme class18.ppt Computer Time Scales Microscopic Time

More information

Timed Compiled-Code Functional Simulation of Embedded Software for Performance Analysis of SOC Design

Timed Compiled-Code Functional Simulation of Embedded Software for Performance Analysis of SOC Design IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 22, NO. 1, JANUARY 2003 1 Timed Compiled-Code Functional Simulation of Embedded Software for Performance Analysis of

More information

1.3 Data processing; data storage; data movement; and control.

1.3 Data processing; data storage; data movement; and control. CHAPTER 1 OVERVIEW ANSWERS TO QUESTIONS 1.1 Computer architecture refers to those attributes of a system visible to a programmer or, put another way, those attributes that have a direct impact on the logical

More information

CSE 141 Summer 2016 Homework 2

CSE 141 Summer 2016 Homework 2 CSE 141 Summer 2016 Homework 2 PID: Name: 1. A matrix multiplication program can spend 10% of its execution time in reading inputs from a disk, 10% of its execution time in parsing and creating arrays

More information

Zilog Real-Time Kernel

Zilog Real-Time Kernel An Company Configurable Compilation RZK allows you to specify system parameters at compile time. For example, the number of objects, such as threads and semaphores required, are specez80acclaim! Family

More information

i960 Microprocessor Performance Brief October 1998 Order Number:

i960 Microprocessor Performance Brief October 1998 Order Number: Performance Brief October 1998 Order Number: 272950-003 Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel or otherwise, to any intellectual

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Performance Improvement of Hardware-Based Packet Classification Algorithm

Performance Improvement of Hardware-Based Packet Classification Algorithm Performance Improvement of Hardware-Based Packet Classification Algorithm Yaw-Chung Chen 1, Pi-Chung Wang 2, Chun-Liang Lee 2, and Chia-Tai Chan 2 1 Department of Computer Science and Information Engineering,

More information

Assembly Language. Lecture 2 x86 Processor Architecture

Assembly Language. Lecture 2 x86 Processor Architecture Assembly Language Lecture 2 x86 Processor Architecture Ahmed Sallam Slides based on original lecture slides by Dr. Mahmoud Elgayyar Introduction to the course Outcomes of Lecture 1 Always check the course

More information

Assembly Language Programming Introduction

Assembly Language Programming Introduction Assembly Language Programming Introduction October 10, 2017 Motto: R7 is used by the processor as its program counter (PC). It is recommended that R7 not be used as a stack pointer. Source: PDP-11 04/34/45/55

More information

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung

More information

Computer Time Scales Time Measurement Oct. 24, Measurement Challenge. Time on a Computer System. The course that gives CMU its Zip!

Computer Time Scales Time Measurement Oct. 24, Measurement Challenge. Time on a Computer System. The course that gives CMU its Zip! 5-23 The course that gives CMU its Zip! Computer Time Scales Microscopic Time Scale ( Ghz Machine) Macroscopic class8.ppt Topics Time Measurement Oct. 24, 2002! Time scales! Interval counting! Cycle counters!

More information

LINUX OPERATING SYSTEM Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

LINUX OPERATING SYSTEM Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science A Seminar report On LINUX OPERATING SYSTEM Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED

More information

Chapter 4 The Processor (Part 4)

Chapter 4 The Processor (Part 4) Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline

More information

Computer Organization - Overview

Computer Organization - Overview Computer Organization - Overview Hyunyoung Lee CSCE 312 1 Course Overview Topics: Theme Five great realities of computer systems Computer system overview Summary NOTE: Most slides are from the textbook

More information

MARIE: An Introduction to a Simple Computer

MARIE: An Introduction to a Simple Computer MARIE: An Introduction to a Simple Computer 4.2 CPU Basics The computer s CPU fetches, decodes, and executes program instructions. The two principal parts of the CPU are the datapath and the control unit.

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

Optimizing Memory Bandwidth

Optimizing Memory Bandwidth Optimizing Memory Bandwidth Don t settle for just a byte or two. Grab a whole fistful of cache. Mike Wall Member of Technical Staff Developer Performance Team Advanced Micro Devices, Inc. make PC performance

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15

More information

RECENTLY, researches on gigabit wireless personal area

RECENTLY, researches on gigabit wireless personal area 146 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 55, NO. 2, FEBRUARY 2008 An Indexed-Scaling Pipelined FFT Processor for OFDM-Based WPAN Applications Yuan Chen, Student Member, IEEE,

More information

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Lecture 26: Parallel Processing. Spring 2018 Jason Tang Lecture 26: Parallel Processing Spring 2018 Jason Tang 1 Topics Static multiple issue pipelines Dynamic multiple issue pipelines Hardware multithreading 2 Taxonomy of Parallel Architectures Flynn categories:

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

ELC4438: Embedded System Design Embedded Processor

ELC4438: Embedded System Design Embedded Processor ELC4438: Embedded System Design Embedded Processor Liang Dong Electrical and Computer Engineering Baylor University 1. Processor Architecture General PC Von Neumann Architecture a.k.a. Princeton Architecture

More information

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019 CS 31: Introduction to Computer Systems 22-23: Threads & Synchronization April 16-18, 2019 Making Programs Run Faster We all like how fast computers are In the old days (1980 s - 2005): Algorithm too slow?

More information

Principles. Performance Tuning. Examples. Amdahl s Law: Only Bottlenecks Matter. Original Enhanced = Speedup. Original Enhanced.

Principles. Performance Tuning. Examples. Amdahl s Law: Only Bottlenecks Matter. Original Enhanced = Speedup. Original Enhanced. Principles Performance Tuning CS 27 Don t optimize your code o Your program might be fast enough already o Machines are getting faster and cheaper every year o Memory is getting denser and cheaper every

More information

Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras

Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras Week - 04 Lecture 17 CPU Context Switching Hello. In this video

More information

Improving Http-Server Performance by Adapted Multithreading

Improving Http-Server Performance by Adapted Multithreading Improving Http-Server Performance by Adapted Multithreading Jörg Keller LG Technische Informatik II FernUniversität Hagen 58084 Hagen, Germany email: joerg.keller@fernuni-hagen.de Olaf Monien Thilo Lardon

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Improved Address-Space Switching on Pentium. Processors by Transparently Multiplexing User. Address Spaces. Jochen Liedtke

Improved Address-Space Switching on Pentium. Processors by Transparently Multiplexing User. Address Spaces. Jochen Liedtke Improved Address-Space Switching on Pentium Processors by Transparently Multiplexing User Address Spaces Jochen Liedtke GMD German National Research Center for Information Technology jochen.liedtkegmd.de

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise CSCI 4717/5717 Computer Architecture Topic: Instruction Level Parallelism Reading: Stallings, Chapter 14 What is Superscalar? A machine designed to improve the performance of the execution of scalar instructions.

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

An Overview of MIPS Multi-Threading. White Paper

An Overview of MIPS Multi-Threading. White Paper Public Imagination Technologies An Overview of MIPS Multi-Threading White Paper Copyright Imagination Technologies Limited. All Rights Reserved. This document is Public. This publication contains proprietary

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

IA-32 Architecture COE 205. Computer Organization and Assembly Language. Computer Engineering Department

IA-32 Architecture COE 205. Computer Organization and Assembly Language. Computer Engineering Department IA-32 Architecture COE 205 Computer Organization and Assembly Language Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Basic Computer Organization Intel

More information

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0. IBM Optim Performance Manager Extended Edition V4.1.0.1 Best Practices Deploying Optim Performance Manager in large scale environments Ute Baumbach (bmb@de.ibm.com) Optim Performance Manager Development

More information

A MULTIPOINT VIDEOCONFERENCE RECEIVER BASED ON MPEG-4 OBJECT VIDEO. Chih-Kai Chien, Chen-Yu Tsai, and David W. Lin

A MULTIPOINT VIDEOCONFERENCE RECEIVER BASED ON MPEG-4 OBJECT VIDEO. Chih-Kai Chien, Chen-Yu Tsai, and David W. Lin A MULTIPOINT VIDEOCONFERENCE RECEIVER BASED ON MPEG-4 OBJECT VIDEO Chih-Kai Chien, Chen-Yu Tsai, and David W. Lin Dept. of Electronics Engineering and Center for Telecommunications Research National Chiao

More information

Interprocess Communication By: Kaushik Vaghani

Interprocess Communication By: Kaushik Vaghani Interprocess Communication By: Kaushik Vaghani Background Race Condition: A situation where several processes access and manipulate the same data concurrently and the outcome of execution depends on the

More information

Assembly Language for Intel-Based Computers, 4 th Edition. Kip R. Irvine. Chapter 2: IA-32 Processor Architecture

Assembly Language for Intel-Based Computers, 4 th Edition. Kip R. Irvine. Chapter 2: IA-32 Processor Architecture Assembly Language for Intel-Based Computers, 4 th Edition Kip R. Irvine Chapter 2: IA-32 Processor Architecture Chapter Overview General Concepts IA-32 Processor Architecture IA-32 Memory Management Components

More information

Chapter 13 Reduced Instruction Set Computers

Chapter 13 Reduced Instruction Set Computers Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining

More information

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit Assembly Language for Intel-Based Computers, 4 th Edition Kip R. Irvine Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit Slides prepared by Kip R. Irvine Revision date: 09/25/2002

More information

Introduction to Computer Systems: Semester 1 Computer Architecture

Introduction to Computer Systems: Semester 1 Computer Architecture Introduction to Computer Systems: Semester 1 Computer Architecture Fall 2003 William J. Taffe using modified lecture slides of Randal E. Bryant Topics: Theme Five great realities of computer systems How

More information

Performance Evaluation. December 2, 1999

Performance Evaluation. December 2, 1999 15-213 Performance Evaluation December 2, 1999 Topics Getting accurate measurements Amdahl s Law class29.ppt Time on a Computer System real (wall clock) time = user time (time executing instructing instructions

More information

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, 2003 Review 1 Overview 1.1 The definition, objectives and evolution of operating system An operating system exploits and manages

More information

Hardware and Software Architecture. Chapter 2

Hardware and Software Architecture. Chapter 2 Hardware and Software Architecture Chapter 2 1 Basic Components The x86 processor communicates with main memory and I/O devices via buses Data bus for transferring data Address bus for the address of a

More information

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15

More information

Computer Systems A Programmer s Perspective 1 (Beta Draft)

Computer Systems A Programmer s Perspective 1 (Beta Draft) Computer Systems A Programmer s Perspective 1 (Beta Draft) Randal E. Bryant David R. O Hallaron August 1, 2001 1 Copyright c 2001, R. E. Bryant, D. R. O Hallaron. All rights reserved. 2 Contents Preface

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture. Chapter Overview.

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture. Chapter Overview. Assembly Language for Intel-Based Computers, 4 th Edition Kip R. Irvine Chapter 2: IA-32 Processor Architecture Slides prepared by Kip R. Irvine Revision date: 09/25/2002 Chapter corrections (Web) Printing

More information

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale Datapoint 2200 IA-32 Nicholas FitzRoy-Dale At the forefront of the computer revolution - Intel Difficult to explain and impossible to love - Hennessy and Patterson! Released 1970! 2K shift register main

More information

THE orthogonal frequency-division multiplex (OFDM)

THE orthogonal frequency-division multiplex (OFDM) 26 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 57, NO. 1, JANUARY 2010 A Generalized Mixed-Radix Algorithm for Memory-Based FFT Processors Chen-Fong Hsiao, Yuan Chen, Member, IEEE,

More information

CSC Operating Systems Spring Lecture - XII Midterm Review. Tevfik Ko!ar. Louisiana State University. March 4 th, 2008.

CSC Operating Systems Spring Lecture - XII Midterm Review. Tevfik Ko!ar. Louisiana State University. March 4 th, 2008. CSC 4103 - Operating Systems Spring 2008 Lecture - XII Midterm Review Tevfik Ko!ar Louisiana State University March 4 th, 2008 1 I/O Structure After I/O starts, control returns to user program only upon

More information

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India Advanced Department of Computer Science Indian Institute of Technology New Delhi, India Outline Introduction Advanced 1 Introduction 2 Checker Pipeline Checking Mechanism 3 Advanced Core Checker L1 Failure

More information

structural RTL for mov ra, rb Answer:- (Page 164) Virtualians Social Network Prepared by: Irfan Khan

structural RTL for mov ra, rb Answer:- (Page 164) Virtualians Social Network  Prepared by: Irfan Khan Solved Subjective Midterm Papers For Preparation of Midterm Exam Two approaches for control unit. Answer:- (Page 150) Additionally, there are two different approaches to the control unit design; it can

More information

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST Chapter 3. Pipelining EE511 In-Cheol Park, KAIST Terminology Pipeline stage Throughput Pipeline register Ideal speedup Assume The stages are perfectly balanced No overhead on pipeline registers Speedup

More information

Chapter 4. MARIE: An Introduction to a Simple Computer 4.8 MARIE 4.8 MARIE A Discussion on Decoding

Chapter 4. MARIE: An Introduction to a Simple Computer 4.8 MARIE 4.8 MARIE A Discussion on Decoding 4.8 MARIE This is the MARIE architecture shown graphically. Chapter 4 MARIE: An Introduction to a Simple Computer 2 4.8 MARIE MARIE s Full Instruction Set A computer s control unit keeps things synchronized,

More information

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 5th Edition, Irv Englander John

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 Objectives ---------- 1. To introduce the basic concept of CPU speedup 2. To explain how data and branch hazards arise as

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

Newbie s Guide to AVR Interrupts

Newbie s Guide to AVR Interrupts Newbie s Guide to AVR Interrupts Dean Camera March 15, 2015 ********** Text Dean Camera, 2013. All rights reserved. This document may be freely distributed without payment to the author, provided that

More information

Time Measurement Nov 4, 2009"

Time Measurement Nov 4, 2009 Time Measurement Nov 4, 2009" Reminder" 2! Computer Time Scales" Microscopic Time Scale (1 Ghz Machine) Macroscopic Integer Add FP Multiply FP Divide Keystroke Interrupt Handler Disk Access Screen Refresh

More information

General Purpose Signal Processors

General Purpose Signal Processors General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:

More information

CS4961 Parallel Programming. Lecture 2: Introduction to Parallel Algorithms 8/31/10. Mary Hall August 26, Homework 1, cont.

CS4961 Parallel Programming. Lecture 2: Introduction to Parallel Algorithms 8/31/10. Mary Hall August 26, Homework 1, cont. Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 26, 2010 1 Homework 1 Due 10:00 PM, Wed., Sept. 1 To submit your homework: - Submit a PDF file - Use the handin program

More information

School of Computer and Information Science

School of Computer and Information Science School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

EEM336 Microprocessors I. The Microprocessor and Its Architecture

EEM336 Microprocessors I. The Microprocessor and Its Architecture EEM336 Microprocessors I The Microprocessor and Its Architecture Introduction This chapter presents the microprocessor as a programmable device by first looking at its internal programming model and then

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Using implicit fitness functions for genetic algorithm-based agent scheduling

Using implicit fitness functions for genetic algorithm-based agent scheduling Using implicit fitness functions for genetic algorithm-based agent scheduling Sankaran Prashanth, Daniel Andresen Department of Computing and Information Sciences Kansas State University Manhattan, KS

More information

MICROPROCESSOR TECHNOLOGY

MICROPROCESSOR TECHNOLOGY MICROPROCESSOR TECHNOLOGY Assis. Prof. Hossam El-Din Moustafa Lecture 17 Ch.8 The Pentium and Pentium Pro Microprocessors 21-Apr-15 1 Chapter Objectives Contrast the Pentium and Pentium Pro with the 80386

More information

Field Analysis. Last time Exploit encapsulation to improve memory system performance

Field Analysis. Last time Exploit encapsulation to improve memory system performance Field Analysis Last time Exploit encapsulation to improve memory system performance This time Exploit encapsulation to simplify analysis Two uses of field analysis Escape analysis Object inlining April

More information

Concurrent & Distributed Systems Supervision Exercises

Concurrent & Distributed Systems Supervision Exercises Concurrent & Distributed Systems Supervision Exercises Stephen Kell Stephen.Kell@cl.cam.ac.uk November 9, 2009 These exercises are intended to cover all the main points of understanding in the lecture

More information

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science

More information

Modeling and Simulating Discrete Event Systems in Metropolis

Modeling and Simulating Discrete Event Systems in Metropolis Modeling and Simulating Discrete Event Systems in Metropolis Guang Yang EECS 290N Report December 15, 2004 University of California at Berkeley Berkeley, CA, 94720, USA guyang@eecs.berkeley.edu Abstract

More information

Simultaneous Multithreading Architecture

Simultaneous Multithreading Architecture Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.

More information