APPLYING JAVA ON SINGLE-CHIP MULTIPROCESSORS. Antonio C. S. Beck F., Júlio C. B. Mattos, Luigi Carro. {caco, julius, carro

Size: px
Start display at page:

Download "APPLYING JAVA ON SINGLE-CHIP MULTIPROCESSORS. Antonio C. S. Beck F., Júlio C. B. Mattos, Luigi Carro. {caco, julius, carro"

Transcription

1 APPLYING JAVA ON SINGLE-CHIP MULTIPROCESSORS Antonio C. S. Beck F., Júlio C. B. Mattos, Luigi Carro Instituto de Informática Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil {caco, julius, carro ABSTRACT This paper presents first studies about the implementation of Java on Single-Chip Multiprocessors (CMP). CMP instantiates multiples of the same processor core on a single die where the communication among each other is done at high speed. Moreover, it can be implemented as a diverse set of cores with the same instruction set that target different performance levels and consume different levels of power. Depending on the requirements of the applications at a certain time, the core that can best meet these requirements while minimizing energy consumption is selected. Extending this idea, we apply Java on CMP with different cores that execute Java bytecodes. We demonstrate four different Java cores and first results concerning the tradeoffs in performance, power and area, showing the potential of this design exploration. 1. INTRODUCTION The Embedded system market is expanding. The research and production of specific processors to be used inside cellular phones, mp3 players, digital cameras, microwaves, videogames, printers and others appliances is following the same growing path [1]. Moreover, dictated by market needs, the complexity of these embedded systems is increasing as well, since they are offering more and more functions to the user, like Internet access, color display, audio and video reproduction, videogames, among others [2]. Hence, these systems must have enough processing power to handle these complex tasks. In the same way, Java is becoming increasingly popular in embedded environments. It is estimated that devices with embedded Java such as cellular phones, PDAs and pagers will grow from 176 million in 2001 to 721 million in 2005 [3]. Nevertheless, it is predicted that at least 80 percent of mobile phones will support Java by 2006 [4]. As one can observe, the presence of Java in embedded systems is becoming more significant. This means that current design goals might include a careful look on embedded Java processors, and their performance versus power tradeoffs must be taken into account. Therefore, while still sustaining great performance, present days embedded systems must also have low power dissipation and support a huge software library to cope with stringent design times. Consequently, there is a clear need for architectures which can support all the software development effort currently required. One of these alternatives is the use of Chip Multiprocessors (CMP) [5]. Basically, CMP instantiates multiples of the same processor core on a single die communicating among each other at high speed. These processors usually have a private L1 cache and a shared L2 one. In [6] first studies about an extension of this technique was proposed, implementing a diverse set of cores with the same instruction set that target different performance levels and consume different levels of power. Depending on the applications requirements at a certain time, the core that can best meet these requirements while minimizing energy consumption is selected. When there are idle processors on the CMP, they are turned off. This way, dynamic and static power consumption is saved. Furthermore, there are a lot of possibilities in the threads allocation, depending on the energy consumption or real-time constraints. For instance, threads can be allocated as soon as possible, getting the first processor that can handle that thread in the moment, or wait in a queue until a processor that exactly fits in that thread requirements becomes available. Extending this idea, we present first studies about applying Java on CMP with different cores that execute Java bytecodes. The use of Java brings additional advantages, as the native support of threads and the synchronization mechanism among them. This facilitates the software development, since the designer does not need to worry about explicitly dividing the software to run on CMP, and makes the development of the CMP easier, because the Java compiler takes charge of the implicit synchronization among threads. Moreover, the designer can take advantage of the traditional Java characteristics: it is object oriented, which facilitates the modeling of the

2 system; it is multiplatform and it is a wide-spread used language. This paper is organized as follows: section 2 demonstrates a brief review of the related work about CMP. In section 3 we show the Java architectures evaluated. In section 4 is showed first results regarding Java processors on CMP. The last section draws conclusions and introduces future work. 2. RELATED WORK The concept of Chip Multiprocessor was first presented in [5], where multiples of the same processor core are implemented on a single die communicating among each other at high speed. In [7], more studies about its feasibility were done. The CMP technique was compared to others techniques as superscalar and SMT (Simultaneous Multithreading). In this work was showed that the CMP is the best choice to explore the great capacity in terms of transistors that will be available in the future. CMP has many advantages: high clock frequencies can be achieved, since the project is made of many relatively simple processors, instead of complicated structures as in the SMT, with long and high capacitance wires with huge register banks; the design can be kept simple: if it is need to increase performance it is just necessary to put more processors on the die; as a consequence, the validation and test become easier as well. The idea of implementing different cores and using then depending on the requirements of the application is proposed in [6]. In this work, four different Alpha cores were implemented in CMP. Switching applications between cores, meaningful results concerning energy consumption was achieved with a relative small area overhead when considering the complete CMP system with the most complex Alpha core. Different studies exploring many areas on CMP have been done. In [9] is explored the thermal behavior of SMT and CMP architectures. In [10], simple enhancements to the basic CMP architecture are investigated in order to increase the performance using speculative precomputation. In [11] an hybrid architecture called Parallel On-Chip Simultaneous Multithreaded Processor (POSM), that mix the advantages of SMT and CMP, is proposed. Moreover, a detailed study in area and performance of these architectures was done, using the MIPS R10000 as base processor. As can be observed, CMP explores mainly the thread level parallelism instead of the instruction level parallelism as superscalar architectures do, or both, as SMT does. However, it is well know that the trend is to execute many threads in parallel, since the embedded systems are getting more complex maintaining a certain simplicity to reduce the design cycle, as cited before. There are few implementations of CMP available. The Piranha Project [12], the MAJC [13] and Hydra [14] are examples of CMP. As none of those has already explored the Java potential in CMP, we present first studies on using this language. 3. THE FEMTOJAVA PROCESSORS The Femtojava processor [8] is a stack-based microcontroller that executes Java bytecodes. General characteristics of the processor are: reduced instruction set, Harvard architecture and small size. This processor was designed specifically for the embedded system market. The first architecture evaluated is a multicycle version [8] that takes from three to fourteen cycles to execute an instruction and can be observed in Figure 1. RAM ROM Input Ports Output Ports Timer Interrupt Handler P r g M e m A d d r e s s B u s I n t r u c t i o n B u s D a t a M e m A d d r e s s B u s D a t a B u s PC 0 1 A IMM Const SP FRM VAR A B IR Control Figure 1 Multicycle Java Processor [8] The second architecture is the pipelined version [15], which has five stages: instruction fetch, instruction decoding, operand fetch, execution, and write back, as shown in Figure 2. One of the main characteristics of this architecture is the presence of registers playing the role of operand stack and local variable pool (used to keep values of the local variables of a method). We call this architecture of Low-Power Femtojava. The first stage, instruction fetch, is composed by an instruction queue of 9 registers. The first instruction in the queue is sent to the instruction decoder stage. The decoder has four functions: the generation of the control word for MAR MUX MUX MUX MUX ALU + +/-

3 that instruction, to handle data dependencies, to analyze the forwarding possibilities and to inform to the instruction queue the size of the current instruction, in order to put the next instruction of the stream in the first place of the queue. This is necessary because of the use of variable length instructions: they can have one or two immediate operands, or none at all. Operands fetch is done in a variable size register bank, defined a priori in earlier stages of the design. The operand stack and the local variable pool of the methods are available in this register bank. There are two registers: SP (Stack Pointer) and VARS (VARiables Storage), which point to the top of the stack and to beginning of the local variable storage, respectively. Depending on the instruction, one of them is used as base for the operand fetch. Once the operands are fetched, they are sent to the fourth stage, where the operation is executed. There is no branch prediction, in order to save area. All branches are supposed to be not taken. If the branch is taken, a penalty of three cycles is paid. The write back stage saves, if necessary, the result of the execution stage back to the register bank, again, using the SP or VARS as base. There is a unified register bank for the stack and local variable pool, because this facilitates the call and return of methods, taking advantage of the JVM specification, where each method is located by a frame pointer in the stack. The Low-Power Java processor uses the forwarding technique [16], which brings an advantage when comparing to RISC like processors: in instructions that manipulate the stack, the operands forwarded to earlier stages will not be used anymore. As a consequence, there is no need to write back these operands to the stack. The result is the reduction on the power consumption, because the number of writes in the stack is reduced. In this work we show a gain factor of 8 concerning energy consumption with a minimal area overhead, thanks to the use of the forwarding technique. Figure 2 Low-Power Java Processor The third architecture is the VLIW processor [17]. It is an extension of the pipelined one. Basically, it has its functional units and the instruction decoders replicated. The additional decoders do not support the instructions for call and return of methods, since they are always in the main flow. The local variable storage is placed just in the first register file. When the instructions of other flows need the value of a local variable, they must fetch it from the register bank in the main flow. Each instruction flow has its own operand stack, which has less registers than the main operand stack, since the operand stacks for the secondary flows do not grow as much as the main flow operand stack grows. The VLIW packet has a variable size, avoiding unnecessary memory accesses. A header in the first instruction of the word informs to the instruction fetch controller how many instructions the current packet has. The search for ILP in the Java program is done at the bytecode level. The algorithm works as follows: all the instructions that depend on the result of the previous one are grouped in an operand block. The entire Java program is divided in these groups and they can be parallelized respecting the functional unit constraints. For example, if there is just one multiplier, two instructions that use this functional unit can not be operated in parallel. There are three basic versions of the VLIW processor: the VLIW 2, VLIW 4 and VLIW 8, which can handle until 2, 4 or 8 instructions per VLIW packet, respectively. The last version is the Femtojava Low-Power coupled to a coarse-grain reconfigurable array. The use of a coarse-grain array ensures fast reconfiguration and less control overhead, saving energy for the reconfiguration task. The reconfigurable array is tightly coupled to the processor. It is implemented as a functional unit in the execution stage, using the same approach of Chimaera [18] and ConCISe [19]. Thanks to the coupling of dynamic binary translation (BT) [20], the detection of instructions that will be executed in the reconfigurable array is done at run-time. This binary translation of sequences of instructions into combinational logic is totally transparent for the software designer, allowing easy software porting for different

4 machines tracking technological evolutions. By using BT one allows optimizations not only in critical loops, as recent works have proposed, but in all parts of the algorithm. Thus, our technique is not limited just to the parallelism available in the application, since the amount of parallelism during the execution usually varies [20]. The array is divided in blocks, called cells. The operand block (a sequence of bytecodes) previously detected is fitted in one ore more of these cells in the array. The cell can be observed in Figure 3. The initial part of the cell is composed by three functional units (ALU, shifter, ld/st). After the first part, six identical parts follow in sequence. This number was chosen based on time delays presented in [21], where it is compared the delay of the booth multiplier (used in our processor) with adders. Each cell of the array has just one multiplier and takes exactly one processor cycle to complete execution, being limited to its critical path, bringing no delay overhead in the processor pipeline. At the end of each cell, there are two additional functional units: a branch unit and an extended ld/st one, made for the execution of the iastore (fetches a static value from memory, taking the address from the stack) instruction, since this instruction needs three operands, instead of two, as usual. For each cell in the array 327 reconfiguration bits are necessary. Consequently, if the array is formed by 3 cells, 971 bits in the reconfiguration cache are necessary. To these one must add 58 bits of additional information, like as how many cycles the execution takes, and what is the initial ROM address that this sequence is located, totalizing 1029 bits for each configuration of the array. The array is reconfigured using the data of the cache specially designed for it. Each cell in the array has its own reconfiguration cache. Therefore, the reconfiguration of each cell can be done in parallel. While the program executes, when an address of a reconfigurable instruction group is found, the reconfigurable unit detector sends information for the main processor. Then, its control unit configures the array as the active functional unit, stops the rest of the processor while the array is performing it functions, and upgrades the Program Counter with the new address, in order to continue the normal operation after the execution of that sequence of instructions by the array. At the same time, the configurations in the reconfiguration cache for that address are sent to the reconfigurable array, since they are indexed by addresses of the instruction memory. As the detection for the address that will be used in the reconfiguration is done in the first stage of the pipeline, and the reconfigurable array is in the fourth stage, there are 3 cycles available between the detection and the use of the array. As one cycle is necessary to find the cache line that has the array configuration, two cycles are available for the reconfiguration. Considering the follow example, we show the potential gains using a reconfigurable array in a Java machine: if there are five simple arithmetical or logic operations, one needs at least 11 cycles for the execution: 6 for pushing the operands into the stack, plus 5 cycles for the operations themselves. This is the optimal execution, without considering pipeline stalls due to data dependency. On the other hand, the array would execute in just one cycle everything, after the first time it finds this sequence and saves it in the reconfiguration cache. If this sequence is repeated a certain number of times, meaningful gains can be achieved. For the execution, the first two operand values will be used as the first operation. After that, for each basic part of the cell, the first operand is always the result of the previous operation. This is another characteristic of stack machines, and makes the control and the hardware of the array simpler. In each cell it is possible to make 7 simple operations (arithmetic, logical, shift). If a multiplication is found, the result of the current cell is bypassed until the end. This multiplication will be executed in the next cell. 4. RESULTS Figure 3. Architecture of the array Table 1 shows the area occupied by the four different versions of our Java processors. It is important to note that the register bank in the Low-Power version (as the main flow of the VLIW one), used as stack and local variable pool, has 32 registers, the maximum required among all the applications (note that they are counted in the table, even though the FPGA s memory could be used for this

5 purpose). The version with the reconfigurable array can support until 10 different reconfigurations and the array is composed by 4 cells. The area was evaluated using the Leonardo Spectrum for Windows [22], and it is presented in number of logic cells. Table 1 Area of the VHDL versions of the architectures PROCESSOR LOW POWER VLIW (INSTR. PER PACKET) REC. ARRAY X4 Area (LCs) For performance and power estimation, the tool utilized was a configurable compiled-code cycle-accurate simulator [23], and it was used to provide data on the energy consumption, memory usage and performance. Its power estimation technique is comparable to the component-based approach used in [24]. Power dissipation is evaluated in terms of switching activity, and as the processor has separated instruction and data memories, we also included an evaluation module concerning RAM and ROM memories, besides the register bank. This way, one can verify the relative power dissipation of the CPU, instruction memory, and data memory. It is important to measure the impact of each one of these blocks, so that one can better explore the design space. To illustrate the results of the design exploration with different cores we selected three different versions of the architectures explained before: multicycle, low-power and a VLIW version with 2 instructions/packet, executing the follow algorithms: the sine and the Inverse Modified Discrete Cosine Transform with different versions of loop unrolled. The table 2 presents the results in terms of performance, power and energy, using a frequency equal 50 MHz and Vdd equal 3.3v in two different ways of executing the Sin computation. It is obvious that the pipelined and VLIW architectures provide the better results in terms of cycles, the best results in terms of performance is the combination of sine calculation as simple table search and VLIW architecture, but the worst results in terms of power. The best combination in terms of power but worst in terms of energy is the sine using Cordic and Multicycle core. Table 2. Sin Characterization Characteristic Cordic Table Multi Pipeline VLIW2 Multi Pipeline VLIW2 Performance (cycles) Power (mw) Energy (ηj) Table 3, 4 e 5 show the main results of the characterization of the four different implementations of the IMDCT function (the loop unrolling technique was used in different levels). The IMDCT3 implementation has the better results in terms of performance in all architectures, but the size of program memory significantly increases. The opposite happens with the IMDCT implementation, which has far better results in terms of program memory, but consumes more cycles. In terms of power (using a frequency equal 50 MHz and Vdd equal 3.3v) the best combination are the IMDCT1 and IMDCT2 with multicycle core, but this combination does not have good results in terms of energy. Table 4 and 5 shows that the best results in terms of performance are the results that execute in VLIW core, since the IMDCT routine has lot of parallelism. Table 3. IMDCT Characterization (software dependable) Characteristic IMDCT IMDCT1 IMDCT2 IMDCT3 Program size (bytes) 344 2,137 4,260 15,294 Data mem (bytes) Table 4. IMDCT Characterization (hardware dependable) Characteristic IMDCT IMDCT1 Multi Pipeline VLIW2 Multi Pipeline VLIW2 Performance (cycles) Power (mw) Energy (µj) Table 5. IMDCT Characterization (hardware dependable) (cont.) Characteristic IMDCT2 IMDCT3 Multi Pipeline VLIW2 Multi Pipeline VLIW2 Performance (cycles) Power (mw) Energy (µj) As one can observe, depending on the requirements in area, power and energy consumption or performance, the designer can choose different configurations to meet the requirements of an application in a CMP organization. This way, we demonstrate that there are many alternatives for design exploration using CMP with Java. 5. CONCLUSION AND FUTURE WORK We presented first studies about the implementation of a CMP architecture based on Java, obtaining different results in terms of area, performance and power consumption when executing the algorithms. Our next step is to build the hardware to make possible the

6 communication among the Java threads. Besides, we intend to use binary translation to allocate parts of the code in different processors at run-time, exploring the instruction level parallelism in the application instead of just among threads as before. Moreover, we will study alternatives for the allocation of threads on the processors. It can be done in a dynamic way with a special Java processor executing a scheduler, or statically, where the applications are previously simulated in the Java architectures, considering performance and power consumption. 6. REFERENCES [1] Schlett, M., Trends in Embedded-Microprocessor Design. In Computer, vol. 31, n. 8, 1998, [2] Nokia N-GAGE Home Page, available at [3] Takahashi, D., Java Chips Make a Comeback. In Red Herring, 2001 [4] Lawton, G., Moving Java into Mobile Phones. In Computer, vol. 35, n. 6, 2002, [5] Olukotun, K. et. al., The Case for a Single Chip Multi- Processor. In Proc. 7 th Int l Conf. Architectural Support for Programming Languages and Operating Systems, 1996, 2-11 [6] Kumar, R., Farkas, K., Jouppi, N., Ranganathan, P., Tullsen, D. "Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction". In 36th International Symposium on Microarchitecture, MICRO-36, Dec [7] Hammond, L., Nayfeh, B. A., Olukotun, K. A Single-Chip Multiprocessor, IEEE Computer, Sep. 1997, [8] S. A. Ito, L. Carro, R. P. Jacobi. Making Java Work for Microcontroller Applications. In IEEE Design & Test of Computers, vol. 18, n. 5, 2001, pp [9] Li, Y., Skadron, K., Hu, Z., Brooks, D. Evaluating the Thermal Efficiency of SMT and CMP Architectures. In IBM Watson Conference on Interaction between Architecture, Circuits, and Compilers, external track, Oct [10] Brown, J., Wang, H., Chrysos, G., Wang, P., Shen J. Speculative Precomputation on Chip Multiprocessors. In 6th Workshop on Multithreaded Execution, Architecture, and Compilation, Nov [11] Burns, J., Gaudiot, J., Area and System Clock Effects on SMT/CMP Processors. In Proceeding of the 2001 International on Parallel Architectures and Compilation Techniques, 2001, [12] Barroso, L. A., Gharachorloo, K., McNamara, R., Nowatzyk, A., Qadeer, S., Sano, B., Smith, S., Stets, R. Verghese, B. Piranha: A Scalable Architecture Based on Single-chip Multiprocessing. In 27th International Symposium on Computer Architecture, 2000 [13] Tremblay, M., Chan, J., Chaudhry, S., Conigliaro, A., Tse, S. The MAJC Architecture: A Synthesis of Parallelism and Scalability. In IEEE Micro, vol. 20, n. 6, 2000, [14] Hammond, L., Hubbert, B. "The Stanford Hydra CMP", In IEEE Micro, V. 20 N. 2, Mar. 2000, [15] Beck, A.C.S., Carro, L., Low Power Java Processor for Embedded Applications. In IFIP 12th International Conference on Very Large Scale Integration, Germany, December, 2003 [16] J. L. Hennessy, D. A. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publishers, 3th edition, 2003 [17] Beck, A.C.S., Carro, L., A VLIW Low Power Java Processor for Embedded Applications. In 17th Brazilian Symp. Integrated Circuit Design (SBCCI 2004), Sep [18] Hauck, S., Fry, T., Hosler, M., Kao, J., The Chimaera reconfigurable functional unit. In Proc. IEEE Symp. FPGAs for Custom Computing Machines, Napa Valley, CA, 1997, [19] Kastrup, B., Bink, A., Hoogerbrugge, J., ConCISe: a compiler-driven CPLD-based instruction set accelerator. In Proc. 7th Annu. IEEE Symp Field-Programmable Custom Computing Machines, Napa Valley, CA, 1999, [20] Gschwind, M., Altman, E., Sathaye, P., Ledak, Appenzeller, D., Dynamic and Transparent Binary Translation. In IEEE Computer, vol. 3 n. 33, 2000, [21] Callaway, T. K., Swartzlander Jr, E. E., Power Delay Characteristics of Multipliers. In IEEE 13 th Symposium on Computer Arithmetic (ARITH 97), Mar [22] Leonardo Spectrum, available at homepage: [23] Beck, A.C.S., Mattos, J.C.B., Wagner, F.R., Carro, L., CACO-PS: A General Purpose Cycle-Accurate Configurable Power-Simulator. In 16th Brazilian Symp. Integrated Circuit Design (SBCCI 2003), Sep [24] Chen, R., Irwin, M. J., Bajwa, R., Architecture-Level Power Estimation and Design Experiments. In ACM Transactions on Design Automation of Electronic Systems, vol. 6, n. 1, Jan. 2001, pp 50-66

Reconfigurable Accelerator with Binary Compatibility for General Purpose Processors

Reconfigurable Accelerator with Binary Compatibility for General Purpose Processors Reconfigurable Accelerator with Binary Compatibility for General Purpose Processors Universidade Federal do Rio Grande do Sul Instituto de Informática Av. Bento Gonçalves, 9500 Campus do Vale Porto Alegre/Brazil

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Designing a Dual Core Processor

Designing a Dual Core Processor Designing a Dual Core Processor Manfred Georg Applied Research Laboratory Department of Computer Science and Engineering Washington University in St. Louis St. Louis, MO 63130-4899, USA mgeorg@arl.wustl.edu

More information

Run-time Adaptable Architectures for Heterogeneous Behavior Embedded Systems

Run-time Adaptable Architectures for Heterogeneous Behavior Embedded Systems Run-time Adaptable Architectures for Heterogeneous Behavior Embedded Systems Antonio Carlos S. Beck 1, Mateus B. Rutzig 1, Georgi Gaydadjiev 2, Luigi Carro 1, 1 Universidade Federal do Rio Grande do Sul

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

ACCELERATING 2D GRAPHIC APPLICATIONS WITH LOW ENERGY OVERHEAD

ACCELERATING 2D GRAPHIC APPLICATIONS WITH LOW ENERGY OVERHEAD ACCELERATING 2D GRAPHIC APPLICATIONS WITH LOW ENERGY OVERHEAD Oliveira, L.; Neves, B; Carro, L Instituto de Informática Universidade Federal do Rio Grande do Sul {loliveira,bsneves,carro}@inf.ufrgs.br

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

Transparent Reconfigurable Acceleration for Heterogeneous Embedded Applications

Transparent Reconfigurable Acceleration for Heterogeneous Embedded Applications Transparent Reconfigurable Acceleration for Heterogeneous Embedded Applications Antonio Carlos S. Beck 1,2, Mateus B. Rutzig 1, Georgi Gaydadjiev 2, Luigi Carro 1 1 Universidade Federal do Rio Grande do

More information

Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor

Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor Abstract The proposed work is the design of a 32 bit RISC (Reduced Instruction Set Computer) processor. The design

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

EECS 452 Lecture 9 TLP Thread-Level Parallelism

EECS 452 Lecture 9 TLP Thread-Level Parallelism EECS 452 Lecture 9 TLP Thread-Level Parallelism Instructor: Gokhan Memik EECS Dept., Northwestern University The lecture is adapted from slides by Iris Bahar (Brown), James Hoe (CMU), and John Shen (CMU

More information

Research Article Dynamic Reconfigurable Computing: The Alternative to Homogeneous Multicores under Massive Defect Rates

Research Article Dynamic Reconfigurable Computing: The Alternative to Homogeneous Multicores under Massive Defect Rates International Journal of Reconfigurable Computing Volume 2, Article ID 452589, 7 pages doi:.55/2/452589 Research Article Dynamic Reconfigurable Computing: The Alternative to Homogeneous Multicores under

More information

INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 ELEC : Computer Architecture and Design

INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 ELEC : Computer Architecture and Design INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 GBI0001@AUBURN.EDU ELEC 6200-001: Computer Architecture and Design Silicon Technology Moore s law Moore's Law describes a long-term trend in the history

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University 18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Fundamentals of Computers Design

Fundamentals of Computers Design Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor Kyriakos Stavrou, Paraskevas Evripidou, and Pedro Trancoso Department of Computer Science, University of Cyprus, 75 Kallipoleos Ave., P.O.Box

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Materials: 1. Projectable Version of Diagrams 2. MIPS Simulation 3. Code for Lab 5 - part 1 to demonstrate using microprogramming

Materials: 1. Projectable Version of Diagrams 2. MIPS Simulation 3. Code for Lab 5 - part 1 to demonstrate using microprogramming CS311 Lecture: CPU Control: Hardwired control and Microprogrammed Control Last revised October 18, 2007 Objectives: 1. To explain the concept of a control word 2. To show how control words can be generated

More information

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany U. Brinkschulte,

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit

Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit P Ajith Kumar 1, M Vijaya Lakshmi 2 P.G. Student, Department of Electronics and Communication Engineering, St.Martin s Engineering College,

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

VLSI DESIGN OF REDUCED INSTRUCTION SET COMPUTER PROCESSOR CORE USING VHDL

VLSI DESIGN OF REDUCED INSTRUCTION SET COMPUTER PROCESSOR CORE USING VHDL International Journal of Electronics, Communication & Instrumentation Engineering Research and Development (IJECIERD) ISSN 2249-684X Vol.2, Issue 3 (Spl.) Sep 2012 42-47 TJPRC Pvt. Ltd., VLSI DESIGN OF

More information

New Advances in Micro-Processors and computer architectures

New Advances in Micro-Processors and computer architectures New Advances in Micro-Processors and computer architectures Prof. (Dr.) K.R. Chowdhary, Director SETG Email: kr.chowdhary@jietjodhpur.com Jodhpur Institute of Engineering and Technology, SETG August 27,

More information

COMPUTER ORGANISATION CHAPTER 1 BASIC STRUCTURE OF COMPUTERS

COMPUTER ORGANISATION CHAPTER 1 BASIC STRUCTURE OF COMPUTERS Computer types: - COMPUTER ORGANISATION CHAPTER 1 BASIC STRUCTURE OF COMPUTERS A computer can be defined as a fast electronic calculating machine that accepts the (data) digitized input information process

More information

Early Embedded Software Design Space Exploration Using UML-based Estimation

Early Embedded Software Design Space Exploration Using UML-based Estimation Early Embedded Software Design Space Exploration Using UML-based Estimation Marcio F. da S. Oliveira 1 Lisane B. de Brisolara 1 Luigi Carro 1,2 Flávio R. Wagner 1 1 Computer Science Institute Federal University

More information

Applying Memory Test to Embedded Systems

Applying Memory Test to Embedded Systems Applying Memory Test to Embedded Systems César Augusto M. Marcon 1, Alexandre de Moraes Amory 1, Marcelo Lubaszewski 1, Altamiro Amadeu Susin 1, Ney Laert V. Calazans 2, Fernando Gehm Moraes 2, Fabiano

More information

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

Processor Applications. The Processor Design Space. World s Cellular Subscribers. Nov. 12, 1997 Bob Brodersen (http://infopad.eecs.berkeley.

Processor Applications. The Processor Design Space. World s Cellular Subscribers. Nov. 12, 1997 Bob Brodersen (http://infopad.eecs.berkeley. Processor Applications CS 152 Computer Architecture and Engineering Introduction to Architectures for Digital Signal Processing Nov. 12, 1997 Bob Brodersen (http://infopad.eecs.berkeley.edu) 1 General

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Digital System Design Using Verilog. - Processing Unit Design

Digital System Design Using Verilog. - Processing Unit Design Digital System Design Using Verilog - Processing Unit Design 1.1 CPU BASICS A typical CPU has three major components: (1) Register set, (2) Arithmetic logic unit (ALU), and (3) Control unit (CU) The register

More information

Embedded Systems. 7. System Components

Embedded Systems. 7. System Components Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic

More information

A Prototype Multithreaded Associative SIMD Processor

A Prototype Multithreaded Associative SIMD Processor A Prototype Multithreaded Associative SIMD Processor Kevin Schaffer and Robert A. Walker Department of Computer Science Kent State University Kent, Ohio 44242 {kschaffe, walker}@cs.kent.edu Abstract The

More information

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

Chapter 4 The Processor (Part 4)

Chapter 4 The Processor (Part 4) Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Chapter 2 Lecture 1 Computer Systems Organization

Chapter 2 Lecture 1 Computer Systems Organization Chapter 2 Lecture 1 Computer Systems Organization This chapter provides an introduction to the components Processors: Primary Memory: Secondary Memory: Input/Output: Busses The Central Processing Unit

More information

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard. COMP 212 Computer Organization & Architecture Pipeline Re-Cap Pipeline is ILP -Instruction Level Parallelism COMP 212 Fall 2008 Lecture 12 RISC & Superscalar Divide instruction cycles into stages, overlapped

More information

Hardware support in a middleware for distributed and real-time embedded applications

Hardware support in a middleware for distributed and real-time embedded applications Hardware support in a middleware for distributed and real-time embedded applications Elias T. Silva Jr 1,2, Flávio R. Wagner 1, Edison P. Freitas 1, Leonardo Kunz 1, Carlos E. Pereira 1 1 PPGC Instituto

More information

Processing Unit CS206T

Processing Unit CS206T Processing Unit CS206T Microprocessors The density of elements on processor chips continued to rise More and more elements were placed on each chip so that fewer and fewer chips were needed to construct

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

Computer Architecture!

Computer Architecture! Informatics 3 Computer Architecture! Dr. Boris Grot and Dr. Vijay Nagarajan!! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors:!

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2 Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

JOP: A Java Optimized Processor for Embedded Real-Time Systems. Martin Schoeberl

JOP: A Java Optimized Processor for Embedded Real-Time Systems. Martin Schoeberl JOP: A Java Optimized Processor for Embedded Real-Time Systems Martin Schoeberl JOP Research Targets Java processor Time-predictable architecture Small design Working solution (FPGA) JOP Overview 2 Overview

More information

ECE 588/688 Advanced Computer Architecture II

ECE 588/688 Advanced Computer Architecture II ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Fall 2009 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2009 1 When and Where? When:

More information

COMPUTER ORGANIZATION AND ARCHITECTURE

COMPUTER ORGANIZATION AND ARCHITECTURE Page 1 1. Which register store the address of next instruction to be executed? A) PC B) AC C) SP D) NONE 2. How many bits are required to address the 128 words of memory? A) 7 B) 8 C) 9 D) NONE 3. is the

More information

IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE

IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE Stephan Suijkerbuijk and Ben H.H. Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science

More information

ECE 1160/2160 Embedded Systems Design. Midterm Review. Wei Gao. ECE 1160/2160 Embedded Systems Design

ECE 1160/2160 Embedded Systems Design. Midterm Review. Wei Gao. ECE 1160/2160 Embedded Systems Design ECE 1160/2160 Embedded Systems Design Midterm Review Wei Gao ECE 1160/2160 Embedded Systems Design 1 Midterm Exam When: next Monday (10/16) 4:30-5:45pm Where: Benedum G26 15% of your final grade What about:

More information

Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC

More information

Fundamentals of Computer Design

Fundamentals of Computer Design Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Computer Architecture!

Computer Architecture! Informatics 3 Computer Architecture! Dr. Vijay Nagarajan and Prof. Nigel Topham! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Single-Path Programming on a Chip-Multiprocessor System

Single-Path Programming on a Chip-Multiprocessor System Single-Path Programming on a Chip-Multiprocessor System Martin Schoeberl, Peter Puschner, and Raimund Kirner Vienna University of Technology, Austria mschoebe@mail.tuwien.ac.at, {peter,raimund}@vmars.tuwien.ac.at

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

Lecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor

Lecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor Lecture 12 Architectures for Low Power: Transmeta s Crusoe Processor Motivation Exponential performance increase at a low cost However, for some application areas low power consumption is more important

More information

Materials: 1. Projectable Version of Diagrams 2. MIPS Simulation 3. Code for Lab 5 - part 1 to demonstrate using microprogramming

Materials: 1. Projectable Version of Diagrams 2. MIPS Simulation 3. Code for Lab 5 - part 1 to demonstrate using microprogramming CPS311 Lecture: CPU Control: Hardwired control and Microprogrammed Control Last revised October 23, 2015 Objectives: 1. To explain the concept of a control word 2. To show how control words can be generated

More information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EE282 Computer Architecture. Lecture 1: What is Computer Architecture? EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer

More information

Microprocessor Trends and Implications for the Future

Microprocessor Trends and Implications for the Future Microprocessor Trends and Implications for the Future John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 4 1 September 2016 Context Last two classes: from

More information