APPLYING JAVA ON SINGLE-CHIP MULTIPROCESSORS. Antonio C. S. Beck F., Júlio C. B. Mattos, Luigi Carro. {caco, julius, carro

Size: px

Start display at page:

Download "APPLYING JAVA ON SINGLE-CHIP MULTIPROCESSORS. Antonio C. S. Beck F., Júlio C. B. Mattos, Luigi Carro. {caco, julius, carro"

Gabriel Howard
5 years ago
Views:

1 APPLYING JAVA ON SINGLE-CHIP MULTIPROCESSORS Antonio C. S. Beck F., Júlio C. B. Mattos, Luigi Carro Instituto de Informática Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil {caco, julius, carro ABSTRACT This paper presents first studies about the implementation of Java on Single-Chip Multiprocessors (CMP). CMP instantiates multiples of the same processor core on a single die where the communication among each other is done at high speed. Moreover, it can be implemented as a diverse set of cores with the same instruction set that target different performance levels and consume different levels of power. Depending on the requirements of the applications at a certain time, the core that can best meet these requirements while minimizing energy consumption is selected. Extending this idea, we apply Java on CMP with different cores that execute Java bytecodes. We demonstrate four different Java cores and first results concerning the tradeoffs in performance, power and area, showing the potential of this design exploration. 1. INTRODUCTION The Embedded system market is expanding. The research and production of specific processors to be used inside cellular phones, mp3 players, digital cameras, microwaves, videogames, printers and others appliances is following the same growing path [1]. Moreover, dictated by market needs, the complexity of these embedded systems is increasing as well, since they are offering more and more functions to the user, like Internet access, color display, audio and video reproduction, videogames, among others [2]. Hence, these systems must have enough processing power to handle these complex tasks. In the same way, Java is becoming increasingly popular in embedded environments. It is estimated that devices with embedded Java such as cellular phones, PDAs and pagers will grow from 176 million in 2001 to 721 million in 2005 [3]. Nevertheless, it is predicted that at least 80 percent of mobile phones will support Java by 2006 [4]. As one can observe, the presence of Java in embedded systems is becoming more significant. This means that current design goals might include a careful look on embedded Java processors, and their performance versus power tradeoffs must be taken into account. Therefore, while still sustaining great performance, present days embedded systems must also have low power dissipation and support a huge software library to cope with stringent design times. Consequently, there is a clear need for architectures which can support all the software development effort currently required. One of these alternatives is the use of Chip Multiprocessors (CMP) [5]. Basically, CMP instantiates multiples of the same processor core on a single die communicating among each other at high speed. These processors usually have a private L1 cache and a shared L2 one. In [6] first studies about an extension of this technique was proposed, implementing a diverse set of cores with the same instruction set that target different performance levels and consume different levels of power. Depending on the applications requirements at a certain time, the core that can best meet these requirements while minimizing energy consumption is selected. When there are idle processors on the CMP, they are turned off. This way, dynamic and static power consumption is saved. Furthermore, there are a lot of possibilities in the threads allocation, depending on the energy consumption or real-time constraints. For instance, threads can be allocated as soon as possible, getting the first processor that can handle that thread in the moment, or wait in a queue until a processor that exactly fits in that thread requirements becomes available. Extending this idea, we present first studies about applying Java on CMP with different cores that execute Java bytecodes. The use of Java brings additional advantages, as the native support of threads and the synchronization mechanism among them. This facilitates the software development, since the designer does not need to worry about explicitly dividing the software to run on CMP, and makes the development of the CMP easier, because the Java compiler takes charge of the implicit synchronization among threads. Moreover, the designer can take advantage of the traditional Java characteristics: it is object oriented, which facilitates the modeling of the

2 system; it is multiplatform and it is a wide-spread used language. This paper is organized as follows: section 2 demonstrates a brief review of the related work about CMP. In section 3 we show the Java architectures evaluated. In section 4 is showed first results regarding Java processors on CMP. The last section draws conclusions and introduces future work. 2. RELATED WORK The concept of Chip Multiprocessor was first presented in [5], where multiples of the same processor core are implemented on a single die communicating among each other at high speed. In [7], more studies about its feasibility were done. The CMP technique was compared to others techniques as superscalar and SMT (Simultaneous Multithreading). In this work was showed that the CMP is the best choice to explore the great capacity in terms of transistors that will be available in the future. CMP has many advantages: high clock frequencies can be achieved, since the project is made of many relatively simple processors, instead of complicated structures as in the SMT, with long and high capacitance wires with huge register banks; the design can be kept simple: if it is need to increase performance it is just necessary to put more processors on the die; as a consequence, the validation and test become easier as well. The idea of implementing different cores and using then depending on the requirements of the application is proposed in [6]. In this work, four different Alpha cores were implemented in CMP. Switching applications between cores, meaningful results concerning energy consumption was achieved with a relative small area overhead when considering the complete CMP system with the most complex Alpha core. Different studies exploring many areas on CMP have been done. In [9] is explored the thermal behavior of SMT and CMP architectures. In [10], simple enhancements to the basic CMP architecture are investigated in order to increase the performance using speculative precomputation. In [11] an hybrid architecture called Parallel On-Chip Simultaneous Multithreaded Processor (POSM), that mix the advantages of SMT and CMP, is proposed. Moreover, a detailed study in area and performance of these architectures was done, using the MIPS R10000 as base processor. As can be observed, CMP explores mainly the thread level parallelism instead of the instruction level parallelism as superscalar architectures do, or both, as SMT does. However, it is well know that the trend is to execute many threads in parallel, since the embedded systems are getting more complex maintaining a certain simplicity to reduce the design cycle, as cited before. There are few implementations of CMP available. The Piranha Project [12], the MAJC [13] and Hydra [14] are examples of CMP. As none of those has already explored the Java potential in CMP, we present first studies on using this language. 3. THE FEMTOJAVA PROCESSORS The Femtojava processor [8] is a stack-based microcontroller that executes Java bytecodes. General characteristics of the processor are: reduced instruction set, Harvard architecture and small size. This processor was designed specifically for the embedded system market. The first architecture evaluated is a multicycle version [8] that takes from three to fourteen cycles to execute an instruction and can be observed in Figure 1. RAM ROM Input Ports Output Ports Timer Interrupt Handler P r g M e m A d d r e s s B u s I n t r u c t i o n B u s D a t a M e m A d d r e s s B u s D a t a B u s PC 0 1 A IMM Const SP FRM VAR A B IR Control Figure 1 Multicycle Java Processor [8] The second architecture is the pipelined version [15], which has five stages: instruction fetch, instruction decoding, operand fetch, execution, and write back, as shown in Figure 2. One of the main characteristics of this architecture is the presence of registers playing the role of operand stack and local variable pool (used to keep values of the local variables of a method). We call this architecture of Low-Power Femtojava. The first stage, instruction fetch, is composed by an instruction queue of 9 registers. The first instruction in the queue is sent to the instruction decoder stage. The decoder has four functions: the generation of the control word for MAR MUX MUX MUX MUX ALU + +/-

that instruction, to handle data dependencies, to analyze the forwarding possibilities and to inform to the instruction queue the size of the current instruction, in order to put the next instruction

3 that instruction, to handle data dependencies, to analyze the forwarding possibilities and to inform to the instruction queue the size of the current instruction, in order to put the next instruction of the stream in the first place of the queue. This is necessary because of the use of variable length instructions: they can have one or two immediate operands, or none at all. Operands fetch is done in a variable size register bank, defined a priori in earlier stages of the design. The operand stack and the local variable pool of the methods are available in this register bank. There are two registers: SP (Stack Pointer) and VARS (VARiables Storage), which point to the top of the stack and to beginning of the local variable storage, respectively. Depending on the instruction, one of them is used as base for the operand fetch. Once the operands are fetched, they are sent to the fourth stage, where the operation is executed. There is no branch prediction, in order to save area. All branches are supposed to be not taken. If the branch is taken, a penalty of three cycles is paid. The write back stage saves, if necessary, the result of the execution stage back to the register bank, again, using the SP or VARS as base. There is a unified register bank for the stack and local variable pool, because this facilitates the call and return of methods, taking advantage of the JVM specification, where each method is located by a frame pointer in the stack. The Low-Power Java processor uses the forwarding technique [16], which brings an advantage when comparing to RISC like processors: in instructions that manipulate the stack, the operands forwarded to earlier stages will not be used anymore. As a consequence, there is no need to write back these operands to the stack. The result is the reduction on the power consumption, because the number of writes in the stack is reduced. In this work we show a gain factor of 8 concerning energy consumption with a minimal area overhead, thanks to the use of the forwarding technique. Figure 2 Low-Power Java Processor The third architecture is the VLIW processor [17]. It is an extension of the pipelined one. Basically, it has its functional units and the instruction decoders replicated. The additional decoders do not support the instructions for call and return of methods, since they are always in the main flow. The local variable storage is placed just in the first register file. When the instructions of other flows need the value of a local variable, they must fetch it from the register bank in the main flow. Each instruction flow has its own operand stack, which has less registers than the main operand stack, since the operand stacks for the secondary flows do not grow as much as the main flow operand stack grows. The VLIW packet has a variable size, avoiding unnecessary memory accesses. A header in the first instruction of the word informs to the instruction fetch controller how many instructions the current packet has. The search for ILP in the Java program is done at the bytecode level. The algorithm works as follows: all the instructions that depend on the result of the previous one are grouped in an operand block. The entire Java program is divided in these groups and they can be parallelized respecting the functional unit constraints. For example, if there is just one multiplier, two instructions that use this functional unit can not be operated in parallel. There are three basic versions of the VLIW processor: the VLIW 2, VLIW 4 and VLIW 8, which can handle until 2, 4 or 8 instructions per VLIW packet, respectively. The last version is the Femtojava Low-Power coupled to a coarse-grain reconfigurable array. The use of a coarse-grain array ensures fast reconfiguration and less control overhead, saving energy for the reconfiguration task. The reconfigurable array is tightly coupled to the processor. It is implemented as a functional unit in the execution stage, using the same approach of Chimaera [18] and ConCISe [19]. Thanks to the coupling of dynamic binary translation (BT) [20], the detection of instructions that will be executed in the reconfigurable array is done at run-time. This binary translation of sequences of instructions into combinational logic is totally transparent for the software designer, allowing easy software porting for different

machines tracking technological evolutions. By using BT one allows optimizations not only in critical loops, as recent works have proposed, but in all parts of the algorithm.

4 machines tracking technological evolutions. By using BT one allows optimizations not only in critical loops, as recent works have proposed, but in all parts of the algorithm. Thus, our technique is not limited just to the parallelism available in the application, since the amount of parallelism during the execution usually varies [20]. The array is divided in blocks, called cells. The operand block (a sequence of bytecodes) previously detected is fitted in one ore more of these cells in the array. The cell can be observed in Figure 3. The initial part of the cell is composed by three functional units (ALU, shifter, ld/st). After the first part, six identical parts follow in sequence. This number was chosen based on time delays presented in [21], where it is compared the delay of the booth multiplier (used in our processor) with adders. Each cell of the array has just one multiplier and takes exactly one processor cycle to complete execution, being limited to its critical path, bringing no delay overhead in the processor pipeline. At the end of each cell, there are two additional functional units: a branch unit and an extended ld/st one, made for the execution of the iastore (fetches a static value from memory, taking the address from the stack) instruction, since this instruction needs three operands, instead of two, as usual. For each cell in the array 327 reconfiguration bits are necessary. Consequently, if the array is formed by 3 cells, 971 bits in the reconfiguration cache are necessary. To these one must add 58 bits of additional information, like as how many cycles the execution takes, and what is the initial ROM address that this sequence is located, totalizing 1029 bits for each configuration of the array. The array is reconfigured using the data of the cache specially designed for it. Each cell in the array has its own reconfiguration cache. Therefore, the reconfiguration of each cell can be done in parallel. While the program executes, when an address of a reconfigurable instruction group is found, the reconfigurable unit detector sends information for the main processor. Then, its control unit configures the array as the active functional unit, stops the rest of the processor while the array is performing it functions, and upgrades the Program Counter with the new address, in order to continue the normal operation after the execution of that sequence of instructions by the array. At the same time, the configurations in the reconfiguration cache for that address are sent to the reconfigurable array, since they are indexed by addresses of the instruction memory. As the detection for the address that will be used in the reconfiguration is done in the first stage of the pipeline, and the reconfigurable array is in the fourth stage, there are 3 cycles available between the detection and the use of the array. As one cycle is necessary to find the cache line that has the array configuration, two cycles are available for the reconfiguration. Considering the follow example, we show the potential gains using a reconfigurable array in a Java machine: if there are five simple arithmetical or logic operations, one needs at least 11 cycles for the execution: 6 for pushing the operands into the stack, plus 5 cycles for the operations themselves. This is the optimal execution, without considering pipeline stalls due to data dependency. On the other hand, the array would execute in just one cycle everything, after the first time it finds this sequence and saves it in the reconfiguration cache. If this sequence is repeated a certain number of times, meaningful gains can be achieved. For the execution, the first two operand values will be used as the first operation. After that, for each basic part of the cell, the first operand is always the result of the previous operation. This is another characteristic of stack machines, and makes the control and the hardware of the array simpler. In each cell it is possible to make 7 simple operations (arithmetic, logical, shift). If a multiplication is found, the result of the current cell is bypassed until the end. This multiplication will be executed in the next cell. 4. RESULTS Figure 3. Architecture of the array Table 1 shows the area occupied by the four different versions of our Java processors. It is important to note that the register bank in the Low-Power version (as the main flow of the VLIW one), used as stack and local variable pool, has 32 registers, the maximum required among all the applications (note that they are counted in the table, even though the FPGA s memory could be used for this

5 purpose). The version with the reconfigurable array can support until 10 different reconfigurations and the array is composed by 4 cells. The area was evaluated using the Leonardo Spectrum for Windows [22], and it is presented in number of logic cells. Table 1 Area of the VHDL versions of the architectures PROCESSOR LOW POWER VLIW (INSTR. PER PACKET) REC. ARRAY X4 Area (LCs) For performance and power estimation, the tool utilized was a configurable compiled-code cycle-accurate simulator [23], and it was used to provide data on the energy consumption, memory usage and performance. Its power estimation technique is comparable to the component-based approach used in [24]. Power dissipation is evaluated in terms of switching activity, and as the processor has separated instruction and data memories, we also included an evaluation module concerning RAM and ROM memories, besides the register bank. This way, one can verify the relative power dissipation of the CPU, instruction memory, and data memory. It is important to measure the impact of each one of these blocks, so that one can better explore the design space. To illustrate the results of the design exploration with different cores we selected three different versions of the architectures explained before: multicycle, low-power and a VLIW version with 2 instructions/packet, executing the follow algorithms: the sine and the Inverse Modified Discrete Cosine Transform with different versions of loop unrolled. The table 2 presents the results in terms of performance, power and energy, using a frequency equal 50 MHz and Vdd equal 3.3v in two different ways of executing the Sin computation. It is obvious that the pipelined and VLIW architectures provide the better results in terms of cycles, the best results in terms of performance is the combination of sine calculation as simple table search and VLIW architecture, but the worst results in terms of power. The best combination in terms of power but worst in terms of energy is the sine using Cordic and Multicycle core. Table 2. Sin Characterization Characteristic Cordic Table Multi Pipeline VLIW2 Multi Pipeline VLIW2 Performance (cycles) Power (mw) Energy (ηj) Table 3, 4 e 5 show the main results of the characterization of the four different implementations of the IMDCT function (the loop unrolling technique was used in different levels). The IMDCT3 implementation has the better results in terms of performance in all architectures, but the size of program memory significantly increases. The opposite happens with the IMDCT implementation, which has far better results in terms of program memory, but consumes more cycles. In terms of power (using a frequency equal 50 MHz and Vdd equal 3.3v) the best combination are the IMDCT1 and IMDCT2 with multicycle core, but this combination does not have good results in terms of energy. Table 4 and 5 shows that the best results in terms of performance are the results that execute in VLIW core, since the IMDCT routine has lot of parallelism. Table 3. IMDCT Characterization (software dependable) Characteristic IMDCT IMDCT1 IMDCT2 IMDCT3 Program size (bytes) 344 2,137 4,260 15,294 Data mem (bytes) Table 4. IMDCT Characterization (hardware dependable) Characteristic IMDCT IMDCT1 Multi Pipeline VLIW2 Multi Pipeline VLIW2 Performance (cycles) Power (mw) Energy (µj) Table 5. IMDCT Characterization (hardware dependable) (cont.) Characteristic IMDCT2 IMDCT3 Multi Pipeline VLIW2 Multi Pipeline VLIW2 Performance (cycles) Power (mw) Energy (µj) As one can observe, depending on the requirements in area, power and energy consumption or performance, the designer can choose different configurations to meet the requirements of an application in a CMP organization. This way, we demonstrate that there are many alternatives for design exploration using CMP with Java. 5. CONCLUSION AND FUTURE WORK We presented first studies about the implementation of a CMP architecture based on Java, obtaining different results in terms of area, performance and power consumption when executing the algorithms. Our next step is to build the hardware to make possible the

6 communication among the Java threads. Besides, we intend to use binary translation to allocate parts of the code in different processors at run-time, exploring the instruction level parallelism in the application instead of just among threads as before. Moreover, we will study alternatives for the allocation of threads on the processors. It can be done in a dynamic way with a special Java processor executing a scheduler, or statically, where the applications are previously simulated in the Java architectures, considering performance and power consumption. 6. REFERENCES [1] Schlett, M., Trends in Embedded-Microprocessor Design. In Computer, vol. 31, n. 8, 1998, [2] Nokia N-GAGE Home Page, available at [3] Takahashi, D., Java Chips Make a Comeback. In Red Herring, 2001 [4] Lawton, G., Moving Java into Mobile Phones. In Computer, vol. 35, n. 6, 2002, [5] Olukotun, K. et. al., The Case for a Single Chip Multi- Processor. In Proc. 7 th Int l Conf. Architectural Support for Programming Languages and Operating Systems, 1996, 2-11 [6] Kumar, R., Farkas, K., Jouppi, N., Ranganathan, P., Tullsen, D. "Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction". In 36th International Symposium on Microarchitecture, MICRO-36, Dec [7] Hammond, L., Nayfeh, B. A., Olukotun, K. A Single-Chip Multiprocessor, IEEE Computer, Sep. 1997, [8] S. A. Ito, L. Carro, R. P. Jacobi. Making Java Work for Microcontroller Applications. In IEEE Design & Test of Computers, vol. 18, n. 5, 2001, pp [9] Li, Y., Skadron, K., Hu, Z., Brooks, D. Evaluating the Thermal Efficiency of SMT and CMP Architectures. In IBM Watson Conference on Interaction between Architecture, Circuits, and Compilers, external track, Oct [10] Brown, J., Wang, H., Chrysos, G., Wang, P., Shen J. Speculative Precomputation on Chip Multiprocessors. In 6th Workshop on Multithreaded Execution, Architecture, and Compilation, Nov [11] Burns, J., Gaudiot, J., Area and System Clock Effects on SMT/CMP Processors. In Proceeding of the 2001 International on Parallel Architectures and Compilation Techniques, 2001, [12] Barroso, L. A., Gharachorloo, K., McNamara, R., Nowatzyk, A., Qadeer, S., Sano, B., Smith, S., Stets, R. Verghese, B. Piranha: A Scalable Architecture Based on Single-chip Multiprocessing. In 27th International Symposium on Computer Architecture, 2000 [13] Tremblay, M., Chan, J., Chaudhry, S., Conigliaro, A., Tse, S. The MAJC Architecture: A Synthesis of Parallelism and Scalability. In IEEE Micro, vol. 20, n. 6, 2000, [14] Hammond, L., Hubbert, B. "The Stanford Hydra CMP", In IEEE Micro, V. 20 N. 2, Mar. 2000, [15] Beck, A.C.S., Carro, L., Low Power Java Processor for Embedded Applications. In IFIP 12th International Conference on Very Large Scale Integration, Germany, December, 2003 [16] J. L. Hennessy, D. A. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publishers, 3th edition, 2003 [17] Beck, A.C.S., Carro, L., A VLIW Low Power Java Processor for Embedded Applications. In 17th Brazilian Symp. Integrated Circuit Design (SBCCI 2004), Sep [18] Hauck, S., Fry, T., Hosler, M., Kao, J., The Chimaera reconfigurable functional unit. In Proc. IEEE Symp. FPGAs for Custom Computing Machines, Napa Valley, CA, 1997, [19] Kastrup, B., Bink, A., Hoogerbrugge, J., ConCISe: a compiler-driven CPLD-based instruction set accelerator. In Proc. 7th Annu. IEEE Symp Field-Programmable Custom Computing Machines, Napa Valley, CA, 1999, [20] Gschwind, M., Altman, E., Sathaye, P., Ledak, Appenzeller, D., Dynamic and Transparent Binary Translation. In IEEE Computer, vol. 3 n. 33, 2000, [21] Callaway, T. K., Swartzlander Jr, E. E., Power Delay Characteristics of Multipliers. In IEEE 13 th Symposium on Computer Arithmetic (ARITH 97), Mar [22] Leonardo Spectrum, available at homepage: [23] Beck, A.C.S., Mattos, J.C.B., Wagner, F.R., Carro, L., CACO-PS: A General Purpose Cycle-Accurate Configurable Power-Simulator. In 16th Brazilian Symp. Integrated Circuit Design (SBCCI 2003), Sep [24] Chen, R., Irwin, M. J., Bajwa, R., Architecture-Level Power Estimation and Design Experiments. In ACM Transactions on Design Automation of Electronic Systems, vol. 6, n. 1, Jan. 2001, pp 50-66

Reconfigurable Accelerator with Binary Compatibility for General Purpose Processors

Reconfigurable Accelerator with Binary Compatibility for General Purpose Processors Universidade Federal do Rio Grande do Sul Instituto de Informática Av. Bento Gonçalves, 9500 Campus do Vale Porto Alegre/Brazil