Fast Cycle-Accurate Simulation and Instruction Set Generation for Constraint-Based Descriptions of Programmable Architectures

Size: px
Start display at page:

Download "Fast Cycle-Accurate Simulation and Instruction Set Generation for Constraint-Based Descriptions of Programmable Architectures"

Transcription

1 Fast Cycle-Accurate Simulation and Instruction Set Generation for Constraint-Based Descriptions of Programmable Architectures Scott J. Weber 1, Matthew W. Moskewicz 1, Matthias Gries 1, Christian Sauer 2, Kurt Keutzer 1 1 University of California, Electronics Research Lab, Berkeley 2 Infineon Technologies, Corporate Research, Munich, Germany {sjweber, moskewcz, gries, sauer, keutzer@eecs.berkeley.edu Abstract State-of-the-art architecture description languages have been successfully used to model application-specific programmable architectures limited to particular control schemes. In this paper, we introduce a language and methodology that provide a framework for constructing and simulating a wider range of architectures. The framework exploits the fact that designers are often only concerned with data paths, not the instruction set and control. In the framework, each processing element is described in a structural language that only requires the specification of the data path and constraints on how it can be used. From such a description, the supported operations of the processing element are automatically extracted and a controller is generated. Various architectures are then realized by composing the processing elements. Furthermore, hardware descriptions and bit-true cycleaccurate simulators are automatically generated. Results show that our simulators are up to an order of magnitude faster than other reported simulators of this type and two orders of magnitude faster than equivalent Verilog simulations. Categories and Subject Descriptors: C.0 [Computer Systems Organization]: General -- Modeling of computer architecture; I.6.7 [Simulation and Modeling]: Simulation Support Systems; D.3.2 [Programming Languages]: Language Classifications Constraint and logic languages, design languages, specialized application languages. General Terms: Algorithms, Design, Languages. Keywords: Instruction set extraction, automatic control generation, cycle-accurate simulation. 1. Introduction The primary focus of the designer of an application-specific programmable processor is often only the data path, not the control and instruction set. Control should be automatically generated whenever possible to ease the design process and to avoid potential errors. Likewise, the instruction set should reflect the capabilities of the underlying data path, not define them. Attempting to define an instruction set can be complicated by architectural complexities such as multiple memories and forwarg paths. The problem, however, is that existing Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted with fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CODES+ISSS 04, September 8 10, 2004, Stockholm, Sweden. Copyright 2004 ACM /04/ $5.00. architecture description languages (ADLs) require the complete specification of either the control or the instruction set in addition to the data path. The MIMOLA ADL [1][2], for instance, provides the ability to automatically extract the instruction set from a description of the data path, but requires the specification of control signals. On the other hand, ADLs, such as LISA [3], EXPRESSION [4], and nml [5], provide the ability to automatically generate the control, but require the specification of the instruction set. It is for this reason that we introduce a new language and supporting framework that both extracts the instruction set and generates a controller from a description of the data path. In our framework, each processing element is a staticallyscheduled horizontally-microcoded machine. Such machines neither require hazard detection logic nor dynamic control, but instead rely on static scheduling. In such a scenario, components are composed in a modular manner so that the control for one is provided by another. Moreover, composing in this manner does not limit one to the vertically controlled machines created by today s ADLs. Due to space constraints, we will focus only on machines with horizontally-microcoded control. Such machines are the core component for defining more sophisticated machines. To deal with the potential code size explosion of the microcode, we will also line an encoder/decoder strategy. As a first step to analyzing the utility of our methodology, we have developed a type-polymorphic, parameterized language for describing processing elements in terms of the data path and constraints on how the data path can be used. From this description, we are able to extract the supported operations, generate assemblers, generate synthesizable RTL Verilog, and generate fast bit-true cycle-accurate interpretive and compiledcode simulators. In fact, our generated simulators are as fast as or faster than those produced by other ADL-based frameworks. The paper is organized as follows. In Section 2, we provide a brief overview of our design flow. Our language and operation extraction procedure are discussed in Section 3. In Section 4, we discuss the generation of simulators. An overview of the generation of hardware is covered in Section 5. In Section 6, we present the simulation performance for two designs. We discuss related work in Section 7, and conclude in Section Overview We have developed a correct-by-construction framework for designing statically-scheduled horizontally-microcoded programmable architectures. From a constraint-based description of the design, the primitive operations of the architecture are extracted. From these operations, bit-true cycle-accurate simulators and a synthesizable RTL description are automatically 18

2 generated. The complete flow of the design methodology is shown in Figure 1. Programmer Application Source Code Figure 1: Design Methodology 3. Architecture Design Our methodology simplifies the way an RTL designer approaches a programmable design. First, designers specify the data path. Second, to enable operation extraction and to make verification a first-class citizen during design, constraints must be specified on how particular components of the data path can be used. These constraints force the designer to design in a correct-byconstruction manner. An instruction set representing the source (inputs or register reads) to sink (puts or register writes) operations of the data path is then automatically extracted. A controller that implements the instruction set is also generated. Finally, in order to finish the implementation, the designer writes a program either at the level of the instruction set or in a higher level language that is then compiled to the architecture. As an illustrative example, we explore an architecture that is capable of incrementing or decrementing an input by a step value. The architecture is shown in Figure 2. Object Code Verilog RTL Hardware sel D step step Architect Architecture Description Operation Extraction Interpretive Simulator inc dec Figure 2. Inc/Dec Architecture 3.1 Describing an Architecture The components themselves are described in a constraint-based language in terms of constraints on signals and a set of rules. Rules have an activation and an action section. The activation section indicates what signals have to be present and/or absent in order for the rule to be activated (parenthesized after the rule name). The action section describes what operations to perform if the rule is activated. An incrementer component is shown in Figure 3 (<5> indicates that step is a 5-bit unsigned integer). inc( input, input <5> step, put ) { rule fire(, step) { = + step; rule no_fire(-, -step) { inc <=> fire no_fire; Figure 3. Incrementer Description inc The incrementer is interpreted as follows. The inc component is valid if and only if the constraints of rule fire or (inclusive) the rule no_fire are satisfied. The rule fire is satisfied if and step are present. A signal is present if it is assigned a value. When fire is satisfied, then is set to present and is assigned the sel M Lib Compiled Code Sim value + step. The rule no_fire is satisfied if and step are not present When no_fire is satisfied, then is set to not present (i.e. nothing is assigned). Note that due to these rules, inc cannot be valid if only one of the input signals ( or step) is present. In Figure 4 the component is shown in terms of quantifier-free firstorder logic on rules and the presence of signals. The formulation is shown since internally this is how components are represented for the operation extraction procedure described later. inc (input, input <5> step, put ) { ( fire ( step ) ( fire ( = + step ) ) ( no_fire ( step ) ( no_fire ) ( inc ( fire no_fire ) ) Figure 4. Incrementer Constraints dec Accorgly, a decrementer is defined in Figure 5. dec (input, input <5> step, put ) { ( fire ( step ) ( fire ( = - step ) ) ( no_fire ( step) ( no_fire ) ( dec ( fire no_fire ) ) Figure 5. Decrementer Constraints mux The definition in Figure 6 of the mux demonstrates a number of useful features of the language. First, the mux uses a list and foreach expressions in order to make the mux a generic N:1 mux where N is determined by the number of signals connected Second, the N:1 mux is type- polymorphic. A type resolution rine is used to determine the type of and. The type rules are the same as in Verilog, but types can also be determined as expressions on constants and other types (i.e. is defined to have the same type as ). Third, the input sel is an enumerated signal. Enumerated signals are constrained to take a unique value for each rule in which they appear. Finally, only the data path has been specified. The control will be generated for the unconnected port(s) only sel in the case of the mux. mux ( input input {enum sel, put ) { foreach(i) { ( fire ( [$i] sel foreach(j) { if ($i!= $j) { [$j] ) ) ( fire ( = [$i] ) ) ( no_fire sel ) ) ( no_fire ) ( mux ( or(fire) no_fire ) ) Figure 6. Type-Polymorphic N:1 Mux If is a 22-bit unsigned integer, and there are two signals connected to, then the mux component would be automatically expanded to the constraints of the 2:1 mux shown in Figure 7. Note that the sel.e signal is constrained to take a single value, and that sel is of type <0> indicating that it is not used in the data path. Furthermore, signals (e.g. ) are constrained to be assigned only one value. mux ( input <22> [0], input <22> [1], input <0> sel, put <22> ) { ( fire0 ( [0] [1] sel sel.e = α ) ) ( fire0 ( = [0] ) ) ( fire1 ( [1] [0] sel sel.e = β ) ) ( fire1 ( = [1] ) ( no_fire ( [0] [1] sel sel.e = χ ) ) ( no_fire ) ( sel.e { α, β, χ ) ( = [0] ) ( = [1] ) ( = [0] = [1] ) ( mux ( fire0 fire1 no_fire ) ) Figure bit 2:1 Mux 19

3 demux The demux actor is defined in a way similar to the mux so that it is a type-polymorphic 1:N demux with no control specified. An expanded 22-bit 1:2 demux is shown in Figure 8. demux ( input <22>, input <0> sel, put <22> [0], put <22> [1]) { ( fire0 ( sel sel.e = α ) ) ( fire0 ( [0] [1] [0] = ) ) ( fire1 ( sel sel.e = β ) ) ( fire1 ( [0] [1] [1] = ) ) ( no_fire ( sel sel.e = χ ) ) ( no_fire ( [0] [1] ) ( sel.e { α, β, χ ) ( demux ( fire0 fire1 no_fire ) ) Figure bit 1:2 Demux composition To complete the architecture in Figure 2, the appropriate ports on the components are connected. This allows for the propagation of present signals. Hierarchical composition is available, but is not used in the example. The two sel ports that are left unconnected will have their values set appropriately for each operation that is extracted. Since the step input ports are not connected to any signal but are used by the data path, their values will be provided by operation parameters. We assume that D. is provided by a signal from the environment. memories Although not shown in any of the components in Figure 2, a component can also contain any number of state elements. A state element is either a register or flip-flop. A register holds the value written to it until the value is overwritten. A flip-flop holds the value written to it for one cycle. With these primitives, one can build type-polymorphic, parameterized RAMs, register files, ROMs, pipeline registers, and other useful state elements. After composing either user defined or pre-defined components from a library, the designer invokes the operation extraction rine. At this point, the designer will see what operations the data path supports. If components are utilized in an unexpected way, the designer must inspect and reformulate the constraints to get the desired behavior. 3.2 Extracting the Operations In the previous section, we demonstrated how the constraints on components are formulated in quantifier-free first-order logic. When extracting operations, we create additional constraints to equate connected ports and to assert each component s constraint literal (e.g. inc, dec, mux, demux). We then find the set of satisfying solutions using an iterative SAT procedure, called FindMinimalOperations (FMO), shown in Figure 10. FMO finds all minimal paths through the data path; these are the valid operations of the design. The procedure must be restricted to find minimal solutions, as there are generally an exponential number of solutions, scaling with the amount of independent parallelism in the design. A solution is minimal when it has the least number of present ports and still satisfies the constraints. This means that no other ports can be set to not present. The inner loop of FMO does this minimization. Following the creation of an operation, we restrict the model formula so that subsequent minimal operations are not simply combinations of previously found operations. By limiting FMO in this manner, we can quickly find the set supported minimal operations. For the example architecture in Figure 2, the operations shown in Figure 9 are found. NOP INC(step) DEC(step) M. = D. + step M. = D. step Figure 9. Operations Extracted for Inc/Dec Arch Operation extraction also creates a conflict table that indicates which operations cannot occur on the same cycle due to the constraints. For the operations in Figure 9, INC(step) and DEC(step) are in conflict. The assembler and compiler then use this information to create appropriate schedules. After extraction, we can further restrict the architecture by removing unwanted operations. This is sometimes an easier way to restrict particular paths than it is to create the appropriate constraints on the data path; in either case, the resulting control is equivalent. After extracting the operations, we now have a description of a completely statically-scheduled horizontally-microcoded machine. Architectures in this class neither require dynamic scheduling nor hazard detection control since all conflicts can be found at compile time. More complex architectures can be created by composing and coupling these machines. Such coupling approaches have been successfully used by the Intel x86 machines to translate CISC instructions into more RISC-like instructions for the superscalar core, and more recently by Transmeta to translate x86 code into statically-scheduled VLIW code. From this description we can now generate bit-true cycle-accurate simulators as well as synthesizable RTL for any architecture in this class. For the RTL, we will also need to synthesize a controller. BASE is the CNF formulation of the model PORT is the set of port literals CERT is a certificate, i.e. set of literals (satisfying the BASE) present : (PORT x CERT) Boolean (true if port present in cert) issatisfiable: BASE Boolean (true if CNF is satisfiable) getcertificate: the last certificate that made issatisfiable TRUE OPERATIONS = {, OPPORTS = { while (issatisfiable(base)) { do { C = getcertificate() remove the constraints added in (1) BASE = ( i { port i port i PORT present(port i, C)) (1) BASE = ( i { port i port i PORT present(port i, C)) (2) while (issatisfiable(base)) remove the constraints added in (2) create new operation named op based on the certificate OPPORTS = {op {port port PORT present(port, C) OPERATIONS = OPERATIONS op BASE = ( i {port i port i PORT present(port i, C) op) (3) BASE = ( i {port i port i PORT ( j { op j op j OPERATIONS port i OPPORTS(op j ))) (4) remove the constraints added in (3), and (4) Figure 10. FindMinimalOperations (FMO) 4. Simulator Generation The structure of our generated simulators is straightforward. For each cycle, we run a statically-scheduled instruction. An instruction is defined as an unordered set of operations that has been checked for conflicts. Each operation may contain a set of statically determined parameters and may read inputs from and write puts to the interface of the architecture. Decog an operation simply requires jumping to an appropriate label based on the name of the operation. For each operation label, we execute the appropriate source to sink transformations. Since multiple operations can be executed in a single cycle, we commit writes to state elements at the end of a cycle. The resulting simulator is 20

4 equivalent to a discrete-event simulation of an FSM in an RTL simulator. However, our simulator is much faster because we statically determine the schedule. The generated simulators are implemented in C++ for performance reasons. In order to get the utmost performance from our simulators, we also use a number of constructs that would not be found in hand-written code. For example, we use computedgotos, template meta-programming, and inlining. These techniques coupled with a good compiler result in high performance simulators. 4.1 Interpretive Simulator Generating an interpretive simulator is useful when the instruction stream is dynamic. The interpretive simulator executes either instructions from object files generated by an assembler or dynamic instructions produced by another processing element. The assembler is parameterized by the instruction set and conflict table that were found during operation extraction. Since instructions are conflict free, the execution order of the operations within an instruction does not matter. Also, since the program is statically scheduled on a cycle by cycle basis, no dynamic scheduling or hazard detection control is required. State is maintained appropriately by performing reads at the beginning and writes at the end of a cycle. Finally, a testbench is generated for each simulator that is used to advance time, terminate simulation, and to provide inputs and puts for the simulator. The control of the simulation is orchestrated through a special port called instruction. At the beginning of each cycle, the last value written to instruction is interpreted as the instruction to execute. The mechanism of how instructions are placed on this port is up to the designer. Possible implementations may include having another processing element with its own instruction set produce the instructions or having the processing element produce the instructions itself through the use of a program counter. If nothing is connected to the instruction port then the environment must provide an instruction trace. while(!testbench exit()) { instruction = testbench read_instruction(); foreach(operation instruction) { switch(operation) { case NOP: break; case INC: uint<22> = testbench read_(); uint<5> step = operation get(0); uint<22> = + step; testbench write_(); break; case DEC: uint<22> = testbench read_(); uint<5> step = operation get(0); uint<22> = step; testbench write_(); break; default: report_error(); // commit state writes (if they exist) Figure 11. Interpretive Simulator for Inc/Dec Arch For each operation, the actual expressions to be performed are automatically extracted from the action sections of the components which are activated for a given operation. We apply copy propagation on the network of activated expressions for a given operation. If we did not do this, we would have a number of unnecessary temporaries for the signals between components. Dead code elimination is also applied to improve the quality of the generated code. Although compilers will attempt to perform these optimizations, we have found that applying these optimizations before compilation is beneficial. The interpretive simulator for the architecture of Figure 2 is shown in Figure Compiled-Code Simulator Compiled-code techniques can be utilized to further improve the performance of the simulator. If we know the program that is going to be run, we can hard code the program into the simulator. If the program trace, {NOP(), INC(6); {NOP(), DEC(10), is executed for the architecture in Figure 2, then the compiled-code simulator would be as shown in Figure 12. _0: uint<22> = testbench read_(); uint<22> = + 6; testbench write_(); testbench increment_clock(); // computed-goto would go here _1: uint<22> = testbench read_(); uint<22> = - 10; testbench write_(); testbench increment_clock(); testbench exit(); Figure 12. Compiled-Code Simulator for Inc/Dec Arch Before runtime, we know what operations are included in each instruction. Therefore, we can combine the operations to create a single set of expressions for each cycle. Although in our example we trivially combine NOP(), any number of conflict-free operations (the assembler checks this) can be combined into a single set. Combining the operations of an instruction removes the overhead of iterating through the list of operations in an instruction. We then apply the same optimizations as we did for the interpretive simulator plus we can propagate constant operation arguments. Furthermore, in cases where a program counter is used to determine the next instruction, we utilize computed-gotos to jump between runtime computed labels. A further optimization that we will make in the future is to remove these gotos when we can statically determine that the simulation simply proceeds to the next label. 4.3 Interfacing with the Simulator Three methods are used to interface with the simulator. First, probe components, which are guaranteed not to add any new semantics to the design, can passively capture a trace of the simulation. Second, when synthesis is not required, black-box components can be used. This is useful for modeling components with verified implementations (i.e. IP integration) and for using analysis components (i.e. cache analyzer). Finally, a testbench is generated that can be used to interface the simulator in a system simulation (e.g. SystemC based). 5. Hardware Generation A key component of our design methodology is to be able to produce synthesizable RTL from our architecture descriptions we only discuss the simulation of the RTL, since describing the synthesis of hardware is beyond the scope of this paper. Since the components are specified in a structural manner using syntax and type rules consistent with Verilog, simple syntax transformations are used to generate the appropriate RTL for the design. However, unlike our simulators, we do not implement each operation as a hardware path. Instead, we preserve the structure of the data path, and synthesize a controller that multiplexes the paths in the architecture appropriately. For each operation, we can determine what control signals and write enables are required in order to activate the appropriate paths 21

5 in the architecture. We then use this information to create a horizontally-microcoded controller. In order to simulate a program, we also generate the appropriate control words to be embedded in the program memory. Currently, we are exploring an encoder/decoder scheme that compresses the instruction stream based on the analysis of static program traces. Although the details of the approach are beyond the scope of this paper, the basic idea, as shown in Figure 13, is to compress the microcode and then decompress it to get the appropriate control. Software: program compiler encoder Hardware: program store decoder microcode buffer control Figure 13: Encoder/Decoder Strategy The generated decoder is manually specified or automatically generated as another component in the system with its own instruction set. The decoder demonstrates how multiple architectures can be coupled to create a more complex architecture. The benefit of this general approach is that the data path and control do not change when the encoder and decoder change. We are actively exploring encoder and decoder strategies, but for this paper our encoder/decoder strategy performs the identity function. 6. Results In order to test the quality of our generated simulators, we developed a DLX processor and a channel encog processor. We designed the architectures in our language and then automatically extracted the operations. We then generated interpretive and compiled-code simulators and synthesizable RTL Verilog. Since our C++ simulators are equivalent to RTL simulation, we compare our simulation with Cadence s nc-verilog simulation of our generated RTL Verilog. Our simulators were compiled using gcc v3.2 with O3 and were run on a 2.4 GHz Pentium 4 with 1 GB RAM. The nc-verilog ran on a dual 900 MHz 64-bit UltraSparcIII with 2 GB of memory. In order to compare the results, we liberally scaled the nc-verilog numbers by a factor of 2.67 (2400 MHz/900MHz). 6.1 DLX Processor Although our framework is mainly targeted for applicationspecific programmable cores, in order to compare the effectiveness of our approach with existing methods, we have modeled a general purpose processing core. This means our model incorporates the characteristic elements of the micro-architecture of the DLX processor [6]. However, since we extract the set of supported operations automatically from the description of the data path, we do not match the binary encog. Furthermore, the modeled 32-bit DLX is a horizontally-microcoded core supporting arithmetical and logical operations with a five stage pipeline. Conditional jumps are supported. The model includes instruction and data memory, as well as program counter logic. We extracted 113 operations. Each operation represented one path through a single stage of the pipeline. We then generated the DLX instruction set by defining macro-operations that combine operations from different stages of the pipeline. These macrooperations make programming easier; however, in the end, the assembler expands the macro-operations back to the set of extracted operations. We have implemented three representative benchmark kernels in assembly code: Cyclic Redundancy Check (CRC), Inverse Discrete Cosine Transform (IDCT), and a signal processing multiply and accumulate filter loop inclug masking (FIR), as used by established benchmarks (EEMBC, DSPStone, and Mediabench). The lookup-based 32-bit CRC requires only a few arithmetic operations, but relatively frequent memory accesses, whereas, the complex IDCT has more arithmetic and program flow constructs, but fewer memory accesses. The FIR filter loop, in particular, allows us to stress the pipeline. As a corner case, we also simulate executing NOP operations only. The achieved simulation speed results are listed in Table 1. design nc-verilog interpretive compiled ops/inst NOP 2.5 MHz 6.9 MHz 588 MHz 1 CRC KHz 4.6 MHz 85.5 MHz 9.12 IDCT KHz 4.6 MHz 49.0 MHz 8.65 FIR KHz 4.0 MHz 40.8 MHz 5.5 Table 1: DLX Simulation Speed Results The results are reported for 2 billion simulated cycles. We report the virtual running speed on the host in cycles per second, and the ratio of the average number of primitive operations to equivalent DLX pipelined instructions. For instance, a DLX add instruction needs ten operations to execute within five cycles. 6.2 Channel Encog Processor We also developed a channel encog processor capable of performing CRC, UMTS Turbo/convolutional encog, and a convolutional encog. The design is composed of approximately 60 components that include PC and zero-overhead looping logic, a register file, an accumulator, and a bit manipulation unit. A total of 46 operations were extracted from the design. Half of these were removed after extraction. bit width nc-verilog interpretive compiled KHz 5.5 MHz MHz KHz 5.7 MHz MHz KHz 5.7 MHz MHz KHz 5.5 MHz MHz KHz 5.4 MHz MHz KHz 5.4 MHz MHz KHz 5.5 MHz MHz KHz 5.2 MHz MHz KHz 5.5 MHz MHz Table 2: CEP Simulation Speed Results For this design, we experimented with various bit widths for the data path. We only had to write the convolution encog program in assembly once, since our tools automatically adjust for bit width changes if possible. The results of the experiment are shown in Table 2. The results are reported for 2 billion simulated cycles. We report raw speeds since the ratio of instructions to operations is not as relevant with this type of processor. The processor was running an average of five primitive operations per cycle. The running times are only slightly affected by the bit width size. However, there was a noticeable drop in performance for bit widths greater than the native 32-bit data path of the Pentium Discussion The actual time taken to create these designs was on the order of an hour. The extraction of operations, generation of simulators, and generation of Verilog was performed in a few seconds. The majority of the design effort was focused on specifying the programs. This effort required renaming the operations for debugging purposes, controlling the pipeline on a cycle to cycle basis, and specifying macro-operations to ease programming. We have a compiler that alleviates the need to perform these tasks. 22

6 The compiler is still in development so we did not use it for our experiments. The typical speed of the compiled simulators is ab a factor of 20 to 60 slower in cycles per second than the native host speed. The compiled C++ simulator is approximately one order of magnitude faster than the interpretive version and is two orders of magnitude faster than highly-optimized commercial Verilog simulator. Most of the speedup can be attributed to the fact that simulation can be completely statically scheduled. Comparing to results in related work, the speed of our simulators meets or exceeds the speed of similar simulation techniques. In the domain of ADLs, recent instruction set simulation results have been reported for ARM7, SPARC, and VLIW cores[7][8]. When we scale the reported MIPS results to our simulation host, the performance of the interpretive simulators is comparable. However, our compiled-code simulators are at least a factor of two faster than ADL-based compiled simulators (this may be a sideeffect of our small kernels). When compared to the MIMOLAbased JACOB simulator [9], which we are most closely related to, we find that our simulators are an order of magnitude faster. Our speedup can be attributed to a number of factors. First, since each operation is a statically-scheduled state-to-state transformation, we treat each as a basic block. We can then apply a number of compiler transformations such as copy propagation, dead code elimination, and inlining. Second, we do not need to decode operations, but instead we simply use each one as a label to jump to the correspong optimized basic block. Using computed-gotos in the compiled-code simulator further improves the jump efficiency. Third, unlike JACOB [9], we do not depend on chaining primitive operations, but instead apply optimizations directly on the C code extracted from the action section of the component descriptions and handle arbitrary bit-width types with GNUmp libraries [10]. Finally, gcc is applied to the resulting simulator to efficiently map the code to the host. 7. Related Work A number of approaches to retargetable simulation based on ADLs have been proposed. Frameworks, such as FACILE [11], ISDL [12], Sim-nML [13], are optimized for particular architectural families and cannot capture the range of architecture that we can. More flexible modeling that supports both interpretive and compiled-code simulation is presented in the LISA [7] and EXPRESSION [8] frameworks. All of these approaches require that the designer specify the instruction set, and thus are more suitable for modeling architectures where the instruction set is known. Although we have not applied the just-intime techniques presented in [7] we have applied a number of optimizations inclug compiled-code techniques, static analysis, and compiler optimizations. The MIMOLA framework most closely resembles our approach to retargetable simulation. Interpretive and compiled-code simulators have been generated from structural MIMOLA descriptions [9]. Since the structure of the data path and control are specified in MIMOLA, hardware generation is straightforward. The key difference between our approach and MIMOLA lies in the fact that we do not require the control to be specified and we extract instructions using SAT, not BDDs [2]. 8. Conclusion Simplifying the design process by freeing the designer from concerns ab the instruction set and control and provig high performance automatically generated tools greatly increases the productivity of designers. Our new design language obviates the need to specify the control and an instruction set, thus allowing designers to focus on the data path. From a description of a data path, we automatically extract the control and instruction set, generate bit-true cycle-accurate interpretive and compiled-code simulators, and generate synthesizable RTL Verilog. A simple horizontally-microcoded control scheme is also generated. Our results have shown that our simulators are one to two orders of magnitude faster than an equivalent nc-verilog simulation of our generated Verilog. Furthermore, our simulators are up to an order of magnitude faster than existing simulators using similar generation techniques. 9. References [1] R. Leupers, P. Marwedel, Retargetable Code Generation Based on Structural Processor Description. Design Automation for Embedded Systems, vol. 3, no. 1, Jan 1998, pp [2] R. Leupers, Instruction-Set Extraction, In Retargetable Code Generation for Digital Signal Processors, Kluwer Academic Publishers, 1997, pp [3] A. Hoffmann, H. Meyr, and R. Leupers. Architecture Exploration for Embedded Processors with LISA. Kluwer, [4] A. Halambi, P. Grun, V. Ganesh, A. Khare, N. Dutt, A. Nicolau, EXPRESSION: A Language for Architecture Exploration through Compiler/Simulator Retargetability. DATE [5] A. Fauth, J. Van Praet, M. Freericks, Describing Instruction Set Processors Using nml ED&TC [6] D.A. Patterson, J.L. Hennessy. Computer Organization & Design: The Hardware/Software Interface. Morgan Kaufmann, [7] A. Nohl, G. Braun, O. Schliebusch, R. Leupers, H. Meyr, A. Hoffman, A Universal Technique for Fast and Flexible Instruction-Set Architecture Simulation. DAC [8] M. Reshadi, N. Bansal, P. Mishra, N. Dutt, An Efficient Retargetable Framework for Instruction-Set Simulation. CODES+ISSS [9] R. Leupers, J. Elste, and B. Landwehr, Generation of Interpretive and Compiled Instruction Set Simulators. ASP- DAC [10] GNUmp, [11] E. Schnarr, M. Hill, J. R. Larus, Facile: A Language and Compiler for High-Performance Processor Simulators. PLDI [12] G. Hadjiyiannis, S. Hanono, S. Devadas, ISDL: An Instruction Set Description Language for Retargetability. DAC [13] M. Hartoog, J.A. Rowson, P.D. Reddy, S. Desai, D.D. Dunlop, E.A. Harcourt, N. Khullar, Generation of Software Tools from Processor Descriptions for Hardware/Software Codesign. DAC

Instruction Set Compiled Simulation: A Technique for Fast and Flexible Instruction Set Simulation

Instruction Set Compiled Simulation: A Technique for Fast and Flexible Instruction Set Simulation Instruction Set Compiled Simulation: A Technique for Fast and Flexible Instruction Set Simulation Mehrdad Reshadi Prabhat Mishra Nikil Dutt Architectures and Compilers for Embedded Systems (ACES) Laboratory

More information

Memory Access Optimizations in Instruction-Set Simulators

Memory Access Optimizations in Instruction-Set Simulators Memory Access Optimizations in Instruction-Set Simulators Mehrdad Reshadi Center for Embedded Computer Systems (CECS) University of California Irvine Irvine, CA 92697, USA reshadi@cecs.uci.edu ABSTRACT

More information

A Universal Technique for Fast and Flexible Instruction-Set Architecture Simulation

A Universal Technique for Fast and Flexible Instruction-Set Architecture Simulation A Universal Technique for Fast and Flexible Instruction-Set Architecture Simulation Achim Nohl, Gunnar Braun, Oliver Schliebusch, Rainer Leupers, Heinrich Meyr Integrated Signal Processing Systems Templergraben

More information

Hybrid-Compiled Simulation: An Efficient Technique for Instruction-Set Architecture Simulation

Hybrid-Compiled Simulation: An Efficient Technique for Instruction-Set Architecture Simulation Hybrid-Compiled Simulation: An Efficient Technique for Instruction-Set Architecture Simulation MEHRDAD RESHADI University of California Irvine PRABHAT MISHRA University of Florida and NIKIL DUTT University

More information

A Retargetable Framework for Instruction-Set Architecture Simulation

A Retargetable Framework for Instruction-Set Architecture Simulation A Retargetable Framework for Instruction-Set Architecture Simulation MEHRDAD RESHADI and NIKIL DUTT University of California, Irvine and PRABHAT MISHRA University of Florida Instruction-set architecture

More information

Architecture Implementation Using the Machine Description Language LISA

Architecture Implementation Using the Machine Description Language LISA Architecture Implementation Using the Machine Description Language LISA Oliver Schliebusch, Andreas Hoffmann, Achim Nohl, Gunnar Braun and Heinrich Meyr Integrated Signal Processing Systems, RWTH Aachen,

More information

Automatic Generation of JTAG Interface and Debug Mechanism for ASIPs

Automatic Generation of JTAG Interface and Debug Mechanism for ASIPs Automatic Generation of JTAG Interface and Debug Mechanism for ASIPs Oliver Schliebusch ISS, RWTH-Aachen 52056 Aachen, Germany +49-241-8027884 schliebu@iss.rwth-aachen.de David Kammler ISS, RWTH-Aachen

More information

An Efficient Retargetable Framework for Instruction-Set Simulation

An Efficient Retargetable Framework for Instruction-Set Simulation An Efficient Retargetable Framework for Instruction-Set Simulation Mehrdad Reshadi, Nikhil Bansal, Prabhat Mishra, Nikil Dutt Center for Embedded Computer Systems, University of California, Irvine. {reshadi,

More information

Synthesis-driven Exploration of Pipelined Embedded Processors Λ

Synthesis-driven Exploration of Pipelined Embedded Processors Λ Synthesis-driven Exploration of Pipelined Embedded Processors Λ Prabhat Mishra Arun Kejariwal Nikil Dutt pmishra@cecs.uci.edu arun kejariwal@ieee.org dutt@cecs.uci.edu Architectures and s for Embedded

More information

However, no results are published that indicate the applicability for cycle-accurate simulation purposes. The language RADL [12] is derived from earli

However, no results are published that indicate the applicability for cycle-accurate simulation purposes. The language RADL [12] is derived from earli Retargeting of Compiled Simulators for Digital Signal Processors Using a Machine Description Language Stefan Pees, Andreas Homann, Heinrich Meyr Integrated Signal Processing Systems, RWTH Aachen pees[homann,meyr]@ert.rwth-aachen.de

More information

Generating an Efficient Instruction Set Simulator from a Complete Property Suite

Generating an Efficient Instruction Set Simulator from a Complete Property Suite Generating an Efficient Instruction Set Simulator from a Complete Property Suite Ulrich Kühne Institute of Computer Science University of Bremen 28359 Bremen, Germany ulrichk@informatik.uni-bremen.de Sven

More information

ReXSim: A Retargetable Framework for Instruction-Set Architecture Simulation

ReXSim: A Retargetable Framework for Instruction-Set Architecture Simulation ReXSim: A Retargetable Framework for Instruction-Set Architecture Simulation Mehrdad Reshadi, Prabhat Mishra, Nikhil Bansal, Nikil Dutt Architectures and Compilers for Embedded Systems (ACES) Laboratory

More information

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard. COMP 212 Computer Organization & Architecture Pipeline Re-Cap Pipeline is ILP -Instruction Level Parallelism COMP 212 Fall 2008 Lecture 12 RISC & Superscalar Divide instruction cycles into stages, overlapped

More information

Eliminating False Loops Caused by Sharing in Control Path

Eliminating False Loops Caused by Sharing in Control Path Eliminating False Loops Caused by Sharing in Control Path ALAN SU and YU-CHIN HSU University of California Riverside and TA-YUNG LIU and MIKE TIEN-CHIEN LEE Avant! Corporation In high-level synthesis,

More information

Generic Pipelined Processor Modeling and High Performance Cycle-Accurate Simulator Generation

Generic Pipelined Processor Modeling and High Performance Cycle-Accurate Simulator Generation Generic Pipelined Processor Modeling and High Performance Cycle-Accurate Simulator Generation Mehrdad Reshadi, Nikil Dutt Center for Embedded Computer Systems (CECS), Donald Bren School of Information

More information

Cycle Accurate Binary Translation for Simulation Acceleration in Rapid Prototyping of SoCs

Cycle Accurate Binary Translation for Simulation Acceleration in Rapid Prototyping of SoCs Cycle Accurate Binary Translation for Simulation Acceleration in Rapid Prototyping of SoCs Jürgen Schnerr 1, Oliver Bringmann 1, and Wolfgang Rosenstiel 1,2 1 FZI Forschungszentrum Informatik Haid-und-Neu-Str.

More information

S2CBench : Synthesizable SystemC Benchmark Suite for High-Level Synthesis

S2CBench : Synthesizable SystemC Benchmark Suite for High-Level Synthesis S2CBench : Synthesizable SystemC Benchmark Suite for High-Level Synthesis Benjamin Carrion Schafer 1, Ansuhree Mahapatra 2 The Hong Kong Polytechnic University Department of Electronic and Information

More information

S2CBench : Synthesizable SystemC Benchmark Suite for High-Level Synthesis

S2CBench : Synthesizable SystemC Benchmark Suite for High-Level Synthesis S2CBench : Synthesizable SystemC Benchmark Suite for High-Level Synthesis Benjamin Carrion Schafer 1, Ansuhree Mahapatra 2 The Hong Kong Polytechnic University Department of Electronic and Information

More information

Fast and Accurate Simulation using the LLVM Compiler Framework

Fast and Accurate Simulation using the LLVM Compiler Framework Fast and Accurate Simulation using the LLVM Compiler Framework Florian Brandner, Andreas Fellnhofer, Andreas Krall, and David Riegler Christian Doppler Laboratory: Compilation Techniques for Embedded Processors

More information

A Methodology for Accurate Performance Evaluation in Architecture Exploration

A Methodology for Accurate Performance Evaluation in Architecture Exploration A Methodology for Accurate Performance Evaluation in Architecture Eploration George Hadjiyiannis ghi@caa.lcs.mit.edu Pietro Russo pietro@caa.lcs.mit.edu Srinivas Devadas devadas@caa.lcs.mit.edu Laboratory

More information

NISC Application and Advantages

NISC Application and Advantages NISC Application and Advantages Daniel D. Gajski Mehrdad Reshadi Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA {gajski, reshadi}@cecs.uci.edu CECS Technical

More information

Control and Datapath 8

Control and Datapath 8 Control and Datapath 8 Engineering attempts to develop design methods that break a problem up into separate steps to simplify the design and increase the likelihood of a correct solution. Digital system

More information

Evaluating Inter-cluster Communication in Clustered VLIW Architectures

Evaluating Inter-cluster Communication in Clustered VLIW Architectures Evaluating Inter-cluster Communication in Clustered VLIW Architectures Anup Gangwar Embedded Systems Group, Department of Computer Science and Engineering, Indian Institute of Technology Delhi September

More information

Worst Case Execution Time Analysis for Synthesized Hardware

Worst Case Execution Time Analysis for Synthesized Hardware Worst Case Execution Time Analysis for Synthesized Hardware Jun-hee Yoo ihavnoid@poppy.snu.ac.kr Seoul National University, Seoul, Republic of Korea Xingguang Feng fengxg@poppy.snu.ac.kr Seoul National

More information

Characterization of Native Signal Processing Extensions

Characterization of Native Signal Processing Extensions Characterization of Native Signal Processing Extensions Jason Law Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78712 jlaw@mail.utexas.edu Abstract Soon if

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

04 - DSP Architecture and Microarchitecture

04 - DSP Architecture and Microarchitecture September 11, 2015 Memory indirect addressing (continued from last lecture) ; Reality check: Data hazards! ; Assembler code v3: repeat 256,endloop load r0,dm1[dm0[ptr0++]] store DM0[ptr1++],r0 endloop:

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,

More information

Instruction Pipelining

Instruction Pipelining Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages

More information

Instruction Pipelining

Instruction Pipelining Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages

More information

FPGA Implementation and Validation of the Asynchronous Array of simple Processors

FPGA Implementation and Validation of the Asynchronous Array of simple Processors FPGA Implementation and Validation of the Asynchronous Array of simple Processors Jeremy W. Webb VLSI Computation Laboratory Department of ECE University of California, Davis One Shields Avenue Davis,

More information

Techniques for Effectively Exploiting a Zero Overhead Loop Buffer

Techniques for Effectively Exploiting a Zero Overhead Loop Buffer Techniques for Effectively Exploiting a Zero Overhead Loop Buffer Gang-Ryung Uh 1, Yuhong Wang 2, David Whalley 2, Sanjay Jinturkar 1, Chris Burns 1, and Vincent Cao 1 1 Lucent Technologies, Allentown,

More information

Functional Programming in Hardware Design

Functional Programming in Hardware Design Functional Programming in Hardware Design Tomasz Wegrzanowski Saarland University Tomasz.Wegrzanowski@gmail.com 1 Introduction According to the Moore s law, hardware complexity grows exponentially, doubling

More information

Processing Unit CS206T

Processing Unit CS206T Processing Unit CS206T Microprocessors The density of elements on processor chips continued to rise More and more elements were placed on each chip so that fewer and fewer chips were needed to construct

More information

Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization

Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization Mian-Muhammad Hamayun, Frédéric Pétrot and Nicolas Fournel System Level Synthesis

More information

A First-step Towards an Architecture Tuning Methodology for Low Power

A First-step Towards an Architecture Tuning Methodology for Low Power A First-step Towards an Architecture Tuning Methodology for Low Power Greg Stitt, Frank Vahid*, Tony Givargis Department of Computer Science and Engineering University of California, Riverside {gstitt,

More information

Chapter 04: Instruction Sets and the Processor organizations. Lesson 20: RISC and converged Architecture

Chapter 04: Instruction Sets and the Processor organizations. Lesson 20: RISC and converged Architecture Chapter 04: Instruction Sets and the Processor organizations Lesson 20: RISC and converged Architecture 1 Objective Learn the RISC architecture Learn the Converged Architecture 2 Reduced Instruction Set

More information

Chapter 13 Reduced Instruction Set Computers

Chapter 13 Reduced Instruction Set Computers Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining

More information

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer

More information

High Level Synthesis from Sim-nML Processor Models

High Level Synthesis from Sim-nML Processor Models High Level Synthesis from Sim-nML Processor Models Souvik Basu souvik@gdatech.co.in GDA Technologies Limited, Bangalore, India Rajat Moona moona@iitk.ac.in Dept. of CSE, IIT Kanpur, India Abstract The

More information

Lossless Compression using Efficient Encoding of Bitmasks

Lossless Compression using Efficient Encoding of Bitmasks Lossless Compression using Efficient Encoding of Bitmasks Chetan Murthy and Prabhat Mishra Department of Computer and Information Science and Engineering University of Florida, Gainesville, FL 326, USA

More information

CUSTOMIZING SOFTWARE TOOLKITS FOR EMBEDDED SYSTEMS-ON-CHIP Ashok Halambi, Nikil Dutt, Alex Nicolau

CUSTOMIZING SOFTWARE TOOLKITS FOR EMBEDDED SYSTEMS-ON-CHIP Ashok Halambi, Nikil Dutt, Alex Nicolau CUSTOMIZING SOFTWARE TOOLKITS FOR EMBEDDED SYSTEMS-ON-CHIP Ashok Halambi, Nikil Dutt, Alex Nicolau Center for Embedded Computer Systems Department of Information and Computer Science University of California,

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance

More information

C Compiler Retargeting Based on Instruction Semantics Models

C Compiler Retargeting Based on Instruction Semantics Models C Compiler Retargeting Based on Instruction Semantics Models Jianjiang Ceng, Manuel Hohenauer, Rainer Leupers, Gerd Ascheid, Heinrich Meyr Integrated Signal Processing Systems Aachen University of Technology

More information

Structure of Computer Systems

Structure of Computer Systems 288 between this new matrix and the initial collision matrix M A, because the original forbidden latencies for functional unit A still have to be considered in later initiations. Figure 5.37. State diagram

More information

CS311 Lecture: CPU Implementation: The Register Transfer Level, the Register Set, Data Paths, and the ALU

CS311 Lecture: CPU Implementation: The Register Transfer Level, the Register Set, Data Paths, and the ALU CS311 Lecture: CPU Implementation: The Register Transfer Level, the Register Set, Data Paths, and the ALU Last revised October 15, 2007 Objectives: 1. To show how a CPU is constructed out of a regiser

More information

A Study for Branch Predictors to Alleviate the Aliasing Problem

A Study for Branch Predictors to Alleviate the Aliasing Problem A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract

More information

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V. Introduction to optimizations CS3300 - Compiler Design Introduction to Optimizations V. Krishna Nandivada IIT Madras Copyright c 2018 by Antony L. Hosking. Permission to make digital or hard copies of

More information

Combinational Logic II

Combinational Logic II Combinational Logic II Ranga Rodrigo July 26, 2009 1 Binary Adder-Subtractor Digital computers perform variety of information processing tasks. Among the functions encountered are the various arithmetic

More information

Chapter 3 : Control Unit

Chapter 3 : Control Unit 3.1 Control Memory Chapter 3 Control Unit The function of the control unit in a digital computer is to initiate sequences of microoperations. When the control signals are generated by hardware using conventional

More information

Design of CPU Simulation Software for ARMv7 Instruction Set Architecture

Design of CPU Simulation Software for ARMv7 Instruction Set Architecture Design of CPU Simulation Software for ARMv7 Instruction Set Architecture Author: Dillon Tellier Advisor: Dr. Christopher Lupo Date: June 2014 1 INTRODUCTION Simulations have long been a part of the engineering

More information

A Methodology for Validation of Microprocessors using Equivalence Checking

A Methodology for Validation of Microprocessors using Equivalence Checking A Methodology for Validation of Microprocessors using Equivalence Checking Prabhat Mishra pmishra@cecs.uci.edu Nikil Dutt dutt@cecs.uci.edu Architectures and Compilers for Embedded Systems (ACES) Laboratory

More information

Cadence SystemC Design and Verification. NMI FPGA Network Meeting Jan 21, 2015

Cadence SystemC Design and Verification. NMI FPGA Network Meeting Jan 21, 2015 Cadence SystemC Design and Verification NMI FPGA Network Meeting Jan 21, 2015 The High Level Synthesis Opportunity Raising Abstraction Improves Design & Verification Optimizes Power, Area and Timing for

More information

Configurable and Extensible Processors Change System Design. Ricardo E. Gonzalez Tensilica, Inc.

Configurable and Extensible Processors Change System Design. Ricardo E. Gonzalez Tensilica, Inc. Configurable and Extensible Processors Change System Design Ricardo E. Gonzalez Tensilica, Inc. Presentation Overview Yet Another Processor? No, a new way of building systems Puts system designers in the

More information

On Resolution Proofs for Combinational Equivalence Checking

On Resolution Proofs for Combinational Equivalence Checking On Resolution Proofs for Combinational Equivalence Checking Satrajit Chatterjee Alan Mishchenko Robert Brayton Department of EECS U. C. Berkeley {satrajit, alanmi, brayton}@eecs.berkeley.edu Andreas Kuehlmann

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Processors. Young W. Lim. May 12, 2016

Processors. Young W. Lim. May 12, 2016 Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version

More information

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path Michalis D. Galanis, Gregory Dimitroulakos, and Costas E. Goutis VLSI Design Laboratory, Electrical and Computer Engineering

More information

Chapter 4. MARIE: An Introduction to a Simple Computer 4.8 MARIE 4.8 MARIE A Discussion on Decoding

Chapter 4. MARIE: An Introduction to a Simple Computer 4.8 MARIE 4.8 MARIE A Discussion on Decoding 4.8 MARIE This is the MARIE architecture shown graphically. Chapter 4 MARIE: An Introduction to a Simple Computer 2 4.8 MARIE MARIE s Full Instruction Set A computer s control unit keeps things synchronized,

More information

CPS311 Lecture: CPU Implementation: The Register Transfer Level, the Register Set, Data Paths, and the ALU

CPS311 Lecture: CPU Implementation: The Register Transfer Level, the Register Set, Data Paths, and the ALU CPS311 Lecture: CPU Implementation: The Register Transfer Level, the Register Set, Data Paths, and the ALU Objectives: Last revised August 5, 2015 1. To show how a CPU is constructed out of a register

More information

Virtual Machines and Dynamic Translation: Implementing ISAs in Software

Virtual Machines and Dynamic Translation: Implementing ISAs in Software Virtual Machines and Dynamic Translation: Implementing ISAs in Software Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Software Applications How is a software application

More information

Intel HLS Compiler: Fast Design, Coding, and Hardware

Intel HLS Compiler: Fast Design, Coding, and Hardware white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager

More information

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language. Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central

More information

Ten Reasons to Optimize a Processor

Ten Reasons to Optimize a Processor By Neil Robinson SoC designs today require application-specific logic that meets exacting design requirements, yet is flexible enough to adjust to evolving industry standards. Optimizing your processor

More information

ELC4438: Embedded System Design Embedded Processor

ELC4438: Embedded System Design Embedded Processor ELC4438: Embedded System Design Embedded Processor Liang Dong Electrical and Computer Engineering Baylor University 1. Processor Architecture General PC Von Neumann Architecture a.k.a. Princeton Architecture

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9 General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

Improvement of Compiled Instruction Set Simulator by Increasing Flexibility and Reducing Compile Time

Improvement of Compiled Instruction Set Simulator by Increasing Flexibility and Reducing Compile Time Improvement of Compiled Instruction Set Simulator by Increasing Flexibility and Reducing Compile Time Moo-Kyoung Chung, Chung-Min Kyung, Department of EECS, KAIST 1 Moo-Kyoung Chung Outline Previous works

More information

Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL

Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, Krste Asanović

More information

A Retargetable Micro-architecture Simulator

A Retargetable Micro-architecture Simulator 451 A Retargetable Micro-architecture Simulator Wai Sum Mong, Jianwen Zhu Electrical and Computer Engineering University of Toronto, Ontario M5S 3G4, Canada {mong, jzhu@eecgtorontoedu ABSTRACT The capability

More information

Register Transfer Level in Verilog: Part I

Register Transfer Level in Verilog: Part I Source: M. Morris Mano and Michael D. Ciletti, Digital Design, 4rd Edition, 2007, Prentice Hall. Register Transfer Level in Verilog: Part I Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National

More information

MARIE: An Introduction to a Simple Computer

MARIE: An Introduction to a Simple Computer MARIE: An Introduction to a Simple Computer 4.2 CPU Basics The computer s CPU fetches, decodes, and executes program instructions. The two principal parts of the CPU are the datapath and the control unit.

More information

arxiv: v1 [cs.pl] 30 Sep 2013

arxiv: v1 [cs.pl] 30 Sep 2013 Retargeting GCC: Do We Reinvent the Wheel Every Time? Saravana Perumal P Department of CSE, IIT Kanpur saravanan1986@gmail.com Amey Karkare Department of CSE, IIT Kanpur karkare@cse.iitk.ac.in arxiv:1309.7685v1

More information

PINE TRAINING ACADEMY

PINE TRAINING ACADEMY PINE TRAINING ACADEMY Course Module A d d r e s s D - 5 5 7, G o v i n d p u r a m, G h a z i a b a d, U. P., 2 0 1 0 1 3, I n d i a Digital Logic System Design using Gates/Verilog or VHDL and Implementation

More information

Advanced Computer Architecture

Advanced Computer Architecture ECE 563 Advanced Computer Architecture Fall 2007 Lecture 14: Virtual Machines 563 L14.1 Fall 2009 Outline Types of Virtual Machine User-level (or Process VMs) System-level Techniques for implementing all

More information

Timed Compiled-Code Functional Simulation of Embedded Software for Performance Analysis of SOC Design

Timed Compiled-Code Functional Simulation of Embedded Software for Performance Analysis of SOC Design IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 22, NO. 1, JANUARY 2003 1 Timed Compiled-Code Functional Simulation of Embedded Software for Performance Analysis of

More information

CHAPTER 4 MARIE: An Introduction to a Simple Computer

CHAPTER 4 MARIE: An Introduction to a Simple Computer CHAPTER 4 MARIE: An Introduction to a Simple Computer 4.1 Introduction 177 4.2 CPU Basics and Organization 177 4.2.1 The Registers 178 4.2.2 The ALU 179 4.2.3 The Control Unit 179 4.3 The Bus 179 4.4 Clocks

More information

Superscalar Machines. Characteristics of superscalar processors

Superscalar Machines. Characteristics of superscalar processors Superscalar Machines Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any performance

More information

Design for Verification in System-level Models and RTL

Design for Verification in System-level Models and RTL 11.2 Abstract Design for Verification in System-level Models and RTL It has long been the practice to create models in C or C++ for architectural studies, software prototyping and RTL verification in the

More information

COMPUTER ARCHITECTURE AND ORGANIZATION Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital

COMPUTER ARCHITECTURE AND ORGANIZATION Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital hardware modules that accomplish a specific information-processing task. Digital systems vary in

More information

structure syntax different levels of abstraction

structure syntax different levels of abstraction This and the next lectures are about Verilog HDL, which, together with another language VHDL, are the most popular hardware languages used in industry. Verilog is only a tool; this course is about digital

More information

Here is a list of lecture objectives. They are provided for you to reflect on what you are supposed to learn, rather than an introduction to this

Here is a list of lecture objectives. They are provided for you to reflect on what you are supposed to learn, rather than an introduction to this This and the next lectures are about Verilog HDL, which, together with another language VHDL, are the most popular hardware languages used in industry. Verilog is only a tool; this course is about digital

More information

Real instruction set architectures. Part 2: a representative sample

Real instruction set architectures. Part 2: a representative sample Real instruction set architectures Part 2: a representative sample Some historical architectures VAX: Digital s line of midsize computers, dominant in academia in the 70s and 80s Characteristics: Variable-length

More information

SAE5C Computer Organization and Architecture. Unit : I - V

SAE5C Computer Organization and Architecture. Unit : I - V SAE5C Computer Organization and Architecture Unit : I - V UNIT-I Evolution of Pentium and Power PC Evolution of Computer Components functions Interconnection Bus Basics of PCI Memory:Characteristics,Hierarchy

More information

Flexible and Formal Modeling of Microprocessors with Application to Retargetable Simulation

Flexible and Formal Modeling of Microprocessors with Application to Retargetable Simulation Flexible and Formal Modeling of Microprocessors with Application to Retargetable Simulation Wei Qin Sharad Malik Princeton University Princeton, NJ 08544, USA Abstract Given the growth in application-specific

More information

Index. object lifetimes, and ownership, use after change by an alias errors, use after drop errors, BTreeMap, 309

Index. object lifetimes, and ownership, use after change by an alias errors, use after drop errors, BTreeMap, 309 A Arithmetic operation floating-point arithmetic, 11 12 integer numbers, 9 11 Arrays, 97 copying, 59 60 creation, 48 elements, 48 empty arrays and vectors, 57 58 executable program, 49 expressions, 48

More information

William Stallings Computer Organization and Architecture. Chapter 12 Reduced Instruction Set Computers

William Stallings Computer Organization and Architecture. Chapter 12 Reduced Instruction Set Computers William Stallings Computer Organization and Architecture Chapter 12 Reduced Instruction Set Computers Major Advances in Computers(1) The family concept IBM System/360 1964 DEC PDP-8 Separates architecture

More information

Reduced Instruction Set Computers

Reduced Instruction Set Computers Reduced Instruction Set Computers The acronym RISC stands for Reduced Instruction Set Computer. RISC represents a design philosophy for the ISA (Instruction Set Architecture) and the CPU microarchitecture

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

Architecture Description Language (ADL)-Driven Software Toolkit Generation for Architectural Exploration of Programmable SOCs

Architecture Description Language (ADL)-Driven Software Toolkit Generation for Architectural Exploration of Programmable SOCs Architecture Description Language (ADL)-Driven Software Toolkit Generation for Architectural Exploration of Programmable SOCs PRABHAT MISHRA University of Florida and AVIRAL SHRIVASTAVA and NIKIL DUTT

More information

ECE 341. Lecture # 15

ECE 341. Lecture # 15 ECE 341 Lecture # 15 Instructor: Zeshan Chishti zeshan@ece.pdx.edu November 19, 2014 Portland State University Pipelining Structural Hazards Pipeline Performance Lecture Topics Effects of Stalls and Penalties

More information

Code Compression for DSP

Code Compression for DSP Code for DSP Charles Lefurgy and Trevor Mudge {lefurgy,tnm}@eecs.umich.edu EECS Department, University of Michigan 1301 Beal Ave., Ann Arbor, MI 48109-2122 http://www.eecs.umich.edu/~tnm/compress Abstract

More information

Draft Standard for Verilog. Randomization and Constraints. Extensions

Draft Standard for Verilog. Randomization and Constraints. Extensions Draft Standard for Verilog Randomization and Constraints Extensions Copyright 2003 by Cadence Design Systems, Inc. This document is an unapproved draft of a proposed IEEE Standard. As such, this document

More information

Evolution of ISAs. Instruction set architectures have changed over computer generations with changes in the

Evolution of ISAs. Instruction set architectures have changed over computer generations with changes in the Evolution of ISAs Instruction set architectures have changed over computer generations with changes in the cost of the hardware density of the hardware design philosophy potential performance gains One

More information

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr Ti5317000 Parallel Computing PIPELINING Michał Roziecki, Tomáš Cipr 2005-2006 Introduction to pipelining What is this What is pipelining? Pipelining is an implementation technique in which multiple instructions

More information

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Introduction All processors offer some form of instructions to add, subtract, and manipulate data.

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Word-Level Equivalence Checking in Bit-Level Accuracy by Synthesizing Designs onto Identical Datapath

Word-Level Equivalence Checking in Bit-Level Accuracy by Synthesizing Designs onto Identical Datapath 972 PAPER Special Section on Formal Approach Word-Level Equivalence Checking in Bit-Level Accuracy by Synthesizing Designs onto Identical Datapath Tasuku NISHIHARA a), Member, Takeshi MATSUMOTO, and Masahiro

More information

ABC basics (compilation from different articles)

ABC basics (compilation from different articles) 1. AIG construction 2. AIG optimization 3. Technology mapping ABC basics (compilation from different articles) 1. BACKGROUND An And-Inverter Graph (AIG) is a directed acyclic graph (DAG), in which a node

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Computer Organization CS 206 T Lec# 2: Instruction Sets

Computer Organization CS 206 T Lec# 2: Instruction Sets Computer Organization CS 206 T Lec# 2: Instruction Sets Topics What is an instruction set Elements of instruction Instruction Format Instruction types Types of operations Types of operand Addressing mode

More information