For our next chapter, we will discuss the emulation process which is an integral part of virtual machines.

Size: px
Start display at page:

Download "For our next chapter, we will discuss the emulation process which is an integral part of virtual machines."

Transcription

1 For our next chapter, we will discuss the emulation process which is an integral part of virtual machines. 1

2 2

3 For today s lecture, we ll start by defining what we mean by emulation. Specifically, in this section, we ll focus on how to emulate the instructions for one machine on another machine. There are two basic methods for emulating an instruction set. The first is interpretation. We ll discuss basic techniques for building an interpreter including basic, indirect threaded and direct threaded interpretation. The other is binary translation which is basically compiling sections of the source binary to the target platform. There are a number of issues that arise during binary translation including code discovery and code location which have to do with separating instructions from data in the source binary. We ll also look at other issues, such as mapping registers in the source machine to registers on the target platform. Next, we ll look at optimizations to deal with control transfers in the translated code and we ll end by looking at some issues with translating specific parts of the 3

4 instruction set. 3

5 Let s talk about a few of the definitions we ll use in our discussion going forward. First, the book defines emulation as the process of implementing the interface or functionality of one system on a different system. Notice that this is not very different from our definition of virtualization and taken in this general sense I think you could argue that virtualization is just a form of emulation. However, for our discussion, we re going to narrow the definition of emulation and say that it applies specifically to instruction sets. So, emulation is the process of taking one instruction set and implementing its interface or functionality on a machine with a different instruction set. There are different techniques for doing emulation which we ll discuss over the next few lectures. The first is interpretation which is basically instruction-at-a-time translation of the source instructions. Interpretation is simple to implement, but relatively slow compared to binary translation. The other strategy is binary translation which is basically block-at-a-time translation of the source program. This technique can improve performance over interpretation 4

6 but it s a bit more challenging to implement efficiently. It takes time to compile blocks of code so you have to prioritize which parts of the program you want to compile first. Also, you have to keep this compiled code in memory and have a way to transfer control from compiled to non-compiled blocks, and vice versa. Now, I want to talk about how emulation is related to the term simulation because the terms are often confused. They are related, but they are different concepts. Simulation is a method for modeling a system s operation. So, simulators are used when you need to understand how something works, but don t have the means to setup an experiment on a real machine. For instance, in my lab, we are trying to study heterogeneous memory architectures (systems with multiple types of memory), but these systems do not exist yet. So, we use a simulator called ramulator to simulate the operation of multiple types of memory in one machine. Emulation is often part of the process of simulation, but simulators will often implement much more functionality to give you information about how a system works. For instance, there are processor simulators that will actually execute the application, but additionally, they provide information about how many cycles the processor would require to execute the application or how efficiently the application will utilize processor caches. 4

7 Recall from our discussion on virtual machines that the the guest refers to the system or interface that will be supported by the underlying platform. The host refers to the underlying platform that is used to provide an environment for the guest. 5

8 In the context of instruction set emulation, we use the terms source and target to refer to the instructions that participate in the emulation. The source ISA (or binary) is the original instruction set or binary file that needs to be emulated. The target ISA (or binary) is the ISA of the host processor you want to use to run your source instructions. So, we need to somehow translate the source instructions to the target ISA to emulate them on the host platform. So, I ll try to use the terms source and target when referring to instruction sets that are emulated, and guest and host when referring platforms that are virtualized. The terms are very similar, and even in the literature, they are not always used consistently. 6

9 OK so for this lecture we will be primarily concerned with instruction set emulation since that is a key aspect of most virtual machine implementations. Our definition for instruction set emulation is on this slide. We say the source is emulated by the target if the binaries in the source instruction set can be executed on a machine implementing the target instruction set. As we ve discussed, this capability is required for many VM implementations. An example of instruction set emulation is the IA-32 execution layer we discussed last time. Basically, this is software that Intel developed to enable the execution of older 32-bit x86 executables on 64 bit Itanium processors. The IA-32 EL translated 32-bit x86 instructions to the Itanium s 64 bit instruction set and it was integrated into the OS and so it did this emulation seamlessly and transparently to the upper-level applications. 7

10 You can think of ISA emulation techniques as existing on a spectrum where different techniques require different amounts of computing resources and offer different performance and portability characteristics. On one end of the spectrum is the straightforward method of interpretation and on the other is binary translation. Interpretation involves a cycle of fetching a source instruction, analyzing it, performing the required operation, and then fetching the next source instruction. Interpretation is the simplest emulation technique, but it typically has poor performance. Interpreters are also often implemented in a high-level language, such as C, and so their often portable. Binary translation tries to amortize the fetch and analysis costs by translating a block of source instructions to a block of target instructions, and then saving the translated block for repeated use. Implementing binary translation is more complex, and it requires higher initial cost to 8

11 translate the blocks, but it can pay dividends by providing better long-term performance. There are other techniques that attempt to eliminate the drawbacks of both approaches. Predecoding is a preprocessing step done for interpretation to do some of the work of interpreting the instructions beforehand to speed up the process of interpretation. And selective compilation is sort of a hybrid approach that uses interpretation early in the run and for sections of code that are not executed very often, and binary translation for sections of code that are expected to be hot. We ll talk about each of these in more depth later in the lecture. 8

12 Let s first talk about how interpretation of the source ISA is implemented. The interpreter program has to maintain the complete architected state of a machine implementing the source ISA. So, this figure shows that the interpreter maintains an image of all of the guest s memory, including code, data, and stack regions for the executable. Additionally, the interpreter holds a table called the context block. The context block contains the various components of the source s architected state, including general-purpose registers, the program counter, condition codes, and miscellaneous control registers. (point out the context block). Ask: what is the condition codes register? 9

13 The simplest interpreter implementation is known as a decode-and-dispatch interpreter. It s implementation is structured around a simple loop that steps through the program, one instruction at a time, and modifies the state of the source according to the instruction. For each iteration of the loop, it decodes the current instruction, and dispatches it to an interpretation routine based on the type of the instruction. The code on this slide shows a decode-and-dispatch loop for interpreting the PowerPC ISA. In this example, the source instructions are kept in an array called code, and PC is the current value of the source s program counter. So, we first get the next instruction by indexing into the code array. Next, the extract function extracts the opcode from the current instruction. (bit slicing) 10

14 Next, we enter a switch statement, where, based on this opcode, we will jump to an interpreter routine that implements this source instruction on the target machine. 10

15 Here s an example interpreter routine for the LoadWordAndZero source instruction. The Load Word and Zero instruction loads a 32-bit word into a 64-bit register and zeroes the upper 32-bits of the register; it is the basic PowerPC load word instruction. 11

16 The ALU instruction is actually a stand-in for a number of PowerPC instructions that have the same primary opcode but are distinguished by different extended opcodes. For instructions of this type, two levels of decoding (via switch statements) are used. So, I look at this implementation and think interpretation must be really slow. Why? Interpretation of a single source instruction will often require the execution of tens of instructions on the target machine. 12

17 So, let s talk about the efficiency of this interpreter implementation. We all know that branch instructions are harmful to the performance of a pipelined processor. Why? Since the processor does not know ahead of time which path will be taken, it cannot fetch and decode the next instruction until it knows the result of the branch. There are at least five branch instructions for every iteration of the decode-anddispatch loop in this implementation. Test for a halt or an interrupt at the top of the loop There is a register indirect branch for the switch statement A branch to the interpreter routine A second register indirect branch to return from the interpreter routine And finally, a backward branch to terminate the loop. Performance of indirect branches is even worse because it is more difficult to predict which instruction an indirect branch will jump to. 13

18 What is an indirect branch? Why is their prediction difficult? A normal branch is either taken or not taken while an indirect branch may jump to any instruction in the program. Also, as we noted before, interpreting even a simple instruction requires 10s of target instructions. A simple add will require 20 target instructions, and several of these instructions might require expensive loads or stores to memory. Many interpreters use hand-coded assembly to try to improve performance. Even just saving a few cycles from each iteration of the interpreter loop can yield significant performance gains when your interpreter application spends almost all of its time in the decode-dispatch-loop. Of course, this hurts portability. HotSpot uses an interpreter written in assembly. There is also a zero-assembly project that implements a zero-assembly interpreter in a high-level language. 13

19 So, we have seen that the simple decode-and-dispatch interpreter has a high number of branch instructions for each interpreted instruction and that this can hurt performance. We counted at least 5 branches for each interpreted instruction. So, there is actually a simple technique that can be used to improve performance over this naïve decode-and-dispatch implementation. This technique is called threaded interpretation, and the basic idea is to simply append the code that dispatches to the next instruction interpreter routine to the end of each of the instruction interpretation routines. The idea is that we re spending a lot of time branching back and forth from the main interpreter loop to these instruction interpretation routines. By appending this dispatch code to the end of each interpreter routine, we can save ourselves a lot of branching. Specifically, we save three branches per interpreted instruction. It s called threaded interpretation because it threads the instruction interpretation 14

20 routines together so that you don t have to jump back to the main interpreter loop. Let s look at an example implementation. 14

21 OK this code shows the implementation of threaded interpretation with the two example instruction interpretation routines we had seen before. Notice that the dispatch code is appended at the end of the interpretation routine. We find the address of the next routine to jump to by looking it up in a table and then we just directly jump to that address. In this way, we avoid the main interpreter loop altogether. So, there are only two branches left in this code (test for exit and jump to the next instruction routine). Since there is a branch in every interpreter routine, we may be able to do a better job of branch prediction, if inst. A shows a tendency to follow inst. B (like jump following a compare). There will be multiple entries in the branch prediction table, one for each interpreter routine, each with its own prediction history. Suppose we have 5 insts in a loop: 15

22 1. load 2. add 3. mult 4. store 5. branch to 1 Simple predictor: predict as last branch. Decode-dispatch interpreter mispredicts on every inst. Threaded interpreter will correctly predict on every other iteration after the first one. [Show on board]. Decode-dispatch: (look at only the switch branch): Prediction : Actual : Result 1) No prediction Load miss 2) Load -> Add -> miss 3) Add -> Mult -> miss 4) Mult -> Store -> miss 5) Store -> Branch -> miss 6) Branch -> Load -> miss 7) Load -> Add -> miss Threaded: Prediction : Actual : Result 1) No prediction -> Add -> miss Other instructions execute in the loop 1) Add : Add : Hit 15

23 Since there is a branch in every interpreter routine, we may be able to do a better job of branch prediction, if inst. A shows a tendency to follow inst. B (like jump following a compare). There will be multiple entries in the branch prediction table, one for each interpreter routine, each with its own prediction history. Suppose we have 5 insts in a loop: 1. load 2. add 3. mult 4. store 5. branch to 1 Simple predictor: predict as last branch. Decode-dispatch interpreter mispredicts on every inst. Threaded interpreter will correctly predict on every other iteration after the first one. [Show on board]. Decode-dispatch: 16

24 (look at only the switch branch): Prediction : Actual : Result 1) No prediction Load miss 2) Load -> Add -> misss 3) Add -> Mult -> miss 4) Mult -> Store -> miss 5) Store -> Branch -> miss 6) Branch -> Load -> miss 7) Load -> Add -> miss Threaded: Prediction : Actual : Result 1) No prediction -> Add -> miss Other instructions execute in the loop 1) Add : Add : Hit 16

25 So, we call this technique indirect threaded interpretation because the dispatch now occurs indirectly through a table. Notice, for each interpreted instruction, we look up the address of the next interpreted instruction in a table, and use an indirect jump to jump to that address. One of the advantages of this approach is that the instruction interpretation routines do not refer directly to the addresses of the other interpretation routines. In this way, the interpretation routines can be modified and relocated independently. The other advantage, of course, is that this approach improves efficiency over the basic decode-and-dispatch interpreter implementation by removing branch instructions. The main disadvantage is that it increases the code size of the interpreter (replicates dispatch code) 17

26 This figure shows the difference in the data and control flow between decode-anddispatch and threaded interpreter techniques. The dotted lines show the memory access that is needed to fetch the next source instruction. This figure really shows where the savings come from with this approach. Notice that, in decode dispatch, we re constantly jumping from the dispatch loop to the interpreter routines and then back again. While with threaded interpretation, we just jump directly to the next interpreter routine after finding its address in the table. (The interpreter routines are not subroutines in the usual sense; they are simply pieces of code that are threaded together.) 18

27 Now, let s talk about another optimization we can perform over threaded interpretation. This optimization allows even greater efficiency but at the cost of a considerable loss in portability. The idea here is to reduce interpretation cost by doing some preprocessing on the source instructions to put them into a form that is easier to interpret. With our original interpreter design, every instruction is decoded as we interpret the instruction. This means, that for every instruction we interpret, we use the extract function to shift out bits in the instruction and find its opcode and operands. This can be an expensive operation especially if we re interpreting instructions in a loop. For every iteration of the loop, we have to decode the same instructions over and over again. Pre-decoding saves this cost by statically parsing each instruction into some predefined data structure and storing this pre-decoded form with the source binary. This slide shows what pre-decoding might do for three instructions in the PowerPC 19

28 ISA. Basically, both the opcode and the operands are decoded so that they are easily accessible later. Although this example is for a RISC machine, I should also note that this optimization is much more important when the source ISA is a CISC ISA which often have variable length instructions and which are much more costly to decode than RISC instructions. 19

29 This slide shows how the code for the LoadWordAndZero interpreter routine might look after doing pre-decoding. The structure for the pre-decoded instruction is also shown. In C, we would just put the opcode and operands into an easy-to-read struct. There are a couple interesting things to note about this code: 1) The extract function is no longer needed we just read the pre-decoded opcodes and operands directly from memory. 2) We now need to maintain a separate program counter for this pre-decoded code which is called TPC in the code on this slide. We maintain the original source program counter (SPC on this slide) to keep the correct architected state, while TPC is used fetch the correct pre-decoded instructions. While pre-decoding can improve performance, there are a couple disadvantages: 1) It requires a pre-processing pass over the source instructions 2) It changes the input program so that it is no longer portable Basically, with pre-decoding, we re already started heading down the path towards binary translation. 20

30 Another technique we can use to further improve interpreter performance is called direct threaded interpretation. The idea here is to remove the memory access that s required to lookup the address of the interpreter routine when you have to jump to the next routine. This approach is implemented as part of predecoding. Basically, during the predecoding pass, we replace the opcode of each instruction with the address of the interpreter routine as shown in the figure on this slide. Now, the interpreter code is only slightly different (next slide) The downside of this technique, is that, like predecoding, it reduces portability. Although it s faster, the interpreter code is now dependent on the exact locations of the interpreter routines. So, if the interpreter code is ported to a different target machine, it will have to be regenerated. One way to mitigate this problem is to use the address of label operator in C (&&). 21

31 Basically, we can use this operator to find the addresses of the labels that begin each of the interpreter routines and place those addresses in the predecoded instructions.. 21

32 Notice we no longer have a table lookup to read the address of the routine. We just read op directly from the current code structure we re using. The address data is likely already loaded into the processor caches because we access code[tpc] just a few instructions prior to reading the address (back a slide) 22

33 This figure shows how direct threaded interpretation works. Basically, the only step that is different here than indirect threaded interpretation is that we run the source through a pre-decoder that generates an intermediate form that is less portable but faster. 23

34 So, these optimizations can help speed up interpretation, but realize that interpretation is still going to be much slower than executing the source code directly on a machine that implements the source ISA. We can use another technique, called binary translation, to improve performance even further. Notice that, with predecoding, all the instructions of the same type are executed with the same interpreter routine (see slide 20 or 22). We can speed up emulation even more by converting each source binary instruction to its own customized target code. The process of converting the source binary instructions to corresponding target binary instructions is called binary translation. After a piece of code has been translated, we can just execute it directly on the target machine. There is no need to parse through the source code or jump around to try to interpret it at all. Binary translation also allows the emulator to apply traditional compiler optimizations to the native code potentially even further speeding up the translated code. The result is much better performance than what can be achieved with interpretation, but as you can imagine, the portability of the generated code is 24

35 completely lost. 24

36 This figure illustrates the process of binary translation. Compare to direct threaded interpretation / predecoding. In both approaches, the original source is converted to another form, but with predecoding, interpretation routines are still needed, while with binary translation, we can execute the code directly. A word about portability: We say that predecoding and direct threaded interpretation are less portable because we translate the source code into an intermediate form that can only be used with a particular machine. But we can still write the interpreter in a portable high level language. With BT we translate the code into a form that can only be used a particular machine and we need to write a specialized code generator for every target architecture we want to use. 25

37 Here s an example of doing binary translation with an x86 source binary and translating to a PowerPC target binary. In this example, the architectural registers for the source binary are stored in a register context block which is held in memory on the target. The target retrieves these values from memory and puts them into target registers as it needs them. This is similar to the register context block used by the interpreter. (RCB is a block of storage containing the architectural state of the source machine see slide 9). Some of the registers on the target are permanently assigned to contain or point to certain source resources. For instance, r1 on the target always contains a pointer to the memory that holds the source s register context block. r2 points to the source s memory image And r3 is mapped to the source s program counter. (write these on the board) The process of mapping source registers to target registers is an example of state 26

38 mapping that is the process of maintaining the state of the source machine on the host. 26

39 Here is the translated code. So, notice here, in the translated code, we still need to load the source s register values from memory. But some values, such as the PC can be updated directly without having to go to memory because we ve mapped them to registers. Also note, this code can now be executed directly on the target machine there s no jumps between instructions and our overhead is much lower now (4 to 7 target instructions per source instruction in this example). 27

40 Let s talk about the process of register mapping. With binary translation, we can map some of the registers in the source machine directly to registers on the target. This helps greatly reduce the number of loads and stores to memory which is very important for performance. In some cases, the number of registers on the target machine may be less than the number of registers on the source machine. For instance, consider if you wanted to emulate ARM code on an x86 machine. x86 has much fewer general purpose registers than ARM. In these cases, you might have to map the registers you don t plan to use as often to locations in memory. Another option is to map the source registers to the target on a per-block basis. That is, when you enter a block of translated code, all of the registers read in the source code are copied from the context block (in memory) into target registers. And 28

41 then, when the translated block is exited, any modified registers are copied back to the context block. This helps keep down spills to memory in code that might be really hot (for instance in a loop). 28

42 This example shows how we can use register mapping to further improve performance for translated codes. This is a translation for the same three instructions we just saw, but now we are using register mapping to map the contents of eax and edx from the x86 to r4 and r7 on the PowerPC. Now, instead of requiring 16 instructions to emulate the 3 x86 instructions, we only need 7 instructions when we use register mapping. Now, performance is starting to get closer to what would be expected with native execution. 29

43 At this point, you may be thinking that maybe it makes sense to predecode or binary translate an entire program before beginning emulation. However, this sort of static predecoding or translation is difficult, if not impossible, to implement for many source ISA s. The reason is because of the code discovery problem. It turns out that it s often difficult to find the beginning of each and every instruction in the source binary. To see how, consider the x86 code on this slide. If you were parsing through this code in your translator, and you came to the byte beginning with 8b, it s not obvious whether 8b marks the start of a sequence beginning with a movl instruction (8b is the opcode for the movl instruction), or if 8b is the end of the previous instruction and b5 starts the next instruction (which would then be a mov) 30

44 There are a number of issues with low level machine code representations that contribute to the code discovery problem. A major issue is, CISC machines, such as x86, allow variable length instructions. And so, x86 instructions can begin at any byte in the source binary. In contrast, RISC ISA s usually enforce the property that instructions must begin on a word boundary. Another issue is indirect jumps. (what is an indirect jump?) The target of the jump is held in a register. It is often difficult (if not impossible), to determine what will be the value of that register at runtime during the translation process. This might not be so bad if you knew that the bytes following the jump were actually instructions that you need to translate. But many compilers will actually intersperse code with program data or padding and so it s often not even possible to know if the byte you re considering is part of an instruction or is data or padding. The padding I m referring to is that, some compilers will pad the instruction stream with unused bytes in order to align branch or jump targets on word or cache line 31

45 boundaries for performance reasons. 31

46 In addition to the code discovery problem, we also have the related problem of code location. Recall that, during emulation, we maintain two program counters one that we use to traverse the source instructions and another that is used on the target machine (this is the regular PC). Having two PC s can cause a problem when we have indirect jumps in the source code. What happens is that the address held in the register (that is, the address that we are going to jump to) is a source address even though it occurs in the translated code. Thus, we need some way to map a source PC address to a target PC address so that we know where to jump in the translated code. For instance, the code in this example will not work since the target code cannot jump to an address in the source. 32

47 Now, code discovery and code location are difficult problems that require sophisticated solutions. But there are some special cases where solutions are simpler. One of these is to simply use instruction sets with instructions that are always aligned on fixed boundaries, as is typical for RISC architectures. Using RISC will not completely eliminate the problems, but it does make it easier. One special case that solves both code discovery and code location is to use an ISA that is specifically designed to be emulated, such as Java bytecodes. Java does not allow jumps or branches to arbitrary locations and there is no data or padding interspersed with the bytecode instructions. This design effectively eliminates the code discovery and code location problems and allows all code to be discovered statically. 33

48 There is no completely static general solution to these problems. However, people have developed some general solutions that translate the binary dynamically (i.e. while the program is operating on actual input data), and incrementally (i.e. they translate sections of the program as their reached during execution). So, the key aspect of this approach is that we use interpretation to solve the code discovery problem. We interpret the original source program (no predecoding) and translate blocks of the source to the target during execution. The process is managed by a unit called the emulation manager. (next slide) 34

49 This figure illustrates the process. So, here we have an interpreter operating on the source binary. As blocks of code are translated, the translated code is placed into a region of memory so that it can be reused later. This region of memory is called the code cache. Sometimes, the code cache can become quite large so we need some way of managing its size. For instance, we might have code cache eviction policies that aim to remove some code that we don t expect to use in the future. We ll discuss code cache management later. For now, just assume the code cache is large enough to hold all the compiled code. We also have a component called a map table. This component associates the source PC for a block of source code with the target PC for the corresponding block of translated code. So, the source PC is going to come from the interpreted or translated code. Say, we re 35

50 executing this code, and we get a branch instruction to some source PC. The EM will then check to see if the SPC is in the map table, and if it is, it will return the corresponding target PC value (that points somewhere in the code cache) where the program can now jump to. If it s not in the map table, that indicates it hasn t been translated yet, so we continue execution by going through the interpreter (where we can jump to the instruction in the source binary directly). 35

51 So, I said the system does it s translation one block at a time. Let s talk about what that means. A natural unit for a simple incremental translation scheme is to use what is known as a dynamic basic block. In static compilers, what is a basic block? A static basic block of instructions contains a sequence with a single entry point and a single exit point. What starts a basic block? First program instruction Target of a branch or jump Instruction immediately after branch or jump A dynamic basic block is determined by the actual flow of the program as it executes. Dynamic basic blocks start at an instruction immediately following a branch or jump, follows the sequential instruction stream, and ends with the next branch or jump. 36

52 So, here s an example showing the difference between static basic blocks and dynamic basic blocks. Notice that the instruction at the label loop happens to be the target of a branch. So, in a static compiler, that would start a new basic block. However, in a runtime system, if you were computing dynamic basic blocks and were executing this code sequentially starting from the top, that load instruction would just be part of the first dynamic basic block. Hence, dynamic BB s tend to be larger than static BB s. Also, note that the same static instructions could belong to more than one dynamic BB. For example, the add instruction at label skip (on the static side) belongs to dynamic basic block 2 as well as the shorter dynamic basic block 4. 37

53 So, the entire process basically works like this: After the source binary is loaded into memory, the EM begins interpreting the binary using a simple interpretation scheme. As it proceeds, the system dynamically translates the source instructions into blocks of translated target binary code. These translated blocks are placed into the code cache, and the corresponding SPCto-TPC mapping is placed into the map table. Then, for each translated block, we stop the translation when the next branch or jump statement is encountered. 38

54 The process for the emulation manager is summarized on this flow chart. Basically, we start with a source PC from the interpreter (or already translated code), and we look it up in the mapping table. If there s a hit in the map table, that means it s already been translated, so we execute the translated block, and then get the next source PC to continue execution.. If there s no hit in the mapping table, we use the source PC to read instructions from the source code. We interpret those instructions, perhaps translate them, and place the translated code into the code cache. So, in this way, the program is discovered and translated incrementally, until eventually only translated code is being executed. What happens if the source code jumps to the middle of some block that has already been translated? It could happen could be a complicated issue. It s typically not an issue because we just treat the code as a new dynamic basic block and always start a 39

55 new translation. 39

56 It is important for this system to keep track of the source PC at all times while emulation is taking place. The interpreter uses the source PC directly when it fetches source instructions. But when the interpreter, or a block of translated code transfers control back to the EM, we have to pass the value of the source PC back to the EM so we know what the next instruction should be. So, how should we transfer the source PC to the EM? One way is to simply map the source PC to a register. We just need to make sure the register is updated before transferring control back to the emulation manager. Another way, that can save a register (and possibly improve performance) is to use the branch and link instruction. The branch and link instruction is used for function calls. Basically, the B&L instruction places the address of the next instruction in the source code into the link register so that the subroutine you called can branch back to that instruction when it s done. 40

57 For the purposes of feeding the source PC back to the EM, we can place the source PC into a stub after a branch and link instruction that branches to the EM. In this way, the source PC will be held in the link register and the EM can read it directly from there. 40

58 So, with the simple translation system I ve described so far, every time a translation block finishes execution, the EM has to be re-entered and we have to do a source PC to target PC lookup in the map table. There are a number of optimizations we can use to reduce this overhead. The simplest optimization is called translation chaining it is similar to the threading optimization we saw for interpreters. (next slide) The address of the next block is determined by mapping the source PC of the next instruction to the target PC of its translated code using the map table. If the successor block has not yet been translated, then just insert the stub code to jump back to the EM. The next time this block exits to the emulation manager, the EM can check whether the code is now translated. If it is, the EM can find the correct source PC to target PC mapping in the map table, and then overwrite the stub code in the predecessor 41

59 block. 41

60 This figure illustrates the benefit of translation chaining. So, with the simple dynamic translation scheme, we translate blocks one at a time, and each translated block jumps back to the EM after execution. With translation chaining, we still translate blocks one at a time, but, during translation, we attempt to link them together into chains. To chain two blocks together, we replace the branch back to the EM with a branch directly to the next translated block. 42

61 This figure illustrates the process of installing a link from a predecessor block to the successor block. 1) We reach the end of the predecessor where there is a branch & link to the EM followed by the next source PC 2) The EM reads the SPC from the link register and then looks it up on the map table to get the target PC 3) At that point, the EM can set up a link to the successor by overwriting the branch and link instruction in the predecessor block. 43

62 This slide shows how the code might be modified to install a branch chain. So, notice in this case, we have a conditional branch at the end of the dynamic BB. For the true branch, we replace the branch to the emulation manager (bl F000) with a branch directly to the successor block (b 9c08). 44

63 Translation chaining works in situations where you know the branch is never going to change. But what about situations with indirect jumps? In these cases, the target of the address might change from one execution to the next so we can t replace the jump with a direct jump to a static address. The easiest way to handle this situation is to just always go through the EM to find the correct target PC when the source PC is known. But there are some optimizations for handling indirect jumps. The key here is that, in many cases, even though the source code uses an indirect jump, the target of the indirect jump seldom changes. So, we can use profiling to determine which addresses are most likely to be the targets of a particular jump instruction. Then, we can inline the most frequently used target addresses into the translated binary code. This is illustrated in the code on this slide. Rx symbolizes a register holding the 45

64 indirect jump s source PC value. Now, in a series of if statements, we check if Rx is equal to the most frequent source PC values we have seen for this jump. On a match, we just directly jump to the corresponding target PC value avoiding the expensive jump back to the emulation manager. Typically, the comparisons are ordered, so that we check the most frequent source PC destinations first. If all the predictions are wrong, then the map table lookup has to be performed anyway so we lose the performance benefit of this technique. 45

65 There are a number of issues that make incremental code translation a bit challenging The first is one we ve already discussed that it is important to keep track of the source PC during the emulation process so that if we need that bit of state at any point in the code or if we need to fetch source instructions that aren t already translated then we have access to it. We already discussed a technique using the branch & link instruction to pass the source PC to the EM at the end of a translated block (see slide 40) Another issue is self-modifying code. Although it is relatively uncommon, some applications may perform stores in the code area. When this happens, code that we ve already translated may no longer be valid so we need some mechanism to possibly re-translate it. We ll talk more about self modifying code in the next section Another issue is self-referencing code. This is when the program performs loads from the code area. In these cases, the data that is read must correspond to the original 46

66 source code not the already translated version. We ll also talk about selfreferencing code in the next set of slides. Lastly, if the translated code raises some sort of exception condition, the correct state of the original source code (including source PC of the trapping instruction) must be provided to the handler routine. This can be difficult in the face of optimizations and code reordering. We ll revisit this topic several times in the remainder of the lecture. 46

67 There are a number of details that need to be considered when translating a complete instruction set. I will cover a few of the most common and most important over the next few slides. 47

68 Registers are at the very top of your computer s storage hierarchy, and thus, are critical for performance. The general purpose registers in your target ISA are used for a number of functions, including: 1) Holding general-purpose registers of the source ISA 2) Holding special-purpose registers (such as the PC, or condition codes register) 3) Pointing to the register context block or memory image of the source 4) Holding intermediate values that might be useful during emulation If your target ISA has significantly more registers than your source ISA, then it might be possible to satisfy all these functions simultaneously. But, if the target ISA has the same or fewer registers than the source ISA, then it might be difficult to do all these things at the same time. In these cases, the emulator has to prioritize the use of target registers. For instance, you might map the register context block and source memory image to registers first because these pointers are going to be used very frequently during the 48

69 emulation process. Next, you might map the source PC to a register. If the source has a stack register, condition codes, or other special registers, you might map those as well. So, you might have 5 to 10 registers already reserved before you start mapping general purpose registers. Another strategy you can use is to assign source registers to target registers on a block-by-block basis. Basically, on entering a block of translated code, you identify the registers that are read in the source program and then assign those registers to registers on the target. Any modified registers are copied back to the register context block before the block exits. This is useful because if the source block does not use a particular register you don t have to map it to a target register. 48

70 Another issue is how to deal with condition codes. Condition codes are special architected bits that characterize instruction results (zero, negative, overflow, carry) that are tested by conditional branch instructions. However, condition codes are not used uniformly across different architectures. For instance, in Intel s IA-32, the condition codes are always set implicitly as a side effect of executing some instruction. (add instruction might set the carry CC) SPARC and PowerPC have instructions for explicitly setting the condition codes registers. And the MIPS ISA does not use condition codes at all. So, this non-uniformity can cause issues during the emulation process. In most cases, these issues are more performance-oriented and not correctness-oriented. We can break the problem down into a few different cases The first case is that neither the source or target ISA use condition codes. This is the 49

71 easiest case because there is nothing special to do. Another easy case is where the source ISA does not use CC, but the target does. In this case, you might need to use the CC registers on the target to implement the operation of the source, but otherwise, no additional work is needed. That is, you don t need to maintain the condition code state of the source machine. The next case is where the source ISA has explicit CC, but there s no CC on the target machine (or it does not have the same CC). In this case, we need to emulate the operation of the condition codes on the target, but this emulation is straightforward. The most challenging case is the last case where the source ISA has implicit CC, but the target machine does not have CC (or doesn t use the same CC as the source). This last case can be very difficult and time consuming to implement. The most straightforward way to implement it is to just evaluate the CC after every instruction that might set them. But this can be a very compute expensive process on the target machine often requiring more cycles to emulate than the rest of the instruction. 49

72 For example, in the IA-32, the CC are a set of flags in the EFLAGS register. The add instruction always sets 6 of these CC as shown on this slide. Emulating all of these CC for each add is very time consuming. 50

73 One way to mitigate this performance impact is to use a technique known as lazy condition code evaluation. The key here is that even though the CC are set frequently, they are seldom used. Lazy CC evaluation saves the operands and operation that set the condition code, rather than the CC setting themselves. In this way, you don t actually have to compute the CC for every instruction. You only compute them when they are needed. For example, the IA-32 add instruction modifies all the condition code bits. If an add operates on two registers containing the values 2 and 3, then you would save: add: 2: 3: 5 into a table after the instruction completes. Then if a later instruction needs to test the SF (sign) CC, then you look at the result field in the table to generate the SF CC of 0. Another lazy CC strategy is to, during binary translation, do some extra analysis to detect cases where condition codes will never be used. 51

74 Let s look at an example. 51

75 For example, at the top of this slide, we have some x86 code. add %ebx, 0(%eax) add %ecx, %ebx jmp label1 We can use analysis during translation to detect that the CC will not be needed after the first add instruction. However, at the time the jmp is translated, it may not be known whether the CC set by the second add will be needed. For example, the jmp and its label maybe translated separately and reside in separate blocks. To handle this situation, the PowerPC translation of this code uses R25 R27 for lazy CC evaluation. Basically, the operand and opcode information of the add are saved in these registers. 52

76 Then, here s the PowerPC translation. We store the operands and opcode of the add instruction into r25, r26, and r27 registers, then branch to label1. At label 1, it turns out we need test the condition codes. So, we branch to the genzf routine to set ZF condition code. From here, we branch to a bit of code that can evaluate the ZF flag and then sets it to cr0 so that it can branch again on that condition codes. (adding a period in PowerPC tells it to set cr0). 53

77 Another issue has to do with ensuring that arithmetic operations transform data in the same way on both the source and target machines. In most cases, emulating arithmetic operations is straightforward because the data formats have been more or less standardized over the years. For instance, integers use two s complement. Floating point uses the IEEE standard. Additionally, most ISAs offer basic logical and arithmetic operations that can be used to emulate the different variations of shifts and logical instructions in a source ISA. However, there are some exceptions that lead to some differences in the way arithmetic operations are done on different machines. For instance, although most machines use 64 bits for their floating point results, the IA-32 uses 80 bits for its intermediate results, meaning the precision is going to be slightly different. It s possible, but quite difficult, to obtain floating point results identical to IA-32 on a non-ia-32 machine. Unless it s crucial for the application, most emulators would just accept the loss in precision. 54

78 Another example where emulation is difficult is when some machines provide native integer divide instructions, while others rely on converting integer values and relying on the FP unit to divide. And another would be machines with different lengths for immediate values (for instance, a constant used in instructions). As you would expect, mapping shorter length values to machines with longer length immediate fields is easier than mapping longer immediate lengths to shorter immediate fields. However, all machines have ways of building up full-length constants from their immediate fields, so all immediate values can be handled. The main takeaway here is that it is usually possible to emulate programs on machines with different data formats and arithmetic instructions, but depending on the differences, there might be significant implementation and performance challenges. 54

79 Another issue is that different machines may use different orderings for bytes within a word. One class of machines, the so-called big-endian machines, have the most significant byte as byte 0 (on the left) And little-endian machines, use the last byte in the word (byte 3 on 32-bit architectures) as the most significant bit. Little endian has the advantage of being able to read the least significant bits without changing the address your reading: Store 16 in memory: big endian: need to compute a new address to read little endian: read 16 directly by using a byte read instead of a word read. Most emulators will typically maintain the guest data image in the same byte order as 55

80 assumed by the source ISA. So, to emulate a big-endian source on a little-endian machine, the emulator can modify addresses when bytes (or half-words) are accessed from the guest memory region. For instance, on a load-byte instruction, you might complement the low-order bits to obtain the correct byte to load (e.g. 00 becomes 11, or 01 becomes 10). This can be an awkward, time-consuming process that is hard to avoid. Some architectures actually support different byte orders and can be set using a mode bit. Obviously having a target ISA with this feature would simplify the emulation process. 55

Emulation. Michael Jantz

Emulation. Michael Jantz Emulation Michael Jantz Acknowledgements Slides adapted from Chapter 2 in Virtual Machines: Versatile Platforms for Systems and Processes by James E. Smith and Ravi Nair Credit to Prasad A. Kulkarni some

More information

Lecture 2: The Art of Emulation

Lecture 2: The Art of Emulation CSCI-GA.3033-015 Virtual Machines: Concepts & Applications Lecture 2: The Art of Emulation Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Disclaimer: Many slides of this lecture are based

More information

Virtual Machines and Dynamic Translation: Implementing ISAs in Software

Virtual Machines and Dynamic Translation: Implementing ISAs in Software Virtual Machines and Dynamic Translation: Implementing ISAs in Software Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Software Applications How is a software application

More information

Mechanisms and constructs for System Virtualization

Mechanisms and constructs for System Virtualization Mechanisms and constructs for System Virtualization Content Outline Design goals for virtualization General Constructs for virtualization Virtualization for: System VMs Process VMs Prevalent trends: Pros

More information

Advanced Computer Architecture

Advanced Computer Architecture ECE 563 Advanced Computer Architecture Fall 2007 Lecture 14: Virtual Machines 563 L14.1 Fall 2009 Outline Types of Virtual Machine User-level (or Process VMs) System-level Techniques for implementing all

More information

Chapter 5. A Closer Look at Instruction Set Architectures

Chapter 5. A Closer Look at Instruction Set Architectures Chapter 5 A Closer Look at Instruction Set Architectures Chapter 5 Objectives Understand the factors involved in instruction set architecture design. Gain familiarity with memory addressing modes. Understand

More information

The Instruction Set. Chapter 5

The Instruction Set. Chapter 5 The Instruction Set Architecture Level(ISA) Chapter 5 1 ISA Level The ISA level l is the interface between the compilers and the hardware. (ISA level code is what a compiler outputs) 2 Memory Models An

More information

Instruction Set Principles and Examples. Appendix B

Instruction Set Principles and Examples. Appendix B Instruction Set Principles and Examples Appendix B Outline What is Instruction Set Architecture? Classifying ISA Elements of ISA Programming Registers Type and Size of Operands Addressing Modes Types of

More information

Understand the factors involved in instruction set

Understand the factors involved in instruction set A Closer Look at Instruction Set Architectures Objectives Understand the factors involved in instruction set architecture design. Look at different instruction formats, operand types, and memory access

More information

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture Computer Science 324 Computer Architecture Mount Holyoke College Fall 2009 Topic Notes: MIPS Instruction Set Architecture vonneumann Architecture Modern computers use the vonneumann architecture. Idea:

More information

Practical Malware Analysis

Practical Malware Analysis Practical Malware Analysis Ch 4: A Crash Course in x86 Disassembly Revised 1-16-7 Basic Techniques Basic static analysis Looks at malware from the outside Basic dynamic analysis Only shows you how the

More information

Topic Notes: MIPS Instruction Set Architecture

Topic Notes: MIPS Instruction Set Architecture Computer Science 220 Assembly Language & Comp. Architecture Siena College Fall 2011 Topic Notes: MIPS Instruction Set Architecture vonneumann Architecture Modern computers use the vonneumann architecture.

More information

(Refer Slide Time: 1:40)

(Refer Slide Time: 1:40) Computer Architecture Prof. Anshul Kumar Department of Computer Science and Engineering, Indian Institute of Technology, Delhi Lecture - 3 Instruction Set Architecture - 1 Today I will start discussion

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

Page 1. Structure of von Nuemann machine. Instruction Set - the type of Instructions

Page 1. Structure of von Nuemann machine. Instruction Set - the type of Instructions Structure of von Nuemann machine Arithmetic and Logic Unit Input Output Equipment Main Memory Program Control Unit 1 1 Instruction Set - the type of Instructions Arithmetic + Logical (ADD, SUB, MULT, DIV,

More information

CHAPTER 5 A Closer Look at Instruction Set Architectures

CHAPTER 5 A Closer Look at Instruction Set Architectures CHAPTER 5 A Closer Look at Instruction Set Architectures 5.1 Introduction 199 5.2 Instruction Formats 199 5.2.1 Design Decisions for Instruction Sets 200 5.2.2 Little versus Big Endian 201 5.2.3 Internal

More information

Real instruction set architectures. Part 2: a representative sample

Real instruction set architectures. Part 2: a representative sample Real instruction set architectures Part 2: a representative sample Some historical architectures VAX: Digital s line of midsize computers, dominant in academia in the 70s and 80s Characteristics: Variable-length

More information

Module 3 Instruction Set Architecture (ISA)

Module 3 Instruction Set Architecture (ISA) Module 3 Instruction Set Architecture (ISA) I S A L E V E L E L E M E N T S O F I N S T R U C T I O N S I N S T R U C T I O N S T Y P E S N U M B E R O F A D D R E S S E S R E G I S T E R S T Y P E S O

More information

CHAPTER 5 A Closer Look at Instruction Set Architectures

CHAPTER 5 A Closer Look at Instruction Set Architectures CHAPTER 5 A Closer Look at Instruction Set Architectures 5.1 Introduction 5.2 Instruction Formats 5.2.1 Design Decisions for Instruction Sets 5.2.2 Little versus Big Endian 5.2.3 Internal Storage in the

More information

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture Computer Science 324 Computer Architecture Mount Holyoke College Fall 2007 Topic Notes: MIPS Instruction Set Architecture vonneumann Architecture Modern computers use the vonneumann architecture. Idea:

More information

Slides for Lecture 6

Slides for Lecture 6 Slides for Lecture 6 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 28 January,

More information

Topics Power tends to corrupt; absolute power corrupts absolutely. Computer Organization CS Data Representation

Topics Power tends to corrupt; absolute power corrupts absolutely. Computer Organization CS Data Representation Computer Organization CS 231-01 Data Representation Dr. William H. Robinson November 12, 2004 Topics Power tends to corrupt; absolute power corrupts absolutely. Lord Acton British historian, late 19 th

More information

ECE 486/586. Computer Architecture. Lecture # 7

ECE 486/586. Computer Architecture. Lecture # 7 ECE 486/586 Computer Architecture Lecture # 7 Spring 2015 Portland State University Lecture Topics Instruction Set Principles Instruction Encoding Role of Compilers The MIPS Architecture Reference: Appendix

More information

Computer Organization CS 206 T Lec# 2: Instruction Sets

Computer Organization CS 206 T Lec# 2: Instruction Sets Computer Organization CS 206 T Lec# 2: Instruction Sets Topics What is an instruction set Elements of instruction Instruction Format Instruction types Types of operations Types of operand Addressing mode

More information

CS 101, Mock Computer Architecture

CS 101, Mock Computer Architecture CS 101, Mock Computer Architecture Computer organization and architecture refers to the actual hardware used to construct the computer, and the way that the hardware operates both physically and logically

More information

Lecture 4: Instruction Set Design/Pipelining

Lecture 4: Instruction Set Design/Pipelining Lecture 4: Instruction Set Design/Pipelining Instruction set design (Sections 2.9-2.12) control instructions instruction encoding Basic pipelining implementation (Section A.1) 1 Control Transfer Instructions

More information

CHAPTER 5 A Closer Look at Instruction Set Architectures

CHAPTER 5 A Closer Look at Instruction Set Architectures CHAPTER 5 A Closer Look at Instruction Set Architectures 5.1 Introduction 293 5.2 Instruction Formats 293 5.2.1 Design Decisions for Instruction Sets 294 5.2.2 Little versus Big Endian 295 5.2.3 Internal

More information

(Refer Slide Time: 00:02:04)

(Refer Slide Time: 00:02:04) Computer Architecture Prof. Anshul Kumar Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture - 27 Pipelined Processor Design: Handling Control Hazards We have been

More information

Problem with Scanning an Infix Expression

Problem with Scanning an Infix Expression Operator Notation Consider the infix expression (X Y) + (W U), with parentheses added to make the evaluation order perfectly obvious. This is an arithmetic expression written in standard form, called infix

More information

CS4617 Computer Architecture

CS4617 Computer Architecture 1/27 CS4617 Computer Architecture Lecture 7: Instruction Set Architectures Dr J Vaughan October 1, 2014 2/27 ISA Classification Stack architecture: operands on top of stack Accumulator architecture: 1

More information

Lecture 4: Instruction Set Architecture

Lecture 4: Instruction Set Architecture Lecture 4: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation Reading: Textbook (5 th edition) Appendix A Appendix B (4 th edition)

More information

COS 140: Foundations of Computer Science

COS 140: Foundations of Computer Science COS 140: Foundations of Computer Science CPU Organization and Assembly Language Fall 2018 CPU 3 Components of the CPU..................................................... 4 Registers................................................................

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

ISA and RISCV. CASS 2018 Lavanya Ramapantulu

ISA and RISCV. CASS 2018 Lavanya Ramapantulu ISA and RISCV CASS 2018 Lavanya Ramapantulu Program Program =?? Algorithm + Data Structures Niklaus Wirth Program (Abstraction) of processor/hardware that executes 3-Jul-18 CASS18 - ISA and RISCV 2 Program

More information

Pipelining, Branch Prediction, Trends

Pipelining, Branch Prediction, Trends Pipelining, Branch Prediction, Trends 10.1-10.4 Topics 10.1 Quantitative Analyses of Program Execution 10.2 From CISC to RISC 10.3 Pipelining the Datapath Branch Prediction, Delay Slots 10.4 Overlapping

More information

COMPUTER ORGANIZATION & ARCHITECTURE

COMPUTER ORGANIZATION & ARCHITECTURE COMPUTER ORGANIZATION & ARCHITECTURE Instructions Sets Architecture Lesson 5a 1 What are Instruction Sets The complete collection of instructions that are understood by a CPU Can be considered as a functional

More information

Overview. EE 4504 Computer Organization. Much of the computer s architecture / organization is hidden from a HLL programmer

Overview. EE 4504 Computer Organization. Much of the computer s architecture / organization is hidden from a HLL programmer Overview EE 4504 Computer Organization Section 7 The Instruction Set Much of the computer s architecture / organization is hidden from a HLL programmer In the abstract sense, the programmer should not

More information

Darek Mihocka, Emulators.com Stanislav Shwartsman, Intel Corp. June

Darek Mihocka, Emulators.com Stanislav Shwartsman, Intel Corp. June Darek Mihocka, Emulators.com Stanislav Shwartsman, Intel Corp. June 21 2008 Agenda Introduction Gemulator Bochs Proposed ISA Extensions Conclusions and Future Work Q & A Jun-21-2008 AMAS-BT 2008 2 Introduction

More information

Chapter 5. A Closer Look at Instruction Set Architectures. Chapter 5 Objectives. 5.1 Introduction. 5.2 Instruction Formats

Chapter 5. A Closer Look at Instruction Set Architectures. Chapter 5 Objectives. 5.1 Introduction. 5.2 Instruction Formats Chapter 5 Objectives Understand the factors involved in instruction set architecture design. Chapter 5 A Closer Look at Instruction Set Architectures Gain familiarity with memory addressing modes. Understand

More information

Assembly Language. Lecture 2 x86 Processor Architecture

Assembly Language. Lecture 2 x86 Processor Architecture Assembly Language Lecture 2 x86 Processor Architecture Ahmed Sallam Slides based on original lecture slides by Dr. Mahmoud Elgayyar Introduction to the course Outcomes of Lecture 1 Always check the course

More information

Chapter 5. A Closer Look at Instruction Set Architectures

Chapter 5. A Closer Look at Instruction Set Architectures Chapter 5 A Closer Look at Instruction Set Architectures Chapter 5 Objectives Understand the factors involved in instruction set architecture design. Gain familiarity with memory addressing modes. Understand

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

Lecture 3 Machine Language. Instructions: Instruction Execution cycle. Speaking computer before voice recognition interfaces

Lecture 3 Machine Language. Instructions: Instruction Execution cycle. Speaking computer before voice recognition interfaces Lecture 3 Machine Language Speaking computer before voice recognition interfaces 1 Instructions: Language of the Machine More primitive than higher level languages e.g., no sophisticated control flow Very

More information

Chapter 2. lw $s1,100($s2) $s1 = Memory[$s2+100] sw $s1,100($s2) Memory[$s2+100] = $s1

Chapter 2. lw $s1,100($s2) $s1 = Memory[$s2+100] sw $s1,100($s2) Memory[$s2+100] = $s1 Chapter 2 1 MIPS Instructions Instruction Meaning add $s1,$s2,$s3 $s1 = $s2 + $s3 sub $s1,$s2,$s3 $s1 = $s2 $s3 addi $s1,$s2,4 $s1 = $s2 + 4 ori $s1,$s2,4 $s2 = $s2 4 lw $s1,100($s2) $s1 = Memory[$s2+100]

More information

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015 Advanced Parallel Architecture Lesson 3 Annalisa Massini - 2014/2015 Von Neumann Architecture 2 Summary of the traditional computer architecture: Von Neumann architecture http://williamstallings.com/coa/coa7e.html

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 12 Processor Structure and Function

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 12 Processor Structure and Function William Stallings Computer Organization and Architecture 8 th Edition Chapter 12 Processor Structure and Function CPU Structure CPU must: Fetch instructions Interpret instructions Fetch data Process data

More information

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties Instruction-Level Parallelism Dynamic Branch Prediction CS448 1 Reducing Branch Penalties Last chapter static schemes Move branch calculation earlier in pipeline Static branch prediction Always taken,

More information

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language. Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central

More information

Chapter 5. A Closer Look at Instruction Set Architectures

Chapter 5. A Closer Look at Instruction Set Architectures Chapter 5 A Closer Look at Instruction Set Architectures Chapter 5 Objectives Understand the factors involved in instruction set architecture design. Gain familiarity with memory addressing modes. Understand

More information

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam Assembly Language Lecture 2 - x86 Processor Architecture Ahmed Sallam Introduction to the course Outcomes of Lecture 1 Always check the course website Don t forget the deadline rule!! Motivations for studying

More information

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 Objectives ---------- 1. To introduce the basic concept of CPU speedup 2. To explain how data and branch hazards arise as

More information

Chapter 5:: Target Machine Architecture (cont.)

Chapter 5:: Target Machine Architecture (cont.) Chapter 5:: Target Machine Architecture (cont.) Programming Language Pragmatics Michael L. Scott Review Describe the heap for dynamic memory allocation? What is scope and with most languages how what happens

More information

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra Binary Representation Computer Systems Information is represented as a sequence of binary digits: Bits What the actual bits represent depends on the context: Seminar 3 Numerical value (integer, floating

More information

We briefly explain an instruction cycle now, before proceeding with the details of addressing modes.

We briefly explain an instruction cycle now, before proceeding with the details of addressing modes. Addressing Modes This is an important feature of computers. We start with the known fact that many instructions have to include addresses; the instructions should be short, but addresses tend to be long.

More information

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018 CS 31: Intro to Systems Virtual Memory Kevin Webb Swarthmore College November 15, 2018 Reading Quiz Memory Abstraction goal: make every process think it has the same memory layout. MUCH simpler for compiler

More information

DC57 COMPUTER ORGANIZATION JUNE 2013

DC57 COMPUTER ORGANIZATION JUNE 2013 Q2 (a) How do various factors like Hardware design, Instruction set, Compiler related to the performance of a computer? The most important measure of a computer is how quickly it can execute programs.

More information

Chapter 2: Instructions How we talk to the computer

Chapter 2: Instructions How we talk to the computer Chapter 2: Instructions How we talk to the computer 1 The Instruction Set Architecture that part of the architecture that is visible to the programmer - instruction formats - opcodes (available instructions)

More information

Interfacing Compiler and Hardware. Computer Systems Architecture. Processor Types And Instruction Sets. What Instructions Should A Processor Offer?

Interfacing Compiler and Hardware. Computer Systems Architecture. Processor Types And Instruction Sets. What Instructions Should A Processor Offer? Interfacing Compiler and Hardware Computer Systems Architecture FORTRAN 90 program C++ program Processor Types And Sets FORTRAN 90 Compiler C++ Compiler set level Hardware 1 2 What s Should A Processor

More information

UNIT- 5. Chapter 12 Processor Structure and Function

UNIT- 5. Chapter 12 Processor Structure and Function UNIT- 5 Chapter 12 Processor Structure and Function CPU Structure CPU must: Fetch instructions Interpret instructions Fetch data Process data Write data CPU With Systems Bus CPU Internal Structure Registers

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

A First Look at Microprocessors

A First Look at Microprocessors A First Look at Microprocessors using the The General Prototype Computer (GPC) model Part 2 Can you identify an opcode to: Decrement the contents of R1, and store the result in R5? Invert the contents

More information

Instruction Set Architecture. "Speaking with the computer"

Instruction Set Architecture. Speaking with the computer Instruction Set Architecture "Speaking with the computer" The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture Digital Design

More information

Instruction Set Principles. (Appendix B)

Instruction Set Principles. (Appendix B) Instruction Set Principles (Appendix B) Outline Introduction Classification of Instruction Set Architectures Addressing Modes Instruction Set Operations Type & Size of Operands Instruction Set Encoding

More information

RISC Principles. Introduction

RISC Principles. Introduction 3 RISC Principles In the last chapter, we presented many details on the processor design space as well as the CISC and RISC architectures. It is time we consolidated our discussion to give details of RISC

More information

CSCI 402: Computer Architectures. Instructions: Language of the Computer (1) Fengguang Song Department of Computer & Information Science IUPUI

CSCI 402: Computer Architectures. Instructions: Language of the Computer (1) Fengguang Song Department of Computer & Information Science IUPUI To study Chapter 2: CSCI 402: Computer Architectures Instructions: Language of the Computer (1) Fengguang Song Department of Computer & Information Science IUPUI Contents 2.1-2.3 Introduction to what is

More information

Computer Organization & Assembly Language Programming

Computer Organization & Assembly Language Programming Computer Organization & Assembly Language Programming CSE 2312-002 (Fall 2011) Lecture 8 ISA & Data Types & Instruction Formats Junzhou Huang, Ph.D. Department of Computer Science and Engineering Fall

More information

CS577 Modern Language Processors. Spring 2018 Lecture Interpreters

CS577 Modern Language Processors. Spring 2018 Lecture Interpreters CS577 Modern Language Processors Spring 2018 Lecture Interpreters 1 MAKING INTERPRETERS EFFICIENT VM programs have an explicitly specified binary representation, typically called bytecode. Most VM s can

More information

CS162 Operating Systems and Systems Programming Lecture 14. Caching (Finished), Demand Paging

CS162 Operating Systems and Systems Programming Lecture 14. Caching (Finished), Demand Paging CS162 Operating Systems and Systems Programming Lecture 14 Caching (Finished), Demand Paging October 11 th, 2017 Neeraja J. Yadwadkar http://cs162.eecs.berkeley.edu Recall: Caching Concept Cache: a repository

More information

Stored Program Concept. Instructions: Characteristics of Instruction Set. Architecture Specification. Example of multiple operands

Stored Program Concept. Instructions: Characteristics of Instruction Set. Architecture Specification. Example of multiple operands Stored Program Concept Instructions: Instructions are bits Programs are stored in memory to be read or written just like data Processor Memory memory for data, programs, compilers, editors, etc. Fetch

More information

T Jarkko Turkulainen, F-Secure Corporation

T Jarkko Turkulainen, F-Secure Corporation T-110.6220 2010 Emulators and disassemblers Jarkko Turkulainen, F-Secure Corporation Agenda Disassemblers What is disassembly? What makes up an instruction? How disassemblers work Use of disassembly In

More information

There are different characteristics for exceptions. They are as follows:

There are different characteristics for exceptions. They are as follows: e-pg PATHSHALA- Computer Science Computer Architecture Module 15 Exception handling and floating point pipelines The objectives of this module are to discuss about exceptions and look at how the MIPS architecture

More information

Memory Management: Virtual Memory and Paging CS 111. Operating Systems Peter Reiher

Memory Management: Virtual Memory and Paging CS 111. Operating Systems Peter Reiher Memory Management: Virtual Memory and Paging Operating Systems Peter Reiher Page 1 Outline Paging Swapping and demand paging Virtual memory Page 2 Paging What is paging? What problem does it solve? How

More information

EC 413 Computer Organization

EC 413 Computer Organization EC 413 Computer Organization Review I Prof. Michel A. Kinsy Computing: The Art of Abstraction Application Algorithm Programming Language Operating System/Virtual Machine Instruction Set Architecture (ISA)

More information

CS311 Lecture: Pipelining and Superscalar Architectures

CS311 Lecture: Pipelining and Superscalar Architectures Objectives: CS311 Lecture: Pipelining and Superscalar Architectures Last revised July 10, 2013 1. To introduce the basic concept of CPU speedup 2. To explain how data and branch hazards arise as a result

More information

Computer System Architecture

Computer System Architecture CSC 203 1.5 Computer System Architecture Department of Statistics and Computer Science University of Sri Jayewardenepura Instruction Set Architecture (ISA) Level 2 Introduction 3 Instruction Set Architecture

More information

CSC258: Computer Organization. Microarchitecture

CSC258: Computer Organization. Microarchitecture CSC258: Computer Organization Microarchitecture 1 Wrap-up: Function Conventions 2 Key Elements: Caller Ensure that critical registers like $ra have been saved. Save caller-save registers. Place arguments

More information

17. Instruction Sets: Characteristics and Functions

17. Instruction Sets: Characteristics and Functions 17. Instruction Sets: Characteristics and Functions Chapter 12 Spring 2016 CS430 - Computer Architecture 1 Introduction Section 12.1, 12.2, and 12.3 pp. 406-418 Computer Designer: Machine instruction set

More information

Review Questions. 1 The DRAM problem [5 points] Suggest a solution. 2 Big versus Little Endian Addressing [5 points]

Review Questions. 1 The DRAM problem [5 points] Suggest a solution. 2 Big versus Little Endian Addressing [5 points] Review Questions 1 The DRAM problem [5 points] Suggest a solution 2 Big versus Little Endian Addressing [5 points] Consider the 32-bit hexadecimal number 0x21d3ea7d. 1. What is the binary representation

More information

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are

More information

For your convenience Apress has placed some of the front matter material after the index. Please use the Bookmarks and Contents at a Glance links to

For your convenience Apress has placed some of the front matter material after the index. Please use the Bookmarks and Contents at a Glance links to For your convenience Apress has placed some of the front matter material after the index. Please use the Bookmarks and Contents at a Glance links to access them. Contents at a Glance About the Author...xi

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Arithmetic Unit 10032011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Recap Chapter 3 Number Systems Fixed Point

More information

Binary Translation 2

Binary Translation 2 Binary Translation 2 G. Lettieri 22 Oct. 2014 1 Introduction Fig. 1 shows the main data structures used in binary translation. The guest system is represented by the usual data structures implementing

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Dynamic Instruction Scheduling with Branch Prediction

More information

RISC I from Berkeley. 44k Transistors 1Mhz 77mm^2

RISC I from Berkeley. 44k Transistors 1Mhz 77mm^2 The Case for RISC RISC I from Berkeley 44k Transistors 1Mhz 77mm^2 2 MIPS: A Classic RISC ISA Instructions 4 bytes (32 bits) 4-byte aligned Instructions operate on memory and registers Memory Data types

More information

Memory Models. Registers

Memory Models. Registers Memory Models Most machines have a single linear address space at the ISA level, extending from address 0 up to some maximum, often 2 32 1 bytes or 2 64 1 bytes. Some machines have separate address spaces

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Preventing Stalls: 1

Preventing Stalls: 1 Preventing Stalls: 1 2 PipeLine Pipeline efficiency Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls Ideal pipeline CPI: best possible (1 as n ) Structural hazards:

More information

Assembly Language: Overview!

Assembly Language: Overview! Assembly Language: Overview! 1 Goals of this Lecture! Help you learn:" The basics of computer architecture" The relationship between C and assembly language" IA-32 assembly language, through an example"

More information

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015 Advanced Parallel Architecture Lesson 3 Annalisa Massini - Von Neumann Architecture 2 Two lessons Summary of the traditional computer architecture Von Neumann architecture http://williamstallings.com/coa/coa7e.html

More information

Processor Organization and Performance

Processor Organization and Performance Chapter 6 Processor Organization and Performance 6 1 The three-address format gives the addresses required by most operations: two addresses for the two input operands and one address for the result. However,

More information

Systems Architecture I

Systems Architecture I Systems Architecture I Topics Assemblers, Linkers, and Loaders * Alternative Instruction Sets ** *This lecture was derived from material in the text (sec. 3.8-3.9). **This lecture was derived from material

More information

Memory Management. Kevin Webb Swarthmore College February 27, 2018

Memory Management. Kevin Webb Swarthmore College February 27, 2018 Memory Management Kevin Webb Swarthmore College February 27, 2018 Today s Goals Shifting topics: different process resource memory Motivate virtual memory, including what it might look like without it

More information

Operating System Principles: Memory Management Swapping, Paging, and Virtual Memory CS 111. Operating Systems Peter Reiher

Operating System Principles: Memory Management Swapping, Paging, and Virtual Memory CS 111. Operating Systems Peter Reiher Operating System Principles: Memory Management Swapping, Paging, and Virtual Memory Operating Systems Peter Reiher Page 1 Outline Swapping Paging Virtual memory Page 2 Swapping What if we don t have enough

More information

Compiler Construction D7011E

Compiler Construction D7011E Compiler Construction D7011E Lecture 8: Introduction to code generation Viktor Leijon Slides largely by Johan Nordlander with material generously provided by Mark P. Jones. 1 What is a Compiler? Compilers

More information

EE 3170 Microcontroller Applications

EE 3170 Microcontroller Applications EE 317 Microcontroller Applications Lecture 5 : Instruction Subset & Machine Language: Introduction to the Motorola 68HC11 - Miller 2.1 & 2.2 Based on slides for ECE317 by Profs. Davis, Kieckhafer, Tan,

More information

COMP2121: Microprocessors and Interfacing. Instruction Set Architecture (ISA)

COMP2121: Microprocessors and Interfacing. Instruction Set Architecture (ISA) COMP2121: Microprocessors and Interfacing Instruction Set Architecture (ISA) http://www.cse.unsw.edu.au/~cs2121 Lecturer: Hui Wu Session 2, 2017 1 Contents Memory models Registers Data types Instructions

More information

CS 31: Intro to Systems ISAs and Assembly. Kevin Webb Swarthmore College September 25, 2018

CS 31: Intro to Systems ISAs and Assembly. Kevin Webb Swarthmore College September 25, 2018 CS 31: Intro to Systems ISAs and Assembly Kevin Webb Swarthmore College September 25, 2018 Overview How to directly interact with hardware Instruction set architecture (ISA) Interface between programmer

More information

Chapter 2 Instruction Set Architecture

Chapter 2 Instruction Set Architecture Chapter 2 Instruction Set Architecture Course Outcome (CO) - CO2 Describe the architecture and organization of computer systems Program Outcome (PO) PO1 Apply knowledge of mathematics, science and engineering

More information

Basic Execution Environment

Basic Execution Environment Basic Execution Environment 3 CHAPTER 3 BASIC EXECUTION ENVIRONMENT This chapter describes the basic execution environment of an Intel Architecture processor as seen by assembly-language programmers.

More information

Instruction Sets: Characteristics and Functions

Instruction Sets: Characteristics and Functions Instruction Sets: Characteristics and Functions Chapter 10 Lesson 15 Slide 1/22 Machine instruction set Computer designer: The machine instruction set provides the functional requirements for the CPU.

More information