For our next chapter, we will discuss the emulation process which is an integral part of virtual machines.

Size: px

Start display at page:

Download "For our next chapter, we will discuss the emulation process which is an integral part of virtual machines."

Antonia Ramsey
5 years ago
Views:

1 For our next chapter, we will discuss the emulation process which is an integral part of virtual machines. 1

2 2

3 For today s lecture, we ll start by defining what we mean by emulation. Specifically, in this section, we ll focus on how to emulate the instructions for one machine on another machine. There are two basic methods for emulating an instruction set. The first is interpretation. We ll discuss basic techniques for building an interpreter including basic, indirect threaded and direct threaded interpretation. The other is binary translation which is basically compiling sections of the source binary to the target platform. There are a number of issues that arise during binary translation including code discovery and code location which have to do with separating instructions from data in the source binary. We ll also look at other issues, such as mapping registers in the source machine to registers on the target platform. Next, we ll look at optimizations to deal with control transfers in the translated code and we ll end by looking at some issues with translating specific parts of the 3

4 instruction set. 3

5 Let s talk about a few of the definitions we ll use in our discussion going forward. First, the book defines emulation as the process of implementing the interface or functionality of one system on a different system. Notice that this is not very different from our definition of virtualization and taken in this general sense I think you could argue that virtualization is just a form of emulation. However, for our discussion, we re going to narrow the definition of emulation and say that it applies specifically to instruction sets. So, emulation is the process of taking one instruction set and implementing its interface or functionality on a machine with a different instruction set. There are different techniques for doing emulation which we ll discuss over the next few lectures. The first is interpretation which is basically instruction-at-a-time translation of the source instructions. Interpretation is simple to implement, but relatively slow compared to binary translation. The other strategy is binary translation which is basically block-at-a-time translation of the source program. This technique can improve performance over interpretation 4

6 but it s a bit more challenging to implement efficiently. It takes time to compile blocks of code so you have to prioritize which parts of the program you want to compile first. Also, you have to keep this compiled code in memory and have a way to transfer control from compiled to non-compiled blocks, and vice versa. Now, I want to talk about how emulation is related to the term simulation because the terms are often confused. They are related, but they are different concepts. Simulation is a method for modeling a system s operation. So, simulators are used when you need to understand how something works, but don t have the means to setup an experiment on a real machine. For instance, in my lab, we are trying to study heterogeneous memory architectures (systems with multiple types of memory), but these systems do not exist yet. So, we use a simulator called ramulator to simulate the operation of multiple types of memory in one machine. Emulation is often part of the process of simulation, but simulators will often implement much more functionality to give you information about how a system works. For instance, there are processor simulators that will actually execute the application, but additionally, they provide information about how many cycles the processor would require to execute the application or how efficiently the application will utilize processor caches. 4

7 Recall from our discussion on virtual machines that the the guest refers to the system or interface that will be supported by the underlying platform. The host refers to the underlying platform that is used to provide an environment for the guest. 5

8 In the context of instruction set emulation, we use the terms source and target to refer to the instructions that participate in the emulation. The source ISA (or binary) is the original instruction set or binary file that needs to be emulated. The target ISA (or binary) is the ISA of the host processor you want to use to run your source instructions. So, we need to somehow translate the source instructions to the target ISA to emulate them on the host platform. So, I ll try to use the terms source and target when referring to instruction sets that are emulated, and guest and host when referring platforms that are virtualized. The terms are very similar, and even in the literature, they are not always used consistently. 6

9 OK so for this lecture we will be primarily concerned with instruction set emulation since that is a key aspect of most virtual machine implementations. Our definition for instruction set emulation is on this slide. We say the source is emulated by the target if the binaries in the source instruction set can be executed on a machine implementing the target instruction set. As we ve discussed, this capability is required for many VM implementations. An example of instruction set emulation is the IA-32 execution layer we discussed last time. Basically, this is software that Intel developed to enable the execution of older 32-bit x86 executables on 64 bit Itanium processors. The IA-32 EL translated 32-bit x86 instructions to the Itanium s 64 bit instruction set and it was integrated into the OS and so it did this emulation seamlessly and transparently to the upper-level applications. 7

10 You can think of ISA emulation techniques as existing on a spectrum where different techniques require different amounts of computing resources and offer different performance and portability characteristics. On one end of the spectrum is the straightforward method of interpretation and on the other is binary translation. Interpretation involves a cycle of fetching a source instruction, analyzing it, performing the required operation, and then fetching the next source instruction. Interpretation is the simplest emulation technique, but it typically has poor performance. Interpreters are also often implemented in a high-level language, such as C, and so their often portable. Binary translation tries to amortize the fetch and analysis costs by translating a block of source instructions to a block of target instructions, and then saving the translated block for repeated use. Implementing binary translation is more complex, and it requires higher initial cost to 8

11 translate the blocks, but it can pay dividends by providing better long-term performance. There are other techniques that attempt to eliminate the drawbacks of both approaches. Predecoding is a preprocessing step done for interpretation to do some of the work of interpreting the instructions beforehand to speed up the process of interpretation. And selective compilation is sort of a hybrid approach that uses interpretation early in the run and for sections of code that are not executed very often, and binary translation for sections of code that are expected to be hot. We ll talk about each of these in more depth later in the lecture. 8

Let s first talk about how interpretation of the source ISA is implemented. The interpreter program has to maintain the complete architected state of a machine implementing the source ISA.

12 Let s first talk about how interpretation of the source ISA is implemented. The interpreter program has to maintain the complete architected state of a machine implementing the source ISA. So, this figure shows that the interpreter maintains an image of all of the guest s memory, including code, data, and stack regions for the executable. Additionally, the interpreter holds a table called the context block. The context block contains the various components of the source s architected state, including general-purpose registers, the program counter, condition codes, and miscellaneous control registers. (point out the context block). Ask: what is the condition codes register? 9

13 The simplest interpreter implementation is known as a decode-and-dispatch interpreter. It s implementation is structured around a simple loop that steps through the program, one instruction at a time, and modifies the state of the source according to the instruction. For each iteration of the loop, it decodes the current instruction, and dispatches it to an interpretation routine based on the type of the instruction. The code on this slide shows a decode-and-dispatch loop for interpreting the PowerPC ISA. In this example, the source instructions are kept in an array called code, and PC is the current value of the source s program counter. So, we first get the next instruction by indexing into the code array. Next, the extract function extracts the opcode from the current instruction. (bit slicing) 10

14 Next, we enter a switch statement, where, based on this opcode, we will jump to an interpreter routine that implements this source instruction on the target machine. 10

15 Here s an example interpreter routine for the LoadWordAndZero source instruction. The Load Word and Zero instruction loads a 32-bit word into a 64-bit register and zeroes the upper 32-bits of the register; it is the basic PowerPC load word instruction. 11

16 The ALU instruction is actually a stand-in for a number of PowerPC instructions that have the same primary opcode but are distinguished by different extended opcodes. For instructions of this type, two levels of decoding (via switch statements) are used. So, I look at this implementation and think interpretation must be really slow. Why? Interpretation of a single source instruction will often require the execution of tens of instructions on the target machine. 12

17 So, let s talk about the efficiency of this interpreter implementation. We all know that branch instructions are harmful to the performance of a pipelined processor. Why? Since the processor does not know ahead of time which path will be taken, it cannot fetch and decode the next instruction until it knows the result of the branch. There are at least five branch instructions for every iteration of the decode-anddispatch loop in this implementation. Test for a halt or an interrupt at the top of the loop There is a register indirect branch for the switch statement A branch to the interpreter routine A second register indirect branch to return from the interpreter routine And finally, a backward branch to terminate the loop. Performance of indirect branches is even worse because it is more difficult to predict which instruction an indirect branch will jump to. 13

18 What is an indirect branch? Why is their prediction difficult? A normal branch is either taken or not taken while an indirect branch may jump to any instruction in the program. Also, as we noted before, interpreting even a simple instruction requires 10s of target instructions. A simple add will require 20 target instructions, and several of these instructions might require expensive loads or stores to memory. Many interpreters use hand-coded assembly to try to improve performance. Even just saving a few cycles from each iteration of the interpreter loop can yield significant performance gains when your interpreter application spends almost all of its time in the decode-dispatch-loop. Of course, this hurts portability. HotSpot uses an interpreter written in assembly. There is also a zero-assembly project that implements a zero-assembly interpreter in a high-level language. 13

19 So, we have seen that the simple decode-and-dispatch interpreter has a high number of branch instructions for each interpreted instruction and that this can hurt performance. We counted at least 5 branches for each interpreted instruction. So, there is actually a simple technique that can be used to improve performance over this naïve decode-and-dispatch implementation. This technique is called threaded interpretation, and the basic idea is to simply append the code that dispatches to the next instruction interpreter routine to the end of each of the instruction interpretation routines. The idea is that we re spending a lot of time branching back and forth from the main interpreter loop to these instruction interpretation routines. By appending this dispatch code to the end of each interpreter routine, we can save ourselves a lot of branching. Specifically, we save three branches per interpreted instruction. It s called threaded interpretation because it threads the instruction interpretation 14

20 routines together so that you don t have to jump back to the main interpreter loop. Let s look at an example implementation. 14

21 OK this code shows the implementation of threaded interpretation with the two example instruction interpretation routines we had seen before. Notice that the dispatch code is appended at the end of the interpretation routine. We find the address of the next routine to jump to by looking it up in a table and then we just directly jump to that address. In this way, we avoid the main interpreter loop altogether. So, there are only two branches left in this code (test for exit and jump to the next instruction routine). Since there is a branch in every interpreter routine, we may be able to do a better job of branch prediction, if inst. A shows a tendency to follow inst. B (like jump following a compare). There will be multiple entries in the branch prediction table, one for each interpreter routine, each with its own prediction history. Suppose we have 5 insts in a loop: 15

22 1. load 2. add 3. mult 4. store 5. branch to 1 Simple predictor: predict as last branch. Decode-dispatch interpreter mispredicts on every inst. Threaded interpreter will correctly predict on every other iteration after the first one. [Show on board]. Decode-dispatch: (look at only the switch branch): Prediction : Actual : Result 1) No prediction Load miss 2) Load -> Add -> miss 3) Add -> Mult -> miss 4) Mult -> Store -> miss 5) Store -> Branch -> miss 6) Branch -> Load -> miss 7) Load -> Add -> miss Threaded: Prediction : Actual : Result 1) No prediction -> Add -> miss Other instructions execute in the loop 1) Add : Add : Hit 15

23 Since there is a branch in every interpreter routine, we may be able to do a better job of branch prediction, if inst. A shows a tendency to follow inst. B (like jump following a compare). There will be multiple entries in the branch prediction table, one for each interpreter routine, each with its own prediction history. Suppose we have 5 insts in a loop: 1. load 2. add 3. mult 4. store 5. branch to 1 Simple predictor: predict as last branch. Decode-dispatch interpreter mispredicts on every inst. Threaded interpreter will correctly predict on every other iteration after the first one. [Show on board]. Decode-dispatch: 16

24 (look at only the switch branch): Prediction : Actual : Result 1) No prediction Load miss 2) Load -> Add -> misss 3) Add -> Mult -> miss 4) Mult -> Store -> miss 5) Store -> Branch -> miss 6) Branch -> Load -> miss 7) Load -> Add -> miss Threaded: Prediction : Actual : Result 1) No prediction -> Add -> miss Other instructions execute in the loop 1) Add : Add : Hit 16

25 So, we call this technique indirect threaded interpretation because the dispatch now occurs indirectly through a table. Notice, for each interpreted instruction, we look up the address of the next interpreted instruction in a table, and use an indirect jump to jump to that address. One of the advantages of this approach is that the instruction interpretation routines do not refer directly to the addresses of the other interpretation routines. In this way, the interpretation routines can be modified and relocated independently. The other advantage, of course, is that this approach improves efficiency over the basic decode-and-dispatch interpreter implementation by removing branch instructions. The main disadvantage is that it increases the code size of the interpreter (replicates dispatch code) 17

26 This figure shows the difference in the data and control flow between decode-anddispatch and threaded interpreter techniques. The dotted lines show the memory access that is needed to fetch the next source instruction. This figure really shows where the savings come from with this approach. Notice that, in decode dispatch, we re constantly jumping from the dispatch loop to the interpreter routines and then back again. While with threaded interpretation, we just jump directly to the next interpreter routine after finding its address in the table. (The interpreter routines are not subroutines in the usual sense; they are simply pieces of code that are threaded together.) 18

27 Now, let s talk about another optimization we can perform over threaded interpretation. This optimization allows even greater efficiency but at the cost of a considerable loss in portability. The idea here is to reduce interpretation cost by doing some preprocessing on the source instructions to put them into a form that is easier to interpret. With our original interpreter design, every instruction is decoded as we interpret the instruction. This means, that for every instruction we interpret, we use the extract function to shift out bits in the instruction and find its opcode and operands. This can be an expensive operation especially if we re interpreting instructions in a loop. For every iteration of the loop, we have to decode the same instructions over and over again. Pre-decoding saves this cost by statically parsing each instruction into some predefined data structure and storing this pre-decoded form with the source binary. This slide shows what pre-decoding might do for three instructions in the PowerPC 19

28 ISA. Basically, both the opcode and the operands are decoded so that they are easily accessible later. Although this example is for a RISC machine, I should also note that this optimization is much more important when the source ISA is a CISC ISA which often have variable length instructions and which are much more costly to decode than RISC instructions. 19

29 This slide shows how the code for the LoadWordAndZero interpreter routine might look after doing pre-decoding. The structure for the pre-decoded instruction is also shown. In C, we would just put the opcode and operands into an easy-to-read struct. There are a couple interesting things to note about this code: 1) The extract function is no longer needed we just read the pre-decoded opcodes and operands directly from memory. 2) We now need to maintain a separate program counter for this pre-decoded code which is called TPC in the code on this slide. We maintain the original source program counter (SPC on this slide) to keep the correct architected state, while TPC is used fetch the correct pre-decoded instructions. While pre-decoding can improve performance, there are a couple disadvantages: 1) It requires a pre-processing pass over the source instructions 2) It changes the input program so that it is no longer portable Basically, with pre-decoding, we re already started heading down the path towards binary translation. 20

30 Another technique we can use to further improve interpreter performance is called direct threaded interpretation. The idea here is to remove the memory access that s required to lookup the address of the interpreter routine when you have to jump to the next routine. This approach is implemented as part of predecoding. Basically, during the predecoding pass, we replace the opcode of each instruction with the address of the interpreter routine as shown in the figure on this slide. Now, the interpreter code is only slightly different (next slide) The downside of this technique, is that, like predecoding, it reduces portability. Although it s faster, the interpreter code is now dependent on the exact locations of the interpreter routines. So, if the interpreter code is ported to a different target machine, it will have to be regenerated. One way to mitigate this problem is to use the address of label operator in C (&&). 21

31 Basically, we can use this operator to find the addresses of the labels that begin each of the interpreter routines and place those addresses in the predecoded instructions.. 21

32 Notice we no longer have a table lookup to read the address of the routine. We just read op directly from the current code structure we re using. The address data is likely already loaded into the processor caches because we access code[tpc] just a few instructions prior to reading the address (back a slide) 22

33 This figure shows how direct threaded interpretation works. Basically, the only step that is different here than indirect threaded interpretation is that we run the source through a pre-decoder that generates an intermediate form that is less portable but faster. 23

34 So, these optimizations can help speed up interpretation, but realize that interpretation is still going to be much slower than executing the source code directly on a machine that implements the source ISA. We can use another technique, called binary translation, to improve performance even further. Notice that, with predecoding, all the instructions of the same type are executed with the same interpreter routine (see slide 20 or 22). We can speed up emulation even more by converting each source binary instruction to its own customized target code. The process of converting the source binary instructions to corresponding target binary instructions is called binary translation. After a piece of code has been translated, we can just execute it directly on the target machine. There is no need to parse through the source code or jump around to try to interpret it at all. Binary translation also allows the emulator to apply traditional compiler optimizations to the native code potentially even further speeding up the translated code. The result is much better performance than what can be achieved with interpretation, but as you can imagine, the portability of the generated code is 24

35 completely lost. 24

36 This figure illustrates the process of binary translation. Compare to direct threaded interpretation / predecoding. In both approaches, the original source is converted to another form, but with predecoding, interpretation routines are still needed, while with binary translation, we can execute the code directly. A word about portability: We say that predecoding and direct threaded interpretation are less portable because we translate the source code into an intermediate form that can only be used with a particular machine. But we can still write the interpreter in a portable high level language. With BT we translate the code into a form that can only be used a particular machine and we need to write a specialized code generator for every target architecture we want to use. 25

37 Here s an example of doing binary translation with an x86 source binary and translating to a PowerPC target binary. In this example, the architectural registers for the source binary are stored in a register context block which is held in memory on the target. The target retrieves these values from memory and puts them into target registers as it needs them. This is similar to the register context block used by the interpreter. (RCB is a block of storage containing the architectural state of the source machine see slide 9). Some of the registers on the target are permanently assigned to contain or point to certain source resources. For instance, r1 on the target always contains a pointer to the memory that holds the source s register context block. r2 points to the source s memory image And r3 is mapped to the source s program counter. (write these on the board) The process of mapping source registers to target registers is an example of state 26

38 mapping that is the process of maintaining the state of the source machine on the host. 26

39 Here is the translated code. So, notice here, in the translated code, we still need to load the source s register values from memory. But some values, such as the PC can be updated directly without having to go to memory because we ve mapped them to registers. Also note, this code can now be executed directly on the target machine there s no jumps between instructions and our overhead is much lower now (4 to 7 target instructions per source instruction in this example). 27

40 Let s talk about the process of register mapping. With binary translation, we can map some of the registers in the source machine directly to registers on the target. This helps greatly reduce the number of loads and stores to memory which is very important for performance. In some cases, the number of registers on the target machine may be less than the number of registers on the source machine. For instance, consider if you wanted to emulate ARM code on an x86 machine. x86 has much fewer general purpose registers than ARM. In these cases, you might have to map the registers you don t plan to use as often to locations in memory. Another option is to map the source registers to the target on a per-block basis. That is, when you enter a block of translated code, all of the registers read in the source code are copied from the context block (in memory) into target registers. And 28

41 then, when the translated block is exited, any modified registers are copied back to the context block. This helps keep down spills to memory in code that might be really hot (for instance in a loop). 28

42 This example shows how we can use register mapping to further improve performance for translated codes. This is a translation for the same three instructions we just saw, but now we are using register mapping to map the contents of eax and edx from the x86 to r4 and r7 on the PowerPC. Now, instead of requiring 16 instructions to emulate the 3 x86 instructions, we only need 7 instructions when we use register mapping. Now, performance is starting to get closer to what would be expected with native execution. 29

43 At this point, you may be thinking that maybe it makes sense to predecode or binary translate an entire program before beginning emulation. However, this sort of static predecoding or translation is difficult, if not impossible, to implement for many source ISA s. The reason is because of the code discovery problem. It turns out that it s often difficult to find the beginning of each and every instruction in the source binary. To see how, consider the x86 code on this slide. If you were parsing through this code in your translator, and you came to the byte beginning with 8b, it s not obvious whether 8b marks the start of a sequence beginning with a movl instruction (8b is the opcode for the movl instruction), or if 8b is the end of the previous instruction and b5 starts the next instruction (which would then be a mov) 30

44 There are a number of issues with low level machine code representations that contribute to the code discovery problem. A major issue is, CISC machines, such as x86, allow variable length instructions. And so, x86 instructions can begin at any byte in the source binary. In contrast, RISC ISA s usually enforce the property that instructions must begin on a word boundary. Another issue is indirect jumps. (what is an indirect jump?) The target of the jump is held in a register. It is often difficult (if not impossible), to determine what will be the value of that register at runtime during the translation process. This might not be so bad if you knew that the bytes following the jump were actually instructions that you need to translate. But many compilers will actually intersperse code with program data or padding and so it s often not even possible to know if the byte you re considering is part of an instruction or is data or padding. The padding I m referring to is that, some compilers will pad the instruction stream with unused bytes in order to align branch or jump targets on word or cache line 31

45 boundaries for performance reasons. 31

46 In addition to the code discovery problem, we also have the related problem of code location. Recall that, during emulation, we maintain two program counters one that we use to traverse the source instructions and another that is used on the target machine (this is the regular PC). Having two PC s can cause a problem when we have indirect jumps in the source code. What happens is that the address held in the register (that is, the address that we are going to jump to) is a source address even though it occurs in the translated code. Thus, we need some way to map a source PC address to a target PC address so that we know where to jump in the translated code. For instance, the code in this example will not work since the target code cannot jump to an address in the source. 32

47 Now, code discovery and code location are difficult problems that require sophisticated solutions. But there are some special cases where solutions are simpler. One of these is to simply use instruction sets with instructions that are always aligned on fixed boundaries, as is typical for RISC architectures. Using RISC will not completely eliminate the problems, but it does make it easier. One special case that solves both code discovery and code location is to use an ISA that is specifically designed to be emulated, such as Java bytecodes. Java does not allow jumps or branches to arbitrary locations and there is no data or padding interspersed with the bytecode instructions. This design effectively eliminates the code discovery and code location problems and allows all code to be discovered statically. 33

48 There is no completely static general solution to these problems. However, people have developed some general solutions that translate the binary dynamically (i.e. while the program is operating on actual input data), and incrementally (i.e. they translate sections of the program as their reached during execution). So, the key aspect of this approach is that we use interpretation to solve the code discovery problem. We interpret the original source program (no predecoding) and translate blocks of the source to the target during execution. The process is managed by a unit called the emulation manager. (next slide) 34

49 This figure illustrates the process. So, here we have an interpreter operating on the source binary. As blocks of code are translated, the translated code is placed into a region of memory so that it can be reused later. This region of memory is called the code cache. Sometimes, the code cache can become quite large so we need some way of managing its size. For instance, we might have code cache eviction policies that aim to remove some code that we don t expect to use in the future. We ll discuss code cache management later. For now, just assume the code cache is large enough to hold all the compiled code. We also have a component called a map table. This component associates the source PC for a block of source code with the target PC for the corresponding block of translated code. So, the source PC is going to come from the interpreted or translated code. Say, we re 35

50 executing this code, and we get a branch instruction to some source PC. The EM will then check to see if the SPC is in the map table, and if it is, it will return the corresponding target PC value (that points somewhere in the code cache) where the program can now jump to. If it s not in the map table, that indicates it hasn t been translated yet, so we continue execution by going through the interpreter (where we can jump to the instruction in the source binary directly). 35

51 So, I said the system does it s translation one block at a time. Let s talk about what that means. A natural unit for a simple incremental translation scheme is to use what is known as a dynamic basic block. In static compilers, what is a basic block? A static basic block of instructions contains a sequence with a single entry point and a single exit point. What starts a basic block? First program instruction Target of a branch or jump Instruction immediately after branch or jump A dynamic basic block is determined by the actual flow of the program as it executes. Dynamic basic blocks start at an instruction immediately following a branch or jump, follows the sequential instruction stream, and ends with the next branch or jump. 36

52 So, here s an example showing the difference between static basic blocks and dynamic basic blocks. Notice that the instruction at the label loop happens to be the target of a branch. So, in a static compiler, that would start a new basic block. However, in a runtime system, if you were computing dynamic basic blocks and were executing this code sequentially starting from the top, that load instruction would just be part of the first dynamic basic block. Hence, dynamic BB s tend to be larger than static BB s. Also, note that the same static instructions could belong to more than one dynamic BB. For example, the add instruction at label skip (on the static side) belongs to dynamic basic block 2 as well as the shorter dynamic basic block 4. 37

53 So, the entire process basically works like this: After the source binary is loaded into memory, the EM begins interpreting the binary using a simple interpretation scheme. As it proceeds, the system dynamically translates the source instructions into blocks of translated target binary code. These translated blocks are placed into the code cache, and the corresponding SPCto-TPC mapping is placed into the map table. Then, for each translated block, we stop the translation when the next branch or jump statement is encountered. 38

54 The process for the emulation manager is summarized on this flow chart. Basically, we start with a source PC from the interpreter (or already translated code), and we look it up in the mapping table. If there s a hit in the map table, that means it s already been translated, so we execute the translated block, and then get the next source PC to continue execution.. If there s no hit in the mapping table, we use the source PC to read instructions from the source code. We interpret those instructions, perhaps translate them, and place the translated code into the code cache. So, in this way, the program is discovered and translated incrementally, until eventually only translated code is being executed. What happens if the source code jumps to the middle of some block that has already been translated? It could happen could be a complicated issue. It s typically not an issue because we just treat the code as a new dynamic basic block and always start a 39

55 new translation. 39

It is important for this system to keep track of the source PC at all times while emulation is taking place. The interpreter uses the source PC directly when it fetches source instructions.

56 It is important for this system to keep track of the source PC at all times while emulation is taking place. The interpreter uses the source PC directly when it fetches source instructions. But when the interpreter, or a block of translated code transfers control back to the EM, we have to pass the value of the source PC back to the EM so we know what the next instruction should be. So, how should we transfer the source PC to the EM? One way is to simply map the source PC to a register. We just need to make sure the register is updated before transferring control back to the emulation manager. Another way, that can save a register (and possibly improve performance) is to use the branch and link instruction. The branch and link instruction is used for function calls. Basically, the B&L instruction places the address of the next instruction in the source code into the link register so that the subroutine you called can branch back to that instruction when it s done. 40

57 For the purposes of feeding the source PC back to the EM, we can place the source PC into a stub after a branch and link instruction that branches to the EM. In this way, the source PC will be held in the link register and the EM can read it directly from there. 40

So, with the simple translation system I ve described so far, every time a translation block finishes execution, the EM has to be re-entered and we have to do a source PC to target PC lookup in the

58 So, with the simple translation system I ve described so far, every time a translation block finishes execution, the EM has to be re-entered and we have to do a source PC to target PC lookup in the map table. There are a number of optimizations we can use to reduce this overhead. The simplest optimization is called translation chaining it is similar to the threading optimization we saw for interpreters. (next slide) The address of the next block is determined by mapping the source PC of the next instruction to the target PC of its translated code using the map table. If the successor block has not yet been translated, then just insert the stub code to jump back to the EM. The next time this block exits to the emulation manager, the EM can check whether the code is now translated. If it is, the EM can find the correct source PC to target PC mapping in the map table, and then overwrite the stub code in the predecessor 41

59 block. 41

60 This figure illustrates the benefit of translation chaining. So, with the simple dynamic translation scheme, we translate blocks one at a time, and each translated block jumps back to the EM after execution. With translation chaining, we still translate blocks one at a time, but, during translation, we attempt to link them together into chains. To chain two blocks together, we replace the branch back to the EM with a branch directly to the next translated block. 42

61 This figure illustrates the process of installing a link from a predecessor block to the successor block. 1) We reach the end of the predecessor where there is a branch & link to the EM followed by the next source PC 2) The EM reads the SPC from the link register and then looks it up on the map table to get the target PC 3) At that point, the EM can set up a link to the successor by overwriting the branch and link instruction in the predecessor block. 43

62 This slide shows how the code might be modified to install a branch chain. So, notice in this case, we have a conditional branch at the end of the dynamic BB. For the true branch, we replace the branch to the emulation manager (bl F000) with a branch directly to the successor block (b 9c08). 44

63 Translation chaining works in situations where you know the branch is never going to change. But what about situations with indirect jumps? In these cases, the target of the address might change from one execution to the next so we can t replace the jump with a direct jump to a static address. The easiest way to handle this situation is to just always go through the EM to find the correct target PC when the source PC is known. But there are some optimizations for handling indirect jumps. The key here is that, in many cases, even though the source code uses an indirect jump, the target of the indirect jump seldom changes. So, we can use profiling to determine which addresses are most likely to be the targets of a particular jump instruction. Then, we can inline the most frequently used target addresses into the translated binary code. This is illustrated in the code on this slide. Rx symbolizes a register holding the 45

64 indirect jump s source PC value. Now, in a series of if statements, we check if Rx is equal to the most frequent source PC values we have seen for this jump. On a match, we just directly jump to the corresponding target PC value avoiding the expensive jump back to the emulation manager. Typically, the comparisons are ordered, so that we check the most frequent source PC destinations first. If all the predictions are wrong, then the map table lookup has to be performed anyway so we lose the performance benefit of this technique. 45

65 There are a number of issues that make incremental code translation a bit challenging The first is one we ve already discussed that it is important to keep track of the source PC during the emulation process so that if we need that bit of state at any point in the code or if we need to fetch source instructions that aren t already translated then we have access to it. We already discussed a technique using the branch & link instruction to pass the source PC to the EM at the end of a translated block (see slide 40) Another issue is self-modifying code. Although it is relatively uncommon, some applications may perform stores in the code area. When this happens, code that we ve already translated may no longer be valid so we need some mechanism to possibly re-translate it. We ll talk more about self modifying code in the next section Another issue is self-referencing code. This is when the program performs loads from the code area. In these cases, the data that is read must correspond to the original 46

66 source code not the already translated version. We ll also talk about selfreferencing code in the next set of slides. Lastly, if the translated code raises some sort of exception condition, the correct state of the original source code (including source PC of the trapping instruction) must be provided to the handler routine. This can be difficult in the face of optimizations and code reordering. We ll revisit this topic several times in the remainder of the lecture. 46

67 There are a number of details that need to be considered when translating a complete instruction set. I will cover a few of the most common and most important over the next few slides. 47

68 Registers are at the very top of your computer s storage hierarchy, and thus, are critical for performance. The general purpose registers in your target ISA are used for a number of functions, including: 1) Holding general-purpose registers of the source ISA 2) Holding special-purpose registers (such as the PC, or condition codes register) 3) Pointing to the register context block or memory image of the source 4) Holding intermediate values that might be useful during emulation If your target ISA has significantly more registers than your source ISA, then it might be possible to satisfy all these functions simultaneously. But, if the target ISA has the same or fewer registers than the source ISA, then it might be difficult to do all these things at the same time. In these cases, the emulator has to prioritize the use of target registers. For instance, you might map the register context block and source memory image to registers first because these pointers are going to be used very frequently during the 48

69 emulation process. Next, you might map the source PC to a register. If the source has a stack register, condition codes, or other special registers, you might map those as well. So, you might have 5 to 10 registers already reserved before you start mapping general purpose registers. Another strategy you can use is to assign source registers to target registers on a block-by-block basis. Basically, on entering a block of translated code, you identify the registers that are read in the source program and then assign those registers to registers on the target. Any modified registers are copied back to the register context block before the block exits. This is useful because if the source block does not use a particular register you don t have to map it to a target register. 48

70 Another issue is how to deal with condition codes. Condition codes are special architected bits that characterize instruction results (zero, negative, overflow, carry) that are tested by conditional branch instructions. However, condition codes are not used uniformly across different architectures. For instance, in Intel s IA-32, the condition codes are always set implicitly as a side effect of executing some instruction. (add instruction might set the carry CC) SPARC and PowerPC have instructions for explicitly setting the condition codes registers. And the MIPS ISA does not use condition codes at all. So, this non-uniformity can cause issues during the emulation process. In most cases, these issues are more performance-oriented and not correctness-oriented. We can break the problem down into a few different cases The first case is that neither the source or target ISA use condition codes. This is the 49

71 easiest case because there is nothing special to do. Another easy case is where the source ISA does not use CC, but the target does. In this case, you might need to use the CC registers on the target to implement the operation of the source, but otherwise, no additional work is needed. That is, you don t need to maintain the condition code state of the source machine. The next case is where the source ISA has explicit CC, but there s no CC on the target machine (or it does not have the same CC). In this case, we need to emulate the operation of the condition codes on the target, but this emulation is straightforward. The most challenging case is the last case where the source ISA has implicit CC, but the target machine does not have CC (or doesn t use the same CC as the source). This last case can be very difficult and time consuming to implement. The most straightforward way to implement it is to just evaluate the CC after every instruction that might set them. But this can be a very compute expensive process on the target machine often requiring more cycles to emulate than the rest of the instruction. 49

72 For example, in the IA-32, the CC are a set of flags in the EFLAGS register. The add instruction always sets 6 of these CC as shown on this slide. Emulating all of these CC for each add is very time consuming. 50

73 One way to mitigate this performance impact is to use a technique known as lazy condition code evaluation. The key here is that even though the CC are set frequently, they are seldom used. Lazy CC evaluation saves the operands and operation that set the condition code, rather than the CC setting themselves. In this way, you don t actually have to compute the CC for every instruction. You only compute them when they are needed. For example, the IA-32 add instruction modifies all the condition code bits. If an add operates on two registers containing the values 2 and 3, then you would save: add: 2: 3: 5 into a table after the instruction completes. Then if a later instruction needs to test the SF (sign) CC, then you look at the result field in the table to generate the SF CC of 0. Another lazy CC strategy is to, during binary translation, do some extra analysis to detect cases where condition codes will never be used. 51

74 Let s look at an example. 51

75 For example, at the top of this slide, we have some x86 code. add %ebx, 0(%eax) add %ecx, %ebx jmp label1 We can use analysis during translation to detect that the CC will not be needed after the first add instruction. However, at the time the jmp is translated, it may not be known whether the CC set by the second add will be needed. For example, the jmp and its label maybe translated separately and reside in separate blocks. To handle this situation, the PowerPC translation of this code uses R25 R27 for lazy CC evaluation. Basically, the operand and opcode information of the add are saved in these registers. 52

76 Then, here s the PowerPC translation. We store the operands and opcode of the add instruction into r25, r26, and r27 registers, then branch to label1. At label 1, it turns out we need test the condition codes. So, we branch to the genzf routine to set ZF condition code. From here, we branch to a bit of code that can evaluate the ZF flag and then sets it to cr0 so that it can branch again on that condition codes. (adding a period in PowerPC tells it to set cr0). 53

77 Another issue has to do with ensuring that arithmetic operations transform data in the same way on both the source and target machines. In most cases, emulating arithmetic operations is straightforward because the data formats have been more or less standardized over the years. For instance, integers use two s complement. Floating point uses the IEEE standard. Additionally, most ISAs offer basic logical and arithmetic operations that can be used to emulate the different variations of shifts and logical instructions in a source ISA. However, there are some exceptions that lead to some differences in the way arithmetic operations are done on different machines. For instance, although most machines use 64 bits for their floating point results, the IA-32 uses 80 bits for its intermediate results, meaning the precision is going to be slightly different. It s possible, but quite difficult, to obtain floating point results identical to IA-32 on a non-ia-32 machine. Unless it s crucial for the application, most emulators would just accept the loss in precision. 54

78 Another example where emulation is difficult is when some machines provide native integer divide instructions, while others rely on converting integer values and relying on the FP unit to divide. And another would be machines with different lengths for immediate values (for instance, a constant used in instructions). As you would expect, mapping shorter length values to machines with longer length immediate fields is easier than mapping longer immediate lengths to shorter immediate fields. However, all machines have ways of building up full-length constants from their immediate fields, so all immediate values can be handled. The main takeaway here is that it is usually possible to emulate programs on machines with different data formats and arithmetic instructions, but depending on the differences, there might be significant implementation and performance challenges. 54

79 Another issue is that different machines may use different orderings for bytes within a word. One class of machines, the so-called big-endian machines, have the most significant byte as byte 0 (on the left) And little-endian machines, use the last byte in the word (byte 3 on 32-bit architectures) as the most significant bit. Little endian has the advantage of being able to read the least significant bits without changing the address your reading: Store 16 in memory: big endian: need to compute a new address to read little endian: read 16 directly by using a byte read instead of a word read. Most emulators will typically maintain the guest data image in the same byte order as 55

80 assumed by the source ISA. So, to emulate a big-endian source on a little-endian machine, the emulator can modify addresses when bytes (or half-words) are accessed from the guest memory region. For instance, on a load-byte instruction, you might complement the low-order bits to obtain the correct byte to load (e.g. 00 becomes 11, or 01 becomes 10). This can be an awkward, time-consuming process that is hard to avoid. Some architectures actually support different byte orders and can be set using a mode bit. Obviously having a target ISA with this feature would simplify the emulation process. 55

Emulation. Michael Jantz

Emulation. Michael Jantz Emulation Michael Jantz Acknowledgements Slides adapted from Chapter 2 in Virtual Machines: Versatile Platforms for Systems and Processes by James E. Smith and Ravi Nair Credit to Prasad A. Kulkarni some