Using Timestamps to Track Causal Dependencies

Size: px

Start display at page:

Download "Using Timestamps to Track Causal Dependencies"

Logan Hampton
5 years ago
Views:

1 Using Timestamps to Track Causal Dependencies J. A. David McWha Dept. of Computer Science, University of Waikato, Private Bag 315, Hamilton ABSTRACT As computer architectures speculate more aggressively in an attempt to extract an increasing amount of parallelism tracking causal dependencies is becoming an increasingly difficult task. Timestamping events is a convenient way to store this information. It is argued that timestamps with a fixed maximum length are easier to implement. A number of fixed length timestamp schemes are proposed, evaluated by functional simulation and the advantages and shortcomings of each are discussed. 1 Introduction To improve processor performance computer architects are increasingly turning t o parallelism, and in particular out-of-order execution and speculation [5][6]. Current production architectures only use a modest level of speculation, for example the Intel Pentium Pro [7] uses a reorder buffer to hold a pool of 4 instructions, and this determines how far ahead the processor can speculate (the speculation distance). At this level of speculation causal dependencies can be maintained by relatively simple hardware. As speculation becomes more aggressive this hardware will grow more complex, quickly becoming infeasible. By timestamping each instruction (or block of instructions) their virtual sequence can be tracked in a scalable way. The virtual sequence is the order in which a sequential machine would execute the instructions. As a program may be of arbitrary length, an arbitrary number of timestamps may be required. If the timestamp representation is allowed to become arbitrarily large then storage and bandwidth requirements are also unbounded. This is of particular importance in hardware implementations, where timestamps of varying lengths are difficult to implement. Arbitrarily large timestamps will also take a potentially unbounded time to generate and compare. However, a fixed length timestamp representation provides only a finite number of timestamps and a sufficiently large program will exhaust these. In order to execute such a program a method of reusing timestamps is necessary; solutions to this problem are discussed later in this paper. In the program fragment shown in Figure 1 the code D will always execute, regardless of whether the branch at A is taken or not. This is known as code convergence, and it is possible to start D in parallel with A, constrained only by data dependencies. if (A) then B; else C; D; A? B D C Figure 1: An example of convergent code

2 To do this it must be possible to insert an arbitrary number of timestamps between the start of the section of unknown size (A and B or C) and the convergent code (D) in order to assign a timestamp to each instruction [1]. In this paper we present a conceptual model for timestamps (see also [3]) and consider a number of possible implementations of fixed length timestamp representations and their efficiency. Results are presented from a number of algorithms run on a simulator for the WarpEngine [4], an aggressively optimistic architecture based on the Time Warp algorithm [8]. The WarpEngine extracts large amounts of parallelism, allowing the restrictions caused by the timestamps to be clearly seen. 2 Conceptual Timestamps 2.1 Tree-based Execution To identify convergent code in a program the code can be mapped to an execution tree. This is done by splitting the instructions into blocks, for example (but not necessarily) basic blocks. Figure 2 shows a tree that might be generated for a sequence of code: A; B; C; D; Note that an n-way tree can be decomposed to a binary tree in all cases by using extra levels of nodes which further subdivide the tree, but perform no execution. The tree can be executed serially by a depth-first left to right traversal of the tree (the virtual sequence). Separate branches may have data dependencies, but are guaranteed to have no control dependencies. A B C D Figure 2: Execution tree generated by sequential code Using the Time Warp algorithm the WarpEngine speculatively executes each branch of the tree in parallel, rolling back any speculation errors and re-executing. Global Virtual Time (GVT) is represented by the earliest node in the virtual sequence with an instruction still pending. All nodes earlier in the virtual sequence will never be rolled back, and can be removed from the system (fossil collected). 2.2 Timestamps Figure 3: Conceptual timestamps for a binary tree. The timestamp representation should efficiently represent a large number of timestamps and efficiently perform creation of new timestamps and comparison of timestamps to obtain an ordering. The initial design for the timestamps is the simplest possible which allows a strict ordering to be determined and an arbitrary number of timestamps to be inserted between any pair of timestamps. This is done in an effort to minimize overheads. The symbolic representation described in this section was not directly simulated. Section 3 describes the practical representations implemented

3 Total size Length representation Exponential representation (bits) (I,L) Number (M,I,L) Number of levels of levels 32 (27,5) 28 (16,12,4) ( ) (58,6) 59 (32,27,5) ( ) (89,7) 9 (32,58,6) ( ) 58 Table 1: Number of levels which can be represented by different sizes of length and exponential representation timestamps. Timestamps are divided into mantissa (M), integer (I), and length (L) parts. and the results obtained. Each node in the execution tree is associated with a string that gives the path from the root to the node. A zero is used for a left branch, one for a right branch and to terminate the string (as in Figure 3). A lexicographic ordering where < < 1, places these strings in the same order as the sequential execution order of the associated nodes. Thus the strings can be used as timestamps for the virtual sequence. 2.3 Rescaling A problem that arises for any finite representation is the need to re-use old timestamps. As the execution tree grows, eventually there will be more nodes than can be represented. Because branches of the tree grow in depth unevenly it is the number of levels available to a branch that causes the restriction, more than the total number of timestamps themselves. This uneven growth tends to cause inefficient use of the timestamps and some will remain unused and be wasted. Details on implementations of rescaling can be found in [3]. As GVT advances, early timestamps will become available for re-use. To make use of these timestamps are re-allocated while retaining their ordering. We term this operation rescaling. 3 Timestamp Schemes 3.1 Length Representation Timestamps To make timestamp comparison easy we can map the timestamps from bit strings to integer values. This makes it possible to compare timestamps bitwise, as for integers. The bit string timestamp is converted to an integer by padding the bit string out to (a fixed) I bits with zeros and then appending the length of the original string (less the terminating ). Some integers remain unused in the representation. Table 1 shows the division of different sized length representations and the number of levels of nodes they can represent. The advantage of this representation is its simplicity, however the maximum tree depth is quite limited.

4 3.2 Exponential Representation Timestamps This representation uses a scheme similar to floating point number representations to allow different parts of the tree to grow to different depths. It is comprised of two parts: a mantissa and an exponent. The exponent is the number of leading zeros in the timestamp, while the mantissa is the normalized tail of the timestamp. In the example in Figure 4 the timestamp 1 becomes 2; 1, where the number before the comma is the exponent and the string after the comma is the mantissa. A complete exponential representation requires that the mantissa be coded using the length representation above. The number of levels which can be represented is greatest on the left side of the tree and decreases to the right, as shown for an arbitrary exponent size in Table 1. The proportions in which a timestamp is divided into exponent and mantissa will be the subject of some optimization based on application. 1 2, 4 1, 6 2, 3, 7 2, , 3,1 2,1 3 2,11,1 5,1 Figure 4: Timestamps in exponential form. This representation favours execution of the left side of the tree. Nodes that are early in the virtual sequence can have longer strings and so will not exhaust the maximum depth as often. Thus, rescale, and possibly cancel, operations can be reduced. This has two additional advantages. First, by delaying execution of nodes to the right of the tree which are more speculative it helps balance the overall execution. Second, a compiler can take advantage of the representation by scheduling more computation on the left of the tree. Provided the compiler can schedule the critical parts of the execution to the left of the tree, execution can progress for much longer without needing to rescale using this representation. 3.3 Ideal Timestamps In order to show the restrictions placed upon execution by the timestamp schemes we also simulate execution with timestamps of unbounded length, i.e. which never require rescaling. 3.4 Rescaling method In the simulation results which follow, the optimistic assumption that rescaling has no cost, in time or resources, is used. This allows an optimistic first approximation of the feasibility of the timestamp schemes to be obtained and allows various rescaling strategies to be studied. Given this assumption the optimal rescaling method is to rescale one level at a time to the root of the tree until sufficient timestamps have been reclaimed. This results in a large number of rescales, impractical in a real scheme, but rescales the minimum number of levels necessary to complete execution. This allows canceled nodes to be rescheduled as early as possible because some timestamps remain unused near the

5 root on the right hand side of the tree. 4 Test Algorithms A set of small algorithms were used for testing the timestamps. These were all handcoded in WarpEngine assembly language [2], and hence, necessarily of small size. The programs that we have simulated span the types of operations that are performed in many programs. The sorting algorithm quicksort (quick) is used. Naive binary tree insertion (bin) and AVL tree insertion (avl) perform dynamic structure manipulation. Matrix and array operations are represented by matrix multiplication (mat) and Gauss-Jordan elimination (gj). Fibbonacci number generation (fib) is an example of recursion. The algorithms are simple in concept but vary in the relative amounts of data and control dependence. 5 Timestamp Scheme Comparison Four different configurations were simulated for each of the algorithms available in order to examine the restrictions caused by the different timestamp schemes. The configurations consisted of: length scheme using 32 bits; length scheme using 64 bits; exponential scheme using 64 bits (32 exponent and 32 mantissa); timestamps. Using timestamps the same size as the machine s word size is likely to be convenient. Timestamps larger than 64 bits are not used, since these may consume undesirable amounts of resources. By comparing the 64 bit exponential scheme with the scheme we can determine whether the 32 additional bits are more valuable as an exponent, or used to extend the length timestamp. It must also be remembered that all manipulations of the timestamps will take longer for exponential timestamps than length timestamps, due to the added complexity of the scheme. However, for simplicity s sake this has not been simulated. By comparing with the timestamps the extent to which the timestamp scheme is restricting execution can be seen. As described earlier, an n-way execution tree can be decomposed to an equivalent binary tree. In the simulations for ease of programming each node can have up to four children, giving a 4-way execution tree. 5.1 Results Figure 5 shows graphs of simulated speedup over a range of problem size for each of the algorithms using the four timestamp configurations. is defined as the number of cycles required for execution on the WarpEngine simulator divided by the number of cycles required by the WarpEngine if all instructions were executed in virtual order (i.e. serially). The simulator makes a number of optimistic assumptions, including zero cost timestamp rescaling, and unlimited bandwidth. Still the speedup quickly diverges from

6 12 1 AVL using, length and exponential timestamps 6 5 BIN using, length and exponential timestamps FIB using, length and exponential timestamps 7 6 GJ using, length and exponential timestamps MAT using, length and exponential timestamps "mat.t.par" QUICK using, length and exponential timestamps Figure 5: Comparison of speedup for exponent and length timestamp schemes. the results for timestamps and, in some cases, drops to levels comparable to current production architectures [7]. Good speedup is achieved for some of the test programs, however all the test programs are small and in some cases easily parallelisable. It is likely that larger benchmarks would follow the trend shown by larger problem sizes and speedup would continue to diverge widely from the timestamp case. Adding a 32 bit exponent to use the exponential scheme provided little gain over the representation. Despite the relatively large proportion of children (more than 6% in most cases) generated as the left-most child, there are few long chains of children on the left-most branch. Often the earliest events in the virtual sequence (and hence the furthest left) are initialization procedures, which are usually brief. Also, the top level of loop structures tend to have a high fan out in an effort

7 to extract large amounts of parallelism. Thus the exponent often cannot be used to replace leading zeros in the timestamp, at least until rescaling is done to place the nodes on the left-most branch. Using timestamps the speedup generally increases with increasing problem size, as one would expect. With the other schemes, however, the speedup generally decreases as the problem size gets larger. This is caused by increasingly long, thin branches in the execution tree of larger problems, which force delays until GVT can progress and allow fossil collection to release timestamps for rescaling to take place. This also forces more cancellation and the attendant delays. 6 Tree Balancing Further work is currently being done to improve timestamp representations. One approach achieving good results is to alter the shape of the timestamp tree to better fit the shape of the execution tree generated by the program by using variable range timestamps. The range of timestamps allocated to each subtree is fixed in all the representations discussed so far. The range of the parent node is subdivided evenly and allocated to each child, regardless of the number of timestamps required by the subtree, or whether the subtree even exists. By analyzing the likely size of each subtree an upper and lower bound for the timestamp range for each subtree can be established. This is equivalent to balancing the execution tree to achieve better timestamp utilization by packing the timestamp tree more densely. There are a number of ways of expressing the analysis of the subtree required to assign variable range timestamps. It may be possible to determine an absolute number of nodes which will be in the subtree, in which case setting the upper limit is trivial. If this is not possible, it may still be possible to estimate the relative sizes of the subtrees, in which case a proportion of the available interval can be allocated to each subtree. Some subtrees may preclude analysis more detailed than a very large number of nodes in which case it is pointless executing anything further right in the tree, because it will have to be rolled back to provide more timestamp space. This could save some rollback overheads. Preliminary results suggest that this approach will place minimal restrictions on the speculative execution, compared with other resource limitations. 7 Conclusions All of the timestamp representations evaluated in this paper unacceptably restrict the parallelism extracted. This is due to the long, thin branches typically present in the execution tree, causing many timestamps to be wasted. This, in turn, causes timestamp precision to be quickly exhausted and prompts frequent rescaling. Even when rescaling itself is assumed to be instantaneous, it restricts the speculation distance to the point where performance is reduced to that of current production architectures. The addition of an exponent to extend the left-most branch has been shown to be

8 ineffective compared with extending the basic timestamp by the same number of bits. The structure of most programs uses a high fanout at the top levels to extract large amounts of parallelism, and frequently only initialization is performed in the left-most branch. A more promising approach to allocating timestamps more efficiently is to allocate variable timestamp ranges to subtrees. This relies on compiler technology and runtime analysis to improve allocation efficiency. Many of the issues described here are also applicable to allocation of other resources for highly speculative programs. Each block of instructions requires a certain amount of resources (for example memory) to be allocated quickly and efficiently. Where the resource is in some sense linear sparse allocation may seriously affect the ability to utilize that resource. References [1] Adam Back and Steve Turner. Time-stamp generation for optimistic parallel computing. In Proceedings of the 28th Annual Simulation Symposium, pages , Phoenix, Arizona, April [2] John G. Cleary. WarpEngine instruction set. Internet Web Page, November URL [3] John G. Cleary, J. A. David McWha, and Murray Pearson. Timestamp representations for virtual sequences. In Proceedings of 11th Workshop on Parallel and Distributed Simulation (PADS 97), pages 98 15, Lockenhaus, Austria, June [4] John G. Cleary, Murray W. Pearson, and Husam Kinawi. The architecture of an optimistic CPU: The WarpEngine. In Proceedings of HICSS, volume 1, pages , Hawaii, [5] Digital Equipment Corporation. DIGITAL Semiconductor Alpha Microprocessor Product Brief, August Serial:EC-R2YTC-TE. [6] D. Hunt. Advanced performance features of the 64-bit PA-8. In Compcon Digest of Papers, pages , March [7] INTEL. Pentium Pro processor at 15 MHz, 166 MHz, 18 MHz and 2 MHz. INTEL Corporation Datasheets, November Order Number: [8] David Jefferson. Virtual time. Transactions on Programming Languages and Systems, 7(3):44 425, July 1985.

Timestamp Representations for Virtual Sequences

Timestamp Representations for Virtual equences John G. Cleary, J. A. David McWha, Murray Pearson Dept of Computer cience, University of Waikato, Private Bag 305, Hamilton, New Zealand. {jcleary, jadm,