Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor

Size: px

Start display at page:

Download "Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor"

Marylou Holland
6 years ago
Views:

1 Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Kostas Papadopoulos December 11, 2005 Abstract Simultaneous Multi-threading (SMT) has been developed to increase instruction level parallelism by allowing instructions from a different thread to run during a stall. Inter-thread cache interference, however, might limit the benefit of running multiple independent threads. SMT processors can be utilized in a different model, where a helper thread is used to prefetch cache blocks for the main execution thread. Physical experimentation with low level compiler generated prefetch threads has been tried with mixed results. Memory resident databases spent as much as 50% their time in stalls. Memory prefetching has been shown to have a positive effect in some situations. In this paper we present an experiment with an abstracted database operation which uses a high level synchronized thread mechanism to prefetch memory. The experiment run on an Intel Pentium 4 processor with Hyper-Threading. The focus of the experiment was the L2 cache performance. The results show a substantial decrease of L2 misses as reported by Pentium 4 Performance Monitoring. Additionally, a main/prefetch thread pair can be made to run 15% faster than a single thread, in spite of the high synchronization cost. 1 Introduction Commercial Relational Database Systems have been shown to suffer from memory stalls due to under-utilization of the L2 cache[2][6]. Methods of overcoming this limitation have been proposed which include different data layouts [3], algorithmic optimizations[4], and prefetching[6][7]. SMT processor offer an ideal architecture for memory prefetching as the additional logical processor can be used to run a prefetch thread[5]. Databases have methods of accessing their data which can generally be grouped into indexed or sequential. Indexed methods use complicate structures, usually B-tree+ but also hash based, in order to lookup the relevant tuple. Sequential methods scan the full set of tuples in a database relation and perform an operation, such as selection, projection, aggregation etc. on each tuple. In this paper we present an experiment on a physical processor to measure the impact of a prefetch thread on a database operation. In the experiment, we abstracted a table sequential scan operation as executed by PostgreSQL. The prefetch thread was created and synchronized using a POSIX thread library. The threads were synchronized so as the prefetch thread would always fetch data within a predefined range of memory pages ahead of the main thread. The threads were affined to logical processors on a Linux system. The OS scheduler was otherwise allowed to schedule them as it saw fit. The results show a decrease of the L2 misses by as much as 27 times as reported by Pentium 4. The main/prefetch pair execution times where 15% lower than that of a single thread. The main/prefetch pair execution times where 10% higher than that of a two simultaneous main threads. This must be attributed to the high synchronization cost. The rest of the paper is organized as follows: Section 2 describes the experimental setup and methodology, section 3 shows the results, section 4 shows some related work. The paper ends with some notes on future work in section 5 and the conclusion in section 6. 1

2 2 Experimental Setup and Methodology 2.1 Environment The experiment run on a Pentium 4 processor with Hyper-threading[12]. Pentium 4 has two logical processors sharing the L1 and L2 caches. The processor characteristics are listed in table 1. The processor has Performance Monitoring counters which can report, among other measurements, the L2 and L1 load misses. Processor Intel Pentium 4 SMT Intel Hyper-Threading with two logical CPUs Frequency 2.8 GHz L2 Cache Size 512K L2 Cache Line Size 64B L1 Data Cache Size 32K Table 1: Processor Characteristics The operating system was Linux. Linux recognizes the logical CPU s as real CPU s. Processes, and threads, are scheduled on each of the logical processors. The scheduler also understands sibling CPU s with special consideration in the scheduling algorithm[14]. The Performance Application Programming Interface (PAPI)[13] was used to obtain the Pentium 4 Performance measurements. PAPI allows the recording of the performance events at suitable locations in the program execution at the expense of some small additional coding. Note that PAPI needs a specially built Linux kernel to run. 2.2 Database Operation and Synchronization Algorithm A table sequential scan is the simpler form of selection that the a database must do. In the experiment, we abstracted a table sequential scan operation as executed by PostgreSQL. PostgreSQL, as most database systems, organizes data as to optimize disk I/O. In that, a disk block becomes a memory buffer. Each buffer holds a varying number of tuples, depending on the tuple size. Each buffer holds tuples for only a single relation. Buffers occupy most the memory used by the database system. Memory is generally allocated using one or more shared memory segments. Although a relation might be scanned sequentially, disk blocks are usually scattered in available memory buffers, and a hash-table (or another suitable construct) is used for mapping. PostgreSQL does sequential scans by: 1. Create a lookup key. 2. Lookup the buffer in a hash table 3. Check conditions for each tuple in buffer and return tuples that match 4. go to 1. The above algorithm was created in a small stand-alone program which replicated the design principles detailed above. The memory was allocated in a shared memory segment. A tuple was constructed to hold 5 attributes of 32 bytes in total. The buffer size was set at 4K and each buffer had 127 randomly filled tuples. Buffers were fetched in a predefined random order in order to simulate the scattering of the disk blocks in memory. A simple aggregate operation, count(*), was performed on the tuples by comparing a double numeric value. The prefetch thread was created using the Native POSIX Linux Threads library[10] which is, as the name suggests, a POSIX thread implementation on Linux. Thread synchronization uses POSIX thread primitives which are assumed to be fast under Linux[11]. The prefetch thread executed the same algorithm as the main thread, with the exception that it did not perform the operation. The main and prefetch threads were synchronized on the number of buffers that the prefetch thread can be ahead of the main, called synchronization distance. For a given distance N, the prefetch thread was kept between [-N,2N] buffers ahead of the main. If either thread would fall outside this range, it would sleep waiting for the other to catch-up. At either end of the range [0,N] the threads would signal each other to continue. The synchronization distance was varied during the experiment. Table 2 lists the steps taken by each thread. A more complete listing is given in the Appendix. 2

3 Main Thread wait for Prefetch thread to start wait for the Prefetch to read N buffers for all buffers if Prefetch is behind wake Prefetch if Prefetch more than N behind wait for Prefetch process tuples in current buffer loop wait for Prefetch to finish return matches Prefetch Thread wait for Main thread to start for all buffers if Main is more than N ahead wake Main if Main more than 2N ahead wait for Main loop wait for Main to finish return Table 2: The synchronization of Main and Prefetch threads 3 Experimental Results We run the tests for a single thread, two main threads and a main/prefetch pair. For the main/prefetch pair, we varying the synchronization distance from 2 to 300 buffers. The measurements were taken after a full scan of 5000 buffers (about 20MB of memory). The scan was repeated 1000 times in each case. The measurements are for one logical processor and excludes kernel level events. L2 misses (x 1000) L2 Misses Time (sec) L1 Misses Figure 1: L1 and L2 misses for single thread L1 misses (x 1000) Figure 1 and 2 show the L1 and L2 misses for a single thread and two main threads respectively. The average number of L2 misses are for single thread and for two threads. The two threads test shows interesting patterns in L2 misses. As the threads run, they fall in and out of L2 misses (x 1000) L2 Misses Time (sec) L1 Misses L1 misses (x 1000) Figure 2: L1 and L2 misses for two main threads L2 misses (x 1000) L2 Misses Time (sec) L1 Misses L1 misses (x 1000) Figure 3: L1 and L2 misses for a main/prefetch pair 3

4 L2 misses Synchronization Distance (buffers) Main/Prefetch Thread Pair Single Thread Two Main Threads Figure 4: L2 misses per synchronization distance Real time (sec) Synchronization Distance (buffers) Main/Prefetch Thread Pair Single Thread Reference Two Threads Reference Figure 5: Real time per synchronization distance Synchronization Distance L2 misses Time (secs) Number of Prefetch Waiting Number of Main Waiting Time (secs) Figure 6: Relation of execution time to other parameters synch, periodically thrashing each others L2 cache. None-the-less, the two thread run shows only slight increase, less than 10%, in the average L2 miss count. At the same time the average execution time falls from 8.45s to 6.46s which is a 25% increase. Figure 3 shows the L1 and L2 misses for a main/prefetch pair. The synchronization distance is 6 which gave the best results. It can be seen that the average number of L2 misses falls from about to 920, a 27 times reduction. The reduction is about 29 times better than the two thread run. Figure 4 shows the average L2 misses as the synchronization distance varied. A distance of 6 shows the best results. As the distance increases and the threads are allowed to drift further apart, the number of misses increases to about for 300 buffers. This is still a 2.5 times reduction which is fairly impressive considering how little synchronization there is between the two threads. Figure 5 shows the average real time as the synchronization distance varied. The execution time is compared to the execution of single thread, at 8.45 secs, and average time of two threads, at 6.47 secs. The execution time drops below that of the single thread at 200 buffers and levels at 7.2 secs between 4 and 50 buffers. The pair time is about 10% slower than the average time of the two thread test. The algorithm for main/prefetch pair is an extension of the single thread algorithm with the synchronization logic. So there are two main factors that influence the execution time, the improved L2 utilization and the synchronization cost. Part of the synchronization is the time the main thread spends waiting. We will consider the number of times the thread waits on the condition variable, as well as the similar number for the prefetch thread, as an indication of the synchronization cost. In order to examine these factors, we show in figure 6 the execution time along with the wait counts for main and prefetch threads and the L2 misses. We observe that all quantities rise sharply at below 4. This means that the threads spend considerable time waiting which impacts both the L2 misses and the execution time. Above 5 the L2 misses rises almost linearly, while the wait count for Main drop close to zero and the count for the prefetch thread decreases. The execution time remains constant during this period. This would lead to the conclusion that synchroniza- 4

5 tion cost and L2 improvements play an equal part in the overall performance. Additionally execution gains can be achieved with relatively little synchronization, at least for a dedicated system as the one studied here. 4 Related Work In [1] Kihm et al. examined the problem of interthread cache interference for independent threads on SMT systems. The show that it can limit the benefit of running multiple independent threads. The propose methods so that the operation system scheduler be made aware of these problems and schedule threads accordingly. In [5] Kim et al. use a static analysis in order to identify delinquent loads which are loops in the code where the most cache misses occur. Then they change the compiler in order to generate the helper threads of execution for the identified loops. They use two methods to synchronize the threads, one based on Windows XP system calls and a custom build hardware solution. The synchronization is done in three ways. Static loop based in which the threads synchronize once every loop, Static sample trigger in which the threads synchronize every few iterations of the loops, dynamic trigger in which the generated code monitors the cache behavior and synchronizes accordingly. Their results show same performance gains using these methods of synchronization, with dynamic trigger synchronization being the most promising. The authors also expect better results if the synchronization cost can be lowered. 5 Future Work In this experiment we tested a simple database operation, a sequential scan with a simple aggregate. Only the synchronization distance was examined. Other parameters that might influence the usefulness of this method must be analyzed such as the type of operation performed on each tuple, the synchronization algorithm, the structure of the data and database access method, e.g. index methods, and the relation of the prefetch thread to the cache line size and the hardware prefetch logic of the processor. Tests involving multiple main/prefetch pair must also be carried out as well as a multiple main to single prefetch thread model. Additionally, the synchronization needs to be minimized or eliminated. This can be done by either designing a better synchronization algorithm or by designing explicit operating system support. In the first case, a never block algorithm could give good results if it can maintain relative close distance between the threads. In the second case, the OS can be modified to schedule the main and prefetch threads in the same time slot, in a gang scheduling way, as to eliminate or minimize the need for synchronization. 6 Conclusions Databases are known to suffer from poor cache performance leading to excessive processor stalls. We present here an experiment where a prefetch thread is used to assist a database operation on a SMT processor. The results show dramatic decrease in L2 cache misses, while the overall performance was about 15% better as compared to single thread and 12% worst as compared to two independent threads execution. The performance penalty that must be attributed to the high synchronization cost. Overall the results are promising. The experiment shows that it is possible to improve the L2 cache performance with high level synchronization. References [1] Joshua Kihm, Alex Settle, Andrew Janiszewski, Dan Connors Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors 06/2005 Journal of Instruction-Level Parallelism 7 (2005) 1-28 [2] DBMSs On A Modern Processor: Where Does Time Go, Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, David A. Wood, Proceedings of the 25th VLDB Conference,Edinburgh, Scotland,

6 [3] Weaving Relations for Cache Performance Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, Marios Skounakis [4] Buffering Database Operations for Enhanced Instruction Cache Performance Kenneth A. Ross, Jingren Zhou [5] Dongkeun Kim et all. Physical Experimentation with Prefetching Helper Threads on Intel s Hyper-Threaded Processors International Symposium on Code Generation and Optimization (CGO 04) p. 27 [6] The Memory performance of DSS Commercial Workloads in Shared-Memory Multiprocessors Pedro Trancoso, Josep-L. Larriba- Pey, Zheng Zhang, and Josep Torrellas [7] The Impact of Speeding up Critical Sections with Data Prefetching and Forwarding Pedro Trancoso and Josep Torrellas [8] Lawrence Spracklen, Yuan Chou & Santosh G. Abraham, Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications 2005, Proceedings of the 11th International Symposium on High- Performance Computer Architecture Pages: [9] BRUCE MOMJIAN, PostgreSQL Internals December, 2001, SOFTWARE RESEARCH ASSOCIATES [10] U. Drepper, I. Molnar The Native POSIX Thread Library for Linux February [11] Ulrich Drepper, Futexes Are Tricky December 13, drepper/futex.pdf [12] Intel IA-32 Intel Architecture, Software Developer s Manual Volume 3: System Programming Guide [13] Performance Application Programming Interface at [14] Understanding the Linux CPU Scheduler Josh Aas cpu scheduler.pdf. 6

7 A Appendix We list below a simplification of the scan method of the main thread. As explained above, the prefetch thread follows the same algorithm. Initializations and error checking has been removed for clarity. int scan(){ // wait for prefetch to start pthread_mutex_lock(&mut); main_status=started; while(prefetch_status!=started){ pthread_cond_signal(&go_cond); pthread_cond_wait(&done_cond, &mut); // wait for the first N buffers while(prefetch_buffers_read < prefetch_sync){ pthread_cond_signal(&go_cond); pthread_cond_wait(&done_cond, &mut); main_buffers_read=0; pthread_mutex_unlock(&mut); for(i=0;i<buffers;i++){ // synchronize every N=prefetch_sync buffers if(i%prefetch_sync){ pthread_mutex_lock(&mut); main_buffers_read=i; if(prefetch_buffers_read < main_buffers_read ){ // wake helper thread pthread_cond_signal(&go_cond); if(prefetch_buffers_read < main_buffers_read - prefetch_sync){ // we are too far ahead; wait... pthread_cond_wait(&done_cond, &mut); pthread_mutex_unlock(&mut); int j; for(j=0;j<tuples;j++){ if(relation[order[i]].data[j].net_weight <target_weight){ matches++; 7

8 // make sure a helper thread is not waiting pthread_mutex_lock(&mut); main_status=finished; while(prefetch_status!=finished2){ pthread_cond_signal(&go_cond); pthread_cond_wait(&done_cond, &mut); pthread_mutex_unlock(&mut); return matches; 8

STEPS Towards Cache-Resident Transaction Processing

STEPS Towards Cache-Resident Transaction Processing Stavros Harizopoulos joint work with Anastassia Ailamaki VLDB 2004 Carnegie ellon CPI OLTP workloads on modern CPUs 6 4 2 L2-I stalls L2-D stalls L1-I