An Examination of the Adaptation of Simple Branch Prediction Analysis Attacks to the Core 2 Duo

Size: px

Start display at page:

Download "An Examination of the Adaptation of Simple Branch Prediction Analysis Attacks to the Core 2 Duo"

Clement Simon
6 years ago
Views:

1 An Examination of the Adaptation of Simple Branch Prediction Analysis Attacks to the Core 2 Duo Anthony Gregerson Aditya Godse CS/ECE 752 Advanced Computer Architecture I The University of Wisconsin Madison agregerson@wisc.edu godse@wisc.edu Abstract Within the last year, a new class of side-channel microarchitectural security attacks has emerged that targets the branch prediction system. While Differential Branch Prediction Analysis (DBPA) attacks have been shown to be feasible on commodity PCs, they suffer from the traditional weakness of timing-based attacks, in that they require a statistical analysis of measurements over many key-signing operations. Very recently, a new permutation called Simple Branch Prediction Analysis (SBPA) has been proposed. The discoverers of SBPA claim that it can extract most of the secret bits of an RSA key in a single key-signing operation - significantly increasing the potential applications of the attack. In this paper, we explore the challenges of implementing an SBPA attack in a high-level language and examine how variations in the microarchitecture can affect the feasible of an SBPA attack. We develop a portable software system for evaluating the potential vulnerability to SBPA and experimentally demonstrate that the branch prediction system of Intel s Core 2 Duo architecture is vulnerable to side-channel attacks. We show it is possible to infer secret key information from the Core 2 Duo s branch target buffer but find that the effect of a single BTB miss is not large enough to enable SBPA in our test system. 1. Introduction We live in an increasingly data-centric world. With each year, more and more sensitive private information is being stored on computers and transmitted on computer networks [15]. With high-profile cases of attackers compromising government records, large-scale clandestine spying operations, and the rise of digital rights management, the demand for secure computing platforms has surged [13, 5, 9]. On the surface, one might get the impression that secure computing solely concerns cryptographic algorithms and operating systems, however the emergence of side-channel attacks attacks which target the physical implementation of a cryptographic system have shown that underlying computer architecture is a critical component in ensuring the security of a system [4]. The sheer variety of potential side-channel attacks, which range from analysis of execution timing to power consumption to electromagnetic emanations to acoustics [3, 17, 12], might be enough to convince a computer architect that there is no hope of ever designing a secure system. Practically, the high precision of measurement required to execute these attacks often makes them infeasible except under special circumstances [7]. To design a system that is reasonably secure against side-channel attacks, the architect must have a good understanding of the challenges of implementing each type of attack and the effects that specific architectural decisions can have on their feasibility. One of the most recently-discovered sidechannel vulnerabilities lies in the branch prediction system [2]. By carefully controlling the state of the branch predictor, an attacker can steal secret key data from a crypto process running in a parallel thread. An attack known as differential branch predication analysis (DBPA) has been shown to be possible on a commodity PC; however DBPA requires statistical analysis of

2 measurements taken over many successive signing operations using the same crypto key, severely limiting its potential uses. Furthermore, techniques such as reduction and inversion can be used to counteract such differential timing attacks [6, 11, 16]. Recently, researchers have proposed a new permutation of DBPA called simple branch prediction analysis (SBPA) and showed that it in a controlled experimental environment, it could compromise the majority of the bits of an RSA encryption key in a single key signing operation [1]. While the initial SBPA paper showed that it had the potential to compromise a Pentium IV system, the fine details on their implementation and test methodology were not disclosed, making it difficult for the computer architect to evaluate the true feasibility of this attack on a general computing system. In this paper, we develop a portable software system for evaluating the vulnerability of system to SBPA. We also outline the ways in which architectural choices in the pipeline and branch prediction system can affect the viability of SBPA and show that the choice of replacement policy in the branch target buffer (BTB) can significantly affect the ease of implementing SBPA. Finally, we use our system to predict the SBPA vulnerability of a future SMT processor based on Intel s current Core 2 Duo architecture. We find that evicting a single entry in Core 2 s BTB produces an additional latency of approximately 50 cycles in our test setup, but conclude that a cryptographic algorithm needs multiple dependant branches per key bit to make SBPA viable on our test architecture. The remainder of this paper is divided into seven sections. In Section 2, we provide an overview of the branch prediction side-channel and the methods SBPA uses to improve upon DBPA. In Section 3, we describe the ways in which specific architectural design choices can affect the viability of SBPA. In Section 4 we outline the design of our system for implementing SBPA and evaluating the vulnerability of an architecture. In Sections 5 & 6 we apply this system to the Intel Core 2 Duo architecture and analyze our results. In Section 7 we review related work in the field. In Section 8 we provide our conclusions on the viability of SBPA and give some general comments on the role the architect in designing a secure system. 2. Side-channel Vulnerabilities in the Branch Prediction System SBPA and DBPA both exploit features of the branch prediction system that are shared between threads, and use these shared features to infer secret key data without gaining direct access to the key. To understand how this is possible, consider the following fragment of C code which represents the binary-square-and-multiply algorithm: int bsqm(int x, int n) { int result = 1; while ( n ) { if ( n & 1 ) { result = result * x; n = n-1; } x = x*x; n = n/2; } return result; } This code provides an efficient execution of the operation x n - an operation critical to crypto schemes. It is very similar to the Montgomery multiplication algorithm used in RSA, DSS, and other crypto algorithms [8, 10]. In the code, x represents a block of data and n is part of the private key. The fragment if ( n & 1 ) { result = result * x; n = n-1; } is executed only when the lowest-order bit of n is 1. The algorithm shifts through all the bits of n, performing this check on each bit. This means that for every 1 in the key generates a taken branch and every 0 generates an untaken branch. Many modern architectures, such as the Pentium IV and the Core 2 Duo, implement a BTB to prevent the pipeline from stalling when a branch target address needs to be computed [14]. The BTB stores entries of recently taken branches, which tagged by the address of the branch instruction and contain its target address. By accessing the BTB during instruction decode of a conditional branch, the branch target can be determined without stalling the pipeline. If a branch is predicted as taken but it suffers miss in the BTB, it will be forced to stall until the target can be computed. If BTB hits/misses are used as the primary prediction system, a BTB miss may cause a mis-

3 speculation to occur. With precise measurements of execution time, these additional execution latencies can be used to detect BTB misses. DBPA manipulates the BTB by continually executing a static number of branches. If these branches are equal to or greater in number than the associativity of the BTB and are properly placed in the instruction scheme, at least one entry will map to every set in the BTB [2]. Whenever the crypto process takes a branch, it will write its entry into the BTB. When this happens, there is some finite probability that this entry will evict a branch belonging to the DBPA spy process. The spy process can detect this eviction when the next BTB miss occurs infer that the eviction was caused by a 0 in the private crypto key. By repeating this process many times over a large number of key signings, statistical analysis can be used to extract all of the key bits. SBPA uses the same basic concept, but the SBPA spy process uses far more unique branches than DBPA. By nearly filling all of the BTB entries, the SBPA spy process not only increases the probability that a crypto branch will evict one of the spy branches, but also the probability that this eviction will in turn result in an additional eviction of a spy process branch, which itself may result in an eviction. By making an avalanche effect of evictions probable, SBPA significantly increases the amount of latency experienced by an eviction. This loosens the requirement on the precision of timing measurements and may make it feasible to detect branches without doing a statistical analysis over a large number of averages. 3. SBPA & Architectural Choices As described in Section 2, SBPA depends heavily on implementation details of the BTB and branch prediction system. In this section we examine several possible architectural choices and explore the effect they would have on the feasibility of SBPA. It is important to note that the architecture must have a BTB and SMT capabilities to make an SBPA attack possible. Given these conditions, the feasibility of an SBPA attack can be broken into two factors: the size of the effect caused by a BTB miss and the precision of execution time measurement. If the precision is not good enough to detect the effect of a BTB miss or if the size of the effect is smaller than the standard deviation in measurements, then SBPA can be judged infeasible. 3.1 Pipeline Length The effect of a single BTB miss is based on the stall penalty of the critical loop between instruction fetch/decode and branch target address resolution. This is determined by the number of pipeline stages between the decode of a conditional branch instruction and the computation of its target address. Shorter pipelines will result in fewer stall cycles for an unresolved branch target address. If this loop can be made into a tight loop rather than a loose loop [18], there would be little advantage in including a BTB in the microarchitecture. 3.2 Branch Predictor Design Multi-threaded architectures often opt for simpler branch prediction schemes, since branch prediction accuracy may become less important with increasing numbers of threads. One method of doing branch prediction is simply to use the BTB as a predictor. If a branch address is present in the BTB, predict the branch as taken. If it is not in the BTB, predict backward branches as taken and forward branches as not taken. When the branch resolves, write it into the BTB if taken or evict the corresponding BTB entry if not taken. This scheme only provides a single bit of prediction history, but may be sufficient for multi-threaded architectures, and is somewhat similar to the branch prediction system used in the Pentium IV [2]. This style of prediction scheme may greatly improve the feasibility of SBPA, as a premature eviction of a BTB entry now has the potential to create a full misspeculation penalty, rather than just stalling the pipeline briefly. This effect again penalizes longer pipelines. Furthermore, if branch prediction performance counters are available, an improved form of BPA could be devised which analyzed branch prediction accuracy (a 1 st -order effect) rather than execution time (a 2 nd -order effect). 3.3 BTB Size The primary difference between SBPA and DBPA is that SBPA attempts to fill as many of the BTB slots as possible such that a single BTB eviction by the crypto process is likely to cause an avalanche of evictions in the spy process. However, in order to achieve a bit-level resolution, the spy process must necessarily execute all of its branches in the time between each key bit-dependant branch in the crypto process. Unless the crypto process is

4 very slow, it is difficult for the spy process to execute enough branches to fill all of the entries of a very large BTB between each key-dependant crypto branch. If the spy process cannot fill all of the entries in time and the BTB uses a random replacement policy, there is a certain finite probability that when the crypto process executes two consecutive evictions (equivalent to key bits of 00 ) the write of the second branch to the BTB will evict the address of the first branch. This will result in the first eviction being undetected by the spy process, which will interpret this pattern as being equivalent to key bits of 10. Conversely, the spy process can simply use a smaller number of branches than BTB entries, but this reduces the probability of avalanche evictions occurring and decreases SBPA s resolution advantage over DBPA. While larger BTBs make SBPA more difficult, there is a practical limit on how large the architect can make the BTB, as the BTB will be useless if it takes multiple cycles to access. 3.4 BTB Replacement Policy In previous discussion we have assumed that the BTB is using a random replacement policy, however an architect might choose to implement an LRU replacement policy to more effectively exploit branch temporal locality and reduce the size of the BTB. When using a random replacement policy, the SBPA and DBPA attacks must rely on finite probabilities of whether the crypto branch entry will evict a spy branch entry. With an LRU policy, the choice of entry for eviction is deterministic, and since the spy process is designed to take many more branches than the crypto process, it is very likely that every eviction caused by the crypto process will evict a spy branch. Furthermore, if the spy process has enough time to completely fill the BTB between crypto branches, each eviction is guaranteed to cause an avalanche of evictions, making detection of key bits significantly easier. 4. Design To determine the viability of SBPA attacks on various architectures, we designed a software system to find the optimal number of branch addresses to use in a spy program (the branch address stream length) and measure the increase in latency caused by injecting an additional branch address into the BTB. To accurately determine these values, the BTB analysis software needed to meet the following criteria: 1. Must be able to generate streams of unique branch addresses with variable length. 2. Must have self-measuring capability to determine the latency experienced when executing a fixed number of branches. 3. Must have no other sources of variable latency besides BTB hits. This means that cache hit rate and branch prediction rate (when not limited by the BTB) must be nearly 100%. We also desired a system that could be easily applied to many different architectures in future tests, so we set the following additional requirements: 4. Must be written in a portable high-level language and require no hand-optimization of assembly code. 5. Must be adaptable to different BTB sizes and levels of associativity. 6. Must be able to automate testing to test over a large range of branch addresses. Our solution was a system called the Unique Multiple Branch Address Generator, or UMBRA. UMBRA is composed of a scripted C-code self-generation, compilation, and execution system capable of generating arbitrary numbers of unique branch addresses and executing these addresses for an arbitrary number of iterations. By default, branch instructions are organized to be placed three words apart when compiled into x86 assembly code to ensure even set mapping into fully, 2- way, 4-way, and 8-way set associative BTBs. All branches are data-dependent forward branches to take advantage of BTB-based prediction systems as described in Section 3.2. Data-dependencies are based on a single data value to mitigate the chances of variable latency due to a cache miss and the data location is written to between every branch instruction to force in-order execution on out-oforder architectures and prevent compiler re-ordering. Latency measurement can be done either using the universal C clock( ) function for measuring CPU time or by interfacing with PAPI performance counters on supported CPUs. Because of the limited resolution of the clock function, it is limited to measuring over an arbitrary number of iterations of all the branch addresses, whereas the PAPI performance counter measurements can be taken

5 over an arbitrary number of individual branches. In the case that PAPI performance counters are not available for a processor, some ISAs provide assembly language instructions to access performance counters; however our goal is to avoid ISA-specific code alterations wherever possible to increase the portability of the system. 5. Methodology To test UMBRA, we applied it to a Core 2 Duobased workstation. We selected the Core 2 Duo as our test platform because it is a popular, top-of-the-line consumer processor that was readily available for testing. The Core 2 Duo also varies largely from the Pentium IV architecture for which SBPA was originally designed. It has a much shorter, higher-efficiency pipeline than the Pentium IV, which is representative of current trends in processors and may be able to highlight the effects of varying these features on SBPA. There is also very little public information available about the Core 2 Duo architecture, which forced us to take a very general approach to testing which can be applied to many other processors. It is important to note that while the Core 2 Duo has a BTB, current commercially available consumer versions of the processor do not have multi-threading capability. This means that actually implementing SBPA on the Core 2 architecture is currently impractical, however we believe that it still serves as a good illustrative example of the challenges of adapting SBPA to a new architecture and provides a glimpse at the potential vulnerability of future Core 2-based multithreaded processors. The test system was running Fedora Core 5 Linux and all tests were run remotely via SSH to simulate the way a remote attacker might access the system. No other user-level processes were active during the tests, but we made no effort to alter the system-level processes being run by the OS. This was done to provide a more accurate representation of the conditions of a real attack. To evaluate the vulnerability of the BTB sidechannel on the system, we used UMBRA to generate 500 different spy processes with branch address streams ranging in length from 500 to 10,000 branches, with the majority of measurements focused in the 1000 to 2000 branch region. We performed these measurements using the clock function measurement, PAPI CPU cycle counters, and PAPI CPU branch counters. For the clock function, measurements were taken over 4000 consecutive branches in the spy program. For the PAPI CPU counters measurements were taken over groups of 250 consecutive branches. Additionally we performed a set of tests using the PAPI branch counters to see if the Core 2 Duo s branch predictor would be affected by BTB misses. In each case we discarded the measurements corresponding to the first three iterations of the branch address stream to remove the effects of compulsory BTB misses generated when starting the process. We then recorded the next 3000 measurements for each process and computed the mean and standard deviation of each measurement. 6. Experimental Results Figures 1 & 2 show plots of the results collected using the clock function and PAPI clock counter measurements respectively. The plots show the execution time for a fixed number of branches vs the size of the unique branch address stream in the spy process. The vulnerability of the BTB can be judged by how close this plot compares to a step function. In particular, we are interested in the steepness of slope of the plot, which represents the size of the effect of a BTB eviction, and the point at which the maximum slope occurs, which represents the ideal size of the branch address stream in the spy process. Additionally, the point at which the plot starts sloping upward represents the smallest possible branch address stream that can be used to attack the BTB. Figure 1. Execution latency vs branch address stream size, using the clock function for measurements averaged over several thousand iterations. Figure 1 shows that the C clock function, even with its lower cycle resolution, still shows a strong

6 relationship between the number of addresses being written into the BTB. The plot begins sloping upward at 1300 branches, and achieves a steepest point at 1500 branches. While these results might seem encouraging for SBPA, it is important to note that the resolution of the clock function is limited to measurements of 10,000 cycles on the Core 2 Duo. The high resolution of the plot is only achievable by taking averages over a large number of iterations; however in an SBPA attack the measurement must be made over a single key-signing event, making the measurement resolution 2~3 orders of magnitude too imprecise to detect a single additional branch in the BTB. Despite that, it does show that DBPA, which relies on heavy averaging, would be feasible on the Core 2 Duo using the clock function. influence of background system noise is a major factor. At the optimal stream length, the standard deviation in our 3000 readings was 180 cycles, nearly an order of magnitude greater than the latency increase caused by a single branch. If the standard deviation measurements are confined to groups of 100 readings, they vary from 22 to 330 cycles. This shows that even our cleanest readings had a standard deviation slightly higher than the latency increase caused by a branch from the crypto process. This indicates that even under ideal background activity, the additional latency of a single eviction from the BTB is too small to be reliably measured during a single key-signing event. Figure 2. Execution latency vs branch address stream size, using PAPI CPU cycle counters. Latency measured over groups of 250 branches. Figure 2 shows the plot of results using measurements from the PAPI cycle counter. The PAPI counter is able to return measurements with single-cycle resolutions. The PAPI clock cycle measurements more closely resemble the ideal step-function results which represent an optimal architecture for SBPA. The plot begins sloping upward at 1275 branches and achieves a maximum slope of 20 cycles per additional branch at a branch address stream length of 1350 branches. The difference in optimal stream lengths between the C clock function and the PAPI counters can be attributed to the difference in branch overhead required to run the clock function compared to accessing the performance counters. The slope indicates that the execution of a single keydependent branch by the crypto process will result in a 20 cycle increase in latency when the stream length is While this 20 cycle additional latency indicates that an SBPA attack is theoretically possible on this system, the Figure 3. Branch mispredictions vs branch address stream size, using PAPI branch counters. Mispredictions measured over groups of 250 branches. Figure 3 shows the result of repeating these tests and measuring the number of branch mispredictions recorded by the PAPI performance counters. These results do not show a step-function like shape, indicating that the BTB is not the primary branch prediction resource in the Core 2 Duo, and that measurements of branch mispredictions cannot be used to detect BTB evictions. However, the results show that the increases in latency shown in Figures 1 & 2 are not simply the result of limitations in the branch predictions system, providing further confidence that they are a result of evictions in the BTB. Interestingly, the branch misprediction rate scales somewhat with increasing branch address streams. In particular, while the average number of branch increases by only 25% between stream lengths of 1000 and 5000, the standard deviation in mispredictions increases by over 600%, however it is unclear whether this bears a relationship with the status of the BTB and might be worth investigating in another study.

7 7. Related Work Branch-predictor-based side-channel attacks a very recent development, so the amount of work in the field is limited. In early 2007, Acıicmez, et al, first proposed the DBPA attack and showed that the branch target buffer could be used to extract data from the RSA algorithm implemented in OpenSSL, however this method required averaging results over 9 million keysigning operations, making it impractical for most applications [2]. Later that year, the same researchers proposed the SBPA enhancement, and outlined the an attack on the Pentium IV architecture, however they used a trace-based simulation to provide most of their results and provided few details on the practical challenges of implementing an SBPA attack on a real system [1]. 8. Conclusions In this paper we developed a software system for evaluating the SBPA vulnerability of various architectures. We explored the different architectural choices that can affect the viability of SBPA and analyzed the vulnerability of an off-the-shelf Linux workstation using a Core 2 Duo processor. We showed that it was possible to extract side-channel information about the Core 2 Duo s branch target buffer using timing analysis, but discovered that analyzing the branch misprediction rate is not effective. We showed that the optimal branch address stream to use in a BPA attack is between 1350 and 1500 and that an eviction for the BTB provides an increase in latency of 20 cycles in the best case. We found that using PAPI performance counters to measure CPU time provides much more accurate measurements than the C clock function and that DBPA techniques could be effective using these measurement methods, but found executing an SBPA attack on a single key-signing operation was infeasible as the standard deviation in measurements was higher than the effect we were trying to measure. At the beginning of this paper, we stated that we wished to discover whether SBPA attacks can be applied effectively to other computer architectures. Our results show that while implementing SBPA proved infeasible on our test system, however it is not far from being feasible, indicating that architectures which suffer larger penalties for BTB misses could be vulnerable to SBPA attacks. An important additional point is that although the binary-square-and-multiply example only included a single branch dependant on the key bits, some sensitive algorithms could contain 10 s of branches dependant on each key bit, which would be enough to make an SBPA attack on our test system viable. It is also very important to note that in both of our tests, the spy process needed to execute a minimum of 1300 branches before it could detect a branch from the crypto process. If the cryptographic process can execute multiple key-dependent branches in the time it takes for the spy process to execute 1300 branches, it becomes very difficult for SBPA to infer the values of the secret key bits. By designing highly-efficient algorithms, the problem of the BTB side-channel can be mitigated to a certain extent. These two points demonstrate that designing a truly secure platform is a top-down effort. An architect cannot solely consider security of an algorithm or architecture, but must instead consider the interrelationship between all parts of the system, from the microarchitecture to the ISA to the OS to the software running on it in order to design a truly secure system. Acknowledgements We would like to thank Dr. Karu Sankaralingam for providing us with the test system and his valuable advice throughout the project. References [1] O. Acuicmez, C. Koc and J.-P. Seifert, On the Power of Simple Branch Prediction Analysis, ASIACCS 07, [2] O. Acuicmez, J.-P. Seifert and C.K. Koc. Extracting Secret Keys via Branch Prediction, CT-RSA2007, [3] D. Agrawal, B. Archambeault, J.R. Rao and P. Rohatgi, The EM Side-Channel(s), Proceedings of Workshop on Cryptographic Hardware and Embedded Systems, pp , [4] Bar-El, H. Introduction to Side-Channel Attacks. White paper, [5] J.F. Burns, Privacy breach rocks British government, International Herald Tribune, 24 November 2007.

8 [6] M. Joye and J. Quisquater, Hessian Elliptic Curves and Side-Channel Attacks in Cryptographic Hardware and Embedded Systems, vol. 2162, Heidelberg: Springer Berlin, p. 402, [18] E. Tune, E. Borch, S. Manne, and J. Emer, Loose loops sink chips, High-Performance Computer Architecture 2002, pp , [7] J. Kelsey, B. Schneier, D. Wagner and C. Hall, Side Channel Cryptanalysis of Product Ciphers, ESORICS 98, [8] T.W. Kwon, C.S. You, W.S. Heo, Y.K. Kang and J.R. Choi, Two implementation methods of a 1024-bit RSA cryptoprocessor based on modified Montgomery algorithm, ISCAS 2001, vol. 4, pp , [9] Q. Liu, R. Safavi-Naini, and N.P. Sheppard, Digital Rights Management for Content Distribution, Conferences in Research and Practice in Information Technology, vol. 34, pp , [10] C. McIvor, M. McLoone, and J.V. McCanny, Fast Montgomery Modular Multiplications and the RSA Cryptographic Processor Architectures, Signals, Systems and Computers, vol. 1, pp , [11] B. Moller, Securing Elliptic Curve Point Multiplication against Side-Channel Attacks in Information Security, vol. 2200, Heidelberg: Springer Berlin, p. 324, [12] C. Percival, Cache missing for fun and profit, BSDCan 2005, [13] P. Regan, From Clipper to Carnivore: Balancing Privacy, Law Enforcement, and Industry Interests, American Political Science Association, 2001 Proceedings, [14] J. Shen and M. Lipasti, Modern Processor Design: Fundamentals of Superscalar Processors. McGraw-Hill, [15] D. Solove, The Digital Person: Technology and Privacy in the Information Age, New York: New York University Press, [16] F. Standaert, G. Piret, G. Rouvroy, J. Quisquater and J. Legat, ICEBERG: an Involutional Cipher Efficient for Block Encryption on Reconfigurable Hardware, FSE 2004, Springer-Verlag, [17] F.-X. Standaert, E. Peeters, G. Rouvroy and J. Quisquater, An overview of power analysis attacks against field programmable gate arrays, Proceedings of the IEEE, vol. 92, pp , 2006.

Micro-Architectural Attacks and Countermeasures

Micro-Architectural Attacks and Countermeasures Çetin Kaya Koç koc@cs.ucsb.edu Çetin Kaya Koç http://koclab.org Winter 2017 1 / 25 Contents Micro-Architectural Attacks Cache Attacks Branch Prediction Attack