An Examination of the Adaptation of Simple Branch Prediction Analysis Attacks to the Core 2 Duo

Size: px
Start display at page:

Download "An Examination of the Adaptation of Simple Branch Prediction Analysis Attacks to the Core 2 Duo"

Transcription

1 An Examination of the Adaptation of Simple Branch Prediction Analysis Attacks to the Core 2 Duo Anthony Gregerson Aditya Godse CS/ECE 752 Advanced Computer Architecture I The University of Wisconsin Madison agregerson@wisc.edu godse@wisc.edu Abstract Within the last year, a new class of side-channel microarchitectural security attacks has emerged that targets the branch prediction system. While Differential Branch Prediction Analysis (DBPA) attacks have been shown to be feasible on commodity PCs, they suffer from the traditional weakness of timing-based attacks, in that they require a statistical analysis of measurements over many key-signing operations. Very recently, a new permutation called Simple Branch Prediction Analysis (SBPA) has been proposed. The discoverers of SBPA claim that it can extract most of the secret bits of an RSA key in a single key-signing operation - significantly increasing the potential applications of the attack. In this paper, we explore the challenges of implementing an SBPA attack in a high-level language and examine how variations in the microarchitecture can affect the feasible of an SBPA attack. We develop a portable software system for evaluating the potential vulnerability to SBPA and experimentally demonstrate that the branch prediction system of Intel s Core 2 Duo architecture is vulnerable to side-channel attacks. We show it is possible to infer secret key information from the Core 2 Duo s branch target buffer but find that the effect of a single BTB miss is not large enough to enable SBPA in our test system. 1. Introduction We live in an increasingly data-centric world. With each year, more and more sensitive private information is being stored on computers and transmitted on computer networks [15]. With high-profile cases of attackers compromising government records, large-scale clandestine spying operations, and the rise of digital rights management, the demand for secure computing platforms has surged [13, 5, 9]. On the surface, one might get the impression that secure computing solely concerns cryptographic algorithms and operating systems, however the emergence of side-channel attacks attacks which target the physical implementation of a cryptographic system have shown that underlying computer architecture is a critical component in ensuring the security of a system [4]. The sheer variety of potential side-channel attacks, which range from analysis of execution timing to power consumption to electromagnetic emanations to acoustics [3, 17, 12], might be enough to convince a computer architect that there is no hope of ever designing a secure system. Practically, the high precision of measurement required to execute these attacks often makes them infeasible except under special circumstances [7]. To design a system that is reasonably secure against side-channel attacks, the architect must have a good understanding of the challenges of implementing each type of attack and the effects that specific architectural decisions can have on their feasibility. One of the most recently-discovered sidechannel vulnerabilities lies in the branch prediction system [2]. By carefully controlling the state of the branch predictor, an attacker can steal secret key data from a crypto process running in a parallel thread. An attack known as differential branch predication analysis (DBPA) has been shown to be possible on a commodity PC; however DBPA requires statistical analysis of

2 measurements taken over many successive signing operations using the same crypto key, severely limiting its potential uses. Furthermore, techniques such as reduction and inversion can be used to counteract such differential timing attacks [6, 11, 16]. Recently, researchers have proposed a new permutation of DBPA called simple branch prediction analysis (SBPA) and showed that it in a controlled experimental environment, it could compromise the majority of the bits of an RSA encryption key in a single key signing operation [1]. While the initial SBPA paper showed that it had the potential to compromise a Pentium IV system, the fine details on their implementation and test methodology were not disclosed, making it difficult for the computer architect to evaluate the true feasibility of this attack on a general computing system. In this paper, we develop a portable software system for evaluating the vulnerability of system to SBPA. We also outline the ways in which architectural choices in the pipeline and branch prediction system can affect the viability of SBPA and show that the choice of replacement policy in the branch target buffer (BTB) can significantly affect the ease of implementing SBPA. Finally, we use our system to predict the SBPA vulnerability of a future SMT processor based on Intel s current Core 2 Duo architecture. We find that evicting a single entry in Core 2 s BTB produces an additional latency of approximately 50 cycles in our test setup, but conclude that a cryptographic algorithm needs multiple dependant branches per key bit to make SBPA viable on our test architecture. The remainder of this paper is divided into seven sections. In Section 2, we provide an overview of the branch prediction side-channel and the methods SBPA uses to improve upon DBPA. In Section 3, we describe the ways in which specific architectural design choices can affect the viability of SBPA. In Section 4 we outline the design of our system for implementing SBPA and evaluating the vulnerability of an architecture. In Sections 5 & 6 we apply this system to the Intel Core 2 Duo architecture and analyze our results. In Section 7 we review related work in the field. In Section 8 we provide our conclusions on the viability of SBPA and give some general comments on the role the architect in designing a secure system. 2. Side-channel Vulnerabilities in the Branch Prediction System SBPA and DBPA both exploit features of the branch prediction system that are shared between threads, and use these shared features to infer secret key data without gaining direct access to the key. To understand how this is possible, consider the following fragment of C code which represents the binary-square-and-multiply algorithm: int bsqm(int x, int n) { int result = 1; while ( n ) { if ( n & 1 ) { result = result * x; n = n-1; } x = x*x; n = n/2; } return result; } This code provides an efficient execution of the operation x n - an operation critical to crypto schemes. It is very similar to the Montgomery multiplication algorithm used in RSA, DSS, and other crypto algorithms [8, 10]. In the code, x represents a block of data and n is part of the private key. The fragment if ( n & 1 ) { result = result * x; n = n-1; } is executed only when the lowest-order bit of n is 1. The algorithm shifts through all the bits of n, performing this check on each bit. This means that for every 1 in the key generates a taken branch and every 0 generates an untaken branch. Many modern architectures, such as the Pentium IV and the Core 2 Duo, implement a BTB to prevent the pipeline from stalling when a branch target address needs to be computed [14]. The BTB stores entries of recently taken branches, which tagged by the address of the branch instruction and contain its target address. By accessing the BTB during instruction decode of a conditional branch, the branch target can be determined without stalling the pipeline. If a branch is predicted as taken but it suffers miss in the BTB, it will be forced to stall until the target can be computed. If BTB hits/misses are used as the primary prediction system, a BTB miss may cause a mis-

3 speculation to occur. With precise measurements of execution time, these additional execution latencies can be used to detect BTB misses. DBPA manipulates the BTB by continually executing a static number of branches. If these branches are equal to or greater in number than the associativity of the BTB and are properly placed in the instruction scheme, at least one entry will map to every set in the BTB [2]. Whenever the crypto process takes a branch, it will write its entry into the BTB. When this happens, there is some finite probability that this entry will evict a branch belonging to the DBPA spy process. The spy process can detect this eviction when the next BTB miss occurs infer that the eviction was caused by a 0 in the private crypto key. By repeating this process many times over a large number of key signings, statistical analysis can be used to extract all of the key bits. SBPA uses the same basic concept, but the SBPA spy process uses far more unique branches than DBPA. By nearly filling all of the BTB entries, the SBPA spy process not only increases the probability that a crypto branch will evict one of the spy branches, but also the probability that this eviction will in turn result in an additional eviction of a spy process branch, which itself may result in an eviction. By making an avalanche effect of evictions probable, SBPA significantly increases the amount of latency experienced by an eviction. This loosens the requirement on the precision of timing measurements and may make it feasible to detect branches without doing a statistical analysis over a large number of averages. 3. SBPA & Architectural Choices As described in Section 2, SBPA depends heavily on implementation details of the BTB and branch prediction system. In this section we examine several possible architectural choices and explore the effect they would have on the feasibility of SBPA. It is important to note that the architecture must have a BTB and SMT capabilities to make an SBPA attack possible. Given these conditions, the feasibility of an SBPA attack can be broken into two factors: the size of the effect caused by a BTB miss and the precision of execution time measurement. If the precision is not good enough to detect the effect of a BTB miss or if the size of the effect is smaller than the standard deviation in measurements, then SBPA can be judged infeasible. 3.1 Pipeline Length The effect of a single BTB miss is based on the stall penalty of the critical loop between instruction fetch/decode and branch target address resolution. This is determined by the number of pipeline stages between the decode of a conditional branch instruction and the computation of its target address. Shorter pipelines will result in fewer stall cycles for an unresolved branch target address. If this loop can be made into a tight loop rather than a loose loop [18], there would be little advantage in including a BTB in the microarchitecture. 3.2 Branch Predictor Design Multi-threaded architectures often opt for simpler branch prediction schemes, since branch prediction accuracy may become less important with increasing numbers of threads. One method of doing branch prediction is simply to use the BTB as a predictor. If a branch address is present in the BTB, predict the branch as taken. If it is not in the BTB, predict backward branches as taken and forward branches as not taken. When the branch resolves, write it into the BTB if taken or evict the corresponding BTB entry if not taken. This scheme only provides a single bit of prediction history, but may be sufficient for multi-threaded architectures, and is somewhat similar to the branch prediction system used in the Pentium IV [2]. This style of prediction scheme may greatly improve the feasibility of SBPA, as a premature eviction of a BTB entry now has the potential to create a full misspeculation penalty, rather than just stalling the pipeline briefly. This effect again penalizes longer pipelines. Furthermore, if branch prediction performance counters are available, an improved form of BPA could be devised which analyzed branch prediction accuracy (a 1 st -order effect) rather than execution time (a 2 nd -order effect). 3.3 BTB Size The primary difference between SBPA and DBPA is that SBPA attempts to fill as many of the BTB slots as possible such that a single BTB eviction by the crypto process is likely to cause an avalanche of evictions in the spy process. However, in order to achieve a bit-level resolution, the spy process must necessarily execute all of its branches in the time between each key bit-dependant branch in the crypto process. Unless the crypto process is

4 very slow, it is difficult for the spy process to execute enough branches to fill all of the entries of a very large BTB between each key-dependant crypto branch. If the spy process cannot fill all of the entries in time and the BTB uses a random replacement policy, there is a certain finite probability that when the crypto process executes two consecutive evictions (equivalent to key bits of 00 ) the write of the second branch to the BTB will evict the address of the first branch. This will result in the first eviction being undetected by the spy process, which will interpret this pattern as being equivalent to key bits of 10. Conversely, the spy process can simply use a smaller number of branches than BTB entries, but this reduces the probability of avalanche evictions occurring and decreases SBPA s resolution advantage over DBPA. While larger BTBs make SBPA more difficult, there is a practical limit on how large the architect can make the BTB, as the BTB will be useless if it takes multiple cycles to access. 3.4 BTB Replacement Policy In previous discussion we have assumed that the BTB is using a random replacement policy, however an architect might choose to implement an LRU replacement policy to more effectively exploit branch temporal locality and reduce the size of the BTB. When using a random replacement policy, the SBPA and DBPA attacks must rely on finite probabilities of whether the crypto branch entry will evict a spy branch entry. With an LRU policy, the choice of entry for eviction is deterministic, and since the spy process is designed to take many more branches than the crypto process, it is very likely that every eviction caused by the crypto process will evict a spy branch. Furthermore, if the spy process has enough time to completely fill the BTB between crypto branches, each eviction is guaranteed to cause an avalanche of evictions, making detection of key bits significantly easier. 4. Design To determine the viability of SBPA attacks on various architectures, we designed a software system to find the optimal number of branch addresses to use in a spy program (the branch address stream length) and measure the increase in latency caused by injecting an additional branch address into the BTB. To accurately determine these values, the BTB analysis software needed to meet the following criteria: 1. Must be able to generate streams of unique branch addresses with variable length. 2. Must have self-measuring capability to determine the latency experienced when executing a fixed number of branches. 3. Must have no other sources of variable latency besides BTB hits. This means that cache hit rate and branch prediction rate (when not limited by the BTB) must be nearly 100%. We also desired a system that could be easily applied to many different architectures in future tests, so we set the following additional requirements: 4. Must be written in a portable high-level language and require no hand-optimization of assembly code. 5. Must be adaptable to different BTB sizes and levels of associativity. 6. Must be able to automate testing to test over a large range of branch addresses. Our solution was a system called the Unique Multiple Branch Address Generator, or UMBRA. UMBRA is composed of a scripted C-code self-generation, compilation, and execution system capable of generating arbitrary numbers of unique branch addresses and executing these addresses for an arbitrary number of iterations. By default, branch instructions are organized to be placed three words apart when compiled into x86 assembly code to ensure even set mapping into fully, 2- way, 4-way, and 8-way set associative BTBs. All branches are data-dependent forward branches to take advantage of BTB-based prediction systems as described in Section 3.2. Data-dependencies are based on a single data value to mitigate the chances of variable latency due to a cache miss and the data location is written to between every branch instruction to force in-order execution on out-oforder architectures and prevent compiler re-ordering. Latency measurement can be done either using the universal C clock( ) function for measuring CPU time or by interfacing with PAPI performance counters on supported CPUs. Because of the limited resolution of the clock function, it is limited to measuring over an arbitrary number of iterations of all the branch addresses, whereas the PAPI performance counter measurements can be taken

5 over an arbitrary number of individual branches. In the case that PAPI performance counters are not available for a processor, some ISAs provide assembly language instructions to access performance counters; however our goal is to avoid ISA-specific code alterations wherever possible to increase the portability of the system. 5. Methodology To test UMBRA, we applied it to a Core 2 Duobased workstation. We selected the Core 2 Duo as our test platform because it is a popular, top-of-the-line consumer processor that was readily available for testing. The Core 2 Duo also varies largely from the Pentium IV architecture for which SBPA was originally designed. It has a much shorter, higher-efficiency pipeline than the Pentium IV, which is representative of current trends in processors and may be able to highlight the effects of varying these features on SBPA. There is also very little public information available about the Core 2 Duo architecture, which forced us to take a very general approach to testing which can be applied to many other processors. It is important to note that while the Core 2 Duo has a BTB, current commercially available consumer versions of the processor do not have multi-threading capability. This means that actually implementing SBPA on the Core 2 architecture is currently impractical, however we believe that it still serves as a good illustrative example of the challenges of adapting SBPA to a new architecture and provides a glimpse at the potential vulnerability of future Core 2-based multithreaded processors. The test system was running Fedora Core 5 Linux and all tests were run remotely via SSH to simulate the way a remote attacker might access the system. No other user-level processes were active during the tests, but we made no effort to alter the system-level processes being run by the OS. This was done to provide a more accurate representation of the conditions of a real attack. To evaluate the vulnerability of the BTB sidechannel on the system, we used UMBRA to generate 500 different spy processes with branch address streams ranging in length from 500 to 10,000 branches, with the majority of measurements focused in the 1000 to 2000 branch region. We performed these measurements using the clock function measurement, PAPI CPU cycle counters, and PAPI CPU branch counters. For the clock function, measurements were taken over 4000 consecutive branches in the spy program. For the PAPI CPU counters measurements were taken over groups of 250 consecutive branches. Additionally we performed a set of tests using the PAPI branch counters to see if the Core 2 Duo s branch predictor would be affected by BTB misses. In each case we discarded the measurements corresponding to the first three iterations of the branch address stream to remove the effects of compulsory BTB misses generated when starting the process. We then recorded the next 3000 measurements for each process and computed the mean and standard deviation of each measurement. 6. Experimental Results Figures 1 & 2 show plots of the results collected using the clock function and PAPI clock counter measurements respectively. The plots show the execution time for a fixed number of branches vs the size of the unique branch address stream in the spy process. The vulnerability of the BTB can be judged by how close this plot compares to a step function. In particular, we are interested in the steepness of slope of the plot, which represents the size of the effect of a BTB eviction, and the point at which the maximum slope occurs, which represents the ideal size of the branch address stream in the spy process. Additionally, the point at which the plot starts sloping upward represents the smallest possible branch address stream that can be used to attack the BTB. Figure 1. Execution latency vs branch address stream size, using the clock function for measurements averaged over several thousand iterations. Figure 1 shows that the C clock function, even with its lower cycle resolution, still shows a strong

6 relationship between the number of addresses being written into the BTB. The plot begins sloping upward at 1300 branches, and achieves a steepest point at 1500 branches. While these results might seem encouraging for SBPA, it is important to note that the resolution of the clock function is limited to measurements of 10,000 cycles on the Core 2 Duo. The high resolution of the plot is only achievable by taking averages over a large number of iterations; however in an SBPA attack the measurement must be made over a single key-signing event, making the measurement resolution 2~3 orders of magnitude too imprecise to detect a single additional branch in the BTB. Despite that, it does show that DBPA, which relies on heavy averaging, would be feasible on the Core 2 Duo using the clock function. influence of background system noise is a major factor. At the optimal stream length, the standard deviation in our 3000 readings was 180 cycles, nearly an order of magnitude greater than the latency increase caused by a single branch. If the standard deviation measurements are confined to groups of 100 readings, they vary from 22 to 330 cycles. This shows that even our cleanest readings had a standard deviation slightly higher than the latency increase caused by a branch from the crypto process. This indicates that even under ideal background activity, the additional latency of a single eviction from the BTB is too small to be reliably measured during a single key-signing event. Figure 2. Execution latency vs branch address stream size, using PAPI CPU cycle counters. Latency measured over groups of 250 branches. Figure 2 shows the plot of results using measurements from the PAPI cycle counter. The PAPI counter is able to return measurements with single-cycle resolutions. The PAPI clock cycle measurements more closely resemble the ideal step-function results which represent an optimal architecture for SBPA. The plot begins sloping upward at 1275 branches and achieves a maximum slope of 20 cycles per additional branch at a branch address stream length of 1350 branches. The difference in optimal stream lengths between the C clock function and the PAPI counters can be attributed to the difference in branch overhead required to run the clock function compared to accessing the performance counters. The slope indicates that the execution of a single keydependent branch by the crypto process will result in a 20 cycle increase in latency when the stream length is While this 20 cycle additional latency indicates that an SBPA attack is theoretically possible on this system, the Figure 3. Branch mispredictions vs branch address stream size, using PAPI branch counters. Mispredictions measured over groups of 250 branches. Figure 3 shows the result of repeating these tests and measuring the number of branch mispredictions recorded by the PAPI performance counters. These results do not show a step-function like shape, indicating that the BTB is not the primary branch prediction resource in the Core 2 Duo, and that measurements of branch mispredictions cannot be used to detect BTB evictions. However, the results show that the increases in latency shown in Figures 1 & 2 are not simply the result of limitations in the branch predictions system, providing further confidence that they are a result of evictions in the BTB. Interestingly, the branch misprediction rate scales somewhat with increasing branch address streams. In particular, while the average number of branch increases by only 25% between stream lengths of 1000 and 5000, the standard deviation in mispredictions increases by over 600%, however it is unclear whether this bears a relationship with the status of the BTB and might be worth investigating in another study.

7 7. Related Work Branch-predictor-based side-channel attacks a very recent development, so the amount of work in the field is limited. In early 2007, Acıicmez, et al, first proposed the DBPA attack and showed that the branch target buffer could be used to extract data from the RSA algorithm implemented in OpenSSL, however this method required averaging results over 9 million keysigning operations, making it impractical for most applications [2]. Later that year, the same researchers proposed the SBPA enhancement, and outlined the an attack on the Pentium IV architecture, however they used a trace-based simulation to provide most of their results and provided few details on the practical challenges of implementing an SBPA attack on a real system [1]. 8. Conclusions In this paper we developed a software system for evaluating the SBPA vulnerability of various architectures. We explored the different architectural choices that can affect the viability of SBPA and analyzed the vulnerability of an off-the-shelf Linux workstation using a Core 2 Duo processor. We showed that it was possible to extract side-channel information about the Core 2 Duo s branch target buffer using timing analysis, but discovered that analyzing the branch misprediction rate is not effective. We showed that the optimal branch address stream to use in a BPA attack is between 1350 and 1500 and that an eviction for the BTB provides an increase in latency of 20 cycles in the best case. We found that using PAPI performance counters to measure CPU time provides much more accurate measurements than the C clock function and that DBPA techniques could be effective using these measurement methods, but found executing an SBPA attack on a single key-signing operation was infeasible as the standard deviation in measurements was higher than the effect we were trying to measure. At the beginning of this paper, we stated that we wished to discover whether SBPA attacks can be applied effectively to other computer architectures. Our results show that while implementing SBPA proved infeasible on our test system, however it is not far from being feasible, indicating that architectures which suffer larger penalties for BTB misses could be vulnerable to SBPA attacks. An important additional point is that although the binary-square-and-multiply example only included a single branch dependant on the key bits, some sensitive algorithms could contain 10 s of branches dependant on each key bit, which would be enough to make an SBPA attack on our test system viable. It is also very important to note that in both of our tests, the spy process needed to execute a minimum of 1300 branches before it could detect a branch from the crypto process. If the cryptographic process can execute multiple key-dependent branches in the time it takes for the spy process to execute 1300 branches, it becomes very difficult for SBPA to infer the values of the secret key bits. By designing highly-efficient algorithms, the problem of the BTB side-channel can be mitigated to a certain extent. These two points demonstrate that designing a truly secure platform is a top-down effort. An architect cannot solely consider security of an algorithm or architecture, but must instead consider the interrelationship between all parts of the system, from the microarchitecture to the ISA to the OS to the software running on it in order to design a truly secure system. Acknowledgements We would like to thank Dr. Karu Sankaralingam for providing us with the test system and his valuable advice throughout the project. References [1] O. Acuicmez, C. Koc and J.-P. Seifert, On the Power of Simple Branch Prediction Analysis, ASIACCS 07, [2] O. Acuicmez, J.-P. Seifert and C.K. Koc. Extracting Secret Keys via Branch Prediction, CT-RSA2007, [3] D. Agrawal, B. Archambeault, J.R. Rao and P. Rohatgi, The EM Side-Channel(s), Proceedings of Workshop on Cryptographic Hardware and Embedded Systems, pp , [4] Bar-El, H. Introduction to Side-Channel Attacks. White paper, [5] J.F. Burns, Privacy breach rocks British government, International Herald Tribune, 24 November 2007.

8 [6] M. Joye and J. Quisquater, Hessian Elliptic Curves and Side-Channel Attacks in Cryptographic Hardware and Embedded Systems, vol. 2162, Heidelberg: Springer Berlin, p. 402, [18] E. Tune, E. Borch, S. Manne, and J. Emer, Loose loops sink chips, High-Performance Computer Architecture 2002, pp , [7] J. Kelsey, B. Schneier, D. Wagner and C. Hall, Side Channel Cryptanalysis of Product Ciphers, ESORICS 98, [8] T.W. Kwon, C.S. You, W.S. Heo, Y.K. Kang and J.R. Choi, Two implementation methods of a 1024-bit RSA cryptoprocessor based on modified Montgomery algorithm, ISCAS 2001, vol. 4, pp , [9] Q. Liu, R. Safavi-Naini, and N.P. Sheppard, Digital Rights Management for Content Distribution, Conferences in Research and Practice in Information Technology, vol. 34, pp , [10] C. McIvor, M. McLoone, and J.V. McCanny, Fast Montgomery Modular Multiplications and the RSA Cryptographic Processor Architectures, Signals, Systems and Computers, vol. 1, pp , [11] B. Moller, Securing Elliptic Curve Point Multiplication against Side-Channel Attacks in Information Security, vol. 2200, Heidelberg: Springer Berlin, p. 324, [12] C. Percival, Cache missing for fun and profit, BSDCan 2005, [13] P. Regan, From Clipper to Carnivore: Balancing Privacy, Law Enforcement, and Industry Interests, American Political Science Association, 2001 Proceedings, [14] J. Shen and M. Lipasti, Modern Processor Design: Fundamentals of Superscalar Processors. McGraw-Hill, [15] D. Solove, The Digital Person: Technology and Privacy in the Information Age, New York: New York University Press, [16] F. Standaert, G. Piret, G. Rouvroy, J. Quisquater and J. Legat, ICEBERG: an Involutional Cipher Efficient for Block Encryption on Reconfigurable Hardware, FSE 2004, Springer-Verlag, [17] F.-X. Standaert, E. Peeters, G. Rouvroy and J. Quisquater, An overview of power analysis attacks against field programmable gate arrays, Proceedings of the IEEE, vol. 92, pp , 2006.

Micro-Architectural Attacks and Countermeasures

Micro-Architectural Attacks and Countermeasures Micro-Architectural Attacks and Countermeasures Çetin Kaya Koç koc@cs.ucsb.edu Çetin Kaya Koç http://koclab.org Winter 2017 1 / 25 Contents Micro-Architectural Attacks Cache Attacks Branch Prediction Attack

More information

An Improved Trace Driven Instruction Cache Timing Attack on RSA

An Improved Trace Driven Instruction Cache Timing Attack on RSA An Improved Trace Driven Instruction Cache Timing Attack on RSA Chen Cai-Sen 1*, Wang Tao 1, Chen Xiao-Cen 2 and Zhou Ping 1 1 Department of Computer Engineering, Ordnance Engineering College, China 2

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

Cache Side Channel Attacks on Intel SGX

Cache Side Channel Attacks on Intel SGX Cache Side Channel Attacks on Intel SGX Princeton University Technical Report CE-L2017-001 January 2017 Zecheng He Ruby B. Lee {zechengh, rblee}@princeton.edu Department of Electrical Engineering Princeton

More information

D eepa.g.m 3 G.S.Raghavendra 4

D eepa.g.m 3 G.S.Raghavendra 4 Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Breaking Cryptosystem

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Skewed-Associative Caches: CS752 Final Project

Skewed-Associative Caches: CS752 Final Project Skewed-Associative Caches: CS752 Final Project Professor Sohi Corey Halpin Scot Kronenfeld Johannes Zeppenfeld 13 December 2002 Abstract As the gap between microprocessor performance and memory performance

More information

Spectre and Meltdown. Clifford Wolf q/talk

Spectre and Meltdown. Clifford Wolf q/talk Spectre and Meltdown Clifford Wolf q/talk 2018-01-30 Spectre and Meltdown Spectre (CVE-2017-5753 and CVE-2017-5715) Is an architectural security bug that effects most modern processors with speculative

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Mo Money, No Problems: Caches #2...

Mo Money, No Problems: Caches #2... Mo Money, No Problems: Caches #2... 1 Reminder: Cache Terms... Cache: A small and fast memory used to increase the performance of accessing a big and slow memory Uses temporal locality: The tendency to

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Show Me the $... Performance And Caches

Show Me the $... Performance And Caches Show Me the $... Performance And Caches 1 CPU-Cache Interaction (5-stage pipeline) PCen 0x4 Add bubble PC addr inst hit? Primary Instruction Cache IR D To Memory Control Decode, Register Fetch E A B MD1

More information

Low-power Architecture. By: Jonathan Herbst Scott Duntley

Low-power Architecture. By: Jonathan Herbst Scott Duntley Low-power Architecture By: Jonathan Herbst Scott Duntley Why low power? Has become necessary with new-age demands: o Increasing design complexity o Demands of and for portable equipment Communication Media

More information

CS Computer Architecture

CS Computer Architecture CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 An Example Implementation In principle, we could describe the control store in binary, 36 bits per word. We will use a simple symbolic

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

Meltdown and Spectre - understanding and mitigating the threats (Part Deux)

Meltdown and Spectre - understanding and mitigating the threats (Part Deux) Meltdown and Spectre - understanding and mitigating the threats (Part Deux) Gratuitous vulnerability logos Jake Williams @MalwareJake SANS / Rendition Infosec sans.org / rsec.us @SANSInstitute / @RenditionSec

More information

Version:1.1. Overview of speculation-based cache timing side-channels

Version:1.1. Overview of speculation-based cache timing side-channels Author: Richard Grisenthwaite Date: January 2018 Version 1.1 Introduction This whitepaper looks at the susceptibility of Arm implementations following recent research findings from security researchers

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Midterm Exam 1 Wednesday, March 12, 2008

Midterm Exam 1 Wednesday, March 12, 2008 Last (family) name: Solution First (given) name: Student I.D. #: Department of Electrical and Computer Engineering University of Wisconsin - Madison ECE/CS 752 Advanced Computer Architecture I Midterm

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Walking Four Machines by the Shore

Walking Four Machines by the Shore Walking Four Machines by the Shore Anastassia Ailamaki www.cs.cmu.edu/~natassa with Mark Hill and David DeWitt University of Wisconsin - Madison Workloads on Modern Platforms Cycles per instruction 3.0

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Instruction Level Parallelism (Branch Prediction)

Instruction Level Parallelism (Branch Prediction) Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution

More information

Security against Timing Analysis Attack

Security against Timing Analysis Attack International Journal of Electrical and Computer Engineering (IJECE) Vol. 5, No. 4, August 2015, pp. 759~764 ISSN: 2088-8708 759 Security against Timing Analysis Attack Deevi Radha Rani 1, S. Venkateswarlu

More information

Journal of Global Research in Computer Science A UNIFIED BLOCK AND STREAM CIPHER BASED FILE ENCRYPTION

Journal of Global Research in Computer Science A UNIFIED BLOCK AND STREAM CIPHER BASED FILE ENCRYPTION Volume 2, No. 7, July 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info A UNIFIED BLOCK AND STREAM CIPHER BASED FILE ENCRYPTION Manikandan. G *1, Krishnan.G

More information

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang 1, Kris Gaj 2, Soonhak Kwon 3, Tarek El-Ghazawi 1 1 The George Washington University, Washington, D.C., U.S.A.

More information

ECE 341. Lecture # 15

ECE 341. Lecture # 15 ECE 341 Lecture # 15 Instructor: Zeshan Chishti zeshan@ece.pdx.edu November 19, 2014 Portland State University Pipelining Structural Hazards Pipeline Performance Lecture Topics Effects of Stalls and Penalties

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance

More information

Performance Tuning VTune Performance Analyzer

Performance Tuning VTune Performance Analyzer Performance Tuning VTune Performance Analyzer Paul Petersen, Intel Sept 9, 2005 Copyright 2005 Intel Corporation Performance Tuning Overview Methodology Benchmarking Timing VTune Counter Monitor Call Graph

More information

Automatic Counterflow Pipeline Synthesis

Automatic Counterflow Pipeline Synthesis Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The

More information

On the Design of Secure Block Ciphers

On the Design of Secure Block Ciphers On the Design of Secure Block Ciphers Howard M. Heys and Stafford E. Tavares Department of Electrical and Computer Engineering Queen s University Kingston, Ontario K7L 3N6 email: tavares@ee.queensu.ca

More information

HOST Differential Power Attacks ECE 525

HOST Differential Power Attacks ECE 525 Side-Channel Attacks Cryptographic algorithms assume that secret keys are utilized by implementations of the algorithm in a secure fashion, with access only allowed through the I/Os Unfortunately, cryptographic

More information

High-Performance Cryptography in Software

High-Performance Cryptography in Software High-Performance Cryptography in Software Peter Schwabe Research Center for Information Technology Innovation Academia Sinica September 3, 2012 ECRYPT Summer School: Challenges in Security Engineering

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of

More information

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging 6.823, L8--1 Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Highly-Associative

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

A Study for Branch Predictors to Alleviate the Aliasing Problem

A Study for Branch Predictors to Alleviate the Aliasing Problem A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract

More information

The Last Mile An Empirical Study of Timing Channels on sel4

The Last Mile An Empirical Study of Timing Channels on sel4 The Last Mile An Empirical Study of Timing on David Cock Qian Ge Toby Murray Gernot Heiser 4 November 2014 NICTA Funding and Supporting Members and Partners Outline The Last Mile Copyright NICTA 2014 David

More information

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Demand fetching is commonly employed to bring the data

Demand fetching is commonly employed to bring the data Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Reconfigurable Multicore Server Processors for Low Power Operation

Reconfigurable Multicore Server Processors for Low Power Operation Reconfigurable Multicore Server Processors for Low Power Operation Ronald G. Dreslinski, David Fick, David Blaauw, Dennis Sylvester, Trevor Mudge University of Michigan, Advanced Computer Architecture

More information

A Weight Based Attack on the CIKS-1 Block Cipher

A Weight Based Attack on the CIKS-1 Block Cipher A Weight Based Attack on the CIKS-1 Block Cipher Brian J. Kidney, Howard M. Heys, Theodore S. Norvell Electrical and Computer Engineering Memorial University of Newfoundland {bkidney, howard, theo}@engr.mun.ca

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Speculation and Future-Generation Computer Architecture

Speculation and Future-Generation Computer Architecture Speculation and Future-Generation Computer Architecture University of Wisconsin Madison URL: http://www.cs.wisc.edu/~sohi Outline Computer architecture and speculation control, dependence, value speculation

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II CS252 Spring 2017 Graduate Computer Architecture Lecture 8: Advanced Out-of-Order Superscalar Designs Part II Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time

More information

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print

More information

Chapter 14 Performance and Processor Design

Chapter 14 Performance and Processor Design Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures

More information

A New Attack with Side Channel Leakage during Exponent Recoding Computations

A New Attack with Side Channel Leakage during Exponent Recoding Computations A New Attack with Side Channel Leakage during Exponent Recoding Computations Yasuyuki Sakai 1 and Kouichi Sakurai 2 1 Mitsubishi Electric Corporation, 5-1-1 Ofuna, Kamakura, Kanagawa 247-8501, Japan ysakai@iss.isl.melco.co.jp

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Side channel attack: Power Analysis. Chujiao Ma, Z. Jerry Shi CSE, University of Connecticut

Side channel attack: Power Analysis. Chujiao Ma, Z. Jerry Shi CSE, University of Connecticut Side channel attack: Power Analysis Chujiao Ma, Z. Jerry Shi CSE, University of Connecticut Conventional Cryptanalysis Conventional cryptanalysis considers crypto systems as mathematical objects Assumptions:

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Final Lecture. A few minutes to wrap up and add some perspective

Final Lecture. A few minutes to wrap up and add some perspective Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

Written Exam / Tentamen

Written Exam / Tentamen Written Exam / Tentamen Computer Organization and Components / Datorteknik och komponenter (IS1500), 9 hp Computer Hardware Engineering / Datorteknik, grundkurs (IS1200), 7.5 hp KTH Royal Institute of

More information

0x1A Great Papers in Computer Security

0x1A Great Papers in Computer Security CS 380S 0x1A Great Papers in Computer Security Vitaly Shmatikov http://www.cs.utexas.edu/~shmat/courses/cs380s/ Attacking Cryptographic Schemes Cryptanalysis Find mathematical weaknesses in constructions

More information

Homework 2 (r1.1) Due: Part (A) -- Apr 2, 2017, 11:55pm Part (B) -- Apr 2, 2017, 11:55pm Part (C) -- Apr 2, 2017, 11:55pm

Homework 2 (r1.1) Due: Part (A) -- Apr 2, 2017, 11:55pm Part (B) -- Apr 2, 2017, 11:55pm Part (C) -- Apr 2, 2017, 11:55pm Second Semester, 2016 17 Homework 2 (r1.1) Due: Part (A) -- Apr 2, 2017, 11:55pm Part (B) -- Apr 2, 2017, 11:55pm Part (C) -- Apr 2, 2017, 11:55pm Instruction: Submit your answers electronically through

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

Profiling: Understand Your Application

Profiling: Understand Your Application Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

Software Engineering Aspects of Elliptic Curve Cryptography. Joppe W. Bos Real World Crypto 2017

Software Engineering Aspects of Elliptic Curve Cryptography. Joppe W. Bos Real World Crypto 2017 Software Engineering Aspects of Elliptic Curve Cryptography Joppe W. Bos Real World Crypto 2017 1. NXP Semiconductors Operations in > 35 countries, more than 130 facilities 45,000 employees Research &

More information

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS252 Graduate Computer Architecture Midterm 1 Solutions CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

page 1 Introduction to Cryptography Benny Pinkas Lecture 3 November 18, 2008 Introduction to Cryptography, Benny Pinkas

page 1 Introduction to Cryptography Benny Pinkas Lecture 3 November 18, 2008 Introduction to Cryptography, Benny Pinkas Introduction to Cryptography Lecture 3 Benny Pinkas page 1 1 Pseudo-random generator Pseudo-random generator seed output s G G(s) (random, s =n) Deterministic function of s, publicly known G(s) = 2n Distinguisher

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Simultaneous Multithreading Processor

Simultaneous Multithreading Processor Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf

More information

A Perfect Branch Prediction Technique for Conditional Loops

A Perfect Branch Prediction Technique for Conditional Loops A Perfect Branch Prediction Technique for Conditional Loops Virgil Andronache Department of Computer Science, Midwestern State University Wichita Falls, TX, 76308, USA and Richard P. Simpson Department

More information

Intel Analysis of Speculative Execution Side Channels

Intel Analysis of Speculative Execution Side Channels Intel Analysis of Speculative Execution Side Channels White Paper Revision 1.0 January 2018 Document Number: 336983-001 Intel technologies features and benefits depend on system configuration and may require

More information

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Stanley Bak Abstract Network algorithms are deployed on large networks, and proper algorithm evaluation is necessary to avoid

More information

CSCE 626 Experimental Evaluation.

CSCE 626 Experimental Evaluation. CSCE 626 Experimental Evaluation http://parasol.tamu.edu Introduction This lecture discusses how to properly design an experimental setup, measure and analyze the performance of parallel algorithms you

More information

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars Krste Asanovic Electrical Engineering and Computer

More information

For Problems 1 through 8, You can learn about the "go" SPEC95 benchmark by looking at the web page

For Problems 1 through 8, You can learn about the go SPEC95 benchmark by looking at the web page Problem 1: Cache simulation and associativity. For Problems 1 through 8, You can learn about the "go" SPEC95 benchmark by looking at the web page http://www.spec.org/osg/cpu95/news/099go.html. This problem

More information

THE OPTIUM MICROPROCESSOR AN FPGA-BASED IMPLEMENTATION

THE OPTIUM MICROPROCESSOR AN FPGA-BASED IMPLEMENTATION THE OPTIUM MICROPROCESSOR AN FPGA-BASED IMPLEMENTATION Radu Balaban Computer Science student, Technical University of Cluj Napoca, Romania horizon3d@yahoo.com Horea Hopârtean Computer Science student,

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 9

ECE 571 Advanced Microprocessor-Based Design Lecture 9 ECE 571 Advanced Microprocessor-Based Design Lecture 9 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 30 September 2014 Announcements Next homework coming soon 1 Bulldozer Paper

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Computer Security. 08. Cryptography Part II. Paul Krzyzanowski. Rutgers University. Spring 2018

Computer Security. 08. Cryptography Part II. Paul Krzyzanowski. Rutgers University. Spring 2018 Computer Security 08. Cryptography Part II Paul Krzyzanowski Rutgers University Spring 2018 March 23, 2018 CS 419 2018 Paul Krzyzanowski 1 Block ciphers Block ciphers encrypt a block of plaintext at a

More information