Efficient Practical Key Recovery for Side- Channel Attacks

Size: px

Start display at page:

Download "Efficient Practical Key Recovery for Side- Channel Attacks"

Douglas Osborne
6 years ago
Views:

1 Aalto University School of Science Degree Programme in Security and Mobile Computing Kamran Manzoor Efficient Practical Key Recovery for Side- Channel Attacks Master s Thesis Espoo, June 30, 2014 Supervisors: Professor Antti Ylä-Jääski, Aalto University Assistant Professor Andrey Bogdanov, Technical University of Denmark

2 Aalto University School of Science Degree Programme in Security and Mobile Computing Author: Kamran Manzoor Title: Efficient Practical Key Recovery for Side-Channel Attacks ABSTRACT OF MASTER S THESIS Date: June 30, 2014 Pages: 77 Major: Data Communication Software Code: T-110 Supervisors: Professor Antti Ylä-Jääski, Aalto University Assistant Professor Andrey Bogdanov, Technical University of Denmark Side-channel attacks impose serious threats to the security of a cryptographic device. These attacks exploit the physical properties of a device in order to recover the secret key involved in the computations. Most side-channel attacks yield ranking lists of candidates for different parts of the secret key. The lists of candidates are typically ranked in the decreasing order of their likelihood. Ideally, the correct key is the combination of most likely candidate from each list. However, the success of side-channel attacks depends on various factors and therefore, the correct key may not be the most likely one. This in turns means that a process is required that enumerates keys in the decreasing order of their likelihood. Moreover, simultaneous validation of the enumerated keys is required in order to reveal the secret key. The former process is referred to as key enumeration while later one is referred to as key validation. Thus, key recovery for side-channel attacks consists of two main components i.e., key enumeration and key validation. In this thesis, we propose an efficient practical solution of key recovery for sidechannel attacks. The proposed solution employs the processing powers of both a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU). Regarding key enumeration, we have implemented and compared three key enumeration algorithms namely, Optimal Key Enumeration Algorithm (OKEA), Score based Key Enumeration Algorithm (SKEA) and Trivial Key Enumeration Algorithm (TKEA). Our experimental results show that SKEA outperforms its counterparts as it reveals the correct key much faster than OKEA and TKEA. Thus, we have proposed to deploy SKEA as the key enumeration algorithm. Furthermore, the maximum throughput of SKEA on a CPU is around 2 23 keys/sec which in turns requires that the speed of key validation should at least keep this pace. For this purpose, we have utilized the immense parallel processing power of a GPU. Thus, concurrent execution of key enumeration process i.e., SKEA on a CPU and key validation process on a GPU yields an overall efficient key recovery solution for side-channel attacks. Keywords: Language: side-channel attacks, key recovery, key enumeration, key validation, optimality, OKEA, SKEA, TKEA English 2

3 Acknowledgements I sincerely thank my main thesis supervisor Assistant Professor Andrey Bogdanov for his excellent supervision. His valuable comments and useful feedback helped me a lot in completing this thesis. I would also like to thank Assistant Professor Elmar Wolfgang Tischhauser for having discussion with us regarding AES-NI. I would like to express my special gratitude to Ilya Kizhvatov, Senior Security Analyst, Riscure, The Netherlands for being my supervisor during my internship at Riscure. My weekly meetings with him were very helpful in smooth execution of the thesis. I would also like to thank him for the data which he sent us for our experimentations. Moreover, I would like to thank Marc Witteman, CTO, Riscure for allowing me to work on his proposed algorithm. I am also very grateful to Professor Antti Ylä-Jääski for remotely supervising my thesis. Throughout my thesis, he provided valuable, detailed and timely comments on our work amidst his busy schedules. Last but not least, I would like to express my gratitude to my parents and family members for their unconditional love, affection and moral support. They have been a source of inspiration for me throughout my life. Espoo, June 30, 2014 Kamran Manzoor 3

4 Abbreviations and Acronyms CPU GPU GPGPU SNR OKEA SKEA TKEA AES DES SPA DPA CUDA OpenCL CC SM AES-NI SIMD NIST Central Processing Unit Graphics Processing Unit General Purpose Computing on Graphics Processing Unit Signal to Noise Ratio Optimal Key Enumeration Algorithm Score based Key Enumeration Algorithm Trivial Key Enumeration Algorithm Advanced Encryption Standard Data Encryption Standard Simple Power Analysis Differential Power Analysis Compute Unified Device Architecture Open Computing Language Compute Capability Streaming Multiprocessor Advanced Encryption Standard New Instructions Single Instruction Multiple Data National Institute of Standards and Technology 4

5 Contents Abbreviations and Acronyms 4 1 Introduction Physical Attacks on Cryptographic Devices Side-Channel Attacks Problem Statement Contributions Structure of the Thesis Background Advanced Encryption Standard AES Implementation through T-tables Side-Channel Attacks Power Analysis Attacks Simple Power Analysis Attacks Differential Power Analysis Attacks General Purpose Computing on Graphics Processing Units Programming GPUs using CUDA platform CUDA Models Hardware View of a GPU Design Criteria for GPGPU Key Enumeration Optimal Key Enumeration Algorithm Bayesian Extension of Non-Profiled Side-Channel Attacks Description of Enumeration Process of OKEA Score based Key Enumeration Algorithm Obtaining Discrete Score based Ranking Description of Enumeration Process of SKEA Trivial Key Enumeration Algorithm

6 4 Analysis of Key Enumeration Algorithms Performance Metrics Memory Consumption Optimality Throughput Time to find the Correct Key Experimental Comparisons Memory Consumption Optimality Throughput Time to find the Correct Key Discussion Key Validation AES Key Validation on GPU Performance Analysis Discussion Proposed Solution and Future Enhancements Proposed Solution Flowchart Performance Analysis Future Enhancements GPU Computing and Key Enumeration Algorithms GPU constraints Compatibility of OKEA for a GPU Compatibility of SKEA for a GPU Proposed Solution for Future Research Conclusion 73 Bibliography

7 Chapter 1 Introduction Cryptography refers to the use of cryptographic algorithms with an aim to have secure communication in the presence of an adversary. In modern cryptography, cryptographic algorithm itself is publicly known while only the cryptographic key is kept secret. This important principle was stated by a Dutch cryptographer, Auguste Kerckhoffs in the 19th century [1]. Therefore, breaking a cryptographic algorithm means revealing the secret key by exploiting some public information. A cryptographic algorithm is considered to be practically secure if no known attack could break it within a realistic amount of time and with a realistic amount of computational power [1]. Cryptographic devices are dedicated devices that are used to perform cryptographic operations by using the keys stored on them. In practice, the security of a cryptographic primitive implemented on a cryptographic device can be viewed from two main perspectives: on one hand, it can be considered as a black box that maps input into an output based on the key, while on the other hand, since this primitive is being implemented on a given processor in a given environment so the other view is to consider the implementation-specific details of the underlying hardware (cryptographic device). The first perspective is considered in the classical cryptanalysis while the other one is of the physical or hardware security. Thus, both these perspectives should be considered for overall security of a cryptographic device. In other words, the cryptographic algorithm needs to be secured not only from classical cryptanalysis attacks but also the physical security of the cryptographic device is of at most importance as the device specific characteristics may help an attacker in revealing the stored secret key. It is necessary to make assumptions about attacker s knowledge in order to evaluate the security of a cryptographic device. The strongest assumption is to consider that the attacker knows all details of the cryptographic device and the corresponding cryptographic algorithm while cryptographic key is 7

8 CHAPTER 1. INTRODUCTION 8 the only secret that is unknown to him. Most research in cryptography focuses on the mathematics of the underlying cryptographic algorithms, while in comparison, less consideration has been given to the physical security of the cryptographic devices. In this thesis, we focus on the physical security aspect of cryptographic devices. Let us first briefly discuss physical attacks and the threats which they impose on the security of the cryptographic devices. 1.1 Physical Attacks on Cryptographic Devices Physical attacks exploit the implementation-specific characteristics of the cryptographic devices in order to reveal the secret key involved in the computation. They are much less general (as they are specific to a given implementation) but much stronger than the classical cryptanalysis attacks and therefore, they are considered very seriously by the vendors of the cryptographic devices. Although all physical attacks share the same goal of revealing the secret key, yet they differ significantly in terms of equipment, cost, time and expertise required. In literature, physical attacks are usually categorized into the following two orthogonal axes [3]: Active vs passive attacks: Active attacks tamper with the normal functionality of the cryptographic device in order to make it behave abnormally. This abnormal behavior is then exploited to reveal the secret key of the device. In contrast, passive attacks do not disturb the normal behavior of the device and reveal the secret key by observing the physical properties of the device. Invasive vs Non-invasive attacks: Invasive attacks require depackaging of the device in order to get access to its various components. These attacks are said to be the strongest attacks that can be mounted on a cryptographic device as there are essentially no limits to what could be done in order to reveal the secret key. However, they typically employ quite expensive equipment and therefore, they are neither very common nor easy to mount. In contrast, non-invasive attacks only exploit the directly accessible interfaces of a device. These attacks can be mounted using inexpensive equipment and therefore, pose serious practical threats to the security of cryptographic devices [4]. Both invasive and non-invasive attacks can be either active or passive.

9 CHAPTER 1. INTRODUCTION 9 Passive non-invasive attacks are commonly referred to as side-channel attacks. As we are mainly concerned with side-channel attacks in our thesis, therefore, these attacks are briefly explained below while their detailed description can be found in the next chapter Side-Channel Attacks Side-channel attacks exploit the fact that during the execution of a cryptographic algorithm, the cryptographic device itself reveals the stored secret key in the form of physical information that can be measured externally. Being passive non-invasive attacks, they can generally be performed using cheap equipment. Furthermore, these attacks neither leave any damage to the device nor disturb the normal operation of the device and therefore, they pose serious threats to the security of cryptographic devices. It is worth mentioning here that no perfect protection exists against these attacks, however, appropriate countermeasures do exist that make the attacker s task harder [5]. Mobile devices such as smart cards are said to be the primary targets of side-channel attacks as they have external controllable pins for power supply and clock etc. Power analysis attacks, electromagnetic attacks and timings attacks are the most important types of side-channel attacks [1]. As the names imply, these attacks reveal the secret key by exploiting respectively the side-channel information i.e., power consumption, electromagnetic leaks and execution time of a cryptographic device. In order to explain readers about operation of side-channel attacks, we have thoroughly described power analysis attacks in the next chapter. 1.2 Problem Statement Side-channel attacks are considered as serious threats to the security of a cryptographic device. Most of the side-channel attacks employ divide-andconquer strategy. In the divide phase, an adversary recovers information about different parts of the full key, usually named as sub-keys. In the conquer phase, these sub-keys are then combined to reveal the full key. Typically, the divide phase yields lists of sub-key candidates sorted from most likely to least likely one. For example, a DPA attack on AES-128 implementation yields a ranking candidate list for each byte of the secret key, thereby generates 16 lists in total. Ideally, the most likely sub-key candidate from each list is combined in the conquer phase to reveal the full key. However, the success of side-channel attacks depends on various factors [22]. For instance,

influential factors. Typically, the success of DPA attacks varies proportionally to these two factors.

10 CHAPTER 1. INTRODUCTION 10 in case of Differential Power Analysis (DPA) attacks, number of acquired power traces and corresponding Signal to Noise Ratio (SNR) of the acquired traces are among the most influential factors. Typically, the success of DPA attacks varies proportionally to these two factors. This in turns means that if the number of power traces or the SNR is not adequate then the sub-keys may not be recovered with high confidence in the divide phase and therefore, correct key may not be the one obtained as the combination of the most likely sub-key candidate from each list. In order to recover key for such cases, one need to have a key enumeration process that enumerates full keys from the sub-key lists (ideally in the decreasing order of their likelihood) and a key validation process that concurrently validates the keys generated by the key enumeration process. Thus, key enumeration and key validation are the two constituting components of the key recovery mechanism. Figure 1.1 illustrates the idea. In this thesis, we deal with the aforementioned key recovery problem and propose an efficient key recovery solution that utilizes the processing powers of both a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit). The proposed solution employs the processing power of a CPU to smartly enumerate full keys from the sub-key candidates lists and employs the processing power of a GPU to concurrently validate the generated full keys in order to reveal the correct key. Figure 1.1: Block level description of key recovery mechanism The master s thesis can also be viewed from another perspective i.e., to enhance the brute force attack by exploiting the side-channel leakage information. Considering an adversary has a maximum computing power P, ECRYPT recommends to select n bit key size for the symmetric schemes such that the ratio 2 n /P should be larger than the life-time of the protected data [7]. Following this recommendation, the cryptographic key lengths for symmetric schemes in practice today are typically range between 80 to 128- bits [7]. For instance, the smallest key length for AES is 128-bit. Brute forcing such a large key space is trivially impossible. However, if we exploit some side-channel leakage information (like applying DPA attack on a few power traces) then using our proposed solution, we can efficiently recover the secret key. The key enumeration part of our proposed solution outputs

11 CHAPTER 1. INTRODUCTION 11 key candidates in the decreasing order of their posterior likelihood. This in turns minimizes the expected number of keys to test in the key validation part. 1.3 Contributions Since the key recovery mechanism constitutes of two main components i.e., key enumeration and key validation, so let us describe our contributions sequentially for both of these parts. Regarding key enumeration part, we have implemented and thoroughly compared three key enumeration algorithms namely: Optimal Key Enumeration Algorithm (OKEA) [2], Score based Key Enumeration Algorithm (SKEA) [6] and Trivial Key Enumeration Algorithm (TKEA). All these three algorithms are explained in detail in Chapter 3. We have compared these algorithms based on four metrics: optimality, throughput, memory consumption and time to find the correct key. Optimality refers to the property of generating key candidates in the decreasing order of their likelihood and throughput refers to the speed of enumeration i.e., how many keys can be generated by the algorithm in one second. The other metrics i.e., memory consumption and time to find the correct key are straightforward. An ideal key enumeration algorithm should be optimal, should have high throughput, should consume limited memory and if the correct key is feasible to recover, the algorithm should take a very short time to find the correct key. From the comparison analyses presented in detail in Chapter 4, we have seen that none of the three key enumeration algorithms under consideration satisfy all these features. OKEA outputs keys in the optimal order but requires a lot of memory and has a small throughput which in turns results in a very long time to find the correct key. On the other hand, SKEA is a sub-optimal algorithm, consumes only a limited amount of memory and has a high throughput which in turns results in a very short time to find the correct key. In contrast to both OKEA and SKEA, TKEA is a non-optimal algorithm, does not require significant memory and has a very high throughput but because of non-optimality, it requires a very long time (practically infeasible time in some cases) to find the correct key. As SKEA satisfies most of the properties of being an ideal key enumeration algorithm, therefore, we have selected SKEA for the design of our proposed solution. Our experiments show that the maximum speed (or throughput) of the selected key enumeration algorithm i.e., SKEA on the CPU is approx keys/sec. This in turns means that the key validation process should at least keep this pace in order to achieve an efficient key recovery solution. For this

12 CHAPTER 1. INTRODUCTION 12 purpose, we have used the immense parallel processing power of a GPU. We have implemented key validation for AES cipher on Nvidia GPU GTX650 using CUDA (Compute Unified Device Architecture) platform and achieved throughput more than the required figure. Furthermore, we have proposed the design of overall key recovery solution. For the key enumeration part, we have proposed to use SKEA on a CPU while for key validation we have suggested implementing AES on a GPU. This combination i.e., concurrent execution of SKEA on a CPU and AES key validation on a GPU results in the desired efficient key recovery solution. Finally, to give directions for future research, we have also examined that which key enumeration algorithm among OKEA and SKEA fits best on a GPU. Although GPU computing is a very economical and efficient way to enhance the performance of an algorithm, however, GPU computing has a lot of constraints which should be satisfied in order to achieve performance boost. From our analyses, we have concluded that comparing to OKEA, SKEA algorithm seems to be a much better fit for a GPU. Thus, we expect that the throughput of SKEA can further be enhanced by implementing it on a GPU. However, SKEA implementation on a GPU may then require much faster key validation than AES key validation speed on a GPU. For this purpose, Intel s Advanced Encryption Standard New Instructions (AES- NI) could be used as around only 64 cycles are required to test one key using AES-NI on Intel s new Haswell architecture [8]. 1.4 Structure of the Thesis The rest of the thesis is organized as follows. Chapter 2 presents the detailed description of the topics that are necessary to understand the rest of the thesis. The topics are Advanced Encryption Standard (AES), side-channel attacks and general purpose computing on graphics processing units. Chapter 3 discusses the first part of the key recovery mechanism i.e., key enumeration and presents three key enumeration algorithms namely Optimal Key Enumeration Algorithm (OKEA), Score based Key Enumeration Algorithm (SKEA) and Trivial Key Enumeration Algorithm (TKEA). Chapter 4 presents the thorough experimental comparisons of the three key enumeration algorithms and proposes the best one among them. Chapter 5 describes the second part of the key recovery mechanism i.e., key validation and presents the details of our implementation of AES key validation on a GPU. Chapter 6 summarizes the proposed solution and also depicts its performance. Chapter 7 concludes the thesis.

13 Chapter 2 Background In this thesis, we mainly focus on Advanced Encryption Standard (AES) and unless otherwise stated, all examples of attacks are presented by taking AES as the underlying cipher. AES is arguably the most widely deployed block cipher in practice today. In order to give readers a brief introduction about the operation of AES, we have dedicated first section for this purpose. Moreover, we have also presented the notion of T-tables and discussed the way they are used to increase the efficiency of AES. Secondly, as side-channel attacks that employ divide and conquer strategy are of our main concern, therefore, we have presented the details of those attacks by taking power analysis attacks as an example. We have presented the detailed description of Differential Power Analysis (DPA) attacks and also the steps required to execute them. The section will help readers in understanding the basic concept of sidechannel attacks and the threats which they impose on the security of the cryptographic devices. Finally, as we have employed the immense parallel processing power of a GPU in our key recovery solution, therefore, we have dedicated a section to describe the basics of general purpose computing on graphics processing units. The section presents all the necessary concepts required to understand the GPU computing part of our thesis. Furthermore, it also summarizes the design criteria that should be followed in order to achieve maximum efficiency on a GPU. Thus, this chapter will greatly help readers in understanding the rest of the thesis. 2.1 Advanced Encryption Standard Advanced Encryption Standard (AES) is a block cipher standardized by the National Institute of Standards and Technology (NIST) [10]. AES was developed by Belgian cryptographers Daemen and Rijmen. AES encrypts a block 13

14 CHAPTER 2. BACKGROUND 14 Table 2.1: AES variants Variant Key Size No. of Rounds AES AES AES of 128-bit plaintext to generate a corresponding 128-bit ciphertext. The three variants of AES are AES-128, AES-192 and AES-256. These variants have the same block length of 128 bits but differ mainly in their key sizes as they use keys of size 128, 192 and 256 bits respectively. AES is based on substitution permutation network and operates on byte level. It is an iterated cipher and depending on the key length uses 10, 12 or 14 rounds of encryption. The AES variants are shown in Table 2.1. One AES round mainly consists of four main operations namely, SubBytes, ShiftRows, MixColumns and AddRoundKey. The final round differs from the remaining rounds as it lacks the MixColumns operation. We have mainly focused on AES-128 in our thesis and now onwards simple AES will also refer to AES-128. The algorithm of AES is presented in Algorithm 1. Given a 128-bit data block and 128-bit key, the encryption process begins with a zero round which constitutes of simple AddRoundKey operation. This zeroth round is to make data dependent on the secret key. Nine similar rounds with each constituting of four main operations are then executed. The names of the operations imply the corresponding operation. Each AES operation can be easily understood by considering the 16-bytes input plaintext as a matrix of 4 4. Figure 2.1 illustrates each AES operation. The SubBytes refers to the process of replacing each of the 16-bytes of plaintext with a corresponding byte from an invertible Sbox. The details of Sbox can be found in [10]. In ShiftRows, the rows of the matrix are shifted by a number of byte positions to the left. The first or top row is not shifted. The second row is shifted by one position, third by two positions and fourth by three positions. The MixColumns operation consists of mixing the four bytes in each column by multiplying it with an invertible, fix 4 4 matrix [10]. All nine rounds are similar as they all contain these four operations but the last round differs from the rest as it omits the MixColumns operation. Since AddRoundKey operation depends on the 16-byte key and it is present in all rounds, thus total 11 keys each of length 16-byte are required in the encryption of a single block. A key expansion function is being used to generate the other 10 keys from the initial user provided key. The key expansion function for AES is quite simple and efficient. The details can be found in [10].

15 CHAPTER 2. BACKGROUND 15 Figure 2.1: Illustration of AES operations

16 CHAPTER 2. BACKGROUND 16 Algorithm 1 AES-128 encryption algorithm [11] START First do (Zero Round) AddRoundKey Then nine times do SubBytes ShiftRows MixColumns AddRoundKey Finally do SubBytes ShiftRows AddRoundKey END AES Implementation through T-tables T-tables are meant to speed up AES encryption speed [12]. The T-table approach allows computation of the entire AES round using only look-up tables and XOR operations. In contrast to typical S-box implementation that employs 8 8 look-up tables, T-table approach uses 8 32 look-up tables. The basic idea behind T-table approach is to pre-compute the combined results of SubBytes, ShiftRows and MixColumns operations and store them in tables that are generally referred to as T-tables. Given a plaintext (m 0, m 1,..., m 15 ), we may directly apply the SubBytes, ShiftRows and MixColumns operations using Equation 2.1. p 4j S[m 4j ] p 4j+1 p 4j+2 = S[m 4j+5 ] S[m 4j+10 ] (2.1) p 4j S[m 4j+15 ] j = 0,..., 3 The Equation 2.1 can be simplified to yield the following Equation 2.2. p 4j 02S[m 4j ] 03S[m 4j+5 ] 01S[m 4j+10 ] 01S[m 4j+15 ] p 4j+1 p 4j+2 = 01S[m 4j ] 01S[m 4j ] 02S[m 4j+5 ] 01S[m 4j+5 ] 03S[m 4j+10 ] 02S[m 4j+10 ] 01S[m 4j+15 ] 03S[m 4j+15 ] p 4j+3 03S[m 4j ] 01S[m 4j+5 ] 01S[m 4j+10 ] 02S[m 4j+15 ] (2.2) j = 0,..., 3

17 CHAPTER 2. BACKGROUND 17 Equation 2.2 yields four T-tables as shown below in Equation S[a] 03S[a] 01S[a] 01S[a] T 0 [a] = 01S[a] 01S[a], T 1[a] = 02S[a] 01S[a], T 2[a] = 03S[a] 02S[a], T 3[a] = 01S[a] 03S[a] 03S[a] 01S[a] 01S[a] 02S[a] (2.3) From Equation 2.3, we may see that for each byte value a, these four T-tables can be pre-computed and stored in memory. One T-table requires = 8192 bits = 1KB and thus, T-table approach consumes 4 times more memory than typical S-box approach (Sbox: 16 times 8 8 while T- table: 16 times 8 32) [12]. Using T-tables the above Equation 2.2 can be re-written as: j = 0,..., 3 p 4j p 4j+1 p 4j+2 = T 0[m 4j ] T 1 [m 4j+5 ] T 2 [m 4j+10 ] T 3 [m 4j+15 ] (2.4) p 4j+3 Thus, the output of an AES round can be computed by simply applying Equation 2.4 and then adding the corresponding round key. Hence, using T-tables, the overall encryption process is reduced to look-up and XOR operations. 2.2 Side-Channel Attacks Side-channel attacks, as the name implies, refer to those attacks that exploit the side-channel leakage information. Typically, side-channel information is the information retrieved from the physical properties of a cryptographic device that could be exploited to reveal the secret key. Power consumption of a cryptographic device is an example of side-channel information. Power analysis attacks, electromagnetic attacks and timings attacks are the most important types of side-channel attacks [1]. As the names imply, these attacks reveal the secret key by exploiting respectively the side-channel information i.e., power consumption, electromagnetic leaks and execution time of

18 CHAPTER 2. BACKGROUND 18 a cryptographic device. In order to give a detailed overview of side-channel attacks and how they are used to reveal the secret key, we have thoroughly explained power analysis attacks below Power Analysis Attacks Generally the power consumption of a cryptographic device depends on the data they process and the operations they perform [1]. This dependency of power consumption is commonly referred to as leakage information. Power analysis attacks exploit such leakage information in order to reveal the secret key. The two main categories of power analysis attacks are as follows Simple Power Analysis Attacks Simple power analysis (SPA) is a technique that involves direct interpretation of power consumption traces collected when the device performs cryptographic operations. SPA attacks exploit the key-dependent information from the traces. The goal of SPA attacks is to reveal the secret key when only a small number of power traces are available. In worst cases, the adversary tries to reveal the secret key directly from a single trace. This in turns makes SPA attacks quite challenging in practice as the adversaries require to know detailed implementation characteristics of the cryptographic algorithm and complex statistical methods need to be deployed in order to extract the secret key Differential Power Analysis Attacks In contrast to SPA attacks, differential power analysis (DPA) attacks typically require numerous power traces but they do not require detailed knowledge about the cryptographic device under attack. In fact, only the knowledge about the underlying cryptographic algorithm is sufficient for these attacks. Moreover, an adversary can successfully reveal the secret key even with the extremely noisy power traces. These facts rank DPA attacks among the most popular types of power analysis attacks. DPA attacks exploit the dependency of power consumption on the processed data. Basically DPA attacks use a large number of power traces to analyze the dependency of power consumption on processed data at a fixed time interval. All DPA attacks usually employ a general attack strategy which consists of the following steps [1]:

19 CHAPTER 2. BACKGROUND 19 Step 1: Selecting an Appropriate Intermediate Value of the Executed Algorithm In the first step of a DPA attack, an adversary selects a particular part of the secret key say k as his attack target. Consequently, the adversary selects an intermediate result of the algorithm which depends on that particular sub-key k along with a known data value say d. In other words, the intermediate result needs to be a function of both d and k i.e., f(d,k). Let us take an example of DPA attack on AES. Suppose an adversary wants to target the first byte of AES secret key through a DPA attack. An appropriate choice for the intermediate value in this case would be the output of the first AES S-box since this intermediate value is a function of first bytes of both the data and the key. Step 2: Collecting Power Traces The second step is to collect the power traces of the cryptographic device under attack while it encrypts or decrypts D different data blocks. For each of these runs, the attacker needs to know the corresponding data value d that is being used in the calculation of the intermediate value selected in step 1. Consequently, these data values can be organized in the form of a vector D = (d 1, d 2,...d D ) where d i refers to the data value in the i th run. Moreover, the power trace corresponding to d i data block is referred to as t i = (t (i,1),...t (i,l) ) where L is the length of the power trace. The attacker collects a power trace for each of the D data blocks. Thus, all the D power traces can be organized in the form of a matrix T of order D L. The power traces should be perfectly aligned so that the power consumption values in each column of matrix T correspond to the same operation. Step 3: Calculating Hypothetical Intermediate Values The next step is to calculate a hypothetical intermediate value selected in step 1 for each possible value of the sub-key k. The possible values for the sub-key k can be written in the form of a vector K = (k 1,..., k N ), where N is the total number of possibilities for the sub-key k. Given the data vector D and sub-key vector K, an attacker can easily find the intermediate values f(d, k) for each d i and for all possible sub-key values. This yields a matrix V of order D N. V matrix for AES example considered in step 1 is shown below in Equation 2.5.

20 CHAPTER 2. BACKGROUND 20 S(d 1 k1 ) S(d 1 k2 ).... S(d 1 kn ) S(d 2 k1 ) S(d 2 k2 ).... S(d 2 kn ) V = : :.... : : :.... : S(d D k1 ) S(d D k2 ).... S(d D kn ) (2.5) Column i of matrix V contains intermediate values calculated with the sub-key value k i. Since each column of matrix V contains intermediate values calculated using a specific key value, therefore, one column of V contains those intermediate values that have been generated in the device during actual D encryption or decryption runs. An attacker has to find that specific column because then the corresponding sub-key value would be immediately revealed. Step 4: Leakage Modeling The attacker has to compare the matrix V with the matrix T in order to reveal the target sub-key k. However, the matrix T contains actual power consumption values while the matrix V contains the hypothetical intermediate values (such as output of S-box for AES example under consideration) and therefore, direct comparison of V with T is not feasible. In order to perform the comparison, power consumption value should be predicted corresponding to each intermediate value present in the matrix V. In other words, the intermediate values present in the matrix V should be mapped to hypothetical power consumption values. Various models have been presented in literature that maps the intermediate values to corresponding power consumption values. Hamming-weight model is among the most widely deployed models for DPA attacks [1]. The Hamming-weight model assumes that the power consumption is proportional to the number of bits that are set (bit value = 1) in the processed data value. Let us call the mapped resultant matrix of this step as matrix H. Matrix H has the same order as of matrix V i.e., D N. Applying Hamming-weight model, the resultant matrix H for the AES example under consideration is given in Equation 2.6. HW [S(d 1 k1 )] HW [S(d 1 k2 )].... HW [S(d 1 kn )] HW [S(d 2 k1 )] HW [S(d 2 k2 )].... HW [S(d 2 kn )] H = : :.... : : :.... : HW [S(d D k1 )] HW [S(d D k2 )].... HW [S(d D kn )] (2.6)

21 CHAPTER 2. BACKGROUND 21 Step 5: Comparing the Predicted Leakage with the Actual Power Traces The last step of DPA attack is to compare the predicted power leakage values with the ones acquired in the actual power traces i.e., each column h i of matrix H is compared with each column t j of matrix T. This in turns means that the adversary compares the predicted power consumption values of each key hypothesis with the recorded traces at every position which results in a matrix R of order N L. Various statistical methods such as correlation coefficient, difference of means and distance of means can be employed for comparison. Generally, these statistical methods share a common property that better matching of columns h i and t j corresponds to the higher resulting value r (i,j) in the matrix R. Thus, the actual sub-key value k can be revealed by simply analyzing the resultant matrix R. For the AES example under consideration, suppose the attacker decides to apply correlation coefficient method in order to determine the relationship between the columns h i (where i = 1,..., N) and t j (where j = 1,.., L) of the matrices H and T respectively. Based on the D elements of the columns h i and t j, each value r (i,j) of resultant matrix R is estimated using the following Equation 2.7 where h i and t j refers to the mean values of the columns h i and t j respectively. The attacker may then reveal the correct sub-key k by searching for the row index that corresponds to the maximum absolute value of the correlation coefficient. r i,j = D d=1 (h d,i h i ).(t d,j t j ) D d=1 (h d,i h (2.7) i ) 2.(t d,j t j ) General Purpose Computing on Graphics Processing Units The notion of general purpose computing on graphics processing units (GPGPU) is to employ the immense parallel processing power of GPUs in order to enhance the performance of applications. From the last two decades, the graphics manufacturers have been focusing on producing fast GPUs, specifically for gaming industry. This has led to powerful GPU devices which immensely boost the performance of an application. Hence, research community has been established whose main aim is to employ the massive processing power of GPUs for general purpose computing. Figure 2.2 illustrates the basic difference between the architectures of a typical Central Processing Unit (CPU) and a GPU. In contrast to CPU

CHAPTER 2. BACKGROUND 22 which is typically composed of a few cores, a GPU consists of hundreds of thousands of smaller cores. These cores allow a GPU to handle multiple tasks simultaneously.

e., amount of work done in a given amount of time) and they are referred to as high latency high throughput processors [13].

22 CHAPTER 2. BACKGROUND 22 which is typically composed of a few cores, a GPU consists of hundreds of thousands of smaller cores. These cores allow a GPU to handle multiple tasks simultaneously. CPUs are designed to minimize latency (i.e., the amount of time requires completing a task) and they are referred to as low latency low throughput processors while GPUs are designed to maximize throughput (i.e., amount of work done in a given amount of time) and they are referred to as high latency high throughput processors [13]. Furthermore, CPUs use the notion of task parallelism which means multiple tasks are mapped to multiple threads and therefore, tasks run different instructions. In contrast, GPUs employ data parallelism through Single Instruction Multiple Data (SIMD) model. SIMD model as the name implies allows executing same instruction on different data. GPUs are good at launching and executing hundreds of thousands of threads in parallel, however, in contrast to CPU threads, GPU threads are extremely lightweight and GPUs require hundreds of thousands of threads to achieve full efficiency [14]. Figure 2.2: CPU architecture vs GPU architecture Various platforms such as Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL) are available to develop applications on a GPU. CUDA is meant for Nvidia GPUs while OpenCL is a more generic one and can natively talk to a large number of devices [17]. Since we had Nvidia hardware and OpenCL turns out be 10 percent slower than CUDA on Nvidia hardware [15], therefore, we selected CUDA platform. Let us briefly discuss about programming GPUs using CUDA platform Programming GPUs using CUDA platform CUDA platform allows a programmer to utilize both the processing powers of a CPU and a GPU in his task. In terms of CUDA, CPU is commonly

23 CHAPTER 2. BACKGROUND 23 referred to as a host while GPU is commonly referred to as a device. From a programmer s perspective, a GPU is viewed as a compute device that is co-processor to the CPU, efficiently launches a lot of threads, executes a lot of threads in parallel and has its own memory [16] CUDA Models In order to efficiently utilize the resources of a GPU for general purpose computing, CUDA defines programming model and memory model. These models are explained below. Programming Model: In general, the massive computing power of a GPU relies on its inherent parallel architecture. For this, CUDA framework introduces a smallest unit of parallelism referred to as a thread. It is worth mentioning here that in contrast to CPU threads, GPU threads are meant to be extremely lightweight and a large number of threads are required to achieve effective GPU performance. A GPU physically executes a group of threads in parallel which is referred to as a warp. On all Nvidia GPUs today, a warp has 32 threads [14]. All threads in a warp execute in a single instruction multiple data (SIMD) fashion i.e., all threads in a warp execute same instruction at the same time. In case one or more threads within a warp need to execute different instructions, e.g., in case of a data-dependent branch, the hardware then automatically serializes the execution of the threads and such threads are called divergent threads. Since there are 32 threads in a warp so we may have at most 32-way divergence in a worst case. As the next level of parallelism, the threads are organized in thread blocks. Only threads within a thread block can communicate with each other and can synchronize their execution. The number of threads per thread block is limited by the hardware. The thread blocks are finally organized in the form of a grid. All a programmer needs to do is to: a) write a program that he wants to deploy on a GPU (that program is generally referred to as a kernel), b) specify grid dimension i.e., the number of thread blocks within a grid, c) specify block dimension i.e., the number of threads within a thread block and d) invoke the kernel. The kernel is then executed by the specified grid of thread blocks. The aforementioned discussion is being summarized in Figure 2.3. Memory Model: CUDA implements a hierarchical memory model in order to yield optimized performance. Host (CPU) and device (GPU) has their own separate physical memories. CUDA threads cannot directly access host memory and therefore, the input data should first move to GPU memory.

CHAPTER 2. BACKGROUND 24 Figure 2.3: CUDA programming model [13] CUDA provides optimized functions to transfer between these two memory spaces. The CUDA memory model is illustrated in Figure 2.4. Each thread has its own local registers which provide extremely fast access.

24 CHAPTER 2. BACKGROUND 24 Figure 2.3: CUDA programming model [13] CUDA provides optimized functions to transfer between these two memory spaces. The CUDA memory model is illustrated in Figure 2.4. Each thread has its own local registers which provide extremely fast access. Apart from registers, each thread also has its private so-called local memory which resides in off-chip device memory. As mentioned before, only threads within a thread block can communicate with each other and they do so via shared memory. Thus, threads within a thread block can access shared memory. Shared memory is organized in equally sized memory modules that are commonly referred to as banks and they can be accessed in parallel. Shared memory is an on-chip memory i.e., each SM contains a specific amount of shared memory depending on the compute capability (CC) of the device (see Figure 2.6). Global memory is both read and writeable memory space and it resides in off-chip device memory. It is slowest among all the memories but it is accessible to all threads. Apart from the aforementioned memories, there are two other read-only memory spaces i.e., constant memory and texture memory. Just like global memory, both of them also reside in device memory. It should be noted that constant and texture memories are the only memories that are cached. The constant memory works best for one-dimensional locality of accesses while texture memory is best suitable for two-dimensional data (arrays).

e., shared memory (SMEM) and caches as shown in Figure 2.5. Nvidia has categorized their GPUs into a number of compute capabilities.

25 CHAPTER 2. BACKGROUND 25 Figure 2.4: CUDA memory model [13] From the above hierarchical model, we may clearly infer that registers are faster than shared memory which in turns is much faster than global memory Hardware View of a GPU A CUDA capable GPU consists of a number of Streaming Multiprocessor (SM) and each SM consists of a number of CUDA cores, on-chip memory i.e., shared memory (SMEM) and caches as shown in Figure 2.5. Nvidia has categorized their GPUs into a number of compute capabilities. Hardware specific features such as number of SMs, CUDA cores and maximum amount of shared memory per SM etc vary as per compute capability. The table in Figure 2.6 illustrates few technical specifications corresponding to different compute capabilities. Complete table containing all technical specifications per compute capability is available at [28]. It should be noted that a single SM may run more than one thread block but one thread block cannot be executed on more than one SM. Furthermore, a programmer is only responsible for defining thread blocks in software while GPU is responsible for allocating thread blocks to SMs. This in turns means that one cannot say anything about allocation of thread blocks on SMs. However, CUDA guarantees that all threads in a thread block run on the same SM at the same time and all thread blocks in a kernel finish execution before any thread blocks from the next kernel launches.

26 CHAPTER 2. BACKGROUND 26 Figure 2.5: Simple hardware view of a GPU [13] Figure 2.6: Technical Specifications per Compute Capability [28]

27 CHAPTER 2. BACKGROUND Design Criteria for GPGPU In order to achieve performance boost using CUDA, the algorithms should take maximum advantage of immense parallel processing hardware of the GPU. Moreover, the algorithm should be carefully deployed by considering the memory hierarchy presented above. We have briefly presented few points that should be followed in order to gain most out of a GPU s hardware. These points are being taken from the CUDA programming guide [18] and a lecture delivered by Mark Harris of Nvidia [19]. A. Maximize use of available processing power A1 Maximize independent parallelism: In order to gain most of the underlying parallel processing hardware of a GPU, there should be more computation and less communication between threads [19]. A2 Optimize resource usage: The algorithm should utilize the resources like number of registers per thread and shared memory in an optimized way. This in turns allows GPU to concurrently execute as many threads as possible which yields an immense boost in the performance of the algorithm. A3 Maximize arithmetic intensity: GPUs are good at performing computations. Modern GPUs can perform around 3 trillion computations per second. Thus, Math/Memory ratio should be increased for each thread. In other words, a thread should spend most of the time on computations rather than memory accesses. For example, values obtained from arithmetic instructions should be recomputed instead of saving them for later use. A4 Avoid threads divergence: As mentioned before, threads within a thread block are executed in groups of 32-threads which are referred to as warps. Warps are executed in a SIMD fashion. Thus, if any one or more threads within a warp need to execute different instruction then the hardware automatically serializes the execution of the threads and this process is referred to as thread divergence. Since a warp consists of 32 threads so we may have at most 32-way thread divergence. The thread divergence should be avoided to a maximum possible extent as it may severely affect the performance. B. Maximize use of available memory bandwidth

28 CHAPTER 2. BACKGROUND 28 B1 Minimize data transfers between host and device: The transfer of data between host and device is one of the most expensive operations and therefore, it should be avoided to maximum possible extent. This can be avoided by doing more computations on the device. Moreover, one large transfer is much better than several small transfers. B2 Careful utilization of memory model: As CUDA implements a hierarchical memory model, therefore the application should be deployed in such a way that it efficiently utilizes the memory hierarchy. For instance, as global memory is slowest among all memories therefore, it should be avoided by utilizing the alternative fast memories like registers or shared memory. Moreover, instead of global memory, constant or texture memory should be employed for constants. B3 Coalesce global memory accesses: The access pattern for global memory should be chosen in a way that allows successive threads to read or write adjacent locations in a contiguous stretch of memory. Such access pattern is referred to as coalesced memory access. The coalesced memory access is considered as a very crucial optimization strategy as it greatly impacts the performance. B4 Avoid bank conflicts: As mentioned before, shared memory is an onchip memory and it is organized in equally sized memory modules that are commonly referred to as banks. A shared memory request for a warp is split into two requests, one for the first half warp and the other request for the second half warp. A bank conflict occurs when a shared memory request accesses different values from the same bank [19]. The bank conflicts degrade the performance and therefore, it should be avoided. It should be noted that the bank conflicts depend on GPU s hardware and thus, a bank conflict for a GPU with a particular compute capability may not be a bank conflict for devices with other compute capabilities.

29 Chapter 3 Key Enumeration Key recovery mechanism for side-channel attacks consists of two main components i.e., key enumeration and key validation. The divide phase of a side-channel attack yields information about different parts of the cryptographic key, for instance, in case of a DPA attack on AES-128, the divide phase yields 16 ranking lists of sub-key candidates, one list for each byte of the secret key. The lists are typically ranked from the most likely sub-key candidate to the least likely one. In the conquer phase, the attacker combines the gathered information in an efficient way in order to reveal the full key. Ideally, the most likely sub-key candidate from each list yields the correct full key, however, this may not be the case and therefore, an attacker needs to validate other combinations of full key from the ranking lists. The trivial approach is to make all possible combinations of the lists, sort them in the order of their likelihood and then validate them. However, this is practically not feasible and therefore, some smart way is required to generate full keys from the ranking lists. This issue is referred to as key enumeration problem. Key enumeration is a very crucial part of the entire key recovery mechanism. It directly takes the output of the divide phase of a side-channel attack, such as a DPA attack and enumerates full keys which are then exploited in the key validation step to reveal the correct key. In this chapter, we thoroughly present three key enumeration algorithms: Optimal Key Enumeration Algorithm (OKEA)[2] Score based Key Enumeration Algorithm (SKEA)[6] Trivial Key Enumeration Algorithm (TKEA) 29

30 CHAPTER 3. KEY ENUMERATION 30 Optimality is one of the most important properties of key enumeration algorithm. A key enumeration algorithm is said to be optimal if it enumerates keys in the optimal order (i.e., outputs keys in the decreasing order of their likelihood). Based on this optimality property, the aforementioned three algorithms i.e., OKEA, SKEA and TKEA are said to be optimal, suboptimal and non-optimal respectively. These algorithms are comprehensively described in the next sections. 3.1 Optimal Key Enumeration Algorithm Optimal key enumeration algorithm (OKEA) as the name implies strictly follows optimal order while enumerating keys. OKEA was proposed by Veyrat- Charvillon et al. in This algorithm requires probability based ranking of sub-key lists. Side-channel attacks that usually involve a training phase are typically referred to as profiled attacks. Template attacks are the best examples of this category [20]. Profiled attacks employ a probabilistic model in their training phase in order to characterize the target cryptographic device. This in turns allows these attacks to output sub-keys according to their actual probabilities. Thus, OKEA algorithm is directly applicable with the profiled side-channel attacks. On the other hand, in case of non-profiled side-channel attacks such as DPA attacks, the sub-keys are not ranked based their probabilities, instead they are ranked based on a score produced by statistical distinguishers such as correlation coefficient or difference of means etc. Thus, a middle step is required to mold the score based ranking of lists generated by the non-profiled side-channel attacks into corresponding probability based ranking. This step is referred to as Bayesian extension and is explained below Bayesian Extension of Non-Profiled Side-Channel Attacks In [2], Veyrat-Charvillon et al. proposed a Bayesian extension of non-profiled side-channel attacks. The main purpose of Bayesian extension of non-profiled side-channel attacks is to transform the default score based ranking of subkey lists into equivalent probability based ranking. The general algorithm for Bayesian extension of non-profiled side-channel attacks is presented in Algorithm 2. The first 7 steps represent the general attack strategy of the non-profile side-channel attacks. Although, these 7

31 CHAPTER 3. KEY ENUMERATION 31 steps are being thoroughly explained for DPA attacks (DPA attacks are nonprofiled side-channel attacks) in the previous chapter, here we briefly discuss these steps in the context of Bayesian extension. We choose p to denote the byte of the plaintext, k to denote the byte of the secret key under attack (i.e., the sub-key) and y to represent the output of S-box i.e., y = S(p k). The goal of an attacker is to reveal the best sub-key candidate ˆk from the sub-key space K, using q measured encryptions. In the first step, for each encryption run, 1 i q, the attacker acquires data set of pairs (p i, l i ) where p i is the i th plaintext byte involved in the computation of the target sub-key and l i is the corresponding leakage value. For each sub-key k from the sub-key space K, the attacker then calculates hypothetical values and maps them into estimated leakage values by employing a leakage model. The attacker then calculates the error vector (by calculating the difference between estimated leakage values and the original ones) and its standard deviation. The equation presented in step 9 is then used to transform the standard deviation value σ k for each sub-key candidate k into the corresponding probability value [2]. To summarize, we can say that in order to apply Bayesian extension, all we need is a standard deviation value corresponding to each sub-key candidate and then employing the equation presented in step 9 of the algorithm, we may transform it into an equivalent probability value [2]. In order to examine effectiveness of Bayesian extension, we have applied it on a list obtained by a correlation coefficient based DPA attack on AES-128. For each sub-key candidate k, the following equation 3.1 is being employed to get the standard deviation value σ k from the corresponding correlation value ρ k. The relation presented in equation 3.1 is derived using section 5 of [21]. The standard deviation value σ k is then used in the equation presented in step 9 of the Bayesian algorithm to obtain the equivalent probability value. σ k = 1 ρ 2 k (3.1) We have ranked the sub-key candidates of the list based on both the correlation values and the equivalent probability values. Table 3.1 illustrates both rankings. From the table, we can clearly see that the Bayesian extension ranks the sub-key candidates in the same order as that of default correlation values. Moreover, the probability based ranking of sub-key candidates has an immense advantage over other criteria i.e., the combination of sub-key candidates becomes very natural as the probability of the combined key is equal to the product of the individual probabilities of the constituting subkeys.

32 CHAPTER 3. KEY ENUMERATION 32 Algorithm 2 Bayesian extension of non-profiled side-channel attacks [2] START Acquire {(p i, l i )} 1 i q for k K do Compute the S-box output hypotheses y i,k = S(p i k) Estimate leakage values corresponding to the hypothetical values by employing a leakage model θ k i.e., θ k (y i,k ) Compute the error vector e k : e i,k = l i θ k (y i,k ) Evaluate standard deviation of the error vector: σ k = standarddeviation(e k ) end for for k K do Transform standard deviation value into equivalent probability value using equation end for END P r[k] = σ q k k σ q k Table 3.1: Correlation based ranking vs Probability based ranking Sub-key candidate Correlation based ranking Probability based ranking Description of Enumeration Process of OKEA In order to understand how OKEA enumerates keys, let us first consider a simple bi-dimensional case i.e., merging of 2 sub-key lists having 4 candidates each. Both of these lists are supposed to be ranked according to the decreasing order of their probabilities, therefore, the key space can geometrically be visualized as a compartmentalized square of length 1. The enumeration process of OKEA is depicted in Figure 3.1. The 4 rows (vertical axis) represent the 4 sub-key candidates of the first list while the 4 columns (horizontal axis) correspond to the 4 sub-key candidates of the second list. Height and width correspond to the probability of the corresponding sub-key candidate. Let k (j) i denotes the j th likeliest sub-key candidate of the i th sub-key. Thus, the intersection of a row say j 1 and a column say j 2 form a full key (k (j 1) 1, k (j 2) 2 ) whose probability is equal to the area of the resultant rectangle which in

CHAPTER 3. KEY ENUMERATION 33 turns means that the probability of the full key is equal to the product of the probabilities of the corresponding two sub-keys.

1, consists of following two steps: Step1: From Figure 3.1, we may see that the most likely key is (k (1) 1, k (1) 2 ), thus OKEA outputs this key first.

33 CHAPTER 3. KEY ENUMERATION 33 turns means that the probability of the full key is equal to the product of the probabilities of the corresponding two sub-keys. Using this geometrical visualization, OKEA outputs rectangles in the decreasing order of their areas. The full enumeration process of OKEA as illustrated in Figure 3.1, consists of following two steps: Step1: From Figure 3.1, we may see that the most likely key is (k (1) 1, k (1) 2 ), thus OKEA outputs this key first. In the Figure, it is being highlighted with dark gray color and marked as number 1. The next possible key can only be one of its successors i.e., either (k (2) 1, k (1) 2 ) or (k (1) 1, k (2) 2 ). These successors are being highlighted with light gray color in Figure 3.1. OKEA stores these potential next candidates in a set called frontier set denoted with F. Figure 3.1: Geometric visualization of OKEA [2] Step2: OKEA outputs keys only from the frontier set and therefore, each key candidate has to belong to this set. OKEA outputs the most likely candidate from the frontier set and updates the frontier set with the successors of that outputted key candidate. This step is repeated until the correct key is generated or the memory required by the frontier set exceeds the available memory space. From Figure 3.1, we may see that (k (2) 1, k (1) than (k (1) 1, k (2) 2 ) is a more likely candidate 2 ), therefore, OKEA outputs this key and updates the frontier 2 ) are the corresponding two 2 ), therefore, these successors need to be added into the set with its successors. (k (3) 1, k (1) 2 ) and (k (2) 1, k (2) successors of (k (2) 1, k (1)

34 CHAPTER 3. KEY ENUMERATION 34 frontier set. However, since the frontier set already contained a key candidate (k (1) 1, k (2) 2 ) which has more probability than (k (2) 1, k (2) 2 ), therefore, there is no need to store this new successor (k (2) 1, k (2) 2 ) into the frontier set. The rule which handles these cases is very simple and greatly minimizes the memory requirements of OKEA. The rule is defined as follows: Rule 1: The frontier set F may contain at most one element from each column and row. [2] Algorithm 3 illustrates the algorithm of the enumeration process of OKEA. Algorithm 3 Optimal key enumeration algorithm [2] START sizelist1 Number of sub-key candidates in the first list (#k 1 ) sizelist2 Number of sub-key candidates in the second list (#k 2 ) Initializing frontier set with most likely key candidate i.e., F (k (1) 1, k (1) 2 ) while (F φ) or (size(f ) availablememory) do Pick the most likely key candidate from F say (k (i) 1, k (j) Output (k (i) 1, k (j) 2 ) Remove (k (i) 1, k (j) 2 ) from F i.e., F F \ (k (i) 1, k (j) Following rule 1, update F by adding the successors of the outputted key as: if (i + 1) sizelist1 and F does not contain an element from row i+1 then F F (k (i+1) 1, k (j) 2 ) end if if (j + 1) sizelist2 and F does not contain an element from column j+1 then F F (k (i) 1, k (j+1) end if end while END 2 ) Finding the most likely element in the set, insertion and deletion of elements are the three main operations that need to be performed on the frontier set F. If the candidate keys are stored in an ordered structure then these operations can be performed in an efficient way. These manipulations can be performed with logarithmic complexity if balance trees (such as heap tree) are being used for the frontier set. Moreover, arrays of Boolean values can be used to validate the tests of Rule 1. 2 ) 2 )

35 CHAPTER 3. KEY ENUMERATION 35 Merging multiple lists Practically, one often has to merge multiple lists, for instance, in case of a DPA attack on AES-128, one has to merge 16 lists. The authors suggest merging multiple lists by merging two lists at a time [2]. Merging two lists yield larger sub-key lists which are in turn merged together. This way n lists are merged by merging two lists n 1 times. This process of merging requires that the lists need to be merged together should be of similar sizes. If we consider this simple approach then in case of AES-128, taking two lists at a time and hypothetically applying OKEA on the initial 16 lists of 1 byte sub-keys, generates 8 lists of 2 byte sub-keys which are then merged together to yield 4 lists of 4 byte sub-keys which in turn are merged together to generate 2 lists of 8 byte sub-keys which are finally merged together to yield the required list of 16 byte key. However, this approach is practically not feasible as for instance, the last step requires merging two lists of size Such large lists cannot be generated or stored efficiently. Therefore, authors suggested a recursive decomposition of the problem. The main idea of recursive decomposition is to generate the lists only as far as required by key enumeration. The concept behind recursive decomposition is that a new sub-key at a particular level is obtained by applying enumeration algorithm on the sub-key lists at the lower level. For instance, 8 byte sub-key is generated by merging two 4 byte lists where each 4 byte list is in turn obtained by merging two 2 byte lists and so on. The recursive decomposition minimizes storage and enumeration efforts. Figure 3.2 shows the recursive decomposition approach. In order to obtain a 16-byte full key, we know that 16-byte key can be generated by applying enumeration algorithm on two 8-byte lists, thus we start by checking the frontier set of 8-byte level. However, if the frontier set of 8-byte level does not contain the key then we first generate the 8-byte sub-key by checking the frontier set of the corresponding lower level i.e., 4-byte level and this process is repeated until we reach the original lists of 1 byte sub-keys. This recursive decomposition approach along with rule 1 keep the memory consumption and computations to a minimum and thus, allows us to enumerate a large number of keys.

Unlike OKEA, this algorithm does not require probability based ranking of sub-key lists, instead it requires discrete score based ranking of sub-key lists.

36 CHAPTER 3. KEY ENUMERATION 36 Figure 3.2: Recursive decomposition approach to merge multiple lists [2] 3.2 Score based Key Enumeration Algorithm Score based Key Enumeration (SKEA) is recently proposed by Marc Witteman, CTO, Riscure [6]. Unlike OKEA, this algorithm does not require probability based ranking of sub-key lists, instead it requires discrete score based ranking of sub-key lists. Therefore, the first step of SKEA is to convert the default ranking of sub-key candidates provided by a side-channel attack into the corresponding discrete score based ranking. Provided a discrete score based ranking of sub-key lists, SKEA then enumerates full keys in the decreasing order of their cumulative scores. As SKEA requires conversion of default ranking of sub-key candidates into corresponding discrete score based ranking, therefore, because of this conversion, SKEA does not follow optimality as strictly as OKEA does. For this reason, we may say that SKEA is a sub-optimal algorithm Obtaining Discrete Score based Ranking Although the problem of effectively converting the default ranking into discrete score based ranking is currently under research, however, in this thesis, we have used a simple product based conversion methodology. According to this methodology, the default ranking values of sub-key candidates (probabilities in case of profiled side-channel attacks or heuristic scores in case of non-profiled side-channel attacks) are simply multiplied by a certain constant and the results are then rounded off to get the corresponding discrete

a DPA attack using correlation coefficient as a statistical distinguisher. The resultant correlation coefficient scores of sub-key candidates fall in the range of [ 1, 1].

37 CHAPTER 3. KEY ENUMERATION 37 scores. In order to demonstrate the process of converting default ranking of sub-key candidates into discrete score based ranking using simple product based conversion methodology, let us take an example of a DPA attack using correlation coefficient as a statistical distinguisher. The resultant correlation coefficient scores of sub-key candidates fall in the range of [ 1, 1]. To convert these scores into corresponding discrete scores, we multiply the scores with 100 and then round off the resultant values into the corresponding nearest integer values. The round off process of this methodology maps distinguish default ranking values into a single discrete score which in turns affects the optimality of the algorithm. Table 3.2 demonstrates the effect of this conversion where we have compared the ranking of a sub-key list based on the default correlation values and ranking based on discrete scores. Further discussion about optimality of the algorithm can be found in the next chapter where we have presented the detailed comparison results of the three enumeration algorithms under consideration. Table 3.2: Correlation based ranking vs Discrete score based ranking Sub-key candidate Correlation based ranking Discrete score based ranking Description of Enumeration Process of SKEA In order to understand how SKEA enumerates keys, let us take an example of merging four sub-key lists, each having 4 sub-key candidates. The lists are ranked in the decreasing order of discrete scores of corresponding sub-keys. Figure 3.3 illustrates the lists. Figure 3.3: Four sub-key lists ranked in decreasing order of discrete score [6] SKEA outputs keys in the decreasing order of their cumulative score. For instance, from Figure 3.3, we may inspect that the best cumulative score

CHAPTER 3. KEY ENUMERATION 38 is: 5 + 5 + 4 + 5 = 19 and therefore, the best full key is the one that constitutes of this cumulative score.

4. Figure 3.4: Full keys with decreasing cumulative scores are being highlighted in gray color [6] In order to efficiently find the score paths, SKEA proposes three main steps.

38 CHAPTER 3. KEY ENUMERATION 38 is: = 19 and therefore, the best full key is the one that constitutes of this cumulative score. Thus, all we need is to enumerate the keys in decreasing order of their cumulative scores i.e., finding keys having cumulative score of 19 then keys with cumulative score of 18 then 17 and so on as depicted in Figure 3.4. Figure 3.4: Full keys with decreasing cumulative scores are being highlighted in gray color [6] In order to efficiently find the score paths, SKEA proposes three main steps. The steps are explained in detail below. It should be noted that the first two steps can be executed and their output can be stored beforehand without actual starting the enumeration process in step 3. Step 1: Compute arrays for cumulative running maximum and minimum scores The first step is to compute and store arrays for cumulative running maximum and cumulative running minimum scores. Since the sub-key lists are supposed to be ranked in the decreasing order of their respective discrete scores, therefore, in order to compute an array for cumulative running maximum scores, we inspect the first or top element of each sub-key. The cumulative running maximum is computed as a running sum from right to left. For instance, in order to compute an array for cumulative running maximum of the example under consideration, we inspect the top element of each sub-key list as highlighted in dark gray color in Figure 3.5 (a). From right to left, we may calculate the running sum as: 5, = 9, = 14 and = 19. This cumulative running maximum array is presented in the Figure 3.5 (a).

CHAPTER 3. KEY ENUMERATION 39 The same process is repeated by taking the bottom element of each sub-key to compute an array for cumulative running minimum scores as depicted in Figure 3.5 (b).

5: Computing arrays for (a) cumulative running maximum scores and (b) cumulative running minimum scores [6] Algorithm 4 Computing arrays for cumulative running maximum and cumulative running minimum

39 CHAPTER 3. KEY ENUMERATION 39 The same process is repeated by taking the bottom element of each sub-key to compute an array for cumulative running minimum scores as depicted in Figure 3.5 (b). The algorithm for computing arrays for cumulative running maximum scores and cumulative running minimum scores is presented in Algorithm 4. Figure 3.5: Computing arrays for (a) cumulative running maximum scores and (b) cumulative running minimum scores [6] Algorithm 4 Computing arrays for cumulative running maximum and cumulative running minimum scores [6] START sortedscore Sub-key lists arranged in a sorted tabular form. Each column contains a sorted list (see Figure 3.3) keyparts sortedscore.columns candidates sortedscore.rows last candidates - 1; cumulativemaximum An array of size keyparts + 1, initialized with 0 and is used to store cumulative running maximum scores cumulativeminimum An array of size keyparts + 1, initialized with 0 and is used to store cumulative running minimum scores for (i = keyparts 1 to 0) do cumulativeminimum[i] = cumulativeminimum[i + 1] + sorted- Score[i][last].score cumulativemaximum[i] = cumulativemaximum[i + 1] + sorted- Score[i][0].score end for END

40 CHAPTER 3. KEY ENUMERATION 40 Step 2: Compute tables that contain indexes for paths within score range Step 1 gives us the range of cumulative maximum score and cumulative minimum score. In other words, it gives us a range of scores for full keys. In the example under consideration, the best score is 19 while the worst score is 3. Thus, the scores of all possible full keys lie in the range of [19, 3]. The next step is to compute two tables, one for minimal indexes and the other one for maximal indexes for all possible cumulative scores. The minimal index of a sub-key list represents the minimum index from that list which may be used to compute a particular score. Similarly, the maximal index of a sub-key list represents the maximum index from that list which may be used to compute a particular score. These indexes in turn give us the possible sub-key candidates from each list that can be used to generate a full key of a particular score. Let us demonstrate the computations of these tables of indexes with the help of our considered example. In our example, we have 4 sub-key lists numbered [0, 3] each having 4 candidates. Thus, the possible indexes within each list is in the range of [0, 3]. We know the range of possible scores of full keys from step 1 i.e., [19, 3]. We want to find the maximal index and minimal index of each list corresponding to each cumulative score. Let us first find the cumulative minimal indexes of all 4 lists for the cumulative score of 19. For list 0, we know that if we choose the index 0 i.e., sub-key candidate having score 5, then in order to get a full key with a cumulative score of 19, the remaining sub-keys should yield at least 19 5 = 14 cumulative score. From the arrays of cumulative running maximum scores and cumulative running minimum scores computed in step 1, we may inspect that the remaining sub-keys can be combined to yield keys within the score range of [14, 3]. From this score range, we may see that the remaining sub-key candidates can generate a key with cumulative score of 14 and therefore, we can choose index 0 as the minimal index of list 0 for cumulative score of 19. If for instance, we would not be able to choose index 0 as a minimal index then we keep on validating other indexes of the list in the increasing order until we find a minimal index. The same process is repeated for other lists in order to find their corresponding minimal indexes for the cumulative score of 19. The process of finding the maximal indexes is the same as that of minimal indexes described above, the only difference is that instead of choosing minimum index of a sub-key list for a particular cumulative score, we choose maximum possible index. The table presented in Figure 3.6 shows the maximal indexes and minimal indexes of all 4 lists for all possible cumulative scores. Each column of the table corresponds to a particular list while each row corresponds to a particular cumulative score. Thus, we can see that each

41 CHAPTER 3. KEY ENUMERATION 41 row of the table contains maximal indexes and minimal indexes of all lists corresponding to a particular cumulative score. Minimal Index / Maximal Index is the format of each entry in the table. The algorithm for computing tables for both minimal indexes and maximal indexes is presented in Algorithm 5. Step 3: Use backtracking on tables containing minimal and maximal indexes computed in step 2 to enumerate full keys This step is said to be the main enumeration process of SKEA. It enumerates full keys in the decreasing order of their cumulative scores by exploiting the already computed tables and arrays of step 1 and 2. Let us describe this step with the help of our considered example. This last step has the responsibility to output full keys in the decreasing order of their cumulative scores and therefore, in our considered example, this step should first output all possible keys with a cumulative score of 19 and then all possible keys with a cumulative score of 18 and so on. This can easily be done by exploiting the information already computed in prior steps i.e., tables for maximal and minimal indexes. The table in Figure 3.6 presents minimal and maximal indexes of the 4 sub-key lists for all possible cumulative scores and therefore, we just need to make combinations of all those indexes in order to reveal all possible keys of a particular cumulative score. For example, in order to output keys with a cumulative score of 19, upon inspecting the table presented in Figure3.6, we may conclude that the sub-key candidate at index 0 is the only possible candidate from each list that may be combined in order to generate a full key with a cumulative score of 19. This is because 0 th index is the both minimal and maximal index in each list for the cumulative score of 19. The algorithm for this enumeration process is presented in Algorithm Trivial Key Enumeration Algorithm As the name implies, this algorithm trivially merges the sub-key lists to generate full keys. Just like OKEA and SKEA, this algorithm requires ranking sub-key lists. However, in contrast to both OKEA and SKEA, this algorithm does not have any requirement on the ranking criteria of the sub-key lists and hence, this algorithm can directly be applied to the output of side-channel attacks. Let us consider a simple case of merging two sorted lists with the help of TKEA. Suppose the lists are: {1, 2, 3} and {4, 5, 6}. This algorithm simply makes all the possible combination of the lists elements and generates the

42 CHAPTER 3. KEY ENUMERATION 42 Figure 3.6: Table containing minimal and maximal indexes for all sub-key lists corresponding to each cumulative score. Each table entry is of the form Minimal Index / Maximal Index [6]

43 CHAPTER 3. KEY ENUMERATION 43 Algorithm 5 Computing tables for mininmal indexes and maximal indexes [6] START sortedscore Sub-key lists arranged in a sorted tabular form. Each column contains a sorted list (see Figure 3.3) keyparts sortedscore.columns candidates sortedscore.rows cumulativemaximum An array computed in step 1 and contains cumulative running maximum scores cumulativeminimum An array computed in step 1 and contains cumulative running minimum scores cumulativemaximumindex A table of size keyparts x (cumulativemaximum[0] + 1) and is used to store maximal indexes of all sub-keys corresponding to all possible cumulative scores cumulativeminimumindex A table of size keyparts x (cumulativemaximum[0] + 1) and is used to store minimal indexes of all sub-keys corresponding to all possible cumulative scores for (i = 0 to keyparts 1) do for (j = 0 to cumulativeminimumindex[i].length -1) do k 0 while (k candidates and sortedscore[i][k].score j - cumulativeminimum[i + 1]) do k k+1 end while cumulativeminimumindex[i][j] k end for for (j = 0 to cumulativemaximumindex[i].length -1) do k candidates - 1 while (k 0 and (sortedscore[i][k].score j - cumulativemaximum[i + 1]) do k k-1; end while cumulativemaximumindex[i][j] k end for end for END

44 CHAPTER 3. KEY ENUMERATION 44 Algorithm 6 Enumeration process of SKEA by employing information computed in step 1 and 2 [6] START sortedscore Sub-key lists arranged in a sorted tabular form. Each column contains a sorted list (see Figure 3.3) keyparts sortedscore.columns candidates sortedscore.rows cumulativemaximum An array computed in step 1 and contains cumulative running maximum scores cumulativeminimum An array computed in step 1 and contains cumulative running minimum scores cumulativemaximumindex A table computed in step 2 and contains maximal indexes of all sub-keys corresponding to all possible cumulative scores cumulativeminimumindex A table computed in step 2 and contains minimal indexes of all sub-keys corresponding to all possible cumulative scores k An array to store the full key for (cscore = cumulativemaximum[0] to cumulativeminimum[0] and!found) do index 0 k[0] cumulativeminimumindex[0][cscore] actualscore cscore while (index 0 and!found) do while (index keyparts-1) do actualscore actualscore - sortedscore[index][k[index]].score index index+1 k[index] cumulativeminimumindex[index][actualscore] end while if (index == keyparts - 1) then found=testkey(k) end if while (index 0 and k[index] cumulativemaximumindex[index][actualscore]) do index index-1 if (index 0) then actualscore actualscore + sortedscore[index][k[index]].score end if end while if (index 0) then k[index] k[index] + 1 end if end while end for END

45 CHAPTER 3. KEY ENUMERATION 45 outputs. Thus, the combination of above lists yield (1, 4), (1, 5), (1, 6), (2, 4), (2, 5) and so on. However, in practice, we get multiple lists with a large number of sub-key candidates. For instance, in case of a DPA attack on AES-128, we get 16 sub-key lists, one for each byte and each list contains 256 candidates. This in turns means that making all possible combinations of these lists using TKEA is practically infeasible. Thus, TKEA requires an enumeration limit. For instance, specifying 4 as an enumeration limit cuts down each list to top 4 sub-key candidates. TKEA then operates on these scaled down lists and yields the possible combinations. Since this algorithm trivially merges the sub-key lists, therefore, this algorithm does not yield keys in the decreasing order of their likelihood. Hence, we may say that TKEA is a non-optimal algorithm. TKEA algorithm to merge 2 lists is presented in Algorithm 7. This algorithm can easily be expanded to merge any number of lists. Algorithm 7 Trivial key enumeration algorithm START list1 First sorted list of sub-key candidates list2 Second sorted list of sub-key candidates enumlimit Factor to scaled down the lists for (i = 0 to enumlimit) do for (j = 0 to enumlimit) do Output(list1[i],list2[j]) end for end for END

46 Chapter 4 Analysis of Key Enumeration Algorithms In this chapter, we present a detailed analysis of the performance of the three key enumeration algorithms under consideration. Let us first discuss the performance metrics which we have taken in order to compare the key enumeration algorithms. 4.1 Performance Metrics An ideal key enumeration algorithm should be optimal, should consume limited memory, should have a high throughput and should take very less time to find the correct key given that the correct key is feasible to recover. This fact presents us four important performance metrics and thus, we have compared the key enumeration algorithms based on the aforementioned four performance metrics i.e., memory consumption, optimality, throughput and time to find the correct key. The importance of these performance metrics and the reasons behind their choice are further explained below Memory Consumption The memory consumption of an algorithm/attack is considered a very important factor in the domain of statistical cryptanalysis as it decides the practical feasibility of the algorithm/attack. In other words, an algorithm is said to be practically infeasible if it requires a huge amount of memory that is practically not viable. The same fact is true in the case of key enumeration. Moreover, the overall efficiency of an algorithm highly depends on its memory consumption as more memory consumption in turns means 46

47 CHAPTER 4. ANALYSIS OF KEY ENUMERATION ALGORITHMS 47 more memory accesses by the algorithm and generally a memory access instruction is much expensive than a simple computational instruction. The memory consumption of a key enumeration algorithm is also important to consider from the perspective of GPU computing. The performance of an algorithm can be greatly enhanced by exploiting parallelism and an efficient and economical way is to implement the algorithm on a GPU. However, GPU computing has a number of constraints which should be satisfied in order to achieve performance boost [13]. Memory of a GPU is one of those constraints as a GPU typically has a limited amount of memory and therefore, suitability of an algorithm on a GPU highly depends on its memory consumption Optimality Optimality of a key enumeration algorithm refers to the property of generating keys in the decreasing order of their likelihood. The key enumeration algorithm is applied in the conquer phase of a side-channel attack which takes sub-key lists generated by the divide phase of the attack. A sub-key list generated by the divide phase of a side-channel attack is generally ranked in the decreasing order of the likelihood of the sub-key candidates. This in turns means that ideally, the correct key is the one generated by the combination of the most likely sub-key candidate from each list. However, if the combination of most likely sub-key candidate from each list does not yield the correct key then the next probable full key has more chances to be the correct key. Thus, a key enumeration algorithm should ideally generate keys in the decreasing order of their likelihood Throughput Throughput refers to the speed of enumeration i.e., number of keys an algorithm generates in one second. It is also referred to as key generation rate of the algorithm. Since we need to test whether the key generated by the key enumeration algorithm is correct or not and therefore, key validation is the other component of the key recovery mechanism. The key validation step basically involves the implementation of the cipher under attack such as AES, DES etc. The key validation step concurrently takes the key generated by the key enumeration step, generates a ciphertext by encrypting a plaintext with that key and compares the ciphertext with the one obtained with the original key stored in the cryptographic device. Thus, the whole process of key recovery is to continually generate keys using key enumeration algorithm and at the same time validate those keys in order to reveal the correct key. This in turns highlights the importance of throughput metric as in order to

48 CHAPTER 4. ANALYSIS OF KEY ENUMERATION ALGORITHMS 48 efficiently recover the secret key, the throughput of both key enumeration and key validation steps should be compatible Time to find the Correct Key Among all these four performance metrics, time to find the correct key is the most crucial metric. This factor internally contains the impact of other factors and therefore, this factor is of our main interest i.e., we are interested in that key enumeration algorithm that takes less time to find the correct key. As mentioned, this factor internally contains the effect of other factors. For instance, an algorithm that strictly follows the optimal order while generating keys but has a very low throughput may take a large amount of time to generate the correct key. Thus, optimality alone is not of our main interest. Similarly, high throughput is not the only desired feature of a key enumeration algorithm, as for instance, an algorithm that has a high throughput but does not follow optimal order at all may also take a large amount of time (or may even practically infeasible time) to find the correct key. Thus, time to find the correct key is of our main concern as this factor internally contains the impact of optimality, memory consumption and throughput. 4.2 Experimental Comparisons In order to evaluate the performance of the three key enumeration algorithms namely, Optimal Key Enumeration Algorithm (OKEA), Score based Key Enumeration Algorithm (SKEA) and Trivial Key Enumeration Algorithm (TKEA), we led several experiments. We executed a correlation coefficient based Differential Power Analysis (DPA) attack on the ATmega implementation of AES-128. The attack yielded 16 independent sub-key lists sorted according to the correlation scores of the sub-key candidates. The attack was executed by varying the number of power traces, starting from 55 traces to 100 traces in an increment of 5. The range of 55 to 100 traces was chosen because less than 55 traces, the time required by the algorithms to find the correct key was too long for the thesis schedule while above 100 traces, the enumeration was not required at all because the correct key was the combination of the top sub-key candidate from each list. As the behavior of ranking was probabilistic, so in order to get a statistical sample, we collected 12 cases for each number of power traces i.e., 12 cases for 100 traces, 12 cases for 95 traces and so on. All the results discussed in the next sections show the average of those experiments. As the correlation coefficient based DPA attack yields sub-key lists ranked

49 CHAPTER 4. ANALYSIS OF KEY ENUMERATION ALGORITHMS 49 according to the correlation scores of the sub-key candidates and therefore, the first step was to convert the correlation based ranking of lists into probability based and discrete score based ranking for OKEA and SKEA respectively. For this purpose, the Bayesian extension process and the process to obtain discrete score based ranking as explained in sections and respectively were executed. It is worth mentioning here that all the key enumeration algorithms under consideration are generic and work for all the side-channel attacks that employ divide and conquer strategy. Therefore, to evaluate the performance of key enumeration algorithms, we just need sorted sub-key lists. The open source implementation of OKEA was used [9] while we implemented both SKEA and TKEA in C. We executed the experiments on a common laptop (Intel core i5-2450m with 8 GB RAM and running 64-bit Windows 7). We put 1 hour as the time limit to find the correct key on each experiment. For each case, both OKEA and SKEA were able to find the correct key within this time frame, however, TKEA could only find the correct key till 70 traces and below that, it was not able to find the correct key within 1 hour. For this reason, in each result discussed below, TKEA has a curve till 70 traces. Let us now discuss the results of our experimentations based on the four performance metrics i.e., memory consumption, optimality, throughput and time to find the correct key Memory Consumption TKEA does not have any significant memory requirements as it trivially merges the sub-key lists in order to generate full keys. TKEA just requires the sorted sub-key lists to enumerate keys and therefore, TKEA is not being discussed below. In order to check the dependency of memory consumption of OKEA and SKEA on input data, we executed a simple experiment. We executed both OKEA and SKEA on two different input data and examined the memory consumption for different enumeration limits. In other words, we executed the algorithms till the specified enumeration limits irrespective of the correct key. Enumeration limit refers to the number of keys generated by the algorithm. An input data in the form of 16 lists, each having 256 sub-key candidates probabilities is provided by the OKEA authors to test their open source implementation [9] and therefore, we used that data as one of the input data and we refer it to as input 1 in our discussion. The other data was obtained by executing a correlation based DPA attack on ATmega implementation of AES-128 with 100 traces and we refer it to as input 2. The results of memory consumption of both algorithms are presented in

50 CHAPTER 4. ANALYSIS OF KEY ENUMERATION ALGORITHMS 50 Figure 4.1. Different line styles are used to differentiate the input data (i.e., solid line for input 1 while dashed line for input 2) while different colors are employed to differentiate the curves of both algorithms (i.e., blue color for OKEA while red color for SKEA). From the results we can clearly infer that the memory consumption of OKEA highly depends on the input data. This is because of the frontier set. Since OKEA after generating a most probable key, stores the next likely candidates in a frontier set (after validating rule 1 as thoroughly described in section 3.1.2) and therefore, the formation and expansion of the frontier set depends entirely on the input data. Moreover, the memory consumption of OKEA varies proportionally with the enumeration limit. This is quite obvious as enumerating more keys in turns requires more memory for the frontier set in order to store the candidates. Thus, OKEA consumes memory in a dynamic way and the memory consumption usually goes up to several Giga Bytes (GBs) for enumerating more than 2 28 keys. In contrast to OKEA, SKEA does not consume memory in a dynamic way i.e., the memory consumption of SKEA does not increase with the increase in enumeration limit, instead, SKEA consumes a limited constant memory. As explained in section 3.2.2, apart from the sorted sub-key lists, the enumeration process of SKEA only requires few other tables that need to be stored in memory. In Chapter 3, those tables are being referred as cumulativeminimum, cumulativemaximum, cumulativeminimumindex and cumulativemaximumindex. The tables respectively store the cumulative running minimum scores, cumulative running maximum scores, minimum indexes of the sub-key lists for all possible cumulative scores and maximum indexes of the sub-key lists for all possible cumulative scores. It is worth mentioning here that range of cumulative score depends on the factor we employ to convert the default ranking of sub-key lists into discrete score based ranking. Moreover, the factor is usually chosen as per the range of values in the input data and therefore, the size of the two tables namely cumulativeminimumindex and cumulativemaximumindex depends on the input data which in turns means that the memory consumption of SKEA also has some dependency on input data. Since in our experimentations, we used a constant conversion factor 100 and therefore, SKEA consumes almost similar memory in both cases. Moreover, from Figure 4.1, we may see that SKEA consumes only around 160 KB of memory in both cases which is far less than the memory consumption of OKEA. To further explore the memory consumption of both OKEA and SKEA, we executed both these algorithms on the outputs of correlation based DPA attack which we performed on ATmega implementation of AES-128 as described in the previous section. We measured the memory consumed by both

51 CHAPTER 4. ANALYSIS OF KEY ENUMERATION ALGORITHMS 51 Figure 4.1: Memory consumption of OKEA and SKEA for two different data sets OKEA and SKEA until they both found the correct key. Figure 4.2 shows the results. Since the experiment was performed by taking 12 cases for each number of power traces and therefore, the Figure illustrates the average memory consumed by the algorithms. The solid blue curve depicts the memory consumption of OKEA while the dashed red line represents the memory consumption of SKEA. From the Figure4.2, we may clearly see that for higher number of traces, OKEA requires much less memory than SKEA. However, the memory consumption of OKEA varies inversely with the number of traces and therefore, for less number of traces like 60 and below, OKEA consumes a large amount of memory as compared to SKEA in order to reach the correct key. This is because less number of power traces requires more enumeration which in turns requires more storage for OKEA s frontier set. In contrast to OKEA, SKEA consumes a much less and a constant amount of memory in each case Optimality In order to compare the optimality of the three key enumeration algorithms under consideration, we measured the rank of the correct key found by them. In other words, we counted the number of keys an algorithm generated before it reached the correct key. This is referred to as the rank of the correct key. An algorithm that strictly follows optimal order should generate correct key with the smallest possible rank. Figure 4.3 depicts rank of the correct key found by all the three algorithms. Since the experiment was performed by

52 CHAPTER 4. ANALYSIS OF KEY ENUMERATION ALGORITHMS 52 Figure 4.2: Memory consumed by OKEA and SKEA until the correct key was being generated taking 12 cases for each number of power traces and therefore, the figure illustrates the average rank of the correct key found for each number of traces. The dotted green line represents the rank of the correct key in case of TKEA while dashed red and solid blue lines represent respectively the ranks of the correct key measured for SKEA and OKEA. Within the time limit of 1 hour, TKEA was only able to find the correct key till 70 traces and that s why the graph does not contain a dotted line for less than 70 traces. From the graph, we may clearly infer the optimality of the algorithms. Since TKEA trivially merges the sub-key lists and therefore, TKEA generates the correct key with a much higher rank. This in turns means that TKEA generates a large number of extra keys before generating the correct key. TKEA therefore, is referred to as a non-optimal key enumeration algorithm as it does not consider optimal order at all while enumerating keys. In contrast to TKEA, OKEA strictly follows optimality and this can be clearly seen from Figure 4.3. Among the three algorithms, OKEA generated the correct key with smallest possible ranks in all cases. OKEA requires probability based ranking of sub-key lists which in turns are generated by Bayesian extension process (explained in section 3.2.1). The Bayesian extension process yields the ranking of sub-key lists in exactly the same likelihood as that of default score based ranking. Thus, there is no loss of precision which in turns allows OKEA to output keys in a strict optimal order. The curve of SKEA as represented by dashed red line in Figure 4.3 lies very close to that of OKEA s curve. This means that SKEA does not follow strict optimality but on the other hand performs much better than TKEA

1) which may map sub-key candidates with different default scores into same discrete scores.

53 CHAPTER 4. ANALYSIS OF KEY ENUMERATION ALGORITHMS 53 which is a non-optimal algorithm. The initial conversion step of SKEA which transforms the default ranking of sub-key lists into a discrete score based ranking involves a rounding off step (as described in section 3.2.1) which may map sub-key candidates with different default scores into same discrete scores. Thus, the conversion process costs SKEA to lose precision which in turns affects the optimal order followed by SKEA. This fact is evident from the Figure 4.3 where the SKEA curve depicts higher ranks for correct key than OKEA and therefore, we refer SKEA as a sub-optimal key enumeration algorithm. It should be noted that as the conversion step of SKEA has a very significant impact on its performance, therefore this topic is still under research [6]. Figure 4.3: Comparison of optimality for OKEA, SKEA and TKEA Throughput Throughput means speed of enumeration i.e., number of keys an algorithm enumerates in one second. We first examined the dependency of throughput of the algorithms on input data just like we did for memory consumption i.e., we executed all the three algorithms OKEA, SKEA and TKEA on two different data sets and examined their throughput. Same input data as the one used to examine the dependency of memory consumption was employed i.e., input 1 and input 2, where input 1 refers to the input data provided by OKEA authors [9] while input 2 refers to the data that was obtained by executing a correlation based DPA attack on ATmega implementation of AES-128 by exploiting 100 traces. Figure 4.4 illustrates the throughput of all the three algorithms for both data sets. Different line styles are used to differentiate the input data (i.e.,

54 CHAPTER 4. ANALYSIS OF KEY ENUMERATION ALGORITHMS 54 solid line for input 1 while dashed line for input 2) while different colors are employed to differentiate the curves of the three algorithms (i.e., blue color for OKEA, red for SKEA and green for TKEA). Since TKEA neither consumes any significant memory nor requires any complex memory accesses and therefore, the throughput for TKEA is much higher than the throughput of SKEA and OKEA. Moreover, the throughput of TKEA for both data sets is almost similar (i.e., around keys/sec) which in turns concludes that the throughput of TKEA is essentially independent of the input data. SKEA on the other hand enumerates keys in the decreasing order of their cumulative scores. The range of cumulative score depends on the input data and the size of tables namely cumulativeminimumindex and cumulativemaximumindex, containing the minimal and maximal indexes of sub-key lists for all possible cumulative scores, in turns depends on the input data. In a sense, these tables exhibit the number of memory accesses require for generating keys with a particular cumulative score. Thus, the throughput of SKEA also depends on the input data, however, we may see from Figure 4.4 that the variation between the two throughputs is not very high and SKEA is able to maintain throughput around keys/sec in both cases. In contrast to both TKEA and SKEA, OKEA yields a much lower throughput ( keys/sec) in both cases which is evident from Figure 4.4. We may also see from Figure 4.4 that the throughput of OKEA depends on the input data. This is because OKEA enumerates keys using a frontier set and the formation and expansion of a frontier set depends entirely on the input data. It is worth mentioning here that the throughput of OKEA will further decrease by increasing enumeration limit beyond This fact is based on our experimental observation and also from the results presented in [2]. It is also theoretically understandable as enumerating large number of keys requires more memory for the frontier set which in turns means more memory accesses that finally leads a decline in the throughput. We further examined the throughput of the three key enumeration algorithms by executing them on the outputs of a correlation based DPA attack which we performed on ATmega implementation of AES-128 as described in the previous section. Figure 4.5 depicts the throughput of the three key enumeration algorithms. The throughput of TKEA is being shown by dotted green line while the throughputs of SKEA and OKEA are being represented by dashed red and solid blue lines respectively. The throughput results are quite inline to what we have discussed before. From the Figure 4.5, we may see that TKEA maintains a much higher throughput (i.e., around keys/sec) than both SKEA and OKEA. However, within the time limit of 1 hour, TKEA could only find the correct key till 70 traces and that is why

55 CHAPTER 4. ANALYSIS OF KEY ENUMERATION ALGORITHMS 55 Figure 4.4: Throughput of OKEA, SKEA and TKEA for two different data sets the Figure 4.5 does not contain the throughput of TKEA for traces less than 70. The throughput of SKEA lies in between the throughputs of TKEA and OKEA. It maintains a much higher throughput than OKEA but on the other side, much less than the throughput of TKEA. OKEA depicts the lowest throughput among all. Moreover, the throughput of OKEA shows an interesting behavior. Initially the throughput is very small and it increases with the decrease in number of traces. Since decreasing the number of traces requires more enumeration which in turns means more memory accesses for OKEA and therefore, the throughput should ideally decrease with the decrease in number of traces. From the Figure 4.5, it seems that the result does not follow this fact. This however, is not true because the initial increase in the throughput of OKEA is because of caching process and if we keep on decreasing the traces beyond 55 traces then the aforementioned effect could be visible as then the throughput would ultimately start decreasing Time to find the Correct Key Among all performance metrics, time to find the correct key is the most critical metric as it internally contains the impact of other factors such as throughput, optimality and memory consumption. For instance, an algorithm that maintains high throughput, strictly follows optimality and consumes less memory would ultimately take less time to find the correct key in case where the key is practically feasible to recover.

Since the experiments were performed by taking 12 cases for each number of power traces and therefore, the Figure illustrates the average time taken by the algorithms to find the correct key in each

56 CHAPTER 4. ANALYSIS OF KEY ENUMERATION ALGORITHMS 56 Figure 4.5: Throughput of OKEA, SKEA and TKEA while finding the correct key Figure 4.6 presents the time which each algorithm took to find the correct key in our experiments. Since the experiments were performed by taking 12 cases for each number of power traces and therefore, the Figure illustrates the average time taken by the algorithms to find the correct key in each case. The dotted green line represents the time taken by TKEA while dashed red and solid blue lines respectively demonstrate the time taken by SKEA and OKEA respectively. Although TKEA maintains a very high throughput but on the other hand it does not follow optimality at all which leads TKEA to generate a large number of extra keys. Thus, TKEA takes a large amount of time (may even practically infeasible time) to generate the correct key. This fact is evident from Figure 4.6 as within the time limit of 1 hour, TKEA could only find the correct key till 70 traces. On the other hand, both OKEA and SKEA consumes less than 1 second to find the correct key till 60 traces, however, among them SKEA outperforms OKEA by approx. a factor of 10. The comparison between these two algorithms becomes more interesting for lower number of traces as lower number of traces requires more enumeration. Although OKEA strictly follows optimality but because of its low throughput, it consumes a large amount of time to generate the correct key (around 300 sec in case of 55 traces). In contrast, SKEA is a sub-optimal algorithm but has a very high throughput which enables it to reach the correct key in a much shorter time than OKEA (around 10 sec in case of 55 traces). Thus, we may say that the factor with

CHAPTER 4. ANALYSIS OF KEY ENUMERATION ALGORITHMS 57 which SKEA outperforms OKEA increases with the decrease in number of traces or with the increase in enumeration. Figure 4.

These facts finally lead the algorithm to find the correct key in a very short amount of time in case where the key is practically feasible to recover.

57 CHAPTER 4. ANALYSIS OF KEY ENUMERATION ALGORITHMS 57 which SKEA outperforms OKEA increases with the decrease in number of traces or with the increase in enumeration. Figure 4.6: Time taken by OKEA, SKEA and TKEA to find the correct key 4.3 Discussion An ideal key enumeration algorithm should be optimal, should have high throughput and should consume less memory. These facts finally lead the algorithm to find the correct key in a very short amount of time in case where the key is practically feasible to recover. In the previous section, we have thoroughly presented the comparisons of the three key enumeration algorithms based on the aforementioned metrics. From the results, we may infer that none of the key enumeration algorithm under consideration satisfies all the properties of being an ideal key enumeration algorithm. TKEA does not consume significant memory and have a very high throughput but since it does not follow optimality at all, it outputs a large number of extra keys before generating the correct key. Therefore, TKEA in most of the cases, does not generate the correct key within a feasible amount of time. OKEA on the other hand, strictly follows optimality but consumes a large amount of memory which leads it to have a very low throughput. Consequently, OKEA takes a large amount of time to generate the correct key. In contrast to OKEA, SKEA does not strictly follow optimality but consumes limited memory which leads it to have a very high throughput. Consequently, SKEA takes very less time to generate the correct key.

58 CHAPTER 4. ANALYSIS OF KEY ENUMERATION ALGORITHMS 58 From the comparison results (which employed 55 to 100 traces) and above discussion, we may conclude that among OKEA, SKEA and TKEA, SKEA is the one that satisfies most of the properties of being an ideal key enumeration algorithm and therefore, we select this algorithm to be used in the design of our proposed key recovery solution.

59 Chapter 5 Key Validation Key validation (or key testing) is the other constituting component of the key recovery mechanism. As the name implies, this part has the responsibility to test whether the key generated by the key enumeration step is correct or not. This in turns means that the key validation step deals with the implementation of cipher under attack. Advanced Encryption Standard (AES) is arguably the most widely deployed cipher in practice today and therefore, we have focused mainly on AES in our thesis. The description of AES cipher can be found in Chapter 2. In order to validate the key generated by key enumeration step, an attacker encrypts a particular plaintext with that key and matches the corresponding ciphertext with the one already computed with the secret key stored in the cryptographic device under attack. Thus, the key validation rate should be compatible with the key generation rate of key enumeration step in order to achieve an overall efficient key recovery solution. In the previous chapter, we have proposed to deploy Score based Key Enumeration Algorithm (SKEA) as the performance of SKEA outperforms its counterparts. Moreover, we have also figured out the maximum throughput (or key generation rate) of SKEA is roughly around 2 23 keys/sec and therefore, the key validation rate should at least keep this pace. In order to achieve the desired throughput for key validation, we have utilized the immense parallel processing power of a GPU. As GPUs are optimized for throughput, therefore, we have implemented AES on a GPU and achieved the desired throughput. Let us discuss the details of our implementation of AES key validation on a GPU. Basic understanding of GPU computing using CUDA platform is required to understand the following details. A basic introduction to GPU computing using CUDA can be found in Chapter 2. 59

60 CHAPTER 5. KEY VALIDATION AES Key Validation on GPU Over the past decade, research community has started utilizing the immense parallel processing power of GPUs in the domain of cryptography. Various researchers have enhanced the performance of AES encryption by implementing it on a GPU [24 27]. However, AES encryption is different than our desired AES key validation. In encryption, a single secret key is employed to encrypt a huge amount of plaintext. On the other hand, key validation means encrypting a single block of plaintext with different keys. This in turns means that AES encryption requires key scheduling only once for the secret key while AES key validation requires key scheduling for each key that needs to be tested. In almost all the implementations of AES encryption on a GPU present in literature, the authors propose to schedule the secret key on a CPU and store the resultant expanded keys on a GPU. To the best of our knowledge, no one has implemented AES key validation on a GPU. We have implemented AES key validation using CUDA platform on Nvidia GeForce GTX650. This Nvidia GPU has a compute capability (CC) 3.0. Few CUDA related hardware specifications corresponding to CC 3.0 can be found in Figure 2.6 while detailed hardware specifications are available at [28, 29]. It is worth mentioning here that we have used one of the fastest known open-source implementations of AES encryption for GPU platforms and transformed it to our desired key validation process [30]. Our AES key validation on GPU employs the T-table based implementation of AES. Each T-table consumes 1KB of memory as it consists of 256 entries of 32-bit each. The four T-tables are being pre-loaded in the constant memory region of the GPU. A thread block is composed of 256 CUDA threads organized as 4 64 i.e., 4 threads in x-dimension while 64 threads in y-dimension. Since per-block shared memory is much faster than constant memory, therefore, the threads in a thread block first copy all four T-tables from constant memory to local shared memory. Number of thread blocks i.e., grid dimension is decided at run time based on the input data (i.e., number of input keys that need to be tested). The input keys are divided into groups of 64 and each thread block is responsible to test a group of input keys. In other words, each thread block tests 64 keys which in turns mean that 4 threads are being employed to test a single key. Thus, in our implementation, a single CUDA thread works on 4 bytes of a key. Two arrays of size 1KB are being reserved in shared memory to compute and store the intermediate states. As the process starts, one of the two arrays stores results of zeroth round which is then used in the computation of first round whose result is then stored in the second array. The arrays are then swapped for the successive rounds.

The key among the test keys which yields the correct ciphertext is then revealed as the correct key. 5.2 Performance Analysis Figure 5.1 and Figure 5.

61 CHAPTER 5. KEY VALIDATION 61 This strategy allows us to complete the whole process i.e., all AES rounds without using any loop. The ciphertext corresponding to each key is then compared to the one already computed using the original secret key. The key among the test keys which yields the correct ciphertext is then revealed as the correct key. 5.2 Performance Analysis Figure 5.1 and Figure 5.2 present the performance of AES key validation process on Nvidia GeForce GTX 650 in terms of latency and throughput respectively. It is worth mentioning here that these results include the time to transfer keys from the host (CPU) to the device (GPU). Figure 5.1 illustrates the latency of AES key validation on the GPU. Latency refers to the time taken by the device to test the given amount of keys. From the Figure 5.1, we may see that the latency remains constant till a particular amount of input data and after that it starts increasing. Infact, this illustrates the typical behavior of a GPU. Initial increase in input data propels GPU to execute more thread blocks in parallel. Once the GPU utilizes its hardware resources to a maximum extent, the latency then starts increasing. Figure 5.1: Latency of AES key validation on GeForce GTX 650 Figure 5.2 presents the throughput or key validation rate of our implementation. The throughput curve is quite in-line to what we have discussed above. From the Figure 5.2, we may clearly see that the throughput initially

CHAPTER 5. KEY VALIDATION 62 increases with the increase in input data and then attains a certain level. The initial increase in throughput is because of the parallel execution of the thread blocks.

Furthermore, the throughput curve clearly verifies the fact that a GPU requires a large amount of threads to achieve maximum efficiency. From the Figure 5.

62 CHAPTER 5. KEY VALIDATION 62 increases with the increase in input data and then attains a certain level. The initial increase in throughput is because of the parallel execution of the thread blocks. In other words, initial increase in input data causes GPU to run more thread blocks in parallel which in turns increases the throughput until GPU utilizes its resources to a maximum extent. Furthermore, the throughput curve clearly verifies the fact that a GPU requires a large amount of threads to achieve maximum efficiency. From the Figure 5.2, we may clearly see that the throughput of our key validation implementation is more than our required figure i.e., more than 2 23 keys/sec. Figure 5.2: Throughput of AES key validation on GeForce GTX Discussion Both constituting parts of key recovery mechanism i.e., key enumeration and key validation need to execute concurrently in order to achieve an overall efficient solution. This in turns requires that key validation rate should be compatible to key generation rate of the key enumeration part. Since we have selected SKEA as the key enumeration algorithm and its maximum throughput is approx keys/sec, therefore, key validation rate should at least be 2 23 keys/sec. We can clearly see from Figure 5.2 that employing the parallel processing power of a GPU one can achieve a throughput of key validation more than the required figure (i.e., 2 23 keys/sec). Thus, execution of SKEA on a CPU while at the same time employing parallel processing power of a GPU for key validation yields an overall efficient solution for key recovery.

63 CHAPTER 5. KEY VALIDATION 63 It is worth mentioning here that the throughput of AES key validation on the GPU can be further enhanced by utilizing CUDA streams. Streams are very powerful feature of CUDA platform that allows concurrent execution of multiple CUDA operations. Although the already achieved throughput of key validation is more than the throughput of SKEA, therefore, there is no such obvious requirement of using CUDA streams. However, we propose to use CUDA streams in order to ensure that key validation could not be a bottleneck for the entire key recovery mechanism. Next chapter presents the details of CUDA streams and the way we have employed them in our proposed key recovery solution.

64 Chapter 6 Proposed Solution and Future Enhancements The two constituting parts of key recovery mechanism i.e., key enumeration and key validation have been explained in detail in chapters 3 and 5 respectively. In this chapter, we present a summary of the proposed key recovery solution and analyze its performance. 6.1 Proposed Solution Among the three key enumeration algorithms i.e., Optimal Key Enumeration Algorithm (OKEA), Score based Key Enumeration Algorithm (SKEA) and Trivial Key Enumeration Algorithm (TKEA), we have chosen SKEA to be deployed because it outperforms its counter parts and satisfies most of the properties of being an ideal key enumeration algorithm (i.e., it is a suboptimal algorithm, consumes a constant limited memory, has a very high throughput and takes very less amount of time to find the correct key given that the correct key is practically feasible to recover). Based on our experimental comparisons presented in chapter 4, we have seen that the maximum possible throughput of SKEA is approximately 2 23 keys/sec. This in turns means that the speed of key validation process should at least keep this pace in order to achieve an overall efficient solution. To obtain this desired figure of throughput for key validation, we utilized the immense parallel processing power of a GPU. We implemented AES key validation on a GPU and achieved a throughput of more than 2 23 keys/sec. Thus, simultaneous execution of key enumeration process on a CPU and key validation process on a GPU yields an overall efficient key recovery solution as shown in Figure 6.1. In order to achieve this concurrency and to ensure that the key validation 64

A CUDA stream is a sequence of operations that execute in the issueorder on the GPU [31].

65 CHAPTER 6. PROPOSED SOLUTION AND FUTURE ENHANCEMENTS65 Figure 6.1: Proposed key recovery solution process on the device keep pace with the speed of key enumeration process on the host, our proposed system employs CUDA streams. A CUDA stream is a sequence of operations that execute in the issueorder on the GPU [31]. Streams are very powerful feature of CUDA platform that allows simultaneous execution of multiple CUDA operations. If no stream is specified then CUDA uses default stream aka stream 0 and all CUDA operations in the default stream are synchronous as shown in Figure 6.2 (a). Figure 6.2 (b) depicts the use of non-zero streams and we can clearly see the concurrency of CUDA operations. Each stream in Figure 6.2 (b) contains the three typical operations of a CUDA program i.e., a) transfer data from host to device, b) kernel execution and c) transfer result from device to host. Figure 6.2 (b) depicts the concurrent execution of operations in different streams. For instance, the transfer of data from device to host in stream 1 executes concurrently with the kernel execution of stream 2 and also with the transfer of data from host to device in stream 3. From Figure 6.2, we may clearly infer that concurrency significantly enhances the performance of an application. However, it should be noted that the device needs to have support for concurrency i.e., concurrent data transfer and concurrent kernel execution [31]. It is worth mentioning here that since the throughput of both the key enumeration and key validation parts are almost equivalent so the solution may be deployed without using streams i.e., by using default stream 0. The

66 CHAPTER 6. PROPOSED SOLUTION AND FUTURE ENHANCEMENTS66 Figure 6.2: (a) CUDA operations in default stream (b) CUDA streams in different non-zero streams [31] host first needs to collect a group of enumerated keys (for instance 2 23 keys) by executing key enumeration process (i.e., SKEA). As the host needs to transfer the keys to the device for validation, therefore, the group formation is meant to efficiently utilize the memory bandwidth between host and the device. In order to start the key validation process, the host launches asynchronous memory transfer and the kernel execution commands in the default stream. As both these operations are asynchronous with respect to host, therefore, the host continues its execution and starts collecting another group of enumerated keys. Thus, the testing of first key group on the device and the collection of next key group on the host executes simultaneously. As both these processes have same throughput, therefore, by the time device finishes testing of first key group, the host is ready with another group. This process is repeated until the correct key is being revealed by the device. Hence, the use of streams is not required in the proposed solution but we have employed streams in order to further enhance the efficiency of key validation process and to ensure that key validation could not affect the speed of key enumeration process. The operation of the proposed solution is explained in the next section with the help of a flowchart Flowchart Flowcharts presented in Figure 6.3 describe the entire operation of the proposed solution for key recovery. Figure 6.3 (a) demonstrates the operations

67 CHAPTER 6. PROPOSED SOLUTION AND FUTURE ENHANCEMENTS67 at the host side. The sorted sub-key lists obtained as a result of executing side-channel attack (such as a DPA attack) is provided as an input to the key enumeration part. Since SKEA requires discrete score based ranking of the sub-key lists so the first step is to transform the input lists into required discrete based ranking. The host first collects a group of keys (for instance 2 23 keys) by applying SKEA. In order to concurrently executing key validation process on the device, the host transfer the collected key group to the device. For this purpose, the host launches a CUDA stream. In the CUDA stream, the host issues commands for asynchronous transfer of collected key group to the device and kernel launch. As both these operations are asynchronous with respect to host, therefore, host resumes its execution and starts collecting another group of keys. The process is repeated until the correct key is being revealed by the key validation process of the device. Furthermore, the host uses different CUDA streams for the consecutive key group. This ensures that if the execution of last stream is not yet completed, the device will concurrently start testing the next key group without interrupting the operations in the previous stream. Figure 6.3 (b) illustrates the flowchart of the key validation process of the device. It receives a group of keys (say 2 23 keys) from host for testing. Since each CUDA thread block on the device tests 64 keys in our implementation (explained in section 5.1), therefore, the input key group is divided into smaller groups, each having 64 keys. Number of thread blocks (i.e., grid dimension) depends on the number of keys in the input key group. The thread blocks start testing their specific key groups. Each thread block encrypts a particular plaintext using all the keys in its specific key group and compares the generated ciphertexts with the one obtained with the original secret key. The key that generates the matching ciphertext is ultimately revealed as the correct key and the process gets terminated Performance Analysis In order to analyze the performance of the proposed solution, we have compared the throughput of SKEA with throughput of the entire solution (i.e., SKEA plus AES key validation on GPU). We have executed both SKEA and entire solution on the lists which in turns were collected by executing DPA attack on ATmega implementation of AES-128 by exploiting 100 traces. Figure 6.4 presents the comparison results. The solid blue line represents the throughput of SKEA while the dashed red line depicts the throughput of the entire solution. The host collects a group of 2 23 keys by applying SKEA and then transfers it to the device (GeForce GTX650) using CUDA streams. CUDA streams effectively allow concurrent execution of both processes (i.e.,

68 CHAPTER 6. PROPOSED SOLUTION AND FUTURE ENHANCEMENTS68 Figure 6.3: Flowchart of the proposed solution (a) Flowchart of the host part of the solution (b) Flowchart of the device part of the solution

CHAPTER 6. PROPOSED SOLUTION AND FUTURE ENHANCEMENTS69 Figure 6.4: Throughput of SKEA and proposed solution key enumeration at the host and key validation at the device).

This in turns proves the efficiency of the entire solution as adding key validation does not cause any decrease in the throughput of key enumeration process (i.e., throughput of SKEA). 6.

69 CHAPTER 6. PROPOSED SOLUTION AND FUTURE ENHANCEMENTS69 Figure 6.4: Throughput of SKEA and proposed solution key enumeration at the host and key validation at the device). This fact is also obvious from the Figure 6.4 as the throughput of entire solution is almost the same as that of SKEA. This in turns proves the efficiency of the entire solution as adding key validation does not cause any decrease in the throughput of key enumeration process (i.e., throughput of SKEA). 6.2 Future Enhancements In order to give directions for future research, we have presented some possible enhancements in the design of the aforementioned solution. Since key validation process needs to wait for the keys to be generated by the key enumeration process, therefore, the overall speed of the whole key recovery mechanism depends on the speed of key enumeration process. This is also evident from Figure 6.4. An economical and efficient way to enhance the performance of a key enumeration algorithm is to implement it on a GPU. However, GPU computing has a number of constraints that need to be satisfied in order to achieve the performance boost. In order to propose a more powerful key recovery solution for future research, we have analyzed the compatibility of OKEA and SKEA for a GPU. We have skipped TKEA because its performance is proved to be worse than OKEA and SKEA.

70 CHAPTER 6. PROPOSED SOLUTION AND FUTURE ENHANCEMENTS GPU Computing and Key Enumeration Algorithms Before indulging into details about compatibility of OKEA and SKEA for a GPU, let us first briefly highlight a few of the GPU constraints that significantly impact the performance of an application GPU constraints A GPU requires a large number of threads in order to yield maximum efficiency. This is because large number of threads leads a GPU to utilize its hardware resources to a maximum extent. The GPU threads are meant to be lightweight i.e., limited amount of work is assigned to each thread. A GPU is meant for such applications that require more computation less communication [14]. This is because a GPU inherits an enormous parallel processing power that makes it good for computations while on the other hand, a GPU allows very limited amount of communication between threads (i.e., only threads within a thread block can communicate with each other). Thus, independent threads generally perform much better on a GPU platform [14]. Furthermore, memory access pattern of the threads is one of the most influential factors affecting the overall performance and therefore, the threads need to carefully utilize the hierarchical memory model of a GPU Compatibility of OKEA for a GPU OKEA is based on the notion of merging two lists as depicted in Figure 3.1. The same notion is used to merge multiple lists i.e., n lists are merged by merging two lists n 1 times. This in turns depicts that the merging process can be executed in parallel. Let us consider an example of a DPA attack on AES-128 i.e., merging 16 sub-key lists. The simplest approach that comes in mind to parallelize the enumeration process is to execute 8 threads to merge initial 16 sub-key lists i.e., each thread merges 2 lists each having 1-byte sub-key candidates. These 8 threads yield 8 lists each having 2-byte subkey candidates which in turns are merged together using 4 threads and so on until we reach the final list of 16-byte key candidates. Thus, the overall process requires 15 concurrently executing threads in total. However, the number of threads is a way too less for a GPU which requires hundreds of thousands of threads for full efficiency. As threads in each level works on the output generated by the threads of the previous level, therefore, there is high dependency between threads. Moreover, each thread merges two lists which is too much work for a GPU thread. Lastly, OKEA consumes a large

71 CHAPTER 6. PROPOSED SOLUTION AND FUTURE ENHANCEMENTS71 amount of memory which means that the global memory needs to be utilized at the GPU. As the global memory is the slowest memory of GPU, therefore, it may severely affect the performance of OKEA. From all these facts, we may clearly see that OKEA does not seem to be a good fit for a GPU Compatibility of SKEA for a GPU SKEA requires discrete score based ranking of the sub-key lists. SKEA enumerates keys in the decreasing order of their cumulative score i.e., it first generates all possible keys with the maximum cumulative score, then for the next highest cumulative score and so on. The enumeration process of SKEA is presented in Algorithm 6. We can effectively parallelize the enumeration process of SKEA as we may launch a thread per cumulative score. Each thread needs to generate keys for that particular cumulative score. Thus, we may effectively launch a large number of threads. The threads are independent as they do not require any communication. Moreover, SKEA consumes a constant limited memory. As explained in section 3.2, SKEA requires storing three main tables for enumeration. The tables are being referred as sortedscore, cumulativemaximumindex and cumulativeminimumindex. The sortedscore table contains the sorted sub-key lists while cumulativemaximumindex and cumulativeminimumindex contain the maximal and minimal indexes of the sub-key lists for all possible cumulative scores. The tables are presented in Figure 3.6. From the Figure, we may clearly see that both these tables are similar to sparse matrices and therefore, by applying careful consideration, they can be easily stored into the shared memory of a GPU which is extremely fast comparing to global memory. Thus, we may conclude that SKEA seems to be a good fit for a GPU Proposed Solution for Future Research In the previous section, we have analyzed the compatibility of OKEA and SKEA for a GPU. We have seen that comparing to OKEA, SKEA seems to be a much better fit for a GPU. Thus, we can further improve the performance of the proposed key recovery solution by implementing SKEA on a GPU. We expect that the implementation of SKEA on a GPU will significantly increase its throughput which may then require a faster key validation process. For this purpose, we may utilize the Intel s Advanced Encryption Standard New Instructions (AES-NI) as only around 64 clock cycles are required to test a key on Intel s new Haswell architecture [8]. Thus, the solution depicted in Figure 6.5 i.e., SKEA on a GPU and AES-NI on a CPU may yield a more powerful key recovery solution for side-channel attacks on AES cipher. This

72 CHAPTER 6. PROPOSED SOLUTION AND FUTURE ENHANCEMENTS72 solution would ultimately open new research directions in this area. Figure 6.5: Proposed solution for future research

Side channel attack: Power Analysis. Chujiao Ma, Z. Jerry Shi CSE, University of Connecticut

Side channel attack: Power Analysis Chujiao Ma, Z. Jerry Shi CSE, University of Connecticut Conventional Cryptanalysis Conventional cryptanalysis considers crypto systems as mathematical objects Assumptions: