On-The-Fly AES Key Expansion For All Key Sizes on ASIC P.V.Sriniwas Shastry 1, M. S. Sutaone 2, 1 Cummins College of Engineering for Women, Pune, 2 College of Engineering, Pune pvs.shastry@cumminscollege.in Abstract This paper proposes the design and implementation of On-The-Fly (OTF) computation of round keys of Advanced Encryption Standard (AES) for all key sizes. The OTF implementation architecture has ensured generation of round key of 128 bits each for the input cipher key sizes of 128, 192 and 256 bits. The implementation was targeted on 180nm CMOS technology using standard cell libraries. Key expansion unit is such designed that, it can be used for both encryption and decryption of AES. The design was clocked at 179MHz to generate 128-bit round keys at a throughput of 22.912Gbps. Key words: On-The-Fly Key Expansion, AES, Very Large Scale Integration (VLSI), All key sizes. 1. Introduction Advanced Encryption Standard (AES) is a symmetric key, block cryptographic algorithm [1]. The rapidly growing need of secure data communication on mobile computing platforms as well as portable devices has led to increasing demand of hardware implementation of stronger encryption standards like AES. The hardware implementation of AES is more reliable and introduces more security against attacks. The need of higher speed of operations and higher security has instigated many researchers to implement the crypto-system algorithms on FPGA and ASIC platforms. Researchers have implemented AES using rolled architectures, pipelined architectures, subpipelined architectures. To date several AES implementations have been published to target very low area designs, while some have been targeting high throughput approaches. Rolled architecture implementations have resulted into minimum use of silicon area and low power, whereas pipelined architectures have achieved high throughput in several tens of Gbps. Further better results were achieved in these same architectures by optimizing substitute box and mixed column operations of AES. The OTF computation of round keys required by encryption or decryption block are performed in the key expansion unit without needing memory to store the keys [2]. Instead of dedicated key expansion units for different key lengths, an architecture which support different key lengths combined with key generating process for encryption as well as decryption, can significantly reduce the hardware cost of full key length AES [3]. The computation of substitute byte on the fly employs the use of composite field arithmetic in reducing the complexity while computing the multiplicative inverse in GF(2 8 ) has further reduced the power consumption and helped in increasing the speed [2][3]. The implementation of substitute byte function involves handling the nonlinearity properties of multiplicative inverse computation of an input byte. The substitute byte operation is a byte function hence an AES implementation with 128bit depth of data path requires sixteen such concurrent functions. Concurrently the substitute byte operation is also needed while performing the key expansion. In this paper we have presented OTF architecture for round key generation for all cipher key sizes. The substitute byte operation is also performed using combinational circuit and hence does not require the memory elements. The design uses limited resources with merely one 256 bit register, for all key sizes. The rest of the paper is organized in the following manner, Section 2 describes the Key expansion unit, Section 3 includes our proposed architecture and Section 4 gives the results and compares with that of others. Lastly in Section 5, conclusion of this work. 2. Key Expansion for AES The key expansion unit of AES takes a cipher key and conducts a key expansion routine to generate various round keys required based on the size of the original cipher key. The key expansion routine can generate 128-bit round keys required by AddRoundKey operation of the encryption or InvAddRoundKey operation of the decryption, from 128-bit or 192-bit or 256-bit input cipher key. The number of rounds (Nr) to be performed depends of the key size, and are mentioned in Table.I. Nb is the number of words of key data with 32bits of each word. The key expansion unit performs RotWord, SubWord and XOR operation with RCON. The explanation of each of these suboperations are given as under. The RotWord operation is a cyclic rotation of bytes within a word to left. This operation is applied only to the lowest significant word of the cipher key. Let the 217
4-byte word be represented as w[i], with i in the range 0 i < Nb(Nr+1), then the RotWord operation is performed to the word w[i k -1], where the condition {i k mod Nk = 0}, is satisfied. The value of Nk is 4, 6 or 8 for 128-bit,192-bit or 256-bit cipher keys respectively. SubWord is SubstituteByte transformation applied independently to each byte of an word w[i k -1], after RotWord operation is performed, except in case of 256- bit cipher keys. The SubWord operation for a 256-bit cipher key is performed on the w[i k -1] word where the condition {i k mod Nk = 0} and the condition {i k mod Nk = 4} is satisfied. RCON is the round constant word which is XORed with the substituted word after SubWord operation. The values of RCON array, [x i-1, {00}h, {00}h,{00}h ] are constituted for i, where the initial value starts with 1 and not 0. The values of x i-1 being powers of x, denoted as{02} h in the GF(2 8 ). Every following word, w[i] is equal to the XOR of the previous word, w[i-1] and the word Nk positions earlier, w[i-nk]. Refer Figure 1. The key expansion may be processing either 128-bit or 192-bit or 256-bit in each iteration, but the round keys supplied to the AddRoundKey operation in encryption or InvAddRoundKey operation in decryption is always 128-bit. This is because the data path consisting of encryption or decryption is always 128-bit depth, while the key expansion path may be different for different key sizes. (Nk) (Nb) 128 -bit 10 4 4 192 -bit 12 6 4 256 -bit 14 8 4 The derivation of round keys from the expanded keys is illustrated in Figure 2. In all there will be Ne key expansions, depending upon the key size, where the value of Ne can be computed as shown in equation (1). Hence the value of Ne is 10, 8 and 7 for 128-bit, 192-bit and 256-bit respectively, after substituting the values of Nr, Nb and Nk from Table I. Ne = (Nr * Nb)/Nk (1) The round keys are required in the reverse order while performing the decryption data path. Hence the round keys expanded while encryption are normally stored in the memory so as to retrieve the keys in the reverse order while decryption. For 128-bit and 192-bit keys: w[i-1] * = SubWord(RotWord(w[i-1]) w[i] = [{w[i-1] * RCON[i/Nk]} w[i-nk] ] ---for i mod Nk = 0; = w[i-1] w[i-nk] ---for other values of i; For 256-bit key: w[i-1] * = SubWord(RotWord(w[i-1]) w[i] = [{w[i-1] * RCON[i/Nk]} w[i-nk] ] ---for i mod Nk = 0; = SubWord(w[i-1]) ---for i mod Nb = 0; = w[i-1] w[i-nk] ---for other values of i; Figure 1. Computations of key expansions Cipher Key Size TABLE I. KEY EXPANSION COMBINATIONS Rounds (Nr) Words per expansion Words per round key (a) (b) (c) Figure 2. Key expansion for different key sizes (a) 128- bit key (b) 192-bit key (c) 256-bit key 3. Proposed Architecture for OTF Key expansion Our proposed architecture makes use of a 256-bit register, which temporarily logs the round keys. The 218
size of the register is chosen so as to accommodate expanded round key of all three sizes. The round keys are generated using multiple iterations and after every iteration the round key of 128-bit needed for AddRoundKey operation, is placed at the upper half of the register. As shown in the Figure 3, a multiplexer is used, which swaps the key expanded in the earlier iteration, to place the round key at the upper half of the 256-bit register. A common architecture is designed for all the three key sizes. The most critical part of the architecture is to manage different number of expansion iteration for each size, while keeping the round key size as 128-bit. With an assumption that the encryption and decryption data path is implemented using rolled architecture and every clock event to the encryption or decryption data path, results into one round of encryption. Hence the key expansion unit also has to generate one round key per clock cycle and this condition would be applicable for all three key sizes. As mentioned in the Figure 1, there are specific words which are operated with SubWord, RotWord and then XOR with RCON. The round key generation per clock cycle is based on 128-key expansion procedure. In order to match to timing for different key sizes, the original key as well as subsequent round keys are shuffled after every clock cycle. The advantage of data shuffling is that only four data processing elements would be required for completion of key expansion for three key sizes [7]. Figure 3 shows the above said A round counter is maintained so as to generate the select lines for the multiplexers. In case of 128-bit key expansion, each clock cycle generates one round key through one expansion iteration. Hence a total of 10 clock cycles would be needed to generate round keys using 128-bit expansion. In case of 192-bit key expansion, every three clock cycles generate three round keys through two expansion iterations, therefore we require 12 clock cycles. While expanding 256-bit keys, every two consecutive clock cycles generate two round keys through one expansion iteration, resulting in to use of total 14 clock cycles. These iteration and their required number of clock cycles are exactly matches with that of encryption or decryption data paths. In Figure 3(a), the swapping of the words are shown for 192-bit and 256-bit key expansion. In case of128-bit key expansion, no swapping of words is needed and hence the data lines joins direct vertically down to the corresponding word. While performing 128-bit key expansion, the words, w 4, w 8, w 12, w 16,etc., performs extra computations of RotWord, SubWord and XOR with RCON. Similarly the words w 6, w 12, w 18, w 24, etc., in 192-bit expansion performs extra computations alike 128-bit expansion. In case of 256- bit expansion the words w 8, w 16, w 24, w 32, - - w 56 perform RotWord, SubWord and XOR with RCON, while the words w 12, w 20, w 28, w 36, - -,w 52 performs only SubWord operation. The word multiplexers in Figure 3(b) selects the first input cipher key or swapped data word from the (a) (b) Figure 3. (a) Data swapping strategy (b) All key size key expansion architecture arrangement and the key expansion architecture. In our architecture we have generated controls signals which select the multiplexer data lines using sequential machine and no processor has been employed as done in [7]. previous key expansion iteration based on the swapping strategy shown in Figure 3(a). The architecture also performs the reverse expansion of the round keys for the decryption data path. 219
In our proposed architecture the splitting of the 256- bit data shuffling multiplexers [7] into word multiplexers has reduced the power consumption, because the multiplexers unselected remain inactive resulting into lower dynamic power consumption. The input to the multiplexer at 0 indexed port is for the cipher key given by the user. The input to the 1 indexed port is for the 128-bit expansion, 2 indexed port for 192-bit expansion and 3 indexed input port is for 256-bit key expansion. The design was synthesized using RTL Compiler of Cadence. Standard cell libraries of 180nm were employed for synthesizing the design. A clock frequency of 179MHz has successfully clocked the design, while having 495ps worst case slack. Irrespective of the key size, every clock cycle has generated one round key of 128-bit at a throughput of 22.91Gbps. The throughput calculations are done using equation (2). Throughput = 128 * Clock Frequency (2) The synthesis results are presented in Table II. The physical layout design on 180nm was performed using SoC Encounter of Cadence. The total design was fit into 61153 um 2 area, with a core density of 70%. 4. Results and Comparison We have implemented the OTF key expansion for all key sizes using TSMC 180nm cell libraries. We compare our implementation results in Table III. The design in [7] has similar implementation and clocked the design at 102MHz and achieving approximately 13.056Gbps. The design in [8] also implemented OTF key expansion unit, but only for 128-bit key size. Even though another similar implementation for different key sizes was proposed in design [4], but it was Table II. Synthesis result Particulars Values Standard cell 3231 Instances Standard Cell Area 16273 Power dissipation 1.79mW Slack 495ps Clock Frequency 179MHz Physical Area (Physical Layout) 61153 m 2 implemented on 250nm technology, also it has consumed 26,639 gate count which is quite higher than our gate count. The design proposed in [6] was also implemented on 180nm technology, but have used pipelined architecture for the 128-bit OTF key expansion. Also this design has used32-bit data path and achieved 10.656Gbps. 5. Conclusion We have presented a new optimization method while implementing the On-The-Fly Key expansion for all key sizes on 180nm technology, by splitting Multiplexers into word multiplexers and keeping them inactive, when not in use. Particularly while 128-bit key expansion is performed and while 192-bit key expansion is performed. This has not only reduced the number of gates required but also reduced the dynamic power consumption. 6. References Table III. Implementation Comparison Particulars [4] [7] Ours CMOS Technology 250nm 180nm 180nm Frequency (MHz) 66 102 179 Throughput (Gbps) 8.448 13.056 22.912 Gates 26,639 26,639 16,284 Key sizes 128, 192 and 256- bit 128,192 and 256- bit 128,192 and 256-bit Data path depth 128 bits 128 bits 128 bits [1] Advanced Encryption Standard (AES)", Federal Information Processing Standards Publications (FIPS PUBS) Publication 197, November, 2001. [2] Qingfu Cao, Shuguo Li, A high throughput costeffective ASIC implementation of the AES algorithm, Proc. IEEE 8th International Conference on ASIC (ASICON)2009, pp. 805-808. [3] Po-Chun Lie, Chang Hsie-Chia, Chen-Yi Lee, A 1.69Gbps area-efficient AES Crypto Core with compact on the fly key expansion unit, Proc. ESSCIRC 2009, pp. 404-407. [4] Chih-Pin Su, Chia-Lung Horng, Chih-Tsun Huang and Cheng-Wen Wu, A configurable AES processor for enhanced security, Proc. ASP-DAC 2005, pp. 361-366 [5] Shen-Fu Hsiao, Ming-Chih Chen, Chia-Shin Tu, Memory-free low cost designs of advanced encryption standard using common subexpression eliminationfor subfunctions in transformations, IEEE Trans. Circuits and Systems -I: Regular papers, Vol.53, No.3, March 2006, pp 615-626. [6] P Saravanan, N Renukadevi, G Swathi, P Kalpana, A high-throughput ASIC implementation of configurable advanced encryption standard(aes), Proc. IJCA special issue on Network Security and Cryptography NSC, 2011. 220
[7] Mao-Yin Wang, Chih-Pin Su, Chia-Lung Horng, Chen- Wen Wu, Chih-Tsun Huang, Single and multi-core configurable AES architectures for flexible security, IEEE Tans. on Very Large Scale Integration (VLSI) Systems, Vol.18, No. 4, April 2010, pp. 541-551. [8] A Alma aitah, Zine-Eddine Abid, Area efficient-high throughput sub-pipelined design of the AES in CMOS 180nm, Proc. 5 th International Design and Test Workshop (IDT), 2010, pp. 31-36. 221