AES Cryptosystem Acceleration Using Graphics Processing Units. Ethan Willoner Supervisors: Dr. Ramon Lawrence, Scott Fazackerley

Similar documents
Encryption Details COMP620

Kernel level AES Acceleration using GPUs

Dr. Jinyuan (Stella) Sun Dept. of Electrical Engineering and Computer Science University of Tennessee Fall 2010

ECE596C: Handout #7. Analysis of DES and the AES Standard. Electrical and Computer Engineering, University of Arizona, Loukas Lazos

Data Encryption Standard (DES)

Block Ciphers. Secure Software Systems

Advanced Encryption Standard and Modes of Operation. Foundations of Cryptography - AES pp. 1 / 50

Course Business. Midterm is on March 1. Final Exam is Monday, May 1 (7 PM) Allowed to bring one index card (double sided) Location: Right here

A High-Performance VLSI Architecture for Advanced Encryption Standard (AES) Algorithm

Block Ciphers Introduction

A New hybrid method in watermarking using DCT and AES

Winter 2011 Josh Benaloh Brian LaMacchia

Cryptography and Network Security. Sixth Edition by William Stallings

Introduction to Cryptology. Lecture 17

CPSC 467b: Cryptography and Computer Security

FPGA BASED CRYPTOGRAPHY FOR INTERNET SECURITY

3 Symmetric Key Cryptography 3.1 Block Ciphers Symmetric key strength analysis Electronic Code Book Mode (ECB) Cipher Block Chaining Mode (CBC) Some

CPSC 467: Cryptography and Computer Security

CPSC 467b: Cryptography and Computer Security

Lecture 2: Secret Key Cryptography

Goals for Today. Substitution Permutation Ciphers. Substitution Permutation stages. Encryption Details 8/24/2010

Symmetric-Key Cryptography

Presented by: Kevin Hieb May 2, 2005

Implementing AES : performance and security challenges

Scanned by CamScanner

Security Applications

9/30/2016. Cryptography Basics. Outline. Encryption/Decryption. Cryptanalysis. Caesar Cipher. Mono-Alphabetic Ciphers

Efficient Practical Key Recovery for Side- Channel Attacks

Cryptography Basics. IT443 Network Security Administration Slides courtesy of Bo Sheng

L3: Basic Cryptography II. Hui Chen, Ph.D. Dept. of Engineering & Computer Science Virginia State University Petersburg, VA 23806

Private-Key Encryption

High-Performance Cryptography in Software

Week 5: Advanced Encryption Standard. Click

Cryptography and Network Security

Optimized AES Algorithm Using FeedBack Architecture Chintan Raval 1, Maitrey Patel 2, Bhargav Tarpara 3 1, 2,

Computer Security CS 526

Stream Ciphers and Block Ciphers

Goals of Modern Cryptography

Implementation and Performance analysis of Skipjack & Rijndael Algorithms. by Viswnadham Sanku ECE646 Project Fall-2001

Stream Ciphers and Block Ciphers

Block Ciphers. Lucifer, DES, RC5, AES. CS 470 Introduction to Applied Cryptography. Ali Aydın Selçuk. CS470, A.A.Selçuk Block Ciphers 1

Implementation of Full -Parallelism AES Encryption and Decryption

Analysis of the Use of Whirlpool s S-box, S1 and S2 SEED s S- box in AES Algorithm with SAC Test Novita Angraini, Bety Hayat Susanti, Magfirawaty

Block ciphers. CS 161: Computer Security Prof. Raluca Ada Popa. February 26, 2016

CS 4770: Cryptography. CS 6750: Cryptography and Communication Security. Alina Oprea Associate Professor, CCIS Northeastern University

Information Security CS526

Parallel Computing Scheme for the Encryption Process of DNSCrypt Protocol using CUDA

I. INTRODUCTION. Manisha N. Kella * 1 and Sohil Gadhiya2.

Network Security Essentials

Analyzing the Generation and Optimization of an FPGA Accelerator using High Level Synthesis

Low area implementation of AES ECB on FPGA

Symmetric Key Encryption. Symmetric Key Encryption. Advanced Encryption Standard ( AES ) DES DES DES 08/01/2015. DES and 3-DES.

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Hardware Implementation of Cryptosystem by AES Algorithm Using FPGA

PARALLEL ANALYSIS OF THE RIJNDAEL BLOCK CIPHER

Secret Key Cryptography

FACE : Fast AES CTR mode Encryption Techniques based on the Reuse of Repetitive Data

Comp527 status items. Crypto Protocols, part 2 Crypto primitives. Bart Preneel July Install the smart card software. Today

EEC-484/584 Computer Networks

Efficient Hardware Design and Implementation of AES Cryptosystem

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Symmetric Cryptography. CS4264 Fall 2016

Symmetric Cryptography

Outline. Data Encryption Standard. Symmetric-Key Algorithms. Lecture 4

Enhanced Cryptanalysis of Substitution Cipher Chaining mode (SCC-128)

Computer Security. 08. Cryptography Part II. Paul Krzyzanowski. Rutgers University. Spring 2018

Block Cipher Modes of Operation

HOST Cryptography III ECE 525 ECE UNM 1 (1/18/18)

Introduction to Network Security Missouri S&T University CPE 5420 Data Encryption Standard

Design and Implementation of Rijndael Encryption Algorithm Based on FPGA

FPGA Based Design of AES with Masked S-Box for Enhanced Security

AES Advanced Encryption Standard

GPU Fundamentals Jeff Larkin November 14, 2016

Comparison of Performance of AES Standards Based Upon Encryption /Decryption Time and Throughput

Parallel Processing SIMD, Vector and GPU s cont.

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Recent Advances in Heterogeneous Computing using Charm++

128 Bit ECB-AES Crypto Core Design using Rijndeal Algorithm for Secure Communication

Assessment of OpenCL software development for GPGPU

Block Ciphers: Fast Implementations on x86-64 Architecture

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Applications of Berkeley s Dwarfs on Nvidia GPUs

Chap. 3. Symmetric Key Crypto (Block Ciphers)

CS 179: GPU Computing. Lecture 16: Simulations and Randomness

Cryptographic algorithm acceleration using CUDA enabled GPUs in typical system configurations

Modern Symmetric Block cipher

Conventional Encryption Principles Conventional Encryption Algorithms Cipher Block Modes of Operation Location of Encryption Devices Key Distribution

Few Other Cryptanalytic Techniques

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Cryptography Symmetric Cryptography Asymmetric Cryptography Internet Communication. Telling Secrets. Secret Writing Through the Ages.

Security against Timing Analysis Attack

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Uses of Cryptography

Symmetric Key Cryptography

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Cryptography Functions

Lecture 4. Encryption Continued... Data Encryption Standard (DES)

Understanding Cryptography by Christof Paar and Jan Pelzl. Chapter 4 The Advanced Encryption Standard (AES) ver. October 28, 2009

Transcription:

AES Cryptosystem Acceleration Using Graphics Processing Units Ethan Willoner Supervisors: Dr. Ramon Lawrence, Scott Fazackerley

Overview Introduction Compute Unified Device Architecture (CUDA) Advanced Encryption Standard (AES) Hardware Capabilities Threads Memory Models Rijndael Algorithm Hardware Acceleration and AES-NI CUDA Implementation of AES Applications to Databases Results Conclusion

Introduction Digital security is important Big data and large volumes of data means a lot of data to secure Graphics Processing Units are a worthwhile solution for processing large data sets Research has shown that utilizing Graphics Processing Units to encrypt and decrypt data is faster than traditional Central Processing Units But previous analysis has been lacking, they don t compare to real world CPU implementations Database applications We need to consider Parallelism vs Concurrency

Visual Example Gophers destroy old computer manuals Parallelism! Each gopher independently performs the same task Source: Rob Pike, https://talks.golang.org/2012/waza.slide Concurrency! Gophers performing one of the tasks A, B, C or D

Compute Unified Device Architecture (CUDA) Invented by NVIDIA and was released in 2006 Parallel computing runtime and programming model which leverages Graphics Processing Units (GPUs) for massively parallel data processing Researchers were drawn to them because of excellent floating point performance in a parallel manner (Lots of Gophers! 1000 s!) CUDA has become incredibly important for CAD, Deep Learning, Medical Research and Imaging, and many other applications

CUDA Threads Similar to CPU threads, CUDA threads are the smallest unit of execution which happens on the GPU. Groups of Threads run in Thread Blocks, and Thread Blocks run in Grids Thread blocks are scheduled across numerous streaming multiprocessors located on the GPU Each processor may have 8, 16, or 32 blocks executing CUDA threads are only effective when executing identical operations on a set of data GPUs are parallel but not concurrent

CUDA Memory Models GPUs utilize their own DRAM on the device itself, separate from the host system s RAM, for mass data storage, called Global Memory Sharing data between the GPU and host can either be managed by the programmer, or managed by CUDA using unified memory Manually managed: explicitly copy memory between CPU DRAM and GPU DRAM Unified Memory: CUDA runtime worries about copying memory as it is needed. Simpler, but slower. By NVIDIA (NVIDIA CUDA Programming Guide version 3.0) [CC BY 3.0 (http://creativecommons. org/licenses/by/3.0)], via Wikimedia Commons

Putting it together 1. Input data is copied to Global Memory 2. Launch the GPU Kernel 3. Kernel processes data in parallel 4. Output data is copied to Main Memory By Tosaka - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index. php?curid=5140417

Advanced Encryption Standard (AES) AES is a symmetric block cipher chosen by the National Institute for Standards and Technology to serve as a standard encryption system for the US government The Rijndael algorithm was submitted by Joan Daemen and Vincent Rijmen Chosen based on features such as estimated security, performance, and cryptanalysis resistance Supports variable key sizes for security/performance trade-offs AES is heavily used in all areas of computing, including secure web browsing (HTTPS), Disk Encryption (Bitlocker, FileVault), Databases, and protects data stored by hospitals, governments, universities, etc.

Rijndael Algorithm Every block of 16 bytes is treated as a 4x4 matrix in the Rijndael algorithm Encryption: AddRoundKey Combine (XOR) our secret key with the data block SubBytes Substitute every byte with a value from a lookup table ShiftRows Each row of our matrix is cyclically shifted MixColumns Each column of our matrix is transformed with an invertible linear transformation For decryption, each operation has an applicable inverse function

Rijndael Algorithm cont. Rijndael encrypts one block at a time (ECB), which poses some issues for identifying patterns in data By utilizing a method such as Cipher Block Chaining (CBC), we can eliminate pattern based analysis of our data Tux the Penguin, the Linux mascot. Created in 1996 by Larry Ewing with The GIMP.

Hardware Acceleration and AES-NI Ubiquity of AES led Intel and AMD to implement it with hardware acceleration in many of their CPUs starting in 2010 This hardware acceleration is exposed through an x86 instruction set called AESNI, or Advanced Encryption Standard New Instructions Originally only available for server CPUs Quickly became common in consumer CPUs as well Limited uptake on mobile CPUs Intel claims up to 10x faster than software implementations Why do we care? Because of OpenSSL OpenSSL of Heartbleed infamy OpenSSL is one of the most common cryptographic libraries OpenSSL is commonly utilized by databases, web servers, and high performance applications OpenSSL supports AES-NI and hardware acceleration

CUDA Implementation of AES The implementation is massively parallel, and incredibly high performance Shared memory it heavily utilized Our lookup tables and encryption/decryption keys are read often Lookup tables are especially low performance because they are based on the data, so locality cannot be reliably exploited by multiprocessor cache. Shared memory helps lessen these effects Each block of data is assigned to its own thread. This gives us the best parallelism possible as there is no thread divergence Utilization of CUDA Streams This overlaps data transfer and kernel execution Copy large data set in smaller pieces Launch shorter running kernels more often We hide some of the copy time in kernel execution http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#asynchronous-transfers-and-overlapping-transfers-with-computation

Database Applications Recall Cipher Block Chaining mode of AES More secure, resistant to pattern analysis Notice the encryption algorithm, each block relies on the previous block s output This makes parallelization impossible Decryption is fully parallelizable, as there is no dependency on the output of a previous block Databases do not store tables in singular large files Chunks of data, called Pages, are encrypted separately We can parallelize the encryption of page sets by assigning each page of data to a CUDA thread

Hardware and Testing Test machine: AMD 8350 8 Core CPU 16GB DDR3 RAM Nvidia GTX 780 w/ 3GB VRAM We aim to test the following: CPU throughput processing small to large numbers of data pages CPU throughput when parallelized across up to 8 cores using OpenMP GPU throughput both with and without utilizing streams GPU throughput when utilized with a Copyless architecture

Results

Results - Throughput

Results

Results - Throughput

Results

Recall back to our Memory Layout diagram... The GPU is an an inherent disadvantage because we must copy data to it before processing CPU can start working immediately High speed PCI links such as storage or network devices allow us to copy the data straight to the GPU memory Bypass the CPU DRAM totally This is now a more accurate comparison of systems, CPU is only needed to launch a GPU application, not actually do the data copies Data starts off on the GPU, eliminating the data copy expenses Now the GPU is in the same position as the CPU Helps to offload parallelizable tasks from the CPU and leave it to perform concurrent tasks

Results - Throughput

Results - Throughput

Conclusion Previous researchers have concluded that GPUs are better at AES than CPUs OpenSSL with hardware support provides competition to CUDA implementations Many of the papers I found ignored AES-NI completely This research has shown that in fact, GPUs are not applicable to many workloads Workloads with small data are not as performant as previously claimed Picking the correct tool for the correct workload is incredibly important There is definitely a place for this in environments which deal with large amounts of encrypted data But we need to modify how we think about the problem and aim to remove memory copy costs totally

Questions?