Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware

Similar documents
Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware

BUILDING SECURE (CLOUD) APPLICATIONS USING INTEL S SGX

US Census Bureau Workshop on Multi-party Computing. David W. Archer, PhD 16-Nov-2017

Research Faculty Summit Systems Fueling future disruptions

from circuits to RAM programs in malicious-2pc

Whitewash: Outsourcing Garbled Circuit Generation for Mobile Devices

Yuval Ishai Technion

Komodo: Using Verification to Disentangle Secure-Enclave Hardware from Software

Deep learning in MATLAB From Concept to CUDA Code

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Deep Learning Compiler

6.857 L17. Secure Processors. Srini Devadas

Notes for Lecture 24

SafeBricks: Shielding Network Functions in the Cloud

Intel Software Guard Extensions (Intel SGX) Memory Encryption Engine (MEE) Shay Gueron

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Iron: Functional encryption using Intel SGX

Intel Software Guard Extensions

Jian Liu, Sara Ramezanian

Blind Machine Learning

Eleos: Exit-Less OS Services for SGX Enclaves

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Security in Data Science

Outsourcing Secure Two-Party Computation as a Black Box

Authenticated Storage Using Small Trusted Hardware Hsin-Jung Yang, Victor Costan, Nickolai Zeldovich, and Srini Devadas

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

CloudSky: A Controllable Data Self-Destruction System for Untrusted Cloud Storage Networks

Outsourcing secure two-party computation as a black box

Ming Ming Wong Jawad Haj-Yahya Anupam Chattopadhyay

Town Crier. Authenticated Data Feeds For Smart Contracts. CS5437 Lecture by Kyle Croman and Fan Zhang Mar 18, 2016

Intelligent Hybrid Flash Management

Locking Down the Processor via the Rowhammer Attack

Refresher: Applied Cryptography

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Securing Cloud-assisted Services

Optimizing CNN Inference on CPUs

Chameleon: A Hybrid Secure Computation Framework for Machine Learning Applications

Massively Parallel Hardware Security Platform

Ascend: Architecture for Secure Computation on Encrypted Data Oblivious RAM (ORAM)

A Comparison Study of Intel SGX and AMD Memory Encryption Technology

Privacy-Preserving Distributed Linear Regression on High-Dimensional Data

IBM Spectrum Scale IO performance

Securely Outsourcing Garbled Circuit Evaluation

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Crypto Background & Concepts SGX Software Attestation

Test 2 Review. 1. (10 points) Timestamps and nonces are both used in security protocols to prevent replay attacks.

Securing Distributed Computation via Trusted Quorums. Yan Michalevsky, Valeria Nikolaenko, Dan Boneh

CERN openlab & IBM Research Workshop Trip Report

Remote Data Checking for Network Codingbased. Distributed Storage Systems

High-Performance Data Loading and Augmentation for Deep Neural Network Training

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

Improving Performance of Machine Learning Workloads

CSC 5930/9010 Cloud S & P: Cloud Primitives

MLCapsule: Guarded Offline Deployment of Machine Learning as a Service

SGX Security Background. Masab Ahmad Department of Electrical and Computer Engineering University of Connecticut

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

Application to More Efficient Obfuscation

Efficient Memory Integrity Verification and Encryption for Secure Processors

Delegated Access for Hadoop Clusters in the Cloud

RSA. Public Key CryptoSystem

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

Homomorphic Encryption. By Raj Thimmiah

Leaky Cauldron on the Dark Land: Understanding Memory Side-Channel Hazards in SGX

SIDE CHANNEL ATTACKS AGAINST IOS CRYPTO LIBRARIES AND MORE DR. NAJWA AARAJ HACK IN THE BOX 13 APRIL 2017

Cryptographic protocols

SGX Enclave Life Cycle Tracking TLB Flushes Security Guarantees

Plaintext Awareness via Key Registration

Graphene-SGX. A Practical Library OS for Unmodified Applications on SGX. Chia-Che Tsai Donald E. Porter Mona Vij

TRESCCA Trustworthy Embedded Systems for Secure Cloud Computing

BITLOCKER MEETS GPUS BITCRACKER

Identification Schemes

9 GENERATION INTEL CORE DESKTOP PROCESSORS

Clock-Based Proxy Re-encryption Scheme in Unreliable Clouds

Verifiable ASICs: trustworthy hardware with untrusted components

Secure Set Intersection with Untrusted Hardware Tokens

0x1A Great Papers in Computer Security

Test 2 Review. (b) Give one significant advantage of a nonce over a timestamp.

HOST Authentication Overview ECE 525

Adversarial Machine Learning An Introduction. With slides from: Binghui Wang

How to Build Optimized ML Applications with Arm Software

Zero Knowledge Protocol

Trojan-tolerant Hardware & Supply Chain Security in Practice

Design and Implementation of the Ascend Secure Processor. Ling Ren, Christopher W. Fletcher, Albert Kwon, Marten van Dijk, Srinivas Devadas

TVM Stack Overview. Tianqi Chen

Kernel level AES Acceleration using GPUs

T-SGX: Eradicating Controlled-Channel

Leveraging Intel SGX to Create a Nondisclosure Cryptographic library

A performance comparison of Deep Learning frameworks on KNL

Privacy and Security in Cloud-based ML

Micro-Architectural Attacks and Countermeasures

Protecting Keys/Secrets in Network Automation Solutions. Dhananjay Pavgi, Tech Mahindra Ltd Srinivasa Addepalli, Intel

How to Build Optimized ML Applications with Arm Software

Efficient Private Information Retrieval

Auditing IoT Communications with TLS-RaR

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Secure Remote Storage Using Oblivious RAM

High Performance Computing

Transcription:

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Florian Tramèr (joint work with Dan Boneh) Intel, Santa Clara August 30 th 2018

Trusted execution of ML: 3 motivating scenarios 1. Outsourced ML Data Privacy Integrity - Model downgrade - Disparate impact - Other malicious tampering

Trusted execution of ML: 3 motivating scenarios 2. Federated Learning Data privacy Integrity Poison model updates

4 Trusted execution of ML: 3 motivating scenarios 3. Trojaned hardware (Verifiable ASICs model, Wahby et al.) Integrity

Solutions Cryptography 1. Outsourced ML: FHE, MPC, (ZK) proof systems 2. Federated learning: no countermeasure for poisoning 3. Trojaned hardware: some root of trust is needed Trusted Execution Environments (TEEs) 1. Outsourced ML: isolated enclaves 2. Federated learning: trusted sensors + isolated enclaves 3. Trojaned hardware: fully trusted (but possibly slow) hardware

Trusted Execution: At what cost? Trusted ASICs (Wahby et al.): ~10 8 worse than SOTA Intel SGX: Paging at ~90MB Images / sec VGG16 Inference 400 350 350 300 250 200 150 100 50 1 0 GPU SGX GPU: Nvidia TITAN XP SGX: Intel Core i7-6700 Skylake Single Core @ 3.40GHz https://medium.com/@danny_harnik/impressions-of-intel-sgx-performance-22442093595a

How do we efficiently leverage TEEs for secure machine learning computations? Idea: outsource work to collocated, faster but untrusted device and verify results TEE x F(x), proof Verifiable ASICs (Wahby et al., 2016) Computations Required gap Privacy Arithmetic circuits ~ 8 orders of magnitude Slalom DNN inference ~ 1-2 orders Yes No

Goal + threat model Adversary controls the rest of the software / hardware stack The model is known to the adversary (but not necessarily to the client) TEE User has secure communication channel with TEE Goal: Efficiently run DNN inference F(x) - Integrity: User obtains F(x) or aborts - Privacy: Adversary learns nothing about x

Bottlenecks in deep neural networks non linear stuff (cheap) MATRIX MULTIPLICATION ~ 97% VGG16 Inference on 1 CPU core

Outsourcing matrix multiplication: Freivald s algorithm Input: X " n n, W " n n Direct Compute: Z = X W DNN weights. Fixed at inference time n 3 multiplications or O(n 2.81 ) with Strassen Outsource + Verify: Sample r " n uniformly at random Check: Z r =? X (W r) Complexity: 3n 2 multiplications Soundness: 1 / " (boost by repeating)

Freivald variants for arbitrary linear operators Linear operator: z = F(x) = x A Matrix of size x z Vector of size z Vector of size x Batched verification: Compute: [z 1 z B ] = F([x 1 x B ]) Freivald: r T [z 1 z B ] =? F(r T [x 1 x B ]) B cost(f) mults B ( x + z ) + cost(f) mults With precomputation: Precompute: A = A r = ( x F)(r) Freivald: z, r =? x, A x + z mults 2 inner products!

Handling convolutions VGG16 K = 3 3 C 512 64 D 512 14 2 N 224 2 Operation Multiplications Compute [z 1 z B ] = im2col([x 1 x B ]) * W B H W K 2 C D Batched verify r 1T * [z 1 z B ] * r 2 =? im2col(r 1 * ([x 1 x B ]) * (W * r 2 ) B H W D + B H W C + K 2 C D + H W K 2 C Preprocessing z, r =? ( x F)(r), x B H W D + B H W C

Preserving privacy Offline precomputation + online blinding Offline: Precompute and store R, R W TEE X X W

Preserving privacy Offline precomputation + online blinding Offline: Precompute and store R, R W Online: one-time-pad over! TEE X+R (X+R) W Online: Unblind using R W Secret sharing? TEE X+R R Can these devices be collocated yet non-colluding?

Slalom Summary Precompute and store (R i, R i W i ) TEE X 1 + R 1 1. Z 1 = Z 1 R 1 W 1 2. Freivald check for (X 1, W 1, Z 1 ) 3. X 2 = σ(z 1 ) Arbitrary non-linearity Z 1 = (X 1 + R 1 ) W 1 X 2 + R 2 Z 2 = (X 2 + R 2 ) W 2

Slalom (some details) Quantization: DNNs are typically trained / evaluated in floating point Freivald / blinding require working over a ring/field! Quantize inputs & weights and work mod p (p < 2 24 ) Integrity checks: Eval DNN on fast device and store inputs/outputs of all linear ops close to no prover overhead Sample r from! and do Freivald check in double precision verifier complexity is at least x + z double muls per linear layer Blinding: Store unblinding factors R W encrypted in untrusted memory In online phase, decrypt (and authenticate) R W to unblind

Design & Evaluation Implementation TEE: Intel SGX Desktop CPU (single thread) Untrusted device: Nvidia Tesla GPU Port of the Eigen linear algebra C++ library to SGX (used in e.g., TensorFlow) TEE Workloads: Microbenchmarks (see paper) VGG16 ( beefy canonical feedforward neural network) MobileNet (resource efficient DNN tailored for low-compute devices) Variant 1: standard MobileNet (see paper) Variant 2: No intermediate ReLU in separable convolutions (this talk)

Verifiable inference MobileNet s weights are only ~10MB so they fit in the SGX cache VGG16 MobileNet Images / sec 25 20 15 10 5 0 1 1.7 19.6 Compute Verify Verify with preproc 120 100 80 60 40 20 0 15.9 30 97.1 Compute Verify Verify with preproc VGG16 weights take 500MB so SGX has to page weights in and out of memory => ~2-3x slowdown Preprocessed weights W r take up less memory and enable faster checks! Difficult to get faster batched verification due to SGX memory limits

Verifiable and private inference VGG16 MobileNet Images / sec 25 20 15 10 5 0 Compute 19.6 13 10.2 1 Outsource+Integrity Outsource+privacy Outsource+both 120 100 80 60 40 20 0 97.1 80 54.9 15.9 Compute Outsource+integrity Outsource+privacy Outsource+both Extra Costs - GPU has to operate in double precision - Decrypt all unblinding factors R W (AES-GCM) - Regenerate all blinding factors R (PRG using AES)

Summary Large savings (6x 20x) in outsourcing DNN inference while preserving integrity Sufficient for some use-cases! More modest savings (3.5x 10x) with input privacy Requires preprocessing

Open questions What other problems are (concretely) easier to verify than to compute? All NP complete problems (are those often outsourced?) What about something in P? Convex optimization Other uses of matrix multiplication Many graph problems (e.g., perfect matching) What about Slalom for verifiable / private training? Quantization at training time is hard Weights change so we can t preprocess weights for Freivald s check We assume the model is known to the adversary (e.g., the cloud provider)