PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili

Similar documents
Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers

Ming Ming Wong Jawad Haj-Yahya Anupam Chattopadhyay

Intel Software Guard Extensions (Intel SGX) Memory Encryption Engine (MEE) Shay Gueron

Using Address Independent Seed Encryption and Bonsai Merkle Trees to Make Secure Processors OS- and Performance-Friendly

Authenticated Storage Using Small Trusted Hardware Hsin-Jung Yang, Victor Costan, Nickolai Zeldovich, and Srini Devadas

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks

Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger

The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!

Flexible Cache Error Protection using an ECC FIFO

Memory Defenses. The Elevation from Obscurity to Headlines. Rajeev Balasubramonian School of Computing, University of Utah

Enabling Transparent Memory-Compression for Commodity Memory Systems

arxiv: v1 [cs.dc] 3 Jan 2019

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Efficient Memory Integrity Verification and Encryption for Secure Processors

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2

Logical Diagram of a Set-associative Cache Accessing a Cache

Lecture notes for CS Chapter 2, part 1 10/23/18

Introduction. Memory Hierarchy

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Mo Money, No Problems: Caches #2...

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez

CS3350B Computer Architecture

Memory Hierarchy. Slides contents from:

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Memory Management Outline. Operating Systems. Motivation. Paging Implementation. Accessing Invalid Pages. Performance of Demand Paging

Design and Implementation of the Ascend Secure Processor. Ling Ren, Christopher W. Fletcher, Albert Kwon, Marten van Dijk, Srinivas Devadas

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Comparing Memory Systems for Chip Multiprocessors

CSE502: Computer Architecture CSE 502: Computer Architecture

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

The Ascend Secure Processor. Christopher Fletcher MIT

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

1/19/2009. Data Locality. Exploiting Locality: Caches

Caches. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Key Point. What are Cache lines

Dynamic Performance Tuning for Speculative Threads

Sharing-aware Efficient Private Caching in Many-core Server Processors

Chapter 8. Virtual Memory

Recap: Machine Organization

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

5 Solutions. Solution a. no solution provided. b. no solution provided

EE 457 Unit 7b. Main Memory Organization

Design Space Exploration and Optimization of Path Oblivious RAM in Secure Processors

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

The Salsa20 Family of Stream Ciphers

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

A Comparison of Capacity Management Schemes for Shared CMP Caches

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

MARACAS: A Real-Time Multicore VCPU Scheduling Framework

Achieving Out-of-Order Performance with Almost In-Order Complexity

Data Integrity. Modified by: Dr. Ramzi Saifan

Lec 13: Linking and Memory. Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University. Announcements

Lecture 6: Symmetric Cryptography. CS 5430 February 21, 2018

Computer Architecture Memory hierarchies and caches

Introduction to OpenMP. Lecture 10: Caches

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Securing the Frisbee Multicast Disk Loader

Lecture 14: Memory Hierarchy. James C. Hoe Department of ECE Carnegie Mellon University

PageForge: A Near-Memory Content- Aware Page-Merging Architecture

Chapter 6 Objectives

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

CSAIL. Computer Science and Artificial Intelligence Laboratory. Massachusetts Institute of Technology

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Caches. Samira Khan March 23, 2017

Martin Kruliš, v

Hashes, MACs & Passwords. Tom Chothia Computer Security Lecture 5

Oasis: An Active Storage Framework for Object Storage Platform

Performance analysis basics

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ELIMINATING ON-CHIP TRAFFIC WASTE: ARE WE THERE YET? ROBERT SMOLINSKI

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

An Adaptive Filtering Mechanism for Energy Efficient Data Prefetching

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Eleos: Exit-Less OS Services for SGX Enclaves

Token Coherence. Milo M. K. Martin Dissertation Defense

Doppelgänger: A Cache for Approximate Computing

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

ECE 571 Advanced Microprocessor-Based Design Lecture 13

Transcription:

PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on Blaise-Pascal Tine Sudhakar Yalamanchili

Outline Background: Memory Security Motivation Proposed Solution Implementation Evaluation Conclusion

Background Cloud Computing Threat Model Compute as a service deals with sensitive content trading, banking, medical, legal, search, etc. Offloading data + computation

Background Cloud Computing Threat Model Data encryption (SSL) Compute Sandboxing (Intel SGX) Hardware Security: Admin? Admin

Background Memory Attacks Snooping Spoofing Splicing Replay Authentication Encryption

Background Hash Generation Generate unique seed Create nonce from seed Hash block using nonce Store MAC for integrity check Where to store? - CPU area limited - Store it in memory - Still secure?

Background Merkle Tree Authentication Generate MAC for each block Build binary hash tree Store tree nodes in memory Store root on chip Integrity Check Fetch block Fetch partial tree Re-compute root Storage Overhead Large User Data = 8 * 64B = 512B Meta Data = 14 * 16B = 224B

Motivation Authentication Cost Memory overhead GMT 41% BMT 23% SGX 55% Runtime overhead GMT [1] 150% BMT [2] 13% SGX [3] <5% Processor cost 32KB counter cache GMT BMT SGX User blocks 59% 77% 45% MACs 0% 20% 5% Counters 1% 2% 0% Hash Tree 40% 1% 50% Meta-Data 41% 23% 55% - 8GB DRAM - 128bit MAC [1] C, Yan. D. Englender, M. Prulovic, et al. ISCA 06 [2] B. Rogers, Raleigh, S. Chhabra, M. Prvulovic et al. ISCA 07 [3] S. Gueron, Cryptology eprint Archive, Report 2016/204

Proposed Solution Key Insight Access the memory at larger block granularity Potential Benefits Reduce storage overhead Reduce memory traffic Reduce runtime overhead Challenges Maintain Security Cache Pollution? User Data = 8 * 64B = 512B Meta Data = 10 42 * * 16B = = 64B 32B 160B

Proposed Solution Aggregate Message Authentication [1] Use XOR to combine blocks: - Aggregate MAC = MAC 0 MAC 1 The aggregate MAC is secure if both operands are unique MACs are unique spatially - Seed s block address MACs are unique temporally - Seed s block counter A partition is a set of consecutive blocks protected by a single aggregate MAC [1] J. Katz and A. Y. Lindell. CT-RSA-2008

Proposed Solution Read Transaction A single MAC protect all blocks in a page partition Fetch all blocks in the page partition on read access Compute the MAC of each fetched block Compute the aggregate MAC Compare with cached MAC

Proposed Solution Write Transaction Operates at block granularity Reduce memory traffic Clear aggregate MAC after each read Compute MAC of dirty block Writeback the dirty block Append to aggregate MAC - Aggregate MAC = Hash(block)

Implementation Handling Partial Read Requests Aggregate MAC only protects off-chip blocks Need to track which blocks are off-chip Where to store tracking info? Use counter cache - Counter cache access latency - Counter overflow Use LLC lookup transaction - Group blocks from same partition into same set - Shift index region of block address left - Return full partition status mask (e.g. 4-bit register) Partition size = 4

Implementation Handling Cache Eviction Evicting block in currently accessed partition - Will invalidate the partition on-chip status - Add lookup logic to pick from next partition in set Evicting Clean Blocks - Aggregate MAC should be updated - Need to recompute block MAC - No need to send the block off-chip

Implementation PageVault Architecture Vault controller Counter cache MAC cache HMAC engine AES engine Command Queue

Evaluation Simulation Manifold Full System Simulator 3 Ghz 4-OoO-cores 32K-L1 2MB-L2 DramSim2-8GB - DDR3-1.25ns - 2 channels GCM-AES-128-8 cycles - 16 stages HMAC-SHA1-8 cycles - 128-bit MACs 8 KB counter cache 8 KB MAC cache Splash2, Parsec, GraphBig benchmarks Metrics Runtime/Storage overhead

Evaluation Systems Configuration NOEA: Baseline system with no protection GMT: Galois Merkle Tree (vanilla) BMT: Bonsai Merkle Tree (state of the art) SGX: Intel SGX (applied) PMT2: PageVault with 2 blocks per partition PMT4: PageVault with 4 blocks per partition PMT8: PageVault with 8 blocks per partition

Results Memory Overhead Meta-data overhead reduction: from 23% to 8% - Using 128-bit MACs to protect 8GB user data Can reach down to 5% for higher partition size (8) User data occupancy above 90% Why? MACs reduction by 1/N GMT BMT SGX PMT2 PMT4 PMT8 User Data 59% 77% 45% 86% 91% 94% MAC s 0% 20% 5% 11% 6% 3% Counters 1% 2% 0% 2% 2% 2% Hash Tre 40% 1% 50% 1% 1% 1% Meta-D ata 41% 23% 55% 15% 8% 5%

Results Execution Time: up to 10-12% improvement bodytrack, Lu-c outperform NOEA Parsec GraphBIG & Splash2 Parsec and Splash2 Performance Exploits Prefetched blocks reuse - Accuracy above 85% MAC cache efficiency - Hit rate above 70% Reduced Hash Processing Time

Results GraphBIG Prefetch Accuracy Good prefetch accuracy in LLC (~80%) DFS has high cache misses due to sync variables Parsec & Splash2

Results GraphBIG Memory Traffic Off-chip read traffic degrades by 15% The write traffic shows similar degradation - Due to synchronization variables creating cache pollution.

Results GraphBIG Memory Traffic Off-chip traffic degrades by 15% Traffic comes from hash tree traffic for counters

Results Reducing the Partition Size Improve runtime by 8% Less cache pollution But? - 2x Memory overhead - from 8% to 15% Adaptive resizing - Compiler driven - Hardware driven use block counters history

Conclusion A cost efficient memory protection Exploits AMAC properties Significant reduction of storage overhead Total runtime execution is improved Increase compute capacity of the secure system Adaptive Compression scheme

Thank You!

Results Counter Cache Hit Rate

Results Runtime Effect on 8KB vs 16KB MAC cache Runtime Effect on Partition Size

BMT [Brian 07] Counter Based Encryption Hash tree covers counters MACs authenticate data. Runtime overhead ~13% Storage overhead ~53% Prior Work GMT [Chenyu 06] Counter Based Encryption Hash tree covers data Hash tree covers counters Runtime overhead ~151% Storage overhead ~134%

Background Basic Direct Encryption Encrypt block using key AES is very slow! Counter Mode Encryption Generate unique seed Create pad from seed Encrypt block using pad Cache the pad for decryption Can fetch block while accessing cache

Evaluation Benchmarks GraphBIG has 100x more traffic than Splash2 Parsec and Splash2 GraphBIG

Results Splash/Parsec Prefetch Accuracy Good prefetch accuracy in LLC (~85%) Average miss rate reduction (10%) Parsec & Splash2

Results Splash/Parsec Memory Traffic Off-chip read traffic is reduced by 10% The write traffic shows similar reduction

Background Bonsai Merkle Tree Tree covers counters only Counters are small Tree overhead reduced MAC overhead still large - 128-bit MAC > 25% - 160-bit MAC > 31% - 192-bit MAC > 38% Compression? - bits shuffling - hardware cost - B. Rogers et al. ISCA 07 User Data = 8 * 64B = 512B Meta Data = 10 * 16B = 160B