PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on Blaise-Pascal Tine Sudhakar Yalamanchili
Outline Background: Memory Security Motivation Proposed Solution Implementation Evaluation Conclusion
Background Cloud Computing Threat Model Compute as a service deals with sensitive content trading, banking, medical, legal, search, etc. Offloading data + computation
Background Cloud Computing Threat Model Data encryption (SSL) Compute Sandboxing (Intel SGX) Hardware Security: Admin? Admin
Background Memory Attacks Snooping Spoofing Splicing Replay Authentication Encryption
Background Hash Generation Generate unique seed Create nonce from seed Hash block using nonce Store MAC for integrity check Where to store? - CPU area limited - Store it in memory - Still secure?
Background Merkle Tree Authentication Generate MAC for each block Build binary hash tree Store tree nodes in memory Store root on chip Integrity Check Fetch block Fetch partial tree Re-compute root Storage Overhead Large User Data = 8 * 64B = 512B Meta Data = 14 * 16B = 224B
Motivation Authentication Cost Memory overhead GMT 41% BMT 23% SGX 55% Runtime overhead GMT [1] 150% BMT [2] 13% SGX [3] <5% Processor cost 32KB counter cache GMT BMT SGX User blocks 59% 77% 45% MACs 0% 20% 5% Counters 1% 2% 0% Hash Tree 40% 1% 50% Meta-Data 41% 23% 55% - 8GB DRAM - 128bit MAC [1] C, Yan. D. Englender, M. Prulovic, et al. ISCA 06 [2] B. Rogers, Raleigh, S. Chhabra, M. Prvulovic et al. ISCA 07 [3] S. Gueron, Cryptology eprint Archive, Report 2016/204
Proposed Solution Key Insight Access the memory at larger block granularity Potential Benefits Reduce storage overhead Reduce memory traffic Reduce runtime overhead Challenges Maintain Security Cache Pollution? User Data = 8 * 64B = 512B Meta Data = 10 42 * * 16B = = 64B 32B 160B
Proposed Solution Aggregate Message Authentication [1] Use XOR to combine blocks: - Aggregate MAC = MAC 0 MAC 1 The aggregate MAC is secure if both operands are unique MACs are unique spatially - Seed s block address MACs are unique temporally - Seed s block counter A partition is a set of consecutive blocks protected by a single aggregate MAC [1] J. Katz and A. Y. Lindell. CT-RSA-2008
Proposed Solution Read Transaction A single MAC protect all blocks in a page partition Fetch all blocks in the page partition on read access Compute the MAC of each fetched block Compute the aggregate MAC Compare with cached MAC
Proposed Solution Write Transaction Operates at block granularity Reduce memory traffic Clear aggregate MAC after each read Compute MAC of dirty block Writeback the dirty block Append to aggregate MAC - Aggregate MAC = Hash(block)
Implementation Handling Partial Read Requests Aggregate MAC only protects off-chip blocks Need to track which blocks are off-chip Where to store tracking info? Use counter cache - Counter cache access latency - Counter overflow Use LLC lookup transaction - Group blocks from same partition into same set - Shift index region of block address left - Return full partition status mask (e.g. 4-bit register) Partition size = 4
Implementation Handling Cache Eviction Evicting block in currently accessed partition - Will invalidate the partition on-chip status - Add lookup logic to pick from next partition in set Evicting Clean Blocks - Aggregate MAC should be updated - Need to recompute block MAC - No need to send the block off-chip
Implementation PageVault Architecture Vault controller Counter cache MAC cache HMAC engine AES engine Command Queue
Evaluation Simulation Manifold Full System Simulator 3 Ghz 4-OoO-cores 32K-L1 2MB-L2 DramSim2-8GB - DDR3-1.25ns - 2 channels GCM-AES-128-8 cycles - 16 stages HMAC-SHA1-8 cycles - 128-bit MACs 8 KB counter cache 8 KB MAC cache Splash2, Parsec, GraphBig benchmarks Metrics Runtime/Storage overhead
Evaluation Systems Configuration NOEA: Baseline system with no protection GMT: Galois Merkle Tree (vanilla) BMT: Bonsai Merkle Tree (state of the art) SGX: Intel SGX (applied) PMT2: PageVault with 2 blocks per partition PMT4: PageVault with 4 blocks per partition PMT8: PageVault with 8 blocks per partition
Results Memory Overhead Meta-data overhead reduction: from 23% to 8% - Using 128-bit MACs to protect 8GB user data Can reach down to 5% for higher partition size (8) User data occupancy above 90% Why? MACs reduction by 1/N GMT BMT SGX PMT2 PMT4 PMT8 User Data 59% 77% 45% 86% 91% 94% MAC s 0% 20% 5% 11% 6% 3% Counters 1% 2% 0% 2% 2% 2% Hash Tre 40% 1% 50% 1% 1% 1% Meta-D ata 41% 23% 55% 15% 8% 5%
Results Execution Time: up to 10-12% improvement bodytrack, Lu-c outperform NOEA Parsec GraphBIG & Splash2 Parsec and Splash2 Performance Exploits Prefetched blocks reuse - Accuracy above 85% MAC cache efficiency - Hit rate above 70% Reduced Hash Processing Time
Results GraphBIG Prefetch Accuracy Good prefetch accuracy in LLC (~80%) DFS has high cache misses due to sync variables Parsec & Splash2
Results GraphBIG Memory Traffic Off-chip read traffic degrades by 15% The write traffic shows similar degradation - Due to synchronization variables creating cache pollution.
Results GraphBIG Memory Traffic Off-chip traffic degrades by 15% Traffic comes from hash tree traffic for counters
Results Reducing the Partition Size Improve runtime by 8% Less cache pollution But? - 2x Memory overhead - from 8% to 15% Adaptive resizing - Compiler driven - Hardware driven use block counters history
Conclusion A cost efficient memory protection Exploits AMAC properties Significant reduction of storage overhead Total runtime execution is improved Increase compute capacity of the secure system Adaptive Compression scheme
Thank You!
Results Counter Cache Hit Rate
Results Runtime Effect on 8KB vs 16KB MAC cache Runtime Effect on Partition Size
BMT [Brian 07] Counter Based Encryption Hash tree covers counters MACs authenticate data. Runtime overhead ~13% Storage overhead ~53% Prior Work GMT [Chenyu 06] Counter Based Encryption Hash tree covers data Hash tree covers counters Runtime overhead ~151% Storage overhead ~134%
Background Basic Direct Encryption Encrypt block using key AES is very slow! Counter Mode Encryption Generate unique seed Create pad from seed Encrypt block using pad Cache the pad for decryption Can fetch block while accessing cache
Evaluation Benchmarks GraphBIG has 100x more traffic than Splash2 Parsec and Splash2 GraphBIG
Results Splash/Parsec Prefetch Accuracy Good prefetch accuracy in LLC (~85%) Average miss rate reduction (10%) Parsec & Splash2
Results Splash/Parsec Memory Traffic Off-chip read traffic is reduced by 10% The write traffic shows similar reduction
Background Bonsai Merkle Tree Tree covers counters only Counters are small Tree overhead reduced MAC overhead still large - 128-bit MAC > 25% - 160-bit MAC > 31% - 192-bit MAC > 38% Compression? - bits shuffling - hardware cost - B. Rogers et al. ISCA 07 User Data = 8 * 64B = 512B Meta Data = 10 * 16B = 160B