CSC 5930/9010 Cloud S & P: Cloud Primitives Professor Henry Carter Spring 2017
Methodology Section This is the most important technical portion of a research paper Methodology sections differ widely depending on the field of study Cryptography: protocol description System security: attack description Measurement paper: methodology and description of collected data This is where you must communicate your idea clearly and completely
Exercise: Read Methodology Sections Protocol design: Outsourcing Secure Two-Party Computation as a Black Box Data analysis work: The Core of the Matter: Analyzing Malicious Traffic in Cellular Carriers Survey paper: The State of Public Infrastructure-as-a-Service Cloud Security
Critical Components For novel research: Clear problem statements Settings and assumptions How the contribution works (this can be many things) For surveys Clear problem statement Settings and assumptions Categorized background knowledge
Methodology Assignment Due in 2 weeks 1 page minimum (IEEEtran format) Read other technical writing for style guidance Use we instead of I Avoid colloquial terms Only use bullets or numbering sparingly (and make sure to use good grammar within the bullets)
Recap Mobile apps process a significant amount of private information Mobile SMC has followed three approaches in parallel with desktop SMC Outsourced garbled circuits Outsourced triple generation Custom partially homomorphic encryption protocols Each technique demonstrates strengths and weaknesses, so future work and practical applications will likely use hybrid techniques
The general SMC model Garbled circuits, secret sharing protocols, and fully homomorphic encryption allow for arbitrary computation Represented as a circuit Multipurpose protocols not always efficient As seen in the biometric authentication protocol For simple functions, the overhead of general SMC is crippling e.g., database lookup
Problems with general SMC Significant setup costs Significant size expansion Typically require all parties to expend significant computational power FHE excluded Do not always allow for convenient repeated accesses Circuit representation of problems is not always efficient
Other Settings Encrypted data stores Hiding access patterns to online repositories Non-circuit represented computation
Database Retrieval A common cloud application is data storage Two types of privacy concerns exist: Access patterns Data privacy Special case protocols exist for hiding some or all of this information
Private Information Retrieval Given: a publicly available database stored on a server and a client seeking to access the data Goal: prevent the database from learning which element was accessed Server only knows the client did or did not access data Often simplified to a stored bitstring where the client queries a single bit Naïve approach: download the whole database Extreme communication cost!
Foundation: Security Types Information theoretic Statistically hidden information Computational security Hidden by a "hard problem" Can be broken as computers get faster
Examples Information theoretic encryption one-time pad Computational encryption RSA
Information coding Given a message written as a string of characters, substitute each character for a code word in the encoding Common in computing and communication Data compression Error correction We can use a combination of randomly selected code words to recover meaningful information
LDC Basics Locally Decodable Codes allow a decoder to recover a single bit in a message x by querying only a few bits in a received code word As long as a threshold number of bits are not corrupted, the decoder will recover the bit with high probability There are many combinations of queries that can all be combined to recover a single bit in x
LDC Example Message: 010 Encoded message: 00110011 Received message:00110111 Message query: 00110111 Bit recovery: 1 1 = 0
From LDC to PIR In a smooth LDC, the decoder can generate code word queries nearly randomly Since the queries look nearly random, the only way to determine which bit the decoder is querying is to see all of the queries Assuming the PIR servers do not communicate, the receiver queries a single bit from each server and combines the result to decode her choice bit in x Secure since no server sees all of the queries
From LDC to PIR Send bit x Send bit y
From LDC to PIR m x m y m x m y
Problems Inefficient Duplicate servers Encoding expands database size significantly Mostly theoretical in nature Ways to improve?
Computational PIR Many schemes perform the same actions based on computationally hard problems Can be done with a single server Typically much more efficient (although still requires larger than O(1) communication)
Example: quadratic residue A quadratic residue modulo m is any number a such that a = x 2 mod m for some x Computationally hard problem: given y and m, it is hard to tell if y is a quadratic residue mod m without knowing the factors of m We can use the properties of quadratic residues to retrieve a bit from a database privately
QR Properties Multiplying a QR by a QR is still a QR a 2 b 2 mod m = (ab) 2 mod m Multiplying a NQR with a QR gives a NQR Difficult to distinguish without factorization of modulus Given the factors of m, we can distinguish QRs from NQRs
Step 1: Database Setup Spread the message bits across a matrix 2 1 1 3 0 41 0 05 0 1 0
Step 2: Query Setup The client wishes to access entry i,j (ex. 1,0) Generate a vector of QRs mod m with a single NQR at position i = 1 2 3 4 425 mod 6 3
Step 3: Query Processing For every column, the server computes the entry-wise product, then multiplies the entries together: 2 1 1 32 3 0 4 41 0 054 25 mod 6 2 4 4 3 0 42 0 05 0 1 0 3 0 3 0
Step 3: Query Processing For every column, the server computes the entry-wise product, then multiplies the entries together: 2 4 4 3 0 42 0 05 0 3 0 8 12 0 2 0 0 mod 6 mod 6
Step 4: Data Recovery If entry j (ex. 0) is an NQR, then entry i,j = 1 Else, i,j = 0 2 0 0 mod 6
Step 5: Practice Put together a query for entry (2,2) using modulus 7 What are the QRs? NQRs?
PIR pros and cons Hides what the user is accessing (database only knows that the user got something) Requires data to be public Still costly in terms of communication (this is improving)
Encrypted Data Data is commonly stored in an encrypted version Recall that typical encryption is randomized How do we query over this data? What sorts of guarantees are possible?
Searchable Encryption Use deterministic "labels" to identify documents E.g., deterministically encrypted file names Data store can search labels to retrieve files Reveals access patterns!
Searchable Enc Illustrated Enc( journal ) WUCHVBEDJ. Enc( recipe ) AJVCGBEFD. Enc( spy_stuff ) SDKVMSNB. Enc( grades ) JUDHSXOIHE. Enc( formula ) AJSUDENFJX
Expansion Protocols Order-preserving encryption Leaks ordering information Prefix matching Allows for searching partial matched labels Keyword stores Use encrypted keywords to label documents for searching within encrypted files
Comparison PIR assumes the database will be public but hides accesses Searchable encryption hides the database but leaks access information Two very different problems in the same related area!
Can we hide both? Constructing a protocol for hiding both data AND access patters would significantly improve security An oblivious protocol for data access could be applicable outside of network protocols Secure processor isolation Oblivious RAM
ORAM Oblivious Random Access Memory hides the contents and access patterns of the data store Allows both read and write operations Indistinguishable to the data store Typically requires some storage by client
Path ORAM Based on a complete binary tree structure where leaves act as indices Reads an entire branch with every operation Maintains an index of block addresses to leaf indices and a small stash of blocks (like a cache)
Basic Components Data store The store is initialized to hold N blocks Leaves are indexed from 0 to 2 L -1 Each node is a bucket holding Z blocks (with approximately N buckets, this requires Z N blocks to be stored) Client The stash holds some blocks temporarily The index holds a mapping for each block such that block(x) = leaf(a)
Initializing the store
Read
Write
Security Guarantees Since each block is written to a new random path, subsequent accesses cannot be linked Since each block is re-encrypted with randomized encryption after every access, data is hidden and indistinguishable from previous data This hides whether a read or write occurred
Costs Client must perform the data processing ORAM provides only a store, not secure computation Client must maintain the index and stash The authors show this is typically very low
Path ORAM Performance
Applications allowed by ORAM Oblivious binary search tree Stateless ORAM Secure processors New approaches to SMC combining ORAM and traditional SMC operations
ORAM vs PIR ORAM hides data AND access ORAM can be applied to more general computation protocols ORAM requires more computation and storage on the client's side
Recap Many specialized cryptographic constructions have been developed for special-case cloud operations PIR hides access patters in a public database Searchable encryption hides the contents of a remote data store but leaks some information about access patters ORAM hides both data and access patters in a remote data store at the cost of maintaining client state and added computation
Next Time... Differential Privacy Remember, you need to read it BEFORE you come to class! Homework: Homework #3 (1 week) Methodology (2 weeks) 49