Private Information Retrieval (PIR)

2 Levente Buttyán Problem formulaton Alce wants to obtan nformaton from a database, but she does not want the database to learn whch nformaton she wanted e.g., Alce s an nvestor queryng a stock-market database e.g., Alce s a company queryng a patent database a trval soluton s for Alce to download the entre database Can the problem be solved wth less communcatons? typcal model: the database s an n-bt strng: X = x x 2 x n Alce s nterested n x the database should not be able to learn 2/7

Some negatve results f Alce uses a determnstc scheme then n bts must be transferred (even f there are multple non-communcatng copes of the database) Alce should use con flps (a randomzed algorthm) f the database has unlmted computatonal power and there s only a sngle copy of the database then n bts must be transferred there s hope f the database can only perform effcent computatons (.e., t s computatonally bounded) there s hope f the database has unlmted computatonal power but there are multple non-communcatng copes of the database 3/7 An example PIR protocol assume that there are 4 copes of the database the bts of X are arranged n a n /2 x n /2 matrx Alce wants to retreve x ( <=, <= n /2 ) protocol: Alce generates two random bt strngs s and t of length n /2 let s be the same as s but wth the -th bt flpped, and let t be the same as t but wth the -th bt flpped Alce sends s and t to DB s and t to DB s and t to DB2 s and t to DB4 each DB returns a sngle bt computed as the XOR of bts x ab where the a-th bt of s (or s ) and the b-th bt of t (or t ) are both equal to Alce XORs the receved bts, and the result gves x 4/7

An example PIR protocol why does t work? s = t = s = t = 4 2 4 s = t = s = t = 2 2 4 2 4 5/7 An example PIR protocol why s t prvate? each database receves two random vectors that are ndependent of and no nformaton on and s leaked to the database 6/7

Informaton theoretc vs. computatonal PIR nformaton theoretc PIR protocols leak no nformaton (n nformaton theoretc sense) about the ndex requested by Alce they wthstand attacks even from a database wth un-lmted computatonal power computatonal PIR (CPIR) protocols provde weaker guarantees: they ensure only that the database cannot get any nformaton unless t solves a computatonally hard problem (reducton) nformaton theoretc PIR protocols requre more than one non-communcatng copes of the database, whle CPIR protocols wth low communcaton overhead exst even for the sngle database case 7/7 An example CPIR protocol prelmnares let m be a postve nteger a number a s a quadratc resdue (QR) mod m, f there s an nteger x such that x 2 mod m = a otherwse a s quadratc non-resdue (QNR) mod m t s computatonally hard to dstngush numbers that are QRs mod m from numbers that are QNRs mod m, unless one knows the factorzaton of m setup the bts of X are arranged n a n /2 x n /2 matrx Alce wants to retreve x ( <=, <= n /2 ) 8/7

An example CPIR protocol protocol: Alce chooses at random a large nteger m (together wth ts factorzaton) she generates n /2 -random QRs mod m: a, a 2,, a -, a +, she generates a random QNR mod m: b Alce sends a, a 2,, a -, b, a +, to the database the server cannot make the dfference between QRs and QNRs mod m, so from the server s pont of vew, the receved vector s ust an array of random numbers: u, u 2, for each column c of X, the database computes v c = u x c u2 x 2c mod m the database responds wth v, v 2, Alce verfes f v s a QR or a QNR mod m f QR, then x = f QNR then x = 9/7 An example CPIR protocol why does t work? X = x x x x a U = b v = a x b x mod m f x = then only QRs are multpled, otherwse QRs are multpled wth a sngle QNR t s known that QR x QR = QR and QR x QNR = QNR /7

State of the art best known nformaton theoretc PIR protocol s based on representng the database as a polynomal, and requres the transmsson of n O(log log k / k log k) bts (where k s the number of copes of the database) CPIR schemes have been constructed based on the dffculty of the Quadratc Resdue Problem (O(n ε )) and the φ-hdng problem (O((log n) a )), and based one-way permutatons (no(n)) connectons of CPIR to oblvous transfer, collson resstant hash functons, functon hdng publc key crypto, complexty theory n general have been studed /7 Varants of PIR and CPIR block PIR what f Alce wants a block of bts (of sze m)? can we do better than nvokng a PIR protocol m tmes? robust PIR what f some of the database copes break down or return false answers (Byzantne falure model)? t-prvate PIR how to ensure that even t colludng databases cannot fgure out n whch bt Alce s nterested n? symmetrc PIR how to prevent Alce to learn more than ust the bt she s nterested n? PIR wth preprocessng the database usually has to do O(n) computatons can ths be cut down? 2/7

Locally decodable codes (LDCs) error correctng codes add redundancy to a message codeword send over nosy channel recover message even f some fracton of the codeword bts are corrupted n practce, longer messages are parttoned nto smaller blocks and each block s coded separately ths allows effcent random access to message bts (one must decode only a fracton of the receved codewords) however, even f a sngle codeword s lost (unrecoverable), then the message cannot be recovered f the entre message would be encoded as a sngle large block ths would mprove robustness but random access would requre decodng the entre message (typcally prohbtvely expensve) 3/7 Locally decodable codes (LDCs) LDCs smultaneously provde random access retreval and hgh nose resstance ths s acheved by allowng the relable reconstructon of any bt of the message from a small number of randomly chosen codeword bts defnton: A (k, δ, ε)-ldc encodes n bt messages nto N bt codewords such that every bt x of the message can be recovered wth probablty -ε by a randomzed decodng procedure that reads only k codeword bts, even f at most δn bts of the codeword are corrupted local decodablty comes at a prce of loss n terms of code effcency (N >> n) fndng more effcent (optmal) LDCs s an actve research area and a maor challenge 4/7

An LDC example (2, δ, 2δ)-Hadamard encodes n bt messages nto 2 n bt codewords let H be a bnary matrx that contans n ts columns all the possble n bt vectors (H s an n x 2 n matrx) encodng: y = C(x) = xh decodng (of the -th bt of x): pck a random n-bt vector t, and let t be the same as t but wth the -th bt flpped x = y t XOR y t probablty of successful decodng at most δn bts of y are corrupted ~ each bt n y s corrupted wth probablty δ (ndependently from the other bts) the probablty that y t or y t s corrupted s 2δ the probablty that both y t and y t are ntact (and hence the decodng of x s successful) s -2δ 5/7 LDCs and the PIR problem LDCs yeld effcent PIR schemes and vce versa all recent constructon of nformaton theoretc PIR schemes work by frst constructng LDCs and then convertng them nto PIR protocols general procedure to obtan a k-server PIR scheme from a (perfectly smooth) k-query LDC: each of the k database servers encodes the database X wth the LDC and stores C(X) f Alce s nterested n x, she generates k random queres q, q 2,, q k, such that x can be recovered from C(X) q,, C(X) qk, and sends q to DB each server DB responds wth one bt C(X) q Alce combnes the responses to obtan x prvacy perfect smoothness of the LDC means that ndvdual queres are dstrbuted perfectly unformly over the codeword bts thus, n the PIR scheme, every query q s ndependent from, and hence, reveals no nformaton on 6/7

Further readngs S. Yekhann, Prvate Informaton Retreval, Communcatons of the ACM, Vol. 53 No. 4, Aprl 2. good ntro + connecton wth LDCs see also hs recent PhD thess (done at MIT) W. Gasarch, A survey on Prvate Informaton Retreval, onlne contans open problems and lot of references R. Ostrovsky, W. Sketh, A survey of sngle-database PIR: technques and applactons, on-lne some constructons n detals + relaton to other problems + references n the above papers 7/7