Oblivious Parallel RAM and Applications

Size: px

Start display at page:

Download "Oblivious Parallel RAM and Applications"

Merry Brown
6 years ago
Views:

1 Oblvous Parallel RAM and Applcatons Elette Boyle Technon Israel Ka-Mn Chung Academca Snca August 31, 2015 Rafael Pass Cornell Unversty Abstract We ntate the study of cryptography for parallel RAM (PRAM) programs. The PRAM model captures modern mult-core archtectures and cluster computng models, where several processors execute n parallel and make accesses to shared memory, and provdes the best of both crcut and RAM models, supportng both cheap random access and parallelsm. We propose and attan the noton of Oblvous PRAM. We present a compler takng any PRAM nto one whose dstrbuton of memory accesses s statstcally ndependent of the data (wth neglgble error), whle only ncurrng a polylogarthmc slowdown (n both total and parallel complexty). We dscuss applcatons of such a compler, buldng upon recent advances relyng on Oblvous (sequental) RAM (Goldrech Ostrovsky JACM 12). In partcular, we demonstrate the constructon of a garbled PRAM compler based on an OPRAM compler and secure dentty-based encrypton. The research of the frst author has receved fundng from the European Unon s Tenth Framework Programme (FP10/ ) under grant agreement no ERC-CaC, and ISF grant 1709/14. Pass s supported n part by a Google Faculty Award, Alfred P. Sloan Fellowshp, Mcrosoft New Faculty Fellowshp, NSF Award CNS , NSF CAREER Award CCF , NSF Award CCF , AFOSR YIP Award FA , and DARPA and AFRL under contract FA The vews and conclusons contaned n ths document are those of the authors and should not be nterpreted as representng the offcal polces, ether expressed or mpled, of the Defense Advanced Research Projects Agency or the US Government. Ths work was done n part whle the authors were vstng the Smons Insttute for the Theory of Computng, supported by the Smons Foundaton and by the DIMACS/Smons Collaboraton n Cryptography through NSF grant #CNS

2 1 Introducton Completeness results n cryptography provde general transformatons from arbtrary functonaltes descrbed n a partcular computatonal model, to solutons for executng the functonalty securely wthn a desred adversaral model. Classc results, stemmng from [Yao82, GMW87], modeled computaton as boolean crcuts, and showed how to emulate the crcut securely gate by gate. As the complexty of modern computng tasks scales at tremendous rates, t has become clear that the crcut model s not approprate: Convertng lghtweght, optmzed programs frst nto a crcut n order to obtan securty s not a vable opton. Large effort has recently been focused on enablng drect support of functonaltes modeled as Turng machnes or random-access machnes (RAM) (e.g., [OS97, GKK + 12, LO13a, GKP + 13, GHRW14, GHL + 14, GLOS15, CHJV15, BGL + 15,KLW15]). Ths approach avods several sources of expensve overhead n convertng modern programs nto crcut representatons. However, t actually ntroduces a dfferent dmenson of neffcency. RAM (and sngle-tape Turng) machnes do not support parallelsm: thus, even f an nsecure program can be heavly parallelzed, ts secure verson wll be nherently sequental. Modern computng archtectures are better captured by the noton of a Parallel RAM (PRAM). In the PRAM model of computaton, several (polynomally many) CPUs are smultaneously runnng, accessng the same shared external memory. Note that PRAM CPUs can model physcal processors wthn a sngle multcore system, as well as dstnct computng enttes wthn a dstrbuted computng envronment. We consder an expressve model where the number of actve CPUs may vary over tme (as long as the pattern of actvaton s fxed a pror). In ths sense, PRAMs capture the best of both RAM and the crcut models: A RAM program handles random access but s entrely sequental, crcuts handle parallelsm wth varable number of parallel resources (.e., the crcut wdth), but not random access; varable CPU PRAMs capture both random access and varable parallel resources. We thus put forth the challenge of desgnng cryptographc prmtves that drectly support PRAM computatons, whle preservng computatonal resources (total computatonal complexty and parallel tme) up to poly logarthmc, whle usng the same number of parallel processors. Oblvous Parallel RAM (OPRAM). A core step toward ths goal s to ensure that secret nformaton s not leaked va the memory access patterns of the resultng program executon. A machne s sad to be memory oblvous, or smply oblvous, f the sequences of memory accesses made by the machne on two nputs wth the same runnng tme are dentcally (or close to dentcally) dstrbuted. In the late 1970s, Pppenger and Fscher [PF79] showed that any Turng Machne Π can be compled nto an oblvous one Π (where memory accesses correspond to the movement of the head on the tape) wth only a logarthmc slowdown n runnng-tme. Roughly ten years later, Goldrech and Ostrovsky [Gol87, GO96] proposed the noton of Oblvous RAM (ORAM), and showed a smlar transformaton result wth polylogarthmc slowdown. In recent years, ORAM complers have become a central tool n developng cryptography for RAM programs, and a great deal of research has gone toward mprovng both the asymptotc and concrete effcency of ORAM complers (e.g., [Ajt10, DMN11, GMOT11, KLO12, CP13, CLP14, GGH + 13, SvDS + 13, CLP14, WHC + 14, RFK + 14, WCS14]). However, for all such complers, the resultng program s nherently sequental. In ths work, we propose the noton of Oblvous Parallel RAM (OPRAM). We present the frst OPRAM compler, convertng any PRAM nto an oblvous PRAM, whle only nducng a 1

3 polylogarthmc slowdown to both the total and parallel complextes of the program. Theorem 1.1 (OPRAM Informally stated). There exsts an OPRAM compler wth O(log(m) log 3 (n)) worst-case overhead n total and parallel computaton, and f(n) memory overhead for any f ω(1), where n s the memory sze and m s an upper-bound on the number of CPUs n the PRAM. We emphasze that applyng even the most hghly optmzed ORAM compler to an m-processor PRAM program nherently nflcts Ω(m log(n)) overhead n the parallel runtme, n comparson to our O(log(m)polylog(n)). When restrcted to sngle-cpu programs, our constructon ncurs slghtly greater logarthmc overhead than the best optmzed ORAM complers (achevng O(log n) overhead for optmal block szes); we leave as an nterestng open queston how to optmze parameters. (As we wll elaborate on shortly, some very nterestng results towards addressng ths has been obtaned n the follow-up work of [CLT15].) 1.1 Applcatons of OPRAM ORAM les at the base of a wde range of applcatons. In many cases, we can drectly replace the underlyng ORAM wth an OPRAM to enable parallelsm wthn the correspondng secure applcaton. For others, smply replacng ORAM wth OPRAM does not suffce; nevertheless, n ths paper, we demontrate one applcaton (garblng of PRAM programs) where they can be overcome; follow-up works show further applcatons (secure computaton and obfuscaton). Drect Applcatons of OPRAM We brefly descrbe some drect applcatons of OPRAM. Improved/Parallelzed Outsourced Data. Standard ORAM has been shown to yeld effectve, practcal solutons for securely outsourcng data storage to an untrusted server (e.g., the Oblv- Store system of [SS13]). Effcent OPRAM complers wll enable these systems to support secure effcent parallel accesses to outsourced data. For example, OPRAM procedures securely aggregate parallel data requests and resolve conflcts clent-sde, mnmzng expensve clent-server communcatons (as was explored n [WST12], at a smaller scale). As network latency s a major bottleneck n ORAM mplementatons, such parallelzaton may yeld sgnfcant mprovements n effcency. Mult-Clent Outsourced Data. In a smlar ven, use of OPRAM further enables secure access and manpulaton of outsourced shared data by multple (mutually trustng) clents. Here, each clent can smply act as an ndependent CPU, and wll execute the OPRAM-compled program correspondng to the parallel concatenaton of ther ndependent tasks. Secure Mult-Processor Archtecture. Much recent work has gone toward mplementng secure hardware archtectures by usng ORAM to prevent nformaton leakage va access patterns of the secure processor to the potentally nsecure memory (e.g., the Ascend project of [FDD12]). Relyng nstead on OPRAM opens the door to achevng secure hardware n the mult-processor settng. Garbled PRAM (GPRAM) Garbled crcuts [Yao82] allow a user to convert a crcut C and nput x nto garbled versons C and x, n such a way that C can be evaluated on x to reveal the output C(x), but wthout revealng further nformaton on C or x. Garblng schemes have found countless applcatons n cryptography, rangng from delegaton of computaton to secure multparty protocols (see below). It was recently shown (usng ORAM) how to drectly garble RAM 2

4 programs [GHL + 14, GLOS15], where the cost of evaluatng a garbled program P scales wth ts RAM (and not crcut) complexty. In ths paper, we show how to employ any OPRAM compler to attan a garbled PRAM (GPRAM), where the tme to generate and evaluate the garbled PRAM program P scales wth the parallel tme complexty of P. Our constructon s based on one of the constructon of [GHL + 14] and extends t usng some of the technques developed for our OPRAM. Pluggng n our (uncondtonal) OPRAM constructon, we obtan: Theorem 1.2 (Garbled PRAM Informally stated). Assumng dentty-based encrypton, there exsts a secure garbled PRAM scheme wth total and parallel overhead poly(κ) polylog(n), where κ s the securty parameter of the IBE and n s the sze of the garbled data. Secure Two-Party and Mult-Party Computaton of PRAMs. Secure mult-party computaton (MPC) enables mutually dstrustng partes to jontly evaluate functons on ther secret nputs, wthout revealng nformaton on the nputs beyond the desred functon output. ORAM has become a central tool n achevng effcent MPC protocols for securely evaluatng RAM programs. By nstead relyng on OPRAM, these protocols can leverage parallelzablty of the evaluated programs. Our garbled PRAM constructon mentoned above yelds constant-round secure protocols where the tme to execute the protocol scales wth the parallel tme of the program beng evaluated. In a companon paper [BCP15], we further demonstrates how to use OPRAM to obtan effcent protocols for securely evaluatng PRAMs n the mult-party settng; see [BCP15] for further detals. Obfuscaton for PRAMs. In a follow-up work, Chung et al [CCC + 15] rely on our specfc OPRAM constructon (and show that t satsfes an addtonal puncturablty property) to acheve obfuscaton for PRAMs. 1.2 Techncal Overvew Begn by consderng the smplest dea toward memory oblvousness: Suppose data s stored n random(-lookng) shuffled order, and for each data query, the lookup s performed to ts permuted locaton, σ(). One can see ths provdes some level of hdng, but clearly does not suffce for general programs. The problem wth the smple soluton s n correlated lookups over tme as soon as tem s quered agan, ths collson wll be drectly revealed. Indeed, hdng correlated lookups whle mantanng effcency s perhaps the core challenge n buldng oblvous RAMs. In order to bypass ths problem, ORAM complers heavly depend on the ablty of the CPU to move data around, and to update ts secret state after each memory access. However, n the parallel settng, we fnd ourselves back at square one. Suppose n some tme step, a group of processors all wsh to access data tem. Havng all processors attempt to perform the lookup drectly wthn a standard ORAM constructon corresponds to runnng the ORAM several tmes wthout movng data or updatng state. Ths mmedately breaks securty n all exstng ORAM compler constructons. On the other hand, we cannot afford for the CPUs to take turns, accessng and updatng the data sequentally. In ths overvew, we dscuss our technques for overcomng ths and further challenges. We descrbe our soluton somewhat abstractly, buldng on a sequental ORAM compler wth a treebased structure as ntroduced by Sh et al. [SCSL11]. In our formal constructon and analyss, we 3

5 rely on the specfc tree-based ORAM compler of Chung and Pass [CP13] that enjoys a partcularly clean descrpton and analyss. Tree-Based ORAM Complers. We begn by roughly descrbng the structure of tree-based ORAMs, orgnatng n the work of [SCSL11]. At a hgh level, data s stored n the structure of a bnary tree, where each node of the tree corresponds to a fxed-sze bucket that may hold a collecton of data tems. Each memory cell addr n the orgnal database s assocated wth a random path (equvalently, leaf) wthn a bnary tree, as specfed by a poston map path addr = P os(addr). The schemes mantan three nvarants: (1) The content of memory cell addr wll be found n one of the buckets along the path path addr. (2) Gven the vew of the adversary (.e., memory accesses) up to any pont n tme, the current mappng P os appears unformly random. And, (3) wth overwhelmng probablty, no node n the bnary tree wll ever overflow, n the sense that ts correspondng memory bucket s nstructed to store more tems than ts fxed capacty. These nvarants are mantaned by the followng general steps: 1. Lookup: To access a memory tem addr, the CPU accesses all buckets down the path path addr, and removes t where found. 2. Data put-back : At the concluson of the access, the memory tem addr s assgned a freshly random path P os(addr) path addr, and s returned to the root node of the tree. 3. Data flush: To ensure the root (and any other bucket) does not overflow, data s flushed down the tree va some procedure. For example, n [SCSL11], the flush takes place by selectng and emptyng two random buckets from each level nto ther approprate chldren; n [CP13], t takes place by choosng an ndependent path n the tree and pushng data tems down ths path as far as they wll go (see Fgure 1 n Secton 2.2). Extendng to Parallel RAMs. We must address the followng problems wth attemptng to access a tree-based ORAM n parallel. Parallel memory lookups: As dscussed, a core challenge s n hdng correlatons n parallel CPU accesses. In tree-based ORAMs, f CPUs access dfferent data tems n a tme step, they wll access dfferent paths n the tree, whereas f they attempt to smultaneously access the same data tem, they wll each access the same path n the tree, blatantly revealng a collson. To solve ths problem, before each lookup we nsert a CPU-coordnaton phase. We observe that n tree-based ORAM schemes, ths problem only manfests when CPUs access exactly the same tem, otherwse tems are assocated wth ndependent leaf nodes, and there are no bad correlatons. We thus resolve ths ssue by lettng the CPUs check through an oblvous aggregaton operaton whether two (or more) of them wsh to access the same data tem; f so, a representatve s selected (the CPU wth the smallest d) to actually perform the memory access, and all the others merely perform dummy lookups. Fnally, the representatve CPU needs to communcate the read value back to all the other CPUs that wanted to access the same data tem; ths s done usng an oblvous mult-cast operaton. The challenge s n dong so wthout ntroducng too much overhead namely, allowng only (per-cpu) memory, computaton, and parallel tme polylogarthmc n both the database sze and the number of CPUs and that tself retans memory oblvousness. Parallel put-backs : After a memory cell s accessed, the (possbly updated) data s assgned a fresh random path and s renserted to the tree structure. To mantan the requred 4

6 nvarants, the tem must be nserted somewhere along ts new path, wthout revealng any nformaton about the path. In tree-based ORAMs, ths s done by rensertng at the root node of the tree. However, ths sngle node can hold only a small bounded number of elements (correspondng to the fxed bucket sze), whereas the number of processors m each wth an tem to rensert may be sgnfcantly larger. To overcome ths problem, nstead of returnng data tems to the root, we drectly nsert them nto level log m of the tree, whle ensurng that they are placed nto the correct bucket along ther assgned path. Note that level log m contans m buckets, and snce the m tems are each assgned to random leaves, each bucket wll n expectaton be assgned exactly 1 tem. The challenge n ths step s specfyng how the m CPUs can nsert elements nto the tree whle mantanng memory oblvousness. For example, f each CPU smply nserts ther own tem nto ts assgned node, we mmedately leak nformaton about ts destnaton leaf node. To resolve ths ssue, we have the CPUs oblvously route tems between each other, so that eventually the th CPU holds the tems to be nsert to the th node, and all CPUs fnally perform ether a real or a dummy wrte to ther correspondng node. Preventng overflows: To ensure that no new overflows are ntroduced after nsertng m tems, we now flush m tmes nstead of once, and all these m flushes are done n parallel: each CPU smply performs an ndependent flush. These parallel flushes may lead to conflcts n nodes accessed (e.g., each flush operaton wll lkely access the root node). As before, we resolve ths ssue by havng the CPUs elect some representatve to perform the approprate operatons for each accessed node; note, however, that ths step s requred only for correctness, and not for securty. Our constructon takes a modular approach. We frst specfy and analyze our compler wthn a smplfed settng, where oblvous communcaton between CPUs s for free. We then show how to effcently nstantate the requred CPU communcaton procedures oblvous routng, oblvous aggregaton, and oblvous mult-cast, and descrbe the fnal compler makng use of these procedures. In ths extended abstract, we defer the frst step to Appendx 3.1, and focus on the remanng steps. 1.3 Related Work Restrcted cases of parallelsm n Oblvous RAM have appeared n a handful of pror works. It was observed by Wllams, Son, and Tomescu [WST12] n ther PrvateFS work that exstng ORAM complers can support parallelzaton across data accesses up to the sze of the top level, 1 (n partcular, at most log n), when coordnated through a central trusted entty. We remark that central coordnaton s not avalable n the PRAM model. Goodrch and Mtzenmacher [GM11] showed that parallel programs n MapReduce format can be made oblvous by smply replacng the shuffle phase (n whch data tems wth a gven key are routed to the correspondng CPU) wth a fxed-topology sortng network. The goal of mprovng the parallel overhead of ORAM was studed by Lorch et al. [LPM + 13], but does not support complaton of PRAMs wthout frst sequentalzng. Follow-up work. As mentoned above, our OPRAM compler has been used n the recent works of Boyle, Chung, and Pass [BCP15] and Chen et al. [CCC + 15] to obtan secure mult-party computaton for PRAM, and ndstngushablty obfuscaton for PRAM, respectvely. A dfferent follow-up 1 E.g., for tree-based ORAMs, the sze of the root bucket. 5

7 work by Nayak et al. [NWI + 15] provdes targeted optmzatons and an mplementaton for secure computaton of specfc parallel tasks. Very recently, an exctng follow-up work of Chen, Ln, and Tessaro [CLT15] bulds upon our technques to obtan two new constructon: an OPRAM compler whose overhead n expectaton matches that of the best current sequental ORAM [SvDS + 13]; and, a general transformaton takng any generc ORAM compler to an OPRAM compler wth log n overhead n expectaton. Ther OPRAM constructons, however, only apply to the specal case of PRAM wth a fxed number of processors beng actvated at every step (whereas our noton of a PRAM requres handlng also a varable number of processors 2 ); for the case of varable CPU PRAMs, the results of [CLT15] ncurr an addtonal multlcatve overhead of m n terms of computatonal complexty, and thus the bounds obtaned are ncomparable. 2 Prelmnares 2.1 Parallel RAM (PRAM) Programs We consder the most general case of Concurrent Read Concurrent Wrte (CRCW) PRAMs. An m-processor CRCW parallel random-access machne (PRAM) wth memory sze n conssts of numbered processors CP U 1,..., CP U m, each wth local memory regsters of sze log n, whch operate synchronously n parallel and can make access to shared external memory of sze n. A PRAM program Π (gven m, n, and some nput x stored n shared memory) provdes CPUspecfc executon nstructons, whch can access the shared data va commands Access(r, v), where r [n] s an ndex to a memory locaton, and v s a word (of sze log n) or. Each Access(r, v) nstructon s executed as: 1. Read from shared memory cell address r; denote value by v old. 2. Wrte value v to address r (f v =, then take no acton). 3. Return v old. In the case that two or more processors smultaneously ntate Access(r, v ) wth the same address r, then all requestng processors receve the prevously exstng memory value v old, and the memory s rewrtten wth the value v correspondng to the lowest-numbered CPU for whch v. We more generally support PRAM programs wth a dynamc number of processors (.e., m processors requred for each tme step of the computaton), as long as ths sequence of processor numbers m 1, m 2,... s publc nformaton. The complexty of our OPRAM soluton wll scale wth the number of requred processors n each round, nstead of the maxmum number of requred processors. The (parallel) tme complexty of a PRAM program Π s the maxmum number of tme steps taken by any processor to evaluate Π, where each Access executon s charged as a sngle step. The PRAM complexty of a functon f s defned as the mnmal parallel tme complexty of any PRAM program whch evaluates f. We remark that the PRAM complexty of any functon f s bounded above by ts crcut depth complexty. Remark 2.1 (CPU-to-CPU Communcaton). It wll be sometmes convenent notatonally to assume that CPUs may communcate drectly amongst themselves. When the denttes of sendng 2 As prevously mentoned, dealng wth a varable number of processors s needed to capture standard crcut models of computaton, where the crcut topology may be of varyng wdth. 6

8 and recevng CPUs s known a pror (whch wll always be the case n our constructons), such communcaton can be emulated n the standard PRAM model wth constant overhead by communcatng through memory. That s, each acton CPU1 sends message m to CPU2 s mplemented n two tme steps: Frst, CPU1 wrtes m nto a specal desgnated memory locaton addr CP U1 ; n the followng tme step, CPU2 performs a read access to addr CP U1 to learn the value m. 2.2 Tree-Based ORAM Concretely, our soluton reles on the ORAM due to Chung and Pass [CP13], whch n turn closely follows the tree-based ORAM constructon of Sh et al. [SCSL11]. We now recall the [CP13] constructon n greater detal, n order to ntroduce notaton for the remander of the paper. The [CP13] constructon (as wth [SCSL11]) proceeds by frst presentng an ntermedate soluton achevng oblvousness, but n whch the CPU must mantan a large number of regsters (specfcally, provdng a means for securely storng n data tems requrng CPU state sze Θ(n/α), where α > 1 s any constant). Then, ths soluton s recursvely appled log α n tmes to store the resultng CPU state, untl fnally reachng a CPU state sze polylog(n), whle only blowng up the computatonal overhead by a factor log α n. The overall compler s fully specfed by descrbng one level of ths recurson. Step 1: Basc ORAM wth O(n) regsters. The compler ORAM on nput n N and a program Π wth memory sze n outputs a program Π that s dentcal to Π but each Read(r) or Wrte(r, val) s replaced by correspondng commands ORead(r), OWrte(r, val) to be specfed shortly. Π has the same regsters as Π and addtonally has n/α regsters used to store a poston map Pos plus a polylogarthmc number of addtonal work regsters used by ORead and OWrte. In ts external memory, Π wll mantan a complete bnary tree Γ of depth l = log(n/α); we ndex nodes n the tree by a bnary strng of length at most l, where the root s ndexed by the empty strng λ, and each node ndexed by γ has left and rght chldren ndexed γ0 and γ1, respectvely. Each memory cell r wll be assocated wth a random leaf pos n the tree, specfed by the poston map Pos; as we shall see shortly, the memory cell r wll be stored at one of the nodes on the path from the root λ to the leaf pos. To ensure that the poston map s smaller than the memory sze, we assgn a block of α consecutve memory cells to the same leaf; thus memory cell r correspondng to block b = r/α wll be assocated wth leaf pos = Pos(b). Each node n the tree s assocated wth a bucket whch stores (at most) K tuples (b, pos, v), where v s the content of block b and pos s the leaf assocated wth the block b, and K ω(log n) polylog(n) s a parameter that wll determne the securty of the ORAM (thus each bucket stores K(α + 2) words). We assume that all regsters and memory cells are ntalzed wth a specal symbol. The followng s a specfcaton of the ORead(r) procedure: Fetch: Let b = r/α be the block contanng memory cell r (n the orgnal database), and let = r mod α be r s component wthn the block b. We frst look up the poston of the block b usng the poston map: pos = Pos(b); f Pos(b) =, set pos [n/α] to be a unformly random leaf. Next, traverse the data tree from the root to the leaf pos, makng exactly one read and one wrte operaton for the memory bucket assocated wth each of the nodes along the path. More precsely, we read the content once, and then we ether wrte t back (unchanged), or we 7

9 Poston Map P os b = r α n n α 1 α pos = 011 poston of memory cell r s found here ORAM Tree Γ λ flush along random path from λ to pos = value of memory cell r s found somewhere on path from λ to pos = 011 Fgure 1: Illustraton of the basc [CP13] ORAM constructon. smply erase t (wrtng ) so as to mplement the followng task: search for a tuple of the form (b, pos, v) for the desred b, pos n any of the nodes durng the traversal; f such a tuple s found, remove t from ts place n the tree and set v to the found value, and otherwse take v =. Fnally, return the th component of v as the output of the ORead(r) operaton. Update Poston Map: Pck a unformly random leak pos [n/α] and let Pos(b) = pos. Put Back: Add the tuple (b, pos, v) to the root λ of the tree. If there s not enough space left n the bucket, abort outputtng overflow. Flush: Pck a unformly random leaf pos [n/α] and traverse the tree from the roof to the leaf pos, makng exactly one read and one wrte operaton for every memory cell assocated wth the nodes along the path so as to mplement the followng task: push down each tuple (b, pos, v ) read n the nodes traversed so far as possble along the path to pos whle ensurng that the tuple s stll on the path to ts assocated leaf pos (that s, the tuple ends up n the node γ = longest common prefx of pos and pos.) Note that ths operaton can be performed trvally as long as the CPU has suffcently many work regsters to load two whole buckets nto memory; snce the bucket sze s polylogarthmc, ths s possble. If at any pont some bucket s about to overflow, abort outputtng overflow. OWrte(r, v) proceeds dentcally n the same steps as ORead(r), except that n the Put Back steps, we add the tuple (b, pos, v ), where v s the strng v but the th component s set to v (nstead of addng the tuple (b, pos, v) as n ORead). (Note that, just as ORead, OWrte also outputs the ordnal memory content of the memory cell r; ths feature wll be useful n the full-fledged constructon.) The full-fledged constructon: ORAM wth polylog regsters. The full-fledged constructon of the CP ORAM proceeds as above, except that nstead of storng the poston map n regsters n the CPU, we now recursvely store them n another ORAM (whch only needs to operate on n/α 8

10 memory cells, but stll usng buckets that store K tuples). Recall that each nvocaton of ORead and OWrte requres readng one poston n the poston map and updatng ts value to a random leaf; that s, we need to perform a sngle recursve OWrte call (recall that OWrte updates the value n a memory cell, and returns the old value) to emulate the poston map. At the base of the recurson, when the poston map s of constant sze, we use the trval ORAM constructon whch smply stores the poston map n the CPU regsters. Theorem 2.2 ( [CP13]). The compler ORAM descrbed above s a secure Oblvous RAM compler wth polylog(n) worst-case computaton overhead and ω(log n) memory overhead, where n s the database memory sze. 2.3 Sortng Networks Our protocol wll employ an n-wre sortng network, whch can be used to sort values on n wres va a fxed topology of comparsons. A sortng network conssts of a sequence of layers, each layer n turn consstng of one or more comparator gates, whch take two wres as nput, and swap the values when n unsorted order. Formally, gven nput values x = (x 1,..., x n ) (whch we assume to be ntegers wlog), a comparator operaton compare(, j, x) for < j returns x where x = x f x x j, and otherwse, swaps these values as x = x j and x j = x (whereas x k = x k for all k, j). Formally, a layer n the sortng network s a set L = {( 1, j 1 ),..., ( k, j k )} of parwse-dsjont pars of dstnct ndces of [n]. A d-depth sortng network s a lst SN = (L 1,..., L d ) of layers, wth the property that for any nput vector x, the fnal output wll be n sorted order x x +1 < n. Ajta, Komlós, and Szemeréd demonstrated a sortng network wth depth logarthmc n n. Theorem 2.3 ( [AKS83]). There exsts an n-wre sortng network of depth O(log n) and sze O(n log n). Whle the AKS sortng network s asymptotcally optmal, n practcal scenaros one may wsh to use the smpler alternatve constructon due to Batcher [Bat68] whch acheves sgnfcantly smaller lnear constants. 3 Oblvous PRAM The defnton of an Oblvous PRAM (OPRAM) compler mrrors that of standard ORAM, wth the excepton that the compler takes as nput and produces as output a parallel RAM program. Namely, denote the sequence of shared memory cell accesses made durng an executon of a PRAM program Π on nput (m, n, x) as Π(m, n, x). And, denote by ActvatonPatterns(Π, m, n., x) the (publc) CPU actvaton patterns (.e., number of actve CPUs per tmestep) of program Π on nput (m, n, x). We present a defnton of an OPRAM compler followng Chung and Pass [CP13], whch n turn follows Goldrech [Gol87]. Defnton 3.1 (Oblvous Parallel RAM). A polynomal-tme algorthm O s an Oblvous Parallel RAM (OPRAM) compler wth computatonal overhead comp(, ) and memory overhead mem(, ), f O gven m, n N and a determnstc m-processor PRAM program Π wth memory sze n, outputs an m-processor program Π wth memory sze mem(m, n) n such that for any nput x, the parallel runnng tme of Π (m, n, x) s bounded by comp(m, n) T, where T s the parallel runtme of Π(m, n, x), and there exsts a neglgble functon µ such that the followng propertes hold: 9

11 Correctness: For any m, n N and any strng x {0, 1}, wth probablty at least 1 µ(n), t holds that Π(m, n, x) = Π (m, n, x). Oblvousness: For any two PRAM programs Π 1, Π 2, any m, n N, and any two nputs x 1, x 2 {0, 1}, f Π 1 (m, n, x 1 ) = Π 2 (m, n, x 2 ) and ActvatonPatterns(Π 1, m, n, x 1 )) = ActvatonPatterns(Π 2, m, n, x 2 ), then Π 1 (m, n, x 1) s µ-close to Π 2 (m, n, x 2) n statstcal dstance, where Π O(m, n, Π ) for {1, 2}. We remark that not all m processors may be actve n every tme step of a PRAM program Π, and thus ts total computaton cost may be sgnfcantly less than m T. We wsh to consder OPRAM complers that also preserve the processor actvaton structure (and thus total computaton complexty) of the orgnal program up to polylogarthmc overhead. Of course, we cannot hope to do so f the processor actvaton patterns themselves reveal nformaton about the secret data. We thus consder PRAMs Π whose actvaton schedules (m 1,..., m T ) are a-pror fxed and publc. Defnton 3.2 (Actvaton-Preservng). An OPRAM compler O wth computaton overhead comp(, ) s sad to be actvaton preservng f gven m, n N and a determnstc PRAM program Π wth memory sze n and fxed (publc) actvaton schedule (m 1,..., m T ) for m m, the program Π output by O has actvaton schedule ( (m 1 ) t =1, (m 2) t =1,..., (m T ) t =1), where t = comp(m, n). It wll addtonally be useful n applcatons (e.g., our constructon of garbled PRAMs n Secton 4, and the MPC for PRAMs of [BCP15]) that the resultng oblvous PRAM s collson free. Defnton 3.3 (Collson-Free). An OPRAM compler O s sad to be collson free f gven m, n N and a determnstc PRAM program Π wth memory sze n, the program Π output by O has the property that no two processors ever access the same data address n the same tmestep. We now present our man result, whch we construct and prove n the followng subsectons. Theorem 3.4 (Man Theorem: OPRAM). There exsts an actvaton-preservng, collson-free OPRAM compler wth O(log(m) log 3 (n)) worst-case computatonal overhead and f(n) memory overhead, for any f ω(1), where n s the memory sze and m s the number of CPUs. 3.1 Rudmentary Soluton: Requrng Large Bandwdth We frst provde a soluton for a smplfed case, where we are not concerned wth mnmzng communcaton between CPUs or the sze of requred CPU local memory. In such settng, communcatng and aggregatng nformaton between all CPUs s for free. Our compler Heavy-O, on nput m, n N, fxed nteger constant α > 1, and m-processor PRAM program Π wth memory sze n, outputs a program Π dentcal to Π, but wth each Access(r, v) operaton replaced by the modfed procedure Heavy-OPAccess as defned n Fgure 2. (Here, broadcast means to send the specfed message to all other processors). Note that Heavy-OPAccess operates recursvely for t = 0,..., log α n. Ths corresponds analogously to the recurson n the [SCSL11, CP13] ORAM, where n each step the sze of the requred secure database memory drops by a constant factor α. We addtonally utlze a space optmzaton due to Gentry et al. [GGH + 13] that apples to [CP13], where the ORAM tree used for storng data of sze n has depth log n /K (and thus n /K leaves nstead of n ), where K s the bucket sze. Ths enables the overall memory overhead to drop from ω(log n) (.e., K) to ω(1) wth mnmal changes to the analyss. 10

12 Heavy-OPAccess(t, (r, v )): The Large Bandwdth Case To be executed by CP U 1,..., CP U m w.r.t. (recursve) database sze n t := n/(α t ), bucket sze K. Input: Each CP U holds: recurson level t, nstructon par (r, v ) wth r [n t ], global parameter α. Each CP U performs the followng steps, n parallel 0. Ext Case: If t log α n, return 0. Ths corresponds to requestng the (trval) poston map for a block wthn a sngle-leaf tree. 1. Conflct Resoluton (a) Broadcast the nstructon par (r, v ) to all CPUs. (b) Let b = r /α. Locally aggregate ncomng nstructons to block b as v = v [1] v [α], resolvng wrte conflcts (.e., s [α], take v [s] v j for mnmal j such that r j = b α+s). Denote by rep(b ) := mn{j : r j /α = b } the smallest ndex j of any CPU whose r j s n ths block b. (CPU rep(b ) wll actually access b, whle others perform dummy accesses). 2. Recursve Access to Poston Map (Defne L t := 2n t /K, number of leaves n t th tree). If = rep(b ): Sample fresh leaf d l [L t]. Recurse as l Heavy-OPAccess(t + 1, (b, l )) to read the current value l of Pos(b ) and rewrte t wth l. Else: Recursvely ntate dummy access x Heavy-OPAccess(t+1, (1, )) at arbtrary address (say 1); gnore the read value x. Sample fresh random leaf d l [L t ] for a dummy lookup. 3. Look Up Current Memory Values Read the memory contents of all buckets down the path to leaf node l defned n the prevous step, copyng all buckets nto local memory. If = rep(b ): locate and store target block trple (b, v old, l ). Update v from Step 1 wth exstng data: s [α], replace any non-wrtten cell values v [s] = wth v [s] v old [s]. v now stores the entre data block to be rewrtten for block b. 4. Remove Old Data from ORAM Database (a) If = rep(b ): Broadcast (b, l ) to all CPUs. Otherwse: broadcast (, l ). (b) Intate UpdateBuckets ( n t, (remove-b, l ), {(remove-b j, l j )} j [m]\{} ), as n Fgure Insert New Data nto Database n Parallel (a) If = rep(b ): Broadcast (b, v, l ), wth updated value v and target leaf l. (b) Let lev := log(mn{m, L t }) be the ORAM tree level wth number of buckets equal to number of CPUs (the level where data wll be nserted). Locally aggregate all ncomng nstructons whose path l j has lev -bt prefx : Insert := {(b j, v j, l j ) : (l j )(lev ) = }. (c) Access memory bucket (at level lev ) and rewrte contents, nsertng data tems Insert. If bucket exceeds ts capacty, abort wth overflow. 6. Flush the ORAM Database (a) Sample a random leaf node l flush [L t ] along whch to flush. Broadcast l flush. ( (b) If L t : Intate UpdateBuckets n t, (flush, l flush ), {(flush, l flush j )} j [m]\{} ), n Fgure 3. Recall that flush means to push each encountered trple (b, l, v) down to the lowest pont at whch hs chosen flush path and l agree. 7. Update CPUs If = rep(b ): broadcast the old value v old of block b to all CPUs. Fgure 2: Pseudocode for oblvous parallel data 11access procedure Heavy-OPAccess (where we are temporarly not concerned wth per-round bandwdth/memory).

13 UpdateBuckets ( n t, (mycommand, mypath), {(command j, path j )} j [m]\{} ) Let path (0),..., path (log Lt) denote the bt prefxes of length 0 (.e., ) to log(l t ) of path. For each tree level lev = 0 to log L t, each CPU does the followng at bucket mypath (lev) : 1. Defne CPUs(mypath (lev) ) := {} {j : path (lev) j = mypath (lev) } to be the set of CPUs requestng changes to bucket mypath (lev). Let bucket-rep(mypath (lev) ) denote the mnmal ndex n the set. 2. If bucket-rep(mypath (lev) ), do nothng. Otherwse: Case 1: mycommand = remove-b. Interpret each command j = remove-b j as a target block d b j to be removed. Access memory bucket mypath (lev) and rewrte contents, removng any block b j for whch j CPUs(mypath (lev) ). Case 2: mycommand = flush. Defne Flush {L, R} as {v : path j s.t. path (lev+1) j = mypath (lev) v}, assocatng L 0, R 1. Ths determnes whether data wll be flushed left and/or rght from ths bucket. Access memory bucket mypath (lev) ; denote ts collecton of stored data blocks b by ThsBucket. Partton ThsBucket = ThsBucket-L ThsBucket-R nto those blocks whose assocated leaves contnue to the left or rght (.e., ThsBucket-L := {b j ThsBucket : l (lev+1) j = mypath (lev) 0}, and smlar for 1). If L Flush, then set ThsBucket ThsBucket \ ThsBucket-L, access memory bucket mypath (lev) 0, and nsert data tems ThsBucket-L nto t. If R Flush, then set ThsBucket ThsBucket \ ThsBucket-R, access memory bucket mypath (lev) 1, and nsert data tems ThsBucket-R nto t. Rewrte the contents of bucket mypath (lev) wth updated value of ThsBucket. If any bucket exceeds ts capacty, abort wth overflow. Fgure 3: Procedure for combnng CPUs nstructons for buckets and mplementng them by a sngle representatve CPU. (Used for correctness, not securty). See Fgure 4 for a sample llustraton. Lemma 3.5. For any n, m N, The compler Heavy-O s a secure Oblvous PRAM compler wth parallel tme overhead O(log 3 n) and memory overhead ω(1), assumng each CPU has Ω(m) local memory. We wll address the desred clams of correctness, securty, and complexty of the Heavy-O compler by nducton on the number of levels of recurson. Namely, for t [log α n], denote by Heavy-O t the compler that acts on memory sze n/(α t ) by executng Heavy-O only on recurson levels t = t, (t + 1),..., log α n. For each such t, we defne the followng property. Level-t Heavy OPRAM: We say that Heavy-O t s a vald level-t heavy OPRAM f the partalrecurson compler Heavy-O t s a secure Oblvous PRAM compler for memory sze n/(α t ) wth parallel tme overhead O(log 2 n log(n/α t )) and memory overhead ω(1), assumng each CPU has Ω(m) local memory. Then Lemma 3.5 follows drectly from the followng two clams. Clam 3.6. Heavy-O logα n s vald level-(log α n) heavy OPRAM. 12

14 1 CPU1 CPU2 CPU Fgure 4: UpdateBuckets sample llustraton. Here, CPUs 1-3 each wsh to modfy nodes along ther paths as drawn; for each overlappng node, the CPU wth lowest d receves and mplements the aggregated commands for the node. Proof. Note that Heavy-O logα n, actng on trval sze-1 memory, corresponds drectly to the ext case (Step 0) of Heavy-OPAccess n Fgure 2. Namely, correctness, securty, and the requred effcency trvally hold, snce there s a sngle data tem n a fxed locaton to access. Clam 3.7. Suppose Heavy-O t s a vald level-t heavy OPRAM for t > 0. Then Heavy-O t 1 s a vald level-(t 1) heavy OPRAM. Proof. We frst analyze the correctness, securty, and complexty overhead of Heavy-O t 1 condtoned on never reachng the event overflow (whch may occur n Step 5(c), or wthn the call to UpdateBuckets). Then, we prove that the probablty of overflow s neglgble n n. Correctness (w/o overflow). Consder the state of the memory (of the CPUs and server) n each step of Heavy-OPAccess, assumng no overflow. In Step 1, each CPU learns the nstructon pars of all other CPUs; thus all CPUs agree on sngle representatve rep(b ) for each requested block b, and a correct aggregaton of all nstructons to be performed on ths block. Step 2 s a recursve executon of Heavy-OPAccess. By the nductve hypothess, ths access successfully returns the correct value l of Pos(b ) for each b quered, and rewrtes t wth the freshly sampled value l when specfed (.e., for each rep(b ) access; the dummy accesses are read-only). We are thus guaranteed that each rep(b ) wll fnd the desred block b n Step 3 when accessng the memory buckets n the path down the tree to leaf l (as we assume no overflow was encountered), and so wll learn the current stored data value v old. In Step 4, each CPU learns the target block b and assocated leaf l of every representatve CPU rep(b ). By constructon, each requested block b appears n some bucket B n the tree along hs path, and there there wll necessarly be some CPU assgned as bucket-rep(b) n UpdateBuckets, who wll then successfully remove the block b from B. At ths pont, none of the requested blocks b appear n the tree. In Step 5, the CPUs nsert each block b (wth updated data value v ) nto the ORAM data tree at level mn{log α n/α t, log 2 (m) } along the path to ts (new) leaf l. 13

15 Fnally, the flushng procedure n Step 6 mantans the necessary property that each block b appears along the path to Pos(b ), and n Step 7 all CPUs learn the collecton of all quered values v old (n partcular, ncludng the value they ntally requested). Thus, assumng no overflow, correctness holds. Oblvousness (w/o overflow). Consder the access patterns to server-sde memory n each step of Heavy-OPAccess, assumng no overflow. Step 1 s performed locally wthout communcaton to the server. Step 2 s a recursve executon of Heavy-OPAccess, whch thus yelds access patterns ndependent of the vector of quered data locatons (up to statstcal dstance neglgble n n), by the nducton hypothess. In Step 3, each CPU accesses the buckets along a sngle path down the tree, where representatve CPUs rep(b ) access along the path gven by Pos(b ) (for dstnct b ), and non-representatve CPUs each access down an ndependent, random path. Snce the adversaral vew so far has been ndependent of the values of Pos(b ), condtoned on ths vew all CPU s paths are ndependent and random. In Step 4, all data access patterns are publcly determnable based on the accesses n the prevous step (that s, the complcaton n Step 4 s to ensure correctness wthout access collsons, but s not needed for securty). In Step 5, each CPU accesses hs correspondng bucket n the tree. In the flushng procedure of Step 6, each CPU selects an ndependent, random path down the tree, and the communcaton patterns to the server reveal no nformaton beyond the denttes of these paths. Fnally, Step 7 s performed locally wthout communcaton to the server. Thus, assumng no overflow, oblvousness holds. Protocol Complexty (w/o overflow). Frst note that the server-sde memory storage requrement s smply that of the [CP13] ORAM constructon, together wth the log(2n t /K) tree-depth memory optmzaton of [GHL + 14]; namely, f(n) memory overhead suffces for any f ω(1). Consder the local memory requred per CPU. Each CPU must be able to store: O(log n)-sze requests from each CPU (due to the broadcasts n Steps 1(a), 4(a), 5(a), and 7); and the data contents of at most 3 memory buckets (due to the flushng procedure n UpdateBuckets). Overall, ths yelds a per-cpu local memory requrement of Ω(m) (where Ω notaton hdes log n factors). Consder the parallel complexty of the OPRAM-compled program Π Heavy-O(m, n, Π). For each parallel memory access n the underlyng program Π, the processors perform: Conflct resoluton (1 local communcaton round), Read/wrtng the poston map (whch has parallel complexty O(log 2 n log(n/α t )) by the nductve hypothess), Lookng up current memory values (sequental steps = depth of level-(t 1) ORAM tree O(log(n/α t 1 ))), Removng old data from the ORAM tree (1 local communcaton round, plus depth of the ORAM tree O(log(n/α t 1 )) sequental steps), Insertng the new data n parallel (1 local communcaton round, plus 1 communcaton round to the server), Flushng the ORAM database (1 local communcaton round, and 2 the depth of the ORAM tree rounds of communcaton wth the server, snce each bucket along a flush path s accessed once to receve new data tems and once to flush ts own data tems down), and Updatng CPUs wth the read values (1 local communcaton round). Altogether, ths yelds parallel complexty overhead O(log 2 n log(n/α t 1 )). It remans to address the probablty of encounterng overflow. Clam 3.8. There exsts a neglgble functon µ such that for any determnstc m-processor PRAM program Π, any database sze n, and any nput x, the probablty that the Heavy-O-compled program 14

16 Π (m, n, x) outputs overflow s bounded by µ(n). Proof. We consder separately the probablty of overflow n each of the level-t recursve ORAM trees. Snce there are log n of them, the clam follows by a straghtforward unon bound. Takng nspraton from [CP13], we analyze the ORAM-compled executon va an abstract dart game. The game conssts of black and whte darts. In each round of the game, m black darts are thrown, followed by m whte darts. Each dart ndependently hts the bullseye wth probablty p = 1/m. The game contnues untl exactly K darts have ht the bullseye (recall K ω(log n) s the bucket sze), or after the end of the T th round for some fxed polynomal bound T = T (n), whchever comes frst. The game s won (whch wll correspond to overflow n a partcular bucket) f K darts ht the bullseye, and all of them are black. Let us analyze the probablty of wnnng n the above dart game. Subclam 1: Wth overwhelmng probablty n n, no more than K/2 darts ht the bullseye n any round. In any sngle round, assocate wth each of the 2 m darts thrown an ndcator varable X for whether the dart strkes the target. The X are ndependent random varables each equal to 1 wth probablty p = 1/m. Thus, the probablty that more than K/2 of the darts ht the target s bounded (va a Chernoff tal bound 3 ) by [ 2m ] Pr X > K/2 =1 e 2(K/4 1) 2 2+(K/4 1) e Ω(K) e ω(log n). Snce there are at most T = poly(n) dstnct rounds of the game, the subclam follows by a unon bound. Subclam 2: Condtoned on no round havng more than K/2 bullseyes, the probablty of wnnng the game s neglgble n d. Fx an arbtrary such wnnng sequence s, whch termnates sometme durng some round r of the game. By assumpton, the fnal partal round r contans no more than K/2 bullseyes. For the remanng K/2 bullseyes n rounds 1 through r 1, we are n a stuaton mrrorng that of [CP13]: for each such wnnng sequence s, there exst 2 K/2 1 dstnct other losng sequences s that each occur wth the same probablty, where any non-empty subset of black darts httng the bullseye are replaced wth ther correspondng whte darts. Further, every two dstnct wnnng sequences s 1, s 2 yeld dsjont sets of losng sequences, and all such constructed sequences have the property that no round has more than K/2 bullseyes (snce ths number of total bullseyes per round s preserved). Thus, condtoned on havng no round wth more than K/2 bullseyes, the probablty of wnnng the game s bounded above by 2 K/2 e ω(log n). We now relate the dart game to the analyss of our OPRAM compler. We analyze the memory buckets at the nodes n the t-th recursve ORAM tree, va three subcases. Case 1: Nodes n level lev < log m. Snce data tems are nserted to the tree n parallel drectly at level log m, these nodes do not receve data, and thus wll not overflow. Case 2: Consder any nternal node (.e., a node that s not a leaf) γ n the tree at level log m lev < log(l t ). (Recall L t := 2n t /K s the number of leaves n the t th tree when applyng the [GHL + 14] optmzaton). Note that when m > L t, ths case s vacuous. For purposes of analyss, consder the contents of γ as splt nto two parts: γ L contanng the data blocks whose leaf path 3 Explct Chernoff bound used: for X = X 1 + X 2m (X ndependent) and mean µ, then for any δ > 0, t holds that Pr[X > (1 + δ)µ] e δ2 µ/(2+δ). 15

17 contnues to the left from γ (.e., leaf γ 0 ), and γ R contanng the data blocks whose leaf path contnues rght (.e., γ 1 ). For the bucket of node γ to overflow, there must be K tuples n t. In partcular, ether γ L or γ R must have K/2 tuples. For each parallel memory access n Π(m, n, x), n the t-th recursve ORAM tree for whch n t m/k, (at most) m data tems are nserted, and then m ndependent paths n the tree are flushed. By defnton, an nserted data tem wll enter our bucket γ L (respectvely, γ R ) only f ts assocated leaf has the prefx γ 0 (resp., γ 1); we wll assume the worst case n whch all such data tems arrve drectly to the bucket. On the other hand, the bucket γ L (resp., γ R ) wll be completely empted after any flush whose path contans ths same prefx γ 0 (resp., γ 1). Snce all leaves for nserted data tems and data flushes are chosen randomly and ndependently, these events correspond drectly to the black and whte darts n the game above. Namely, the probablty that a randomly chosen path wll have the specfc prefx γ 0 of length lev s 2 lev 1/m (snce we consder lev log m); ths corresponds to the probablty of a dart httng the bullseye. The bucket can only overflow f K/2 black darts (nserts) ht the bullseye wthout any whte dart (flush) httng the bullseye n between. By the analyss above, we proved that for any sequence of K/2 bullseye hts, the probablty that all K/2 of them are black s bounded above by 2 K/4, whch s neglgble n n. However, snce there s a fxed polynomal number T = poly(n) of parallel memory accesses n the executon of Π(m, n, x) (correspondng to the number of rounds n the dart game), and n partcular, T (2m) poly(n) total darts thrown, the probablty that the sequence of bullseyes contans K/2 sequental blacks anywhere n the sequence s bounded va a drect unon bound by (T 2m)2 K/4 e ω(log n), as desred. Case 3: Consder any leaf node γ. Ths analyss follows the same argument as n [CP13] (wth slghtly tweaked parameters from the [GHL + 14] tree-depth optmzaton). For there to be an overflow n γ at tme t, there must be K + 1 out of n t /α elements n the poston map that map to the leaf γ. Snce all postons are sampled unformly and ndependently among the L t := 2n t /K dfferent leaves, the expected number of elements mappng to γ s µ = K/2α, and by a standard multplcatve Chernoff bound, 4 the probablty that K + 1 elements are mapped to γ s upper bounded by ( e 1 ) µ (1 + 1) (1+1) (2 1/3 ) K/2α 2 ω(log n). Thus, the total probablty of overflow s neglgble n n, and the theorem follows. 3.2 Oblvous Dstrbuted Inserton, Aggregaton, and Mult-Cast Oblvous Parallel Inserton (Oblvous Routng) Recall durng the memory put-back phase, each CPU must nsert ts data tem nto the bucket at level log m of the tree lyng along a freshly sampled random path, whle hdng the path. We solve ths problem by delverng data tems to ther target locatons va a fxed-topology routng network. Namely, the m processors CP U 1,..., CP U m wll frst wrte the relevant m data tems msg (and ther correspondng destnaton addresses addr ) to memory n fxed order, and 4 We use the followng verson of the Chernoff bound: Let X 1,..., L n be ndependent [0, 1]-valued random varables. Let X = ( ) X and µ = E[X]. For every δ > 0, Pr[X (1 + δ)µ] e µ. δ (1+δ) (1+δ) 16

18 Parallel Inserton Routng Protocol Route(m, (msg, addr )) Input: CP U holds: message msg wth target destnaton addr, and global threshold K. Output: CP U holds {msg j : addr j = }. Let lev = log m (assumed N for smplcty). Each CP U performs the followng. Intalze M,0 msg. For t = 1,..., lev : 1. Perform the followng symmetrc message exchange wth CP U 2 t: M,t+1 {msg j M,t M 2 t,t : (addr j ) t = () t }. 2. If M,t+1 > K (.e., memory overflow), then CP U aborts. Fgure 5: Fxed-topology routng network for delverng m messages orgnally held by m processors to ther correspondng destnaton addresses wthn [m]. then rearrange them n log m sequental rounds to the proper locatons va the routng network. At the concluson of the routng procedure, each node j wll hold all messages msg for whch addr = j. For smplcty, assume m = 2 l for some l N. The routng network has depth l; n each level t = 1,..., l, each node communcates wth the correspondng node whose d agrees n all bt locatons except for the tth (correspondng to hs tth neghbor n the log m-dmensonal boolean hypercube). These nodes exchange messages accordng to the tth bt of ther destnaton addresses addr. Ths s formally descrbed n Fgure 5. After the tth round, each message msg s held by a party whose d agrees wth the destnaton address addr n the frst t bts. Thus, at the concluson of l rounds, all messages are properly delvered. We demonstrate the case m = 8 = 2 3 below: frst, CPUs exchange nformaton along the depcted communcaton network n 3 sequental rounds (left); then, each CPU nserts hs resultng collecton of tems drectly nto node of level 3 of the data tree (rght) CPUs We show that f the destnaton addresses addr are unformly sampled, then wth overwhelmng probablty no node wll ever need to hold too many (the threshold K wll be set to ω(log n)) messages at any pont durng the routng network executon: Lemma 3.9 (Routng Network). If L messages begn wth target destnaton addresses addr dstrbuted ndependently and unformly over [L] n the L-to-L node routng network n Fgure 5, then wth probablty bounded by 1 (L log L)2 K, no ntermedate node wll ever hold greater than K messages at any pont durng the course of the protocol executon. Proof. Consder an arbtrary node a {0, 1} l, at some level t of executon of the protocol. There are precsely 2 t possble messages m that could be held by node a at ths step, correspondng to those orgnatng n locatons b {0, 1} l whose fnal l t bts agree wth those of a. Node a wll 17

19 hold message m b at the concluson of round t precsely f the frst t bts of addr b agree wth those of a. For each such message m b, the assocated destnaton address addr b s a random element of [L], whch agrees wth a on the frst t bts wth probablty 2 t. For each b {0, 1} l agreeng wth a on the fnal l t bts, defne X b to be the ndcator varable that s equal to 1 f addr b agrees wth a on the frst t bts. Then the collecton of 2 t random varables {X b : b = a = t + 1,..., l} are ndependent, and X = X b has mean µ = 2 t 2 t = 1. Note that X corresponds to the number of messages held by node a at level t. By a Chernoff bound, 5 t holds that ( ) e K 1 Pr[X K] = Pr[X (1 + (K 1))µ] < < 2 K. Then, takng a unon bound over the total number of nodes L and levels l = log L, we have that the probablty of any node experencng an overflow at any round s bounded by (L log L)2 K Oblvous Aggregaton To perform the CPU-coordnaton phase, the CPUs effcently dentfy a sngle representatve and aggregate relevant CPU nstructons; then, at the concluson, the representatve CPU must be able to mult-cast the resultng nformaton to all relevant requestng CPUs. Most mportantly, these procedures must be done n an oblvous fashon. In ths secton, we address oblvous aggregaton; we treat the dual mult-cast problem n Secton Formally, we want to acheve the followng aggregaton goal, wth communcaton patterns ndependent of the nputs, usng only O(log(m)polylog(n)) local memory and communcaton per CPU, n only O(log(m)) sequental tme steps. An llustratve example to keep n mnd s where key = b, data = v, and Agg s the process that combnes nstructons to data tems wthn the same data block, resolvng conflcts as necessary. Oblvous aggregaton: Input: Each CPU [m] holds (key, data ). Let K = {key } denote the set of dstnct keys. We assume that any (subset of) data assocated wth the same key can be aggregated by an aggregaton functon Agg to a short dgest of sze at most poly(l, log m), where l = data. Goal: Each CPU outputs out such that the followng holds. For every key K, there exsts unque agent wth key = key s.t. out = (rep, key, agg key ), where agg key = Agg({data j : key j = key}). For every remanng agent, out = (dummy,, ). At a hgh level, we acheve ths va the followng steps. (1) Frst, the CPUs sort ther data lst wth respect to the correspondng key values. Ths can be acheved va an mplementaton of a log(m)-depth sortng network, and provdes the useful guarantee that all data pertanng to the same key are necessarly held by an block of adjacent CPUs. (2) Second, we pass data among CPUs n a sequence of log(m) steps such that at the concluson the left-most (.e., lowest ndexed) CPU n each key-block wll learn the aggregaton of all data pertanng to ths key. Explctly, n each step, each CPU sends all held nformaton to the CPU 2 to the left of hm, and smultaneously accepts any receved nformaton pertanng to hs key. (3) Thrd, each CPU wll learn whether he s the left-most representatve n each key-block, by smply checkng whether hs left-hand ( ) 5 e Exact Chernoff bound used: Pr[X > (1 + δ)µ] < δ µ (1+δ) for any δ > 0. 1+δ K K 18

20 neghbor holds the same key. From here, the CPUs have succeeded n aggregatng nformaton for each key at a sngle representatve CPU; (4) n the fourth step, they now reverse the orgnal sortng procedure to return ths aggregated nformaton to one of the CPUs who orgnally requested t. Lemma 3.10 (Space-Effcent Oblvous Aggregaton). Suppose m processors ntate protocol OblvAgg w.r.t. aggregator Agg, on respectve nputs {(key, data )} [m], each of sze l. Then at the concluson of executon, each processor [m] outputs a trple (rep, key, data ) such that the followng propertes hold (where asymptotcs are w.r.t. m): 1. The protocol termnates n O(log m) rounds. 2. The local memory and computaton requred per processor s O(log m + l). 3. (Correctness). For every key key {key }, there exsts a unque processor wth output key = key. For each such processor, t further holds that key = key, rep = rep, and data = Agg({data j : key j = key }). For every remanng processor, the output tuple s (dummy,, ). 4. (Oblvousness). The nter-cpu communcaton patterns are ndependent of the nputs (key, data ). A full descrpton of our Oblvous Aggregaton procedure OblvAgg s gven n Fgure 6. Proof of Lemma Property (1): Steps 1 and 4 of OblvAgg each execute a sortng network, and requre communcaton rounds equal to the depth d O(log m) of the sortng network mplemented. Step 2 takes place n log m sequental steps. Step 3 requres a sngle round. And Step 5 (output) takes place locally. Thus, the combned round complexty of OblvAgg s O(log m). Property (2): We frst address the sze the ndvdual tems stored, and then ensure the number of stored tems s never too large. Keys (e.g., key, tempkey ): Each key s bounded n sze by the ntal nput sze l. Data (e.g., data, datatemp, aggdata ): Smlarly, by the property of the aggregaton functon Agg, we are guaranteed that each data tem s bounded n sze by the orgnal data sze, whch s n turn bounded by sze l. CPU dentfers (e.g., sourced, dtemp ): Each processor can be dentfed by bt strng of length log m. Representatve flag (rep ): The rep/dummy flag can be stored as a sngle bt. Each processor begns wth nput sze l. In each round of executng the frst sortng network (Step 1 of OblvAgg), a processor must hold two sets of data (sourced, keytemp, datatemp), correspondng to at most 2(log m + 2l) storage. Note that no more than 2 tuples are requred to be held at any tme wthn ths step, as the processors exchange tuples but need not mantan both values. In each round of the Aggregaton phase (Step 2), processors may need to store two pars (keytemp, datatemp) n addton to the nformaton held from the concluson of the prevous step (namely, a sngle value sourced ), whch totals to log m+2(2l) memory. Note that by the propertes of the aggregaton scheme Agg, the sze of the aggregated data does not grow beyond l (and recall that partes do not mantan data assocated wth any dfferent key). In the Representatve Identfcaton phase (Step 3), each processor receves one addtonal key value key 1, whch requres memory log m, and s then translated to a sngle-bt flag rep and then deleted. In the Reverse Sort phase (Step 4), processors wthn each round must agan store two tuples, ths tme of the form (dtemp, rep, keytemp, datatemp), whch corresponds to 2(log m l + l) memory. Thus, the total local memory requrement per processor s bounded by O(log m + l). 19

21 Oblvous Aggregaton Procedure OblvAgg (w.r.t. Agg) Input: Each CPU [m] holds a par (key, data ). Output: Each CPU [m] outputs a trple (rep, key, aggdata ) correspondng to ether (dummy,, ) or wth aggdata = Agg({data j : key j = key }), as further specfed n Secton Sort on key. Each CP U ntalzes a trple (sourced, keytemp, datatemp ) (, key, data ). For each layer L 1,..., L d n the sortng network: Let L l = (( 1, j 1 ),..., ( m/2, j m/2 )) be the comparators n the current layer l. In parallel, for each t [m/2], the correspondng par of CPUs (CP U t, CP U jt ) perform the followng parwse sort w.r.t. key: If keytemp jt < keytemp t, then swap (sourced t, keytemp t, datatemp t ) (sourced jt, keytemp jt, datatemp jt ). 2. Aggregate to left. For t = 0, 1,..., log m: (Pass to left). Each CP U for > 2 t sends hs current par (keytemp, datatemp ) to CP U 2 t. (Aggregate). Each CP U for < m 2 t recevng a par (keytemp j, datatemp j ) wll aggregate t nto own par f the keys match. That s, f keytemp = keytemp j, then set datatemp Agg(datatemp, datatemp j ). In both cases, the receved par s then erased. The left-most CP U wth keytemp = key now has Agg({datatemp j : keytemp j = key})). 3. Identfy representatves. For each value key j, the left-most CPU currently holdng keytemp = key j wll dentfy hmself as (temporary) representatve. Each CP U for < m: send keytemp to rght-hand neghbor, CP U +1. Each CP U for > 1: If the receved value keytemp 1 matches hs own keytemp, then set rep dummy and zero out keytemp, datatemp. Otherwse, set rep rep. (CP U 1 always sets rep 1 rep ). 4. Reverse sort (.e., sort on sourced ). Return aggregated data to a requestng CPU. For each layer L 1,..., L d n the sortng network: Let L l = (( 1, j 1 ),..., ( m/2, j m/2 )) be the comparators n the current layer l. Each CP U ntalzes dtemp sourced. In parallel, for each t [m/2], the correspondng par of CPUs (CP U t, CP U jt ) perform the followng parwse sort w.r.t. sourced: If dtemp jt < dtemp t, then swap (dtemp t, rep t, keytemp t, datatemp t ) (dtemp jt, rep jt, keytemp jt, datatemp jt ). At the concluson, each CP U holds a tuple (dtemp, rep, keytemp, datatemp ) wth dtemp = and keytemp = key. 5. Output. Each CP U outputs the trple (rep, key, datatemp ). Fgure 6: Space-effcent oblvous data aggregaton procedure. 20

22 Property (3): We now prove that the protocol results n the desred output. Consder the values stored by each processor at the concluson of each phase of the protocol. After the completon of Step 1, by the correctness of the utlzed sortng network, t holds that each CP U holds a tuple (sourced, keytemp, datatemp ) such that the lst (sourced 1,..., sourced m ) s some permutaton of [m], and keytemp keytemp j for every < j. Note that for each t always the case that the par (keytemp, datatemp ) currently held by CP U s precsely the orgnal nput par of CP U j for j = sourced. For the Aggregaton phase n Step 2, we make the followng clam. Clam At the concluson of Aggregate Left (Step 2), the CPU of lowest ndex for whch keytemp = key holds datatemp = Agg({data j : key j = key}) (for each value key). Proof. Fx an arbtrary value key, and let S key [m] denote the subset of processors for whch keytemp = key. From the prevous sortng step, we are guaranteed that S key conssts of an nterval of consecutve processors start,..., stop. Now, consder any j S key (whose data CPU start wshes to learn). For any par of ndces < j S key, denote by t,j := max{t [log m] : (j start ) t = 1} {0, 1,..., log m 1} the hghest ndex n whch the bt representatons of j and start dsagree. We now prove that for each such par, j, CP U wll learn CP U j s data after round t,j log m. The clam wll follow, by applyng ths statement to each par ( start, j) wth j S key. Induct on t,j. Base case t,j = 0: follows mmedately from the protocol constructon; namely, n the 0-th round, each CPU j sends hs data to CPU (j 1), whch n ths case s precsely CPU. Now, suppose the nductve hypothess holds for all < j wth t,j = t, and consder a par < j wth t,j = t + 1. In round t + 1 of the protocol, processor receves from processor ( + 2 t+1 ) the collecton of all nformaton t has aggregated up to round t. By the defnton of t,j, we know that < ( + 2 t+1 ) j, and that t (+2 t+1 ),j t. Indeed, we know that and j dffer n bt ndex (t + 1), and no hgher; thus, ( + 2 t ) must agree wth j n ndex (t + 1) n addton to all hgher ndces. But, ths means by the nductve hypothess that CPU ( + 2 t ) has learned CPU j s data n a prevous round. Thus, CPU wll learn CPU j s data n round t + 1, as desred. In Step 3, each processor learns whether hs left-hand neghbor holds the same temporary key as he does; that s, he learns whether or not he s the left-most CPU holdng tempkey (and, n turn, holds the complete aggregaton of all data relatng to ths key). Each processor for whom ths s not the case sets hs tuple to (dummy,, ). At ths pont n the protocol, the processors have successfully reached the state where a sngle self-dentfed representatve for each quered key holds the desred data aggregaton. The fnal step s to return these nformaton tuples to some CPU who orgnally requested ths key. Ths s acheved n the fnal reverse sort (Step 4). Namely, by the correctness of the mplemented sortng network, at the concluson of Step 4 each CP U holds a tuple (dtemp, rep, keytemp, datatemp ) such that the ordered lst (dtemp 1,..., dtemp m ) s precsely the ordered lst 1,..., m. Snce the tuples (dtemp, rep, keytemp, datatemp ) are never modfed (only swapped between processors), t remans to show that each non-dummy (rep, keytemp, datatemp ) tuple s receved by an approprate requestng CPU. But, that s precsely the nformaton held by dtemp : the dentty of the CPU who made the orgnal request wth respect to key keytemp. Thus, the reverse sort successfully routes the aggregated tuples back to a CPU makng the correct key request. 21

23 Property (4): Snce we utlze a sortng network wth fxed topology, and the aggregate-toleft functonalty has fxed communcaton topology, the nter-cpu communcaton patterns are constant, ndependent of the ntal CPU nputs Oblvous Multcastng Our goal for Oblvous Multcastng s dual to that of the prevous secton: Namely, a subset of CPUs must delver nformaton to (unknown) collectons of other CPUs who request t. Ths s abstractly modeled as follows, where key denotes whch data tem s requested by each CPU. Oblvous Multcastng: Input: Each CPU holds (key, data ) wth the followng promse. Let K = {key } denote the set of dstnct keys. For every key K, there exsts a unque agent wth key = key such that data ; let data key denote such data. Goal: Each agent outputs out = (key, data key ). Fgure 7 contans the protocol OblvMCast for achevng oblvous multcastng n a space-effcent fashon. Ths procedure s roughly the dual of the OblvAgg protocol n the prevous secton. Lemma 3.12 (Space-Effcent Oblvous Multcastng). Suppose m processors ntate protocol OblvMCast on respectve nputs {(key, data )} [m] of sze l that satsfes the promse specfed above. Then at the concluson of executon, each processor [m] outputs a par (key, data ) such that the followng propertes hold (where asymptotcs are w.r.t. m): 1. The protocol termnates n O(log m) rounds. 2. The local memory and computaton requred by each processor s Õ(log m + l). 3. (Correctness). For every, key = key, and data = data key. 4. (Oblvousness). The nter-cpu communcaton patterns are ndependent of the nputs (key, data ). Proof. Identcal to the proof of Oblvous Aggregaton, Lemma Puttng Thngs Together We now combne the so-called Heavy-OPAccess structure of our OPRAM formalzed n Secton 3.1 (Fgure 2) wthn the smplfed free CPU communcaton settng, together wth the (oblvous) Route, OblvAgg, and OblvMCast procedures constructed n the prevous subsectons (specfed n Fgures 5,6,7). For smplcty, we descrbe the case n whch the number of CPUs m s fxed; however, t can be modfed n a straghtforward fashon to the more general case (as long as the actvaton schedule of CPUs s a-pror fxed and publc). Recall the steps n Heavy-OPAccess where large memory/bandwdth are requred. In Step 1, each CP U broadcasts (r, v ) to all CPUs. Let b = r /α. Ths s used to aggregate nstructons to each b and determne ts representatve CPU rep(b ). In Step 4, each CP U broadcasts (b, l ) or (, l ). Ths s used to aggregate nstructons to each buckets along path l about whch blocks b s to be removed. 22

24 Oblvous Multcastng Procedure OblvMCast Input: Each CPU holds (key, data ) wth the followng promse. Let K = {key } denote the set of dstnct keys. For every key K, there exsts a unque agent wth key = key such that data ; let data key denote such data. Output: Each agent outputs out = (key, data key ). 1. Sort on (key, data ). Each CP U ntalzes (sourced, keytemp, datatemp ) (, key, data ). For each layer L 1,..., L d n the sortng network: Let L l = (( 1, j 1 ),..., ( m/2, j m/2 )) be the comparators n the current layer l. In parallel, for each t [m/2], the correspondng par of CPUs (CP U t, CP U jt ) perform the followng parwse sort w.r.t. key, addtonally pushng payloads data key to the left: If () keytemp jt < keytemp t, or () keytemp jt = keytemp t and datatemp jt, then swap (sourced t, keytemp t, datatemp t ) (sourced jt, keytemp jt, datatemp jt ). 2. Multcast to rght. For t = 0, 1,..., log m: (Pass to rght). Each CP U for m 2 t sends hs current par (keytemp, datatemp ) to CP U +2 t. (Aggregate). Each CP U for > 2 t recevng a par (keytemp j, datatemp j ) wth j = 2 t update ts data as follows. If keytemp = keytemp j and datatemp j, then set datatemp datatemp j. Every CPU now holds (keytemp, datatemp ) = (key, data key ) for some key K. 3. Reverse sort (.e., sort on sourced ). Return receved data to an orgnal requestng CPU. For each layer L 1,..., L d n the sortng network: Let L l = (( 1, j 1 ),..., ( m/2, j m/2 )) be the comparators n the current layer l. Each CP U ntalzes dtemp sourced. In parallel, for each t [m/2], the correspondng par of CPUs (CP U t, CP U jt ) perform the followng parwse sort w.r.t. sourced: If dtemp jt < dtemp t, then swap (dtemp t, keytemp t, datatemp t ) (dtemp jt, keytemp jt, datatemp jt ). At the concluson, each CP U holds a tuple wth (dtemp, keytemp, datatemp ) wth dtemp =, keytemp = key, and datatemp = data key. 4. Output. Each CP U outputs output = (key, data key ). Fgure 7: Space-effcent oblvous data multcastng procedure. 23

25 In Step 5, each (representatve) CP U broadcasts (b, v, l ). Ths s used to aggregate blocks to be nserted to each bucket n approprate level of the tree. In Step 6, each CP U broadcasts l flush. Ths s used to aggregate nformaton about whch buckets the flush operaton should perform. In Step 7, each (representatve) CP U rep(b) broadcasts the old value v old of block b to all CPUs, so that each CPU receves desred nformaton. We wll use oblvous aggregaton procedure to replace broadcasts n Step 1, 4, and 6; the parallel nserton procedure to replace broadcasts n Step 5, and fnally the oblvous multcast procedure to replace broadcasts n Step 7. Let us frst consder the aggregaton steps. For Step 1, to nvoke the oblvous aggregaton procedure, we set key = b and data = (r mod α, v ), and defne the output of Agg({(u, v )}) to be a vector v = v[1] v[α] of read/wrte nstructons to each memory cell n the block, where conflcts are resolved by wrtng the value specfed by the smallest CPU:.e., s [α], take v[s] v j for mnmal j such that u j = s and v j. By the functonalty of OblvAgg, at the concluson of OblvAgg, each block b s assgned to a unque representatve (not necessarly the smallest CPU), who holds the aggregaton of all nstructons on ths block. Both Step 4 and 6 nvoke UpdateBuckets to update buckets along m random paths. In our rudmentary soluton, the paths (along wth nstructons) are broadcast among CPUs, and the buckets are updated level by level. At each level, each update bucket s assgned to a representatve CPU wth mnmal ndex, who performs aggregated nstructons to update the bucket. Here, to avod broadcasts, we nvoke the oblvous aggregaton procedure per level as follows. In Step 4, each CPU holds a path l and a block b (or ) to be removed. Also note that the buckets along the path l are stored locally by each CPU, after the read operaton n the prevous step (Step 3). At each level lev [log n], we nvoke the oblvous aggregaton procedure wth key = l (lev) (the lev-bts prefx of l ) and data = b f b s n the bucket of node l (lev), and data = otherwse. We smply defne Agg({data }) = {b : data = b} to be the unon of blocks (to be removed from ths bucket). Snce data only when data s n the bucket, the output sze of Agg s upper bounded by the bucket sze K. By the functonalty of OblvAgg, at the concluson of OblvAgg, each bucket l (lev) s assgned to a unque representatve (not necessarly the smallest CPU) wth aggregated nstructon on the bucket. Then the representatve CPUs can update the correspondng buckets accordngly. In Step 6, each CPU samples a path l flush to be flushed and the nstructons to each bucket are smply left and rght flushes. At each level lev [log n], we nvoke the oblvous aggregaton procedure wth key = l flush (lev) and data = L (resp., R) f the (lev +1)-st bt of l flush s 0 (resp., 1). The aggregaton functon Agg s agan the unon functon. Snce there are only two possble nstructons, the output has O(1) length. By the functonalty of OblvAgg, at the concluson of OblvAgg, each bucket l flush(lev) s assgned to a unque representatve (not necessarly the smallest CPU) wth aggregated nstructon on the bucket. To update a bucket l flush(lev), the representatve CPU loads the bucket and ts two chldren (f needed) nto local memory from the server, performs the flush operaton(s) locally, and wrtes the buckets back. Note that snce we update m random paths, we do not need to hde the access pattern, and thus the dummy CPUs do not need to perform dummy operatons durng UpdateBuckets. A formal descrpton of full-fledged UpdateBuckets can be found n Fgure 8. 24

26 For Step 5, we rely on the parallel nserton procedure of Secton 3.2.1, whch routes blocks to proper destnatons wthn the relevant level of the server-held data tree n parallel usng a smple oblvous routng network. The procedure s nvoked wth msg = b and addr = l. Fnally, n Step 7, each representatve CPU rep(b) holds nformaton of the block b, and each dummy CPU wants to learn the value of a block b. To do so, we nvoke the oblvous multcast procedure wth key = b and data = v old for representatve CPUs and data = for dummy CPUs. By the functonalty of OblvMCast, at the concluson of OblvMCast, each CPU receves the value of the block t orgnally wshed to learn. The Fnal Compler. For convenence, we summarze the complete protocol. Our OPRAM compler O, on nput m, n t N and a m-processor PRAM program Π wth memory sze n t (whch n recurson level t wll be n t = n/α t ), wll output a program Π that s dentcal to Π, but where each Access(r, v) operaton s replaced by a sequence of operatons defned by subroutne OPAccess(r, v), whch we wll construct over the followng subsectons. The OPAccess procedure begns wth m CPUs, each wth a requested data cell r (wthn some α-block b ) and some acton to be taken (ether to denote read, or v to denote rewrtng cell r wth value v ). 1. Conflct Resoluton: Run OblvAgg on nputs {(b, v )} [m] to select a unque representatve rep(b ) for each quered block b and aggregate all CPU nstructons for ths b (denoted v ). 2. Recursve Access to Poston Map: Each representatve CPU rep(b ) samples a fresh random leaf d l [n t] n the tree and performs a (recursve) Read/Wrte access command on the poston map database l OPAccess(t + 1, (b, l )) to fetch the current poston map value l for block b and rewrte t wth the newly sampled value l. Each dummy CPU performs an arbtrary dummy access (e.g., garbage OPAccess(t + 1, (1, ))). 3. Look Up Current Memory Values: Each CPU rep(b ) fetches memory from the database nodes down the path to leaf l ; when b s found, t copes ts value v nto local memory. Each dummy CPU chooses a random path and make analogous dummy data fetches along t, gnorng all read values. (Recall that smultaneous data reads do not yeld conflcts). 4. Remove Old Data: For each level n the tree, Aggregate nstructons across CPUs accessng the same buckets of memory (correspondng to nodes of the tree) on the server sde. Each representatve CPU rep(b) begns wth the nstructon of remove block b f t occurs and dummy CPUs hold the empty nstructon. (Aggregaton s as before, but at bucket level nstead of the block level). For each bucket to be modfed, the CPU wth the smallest d from those who wsh to modfy t executes the aggregated block-removal nstructons for the bucket. Note that ths aggregaton step s purely for correctness and not securty. 5. Insert Updated Data nto Database n Parallel: Run Route on nputs {(m, (msg, addr ))} [m], where for each rep(b ), msg = (b, v, l ) (.e., updated block data) and addr = [l ] log m (.e., level-log m-truncaton of the path l ), and for each dummy CPU, msg, addr =. 6. Flush the ORAM Database: In parallel, each CPU ntates an ndependent flush of the ORAM tree. (Recall that ths corresponds to selectng a random path down the tree, and pushng all data blocks n ths path as far as they wll go). To mplement the smultaneous flush commands, as before, commands are aggregated across CPUs for each bucket to be modfed, and the CPU wth the smallest d performs the correspondng aggregated set of 25

27 UpdateBuckets (m, (command, path )) Let path (1), path (2),..., path (log n) denote the bt prefxes of length 1 to log n of path. For each level lev = 1,..., log n of the tree: 1. The CPUs nvoke the oblvous aggregaton procedure OblvAgg as follows. Case 1: command = remove-b. Each CPU sets key = path (lev) and data = b f b s n the bucket of node l (lev), and data = otherwse. Use the unon functon Agg({data }) = {b : data = b} as the aggregaton functon. Case 2: command = flush. Each CPU sets key = path (lev) and data = L (resp., R) f the (lev + 1)-st bt of path s 0 (resp., 1). Use the unon functon as the aggregaton functon. At the concluson of the protocol, each bucket path (lev) s assgned to a representatve CPU bucket-rep(path (lev) ) wth aggregated commands agg-command. 2. Each representatve CPU performs the updates: If bucket-rep(path (lev) ), do nothng. Otherwse: Case 1: command = remove-b. Remove all blocks b agg-command n the bucket path (lev) by accessng memory bucket and rewrtng contents. path (lev) Case 2: command = flush. Access memory buckets path (lev), path (lev) 0, path (lev) 1, perform flush operaton locally accordng to agg-command {L, R}, and wrte the contents back. Specfcally, denote the collecton of stored data blocks b n path (lev) by ThsBucket. Partton ThsBucket = ThsBucket-L ThsBucket-R nto those blocks whose assocated leaves (lev+1) contnue to the left or rght (.e., {b j ThsBucket : l j = mypath (lev) 0}, and smlar for 1). If L agg-command, then set ThsBucket ThsBucket \ ThsBucket-L, and nsert data tems ThsBucket-L nto bucket path (lev) 0. If R agg-command, then set ThsBucket ThsBucket \ ThsBucket-R, and nsert data tems ThsBucket-L nto bucket path (lev) 0. Fgure 8: A space-effcent mplementaton of the UpdateBuckets procedure. 26

28 commands. (For example, all CPUs wll wsh to access the root node n ther flush; the aggregaton of all correspondng commands to the root node data wll be executed by the lowest-numbered CPU who wshes to access ths bucket, n ths case CPU 1). 7. Return Output: Run OblvMCast on nputs {(b, v )} [m] (where for dummy CPUs, b, v := ) to communcate the orgnal (pre-updated) value of each data block b to the subset of CPUs that orgnally requested t. A few remarks regardng our constructon. Remark 3.13 (Truncatng OPRAM for Fxed m). In the case that the number of CPUs m s fxed and known a pror, the OPRAM constructon can be drectly trmmed n two places. Trmmng tops of recursve data trees: Note that data tems are always nserted nto the OPRAM trees at level log m, and flushed down from ths level. Thus, the top levels n the ORAM tree are never utlzed. In such case, the data buckets n the correspondng tops of the trees, from the root node to level log m for ths bound, can smply be removed wthout affectng the OPRAM. Truncatng recurson: In the t-th level of recurson, the correspondng database sze shrnks to n t = n/α t. In recurson level log α n/m (.e., where n t = m), we can then acheve oblvous data accesses va local CPU communcaton (storng each block [n t ] = [m] locally at CPU, and runnng OblvAgg, OblvMCast drectly) wthout needng any tree lookups or further recurson. % Sze%n/α " Sze%n/α 2% [Truncated%recurson]% [Truncated%tree%tops]% m" Sze%n/α% Sze%n" log%m" Remark 3.14 (Collson-Freeness). In the compler above, CPUs only access the same memory address smultaneously n the (read-only) memory lookup n Step 3. However, a smple tweak to the protocol, replacng the drect memory lookups wth an approprate aggregaton and multcast step (formally, the procedure UpdateBuckets as descrbed n the appendx), yelds collson freeness. 4 Garbled PRAM As an applcaton of OPRAM, we demonstrate a constructon of garbled parallel RAMs. Specfcally, we show that the IBE-based garbled RAM of Gentry et al. [GHL + 14] (whch n turn bulds upon [LO13b]) can be drectly generalzed to garble PRAMs n a smple and modular way, gven an OPRAM compler wth certan propertes, and nstead relyng on 2-level herarchcal IBE. We then show (n the appendx) how to obtan these requred propertes genercally from any OPRAM, and how to reduce the assumpton from 2-HIBE back to IBE wth a further modfcaton of the scheme. We remark that constructons of garbled RAM can be obtaned drectly from one-way functons [GHL + 14, GLOS15, GLO15], and leave as an nterestng open problem how to extend these technques to the PRAM settng. We start by generalzng the noton of garbled RAM [LO13b,GHL + 14] to garbled PRAM, where the man dfference s that a PRAM program Π conssts of m CPUs. We allow each CPU l to take a short nput x l, whch can be thought of as the ntal CPU state. We model the garblng algorthm and garbled program evaluator also as PRAMs, and am to preserve the parallel runtme of Π. 27

Problem Set 3 Solutions

Problem Set 3 Solutions Introducton to Algorthms October 4, 2002 Massachusetts Insttute of Technology 6046J/18410J Professors Erk Demane and Shaf Goldwasser Handout 14 Problem Set 3 Solutons (Exercses were not to be turned n,