Effcent Dstrbuted Fle System (EDFS) (Sem-Centralzed) Debessay(Debsh) Fesehaye, Rahul Malk & Klara Naherstedt Unversty of Illnos-Urbana Champagn
Contents Problem Statement, Related Work, EDFS Desgn Rate Montor (RM), Resource Allocator (RA) Servng Requests Implementaton approach Some Smulaton Results Concluson and Future Work Extra Sldes
Problem Statement To deal wth growng data storage and processng needs Mult-GB fles are common Increasng number of concurrent requests (clents) s more common To optmally utlze network, storage and processng resources Some nodes can be overloaded whle others are less loaded Resultng n hgher block transfer delay
Related Work Exstng Dstrbuted Fle Systems (GFS, HDFS) NameNode (Master) Keeps track of every block and replca operatons a bottleneck sngle pont of falure Lack effcent block allocaton and load balancng scheme Consder ust physcal dstance than rate to allocate Use TCP for data transfer Don t smultaneously look at all bottlenecks (dsk, lnk, CPU)
Our Aproach (EDFS Desgn) Keep an EDFS frontend To forward requests to correspondng NameNode hashng (lght weght) can handle more requests smultaneously New cross-layer (network & transport) protocol Adaptve, effcent block allocaton, load balancng and routng Consder all bottlenecks smultaneously
EDFS Desgn... Cont d A clent applcaton frst connects to the Front-End Server (FES). The FES chooses the correspondng Name Node Server (NNS) for the clent. The NNS asks the Resource Allocator (RA) for the best BS. The RA fnds the best BS based on the metrc montor t gets from the Resource Montor (RM) server n the BS and other RAs.
EDFS... Cont d
EDFS... Cont d
The Resource Montor (RM) The RM computes R d,u (the rates R d and R u )at whch R ts flow can send data to (uplnk) or receve data from (downlnk) other BS or clent as: d, u N T Q RTT L Q R d N N Q w Q And sends R d,u and S to ts resource allocator (RA) w RTT Where R = rate at whch a flow should send data from node or to node K = resource capacty at node Q = backlog (queue sze) ψ = packet (nstructon) sze R = sendng rate of node L = total number of packets d = control nterval S d N
The Resource Allocator(RA) Each RA gathers S from all ts RMs and computes the rates assocated wth ts swtch and lnks as : R d, u N Q S Where N = the number of RMs attached to the RA Keeps the maxmum R d,u max of the rates of ts chldren (for k-allocaton) and the IP addresses of the chldren wth these max values. And sends ts mn(r d,u, R u,d max ) and ts S to ts parent RA Ths contnues untl the hghest level (n-hghest) RA n the tree every control nterval d or at every trgger event.
The RA... Cont d The n-hghest level RA now sends (broadcasts) to ts chldren Its R d,u (at what maxmum rate a node can send to and receve from the outsde clent) Its mn(r d,u, R u,d max ) (at what maxmum rate a BS n the cloud can send to and receve from another BS) Each RA passes these values to ts chldren RAs untl t fnally reaches the parent RA of the RMs whch keeps ths nformaton. The above values can also be requested drectly from the RA-n. Decreased overhead of broadcastng the values to ts chldren (pro) The hghest level RA (RA-n) can be overloaded wth too many requests (con) Each RA also acts as k-hghest level RA and repeat the above steps for k-level allocaton purpose.
Servng Requests Two types of requests to wrte, read and nteract data An external user request to use the cloud Internal BS (RA) request to replcate data, for mantenance and load balancng An external user wth a read weght 0 w r 1 conects to the FES whch forwards ts request to the selected NNS to wrte some data to the cloud. The NNS asks the RA (n ts rack) for the best BS The RA chooses the best level n (hghest ) value V r = w r R u +(1-w r )R d BS and tells the user to use R d,u Here each RA mantans V r for some common w r values
The Interactve (read) weght w r If the RA algorthm runs at every ntal wrte request wth w r gven by every applcaton It can be expensve to run Optmzaton each RA mantans V r for w r = 0 (nfrequent read), 0.5 (nteractve), 0.25 (less read), 0.75 (more read for data replcated n many BS) The RA whch has the n-level R d and R u can also obtan V r = w r R u +(1-w r )R d after an applcaton provdes t The cached frequency of user read/wrtes can also be used to obtan the w r
Servng Requests (Internal) When the BS has a request to another BS (to read/wrte data), ts RA chooses the BS wth the hghest V r (For better update) n ts rack (local swtch) (1-level allocaton) or n a branch of the tree k levels above (k-level allocaton). When the RA wants to upload (replcate) data from ts BS to another BS the value of w r can be 0 or a weghted average of the R d and R u values. When each BS nteract wth almost smlar read and wrte rates, w r = 0.5
Servng Requests: Smple examples External request: If user wants to push data to or pull data from the cloud The RA chooses the BS wth hghest R d or R u resp. If the user wants to do push+pull, The RA chooses the the BS wth the hghest V r value Internal request: If a BS wants to push data to or pull data from a BS, The RA chooses the BS wth hghest mn(r u, R d max ) or hghest mn(r d, R u max ) value resp. If the user wants to do push+pull It chooses the BS wth hghest V r value
Implementng EDFS Among other thngs EDFS answers two man questons Where n the cloud to store data and In the BS whch offers the best V r value. At what rate to transfer data Internal or external transfer? Internal transfer (wthn the cloud) TCP can be modfed to send at the rate obtaned from the RA Settng TCP beta and alpha or the cwnd parameters. For external transfer (no knowledge of the clent path capacty) If a clent uploads data to the cloud (cloud download) Set the clent cloud TCP receve wndow to R d max x RTT If a clent downloads data from the cloud (cloud uplnk) Set the cloud TCP maxmum cwnd to R u max x RTT
Some Smulaton Results Evaluatng the Front End Server (FES ) External requests Number of clents Posson dstrbuted Sze of requests Pareto dstrbuted Update rate Read + wrte Mem access tme = 10ns log 2GB 64 B
Some Smulaton Results... cont d Evaluatg the Resource Allocator (RA) Internal transfer (At what rate?)
Concluson and Future Work Presented an effcent dstrbuted fle system To overcome the sngle NameNode bottleneck ssue of exstng dstrbuted fle systems An effcent block allocaton and load balancng Ongong work System mplementaton of EDFS Comparson wth GFS and HDFS
Extra Sldes Jont Congeston Control and Routng for Dstrbuted Systems Debessay(Debsh) Fesehaye & Klara Naherstedt Unversty of Illnos-Urbana Champagn
Contents Motvaton Dervaton of the rate metrc Applcatons of such scheme
Motvaton TCP takes many rounds to fully utlze the lnk capacty Very hgh download tme TCP uses packet loss and delay as a congeston sgnal A temporary loss or delay due to channel error means congeston and reducng the congeston wndow No good routng metrc Do not effectvely take nto account dynamc delay and congeston on lnks
Exact dervaton of a rate metrc Notatons C = Capacty of Lnk Q = Queue sze at lnk RTT = RTT of flow durng control nterval d d C Q = total load whch has to be shared among all connectons crossng lnk L = total number of packets that arrve at router durng control nterval d w = current cwnd of correspondng to packet w = next cwnd of flow a = cwnd per packet = number of packets the source sends at an arrval of ACKs of each of the w packets
Exact dervaton of a rate metrc Then we have Hence for some constant k. But whch mples that Hence the throughput share assocated wth packet s
Applcatons Applcaton to the Internet Routng Metrc Mtgatng DDoS & DoS Overlay Networks TEEVE: Tele-mmersve Envronment for EVErybody Other Applcatons TCIPG (Power Grd), etc Large Scale Dstrbuted Fle System Desgn (Today s topc) The End!