For the sake of entirety I would like to mention the main differences between a), b), and c) architectures.

Size: px

Start display at page:

Download "For the sake of entirety I would like to mention the main differences between a), b), and c) architectures."

Gary Marsh
5 years ago
Views:

1 Jozsef Patvarczki Comprehensive exam Due August 24 th, 2010 Subject: Performance of Distributed Systems Q1) A survey of work to distribute requests in a data center (Web cluster) was done by Cardellini, et al in a 2002 ACM Computing Surveys paper [Cardellini et al. 2002]. Briefly outline the primary architectural approaches described in this paper for distributing requests within a data center. This work assumes that the Web cluster of the content provider is not distributed over the world but is instead grouped in a local area. The survey mainly focuses on the primary architectural approaches for geographically not distributed systems. The survey describes that the basic architecture consists of multiple server nodes grouped in a local area with one or more mechanisms to distribute client requests among them with possible internally deployed routing devices. It assumes that each Web server can access all site information independently based on the degree of the possible content replication (chapter 9 introduces data placement techniques for the Web content using content replication methods). Although the main focus of the survey is on the Web server tier (presentation layer), the architecture can be a multi-tier system including a middle-layer for application servers and data-layer for databases. There are three main classes of architectures of the given grouped server nodes that host a Web site: Cluster-based Web system - Web cluster (a); Virtual Web cluster (b); and Distributed Web system (c). For the sake of entirety I would like to mention the main differences between a), b), and c) architectures. a) The nodes of the Web cluster are grouped together at a single location and interconnected through a high-speed network. From outside, the nodes present a single system image with a site name e.g. and one Virtual IP (VIP) address e.g The authoritative Domain Name Server (A-DNS) that has always a complete up-to-date dataset about a particular domain translates the site name into the VIP address which is the address of the front-end node of the Web cluster. The A-DNS is not part of the Cluster-based Web system. The front-end node of the Web cluster has key role as a Web switch: it receives all the incoming packets from the clients and routes them to one of the server nodes. It functions as a centralized dispatcher with capability to change the requests assignments. The Web switch can be a hardware component of the Web cluster or a software module of an operating system. b) This architecture presents a single system image to the outside world as well but does not use a front-end Web switch. This eliminates a possible single-point-of-failure problem occurred by 1

2 the Web switch. Additional difference is that the VIP address is shared by the server nodes in the cluster. This makes it possible that all nodes will receive and filter the inbound packets. They can decide to accept or discard them. A-DNS is not part of the system. c) This architecture has no single image view because the locally distributed server nodes multiple IP addresses can be visible from outside. A-DNS is part of the architecture and it acts as a request router of the system during the look-up phase. It does not have a front-end Web switch. The Web switch has a key role to distribute the requests within the data center. The survey classifies the Web cluster architecture alternatives according to the Open Systems Interconnection Model (OSI) protocol stack layer: Layer-4 (ISO/OSI protocol layer: Transport layer) Web switches that work at the (Transmission Control Protocol) TCP/IP level and Layer-7 (ISO/OSI protocol layer: Application) Web switches. Layer-4 switches has efficient routing mechanism because the incoming packet will not reach the application layer. This switch type performs content-blind routing as it selects the target server when a client starts to establish a TCP/IP connection during the active-open phase as the SYN packet arrives to the Web switch before the HTTP request. The routing policies know nothing about the content of the client request. The routing of the answers from the server (data flow) back to the client can be different. If the selected server will respond to the client without using the Web switch then it is a one-way approach. If the selected server will utilize the Web switch to answer then it is a two-way solution. As soon as a Web switch receives an incoming packet it starts to analyze its header information to determine whether the packet introduces a new connection or it belongs to an old one, or neither of them. For example, in the case of a TCP segment that consists of a segment header and a data section, the SYN (Synchronize Sequence Numbers) flag is always set in the case of the first packet in the header. The Web switch selects a server at the TCP session level and inserts the assignment into the managed binding table. The binding table contains the assignment information about each client TCP session and its target server. If the packet is not for a new connection then the Web switch scans the binding table to determine that the packet belongs to an old connection or not. If not the packet is dropped. If it does then it routes the packet according to the dispatching policy. In the case of a two-way architecture each server has an own IP address (a private address is at the IP level). The Web switch is responsible to rewrite the incoming and outgoing packages at the TCP/IP level based on the IP Network Address Translation Protocol (Traditional NAT) [Srisuresh & Egevang, 2001]. With this method the private address of the assigned server is bound to the VIP address of the Web switch and the IP header of every incoming and outgoing packet must be modified (double packet rewriting). This modification includes the IP address (source IP address for outgoing/destination IP for incoming packets) and the IP checksum modifications. For TCP sessions the modifications must include the update of the checksum in the TCP headers. TCP checksum also contains a pseudo header which contains the source and destination IP addresses. Since IP and TCP headers use a one's complement sum, it is sufficient 2

to calculate the arithmetic difference between the before-translation and after-translation to modify the checksum addresses and add it to the checksum [Srisuresh & Egevang, 2001] [Rijsinghani, 1994].

3 to calculate the arithmetic difference between the before-translation and after-translation to modify the checksum addresses and add it to the checksum [Srisuresh & Egevang, 2001] [Rijsinghani, 1994]. For Layer-4 switches this means that the server nodes can be in different (Local Area Network) LANs (figure 1) but the scalability of the two-way Web cluster can be limited by the Web switch packet rewrites and checksum modifications. In the case of one-way architectures the outgoing packets do not flow through the Web switch. They flow from the assigned server back to the client directly (single packet rewriting). It is necessary to have a second high-speed network connection for the outgoing packets. Routing to the target server can be done by using packet single rewriting, packet tunneling, or packet forwarding. Packet single rewriting rewrites the destination IP address of all incoming packets (Web switch replaces the VIP with the target server IP and recalculates the checksums). The selected Web server replaces its IP address with the VIP address, recalculates the checksums and it will send the packet back directly bypassing the Web switch. Figure 1: Two-ways Layer-3 forwarding Packet tunneling encapsulates an IP datagram within another IP datagram and assumes that all server supports IP tunneling. An outer IP header is inserted before the datagram s existing IP header. In this case, the outer IP header addresses (source and destination VIP and the target server IP address) specifies the tunnel and the inner IP header addresses (source and destination) the original sender and receiver [Perkins, 1996]. As soon as the selected Web server gets the encapsulated packet it processes the request based on the core packet and responds directly to the client. Packet forwarding also known as Direct Server Return (DSR) and MAC address translation links 3

one of the network interfaces of the servers and the Web switch by a LAN segment. It manipulates the packets on the Layer 2 level using MAC Address Translation (MAT).

4 one of the network interfaces of the servers and the Web switch by a LAN segment. It manipulates the packets on the Layer 2 level using MAC Address Translation (MAT). MAC address is a Layer 2 (Data Link) unique Ethernet hardware address given by the manufacturer and generally it does not change. It is responsible to connect IP packets with the target physical device. A Web server is configured with the IP address of the VIP (same VIP is shared by the Web switch and the servers) and its secondary IP address. The trick to put multiple servers on the network with the same IP address is to bind the VIP to the loopback interface (universal address is ) that is used for internal communications [Bourke, 2001]. The Web switch uses MAT to translate the destination IP address and the target server accepts the incoming traffic because the VIP is on the loopback interface. To avoid the possible collision the server nodes disable the Address Resolution Protocol (ARP) that is responsible to map an IP address to a link layer address. It is used to locate the Ethernet address with the associated IP address. This makes it possible that the incoming packet is received by the Web switch and not the other servers. This process does not modify the TCP/IP header of the packet and it avoids the expensive checksum re-calculation. The target server receives the packet and responds directly to the client (figure 2 shows a layer-2 forwarding). Figure 2: Two-ways layer-2 forwarding Layer-7 switches establish a complete TCP connection with the client using the tree-way handshake between the client and the Web switch, it inspects the HTTP request at the application layer (content-aware routing), and routes the packet to the selected server using more complicated dispatching policies. Figure 3 shows the difference between a Layer-4 and a Layer-7 routing. Architectures that use Layer-7 switch can also be divided further into one-way and twoway architectures. Two-way architecture can apply TCP gateway or TCP splicing. In the case of a TCP gateway a proxy handles the communication between the client and the server. The proxy is on the Web switch and it maintains persistent TCP connections with the Web servers. This 4

5 proxy receives the incoming packets and forwards to the target server using the persistence TCP connection. It also forwards the outgoing packets from the server to the client through the persistence connection. This method is a computation intensive one since each packet must utilize to the application layer. Figure 3: Layer-4 vs. Layer-7 routing [Cardellini et al., 2002] TCP splicing forwards packets at the network layer between the network interface card and the TCP/IP stack (Figure 4). The client library is responsible to redirect connections to the proxy (Web switch) that relays data between the client and the target server. The operating system needs to be modified at the kernel level. TCP splice can change the header information of the incoming packets and forward them instead of passing to the application layer and back. Figure 4: Basic architecture of the TCP split connection [Maltz & Bhagwat, 1998] In the case of the one-way architectures TCP handoff and TCP connection hop are the possible routing solutions. Figure 5 shows the TCP connection handoff. First a client connects to the Web switch (1) that accepts the connection (2) and hands it off to the selected Web server. The selected server takes over the established connection (3), accepts the initiated connection (4), and 5

6 answers back to the client directly (5). The Web switch has to forward packets to the selected server node. The forwarding module needs to be very fast to avoid performance problems. There is a provision for HTTP 1.1 persistent connections. In this case, the Web switch can handoff a connection multiple times selecting different servers each time or the selected server node can forward the request to the second node (backend-forwarding). Figure 5: TCP connection handoff [Pai et al., 1998] TCP connection-hop hops the TCP connection to the selected server using an encapsulated IP packet in a (Resonate Exchange Protocol) RXP packet. RXP is a kernel level service that gets inserted below the TCP/IP stack. The RXP driver on the Web switch receives the connection from a client then encapsulates that connection within the TCP connection-hop and forwards that to the selected server. The RXP at the target node recovers the original connection and replies directly to the client [Resonate, 2002]. In the case of a Virtual Web Cluster (b) the request routing is distributed among the server nodes and all incoming packets target all Web servers. They have to apply some filtering mechanism to determine their target functionality. The result of the filtering process is that only one server will accept the packet. The routing is content-blind and the target server inspects the information at the TCP/IP level. The filtering process can be hash based (e.g. hash value based on the source address or a port in the TCP/IP packet) and the request routing happens at the MAC layer with all the required conditions like in the case of Direct Server Return (DSR). In the case of a Distributed Web System (c) the routing can happen at the DNS or the server level. The A-DNS has a key role to select the target server for each address resolution. At each address resolution it returns one of the IP addresses of the Web servers. Most common implementation of the DNS is (Berkeley Internet Name Domain) BIND [Internet System Consortium, 2001]. BIND allows to duplicate the so called A records (Address record) for a 6

7 specific address with different IPs of the Web servers. A record has a so called Time to Live (TTL) filed that specifies the maximum time that the information considered being valid because the information is cached in the name servers between the A-DNS and the possible local DNS of the client. Name resolution is also cached in by different Web browsers (with their own caching policies) for performance purposes [Shaikh et al., 2001]. Server level routing has three main mechanisms: Triangulation, HTTP redirection, and URL rewriting (redirection). Triangulation is based on the packet tunneling method, described above. The client sends the request to the selected server and the server forwards it to a second server node that actually handles the request. The second server answers back to the client directly. This process continues until the connection is closed. HTTP redirection is based on the different status codes 301 and 302 in the response header of the HTTP protocol. Redirect is a response with status code beginning with number 3. For example, Moved permanently (301) means that the resource has a new URL and the client should re-link references to the request URI using the returned reference. HTTP redirection allows content-aware routing because the first server can inspect the content of the request and select a target server based on that. One possible drawback is the overhead generated by the new TCP connection initiated with the target node and then with the redirected node. URL redirection redirects one URL to another dynamically manipulating the hyperlinks for the requested Web page in the object [Li & Moon, 2001]. For example mod_rewrite [ ]module (URL rewrite engine) of the Apache HTTP server [Apache2010]. Q2) What are the considerations for assigning a request to a particular server within a data center? These considerations should include the state of the data center as well as of the request itself. There are two main considerations for Web cluster dispatching: Content-Blind (Web switch works at the TCP/IP layer) and Content-Aware (Web switch works at the application layer). Global scheduling policies can be divided further based on the algorithm type. There are two main types: static and dynamic. Static dispatching algorithms do not consider any state information of the data center while making an assignment decision. Dynamic algorithms can consider state information of the data center. Also we can classify dynamic algorithms based on the level of the system state information: Client State Aware (Web switch considers client information to route requests), Server State Aware (Web switch considers server state information to route requests), and Client and Server State Aware policies (Web switch considers client and server state information to route requests). In all cases the goal is to share different load classes among all servers without any server overload [Shivaratri et al., 1992]. It is a problem that the load state information of the data center 7

8 is not immediately available at the Web switch for computation. We assume that the data center has a shared-nothing architecture where each server has its own CPU, main memory, locally attached hard disk, and network interface. The Web cluster is faced with static and dynamic mixed workload where the static pages are handled by the Web servers and the dynamic content is processed by the back-end (application and database) servers. The Web switch and the server nodes are interconnected with a 100Mbps link. The HTML file and the required objects for the page are retrieved from the disks or from the cache of the servers. Clients arriving to the system follow an exponential distribution. For example the requests can be modeled using the following parameters (Table 1): clients inter-arrival rate (100 clients per second), number of requests per client session (modeled through the inverse Gaussian distribution), the user think time the time between two successfully retrieved Web pages (modeled through a Pareto distribution [Barford & Crovella, 1999]), the number of embedded objects per page (modeled through the Pareto distribution [Barford & Crovella, 1999, Mah, 1997]), and the distribution of the object size requested to the Web server (modeled according to a lognormal (body) and the heavy-tailed Pareto distribution (tail) [Barford et al. 1999]). Table 1: Request model [Andreolini et al., 2002] Web services provided by the Web cluster can be different and the possible request types of the clients mainly depend on it [Casalicchio et al., 2001, Casalicchio et al., 2000]. We can separate Web publishing, Web transaction, Web commerce, and Web multimedia sites. Web publishing sites has mainly static information with light dynamic services that do not use the data center resources heavily (Neutral: N) [Casalicchio et al., 2001]. Typically the static object is on the disk of the Web server and it is perfectly cacheable. Web transaction sites provide dynamic services and utilize database servers. These sites are disk intensive sites (disk bound services - DBS). Web commerce sites combine static and dynamic information with secure requests (e.g. Secure Socket Layer SSL). These sites are disk and CPU intensive sites (DBS and CPUBS). Web multimedia sites mainly focusing on video and music streaming services are heavily disk and network bound (DBS and NBS). According to these types a request can be static/lightly dynamic (N), disk intensive (DBS), CPU intensive (CPUBS), disk and CPU intensive (DBS+CPUBS), and disk and network intensive (DBS and NBS). To get a nice overview of the server states and 8

requests processing figure 6 presents a possible Web Cluster model. Based on this model further Web cluster state information can be periodically determined e.g. the queue length of CPUs, disks, and network adapters or the total number of running processes.

9 requests processing figure 6 presents a possible Web Cluster model. Based on this model further Web cluster state information can be periodically determined e.g. the queue length of CPUs, disks, and network adapters or the total number of running processes. Figure 6: Web Cluster model (two-way) The Content-Blind Dispatching static policies do not consider any system state information of the data center. Random algorithm (RAN) distributes the arrivals uniformly through the servers. Disadvantage can be that the requests may not be evenly balanced given that the number of requests is small. Round-Robin (RR) policy is a circular list with a pointer to the last selected server to make the decision. It selects server by i = (i+1) mod n where n is the number of Web servers. It makes decisions based on information from the past. It can be a problem if the Web servers are not identical (heterogeneous). Both policies can use a static configurable server capacity (Ci) to decide where the relative server capacity is ξ(i) (0 ξ(i) 1) = Ci / max(c) (max(c) is the maximum server capacity among all the nodes). In the case of the static Weighted Round-Robin policy each server will have a weight w(i) that shows its capacity where w(i) = C(i) / min(c) (min(c) is the minimum server capacity among all the nodes). Servers with higher weights will receive more connections than servers with less weight. Client state aware algorithms has limited information available (IP address and TCP port number) about the client because the Layer-4 Web switch is content-blind. This means that the assignment to the server nodes is static based on the calculated hash value of the client IP address (or IP address and port). Server state aware algorithms focus on different server load indexes (lowest load, least loaded, etc.). Mitzenmacher studied different load index update models [Mitzenmacher, 2000]. For example, periodic update method updates a central bulletin board at the Web switch in every T units of time and the updated table reflects the current load of all servers. The continuous 9

10 update model updated the bulletin board constantly but the board state is behind of the real system state with T units. This way the assignment is based on this T units earlier status of the data center. Least loaded approach is one of the default methods in commercial products. Least Connections algorithm assigns the request to the server based on the number of active connections. The Web switch selects the server with the minimum number of active connections to handle the next request. The dynamic Weighted Round-robin algorithm (WRR) assigns a dynamically evaluated weight to each server node that is proportional to the server load estimated as a number of connections [Hunt et al., 1998]. The Web switch computes the weights based on the available load information and these weights are dynamically incremented for each new assignment. Content-Aware Dispatching Policies use Layer-7 Web switch that is capable to inspect the HTTP request. Layer-7 Web switch can assign request to a different Web server using the same TCP connection when HTTP/1.1 persistent connection is used. Prior to this (HTTP/1.0) a new TCP connection was established for each URL. The Client State Aware Algorithm can apply a hash function to any part of the URL and it can lead to a static partitioning of the requested files. This hash function is applied at the Web switch and can be applied at the targeted server node as well (e.g. multiple hashing, two-tier folder structure). This method works well for static content and does not consider load sharing. Service Partitioning [Yang & Luo, 2000] assigns special servers e.g. media server to different class of requests (see above the possible request types). Client Affinity Policies can assign all requests from the same client to the same server using session identifiers or stored cookie information. Client-Aware Policy (CAP) [Casalicchio & Colajanni, 2001] is a policy that can improve load sharing among the servers. CAP is a policy for heterogeneous Web clusters with multiple types of services (see above). The Web switch can determine the class of the request (see above) form the requested URL and estimates its possible influence on the system. The switch manages a circular list of assignments for each class of service. The main goal of CAP is to share multiple classes among all nodes and eliminate a possible overload. Client and Server State Aware algorithms combine the client and the server state information. CAP-ALARM [Andreolini et al., 2002] exchanges alarm messages through an asynchronous communication protocol with load information. The requests are distributed using the CAP policy unless an alarm message was sent (consider the next server). If the server load status exceeds a pre-set threshold value the alarm message informs the Web switch about the event. When the load falls below the pre-set threshold a wake-up message informs the Web switch about that. The Locality-Aware Request Distribution (LARD) considers locality as well as load balancing [Pai et al., 1998]. It directs all requests from the same Web objects to the same target server (caching involved) until the node utilization will exceed a pre-set threshold limit. If it exceeds, the Web switch assigns the request to the next server with the lowest load - if it exists - or to the least loaded one. LARD improves the cache hit rates in the Web cluster. Figure 7a-7d show the state of the data center as well as the request itself for the above 10

disk and network intensive (DBS+NBS) requests.

11 mentioned main algorithms. Our Web cluster has 2 Web servers (A and B). We use static/lightly dynamic (N), disk intensive (DBS), CPU intensive (CPUBS), disk and CPU intensive (DBS+CPUBS), and disk and network intensive (DBS+NBS) requests. The Web switch queue has total N(3), CPUBS(2), DBS(3), DBS+CPUBS(2), DBS+NBS(1) requests in the order of N1, DBS1, CPUBS1, DBS2, DBS3, N2, CPUBS2, DBS+CPUBS1, DBS+NBS1, DBS+CPUBS2, N3 Figure 7a: Initial State Figure 7b: Random and Round-Robin algorithms 11

12 Figure 7c: Static Weighted Round-Robin and Least Connections algorithms Figure 7d: CAP and LARD algorithms 12

13 Q3) Another direction of work is global load balancing of Web content. Pan, et al examine the use of DNS to distribute work in content distribution networks [Pan et al., 2003]. Briefly outline the primary architectural approaches described in this paper for distributing requests using DNS. The Domain Name System (DNS) is a distributed database of records spread across a semi-static hierarchy of servers [Mockapetris, 1987]. It works like a contact book that assigns numbers to different names but in the Internet land it assigns IP addresses to different hostnames. The domain name space of the Internet is partitioned into different domain hierarchies where each domain is administered by an authoritative nameserver the A-DNS. The authoritative Domain Name Server (A-DNS) has always a complete up-to-date dataset about a particular domain and it can translate the site name into the VIP address which can be the address of the front-end node of a Web cluster. The top of the hierarchy is served by the root nameservers that resolves a toplevel domain (TLD). TLD is at the highest level in the domain name system. For example, in the TLD is net. There are generic top level domains such as edu, gov, or mil. Generic top-level domains (gtld) have three or more characters. Country code top level domains (cctld) present countries and different territories. Each name to address mapping has an associated time-to-live (TTL) value that validates the caching time interval of the entry. It is important to select an appropriate TTL value to reduce the load of the nameservers, but do not create a database propagation bottleneck. [Barr, 1996] recommends 1-5 days for the minimum TTL. Caching can decrease the DNS resolution delay significantly. To further expedite the process of the address resolution DNS queries use the unreliable and connectionless User Datagram Protocol (UDP). DNS database can contain the following main entries: A Record (Address) assigns an IP address to a domain/sub-domain name, CNAME record (Canonical) makes one domain name an alias of another, MX (Mail exchange) specifies the mail exchange servers for the domain, PTR record (Pointer) maps an IP to a canonical name for a host creating DNS reverse lookup, and NS record (Name server) maps a domain name to a list of A-DNS servers. [IANA2002] specifies the full list of DNS records. After the client makes a request e.g. the local nameserver iteratively tries to resolve the requested name. If the A or the PTR record of the domain is not available and the NS is not cached locally, then the local nameserver sends a query to the root DNS to resolve the requested domain name. The root server answers back with the address of the A-DNS for the sub-domain. After this, the local nameserver sends a query to the A-DNS and receives the IP of that is being forwarded to the client and the client can connect to the Web cluster (figure 8). There are 13 root servers of the alternative open DNS root system [Public-Root2003] that are strategically deployed around the globe. The paper describes two main approaches to distribute request using DNS: Web-caching and Content distribution. Web-caching reduces the bandwidth and improves response time using a proxy server. In this user-oriented solution users can set-up a proxy server relatively close to their location. The requests from the users are first sent to the proxy server. This caching proxy server retrieves content saved from previous requests generated by the users and it keeps a local copy of the frequently requested pages. 13

14 Figure 8: DNS resolving address The cache copy validation is an important factor. If the local copy is not valid anymore the caching proxy server generates a request to the original server to fulfill the request. Caching proxies split the TCP connection into two (between the client and the proxy and between the proxy and the server). If the local copy is still valid then it returns the requested object directly without contacting the origin (figure 9). Figure 9: Web caching [Pan et al., 2003] We can distinguish between flat proxy infrastructure where parent and child proxies are deployed and hierarchical proxy structures where the cooperation is not only between sibling neighbor domain proxies, but also between parent and child proxies. A proxy cannot ask a sibling proxy to get a document from the origin, but it can ask a parent proxy. HTTP/1.1 has two methods for caches to maintain the consistency with the origin (expiration times and validators). For example, if the copy is expired in the cache it needs to be validated before the next usage. 14

15 Content distribution is a provider-oriented method, where the origin servers are partially or fully replicated across the Internet (locally/remotely). The client can select from the local replicas using a round-robin based DNS response. In this case, each replica should have its own IP address. If the replicas are geographically distributed users can select their replicas explicitly. For example, this can happen when users needs to select one of the 30 download links for a Debian Linux distribution. A better way for global balancing is to exploit the DNS infrastructure. This paper introduces Akamai [Akamai, 2007] and its Content Delivery Network (CDN) that addresses global balancing. Based on the client s local DNS query (see figure 8) Akamai DNS servers select a suitable server using different network and server statistics. CDN has two main services: site delivery and content delivery. Akamai has 3 main types of DNS servers one for site delivery (xz.akadns.net), and two for object delivery (akamaitech.net and nxg.akamai.net). Site delivery fully replicates the customer s site and the customer utilizes the CDN server selection method. Site delivery is mainly for static content retrieval. For example, if a client requests the first DNS query goes to the client s local DNS server. If the NS is not cached locally, then the local nameserver sends a query to the root DNS (a.root-servers.net) to resolve the requested domain name. The root server answers back with the address of the A- DNS (a.gtld.servers.net) for the sub-domain. After this, the local nameserver sends a query to the A-DNS and receives the address of the site s DNS server (e.g. nsx.about.com). The local DNS server sends a query to the client s DNS server and receives a CNAME The root DNS and the A-DNS can be contacted again if the NS (akadns.net) is not cached locally. In the next step an Akamai DNS server (ze.akadns.net) determines which server group is the closest to the local DNS server of the client and replies back with an A record (it can send back more than one A record and the local DNS server has the ability to decide). Akamai servers are connected with two Ethernet switches (Internet connection and internal communication). Most of the servers have the SQUID [Wessels, 2004] Web caching software installed which handles the user requests. The cluster has one support server that coordinates internal communication and collects information for the Network Operation and Control Center (NOCC). This information is used to dynamically select the appropriate server group (A record) for the user request. Object delivery is for retrieving dynamic content (image, video, audio, etc.). If a client requests a Web object then Akamai s servers cache the object. The object delivery starts like the site delivery but instead of receiving a CNAME record from the site s DNS the client receives the IP address of the page. The received HTML page contains an Akamaized URL (ARL) that specifies the protocol, Akamai host and domain name (e.g. a1516.g.akamai.net), customer id, the freshness of the object, and the requested URL. After this process, the root DNS and the A-DNS can be contacted again if the NS of the ARL s domain name (akamai.net) is not cached locally. In the next step an Akamai DNS server (ze.akadns.net) selects the closest DNS server to the user s local DNS server (g.akamai.net) and sends it back to the local DNS. The g.akamai.net DNS determines the server group and replies back with an A record (it can send back more than one A record and the local DNS server has the ability to decide). Q4) What are the considerations for assigning a request to a particular server in a global network? These considerations should include the state of the servers in the network as well as of the request itself. 15

The DNS based server selection is based on the delivery that can be site or object based. The DNS based server selection methods use the upper layer of the networking protocol stack.

16 The DNS based server selection is based on the delivery that can be site or object based. The DNS based server selection methods use the upper layer of the networking protocol stack. In many OS the DNS lookup is a serialized blocking call that uses the non-reliable and connectionless UDP protocol. For site delivery Akamai uses zx.akadns.net that DNS NS record is maintained by the gtld DNS server. Akamai does not control the gtld servers. At this level, gtld DNS servers use Round Robin (RR-DNS) algorithm to return the list of NS records and distribute the load. With each DNS response the returned addresses of the identical DNS servers are permutated. This means that the first returned Akamai DNS maybe not the closest one to the local DNS. The Critical Assumption (CA) of the DNS based server selection is that the users are always close to their local DNS server. Therefore, Akamai deployed many DNS servers assuming that the local DNS server will find a closer one. The resolved akamai.net DNS servers are cached locally with TTL value of 2 days. There is functionality for DNS servers to re-arrange this cached list based on the response time of the DNS servers. Akamai dynamic mapping of an A record tries to prevent DNS caching with TTL value ~5 minutes. After 5 minutes the local DNS server needs to query akadns.net again for a new copy. Akamai DNS server determines which server group is the closest to the local DNS server of the client and replies back with an A record or a list of A records and the local DNS server has the ability to decide using Round Robin algorithm (RR-LOCAL) or by tracking the response time between different Web cluster and the local DNS. Akami s DNS servers can only estimate the cluster closeness to the client s local DNS server using the hostname, address allocation, AS number (Autonomous System Number is a number to uniquely identify a network connected to more than one networks with different routing policies), etc. Figure 10 show the DNS based server selection for site delivery with the status of the DNS servers as well. Figure 10: DNS-based server selection with DNS server status (Site delivery) 16

We have four static/lightly dynamic (N) requests: N1, N2, N3, and N4 against the site at www.about.com. We assume that local DNS already has information about the CNAME www.about.akadns.

Web cluster A applies Random algorithm (RAN) to distribute the requests arrival uniformly through the servers and Web cluster B uses static Weighted Round-Robin policy.

17 We have four static/lightly dynamic (N) requests: N1, N2, N3, and N4 against the site at We assume that local DNS already has information about the CNAME of We do not have cached information about the DNS servers of akadns.net and the Web clusters before the first request (N1). Web cluster A applies Random algorithm (RAN) to distribute the requests arrival uniformly through the servers and Web cluster B uses static Weighted Round-Robin policy. Figure 11a and 11b show the initial state and the final state of the Web clusters using DNS based server selection based on figure 10. Figure 11a: Initial states Figure 11b: Final states (Web cluster A: RAN, Web cluster B: WRR) 17

18 For content delivery, akamaitech.net (A-DNS for the domain of akamai.net) and akamai.net are two layers of DNS based server selection that are introduced. A-DNS for the domain akamai.net are geographically deployed through the globe. The difference between site and content delivery is that content delivery has two dynamic mappings within akamai.net. Akamai uses zx.akadns.net that DNS NS record is maintained by the gtld DNS server. The gtld servers are out of the control of Akamai. At this level, gtld DNS servers use Round Robin (RR-DNS) algorithm to return the list of NS records and distribute the load. This first dynamic mapping is similar to the site delivery method. With each DNS response the returned addresses of the identical DNS servers are permutated. This means that the first returned Akamai DNS server maybe not the closest one to the local DNS that violates the Critical Assumption (CA). This is one of the reasons why the additional DNS level (g.akamai.net) is introduced for content delivery. These additional DNS servers (returned by akamai.net) are expected to be close to the local DNS of the client to fulfill CA. The second reason is to avoid frequent queries to gtld or akamai.net DNS servers because host names in g.akamai.net have really small (~ seconds) TTL values. This is where the second dynamic mapping happens using the static Weighted Round Robin policy (WRR) where the weight of a DNS server is the replication of a particular IP address. The TTL values of the returned A records of the Web clusters are also really small (~20 seconds). Table 2 gives an example for the client s requests, translated ARLs, and the possible request types (for request classes see Q2). Figure 12 presents the DNS based server selection for content delivery with the status of the DNS servers as well. Figure 13a and 13b show the initial state and the final state of the Web clusters using DNS based server selection based on figure 12. Table 2: Requests with translations to ARL ID Request ARL RequestClass #1 DBS get_all_image_names.cgi 2h/about.com/services/get_all_image_names.cgi #2 DBS+CPUBS display_all_images.cgi 2h/about.com/services/display_all_images.cgi #3 CPUBS histogram.cgi 2h/about.com/services/histogram.cgi #4 history.avi 2h/about.com/video/history.avi DBS+NBS 18

19 Figure 12: DNS-based server selection with DNS server status (Content delivery) #1:DBS, #2:DBS+CPUBS, #3:CPUBS, #4:DBS+NBS based on table 2 Figure 13a: Initial states 19

Figure 13b: Final states (Web cluster A: CAP, Web cluster B: LARD) Q5) How do these two types of request routing approaches compare with and relate to with each other?

20 Figure 13b: Final states (Web cluster A: CAP, Web cluster B: LARD) Q5) How do these two types of request routing approaches compare with and relate to with each other? Can or should local decisions made within a data center affect global decisions? Can or should global decisions affect decisions made within a data center? DNS based routing was the first proposed approach to handle multiple Web servers distributed locally [Kwan et al., 1995]. Nowadays, DNS based routing is used in geographically distributes systems [Pan et al., 2003] where the local DNS or A-DNS can select different servers for every address resolution using simple policies [Brisco, 1995]. In the case of a Web cluster the Web switch represents a single-point-of-failure. If the switch is not available then the Web cluster fails to accept any forwarded request. Moreover, because of the different TCP/IP packet header manipulation and checksum re-computations the Web switch can be a system bottleneck. A simple solution could be to use multiple Web clusters each with its own Web switch and static visible VIPs. An A-DNS can distribute clients requests among the Web clusters at the DNS level using static algorithm like Round Robin. This can protect Web switches against overload. Each Web switch can use different algorithms to further share the load [Dias et al., 1996]. This method combines DNS based server assignment with Web server dispatching at the cluster level. DNS based server selection is simple and requires no changes in the existing protocols or operating systems, kernel modules, or drivers [Shaikh, 2001]. The Critical Assumption (CA) of the DNS based server selection assumes that the users are always close to their local DNS server. If the client and the local DNS are not close to each other then the client can be directed to an inadequate Web cluster. There are a couple of facts that distinguish DNS based server selection from Web switch scheduling: - Address caching (e.g. local DNS, A-DNS level) potentially limits the scalability of the DNS method; - High skewing can occur during the time-to-live period; - Browsers cache the addresses and requests are directed to the same server. 20

21 The time-to-live (TTL) period for caching the name-to-address resolution is a major difference between the DNS and the Cluster scheduling problem [Colajanni, 1998]. If the caching is disabled (TTL=0) then the DNS is used for server selection purposes and the clients need to query the A-DNS server for each name resolution. This can reduce the performance of the DNS system. If we take a look at the Web cluster model (figure 6) the length of the queues at each server is a good indicator of the server load. If the system is geographically distributed the queues do not show the expected arrivals due to the same TTL problem. If we compare the request routing policies the Round Robin solution is the standard for DNS based solution. [Colojanni, 1998] says that the Round Robin DNS policy can work well if the clients of each local network segment are uniformly distributed. Another problem with the DNS server selection is that if one Web cluster goes down then the client may still try to get to the resource as opposed to the Web cluster dispatching solutions. A Web switch does not forward a request to a node that is down. Local decisions made within a data center can and should effect the global decisions. In the case of the asynchronous redirection policies [Cardellini, et al., 1999] Each Web cluster can send status information to the DNS server. For example, an alarm signal can indicate that the Web cluster is heavily loaded. The Web cluster sends the signal to the DNS to affect the decision at global level and disable further request forwarding to the overloaded Web cluster (e.g. it sends the signal based on utilization). Each Web cluster can evaluate the utilization of the servers within a specified time interval. Frequent message communication can happen between the DNS server and the Web cluster as well. This method continuously keeps informing the DNS server about the history of the cluster (e.g. utilization). Global decisions can also and should affect decisions made within a data center. Synchronous redirection method centralizes the decision at the DNS server. The Web Clusters send status information to the DNS server in every pre-set time interval. The DNS server can identify the source of each address resolution, the request load, and it can capture the assigned Web cluster for each connected domain. The DNS server builds a mapping table with this information. This table is processed under the address lookup phase. The DNS server receives the alarm signals when a Web cluster is overloaded. In the case of a domain redirect (all clients from the same domain) the DNS server sends the mapping table to each Web cluster. The Web cluster checks the mapping table - based on the incoming request - for redirection and it can apply Round Robin algorithm to select where to forward the request. For example, if the local DNS server cached a domain resolution with TTL=2 days then for the client it is enough to query the local DNS server to get to a Web cluster. In this case the Web cluster redirects the request because the A-DNS server propagated the mapping table and the domain is redirected. It is important to set the mapping table updating interval to a correct value. [Cardellini et al., 1999] showed that individual client redirection is enough to achieve acceptable performance with asynchronous methods but for synchronous ones only an additional domain redirect can help. To make an effective decision on reassignment the combination of server and domain information is the best method. 21

22 Q6) Do these decisions change if back-end processing such as access to a database is needed? Does such back-end processing affect global decisions, local decisions, neither or both? If back end processing is required that means the need of Dynamic Content Generation (DCG). DCG generates high I/O, disk, and CPU demands and the database nodes face with huge number of simultaneous dynamic requests. If we combine static and dynamic request processing the dynamic requests can decrease the Web cluster performance dramatically that needs to be managed applying different solutions. To select the suitable node for Dynamic Content Processing (DCP) the Web switch can estimate the expected cost of the processing requirement (I/O, CPU) before the request is forwarded to the server for DCP [Zhu et al., 1999]. There is a problem with both two-way and one-way architectures. If the server stops answering the client in the middle of a serious transaction then the client has no idea what has happened. The client can retry the query (e.g. refresh the page) and the Web switch can assign the request to different working node for processing. The back-end processing should definitely affect local decisions. For example, data storage can be implemented with full replication (symmetric back-end) to replicate all content across the nodes or partial replication (asymmetric back-end) that replicates specific parts of the content. In this case the Web switch can pick the right node locally, but this requires query-specific knowledge to route the request to the specific back-end node. There are solutions that involve a transaction processing monitor that is responsible to retry the failed DCP on a different server [Gray & Reuter, 1997]. Web switch can manage to send only a subset of the dynamic requests to the first selected node and send the rest to the second one. This can decrease the load of the Web cluster. For example, LARD (see question 2) dispatching algorithm can add this feature to the Web switch. LARD can target specific group of queries and send them to the same node. The LARD does not handle the service or server availability problems. Web switch availability is an important fact as well. If the Web switch is down then the forwarded requests will not be handled. The key factor is to detect the problem as soon as possible and reduce the failover reaction time. It is also possible that the Web cluster is up, but does not function properly, because one or two services are down (http daemon). In this case, the Web switch needs to know the health of the services not just the status of the load index of the server nodes. In the case of the global approach a failover requires to update the DNS server resolution database. The failover response time of a global approach is poor. This can be changed reducing the TTL value of the DNS server but this will generate further problems (see question 5). Moreover, there can be DNS software that overwrites the TTL value to a default minimum value if the distributed TTL is really small or 0. The global method failover reaction time depends on the TTL value and the DNS servers do not have query specific support compared to a local decision. Therefore, back-end processing should not affect global decision, due to the poor failover response time and the global decision cannot route the traffic based on a query [Brewer, 1999]. 22

Q7) Your dissertation topic relates to organizing a database across multiple machines to provide best performance for front-end applications.

23 Q7) Your dissertation topic relates to organizing a database across multiple machines to provide best performance for front-end applications. If front-end applications are spread across globally distributed data centers then what impact would this distribution have on placement of data within databases? Should all data be centralized within a single data center requiring remote database access from some data centers or should data be spread across multiple data centers? The dissertation topic [Patvarczki, 2010] is a rule-based data replication middleware for using multiple database servers for Web applications. The goal is to minimize the effective response time from the database and distribute the data across multiple nodes effectively. We assume that we know every query of the Web based application ahead of time that could come to the system. Queries that are not seen before are not guaranteed to be answerable at all because the query processing logic is not capable to answer them. The system demands that each query will be answerable by a single database server. This can reduce the communication cost between two nodes with pre-computed joins. We characterize the problem as an AI search over layout. There are four operators we consider that can create a search space of possible database layouts: denormalization (DN), horizontal partitioning (HP), vertical partitioning (VP), and full replication (R). The distributed database should always reflect itself as a single image to the client and users should always get a consistent state without limiting the scalability of the Web cluster [Plattner et al., 2006]. If front-end applications are spread across globally distributed data centers the read, update, delete, and insert (UDI) queries - issued by the Web servers - have to find the relevant data parts. Figure 14 presents a geographically distributed data center infrastructure. Figure 14: Globally distributed architecture 23

24 The impact depends on whether both Web clusters (A and B) have data layers or just one of them. If a Web cluster (e.g. A) has no back-end just presentation and application layers then it needs to access a Web cluster (e.g. B) that serves exactly the same front-end application with a back-end layer. At Web cluster (B) the Query Router is responsible for determining the requested data using the correct database nodes. The Query Router knows the current layout of the tables and it is responsible for maintaining the consistency. The Query Router handles each query as an independent action (no transaction support). If both Web clusters have data layers then there are two possibilities: a.) The layout of cluster A s database servers can be the exact replica of cluster B s database layout (the optimal layout is already determined, operators -HP, VP, DN, R- are already applied, and the data routing table is already known by the Query Router). The database layout of Web cluster B is fully replicated. Cluster A and B assigned to the same hostname with two A (Address) records in the DNS servers ( and ). Queries can be answered locally and the database access latency can be reduced. If the query workload consists of UDI queries then the updates must be applied to both clusters databases with a possible propagation mechanism or group communications; b) Cluster A s partitioned database servers can contain only the subset of all data and the cluster B s partitioned data servers have the rest of it. The final layout is determined using all the database nodes of cluster A and B together and each Query Router has its own routing table that indicates the VPI addresses of the clusters near by the data for retrieval purposes. The layout algorithm needs to know about the total number of servers per cluster and each server-cluster assignment to eliminate GeoJoins and fulfill our assumption. We introduced the term GeoJoin that is a join operation between geographically distributed databases. In both cases the goals are to minimize the effective response time from the database and distribute the data across multiple nodes effectively. If all data is centralized at one location and accessed from many geographically distributed sites then requests are highly dependent on central servers and they can generate high traffic on that specific Web cluster. This solution has a relatively simple data coordination method and the data consistency is guaranteed. If the centralized system fails e.g. the Web switch goes down, the requests cannot be completed because the back-end is not available. If data is fully replicated across geographically distributed sites (a), then each replica should be updated if an UDI happens. It introduces redundancy where each query can be processed locally. If the Web switch goes down then a failover can be initiated (e.g. DNS server level) and the request is redirected to the working cluster. If the cluster is overloaded then the Web switch can forward the requests to the replica cluster. There is a communication cost due to the synchronization process. If data is partitioned into fractions across different geographically distributed sites (b), then not local queries need to access different Web clusters for the missing data. This generates network intercommunication cost and query processing delay. Because of the UDI this method requires a complex consistency management. There is one more important factor that we have to consider: the characteristic of the application. An application can be read or UDI intensive. If the application is UDI intensive then the possible synchronizations can be expensive and this can eliminate the (a) option. If an organization has two departments in two different states and each department accesses only a small portion of the database tables but frequently, then option (b) can be a good solution for them. The relevant data can be stored close to the locations where local queries can be executed. 24

25 Q8) How can ideas for organization of a database within a data center be extended for use by front-end applications across geographically dispersed data centers? To give a small overview about the possible ideas we included a couple of interesting technologies that can be relevant [Patvarczki, 2009]. One of the ideas for organizing a database within a data center is replication [Ramakrishnan & Gehrke, 2003]. In replication a table is placed on more than one database server. In such a case, a select query on the table can be executed by any one of the database servers that have a replica of that table. A UDI query on that table however, needs to be executed on all the database servers that have a replica of that table. A drawback of this technique is that every UDI query needs to be executed against the node(s) that hold all of the data and these nodes become the bottleneck of the performance. Another idea is a master-slave architecture that is supported by a couple of database systems [MySQL2010][ PostgreSQL2010] where there is a single master server that holds all of the data and every UDI query is executed against the master node and propagated to slave nodes as necessary. In a master-slave environment, all writes and updates must take place on the master server and reads can take place on one or more slave servers. This model can significantly increase the performance of reads. Master-slave architectures are possible where there is more than one master node. DBFarm [Plattner, 2006] answer all read-only queries using a single replica and write-queries with one of the master databases. They simply separate read and write transactions where writes are performed at the master level and reads at the slave level. Read-only transactions executed at the slave databases are able to see all updates of the master database. DBFarm handles commit acknowledgements and assures read-only consistency. The drawback of this architecture is that a write has to happen on all masters. This can introduce a significant overhead and decrease the system throughput. DBProxy [Amiri et al., 2003] observed that most applications issue template-based queries and these queries have the same structure that contains different string or numeric constraints. This helps to reduce containment checking overhead significantly. Their system is a semantic data cache designed for adopting changes in the workload. They aggregate the similar query templates in the cache which leads to a faster query search and significant performance improvement. [Gao et al., 2003] introduces an edge service architecture (edge refers to a component that intended to improve the performance of a webbased system and distributes web content over the Internet) to improve the availability and performance of the web-based applications by replication not just in a clustered environment but at geographically distributed sites also. GlobeDB [Sivasubramanian et al., 2005] offers a different approach for edge servers to handle data distribution. They replicate the data along with its access code across machines only if the update rate is high enough at the specific location. GlobeTP [Groothuyse et al., 2007] predicts query execution costs based on the known query templates and using the result for table placement involving table replications. Their replication like operator replicates the entire table on a sub-set of database nodes. In the database domain, partial replication assumes that shared data is partitioned into n disjoint databases and we allow replication of an arbitrary subset of the databases as long as every database is present on at least one node. IBM DB2 Enterprise Edition describes the problem of laying out multiple nodes using different operators. They operate in a shared-nothing architecture where a collection of nodes are used for parallel query execution. Replication can be extended for geographically distributed data centers if the database layout of Web cluster B is fully replicated (see question 7 (a) option). Cluster A and B assigned to the 25

26 same hostname with two A (Address) records in the DNS servers ( and ). Queries can be answered locally and the database access latency can be reduced. Each back-end can be treated as a master node and if the query workload consists of UDI queries then the updates must be applied to both masters with a possible propagation mechanism or group communications. With group communication the Query Router functions as a Geographical Data Consistency Manager (GDCM) and decides when and how to synchronize with clusters. This can happen with e.g. push based approach that immediately synchronizes the replica after each UDI in the background. GDCM can automatically collect different statistical information [Haas et al., 2005] of query IDs, table accesses, UDI/table, database I/O, etc. This information can be propagated across the geographically distributed databases to help decide upon request forwarding. Layout generation can happen to differentiate between intensive UDIs and retrieval query templates. Cluster A can have all the UDI intensive tables partitioned and Cluster B can have the retrieval ones. GDCM of cluster A can contact the GDCM of cluster B and retrieve the required data. GDCMs can be prepared for parallel query execution. This means they will send part of a complex query to cluster B meanwhile they will process the second half of the query on cluster A. This can speed up the query processing and decrease the latency. Layout generation can happen reflecting the query template IDs. Because we know all the incoming query templates beforehand we can assign specific queries to specific Web clusters based on their IDs. If Query Router (cluster A) receives a query ID that does not belong to the cluster then contacts Cluster B s Query Router and retrieves the data. The query router should perform the required locking mechanism to handle conflicting requests. When multiple queries are executed on a single database server concurrently, a locking mechanism ensures the correct in-order semantics. Relying on the locking mechanism available at a database server only is not sufficient. The locking mechanism provided by the query router must ensure that when there are two UDI queries against the same table, the two updates are performed in the order in which the requests arrived at the query router (similarly for an UDI query and a select query). Caching is an important solution to consider as well. There can be extended caching method at each Query Router using some caching solution like DBProxy [Amiri et al., 2003] and the router can use the content of the cache to answer a retrieval query efficiently. Furthermore, each cache can be synchronized with different clusters caches reducing the intercommunication cost of the geographically distributed databases. Similarly to caching virtual tables can be created (materialized views) to achieve local transparency with a snapshot of distributed tables (Cluster B and C) in one location (Cluster A). Materialized view generates the result of a query in a cached table format that can be queried or updated. The adapter approach can be a solution to decrease the intercommunication cost and the query dependency interval. With this method each Query Router can communicate directly with each other using a separate network adapter and its own VIP address. If the data is partitioned into fractions across different geographically distributed sites (question 7 option (b)) then horizontal partitioning can be a good solution for organizations with office branches at different locations. Only the relevant data is stored close to the specific location. The data that is not relevant for branch office A is not stored at Web cluster A. Vertical partitioning can be useful for departments that have common tables together e.g. some part of a table is used by computer science some part by electrical engineering and the two department are geographically separated. Applying the organizational functions the table can be vertically partitioned and columns can be grouped together for computer science department at cluster A. 26

Q9) How does the organization of the back-end database within or across data centers affect the decision on where to direct requests both within a data center and across data centers?

27 Q9) How does the organization of the back-end database within or across data centers affect the decision on where to direct requests both within a data center and across data centers? In the case of a geographically distributed infrastructure each client requests assumed to be redirected to the closest Web cluster [Sivasubramanian et al., 2005] and data should be stored at the closest site that is accessed more frequently. Upon receiving the query, the Query Router decides which database server at which Web cluster should serve the request. If all data is centralized within a single data center and remote databases accessing from other data centers then the Query Router is responsible for directing the queries to the appropriate node. The Query Router maintains a mapping table that describes table-node relationships. Based on this table a request assigned to a specific database node. The query router performs the required locking mechanism to handle conflicting requests. The locking mechanism provided by the query router must ensure that when there are two UDI queries against the same table, the two updates are performed in the order in which the requests arrived at the query router (similarly for an UDI query and a select query). Since the created final database layout is static (we do not have unexpected queries) the query-server mapping table will not change. If there is a new incoming query template then our middleware must re-calculate the layout to be able to handle the unseen query template. The Query Router can dispatch the queries based on the Query Template ID (QTID) approach. Each query template has an ID and the Query Router knows which ID belongs to which server. If the tables are fully replicated among the database nodes then a Round Robin algorithm is used to distribute them. An alternative approach is the cost-based routing [Groothuyse et al., 2007]. This method estimates the execution cost of each query and combines that information with the load of each database server. It estimates the server load by using the scheduled queries and their costs on the server then the Query Router routes the next query to the least loaded server. If the data center is geographically distributed and each location has its own back-end layer then the routing is getting more complex (see figure 15a and 15b). Figure 15a: Geographically distributed data centers The layout of cluster A s database servers are the exact replica of cluster B s database layout 27

28 Figure 15a shows if the data center is geographically distributed and each Web cluster has exactly the same database layout and the same data. The client can get the VIP address of cluster A or cluster B from the local DNS server. The request can go to both Web clusters since they have identical database layouts. Group communication occurs to synchronize the data between the two locations but within each cluster the Query Router can apply RR or cost-based dispatching algorithm. This infrastructure can handle the case when cluster A is overloaded or not available (reliability). All requests are forwarded to Web cluster B. Query Router can continuously monitor the database states and route the queries to Cluster B. Figure 15b shows a geographically distributed infrastructure where the data is partitioned into fractions and each Web cluster has its own layout and data. - Figure 15b: Geographically distributed databases A s partitioned database servers contain only the subset of all that is partitioned into fractions The local DNS can give the VIP address of Web cluster A or B based on Round Robin algorithm. When the client receives the address (e.g. cluster B) it initiates the connection to server B. Server B s Query Router either processes the query locally or contacts cluster A s Query Router for the missing data part. The routing can be problematic if Web cluster A is not reachable because the query cannot be completed until cluster A comes live again and the cost of shipping data across networks can be high. 28

Load Balancing Technology White Paper

Load Balancing Technology White Paper Keywords: Server, gateway, link, load balancing, SLB, LLB Abstract: This document describes the background, implementation, and operating mechanism of the load balancing