Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science steen@cs.vu.nl Chapter 12: Distributed -Based Systems Version: December 10, 2012
Distributed -Based Systems 12.1 Architecture 2 / 19 Distributed -based systems Essence The WWW is a huge client- system with millions of s; each hosting thousands of hyperlinked documents. Documents are often represented in text (plain text, HTML, XML) Alternative types: images, audio, video, applications (PDF, PS) Documents may contain scripts, executed by client-side software machine Browser Server machine 2. Server fetches document from local file OS 3. Response 1. Get document request (HTTP)
Distributed -Based Systems 12.1 Architecture Multi-tiered architectures Observation Already very soon, sites were organized into three tiers. 3. Start process to fetch document 1. Get request 6. Return result HTTP request handler CGI program 4. Database interaction 5. HTML document created CGI process Database 3 / 19
services Distributed -Based Systems 12.1 Architecture Observation At a certain point, people started recognizing that it is was more than just user site interaction: sites could offer services to other sites standardization is then badly needed. Look up a service machine application Server machine Server application Publish service Stub Stub Communication subsystem Generate stub from WSDL description SOAP Communication subsystem Generate stub from WSDL description Service Service Service description description description (WSDL) (WSDL) (WSDL) Directory service (UDDI) 4 / 19
Distributed -Based Systems Apache 12.2 Processes Observation: More than 52% of all 185 million sites are Apache. The is internally organized more or less according to the steps needed to process an HTTP request. Module Module Function Module......... Hook Hook Hook Hook Link between function and hook Functions called per hook Apache core Request Response 5 / 19
Distributed -Based Systems 12.2 Processes 6 / 19 Server clusters Essence To improve performance and availability, WWW s are often clustered in a way that is transparent to clients. LAN Front end Front end handles all incoming requests and outgoing responses Request Response
Distributed -Based Systems 12.2 Processes Server clusters Problem The front end may easily get overloaded, so that special measures need to be taken. Transport-layer switching: Front end simply passes the TCP request to one of the s, taking some performance metric into account. Content-aware distribution: Front end reads the content of the HTTP request and then selects the best. 7 / 19
8 / 19 Server Clusters Distributed -Based Systems 12.2 Processes Question Why can content-aware distribution be so much better? 6. Server responses 5. Forward other messages Distributor 3. Hand of TCP connection f Other messages Setup request Switch 1. Pass setup request to a distributor 4. Inform switch Distributor Dispatcher 2. Dispatcher selects
Distributed -Based Systems proxy caching Basic idea Sites install a separate proxy that handles all outgoing requests. Proxies subsequently cache incoming documents. Cache-consistency protocols: Always verify validity by contacting Age-based consistency: T expire = α (T cached T last modified ) + T cached 9 / 19
Distributed -Based Systems 10 / 19 proxy caching Basic idea (cnt d) Cooperative caching, by which you first check your neighbors on a cache miss 1. Look in local cache 3. Forward request to Cache proxy 2. Ask neighboring proxy caches proxy Cache HTTP Get request proxy Cache
Distributed -Based Systems 11 / 19 Replication in hosting systems Observation By-and-large, hosting systems are adopting replication to increase performance. Much research is done to improve their organization. Follows the lines of self-managing systems. Uncontrollable parameters (disturbance / noise) Initial configuration Corrections hosting system Observed output +/- +/- +/- Replica placement Consistency enforcement Request routing Reference input Metric estimation Adjustment triggers Analysis Measured output
Distributed -Based Systems Handling flash crowds Observation We need dynamic adjustment to balance resource usage. Flash crowds introduce a serious problem. 2 days (a) 2 days (b) 6 days (c) 2.5 days (d) 12 / 19
Distributed -Based Systems 13 / 19 Server replication Content Delivery Network CDNs act as hosting services to replicate documents across the Internet providing their customers guarantees on high availability and performance (example: Akamai). Cache CDN 6. Get embedded documents (if not already cached) Return IP address client-best CDN DNS 5. Get embedded documents 4 DNS lookups 3 7. Embedded documents 1. Get base document 2. Document with refs to embedded documents Origin Regular DNS system
Distributed -Based Systems Replication of applications Observation Replication becomes more difficult when dealing with databses and such. No single best solution. Assumption Updates are carried out at origin, and propagated to edge s. 14 / 19
15 / 19 Distributed -Based Systems Replication of applications: normal Edge- side Origin- side query response Appl logic Appl logic Content-blind cache Database copy full/partial data replication Content-aware cache Schema full schema replication/ query templates Schema Authoritative database
Distributed -Based Systems 16 / 19 Replication of applications Alternative solutions Full replication: high read/write ratio, often in combination with complex queries. Partial replication: high read/write ratio, but in combination with simple queries Content-aware caching: Check for queries at local database, and subscribe for invalidations at the. Works good with range queries and complex queries. Content-blind caching: Simply cache the result of previous queries. Works great with simple queries that address unique results (e.g., no range queries). Question What can be said about replication vs. performance?
17 / 19 Distributed -Based Systems Replication apps.: full/partial replication Edge- side Origin- side query response Appl logic Appl logic Content-blind cache Database copy full/partial data replication Content-aware cache Schema full schema replication/ query templates Schema Authoritative database
18 / 19 Distributed -Based Systems Replication apps.: content-aware caching Edge- side Origin- side query response Appl logic Appl logic Content-blind cache Database copy full/partial data replication Content-aware cache Schema full schema replication/ query templates Schema Authoritative database
19 / 19 Distributed -Based Systems Replication apps.: content-blind caching Edge- side Origin- side query response Appl logic Appl logic Content-blind cache Database copy full/partial data replication Content-aware cache Schema full schema replication/ query templates Schema Authoritative database