Designing, Scoping, and Configuring Scalable LAMP Infrastructure

Size: px

Start display at page:

Download "Designing, Scoping, and Configuring Scalable LAMP Infrastructure"

Dominick Moore
6 years ago
Views:

1 Designing, Scoping, and Configuring Scalable LAMP Infrastructure Presented by

2 About me

3 About me Founded Four Kitchens in 2006 while at UT Austin

4 About me Founded Four Kitchens in 2006 while at UT Austin In 2008, launched Pressflow, which now powers the largest Drupal sites

5 About me Founded Four Kitchens in 2006 while at UT Austin In 2008, launched Pressflow, which now powers the largest Drupal sites Worked with some of the largest sites in the world: Lifetime Digital, Mansueto Ventures, Wikipedia, The Internet Archive, and The Economist

6 About me Founded Four Kitchens in 2006 while at UT Austin In 2008, launched Pressflow, which now powers the largest Drupal sites Worked with some of the largest sites in the world: Lifetime Digital, Mansueto Ventures, Wikipedia, The Internet Archive, and The Economist Engineered the LAMP stack, deployment tools, and management tools for Yale University, multiple NBC- Universal properties, and Drupal.org

7 About me Founded Four Kitchens in 2006 while at UT Austin In 2008, launched Pressflow, which now powers the largest Drupal sites Worked with some of the largest sites in the world: Lifetime Digital, Mansueto Ventures, Wikipedia, The Internet Archive, and The Economist Engineered the LAMP stack, deployment tools, and management tools for Yale University, multiple NBC- Universal properties, and Drupal.org Engineered development workflows for Examiner.com

8 About me Founded Four Kitchens in 2006 while at UT Austin In 2008, launched Pressflow, which now powers the largest Drupal sites Worked with some of the largest sites in the world: Lifetime Digital, Mansueto Ventures, Wikipedia, The Internet Archive, and The Economist Engineered the LAMP stack, deployment tools, and management tools for Yale University, multiple NBC- Universal properties, and Drupal.org Engineered development workflows for Examiner.com Contributor to Drupal, Bazaar, Ubuntu, BCFG2, Varnish, and other open-source projects

9 Some assumptions

10 Some assumptions You have more than one web server

11 Some assumptions You have more than one web server You have root access

12 Some assumptions You have more than one web server You have root access You deploy to Linux (though PHP on Windows is more sane than ever)

13 Some assumptions You have more than one web server You have root access You deploy to Linux (though PHP on Windows is more sane than ever) Database and web servers occupy separate boxes

14 Some assumptions You have more than one web server You have root access You deploy to Linux (though PHP on Windows is more sane than ever) Database and web servers occupy separate boxes Your application behaves more or less like Drupal, WordPress, or MediaWiki

15 Understanding Load Distribution

16 Predicting peak traffic Traffic over the day can be highly irregular. To plan for peak loads, design as if all traffic were as heavy as the peak hour of load in a typical month and then plan for some growth.

17 Analyzing hit distribution

18 Analyzing hit distribution 100%

19 Analyzing hit distribution Static Content 100%

20 Analyzing hit distribution 30% Static Content 100%

21 Analyzing hit distribution 30% Static Content 100% Dynamic Pages

22 Analyzing hit distribution 30% Static Content 100% Dynamic Pages 70%

23 Analyzing hit distribution 30% Static Content 100% Dynamic Pages 70% Authenticated

24 Analyzing hit distribution 30% Static Content 100% Dynamic Pages 70% Authenticated 20%

25 Analyzing hit distribution 30% Static Content 100% Dynamic Pages Anonymous 70% Authenticated 20%

26 Analyzing hit distribution 30% Static Content 50% 100% Dynamic Pages Anonymous 70% Authenticated 20%

27 Analyzing hit distribution Static Content 30% 50% Human 100% Dynamic Pages Anonymous 70% Authenticated 20%

28 Analyzing hit distribution 40% Static Content 30% 50% Human 100% Dynamic Pages Anonymous 70% Authenticated 20%

29 Analyzing hit distribution 40% 100% Dynamic Pages Static Content 30% Anonymous 50% Web Crawler Human 70% Authenticated 20%

30 Analyzing hit distribution 40% 100% Dynamic Pages Static Content 30% Anonymous 50% Web Crawler Human 10% 70% Authenticated 20%

31 Analyzing hit distribution 40% 100% Dynamic Pages Static Content 30% Anonymous 50% Web Crawler Human 10% No Special Treatment 70% Authenticated 20%

32 Analyzing hit distribution 40% 100% Dynamic Pages Static Content 30% Anonymous 50% Web Crawler Human 10% No Special Treatment 3% 70% Authenticated 20%

33 Analyzing hit distribution 40% 100% Dynamic Pages Static Content 30% Anonymous 50% Web Crawler Human 10% No Special Treatment Pay Wall Bypass 3% 70% Authenticated 20%

34 Analyzing hit distribution 40% 100% Dynamic Pages Static Content 30% Anonymous 50% Web Crawler Human 10% No Special Treatment Pay Wall Bypass 3% 7% 70% Authenticated 20%

35 Throughput vs. Delivery Methods Green (Static) Yellow (Dynamic, Cacheable) Red (Dynamic) Content Delivery Network Reverse Proxy Cache PHP + APC + memcached 5000 req/s 1 2 PHP + APC 1 PHP (No APC) 1 10 req/s More dots = More throughput 1 2 Delivered by Apache without PHP Some actually can do this.

36 Objective Deliver hits using the fastest, most scalable method available

37 Layering: Less Traffic at Each Step

38 Layering: Less Traffic at Each Step Traffic

39 Layering: Less Traffic at Each Step Traffic

40 Layering: Less Traffic at Each Step Traffic CDN

41 Layering: Less Traffic at Each Step Your Datacenter Traffic CDN

42 Layering: Less Traffic at Each Step Your Datacenter Traffic CDN

43 Layering: Less Traffic at Each Step Your Datacenter Traffic DNS Round Robin CDN

44 Layering: Less Traffic at Each Step Your Datacenter Traffic Load Balancer DNS Round Robin CDN

45 Layering: Less Traffic at Each Step Your Datacenter Traffic Load Balancer DNS Round Robin CDN

46 Layering: Less Traffic at Each Step Your Datacenter Traffic Load Balancer Reverse Proxy Cache DNS Round Robin CDN

47 Layering: Less Traffic at Each Step Your Datacenter Traffic Load Balancer Reverse Proxy Cache DNS Round Robin CDN

48 Layering: Less Traffic at Each Step Your Datacenter Traffic Load Balancer Reverse Proxy Cache Application Server DNS Round Robin CDN

49 Layering: Less Traffic at Each Step Your Datacenter Traffic Load Balancer Reverse Proxy Cache Application Server DNS Round Robin CDN

50 Layering: Less Traffic at Each Step Your Datacenter Traffic Load Balancer Reverse Proxy Cache Application Server DNS Round Robin CDN Database

51 Offload from the master database Your master database is the single greatest limitation on scalability.

52 Offload from the master database Your master database is the single greatest limitation on scalability. Application Server Master Database

53 Offload from the master database Your master database is the single greatest limitation on scalability. Application Server Memory Cache Master Database

54 Offload from the master database Your master database is the single greatest limitation on scalability. Application Server Slave Database Memory Cache Master Database

55 Offload from the master database Search Your master database is the single greatest limitation on scalability. Application Server Slave Database Memory Cache Master Database

56 Tools to use

57 Tools to use Apache Solr or Sphinx for search Solr can be fronted with Varnish or another proxy cache if queries are repetitive.

58 Tools to use Apache Solr or Sphinx for search Solr can be fronted with Varnish or another proxy cache if queries are repetitive. Varnish, nginx, Squid, or Traffic Server for reverse proxy caching

59 Tools to use Apache Solr or Sphinx for search Solr can be fronted with Varnish or another proxy cache if queries are repetitive. Varnish, nginx, Squid, or Traffic Server for reverse proxy caching Any third-party service for CDN

60 Do the math All non-cdn traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers.

61 Do the math All non-cdn traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Traffic

62 Do the math All non-cdn traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Traffic Load Balancer

63 Do the math All non-cdn traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Traffic Load Balancer Reverse Proxy Cache

64 Do the math All non-cdn traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Traffic Load Balancer Reverse Proxy Cache

65 Do the math All non-cdn traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Traffic Load Balancer Reverse Proxy Cache Application Server

66 Do the math All non-cdn traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Traffic Load Balancer Reverse Proxy Cache Application Server

67 Do the math All non-cdn traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Traffic Load Balancer Reverse Proxy Cache Application Server What hit rate is each layer getting? How many servers share the load?

68 Get a management/monitoring box

69 Get a management/monitoring box Management

70 Get a management/monitoring box Management Application Server

71 Get a management/monitoring box Management Application Server Reverse Proxy Cache

72 Get a management/monitoring box Database Management Application Server Reverse Proxy Cache

73 Get a management/monitoring box Load Balancer Database Management Application Server Reverse Proxy Cache

74 Get a management/monitoring box Load Balancer (maybe even two and have them specialize or be redundant) Database Management Application Server Reverse Proxy Cache

75 Planning + Scoping

76 Infrastructure goals

77 Infrastructure goals Redundancy: tolerate failure

78 Infrastructure goals Redundancy: tolerate failure Scalability: engage more users

79 Infrastructure goals Redundancy: tolerate failure Scalability: engage more users Performance: ensure each user s experience is fast

80 Infrastructure goals Redundancy: tolerate failure Scalability: engage more users Performance: ensure each user s experience is fast Manageability: stay sane in the process

81 Redundancy

82 Redundancy When one server fails, the website should be able to recover without taking too long.

83 Redundancy When one server fails, the website should be able to recover without taking too long. This requires at least N+1, putting a floor on system requirements even for small sites.

84 Redundancy When one server fails, the website should be able to recover without taking too long. This requires at least N+1, putting a floor on system requirements even for small sites. How long can your site be down?

85 Redundancy When one server fails, the website should be able to recover without taking too long. This requires at least N+1, putting a floor on system requirements even for small sites. How long can your site be down? Automatic versus manual failover

86 Redundancy When one server fails, the website should be able to recover without taking too long. This requires at least N+1, putting a floor on system requirements even for small sites. How long can your site be down? Automatic versus manual failover Warning: over-automation can reduce uptime

87 Performance

88 Performance Find the sweet spot for hardware. This is the best price/performance point.

89 Performance Find the sweet spot for hardware. This is the best price/performance point. Avoid overspending on any type of component

90 Performance Find the sweet spot for hardware. This is the best price/performance point. Avoid overspending on any type of component Yet, avoid creating bottlenecks

91 Performance Find the sweet spot for hardware. This is the best price/performance point. Avoid overspending on any type of component Yet, avoid creating bottlenecks Swapping memory to disk is very dangerous

92 Performance Find the sweet spot for hardware. This is the best price/performance point. Avoid overspending on any type of component Yet, avoid creating bottlenecks Swapping memory to disk is very dangerous Don t skimp on RAM

93 Relative importance Processors/Cores Memory Disk Speed Reverse Proxy Cache Web Server Database Server Monitoring

94 All of your servers

95 All of your servers 64-bit: no excuse to use anything less in 2010

96 All of your servers 64-bit: no excuse to use anything less in 2010 RHEL/CentOS and Ubuntu have the broadest adoption for large-scale LAMP

97 All of your servers 64-bit: no excuse to use anything less in 2010 RHEL/CentOS and Ubuntu have the broadest adoption for large-scale LAMP But pick one, and stick with it for development, staging, and production

98 All of your servers 64-bit: no excuse to use anything less in 2010 RHEL/CentOS and Ubuntu have the broadest adoption for large-scale LAMP But pick one, and stick with it for development, staging, and production Some disk redundancy: rebuilding a server is time-consuming unless you re very automated

99 Reverse proxy caches

100 Reverse proxy caches Varnish and nginx have modern architecture and broad adoption Sites often front Varnish with nginx for gzip and/or SSL

101 Reverse proxy caches Varnish and nginx have modern architecture and broad adoption Sites often front Varnish with nginx for gzip and/or SSL Squid and Traffic Server are clunky but reliable alternatives

102 Reverse proxy caches Varnish and nginx have modern architecture and broad adoption Sites often front Varnish with nginx for gzip and/or SSL Squid and Traffic Server are clunky but reliable alternatives CPU Save Your Money

103 Reverse proxy caches Varnish and nginx have modern architecture and broad adoption Sites often front Varnish with nginx for gzip and/or SSL Squid and Traffic Server are clunky but reliable alternatives CPU Save Your Money Memory + 1 GB base system + 3 GB for caching

104 Reverse proxy caches Varnish and nginx have modern architecture and broad adoption Sites often front Varnish with nginx for gzip and/or SSL Squid and Traffic Server are clunky but reliable alternatives CPU Save Your Money Memory + 1 GB base system + 3 GB for caching + Disk Slow + Small + Redundant

105 Reverse proxy caches Varnish and nginx have modern architecture and broad adoption Sites often front Varnish with nginx for gzip and/or SSL Squid and Traffic Server are clunky but reliable alternatives CPU Save Your Money Memory + 1 GB base system + 3 GB for caching + Disk Slow + Small + Redundant = 5000 req/s

106 Web servers

107 Web servers Apache mod_php + memcached

108 Web servers Apache mod_php + memcached FastCGI is a bad idea Memory improvements are redundant w/ Varnish Higher latency + less efficient with APC opcode

109 Web servers Apache mod_php + memcached FastCGI is a bad idea Memory improvements are redundant w/ Varnish Higher latency + less efficient with APC opcode Check the memory your app takes per process

110 Web servers Apache mod_php + memcached FastCGI is a bad idea Memory improvements are redundant w/ Varnish Higher latency + less efficient with APC opcode Check the memory your app takes per process Tune MaxClients to around 25 cores

111 Web servers Apache mod_php + memcached FastCGI is a bad idea Memory improvements are redundant w/ Varnish Higher latency + less efficient with APC opcode Check the memory your app takes per process Tune MaxClients to around 25 cores CPU Max out cores (but prefer fast cores to density)

112 Web servers Apache mod_php + memcached FastCGI is a bad idea Memory improvements are redundant w/ Varnish Higher latency + less efficient with APC opcode Check the memory your app takes per process Tune MaxClients to around 25 cores CPU Max out cores (but prefer fast cores to density) Memory + 1 GB base system + 1 GB memcached + 25 cores perprocess app memory

113 Web servers Apache mod_php + memcached FastCGI is a bad idea Memory improvements are redundant w/ Varnish Higher latency + less efficient with APC opcode Check the memory your app takes per process Tune MaxClients to around 25 cores CPU Max out cores (but prefer fast cores to density) Memory + 1 GB base system + 1 GB memcached + 25 cores perprocess app memory + Disk Slow + Small + Redundant

114 Web servers Apache mod_php + memcached FastCGI is a bad idea Memory improvements are redundant w/ Varnish Higher latency + less efficient with APC opcode Check the memory your app takes per process Tune MaxClients to around 25 cores CPU Max out cores (but prefer fast cores to density) Memory + 1 GB base system + 1 GB memcached + 25 cores perprocess app memory + Disk Slow + Small + Redundant = 100 req/s

115 Database servers

116 Database servers Insist on MySQL 5.1+ and InnoDB

117 Database servers Insist on MySQL 5.1+ and InnoDB Consider Percona builds and (eventually) MariaDB

118 Database servers Insist on MySQL 5.1+ and InnoDB Consider Percona builds and (eventually) MariaDB Every Apache process generally needs at least one connection available, and leave some headroom

119 Database servers Insist on MySQL 5.1+ and InnoDB Consider Percona builds and (eventually) MariaDB Every Apache process generally needs at least one connection available, and leave some headroom Tune the InnoDB buffer pool to at least half of RAM

120 Database servers Insist on MySQL 5.1+ and InnoDB Consider Percona builds and (eventually) MariaDB Every Apache process generally needs at least one connection available, and leave some headroom Tune the InnoDB buffer pool to at least half of RAM CPU No more than 8-12 cores

121 Database servers Insist on MySQL 5.1+ and InnoDB Consider Percona builds and (eventually) MariaDB Every Apache process generally needs at least one connection available, and leave some headroom Tune the InnoDB buffer pool to at least half of RAM CPU No more than 8-12 cores Memory + As much as you can afford (even RAM not used by MySQL caches disk content)

122 Database servers Insist on MySQL 5.1+ and InnoDB Consider Percona builds and (eventually) MariaDB Every Apache process generally needs at least one connection available, and leave some headroom Tune the InnoDB buffer pool to at least half of RAM CPU No more than 8-12 cores Memory + As much as you can afford (even RAM not used by MySQL caches disk content) + Disk Fast + Large + Redundant

123 Database servers Insist on MySQL 5.1+ and InnoDB Consider Percona builds and (eventually) MariaDB Every Apache process generally needs at least one connection available, and leave some headroom Tune the InnoDB buffer pool to at least half of RAM CPU No more than 8-12 cores Memory + As much as you can afford (even RAM not used by MySQL caches disk content) + Disk Fast + Large + Redundant = 3000 queries/s

124 Management server

125 Management server Nagios: service outage monitoring

126 Management server Nagios: service outage monitoring Cacti: trend monitoring

127 Management server Nagios: service outage monitoring Cacti: trend monitoring Hudson: builds, deployment, and automation

128 Management server Nagios: service outage monitoring Cacti: trend monitoring Hudson: builds, deployment, and automation Yum/Apt repo: cluster package distribution

129 Management server Nagios: service outage monitoring Cacti: trend monitoring Hudson: builds, deployment, and automation Yum/Apt repo: cluster package distribution Puppet/BCFG2/Chef: configuration management

130 Management server Nagios: service outage monitoring Cacti: trend monitoring Hudson: builds, deployment, and automation Yum/Apt repo: cluster package distribution Puppet/BCFG2/Chef: configuration management CPU Save Your Money

131 Management server Nagios: service outage monitoring Cacti: trend monitoring Hudson: builds, deployment, and automation Yum/Apt repo: cluster package distribution Puppet/BCFG2/Chef: configuration management CPU Save Your Money Memory + Save Your Money

132 Management server Nagios: service outage monitoring Cacti: trend monitoring Hudson: builds, deployment, and automation Yum/Apt repo: cluster package distribution Puppet/BCFG2/Chef: configuration management CPU Save Your Money Memory + Save Your Money + Disk Slow + Large + Redundant

133 Management server Nagios: service outage monitoring Cacti: trend monitoring Hudson: builds, deployment, and automation Yum/Apt repo: cluster package distribution Puppet/BCFG2/Chef: configuration management CPU Save Your Money Memory + Save Your Money + Disk Slow + Large + Redundant = good enough

134 Assembling the numbers

135 Assembling the numbers Start with an architecture providing redundancy. Two servers, each running the whole stack

136 Assembling the numbers Start with an architecture providing redundancy. Two servers, each running the whole stack Increase the number of proxy caches based on anonymous and search engine traffic.

137 Assembling the numbers Start with an architecture providing redundancy. Two servers, each running the whole stack Increase the number of proxy caches based on anonymous and search engine traffic. Increase the number of web servers based on authenticated traffic.

138 Assembling the numbers Start with an architecture providing redundancy. Two servers, each running the whole stack Increase the number of proxy caches based on anonymous and search engine traffic. Increase the number of web servers based on authenticated traffic. Databases are harder to predict, but large sites should run them on at least two separate boxes with replication.

139 Extreme measures for performance and scalability

140 When caching and search offloading isn t enough

141 When caching and search offloading isn t enough Some sites have intense custom page needs High proportion of authenticated users Lots of targeted content for anonymous users

142 When caching and search offloading isn t enough Some sites have intense custom page needs High proportion of authenticated users Lots of targeted content for anonymous users Too much data to process real-time on an RDBMS

143 When caching and search offloading isn t enough Some sites have intense custom page needs High proportion of authenticated users Lots of targeted content for anonymous users Too much data to process real-time on an RDBMS Data is so volatile that maintaing standard caches outweighs the overhead of regeneration

144 Non-relational/NoSQL tools

145 Non-relational/NoSQL tools Most web applications can run well on less-than-acid persistence engines

146 Non-relational/NoSQL tools Most web applications can run well on less-than-acid persistence engines In some cases, like MongoDB, easier to use than SQL in addition to being higher performance

147 Non-relational/NoSQL tools Most web applications can run well on less-than-acid persistence engines In some cases, like MongoDB, easier to use than SQL in addition to being higher performance Interested? You ve already missed the tutorial.

148 Non-relational/NoSQL tools Most web applications can run well on less-than-acid persistence engines In some cases, like MongoDB, easier to use than SQL in addition to being higher performance Interested? You ve already missed the tutorial. In other cases, like Cassandra, considerably harder to use than SQL but massively scalable

149 Non-relational/NoSQL tools Most web applications can run well on less-than-acid persistence engines In some cases, like MongoDB, easier to use than SQL in addition to being higher performance Interested? You ve already missed the tutorial. In other cases, like Cassandra, considerably harder to use than SQL but massively scalable Current Erlang-based systems are neat but slow

150 Non-relational/NoSQL tools Most web applications can run well on less-than-acid persistence engines In some cases, like MongoDB, easier to use than SQL in addition to being higher performance Interested? You ve already missed the tutorial. In other cases, like Cassandra, considerably harder to use than SQL but massively scalable Current Erlang-based systems are neat but slow Many require a special PHP extension, at least for ideal performance

151 Offline processing

152 Offline processing Gearman Primarily asynchronous job manager

153 Offline processing Gearman Primarily asynchronous job manager Hadoop MapReduce framework

154 Offline processing Gearman Primarily asynchronous job manager Hadoop MapReduce framework Traditional message queues ActiveMQ + Stomp is easy from PHP Allows you to build your own job manager

155 Edge-side includes

156 Edge-side includes ESI Processor (Varnish, Akamai, other)

157 Edge-side includes <html> <body> <esi:include href= /> </body> </html> ESI Processor (Varnish, Akamai, other)

158 Edge-side includes <html> <body> <esi:include href= /> </body> </html> ESI Processor (Varnish, Akamai, other) <div> My block HTML. </div>

159 Edge-side includes <html> <body> <esi:include href= /> </body> </html> ESI Processor (Varnish, Akamai, other) <div> My block HTML. </div> <html> <body> <div> My block HTML. </div> </body> </html>

160 Edge-side includes <html> <body> <esi:include href= /> </body> </html> Blocks of HTML are integrated into the page at the edge layer. ESI Processor (Varnish, Akamai, other) <div> My block HTML. </div> <html> <body> <div> My block HTML. </div> </body> </html>

161 Edge-side includes <html> <body> <esi:include href= /> </body> </html> Blocks of HTML are integrated into the page at the edge layer. ESI Processor (Varnish, Akamai, other) <html> <body> <div> My block HTML. </div> </body> </html> <div> My block HTML. </div> Non-primary page content often occupies >50% of PHP execution time.

162 Edge-side includes <html> <body> <esi:include href= /> </body> </html> Blocks of HTML are integrated into the page at the edge layer. ESI Processor (Varnish, Akamai, other) <html> <body> <div> My block HTML. </div> </body> </html> <div> My block HTML. </div> Non-primary page content often occupies >50% of PHP execution time. Decouples block and page cache lifetimes

163 HipHop PHP

164 HipHop PHP Compiles PHP to a C++-based binary Integrated HTTP server

165 HipHop PHP Compiles PHP to a C++-based binary Integrated HTTP server Supports a subset of PHP and extensions

166 HipHop PHP Compiles PHP to a C++-based binary Integrated HTTP server Supports a subset of PHP and extensions Requires an organizational commitment to building, testing, and deploying on HipHop

167 HipHop PHP Compiles PHP to a C++-based binary Integrated HTTP server Supports a subset of PHP and extensions Requires an organizational commitment to building, testing, and deploying on HipHop Scott MacVicar has a presentation on HipHop later today at 16:00.

168 Cluster Problems Credits

169 Server failure

170 Server failure Load balancers can remove broken or overloaded application reverse proxy caches.

171 Server failure Load balancers can remove broken or overloaded application reverse proxy caches. Reverse proxy caches like Varnish can automatically use only functional application servers.

172 Server failure Load balancers can remove broken or overloaded application reverse proxy caches. Reverse proxy caches like Varnish can automatically use only functional application servers. Memcached clients automatically handle failure.

173 Server failure Load balancers can remove broken or overloaded application reverse proxy caches. Reverse proxy caches like Varnish can automatically use only functional application servers. Memcached clients automatically handle failure. Virtual service IP management tools like heartbeat2 can manage which MySQL servers receive connections to automate failover.

174 Server failure Load balancers can remove broken or overloaded application reverse proxy caches. Reverse proxy caches like Varnish can automatically use only functional application servers. Memcached clients automatically handle failure. Virtual service IP management tools like heartbeat2 can manage which MySQL servers receive connections to automate failover. Conclusion: Each layer intelligently monitors and uses the servers beneath it.

175 Cluster coherency

176 Cluster coherency Systems that run properly on single boxes may lose coherency when run on a networked cluster.

177 Cluster coherency Systems that run properly on single boxes may lose coherency when run on a networked cluster. Some caches, like APC s object cache, have no ability to handle network-level coherency. (APC s opcode cache is safe to use on clusters, though.)

178 Cluster coherency Systems that run properly on single boxes may lose coherency when run on a networked cluster. Some caches, like APC s object cache, have no ability to handle network-level coherency. (APC s opcode cache is safe to use on clusters, though.) memcached, if misconfigured, can hash values inconsistently across the cluster, resulting in different servers using different memcached instances for the same keys.

179 Cluster coherency Systems that run properly on single boxes may lose coherency when run on a networked cluster. Some caches, like APC s object cache, have no ability to handle network-level coherency. (APC s opcode cache is safe to use on clusters, though.) memcached, if misconfigured, can hash values inconsistently across the cluster, resulting in different servers using different memcached instances for the same keys. Session coherency issues can be helped with load balancer affinity or storage in memcached

180 Cache regeneration races

181 Cache regeneration races Downside to network cache coherency: synched expiration

182 Cache regeneration races Downside to network cache coherency: synched expiration Requires a locking framework (like ZooKeeper)

183 Cache regeneration races Downside to network cache coherency: synched expiration Requires a locking framework (like ZooKeeper) Old Cached Item

184 Cache regeneration races Downside to network cache coherency: synched expiration Requires a locking framework (like ZooKeeper) Old Cached Item Time

185 Cache regeneration races Downside to network cache coherency: synched expiration Requires a locking framework (like ZooKeeper) Old Cached Item Time Expiration

186 Cache regeneration races Downside to network cache coherency: synched expiration Requires a locking framework (like ZooKeeper) Old Cached Item All servers regenerating the item. { Time Expiration

187 Cache regeneration races Downside to network cache coherency: synched expiration Requires a locking framework (like ZooKeeper) Old Cached Item All servers regenerating the item. { New Cached Item Time Expiration

188 Broken replication

189 Broken replication MySQL slave servers get out of synch, fall further behind

190 Broken replication MySQL slave servers get out of synch, fall further behind No (sane) method of automated recovery

191 Broken replication MySQL slave servers get out of synch, fall further behind No (sane) method of automated recovery Only solvable with good monitoring and recovery procedures

192 Broken replication MySQL slave servers get out of synch, fall further behind No (sane) method of automated recovery Only solvable with good monitoring and recovery procedures Can automate DB slave blacklisting from use, but requires cluster management tools

All content in this presentation, except

194 DrupalCamp Stockholm Presentation Ended Here

195 Managing the Cluster Credits

196 The problem Software and Configuration Applicati on Server Applicati on Server Applicati on Server Applicati on Server Applicati on Server Objectives: Fast, atomic deployment and rollback Minimize single points of failure and contention Restart services Integrate with version control systems Credits

197 Manual updates and deployment Human Human Human Human Human Applicati on Server Applicati on Server Applicati on Server Applicati on Server Applicati on Server Why not: slow deployment, non-atomic/difficult rollbacks Credits

198 Shared storage Applicati on Server Applicati on Server Applicati on Server Applicati on Server Applicati on Server NFS Why not: single point of contention and failure Credits

199 rsync Synchronized with rsync Applicati on Server Applicati on Server Applicati on Server Applicati on Server Applicati on Server Why not: non-atomic, does not manage services Credits

200 Capistrano Deployed with Capistrano Applicati on Server Applicati on Server Applicati on Server Applicati on Server Applicati on Server Capistrano provides near-atomic deployment, service restarts, automated rollback, test automation, and version control integration (tagged releases). Credits

201 Multistage deployment Deployed with Capistrano Deployments can be staged. cap staging deploy cap production deploy Deployed with Capistrano Development Integration Deployed with Capistrano Staging Applicati on Server Applicati on Server Applicati on Server Applicati on Server Applicati on Server Credits

202 But your application isn t the only thing to manage. Credits

203 Beneath the application Reverse Proxy Cache Cluster-level configuration Database Applicati on Server Applicati on Server Applicati on Server Applicati on Server Applicati on Server Cluster management applies to package management, updates, and software configuration. cfengine and bcfg2 are popular cluster-level system configuration tools. Credits

204 System configuration management Deploys and updates packages, cluster-wide or selectively. Manages arbitrary text configuration files Analyzes inconsistent configurations (and converges them) Manages device classes (app. servers, database servers, etc.) Allows confident configuration testing on a staging server. Credits

205 All on the management box Manageme nt {Developme nt Integration Staging Deploymen t Tools Monitoring Credits

206 Monitoring Credits

207 Types of monitoring Failure Capacity/Load Analyzing Downtime Viewing Failover Troubleshooting Notification Analyzing Trends Predicting Load Checking Results of Configuration and Software Changes

208 Everyone needs both. Credits

209 What to use Failure/Uptime Capacity/Load Nagios Hyperic Cacti Munin

210 Nagios Highly recommended. Used by Four Kitchens and Tag1 Consulting for client work, Drupal.org, Wikipedia, etc. Easy to install on CentOS 5 using EPEL packages. Easy to install nrpe agents to monitor diverse services. Can notify administrators on failure. We use this on Drupal.org

211 Cacti Highly annoying to set up. One instance generally collects all statistics. (No agents on the systems being monitored.) Provides flexible graphs that can be customized on demand. Credits

212 Munin Fairly easy to set up. One instance generally collects all statistics. (No agents on the systems being monitored.) Provides static graphs that cannot be customized. Credits

213 Pressflow Make Drupal sites scale by upgrading core with a compatible, powerful replacement.

214 Common large-site issues Drupal core requires patching to effectively support the advanced scalability techniques discussed here. Patches often conflict and have to be reapplied with each Drupal upgrade. The original patches are often unmaintained. Sites stagnate, running old, insecure versions of Drupal core because updating is too difficult.

215 What is Pressflow? Pressflow is a derivative of Drupal core that integrates the most popular performance and scalability enhancements. Pressflow is completely compatible with existing Drupal 5 and 6 modules, both standard and custom. Pressflow installs as a drop-in replacement for standard Drupal. Pressflow is free as long as the matching version of Drupal is also supported by the community.

216 What are the enhancements? Reverse proxy support Database replication support Lower database and session management load More efficient queries Testing and optimization by Four Kitchens with standard high-performance software and hardware configuration Industry-leading scalability support by Four Kitchens and Tag1 Consulting

217 Four Kitchens + Tag1 Provide the development, support, scalability, and performance services behind Pressflow Comprise most members of the Drupal.org infrastructure team Have the most experience scaling Drupal sites of all sizes and all types

218 Ready to scale? Learn more about Pressflow: Pick up pamphlets in the lobby Request Pressflow releases at fourkitchens.com Get the help you need to make it happen: Talk to me (David) or Todd here at DrupalCamp

Scalability of web applications

Scalability of web applications CSCI 470: Web Science Keith Vertanen Copyright 2014 Scalability questions Overview What's important in order to build scalable web sites? High availability vs. load balancing