Designing, Scoping, and Configuring Scalable LAMP Infrastructure

Similar documents
Scalability of web applications

How to pimp high volume PHP websites. 27. September 2008, PHP conference Barcelona. By Jens Bierkandt

An overview of Drupal infrastructure and plans for future growth. prepared by Kieran Lal for the Drupal Association

Site Performance, Optimization and Scalability Alan Dixon

Distributed Architectures & Microservices. CS 475, Spring 2018 Concurrent & Distributed Systems

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

Help! I need more servers! What do I do?

Life as a Service. Scalability and Other Aspects. Dino Esposito JetBrains ARCHITECT, TRAINER AND CONSULTANT

Building a Scalable Architecture for Web Apps - Part I (Lessons Directi)

Scaling DreamFactory

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

Architekturen für die Cloud

Wikimedia Technical & Operational Infrastructure

Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016

The Care and Feeding of a MySQL Database for Linux Adminstrators. Dave Stokes MySQL Community Manager

What is Drupal? What is this Drew-Paul thing you do?

<Insert Picture Here> MySQL Web Reference Architectures Building Massively Scalable Web Infrastructure

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

Using MySQL for Distributed Database Architectures

Improve Web Application Performance with Zend Platform

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Highly Available Database Architectures in AWS. Santa Clara, California April 23th 25th, 2018 Mike Benshoof, Technical Account Manager, Percona

Identifying Workloads for the Cloud

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Introduction to the Active Everywhere Database

A Global In-memory Data System for MySQL Daniel Austin, PayPal Technical Staff

CHAPTER 1: A REFRESHER ON WEB BROWSERS 3

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

CISC 7610 Lecture 2b The beginnings of NoSQL

Drupal Hosting. April 19, Northeast Ohio Drupal User Group 1

Enterprise Overview. Benefits and features of Cloudflare s Enterprise plan FLARE

High Availability/ Clustering with Zend Platform

DATABASE SCALE WITHOUT LIMITS ON AWS

Real World Web Scalability. Ask Bjørn Hansen Develooper LLC

4 Myths about in-memory databases busted

Architecture and Design of MySQL Powered Applications. Peter Zaitsev CEO, Percona Highload Moscow, Russia 31 Oct 2014

Choosing a MySQL HA Solution Today. Choosing the best solution among a myriad of options

Next-Generation Cloud Platform

High Availability High Performance Plone

MySQL High Availability. Michael Messina Senior Managing Consultant, Rolta-AdvizeX /

Use Case: Scalable applications

ScaleArc for SQL Server

Practical MySQL Performance Optimization. Peter Zaitsev, CEO, Percona July 02, 2015 Percona Technical Webinars

MySQL Database Scalability

MySQL High Availability

Real Life Web Development. Joseph Paul Cohen

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Datacenter replication solution with quasardb

Aurora, RDS, or On-Prem, Which is right for you

@joerg_schad Nightmares of a Container Orchestration System

How to setup Orchestrator to manage thousands of MySQL servers. Simon J Mudd 3 rd October 2017

MariaDB MaxScale 2.0, basis for a Two-speed IT architecture

Etanova Enterprise Solutions

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Caching patterns and extending mobile applications with elastic caching (With Demonstration)

Virtual Disaster Recovery

Setting up a LAMP server

MongoDB and Mysql: Which one is a better fit for me? Room 204-2:20PM-3:10PM

Pragmatic Clustering. Mike Cannon-Brookes CEO, Atlassian Software Systems

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Services: Monitoring and Logging. 9/16/2018 IST346: Info Tech Management & Administration 1

Vendor: Citrix. Exam Code: 1Y Exam Name: Designing Citrix XenDesktop 7.6 Solutions. Version: Demo

Retrospective: The Magento Commerce Cloud at Work

Azure Webinar. Resilient Solutions March Sander van den Hoven Principal Technical Evangelist Microsoft

Using and Developing with Azure. Joshua Drew

MIRO DIETIKER Founder

MySQL HA Solutions. Keeping it simple, kinda! By: Chris Schneider MySQL Architect Ning.com

Deployment. Chris Wilson, AfNOG / 26

Consistency and Scalability

Guide to Mitigating Risk in Industrial Automation with Database

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

To Shard or Not to Shard That is the question! Peter Zaitsev April 21, 2016

Brocade Virtual Traffic Manager and Parallels Remote Application Server

Effecient monitoring with Open source tools. Osman Ungur, github.com/o

Capacity Planning for Application Design

Conceptual Modeling on Tencent s Distributed Database Systems. Pan Anqun, Wang Xiaoyu, Li Haixiang Tencent Inc.

EsgynDB Enterprise 2.0 Platform Reference Architecture

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Switching to Innodb from MyISAM. Matt Yonkovit Percona

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8

KEMP 360 Vision. KEMP 360 Vision. Product Overview

Become a MongoDB Replica Set Expert in Under 5 Minutes:

MySQL Multi-Site/Multi-Master Done Right

Advanced Database Technologies NoSQL: Not only SQL

Improve WordPress performance with caching and deferred execution of code. Danilo Ercoli Software Engineer

VOLTDB + HP VERTICA. page

CPS 512 midterm exam #1, 10/7/2016

Amazon Aurora Relational databases reimagined.

MySQL HA Solutions Selecting the best approach to protect access to your data

Crescando: Predictable Performance for Unpredictable Workloads

Depending on your location and needs we can accommodate your application at one of our established data centers:

10. Replication. Motivation

Powerful application delivery, security, performance and reliability

BeBanjo Infrastructure and Security Overview

Send me up to 5 good questions in your opinion, I ll use top ones Via direct message at slack. Can be a group effort. Try to add some explanation.

Manual Mysql Query Cache Hit Rate 0

Beginner's Guide to Performance! Jonathan Rowny

Distributed Systems 16. Distributed File Systems II

Transcription:

Designing, Scoping, and Configuring Scalable LAMP Infrastructure Presented 2010-05-19 by

About me

About me Founded Four Kitchens in 2006 while at UT Austin

About me Founded Four Kitchens in 2006 while at UT Austin In 2008, launched Pressflow, which now powers the largest Drupal sites

About me Founded Four Kitchens in 2006 while at UT Austin In 2008, launched Pressflow, which now powers the largest Drupal sites Worked with some of the largest sites in the world: Lifetime Digital, Mansueto Ventures, Wikipedia, The Internet Archive, and The Economist

About me Founded Four Kitchens in 2006 while at UT Austin In 2008, launched Pressflow, which now powers the largest Drupal sites Worked with some of the largest sites in the world: Lifetime Digital, Mansueto Ventures, Wikipedia, The Internet Archive, and The Economist Engineered the LAMP stack, deployment tools, and management tools for Yale University, multiple NBC- Universal properties, and Drupal.org

About me Founded Four Kitchens in 2006 while at UT Austin In 2008, launched Pressflow, which now powers the largest Drupal sites Worked with some of the largest sites in the world: Lifetime Digital, Mansueto Ventures, Wikipedia, The Internet Archive, and The Economist Engineered the LAMP stack, deployment tools, and management tools for Yale University, multiple NBC- Universal properties, and Drupal.org Engineered development workflows for Examiner.com

About me Founded Four Kitchens in 2006 while at UT Austin In 2008, launched Pressflow, which now powers the largest Drupal sites Worked with some of the largest sites in the world: Lifetime Digital, Mansueto Ventures, Wikipedia, The Internet Archive, and The Economist Engineered the LAMP stack, deployment tools, and management tools for Yale University, multiple NBC- Universal properties, and Drupal.org Engineered development workflows for Examiner.com Contributor to Drupal, Bazaar, Ubuntu, BCFG2, Varnish, and other open-source projects

Some assumptions

Some assumptions You have more than one web server

Some assumptions You have more than one web server You have root access

Some assumptions You have more than one web server You have root access You deploy to Linux (though PHP on Windows is more sane than ever)

Some assumptions You have more than one web server You have root access You deploy to Linux (though PHP on Windows is more sane than ever) Database and web servers occupy separate boxes

Some assumptions You have more than one web server You have root access You deploy to Linux (though PHP on Windows is more sane than ever) Database and web servers occupy separate boxes Your application behaves more or less like Drupal, WordPress, or MediaWiki

Understanding Load Distribution

Predicting peak traffic Traffic over the day can be highly irregular. To plan for peak loads, design as if all traffic were as heavy as the peak hour of load in a typical month and then plan for some growth.

Analyzing hit distribution

Analyzing hit distribution 100%

Analyzing hit distribution Static Content 100%

Analyzing hit distribution 30% Static Content 100%

Analyzing hit distribution 30% Static Content 100% Dynamic Pages

Analyzing hit distribution 30% Static Content 100% Dynamic Pages 70%

Analyzing hit distribution 30% Static Content 100% Dynamic Pages 70% Authenticated

Analyzing hit distribution 30% Static Content 100% Dynamic Pages 70% Authenticated 20%

Analyzing hit distribution 30% Static Content 100% Dynamic Pages Anonymous 70% Authenticated 20%

Analyzing hit distribution 30% Static Content 50% 100% Dynamic Pages Anonymous 70% Authenticated 20%

Analyzing hit distribution Static Content 30% 50% Human 100% Dynamic Pages Anonymous 70% Authenticated 20%

Analyzing hit distribution 40% Static Content 30% 50% Human 100% Dynamic Pages Anonymous 70% Authenticated 20%

Analyzing hit distribution 40% 100% Dynamic Pages Static Content 30% Anonymous 50% Web Crawler Human 70% Authenticated 20%

Analyzing hit distribution 40% 100% Dynamic Pages Static Content 30% Anonymous 50% Web Crawler Human 10% 70% Authenticated 20%

Analyzing hit distribution 40% 100% Dynamic Pages Static Content 30% Anonymous 50% Web Crawler Human 10% No Special Treatment 70% Authenticated 20%

Analyzing hit distribution 40% 100% Dynamic Pages Static Content 30% Anonymous 50% Web Crawler Human 10% No Special Treatment 3% 70% Authenticated 20%

Analyzing hit distribution 40% 100% Dynamic Pages Static Content 30% Anonymous 50% Web Crawler Human 10% No Special Treatment Pay Wall Bypass 3% 70% Authenticated 20%

Analyzing hit distribution 40% 100% Dynamic Pages Static Content 30% Anonymous 50% Web Crawler Human 10% No Special Treatment Pay Wall Bypass 3% 7% 70% Authenticated 20%

Throughput vs. Delivery Methods Green (Static) Yellow (Dynamic, Cacheable) Red (Dynamic) Content Delivery Network Reverse Proxy Cache PHP + APC + memcached 5000 req/s 1 2 PHP + APC 1 PHP (No APC) 1 10 req/s More dots = More throughput 1 2 Delivered by Apache without PHP Some actually can do this.

Objective Deliver hits using the fastest, most scalable method available

Layering: Less Traffic at Each Step

Layering: Less Traffic at Each Step Traffic

Layering: Less Traffic at Each Step Traffic

Layering: Less Traffic at Each Step Traffic CDN

Layering: Less Traffic at Each Step Your Datacenter Traffic CDN

Layering: Less Traffic at Each Step Your Datacenter Traffic CDN

Layering: Less Traffic at Each Step Your Datacenter Traffic DNS Round Robin CDN

Layering: Less Traffic at Each Step Your Datacenter Traffic Load Balancer DNS Round Robin CDN

Layering: Less Traffic at Each Step Your Datacenter Traffic Load Balancer DNS Round Robin CDN

Layering: Less Traffic at Each Step Your Datacenter Traffic Load Balancer Reverse Proxy Cache DNS Round Robin CDN

Layering: Less Traffic at Each Step Your Datacenter Traffic Load Balancer Reverse Proxy Cache DNS Round Robin CDN

Layering: Less Traffic at Each Step Your Datacenter Traffic Load Balancer Reverse Proxy Cache Application Server DNS Round Robin CDN

Layering: Less Traffic at Each Step Your Datacenter Traffic Load Balancer Reverse Proxy Cache Application Server DNS Round Robin CDN

Layering: Less Traffic at Each Step Your Datacenter Traffic Load Balancer Reverse Proxy Cache Application Server DNS Round Robin CDN Database

Offload from the master database Your master database is the single greatest limitation on scalability.

Offload from the master database Your master database is the single greatest limitation on scalability. Application Server Master Database

Offload from the master database Your master database is the single greatest limitation on scalability. Application Server Memory Cache Master Database

Offload from the master database Your master database is the single greatest limitation on scalability. Application Server Slave Database Memory Cache Master Database

Offload from the master database Search Your master database is the single greatest limitation on scalability. Application Server Slave Database Memory Cache Master Database

Tools to use

Tools to use Apache Solr or Sphinx for search Solr can be fronted with Varnish or another proxy cache if queries are repetitive.

Tools to use Apache Solr or Sphinx for search Solr can be fronted with Varnish or another proxy cache if queries are repetitive. Varnish, nginx, Squid, or Traffic Server for reverse proxy caching

Tools to use Apache Solr or Sphinx for search Solr can be fronted with Varnish or another proxy cache if queries are repetitive. Varnish, nginx, Squid, or Traffic Server for reverse proxy caching Any third-party service for CDN

Do the math All non-cdn traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers.

Do the math All non-cdn traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Traffic

Do the math All non-cdn traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Traffic Load Balancer

Do the math All non-cdn traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Traffic Load Balancer Reverse Proxy Cache

Do the math All non-cdn traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Traffic Load Balancer Reverse Proxy Cache

Do the math All non-cdn traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Traffic Load Balancer Reverse Proxy Cache Application Server

Do the math All non-cdn traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Traffic Load Balancer Reverse Proxy Cache Application Server

Do the math All non-cdn traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Internal Traffic Load Balancer Reverse Proxy Cache Application Server What hit rate is each layer getting? How many servers share the load?

Get a management/monitoring box

Get a management/monitoring box Management

Get a management/monitoring box Management Application Server

Get a management/monitoring box Management Application Server Reverse Proxy Cache

Get a management/monitoring box Database Management Application Server Reverse Proxy Cache

Get a management/monitoring box Load Balancer Database Management Application Server Reverse Proxy Cache

Get a management/monitoring box Load Balancer (maybe even two and have them specialize or be redundant) Database Management Application Server Reverse Proxy Cache

Planning + Scoping

Infrastructure goals

Infrastructure goals Redundancy: tolerate failure

Infrastructure goals Redundancy: tolerate failure Scalability: engage more users

Infrastructure goals Redundancy: tolerate failure Scalability: engage more users Performance: ensure each user s experience is fast

Infrastructure goals Redundancy: tolerate failure Scalability: engage more users Performance: ensure each user s experience is fast Manageability: stay sane in the process

Redundancy

Redundancy When one server fails, the website should be able to recover without taking too long.

Redundancy When one server fails, the website should be able to recover without taking too long. This requires at least N+1, putting a floor on system requirements even for small sites.

Redundancy When one server fails, the website should be able to recover without taking too long. This requires at least N+1, putting a floor on system requirements even for small sites. How long can your site be down?

Redundancy When one server fails, the website should be able to recover without taking too long. This requires at least N+1, putting a floor on system requirements even for small sites. How long can your site be down? Automatic versus manual failover

Redundancy When one server fails, the website should be able to recover without taking too long. This requires at least N+1, putting a floor on system requirements even for small sites. How long can your site be down? Automatic versus manual failover Warning: over-automation can reduce uptime

Performance

Performance Find the sweet spot for hardware. This is the best price/performance point.

Performance Find the sweet spot for hardware. This is the best price/performance point. Avoid overspending on any type of component

Performance Find the sweet spot for hardware. This is the best price/performance point. Avoid overspending on any type of component Yet, avoid creating bottlenecks

Performance Find the sweet spot for hardware. This is the best price/performance point. Avoid overspending on any type of component Yet, avoid creating bottlenecks Swapping memory to disk is very dangerous

Performance Find the sweet spot for hardware. This is the best price/performance point. Avoid overspending on any type of component Yet, avoid creating bottlenecks Swapping memory to disk is very dangerous Don t skimp on RAM

Relative importance Processors/Cores Memory Disk Speed Reverse Proxy Cache Web Server Database Server Monitoring

All of your servers

All of your servers 64-bit: no excuse to use anything less in 2010

All of your servers 64-bit: no excuse to use anything less in 2010 RHEL/CentOS and Ubuntu have the broadest adoption for large-scale LAMP

All of your servers 64-bit: no excuse to use anything less in 2010 RHEL/CentOS and Ubuntu have the broadest adoption for large-scale LAMP But pick one, and stick with it for development, staging, and production

All of your servers 64-bit: no excuse to use anything less in 2010 RHEL/CentOS and Ubuntu have the broadest adoption for large-scale LAMP But pick one, and stick with it for development, staging, and production Some disk redundancy: rebuilding a server is time-consuming unless you re very automated

Reverse proxy caches

Reverse proxy caches Varnish and nginx have modern architecture and broad adoption Sites often front Varnish with nginx for gzip and/or SSL

Reverse proxy caches Varnish and nginx have modern architecture and broad adoption Sites often front Varnish with nginx for gzip and/or SSL Squid and Traffic Server are clunky but reliable alternatives

Reverse proxy caches Varnish and nginx have modern architecture and broad adoption Sites often front Varnish with nginx for gzip and/or SSL Squid and Traffic Server are clunky but reliable alternatives CPU Save Your Money

Reverse proxy caches Varnish and nginx have modern architecture and broad adoption Sites often front Varnish with nginx for gzip and/or SSL Squid and Traffic Server are clunky but reliable alternatives CPU Save Your Money Memory + 1 GB base system + 3 GB for caching

Reverse proxy caches Varnish and nginx have modern architecture and broad adoption Sites often front Varnish with nginx for gzip and/or SSL Squid and Traffic Server are clunky but reliable alternatives CPU Save Your Money Memory + 1 GB base system + 3 GB for caching + Disk Slow + Small + Redundant

Reverse proxy caches Varnish and nginx have modern architecture and broad adoption Sites often front Varnish with nginx for gzip and/or SSL Squid and Traffic Server are clunky but reliable alternatives CPU Save Your Money Memory + 1 GB base system + 3 GB for caching + Disk Slow + Small + Redundant = 5000 req/s

Web servers

Web servers Apache 2.2 + mod_php + memcached

Web servers Apache 2.2 + mod_php + memcached FastCGI is a bad idea Memory improvements are redundant w/ Varnish Higher latency + less efficient with APC opcode

Web servers Apache 2.2 + mod_php + memcached FastCGI is a bad idea Memory improvements are redundant w/ Varnish Higher latency + less efficient with APC opcode Check the memory your app takes per process

Web servers Apache 2.2 + mod_php + memcached FastCGI is a bad idea Memory improvements are redundant w/ Varnish Higher latency + less efficient with APC opcode Check the memory your app takes per process Tune MaxClients to around 25 cores

Web servers Apache 2.2 + mod_php + memcached FastCGI is a bad idea Memory improvements are redundant w/ Varnish Higher latency + less efficient with APC opcode Check the memory your app takes per process Tune MaxClients to around 25 cores CPU Max out cores (but prefer fast cores to density)

Web servers Apache 2.2 + mod_php + memcached FastCGI is a bad idea Memory improvements are redundant w/ Varnish Higher latency + less efficient with APC opcode Check the memory your app takes per process Tune MaxClients to around 25 cores CPU Max out cores (but prefer fast cores to density) Memory + 1 GB base system + 1 GB memcached + 25 cores perprocess app memory

Web servers Apache 2.2 + mod_php + memcached FastCGI is a bad idea Memory improvements are redundant w/ Varnish Higher latency + less efficient with APC opcode Check the memory your app takes per process Tune MaxClients to around 25 cores CPU Max out cores (but prefer fast cores to density) Memory + 1 GB base system + 1 GB memcached + 25 cores perprocess app memory + Disk Slow + Small + Redundant

Web servers Apache 2.2 + mod_php + memcached FastCGI is a bad idea Memory improvements are redundant w/ Varnish Higher latency + less efficient with APC opcode Check the memory your app takes per process Tune MaxClients to around 25 cores CPU Max out cores (but prefer fast cores to density) Memory + 1 GB base system + 1 GB memcached + 25 cores perprocess app memory + Disk Slow + Small + Redundant = 100 req/s

Database servers

Database servers Insist on MySQL 5.1+ and InnoDB

Database servers Insist on MySQL 5.1+ and InnoDB Consider Percona builds and (eventually) MariaDB

Database servers Insist on MySQL 5.1+ and InnoDB Consider Percona builds and (eventually) MariaDB Every Apache process generally needs at least one connection available, and leave some headroom

Database servers Insist on MySQL 5.1+ and InnoDB Consider Percona builds and (eventually) MariaDB Every Apache process generally needs at least one connection available, and leave some headroom Tune the InnoDB buffer pool to at least half of RAM

Database servers Insist on MySQL 5.1+ and InnoDB Consider Percona builds and (eventually) MariaDB Every Apache process generally needs at least one connection available, and leave some headroom Tune the InnoDB buffer pool to at least half of RAM CPU No more than 8-12 cores

Database servers Insist on MySQL 5.1+ and InnoDB Consider Percona builds and (eventually) MariaDB Every Apache process generally needs at least one connection available, and leave some headroom Tune the InnoDB buffer pool to at least half of RAM CPU No more than 8-12 cores Memory + As much as you can afford (even RAM not used by MySQL caches disk content)

Database servers Insist on MySQL 5.1+ and InnoDB Consider Percona builds and (eventually) MariaDB Every Apache process generally needs at least one connection available, and leave some headroom Tune the InnoDB buffer pool to at least half of RAM CPU No more than 8-12 cores Memory + As much as you can afford (even RAM not used by MySQL caches disk content) + Disk Fast + Large + Redundant

Database servers Insist on MySQL 5.1+ and InnoDB Consider Percona builds and (eventually) MariaDB Every Apache process generally needs at least one connection available, and leave some headroom Tune the InnoDB buffer pool to at least half of RAM CPU No more than 8-12 cores Memory + As much as you can afford (even RAM not used by MySQL caches disk content) + Disk Fast + Large + Redundant = 3000 queries/s

Management server

Management server Nagios: service outage monitoring

Management server Nagios: service outage monitoring Cacti: trend monitoring

Management server Nagios: service outage monitoring Cacti: trend monitoring Hudson: builds, deployment, and automation

Management server Nagios: service outage monitoring Cacti: trend monitoring Hudson: builds, deployment, and automation Yum/Apt repo: cluster package distribution

Management server Nagios: service outage monitoring Cacti: trend monitoring Hudson: builds, deployment, and automation Yum/Apt repo: cluster package distribution Puppet/BCFG2/Chef: configuration management

Management server Nagios: service outage monitoring Cacti: trend monitoring Hudson: builds, deployment, and automation Yum/Apt repo: cluster package distribution Puppet/BCFG2/Chef: configuration management CPU Save Your Money

Management server Nagios: service outage monitoring Cacti: trend monitoring Hudson: builds, deployment, and automation Yum/Apt repo: cluster package distribution Puppet/BCFG2/Chef: configuration management CPU Save Your Money Memory + Save Your Money

Management server Nagios: service outage monitoring Cacti: trend monitoring Hudson: builds, deployment, and automation Yum/Apt repo: cluster package distribution Puppet/BCFG2/Chef: configuration management CPU Save Your Money Memory + Save Your Money + Disk Slow + Large + Redundant

Management server Nagios: service outage monitoring Cacti: trend monitoring Hudson: builds, deployment, and automation Yum/Apt repo: cluster package distribution Puppet/BCFG2/Chef: configuration management CPU Save Your Money Memory + Save Your Money + Disk Slow + Large + Redundant = good enough

Assembling the numbers

Assembling the numbers Start with an architecture providing redundancy. Two servers, each running the whole stack

Assembling the numbers Start with an architecture providing redundancy. Two servers, each running the whole stack Increase the number of proxy caches based on anonymous and search engine traffic.

Assembling the numbers Start with an architecture providing redundancy. Two servers, each running the whole stack Increase the number of proxy caches based on anonymous and search engine traffic. Increase the number of web servers based on authenticated traffic.

Assembling the numbers Start with an architecture providing redundancy. Two servers, each running the whole stack Increase the number of proxy caches based on anonymous and search engine traffic. Increase the number of web servers based on authenticated traffic. Databases are harder to predict, but large sites should run them on at least two separate boxes with replication.

Extreme measures for performance and scalability

When caching and search offloading isn t enough

When caching and search offloading isn t enough Some sites have intense custom page needs High proportion of authenticated users Lots of targeted content for anonymous users

When caching and search offloading isn t enough Some sites have intense custom page needs High proportion of authenticated users Lots of targeted content for anonymous users Too much data to process real-time on an RDBMS

When caching and search offloading isn t enough Some sites have intense custom page needs High proportion of authenticated users Lots of targeted content for anonymous users Too much data to process real-time on an RDBMS Data is so volatile that maintaing standard caches outweighs the overhead of regeneration

Non-relational/NoSQL tools

Non-relational/NoSQL tools Most web applications can run well on less-than-acid persistence engines

Non-relational/NoSQL tools Most web applications can run well on less-than-acid persistence engines In some cases, like MongoDB, easier to use than SQL in addition to being higher performance

Non-relational/NoSQL tools Most web applications can run well on less-than-acid persistence engines In some cases, like MongoDB, easier to use than SQL in addition to being higher performance Interested? You ve already missed the tutorial.

Non-relational/NoSQL tools Most web applications can run well on less-than-acid persistence engines In some cases, like MongoDB, easier to use than SQL in addition to being higher performance Interested? You ve already missed the tutorial. In other cases, like Cassandra, considerably harder to use than SQL but massively scalable

Non-relational/NoSQL tools Most web applications can run well on less-than-acid persistence engines In some cases, like MongoDB, easier to use than SQL in addition to being higher performance Interested? You ve already missed the tutorial. In other cases, like Cassandra, considerably harder to use than SQL but massively scalable Current Erlang-based systems are neat but slow

Non-relational/NoSQL tools Most web applications can run well on less-than-acid persistence engines In some cases, like MongoDB, easier to use than SQL in addition to being higher performance Interested? You ve already missed the tutorial. In other cases, like Cassandra, considerably harder to use than SQL but massively scalable Current Erlang-based systems are neat but slow Many require a special PHP extension, at least for ideal performance

Offline processing

Offline processing Gearman Primarily asynchronous job manager

Offline processing Gearman Primarily asynchronous job manager Hadoop MapReduce framework

Offline processing Gearman Primarily asynchronous job manager Hadoop MapReduce framework Traditional message queues ActiveMQ + Stomp is easy from PHP Allows you to build your own job manager

Edge-side includes

Edge-side includes ESI Processor (Varnish, Akamai, other)

Edge-side includes <html> <body> <esi:include href= http://drupal.org/block/views/3 /> </body> </html> ESI Processor (Varnish, Akamai, other)

Edge-side includes <html> <body> <esi:include href= http://drupal.org/block/views/3 /> </body> </html> ESI Processor (Varnish, Akamai, other) <div> My block HTML. </div>

Edge-side includes <html> <body> <esi:include href= http://drupal.org/block/views/3 /> </body> </html> ESI Processor (Varnish, Akamai, other) <div> My block HTML. </div> <html> <body> <div> My block HTML. </div> </body> </html>

Edge-side includes <html> <body> <esi:include href= http://drupal.org/block/views/3 /> </body> </html> Blocks of HTML are integrated into the page at the edge layer. ESI Processor (Varnish, Akamai, other) <div> My block HTML. </div> <html> <body> <div> My block HTML. </div> </body> </html>

Edge-side includes <html> <body> <esi:include href= http://drupal.org/block/views/3 /> </body> </html> Blocks of HTML are integrated into the page at the edge layer. ESI Processor (Varnish, Akamai, other) <html> <body> <div> My block HTML. </div> </body> </html> <div> My block HTML. </div> Non-primary page content often occupies >50% of PHP execution time.

Edge-side includes <html> <body> <esi:include href= http://drupal.org/block/views/3 /> </body> </html> Blocks of HTML are integrated into the page at the edge layer. ESI Processor (Varnish, Akamai, other) <html> <body> <div> My block HTML. </div> </body> </html> <div> My block HTML. </div> Non-primary page content often occupies >50% of PHP execution time. Decouples block and page cache lifetimes

HipHop PHP

HipHop PHP Compiles PHP to a C++-based binary Integrated HTTP server

HipHop PHP Compiles PHP to a C++-based binary Integrated HTTP server Supports a subset of PHP and extensions

HipHop PHP Compiles PHP to a C++-based binary Integrated HTTP server Supports a subset of PHP and extensions Requires an organizational commitment to building, testing, and deploying on HipHop

HipHop PHP Compiles PHP to a C++-based binary Integrated HTTP server Supports a subset of PHP and extensions Requires an organizational commitment to building, testing, and deploying on HipHop Scott MacVicar has a presentation on HipHop later today at 16:00.

Cluster Problems Credits

Server failure

Server failure Load balancers can remove broken or overloaded application reverse proxy caches.

Server failure Load balancers can remove broken or overloaded application reverse proxy caches. Reverse proxy caches like Varnish can automatically use only functional application servers.

Server failure Load balancers can remove broken or overloaded application reverse proxy caches. Reverse proxy caches like Varnish can automatically use only functional application servers. Memcached clients automatically handle failure.

Server failure Load balancers can remove broken or overloaded application reverse proxy caches. Reverse proxy caches like Varnish can automatically use only functional application servers. Memcached clients automatically handle failure. Virtual service IP management tools like heartbeat2 can manage which MySQL servers receive connections to automate failover.

Server failure Load balancers can remove broken or overloaded application reverse proxy caches. Reverse proxy caches like Varnish can automatically use only functional application servers. Memcached clients automatically handle failure. Virtual service IP management tools like heartbeat2 can manage which MySQL servers receive connections to automate failover. Conclusion: Each layer intelligently monitors and uses the servers beneath it.

Cluster coherency

Cluster coherency Systems that run properly on single boxes may lose coherency when run on a networked cluster.

Cluster coherency Systems that run properly on single boxes may lose coherency when run on a networked cluster. Some caches, like APC s object cache, have no ability to handle network-level coherency. (APC s opcode cache is safe to use on clusters, though.)

Cluster coherency Systems that run properly on single boxes may lose coherency when run on a networked cluster. Some caches, like APC s object cache, have no ability to handle network-level coherency. (APC s opcode cache is safe to use on clusters, though.) memcached, if misconfigured, can hash values inconsistently across the cluster, resulting in different servers using different memcached instances for the same keys.

Cluster coherency Systems that run properly on single boxes may lose coherency when run on a networked cluster. Some caches, like APC s object cache, have no ability to handle network-level coherency. (APC s opcode cache is safe to use on clusters, though.) memcached, if misconfigured, can hash values inconsistently across the cluster, resulting in different servers using different memcached instances for the same keys. Session coherency issues can be helped with load balancer affinity or storage in memcached

Cache regeneration races

Cache regeneration races Downside to network cache coherency: synched expiration

Cache regeneration races Downside to network cache coherency: synched expiration Requires a locking framework (like ZooKeeper)

Cache regeneration races Downside to network cache coherency: synched expiration Requires a locking framework (like ZooKeeper) Old Cached Item

Cache regeneration races Downside to network cache coherency: synched expiration Requires a locking framework (like ZooKeeper) Old Cached Item Time

Cache regeneration races Downside to network cache coherency: synched expiration Requires a locking framework (like ZooKeeper) Old Cached Item Time Expiration

Cache regeneration races Downside to network cache coherency: synched expiration Requires a locking framework (like ZooKeeper) Old Cached Item All servers regenerating the item. { Time Expiration

Cache regeneration races Downside to network cache coherency: synched expiration Requires a locking framework (like ZooKeeper) Old Cached Item All servers regenerating the item. { New Cached Item Time Expiration

Broken replication

Broken replication MySQL slave servers get out of synch, fall further behind

Broken replication MySQL slave servers get out of synch, fall further behind No (sane) method of automated recovery

Broken replication MySQL slave servers get out of synch, fall further behind No (sane) method of automated recovery Only solvable with good monitoring and recovery procedures

Broken replication MySQL slave servers get out of synch, fall further behind No (sane) method of automated recovery Only solvable with good monitoring and recovery procedures Can automate DB slave blacklisting from use, but requires cluster management tools

All content in this presentation, except where noted otherwise, is Creative Commons Attribution- ShareAlike 3.0 licensed and copyright 2009 Four Kitchen Studios, LLC.

DrupalCamp Stockholm Presentation Ended Here

Managing the Cluster Credits

The problem Software and Configuration Applicati on Server Applicati on Server Applicati on Server Applicati on Server Applicati on Server Objectives: Fast, atomic deployment and rollback Minimize single points of failure and contention Restart services Integrate with version control systems Credits

Manual updates and deployment Human Human Human Human Human Applicati on Server Applicati on Server Applicati on Server Applicati on Server Applicati on Server Why not: slow deployment, non-atomic/difficult rollbacks Credits

Shared storage Applicati on Server Applicati on Server Applicati on Server Applicati on Server Applicati on Server NFS Why not: single point of contention and failure Credits

rsync Synchronized with rsync Applicati on Server Applicati on Server Applicati on Server Applicati on Server Applicati on Server Why not: non-atomic, does not manage services Credits

Capistrano Deployed with Capistrano Applicati on Server Applicati on Server Applicati on Server Applicati on Server Applicati on Server Capistrano provides near-atomic deployment, service restarts, automated rollback, test automation, and version control integration (tagged releases). Credits

Multistage deployment Deployed with Capistrano Deployments can be staged. cap staging deploy cap production deploy Deployed with Capistrano Development Integration Deployed with Capistrano Staging Applicati on Server Applicati on Server Applicati on Server Applicati on Server Applicati on Server Credits

But your application isn t the only thing to manage. Credits

Beneath the application Reverse Proxy Cache Cluster-level configuration Database Applicati on Server Applicati on Server Applicati on Server Applicati on Server Applicati on Server Cluster management applies to package management, updates, and software configuration. cfengine and bcfg2 are popular cluster-level system configuration tools. Credits

System configuration management Deploys and updates packages, cluster-wide or selectively. Manages arbitrary text configuration files Analyzes inconsistent configurations (and converges them) Manages device classes (app. servers, database servers, etc.) Allows confident configuration testing on a staging server. Credits

All on the management box Manageme nt {Developme nt Integration Staging Deploymen t Tools Monitoring Credits

Monitoring Credits

Types of monitoring Failure Capacity/Load Analyzing Downtime Viewing Failover Troubleshooting Notification Analyzing Trends Predicting Load Checking Results of Configuration and Software Changes

Everyone needs both. Credits

What to use Failure/Uptime Capacity/Load Nagios Hyperic Cacti Munin

Nagios Highly recommended. Used by Four Kitchens and Tag1 Consulting for client work, Drupal.org, Wikipedia, etc. Easy to install on CentOS 5 using EPEL packages. Easy to install nrpe agents to monitor diverse services. Can notify administrators on failure. We use this on Drupal.org

Cacti Highly annoying to set up. One instance generally collects all statistics. (No agents on the systems being monitored.) Provides flexible graphs that can be customized on demand. Credits

Munin Fairly easy to set up. One instance generally collects all statistics. (No agents on the systems being monitored.) Provides static graphs that cannot be customized. Credits

Pressflow Make Drupal sites scale by upgrading core with a compatible, powerful replacement.

Common large-site issues Drupal core requires patching to effectively support the advanced scalability techniques discussed here. Patches often conflict and have to be reapplied with each Drupal upgrade. The original patches are often unmaintained. Sites stagnate, running old, insecure versions of Drupal core because updating is too difficult.

What is Pressflow? Pressflow is a derivative of Drupal core that integrates the most popular performance and scalability enhancements. Pressflow is completely compatible with existing Drupal 5 and 6 modules, both standard and custom. Pressflow installs as a drop-in replacement for standard Drupal. Pressflow is free as long as the matching version of Drupal is also supported by the community.

What are the enhancements? Reverse proxy support Database replication support Lower database and session management load More efficient queries Testing and optimization by Four Kitchens with standard high-performance software and hardware configuration Industry-leading scalability support by Four Kitchens and Tag1 Consulting

Four Kitchens + Tag1 Provide the development, support, scalability, and performance services behind Pressflow Comprise most members of the Drupal.org infrastructure team Have the most experience scaling Drupal sites of all sizes and all types

Ready to scale? Learn more about Pressflow: Pick up pamphlets in the lobby Request Pressflow releases at fourkitchens.com Get the help you need to make it happen: Talk to me (David) or Todd here at DrupalCamp Email shout@fourkitchens.com