A Carrier-Grade Cloud Phone System Based on SUSE Linux Enterprise Server & High Availability Extension Brett Buckingham Managing Director, silhouette R&D Broadview Networks bbuckingham@broadviewnet.com
Abstract A case study of the use of SUSE Linux Enterprise technologies in a carrier grade cloud VoIP product. Broadview Networks' OfficeSuite is a cloud phone service used by over 100,000 business people every day. This case study follows a comprehensive rearchitecture of the silhouette product inside OfficeSuite away from proprietary and inflexible technologies toward SUSE Linux Enterprise Server. Topics covered include the extensive use of the SUSE Linux Enterprise High Availability Extension to maintain "five 9's" carrier grade high availability, fault management based on open technologies, and upcoming use of geo clustering. 2
Table of Contents 3 Product overview Re-architecture goals High Availability (HA) Foundation Evolution HA databases Shared file system Custom Cluster Status Fault management Geo Clusters
Product Overview
Product Overview 5 silhouette is sold to cloud services providers (CSPs) CSPs use silhouette to provide a cloud phone service offering to small to medium businesses 1 silhouette supports 1000 businesses each with 20 users (think: 1000 PBXs) Cloud VoIP: only phones and IP network on site Businesses manage their phone service entirely via a web interface Broadview Networks hosts the OfficeSuite service based on silhouette, and licenses silhouette to other CSPs
Phone System Managed via Web 6
silhouette is Widely Deployed 7 Broadview Networks has 13 silhouette systems in production underpinning the OfficeSuite service, serving over 100,000 business users every day Broadview licenses silhouette to 17 other CSPs world wide, which combined serve an additional 60,000 business users every day
silhouette is Carrier-Grade 8 As a product intended to be hosted by cloud and telecom service providers, silhouette is subject to carrier-grade requirements, such as: Availability: 99.999% Reliability: 99.99% Scalability and throughput Manageability and serviceability Security Real-time responsiveness
Other Product Information Developed over past 12 years; in live production service for 9 years Software only 9 Comprised of several (15+) software components Deployed on carrier-class X64 servers SLES is embedded in the product Interfaces with network peer components for some functions Deployed on 3 servers over 2 tiers: Web-tier: single node HA cluster Call-tier: 2 node HA cluster
Network Diagram Internet web tier VoIP ALG 0 call tier 1 Managed Network voice mail med ia sv r PST N gwy PSTN 10
re-architecture Goals
Where Were We? Product was: Stable Successful and profitable Growing rapidly...so, why a need for re-architecture? 12
There Were Risks and Impediments to Future Evolution 13 HA framework was discontinued and poorly supported HA framework locked product to Solaris A suitable and capable operating system, but: Lacked development velocity for several 3rd-party software packages HA framework locked product to a limited subset of Sun servers HA framework greatly complicated install media creation, disk sparing, and disaster recovery strategies
Previous HA Architecture 14 Built on Sun's Netra High Availability Suite (NHAS) Custom component serving as both a local resource manager (LRM) and cluster resource manager (CRM) All service provided by active software components running on the master node Vice-master node contained identical software receiving replication and/or running in slave mode Shared file system provided by NHAS 1+1 nodal sparing: master and vice-master swap roles
Consequences of 1+1 Nodal Sparing 15 Nodal: failure of a sub-critical component can cause a nodal failover, taking critical components with it 1+1: this level of redundancy is expensive: requires 2X server capacity. 3+1 sparing is more typical; still meets availability requirements but at a lower cost. Only 2 cluster nodes does not facilitate proper quorum-based decision making and fencing; risk of split-brain All services were bound to the same master service floating IP address
High Availability Foundation Evolution
We Needed 17 An HA framework which supports component-level availability policy and mechanism Easier horizontal scaling Virtualization technology (labs and production) Better support for operating system and HA framework Geo clustering capability Choice of server hardware Reduced platform cost and licensing complexity Faster development velocity
Constraints 18 Existing legacy systems must be upgradable to the proposed evolved product, therefore we were subject to the following constraints: No additional or upgraded hardware: the product based on the evolved HA foundation must (at least initially) run on the existing legacy servers (1 in web tier, 2 in call tier) No significant networking modifications: applies to system and configuration of network peers Although revolutionary for us, for our customers this was just another product version upgrade All upgrades occur while system is in service
Component-Level HA 19 SUSE Linux Enterprise High Availability Extension is based on Corosync, OpenAIS, and Pacemaker Pacemaker facilitates component-level (resourcelevel) high availability. re-architecture required careful design of: Component Groups Startup and shutdown ordering Dependencies and resulting modifications to components Co-location / anti co-location constraints, stickiness Component monitoring
Complex Component Dependencies 20
High Availability Database
High Availability Database 22 silhouette contains 2 databases: Main database: provisioned system and business data Billing database: call records During re-architecture, we standardized on Postgresql (was: Sybase and MySQL) Each database has HA requirements Stock Postgresql resource agent (RA) was inadequate for master/slave arrangement We developed a custom design and RA
Design: Warm Standby Slave Client A Client B Client N Master IP Master 23 Streaming replication Slave
Master Fails Client A Client B Client N Master IP Master 24 Slave
Slave is Promoted, IP Follows, Clients Reconnect Client A Client B Client N Master IP Master Could be out of date No longer valid 25
Failed Instance Restarts as Slave Client A Client B Client N Master IP Master On-disk files erased 26
Failed Instance Restarts as Slave Client A Client B Client N Master IP Master k bac l l u f 27 up
Failed Instance Restarts as Slave Client A Client B Client N Master IP Slave 28 Streaming replication Master
Initial Startup Problem 29 In Pacemaker the startup sequence for a master / slave resource is as follows: Resource started as a slave on node A Resource started as a slave on node B Pacemaker chooses one instance to promote to master For our design, starting as a slave means : Erase files on disk (both instances would do this) Obtain a full backup from master instance (there isn't one yet) Start slave in Postgresql recovery-mode
Initial Startup Problem Solution 30 Custom Postgresql RA When told to start as a slave: If there is a running master, do normal slave startup If there is no running master: Do nothing Return code to Pacemaker as if successfully started as a slave When Pacemaker eventually promotes one instance: Start that instance as a master from disk image Start the other instance as a slave
Additional Enhancements 31 Fallback images: rotation of regular database backups are taken on each node and stored locally. If normal HA mechanisms fail, or the database is corrupted, the RA will start the database from a fallback image Enhanced monitoring: for our purposes, it is not sufficient to deem a database to be alive based on there being a running PID. Our RA performs representative database queries.
Shared File System
Shared File System 33 Some of silhouette's components (e.g. DHCP server) are inherently file-based, and are made highly available via files shared between the 2 cluster nodes Typical shared storage mechanisms call for a NAS/SAN tier and cluster file system, but we are initially constrained by the requirement to add no new hardware We built a shared file system from an HA NFS server and Distributed Replicated Block Device (DRBD)
Shared File System Cluster Resources NFS client NFS client NFS server IP NFS server ext3 file system DRBD (master) 34 DRBD (slave)
Custom Cluster Status
Custom Cluster Status 36 With more than 40 resources running in the cluster, and complex inter-dependencies, visualizing resource status was difficult crm status and GUI tools are very useful, but expose the underlying resource complexity We implemented a status query utility specific to our product which displays a higher-level status of logical components Component status is summarized, color-coded, and can update in real-time
Output of crm status 37
Custom Cluster Status 38
Fault Management
Fault Management 40 CSP operations personnel need to monitor silhouette 24/7 We redesigned silhouette's fault management system based on Nagios Status and notifications are provided via a built-in web interface, Nagios notifications, and SNMP to a CSP's network operations center Cluster and resource status are inputs to the system
Nagios Dashboard 41
Capacity Monitoring 42
Geo Clusters
Primary Site Primary Site Node 0 Node 1 callp (M) WebAdmin DHCP server DHCP config TFTP server file server maindb (M) txpublisher message broker CDRtoSMDR CDRtoSMDR smdrdb (M) shared file system 44 service IP address HA framework
Replication Within Primary Site Primary Site Node 0 Node 1 callp (S) callp (M) state WebAdmin DHCP server DHCP config TFTP server file server maindb (M) replication maindb (S) txpublisher message broker CDRtoSMDR CDRtoSMDR smdrdb (S) replication shared file system 45 smdrdb (M) service IP address HA framework
Disaster Recovery (DR) Site NOC Site Primary Site Node 0 Arbitrator Node 1 DHCP server DHCP config TFTP server file server replication maindb (S) txpublisher message broker CDRtoSMDR CDRtoSMDR smdrdb (S) replication shared file system 46 smdrdb (M) service IP address callp (M) state WebAdmin maindb (M) Node 0 Node 1 HA framework HA framework callp (S) DR Site
Replication to DR Site NOC Site Primary Site Node 0 DR Site (replicating) Arbitrator Node 1 HA framework HA framework callp (M) state WebAdmin DHCP server DHCP config TFTP server file server maindb (M) replication maindb (S) state service IP address callp (S) Node 1 Node 0 callp (S) maindb (S) replication txpublisher message broker message broker CDRtoSMDR CDRtoSMDR smdrdb (S) replication shared file system 47 smdrdb (M) replication replication smdrdb (S) shared file system
Primary Site Fails NOC Site DR Site (replicating) Arbitrator Node 1 Node 0 HA framework callp (S) Primary Site Fails: Arbitrator and DR site detect loss of primary site Arbitrator and DR site vote to switch service to DR site maindb (S) message broker smdrdb (S) shared file system 48
DR Site Promoted NOC Site DR Site (replicating) Arbitrator Node 1 Node 0 callp (M) WebAdmin DHCP server DHCP config TFTP server DR Site promoted into service: All services start in DR site across the cluster nodes Service IP address migrates to DR site DR site provides service DR site high availability is restored across DR cluster nodes callp (S) state file server maindb (S) replication maindb (M) txpublisher message broker CDRtoSMDR CDRtoSMDR smdrdb (M) replication shared file system 49 smdrdb (S) service IP address HA framework
Complications 50 Moving the service IP address between nodes on the same LAN (HA fail-over) is one thing, but moving an IP address to another site is far more complicated Other service functions (e.g. voicemail) need to participate in geo clustering scheme Telephone switching systems at the primary site may also have failed. Telephone numbers need to be ported / migrated to switching systems at the DR site Should site fail-over decision be automated or manual?
Conclusion
We Achieved 52 An HA framework which supports component-level availability policy and mechanism Easier horizontal scaling Virtualization technology (labs and production) Better support for operating system and HA framework Geo clustering capability Choice of server hardware Reduced platform cost and licensing complexity Faster development velocity
Additional Benefits 53 Higher availability, due to component HA decoupling Impact of failure of sub-critical components minimized Some component fail-overs / restarts avoided if other node is deemed unable to restore the component's service Streamlined creation of install media via kiwi Higher developer productivity Extensive use of virtualization technology System images produced by loadbuild
More About High Availability with SUSE Linux Enterprise CAS1417 A Xen cluster success story using the SLES HA Extension TT1395 How to Build an HA environment with Linux on IBM System z TT1422 Linux Clusters Made Easy with SLES HA Extension TT1449 How To Make Databases on SUSE Linux Enterprise Server Highly Available Thank you. 54
55 Corporate Headquarters +49 911 740 53 0 (Worldwide) Join us on: Maxfeldstrasse 5 90409 Nuremberg Germany www.suse.com www.opensuse.org
Unpublished Work of SUSE. All Rights Reserved. This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability. General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.