Scaling a Highly Available Global SUSE Manager Deployment at Rackspace to Manage Multiple Linux Platforms Aaron Conklin Jeff Price Product Manager Rackspace Principle Architect SUSE
First-level bullet (24pt) Second-level bullet (20 pt) Third-level bullet (16pt) 2 Fourth-level bullet (14pt)
First-level bullet (24pt) Second-level bullet (20 pt) Third-level bullet (16pt) 3 Fourth-level bullet (14pt)
First-level bullet (24pt) Second-level bullet (20 pt) Third-level bullet (16pt) 4 Fourth-level bullet (14pt)
First-level bullet (24pt) Second-level bullet (20 pt) Third-level bullet (16pt) 5 Fourth-level bullet (14pt)
The Problem Red Hat Enterprise Linux (RHL) 4 EOL 6 ~5000 devices were still on RHEL 4 Current supplier had no plan to extend support It was Fanatical to buy our customers additional time
The Problem How to Extend Support 7 What is extended support? Emergency hot fixes Reactive engineering support A patch management solution that integrates into our tools Who can help? A 3 month search yielded nobody Finally, after working with Microsoft, an offer appeared
The Problem Single Pane of Glass 8 Fanatical Support is delivered by empowered Rackers Our Segment Support team provides a single patching interface for Rackspace support to control their customer environments Any solution had to easily integrate into this system How can we add flexibility while maintaining consistency?
The Problem Multi-Platform Support 9 SUSE Manager's many benefits Backed by a strong Linux engineering organization Based on Spacewalk... extendable to support multiple Linux flavors Common provenance with RHN implied minimal new integration points The benefits of mutli-platform support Lower capex (less equipment needed) Lower opex (less integration) Lower risk (fewer potential points of failure)
The Solution
Overview Requirements Discovery 11 Rackspace engaged SUSE Consulting Services to design a scalable SUSE Manager infrastructure in order to provide management and patching services to their extensive set of customers and Linux hosts. These hosts are deployed with a variety of distributions and versions including: SUSE Linux Enterprise Server (11, sp1/sp2) Red Hat Enterprise Linux (4.6 6.x) CentOS (5.9 6.x) Various Debian-based distros
Overview (continued) 12 Rackspace leverages an existing Satellite deployment during transition, hoping to standardize on SUSE Manager for support of more OS flavors. Rackspace Satellite designs are not suitable to handle the large number of clients (hosts) and engaged SUSE to design a solution that can scale to over 100,000 hosts across multiple geographically dispersed data centers. Rackspace depends on an active/passive cluster design that isn t suited to the stated scalability requirements an active/active design will be provided requiring SUSE Manager customization
Architectural Design: 13 Six GEOs including Dallas, Chicago, D.C., Sydney, London and Hong Kong. Scale out N+1 active/active cluster design backed by shared storage and 3-node Oracle RAC db. Split-layer2 network design with F5 load balancers. NON-appliance SUSE Manager install (packages) Customized spacewalk-service scripts. Specialized certificates to handle SSL session fault tolerance. SLES HAE to handle service/session fault-tolerance and taskomatic instance management.
Architectural Design (per GEO): SUSE Manager 1.7 per-dc Installation Details High availability is achieved by a scale-out design. Oracle DB is clustered and controls the taskomatic service on a single node. Client access to SUSE Manager server (s) is load balanced shared storage is utilized for RPM storage (/var/spacewalk) and active cache (/var/ cache/rhn). Each additional node gets a copy of certs, rhn.conf, and oracle client connection info (tnsnames.ora). A common name is used for DNS and CA config. Client Registrations via SSH or curl osad runs on each client (jabber) server chats to clients to tell them to check in 5505 5505 F5 F5 Clients Load Balanced Access to SUSE Manager by common name : CA uses clustered resource name NOT hostname N+1... Copy to each node from primary (first) node: bootstrap.sh SSL Certs /etc/rhn/rhn.conf /etc/tnsnames.ora 14 Each node has its own DB connection. DB commits are handled by taskomatic Shared Storage Details: SLES 11 SP2 HAE (High Availability Extension) SAN-Based storage with 2 LUNS presented to each node 2 HBA interfaces (separate cards) per host Linux host-based MPIO (Multi Path I/O) for load balancing and failover SBD Stonith device on shared storage partition DLM(Distributed Lock Manager) and O2CB (Oracle cluster stack for OCFS) Cloned resources (per node) OCFS Filesystems (for /var /cache/rhn and /var/ spacewalk) N+1... Primary Shared Storage Requirements: /var/spacewalk /var/cache/rhn HA Configuration Details: Per SUSE Manager node, each node utilizes its own connection to the Oracle DB. CA must be created with cluster common name during installation of first node (NOT the server host name). Additional nodes are installed skipping the DB, schema, SSL cert creation and bootstrap generation. Registration to NCC uses the same username/passwd and DB connection info from primary node install. Post install: copy the /etc/rhn/rhn.conf and /etc/tnsnames.ora to the secondary server nodes. Copy and install the SSL certs from primary node to each secondary. Common name value MUST represent the cluster name NOT a host name. Taskomatic engine must ONLY run on a single node. Node failover would start the process on another node via cluster control (single node constraint ) Cluster should be available via common name. (DNS) Oracle 11G RAC
GEO Deployment
Geo Deployment RAC Database 16 Typical SUSE Manager deployments leverage single instance Oracle or PostgreS DB several changes are required for a RAC-backed install: Install the DB backend FIRST BEFORE the initial SUSE Manager configuration Database creation is only done on the PRIMARY node via YaST module. Every additional node uses command line scripts for installation Ask the DBA for a single node (vip) to install to INSTEAD of the scan address (service_name target) tnsnames.ora defaults to using SID= change it to SERVICE_NAME= after finishing all the nodes
Geo Deployment RAC (cont.) Provide your DBA with an install script to create the DB correctly, including UTF-8, correct grants and revokes, Refer to the DBMS wiki for other details 17 http://wiki.novell.com/index.php/suse_manager/rdbms note that the install script method is best for DBAs provides repeatable deployment method Have the DBA check the DB for any compile errors, sometimes a stored procedure doesn t get created correctly the first time and any errors are typically resolved by checking IF you find errors don t forget to modify the script!
HA Preparation
SLES Cluster Preparation 19 The SLES HAE cluster creation scripts are AWESOME! Use them! Use the sleha-init and slehajoin commands Setting up shared storage can be a lot of work make sure you get complete sign-off on your design as reworking a cluster can be VERY PAINFUL even for James Bond! Make sure NTP is solid and leave IPv6 enabled even if you don t use it (there are protocols that the cluster uses from IPv6) Configure your bonded NICs and disk multipathing prior to clusterifying REMEMBER - The cluster controls spacewalk-service!
Cluster Architecture and Design 20 Scale-out cluster requires shared storage with parallelism (lock management) Fail-over cluster (2-node) doesn t need Create storage prior to SUSE Manager setup on primary node Oracle has its own storage
Cluster Architecture and Design 21 Create separate storage for /var/spacewalk and /var/cache/rhn Scale-out cluster uses customized certificates with alternative cnames for all nodes, and cluster name. SSL session persistence/stickiness configurations on load balancer
Cluster Controls
Cluster Controls 23 OCFS mount-points (cache and spacewalk) SBD DLM O2CB Make sure all mount points are setup and working prior to adding them to cluster controlled resources. mkfs -t ocfs2 -N 2 -L cache /dev/mapper/smrepo_part2 mkfs -t ocfs2 -N 2 -L spacewalk dev/mapper/smrepo_part3 dlm ocf:pacemaker:controld ocf:ocfs2:o2cb
Cluster Controls - Taskomatic 24 Taskomatic can only run on one node make the cluster control it as a separate resource For scale-out cluster taskomatic can run on any 1 node For fail-over cluster taskomatic runs on primary node Customize spacewalk-service script to remove Taskomatic as a grouped service - /usr/sbin/spacewalk-service BEFORE: SERVICES="jabberd $SUSE_AUDITLOG $DB_SERVICE $TOMCAT $HTTPD osa-dispatcher Monitoring MonitoringScout rhn-search cobblerd taskomatic AFTER: SERVICES="jabberd $SUSE_AUDITLOG $DB_SERVICE $TOMCAT $HTTPD osa-dispatcher Monitoring MonitoringScout rhn-search cobblerd"
SUSE Manager Cluster Setup
SUSE Manager Cluster Setup Primary Node Gather your setup pre-requisites: Mirror Credentials Database Info : SID or Service Name for RAC User (default susemanager) Password (default susemanager) Port (default 1521) Oracle hostname (DNS please!) - Note : Use a single node for initial install (SID) then change to service_name after 26
SUSE Manager Cluster Setup Primary Node (cont.) 27 Run yast2 susemanager_setup Answer questions using info from previous slide CREATE the CA using the CLUSTER NAME!! Upon completion spacewalk-service is automatically started. Run spacewalk-service disable so services do not start at boot Do spacewalk-service stop and continue with other nodes Copy /etc/tnsnames.ora to other node(s)
SUSE Manager Cluster Setup Secondary Node(s) Ensure /etc/tnsnames.ora is in place and points to single vip and SID Manual setup on other nodes: spacewalk-setup --skip-db-population --ncc 28 Run : Spacewalk-service list Disable services - cluster will control them Don t forget to modify spacewalk-service to remove taskomatic - this will run as a separate cluster controlled service Replace certs with cluster-ready ones
Cluster Certs (also called pain and suffering )
Cluster Certs 30 Initial build asks to setup a CA creates /root/ssl_build directory Each node will have its own self-signed certificate you can use rhn-ssl-tool to adapt the server certificates to include alternative cnames. This is important for connection failovers/reconnects The osa-dispatcher service is very particular to the server name - not alt-cname aware. So you need to create certs with the node names and alias names for the cluster name and alternate nodes. rhn-ssl-tool can be used to recreate the ca and/or the certs
Cluster Certs - active/passive 31 The jabber service also has some sensitivity to host names and certificates. The client-to-server config (c2s.xml) and the session manager (sm.xml) both need to reference the cluster name (fqdn) - floater. The jabber server referenced in /etc/rhn/rhn.conf needs to reference the local server name and the osadispatcher entry uses the cluster name. Certs need to be put into place: cp server.crt /etc/ssl/servercerts/spacewalk.crt cp server.key /etc/ssl/private/spacewalk.key cp server.crt /etc/apache2/ssl.crt/spacewalk.crt cp server.key /etc/apache2/ssl.key/spacewalk.key
Cluster Certs - active/active 32 For active/active there is no aliasing, so everything should have an fqdn in it, including /etc/jabber/xml files and /etc/rhn/rhn.conf
Troubleshooting
Troubleshooting If osad complains on startup about a bad password, check the password stored in the DB: Create 2 sql scripts - one to check the passwords in the DB and another to remove them: pushdispatchpwread.sql (contents below): pushdispatchpwdelete.sql 34 select jabber_id, password from rhnpushdispatcher; delete from rhnpushdispatcher; commit; Then run these with the following commands: /usr/bin/spacewalk-sql --select-mode-direct /root/pushdispatchpwread.sql if more than one password is found run the following command /usr/bin/spacewalk-sql --select-mode-direct /root/pushdispatchpwdelete.sql
Troubleshooting (cont.) Errors that state SSLDisabled like this?? SUSE Manager 4861 2013/06/17 16:16:10-05:00: ('Not able to reconnect',) SUSE Manager 4861 2013/06/17 16:16:10-05:00: ('Traceback (most recent call last):\n File "/usr/share/rhn/osad/jabber_lib.py", line 252, in setup_connection\n c = self._get_jabber_client(js)\n File "/usr/share/rhn/osad/jabber_lib.py", line 309, in _get_jabber_client\n c.connect()\n File "/usr/share/rhn/osad/jabber_lib.py", line 589, in connect\n raise SSLDisabledError\nSSLDisabledError\n',) Make sure the cluster names are being used in jabberd configuration files /etc/jabberd/c2s.xml and sm.xml Make sure the /etc/rhn/rhn.conf jabber and osa-dispatcher lines are correct: server.jabber_server = <fqdn of cluster name> osa-dispatcher.jabber_server = <fqdn of host> 35
Active/Passive vs. Active/Active
Active/Passive Design Version SUSE Manager 1.7 per-dc Installation Details High availability is achieved by a failover design. Oracle DB is clustered. All SUSE Manager services are grouped and run on a single node with a floating IP. Client access to SUSE Manager server(s) is load balanced shared storage is utilized for RPM storage (/var/ spacewalk) and active cache (/var/cache/rhn). Each additional node gets a copy of certs, rhn.conf, and oracle client connection info (tnsnames.ora). A common name is used for DNS and CA config. Load Balanced Access to SUSE Manager is via Proxies common name : CA uses proxy virtual name NOT hostname Proxy Proxy Proxy Proxy F5 37 Shared Storage Details: SLES 11 SP2 HAE (High Availability Extension) SAN-Based storage with 2 LUNS presented to each node 2 HBA interfaces (separate cards) per host Linux host-based MPIO (Multi Path I/O) for load balancing and failover SBD Stonith device on shared storage partition DLM(Distributed Lock Manager) and O2CB (Oracle cluster stack for OCFS) Cloned resources (per node) OCFS Filesystems (for /var /cache/rhn and /var/ spacewalk) 5505 F5 Clients Shared Storage Primary Each node has its own DB connection. DB commits are handled by taskomatic which runs only on the active node 5505 Client Registrations to load balanced Proxies via SSH or curl osad runs on each client (jabber) server chats to clients to tell them to check in with rhn_check Secondary Shared Storage Requirements: /var/spacewalk /var/cache/rhn Copy to secondary node from primary node: bootstrap.sh SSL Certs /etc/rhn/rhn.conf /etc/tnsnames.ora HA Configuration Details: Per SUSE Manager node, each node utilizes its own connection to the Oracle DB. CA must be created with cluster common name during installation of first node (NOT the server host name ). Additional nodes are installed skipping the DB, schema, SSL cert creation and bootstrap generation. Registration to NCC uses the same username /passwd and DB connection info from primary node install. Post install: copy the /etc/rhn/rhn.conf and /etc/tnsnames.ora to the secondary server nodes. Copy and install the SSL certs from primary node to each secondary. Common name value MUST represent the cluster name NOT a host name. Taskomatic engine must ONLY run on a single node. Node failover would start the process on another node via cluster control (single node constraint) Cluster should be available via common name. (DNS) Oracle 11G RAC
Documentation and Guides... 38 There are minimal SUSE-based guides and resources for HA designs today... this will change. HA is fully supported, yet minimally discussed in the documentation... this will change. There are different methods of high-availability and solutions to solve your problems... document your requirements. WE ADAPT, YOU SUCCEED... SUSE is here to help, and this WON'T change!
The Future
The Future Global Patch Management Make management easy Global configuration standardization Fast deployment to approved patching levels Patching CDN 40 Worldwide user and config access by multiple patch management environments Reduced bandwidth usage, lower latency, better performance
The Future More Control For The User 41 Fine grained patch management Edge patching for bleeding edge access to updates Managed patching channels for Rackspace validated patches Premium testing services: Test patches against your configs before applying them in production
The Future Fanatical Anywhere Develop like you produce, produce like you develop 42 Continuity of configuration between locations, even between hosting providers Nimbleness in a multi-cloud world No lock-in... platform freedom, consistency of support
Q&A Thank you. 43
44 Corporate Headquarters +49 911 740 53 0 (Worldwide) Join us on: Maxfeldstrasse 5 90409 Nuremberg Germany www.suse.com www.opensuse.org
Unpublished Work of SUSE. All Rights Reserved. This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability. General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.
Unpublished Work of SUSE. All Rights Reserved. This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability. General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.