ACANO SOLUTION RESILIENT ARCHITECTURE White Paper Mark Blake, Acano CTO September 2014
CONTENTS Introduction... 3 Definition of Resilience... 3 Achieving Resiliency... 4 Managing Your Data Secure from Incident or Accident...7 Distributed Calls...7 Resilient Integration... 8 Network Survival... 9 Conclusion... 9 Page 2
INTRODUCTION The Acano Solution is a scalable software platform for voice, video and web. The solution integrates with a wide variety of third-party equipment from Cisco, Microsoft, Avaya and other vendors. With the Acano Solution, people connect regardless of location, device, or technology. Why resilience? Acano believes that bringing people together to collaborate and exchange ideas is mission-critical for any organization. Historically this has been widely accepted for audio calls but increasingly now also for web conferencing, video conferencing and instant messaging. A superior user experience is required to increase adoption of these technologies. As a single platform that unifies them, it is even more important that Acano be able to maintain service under any circumstance. Acano s resiliency architecture goes hand-in-hand with the solution s ability to scale. This document focuses on resilience and how we achieve it today. This resiliency explanation applies whether the software is deployed across virtual machines (VM), Acano optimized servers (Acano Servers), or a combination of these platforms. DEFINITION OF RESILIENCE For Acano, resilience means that there must be no single point of failure anywhere in the system that means within the Acano software itself, but importantly also in terms of those integration points with other equipment, and in the network used to communicate between the components of the system. Whether the challenge is simple equipment or power failure, a network outage, or a targeted malicious attack, resilience is the ability of the Acano architecture to continue with minimal interruption. The Acano Solution is built from the ground up to address these problems head on. Page 3
ACHIEVING RESILIENCY The Acano software can be run in either a VM or on an Acano Server. Each system running the software is referred to as an instance. Multiple instances can be run and configured to work together to form a single system using a shared nothing architecture, where each instance is independent and self-sufficient. In this architecture the failure of any single node does not prevent the system from functioning. The Acano Solution presents a single software image, but is divided into a number of discrete, partitioned, software components: Call bridge (media processing and signaling) TURN server (NAT traversal) XMPP (instant messaging, presence and call control for Acano clients) WebRTC Database Each Acano instance can choose to run any subset of components. This allows a truly software defined solution where scale and resilience can be tuned separately according to a customer s needs, scale, and network topology. Examples running in production systems at customers today prove how this revolutionary architecture can scale all the way from a simple case of a resilient 2-box solution for a small enterprise, up to a system of dozens of servers distributed globally for a Tier 1 service provider. Typically a simple deployment will split edge functions that need to reside in a DMZ from core functions that are preferentially deployed in the core network. Edge Instance Core Instance Figure 1 Acano software components and their typical deployment. Each instance can be on a VM or Acano Server. Page 4
The software is independent of the underlying platform, which allows unique flexibility where a system can be built from any mixture of virtual machine and/or Acano Server instances. This gives true best of both. For example, Acano s workload optimized server can sit in the network core, while virtual machines sit at the periphery or provide additional flexible capacity as required. In a typical distributed deployment there are one or more geographically separate edge locations. Each of these edge locations consists of one or more Acano instances. Each of these edge instances is connected to the core instances. These core instances have a mesh of connections between them that allow for distributed conferences and resilience. When an Acano client connects to the system it determines the best edge to connect to, and calls are normally routed to a local core instance. The mesh of connections between the core instances allows for distributed cospaces. cospaces are Acano s alwaysavailable virtual meeting rooms. Figure 2 Distributed Acano deployment using both VMs and Acano Servers. Also shown are example signalling and media paths for two Acano clients. Page 5
In this deployment, the total capacity is the sum of the capacity of the individual core instances. If any core instance or instances fail, only the calls legs currently on those instances are affected and the rest of the system carries on. Resilience is introduced by over-provisioning and load-balancing. On failure the rest of the system takes up the slack. Figure 3 New signalling and media path in event of core instance failure. Only the call leg on the failed instance is affected, and the rest of the system takes over. Resiliency also exists at the edge. Failure of an edge instance leads to calls being routed via an alternative edge instance. This instance could either be hosted in the same location or in another location. Failure within each individual instance is prevented where possible. For VMs, we fully support technologies such as vmotion. For the Acano server, there are multiple media processing modules internally, any of which can fail and the only result (often completely unnoticeable) is re-distribution of that module s processing over the remaining modules without call drop or interruption. The Acano Server also includes redundant power supplies which allow it to be connected to dual power feeds to give resiliency against power failures. Page 6
MANAGING YOUR DATA SECURE FROM INCIDENT OR ACCIDENT Acano places a premium on security, following a secure development lifecycle and building to the strictest standards. In terms of resiliency, the most important security within our solution is undoubtedly the database. Here we have taken the design decision to put all persistent states of the system in a distributed SQL database. We use streaming replication technologies to make sure that this is fully distributed and redundant. This database component is an optional component of every Acano instance. If you want it to just work, then it can be enabled on all instances and your database will simply scale in resilience with the number of instances deployed. However, for some customers the information security of keeping the data isolated might mean they want to deploy the database separately. This is simple with the Acano Solution since the entire distributed system will access the database on just the nodes you specify. You can either deploy additional Acano Servers or spin up one or more VMs, enabling only the database component on those while disabling it on the others. The other instances are dumb and contain no configuration information when cospaces aren t running. Every database instance is stored using strong encryption, and all control/database network connections between Acano instances use secure TLS links. DISTRIBUTED CALLS Making a resilient system for calls is as simple as setting up multiple Acano instances with the call bridge (media processing) component enabled. These will then work together and form a single system. By default all calls and conferences are fully distributed; media processing for a single call or conference will be split across multiple instances. The instances can be co-located in the same datacenter or geographically distributed. Secure, bandwidth controlled, inter-instance media links are created automatically to join up the different instances used in a distributed call. From an end user perspective, the fact that the call may be running on one or multiple instances is transparent. All call features like layout control, in-call controls, and roster lists, are preserved. Page 7
If one instance fails, just those participants who are hosted on that instance are affected. In some cases, depending on the call type, they can be automatically reconnected with just a brief loss of connectivity. In the worst case, a dial back into the meeting will place them back into the same call, but now hosted on a different Acano instance. RESILIENT INTEGRATION Integration with other solutions is a key part of Acano s story. It is also key that these integrations also have no single point of failure. If the system isn t resilient at every place, then it s not resilient. Whether the integration is Cisco VCS, CUCM, Microsoft Lync, Avaya, or other, the basic techniques used are the same. These techniques need to be applied in both directions - into and out of the third party kit - so that resiliency of both sides is realised. Figure 4 Multiple trunk connections can be configured for automatic failover. Page 8
Each integration is configured resiliently with multiple trunk connections that have try and continue type rules. Depending on the integration type, dial plan rules or DNS are used to point to multiple destinations such that if one destination has failed, failover to the next happens. The dialplan on the Acano Solution is flexible enough to support different SIP trunks configured at different Acano instances in the distributed system. For example, Acano can be trunked at each datacenter to the local Lync Front-End pool. NETWORK SURVIVAL Another concern is dealing with changing network conditions or network failures. Acano primarily uses TURN/STUN/ICE to perform NAT traversal and to route media optimally between locations; this includes between Acano clients or between WebRTC and Acano call bridges, but also and importantly between Acano instances in a distributed system. A TURN server is an integrated optional component of every Acano instance, so adding additional TURN servers (to steer media) is just a matter of spinning up another Acano instance and only enabling that component. ICE is a dynamic system so the system is constantly checking multiple potential network connections between Acano instances and will use the one it considers most optimal. In most cases the system will choose to use the path with lowest network delay. If that network path has an issue, then ICE can automatically try and route a different way. Each call bridge is aware of multiple TURN servers so that if the nearest fails another can be chosen. CONCLUSION A superior user experience has driven the design of Acano s platform unifying audio, video, web conferencing and instant messaging. Resiliency is a key factor in user experience, and has been addressed comprehensively: within the software in the network used to communicate between the components of the system at integration points with other equipment Page 9
Acano s resiliency architecture is also a source for its scalability. Together Acano scalability and resiliency have enabled the solution to be deployed within the largest, most communication-sensitive organizations in the world. Key differentiators for resiliency in the Acano platform: The Acano Solution is a distributed, shared nothing architecture. Each individual instance of Acano is an identical software image running in either a VM or on an Acano Server that can be configured to form a single system. The software is independent of the underlying platform, which allows unique flexibility where a system can be built from instances running on any mixture of virtual machines and/or Acano Servers. This gives true best of both. If any core instance or instances fail, only the calls currently on those instances are affected and the rest of the system carries on. Failure of an edge instance leads to calls being routed via an alternative edge instance. For secure data management, all persistent states of the system are in a fully distributed and redundant SQL database. By default all calls and conferences are fully distributed; media processing for a single call or conference will be split across multiple instances, which can be co-located in the same datacenter or geographically distributed. Each integration with a third party is configured resiliently with multiple trunk connections that have try and continue type rules. To address changing network conditions or network failures, Acano uses TURN/STUN/ICE to perform NAT traversal and to route media optimally between locations. The whole is even greater than the sum of these parts. By providing resiliency at every opportunity, Acano s system offers a robust approach for IT administrators and a satisfying user experience for call participants. Page 10
2014 Acano (UK) Ltd. All rights reserved. This document is provided for information purposes only and its contents are subject to change without notice. This document may not be reproduced or transmitted in any form or by any means, for any purpose other than the recipient s personal use, without our prior written permission. Acano and cospace are trademarks of Acano. Other names may be trademarks of their respective owners. Page 11