Scalable ejabberd. Konstantin Tcepliaev. Moscow Erlang Factory Lite June 2012

Similar documents
Integrating XMPP based communicator with large scale portal

Thinking in a Highly Concurrent, Mostly-functional Language

2015 Erlang Solutions Ltd

Massive IM Scalability using WebSockets Michał Ślaski

GR Reference Models. GR Reference Models. Without Session Replication

relational Relational to Riak Why Move From Relational to Riak? Introduction High Availability Riak At-a-Glance

Multinode Scalability and WAN Deployments

Distributed Filesystem

Equitrac Office and Express DCE High Availability White Paper

SimpleChubby: a simple distributed lock service

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

How to upgrade MongoDB without downtime

The Google File System (GFS)

How to Scale MongoDB. Apr

PNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013

The Google File System

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Lessons learnt re-writing a PubSub system. Chandru Mullaparthi - Principal Software Architect at bet365

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Plug-in Configuration

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

MongooseIM - Messaging that Scales

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

The Google File System

FRANCESCO CESARINI. presents ERLANG/OTP. Francesco Cesarini Erlang

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Push Notifications (On-Premises Deployments)

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Course Content MongoDB

SCALABLE DATABASES. Sergio Bossa. From Relational Databases To Polyglot Persistence.

CLOUD-SCALE FILE SYSTEMS

MongoDB Distributed Write and Read

MongoDB Architecture

April 21, 2017 Revision GridDB Reliability and Robustness

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

DISTRIBUTED DATABASE OPTIMIZATIONS WITH NoSQL MEMBERS

Datacenter replication solution with quasardb

Replication in Distributed Systems

Configure Push Notifications for Cisco Jabber on iphone and ipad

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Distributed Computation Models

MongoDB - a No SQL Database What you need to know as an Oracle DBA

Distributed Data Management Replication

BENCHMARK: PRELIMINARY RESULTS! JUNE 25, 2014!

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary

Oracle 10g and IPv6 IPv6 Summit 11 December 2003

MySQL High Availability

Consensus and related problems

More reliability and support for PostgreSQL 10: Introducing Pgpool-II 3.7

NoSQL Databases Analysis

MySQL Replication Options. Peter Zaitsev, CEO, Percona Moscow MySQL User Meetup Moscow,Russia

Tutorial 8 Build resilient, responsive and scalable web applications with SocketPro

Scaling with mongodb

B2W Software Resource Requirements & Recommendations

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Instant Messaging Compliance for the IM and Presence Service, Release 12.0(1)

CLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters

<Urban.Boquist. com> Rev PA

status Emmanuel Cecchet

Postgres Cluster and Multimaster

Presented By Chad Dimatulac Principal Database Architect United Airlines October 24, 2011

Database Architectures

Capacity Planning for Application Design

Redundancy. Cisco Unified Communications Manager Redundancy Groups

Scaling MongoDB. Percona Webinar - Wed October 18th 11:00 AM PDT Adamo Tonete MongoDB Senior Service Technical Service Engineer.

Sharding Introduction

Prerequisites for a Multinode Environment with Load Balancing of Virtual Desktops and Failover Cluster (v. 4)

Google File System. By Dinesh Amatya

Data Centers. Tom Anderson

Example File Systems Using Replication CS 188 Distributed Systems February 10, 2015

A Reliable Broadcast System

ICALEPS 2013 Exploring No-SQL Alternatives for ALMA Monitoring System ADC

Massive Scalability With InterSystems IRIS Data Platform

CSCI 4717 Computer Architecture

Google File System (GFS) and Hadoop Distributed File System (HDFS)

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Distributed File Systems II

Today: Fault Tolerance. Replica Management

The Google File System

CS603: Distributed Systems

IM and Presence Server High Availability

IM and Presence Service

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Distributed Systems 24. Fault Tolerance

Availability and Instant Messaging on IM and Presence Service Configuration

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Carrier grade VoIP systems with Kamailio

The Google File System

ITG Software Engineering

Erlang in the battlefield. Łukasz Kubica Telco BSS R&D Department Cracow Erlang Factory Lite, 2013

MySQL High Availability. Michael Messina Senior Managing Consultant, Rolta-AdvizeX /

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CS 537: Introduction to Operating Systems Fall 2015: Midterm Exam #4 Tuesday, December 15 th 11:00 12:15. Advanced Topics: Distributed File Systems

XMPP/Jabber introducing the lingua franca of instant messaging

Database Architectures

GridGain and Apache Ignite In-Memory Performance with Durability of Disk

Transcription:

Scalable ejabberd Konstantin Tcepliaev Moscow Erlang Factory Lite June 2012

ejabberd XMPP (previously known as Jabber) IM server Erlang/OTP Mnesia for temporary data (sessions, routes, etc.) Mnesia or ODBC for persistent data Modular

Too naїve Works excellent on one or two always-on machines with several thousands connections When it comes to a pretty large installation (tens of considered faulty machines, hundreds thousand connections), problems start to emerge

ejabberd at Yandex XMPP IM server Backend for web chat service Transport for iphone/android mobile mail application PUSH notifications for Yandex.Disk

Scalability issues Direct message passing to PIDs without any guarantees Memory consumption: large #state{} of client processes, separate TCP receivers Mnesia-based storage

Message passing problem If remote node goes down, all messages sent there are lost In Erlang, PIDs aren't being invalidated if process has exited already, your message is lost too No method to know whether message was received or not

Trial by error First thought: gen_fsm:sync_send_all_state_event/3 Involves receiving reply, which is another message Synchronous, need to wait for timeout when calling dead process Decision: unacceptable

Trial by error Second try: node/1 to distinguish local and remote PIDs Local: Use is_process_alive/1, send directly if true, failover if false Remote: send message to {ejabberd_sm, node(pid)} and mess with it there. Downside: ejabberd_sm becomes a bottleneck.

Solution Final version: separate the messages <message/> stanzas are critical, should be sent as described before <iq/> and <presence/> aren't that important and may be sent as usual Further improvements for SM: process_flag(priority, high) or replacing with a process pool.

Memory consumption Client (C2S) process state consists of ~30 fields, of which ~10 are used only during stream negotiation Subscription lists are always stored in state. Receiver is another process, separate from C2S two processes per client.

Large state Split C2S process callback module in two, with their own states gen_fsm for stream negotiation gen_server later gen_server:enter_loop(ejabberd_c2s, [], State, hibernate) Free bonus: 'hibernate' option triggers garbage collection

Subscription lists Subscription lists store information about contacts with which client should exchange presence information No need for them when user doesn't send presences Solution: retreive only when presence is sent.

Separate receiver process C2S and receiver share the same TCP socket C2S sends data to socket Receiver controls the socket, waits for {tcp,...}, decodes XML and passes it to C2S. Solution: eliminate receiver process and let C2S control the socket and do all the job.

Memory: results Average memory consumption for one client connection have dropped ~40% to ~500kB Database load dropped slightly

Mnesia Transactional doesn't scale after some limit Brain-splits are fatal Node deaths are fatal Any inconsistency is fatal

MongoDB Eventual consistency Replica sets for failover Sharding for load balancing Self-healing after brain-splits Theoretically unlimited scalability

Late 2010 Mongo 1.2, no replica sets, unstable sharding Manual sharding in SM, using phash2 on usernames Replica pairs with master-to-master scheme Writes and reads both go to one replica, other is used as failover ETS caching of local sessions on each node

Early 2012 Mongo 2.0.1 Requests per second limit reached Master-to-master has become obsolete Replication deadlocking, hard to investigate and impossible to predict

Trying replica sets All database requests seem to need to retreive consistent data, so all operations are done on master nodes Secondaries are kept for failover Mnesia removed completely Problem: RPS on master is still the same, requests time out on load peaks.

MongoDB: final solution Session data doesn't actually always needs to be consistent Retreiving contacts' sessions from master to send them initial presence All other reads are done on secondaries, using slaveok() option.

What it looks like now Reading from secondaries: ~60% requests Writing to primaries, with sharding Cluster easily survives brain-splits, network outages, node downtimes and massive user reconnects Huge potential for horizontal scaling

Questions? f355@yandex-team.ru