Migrating massive monitoring to Bigtable without downtime. Martin Parm, Infrastructure Engineer for Monitoring

Size: px

Start display at page:

Download "Migrating massive monitoring to Bigtable without downtime. Martin Parm, Infrastructure Engineer for Monitoring"

Audrey Norman
5 years ago
Views:

1 Migrating massive monitoring to Bigtable without downtime Martin Parm, Infrastructure Engineer for Monitoring

2 This is a big deal. -- Nicholas Harteau/VP, Engineering & Infrastructure

3 It all comes down to the core business Spotify is... a music streaming company an entertainment company an advertisement company a big data company but Spotify isn t... a data center company a hardware company an infrastructure company

4 We want to move our engineers up in the stack to solve more Spotify specific problems

5 Let s talk about operational monitoring

7 The Spotify monitoring pipeline Metrics ffwd Apache Kafka Users shouldn t worry about the details of monitoring pipeline Heroic Graphs + Alerts

8 Heroic: A Time Series Database Consumer nodes persists and indexes incoming data points Apache Kafka Consumer Consumer Consumer API Graphs + Alerts API nodes evaluates queries and performs aggregations

9 Heroic: A Time Series Database Every physical site hosts its own databases and API nodes API Graphs + Alerts API API nodes communicates internally and acts as one big cluster

10 Heroic on Cassandra (November 2015) The time series data ~100 million time series Collected ~212k data points/sec ~500 active users ~1500 graph dashboards ~750 graph-based alert definitions The Cassandra clusters 6 Cassandra clusters spread across 4 data centers and 2 GCP with a total of 283 nodes Replication factor 3 for availability and bandwidth Storage growth rate: ~10TB/month Each data point is just 20 bytes

11 Expanding a Cassandra cluster requires data relocation Cassandra cluster New hosts can t participate until data has been fully copied Existing hosts can t free up data until data have been fully relocated Existing hosts must keep serving regular requests, while data is being copied

12 Expanding a Cassandra cluster requires data relocation Cassandra cluster New hosts can t participate until data has been fully copied This holds true for virtually any distributed database, where data ownership is strongly tied to nodes Existing hosts must keep serving regular requests, while data is being copied Existing hosts can t free up data until data have been fully relocated

13 Scaling databases became an increasing operational burden

15 Bigtable: Storage and compute is separated GFS SSTable SSTable SSTable SSTable SSTable Bigtable cluster Data ownership is dynamically assigned to Bigtable nodes

16 Bigtable: Expanding Storage Capacity Additional storage is allocated in GFS as needed GFS SSTable SSTable SSTable SSTable SSTable SSTable Bigtable cluster

17 Bigtable: Expanding Compute/Network Capacity GFS SSTable SSTable SSTable SSTable SSTable SSTable Bigtable cluster Data ownership can be reassigned without copying data

18 Migrating Heroic to Google Cloud Bigtable

19 Step 1: Dark Loading Bigtable Cassandra stays the source for graphs and alerting. Apache Kafka Graphs + Alerts New data points written to both databases Old data points gets copied from Cassandra to Bigtable

20 Step 2: Flip The Switch Apache Kafka Graphs + Alerts New data points are still written to both databases Switch to use Bigtable as the source for graphs and alerting. As both databases are in sync, this is an invisible and atomic operation with no downtime.

21 Migrating Heroic to Google Cloud Bigtable Virtually no changes to the database schema 10 migration hosts, each scanning 10% of the Cassandra keyspace Data was copied at ~200MB/s and written to Bigtable at ~300k QPS We needed to pace the migration a bit to avoid breaking Cassandra. The migration took ~1 month with no downtime for our users

22 Heroic in Google Cloud Data center Google Cloud Consumer Consumer HTTPS Apache Kafka Consumer API VPN tunnel Consumer Consumer

23 Key points to a smooth migration Our flexible pipeline which was already prepared to support future migrations Immutable data is easier to darkload Our microservice infrastructure allowed us to move the service in pieces and rewire the traffic

24 Heroic on Google Cloud Bigtable (Nov 2016) 12 Google Cloud Bigtable clusters spread across 3 GCP regions with a total of 150 nodes 252TB of monitoring data Collection rate: ~20TB per month Keep in mind: each data point is just 20 bytes ~424K data points per second

26 Thank you for your time and patience! Martin Parm

27 Open Source software mentioned Apache Kafka, ffwd, ffwd-java, Heroic, Cassandra, ElasticSearch,

Big Data Infrastructure at Spotify

Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure September 26, 2013 2 Who am I? According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system