Tools for Social Networking Infrastructures 1
Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes constantly easily scalable dealing with failures keeping the cost low 2
Cassandra - a decentralised structured storage system Existing Solutions Coda, Ficus Google File System Bayou Dynamo Bigtable 3
Cassandra - a decentralised structured storage system Data Model Table - multidimensional map indexed by a key Row key - string with no restrictions, usually 13-36 bytes Column family - a group of columns Super column family - group within a group of columns Sorting - Columns can be sorted by name or date It is possible to have multiple tables in one cluster. 4
Cassandra - a decentralised structured storage system System Architecture read/write gets to any node in a cluster, the node determines the replicas for the key write: request is routed to replicas and wait until the quorum has acknowledged read weak : first response is sent back strong : wait for quorum of responses 5
Cassandra - a decentralised structured storage system System Architecture - Partitioning consistent hashing each node has a randomly assigned position on the ring low impact of arrival/departure of the nodes load balancing to alleviate heavily loaded nodes 6
Cassandra - a decentralised structured storage system System Architecture - Replication Allows for high availability and durability Coordinator node is in charge of the replicas various replication options Rack Unaware - use N-1 consecutive nodes in the ring after the coordinator replication metadata is kept in Zookeeper replication across multiple data centers allows for no downtime in case of a crash 7
Cassandra - a decentralised structured storage system System Architecture - Membership Scuttlebutt - efficient Gossip based mechanism Φ Accrual Failure Detector module emits a value (Φ) which represents suspicion level for node Φ is calculated based on arrival times of gossip messages using the exponential distribution allows for setting a threshold for suspecting a node is down 8
Cassandra - a decentralised structured storage system System Architecture - Persistence Data is stored in memory and on file system write consists of: 1. file system: commit log update 2. memory: data structure update memory data structure is saved to data file on disk, as it crosses a threshold all writes are sequential and generate an index for lookup merge process runs in background to collate the data files lookup: first check memory then check all files on disk from newest to oldest bloom filter is used to check if file contains a key 9
Cassandra - a decentralised structured storage system Practical experiences Moving data from MySQL to Cassandra using Map/Reduce Different failure detectors produce very different detection times 7 TB of inbox data for over 100M users PHI Detector detects in 15s vs over 120 with other detectors Cassandra is decentralized, but uses Zookeeper for some coordination Inbox Search 50+ TB data stored on a 150 node cluster 10
Kafka: a Distributed Messaging System for Log Processing Problem: managing large amount of log data Log processing has become a critical component of the data pipeline for consumer internet companies. Activity data is a part of the production data pipeline used directly in site features. Every day, China Mobile collects 5 8TB of phone call records and Facebook gathers almost 6TB of various user activity events. System should be distributed, scalable and offer high throughput Log consumption should be possible in real time 11
Kafka: a Distributed Messaging System for Log Processing Existing solutions Early systems for processing this kind of data relied on physically scraping log files off production servers Most systems are designed for collecting and loading the log data into a data warehouse for offline consumption. Systems allowing online consumption are usually overcomplicated, which results in lower performance There are nearly no systems allowing for pull model 12
Kafka: a Distributed Messaging System for Log Processing Architecture Pub/Sub system Producer sends messages to topics Consumer consumes from topics Messages are transferred via broker To balance load, a topic is divided into partitions and each broker stores one or more of those partitions. 13
Kafka: a Distributed Messaging System for Log Processing API Messages can be packed together to reduce overhead Multiple producers and consumers can publish and retrieve messages at the same time Messages are evenly distributed among consumer streams Sample producer code: producer = new Producer(...); message = new Message( test message str.getbytes()); set = new MessageSet(message); producer.send( topic1, set); Sample consumer code: streams[] = Consumer.createMessageStreams( topic1, 1); for (message : streams[0]) { bytes = message.payload(); // do something with the bytes } 14
Kafka: a Distributed Messaging System for Log Processing Architecture... Simple Storage each partition corresponds to a logical log messages have no id, except for file offset messages are consumed in order consumer keeps the state messages are deleted after some period Efficient transfer pull request retrieves multiple messages Linux sendfile API usage no application cache 15
Kafka: a Distributed Messaging System for Log Processing Distributed Coordination All messages from one partition are consumed by a single consumer Consumers coordinate using Zookeeper detecting consumer/broker changes triggering rebalance process keeping track of the consumed offset Rebalancing on broker/consumer change Delivery at-least-once delivery guarantee in order delivery from one partition no broker redundancy 16
Kafka: a Distributed Messaging System for Log Processing LinkedIn usage kafka cluster co-located with each datacenter services publish to local Kafka brokers hardware load-balancer to distribute the publish requests evenly online consumers run within the same datacenter separate datacenter for offline analysis Statistics: end-to-end latency ~10 seconds hundreds of gigabytes of data billion messages a day 17
Kafka: a Distributed Messaging System for Log Processing Experimental results 18
Workload Analysis of a Large-Scale Key-Value Store Memcached - a distributed hash table Typical usage is a layer in data-retrieval hierarchy Memcached exposes the data in RAM to clients over network Expanded by adding RAM or more servers Consistent hashing to determine server per key Stored items can have different size Memory is divided into slab classes, and objects are stored in matching class LRU method used for cache eviction 19
Workload Analysis of a Large-Scale Key-Value Store Methodology & Pools used in the study kernel module used for sniffing captured traces are 3-7 TB Apache HIVE used for analysis comparison of the traces with the logs for verification 20
Workload Analysis of a Large-Scale Key-Value Store Pools used in the study 21
Workload Analysis of a Large-Scale Key-Value Store Key and value size distributions for all traces The sizes of keys, up to Memcached s limit of 250 B (not shown). The sizes of values. Aggregated value sizes by the total amount of data they use in the cache. 22
Workload Analysis of a Large-Scale Key-Value Store Temporal Patterns Figure 3: Request rates at different dates and times of day, Coordinated Universal Time (UTC). 23
Workload Analysis of a Large-Scale Key-Value Store Cache behaviour Hit rates and reasons for misses Locality Repeating keys Locality over time how many keys do not repeat in time proximity Reuse period time between consecutive accesses to a key 24
Workload Analysis of a Large-Scale Key-Value Store Hit rates & miss categories Table 3: Miss categories in last 24 hours of the ETC trace. 25
Workload Analysis of a Large-Scale Key-Value Store 26
Workload Analysis of a Large-Scale Key-Value Store 27
Workload Analysis of a Large-Scale Key-Value Store 28
Workload Analysis of a Large-Scale Key-Value Store Statistical modelling 29
Workload Analysis of a Large-Scale Key-Value Store Discussion Hit rates are inversely correlated with the pool size Hit rates are not correlated with the locality Improvements of hit rates: + Increase RAM different cache eviction policy 30
Making clear graphs sidenote 31
Questions 1. 2. 3. 4. 5. 6. Main concern of Cassandra is write throughput, what are the tradeoffs? Why is Kafka faster than the other services to which it was compared? What are the possible data loss causes in Kafka? What are the advantages of using consensus service (Zookeper) vs replicated master node? How is churn handled in different systems? What is the problem with using Memcached as persistent storage (eg. USR)? 32