Big data, little time. Scale-out data serving. Scale-out data serving. Highly skewed key popularity

Size: px

Start display at page:

Download "Big data, little time. Scale-out data serving. Scale-out data serving. Highly skewed key popularity"

Virginia Patterson
5 years ago
Views:

/7/6 Big data, little time Goal is to keep (hot) data in memory Requires scale-out approach Each server responsible

Computation/serving performed in parallel Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak Falsafi,

Client pinpoints based on hash # hosts collection of micros = set of key-value pairs #N # hosts collection of

popularity Skewed access distribution Why is skew problematic?

1 /7/6 Big data, little time Goal is to keep (hot) data in memory Requires scale-out approach Each server responsible for one chunk Fast access to local data The Case for RackOut Scalable Data Serving Using Rack-Scale Systems # # Computation/serving performed in parallel Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak Falsafi, Boris Grot # #N Scale-out model offers plenty of DRAM & fast local access Scale-out data serving Scale-out data serving Central to many web applications Central to many web applications e.g., social networks, e-commerce e.g., social networks, e-commerce Data sharded using consistent hashing Data sharded using consistent hashing # # Client pinpoints based on hash # hosts collection of micros = set of key-value pairs #N # hosts collection of micros = set of key-value pairs # # #N Fast data lookup based on client-side consistent hashing 3 Highly skewed key popularity Skewed access distribution Why is skew problematic? Shard skew: skew across servers - Shard_skew = MAX/AVG - Zipfian typically Hottest server saturates while most servers barely utilized Service Level Objective (SLO) violations can occur below that level keys 99th percentile lat. MAX hash(key) λi μ Hundreds AVG Arrival rate (load) Billions load popularity Saturation level servers Skewed popularity translates to load imbalance SLO s SLO limits utilization of DC resources 6

load across replicas Rack #N LOAD [Huang ] - s of replicas in social networking workloads - Load monitoring, copying/moving data & metadata updates Higher skew translates to more replicas Dynamic

2 /7/6 How do we deal with skew today? Dynamic migration/replication Dynamic migration & replication techniques Dynamic replication is a trade-off that requires: - Monitor load and detect load bursts - te/migrate hot micro(s) - Balance load across replicas Rack #N LOAD [Huang ] - s of replicas in social networking workloads - Load monitoring, copying/moving data & metadata updates Higher skew translates to more replicas Dynamic replication & migration require: # # # - Additional memory & consistency model replica #N Hot micro Mitigate load imbalance using migration or multiple replicas Higher skew à higher replication overhead! 7 Insight: fewer nodes results in smaller skew Contributions/Outline Scaling in contrary to scaling out reduces shard skew Analysis of load imbalance in data serving - Better load distribution - Fewer replicas needed Popularity skew translates to replication overhead RackOut: A technique for mitigating load imbalance Load ~% of total load (shard_skew = 3) Experimental methodology Combination of queuing model and real implementation s ~% of total load (shard_skew = 3) Load Reduces replication overhead s 3 [ASPLOS ] Evaluation using RDMA and Scale-Out NUMA Fewer data shards results in less imbalance, less overhead 9 RackOut rather than scale-out Scale nodes to host more keys and absorb higher load - e.g., 6x fewer nodes à 6x more memory per node - scale to the size of a rack? Scale-out Super RackOut 99th percentile lat. What if we could make nodes larger? SLO RackOut improves throughput w/ no replication cost

3 N /7/6 Towards rack-scale building blocks Scaling up shared memory is expensive Cost & complexity of HW cache coherence, fault containment Remote Direct Memory Access (RDMA) Enables low-latency access to remote memory Hardware transport, destination CPU not involved e.g. Infiniband over IP and lossless Ethernet (RoCE) Is RDMA is the new scale-up? Extreme case: Full-scale RDMA DC-scale RoCE introduces emergent safety & perf. issues: - PFC-induced congestion - PFC deadlock, - RDMA transport livelock, - pause frame storm, - etc [Zhu, Guo 6] TCP/IP RDMA 3 A hybrid approach: RackOut using RDMA TCP/IP... A hybrid approach: RackOut using RDMA Sweet spot between scale-out and full-scale RDMA - Enable sharing within rack ( super ) RackOut follows scale-out model - Clients connect via network - Consistent hashing - Migration & replication still possible - Only across racks RDMA RDMA RDMA Grouping Factor (GF) defines the size of a rack Reduce imbalance by using RDMA rack as building block Concurrent Exclusive (CREW) Random Select node Request R/W Previously introduced for multicores Shared X Owner of X [Lim ] Shared read-only access to data, exclusive writes 6 Client-rack architecture of RackOut CREW enables load balancing of read operations - tion only across racks à Rack Ū [..GF] Rand S # # RDMA Rack #N Enable load balancing using RDMA via CREW #N 7 3

4 /7/6 Methodology Contributions/Outline Queuing model for modeling DC-scale RackOut Analysis of load imbalance in data serving Input: node count, GF, read-write ratio, distribution, etc. Popularity skew translates to replication overhead A RackOut KVS implementation (RO-KVS) RackOut: A technique for mitigating load imbalance Reduces replication overhead. Instrument model using platform s parameters. Validate model w/ actual measurements 3. Use model to evaluate arbitrary configurations Experimental methodology Combination of queuing model and real implementation [ASPLOS ] Evaluation using RDMA and Scale-Out NUMA 9 RackOut KVS (RO-KVS) Queuing model for RackOut [Dragojevic, ] Uses FaRM framework as foundation Discrete event-based simulation Poisson process, three service times, Zipfian Client Ū [..GF] Rand S S GF Poisson ( λ ) Optimistic Concurrency Control (OCC) [ASPLOS ] Runs on Mellanox RoCE and Scale-Out NUMA Rack (N/GF) S S GF R/W S Key S Rack. Ū [9%/%] Hash space divided among servers (in micros) s can read from all micros within their rack LR RR LW à Rack α) Zipf ( Both clients & servers maintain DHT of cluster size s -benchmarks for measuring service times YCSB workloads with skewed distributions Local (LR); Remote (RR); Local (LW) Fast & accurate modeling of arbitrary RackOut confs Full-scale RackOut DC simulation Model validation (hottest rack) YCSB-B workload (% writes) on hottest group of 6 servers Modeling -node datacenter (YCSB-B) Dashed lines show platform results GF GF GF GF GF6 99th-pct latency (ms) GF GF GF GF6 99th-pct latency (ms) GF 3 6 (% of max. capacity) 7 Rack throughput (% of max. capacity) Model provides accurate RackOut evaluation (<6% error) Model (GF) RO-KVS (GF) Model (GF) RO-KVS (GF) Model (GF) RO-KVS (GF) Model (GF) RO-KVS (GF) Model (GF6) RO-KVS (GF6) RackOut improves TPS w/o violating SLO at DC scale

5 /7/6 RackOut is synergistic w/ replication Greedy dynamic migration and replication algorithm Accounts for consistent updates (% of max. capacity) 3 3 GF GF GF GF GF6 6 Number of replications tion consumes less resources w/ RackOut Sensitivity to remote latency Lower RR/LR ratio à higher impact of RackOut Speedup (over GF) GF GF GF GF GF6 sonuma RoCE RR/LR % higher speedup with sonuma because of lower RR 6 7 Conclusion RackOut: an approach to scaling in using RDMA RackOut reduces skew and replication overheads Requires fewer replicas to absorb high skew 7 RackOut platforms. Intel Xeon E with Mellanox ConnectX-3 (RoCE) Mellanox RDMA RackOut platform Intel Xeon E with Mellanox RDMA (RoCE). VMM-based sonuma emulator outstanding (latency) Obtain service times for RackOut Local, Remote, Local RO-KVS super Soft RMC Soft RMC FaRM FaRM SRIOV Intel x YCSB (coordinator) Network switch (GbE) Aggregate TPS YCSB (load generators) x LR = us; RR =.us; LW = 6.9us sonuma rack (Xen VMM) N outstanding (TPS) 3 9

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Goal: fast and cost-efficient key-value store Store, retrieve, manage key-value objects Get(key)/Put(key,value)/Delete(key) Target: cluster-level