Bohr: Similarity Aware Geo-distributed Data Analytics. Hangyu Li, Hong Xu, Sarana Nutanong City University of Hong Kong

Size: px

Start display at page:

Download "Bohr: Similarity Aware Geo-distributed Data Analytics. Hangyu Li, Hong Xu, Sarana Nutanong City University of Hong Kong"

Everett Cook
5 years ago
Views:

1 Bohr: Similarity Aware Geo-distributed Data Analytics Hangyu Li, Hong Xu, Sarana Nutanong City University of Hong Kong 1

2 Big Data Analytics Analysis Generate 2

3 Data are geo-distributed Frankfurt US Oregon Tokyo US California US Virginia Singapore Sydney San Paulo 3

4 Centralizing data is infeasible Moving data through wide area networks(wans) can be extremely slow. Data sovereignty laws. Frankfurt US Oregon US Virginia Tokyo US California 1. Vulimiri et al., NSDI Pu et al., SIGCOMM Viswanathan et al., OSDI Hsieh et al., NSDI 17 Singapore Sydney San Paulo 4

5 Processing queries in-place Bottleneck site exists. Frankfurt US Oregon Tokyo US California US Virginia Singapore Sydney San Paulo 5

6 Processing queries in-place System Workload Task / Data Placement Optimize aspect JetStream streaming no WAN bandwidth usage Geode batch data WAN bandwidth usage Iridium batch both query response time 6

7 Processing queries in-place Iridium (Pu et al. SIGCOMM 15) Recurring queries, abundant computation resource. Transfer data out from the bottleneck site. Re-schedule reducer tasks. 7

8 Iridium can be improved similarity oblivious data reduction ratio is constant Tokyo Input: Map2 (with combiner) Reduce2 Intermedia: UrlB,2 Intermed Site -iate record Tokyo 3 Input: Map1 (with combiner) Intermedia: UrlA,3 Input: Map3 (with combiner) Intermedia: UrlC,2 Oregon 1 Seoul 2 Reduce1 Reduce3 Total 6 Oregon Seoul 8

9 Iridium can be improved similarity oblivious data reduction ratio is constant Tokyo () Oregon Input: Map2 (with combiner) Reduce2 Intermedia: () Tokyo to Seoul () Tokyo to Oregon () Seoul Intermed Site -iate record Tokyo 2 Input: Map1 (with combiner) Reduce1 Intermedia: UrlA,3 Input: Map3 (with combiner) Reduce3 Intermedia: UrlC,2 Oregon 2 Seoul 3 Total 7 Oregon Seoul 9

10 Iridium can be improved similarity oblivious data reduction ratio is constant Tokyo () Oregon Input: Map2 (with combiner) Reduce2 Intermedia: () Tokyo to Seoul () Tokyo to Oregon () Seoul Intermed Site -iate record Tokyo 2 Input: Map1 (with combiner) Reduce1 Intermedia: UrlA,4 Input: Map3 (with combiner) Reduce3 Intermedia: UrlC,3 Oregon 1 Seoul 2 Total 5 Oregon Seoul 10

11 Bohr Design Key ideas: Recurring queries, abundant computation resource. Similarity aware data transfer. Measuring and predicting the data reduction ratio. 11

12 Bohr Design Site1 OLAP Cube Generation OLAP Cube1 Map1 Combine enabled probe similar dataset Site2 OLAP Cube Generation OLAP Cube2 Map2 Combine enabled Reduce1 Reduce2 final result query tasks Global Manager Job Queue 12

13 Why using OLAP Cube? Sorting dataset according to the attribute. Using record level similarity score. 13

14 Why using OLAP Cube? Advantages: Format the unformatted raw data. Multiple dimension cube for one dataset. 14

15 Similarity Search & Check Search: OLAP instruction(eg. dice, slice, roll up) to retrieve the needed attributes. site 1 site 2 Check: Cross site check: simple probes. The probe components. OLAP Cube1 Probe value(1) value(2) value(k) OLAP Cube2 15

16 Reduce tasks scheduling t : data transfer time. I : input data size. R: data reduction ratio. U: uplink bandwidth. D: downlink bandwidth. r : reduce tasks percentage. Goal : Minimize the data transfer time Time to upload data from site i. Time to download data from other sites. 16

17 System Implementation Based on Spark v2.1.0 and Iridium. Utilize Apache Kylin OLAP data cube to deal with all the OLAP operation. Use Gurobi Optimizer for optimize the LP function. 17

18 Evaluation Methodology Workloads: TPC-DS. Amplab big data benchmark. Hardware platform: A real EC2 deployment with 10 sites(singapore,tokyo ). 18

19 Evaluation Methodology Baseline: Iridium(Pu et al. Sigcomm 15). Performance metrics: Query completion time. Data reduction ratio. 19

20 Data Reduction Ratio Data reduction compared to in-place (%) Singapore Tokyo Oregon Virginia Ohio FrankFurt Seoul Sydney Iridium Bohr London Ireland Bohr can increase the data reductio ratio in all sites. 20

21 Query Completion Time Query completion time (secs) Iridium Bohr 0 Big data (scan) Big data (UDF) Big data (aggr) TPC-DS Bohr saves 30% QCT compare to Iridium. 21

22 Conclusion We propose Bohr: Similarity aware data transfer. Predicting and measuring data reduction ratio for each site. Bohr is 30% faster than Iridium. 22

23 Thank you! 23

24 Future work 1. Generalize the basic idea to more workloads. 2. Dealing with the overhead bring by operate OLAP cubes. 3. Break recurring batch workload condition 4. Multiple types of queries accessing one dataset. 24

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu Machine Learning and Big