Simba: Towards Building Interactive Big Data Analytics Systems. Feifei Li

Size: px

Start display at page:

Download "Simba: Towards Building Interactive Big Data Analytics Systems. Feifei Li"

Julie Haynes
5 years ago
Views:

1 Simba: Towards Building Interactive Big Data Analytics Systems Feifei Li

2 Complex Operators over Rich Data Types Integrated into System Kernel For Example: SELECT k-means from Population WHERE k=5 and feature=age and income >50,000 Group By city What are the impacts to query evaluation and query optimization modules?

3 e.g. Big Spatial Data is Ubiquitous! Location-based Services IoT Projects & Sensor Networks Social Media

4 Problems of Existing Systems Single node shared-state MPP database -> low scalability ArchGIS, PostGIS, Oracle Spatial Disk-oriented cluster computation -> low performance Hadoop-GIS, SpatialHadoop, GeoMesa No native support for spatial operators Spark SQL, MemSQL No sophisticated query planner & optimizer SpatialSpark, GeoSpark

5 100 TB on 1000 machines ½ - 1 Hour 1-5 Minutes Hard Disks Memory In memory computation over a cluster

6 Apache Spark Fast and General engine for large-scale data processing. Speed : By exploiting in-memory computing and other optimizations, Spark can be 100x faster than Hadoop for large scale data processing. Ease of Use : Spark has easy-to-use language integrated APIs for operating on large datasets. A Unified Engine : Spark comes packed with higher-level libraries, including support for SQL queries, streaming data, machine learning and graph processing.

7 Resilient Distributed Datasets (RDDs) Immutable, partitioned collections of objects Created through parallel transformations (map, filter, groupby, join ) on data in stable storage: support pipeline optimization and lazy evaluation Can be cached in memory for efficient reuse. Retain the attractive properties of MapReduce: Fault tolerance, data locality, scalability Maintain linage information that can be used to reconstruct lost partitions. Ex: messages = textfile(...).filter(_.startswith( ERROR )).map(_.split( \t )(2)) HDFS File Filtered RDD Mapped RDD filter (func = _.startwiths(error)) map (func = _.split(...))

8 Spark scheduler DAG based scheduler A: B: Pipeline functions within a stage Stage 1 C: D: groupby F: G: Cache-aware work reuse & locality map E: join Partitioning-aware to avoid shuffles Stage 2 union = cached data partition Stage 3

9 Spark components

10 Spatial and Multimedia Data SELECT * FROM points SORT BY (x - 2)*(x - 2) + (y - 3)*(y - 3) LIMIT 5 SELECT * FROM points WHERE POINT(x, y) IN KNN(POINT(2, 3), 5) Spark Spark Streaming GraphX MLlib machine SELECT * FROM SQLqueries q KNN JOIN pois graph p real-time learning ON POINT(p.x, p.y) IN KNN(POINT(q.x, q.y), 3) Simba (SIGMOD16) Spark

11 Simba: Spatial In-Memory Big data Analytics Simba is an extension of Spark SQL across the system stack! 1. Programming Interface 2. Table Indexing 3. Efficient Spatial Operators 4. New Query Optimizations

12 Comparison with Existing Systems

13 Query Workload in Simba Life of a query in Simba

14 Programming Interfaces Extends both SQL Parser and DataFrame API of Spark SQL Support rich query types natively in the kernel SELECT * FROM points SORT BY (x - 2)*(x - 2) + (y - 3)*(y - 3) LIMIT 5 SELECT * FROM points WHERE POINT(x, y) IN KNN(POINT(2, 3), 5) Achieve something that is impossible in Spark SQL. SELECT * FROM queries q KNN JOIN pois p ON POINT(p.x, p.y) IN KNN(POINT(q.x, q.y), 3)

15 Programming Interfaces (cont d) Fully compatible with standard SQL operators. SELECT poi.id, count(*) as c FROM poi DISTANCE JOIN data ON POINT(data.lat, data.long) IN CIRCLERANGE(POINT(poi.lat, poi.long), 3.0) WHERE POINT(data.lat, data.long) IN RANGE(POINT(24.39, 66.88), POINT(49.38, )) GROUP BY poi.id ORDER BY poi.id Same level of flexibility for DataFrame poi.distancejoin(data, Point(poi( lat ), poi( long )), Point(data( lat ), data( long )), 3.0).range(Point(data( lat ), data( long )), Point(24.39, 66.88), Point(49.38, )).groupBy(poi( id )).agg(count( * ).as( c )).sort(poi( id )).show()

16 Table Indexing All Spark SQL operations are based on RDD scanning. Inefficient for selective spatial queries! In Spark SQL: Record -> Row Table -> RDD[Row] Solution in Simba: native two-level indexing over RDDs Challenges: RDD is not designed for random access Achieve this without hurting Spark kernel and RDD abstraction

Table Indexing (cont d) Two-level Indexing Framework: local + global indexing Partition Local Index Global Index R 0 Row IR 0 i 0 R R 1 R 2 R 3 Packing & Indexing IR 1 IR 2 IR 3 i 1 i 2 i 3 On

17 Table Indexing (cont d) Two-level Indexing Framework: local + global indexing Partition Local Index Global Index R 0 Row IR 0 i 0 R R 1 R 2 R 3 Packing & Indexing IR 1 IR 2 IR 3 i 1 i 2 i 3 On Master Node Global Index RDD[Row] R 4 Array[Row] Local Index IPartition[Row] IR 4 IndexRDD[Row] i 4 Partition Info CREATE INDEX idx_name ON R(x 1,, x m ) USE idx_type DROP INDEX idx_name ON table_name

18 Table Indexing (cont d) Representation for Indexed Tables (RDDs) in Simba case class IPartition[Type](data: Array[Type], index: Index) type IndexRDD[Type] = RDD[IPartition[Type]] Indexed tables are still RDDs (hence, fault tolerance is taken care of)!

19 Efficient execution of rich operations Indexing support -> efficient algorithms Global Index: partition pruning Local Index: parallel pruning within selected partitions R1 R2 R3 R4 R5 R6 global Index partition pruning on the master node R1 R2 R3 R4 R5 R6 local indexes parallel pruning on selected partitions

20 Spatial operations Range Query : range(q, R) Two steps: global Filtering + local processing Query Area

21 Spatial Operations (cont d) k nearest neighbor query : knn(q, R) Key to achieve good performance: Local indexes Pruning bound that is sufficient to cover global knn results. q γ q γ

22 More Sophisticated Operations Distance Join : R τ S Our solution: the DJSpark Algorithm knn join : R knn S Our solution: the RKJSpark Algorithm Details in the paper

23 Query Optimizer Index and geometry-awareness optimizations Index scan optimization: for better index utilization Selectivity estimation + Cost-based Optimization Selectivity estimation over local indexes Choose a proper plan: scan or use index. Spatial predicates merging

24 Query Optimizer Index scan optimization: for better index utilization Result Result Filter By: A B C (D E) A D E Filter By: ( B C (D E)) Optimize Table Scan using Index Operators With Predicate: A C D A B C (D E) Full Table Scan A D E Transform to DNF ( B C (D E))

25 Query Optimizer (cont d) Partition Size Auto-Tuning Data Locality Load Balancing Memory fitness <- record-size estimator Broadcast join optimization: small table joins large table Logical partitioning optimization for RKJSpark provides tighter pruning bounds γ i A good Partitioner (e.g., STR Partitioner)

26 Comparison with Existing Systems Environment: A 10-node cluster with 54 cores and 135GB RAM Single-relation operations Throughput Single-relation operations Latency Query over 500M OpenStreetMap entries

27 Comparison with Existing Systems (cont d) Join operations performance Join between two 3M-entry tables

28 Performance against Spark SQL: Data Size knn Query Throughput knn Query Latency

29 Performance of Joins: Data Size Distance Join Performance knn Join Performance

30 Support for multi-dimension knn Throughput against Dimension knn Latency against Dimension

31 Index Building Cost: Time Index Building Time against Data size Index Building Time against Dimension

32 Index Building Cost: Space Local Index Size Global Index Size

33 100 TB on 1000 machines ½ - 1 Hour 1-5 Minutes 1 second? Hard Disks Memory Online Query Execution

34 Complex Analytical Queries (TPC-H) SELECT SUM(l_extendedprice * (1 - l_discount)) FROM customer, lineitem, orders, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' This query finds the total revenue loss due to returned orders in a given region. 34

35 Online Aggregation [Haas, Hellerstein, Wang SIGMOD 97] Y + ε SELECT ONLINE SUM(l_extendedprice * (1 - l_discount)) FROM customer, lineitem, orders, nation, region Y WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' Y ε AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' WITHTIME CONFIDENCE 95 REPORTINTERVAL 1000 Pr Y ε < Y < Y + ε > 0.95 Confidence Interval Confidence Level 35

36 Accuracy vs. Speed Tradeoff Continuous Query Evaluation and Feedbacks to the user 36

37 Ongoing Works Native support to general geometric objects Polygons, Segments, etc. Geometric object filtering. Spatial join over predicates such as interesect and touch

38 Ongoing works (cont d) Online sampling and aggregation support. Provides uniform random samples / approximate aggregation results in a online fashion.

39 Ongoing works (cont d) Trajectory (Spatial-Temporal) Data Analysis Massive trajectory retrieval Trajectory Similarity Search

40 Ongoing works (cont d) Easy to use user interface integration

41 Conclusion Simba: A distributed in-memory spatial analytics engine User-friendly SQL & DataFrame API Indexing support for efficient query processing Spatial + multimedia + ML operator implementation tailored towards Spark Spatial and index-aware optimizations No changes to Spark kernel -> easier migration to higher version Spark Superior performance compared against other systems Support online query execution Now open sourced at: Under active development.

42 Questions?

43 Supported SQL & DataFrame API Point wrapper SQL: POINT(pois.x + 2, pois.y * 3) DataFrame: Point(pois( x ) + 2, pois( y ) * 3) Box range query SQL: p IN RANGE(low, high) DataFrame: range(base: Point, low: Point, high: Point) Circle range query SQL: p IN CIRCLERANGE(c, rd) DataFrame: circlerange(base: Point, c: Point, rd: Double) k nearest neighbor query SQL: p IN KNN(q, k) DataFrame: knn(base: Point, k: Int)

44 Supported SQL & DataFrame API (cont d) Distance join SQL: R DISTANCE JOIN S ON s IN CIRCLERANGE(r, τ) DataFrame: distancejoin(target: DataFrame, left_key: Point, right_key: Point, τ: Double) knn join SQL: R KNN JOIN S ON s IN KNN(r, k) DataFrame: knnjoin(target: DataFrame, left_key: Point, right_key: Point, k: Int) Index management SQL: CREATE INDEX idx_name ON R(x 1,, x m ) USE idx_type DROP INDEX idx_name ON table_name DROP INDEX ON table_name SHOW INDEX ON R DataFrame: index(idx_type: IndexType, idx_name: String, attrs:seq[attribute]) dropindex() showindex()

45 Spatial Operations: Distance Join Distance Join : R τ S General theta-join in Spark SQL -> Cartesian product!!! Our solution: the DJSpark Algorithm Global Join Local Join R R 0 S 0 R 0 B 1 B 2 B 3 B 4 B 5 B 6 B 7 Local Index of S 0 R 0 S 1 B 4 S 0 S 2 S 3 S 4 S 5 S S 1 S 6 R 0 S 6 S 0 B 3 p τ R0

46 Spatial Operations: knn join knn join : R knn S Solutions in Simba: Block Nested Loop knn join (BKJSpark-N) Block Nested Loop knn join with local R-Trees (BKJSpark-R) Voronoi knn join* (VKJSpark) z-value knn join + (ZKJSpark) -> approximate knn join R-Tree knn join (RKJSpark) * W. Lu, Y. Shen, S. Chen, and B. C. Ooi. Efficient Processing of k Nearest Neighbor joins using MapReduce. VLDB, C. Zhang, F. Li, and J. Jestes. Efficient parallel knn joins for large data in mapreduce. EDBT, 2012.

47 Spatial Operations: knn join (cont d) R-Tree knn join (RKJSpark) Distributed hash-join like algorithm. For each partition R i, find S i S, s. t. r R i, knn r, S = knn(r, S i ) R knn S = R i knn S i Define cr i as the centroid of partition R i Take a uniform random sample S S, and suppose knn cr i, S = {s 1,, s k } For each partition R i : u i = max r R i r, cr i γ i = 2u i + cr i, s k S i = {s s S, cr i, s γ i }

48 Support for multi-dimension joins Distance Join against Dimension knn Join against Dimension

49 Experiments OpenstreetMap Data, up to 2xOpenstreetMap. 1xOpenstreetMap: 2.2 billion records in 132GB 10 nodes with two configurations: (1) 8 machines with a 6-core Intel Xeon E v3 1.60GHz processor and 20GB RAM; (2) 2 machines with a 6-core Intel Xeon E GHz processor and 56GB RAM. Other datasets are used in high dimensions

50 Data Cube Jim Gray, Adam Bosworth, Andrew Layman, Hamid Pirahesh: Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total. ICDE

51 Future Works Native support to general geometric objects Polygons, Segments, etc. Spatial join over predicates such as interesect and touch Data in very high dimensions (> 10d) More sophisticated cost-based optimizations.

Simba: Efficient In-Memory Spatial Analytics.

Simba: Efficient In-Memory Spatial Analytics. Dong Xie, Feifei Li, Bin Yao, Gefei Li, Liang Zhou and Minyi Guo SIGMOD 16. Andres Calderon November 10, 2016 Simba November 10, 2016 1 / 52 Introduction Introduction