C 347 Parallel and Distributed Data Processing pring 2016 Query Processing Decomposition Localization Optimization Notes 3: Query Processing C 347 Notes 3 2 Decomposition ame as in centralized system 1. Normalization 2. Eliminating redundancy 3. lgebraic rewriting Normalization Convert from general language to a standard form E.g., relational algebra C 347 Notes 3 3 C 347 Notes 3 4
Normalization Example: Normalization lso detect invalid expressions.,.c select.,.c from, where. =. and σ (.B=1.C>3).D=2 select * from where.d = 3 does not have D attribute ((.B = 1 and.d = 2) or (.C > 3 and.d = 2)) conjunctive normal form C 347 Notes 3 5 C 347 Notes 3 6 Eliminating edundancy E.g., in conditions Eliminating edundancy E.g., sharing common sub-expressions (. = 1) (. > 5) false (. < 10) (. < 5). < 5 σ cond σ cond T σ cond T C 347 Notes 3 7 C 347 Notes 3 8
lgebraic ewriting E.g., pushing conditions down σ cond σ cond3 σ cond1 σ cond2 Query Processing Decomposition One or more algebraic query trees on relations Localization eplacing relations by corresponding fragments C 347 Notes 3 9 C 347 Notes 3 10 Localization teps (1) tart with query (2) eplace relations by fragments (3) Push up and, σ down (apply C 245 rules) Localization Notation for fragment [ : cond] conditions its tuples satisfy fragment (4) implify ~ eliminate unnecessary operations C 347 Notes 3 11 C 347 Notes 3 12
Example (1) Example (2) σ E=3 σ E=3 [ 1: E<10] [ 2: E 10] C 347 Notes 3 13 C 347 Notes 3 14 Example Example (3) (3) σ E=3 σ E=3 [ 1: E<10] [ 2: E 10] σ E=3 σ E=3 [ 1: E<10] [ 2: E 10] Ø C 347 Notes 3 15 C 347 Notes 3 16
Example (4) Localization ule 1 σ E=3 [ 1: E<10] σ c1 [: c2] [: false] σ c1 [: c1 c2] Ø C 347 Notes 3 17 C 347 Notes 3 18 Example Example B σ E=3 [ 2: E 10] σ E=3 [ 2: E=3 E 10] σ E=3 [ 2: false] Ø (1) = common attribute C 347 Notes 3 19 C 347 Notes 3 20
Example B Example B (2) (3) [1: <5] [2: 5 10] [3: >10] [1: <5] [2: 5] [1: <5] [1: <5] [1: <5] [2: 5] [3: >10] [1: <5] [3: >10] [2: 5] [2: 5 10] [1: <5] [2: 5 10] [2: 5] C 347 Notes 3 21 C 347 Notes 3 22 Example B Example B (3) (4) [1: <5] [1: <5] [1: <5] [2: 5] [3: >10] [1: <5] [3: >10] [2: 5] [1: <5] [1: <5] [2: 5 10] [2: 5] [3: >10] [2: 5] [2: 5 10] [1: <5] [2: 5 10] [2: 5] C 347 Notes 3 23 C 347 Notes 3 24
Localization ule 2 Example B [ 1: < 5] [ 2: 5] [: c 1] [: c 2] [ : c 1 c 2.=.] [ 1 2: 1. < 5 2. 5 1.= 2.] [: false] Ø [ 1 2: false] Ø C 347 Notes 3 25 C 347 Notes 3 26 Example C Example C (2) (3) K K K K K [1: <10] [2: 10] [1: K=.K.<10] [2: K=.K. 10] derived fragmentation [1] [1] [1] [2] [2] [1] [2] [2] C 347 Notes 3 27 C 347 Notes 3 28
Example C (3) Example C (4) K K K K K K [1] [1] [1] [2] [2] [1] [2] [2] [1: <10] [1: K=.K.<10] [2: 10] [2: K=.K. 10] C 347 Notes 3 29 C 347 Notes 3 30 Example C Example D K [ 1: < 10] [ 2: K =.K. 10] (1) [ 1 K 2: 1. < 10 2.K =.K. 10 1.K = 2.K] 1 (K,, B) 2 (K, C, D) [ 1 Ø K 2: false] vertical fragmentation C 347 Notes 3 31 C 347 Notes 3 32
Example D Example D (2) (2) K K 1 (K,,B) 2 (K,C,D) K, K, 1 (K,,B) 2 (K,C,D) C 347 Notes 3 33 C 347 Notes 3 34 Example D ule 3 (4) Consider the vertical fragmentation 1 (K,, B) i = (), i i Then for any B i B() = B( i B i Ø ) C 347 Notes 3 35 C 347 Notes 3 36
Query Processing Decomposition Localization Overview Optimization process 1. Generate query plans P 1 P 2 P 3 P n Optimization Overview Joins and other operations Inter-operation parallelism Optimization strategies 2. Estimate size of intermediate results 3. Estimate cost of plan C 1 C 2 C 3 C n C 347 Notes 3 37 4. Pick minimum C 347 Notes 3 38 Overview Differences from centralized optimization New strategies for some operations Parallel/distributed sort Parallel/distributed join emi-join Privacy preserving join Duplicate elimination ggregation election Many ways to assign and schedule processors C 347 Notes 3 39 Parallel/Distributed ort Input a) elation on single site/disk b) elation fragmented/partitioned by sort attribute c) elation fragmented/partitioned by another attribute C 347 Notes 3 40
Parallel/Distributed ort Output a) orted on single site/disk b) orted fragments/partitions Basic ort (K, ) to be sorted on K Horizontally fragmented on K using vector (k 0, k 1,..., k n) 5 6 10 12 15 19 20 21 50 7 3 k0 k1 10 11 17 14 20 27 22 F 1 F 2 F 3 C 347 Notes 3 41 C 347 Notes 3 42 Basic ort Basic ort lgorithm 1. ort each fragment independently 2. hip results if necessary ame idea on different architectures hared nothing P 1 P 2 M F 1 M F 2 hared memory P 1 P 2 F 1 M F 2 C 347 Notes 3 43 C 347 Notes 3 44
ange-partition ort (K, ) to be sorted on K located at one or more site/disk, not fragmented on K ange-partition ort lgorithm 1. ange partition on K 2. Basic sort a 1 local sort 1 k0 b 2 local sort 2 result k1 3 local sort 3 C 347 Notes 3 45 C 347 Notes 3 46 Partition Vectors Problem elect a good partition vector given fragments 7 52 11 14 31 8 15 11 32 17 10 12 a b c 4 Partition Vectors Example approach Each site sends to coordinator MIN sort key MX sort key Number of tuples Coordinator Computes vector and distributes to sites Decides the number of sites to perform local sorts C 347 Notes 3 47 C 347 Notes 3 48
Partition Vectors ample scenario Coordinator receives : MIN = 5 MX = 9 # of tuples = 10 B: MIN = 7 MX = 16 # of tuples = 10 Partition Vectors ample scenario Coordinator receives : MIN = 5 MX = 9 # of tuples = 10 B: MIN = 7 MX = 16 # of tuples = 10 Expected tuples 2 = 10 / (9 5 + 1) 1 = 10 / (16 7 + 1) 5 10 15 20 k0? assuming we want to sort at 2 sites C 347 Notes 3 49 C 347 Notes 3 50 Partition Vectors Partition Vectors ample scenario Expected tuples 2 Variations end more info to coordinator: a) Partition vector for local site Expected tuples with key < k 0 = half of total tuples 2(k 0 5) + (k 0 7) = 10 k 0 = (10 + 10 + 7) / 3 = 9 5 10 15 20 k0? 1 b) Histogram 3 3 3 5 6 8 10 5 6 7 8 9 10 # of tuples local vector C 347 Notes 3 51 C 347 Notes 3 52
Partition Vectors Variations Multiple rounds 1. ites send range and # of tuples to coordinator 2. Coordinator distributes preliminary vector V 0 3. ites send coordinator # of tuples in each V 0 range Partition Vectors Variations Distributed algorithm that does not require a coordinator? 4. Coordinator computes and distributes final vector V f C 347 Notes 3 53 C 347 Notes 3 54 Parallel External ort-merge ame as range-partition sort except does sorting first a local sort a 1 Parallel/Distributed Join Input elations, May or may not be fragmented b local sort b k0 2 k1 result Output esult at one or more sites 3 in order merge C 347 Notes 3 55 C 347 Notes 3 56
Partitioned Join (Equijoin) local join Partitioned Join (Equijoin) ame partition function f is used for both and 1 1 f can be range or hash partitioning B 2 2 B Local join can be of any type (use any C 245 optimization) f() 3 result 3 f() C Various scheduling options, e.g., a) Partition ; partition ; join b) Partition ; build local hash table for ; partition and join C 347 Notes 3 57 C 347 Notes 3 58 Partitioned Join (Equijoin) We already know why it works Partitioned Join (Equijoin) electing a good partition function f is very important Goal is to make all i + i the same size [1] [2] [3] [1] [2] [3] [1] [1] [2] [2] [3] [3] ometimes we partition just to make this join possible C 347 Notes 3 59 C 347 Notes 3 60
symmetric Fragment + eplicate Join symmetric Fragment + eplicate Join 1 local join Can use any partition function f for (even round robin) Can do any join, not just equijoin (e.g., ).<.B B 2 B partition 3 result union C 347 Notes 3 61 C 347 Notes 3 62 General Fragment + eplicate Join General Fragment + eplicate Join 1 1 1 1 B 2 2 B 2 2 f partition 3 fragments 3 3 n copies of each fragment f partition 3 fragments 3 3 n copies of each fragment is partitioned in similar fashion C 347 Notes 3 63 C 347 Notes 3 64
General Fragment + eplicate Join General Fragment + eplicate Join The asymmetric F+ join is special case of the general F+ 1 1 1 2 symmetric F+ may be good if is small 2 1 2 2 n m pairings of, fragments lso works on non-equijoins 3 1 3 2 result C 347 Notes 3 65 C 347 Notes 3 66 emi-join emi-join Goal: reduce communication traffic ( ) or ( ) or ( ) ( ) B 2 a 10 b 25 c 30 d C 3 x 10 y 15 z 25 w 32 x C 347 Notes 3 67 C 347 Notes 3 68
emi-join emi-join B 2 a 10 b 25 c 30 d C 3 x 10 y 15 z 25 w 32 x B 2 a 10 b 25 c 30 d C 3 x 10 y 15 z 25 w 32 x = [2, 10, 25, 30] = [2, 10, 25, 30] 10 y = 25 w C 347 Notes 3 69 C 347 Notes 3 70 emi-join emi-join B 2 a 10 b 25 c 30 d C 3 x 10 y 15 z 25 w 32 x ssume is the smaller relation Then, in general, ( ) is better than ( ) if size( ) + size ( ) < size ( ) Transmitted data With semi-join ( ): T = 4 + 2 + C + result With join : T = 4 + B + result better if B is large Can do similar comparison for other joins Only taking into account transmission cost C 347 Notes 3 71 C 347 Notes 3 72
emi-join Trick: encode (or ) as a bit vector emi-join Useful for three way joins T keys in 0 0 1 1 0 1 0 0 0 0 1 0 1 0 0 one bit/possible key C 347 Notes 3 73 C 347 Notes 3 74 emi-join Useful for three way joins T emi-join Useful for three way joins T Option 1: T where = = T Option 1: T where = = T Option 2: T where = = T C 347 Notes 3 75 C 347 Notes 3 76
emi-join Useful for three way joins T Option 1: T where = = T Option 2: T where = = T Many options ~ exponential in # of relations Privacy Preserving Join ite 1 has (, B) ite 2 has (, C) Want to compute ite 1 should not discover any info not in the join ite 2 should not discover any info not in the join C 347 Notes 3 77 C 347 Notes 3 78 Privacy Preserving Join emi-join won t work If site 1 sends to site 2, site 2 learns all keys of Privacy Preserving Join Fix: send hash keys only ite 1 hashes each value of before sending ite 2 hashes its own values (same h) to see what tuples match a1 a2 a3 B b1 b2 b3 = (a1, a2, a3, a4) a1 a3 a5 C c1 c2 c3 a1 a2 a3 B b1 b2 b3 = (h(a1), h(a2), h(a3), h(a4)) site 2 sees it has h(a1), h(a3) a1 a3 a5 C c1 c2 c3 a4 b4 a7 c4 site 1 site 2 a4 b4 a7 c4 (a1, c1), (a3, c3) site 1 site 2 C 347 Notes 3 79 C 347 Notes 3 80
Privacy Preserving Join Privacy Preserving Join What is the problem? What is the problem? a1 a2 a3 B b1 b2 b3 = (h(a1), h(a2), h(a3), h(a4)) site 2 sees it has h(a1), h(a3) a1 a3 a5 C c1 c2 c3 a1 a2 a3 B b1 b2 b3 = (h(a1), h(a2), h(a3), h(a4)) site 2 sees it has h(a1), h(a3) a1 a3 a5 C c1 c2 c3 a4 b4 a7 c4 (a1, c1), (a3, c3) site 1 site 2 a4 b4 a7 c4 (a1, c1), (a3, c3) site 1 site 2 Dictionary attack ite 2 can take all keys a 1, a 2, a 3, and checks if h(a 1), h(a 2), h(a 3), matches what site 1 sent C 347 Notes 3 81 C 347 Notes 3 82 Privacy Preserving Join Our adversary model: honest but curious Dictionary attack is possible (cheating is internal and can t be caught) ending incorrect keys not possible (cheater could be caught) Privacy Preserving Join One solution Information haring cross Private Databases. grawal et al., IGMOD 2003 Use commutative encryption functions E i(x) = x encrypted using a key private to site i E 1(E 2(x)) = E 2(E 1(x)) horthand: E 1(x) is x, E 2(x) is x, E 1(E 2(x)) is x C 347 Notes 3 83 C 347 Notes 3 84
Privacy Preserving Join Other Privacy Preserving Operations One solution Inequality join.>. a1 a2 B b1 b2 ( a1, a2, a3, a4 ) a1 a3 C c1 c2 imilarity join sim(.,.)<e a3 b3 ( a1, a2, a3, a4 ) a5 c3 a4 b4 a7 c4 ( a1, a3, a5, a7 ) site 1 site 2 Computes ( a1, a3, a5, a7 ), intersects with ( a1, a2, a3, a4 ) (a1, b1), (a3, b3) C 347 Notes 3 85 C 347 Notes 3 86 Other Parallel Operations Parallel/Distributed ggregates Duplicate elimination ort first (in parallel) then eliminate duplicates in result Partition tuples (range or hash) and eliminate locally a 1 toy 10 2 toy 20 3 sales 15 ggregates Partition by grouping attributes; compute aggregate locally b 4 sales 5 5 toy 20 6 mgmt 15 7 sales 10 8 mgmt 30 sum (sal) group by dept C 347 Notes 3 87 C 347 Notes 3 88
Parallel/Distributed ggregates Parallel/Distributed ggregates a b 1 toy 10 2 toy 20 3 sales 15 4 sales 5 5 toy 20 6 mgmt 15 7 sales 10 8 mgmt 30 1 toy 10 2 toy 20 5 toy 20 6 mgmt 15 8 mgmt 30 3 sales 15 4 sales 5 7 sales 10 a b 1 toy 10 2 toy 20 3 sales 15 4 sales 5 5 toy 20 6 mgmt 15 7 sales 10 8 mgmt 30 1 toy 10 2 toy 20 5 toy 20 6 mgmt 15 8 mgmt 30 3 sales 15 4 sales 5 7 sales 10 sum sum dept sum toy 50 mgmt 45 dept sum sales 30 sum (sal) group by dept sum (sal) group by dept C 347 Notes 3 89 C 347 Notes 3 90 Parallel/Distributed ggregates Parallel/Distributed ggregates a 1 toy 10 2 toy 20 3 sales 15 a 1 toy 10 2 toy 20 3 sales 15 sum dept salary toy 30 toy 20 mgmt 45 b 4 sales 5 5 toy 20 6 mgmt 15 7 sales 10 8 mgmt 30 b 4 sales 5 5 toy 20 6 mgmt 15 7 sales 10 8 mgmt 30 sum dept salary sales 15 sales 15 sum (sal) group by dept with less data sum (sal) group by dept with less data C 347 Notes 3 91 C 347 Notes 3 92
Parallel/Distributed ggregates Parallel/Distributed ggregates a 1 toy 10 2 toy 20 3 sales 15 sum dept salary toy 30 toy 20 mgmt 45 sum dept sum toy 50 mgmt 45 Enhancement Perform aggregate during partition to reduce data transmitted Does not work for all aggregate functions which ones? b 4 sales 5 5 toy 20 6 mgmt 15 7 sales 10 8 mgmt 30 sum dept salary sales 15 sales 15 sum dept sum sales 30 sum (sal) group by dept with less data C 347 Notes 3 93 C 347 Notes 3 94 Preview: Mapeduce Parallel/Distributed election traightforward if one can leverage range or hash partitioning data 1 map data B 1 reduce data C 1 How do indexes work? data 2 data B 2 data C 1 data 3 C 347 Notes 3 95 C 347 Notes 3 96
Parallel/Distributed election Partition vector can act as the root of a distributed index Parallel/Distributed election Distributed indexes on a non-partition attributes get complicated k0 k1 k0 k1 local indexes index sites site 1 site 2 site 3 tuple sites C 347 Notes 3 97 C 347 Notes 3 98 Parallel/Distributed election Indexing schemes Which one is better in a distributed environment? How to make updates and expansion efficient? Where to store the directory and set of participants? Is global knowledge necessary? If the index is not too big, it may be better to keep it whole and replicate it Query Processing Decomposition Localization Optimization Overview Joins and other operations Inter-operation parallelism Optimization strategies C 347 Notes 3 99 C 347 Notes 3 100
Inter-Operation Parallelism Pipelined Parallelism Pipelined Independent site 2 σ c site 1 site 2 site 1 tuples matching σc join probe result C 347 Notes 3 101 C 347 Notes 3 102 Pipelined Parallelism Pipelining cannot be used in all cases Independent Parallelism site 1 site 2 T V stream of tuples stream of tuples 1. temp 1 temp 2 T V 2. result temp 1 temp 2 C 347 Notes 3 103 C 347 Notes 3 104
ummary s we consider query plans for optimization, we must consider various new strategies for individual operations scheduling operations C 347 Notes 3 105