Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.

Size: px

Start display at page:

Download "Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14."

Junior Evans
5 years ago
Views:

1 Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14 Page 1

2 Introduction & Notations Multi-Job optimization Evaluation Conclusion Page 2

3 Can scale to thousands of commodity machines Fault tolerant manner and able to support parallel computing But still not simple and convenient enough! Its has been widely embraced! Page 3

4 To simplify the execution of MR programs But such high-level languages lead to a new problem MRQL So we can use SQL or Script instead of writing the MR java method Page 4

5 Native Java Program VS SQL Script Page 5

6 For job1,2: a 10 a 10 and b > 20 And b 20 and a < 10 b 20 Page 6

7 J 1 F Maper M 1 Reducer R 1 J 2 F Maper M 2 Reducer R 2 The overhead can be reduced by: Cost(M 1 )+ Cost(M 2 )- Cost(M 1 M 2 ) + Cost(F) Some Technique for this issue: MR-Share s grouping technique (MRGT) Generalized Grouping Technique(GGT) Materialization Technique(MT) - By MRShare (VLDB 10) - By the author - By the author Page 7

8 Condition: job1 & Job2 have the same schema of input KVs J 1 F Maper M 1 Reducer R 1 J 2 F Maper M 2 Reducer R 2 (key,(tag,value)) Main Idea: Sharing map input scan&sharing map output Page 8

9 Example 1: But, the condition is restricted! J 1 J 1 t. a 10 t. b > 20 Reducer of J 1 T J 1,4 t. a 10 t. b 20 J 4 J 4 t. a < 10 t. b > 20 Reducer of J 4 Mapper Output Page 9

10 Condition: Job i & Job j satisfiy that K i K j, e.g. ((a,b),d) ((a,b,c),d) or that M j A = M i, e.g. ((a, b, c), d) {a,b} = ((a, b), d) Page 10

11 Example 2: J 1 J 1 t. a 10 t. b > 20 Reducer of J 1 T J 1,2 t. a 10 t. b 20 J 2 J 2 t. a < 10 t. b 20 Reducer of J 2 Mapper Output An alter: must partitioned on a & sort on a:b Page 11

12 MOM Condition: Job i & Job j satisfiy that they can be processed in a specific sequence Two major part: Map Output Materialization (MOM) Reduce Input Materialization (RIM) J i J i Map output for J i Reducer of J i J j F J j Map output for J j HDFS Mapper Output Reducer of J j Page 12

13 RIM Extra Condition: Job i & Job j satisfiy that K j K i, e.g. ((a,b),d) ((a,b,c),d) or that M i A = M j, e.g. ((a, b, c), d) {a,b} = ((a, b), d) J i F Maper M i Reducer R i M j M i K j results of M j that can derived from M i K j J j F Maper M j M i K j Reducer R j Page 13

14 Example 3: J 2 F Maper M i Reducer R i t. a 10 t. b 20 results of M 1 that can derived from M 2 {a} J 1 F Maper M j M i K j Reducer R j Page 14

15 Algorithms: Data: 1 NA naive approach 2 MGRT MRShare s grouping technique 3 GGT generalized grouping technique 4 MT materialize technique 5 GGTMT combining of GGT & MT 6 NA Naïve approach 1 Data schema (key char(8),dim1 char(20),dim2 char(20), dim3 char(20), dim4,char(20),range int,value int) 2 Size 1.7 billion tuples with a size of 100GB 3 Template select T, sum(value) from Data where a range b group by T Page 15

16 Experimental Results Experimental Environment: 1 Env Hadoop processor Intel Xeon X Ghz 3 RAM 8G 4 OS CentOS Default cluster size 1 master 40 slaves 6 Disk 2* 500GB SATA Hadoop Configuration: 1 Heap size of JVM 1024MB 2 Default split size of HDFS 3 Data replication 3 512MB 4 I/O buffer size 128KB Page 16

17 Experimental Results Page 17

18 1 Effect of number of queries: (a) Effect of number of queries GGT outperform NA by 105% on average and up to 167% when No. of queries is 30 and outperform MRGT by 85% on average and up to 107% when No. of queries is 30 No.of queries, outperform Page 18

19 2 Effect of data size: (b) Effect of data size GGT outperform NA by 103% on average and up to 128% when data size is 320GB and outperform MRGT by 82% on average and up to 93% when data size is 320GB No.of queries, outperform Page 19

20 3 Effect of cluster size: (c) Effect of cluster size Page 20

21 4 Effect of data size and cluster size: (d) Effect of data size and cluster size Page 21

22 5 Effect of split size: (e) Effect of split size Page 22

23 6 Analysis of MT: (f) Analysis of MT Page 23

24 Primarily with MR-Share 24

25 Notations(2) â Page 25

26 split 0 split 0 split 0 map map map sort sort sort copy merge reduce reduce part 0 Some ap input can be shared merge part 1 HDFS replication HDFS replication Job1 split 0 split 0 split 0 map map map sort sort sort copy merge reduce reduce part 0 Some map output can be shared merge part 1 HDFS replication HDFS replication Job2 Load Parse Process Sort Shuffle Merge Reduce Page 26

27 Partitioning Algorithm (G i, T i ) a group of jobs G i being processed by a technique T i Merging benefit: Cost(G 1, T 1 )+ Cost(G 2, T 2 )- Cost(G 1 G 2, T 3 ) (G 1 G 2 = φ, T 3 {GGT, MT}) 27

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on