SciHadoop: Array Based Query Processing in Hadoop

Size: px

Start display at page:

Download "SciHadoop: Array Based Query Processing in Hadoop"

Jessica Bennett
5 years ago
Views:

1 SciHadoop: Array Based Query Processing in Hadoop Joe Buck, Noah Watkins, Jeff LeFevre, Kleoni Ioannidou, Carlos Maltzahn, Neoklis Polyzotis, Scott Brandt 1 1

2 Damasc Data Management in Scientific Computing 2 2

3 SciHadoop Logical query interface Data stored in original file format MapReduce processing model 3 3

4 Background MapReduce 4 4

5 MapReduce map split 0 split 1 split 2 split 3 split 4 (3) read (4) local write (5) remote read (6) write output file 0 output file 1 Input files Map phase Intermediate files (on local disks) Reduce phase Output files 5 5

6 MapReduce map split 0 split 1 split 2 split 3 split 4 (3) read (4) local write (5) remote read (6) write output file 0 output file 1 Input files Map phase Intermediate files (on local disks) Reduce phase Output files 6 6

7 MapReduce map split 0 split 1 split 2 split 3 split 4 (3) read (4) local write (5) remote read (6) write output file 0 output file 1 Input files Map phase Intermediate files (on local disks) Reduce phase Output files 7 7

8 MapReduce map split 0 split 1 split 2 split 3 split 4 (3) read (4) local write (5) remote read (6) write output file 0 output file 1 Input files Map phase Intermediate files (on local disks) Reduce phase Output files 8 8

9 MapReduce map split 0 split 1 split 2 split 3 split 4 (3) read (4) local write (5) remote read (6) write output file 0 output file 1 Input files Map phase Intermediate files (on local disks) Reduce phase Output files 9 9

10 Background Scientific Libraries 10 10

11 Scientific Data

12 Scientific Data Access Library

13 MapReduce: One Task Consider a single task processing data Two data sets: text and climate data 13 13

14 MapReduce: One Task map split 0 split 1 split 2 split 3 split 4 (3) read (4) local write (5) remote read (6) write output file 0 output file 1 Input files Map phase Intermediate files (on local disks) Reduce phase Output files 14 14

15 Processing Text Thou wast born of woman But swords I smile at, weapons laugh to scorn, Brandish'd by man that's of a woman born

16 Processing Text Thou wast born of woman But swords I smile at, weapons laugh to scorn, Brandish'd by man that's of a woman born. Output: Thou, 1 of, 2 scorn,

17 Processing Temps 4, 5, 6, 7, 8,

18 Scientific Data Latitude Longitude Time 18 18

19 Scientific Data Latitude Longitude Time 19 19

20 Scientific Data Access Library

21 Scientific Data Access Library

22 Scientific Data X Access Library

23 An Issue Arises 4, 5, 6, 7, 8,

24 Solution Propagate logical coordinates throughout the system 24 24

25 Solution map split 0 split 1 split 2 split 3 split 4 (3) read (4) local write (5) remote read (6) write output file 0 output file 1 Input files Map phase Intermediate files (on local disks) Reduce phase Output files 25 25

26 Solution Corner: 0, 1, 0 Shape: 1, 1, 3 Data: 4, 5, 6 Corner: 1, 0, 0 Shape: 1, 1, 3 Data: 7, 8, 9 Output: 6 Output:

27 Solution Corner: 0, 1, 0 Shape: 1, 1, 3 Data: 4, 5, 6 Corner: 1, 0, 0 Shape: 1, 1, 3 Data: 7, 8, 9 Output: 6 Output:

28 Recap Mismatch between MapReduce and access libraries Logical coordinates are key 28 28

29 Experiments Optimizations: Explore three methods for creating splits Reduce unnecessary reads Holistic Combiner Query-Aware Partitioning 29 29

30 Experiments Optimizations: Explore three methods for creating splits Reduce unnecessary reads Holistic Combiner Query-Aware Partitioning 30 30

31 Experimental Data

32 Experimental Data ==

33 Naive Partitioning Round-robin placement over all the blocks that constitute the input 33 33

34 Naive Partitioning NODE 0 NODE 1 NODE

35 Naive Partitioning NODE 0 NODE 1 NODE NODE0 NODE1 NODE

36 Naive Partitioning NODE 0 NODE 1 NODE 2 NODE 3 NODE 4 NODE 5 NODE 6 NODE 7 other file data NODE0 NODE1 NODE2 NODE3 NODE4 NODE5 NODE6 NODE

37 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan

38 Chunking & Grouping NODE 0 NODE 1 NODE NODE0 NODE0 NODE0 NODE1 NODE1 NODE

39 Chunking & Grouping NODE 0 NODE 1 NODE 2 NODE 3 NODE 4 NODE 5 NODE 6 NODE 7 other file data NODE4 NODE5 NODE6 NODE

40 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan

41 Physical to Logical NODE 0 NODE 1 NODE NODE0 NODE1 NODE

42 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan

43 Experiments Optimizations: Explore three methods for creating splits Reduce unnecessary reads Holistic Combiner Query-Aware Partitioning 43 43

44 No Scan NODE 0 NODE 1 NODE NODE0 NODE1 NODE

45 No Scan NODE 0 NODE 1 NODE NODE0 NODE1 NODE

46 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan

47 Experiments Optimizations: Explore three methods for creating splits Reduce unnecessary reads Holistic Combiner Query-Aware Partitioning 47 47

48 Combiner node 1 node 2 node 3 function: Max Filter / Map 3 Filter / Map Filter / Map 6 8 Combine Combine Combine Reduce Reduce

49 Combiner node 1 node 2 node 2 function: Max Filter / Map 3 Filter / Map 6 8 Filter / Map Combine 3 Combine 8 Reduce Reduce

50 Holistic Combiner node 1 node 2 node 3 function: Median Filter / Map Filter / Map Filter / Map Reduce Reduce

51 Holistic Combiner node 1 node 2 node 3 function: Median Filter / Map Filter / Map Filter / Map Combiner Combiner Combiner 1 Reduce Reduce

52 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan

53 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan

54 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan

55 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan

56 Experiments Optimizations: Explore three methods for creating splits Reduce unnecessary reads Holistic Combiner Query-Aware Partitioning 56 56

57 Query-Aware Partitioning node 1 node 2 node 3 function: Median Filter / Map Filter / Map Filter / Map Combiner Combiner Combiner 1 Reduce Reduce

58 Query-Aware Partitioning node 1 node 2 function: Median Filter / Map Filter / Map Combiner Combiner 1 6 Reduce Reduce

59 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan

60 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan

61 SciHadoop Provides logical query interface In-situ processing over native data Exploits convenient data parallelism 61 61

62 SC 11 Thursday, Nov 17th 1:30-2 pm Room TCC

63 Future Work Integrate structural knowledge into Hadoop proper Produce partial, complete results early Alternative resiliency models Generalize existing niche performance enhancements for scientific data 63 63

64 Future Work Come to my poster for details 64 64

65 Collaborators 65 65

66 Thank You

Storage in HPC: Scalable Scientific Data Management. Carlos Maltzahn IEEE Cluster 2011 Storage in HPC Panel 9/29/11

Storage in HPC: Scalable Scientific Data Management Carlos Maltzahn IEEE Cluster 2011 Storage in HPC Panel 9/29/11 Who am I? Systems Research Lab (SRL), UC Santa Cruz LANL/UCSC Institute for Scalable Scientific