Low memory Map-Reduce

Size: px

Start display at page:

Download "Low memory Map-Reduce"

Stewart Elliott
5 years ago
Views:

1 Low memory -Reduce Hrishikesh Amur, Karsten Schwan, Georgia Tech. Wolf Richter, Athula Balachandran, Erik Zawadzki, Dave Andersen, CMU Michael Kaminsky, Intel 1

2 Datasets are growing People see value (even a little) in storing data rather than throwing it away 2

3 -Reduce Red. Red. 3

4 -Reduce Red. Red. 3

5 -Reduce Red. Red. 3

6 -Reduce Red. Red. 3

7 -Reduce Red. Red. 3

8 M R M R M 4

9 Data transmitted over network can be reduced! M R M R M 4

10 Aggregation is critical... 5

11 Aggregation is critical... Useful data is small (selection problems) 5

12 Aggregation is critical... Useful data is small (selection problems) Aggregate smaller than sum of parts (aggregation problems) 5

13 Aggregation is critical... Useful data is small (selection problems) Aggregate smaller than sum of parts (aggregation problems) Networks usually oversubscribed 5

14 ... as others have said Parallel databases allow aggregation, but queries become complex Dryad, Reduce and Hadoop. 6

15 Pre-aggregation in Hadoop C Red. C Red. C 7

16 Pre-aggregation in Hadoop C 7

17 Pre-aggregation in Hadoop C Can aggregation be performed in memory-constrained environments? 7

18 Why memory-constrained? 8

19 Why memory-constrained? Energy 8

20 Why memory-constrained? Energy Decreasing memory per core 8

21 Why memory-constrained? Energy Decreasing memory per core Fun :) 8

22 Pre-aggregation in Hadoop Sort Add 9

23 Pre-aggregation in Hadoop Sort Add 9

24 Pre-aggregation in Hadoop Sort Add 9

25 Pre-aggregation in Hadoop Sort Add 9

26 Pre-aggregation in Hadoop Sort Add In-memory sort limits aggregation 9

27 Minni: Low-memory -Reduce 10

28 Minni: Low-memory -Reduce Memory-efficient 10

29 Minni: Low-memory -Reduce Memory-efficient Performance scales with available memory 10

30 Minni: Low-memory -Reduce Memory-efficient Performance scales with available memory External aggregation using SSDs 10

31 Partial Aggregation Object (PAO) Key, Value User-defined create(key, value) destroy() merge(pao) serialize() deserialize() Distributed Aggregation for Data-Parallel Computing:Interfaces and Implementations, Yu et. al., SOSP 09 11

32 Grouping by Hashing Sort Add 12

33 Grouping by Hashing Hash 12

34 Grouping by Hashing Hash 12

35 Grouping by Hashing Hash Aggregate as you hash 12

36 Grouping by Hashing Hash Aggregate But the hash table as you hash might not fit in memory 12

37 External Aggregators Bucketing External Sort External Hash 13

38 Bucketing Files on SSD Hash Part. Hash 14

39 Bucketing Files on SSD Hash Part. Cap: 10 keys Hash 14

40 Bucketing Files on SSD 100 keys Hash Part. Cap: 10 keys Hash 14

41 Bucketing Files on SSD 100 keys Hash Part. Cap: 10 keys 12 buckets Hash 14

42 Bucketing Files on SSD 100 keys Hash Part. Has <10 keys 12 buckets Hash 14

43 Bucketing Files on SSD 100 keys Hash Part. Has <10 keys 12 buckets Hash Can aggregate in memory! 14

44 Bucketing Technique: SSDs can support writes to many files 15

45 Bucketing Technique: SSDs can support But, how many? writes to many files 15

46 Bucketing Technique: SSDs can support But, how many? writes to many files Files on SSD 100 keys Hash Part. Cap: 10 keys 12 buckets 15

47 Bucketing Technique: SSDs can support writes to many files 15

48 External Sort Hash Overflow File Overflow File Ext. Sort Add 16

49 External Sort Technique: Trade-off memory consumption for extra CPU work 17

50 External Hash Hash Ext. Hash 18

51 External Hash Hash Ext. Hash Use random read capabilities of SSDs 18

52 Pipelining Aggregators implemented as pipelines in Intel Threading Building Blocks (TBB) 19

53 Effects of token size (bucketing) Wordcount: 8G dataset 7 B/key 1 mil keys 20

54 Comparisons Wordcount: 8G dataset 7 B/key 1 mil keys 21

55 Recap of Techniques 22

56 Recap of Techniques Use SSD capabilities Parallel writes to multiple files High random read capabilities 22

57 Recap of Techniques Use SSD capabilities Parallel writes to multiple files High random read capabilities Trade-off latency for low memory consumption 22

58 Recap of Techniques Use SSD capabilities Parallel writes to multiple files High random read capabilities Trade-off latency for low memory consumption Trade-off CPU work for low memory consumption 22

59 Questions & Suggestions 23

Interruptible Tasks: Treating Memory Pressure as Interrupts for Highly Scalable Data-Parallel Programs

Interruptible Tasks: Treating Memory Pressure as Interrupts for Highly Scalable Data-Parallel Programs Interruptible s: Treating Pressure as Interrupts for Highly Scalable Data-Parallel Programs Lu Fang 1, Khanh Nguyen 1, Guoqing(Harry) Xu 1, Brian Demsky 1, Shan Lu 2 1 University of California, Irvine