Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Size: px

Start display at page:

Download "Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,"

Mark Hudson
5 years ago
Views:

1 Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland, 1

2 Today Answers to Questions How to estimate resources for large data projects - Some basic issues - Factors to take into account - Different methods to estimate - Example 2

3 FAQ How is the grade computed? 15% Mid-Term Exam 15% Final Exam 70% Project 3

4 FAQ What is graded in the project? Final Presentation Project Report (details to follow) 4

5 Some Basics 5

6 Some Basics Why can t we automatize the 5

7 Some Basics Why can t we automatize the estimation of resources for large 5

8 Some Basics Why can t we automatize the estimation of resources for large data projects? 5

9 Some Basics Why can t we automatize the estimation of resources for large data projects? Halting problem 5

10 Some Basics Why can t we automatize the estimation of resources for large data projects? Halting problem undecideable!! 5

11 Some Basics Why can t we automatize the estimation of resources for large data projects? Halting problem undecideable!! 5

12 More Basics 6

13 More Basics Why can t we just parallelize everything? 6

14 More Basics Why can t we just parallelize everything? Amdahl s Law! 6

15 More Basics Why can t we just parallelize everything? Amdahl s Law! Amdahl, Gene (1967). "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities". AFIPS Conference Proceedings (30):

16 Amdahl s Law where: S is the speed-up expected P is the unparallelizable portion N is the number of CPUs 7

17 Amdahl s Law where: SU is the speed-up measured NP is the number of CPUs 8

18 Amdahl s Law 9

19 Amdahl s Law 10

20 Amdahl s Law Intuition: Unparallelizeable portion dominates runtime. 10

21 Amdahl s Law Intuition: Unparallelizeable portion dominates runtime. => Diminishing returns when adding more CPUs. 10

22 Factors to take into Account when estimating runtime 11

23 Factors to take into Account when estimating runtime CPU usage (number of instructions) 11

24 Factors to take into Account when estimating runtime CPU usage (number of instructions) I/O usage 11

25 Factors to take into Account when estimating runtime CPU usage (number of instructions) I/O usage Memory requirements 11

26 Factors to take into Account when estimating runtime CPU usage (number of instructions) I/O usage Memory requirements External factors 11

27 Factors to take into Account when estimating runtime CPU usage (number of instructions) I/O usage Memory requirements External factors The human factor 11

28 CPU Usage 12

29 CPU Usage How many instruction does the program need per byte of input data? 12

30 CPU Usage How many instruction does the program need per byte of input data? - Similar to O(n) notation of theoretical computer science 12

31 CPU Usage How many instruction does the program need per byte of input data? - Similar to O(n) notation of theoretical computer science - Problem: Machine learning algorithms converge! Runtime ~ Entropy of input 12

32 I/O Usage 13

33 I/O Usage I/O might be: - memory access - storage (HD) access - network access 13

34 I/O Usage I/O might be: - memory access - storage (HD) access - network access Orders of magnitude slower than CPU access! 13

35 I/O Usage Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin, "Operating System Concepts, Eighth Edition ", Chapter 1 14

36 I/O Usage Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin, "Operating System Concepts, Eighth Edition ", Chapter 1 15

37 I/O Usage Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin, "Operating System Concepts, Eighth Edition ", Chapter 1 16

38 I/O Usage 17

39 I/O Usage How do you store and access the central data input? 17

40 I/O Usage How do you store and access the central data input? How do you access and store intermediate files? 17

41 I/O Usage How do you store and access the central data input? How do you access and store intermediate files? How do you write out (debugging) output. 17

42 I/O Usage How do you store and access the central data input? How do you access and store intermediate files? How do you write out (debugging) output. How many I/O operations will it be? 17

43 I/O Usage How do you store and access the central data input? How do you access and store intermediate files? How do you write out (debugging) output. How many I/O operations will it be? Using which device latencies? 17

44 Memory Requirements 18

45 Memory Requirements How can the cache be used most efficiently? 18

46 Memory Requirements How can the cache be used most efficiently? How much memory is available per CPU? 18

47 Memory Requirements How can the cache be used most efficiently? How much memory is available per CPU? Is the memory actually physical or virtual? Avoid Thrashing! 18

48 External Factors 19

49 External Factors How many CPUs are available at each time? 19

50 External Factors How many CPUs are available at each time? Are the CPUs of the same type? 19

51 External Factors How many CPUs are available at each time? Are the CPUs of the same type? Are special purpose processors available (GPUs)? 19

52 External Factors How many CPUs are available at each time? Are the CPUs of the same type? Are special purpose processors available (GPUs)? How expensive is data transfer? 19

53 External Factors How many CPUs are available at each time? Are the CPUs of the same type? Are special purpose processors available (GPUs)? How expensive is data transfer? Is the system functioning realiably? 19

54 Human Factors 20

55 Human Factors Do you still need to experiment? 20

56 Human Factors Do you still need to experiment? Are there possibly bugs? 20

57 Human Factors Do you still need to experiment? Are there possibly bugs? Can the algorithm be continued? 20

58 Human Factors Do you still need to experiment? Are there possibly bugs? Can the algorithm be continued? Do I need to supervize the execution? 20

59 Human Factors Do you still need to experiment? Are there possibly bugs? Can the algorithm be continued? Do I need to supervize the execution? When does my batch finish? Do I need to logout in the middle of the night? 20

60 Manual Estimation 21

61 Manual Estimation Check loops and nested loops: for, while 21

62 Manual Estimation Check loops and nested loops: for, while Nesting over input determines order of magnitude! 21

63 Manual Estimation Check loops and nested loops: for, while Nesting over input determines order of magnitude! Find memory allocations: new(), malloc(),+ 21

64 Manual Estimation Check loops and nested loops: for, while Nesting over input determines order of magnitude! Find memory allocations: new(), malloc(),+ Find I/O commands: read(), write() 21

65 Problem String readfile(java.io.filereader filereader) throws java.io.ioexception { java.io.bufferedreader br=new java.io.bufferedreader(filereader); String s= ; String line; while (null!=(line=br.readline())) s+=line+ \n ; return s; } 22

66 Problem Manual estimation hard and often counterintuitive! Example: String readfile(java.io.filereader filereader) throws java.io.ioexception { java.io.bufferedreader br=new java.io.bufferedreader(filereader); String s= ; String line; while (null!=(line=br.readline())) s+=line+ \n ; return s; } 22

67 Problem 23

68 Problem Reading in the ASCII Version of Moby Dick with lines of code and 69 char per line (1.1 MB, not BIGDATA)... 23

69 Problem Reading in the ASCII Version of Moby Dick with lines of code and 69 char per line (1.1 MB, not BIGDATA)......shuffles around Gigabytes! 23

70 Formal Methods 24

71 Formal Methods Similarity templates 24

72 Formal Methods Similarity templates Rough Set-based methods 24

73 Formal Methods Similarity templates Rough Set-based methods Queueing theory 24

74 Formal Methods Similarity templates Rough Set-based methods Queueing theory Autotuning 24

75 Formal Methods Similarity templates Rough Set-based methods Queueing theory Autotuning Problem: Usually limited due to assumptions, might err as well... 24

76 Inductive Algorithm Design 25

77 Inductive Algorithm Design Show algorithm works on small set 25

78 Inductive Algorithm Design Show algorithm works on small set Show error doesn t get worse when adding more data 25

79 Inductive Algorithm Design Show algorithm works on small set Show error doesn t get worse when adding more data Deduct that it will work for arbitratry data sizes 25

80 Problem: Non-Linear Scale PARDORA$Query$Performance$ 6.0# 5.0#!me$(seconds)$ 4.0# 3.0# 2.0# 1.0# 0.0# 0# 500# 1000# 1500# 2000# 2500# Number$of$songs$ Total#Time# 26

81 Optimization: General Approach 27

82 Optimization: General Approach Use spiral development scheme: 27

83 Optimization: General Approach Use spiral development scheme: Estimate computational needs on smaller scale. 27

84 Optimization: General Approach Use spiral development scheme: Estimate computational needs on smaller scale. Verify empirically. 27

85 Optimization: General Approach Use spiral development scheme: Estimate computational needs on smaller scale. Verify empirically. If verification OK, go larger scale and repeat 27

86 Optimization: General Approach Use spiral development scheme: Estimate computational needs on smaller scale. Verify empirically. If verification OK, go larger scale and repeat If not: debug estimation! 27

87 How to debug estimation 28

88 How to debug estimation Find the bottlenecks 28

89 How to debug estimation Find the bottlenecks Prioritize worst bottleneck first 28

90 How to debug estimation Find the bottlenecks Prioritize worst bottleneck first Problem: Diminishing Returns! 28

91 Fighting Bottlenecks 29

92 Fighting Bottlenecks Is there a faster algorithm? 29

93 Fighting Bottlenecks Is there a faster algorithm? Would an approximation do it? 29

94 Fighting Bottlenecks Is there a faster algorithm? Would an approximation do it? Can we do a fast match? 29

95 Fighting Bottlenecks Is there a faster algorithm? Would an approximation do it? Can we do a fast match? Can we parallelize? 29

96 Fighting Bottlenecks Is there a faster algorithm? Would an approximation do it? Can we do a fast match? Can we parallelize? Can we parallelize using specialized hardware? 29

97 Example: Iterative Optimization of ICSI Diarization System 30

98 Example: Iterative Optimization of ICSI Diarization System General Problem: 30

99 Example: Iterative Optimization of ICSI Diarization System General Problem: Rewriting from a base of highly optimized serial code (~10k lines for the core algorithm). 30

100 Example: Iterative Optimization of ICSI Diarization System General Problem: Rewriting from a base of highly optimized serial code (~10k lines for the core algorithm). Current Technical Issue: 30

101 Example: Iterative Optimization of ICSI Diarization System General Problem: Rewriting from a base of highly optimized serial code (~10k lines for the core algorithm). Current Technical Issue: CPU parallelism limited and coarse but easier to implement 30

102 Example: Iterative Optimization of ICSI Diarization System General Problem: Rewriting from a base of highly optimized serial code (~10k lines for the core algorithm). Current Technical Issue: CPU parallelism limited and coarse but easier to implement GPU parallelism finer but harder to implement 30

103 Example: Iterative Optimization of ICSI Diarization System General Problem: Rewriting from a base of highly optimized serial code (~10k lines for the core algorithm). Current Technical Issue: CPU parallelism limited and coarse but easier to implement GPU parallelism finer but harder to implement Current Solution: Hybrid Approach: 30

104 Example: Iterative Optimization of ICSI Diarization System General Problem: Rewriting from a base of highly optimized serial code (~10k lines for the core algorithm). Current Technical Issue: CPU parallelism limited and coarse but easier to implement GPU parallelism finer but harder to implement Current Solution: Hybrid Approach: CPU parallelism per cluster 30

105 Example: Iterative Optimization of ICSI Diarization System General Problem: Rewriting from a base of highly optimized serial code (~10k lines for the core algorithm). Current Technical Issue: CPU parallelism limited and coarse but easier to implement GPU parallelism finer but harder to implement Current Solution: Hybrid Approach: CPU parallelism per cluster GPU parallelism per frame 30

106 Example: Iterative Optimization of ICSI Diarization System General Problem: Rewriting from a base of highly optimized serial code (~10k lines for the core algorithm). Current Technical Issue: CPU parallelism limited and coarse but easier to implement GPU parallelism finer but harder to implement Current Solution: Hybrid Approach: CPU parallelism per cluster GPU parallelism per frame Process: Bottleneck by Bottleneck... 30

107 Speaker Diarization Overview BERKELEY PAR LAB Processing Chain: Audio Signal Dynamic Range Compression Wiener Filtering Beamforming Audio Audio Delay Features Short-Term Feature Extraction Long-Term Feature Extraction Prosodics (only speech) Segmentation MFCC Prosodics (only Speech) Initial Segments Diarization Engine "who spoke when" Speech/Non- Speech Detector EM Clustering Clustering MFCC (only Speech) Overall Structure: Pipe & Filter 31

Speaker Diarization Overview BERKELEY PAR LAB Processing Chain: Audio Signal Dynamic Range Compression Wiener Filtering Beamforming Audio Audio Delay Features Short-Term Feature Extraction Long-Term

108 Speaker Diarization Overview BERKELEY PAR LAB Processing Chain: Audio Signal Dynamic Range Compression Wiener Filtering Beamforming Audio Audio Delay Features Short-Term Feature Extraction Long-Term Feature Extraction Prosodics (only speech) Segmentation MFCC Prosodics (only Speech) Initial Segments Diarization Engine "who spoke when" Speech/Non- Speech Detector EM Clustering Clustering MFCC (only Speech) Overall Structure: Pipe & Filter 31

109 Algorithm outline: Speaker Diarization: Core Algorithm BERKELEY PAR LAB Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Cluster2 Cluster1 Cluster2 Cluster3 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 32

110 Speaker Diarization: Core Algorithm BERKELEY PAR LAB Algorithm outline: Initialization Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 32

111 Speaker Diarization: Core Algorithm BERKELEY PAR LAB Algorithm outline: Initialization Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 32

112 Speaker Diarization: Core Algorithm BERKELEY PAR LAB Algorithm outline: Initialization Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 (Re-)Training Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 32

113 Speaker Diarization: Core Algorithm BERKELEY PAR LAB Algorithm outline: Initialization Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 (Re-)Training Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 32

114 Speaker Diarization: Core Algorithm BERKELEY PAR LAB Algorithm outline: Initialization Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 (Re-)Training (Re-)Alignment Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 32

115 Algorithm outline: Speaker Diarization: Core Algorithm Initialization BERKELEY PAR LAB Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 (Re-)Training (Re-)Alignment Cluster2 Cluster1 Cluster2 Cluster3 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 32

116 Algorithm outline: Speaker Diarization: Core Algorithm Initialization BERKELEY PAR LAB Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 (Re-)Training Yes Merge two Clusters? (Re-)Alignment Cluster2 Cluster1 Cluster2 Cluster3 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 32

117 Algorithm outline: Speaker Diarization: Core Algorithm Initialization BERKELEY PAR LAB Cluster2 Cluster1 Cluster2 Cluster2 (Re-)Training Yes Merge two Clusters? (Re-)Alignment Cluster2 Cluster1 Cluster2 Cluster3 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 32

118 Algorithm outline: Speaker Diarization: Core Algorithm Initialization BERKELEY PAR LAB Cluster2 Cluster1 Cluster2 Cluster2 (Re-)Training Yes Merge two Clusters? (Re-)Alignment Cluster2 Cluster1 Cluster2 Cluster3 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 32

119 Algorithm outline: Speaker Diarization: Core Algorithm Initialization BERKELEY PAR LAB Cluster2 Cluster1 Cluster2 Cluster2 (Re-)Training Yes Merge two Clusters? (Re-)Alignment Cluster1 Cluster2 Cluster1 Cluster2 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 32

120 Algorithm outline: Speaker Diarization: Core Algorithm Initialization BERKELEY PAR LAB Cluster2 Cluster1 Cluster2 Cluster2 (Re-)Training Yes Merge two Clusters? No End (Re-)Alignment Cluster1 Cluster2 Cluster1 Cluster2 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 32

121 CPU Parallelization I BERKELEY PAR LAB Easiest: Different Streams = Different CPUs Delay Features Prosodics (only speech) MFCC (only Speech) Segmentation Diarization Engine Clustering "who spoke when" Parallelization: Each Feature Stream Processed in a Different CPU Thread (MapReduce) 33

122 CPU Parallelization I BERKELEY PAR LAB Easiest: Different Streams = Different CPUs Delay Features Prosodics (only speech) MFCC (only Speech) Segmentation Segmentation Diarization Engine Segmentation Diarization Engine Diarization Clustering Engine Clustering Clustering "who spoke when" Parallelization: Each Feature Stream Processed in a Different CPU Thread (MapReduce) 33

123 CPU Parallelization II New Bottleneck: Comparison of clusters in each iteration BERKELEY PAR LAB Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5... Cluster n ( n 2) Comparisons Bayesian Information Criterion e.g. Cluster 1 Cluster 5 Merging Decision Parallelization: Each pair of comparison to a different CPU thread (MapReduce with irregular data access) 34

124 CPU Parallelization II New Bottleneck: Comparison of clusters in each iteration BERKELEY PAR LAB Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5... Cluster n ( n 2) Comparisons Bayesian Information Criterion CPU Thread CPU Thread ( n 2) e.g. Cluster 1 Cluster 5 Merging Decision Parallelization: Each pair of comparison to a different CPU thread (MapReduce with irregular data access) 34

125 CPU Parallelization II: Result BERKELEY PAR LAB Multicore Speaker Diarization -- Skype Call No Fastmatch Fastmatch Serial optimization more efficient than parallelization! 35

126 CPU Parallelization II BERKELEY PAR LAB New Bottleneck: Training Models Initialization (Re-)Training (Re-)Alignment Yes Merge two Clusters? No End Parallelization: Each Cluster Computed on a Different Core (MapReduce of dense linear algebra operations) 36

127 CPU Parallelization II BERKELEY PAR LAB New Bottleneck: Training Models Initialization (Re-)Training (Re-)Alignment Yes Merge two Clusters? No End Parallelization: Each Cluster Computed on a Different Core (MapReduce of dense linear algebra operations) 36

128 Diarization CPU Parallelization BERKELEY PAR LAB Runtime/Core xrt Cores Parallelized Portion of the Runtime: ~62% 37

129 Diarization GPU Parallelization BERKELEY PAR LAB New Bottleneck: Computation of Log-Likelihoods (Initially 28% of runtime, now 60% of runtime) Initialization (Re-)Training (Re-)Alignment Yes Merge two Clusters? No End Parallelization: Use CUDA for Frame-Level Parallelization. 38

130 Diarization GPU Parallelization BERKELEY PAR LAB New Bottleneck: Computation of Log-Likelihoods (Initially 28% of runtime, now 60% of runtime) Initialization (Re-)Training (Re-)Alignment Yes Merge two Clusters? No End Parallelization: Use CUDA for Frame-Level Parallelization. 38

131 Diarization GPU Parallelization BERKELEY PAR LAB F1 F2 F3 F4 F5 F6 Fi... Fn F1 F2 F3 F4 F5 F6 Fi... Fn Chunk1 Chunkn/maxthreads Fi=Threadi GMM 1... GMM k GMM 2 CUDA Memory... ll1 ll2 ll3 ll4 ll5 ll6 lli... lln ll1 ll2 ll3 ll4 ll5 ll6 lli... lln Result: Total Speaker Diarization speed-up 4.8 -> 0.07xRT t 39

132 Next Week (Project Meeting) Jaeyoung Choi on MediaEval 2012

133 Next Week (Lecture) Mid-Term Exam! 41

Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland, fractor@icsi.berkeley.edu 1 Today Recap: Some more Machine Learning Multimedia Systems An example Multimedia