Optimization Space Exploration of the FastFlow Parallel Skeleton Framework

Size: px

Start display at page:

Download "Optimization Space Exploration of the FastFlow Parallel Skeleton Framework"

Christina Edwards
6 years ago
Views:

1 Optimization Space Exploration of the FastFlow Parallel Skeleton Framework Alexander Collins Christian Fensch Hugh Leather University of Edinburgh School of Informatics

2 What We Did Parallel skeletons provide easy abstraction for parallel programs Contain many manually tuned parameters Automatically tuning provides better performance Preliminary results that make auto-tuning faster 1

3 Motivating Example 2

4 Motivating Example 2

5 Motivating Example 2

6 Motivating Example 2

7 Motivating Example 2

8 Motivating Example 2

9 Tuning the Example Speedup Human expert Best possible Maximum recursion depth 3

10 What Next? Humans failed Auto-tuning won Can we do even better? Multiple parameters Multiple programs Multiple platforms 4

11 Optimization Space Exploration 5

12 Parameters Investigated Number of workers Bounded or unbounded queues Size of queue s buffer Cache alignment Maximum recursion depth Batch size 6

13 Speedup over a Human Expert 7

14 Speedup over a Human Expert 8

15 Speedup over a Human Expert 9

16 Visualising the Optimisation Space 10

17 Visualisation of Optimisation Space 11

18 Reducing the Size of the Search Space Two methods: Remove useless parameters Exploit linear dependencies 12

19 What is a Useless Parameter? 13

20 What is a Useful Parameter? 14

21 Removing Useless Parameters 50% Average performance loss 40% 30% 20% 10% 0% buffertype cachealignbuffersize Parameter Reduces size of search space by 6 seqthresh batchsize numworkers 15

22 Exploiting Linear Dependencies 16

23 Exploiting Linear Dependencies 17

24 Exploiting Linear Dependencies 18

25 Conclusions Tuning parameters is very important Humans are bad at tuning Auto-tuning is much better Tuning is program and platform dependent Have shown preliminary results that make auto-tuning faster 19

26 Optimization Space Exploration of the FastFlow Parallel Skeleton Framework Alexander Collins Christian Fensch Hugh Leather University of Edinburgh School of Informatics

27 Speedup over a Human Expert Speedup aquad cwc dt fibonacci mandelbrot matmul nqueens pbzip2 quicksort swps3 desktop phantom scuttle xxxii16 xxxii Average Average

28 Principal Components Analysis N = 6 P = 5624 ( ) batchsize, buffersize, buffertype, p = cachealign, numworkers, seqthresh λ = (0.443, 0.419, 0.204, 0.138, 0.007, 0.003) e = ν = (36%, 70%, 87%, 99%, 99%, 100%)

29 Is the subset representative? Percentage of best program performance 100% 80% 60% % 80% 60% % 80% 60% % 80% 60% % 80% 60% Number of iterations desktop phantom scuttle Platform xxxii16 xxxii

30 Programs Program Description aquad Adaptive Quadrature algorithm cwc Implementation of CWC, a calculus for the representation and simulation of biological systems dt Implementation of the C4.5 decision tree algorithm fibonacci Naïve recursive algorithm, without memoization, to compute Fibonacci numbers mandelbrot Mandelbrot fractal generator matmul O(n 3 ) nested-loops matrix multiplication nqueens n-queens problem solver pbzip2 Parallel bzip2 compression quicksort Parallel quicksort swps3 Smith-Waterman algorithm for gene sequence alignment

31 Platforms Platform Processor Cores Freq. Memory L3 L2 xxxii 4 Intel GHz 64GB 4 32 Xeon 24MB 256KB L7555 xxxii16 2 Intel Xeon L7555 scuttle AMD Phenom II X6 1055T phantom Intel Xeon E5430 desktop Intel Core 2 Duo E GHz 64GB 2 24MB 6 3.3GHz 8GB 1 6MB KB 1 512KB GHz 8GB None 2 6MB GHz 3GB None 1 2MB

32 Parameter Values Parameter Values numworkers 1,..., # cores 1.5 buffertype Bounded or unbounded buffersize 1, 2, 4, 8,..., 2 20 batchsize 1, 2, 4, 8,..., 2 20 cachealign 64, 128 or 256 bytes seqthresh with aquad 0.02, 0.04, 0.06,..., 1 seqthresh with fibonacci 10, 11, 12,..., 44 seqthresh with nqueens 3, 4, 5,..., 15 seqthresh with quicksort 1, 2, 4, 8,..., 2 21

33 Outlier Removal Arithmetic mean is not a robust statistic An outlier will cause many more repeats Impractical Remove using interquartile range removal: [ Q1 k(q 3 Q 1 ), Q 3 + k(q 3 Q 1 ) ] with k = 3

34 Quantifying Noise Repeats allow quantification of noise: Perform between 10 and 100 repeats Stop if coefficient of variation drops below 1% for a 99% confidence interval Use the arithmetic mean as an estimator of execution time And confidence intervals to compare execution times

35 Skeletons Provided by FastFlow farm farm-with-feedback pipe

MaSiF: Machine learning guided auto-tuning of parallel skeletons Collins, Alexander; Fensch, Christian; Leather, Hugh; Cole, Murray

Heriot-Watt University Heriot-Watt University Research Gateway MaSiF: Machine learning guided auto-tuning of parallel skeletons Collins, Alexander; Fensch, Christian; Leather, Hugh; Cole, Murray Published