PASCAL. A Parallel Algorithmic SCALable Framework for N-body Problems. Laleh Aghababaie Beni, Aparna Chandramowlishwaran. Euro-Par 2017.

Size: px

Start display at page:

Download "PASCAL. A Parallel Algorithmic SCALable Framework for N-body Problems. Laleh Aghababaie Beni, Aparna Chandramowlishwaran. Euro-Par 2017."

Phoebe Lane
6 years ago
Views:

1 PASCAL A Parallel Algorithmic SCALable Framework for N-body Problems Laleh Aghababaie Beni, Aparna Chandramowlishwaran Euro-Par 2017

2 Outline Introduction PASCAL Framework Space Partitioning Trees Tree Traversal Prune/Approximate Generators Experiments & Results Conclusions & Future Work Optimizations & Parallelization

3 Outline Introduction PASCAL Framework Space Partitioning Trees Tree Traversal Prune/Approximate Generators Experiments & Results Conclusions & Future Work Optimizations & Parallelization

4 N-body calculations q : F (q) = r ( {q}) C r q r q 3 Force computation q : AllNN(q) =argmin r R d(q, r) q : KDE(q) = 1 R r R K(q, r) Nearest neighbors Kernel density estimation q : Range(q) = r R I(dist(q, r)) h) Range count

5 N-body calculations What do these have in common? q : F (q) = r ( {q}) C r q r q 3 Force computation q : AllNN(q) =argmin r R d(q, r) q : KDE(q) = 1 R r R K(q, r) Nearest neighbors Kernel density estimation q : Range(q) = I(dist(q, r)) h) Range count r R

6 N-body calculations What do these have in common? q : F (q) = r ( {q}) C r q r q 3 Force computation q : AllNN(q) =argmin r R d(q, r) q : KDE(q) = 1 R r R K(q, r) Nearest neighbors Kernel density estimation q : Range(q) = I(dist(q, r)) h) Range count r R Consider pairs of points naïvely O(N 2 )

7 Commonality: Optimal approximation algorithms q : F (q) = r ( {q}) C r q r q 3 Force computation Hierarchical tree-based approximation algorithms for force computations, e.g., Barnes-Hut or FMM Evaluate interactions Tree traversals Store aggregate data at nodes, e.g., bounding box, mass

8 N-body problems in other domains Problem All Nearest Neighbors All Range Search All Range Count Naive Bayes Classifier Mixture Model E-step K-means E-step Mixture Model Log-likelihood Kernel Density Estimation Kernel Density Bayes Classifier 2-point (cross-)correlation Nadaraya-Watson Regression Thermodynamic Average Largest-span set Closest Pair Minimum Spanning Tree Coulombic Interaction Average Density Wave function Hausdorﬀ Distance Intrinsic (fractal) Dimension Operators 8, arg min 8, [ arg 8, 8, arg max 8, 8 8, arg min X, log X 8, 8, arg max, 8,, max,..., max arg min, arg min 8, arg min 8,, 8, max, min, Kernel Function xq xr I(hmin < xq xr < hmax ) I(hmin < xq xr < hmax ) p 1 T 1 (1/ 2 k )e 2 (xi µk ) k (xi µk ) P (Ck ) p 1 T 1 (1/ 2 k )e 2 (xi µk ) k (xi µk ) xq p (1/ 2 k )e ( ( xr xq xq h xr ) h xr )P (Ck ) xr < h) I( xq yr µk )T k 1 (xi µk ) 1 2 (xi ( xq ( xq ( xq xq xq h xr ) xr ) xr ) xr xr q r xq xr I( xq xr < h) ( xq xr ) xq I( xq xr xr < h) Each problem has a set of operators and a kernel function

9 Why N-body methods? One of the original seven dwarfs or motifs FMM listed among the top 10 algorithms having the greatest influence in 20 th century EM is one of the top 10 algorithms having the highest impact in data mining Applications Machine learning Computer vision Computational geometry Scientific computing

10 Key Ideas and Findings An algorithmic framework for N-body problems Automatically generates prune & approximation conditions Results in O(N log N) and O(N) algorithms Domain-specific optimizations and parallelization Show x speedup on 6 different algorithms compared to state-of-art libraries/softwares Out-of-the-box new optimal algorithms O(N log N) EM algorithm for GMM s O(N) algorithm for Hausdorff distance

11 Outline Introduction PASCAL Framework Space Partitioning Trees Tree Traversal Prune/Approximate Generators Experiments & Results Conclusions & Future Work Optimizations & Parallelization

12 PASCAL Framework Datasets Space-partitioning Trees Kd-tree Tree Traversal Schemes Multi tree traversals BaseCase Prune/Approximate ComputeApproximate N-body spec.: Operators & Kernel function Prune/Approximate condition generator Domain-Specific Optimizations Parallelization Optimized code

13 Tree Construction Recursively divide space until each box has at most q points.

14 Tree Construction Recursively divide space until each box has at most q points.

15 Tree Construction Recursively divide space until each box has at most q points.

16 Tree Traversal R

17 Tree Traversal Prune/Approx()? R

18 Tree Traversal Prune/Approx()? NO R

19 Tree Traversal R

20 Tree Traversal Prune/Approx()? R

21 Tree Traversal Prune/Approx()? NO R

22 Tree Traversal R

23 Tree Traversal R BaseCase() Direct L RL O(q 2 )

24 Tree Traversal R

25 Tree Traversal R Prune/Approx()?

26 Tree Traversal R Prune/Approx()? YES

27 Tree Traversal R If Prune/Approx() is true, discard the entire subtree for pruning problems

28 Tree Traversal BaseCase() R

29 Tree Traversal R

30 Tree Traversal R Prune/Approx()? YES

31 Tree Traversal R If Prune/Approx() is true, replace the subtree with the centroid for approximation problems

32 Tree Traversal R ApproxCompute()

33 Tree Traversal BaseCase() R

34 Tree Traversal R

35 Prune/Approximate Condition Generator Prune e.g., Hausdorff Distance

36 Hausdorff Distance

37 Hausdorff Distance

38 Hausdorff Distance R

39 Hausdorff Distance R

40 Hausdorff Distance R

41 Hausdorff Distance R

42 Hausdorff Distance R

43 Hausdorff Distance R

44 Prune/Approximate Condition Generator Prune e.g., Hausdorff Distance Approximation e.g., Expectation Maximization (EM) E-step M-step Log-likelihood

45 Approximate Condition for EM

46 Approximate Condition for EM R

47 Approximate Condition for EM R

48 Approximate Condition for EM Kmin R

49 Approximate Condition for EM R

50 Approximate Condition for EM Kmax R

51 Approximate Condition for EM R

52 Approximate Condition for EM center R center

53 Approximate Condition for EM center K center R center

54 Approximate Condition for EM Kmax -Kmin < X K center center K center R center

55 Approximate Condition for EM Kmax -Kmin < user-controlled accuracy X K center center K center R center

56 Approximate Condition for EM Kmax -Kmin < user-controlled accuracy X K center center K center R center log liklihood: E-step: (r i,max r i,min ) < r i,mean (i =1,...,K)

57 Prune Condition for Hausdorff distance:

58 Prune Condition for Hausdorff distance: R

59 Prune Condition for Hausdorff distance: R

60 Prune Condition for Hausdorff distance: border point R

61 Prune Condition for Hausdorff distance: border point R op 1 ( 1, K(x q,x r ) op 2 ( 2, K(x q,x r ))) s.t. 8x q 2 Nq border, 8x r 2 Nr border

62 Outline Introduction PASCAL Framework Space Partitioning Trees Tree Traversal Prune/Approximate Generators Experiments & Results Conclusions & Future Work Optimizations & Parallelization

63 Optimizations Incremental bounding box calculation

64 Optimizations Incremental bounding box calculation

65 Optimizations Incremental bounding box calculation

66 Optimizations Incremental bounding box calculation Only update the dimension that is split at each node

67 Optimizations Incremental bounding box calculation Only update the dimension that is split at each node

68 Optimizations Incremental bounding box calculation Optimal Metric Calculation Reduced distance e.g., squared Euclidean distance Eliminates expensive sqrt instruction with long latencies Partial distance Big payoff for large dimensional datasets

69 Optimizations Incremental bounding box calculation Optimal Metric Calculation Reduced distance e.g., squared Euclidean distance Eliminates expensive sqrt instruction with long latencies Partial distance Big payoff for large dimensional datasets Incremental distance calculation Node-to-node distance computed incrementally from parent s distance in constant time

70 Parallelization

71 Parallelization cilk_spawn

72 Parallelization cilk_spawn

73 Parallelization cilk_spawn Stop spawning when #cilk threads = #physical threads t0 t1 t2 t3

74 Parallelization Task parallelism cilk_spawn Stop spawning when #cilk threads = #physical threads t0 t1 t2 t3 Data parallelism

75 Parallelization Task parallelism cilk_spawn Stop spawning when #cilk threads = #physical threads t0 t1 t2 t3 Data parallelism Pruning/Approximation causes load imbalance

76 Outline Introduction PASCAL Framework Space Partitioning Trees Tree Traversal Prune/Approximate Generators Experiments & Results Conclusions & Future Work Optimizations & Parallelization

Experimental Setup Architecture Dual-socket Intel Xeon E5-2630 v3 processor (Haswell-EP) Each socket has 8 cores

77 Experimental Setup Architecture Dual-socket Intel Xeon E v3 processor (Haswell-EP) Each socket has 8 cores Theoretical peak performance of GFlops Compiler Intel C++ complier (icpc v15.0.2) Python v2.7.6 (Scikit-learn) Java v1.8.0 (Weka)

78 Case Studies (Direct) Nearest Neighbors Range-Search I ( x q x r apple h) Kernel Density Estimation Hausdorff Distance

79 Case Studies (Iterative) Expectation Maximization (EM) E-step M-step Log-likelihood Euclidean Minimum Spanning Tree

80 Library Comparison Weka: 6,677,053 downloads, written in Java Scikit-learn: 121,841 downloads, written in Python MATLAB: over 1,000,000 licensed users, uses C in backend MLPACK: exploits C++ language features to provide maximum performance Speedup EM 6.2 Base Base MATLAB WEKA MLPACK Scikit PASCAL Base Base 12.3 Yahoo! HIGGS Census KDD IHEPC Base 160 Speedup knn Base Base Base Base Yahoo! HIGGS Census KDD IHEPC Base

81 Speedup Breakdown 1.6, 3.2, and 53.7 respectively for the same dataset. KNN EM KDE HD RS EMST Alg +Opt +Par Alg +Opt +Par Alg +Opt +Par Alg +Opt +Par Alg +Opt +Par Alg +Opt +Par Yahoo! HIGGS Census KDD IHEPC

82 Scalability

83 Summary and Status First generalized algorithmic framework for N-body problems Out-of-the-box new optimal algorithms O(N log N) EM algorithm O(N) Hausdorff distance algorithm x speedup from optimal tree algorithm, domain- Generalizes to more than two operators specific optimizations and parallelization optimizations and parallelization Short-term: DSL + code generator for base-case, Long-term: Extend to GPUs and distributed memory systems

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline