Data Reduction and Partitioning in an Extreme Scale GPU-Based Clustering Algorithm

Size: px

Start display at page:

Download "Data Reduction and Partitioning in an Extreme Scale GPU-Based Clustering Algorithm"

Angela Thornton
5 years ago
Views:

1 Data Reduction and Partitioning in an Extreme Scale GPU-Based Clustering Algorithm Benjamin Welton and Barton Miller Paradyn Project University of Wisconsin - Madison DRBSD-2 Workshop November 17 th 2017 Denver, CO

2 Our History with Scalable Reductions Debuggers Performance Tools MRNet Multicast Reduction Network (MRNet) [Roth 04] MRNet was built as a scalable reduction framework for debuggers, performance tools, and applications. Scalability is achieved by a software defined tree based overlay network (TBON). Applications 2

3 Our History with Scalable Reductions Debuggers STAT Stack Trace Analysis Tool [Arnold 2007] Performance Tools MRNet Live debugging of millions of processes (max run to date 6+ million procs at LLNL). MRNet provides the scaleable reduction of stack traces collected from processes. Applications 3

4 Our History with Scalable Reductions Debuggers STAT Cray CCBD Total View Cray ATP Performance Tools MRNet Since STAT s release, MRNet has been used as the reduction framework by a number of commercial debuggers. Applications 4

5 Our History with Scalable Reductions Debuggers STAT Cray CCBD Total View Cray ATP Performance Tools MRNet TAU TAU [Nataraj 2008] Performance monitoring framework. MRNet is used to scalable collect performance data from thousands of nodes. Applications 5

6 Our History with Scalable Reductions Debuggers STAT Cray CCBD Total View Cray ATP Performance Tools MRNet TAU CBTF Open Speed Shop CEPBA Toolkit Applications Commercial and open source tools have continued to use MRNet for scalable communication. 6

7 Our History with Scalable Reductions Debuggers STAT Cray CCBD Total View Cray ATP Performance Tools MRNet TAU CBTF Open Speed Shop CEPBA Toolkit Applications Mean Shift Mean Shift [Arnold 2006] Scalable method of identifying local maximums in a data set (mean shift algorithm). 7

8 Our History with Scalable Reductions Debuggers STAT Cray CCBD Total View Cray ATP Performance Tools MRNet TAU CBTF Open Speed Shop CEPBA Toolkit Mr. Scan [Welton 2014] Applications Mean Shift Highly scalable density based clustering algorithm. First known implementation to scale to over thousands of nodes and to process billions of points Mr. Scan 8

9 MRNet Multicast / Reduction Network General-purpose Tree Based Overlay Network (TBON) o Network: user-defined topology o Stream: logical data channel o to a set of back-ends o multicast, gather, and custom reduction o Packet: collection of data o Filter: stream data operator o User definable aggregation and reduction operations. o Fully asynchronous across multiple streams. F(x 1,,x n ) BE app FE CP CP CP CP CP CP BE app BE app BE app 9

10 Introducing Mr. Scan Mr. Scan is a scalable density-based clustering algorithm Designed to cluster billions of data points.

11 Clustering Example (DBSCAN [1] ) Goal: Find regions that meet minimum density and spatial distance characteristics [1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996) 11

12 Clustering Example (DBSCAN [1] ) Goal: Find regions that meet minimum density and spatial distance characteristics [1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996) 11

13 Clustering Example (DBSCAN [1] ) Goal: Find regions that meet minimum density and spatial distance characteristics [1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996) 11

14 Clustering Example (DBSCAN [1] ) The two parameters that determine if a point is in a cluster is Epsilon (Eps), and MinPts [1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996) 11

15 Clustering Example (DBSCAN [1] ) The two parameters that determine if a point is in a cluster is Epsilon (Eps), and MinPts Eps [1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996) 11

16 Clustering Example (DBSCAN [1] ) The two parameters that determine if a point is in a cluster is Epsilon (Eps), and MinPts MinPts [1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996) 11

17 Clustering Example (DBSCAN [1] ) The two parameters that determine if a If the number of points in Eps is > MinPts, point is in a cluster is Epsilon (Eps), and the point is a core point. MinPts [1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996) 11

18 Clustering Example (DBSCAN [1] ) The two parameters that determine if a If the number of points in Eps is > MinPts, point is in a cluster is Epsilon (Eps), and the point is a core point. MinPts MinPts: 3 [1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996) 11

19 Clustering Example (DBSCAN [1] ) The For two every parameters discovered that point, determine this same if a If the number of points in Eps is > MinPts, calculation point is in is a performed cluster is Epsilon until the (Eps), cluster and is the point is a core point. fully MinPts expanded MinPts: 3 [1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996) 11

20 Clustering Example (DBSCAN [1] ) The For two every parameters discovered that point, determine this same if a If the number of points in Eps is > MinPts, calculation point is in is a performed cluster is Epsilon until the (Eps), cluster and is the point is a core point. fully MinPts expanded MinPts: 3 [1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996) 11

21 Clustering Example (DBSCAN [1] ) The For two every parameters discovered that point, determine this same if a If the number of points in Eps is > MinPts, calculation point is in is a performed cluster is Epsilon until the (Eps), cluster and is the point is a core point. fully MinPts expanded MinPts: 3 [1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996) 11

22 Clustering Example (DBSCAN [1] ) The For two every parameters discovered that point, determine this same if a If the number of points in Eps is > MinPts, calculation point is in is a performed cluster is Epsilon until the (Eps), cluster and is the point is a core point. fully MinPts expanded [1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996) 11

Clustering Example (DBSCAN [1] ) The For two every parameters discovered that point, determine this same if a If the number of points in Eps is > MinPts, calculation point is in is a performed

23 Clustering Example (DBSCAN [1] ) The For two every parameters discovered that point, determine this same if a If the number of points in Eps is > MinPts, calculation point is in is a performed cluster is Epsilon until the (Eps), cluster and is the point is a core point. fully MinPts expanded [1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996) 11

24 Intro to Mr. Scan Mr. Scan Phases Partition: Distributed DBSCAN: GPU (on BE) Merge: CPU (x #levels) Sweep: CPU (x #levels) FE BE BE BE BE 12

25 Intro to Mr. Scan Mr. Scan Phases Partition: Distributed DBSCAN: GPU (on BE) Merge: CPU (x #levels) Sweep: CPU (x #levels) 12

26 Intro to Mr. Scan Mr. Scan Phases FE Partition: Distributed CP CP DBSCAN: GPU (on BE) DBSCAN Merge: CPU (x #levels) BE BE BE BE Sweep: CPU (x #levels) 12

27 Intro to Mr. Scan Mr. Scan Phases FE Partition: Distributed Merge CP CP DBSCAN: GPU (on BE) Merge: CPU (x #levels) BE BE BE BE Sweep: CPU (x #levels) 12

28 Intro to Mr. Scan Merge Mr. Scan Phases FE Partition: Distributed CP CP DBSCAN: GPU (on BE) Merge: CPU (x #levels) BE BE BE BE Sweep: CPU (x #levels) 12

29 Intro to Mr. Scan Mr. Scan Phases FE Partition: Distributed CP Sweep CP DBSCAN: GPU (on BE) Merge: CPU (x #levels) BE BE BE BE Sweep: CPU (x #levels) 12

30 Intro to Mr. Scan Mr. Scan Phases FE Partition: Distributed CP Sweep CP DBSCAN: GPU (on BE) Sweep BE BE BE BE Merge: CPU (x #levels) Sweep: CPU (x #levels) 12

31 Intro to Mr. Scan Mr. Scan Phases FE Partition: Distributed CP Sweep CP DBSCAN: GPU (on BE) Sweep BE BE BE BE Merge: CPU (x #levels) Sweep: CPU (x #levels) FS 12

32 Properties of a Scalable TBON App Applications must have the following properties to be scalable in a TBON: F(x 1,,x n ) CP FE CP CP CP CP CP BE BE BE BE app app app app 13

33 Properties of a Scalable TBON App Applications must have the following properties to be scalable in a TBON: o Same amount of work across all nodes in the tree. F(x 1,,x n ) CP FE CP CP CP CP CP BE BE BE BE app app app app 13

34 Properties of a Scalable TBON App Applications must have the following properties to be scalable in a TBON: o Same amount of work across all nodes in the tree. o Size of reduction output must be equal or less than input size. F(x 1,,x n ) CP FE CP CP CP CP CP BE BE BE BE app app app app 13

35 Properties of a Scalable TBON App Applications must have the following properties to be scalable in a TBON: o Same amount of work across all nodes in the tree. o Size of reduction output must be equal or less than input size. Mr. Scan must have these properties to scale F(x 1,,x n ) BE CP BE FE BE CP CP CP CP CP BE app app app app 13

36 Challenges To Scaling DBSCAN Workload Balance: o DBSCAN requires that points near one another must be processed on the same node. Size of data needed for merging: o Simple merging methods require sending all points in all clusters to merge accurately o Sending billions of points in the merge phase is infeasible. 14

37 Our Solutions Solve the balance and data size issues jointly with the following techniques: o Smart Partitioning o Selectively performing redundant computation to reduce merge data size (using data duplicated at edge of each partition). o Use an estimate of partition density to identify hard to compute partitions. o Dense Box Algorithm o Reduces the complexity of computation of extremely dense partitions. o Representative Points o Leverage the redundant computation to create a merge method that requires only a small fixed subset of points. 15

38 Partition Phase 16

39 Partition Phase o Algorithm: 16

40 Partition Phase o Algorithm: o Form initial partitions 16

41 Partition Phase o Algorithm: o Form initial partitions 16

42 Partition Phase o Algorithm: o Form initial partitions 16

43 Partition Phase o Algorithm: o Form initial partitions o Add shadow regions 16

44 Partition Phase o Algorithm: o Form initial partitions o Add shadow regions 16

45 Partition Phase o Algorithm: o Form initial partitions o Add shadow regions o Rebalance 16

46 Partition Phase o Algorithm: o Form initial partitions o Add shadow regions o Rebalance 16

47 Partition Phase o Algorithm: o Form initial partitions o Add shadow regions o Rebalance 16

48 Partition Phase o Algorithm: o Form initial partitions o Add shadow regions o Rebalance 16

49 The DBSCAN Density Problem o Imbalances in point density can cause huge differences in runtimes between Thread Groups inside of a GPU (10-15x variance in time) o Issue is caused by the lookup operation for a points neighbors in the DBSCAN point expansion phase. ε ε Higher density results in higher neighbor count which increases the number of comparison operations 17

50 Dense Box o Dense Box eliminates the need to perform neighbor lookups on points in dense regions by labeling points as a member of cluster before DBSCAN is run. 1. Start with an ε region. 2. Divide the region of data into area s of size ε 2 2 for dense area detection *. ε For each ε area which 2 2 has point count >= MinPts. Mark points as members of a cluster. Do not expand these points. ε 2 2 * ε chosen because it guarantees all points inside are within ε distance of each other

51 Dense Box o Dense Box eliminates the need to perform neighbor lookups on points in dense regions by labeling points as a member of cluster before DBSCAN is run. 1. Start with an ε region. 2. Divide the region of data into area s of size ε 2 2 for dense area detection *. ε For each ε area which 2 2 has point count >= MinPts. Mark points as members of a cluster. Do not expand these points. ε 2 2 ε * ε chosen because it guarantees all points inside are within ε distance of each other

52 Merge Algorithm o Two phases in the merge operation 1. Select Representative points (Leaf Node) 2. Merge operation (Internal Node) 19

53 Representative Points (Leaf Nodes) o Large clusters are too expensive to move up the tree o In border regions we select eight representative points to represent the cluster o These points guarantee that any overlap of clusters detected on adjacent nodes Mr. Scan: Efficient Clustering with MRNet and GPUs 20

54 Representative Points (Leaf Nodes) o Large clusters are too expensive to move up the tree o In border regions we select eight representative points to represent the cluster o These points guarantee that any overlap of clusters detected on adjacent nodes Mr. Scan: Efficient Clustering with MRNet and GPUs 20

55 Representative Points (Leaf Nodes) o Large clusters are too expensive to move up the tree o In border regions we select eight representative points to represent the cluster o These points guarantee that any overlap of clusters detected on adjacent nodes Mr. Scan: Efficient Clustering with MRNet and GPUs 20

56 Representative Points (Leaf Nodes) o Large clusters are too expensive to move up the tree o In border regions we select eight representative points to represent the cluster o These points guarantee that any overlap of clusters detected on adjacent nodes Mr. Scan: Efficient Clustering with MRNet and GPUs 20

57 Merge Operation (Internal Nodes) o Merge overlapping clusters found on different DBSCAN leaf nodes o Merge needs to have low overhead and must operate without entire dataset (Representative Points + Non-Core points) MinPts = 3 Node 1 Node 2 Shadow Regions 21

58 Merge Operation (Internal Nodes) Compare representative points sent by leaf nodes o If an overlap in representative points exists between two nodes, those clusters merge. o Otherwise, the clusters are complete and no further propagation of representative points occurs. Cluster on Node 1 and 2 have overlapping representative points and are merged. Node 1 Node 2 Shadow Regions Mr. Scan: Efficient Clustering with MRNet and GPUs 22

59 Merge Operation (Internal Nodes) Compare representative points sent by leaf nodes o If no overlap exists between clusters, finalize the clusters and do not propagate representative points. No overlap in regions between Node 1 and 2. Clusters in both nodes are now final and no further propagation is needed Node 1 Shadow Regions Node 2 Mr. Scan: Efficient Clustering with MRNet and GPUs 23

60 Elapsed Time (seconds) Mr. Scan Results Partition Merge Startup Sweep DBSCAN 0 102M 409M 1.6B 3.2B 6.5B Point Count (800K points per node) 24

61 Takeaway Lessons for Scalable Data Reduction o By selectively duplicating processing, we could use a less data intensive merging algorithm o We were able to equalize the workload between nodes by use of the dense box algorithm o Using these approaches, we were able to scale Mr. Scan up to 8192 nodes. o MRNet is a general scalability framework for data reductions 25

62 Paper available at: Questions? 26

Tree-Based Density Clustering using Graphics Processors

Tree-Based Density Clustering using Graphics Processors A First Marriage of MRNet and GPUs Evan Samanas and Ben Welton Paradyn Project Paradyn / Dyninst Week College Park, Maryland March 26-28, 2012 The