Data mining with sparse grids

Size: px

Start display at page:

Download "Data mining with sparse grids"

Christal Gregory
6 years ago
Views:

1 Data mining with sparse grids Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Data mining with sparse grids p.1/40

2 Overview What is Data mining? Regularization networks Sparse grids Numerical examples Conclusions Data mining with sparse grids p.2/40

3 What is Data mining?»data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.«[berry and Linoff, Mastering Data Mining] Example: Mail-order merchant (who gets a catalog?) Merchant aims to increase revenue per catalog mailed Based on available customer data a response model is built Available information are e.g. Number of quarters with at least one order placed Number of catalogs purchased from Number of days since last order Amount of money spent per quarter going back some years Data mining with sparse grids p.3/40

4 Data mining activities Directed or supervised data mining Classification, classifying risk of credit applicants Estimation, estimating the value of a piece of real estate Prediction, prediction which customers will leave Undirected or unsupervised data mining Affinity grouping / association rules, shopping cart Clustering, cluster of symptoms indicates particular disease Description and visualization Data mining with sparse grids p.4/40

5 Data mining in the knowledge discovery process Identifying the problem Data preparation Data mining Post-processing of the discovered knowledge Putting the results of knowledge discovery in use Data mining with sparse grids p.5/40

6 The classification problem We want to compute a function, the classifier, which approximates the given training data set but also gives good results on unseen data For that a compromise has to be found between the correctness of the approximation, i.e. the size of the data error, and the generalization qualities of the classifier for new, i.e. before unseen, data can be large, we will consider moderately high can consist of up to millions or billions of data points Data mining with sparse grids p.6/40

7 Approximation with data centered ansatz functions Error is zero at the data points, but is overfitting Assume smoothness properties of Data mining with sparse grids p.7/40

8 Regularization networks To get a well-posed, uniquely solvable problem we have to assume knowledge of Regularization theory imposes smoothness constraints Regularization network approach considers the variational problem with Error of the classifier on the given data Assumed smoothness properties Regularization parameter Data mining with sparse grids p.8/40

9 Exact solution with kernels With a basis of we have In the case of a regularization term of the type where is a decreasing positive sequence, the solution of the variational problem has always the form Data mining with sparse grids p.9/40

10 Reproducing Kernel Hilbert Space is a symmetric kernel function can be interpreted as the kernel of a Reproducing Kernel Hilbert Space (RKHS) In other words if certain functions are used in an approximation scheme which are centered in the location of the data points then the approximation solution is a finite series and involves only terms But in general a full system has to be solved Data mining with sparse grids p.10/40

11 Approximation schemes in regularization network context For radially symmetric kernels we end up with radial basis function approximation schemes Many other approximation schemes like additive models hyper-basis functions ridge approximation models and several types of neural networks can be derived by a specific choice of the regularization operator The support vector machine (SVM) approach can also be expressed in the form of a regularization network All scale in general non-linearly in, the number of data points Data mining with sparse grids p.11/40

12 Discretization Different approach: We explicitly restrict the problem to a finite dimensional subspace, with The ansatz functions should form a basis for Cost function should span and preferably Regularization operator is to be minimized in, i.e.. Data mining with sparse grids p.12/40

13 Derivative of the functional :, Plug-in of and differentiation with respect to ) Or equivalently ( Data mining with sparse grids p.13/40

14 Problem to solve With we get the linear equation system is a -matrix with is a -matrix with is a -matrix with is the vector with length of the data classes is the vector of the unknowns and has length Data mining with sparse grids p.14/40

15 Approximation with grid-based ansatz functions In this picture only discrete values are used on the grid points, in general continuous values are used Data mining with sparse grids p.15/40

16 Which function space to take? Again, widely used are methods with global data-centered basis functions, which scale with the number of data points We use a grid to discretize the data space and local basis functions on the grid points A naive grid has grid points, with a reasonable size of, where gives the mesh size, one encounters the curse of dimensionality To overcome this we use sparse grids, which have grid points Data mining with sparse grids p.16/40

17 Interpolation with the hierarchical basis Interpolation Hierarchical basis 1- case is generalized by means of a tensor product approach Hierarchical values of the -dimensional basis functions are bounded through the size of their supports Data mining with sparse grids p.17/40

18 Supports of Data mining with sparse grids p.18/40

19 Sparse grids -linear functions of piece-wise Space span Difference-spaces of level Sparse grid space can be splitted accordingly Function Data mining with sparse grids p.19/40

20 Properties of sparse grids full grid sparse grid number of points approximation properties smoothness properties Sparse grid in 2D and 3D with level Data mining with sparse grids p.20/40

21 Sparse grids Example in six dimensions with level full grid: points sparse grid: points, i.e. Now use sparse grids to solve the minimization problem Linear equation system with points Matrix is more densely populated than corresponding full grid matrices, would add further terms to complexity Explicit assembly of the matrix should be avoided Difficult to implement only the action of the matrices Action of the data matrix would scale with # of data points Therefore use combination technique variant of sparse grids : Data mining with sparse grids p.21/40

22 Combination technique of level 4 in 2D = Data mining with sparse grids p.22/40

23 Sparse grids with the combination technique Solve the problem on the sequence of full grids combine solution on With the results sparse grid dim Example in two dimensions: Data mining with sparse grids p.23/40

24 Sequence of problems to solve Discretize and solve the minimization problem on, with Number of grids # dim main memory of a workstation (for, i.e. small enough for the ) concerning the grid The resulting linear equation system is solved by a diagonally preconditioned conjugate gradient algorithm Data mining with sparse grids p.24/40

25 Complexities of the computation To solve on each grid in the sequence of grids Complexities of the computation storage assembly mv-multipl. is the number of grid points is the number of data points Scales linearly with Data mining with sparse grids p.25/40

26 Numerical Examples We test our method with Benchmark data sets from the UCI Repository Synthetically generated massive data sets Evaluation and comparison with other methods through either Correctness rates on test data set, which where not used during the computation, 10-fold cross validation, or Leave-one-out cross validation The best is found in an outer loop over several s Data mining with sparse grids p.26/40

Checkerboard data set / Ripley data set Checkerboard with level 10. 10-fold-correctness rate 96,20% Ripley data set with level 5 (correctness rate of 90.

27 Checkerboard data set / Ripley data set Checkerboard with level fold-correctness rate 96,20% Ripley data set with level 5 (correctness rate of 90.9 %) Ripley data set with level 8 (correctness rate of 89.7 %) Ripley data set with neural networks 91.1 % Best possible rate for Ripley is 92.0%, since 8 % error is introduced Data mining with sparse grids p.27/40

Spiral data set level training correctness testing

36 % 87.11 % 6 0.00075 100.00 % 89.69 % 7 0.

14 % Leave-one-out cross-validation results, level 4

28 Spiral data set level training correctness testing correctness % % % % % % % % Leave-one-out cross-validation results, level 4 to 6 are shown 77.20% with neural networks reported [Singh, 1998] Data mining with sparse grids p.28/40

29 BUPA Liver Disorders data set (6D) SVM SSVM SVM sparse grid combination method level 1 level 2 level 3 level 4 10-fold train. % fold test. % Results for the BUPA Liver Disorders data set (345 data points) from the UCI Repository in comparison to support vector machines [Lee and Mangasarian, 2001] Data mining with sparse grids p.29/40

30 PIMA Indians Diabetes data set (8D) SVM SSVM SVM sparse grid combination method level 1 level 2 level 3 10-fold train. % fold test. % Results for the PIMA Indians Diabetes data set (768 data points) from the UCI Repository in comparison to support vector machines [Lee and Mangasarian, 2001] Data mining with sparse grids p.30/40

31 Synthetic massive 6D data set # of training testing total data matrix data correctness correctness time (sec) time (sec) % 90.8 % level % 90.8 % million 90.7 % 90.7 % % 91.5 % level % 91.6 % million 91.4 % 91.5 % Data mining with sparse grids p.31/40

32 Using simplicial basis functions On the grids of the combination technique linear basis functions based on a simplicial discretization are also possible So-called Kuhn s triangulation for each rectangular block (1,1,1) (0,0,0) Theoretical properties of this variant of the sparse grid technique still has to be investigated in more detail Since the overlap of supports is greatly reduced due to the use of a simplicial discretization, the complexities scale significantly better Data mining with sparse grids p.32/40

33 Complexities for both discretization variants -linear basis functions linear basis functions on simplicials storage assembly mv-multipl. Reduced -dependence in the complexities with linear basis functions on simplicials N is the number of grid points Scales linearly with, the number of data points Data mining with sparse grids p.33/40

1 % with neural networks Spiral data set with level 7, 88.

34 Ripley data set / Spiral data set with linear basis functions Ripley data set with level 4 (correctness rate of 91.4 %) Compare with 90.9 % with level 5, -linear and 91.1 % with neural networks Spiral data set with level 7, % leave-one-out correctness Spiral data set with level 8, % leave-one-out correctness Compare with % with level 6, -linear Data mining with sparse grids p.34/40

35 BUPA Liver Disorders data set (6D) linear -linear % % level 1 10-fold train fold test level 2 10-fold train fold test level 3 10-fold train fold test level 4 10-fold train fold test Data mining with sparse grids p.35/40

36 Synthetic massive 6D data set training testing total data matrix # of data correctness correctness time (sec) time (sec) level million level million level million linear basis functions level 2 5 million Data mining with sparse grids p.36/40

37 Synthetic massive 10D data set training testing total data matrix # of data correct. correct. time (sec) time (sec) level million level million Data mining with sparse grids p.37/40

38 Parallelization Combination technique parallel on a coarse grain level Classifiers in sequence of grids can be computed independently of each other Just short setup and gather phases are necessary Simple but effective static load balancing strategy Fine grain level parallelization with threads on SMP-machines To compute data dependent the array of the training set can be separated in (# processors) parts Some overhead is introduced to avoid memory conflicts In the iterative solver a vector can be split into parts and each processor now computes the action of the matrix on a vector of size Data mining with sparse grids p.38/40

39 Synthetic massive 10D data set in parallel Coarse grain level parallelization of the combination technique Speed-up of 10.1 with an efficiency of 0.92 on 11 nodes Since only 11 grids have to be calculated no more than 11 nodes are needed Threads for each partial problem in the sequence of grids We achieve acceptable speed-ups from 1.6 for two processors up to 3.7 for eight processors As one would expect the efficiency decreases with the number of processors Both parallelization strategies are used simultaneously Each node is a shared memory dual-processor system On 11 nodes a speed-up of 17.9 with an efficiency of 0.81 Data mining with sparse grids p.39/40

40 Conclusions and outlook Our method is well suited for huge data sets Moderate high number of dimensions Enough for a lot of practical applications after the reduction to the essential dimensions Dimension reduction (e.g. SVD) has to be applied Memory requirements still grow exponentially in Lumping Reduce number of points on the boundary Fast solvers for the partial problems in the sequence of grids Multi-grid with partial semi-coarsening Data mining with sparse grids p.40/40

Data mining with sparse grids using simplicial basis functions

Data mining with sparse grids using simplicial basis functions Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Part of the work was supported within the project 03GRM6BN