Parallel Data Mining on a Beowulf Cluster

Size: px

Start display at page:

Download "Parallel Data Mining on a Beowulf Cluster"

Oliver Stewart
5 years ago
Views:

1 Parallel Data Mining on a Beowulf Cluster Peter Strazdins, Peter Christen, Ole M. Nielsen and Markus Hegland Peter.Strazdins (/seminars) Data Mining Group Australian National University, Canberra 5th Int l Conference on High-Performance Computing in the Asia-Pacific Region 25 September 2001 c 2001 ANU Data Mining Group p.1/15

2 Talk Outline The Data Mining Process Data Mining Issues What we do at the ANU Predictive Modelling Parallel Implementation Matrix Assembly and Linear System Solve The ANU Beowulf Bunyip Performance Results Outlook c 2001 ANU Data Mining Group p.2/15

3 The Data Mining Process Understand Business Understand Data Prepare Data Take Action Evaluate Model Build Model(s) Analysis of large and complex data sets Find previously unknown relationships and patterns that are useful Data Mining is iterative and interactive Challenges: data size and data complexity c 2001 ANU Data Mining Group p.3/15

4 Data Mining Issues Large data size and data complexity require more computational power higher memory and I/O bandwidth more secondary storage Iterative and interactive data mining process requires rapid prototype development Data security and privacy (personal data) Heterogeneity and distributed data collection c 2001 ANU Data Mining Group p.4/15

5 "What we do at the ANU" Development of algorithms that are scalable both wrt. data size and number of processors Apply numerical techniques for predictive modelling (including thin plate splines, finite elements, wavelets and additive models) Data mining toolbox (DMtools) (for data exploration, analysis and preprocessing) Consultancies Data mining is one of 13 APAC Expertise Programs c 2001 ANU Data Mining Group p.5/15

6 Three Predictive Modelling Methods 1. ADDFIT: Additive models Lowest computational costs, but coarsest approximations, suitable for high-dimensional problems 2. HISURF: High dimensional surface smoothing Uses a hierarchical interpolatory wavelet basis, suitable for three to seven dimensional problems 3. TPSFEM: Thin plate splines finite element method Piecewise multilinear finite elements, most accurate approximation at highest computational costs, suitable for two and three dimensional problems c 2001 ANU Data Mining Group p.6/15

7 Advantages of these Methods All three methods have two steps 1) Read data and assemble a linear system 2) Add constraints and solve the linear system The first step is easy and efficiently to parallelise (as long as matrix fits into main memory) Data has to be read from disk only once Size of the symmetric linear system is independent of the number of data records ADDFIT and HISURF result in dense matrix TPSFEM in larger sparse matrix (only ADDFIT and HISURF are presented here) c 2001 ANU Data Mining Group p.7/15

8 Matrix Structures and Complexity 0 5 3D ADDFIT matrix D HISURF matrix Storage requirements HISURF ADDFIT nz = nz = GB 1MB 1KB 1B Dimension A 3-dimensional examples Resolution is 9 grid points in each dimension HISURF matrices becomes much larger with increasing number of dimensions c 2001 ANU Data Mining Group p.8/15

9 Parallel Cluster Implementation First (sequential) prototypes in Matlab and Python C/MPI code for higher performance Python/Tkl wrapper code to facilitate user interface and present graphical output c 2001 ANU Data Mining Group p.9/15

10 Parallel Assembly Step Data set is initially copied to (local) disks of each processor (one-off cost) Each node reads a fraction of the whole data set and assembles a local linear system Each data record adds some non-zero elements into the matrix (at data dependent locations) The complete matrix data structure is needed on each node The local linear systems are collected and summed after the assembly The final linear system can be solved sequentially or in parallel c 2001 ANU Data Mining Group p.10/15

11 block-cyclic matrix dist. HISURF: uses a Parallel Solve Step processors are used with a square LLT factorization / back-solve AddFit: uses a LDLT factorization / back-solve Bounded Bunch-Kaufman algorithm (Boeing, 1998) used: high accuracy augmented with a block search algorithm (equally stable) to reduce symmetric interchange overheads other communication aspects are highly optimized also has very high serial performance requires large (see LDLT paper in HPCAsia 01) for good parallel efficiency c 2001 ANU Data Mining Group p.11/15

12 The ANU Beowulf Cluster Bunyip 24 x Dual Pentium III Nodes 48-Port Switch B A D 24 x 100MB Ethernet Links Gigabit Links Gigabit Switch C 1 Server 2 c 2001 ANU Data Mining Group p.12/15

13 ADDFIT Results on Bunyip 20 Census-1 Census-10 Speedup Number of Nodes Data set Census from UCI KDD archives (42 attributes, 300,000 records, 54 MBytes) 8 attributes used, matrix dimension For Census-1, reducing and solving limits speedup c 2001 ANU Data Mining Group p.13/15

14 ADDFIT Results on Bunyip (cont d) 30 Census-1 Census Speedup Number of Nodes Data set Census UCI KDD archives 41 attributes used, matrix dimension Reducing (4 MBytes) and solving limits speedup (especially for Census-1) c 2001 ANU Data Mining Group p.14/15

15 Current and Future Work Port to APAC National Facility (OpenMP / MPI) (first OpenMP prototype is working) Unifying ADDFIT, HISURF and TPSFEM into adaptive sparse grids Integrate into data mining toolbox DMtools Apply to other real world data Compare with other predictive modelling techniques (neural networks, decision trees, etc.) Visit our web site at: c 2001 ANU Data Mining Group p.15/15

High Performance Computing and Data Mining

High Performance Computing and Data Mining Performance Issues in Data Mining Peter Christen Peter.Christen@anu.edu.au Data Mining Group Department of Computer Science, FEIT Australian National University,