Parallel Approach for Implementing Data Mining Algorithms

TITLE OF THE THESIS Parallel Approach for Implementing Data Mining Algorithms A RESEARCH PROPOSAL SUBMITTED TO THE SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN (COMPUTER SCIENCE and ENGINEERING) By MANISH BHARDWAJ Registration No < > UNDER THE GUIDANCE OF DR. D.S.ADANE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, NAGPUR, MAHARASHTRA - 440015 Year 2016

MANISH BHARDWAJ Doctorate Research Proposal RCOEM, Nagpur About this proposal This doctorate research proposal document describes the working title of the research proposal and general overview of the area. The research plan mentioned in this document may be modified based on the approval on this documented research proposal. Research Proposal, RCOEM Nagpur Confidential Page ii

MANISH BHARDWAJ Doctorate Research Proposal RCOEM, Nagpur CONTENTS 1. RESEARCH TITLE.. 1 2. ABSTRACT 1 3. LITERATURE SURVEY. 1 4. PROBLEM DEFINITION. 8 5. PROPOSED METHODOLOGY. 9 6. REFERENCES. 9 Research Proposal, RCOEM Nagpur Confidential Page iii

1. Research Proposal Title Parallel approach for Implementing Data Mining Algorithms. 2. Abstract Parallel data mining approach concerns, parallel algorithms, techniques and tools for extraction of useful, implicit and novel pattern from datasets using high performance architecture. The huge data that is generated by online transaction, by social networking sites and government organization working in the area of space and bioinformatics fields create new problems for data mining and knowledge discovery methods. Due to large size most of the currently available data mining algorithms are not useful to many problems. Data mining algorithms not giving better result when the size of datasets becomes very large. The time required to execute the algorithm is also high for large datasets. By help of parallel technique the problem of mining is done in more efficient manner, its help to perform the task by taking the advantages of available high performance architecture. By the parallel approach like Data partition, task partition, divide-and-conquer, single dimension reduction, scalable thread scheduling and local sort help to implement the data mining algorithm which performance is high and time requirement is low as compare to simple implementation. Graphics processing unit with CUDA enable model allow to doing the task in parallel by help of thread block which are running in parallel. OpenMP API with fork join model with multiple constructs and Directives helping the parallel approach implementation with multiple core support. 3. Literature Review and related work 3.1 Research Issues and Challenges Some important research issues and a set of open problems for designing and implementing the large-scale data mining algorithms. 3.1.1 High Dimensionality Available methods are able to handle hundreds of attributes. New parallel algorithms are needed that are able to handle more number of attributes. Research Proposal, RCOEM Nagpur Confidential Page 1

3.1.2 Large Size Data warehouse continue to increase in size. Available techniques are able to handle data in the gigabyte range, but are not yet better suitable for terabyte-sized data. 3.1.3 Data Type More data mining research has focused on structured data, due to its simplicity. But support for other data types are also required. Examples include semi-structured, unstructured, spatial, temporal and multimedia databases. 3.1.4 Dynamic Load Balancing For homogeneous environment static partitioning are used. Dynamic load balancing is also crucial to handle a heterogeneous environment. 3.1.5 Multi-table Mining Applying mining over multiple tables or over distributed databases contain with different database schemas is very difficult with available mining methods. Better methods are required to handle the multi table mining problem [1]. 3.2 Scaling up Methods for Data Mining Scaling up is only the way to handle the large datasets. By parallel approach like one dimension reduction, scalable thread scheduling and local sorting for implementing data mining algorithm which able to handle the large data sets. 3.2.1 Modifying Algorithm Modifying algorithms mainly having the aim to making algorithm faster. For this purpose different optimizing search techniques are used. It also reduce the complexity and showing the optimize representation or try to find approximate solution instead of accurate solution. 3.2.1.1 Model restriction and reducing the search space Restricting the model space has an immediate advantage in that the search space is also reduced. Furthermore, simple solutions are usually faster to obtain and evaluate and, in many cases, are competitive with more complex solutions. The major problem is when the intrinsic complexity of the problem Research Proposal, RCOEM Nagpur Confidential Page 2

cannot be met by a simple solution. Examples of this strategy are many, including linear models, perceptrons, and decision stumps. 3.2.1.2 Using powerful search heuristics Using a more efficient search heuristic avoids artificially constraining the possible models and tries to make the search process faster. The method consists of three steps: first, it must derive an upper bound on the relative loss between using a subset of the available data and the whole dataset in each step of the learning algorithm. Then, it must derive an upper bound of the time complexity of the learning algorithm as a function of the number of samples used in each step. Finally, it must minimize the time bound, via the number of samples used in each step, subject to the target limits on the loss of performance of using a subset of the dataset. 3.2.2 Change the way to deal Problem It consisting on modifying the way to solve problem, is based on general principal of divide-and- Conquer. The idea is to perform some kind of data partitioning or problem decomposition. 3.3 Parallelization Parallelization is help in the sense that the most costly parts are performed concurrently, with parallelization there is possibility of addressing the scaling up of the mining methods without either simplifying the algorithm or the task. 3.4 Graphics Processing Unit with Compute Unified device Architecture Graphics processing units (GPUs) has enabled inexpensive high performance computing for general purpose applications. Compute Unified Device Architecture (CUDA) programming model provides the programmers adequate C language like APIs to better exploit the parallel power of the GPU. GPUs have evolved into a highly parallel, multithreaded, many-core processor with tremendous computational horsepower and very high memory bandwidth. NVIDIA s GPU with the CUDA programming model provides an adequate API for non-graphics applications. Research Proposal, RCOEM Nagpur Confidential Page 3

Fig. 3.1 A set of SIMD stream multiprocessors with memory hierarchy 3.4.1 CUDA Programming Model In software level CUDA is the collection of threads block which are running in parallel. The unit of work is assign to the GPU is called a kernel. CUDA program is running in a thread-parallel way. Computation is organized as a grid of thread blocks which consists of a set of threads as shown in below figure. At Instruction level, 32 consecutive threads in a thread block make up of a minimum unit of execution, which is called a thread warp. Each stream multiprocessor executes one or more thread block concurrently [2]. Research Proposal, RCOEM Nagpur Confidential Page 4

Fig.3.2 Serial execution on the host and parallel execution on the device 3.4.2 Parallelization techniques on CUDA enabled platform Three schemes for data mining parallelization on CUDA- based platform are as follow: 3.4.2.1 Scalable threads scheduling scheme for irregular pattern A task is assigned to the CPU or the GPU or the number of thread blocks is usually determined by the size of the problem before the GPU kernel starts. However, the size of a problems as irregular pattern problem. CUDA computing is not suitable for this problem. Solution: Scalable threads scheduling, Upper bound of number of threads/threads blocks and allocate the GPU resources are calculated first and if some threads block are ideal let the corresponding threads blocks quit immediately. 3.4.2.2 Parallel distributed top k scheme Top k problem is to select the k minimum or maximum elements from a data collection. Insertion sort is has been proved to be efficient when k is small but CUDA based insertion sort is not efficient. Solution: To reduce the computation and tackle the weakness of the CUDAbased insertion sort by using local sorts rather than a global sort. 3.4.2.3 Parallel high dimension reduction scheme Text mining may consist of hundreds of attributes, exceeding the size of the shared memory allocated to each thread block on the GPU. In such case, the record has to be broken into multiple sub-records to fit in the shared Research Proposal, RCOEM Nagpur Confidential Page 5

memory, but breaking down in too many sub-records is not the solution because the cost for manipulating the records and temporal results will high. Solution: By observing that different attributes in a record are independent, if each thread block only takes care of one distinct attribute of all the records. Rather than perform reduction on the high dimensional data, perform one dimensional reduction on each attribute. 3.5 CUDA based implementations of data mining algorithms 3.5.1 CU-Apriori In CUDA based Apriori especially Candidate generation and Support counting take most of the computation of Apriori. 3.5.1.1 Candidate generation Candidate generation procedure joins two frequent (k-1) itemsets and prunes the unpromising k-candidates. Since the task of joining two itemsets is independent between different threads, it is suitable for parallelization, here scalable threads scheduling scheme for irregular pattern is used. 3.5.1.2 Support counting Support counting procedure records the number of occurrence of a candidate itemset by scanning the transaction database. Since the counting for each candidate is independent with others, it is suitable for parallelization. Transactions are loaded into the shared memory and shared by all the threads within a threads block [3]. 3.5.2 CU-KNN CUDA based K- Nearest- Neighbour classifier, Distance calculation and Selection of k nearest neighbours done most of computation. 3.5.2.1 Distance calculation It can be fully parallelized since pair-wise distance calculation is independent. This property makes KNN perfectly suitable for a GPU parallel implementation. The goal of this is to maximize the concurrency of the distance calculation invoked by different threads and minimize the global memory access. 3.5.2.2 Selection of k nearest neighbours The selection of k nearest neighbours of a query object is essentially to find the k shortest distances, which is a typical top-k problem. So, its implementation is done by distributed top-k scheme. Research Proposal, RCOEM Nagpur Confidential Page 6

3.5.3 CU-K-means In CUDA based K-means especially Cluster label update, Centroid update, Centroid movement detection, take most of computation of K-means. 3.5.3.1 Cluster label update All thread performs the distance calculation of an object to all the centroids, and selects the nearest centroid. Each object is assigned to the cluster whose centroid is closest to it. Attribute partitions of objects are loaded into the shared memory, so the bandwidth between the global memory and the shared memory is utilized efficiently. 3.5.3.2 Centroid update Each new centroid is calculated by averaging the attribute values of all the records belonging to the common cluster. Parallel high dimension reduction scheme is used to do the this task. 3.5.3.3 Centroid movement detection If the new centroids move away from the centroids in the last iteration. Firstly we required to calculate the square of the difference between every attribute of the new and old centroids, called centroid difference matrix. Secondly perform the parallel high dimension reduction scheme on the centroid difference matrix. Thirdly, since the attributes of the record is small this record is transferred to the main memory, and summed up to get global_squared_error. The cost of data transfer between the main and global memory is negligible [7]. 3.5.4 FP-Growth Although the FP-Growth association-rule mining algorithm is more efficient than the Apriori algorithm, it has two disadvantages. The first is that the FPtree can become too large to be created in memory; the second is serial processing approach used. A distributed application data framework parallel approach of FP-Growth not required generating overall FP-tree. Overall FP-tree may be too large to create in shared memory. Algorithm uses parallel processing approach in all important steps. Which improve the processing capability and efficiency of association-rule mining algorithm [4]. 3.5.5 Parallel Bees Swarm Optimization Association mining problem with huge datasets solved by using and applying the bees behaviour. It take the advantage of GPU architecture and deal with large datasets to solve real time problem. Master and slave paradigm is used with this method. The master is executing on CPU and the slave is offloaded to the GPU. First, The master initializes randomly the solution reference. After that, it determines regions of the whole bees by generating the Research Proposal, RCOEM Nagpur Confidential Page 7

neighbours of each bee. Single solution is evaluated on GPU in parallel. After, the master receives back the fitness of all rules; each bee calculates sequentially the best rule and puts it in the table dance. The best rule of the dance table becomes the solution reference for the next iteration [5]. 3.5.6 Accelerating Parallel Frequent Itemset Mining on Graphic Processors with Sorting It constructing the Transaction Identifier table and performing the sorting for all frequent itemsets this is helping to reduce the candidate itemsets by using GPU architecture. GPU thread block were allocated after sorting the itemsets in descending order. Therefore time required to check and support counting take less time [6]. 3.5.7 Parallel Highly Informative K-ItemSet PHIKS, a highly scalable, parallel miki mining algorithm. PHIKS able to handle the mining process of huge databases (terabytes of datasets in size). MIKI, the problem of maximally informative k-itemsets (miki for short) discovery in massive data sets, where in formativeness is expressed is expressed by means of joint entropy and k is the size of the itemset. Miki mining is a key problem in data analytics with high potential impact on various tasks such as unsupervised learning, supervised learning, or information retrieval, to cite a few. A typical application is the discovery of discriminative sets of features, based on joint entropy [9]. 4. Problem Definition Large data generated by online transaction, social networking sites and government organization of space and bioinformatics, available data mining algorithms are not performing well with this datasets. Other problem is about performance; some of algorithms are able to solve the mining problem facing problem of search space which prevent efficient execution and generated solution are not satisfactory level. Research Proposal, RCOEM Nagpur Confidential Page 8

5. Proposed Methodology To deal with the very large datasets, the only way to deal with this problem by apply the Parallel approach for Scaling up the data mining algorithm and that can be done by modifying the algorithm, by data partitioning, by problem decomposition and parallelization. For parallelization Graphics processing units enabling in expensive high performance computing power with this Compute unified device architecture programming model provide the programmers adequate c language like API to better exploits the parallel power of GPU.GPU has evolved into a highly parallel,multithreaded,many core processor so work is distributed among different thread block and threads are performing operation in thread parallel fashion. Other approach is based on OpenMP, it is shared memory API work in fork and join model. It having large set of constructs and directives which allow to do the work in parallel, that way task utilize the computing power of multiple core and parallel approach are apply for scaling up data mining algorithms. 6. References 1. M. J. Zaki, Large-Scale parallel Data Mining, LNAI 1759, pp. 1-23,Springer 2000. 2. N. Garcia-Pedrajas, A. de Hero-Garcia, Scaling up data mining algorithms: review and taxonomy, springer-verlag 2011. 3. L. Jian, C. Wang, Y. Liu,Y. Shi, Parallel data mining techniques on Graphics processing Unit with Compute Unified device Architecture (CUDA),pp. 943-967 Springer science + Business Media, LLC 2011. 4. Zhi- gang Wang, Chi-she Wang,A Parallel Association-Rule Mining Algorithm,pp. 125-129 springer-verlag Berling Heidelberg 2012. 5. Y. Tan, Parallel Bees Swarm Optimization for Association rules mining using GPU Architecture,pp. 50-57,Springer International Publishing Switzerland 2014. 6. H.Hsu, Accelerating parallel Frequent Itemset Mining on Graphics processors with Sorting,pp. 245-256 IFIP 2013. 7. H. Decker, Parallel and Distributed Mining of Probalilistic Frequent Itemsets Using Multiple GPUs, Springer-Verlag Berlin Heidelberg 2013. 8. S. Tsutsui and P.Collet, Data Mining Using parallel Multi-objective Evolutionary Algorithms on Graphics Processing Units, Springer-Verlag Berlin 2013. 9. Saber Salah, A high scalable parallel algorithm for maximally informative k- itemset mining,springer Verlag London 2016. Research Proposal, RCOEM Nagpur Confidential Page 9