Map-Reduce for Cube Computation

Similar documents
Different Cube Computation Approaches: Survey Paper

Mining for Data Cube and Computing Interesting Measures

An Efficient Multi-Dimensional Data Analysis over Parallel Computing Framework

Data Cube Materialization Using Map Reduce

A REVIEW DATA CUBE ANALYSIS METHOD IN BIG DATA ENVIRONMENT

Efficient Computation of Data Cubes. Network Database Lab

International Journal of Computer Sciences and Engineering. Research Paper Volume-6, Issue-1 E-ISSN:

Quotient Cube: How to Summarize the Semantics of a Data Cube

Coarse Grained Parallel On-Line Analytical Processing (OLAP) for Data Mining

PnP: Parallel And External Memory Iceberg Cube Computation

Data Cube Technology

Using Tiling to Scale Parallel Data Cube Construction

Multi-Cube Computation

Improved Data Partitioning For Building Large ROLAP Data Cubes in Parallel

Distributed Cube Materialization on Holistic Measures

Distributed Cube Materialization on Holistic Measures

Data Cube Technology. Chapter 5: Data Cube Technology. Data Cube: A Lattice of Cuboids. Data Cube: A Lattice of Cuboids

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 5

Cube-Lifecycle Management and Applications

Novel Materialized View Selection in a Multidimensional Database

CS490D: Introduction to Data Mining Chris Clifton

Computing Data Cubes Using Massively Parallel Processors

Chapter 5, Data Cube Computation

A Simple and Efficient Method for Computing Data Cubes

Computing Complex Iceberg Cubes by Multiway Aggregation and Bounding

Communication and Memory Optimal Parallel Data Cube Construction

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Impact of Data Distribution, Level of Parallelism, and Communication Frequency on Parallel Data Cube Construction

Building Large ROLAP Data Cubes in Parallel

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

This proposed research is inspired by the work of Mr Jagdish Sadhave 2009, who used

Lecture 2 Data Cube Basics

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

Keywords Data alignment, Data annotation, Web database, Search Result Record

Item Set Extraction of Mining Association Rule

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

On The Fly Mapreduce Aggregation for Big Data Processing In Hadoop Environment

M. P. Ravikanth et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 3 (3), 2012,

Distributed Bottom up Approach for Data Anonymization using MapReduce framework on Cloud

Data Warehousing and Data Mining

ETL and OLAP Systems

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Applying Grid Technologies to XML Based OLAP Cube Construction

Databases 2 (VU) ( / )

Comparative Analysis of Range Aggregate Queries In Big Data Environment

Preparation of Data Set for Data Mining Analysis using Horizontal Aggregation in SQL

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

Mitigating Data Skew Using Map Reduce Application

Ecient Computation of Iceberg Cubes with Complex Measures

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

A Better Approach for Horizontal Aggregations in SQL Using Data Sets for Data Mining Analysis

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

On the design space of MapReduce ROLLUP aggregates

DW Performance Optimization (II)

Parallel Evaluation of Composite Aggregate Queries

Trajectory Data Warehouses: Proposal of Design and Application to Exploit Data

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

TI2736-B Big Data Processing. Claudia Hauff

Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Data mining - detailed outline. Problem

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

A Review Paper on Big data & Hadoop

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

Mining Unusual Patterns by Multi-Dimensional Analysis of Data Streams

Inverted Index for Fast Nearest Neighbour

Data mining - detailed outline. Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Problem.

R-Store: A Scalable Distributed System for Supporting Real-time Analytics

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

MapReduce Design Patterns

The Polynomial Complexity of Fully Materialized Coalesced Cubes

Big Data Analytics. Rasoul Karimi

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Data Analysis Using MapReduce in Hadoop Environment

The Dynamic Data Cube

C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking

The Polynomial Complexity of Fully Materialized Coalesced Cubes

Data Warehousing & On-Line Analytical Processing

Discovering Interesting Patterns in Large Graph Cubes

2 CONTENTS

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Research Article Apriori Association Rule Algorithms using VMware Environment

Improving the MapReduce Big Data Processing Framework

Generating Data Sets for Data Mining Analysis using Horizontal Aggregations in SQL

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Searching frequent itemsets by clustering data: towards a parallel approach using MapReduce

Mining of Web Server Logs using Extended Apriori Algorithm

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Appropriate Item Partition for Improving the Mining Performance

Data mining: Hmm, what is it?

Constructing Object Oriented Class for extracting and using data from data cube

Hadoop Map Reduce 10/17/2018 1

Transcription:

299 Map-Reduce for Cube Computation Prof. Pramod Patil 1, Prini Kotian 2, Aishwarya Gaonkar 3, Sachin Wani 4, Pramod Gaikwad 5 Department of Computer Science, Dr.D.Y.Patil Institute of Engineering and Technology Pimpri, Pune-411018 Abstract Analyzing of large data sets is a major concern. Big data contains large amount of unstructured data having heterogeneous patterns. It is quite difficult for existing techniques to give better performance while processing such large set of data. On the other hand tremendously changing size of the data and design parameters for the same becomes an inimitable and interesting challenge. This paper deals with the real world challenges of cube computation and materialization over interesting measures. Cube designing is efficient for handling of unstructured data. There are various techniques for cube computation such as annotation, aggregation, materialization and mining. So Map Reduce approach is provided for efficient computation of cube. Hadoop is a open source software framework for storing and processing big data in distributed form. Hive is a infrastructure on the top of Hadoop for storing query and analysis of large data sets. MR-Cube is a framework of Map Reduce for computation of online analytical processing. Thus MR-cube successfully handles cube computation with dynamics measures over large data sets. Keywords: Data cube, Cube Materialization, Cube Mining, Map-reduced, MR-Cube, Dynamic Measures. I. INTRODUCTION In the past few decades, Organization have tried different approaches to solve the problem of handling Big Data that requires lot of storage and large computation that demands a great deal of processing power. Thus Hadoop was adopted as a platform to Provide distributed storage and computational capabilities. Merits of using Hadoop are scalability and availability along with the distributed environment which are provided by Hadoop Distributed File System (HDFS) and computational capabilities provided by Map Reduce. HDFS helps in replication of files when software and hardware failure occurs, and automatically re-replicates data blocks by providing security to Big Data. Map-Reduce was introduced by Google in 2004.It is based on Divide and Conquer principles. Map-Reduce is the main processing engine of Hadoop. The Map Reduce model simplifies parallel processing by abstracting away the complexities involved in working with distributed systems, such as parallelization, distribution of work while dealing with software and hardware failure. With this abstraction, Map Reduce allows the programmer to focus on addressing business needs. Data cube is a way of organizing data in N-dimensions so as to perform analysis over some measure of interest. Data-Cube is an easy way to look at complex data in simple format. Challenges of Data-cube computation are Size, Complexity, Design, and Quality. The paper is organized as follows: Firstly explanation of different approaches of Cube Designing. Next is the Limitations of these techniques over handling of Big Data sets. Then the implementation steps of Map Reduce based Approach used for data cube materialization and mining. And the last part of paper is Conclusion. II. DIFFERENT TECHNIQUES FOR CUBE COMPUTATION Here we generate various related methods that are used for the computation of the cube and its performance scope and its merits and demerits. In paper [1], introduces analyzing and optimization technique for cube computations such as effectively distribute the data. They also focus that no single machine is overwhelmed with small number of nodes. The CUBE needs computing group bys on all possible combinations of list of attributes and is equivalent to the union of standard group by operations. The proposed algorithm is only works for algebraic measure i.e. such as SUM, COUNT, and AVERAGE etc. In paper [2], they introduce top-down approach for cube computation called multi way array aggregation. The computation begins with the grouping of queries as larger group-bys and proceeds towards the smallest group-bys. Here the planes should be computed and interesting groups are sorted. Limitation of this method is computing takes well only for a small number of dimension. In paper [3], Bottom-up Cubing Algorithm (BUC) method is used in which First, BUC if a group has value partitioned then algorithm executed on a single reducer is self-contained. Let us consider partitions dataset on dimension A, producing partitions a1, a2, a3, a4.then, it recourses on partition a1, the partition a1 is aggregated and BUC produces <a1,*,*,*>. Computing cube form cuboid to base cuboid.

300 In paper [11][12][4], Parallel Algorithms are introduced for cube computation over clusters. In these Algorithm data, dimensions and measures are given as input. Parallelized aggregation of data subsets whose results are then post-processed to derive the final result. BPP (Breadth-first Partitioned Parallel Cube), a parallel algorithm designed for cube materialization over flat dimension hierarchies. Another Parallel algorithm PT (Partitioned Tree) works with tasks that are created by a recursive binary division in each lattice on a single machine into two sub trees having an equal number of nodes. In PT, there is a parameter parallelism (number of reducers) that controls when binary division stops. In paper [14] two more algorithms is described. RP (Replication Parallel BUC) and ASL (Affinity SkipList). Algorithm Rp is dominated by PT. In Algorithm ASDL each cube region in parallel is used to maintain intermediate results during the process. In paper[16], For fast online multi-dimensional analysis of stream data, three important methods are proposed for efficient and effective computation of stream cubes. Based on this design methodology, real life cubing can be constructed. Introduce MR-Cube [7], a Map-Reduce based framework for efficient cube computation and identify interesting cube groups on holistic measure. Cube region is grouping of attribute while a cube group is values of those attribute. Cube lattice is formed by representing all possible groupings of the attributes. Challenging issue is that effectively distribute the computation in terms of efficiency and scalability. challenge when dealing with large amount of unstructured and real time data where measures an dimensions change all the time. 3. Design: Designing methods of data cubes have been becoming interesting and challenging. The parameters to be considered are construction time, cube updation techniques, maintenance plan and the design techniques to be adopted. 4. Quality: Quality becomes a complex factor when data is huge and as data cube is formed the quality of cube tends to be affected during aggregation phase. Thus it is important to control the quality of final cube. IV. MAP - REDUCED BASED APPROACH FOR CUBE COMPUTATION OVER BIG DATA Map-Reduce is a programming model and an associated implementation for popular parallel execution frameworks. A proposed methodology is used also to handle two major issues such as data distribution and computation distribution by illustrating a framework to partition high multidimensional lattice into region areas and distribution of data analysis and mining under parallel computing infrastructure. The research contribution is as follows: Partitioning high multidimensional lattice into region areas, Three phase high multidimensional data computation algorithm to handle billions of data streams, Fusion of stream mining model with multidimensional data streams. III. LIMITATION OF EXISTING TECHNIQUES Limitations in the existing techniques: 1. The existing techniques are designed to handle clusters of small number of nodes or for single machine processing. Thus it is difficult to manage processing of large amount of data. 2. The previous techniques deal with algebraic measure and the data is growing large day by day which needs holistic measures but distribution using holistic m challenge.. There are several more challenges arising when dealing with data cubes over large amount of data. 1. Size: The size of data over large data sets and also the size of intermediate data generated after mapping phase become a great challenge as it can lead to disk running out of space when naïve algorithm is used. 2. Complexity: Complexity of cube building becomes a Fig 1- Flow Chart for cube computation by using mapreduce approach

301 The given Map-reduced based system is designed with flow diagram as shown in figure 1. It consists of following steps: (i) Data Sample i.e. data set, which pre-process the data, dimension hierarchies and measures and convert into search query logs. According to that annotated cube lattice is constructed using sample data. A. Lattice Construction: For example, Fig. 4 illustrates a cube lattice where the dimension attributes include the six attributes derived from ip and query. (ii)the Annotated cube lattice is constructed using Value partitions which are of reducer unfriendly regions and batch areas techniques are used. (iii) In cube materialization using Map-Reducer technique tuples are mapped to each batch areas. Reducer evaluates the measure for each batch area. Then cube is loaded into DB for future exploration. After that according to user queries selecting and executing appropriate cubes in database is take place. As shown in following table, data sets are maintained as a set of tuples. Each tuples has a set of attributes, such as ip and query. For many analyses, it is more desirable to map some raw attributes into a fixed number of derived attributes through a mapping function. For example, ip is mapped to country, city, state. Similar query is then mapped to topic, category subcategory. Fig 2- Cube Lattice using a flat set of dimensions. Fig.3- Divide lattice into two parts: reducer friendly and reducer unfriendly approach

302 For effectively parallelism we use Partitioning technique called Batch Area. As shown in fig 4.Each batch area represents a collection of regions that share a common parent. The combined process of identification and value partitioning Unfriendly regions and partitioning of regions into batches is referred to as annotate so lattice formed is annotated lattice. V. CONLCUSION In this paper, we study annotation, aggregation, materialization and mining techniques for efficient cube computation. Proposed approach deals with cube groups instead of cube region to overcome workload of cube group computation.. Thus MR-Cube successfully handles cube computation with dynamic measure over large datasets REFERENCES Fig 4- Annotated cube lattice. Each color in the lattice indicates a batch area b1 to b5. The cube region term is used to denote a node in the lattice and the term cube group is used to denote an actual value belonging to the cube region. Then two techniques required for efficiently distribute the data and computation task. As shown in figure 3 Value Partitioning is used partitioning groups that are reducer unfriendly and dynamically adjust the partition factor. The reducer unfriendliness of each cube region is estimated by sampling approach. B. Cube materialization using map-reduced In map reduced based approach, mappers are allocated to each batch area and it emits key: value pairs for each batch area. In required, keys based on value partitioning are used, then in shuffle phase sorted by using key. The BUC Algorithm is run on each reducer, and the cube aggregates are generated. All value partitioned groups need to be aggregated to compute the final measures C. Data Aggregation Map-Reduce: Data aggregation is most important challenge which causes it to be from separate Map-Reduce that can be integrated with aggregation phase post materialization. It is feasible to perform both large-scale cube materialization and mining in same distributed framework of similar interesting cube groups. [1]. S. Agarwal, R. Agrawal, P. Deshpande, A. Gupta, J. Naughton, R.Ramakrishnan,and S. Sarawagi, "On the Computation of multidimensional Aggregates," Proc.22nd Int l Conf. Very Large Data Bases (VLDB), 1996. [2]. Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In SIGMOD'97. [3]. K. Ross and D. Srivastava, "Fast Computation of Sparse Datacubes," Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB), 1997. [4]. R.T. Ng, A.S.Wagner, and Y. Yin, "Iceberg-Cube Computation with PC Clusters," Proc. ACM SIGMOD Int l Conf. Management of Data, 2001. [5]. D. Xin, J. Han, X. Li, and B. W. Wah. Starcubing: Computing iceberg cubes by top-down and bottomup integration. In VLDB'03 [6]. J. Hah, J. Pei, G. Dong and K.wang, Efficient Computation of Iceberg cubes with complex measure, Proc ACM SIGMOD Int l conf. Management of data,2001 [7]. Dehne, F.K.H.A., Eavis T., And Rau-Chaplin A., The cgmcube : ptimizing Parallel Data Cube Generation for ROLAP, Distributed and parallel databases19(1),2006 [8]. Fangbo Tao, Kin Hou Lei, EventCube: Multi- Dimentional search and mining of structured and Text data, ACM 978-I-4503-2174-7, 2013 [9]. Nikolay Laptev, Kai Zeng, Very Fast Estimation for result ad accuracy of big data analytics:earl system, Proc. IEEE 27th Int l Conf. Data Eng. (ICDE), 2013 [10]. Yixin Chen, Jiawei Han, Stream Cube: An Architecture for Multi-Dimensional Analysis of Data Streams Springer Science + Business Media, Inc. Manufactured in The Netherlands, Distributed and Parallel Databases, 2005. [11]. Cuzzocrea A., Song.I and Davis, Analytics over large scale Multidimension data : The big data revolution!, Proc of ACM DOLAP,2011 [12]. A. Nandi, C. Yu, P. Bohannon, and R. Ramakrishnan, Distributed Cube Materialization on Holistic Measures, Proc. IEEE 27th Int l Conf. Data Eng. (ICDE), 2011.

303 [13]. Arnab Nandi, Cong Yu, Philip Bohannon, and Raghu Ramakrishnan Data Cube Materialization and Mining over MapReduce IEEE transaction on Knowledge and Data Engineering, vol. 24, no. 10, Oct 2012. [14]. G. Cormode and S. Muthukrishnan, The CM Sketch and Its Applications, J. Algorithms, vol. 55, pp. 58-75, 2005. [15]. D. Talbot, Succinct Approximate Counting of Skewed Data, Proc.21st Int l Joint Conf. Artificial Intelligence (IJCAI), 2009. [16]. J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F.Pellow, and H. Pirahesh, "Data Cube: A Relational Operator Generalizing Group-By, Cross-Tab and Sub-Totals," Proc. 12th Int l Conf. Data Eng. (ICDE), 1996