LITERATURE SURVEY (BIG DATA ANALYTICS)!

Similar documents
Clustering Lecture 8: MapReduce

Introduction to MapReduce

HADOOP FRAMEWORK FOR BIG DATA

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016 ISSN

A REVIEW PAPER ON BIG DATA ANALYTICS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Efficient Map Reduce Model with Hadoop Framework for Data Processing

BigDataBench: a Big Data Benchmark Suite from Web Search Engines

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Optimization Scheme for Storing and Accessing Huge Number of Small Files on HADOOP Distributed File System

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

CLIENT DATA NODE NAME NODE

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Introduction to MapReduce

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

BigData and Map Reduce VITMAC03

Distributed Systems. CS422/522 Lecture17 17 November 2014

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland

BigDataBench-MT: Multi-tenancy version of BigDataBench

MapReduce. U of Toronto, 2014

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

Hadoop. copyright 2011 Trainologic LTD

Multi-tenancy version of BigDataBench

Novel Algorithm with In-node Combiner for enhanced performance of MapReduce on Amazon EC2. Ashwini Rajaram Chandanshive x

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

A New Model of Search Engine based on Cloud Computing

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

Survey on Incremental MapReduce for Data Mining

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Google File System (GFS) and Hadoop Distributed File System (HDFS)

IN organizations, most of their computers are

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

CS370 Operating Systems

An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop

Benchmarking Approach for Designing a MapReduce Performance Model

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

QADR with Energy Consumption for DIA in Cloud

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Data Analysis Using MapReduce in Hadoop Environment

MapReduce, Hadoop and Spark. Bompotas Agorakis

The MapReduce Framework

Distributed Systems CS6421

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Correlation based File Prefetching Approach for Hadoop

Mitigating Data Skew Using Map Reduce Application

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

A BigData Tour HDFS, Ceph and MapReduce

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Survey on MapReduce Scheduling Algorithms

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

ADVANCES in NATURAL and APPLIED SCIENCES

A Fast and High Throughput SQL Query System for Big Data

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Resource Allocation for Video Transcoding in the Multimedia Cloud

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

A brief history on Hadoop

Hadoop/MapReduce Computing Paradigm

Lecture 11 Hadoop & Spark

Hadoop MapReduce Framework

Improving the MapReduce Big Data Processing Framework

Database Systems CSE 414

A What-if Engine for Cost-based MapReduce Optimization

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

Efficient Algorithm for Frequent Itemset Generation in Big Data

Map Reduce Group Meeting

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

The MapReduce Abstraction

A computational model for MapReduce job flow

Introduction to HDFS and MapReduce

Accelerate Big Data Insights

Research and Improvement of Apriori Algorithm Based on Hadoop

MI-PDB, MIE-PDB: Advanced Database Systems

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

Batch Inherence of Map Reduce Framework

Chapter 5. The MapReduce Programming Model and Implementation

Introduction to Data Management CSE 344

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d

Dept. Of Computer Science, Colorado State University

Database Applications (15-415)

Indexing Strategies of MapReduce for Information Retrieval in Big Data

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

SURVEY OF MAPREDUCE OPTIMIZATION METHODS

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

How to Implement MapReduce Using. Presented By Jamie Pitts

Transcription:

LITERATURE SURVEY (BIG DATA ANALYTICS) Applications frequently require more resources than are available on an inexpensive machine. Many organizations find themselves with business processes that no longer fit on a single cost effective computer. A simple but expensive solution has been to buy specialty machines that have a lot of memory and many CPUs. This solution scales as far as what is supported by the fastest machines available, and usually the only limiting factor is the budget. An alternative solution is to build a high-availability cluster. Such a cluster typically attempts to look like a single machine, and typically requires very specialized installation and administration services. Many high-availability clusters are proprietary and expensive. A more economical solution for acquiring the necessary computational resources is cloud computing. A common pattern is to have bulk data that needs to be transformed, where the processing of each data item is essentially independent of other data items; that is, using a single-instruction multiple-data (SIMD) algorithm. Hadoop provides an open source framework for cloud computing, as well as a distributed file system. Hadoop supports the MapReduce model, which was introduced by Google as a method of solving a class of petascale problems with large clusters of inexpensive machines. The model is based on two distinct steps for an application: Map: An initial ingestion and transformation step, in which individual input records can be processed in parallel. Reduce: An aggregation or summarization step, in which all associated records must be processed together by a single entity. The core concept of MapReduce in Hadoop is that input may be split into logical chunks, and each chunk may be initially processed independently, by a map task. The results of these individual processing chunks can be physically partitioned into distinct sets, which are then sorted. Each sorted chunk is passed to a reduce task. Figure 1-1 illustrates how the Map Reduce model works. Input Dataset Split Split Split The Map Map Task Map Task Map Task Key1 Key7 Key4 Key8 Key3 Key4 Key1 Key3 Key5 Key6 Key8 Value1 Value2 Value3 Value4 Value5 Value6 Value7 Value8 Value9 Value0 ValueA ValueB ValueC ValueD Shuffle and Sort Shuffle and Sort Key4 Key6 Key8 Key1 Key3 Key5 Key7 Value3 Value7 Value8 Value4 Value9 ValueC Value5 ValueD Value1 Value0 ValueA ValueB Value2 The Reduce Reduce Task Reduce Task Output Dataset

A map task may run on any compute node in the cluster, and multiple map tasks may be running in parallel across the cluster. The map task is responsible for transforming the input records into key/value pairs. The output of all of the maps will be partitioned, and each partition will be sorted. There will be one partition for each reduce task. Each partition s sorted keys and the values associated with the keys are then processed by the reduce task. There may be multiple reduce tasks running in parallel on the cluster. The application developer needs to provide only four items to the Hadoop framework: the class that will read the input records and transform them into one key/value pair per record, a map method, a reduce method, and a class that will transform the key/value pairs that the reduce method outputs into output records. The Hadoop MapReduce framework requires a shared file system. This shared file system does not need to be a system-level file system, as long as there is a distributed file system plug-in available to the framework. When HDFS is used as the shared file system, Hadoop is able to take advantage of knowledge about which node hosts a physical copy of input data, and will attempt to schedule the task that is to read that data, to run on that machine. The Hadoop Distributed File System (HDFS) MapReduce environment provides the user with a sophisticated framework to manage the execution of map and reduce tasks across a cluster of machines. The user is required to tell the framework the following: location(s) in the distributed file system of the job input location(s) in the distributed file system for the job output input format output format class containing the map function Optionally: the class containing the reduce function JAR file(s) containing the map and reduce functions and any support classes If a job does not need a reduce function, the user does not need to specify a reducer class, and a reduce phase of the job will not be run. The framework will partition the input, and schedule and execute map tasks across the cluster. If requested, it will sort the results of the map task and execute the reduce task(s) with the map output. The final output will be moved to the output directory, and the job status will be reported to the user. MapReduce is oriented around key/value pairs. The framework will convert each record of input into a key/value pair, and each pair will be input to the map function once. The map output is a set of key/value pairs nominally one pair that is the transformed input pair, but it is perfectly acceptable to output multiple pairs. The map output pairs are grouped and sorted by key. The reduce function is called one time for each key, in sort sequence, with the key and the set of values that share that key. The reduce method may output an arbitrary number of key/value pairs, which are written to the output files in the job output directory. If the reduce output keys are unchanged from the reduce input keys, the final output will be sorted.

The framework provides two processes that handle the management of MapReduce jobs: TaskTracker manages the execution of individual map and reduce tasks on a compute node in the cluster. JobTracker accepts job submissions, provides job monitoring and control, and manages the distribution of tasks to the TaskTracker nodes. Generally, there is one JobTracker process per cluster and one or more TaskTracker processes per node in the cluster. The JobTracker is a single point of failure, and the JobTracker will work around the failure of individual TaskTracker processes. The Hadoop Distributed File System HDFS is a file system that is designed for use for MapReduce jobs that read input in large chunks of input, process it, and write potentially large chunks of output. HDFS does not handle random access particularly well. For reliability, file data is simply mirrored to multiple storage nodes. This is referred to as replication in the Hadoop community. As long as at least one replica of a data chunk is available, the consumer of that data will not know of storage server failures. HDFS services are provided by two processes: NameNode handles management of the file system metadata, and provides management and control services. DataNode provides block storage and retrieval services. There will be one NameNode process in an HDFS file system, and this is a single point of failure. Hadoop Core provides recovery and automatic backup of the NameNode, but no hot failover services. There will be multiple DataNode processes within the cluster, with typically one DataNode process per storage node in a cluster.

[1]: provides a technique to implement self tuning in Big Data Analytic systems. Hadoop s performance out of the box leaves much to be desired, leading to suboptimal use of resource, time and money. This paper introduces Starfish, a self tuning system for big data analytics. Starfish builds on Hadoop while adapting to user needs and system workloads to provide good performance automatically, without the need for users to understand and manipulate the many tuning knobs in Hadoop. Explores the MADDER properties (i.e Magnetism, Agility, Depth, Data-lifecyle-awareness, Elasticity, Robustness). The behavior of a map reduce job is controlled by settings of more than 190 configuration parameters. If the user does not specify the settings, then default values are used. Good settings for these parameters depend on job, data, and cluster characteristics. Starfish s Just In Time Optimizer addresses unique optimization problems to automatically select efficient execution techniques for map reduce jobs. [2]: Scalable DBMS- both for update intensive workloads as well as decision support systems for analysis are a critical part of the cloud and play an important role in ensuring the smooth transition of applications from the traditional enterprise infrastructure to next generation cloud infrastructures. This tutorial presents an organized picture of the challenges faced by application developers and DBMS designers in developing and deploying internet scale applications. Also, a survey of state-of-the-art systems to support update intensive web applications is provided. [3] Complexity, diversity, frequently changing workloads and rapid evolutions of big dat systems raise great challenges in big data benchmarking. Most of the big data benchmarking efforts targeted evaluating specific types of applications or system software stacks, and hence they are not qualified much. The bigdatabench not only covers broad application scenarios, but also includes diverse and representative data sets. Compared with other benchmarking suites, BigDataBench has very low operational intensity and the volume of data input has nonnegligible impact on micro-architecture characteristics. [6] The buzz-word big-data (application) refers to the large-scale distributed applications that work on unprecedentedly large data sets. Google s MapReduce framework and Apache s Hadoop, its open-source implementation, are the defacto software system for big-data applications. An observation regarding these applications is that they generate a large amount of intermediate data, and these abundant information is thrown away after the processing finish. Motivated by this observation, a data-aware cache framework for big-data applications, which is called Dache. In Dache, tasks submit their intermediate results to the cache manager. A task, before initiating its execution, queries the cache manager for potential matched processing results, which could accelerate its execution or even completely saves the execution. A novel cache description scheme and a cache request and reply protocol are designed. Dache is implemented by extending the relevant components of the Hadoop project. Testbed experiment results demonstrate that Dache significantly improves the completion time of MapReduce jobs and saves a significant chunk of CPU execution time.

BIBLIOGRAPHY [1] H.Herodotou,H.Lim,G.Luo,N.Borisov,L.Dong,F.B.Cetin,and S. Babu. Starfish: A Selftuning System for Big Data Analytics. In CIDR, pages 261 272, 2011. [2] Divyakant Agrawal Sudipto Das Amr El Abbadi. Big Data and Cloud Computing: Current State and Future Opportunities. In EDBT, pages 530-533, 2011 [3] W. Gao, Y. Zhu1, Z. Jia, C. Luo, L. Wang, Z. Li, J. Zhan, Y. Qi, Y. He, S. Gong, Xiaona Li, S. Zhang, and B. Qiu. BigDataBench: a Big Data Benchmark Suite from Web Search Engines. in The Third Workshop on Architectures and Systems for Big Data(ASBD 2013) in conjunction with The 40th International Symposium on Computer Architecture, May 2013. [4] Shweta Pandey, Vrinda Tokekar. Prominence of MapReduce in BIG DATA Processing. In Fourth International Conference on Communication Systems and Network Technologies, IEEE, pages 555-560, 2014 [5] Daniel E. O Leary, Artificial Intelligence and Big Data, IEEE Computer Society, pages 96-99, March/April 2013 [6] Yaxiong Zhao, Jie Wu. Dache: A Data Aware Caching for Big-Data Applications Using The MapReduce Framework. Proc. 32nd IEEE Conference on Computer Communications, INFOCOM 2013, IEEE Press, Apr. 2013, pp. 35-39. [7] Shang W, Jiang Z M, Hemmati H, et al, Assisting developers of big data analytics applications when deploying on Hadoop clouds, Proc. 35th International Conference on Software Engineering, ICSE 2013, IEEE Press, May 2013, pp. 402-411. [8] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. of ACM, 51(1):107 113, January 2008. [9] Hadoop. http://hadoop.apache.org/. [10] Cache algorithms. http://en.wikipedia.org/wiki/cache algorithms. [11] D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The Performance of MapReduce: An In-depth Study. PVLDB, 3(1), 2010. [12] Hadoop MapReduce Tutorial. http://hadoop.apache.org/common/docs/r0.20.2/ mapred_tutorial.html [13] T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. MRShare: Sharing Across Multiple Queries in MapReduce. PVLDB, 3(1), 2010. [14] Jimmy Lin, Dmitriy Ryaboy, and Kevin Weil. Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics. Pages 59-,66 MapReduce, ACM, 2011

[15]Chandrasekar S, Dakshinamurthy R, Seshakumar P G, Prabavathy B, Chitra Babu, A Novel Indexing Scheme for Efficient Handling of Small Files in Hadoop Distributed File System. In ICCCI -2013 [16] Denis Shestakov, Diana Moise, Gylfi Gudmundsson, Laurent Amsaleg. Scalable highdimensional indexing with Hadoop. Pages 207-212, In 11TH International Workshop on Content-Based Multimedia Indexing, 2013. [17] Chuitian Rong, Wei Lu, Xiaoli Wang, Xiaoyong Du, Yueguo Chen, and Anthony K.H. Tung. Efficient and Scalable Processing of String Similarity Join. Pages 2217-2230, In IEEE transactions of Knowledge and Data Engineering, 2013. [18] http://www.semantikoz.com/blog/hadoop-index-optimise-map-reduce-data-access-resultin-32x-speedup/ [19] http://www.drdobbs.com/parallel/indexing-and-searching-on-a-hadoop-distr/226300241 [20] http://www.pgs-soft.com/harnessing-big-data/ [21] http://bigdatahandler.com/2013/11/02/installing-single-node-hadoop-2-2-0-on-ubuntu/