Comprehensive Analysis of Hadoop Ecosystem Components: MapReduce, Pig and Hive
|
|
- Buddy Wilkerson
- 5 years ago
- Views:
Transcription
1 , pp Comprehensive Analysis of Hadoop Ecosystem Components: MapReduce, Pig and Hive N. Suneetha 1, Ch. Sekhar 2, A. Viswanath Sharma 3 and P. Sandhya 3 1,2,3,4 Vignan s Institute of Information Technology, Duvvada, Visakhapatnam 1 suneekir9@gmail.com, 2 sekhar1203@gmail.com Abstract. Big Data is a term for large-volume, complex, growing data sets with multiple, autonomous sources generated in various fields ranging from economic and business activities to public administration, from national security to scientific researches. It is the emerging technology that draws huge attention from researchers to extract value from voluminous datasets. As the velocity of data growth is increasing with the technological challenges, organization and storage of data is of primary concern. To proceed with a given situation pertaining to big data the consideration parameters are, firstly the background of data, then the value chain phases namely data generation, data acquisition, data storage, and data analysis and finally on the representative applications of big data, including enterprise management, Internet of Things, online social networks, medical applications, collective intelligence, financial service sectors and smart grid. This paper emphasizes on the performance and work nature of data analysis through Hadoop ecosystem components like Map reduces Pig and Hive. Keywords: Big Data, Map reduce, Pig, Hive. 1 Introduction Over the past few decades, data has increased in a large scale in various fields. According to a report from International Data Corporation (IDC), in 2011, the overall created and copied data volume in the world was 1.8ZB ( 1021B), which increased by nearly nine times within five years [1]. This figure will double at least every other two years in the near future. Under the explosive growth of information, the term Big Data means the large or complex data sets that could not be handled by traditional database technologies. Big Data also arises with many challenges, such as difficulties in data capture, data storage, data analysis and data visualization. Big data is characterized by the five Vs, namely volume, variety, velocity, value and veracity (Fig. 1). This 5V definition highlights the meaning and necessity of big data. ISSN: ASTL Copyright 2017 SERSC
2 Volume: Volume means, with the generation and collection of masses of data. Nowadays volume or the size of data in different areas/enterprises is larger than terabytes and petabytes. It is the task of Big data to extract valuable information from high volumes of low-density, unstructured Hadoop data that is, data of unknown value, such as Twitter data feeds, click streams on a web page and a mobile app, network traffic, sensor-enabled equipment capturing data at the speed of light, and many more. Big Data volume includes such features as size, scale, amount, dimension for tera- and exascale data collected from many transactions and stored in individual files or databases to be accessible, searchable, processed and manageable. One example of Big data from industry, global services providers such as Google, Facebook, Twitter are producing, analyzing and storing data in huge amount as their regular activity/production services. To deal these large volume of data, Big Data development is needed. Variety: Variety deals with the complexity of big data and information and semantic models behind these data. Big data comes from a great variety of sources and generally has in four types: structured, semi structured, unstructured and a mixed data. Structured data inserts a data warehouse already tagged and easily sorted but unstructured data is random and difficult to analyze. Unstructured and semi-structured data types, such as text, audio, and video does not conform to fixed fields but contains tags to separate data elements where they require additional processing to both derive meaning and the supporting metadata. Velocity: Velocity is the fast rate at which Big data streams is generated into memory or disks by arrays of sensors or multiple events, and need to be processed in real-time, near real-time or in batch, or as streams (like in case of visualisation). Velocity is required not only for big data, but also all processes. For time limited processes, big data should be used as it streams into the organization in order to maximize its value [4,16]. Some of the applications like Internet of Things (IoT), consumer ecommerce and mobile communications..etc. deals with large amount of data in their implementation in real time or near real time. Value Variety indicates the various types of data, which include semi-structured and unstructured data such as audio, video, webpage, and text, as well as traditional structured data. Value is an important feature of the Big data which is defined by the added-value that the collected data can bring to the intended process, activity or predictive analysis/hypothesis. Data value will depend on the events or processes they represent such as stochastic, probabilistic, regular or random. For example in consumer applications, the intrinsic value of data is derived using quantitative and investigative techniques from discovering a consumer preference or sentiment, to making a relevant offer by location, or for identifying a piece of equipment that is about to fail. Depending on this the requirements may be imposed to collect all data, Copyright 2017 SERSC 469
3 store for longer period (for some possible event of interest), etc. However, finding value of Bigdata requires new discovery processes to make more accurate and precise decisions. Veracity: Big Data veracity ensures that the data used are trusted, authentic and protected from unauthorized access and modification. The data must be secured during the whole their lifecycle from collection from trusted sources to processing on trusted compute facilities and storage on protected and trusted storage facilities. Data veracity relies entirely on the security infrastructure deployed and available from the Big Data infrastructure. With this definition, characteristics of big data may be summarized as five Vs, i.e., Volume(greatvolume), Variety (various modalities), Velocity (rapid generation), Value (huge value but very low density) and Veracity (trusted and authentic) as shown in Fig. 1. Fig. 1. Five V s of Big Data This paper is organized as follows. Section II presents the Hadoop EcoSystem Components. Section III illustrates the precise distributed working environment. Section IV describes the Conclusion of the presented work 2 Hadoop EcoSystem Components The size of datasets are increasing at a rapid pace currently to the tune of petabytes(pb) which is becoming an issue to perform data analysis. The challenges for such analysis are capture, curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy. To meet these challenges, "parallel data processing software" like Hadoop framework is required. Hadoop is an open-source framework that allows to store and process Big Data in a distributed environment across clusters of computers using simple programming models. The Hadoop platform consists of two main services: one is a reliable, distributed file system called Hadoop Distributed File System (HDFS) and the other one is high-performance parallel data processing engine called Hadoop MapReduce. Vendors that provide Hadoop-based platforms include Cloudera, Hortonworks, MapR, Greenplum, IBM, 470 Copyright 2017 SERSC
4 and Amazon. Information is procured from diverse sources, like online networking, customary undertaking information or sensor information, etc. The two main components of Hadoop are Hadoop Distributed File System(HDFS) referring to distributed storage and distributed processing called Map Reduce framework. Combination of HDFS and MapReduce provides a software framework for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault- tolerant manner. Hadoop is a generic processing framework designed to execute queries and other batch operations against massive datasets that can scale from tens of terabytes to petabytes in size. Figure 2 shows the Hadoop Ecosystem. In the ecosystem of Hadoop, there have been several recent research projects exploiting sharing opportunities and eliminating unnecessary data movements, e.g. [2] Fig. 2. Hadoop echo system In the ecosystem of Hadoop, there have been several recent research projects exploiting sharing opportunities and eliminating unnecessary data movements. Huge amounts of inconsistent, incomplete, and noisy data, a number of data preprocessing techniques, including data cleaning, data integration, data transformation and date reduction, can be applied for removing noise and correcting inconsistencies [5]. Data capture and storage Data sets are captured from different sources such as traditional organization data from different enterprises (it includes information from transactional ERP data or web store transactions and general ledger data..etc.), machine generated or sensor data (it includes information from information- sensing mobile devices, aerial sensory technologies, remote sensing, radio-frequency identification readers..etc.), social data (it includes information from blogging sites and social media platforms..etc.) and so Copyright 2017 SERSC 471
5 on. The world s technological capacity to store information is increasing exponentially in terms of quintillion byte Data organization Data is organized into the following file systems namely Google File System(GFS) and Hadoop Distributed File System(HDFS). Google File System Google Inc. built up an appropriated record framework for their own particular use which was intended for proficient and solid acess to information utilizing extensive bunch of product equipment. It utilizes the methodology of "Big Files", which are created by Larry Page and Sergey Brin. Here records are partitioned in fixed size chunks of 64 MB whose replication factor is 3. HDFS is the distributed storage which positions the data into fixed size blocks which is replicated to 3. There will be one master node and multiple slave nodes. Map Reduce Jobs are processed using Version1 and Version2. Data analysis Pig and Hive are the frameworks that have an inherent map reduce functionality with a good processing speed. The analysis is performed on huge databases ranging from thousands to lakhs using Hive and MapReduce. Both the frameworks run on different platforms like redhat linux, Cloudera, Hortonworks etc. Figure 3 shows the Uber transport dataset collected from GITHUB and is placed in HDFS using the hadoop put command. Ubuntu-Server14.04LTS is used to implement the operations of the specified data set. Fig. 3. Uber data set from GITHUB site 472 Copyright 2017 SERSC
6 The data shown in above Figure 3 is taken from GITHUB site which is having the fields like DATE/TIME, LATITUDE, LONGITUDE, BASE fields. For the given data the number of cars with base numbers waiting at a particular latitude position are analyzed. Pig and Hive execution speeds are compared 4 Working of Distributed Environment Map Reduce Hadoop Map Reduce is a product structure for effortlessly composing applications which prepare tremendous measures of information (multi-terabyte information sets) in-parallel on substantial groups (a huge number of hubs) of item equipment in a dependable, fault tolerant way. A MapReduce work for the most part parts the information set into free pieces which are prepared by the guide assignments in a totally parallel way. The structure sorts the yields of the maps, which are then data to the diminish assignments. Ordinarily both the information and the yield of the employment are put away in a document framework. The system deals with planning errands, observing them and re-executes the fizzled assignments. Normally the register hubs and the stockpiling hubs are the same, that is, the MapReduce structure and the Hadoop Distributed File System are running on the same arrangement of hubs. This arrangement permits the system to adequately timetable undertakings on the hubs where information is as of now present, bringing about high total transfer speed over the group. The MapReduce structure comprises of a solitary expert JobTracker and one slave TaskTracker per group hub. The expert is in charge of booking the occupations' part assignments on the slaves, checking them and reexecuting the fizzled errands. The slaves execute the assignments as coordinated by the expert. Negligibly, applications indicate the information/yield areas and supply guide and lessen capacities through executions of suitable interfaces and/or conceptual classes. These, and other employment parameters, contain the occupation design. The Hadoop work customer then presents the occupation (jug/executable and so on.) and setup to the JobTracker which then expect the obligation of disseminating the product/design to the slaves, booking errands and observing them, giving status and demonstrative data to the employment customer. PIG: Pig was initially developed at Yahoo Research around 2006 but moved into the Apache Software Foundation in Pig which is a scripting language that consists of a language and an execution environment which is based on hadoop map reduce framework. Pig script is usually called as PigLatin[6]. PIG came into the world of big data because most of the industries and the companies are finding difficulties in handling a small data with at most lines of code So industry finds difficult in wasting the time and money for such things. Pig script is a language which connects things over very easily with very short lines of code. It is a high level language and it does not need the help of JAVA. Pig support dataflow language.. Pig can handle complex data structure, even those who have levels of nesting. It has two types of execution environment local and distributed environment. Local environment is used Copyright 2017 SERSC 473
7 for testing when distributed environment cannot be deployed. Pig Latin program is collection of statements. A statement can be a operation or command. Pig distributed environment is chosen by command pig. Pig local mode is chosen by the command pig x local. Fig. 4. Execution of PIG Figure 4 shows the execution process of pig framework. Grunt shell is the console for writing and execution of the pig scripts. The execution engine converts the scripts into mapreduce jobs and the resultant is stored in HDFS. Handling of Semistructured data using Pig Data of semi-structured nature with extenson.xml,.jpeg,etc also can be handled using Pig. To handle this.xml files we need to have some additional libraries to be included with the pig. The two additional libraries for the pig are as follows.. Piggy Bank 1)Collection of useful LOAD,STORE and UDF functions. 2)Has many user defined functions. 3)Open source project of apache Hadoop and should be downloaded externally and links with pig. Apache DataFu 1)Collection of libraries for working with large scale data in Hadoop. 2)Project inspired by the need for stable well tested libraries for data mining and statistics 3)It is also a apache open source project and included with in apache pig from version 0.14 The additional library piggybank is used for loading the data into pig from HDFS. For handling the XML file xmlloader() function is linked with the additional library piggybank. 474 Copyright 2017 SERSC
8 Following is the sample XML file which is described as follows: <Data> <employee> <id>5</id> <name>aravind</name> <gender>m</gender> </employee> <employee> <id>6</id> <name>krishna</name> <gender>m</gender> </employee> <employee> <id>7</id> <name>thiru</name> <gender>m</gender> </employee> <employee> <id>8</id> <name>mani</name> <gender>m</gender> </employee> </Data> The above data is to loaded into Pig execution mode using the command load whose syntax is as follows: employee = LOAD input/employee.xml using org.apache.pig.piggybank.storage.xmlloader( employ ee ) as (x:chararray); The output of the script above described is as follows: Copyright 2017 SERSC 475
9 Hive Apache Hive is a data warehouse system for Apache Hadoop [7]. Hive is a technology developed by Facebook and which turns Hadoop into a complete datawarehouse with an extension of sql for querying. HiveQL is a declarative language used by Hive.Hive is a declarative language with various commands execution built upon a schema. The configuration can be set in three ways: Firstly by editing the hive-site.xml fileor by using the command set in the command prompt such as set hive-conf. The above specified database is used to describe the processing of Hive. s = foreach employee generate x; Data is displayed using DUMP Eg:dump employee; DATA ANALYSIS USING HIVE R data set is taken from github.the data can be downloaded from github by using the wget command with nk address in the console. Then the datais moved from our local system to the hadoop distributed file system using the hadoop command put Eg:hadoop fs put localfileaddress destinationaddress hadoop fs put uber-rawdata-may14.csv /user/allamrajuviswanath_gmail/viswanath ; 476 Copyright 2017 SERSC
10 open the hive console using the command hive The hive shell gets started in the console like hive> Hive:First create a data base with a name. Next create a table for the dataset ; Now for Uberdata set the table creation is like create table ubersep14(pickdatetime string,lat float,log float,base string) row format delimited fields terminated by ',' stored as textfile; Load the data into the table using the command LOAD DATA INPATH '/user/allamrajuviswanath_gmail/viswanath/uberraw- data-sep14.csv' OVERWRITE INTO TABLE socialmediaintelligence; Query to retrieve the total no.of Base numbers of Uber cars from the table are as follows: select count(base) from ubersep14; The resultant output screen is shown below: For counting the number of ubercars with their base numbers the at a particular latitude position of the command used is: Select base, count(*) as count from ubersep14 where lat= group by base; Copyright 2017 SERSC 477
11 5 Conclusion The aforesaid results represent lesser time and exertion on huge databases. Pig, Hive and MapReduce frameworks can do investigations in brief time. Hive can break down a database of more than 8 lakh records in only 34 seconds and Pig can do the same in 40 seconds. Hence all these parts make it conceivable to handle and to utilize vast databases in a simple and proficient way. References 1. Apache Hadoop. Available at 2. R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. YSmart: Yet Another SQL-to- MapReduce Translator. In ICDCS, H. Lim, H. Herodotou, and S. Babu. Stubby: A Transformation- based Optimizer for Mapreduce Workflows. In VLDB, T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. MRShare: Sharing Across Multiple Queries in Mapreduce. In VLDB, X. Wang, C. Olston, A. D. Sarma, and R. Burns. CoScan: Cooperative Scan Sharing in the Cloud. In SoCC, Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu and Raghotham Murthy Hive A Petabyte Scale Data Warehouse Using Hadoop By Facebook Data Infrastructure Team 7. Zhifeng YANG, Qichen TU, Kai FAN, Lei ZHU, Rishan CHEN, BoPENG, Performance Gain with Variable Chunk Size in GFSlike File Systems, Journal of Computational Information Systems4:3 pp , Sam Madden, From Databases to Big Data, IEEE Computer Society, Sanjeev Dhawan & Sanjay Rathee, Big Data Analytics using Hadoop Components like Pig and Hive, American International Journal of Research in Science, Technology, Engineering & Mathematics, pp:1-5, Copyright 2017 SERSC
Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data
Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Yojna Arora, Dinesh Goyal Abstract: Big Data refers to that huge amount of data which cannot be analyzed by using traditional analytics
More informationBIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG
BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG Prof R.Angelin Preethi #1 and Prof J.Elavarasi *2 # Department of Computer Science, Kamban College of Arts and Science for Women, TamilNadu,
More informationA REVIEW PAPER ON BIG DATA ANALYTICS
A REVIEW PAPER ON BIG DATA ANALYTICS Kirti Bhatia 1, Lalit 2 1 HOD, Department of Computer Science, SKITM Bahadurgarh Haryana, India bhatia.kirti.it@gmail.com 2 M Tech 4th sem SKITM Bahadurgarh, Haryana,
More informationCIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu
CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean
More informationLarge Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report
Large Scale OLAP Yifu Huang 2014/11/4 MAST612117 Scientific English Writing Report 2014 1 Preliminaries OLAP On-Line Analytical Processing Traditional solutions: data warehouses built by parallel databases
More informationSURVEY ON BIG DATA TECHNOLOGIES
SURVEY ON BIG DATA TECHNOLOGIES Prof. Kannadasan R. Assistant Professor Vit University, Vellore India kannadasan.r@vit.ac.in ABSTRACT Rahis Shaikh M.Tech CSE - 13MCS0045 VIT University, Vellore rais137123@gmail.com
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationSouth Asian Journal of Engineering and Technology Vol.2, No.50 (2016) 5 10
ISSN Number (online): 2454-9614 Weather Data Analytics using Hadoop Components like MapReduce, Pig and Hive Sireesha. M 1, Tirumala Rao. S. N 2 Department of CSE, Narasaraopeta Engineering College, Narasaraopet,
More informationInternational Journal of Computer Engineering and Applications, BIG DATA ANALYTICS USING APACHE PIG Prabhjot Kaur
Prabhjot Kaur Department of Computer Engineering ME CSE(BIG DATA ANALYTICS)-CHANDIGARH UNIVERSITY,GHARUAN kaurprabhjot770@gmail.com ABSTRACT: In today world, as we know data is expanding along with the
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationA Review Approach for Big Data and Hadoop Technology
International Journal of Modern Trends in Engineering and Research www.ijmter.com e-issn No.:2349-9745, Date: 2-4 July, 2015 A Review Approach for Big Data and Hadoop Technology Prof. Ghanshyam Dhomse
More informationHADOOP FRAMEWORK FOR BIG DATA
HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further
More informationNowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?
Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/
More informationA Review Paper on Big data & Hadoop
A Review Paper on Big data & Hadoop Rupali Jagadale MCA Department, Modern College of Engg. Modern College of Engginering Pune,India rupalijagadale02@gmail.com Pratibha Adkar MCA Department, Modern College
More informationWhat is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?
Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation
More informationTop 25 Big Data Interview Questions And Answers
Top 25 Big Data Interview Questions And Answers By: Neeru Jain - Big Data The era of big data has just begun. With more companies inclined towards big data to run their operations, the demand for talent
More informationA SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING
Journal homepage: www.mjret.in ISSN:2348-6953 A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING Bhavsar Nikhil, Bhavsar Riddhikesh,Patil Balu,Tad Mukesh Department of Computer Engineering JSPM s
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationHadoop An Overview. - Socrates CCDH
Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected
More informationTOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY
Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861 International Conference on Emerging Trends in IOT & Machine Learning, 2018 TOOLS
More informationSQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism
Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and
More informationMicrosoft Big Data and Hadoop
Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common
More informationBig Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela
More informationInternational Journal of Advance Engineering and Research Development. A Study: Hadoop Framework
Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja
More informationBased on Big Data: Hype or Hallelujah? by Elena Baralis
Based on Big Data: Hype or Hallelujah? by Elena Baralis http://dbdmg.polito.it/wordpress/wp-content/uploads/2010/12/bigdata_2015_2x.pdf 1 3 February 2010 Google detected flu outbreak two weeks ahead of
More informationFile Inclusion Vulnerability Analysis using Hadoop and Navie Bayes Classifier
File Inclusion Vulnerability Analysis using Hadoop and Navie Bayes Classifier [1] Vidya Muraleedharan [2] Dr.KSatheesh Kumar [3] Ashok Babu [1] M.Tech Student, School of Computer Sciences, Mahatma Gandhi
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationEXTRACT DATA IN LARGE DATABASE WITH HADOOP
International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0
More informationJuxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms
, pp.289-295 http://dx.doi.org/10.14257/astl.2017.147.40 Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms Dr. E. Laxmi Lydia 1 Associate Professor, Department
More informationHadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop
Hadoop Open Source Projects Hadoop is supplemented by an ecosystem of open source projects Oozie 25 How to Analyze Large Data Sets in Hadoop Although the Hadoop framework is implemented in Java, MapReduce
More informationTransaction Analysis using Big-Data Analytics
Volume 120 No. 6 2018, 12045-12054 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ http://www.acadpubl.eu/hub/ Transaction Analysis using Big-Data Analytics Rajashree. B. Karagi 1, R.
More informationBig Data with Hadoop Ecosystem
Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process
More informationOnline Bill Processing System for Public Sectors in Big Data
IJIRST International Journal for Innovative Research in Science & Technology Volume 4 Issue 10 March 2018 ISSN (online): 2349-6010 Online Bill Processing System for Public Sectors in Big Data H. Anwer
More informationdocs.hortonworks.com
docs.hortonworks.com : Getting Started Guide Copyright 2012, 2014 Hortonworks, Inc. Some rights reserved. The, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing,
More informationAn Introduction to Big Data Formats
Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION
More informationAn Improved Performance Evaluation on Large-Scale Data using MapReduce Technique
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,
More informationFrequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management
Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Kranti Patil 1, Jayashree Fegade 2, Diksha Chiramade 3, Srujan Patil 4, Pradnya A. Vikhar 5 1,2,3,4,5 KCES
More informationBIG DATA & HADOOP: A Survey
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,
More informationAPPRAISAL AND ANALYSIS ON VARIOUS BIG DATA TECHNOLOGIES
Asian Journal of Science and Applied Technology (AJSAT) Vol.2.No.1 2014pp 27-32. available at: www.goniv.com Paper Received :05-03-2014 Paper Published:28-03-2014 Paper Reviewed by: 1. John Arhter 2. Hendry
More informationHigh Performance Computing on MapReduce Programming Framework
International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming
More informationNew Approaches to Big Data Processing and Analytics
New Approaches to Big Data Processing and Analytics Contributing authors: David Floyer, David Vellante Original publication date: February 12, 2013 There are number of approaches to processing and analyzing
More informationA Survey on Big Data
A Survey on Big Data D.Prudhvi 1, D.Jaswitha 2, B. Mounika 3, Monika Bagal 4 1 2 3 4 B.Tech Final Year, CSE, Dadi Institute of Engineering & Technology,Andhra Pradesh,INDIA ---------------------------------------------------------------------***---------------------------------------------------------------------
More informationCLIENT DATA NODE NAME NODE
Volume 6, Issue 12, December 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Efficiency
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationThe Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI
2017 International Conference on Electronic, Control, Automation and Mechanical Engineering (ECAME 2017) ISBN: 978-1-60595-523-0 The Establishment of Large Data Mining Platform Based on Cloud Computing
More informationBig Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.
Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large
More informationProcessing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.
Processing Unstructured Data Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd. http://dinesql.com / Dinesh Priyankara @dinesh_priya Founder/Principal Architect dinesql Pvt Ltd. Microsoft Most
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationApache Spark and Hadoop Based Big Data Processing System for Clinical Research
Apache Spark and Hadoop Based Big Data Processing System for Clinical Research Sreekanth Rallapalli 1,*, Gondkar R R 2 1 Research Scholar, R&D Centre, Bharathiyar University, Coimbatore, Tamilnadu, India.
More informationHuge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2
2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering
More informationIntroduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationSQL-to-MapReduce Translation for Efficient OLAP Query Processing
, pp.61-70 http://dx.doi.org/10.14257/ijdta.2017.10.6.05 SQL-to-MapReduce Translation for Efficient OLAP Query Processing with MapReduce Hyeon Gyu Kim Department of Computer Engineering, Sahmyook University,
More informationBIG DATA TESTING: A UNIFIED VIEW
http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation
More informationQuery processing on raw files. Vítor Uwe Reus
Query processing on raw files Vítor Uwe Reus Outline 1. Introduction 2. Adaptive Indexing 3. Hybrid MapReduce 4. NoDB 5. Summary Outline 1. Introduction 2. Adaptive Indexing 3. Hybrid MapReduce 4. NoDB
More information<Insert Picture Here> Introduction to Big Data Technology
Introduction to Big Data Technology The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationLOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS
LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS Vandita Jain 1, Prof. Tripti Saxena 2, Dr. Vineet Richhariya 3 1 M.Tech(CSE)*,LNCT, Bhopal(M.P.)(India) 2 Prof. Dept. of CSE, LNCT, Bhopal(M.P.)(India)
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationBig Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More informationOverview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::
Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional
More informationGlobal Journal of Engineering Science and Research Management
A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV
More informationCertified Big Data and Hadoop Course Curriculum
Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationCombine Native SQL Flexibility with SAP HANA Platform Performance and Tools
SAP Technical Brief Data Warehousing SAP HANA Data Warehousing Combine Native SQL Flexibility with SAP HANA Platform Performance and Tools A data warehouse for the modern age Data warehouses have been
More informationProcessing Large / Big Data through MapR and Pig
Processing Large / Big Data through MapR and Pig Arvind Kumar-Senior ERP Solution Architect / Manager Suhas Pande- Solution Architect (IT and Security) Abstract - We live in the data age. It s not easy
More informationWearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life
Wearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life Ch.Srilakshmi Asst Professor,Department of Information Technology R.M.D Engineering College, Kavaraipettai,
More informationA Review on Hive and Pig
A Review on Hive and Pig Kadhar Basha J Research Scholar, School of Computer Science, Engineering and Applications, Bharathidasan University Trichy, Tamilnadu, India Dr. M. Balamurugan, Associate Professor,
More informationPig A language for data processing in Hadoop
Pig A language for data processing in Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Apache Pig: Introduction Tool for querying data on Hadoop
More informationInformatica Enterprise Information Catalog
Data Sheet Informatica Enterprise Information Catalog Benefits Automatically catalog and classify all types of data across the enterprise using an AI-powered catalog Identify domains and entities with
More informationQuery Execution Performance Analysis of Big Data Using Hive and Pig of Hadoop
International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-3, Issue-9 E-ISSN: 2347-2693 Query Execution Performance Analysis of Big Data Using Hive and Pig of Hadoop Anshu
More informationOracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data
Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous
More informationHow Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,
How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS
More informationBig Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018
Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/
More informationChapter 3. Foundations of Business Intelligence: Databases and Information Management
Chapter 3 Foundations of Business Intelligence: Databases and Information Management THE DATA HIERARCHY TRADITIONAL FILE PROCESSING Organizing Data in a Traditional File Environment Problems with the traditional
More informationAn Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing
[Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 11 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(11), 2014 [5368-5376] The study on magnanimous data-storage system based
More informationMAPR DATA GOVERNANCE WITHOUT COMPROMISE
MAPR TECHNOLOGIES, INC. WHITE PAPER JANUARY 2018 MAPR DATA GOVERNANCE TABLE OF CONTENTS EXECUTIVE SUMMARY 3 BACKGROUND 4 MAPR DATA GOVERNANCE 5 CONCLUSION 7 EXECUTIVE SUMMARY The MapR DataOps Governance
More informationBig Data Analytics by Using Hadoop
Governors State University OPUS Open Portal to University Scholarship All Capstone Projects Student Capstone Projects Spring 2015 Big Data Analytics by Using Hadoop Chaitanya Arava Governors State University
More informationJaql. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata. IBM Almaden Research Center
Jaql Running Pipes in the Clouds Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata IBM Almaden Research Center http://code.google.com/p/jaql/ 2009 IBM Corporation Motivating Scenarios
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationWhere We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)
More informationStages of Data Processing
Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,
More informationThe amount of data increases every day Some numbers ( 2012):
1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect
More information2/26/2017. The amount of data increases every day Some numbers ( 2012):
The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to
More informationA Survey on Comparative Analysis of Big Data Tools
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,
More informationChapter 6 VIDEO CASES
Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:
More informationThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,
More information2013 AWS Worldwide Public Sector Summit Washington, D.C.
2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic
More informationBig Data and Cloud Computing
Big Data and Cloud Computing Presented at Faculty of Computer Science University of Murcia Presenter: Muhammad Fahim, PhD Department of Computer Eng. Istanbul S. Zaim University, Istanbul, Turkey About
More informationIBM Data Replication for Big Data
IBM Data Replication for Big Data Highlights Stream changes in realtime in Hadoop or Kafka data lakes or hubs Provide agility to data in data warehouses and data lakes Achieve minimum impact on source
More informationNext-generation IT Platforms Delivering New Value through Accumulation and Utilization of Big Data
Next-generation IT Platforms Delivering New Value through Accumulation and Utilization of Big Data 46 Next-generation IT Platforms Delivering New Value through Accumulation and Utilization of Big Data
More informationBig Data - Some Words BIG DATA 8/31/2017. Introduction
BIG DATA Introduction Big Data - Some Words Connectivity Social Medias Share information Interactivity People Business Data Data mining Text mining Business Intelligence 1 What is Big Data Big Data means
More informationMap Reduce & Hadoop Recommended Text:
Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationObtaining Rough Set Approximation using MapReduce Technique in Data Mining
Obtaining Rough Set Approximation using MapReduce Technique in Data Mining Varda Dhande 1, Dr. B. K. Sarkar 2 1 M.E II yr student, Dept of Computer Engg, P.V.P.I.T Collage of Engineering Pune, Maharashtra,
More informationCertified Big Data Hadoop and Spark Scala Course Curriculum
Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills
More informationDepartment of Information Technology, St. Joseph s College (Autonomous), Trichy, TamilNadu, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 5 ISSN : 2456-3307 A Survey on Big Data and Hadoop Ecosystem Components
More information