Introduction to Big Data. Hadoop. Instituto Politécnico de Tomar. Ricardo Campos
|
|
- Laurence Haynes
- 6 years ago
- Views:
Transcription
1 Instituto Politécnico de Tomar Introduction to Big Data Hadoop Ricardo Campos Mestrado EI-IC Análise e Processamento de Grandes Volumes de Dados Tomar, Portugal, 2016
2 Part of the slides used in this presentation were adapted from presentations found in internet and from reference bibliography:
3 Part of the slides used in this presentation were adapted from presentations found in internet and from reference bibliography:
4
5 AGENDA What is this talk about? Overview HDFS Map Reduce Ecosystem 4 Hadoop Distributions 5 Q&A 6
6 Hadoop ( currently the most widely adopted Big Data platform, is an Apache-managed software framework derived from MapReduce and Big Table (a distributed storage system developed by Google and intended to manage highly scalable structured data). Please consider reading the following articles: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles - SOSP 03 (2003): Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation (2004). Luiz Andre Barroso, Jeffrey Dean, Urs Hölzle, Web Search for a Planet: The Google Cluster Architecture IEEE Micro, vol. 23 (2003), pp
7 It was originally built by a Yahoo! engineer named Doug Cutting in 2006 and is now an open source project managed by the Apache Software Foundation. It is and is designed to enable the distributed processing of large, complex data sets (huge amounts of structured and unstructured data) across a set of clustered computers.
8
9 Scalable: New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top. Cost effective: Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data. Flexible: Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide. Fault tolerant: When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.
10
11 Hadoop consists of three primary resources: Hadoop Distributed File System: a massively scalable distributed file system that can support petabytes of data and the management of related files across machines. MapReduce engine: A high-performance parallel / distributed data-processing implementation of the MapReduce algorithm. Hadoop ecosystem: A collection of tools that use or sit beside MapReduce and HDFS to store and organize data, and manage the machines that run Hadoop
12
13 Operating System layer: The first layer is the Operating System on the host machine. Hadoop is installed on top of the operating system and runs the same regardless of the host operating system Hadoop layer: This is the base installation of Hadoop, which includes the file system and MapReduce components. DBMS layer: On top of Hadoop, the various Hadoop DBMS and related applications are installed. Typically, Hadoop installations include a data warehousing or database package, such as Hive or HBase. Application layer: The Application layer is the top layer, which includes the tools that provide data management, analysis, and other capabilities.
14 To store data, Hadoop utilizes its own distributed filesystem, HDFS, which makes data available to multiple computing nodes. A distributed file system is a file system that can store large files spread across the nodes of a cluster, A typical Hadoop usage pattern involves three stages: Loading data into HDFS; MapReduce operations, and retrieving results from HDFS.
15 Because the data is written once and then read many times thereafter, rather than the constant read-writes of other file systems, HDFS is an excellent choice for supporting big data analysis. It includes a NameNode and multiple data nodes running on a commodity hardware cluster. In essence, the NameNode keeps track of where data is physically stored. HDFS works by breaking large files into smaller pieces called blocks. When data is stored in Hadoop, the NameNode file automatically stores and replicates the data in multiple blocks (64 MB or 128 MB by default) across the various DataNode. This is done to ensure fault tolerance and high availability.
16 Hadoop implementations rely on a master-slave style of distribution, where the master node stores all the metadata, access rights, mapping and location of files and blocks, and so on. The slaves are nodes where the actual data is stored. All the requests go to the master and then are handled by the appropriate slave node.
17 Data nodes are not very smart, but the NameNode is. The data nodes constantly ask the NameNode whether there is anything for them to do. This continuous behavior also tells the NameNode what data nodes are out there and how busy they are. HDFS metadata is stored in the NameNode When the file was created, accessed, modified, deleted, and so on. Where the blocks of the file are stored in the cluster Who has the rights to view or modify the file
18 How many files are stored on the cluster How many data nodes exist in the cluster The location of the transaction log for the cluster Data nodes are not smart, but they are resilient. They are servers that contain the blocks for a given set of files. It is reasonable to think of data nodes as block servers
19 What exactly does a block server do? Stores (and retrieves) the data blocks in the local file system of the server. HDFS is available on many different operating systems and behaves the same whether on Windows, Mac OS, or Linux. Stores the metadata of a block in the local file system based on the metadata template in the NameNode. Sends regular reports to the NameNode about what blocks are available for file operations. Tutorial Links: Yahoo! has published an excellent guide for configuring and exploring a basic system.
20 Hadoop MapReduce is the heart of the Hadoop system, an implementation of the algorithm (MapReduce). It is helpful to think about this implementation as a MapReduce engine, because that is exactly how it works. You provide input (fuel), the engine converts the input into output quickly and efficiently, and you get the answers you need. The process starts with a user request to run a MapReduce program and continues until the results are written back to the HDFS.
21 In the early 2000s, some engineers at Google looked into the future and determined that while their current solutions for applications such as web crawling, query frequency, and so on were adequate for most existing requirements, they were inadequate for the complexity they anticipated as the web scaled to more and more users. These engineers determined that if work could be distributed across inexpensive computers and then connected on the network in the form of a cluster, they could solve the problem. MapReduce, which was designed by Google (popularized by Yahoo! (Hadoop)), is a software framework that enables developers to write programs that can process massive amounts of unstructured data in parallel across a distributed group of processors.
22 Parallel data processing involves the simultaneous execution of multiple sub-tasks that collectively comprise a larger task. The goal is to reduce the execution time by dividing a single larger task into multiple smaller tasks that run concurrently. Although parallel data processing can be achieved through multiple networked machines, it is more typically achieved within the confines of a single machine with multiple processors or cores
23 Distributed data processing is closely related to parallel data processing in that the same principle of divide-and-conquer is applied. However, distributed data processing is always achieved through physically separate machines that are networked together as a cluster. Take a very large problem and break it into smaller, more manageable chunks, operate on each chunk independently, and then pull it all together at the end.
24 Hadoop s MapReduce involves distributing a dataset among multiple servers and operating on the data: the map stage. The partial results are then recombined: the reduced stage. Map The map component distributes the programming problem or tasks across a large number of systems and handles the placement of the tasks in a way that balances the load and manages recovery from failures The map function is commutative in other words, the order that a function is executed doesn t matter. MapReduce can perform its work on different machines in a network and get the same result as if all the work was done on a single machine.
25 Reduce After the distributed computation is completed, another function called reduce aggregates all the elements back together to provide a result. Tutorial Links: A good place to start is the official Apache documentation ( but Yahoo! has also put together a tutorial module (
26 Check this example:
27 While MapReduce is great for certain categories of tasks, it falls short with others. This led to fracturing in the ecosystem and a variety of tools that live outside of your Hadoop cluster but attempt to communicate with HDFS. Engine Spark
28 Spark MapReduce is the primary workhorse at the core of most Hadoop clusters. While highly effective for very large batch-analytic jobs, MapReduce has proven to be suboptimal for applications like graph analysis. Unlike Pig and Hive, Spark is not a tool for making MapReduce easier to use. It is a complete replacement for MapReduce that includes its own work execution engine that enables Hadoop in-memory data processing. Tutorial Links: A quick start for Spark can be found on the project home page (
29 Database and Data Management Cassandra Hbase MongoDB Hive
30 Cassandra Oftentimes you may need to simply organize some of your big data for easy retrieval. One common way to do this is to use a key-value datastore. Cassandra is a distributed key-value database designed with simplicity and scalability in mind. Your data is organized by a unique key, and values are associated with that key. Cassandra is an all-inclusive system, which means it does not require a Hadoop environment or any other big data tools. Tutorial Links: DataStax, a company that provides commercial support for Cassandra, offers a set of freely available videos (
31 HBase HBase is a NoSQL database system included in the standard Hadoop distributions that works with HDFS for data storage and access. It is based on Google's BigTable. It is a key-value store, logically. This means that rows are defined by a key, and have associated with them a number of bins (or columns) where the associated values are stored. Physically, groups of similar columns are stored together in column families. Most often, Hbase is accessed via Java code, but APIs exist for using HBase with Pig. HBase is often used for applications that may require sparse rows. That is, each row may use only a few of the defined columns.
32 While Cassandra and MongoDB might still be the predominant NoSQL databases today, HBase is gaining in popularity and may well be the leader in the near future. Tutorial Links: The folks at Coreservlets.com have put together a handful of Hadoop tutorials including an excellent series on Hbase ( There s also a handful of video tutorials available on the Internet, including this one ( which we found particularly helpful.
33 MongoDB MongoDB is a document-oriented database, the document being a JSON object. In relational databases, you have tables and rows. In MongoDB, the equivalent of a row is a JSON document, and the analog to a table is a collection, a set of JSON documents. At the start of 2015, it is one of the most popular NoSQL databases. Unlike some other database systems, MongoDB supports secondary indexes meaning it is possible to quickly search on other than the primary key that uniquely identifies each document in the Mongo database. Tutorial Links: The tutorials section on the official project page is a great place to get started ( There are also plenty of videos available on the Internet, including this informative series (
34 Hive At first, all access to data in your Hadoop cluster came through MapReduce jobs written in Java. This worked fine during Hadoop s infancy when all Hadoop users had a stable of Java-savvy coders. However, as Hadoop emerged into the broader world, many wanted to adopt Hadoop but had stables of SQL coders for whom writing MapReduce would be a steep learning curve.
35 The goal of Hive is to allow SQL access to data in the HDFS. Hive defines a simple SQLlike query language, called HQL, that enables users familiar with SQL to query the data. Queries written in HQL are converted into MapReduce code by Hive and executed by Hadoop. Tutorial Links: A couple of great resources are the official Hive tutorial ( and this video published by the folks at HortonWorks ( huxe&feature=youtu.be).
36 Analytics Once you ve ingested your data into the system, you may be satisfied to simply push it into a more traditional data store, such as a relational database, and consider your big data work to be done. On the other hand, you may want to continue to work with your data, running specialized machine-learning algorithms to categorize your data Pig Mahout
37 Pig Pig is the framework for executing MapReduce on HDFS data using its own scripting language. Pig is a tool that creates an abstraction layer on top of MapReduce to enable simpler and faster analysis. Pig is a scripting language designed to facilitate query-like data operations that can be executed with just several lines of code.
38 Why would you want to use Pig rather than MapReduce? Native MapReduce applications written in Java are effective and powerful tools, but the time to develop and test the applications is time-consuming and complex. Pig solves this problem by offering a simpler development and testing process that takes advantage of the power of MapReduce, without the need to build large Java applications. Whereas Java programs may require lines, Pig scripts often have ten lines of code or less. In many ways, Pig is an admirable extract, transform, and load (ETL) tool. Pig is translated or compiled into MapReduce code
39 Tutorial Links: There s a fairly complete guide ( to get you through the process of installing Pig and writing your first couple scripts. Working with Pig is a great overview of the Pig technology (
40 DataTransfer Data transfer tools provide three basic capabilities: File Transfer: help move files and flat text, such as long entries, into your Hadoop cluster; Database transfer: provide a simple mechanism for moving data between traditional relational databases, such as Oracle or SQL Server, and your Hadoop cluster; Data triage: can be used to quickly evaluate and categorize new data as it arrives onto your Hadoop system
41 DataTransfer Sqoop; Flume
42 Sqoop It s likely that some of your data may originate in a relational database management system (RDBMS) that is usually accessed normally by SQL. Sqoop (meaning SQL to Hadoop) is designed to transfer data between relational databases and Hadoop clusters. You ll start your import to Hadoop with a database table that is read into Hadoop as a text file. You can also export an HDFS file into an RDBMS. Tutorial Links: There s an excellent series of lectures on this topic available on YouTube (
43 Flume Flume is a reliable distributed system for collecting, aggregating, and moving large amounts of log data from multiple sources into HDFS. Tutorial Links: Dr. Dobb s Journal published an informative article on Flume ( Readers who enjoy a lecture should check out this interesting presentation from 2011 (
44 Mahout You have a bunch of data in your Hadoop cluster. What are you going to do with it? You might want to do some analytics, or data science, or machine learning. Much of this can be done in some of the tools that come with the standard Apache distribution, such as Pig, MapReduce, or Hive. But more sophisticated uses will involve algorithms that you will not want to code yourself. So you turn to Mahout. Mahout is a collection of scalable machine-learning algorithms that run on Hadoop. Tutorial Links: The Mahout folks have an entire page of curated links to books, tutorials, and talks ((
45 Cloudera ( Hortonworks ( MapR ( IBM ( Intel (hadoop.intel.com) EMC ( Amazon (aws.amazon.com/ec2) Apache Bigtop (
46 Cloudera vs. Hortonworks vs. MapR Please consider reading the following articles:
47 Why install it yourself? My answer to that is, by installing it by yourself, you can learn more about how it all fits together, and gives you a better understanding of the whole Hadoop ecosystem. If you prefer you can use a VM distribution that you can download and play with. Cloudera QuickStart VM ( ) Hortonworks Sandbox ( MapR ( )
48 The Hadoop modes deploy these components as follows: Local standalone mode Pseudo-distributed mode Fully distributed mode
49 Local standalone mode This is the default mode if, you don't configure anything else. In this mode, all the components of Hadoop, such as NameNode, DataNode, JobTracker, and TaskTracker, run in a single Java process. Local, or standalone, mode is the easiest to set up, but you interact with it in a different manner than you would with the fully distributed mode.
50 Pseudo-distributed mode In this mode, a separate JVM is spawned for each of the Hadoop components and they communicate across network sockets, effectively giving a fully functioning minicluster on a single host. We shall generally prefer the pseudo-distributed mode even when using examples on a single host, as everything done in the pseudo-distributed mode is almost identical to how it works on a much larger cluster.
51 Fully distributed mode In this mode, Hadoop is spread across multiple machines, some of which will be generalpurpose workers and others will be dedicated hosts for components, such as NameNode and JobTracker Fully distributed mode is obviously the only one that can scale Hadoop across a cluster of machines, but it requires more configuration work, not to mention the cluster of machines.
52
Big Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More informationIntroduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos
Instituto Politécnico de Tomar Introduction to Big Data NoSQL Databases Ricardo Campos Mestrado EI-IC Análise e Processamento de Grandes Volumes de Dados Tomar, Portugal, 2016 Part of the slides used in
More informationThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,
More informationThe Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou
The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component
More informationCertified Big Data and Hadoop Course Curriculum
Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation
More informationInternational Journal of Advance Engineering and Research Development. A Study: Hadoop Framework
Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja
More informationNew Approaches to Big Data Processing and Analytics
New Approaches to Big Data Processing and Analytics Contributing authors: David Floyer, David Vellante Original publication date: February 12, 2013 There are number of approaches to processing and analyzing
More informationHADOOP FRAMEWORK FOR BIG DATA
HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further
More informationCertified Big Data Hadoop and Spark Scala Course Curriculum
Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Cluster File
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationSTATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns
STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce
More informationGoogle File System (GFS) and Hadoop Distributed File System (HDFS)
Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear
More informationHadoop An Overview. - Socrates CCDH
Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 26 File Systems Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ Cylinders: all the platters?
More informationStages of Data Processing
Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationTop 25 Hadoop Admin Interview Questions and Answers
Top 25 Hadoop Admin Interview Questions and Answers 1) What daemons are needed to run a Hadoop cluster? DataNode, NameNode, TaskTracker, and JobTracker are required to run Hadoop cluster. 2) Which OS are
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationConfiguring and Deploying Hadoop Cluster Deployment Templates
Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page
More informationMicrosoft Big Data and Hadoop
Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common
More informationPrinciples of Data Management. Lecture #16 (MapReduce & DFS for Big Data)
Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationInnovatus Technologies
HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More informationDistributed Computation Models
Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case
More information50 Must Read Hadoop Interview Questions & Answers
50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?
More informationA Survey on Big Data
A Survey on Big Data D.Prudhvi 1, D.Jaswitha 2, B. Mounika 3, Monika Bagal 4 1 2 3 4 B.Tech Final Year, CSE, Dadi Institute of Engineering & Technology,Andhra Pradesh,INDIA ---------------------------------------------------------------------***---------------------------------------------------------------------
More informationColumn Stores and HBase. Rui LIU, Maksim Hrytsenia
Column Stores and HBase Rui LIU, Maksim Hrytsenia December 2017 Contents 1 Hadoop 2 1.1 Creation................................ 2 2 HBase 3 2.1 Column Store Database....................... 3 2.2 HBase
More informationTop 25 Big Data Interview Questions And Answers
Top 25 Big Data Interview Questions And Answers By: Neeru Jain - Big Data The era of big data has just begun. With more companies inclined towards big data to run their operations, the demand for talent
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationProjected by: LUKA CECXLADZE BEQA CHELIDZE Superviser : Nodar Momtsemlidze
Projected by: LUKA CECXLADZE BEQA CHELIDZE Superviser : Nodar Momtsemlidze About HBase HBase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data
More informationdocs.hortonworks.com
docs.hortonworks.com : Getting Started Guide Copyright 2012, 2014 Hortonworks, Inc. Some rights reserved. The, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing,
More informationThe Google File System. Alexandru Costan
1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems
More informationBig Data landscape Lecture #2
Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationA BigData Tour HDFS, Ceph and MapReduce
A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!
More informationHDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
More informationYuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013
Yuval Carmel Tel-Aviv University "Advanced Topics in About & Keywords Motivation & Purpose Assumptions Architecture overview & Comparison Measurements How does it fit in? The Future 2 About & Keywords
More informationBig Data with Hadoop Ecosystem
Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process
More informationBig Data Development HADOOP Training - Workshop. FEB 12 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI
Big Data Development HADOOP Training - Workshop FEB 12 to 16 2017 (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI ISIDUS TECH TEAM FZE PO Box 9798 Dubai UAE, email training-coordinator@isidusnet M: +97150
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationSQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism
Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and
More informationThis is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.
About the Tutorial Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and
More informationIntroduction to Big-Data
Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,
More informationPerformance Comparison of Hive, Pig & Map Reduce over Variety of Big Data
Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Yojna Arora, Dinesh Goyal Abstract: Big Data refers to that huge amount of data which cannot be analyzed by using traditional analytics
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More information"Big Data... and Related Topics" John S. Erickson, Ph.D The Rensselaer IDEA Rensselaer Polytechnic Institute
"Big Data... and Related Topics" John S. Erickson, Ph.D The Rensselaer IDEA Rensselaer Polytechnic Institute erickj4@rpi.edu @olyerickson Director of Operations, The Rensselaer IDEA Deputy Director, Rensselaer
More informationWhat is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?
Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation
More informationHadoop. copyright 2011 Trainologic LTD
Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides
More informationMapR Enterprise Hadoop
2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS
More informationIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -
More informationChase Wu New Jersey Institute of Technology
CS 644: Introduction to Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Institute of Technology Some of the slides were provided through the courtesy of Dr. Ching-Yung Lin at Columbia
More informationData Analysis Using MapReduce in Hadoop Environment
Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationCLIENT DATA NODE NAME NODE
Volume 6, Issue 12, December 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Efficiency
More informationΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing
ΕΠΛ 602:Foundations of Internet Technologies Cloud Computing 1 Outline Bigtable(data component of cloud) Web search basedonch13of thewebdatabook 2 What is Cloud Computing? ACloudis an infrastructure, transparent
More informationCS 345A Data Mining. MapReduce
CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes
More informationA New HadoopBased Network Management System with Policy Approach
Computer Engineering and Applications Vol. 3, No. 3, September 2014 A New HadoopBased Network Management System with Policy Approach Department of Computer Engineering and IT, Shiraz University of Technology,
More informationAn Introduction to Big Data Formats
Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION
More informationHadoop. Introduction to BIGDATA and HADOOP
Hadoop Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big Data and Hadoop What is the need of going ahead with Hadoop? Scenarios to apt Hadoop Technology in REAL
More informationHADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)
HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big
More informationSouth Asian Journal of Engineering and Technology Vol.2, No.50 (2016) 5 10
ISSN Number (online): 2454-9614 Weather Data Analytics using Hadoop Components like MapReduce, Pig and Hive Sireesha. M 1, Tirumala Rao. S. N 2 Department of CSE, Narasaraopeta Engineering College, Narasaraopet,
More informationBatch Inherence of Map Reduce Framework
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287
More informationIntroduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński
Introduction into Big Data analytics Lecture 3 Hadoop ecosystem Janusz Szwabiński Outlook of today s talk Apache Hadoop Project Common use cases Getting started with Hadoop Single node cluster Further
More informationBIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG
BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG Prof R.Angelin Preethi #1 and Prof J.Elavarasi *2 # Department of Computer Science, Kamban College of Arts and Science for Women, TamilNadu,
More informationTHE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
1 THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB
More informationMI-PDB, MIE-PDB: Advanced Database Systems
MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationDatabases and Big Data Today. CS634 Class 22
Databases and Big Data Today CS634 Class 22 Current types of Databases SQL using relational tables: still very important! NoSQL, i.e., not using relational tables: term NoSQL popular since about 2007.
More informationData Lake Based Systems that Work
Data Lake Based Systems that Work There are many article and blogs about what works and what does not work when trying to build out a data lake and reporting system. At DesignMind, we have developed a
More informationHadoop, Yarn and Beyond
Hadoop, Yarn and Beyond 1 B. R A M A M U R T H Y Overview We learned about Hadoop1.x or the core. Just like Java evolved, Java core, Java 1.X, Java 2.. So on, software and systems evolve, naturally.. Lets
More informationCIB Session 12th NoSQL Databases Structures
CIB Session 12th NoSQL Databases Structures By: Shahab Safaee & Morteza Zahedi Software Engineering PhD Email: safaee.shx@gmail.com, morteza.zahedi.a@gmail.com cibtrc.ir cibtrc cibtrc 2 Agenda What is
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel
More informationSURVEY ON BIG DATA TECHNOLOGIES
SURVEY ON BIG DATA TECHNOLOGIES Prof. Kannadasan R. Assistant Professor Vit University, Vellore India kannadasan.r@vit.ac.in ABSTRACT Rahis Shaikh M.Tech CSE - 13MCS0045 VIT University, Vellore rais137123@gmail.com
More informationBring Context To Your Machine Data With Hadoop, RDBMS & Splunk
Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Raanan Dagan and Rohit Pujari September 25, 2017 Washington, DC Forward-Looking Statements During the course of this presentation, we may
More informationDepartment of Information Technology, St. Joseph s College (Autonomous), Trichy, TamilNadu, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 5 ISSN : 2456-3307 A Survey on Big Data and Hadoop Ecosystem Components
More informationCIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench
CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench Abstract Implementing a Hadoop-based system for processing big data and doing analytics is a topic which has been
More informationDistributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Distributed Systems Lec 10: Distributed File Systems GFS Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung 1 Distributed File Systems NFS AFS GFS Some themes in these classes: Workload-oriented
More informationData in the Cloud and Analytics in the Lake
Data in the Cloud and Analytics in the Lake Introduction Working in Analytics for over 5 years Part the digital team at BNZ for 3 years Based in the Auckland office Preferred Languages SQL Python (PySpark)
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More informationThe MapReduce Framework
The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce
More informationNoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu
NoSQL Databases MongoDB vs Cassandra Kenny Huynh, Andre Chik, Kevin Vu Introduction - Relational database model - Concept developed in 1970 - Inefficient - NoSQL - Concept introduced in 1980 - Related
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More informationCassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent
Tanton Jeppson CS 401R Lab 3 Cassandra, MongoDB, and HBase Introduction For my report I have chosen to take a deeper look at 3 NoSQL database systems: Cassandra, MongoDB, and HBase. I have chosen these
More informationParallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem
I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **
More information