Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Similar documents
CS370 Operating Systems

Chase Wu New Jersey Institute of Technology

Distributed Systems 16. Distributed File Systems II

The Google File System. Alexandru Costan

Hadoop An Overview. - Socrates CCDH

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Microsoft Big Data and Hadoop

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

CS 345A Data Mining. MapReduce

Hadoop. Introduction / Overview

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

CISC 7610 Lecture 2b The beginnings of NoSQL

Big Data with Hadoop Ecosystem

Big Data Hadoop Stack

HDInsight > Hadoop. October 12, 2017

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

A New HadoopBased Network Management System with Policy Approach

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

MapR Enterprise Hadoop

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Chapter 5. The MapReduce Programming Model and Implementation

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Stages of Data Processing

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Big Data landscape Lecture #2

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

DATA SCIENCE USING SPARK: AN INTRODUCTION

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Distributed File Systems II

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Big Data Architect.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

CS 345A Data Mining. MapReduce

Windows Azure Overview

Innovatus Technologies

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Challenges for Data Driven Systems

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

HDFS Architecture Guide

Lecture 12 DATA ANALYTICS ON WEB SCALE

Introduction to MapReduce

HADOOP FRAMEWORK FOR BIG DATA

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński

A BigData Tour HDFS, Ceph and MapReduce

CSE-E5430 Scalable Cloud Computing Lecture 9

Google File System (GFS) and Hadoop Distributed File System (HDFS)

The Google File System

The Datacenter Needs an Operating System

Big Data Fundamentals

Next-Generation Cloud Platform

EsgynDB Enterprise 2.0 Platform Reference Architecture

Introduction to Big Data. Hadoop. Instituto Politécnico de Tomar. Ricardo Campos

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

CS370 Operating Systems

CSE 444: Database Internals. Lecture 23 Spark

Hadoop & Big Data Analytics Complete Practical & Real-time Training

A Review Paper on Big data & Hadoop

Column Stores and HBase. Rui LIU, Maksim Hrytsenia

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Staggeringly Large File Systems. Presented by Haoyan Geng

The Evolving Apache Hadoop Ecosystem What it means for Storage Industry

The Google File System

Data Storage Infrastructure at Facebook

Big Data and Cloud Computing

Data Centers and Cloud Computing

MapReduce and Friends

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Data Storage in the Cloud

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

CSE6331: Cloud Computing

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

On the Varieties of Clouds for Data Intensive Computing

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Kinetic drive. Bingzhe Li

The Google File System

SURVEY ON BIG DATA TECHNOLOGIES

Databases 2 (VU) ( / )

Timeline Dec 2004: Dean/Ghemawat (Google) MapReduce paper 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch (

50 Must Read Hadoop Interview Questions & Answers

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

CLOUD-SCALE FILE SYSTEMS

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Transcription:

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul et. Al. Demetris Zeinalipour http://www.cs.ucy.ac.cy/~dzeina/courses/epl646 14-1

Lecture Outline Introduction to Cloud Computing Typical Datacenters, Cloud Stack and Buzzwords, Public / Private Clouds, Utility Computing, Killer Apps, Economic Model Distributed System Basics I/O Performance Replication Strategies The Hadoop Project (Core, HDFS, Map-Reduce, HBase, HIVE) The Hadoop Distributed File System (HDFS) HDFS vs. NFS (Network File System) HDFS Example Deployments (Yahoo, Facebook) 14-2

Cloud Computing αγαθά Google's Datacenter in Oregon Microsoft Azure in Chicago 14-3

Cloud Computing (Datacenters) Datacenter 14-4

Cloud Computing (Cloud Stack) 14-5

Cloud Computing (Public vs. Private) Based on: Above the Clouds: A Berkeley View of Cloud Computing, RAD Lab, UC Berkeley 14-7

Cloud Computing (Utility Computing Yπ. Ωφελείας) NewSQL-as-a-Service To Amazon RDS* (Relational Database Service) 963$ / year 27,165 $ / year (*essentially MySQL running on Amazon EC2 Elastic Computing Cloud) 14-8

Cloud Computing (Utility Computing Yπ. Ωφελείας) 14-9

Cloud Computing (Virtualization - Tεχνολογία εικονικών συστημάτων) 14-10

Our IaaS Private Cloud Demo 14-11

Cloud Computing (Economic Model) 14-12

Cloud Computing (Killer Apps: OLTP/OLAP) 14-13

Cloud Computing (Killer Apps: escience) 14-14

Distributed Systems Basics (I/O Performance) Each Disk: 2TB 7,200 rpm 6Gb SAS (Serial Attach. SCSI) 3.5" HDD In RAID-5 configuration [10Gbps FCoE also available] 14-15

Distributed Systems Basics (I/O Performance) (throughput) 14-16

Distributed Systems Basics (I/O Performance) (Similar to Parallel but more scalable) 14-17

Distributed Systems Basics (I/O Performance) 14-18

Distributed Systems Basics (Replication Strategies) NOSQL DBMSs SQL RDBMSs 14-19

Distributed Systems Basics (Replication Strategies) (Recall conflict resolution in CouchDB) 14-20

Terminology (MR => HADOOP => HBASE) Map-Reduce: a programming model for processing large data sets. Invented by Google! "MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDI'04: Sixth Symposium on Operating System Design and Implementation,San Francisco, CA, December, 2004." Can be implemented in any language (recall javascript Map- Reduce we used in the context of CouchDB). Hadoop: Apache's open-source software framework that supports data-intensive distributed applications Derived from Google's MapReduce + Google File System (GFS) papers. Enables applications to work with thousands of computationindependent computers and petabytes of data. Download: http://hadoop.apache.org/ 14-22

Terminology (MR => HADOOP => HBASE) Hadoop Project Modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS ): A distributed file system that provides highthroughput access to application data. Hadoop YARN (Yet Another Resource Negotiator): A framework for job scheduling and cluster resource management. Hadoop MapReduce (MapReduce v2.0): A YARN-based system for parallel processing of large data sets. (Next Lectures) Other Hadoop-related projects at Apache include: Ambari: Dashboard management system for Hadoop. Avro : A data serialization system. Cassandra : A scalable multi-master database with no single points of failure. Chukwa : A data collection system for managing large distributed systems. HBase (Hadoop Database): A scalable, distributed database that supports structured data storage for large tables. (Next Lectures) Hive : A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout : A Scalable machine learning and data mining library. Pig : A high-level data-flow language and execution framework for parallel computation. (Next Lectures) ZooKeeper : A high-performance coordination service for distributed applications. 14-23

Large-Scale File Systems (GFS => HDFS) Problem: if nodes can fail, how can we store data persistently? Answer: Distributed File System (global file namespace) The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003. 14-24

Network File Systems (NFS=>GFS=>HDFS) UNIX NFS (Network File System): nfsd (deamon) mounts remote folders to a UNIX host (/etc/fstab). 14-25

NFS uses a Client/Server Architecture that is a single point of failure by default. Network File Systems (NFS=>GFS=>HDFS) 14-26

inode Δομές στο UNIX (Επανάληψη) Τι γίνεται εάν ένα αρχείο έχει πολλά blocks; Υπάρχει αρκετός χώρος για να αποθηκευτούν όλα τα i-nodes των blocks που συσχετίζονται με το αρχείο; Το Υποσύστημα Αρχείων χρησιμοποιεί ένα ιεραρχικό σχήμα το οποίο αποτελείται από δένδρα δεικτών βάθους 0 (direct, αυτό το οποίο είδαμε ήδη), 1 (single), 2 (double), και 3 (triple) 2KB block <24ΚΒ <1ΜΒ <512ΜB < max supported file size 14-27

Large-Scale File Systems (Hadoop File System - HDFS) Chuck (Block) Size: 4KB-32KB NFS File Size Limit = 2GB Default Chunk Size: 64 MB Unlimited File Size (21PB by Facebook) 3x replicas per chunk 14-28

Large-Scale File Systems (Hadoop File System - HDFS) 2010 2010 HDFS scalability: the limits to growth http://static.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf 14-29

Large-Scale File Systems (Hadoop File System - HDFS) Namespace lookup are fast (1 Master enough!) [ 1GB Metadata = 1PB Data ] In NFS Metadata + Transfers going through same server => Not Scalable HDFS designed for unreliable hardware (2-3 failures / 1000 nodes / day) New Hardware: 3x more unreliable!!! 14-30

Large-Scale File Systems (Hadoop File System - HDFS) 14-31

User Interface API Java API C language wrapper (libhdfs) for the Java API is also avaiable POSIX like command hadoop dfs -mkdir /foodir hadoop dfs -cat /foodir/myfile.txt hadoop dfs -rm /foodir myfile.txt HDFS Admin bin/hadoop dfsadmin safemode bin/hadoop dfsadmin report bin/hadoop dfsadmin -refreshnodes Web Interface Ex: http://localhost:50070 14-32

Web Interface (http://172.16.203.136:50070) 14-33

Web Interface (http://localhost:50070) Browse the file system 14-34

POSIX Like command 14-35

Latest API Java API http://hadoop.apache.org/core/docs/current/api/ 14-36