Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Similar documents
Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

HADOOP FRAMEWORK FOR BIG DATA

Clustering Lecture 8: MapReduce

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

A brief history on Hadoop

MapReduce, Hadoop and Spark. Bompotas Agorakis

MI-PDB, MIE-PDB: Advanced Database Systems

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Hadoop. copyright 2011 Trainologic LTD

MapReduce-style data processing

MapReduce. U of Toronto, 2014

Programming Systems for Big Data

Hadoop Map Reduce 10/17/2018 1

Lecture 11 Hadoop & Spark

Introduction to Hadoop and MapReduce

Hadoop/MapReduce Computing Paradigm

BigData and Map Reduce VITMAC03

The MapReduce Framework

Distributed File Systems II

50 Must Read Hadoop Interview Questions & Answers

Data Platforms and Pattern Mining

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Introduction to MapReduce

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

A BigData Tour HDFS, Ceph and MapReduce

Distributed Filesystem

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

The MapReduce Abstraction

Hadoop and HDFS Overview. Madhu Ankam

Hadoop An Overview. - Socrates CCDH

TP1-2: Analyzing Hadoop Logs

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Distributed Systems 16. Distributed File Systems II

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

CS 345A Data Mining. MapReduce

A Review Approach for Big Data and Hadoop Technology

Distributed Systems CS6421

Dept. Of Computer Science, Colorado State University

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Introduction to Data Management CSE 344

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Parallel Computing: MapReduce Jin, Hai

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

itpass4sure Helps you pass the actual test with valid and latest training material.

Chapter 5. The MapReduce Programming Model and Implementation

Database Applications (15-415)

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Hadoop MapReduce Framework

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Introduction to MapReduce

Performance Enhancement of Data Processing using Multiple Intelligent Cache in Hadoop

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Distributed Systems. CS422/522 Lecture17 17 November 2014

Data Analysis Using MapReduce in Hadoop Environment

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Programming Models MapReduce

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

Lecture 12 DATA ANALYTICS ON WEB SCALE

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Map Reduce. Yerevan.

CA485 Ray Walshe Google File System

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

Map Reduce & Hadoop Recommended Text:

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

OPEN SOURCE GRID MIDDLEWARE PACKAGES

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

CompSci 516: Database Systems

BIG DATA TESTING: A UNIFIED VIEW

The Google File System. Alexandru Costan

Parallel Programming Concepts

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Introduction to HDFS and MapReduce

CS370 Operating Systems

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Batch Processing Systems

MapReduce & BigTable

Map-Reduce. Marco Mura 2010 March, 31th

MapReduce Simplified Data Processing on Large Clusters

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

Transcription:

Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1

Big Data "The world is crazy. But at least it s getting regular analysis." Izabela Moise, Evangelos Pournaras, Dirk Helbing 2

Big Data Explosion 1.8 ZB of information world-wide (2011) Izabela Moise, Evangelos Pournaras, Dirk Helbing 3

Big data becomes Real data Big Data became real in 2013 Obama "the Big Data President" Oscar prediction Finding and telling data-driven stories in billions of Tweets Izabela Moise, Evangelos Pournaras, Dirk Helbing 4

The 5 V s Volume large amounts of data generated every second (emails, twitter messages, videos, sensor data...) Velocity the speed of data moving in and out data management systems (videos going viral...) on-the-fly Variety different data formats in terms of structured or unstructured (80%) data Value insights we can reveal within the data Veracity trustworthiness of the data Izabela Moise, Evangelos Pournaras, Dirk Helbing 5

Big Data Extremely large datasets that are hard to deal with using traditional (Relational) Databases Storage/Cost Search/Performance Analytics and Visualization Izabela Moise, Evangelos Pournaras, Dirk Helbing 6

Big data Different Types of Data: Structured data (Relational, Tables) Semi Structured Data (XML, JSON, Logfiles) Unstructured Data (Free Text, Webpages) Graph Data (Social Network, Semantic Web) Streaming Data Typical Operations: Aggregation & Statistics Data warehouse, OLAP Index, Searching, Querying Keyword bases search Pattern matching Knowledge discovery Data Mining Statistical Modeling Izabela Moise, Evangelos Pournaras, Dirk Helbing 7

Big Data Old model: Query the world : Data acquisition coupled to a specific hypothesis New model: Download the world : Data acquisition supports many hypotheses Examples: E-commerce: Transaction, Customer tracking, Social Graph Security: Logfiles, Anomaly detection Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) Biology: lab automation, high-throughput sequencing, Oceanography: high-resolution models, cheap sensors, satellites Izabela Moise, Evangelos Pournaras, Dirk Helbing 8

Big Data Old model: Query the world : Data acquisition coupled to a specific hypothesis New model: Download the world : Data acquisition supports many hypotheses Examples: E-commerce: Transaction, Customer tracking, Social Graph Security: Logfiles, Anomaly detection Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) Biology: lab automation, high-throughput sequencing, Oceanography: high-resolution models, cheap sensors, satellites Izabela Moise, Evangelos Pournaras, Dirk Helbing 8

Big Data Old model: Query the world : Data acquisition coupled to a specific hypothesis New model: Download the world : Data acquisition supports many hypotheses Examples: E-commerce: Transaction, Customer tracking, Social Graph Security: Logfiles, Anomaly detection Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) Biology: lab automation, high-throughput sequencing, Oceanography: high-resolution models, cheap sensors, satellites Izabela Moise, Evangelos Pournaras, Dirk Helbing 8

MapReduce Izabela Moise, Evangelos Pournaras, Dirk Helbing 9

Parallel and distributed programming paradigms Partitioning 1. Computation 2. Data Mapping Assign computation and data parts to resources Synchronisation Communication Send intermediate data between workers Scheduling Izabela Moise, Evangelos Pournaras, Dirk Helbing 10

Parallel and distributed programming paradigms Partitioning 1. Computation 2. Data Mapping Assign computation and data parts to resources Synchronisation Communication Send intermediate data between workers Scheduling A paradigm is an abstraction that hides the implementation of these issues from the users Izabela Moise, Evangelos Pournaras, Dirk Helbing 10

MapReduce An abstraction for performing computations on data automatic parallelization of computations large-scale data distribution simple, yet powerful interface user-transparent fault tolerance commodity hardware Introduced by Google in 2004: paradigm and implementation Izabela Moise, Evangelos Pournaras, Dirk Helbing 11

Motivation: Common operations on data Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Izabela Moise, Evangelos Pournaras, Dirk Helbing 12

Motivation: Common operations on data Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Provide a functional abstraction for these two operations Izabela Moise, Evangelos Pournaras, Dirk Helbing 12

Izabela Moise, Evangelos Pournaras, Dirk Helbing 13

MapReduce Programming Model Input & Output: each a set of key/value pairs Programmer specifies two functions: map(in_key, in_value) list(out_key, intermediate_value) Processes input key/value pairs Produces set of intermediate pairs: reduce(out_key, list(intermediate_value)) list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one) Inspired by primitives of functional programming languages such as Lisp, Scheme and Haskell Izabela Moise, Evangelos Pournaras, Dirk Helbing 14

What is MapReduce used for? At Google Index construction for Google Search Article clustering for Google News Statistical machine translation At Yahoo! Web map powering Yahoo! Search Spam detection for Yahoo! Mail At Facebook Data mining Ad optimization Spam detection Izabela Moise, Evangelos Pournaras, Dirk Helbing 15

Example: Word Count Reduce phase is optional: Jobs can be Map-only Izabela Moise, Evangelos Pournaras, Dirk Helbing 16

Example: Word Count Izabela Moise, Evangelos Pournaras, Dirk Helbing 17

How does it work? Izabela Moise, Evangelos Pournaras, Dirk Helbing 18

Key Characteristics Parallelism map() and reduce() functions run in parallel each working on different data. reduce phase cannot start until map phase is completely finished. Locality master program assigns tasks based on location of data: tries to have map() tasks on the same machine as physical file data, or at least the same rack Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

Izabela Moise, Evangelos Pournaras, Dirk Helbing 20

The Hadoop Project Izabela Moise, Evangelos Pournaras, Dirk Helbing 21

Hadoop Tool Suite Izabela Moise, Evangelos Pournaras, Dirk Helbing 22

HDFS Distributed File System An open-source implementation of Google File System Data split into chunks of fixed (configurable) size 64MB default Two server types: 1. Namenode - keeps the metadata 2. Datanode - stores the data Failures handled through chunk level replication 3 replicas: local, same rack, different rack Write-once-ready-many pattern Files are append-only Optimized for large files, sequential reads Izabela Moise, Evangelos Pournaras, Dirk Helbing 23

HDFS-Hadoop Distributed File System Izabela Moise, Evangelos Pournaras, Dirk Helbing 24

HDFS Design Izabela Moise, Evangelos Pournaras, Dirk Helbing 25

HDFS Design The Namenode manages the filesystem namespace filesystem tree and metadata for all the files and directories in the tree Stored persistently on local disk chunk placement on datanodes reconstructed from datanodes when the system starts Single point of failure if the Namenode fails, the filesystem is not usable anymore the metadata can be stored on a remote disk so that the namespace can be reconstructed if the Namenode fails Datanodes report periodically the list of chunks they store Namenode front page is at http://namenode-name:50070/ basic statistics of the cluster browse the file system Izabela Moise, Evangelos Pournaras, Dirk Helbing 26

HDFS File-based data structures SequenceFiles Data structure for binary key-value pairs Izabela Moise, Evangelos Pournaras, Dirk Helbing 27

HDFS configuration fs.default.name, set to hdfs://localhost/ the default HDFS port 8020 HDFS clients will use this property to work out where the namenode is running so they can connect to it. dfs.replication Chunk replication, by default set to 3 Izabela Moise, Evangelos Pournaras, Dirk Helbing 28

HDFS command line interface Izabela Moise, Evangelos Pournaras, Dirk Helbing 29

The Hadoop MapReduce framework Izabela Moise, Evangelos Pournaras, Dirk Helbing 30

Some Hadoop Terminology Job: a program that executes map and reduce processing across a data set Task: an execution of a Mapper or a Reducer on a slice of data also called, Task-In-Progress (TIP) Task Attempt: a particular instance of an attempt to execute a task on a machine Izabela Moise, Evangelos Pournaras, Dirk Helbing 31

Hadoop Internals Hadoop uses its own RPC protocol Communications are initiated by TaskTracker nodes heartbeat mechanism A single JobTracker per cluster accepts Job requests from clients Job is a Java jar file + an XML file containing program configuration options the Job client places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code A single TaskTracker instance runs on each slave node TaskTracker forks separate Java process for task instances Izabela Moise, Evangelos Pournaras, Dirk Helbing 32