Large Scale Processing with Hadoop

Similar documents
Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

High Performance Computing on MapReduce Programming Framework

Amazon Web Services Cloud Computing in Action. Jeff Barr

A brief history on Hadoop

Introduction to MapReduce

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Hadoop Map Reduce 10/17/2018 1

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Big Data Hadoop Stack

Sensor Data Collection and Processing

Backtesting with Spark

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Hadoop/MapReduce Computing Paradigm

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CS 61C: Great Ideas in Computer Architecture. MapReduce

Distributed Systems CS6421

Application of machine learning and big data technologies in OpenAIRE system

Expert Lecture plan proposal Hadoop& itsapplication

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Clustering Lecture 8: MapReduce

Modern Data Warehouse The New Approach to Azure BI

CS370 Operating Systems

Mixing and matching virtual and physical HPC clusters. Paolo Anedda

MapReduce. U of Toronto, 2014

Introduction to Data Management CSE 344

Introduction to Hadoop and MapReduce

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Improving the MapReduce Big Data Processing Framework

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

MapReduce, Hadoop and Spark. Bompotas Agorakis

Preparing Digital Collections for Big Data Analysis. Sven Schlarb, Austrian Institute of Technology e-archiving, Cordoba, Spain 05 th October 2018

BigData and Map Reduce VITMAC03

Map-Reduce. Marco Mura 2010 March, 31th

5 Fundamental Strategies for Building a Data-centered Data Center

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

ESOC s Successes, Complications and Opportunities in using Cloud Computing and Big Data Technology

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Cloud Architectures. Jinesh Varia. Amazon Web Services. Technology Evangelist

Dept. Of Computer Science, Colorado State University

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Introduction to MapReduce

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera

CS370 Operating Systems

Database Systems CSE 414

Cloud Computing CS

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Map-Reduce. John Hughes

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Lecture 11 Hadoop & Spark

Database Applications (15-415)

Apache Hadoop.Next What it takes and what it means

Headline in Arial Bold 30pt. Visualisation using the Grid Jeff Adie Principal Systems Engineer, SAPK July 2008

Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together!

CSCI6900 Assignment 1: Naïve Bayes on Hadoop

Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms

ML from Large Datasets

Global Journal of Engineering Science and Research Management

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Chapter 5. The MapReduce Programming Model and Implementation

Introduction to Data Management CSE 344

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Best Practices for Deploying Hadoop Workloads on HCI Powered by vsan

Map Reduce Group Meeting

Big Data and Object Storage

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

CS427 Multicore Architecture and Parallel Computing

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Distributed Filesystem

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Map-Reduce (PFP Lecture 12) John Hughes

Analytics in the cloud

Big Data XML Parsing in Pentaho Data Integration (PDI)

Big Data Hadoop Course Content

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX

Big Data and Cloud Computing

MI-PDB, MIE-PDB: Advanced Database Systems

Processing 11 billions events a day with Spark. Alexander Krasheninnikov

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Strategies for Incremental Updates on Hive

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

BIG DATA TESTING: A UNIFIED VIEW

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Transcription:

Large Scale Processing with Hadoop William Palmer Some slides courtesy of Per Møldrup-Dalum (State and University Library, Denmark) and Sven Schlarb (Austrian National Library) SCAPE Information Day British Library, UK, 14 th July 2014

Large Scale Processing Methodologies Traditional One central large processor capability One+ central storage instance Data stored away from processor Paradigm: Move the data to the processor Hadoop Many smaller commodity computers/cpus Storage capacity in all computers, federated together Easily expandable Paradigm: Move the processor to the data 2

The New York Times + Hadoop on Amazon Web Services 11 million articles (1851-1980) that need to be converted to PDF 4TB TIFF data Example 24 hours wall time to complete the migration Cost: $240 (not including bandwidth) http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-supercomputing-fun/ http://cse.unl.edu/~byrav/infocom2011/workshops/papers/p1099-xiao.pdf 3

Hadoop Ecosystem: The Zoo HDFS data locality MapReduce 4

MapReduce MAP REDUCE 5

MapReduce in detail Map Reduce Merge Input Input Split Input Split Input Split Shuffle Sort Map Output Map Output Reducer Output 6

Hadoop In Action Designed for processing text Capacity can be reduced/expanded Comes with HDFS filesystem, with federation and redundancy (three copies of data by default) Using commodity hardware node failures are expected A node being down should not affect the cluster Data locality is considered when distributing computation, processing data where it is stored, reducing the need to transfer it Very large community and ecosystem 7

(Obligatory) Hadoop Screenshots 14/02/13 11:22:33 INFO gzchecker.gzchecker: Loading paths... 14/02/13 11:22:36 INFO gzchecker.gzchecker: Setting paths... 14/02/13 11:22:37 WARN mapred.jobclient: Use GenericOptionsParser for parsing the arguments. 14/02/13 11:22:39 INFO mapred.fileinputformat: Total input paths to process : 1 14/02/13 11:22:40 INFO mapred.jobclient: Running job: job_201401131502_0058 14/02/13 11:22:41 INFO mapred.jobclient: map 0% reduce 0% 8

Hadoop In Action We are using Hadoop/MapReduce for parallelisation Non standard use case As a parallelisation method costs are associated but get a lot of well supported features for free HDFS Administration Support Once a MapReduce program is developed scalability just happens Can theoretically prototype on a Raspberry Pi and run on a 3000 node super cluster 9

Hadoop In Action Do I have to copy data to HDFS for processing? 1TB of data took 8 hours to copy from NAS to HDFS Image format migration (TIFF-JP2) took ~57hours still got to get the data back to the NAS What if I don t? Same image format migration code accessing/posting data directly from/to Repository took ~58hours No copying data before/after More efficient as processing time is greater per file Won t necessarily hold for different preservation actions (see: small files problem ) 10

Hadoop at The British Library Two Hadoop clusters: Digital Preservation Team Cluster Virtualised hardware 1 management node, 1 master node 28 worker nodes (1 core/1 CPU, 6GB RAM each) 14TB raw storage, 5TB useable @ replication of 3 Cloudera Hadoop (CDH4) For testing/r&d Web Archiving Team Cluster Physical hardware 80 nodes (8 cores/2cpus, 16GB RAM) 700TB raw storage, 233TB useable @ replication of 3 Cloudera Hadoop (CDH3) In production use 11

SCAPE Workflow Results TIFF->JP2 migration with QA Single node @ 26 files/hour (with OpenJPEG) 28 nodes @ 735 files/hour (with OpenJPEG) 2409 files/hour with Kakadu Detecting DRM in PDF files 28 nodes @ 51869 files/hour Identifying web content 5.3million files/hour 12

Other Large Scale Execution Platforms SCAPE tools are treated as individual components and should be reusable on other large scale execution platforms (all tools described today are, at least) British Library Digital Library System (DLS) has a bespoke workflow execution system where SCAPE tools have been integrated Other platforms: GNU Parallel Tools can be integrated with your own systems 13