Analytics in the cloud

Similar documents
Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Distributed Filesystem

Fault Tolerance Design for Hadoop MapReduce on Gfarm Distributed Filesystem

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

CA485 Ray Walshe Google File System

Hadoop/MapReduce Computing Paradigm

Distributed Systems 16. Distributed File Systems II

Hadoop and HDFS Overview. Madhu Ankam

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

The Fusion Distributed File System

Progress on Efficient Integration of Lustre* and Hadoop/YARN

Distributed File Systems II

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

An Introduction to GPFS

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Decentralized Distributed Storage System for Big Data

MI-PDB, MIE-PDB: Advanced Database Systems

The Google File System. Alexandru Costan

Cluster-Level Google How we use Colossus to improve storage efficiency

Mixing and matching virtual and physical HPC clusters. Paolo Anedda

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

A BigData Tour HDFS, Ceph and MapReduce

Hadoop An Overview. - Socrates CCDH

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

GlusterFS Architecture & Roadmap

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

Best Practices for Deploying Hadoop Workloads on HCI Powered by vsan

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

The Leading Parallel Cluster File System

GFS: The Google File System

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

A brief history on Hadoop

Data Sharing Made Easier through Programmable Metadata. University of Wisconsin-Madison

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Introduction to Cloud Computing

MapReduce. U of Toronto, 2014

HDFS Architecture Guide

Lecture 11 Hadoop & Spark

MixApart: Decoupled Analytics for Shared Storage Systems

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Introduction to HPC Parallel I/O

Storage for HPC, HPDA and Machine Learning (ML)

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Fall HDFS Basics

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Cluster Setup and Distributed File System

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Programming Systems for Big Data

BigData and Map Reduce VITMAC03

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

GFS: The Google File System. Dr. Yingwu Zhu

Next-Generation Cloud Platform

Enterprise2014. GPFS with Flash840 on PureFlex and Power8 (AIX & Linux)

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Enosis: Bridging the Semantic Gap between

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

CS370 Operating Systems

The Google File System

CS370 Operating Systems

CLOUD-SCALE FILE SYSTEMS

IBM Storwize V7000 Unified

ibench: Quantifying Interference in Datacenter Applications

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

<Insert Picture Here> Filesystem Features and Performance

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Big Data and Object Storage

CS427 Multicore Architecture and Parallel Computing

CS Project Report

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Parallel I/O on JUQUEEN

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Introduction to Hadoop and MapReduce

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

AUTOMATING IBM SPECTRUM SCALE CLUSTER BUILDS IN AWS PROOF OF CONCEPT

High Performance Computing on MapReduce Programming Framework

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Google File System 2

Map-Reduce. Marco Mura 2010 March, 31th

IBM Spectrum Scale on Power Linux tuning paper

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Embedded Technosolutions

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Hadoop Distributed File System(HDFS)

Transcription:

Analytics in the cloud Dow we really need to reinvent the storage stack? R. Ananthanarayanan, Karan Gupta, Prashant Pandey, Himabindu Pucha, Prasenjit Sarkar, Mansi Shah, Renu Tewari Image courtesy NASA / ESA 2002 IBM Corporation

Data-Intensive Internet Scale Applications Typical Applications Web-scale search, indexing, mining Genomic sequencing brain-scale network simulations

Data-Intensive Internet Scale Applications Key Requirements Scale to very large data sets Platform needs to scale to 1000 s of nodes Built of commodity hardware for cost efficiency Tolerate failures during every job execution Support data shipping to reduce network requirements

MapReduce for analytics MapReduce is emerging as a model for large-scale analytics application Important design goals are extreme-scalability and fault-tolerance Storage layer is separated and has well-defined requirements QuickTime and a decompressor are needed to see this picture. Image source: http://developer.yahoo.com/hadoop/tutorial/module1.html

MapReduce Data-store requirements Provide a hierarchical namespace with directories and files Allow applications to read/write data to files Protect data availability and reliability in the face of node and disk failures Provide high bandwidth access to reasonably-sized chunks of data to all compute nodes (not necessarily all-to-all) Provide chunk access-affinity information to allow proper scheduling of tasks

Data store options: Cluster FS Vs Specialized FS Specialized FS Cluster FS Scaling Commodity hardware compliant

Data store options: Cluster FS Vs Specialized FS Specialized FS Cluster FS Scaling Commodity hardware compliant Traditional application support No Mature management tools No

Data store options: Cluster FS Vs Specialized FS Specialized FS Cluster FS Scaling Commodity hardware compliant Traditional application support No Mature management tools No Tuned for Hadoop No

Modifying a Cluster Filesystem for MapReduce GPFS Mature filesystem - many large production installations High performance, Highly scalable Reliability features focused on SAN environments Supports rack-aware 2-way replication POSIX interface Supports shared disk (SAN) and shared-nothing setups Not optimized for MapReduce workloads Does not expose data location information largest block size = 16 MB Changes for Hadoop: Make blocks bigger Let the platform know where the big blocks are Optimize replication and placement to reduce network usage

Key change: Metablocks Works for many workloads Small FS blocks (eg: 512K) Large Application blks (eg: 64M) New allocation scheme Metablock size granularity for wide striping New allocation policy Block map operates on large Metablock size All FS operations operate on small regular block size FS block Application meta-block Additional changes to provide block location information and write affinity

MapReduce performance 2500.00 2000.00 GPFS rack=1 HDFS rack=1 GPFS rack=2 HDFS rack=2 Time (sec) 1500.00 1000.00 500.00 0.00 Grep TeraGen Terasort Test bed idataplex: 42 nodes 8 cores, 8GB RAM 4+1 disks per-node Hadoop : version 0.18.1 GPFS: version pre3.3 16 nodes 160 GB data (replication factor = 2)

Impact on traditional workloads 2.5 2 Normalized perf 1.5 1 0.5 0 512K 512K, metablock 16M Normalized Random perf Normalized Seq read perf idataplex: 42 nodes 8 cores, 8GB RAM 4+1 disks per-node GPFS: version pre3.3 Bonnie filesystem benchmark

Things that didn t work 800 Large filesystem block-size 700 600 500 Turn-off Prefetching 400 300 Create alignment of records to block boundaries 200 100 0 HDFS GPFS16M GPFS16ML GPFS16MLA GPFS16MLANP File System 2.5 2 Time (sec) 1.5 1 0.5 0 512K 16M 16M, no-prefetch Normalized Random perf Normalized Seq read perf

Advantages of traditional filesystems Traditional filesystems have solved many hard problems like access control, quotas, snapshots Allow traditional and MapReduce applications to share the same input data. Exploit Filesystem tools & scripts based on regular filesystems. Re-use of Backup/Archive solutions built around particular filesystems. Mixed analytics pipelines. Using a MapReduce-specific filesystem (e.g. HDFS): Crawl Load Analyze Output Serve Crawler writes to a traditional filesystem Into mapreduce filesystem Back to traditional fielsystem Using a general-purpose filesystem (e.g. GPFS): Crawl Analyze Serve

Conclusion MapReduce platforms can use traditional filesystems without loss of performance. There are important reasons why traditional filesystems are attractive to users of MapReduce.