Welcome to the New Era of Cloud Computing

Similar documents
Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Introduction to Hadoop and MapReduce

Amazon Web Services Cloud Computing in Action. Jeff Barr

CIT 668: System Architecture. Amazon Web Services

Basics of Cloud Computing Lecture 2. Cloud Providers. Satish Srirama

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Cloud Computing 1. CSCI 4850/5850 High-Performance Computing Spring 2018

What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)?

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud?

At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

Basics of Cloud Computing Lecture 2. Cloud Providers. Satish Srirama

Cloud Computing. What is cloud computing. CS 537 Fall 2017

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Next-Generation Cloud Platform

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION

Introduction To Cloud Computing

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies

How can you implement this through a script that a scheduling daemon runs daily on the application servers?

FLORIDA DEPARTMENT OF TRANSPORTATION PRODUCTION BIG DATA PLATFORM

what is cloud computing?

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Cloud Computing & Visualization

Introduction to data centers

Kubernetes Integration with Virtuozzo Storage

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

CISC 7610 Lecture 2b The beginnings of NoSQL

MapReduce Algorithms

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Big Data with Hadoop Ecosystem

MapReduce, Hadoop and Spark. Bompotas Agorakis

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

App Engine: Datastore Introduction

Lecture 11 Hadoop & Spark

Large-Scale Web Applications

MapReduce-style data processing

SAP VORA 1.4 on AWS - MARKETPLACE EDITION FREQUENTLY ASKED QUESTIONS

Cluster-Level Google How we use Colossus to improve storage efficiency

DATA SCIENCE USING SPARK: AN INTRODUCTION

Tools for Social Networking Infrastructures

CS / Cloud Computing. Recitation 11 November 5 th and Nov 8 th, 2013

Today s Objec4ves. Data Center. Virtualiza4on Cloud Compu4ng Amazon Web Services. What did you think? 10/23/17. Oct 23, 2017 Sprenkle - CSCI325

Hadoop, Yarn and Beyond

Chapter 1: Introduction 1/29

Basics of Cloud Computing Lecture 2. Cloud Providers. Satish Srirama

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Cloud Computing 4/17/2016. Outline. Cloud Computing. Centralized versus Distributed Computing Some people argue that Cloud Computing. Cloud Computing.

The Bits Are In The Clouds

Middle East Technical University. Jeren AKHOUNDI ( ) Ipek Deniz Demirtel ( ) Derya Nur Ulus ( ) CENG553 Database Management Systems

Next Generation Storage for The Software-Defned World

Principal Solutions Architect. Architecting in the Cloud

Talend Big Data Sandbox. Big Data Insights Cookbook

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Hadoop. Introduction / Overview

Cloud Computing. Ennan Zhai. Computer Science at Yale University

escience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows

AZURE CONTAINER INSTANCES

Advanced Database Technologies NoSQL: Not only SQL

ARCHITECTING WEB APPLICATIONS FOR THE CLOUD: DESIGN PRINCIPLES AND PRACTICAL GUIDANCE FOR AWS

CSE 444: Database Internals. Lecture 23 Spark

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Distributed Systems COMP 212. Lecture 18 Othon Michail

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Developing Enterprise Cloud Solutions with Azure

Building A Better Test Platform:

THE ZADARA CLOUD. An overview of the Zadara Storage Cloud and VPSA Storage Array technology WHITE PAPER

TPP On The Cloud. Joe Slagel

FAST TRACK YOUR AMAZON AWS CLOUD TECHNICAL SKILLS. Enterprise Website Hosting with AWS

Advanced Database Systems

Syllabus INFO-GB Design and Development of Web and Mobile Applications (Especially for Start Ups)

EsgynDB Enterprise 2.0 Platform Reference Architecture

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Distributed Systems. 31. The Cloud: Infrastructure as a Service Paul Krzyzanowski. Rutgers University. Fall 2013

/ Cloud Computing. Recitation 5 February 14th, 2017

MapReduce and Friends

CA485 Ray Walshe NoSQL

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Parallel Computing: MapReduce Jin, Hai

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

Top 40 Cloud Computing Interview Questions

Image Processing on the Cloud. Outline

Automated Control for Elastic Storage Harold Lim, Shivnath Babu, Jeff Chase Duke University

Introduction to Database Services

Improving the MapReduce Big Data Processing Framework

Hadoop/MapReduce Computing Paradigm

Chapter 5. The MapReduce Programming Model and Implementation

ECMWF Web re-engineering project

Achieving Horizontal Scalability. Alain Houf Sales Engineer

AWS Serverless Architecture Think Big

Prototyping Data Intensive Apps: TrendingTopics.org

Map-Reduce. Marco Mura 2010 March, 31th

Transcription:

Welcome to the New Era of Cloud Computing Aaron Kimball The web is replacing the desktop 1

SDKs & toolkits are there What about the backend? Image: Wikipedia user Calyponte 2

Two key concepts Processing 1000x more data doesn t have to be 1000x harder Cycles and bytes, not hardware, are the new commodity Cloud computing is: (Engineering definition) Providing services on virtual machines allocated on top of a large physical machine pool 3

Cloud computing is: (Business definition) A method to address scalability and availability concerns for large scale applications Cloud computing is: (The big picture) Democratized distributed computing 4

Outline Introduction Large-Scale Data Processing Cluster Management Conclusions Introduction Outline Large-Scale Data Processing MapReduce The Google File System BigTable Hadoop Cluster Management Conclusions 5

What could you do with 1000x as much data & CPU power? Scale your business to 1000x more users Gather statistics about every user click Improve recommendation systems Model better price plan choices Simulate 1,000,000 users of a system Why can t you do it? Large-scale cluster reliability is hard! Machines fail Hard drives crash Network goes down Software bugs Gremlins? Two tasks for two types of people Designing reliable, scalable system architecture Writing data processing algorithms 6

An old problem Image: Bill Bertram At a larger scale Image Google, Inc. 7

MapReduce Developed at Google in 2003 Divides performing computation reliably from computing over large data Provides reliable, scalable platform on which to develop distributed applications Simplifies programming model for distributed application writers Data flow programming Example use: Make an index Algorithm: 1. For each page, get a list of all words on the page 2. For each word from (1), get list of pages on which it appears 8

Map map (in_key, in_value) -> (out_key, intermediate_value) Reduce reduce (out_key, intermediate_value list) -> out_value 9

Building the index 10

Index: MapReduce map(pagename, pagetext): foreach word w in pagetext: emitintermediate(w, pagename); done reduce(word, values): foreach pagename in values: AddToOutputList(pageName); done emitfinal(formattedpagelistforword); Index: Data Flow 11

Other applications Classifying instances; machine learning Log analysis: click streams, user trends Image processing A scalable implementation of your backend business logic The Google File System The distributed file system is what makes MapReduce work Data spread evenly throughout cluster Replicated 3x for redundant storage Master machine detects failures and rebalances data on the fly 12

Putting it together: active storage Data automatically distributed to nodes at load time Automatic parallel processing Data elements processed locally, in parallel 13

Distributed data, single volume Output data is written to local disks, and forms a single user-accessible volume A self-healing system Loss of nodes causes automatic data rebalance 14

Adding structure GFS provides storage for very large files But structure within files is still slow to use Organized database storage still valuable How to get scalable storage and performance? What does a database do? Stores large amounts of data Individual records not too large Many records stored per table Search over this data (SQL queries) Individual record via index (fast) Complicated joins of related information Consistent modifications (transactions) 15

Web scale database problems Number of records gets very large Must be split across several machines Must be stored reliably, redundantly How to keep copies consistent? Must have fast access to individual elements, and be able to search it BigTable A Google database layer Key idea: divorce organized storage from query system 16

Fast single-element access High-speed lookup of individual (row, column) Selects data needed by online applications BigTable as a MapReduce input Each row is an input record to MapReduce MapReduce jobs can sort/search/index/query data in bulk 17

BigTable conclusions BigTable provides large-scale storage, MapReduce provides query backend Stores immense amount of structured data Uses Google File System for reliability Tradeoff: non-standard data access model Hadoop Open source MapReduce + GFS + BigTable Java implementation Free for commercial use Official Apache Software Foundation project 18

Broad industry adoption Flexible interoperability C++/Python: Pipes/Streaming Distributed database: HBase Machine learning: Mahout Performance tracking: Ganglia Development environment: Eclipse Hosting: Amazon Web Services 19

Benefits of Hadoop Open source MapReduce, GFS, BigTable Clean programming abstractions Allows rapid prototyping of large-scale computation Scalable active storage infrastructure Programs for 10 GB of data work for 10 TB Reliable storage of dynamic data 3x replication, continuous monitoring Outline Introduction Large-Scale Data Processing Cluster Management Clouds & Virtualization Amazon Web Services Rightscale AWS + Hadoop Conclusions 20

Constrained Resources Large-data applications are I/O bound Hard drive speed is the limiting factor Availability of RAM is 2 nd level concern Raw CPU speed not usually a factor Typical hardware profile Commodity server-class machines Think performance/watt or per $$ 4 processor cores (not fastest available) 4-8 GB RAM 2x500 GB SATA hard drives 21

Traditional service architecture Expensive upfront, maintenance costs Requires many types of resources Machines, power, cooling, bandwidth Does not scale on demand Not easily reconfigurable Adding virtualization helps with this What is a cloud? Virtualized server pool Reconfigures to provide different service profiles on demand Individual node providing service is unimportant 22

! " Clouds: high-level &' ( # $ %%% # # )# " Server architecture 23

Amazon Web Services EC2: Elastic Compute Cloud S3: Simple Storage Service SQS: Simple Queue Service SDB: Simple Database aws.amazon.com Elastic Compute Cloud Provides on-demand processing power From $0.10/instance*hour + bandwidth Virtual machine images with dynamic or static IP addresses 24

EC2 Templates EC2 node configuration stored as Amazon Machine Image (AMI) 100s of stock AMIs available Generic Linux distributions Hadoop configurations Web server / database installations Existing AMIs can be customized & saved for later reuse Simple Storage Service Virtually infinite storage capacity Cost: $0.15/GB*month + bandwidth Free high-speed access to/from EC2 nodes Provides permanence layer when EC2 nodes aren t running 25

Data Security Secure data transfer via SSL Encrypted storage possible in S3 Simple Queue Service Efficient, reliable load distribution layer Pay by the message 26

Simple Database Service High availability large-store database Provides simple SQL-like language Designed for interactive/online use System reliability Geographic server diversity through availability zones 27

Broad existing user base Highlight: SmugMug Flickr-style online photo hosting Entirely stored on S3 They do not own any hard drives! 28

A web service? Resource provisioning access via SOAP Scriptable client-side tools Not actually exposed on the web by AWS Web services for the rest of us Web-based interface to AWS Provisioning commands Monitoring & status Parameterizable startup scripts www.rightscale.com 29

(Rightscale Demo) Automatic scalability Real power of Rightscale is in monitoring Scales instances based on demand Provides backend reliability in virtual environment 30

mysql setup in EC2/S3,+,+,+! -'&. * + $ $#) Google App Engine Integrated cloud computing platform Compute resources with GFS- and BigTable-backed storage Limited preview now; check back soon! 31

Using AWS with Hadoop Hadoop runs on EC2 nodes Templates defined for several versions Permanent results storable in S3 EC2-served frontend retrieves results Published via static IP address Putting it together: FlickrProducts 3,. 0 0 1 1., 2 ' 4 ( 3 1 +!$ / 2* $# 32

FlickrProducts Tour Images in context on the map 33

Top tags listed per city Recommendations from Amazon 34

Demo technical points Public Amazon, Flickr APIs for input data Hadoop pipeline entirely in Java Frontend: mysql + PHP + Javascript Full system hosted on AWS EC2 VM Image and database backed by S3 Rightscale management dashboard Conception-to-demo time: two days. Conclusions Powerful new abstractions for large-scale data processing systems Scalable, reliable, available Support rapid development Large managed server pools available Low overhead Eliminate management headaches Grow and shrink according to need 35

The landscape is changing Questions? Aaron Kimball aaron@spinnakerlabs.com 36