Distributed Data Management. Christoph Lofi Institut für Informationssysteme Technische Universität Braunschweig

Similar documents
Distributed Data Management

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Introduction To Cloud Computing

Distributed Systems. 31. The Cloud: Infrastructure as a Service Paul Krzyzanowski. Rutgers University. Fall 2013

Cloud Computing. What is cloud computing. CS 537 Fall 2017

ECE Enterprise Storage Architecture. Fall ~* CLOUD *~. Tyler Bletsch Duke University

Next-Generation Cloud Platform

CIT 668: System Architecture. Amazon Web Services

Improving the MapReduce Big Data Processing Framework

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud?

Data Centers and Cloud Computing

Demystifying the Cloud With a Look at Hybrid Hosting and OpenStack

Basics of Cloud Computing Lecture 2. Cloud Providers. Satish Srirama

Cloud Computing Lecture 4

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

Data Centers and Cloud Computing. Data Centers

DEEP DIVE INTO CLOUD COMPUTING

THE DEFINITIVE GUIDE FOR AWS CLOUD EC2 FAMILIES

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Cloud Computing. Ennan Zhai. Computer Science at Yale University

Cloud Computing 4/17/2016. Outline. Cloud Computing. Centralized versus Distributed Computing Some people argue that Cloud Computing. Cloud Computing.

Cloud Computing Introduction & Offerings from IBM

Large Scale Computing Infrastructures

Introduction to Database Services

CHEM-E Process Automation and Information Systems: Applications

Basics of Cloud Computing Lecture 2. Cloud Providers. Satish Srirama

Middle East Technical University. Jeren AKHOUNDI ( ) Ipek Deniz Demirtel ( ) Derya Nur Ulus ( ) CENG553 Database Management Systems

EBOOK: VMware Cloud on AWS: Optimized for the Next-Generation Hybrid Cloud

Architekturen für die Cloud

Faculté Polytechnique

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

CLOUD COMPUTING. Lecture 4: Introductory lecture for cloud computing. By: Latifa ALrashed. Networks and Communication Department

Cloud Computing introduction

COMPARING COST MODELS - DETAILS

Windows Servers In Microsoft Azure

HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION

CISC 7610 Lecture 2b The beginnings of NoSQL

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Introduction to Cloud Computing and Virtual Resource Management. Jian Tang Syracuse University

Cloud Computing Concepts, Models, and Terminology

CLOUD COMPUTING ABSTRACT

Top 40 Cloud Computing Interview Questions

INFS 214: Introduction to Computing

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Lecture 20: WSC, Datacenters. Topics: warehouse-scale computing and datacenters (Sections )

Cloud Computing: Making the Right Choice for Your Organization

ZeroStack vs. AWS TCO Comparison ZeroStack s private cloud as-a-service offers significant cost advantages over public clouds.

Introduction to Cloud Computing. [thoughtsoncloud.com] 1

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Lesson 14: Cloud Computing

Understanding Cloud Migration. Ruth Wilson, Data Center Services Executive

On-Premises Cloud Platform. Bringing the public cloud, on-premises

COMP6511A: Large-Scale Distributed Systems. Windows Azure. Lin Gu. Hong Kong University of Science and Technology Spring, 2014

Parallel Computing: MapReduce Jin, Hai

White Paper. Platform9 ROI for Hybrid Clouds

VMware on IBM Cloud:

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Cloud Computing. Technologies and Types

Cloud + Big Data Putting it all Together

vrealize Business Standard User Guide

Mobile Cloud Computing

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. reserved. Insert Information Protection Policy Classification from Slide 8

Welcome to the New Era of Cloud Computing

POSTGRESQL ON AWS: TIPS & TRICKS (AND HORROR STORIES) ALEXANDER KUKUSHKIN. PostgresConf US

Data center interconnect for the enterprise hybrid cloud

vsan Mixed Workloads First Published On: Last Updated On:

BUILDING A PRIVATE CLOUD. By Mark Black Jay Muelhoefer Parviz Peiravi Marco Righini

Course Overview. ECE 1779 Introduction to Cloud Computing. Marking. Class Mechanics. Eyal de Lara

IT your way - Hybrid IT FAQs

THE ZADARA CLOUD. An overview of the Zadara Storage Cloud and VPSA Storage Array technology WHITE PAPER

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Analytics and Business Intelligence on AWS

Cloud Computing Briefing Presentation. DANU

MySQL In the Cloud. Migration, Best Practices, High Availability, Scaling. Peter Zaitsev CEO Los Angeles MySQL Meetup June 12 th, 2017.

Nutanix Tech Note. Virtualizing Microsoft Applications on Web-Scale Infrastructure

Challenges for Data Driven Systems

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies

Big Data Management and NoSQL Databases

Acknowledgements. Beyond DBMSs. Presentation Outline

CLOUD COMPUTING PRIMER

MapReduce and Friends

Dell EMC Hyper-Converged Infrastructure

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

Private Cloud Database Consolidation Name, Title

How to Keep UP Through Digital Transformation with Next-Generation App Development

Go Cloud. VMware vcloud Datacenter Services by BIOS

CompSci 516: Database Systems

Oracle IaaS, a modern felhő infrastruktúra

SQL Server SQL Server 2008 and 2008 R2. SQL Server SQL Server 2014 Currently supporting all versions July 9, 2019 July 9, 2024

Getting Hybrid IT Right. A Softchoice Guide to Hybrid Cloud Adoption

Lecture 11 Hadoop & Spark

CS 61C: Great Ideas in Computer Architecture. MapReduce

Introduction to data centers

Enterprise Cloud Computing. Eddie Toh Platform Marketing Manager, APAC Data Centre Group Cisco Summit 2010, Kuala Lumpur

Lecture 09: VMs and VCS head in the clouds

Map Reduce Group Meeting

Map Reduce. Yerevan.

What is Cloud Computing? Cloud computing is the dynamic delivery of IT resources and capabilities as a Service over the Internet.

Transcription:

Distributed Data Management Christoph Lofi Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

12.0 The Cloud 12.1 Map & Reduce 12.2 Cloud beyond Storage 12.3 Computing as a Service SaaS PaaS IaaS Distributed Data Management Christoph Lofi IfIS TU Braunschweig 2

12.1 Map & Reduce Just storing massive amounts of data is often not enough! Often, we also need to process and transform that data Large-Scale Data Processing Use thousands of worker nodes within a computation cluster to process large data batches But don t want hassle of managing things Map & Reduce provides Automatic parallelization & distribution Fault tolerance I/O scheduling Monitoring & status updates Distributed Data Management Christoph Lofi IfIS TU Braunschweig 3

12.1 Map & Reduce Initially, implemented by Google for building the Google search index i.e. crawling the web, building inverted word index, computing page rank, etc. General framework for parallel high volume data processing J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, Symp. Operating System Design and Implementation, San Francisco, USA, 2004 Also available as Open Source implementation as part of Apache Hadoop http://hadoop.apache.org/mapreduce/ Distributed Data Management Christoph Lofi IfIS TU Braunschweig 4

12.1 Map & Reduce Base idea There is a large number of input data, identified by a key i.e. input given as key-value pairs e.g. all web pages of the internet identified by their URL A map operation is a simple function which accepts one input key-value pair A map operation runs as a autonomous thread on one single node of a cluster Many map jobs can run in parallel on different input keys Returns for a single input key-value pair a set of intermediate key-value pairs map(key, value) Set of intermediate (key, value) After map job is finished, the node is free to perform another map job for the next input key-value pair A central controller distributes map jobs to free nodes Distributed Data Management Christoph Lofi IfIS TU Braunschweig 5

12.1 Map & Reduce After input data is mapped, reduce jobs can start reduce(key, values) is run for each unique key emitted by map() Each reduce job is also run autonomously on one single node Many reduce jobs can run in parallel on different intermediate key groups Reduce emits final output of the map-reduce operation Each reduce job takes all map tuples with a given key as input Generate usually one, but possible more output tuples Distributed Data Management Christoph Lofi IfIS TU Braunschweig 6

12.1 Map & Reduce Each reduce is executed on a set of intermediate map results which have the same key To efficiently select that set, the intermediate keyvalue pairs are usually shuffled i.e. just sorted and grouped by their respective key After shuffling, reduce input data can be selected by a simple range scan Distributed Data Management Christoph Lofi IfIS TU Braunschweig 7

12.1 Map & Reduce Example: Counting words in documents map(key, value): // key: doc name; // value: text of doc for each word w in value: emit(w, 1); reduce(key, values): // key: a word; // values: list of counts result = 0; for each v in values) result += v; emit(key, result); Distributed Data Management Christoph Lofi IfIS TU Braunschweig 8

12.1 Map and Reduce Example: Counting words in documents doc1: distributed db and p2p doc2: map and reduce is a distributed processing technique for db map(key,value) distributed 1 db 1 and 1 p2p 1 map 1 and 1 reduce 1 is a 1 distributed 1 reduce(key,values) distributed 2 db 2 and 2 p2p 1 map 1 reduce 1 is 1 Distributed Data Management Christoph Lofi IfIS TU Braunschweig 9

12.1 Map and Reduce Improvement: Combiners Combiners are mini-reducers that run in-memory after the map phase Used to group rare map keys into larger groups e.g. word counts: group multiple extremely rare words under one key (and mark that they are grouped ) Used to reduce network and worker scheduling overhead Distributed Data Management Christoph Lofi IfIS TU Braunschweig 10

12.1 Map & Reduce Responsibility of the map and reduce master Often, also called scheduler Assign Map and Reduce tasks to workers on nodes Usually, map tasks are assigned to worker nodes as a batch and not one by one Often called a split, i.e. subset of the whole input data Split often implemented by a simple hash function with as many buckets as worker nodes Full split data is assigned to worker node which starts a map task for each input key-value pair Check for node failure Check for task completion Route map results to reduce tasks Distributed Data Management Christoph Lofi IfIS TU Braunschweig 11

12.1 Map & Reduce Map and Reduce overview Distributed Data Management Christoph Lofi IfIS TU Braunschweig 12

12.1 Map and Reduce Master is responsible for worker node fault tolerance Handled via re-execution Detect failure via periodic heartbeats Re-execute completed + in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Robust: lost 1600/1800 machines once finished ok Master failures are not handled Unlikely due to redundant hardware Distributed Data Management Christoph Lofi IfIS TU Braunschweig 13

12.1 Map and Reduce Showcase: machine usage during web indexing Fine granularity tasks: map tasks >> machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing Showcase uses 200,000 map & 5,000 reduce tasks Running on 2,000 machines Distributed Data Management Christoph Lofi IfIS TU Braunschweig 14

12.1 MR - Performance Distributed Data Management Christoph Lofi IfIS TU Braunschweig 15

12.1 MR - Performance Distributed Data Management Christoph Lofi IfIS TU Braunschweig 16

12.1 MR - Performance Distributed Data Management Christoph Lofi IfIS TU Braunschweig 17

12.1 MR - Performance Distributed Data Management Christoph Lofi IfIS TU Braunschweig 18

12.1 MR - Performance Distributed Data Management Christoph Lofi IfIS TU Braunschweig 19

12.1 MR - Performance Distributed Data Management Christoph Lofi IfIS TU Braunschweig 20

12.1 MR - Performance Distributed Data Management Christoph Lofi IfIS TU Braunschweig 21

12.1 MR - Performance Distributed Data Management Christoph Lofi IfIS TU Braunschweig 22

12.1 MR - Performance Distributed Data Management Christoph Lofi IfIS TU Braunschweig 23

12.1 MR - PageRank PageRank is one of the major algorithm behind Google Search See our wonderful IR lecture (No 12)!! Key Question: How important is a given website? Importance independent of query Idea: other pages vote for a site by linking to it also called giving credit to Pages with many votes are probably important If an important site votes for another site, that vote has a higher weight as when an unimportant site votes t 1 x t 2 t 3 Distributed Data Management Christoph Lofi IfIS TU Braunschweig 24

12.1 MR - PageRank Given page x with in-bound links t 1,, t n, where C(t) is the out-degree of t α is probability of random jump N is the total number of nodes in the graph PR x = α 1 + (1 α) N i=1 n ( PR t i ) C t i Distributed Data Management Christoph Lofi IfIS TU Braunschweig 25

12.1 MR - PageRank Properties of PageRank Can be computed iteratively Effects at each iteration is local Sketch of algorithm: Start with seed PR i values Each page distributes PR i credit to all pages it links to Each target page adds up credit from multiple inbound links to compute PR i+1 Iterate until values converge Distributed Data Management Christoph Lofi IfIS TU Braunschweig 26

12.1 MR - PageRank Map Step: Distribute Page Rank Credits to link targets Reduce Step: gather up PageRank credit from multiple sources to compute new PageRank value Distributed Data Management Christoph Lofi IfIS TU Braunschweig 27

12.1 MR - Performance Turbo-Charging Map and Reduce Naïve approach for implementing Map and Reduce Move data to workers Have a cluster of computation nodes A master, multiple workers Master has access to all data Master splits the data and assigns map tasks Master transfers input data to workers Map results are somehow transferred to reduce workers Directly? Pipelined? Via master? In short: a lot of data shipping is necessary Distributed Data Management Christoph Lofi IfIS TU Braunschweig 28

12.1 MR - Performance Location aware file system approach Rely on a distributed file system like GFS or HFS Or even on a higher layers like Bigtable or HBase All those systems are especially designed for increased Map and Reduce performance Idea: Each processing node runs a GFS chunk server and a Map & Reduce Worker Input data is stored in large chunks in GFS Start a worker task which uses a local chunk as batch map input Read sequentially through the local chunk GFS as well as BigTable are optimized for sequential scans Distributed Data Management Christoph Lofi IfIS TU Braunschweig 29

12.1 MR - Performance Map workers sequentially appends intermediate keyvalue pairs to another chunk (local or remote) GFS as well as BigTable are optimized for append operations Reduce workers also scan through local chunks as input and append results to a local or remote chunk File system responsible for distributing data Very easy scheduling for master Just assign local data to workers Fault tolerant (data loss improbable) Distributed Data Management Christoph Lofi IfIS TU Braunschweig 30

12.2 The Cloud Distributed Data Management Christoph Lofi IfIS TU Braunschweig 31

12.2 The Cloud The term cloud computing is often seen as a successor of client-server architectures Often used as synonym for centralized on-demand pay-what-you-use provisioning of general computation resources e.g. compared to utility providers like electric power grids or water supply Computing as a commodity Cloud is used as a metaphor for the Internet Users or applications just use computation resources provided in the internet instead using local hardware or software Distributed Data Management Christoph Lofi IfIS TU Braunschweig 32

12.2 The Cloud Computation resources can mean a lot of things: Dynamic access to raw metal Raw storage space or CPU time Fully operational server are provided by the cloud Low-level services and platforms e.g. runtime platforms like Jave JRE» User can run application directly on cloud platform» No own servers or platform software needed e.g. abstracted storage space like space within a database or a file system» This is what we did in the last weeks! Distributed Data Management Christoph Lofi IfIS TU Braunschweig 33

13.0 The Cloud Software services i.e. some functionalities required by user software is provided by the cloud» Used via web service remote procedure calls» e.g. delegate a the rendering of a map in a user applciarion to Google Maps Full software functionality e.g. rented web applications replacing traditional server or desktop applications» e.g. rent CRM software online from SalesForce, use Google apps instead of MS Office, etc. Distributed Data Management Christoph Lofi IfIS TU Braunschweig 34

12.2 The Cloud Underlying base problem Successfully running IT departments and IT infrastructure can be very difficult and expensive for companies High fixed costs Acquiring and paying competent IT staff Competent is often very hard to get Buying and maintaining servers Correctly hosting hardware Proper power and cooling facilities, network connections, server racks, etc. Buying and maintaining software Distributed Data Management Christoph Lofi IfIS TU Braunschweig 35

12.2 The Cloud Load and Utilization Issues How much hardware resources are required by each application and / or service? How to handle scaling issues? What happens if demand increases or declines? How to handle spike loads? Digg Effect Traditional data centers are notoriously underutilized, often idle 85% of the time Over provisioning for future growth or spikes Insufficient capacity planning and sizing Improper understanding of scalability requirements etc. Distributed Data Management Christoph Lofi IfIS TU Braunschweig 36

12.2 The Cloud Cloud computing centrally unifies computation resources and provides them on-demand Degree of centralization and provision may differ Centralize hardware within a department? A company? A number of companies? Globally? Provide resources only oneself? To some partners? To anybody? How to compensate resource for resource usage? Provide resources by a rental model (e.g. monthly fee)? Provide resources metered on what-is-used basis (e.g. similar to electricity or water?) Provide resources for free? Distributed Data Management Christoph Lofi IfIS TU Braunschweig 37

12.2 The Cloud Usually, three types of clouds are distinguished Public Cloud Private Cloud Hybrid Cloud Distributed Data Management Christoph Lofi IfIS TU Braunschweig 38

12.2 The Cloud Public Cloud Traditional cloud computing Services and resources are offered via the internet to anybody willing to pay for them User just pays for services, usually no acquisition, administration or maintenance of hardware / software necessary Services usually provided by off-site 3 rd party providers Open for use by general public Exist beyond firewall, fully hosted and managed by the vendor Customers are individuals, corporations and others e.g. Amazon's Web Services and Google AppEngine Offers startups and SMB s quick setup, scalability, flexibility and automated management. Pay as you go model helps startups to start small and go big Security and compliance? Reliability and privacy concerns hinder the adoption of cloud Amazon S3 services were down for 6 hours in 2010 What will Amazon do with all the data? Distributed Data Management Christoph Lofi IfIS TU Braunschweig 39

12.2 The Cloud Private Cloud Cloud computing hardware are within the premises of a company behind the cooperate firewall Resources are only provided internally for various departments Private clouds are still fully bought, build, and maintained by the company using it But usually not exclusive to single departments! Still, costs could be prohibitive and may by far exceed that of public clouds Fine grained control over resources More secure as they are internal to organization Schedule and reshuffle resources based on business demands Ideal for apps requiring tight security and regulatory concerns Development requires hardware investments and in-house expertise Distributed Data Management Christoph Lofi IfIS TU Braunschweig 40

12.2 The Cloud Hybrid Cloud Both private and public cloud services or even non-cloud services are used or offered simultaneously State-of-art for most companies relying on cloud technology Distributed Data Management Christoph Lofi IfIS TU Braunschweig 41

12.2 The Cloud Properties promised by Cloud computing Agility Resources are quickly available when needed Costs i.e. servers must not be ordered and build, software doesn t need to be configured and installed, etc. Capital expenditure is converted to operational expenditure Independence Services are available everywhere and for any device Distributed Data Management Christoph Lofi IfIS TU Braunschweig 42

12.2 The Cloud Multi-tenancy Resources are shared by larger pool of users Resources can be centralized which reduces the costs Load distribution of users differs Peak loads can usually be distributed Overall utilization and efficiency of resources is better Reliability Most cloud services promise durable and reliable resources due to distribution and replication Scalability If a user needs more resources or performance, it can easily provisioned Distributed Data Management Christoph Lofi IfIS TU Braunschweig 43

12.2 The Cloud Low maintenance Cloud services or applications are not installed on user s machines, but maintained centrally by specialized staff Transparency and metering Costs for computation resources are directly visible and transparent Pay-what-you-use models Cloud computing generally promises to be beneficial for fast growing startups, SMBs and enterprises alike. Cost effective solutions to key business demands Improved overall efficiency Distributed Data Management Christoph Lofi IfIS TU Braunschweig 44

12.2 The Cloud The cloud heavily encourages a self-service model Users can simply request the resources they need from cloudscaling.com Distributed Data Management Christoph Lofi IfIS TU Braunschweig 45

12.3 XaaS Anything-as-a-Service XaaS= X as a service In general, cloud providers offer any computation resources as a service In the long run, all computation needs of a company should be modeled, provided and used as a service e.g. in Amazon s private and public cloud infrastructures: everything is a service! Distributed Data Management Christoph Lofi IfIS TU Braunschweig 46

12.3 XaaS Services provide a strictly defined functionality with certain guarantees Service description and service-level agreement (SLA) Services description explains what is offered by the service SLA further clarifies the provisioning guarantees Often: performance, latency, reliability, availability, etc. Distributed Data Management Christoph Lofi IfIS TU Braunschweig 47

12.3 XaaS Usually, three main resources may be offered as a service Software as a Service SaaS Platform as a Service PaaS Infrastructure as a Service IaaS Client Application Platform Infrastructure Server Distributed Data Management Christoph Lofi IfIS TU Braunschweig 48

12.3 XaaS Application Services (services on demand) Gmail, GoogleCalender Payroll, HR, CRM, etc Sugar CRM, IBM Lotus Live Platform Services (resources on demand) Middleware, Integration, Messaging, Information, connectivity etc Amazon AWS, Boomi, CastIron, Google AppEngine Infrastructure as services (physical assets as services) IBM Blue House, VMWare Cloud Edition, Amazon EC2, Microsoft Azure Platform, Distributed Data Management Christoph Lofi IfIS TU Braunschweig 49

12.3 XaaS Individuals Corporations Non-Commercial? CLOUD Cloud Middle Ware Storage Provisioning OS Provisioning Network Provisioning Service(apps) Provisioning SLA(monitor), Security, Billing, Payment Resources Services Storage Network OS Distributed Data Management Christoph Lofi IfIS TU Braunschweig 50

12.3 IaaS Infrastructure as a Service (IaaS) Provides raw computation infrastructure, i.e. usually a virtual server e.g. see hardware virtualization (VMWare & co.) Successor to dedicated server rental For the user, a virtual server is similar to a real server Has CPU cores, main memory, hard disc space, etc. Usually provided as self-service raw machine User is responsible for installing and maintaining applications like e.g. operating system, databases or server software User does not need to buy, host or maintain the actual hardware Distributed Data Management Christoph Lofi IfIS TU Braunschweig 51

12.3 IaaS The IaaS provider can host multiple virtual servers on a single, real machine Often, 10-30 virtual severs per real server Virtualization is used to abstract server hardware for virtual servers Virtual system also often called virtual machines (neutral term) or appliances (usually suggesting preinstalled OS and software) Virtualization of hardware is usually handled by a socalled hypervisor, e.g. Xen, KVM, VMWare, HyperV, Distributed Data Management Christoph Lofi IfIS TU Braunschweig 52

1 many #appliances 12.3 IaaS In short, IaaS is virtualization on multiple hardware machines Normal Server 1 machine with one OS Traditional virtualization 1 machine hosting multiple virtual servers Distributed Application 1 appliance running on multiple machines IaaS Multiple machines running multiple virtual servers Dynamic load balancing between machines Traditional virtualization Normal server IaaS Distributed Appliance 1 #machines many Distributed Data Management Christoph Lofi IfIS TU Braunschweig 53

12.3 IaaS Hypervisor is responsible for allocating available resources to VMs Dispatch VMs to machines Relocate VM to balance load Distribute resources Network adaptors, logical discs, RAM, CPU cores, etc Distributed Data Management Christoph Lofi IfIS TU Braunschweig 54

12.3 IaaS Usually, virtual machines offered by IaaS infrastructures cannot grow arbitrarily big Usually capped by actual server size or a smaller server group Really big applications are usually deployed in socalled Pods Similar to database shards Group of machines running one or multiple appliances Machines within a Pod are very tightly networked Distributed Data Management Christoph Lofi IfIS TU Braunschweig 55

12.3 IaaS i.e. each Pod is a full copy of given virtual machines with full OS and application installed Usually, there are multiple copies of a given Pod (and its VMs) Each Pod is responsible for a disjoint part of the whole workload Pods are usually scattered across availability zones (e.g. data centers or a certain rack) Physically separated, usually with own power / network, etc. Distributed Data Management Christoph Lofi IfIS TU Braunschweig 56

12.3 IaaS IaaS Pods from CloudScaling.com Distributed Data Management Christoph Lofi IfIS TU Braunschweig 57

12.3 IaaS Simplified Pod example: GoogleMail Multiple Pods, each Pod running on multiple machines with a full and independent installation of Gmail software Load balancer decides during user log-in which Pod will handle the user session Users are distributed across Pods Pods are flexible by using shared GFS file system Distributed Data Management Christoph Lofi IfIS TU Braunschweig 58

12.3 IaaS Mission critical applications should be designed such that they run in multiple availability zones on multiple Pods Cloud control system (CCS) responsible for distribution and replication Distributed Data Management Christoph Lofi IfIS TU Braunschweig 59

12.3 IaaS Pod Architectures Each pod consists of multiple machines with mainboards, CPUs, and main memory Question: where to put secondary storage? Usually, three options Storage area network (SAN) Direct attached storage (DAS) Network attached storage (NAS) or. Storage Service! (e.g. GFS & co.) Distributed Data Management Christoph Lofi IfIS TU Braunschweig 60

12.3 IaaS SAN Pods Individual servers don t have own secondary storage Storage area network provides shared hard disks storage for all machines of a Pod Pro All machines have access to the same data Allows for dynamic load balancing or migration of appliances e.g. VMware vmotion Con Very very expensive Higher latency than direct attached storage Distributed Data Management Christoph Lofi IfIS TU Braunschweig 61

12.3 IaaS SAN Pods Distributed Data Management Christoph Lofi IfIS TU Braunschweig 62

12.3 IaaS DAS Pods Each server has its own set of hard drives Accessing data from other servers may be difficult Pro Cheap Low latency for accessing local data Con Usually, no shared data access Usually, difficult to live-migrate appliances (due to no shared data) But: by using clever storage abstractions, common problems can be circumvented Use distributed file system or a distributed data store! e.g. Apache S3 & SimpleDB, Google GFS & BigTable, Apache HBase & HFS, etc. Distributed Data Management Christoph Lofi IfIS TU Braunschweig 63

12.3 IaaS DAS Pods Distributed Data Management Christoph Lofi IfIS TU Braunschweig 64

12.3 Amazon EC2 IaaS example: Amazon EC2 Elastic Compute Cloud is one of the core service of the Amazon Cloud Infrastructure Public IaaS Cloud Customers may rent virtual servers hosted at Amazons Data Centers Can freely install OS and applications as needed Virtual servers are offered in different sizes and are paid by CPU usage Basic storage is offered within the VM, but usually additional storage services are used by application which cost extra e.g. S3, SimpleDB, or Dynamo DB Distributed Data Management Christoph Lofi IfIS TU Braunschweig 65

12.3 Amazon EC2 Example: t2.micro 1.0 GB memory 1 vcpu units 1 virtual core 1 vcpu is roughly one 2.5 GHz Xeon core No dedicated storage Has to use AWS network storage Burstable performance: 6 CPU credits per hour 1 CPU credit = 1 minute full CPU performance Costs $0.013 per hour $9,30 per month Usually many users start will the small instance, also heavily used for testing From July 2010 Distributed Data Management Christoph Lofi IfIS TU Braunschweig 66

12.3 Amazon EC2 Example: m3.xlarge 15 GB memory 4 vcpu units Total of 13 ECU (Elastic Compute Units) 1 ECU is roughly equal to 1.5GHz Xeon core 80 GB instance storage on SSD More storage via AWS Costs $0.28 per hour $201 per month Distributed Data Management Christoph Lofi IfIS TU Braunschweig 67

12.3 Amazon EC2 Example: i2.8xlarge 244 GB of memory 32 vcpu Total of 104 ECU units 6400 GB of instance storage on SSD Costs $6.82 per hour $4910 per month Distributed Data Management Christoph Lofi IfIS TU Braunschweig 68

12.3 Amazon EC2 Rough Estimations (Oct 2009) Roughly 40,000 servers Uses standard server racks with 16 machines per rack Mostly packed with 2U dual-socket Quad-Core Intel Xeons Roughly matches the High-Mem Quad XL instance Uses around 8 500GB Raid-0 disks Target cost around $2500 per machine in average 75% of the machines are US, the remainder in Europe and Asia Amazon aims at a utilization rate of 75% Very rough guesses state that Amazon may earn $25,264 per hour with EC2! http://cloudscaling.com/blog/cloud-computing/amazons-ec2-generating-220m-annually From Oct 2009 Distributed Data Management Christoph Lofi IfIS TU Braunschweig 69

12.3 PaaS Platform as a Service (PaaS) Provides software platforms on demand e.g. runtime engines (JavaVM,.Net Runtime, etc.), storage systems (distributed file system, or databases), web services, communication services, etc. PaaS systems are usually used to develop and host web applications or web services User applications run on the provided platform In contrast to IaaS, no installation and maintenance of operation system and server applications necessary Centrally managed and maintained Services or runtimes are directly usable Distributed Data Management Christoph Lofi IfIS TU Braunschweig 70

12.3 Google AppEngine Google AppEngine provides users a managed Phyton or Java Runtime Web applications can be directly hosted in AppEngine Just upload you WAR file and you are done Users are billed by resource usage Some free resources provided everyday 1 GB in- and out traffic, 6.5 hours CPU, 500 MB storage overall Resource Unit Unit cost Outgoing Bandwidth GB $0.12 Incoming Bandwidth GB $0.10 CPU Time CPU hours $0.10 Stored Data GB / month $0.15 Recipients Emailed recipients $0.0001 Distributed Data Management Christoph Lofi IfIS TU Braunschweig 71

12.3 Google AppEngine Each application can access system resources up to a fixed maximum AppEngine is not fully scalable! AppEngine max values (2010) CPU: 1730 hours CPU per day; 72 minutes CPU per minute Data in or out: 1 TB per day; 10 GB per minute Request: 43M web service calls per day, 30K calls per minute Data storage: no limit (uses BigTable which can scale in size!!) Distributed Data Management Christoph Lofi IfIS TU Braunschweig 72

12.3 Amazon SimpleDB Amazon Simple DB is data storage system roughly similar to Google BigTable http://aws.amazon.com/simpledb Simple table-centric database engine SimpleDB is directly ready to use No user configuration or administration Accessible via web service SimpleDB is highly available, uses flexible schemas, and eventual consistency Similar to HBase or BigTable Distributed Data Management Christoph Lofi IfIS TU Braunschweig 73

12.3 Amazon Simple DB Any application may use SimpleDB for data storage Simple web service provided to interact with Simple DB Create or delete a table (called domain) Put and delete rows Query for rows Users pay for storage, data transfer, and computation time 25 hours computation time (for querying) are free per month Later: $0.154 per machine hour in 2009 Later: $0.140 per machine hour in 2014 1 GB of data transfer is free per month Later: $0.15 per GB in 2009 Later: $0.12 per GB in 2014 1 Gb of data storage is free per month Later: $0.28 per GB in 2009 Later: $0.25 per GB in 2014 Distributed Data Management Christoph Lofi IfIS TU Braunschweig 74

12.3 SaaS Software as a Service (SaaS) Full applications are offered on-demand User just need to consume the software; no installation or maintenance necessary All administrative and maintenance tasks are performed by the Cloud provider e.g. hosting physical hardware, maintaining platforms, maintaining software, dealing with security, scalability, etc. Distributed Data Management Christoph Lofi IfIS TU Braunschweig 75

12.3 SalesForce Salesforce.com On-Demand CRM software Customer-Relationship-Management Cooperation with Google Apps in early summer Provides simple online services for Customer database Lead management Call center Customer portal Knowledge Bases Email Collaboration environments Etc. Distributed Data Management Christoph Lofi IfIS TU Braunschweig 76

12.3 SalesForce Distributed Data Management Christoph Lofi IfIS TU Braunschweig 77

12.3 SalesForce Distributed Data Management Christoph Lofi IfIS TU Braunschweig 78

12.3 SalesForce Bills per month and user, based on edition Distributed Data Management Christoph Lofi IfIS TU Braunschweig 79

12.3 Google Apps Google Apps Provides standard office application on-demand i.e. Targeting at the lower-end of the customer base of Microsoft Office MS counters with Office 365 Google Apps provides Email & Groupware Spreadsheets Documents Presentations Online Forms Drawings etc. Distributed Data Management Christoph Lofi IfIS TU Braunschweig 80

12.3 Google Apps Distributed Data Management Christoph Lofi IfIS TU Braunschweig 81

Next Semester Multimedia Databases Information Retrieval Relational Databases 1 Distributed Data Management Christoph Lofi IfIS TU Braunschweig 82

Distributed Data Management Thank you for your attention! Distributed Data Management Christoph Lofi IfIS TU Braunschweig 83