App Engine MapReduce. Mike Aizatsky 11 May Hashtags: #io2011 #AppEngine Feedback:

Similar documents
Optimizing Your App Engine App

App Engine: Datastore Introduction

Building Scalable Web Apps with Google App Engine. Brett Slatkin June 14, 2008

Developing Solutions for Google Cloud Platform (CPD200) Course Agenda

61A Lecture 36. Wednesday, November 30

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Dept. Of Computer Science, Colorado State University

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Getting the most out of Spring and App Engine!!

Batch Processing Basic architecture

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Clustering Lecture 8: MapReduce

ML from Large Datasets

Map Reduce. Yerevan.

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Designing for the Cloud

Database Applications (15-415)

Programming Models MapReduce

MapReduce. Stony Brook University CSE545, Fall 2016

The Google File System. Alexandru Costan

Developing MapReduce Programs

Scaling App Engine Applications. Justin Haugh, Guido van Rossum May 10, 2011

Map-Reduce. Marco Mura 2010 March, 31th

PaaS Cloud mit Java. Eberhard Wolff, Principal Technologist, SpringSource A division of VMware VMware Inc. All rights reserved

Building Scalable Web Apps with Python and Google Cloud Platform. Dan Sanderson, April 2015

Lecture 12. Lecture 12: The IO Model & External Sorting

Large-Scale Web Applications

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Distributed File Systems II

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo

Developing with Google App Engine

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

A brief history on Hadoop

Extract API: Build sophisticated data models with the Extract API

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Distributed Data Store

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

GFS-python: A Simplified GFS Implementation in Python

Distributed Systems CS6421

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Introduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 08 09/17/2015

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Introduction to MapReduce

Database Systems CSE 414

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Introduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 12 09/29/2016

Batch Inherence of Map Reduce Framework

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

HADOOP FRAMEWORK FOR BIG DATA

CS MapReduce. Vitaly Shmatikov

Introduction to MapReduce (cont.)

GFS: The Google File System

Introduction to Data Management CSE 344

Extreme Computing. NoSQL.

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Distributed Computation Models

Introduction to MapReduce

CSE 444: Database Internals. Lecture 23 Spark

CSE 544: Principles of Database Systems

CS 61C: Great Ideas in Computer Architecture. MapReduce

Google GCP-Solution Architects Exam

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Scalability. ! Load Balancing. ! Provisioning/auto scaling. ! State Caching Commonly used items of the database are kept in memory. !

Data-Intensive Distributed Computing

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Parallel Computing: MapReduce Jin, Hai

The Evolution of a Data Project

Algorithms for MapReduce. Combiners Partition and Sort Pairs vs Stripes

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

GFS: The Google File System. Dr. Yingwu Zhu

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Google File System (GFS) and Hadoop Distributed File System (HDFS)

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Data Processing with Apache Beam (incubating) and Google Cloud Dataflow

Distributed Filesystem

Big data systems 12/8/17

Making Immutable Infrastructure simpler with LinuxKit. Justin Cormack

Balanced Trees Part One

2/26/2017. For instance, consider running Word Count across 20 splits

Hadoop and Map-reduce computing

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Hadoop MapReduce Framework

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

Lecture 18: Reliable Storage

EVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Transcription:

App Engine MapReduce Mike Aizatsky 11 May 2011 Hashtags: #io2011 #AppEngine Feedback: http://goo.gl/snv2i

Agenda MapReduce Computational Model Mapper library Announcement Technical bits: Files API User-space shuffling MapReduce & Pipeline API Examples and Demos

MapReduce Computational Model

MapReduce A model to do efficient distributed computing over large data sets. Used at Google for years Every project uses MapReduce!

MapReduce Computational Model

Map Input: user data Output: (key, value) pairs User code

Shuffle Collates value with the same key Input: (key, value) pairs Output: (key, [value]) pairs No user code

Reduce Input: (key, [value]) pairs Output: user data User code

MapReduce Computational Model

Common App Engine Approach Take what works for us at Google Give it to people

App Engine & Google's MapReduce Additional scaling dimension: Lots and lots of applications Many of them will run MapReduce at the same time Isolation: application shouldn't influence performance of the other

App Engine & Google's MapReduce Rate limiting: you don't want to burn all day's resources in 15min and kill your online traffic Very slow execution: free apps want to go really slow, staying under their resource limint Protection: from malicious App Engine users

Mapper

Mapper Library Released at Google I/O 2010 Heavily used by developers outside and inside Google (admin console, new indexer pipeline, etc.) Has seen lots of improvements since

Mapper Library Improvements Control API - start your jobs programmatically (and transactionally) Custom mutation pools - batch work between map function calls Namespaces support - iterate over data in different namespaces or over namespaces themselves Better sharding with scatter indices And more!

Mapper => MapReduce? Storage system for intermediate data: Files API, released in 1.4.3 (March 2011) Shuffler Lots of glue code

Launching Shuffler Functionality In-memory, user-space, task-driven shuffle for small (100Mb) datasets. Trusted testers access to big shuffler. All the integration pieces needed to run your own mapreduce jobs are part of Mapper library. Mapper library => Mapreduce library! Python today, Java soon. http://mapreduce.appspot.com

Examples

Example 1: Word Count # Map def map(line): for w in clean(line).split(): yield (w, '') Zed's dead, baby, Zed's dead! ('zed's', ''), ('dead', ''), ('baby', ''), ('zed's', ''), ('dead', '') # Reduce def reduce(key, values): yield (key, len(values)) ('zed's', ['', '']), ('dead', ['', '']), ('baby', ['']) ('zed's', 2), ('dead', 2'), ('baby', 1)

Demo

Example 2: Inverse Index # Map def map(line, filename): for w in clean(line).split(): yield (w, filename) # Reduce def reduce(key, values): yield (key, list(set(values)))

Demo

MapReduce: Technical Bits

Technical Bits Files API: solution to MapReduce storage problem User-space shuffler

Files API

Mapreduce Storage Mapreduce jobs generate lots of intermediate data. Datastore: expensive, 1MB entity limit Blobstore: read-only Memcache: small, volatile

Files API Familiar, files-like interface to various virtual file systems. Released in 1.4.3, integrated with Mapper library. Considered to be a low-level API.

Files API Files have two states: writable and readable. Start in writable. Moved to readable by "finalization". Can't read writable, can't write to readable. Write is append-only, atomic and fully serializable between concurrent clients. Concrete filesystems might have their own reliability constraints and/or additional APIs.

Blobstore Filesystem Write directly to blobstore. Files can be >2G. Finalized files are durable. Writable files are not (just restart your MapReduce) Can fetch a blob key for finalized files and use blobstore api.

Blobstore Filesystem Python Example from google.appengine.api import files from future import with_statement # Create the file. file_name = files.blobstore.create() # Open the file for append. with files.open(file_name, 'a') as f: f.write('data') # All data is in. Finalize the file. files.finalize(file_name)

Blobstore Filesystem Python Example # Open the file for read. with files.open(file_name, 'r') as f: data = f.read(4) # Fetch blobkey for blobstore api. blob_key = files.blobstore.get_blob_key(file_name)

Mapper Integration # mapreduce.yaml... mapper: output_writer: mapreduce.output_writers. BlobstoreOutputWriters # Handler function def map(entity): yield entity.to_csv_line() + '\n'

Low-level Features Exclusive locks: files can be opened exclusively by a single client only. Sequence keys: each write can have a "sequence key" attached. Our backends make sure that they only increase.

Future Plans "Tempfile" file system: much faster, much cheaper, but not durable, several days of storage only (geared specifically towards MapReduce) Integrations with other Google storage technologies and other reliability guarantees

User-Space Shuffler

User-Space Shuffler Consolidates values for the same key together. [(key, value)] => [(key, [value])] Should be reasonably fast, scalable and efficient. User-space: full source code, no new AppEngine components.

Take 1 Load all data into memory Sort Read sorted array

Take 1

Take 1 Properties Memory-bound No parallelism

Take 2 Sort chunks of data and store them back to Files API Merge-sort all chunks (or merge-read)

Take 2

Take 2 Properties No longer memory-bound Sorting is parallel Merge phase is not parallel Difficult (and slow) to read from too many files

Take 3 Shard mapper output by key hash code Sort each shard into chunks Merge-read each shard

Take 3

Take 3 Properties No longer memory-bound Sorting is parallel Merge phase is now parallel This is the shuffler we release today.

MapReduce & Pipeline

Pipeline API New API to chain complex work together. A glue which holds Mapper + Shuffler + Reducer together. MapReduce library is fully integrated with Pipeline. For in-depth look visit "Large-scale Data Analysis Using the App Engine Pipeline API" talk later today.

More Complex Example

Example 3: Distinguishing Phrases # Map def map(text, filename): for words in ngrams(text): yield (words, filename) # Reduce def reduce(key, values): if len(values) < 10: return for filename, count in count_occurences(values): if count > len(values) / 2: yield (key, filename)

Demo

Summary Small & Medium MapReduce jobs can be run by anyone today! Contact us for getting access to Large MapReduce jobs. http://mapreduce.appspot.com

Questions? Hashtags: #io2011 #AppEngine Feedback: http://goo.gl/snv2i