Lecture 12 DATA ANALYTICS ON WEB SCALE

Similar documents
Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

CS370 Operating Systems

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Chapter 5. The MapReduce Programming Model and Implementation

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Hadoop and HDFS Overview. Madhu Ankam

Google File System (GFS) and Hadoop Distributed File System (HDFS)

CS 345A Data Mining. MapReduce

HADOOP FRAMEWORK FOR BIG DATA

MI-PDB, MIE-PDB: Advanced Database Systems

Lecture 11 Hadoop & Spark

CS 61C: Great Ideas in Computer Architecture. MapReduce

Distributed Filesystem

Clustering Lecture 8: MapReduce

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

UNIT-IV HDFS. Ms. Selva Mary. G

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Introduction to MapReduce

Embedded Technosolutions

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Distributed Systems. CS422/522 Lecture17 17 November 2014

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop/MapReduce Computing Paradigm

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

HDFS Architecture Guide

CLIENT DATA NODE NAME NODE

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

CS370 Operating Systems

Distributed Systems 16. Distributed File Systems II

Introduction to HDFS and MapReduce

A Survey on Big Data

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Introduction to Data Management CSE 344

Introduction to Hadoop and MapReduce

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Warehouse-Scale Computing

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Introduction to Map Reduce

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

Introduction to MapReduce (cont.)

Hadoop. copyright 2011 Trainologic LTD

Map-Reduce. Marco Mura 2010 March, 31th

Performance Enhancement of Data Processing using Multiple Intelligent Cache in Hadoop

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Big Data Management and NoSQL Databases

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

August Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing

Hadoop An Overview. - Socrates CCDH

An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

The MapReduce Abstraction

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

The MapReduce Framework

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

High Throughput WAN Data Transfer with Hadoop-based Storage

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Distributed File Systems II

Introduction to MapReduce

Chase Wu New Jersey Institute of Technology

The Google File System

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Map Reduce. Yerevan.

Hadoop Distributed File System(HDFS)

Data Analysis Using MapReduce in Hadoop Environment

Database Applications (15-415)

CLOUD-SCALE FILE SYSTEMS

MapReduce-style data processing

Batch Inherence of Map Reduce Framework

The Google File System. Alexandru Costan

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

MapReduce. Cloud Computing COMP / ECPE 293A

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Laarge-Scale Data Engineering

Introduction to MapReduce

Top 25 Hadoop Admin Interview Questions and Answers

MapReduce & BigTable

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

The Google File System

Introduction to Big Data. Hadoop. Instituto Politécnico de Tomar. Ricardo Campos

Introduction to MapReduce Algorithms and Analysis

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

COMP Parallel Computing. Lecture 22 November 29, Datacenters and Large Scale Data Processing

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

MapReduce, Hadoop and Spark. Bompotas Agorakis

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Lessons Learned While Building Infrastructure Software at Google

Transcription:

Lecture 12 DATA ANALYTICS ON WEB SCALE

Source: The Economist, February 25, 2010

The Data Deluge EIGHTEEN months ago, Li & Fung, a firm that manages supply chains for retailers, saw 100 gigabytes of information flow through its network each day. Now the amount has increased tenfold. During 2009, American drone aircraft flying over Iraq and Afghanistan sent back around 24 years' worth of video footage. New models being deployed this year will produce ten times as many data streams as their predecessors, and those in 2011 will produce 30 times as many. Source: The Economist, February 25, 2010

The Data Deluge Everywhere you look, the quantity of information in the world is soaring. According to one estimate, mankind created 150 exabytes (billion gigabytes) of data in 2005. This year, it will create 1,200 exabytes. Merely keeping up with this flood, and storing the bits that might be useful, is difficult enough. Analysing it, to spot patterns and extract useful information, is harder still. 1 gigabyte = 10 9 bytes 1 terabyte = 10 12 bytes 1 petabyte = 10 15 bytes 1 exabyte = 10 18 bytes Source: The Economist, February 25, 2010

The Data Deluge on the Web Social Networks (Facebook, Twitter) Web Pages & Links Web Data Deluge Internet e- commerce Web Surfer History & Contextualization (Smartphones)

Why analyze this data? Social Networks (Facebook, Twitter) Web Pages & Links Web Data Deluge Internet e- commerce Web Surfer History & Contextualization (Smartphones)

Why analyze this data? Social Networks (Facebook, Twitter) Web Pages & Links Web Data Deluge Internet e- commerce Web Search Index Construction Web Surfer History & Contextualization (Smartphones)

Why analyze this data? Social Networks (Facebook, Twitter) Web Pages & Links Web Data Deluge Internet e- commerce Web Search Index Construction Web Surfer History & Contextualization (Smartphones) Recommender systems & advertising

Why analyze this data? Sentiment analysis E-marketing Click Stream Analysis Social Networks (Facebook, Twitter) Web Pages & Links Web Data Deluge Internet e- commerce Web Search Index Construction Web Surfer History & Contextualization (Smartphones) Recommender systems & advertising

Is this really problematic? Although it is hard to accurately determine the size of the Web at any point in time, it is safe to say that it consists of hundreds of billions of individual documents and that it continues to grow. If we assume the Web to contain 100 billion documents, with an average document size of 4 kb (after compression), the Web is about 400 TB The size of the resulting inverted (web search) index depends on the specific implementation, but it tends to be on the same order of magnitude as the original repository. Source: The Datacenter as a Computer Luiz André Barroso (Google) Jimmy Clidaras (Google) Urs Hölzle (Google)

Limitations of Traditional Data Analytics Architecture Data Collection Processes Data Storage Infrastructure [Storage only, no computation] Extract-Transform-Load (ETL) Compute Grid Relational Database System (Containing Aggregated Data only) Interactive applications + Reporting (On aggregated data only)

Limitations of Traditional Data Analytics Architecture Data Collection Processes Data Storage Infrastructure [Storage only, no computation] Moving the data to another infrastructure to perform computation does not scale Extract-Transform-Load (ETL) Compute Grid Relational Database System (Containing Aggregated Data only) Interactive applications + Reporting (On aggregated data only)

Limitations of Traditional Data Analytics Architecture Data Collection Processes Data Storage Infrastructure [Storage only, no computation] Moving the data to another infrastructure to perform computation does not scale Extract-Transform-Load (ETL) Compute Grid Relational Database System (Containing Aggregated Data only) Interactive applications + Reporting (On aggregated data only) Impossible to explore RAW data

Google s solutions New programming models and frameworks for distributed and scalable data analysis

Google s solutions New programming models and frameworks for distributed and scalable data analysis Name Google File System Purpose A distributed file system for scalable storage and high-throughput retrieval Map Reduce Pregel Dremel A programming model + execution environment for general-purpose distributed batch processing A programming model + execution environment for analysis of graphs A query language for interactive SQLlike analysis of structured datasets

Google s solutions New programming models and frameworks for distributed and scalable data analysis Name Google File System Purpose A distributed file system for scalable storage and high-throughput retrieval Map Reduce Pregel Dremel A programming model + execution environment for general-purpose distributed batch processing A programming model + execution environment for analysis of graphs A query language for interactive SQLlike analysis of structured datasets High-level design described in a series of papers; no implementation available

Google s solutions New programming models and frameworks for distributed and scalable data analysis Name Google File System Map Reduce Pregel Dremel Purpose A distributed file system for scalable storage and high-throughput retrieval A programming model + execution environment for general-purpose distributed batch processing A programming model + execution environment for analysis of graphs A query language for interactive SQLlike analysis of structured datasets Open Source Impl Apache Hadoop HDFS M/R Apache Giraph Apache Drill High-level design described in a series of papers; no implementation available

Rest of this lecture 1. Execution hardware 2. HDFS: Hadoop Distributed File System 3. Map/Reduce

WAREHOUSE SCALE MACHINES

Execution environment - 1997 Just a bunch of computers. The web index fit on a single computer Redundancy ensure availability to

Execution environment - 1999 Compute servers consist of multiple CPUs (possibly with multiple cores per CPU), and attached hard disks, Servers are collected in racks. High-speed Ethernet connections (at least 1Gbps) connect servers in a rack together The Google Corckboard rack

Execution environment - currently An image of the google datacenter in Mons (Belgium) Racks are connected together to central network switches using multi-gbps redundant links. Access of data from other racks is slower than data on same computer/same rack Many racks together form a data center

Discussion Each compute node is kept simple by design: Mid-range computers are much cheaper than powerful high-range computers and consume less energy but google has lots of them!

Characteristics Figure Source: The Datacenter as a Computer

Characteristics Figure Source: The Datacenter as a Computer Capacity: The amount of data we can store per server/rack/datacenter

Characteristics Figure Source: The Datacenter as a Computer

Characteristics Figure Source: The Datacenter as a Computer Latency: The time it takes to fetch a data item, when asked on local machine/another server on the same rack/another server on a different rack

Characteristics Figure Source: The Datacenter as a Computer

Characteristics Figure Source: The Datacenter as a Computer Bandwidth: the speed at which data can be transferred to the same machine/another server on the same rack/another server ona different rack

Characteristics Figure Source: The Datacenter as a Computer

Characteristics Conclusion: Huge storage capacity Latency between racks = 1/10 latency on rack level 1/10 latency on server level Bandwidth between racks = 1/10 bandwidth on rack level = ½ to 1/10 bandwidth on server level Figure Source: The Datacenter as a Computer

What do we gain? Parallellism! Let us consider the maximal aggregate bandwidth: the speed by which we can analyze data in parallel assuming ideal data distribution over servers & disks

What do we gain? Parallellism! Let us consider the maximal aggregate bandwidth: the speed by which we can analyze data in parallel assuming ideal data distribution over servers & disks Embarassingly parallel example: count the number of times the word Belgium appears in documents on the Web.

What do we gain? Parallellism! Let us consider the maximal aggregate bandwidth: the speed by which we can analyze data in parallel assuming ideal data distribution over servers & disks Embarassingly parallel example: count the number of times the word Belgium appears in documents on the Web. Each server has multiple CPUs and can read from multiple disks in parallel. As such, each server can analyze many documents in parallel.

What do we gain? Parallellism! Let us consider the maximal aggregate bandwidth: the speed by which we can analyze data in parallel assuming ideal data distribution over servers & disks Embarassingly parallel example: count the number of times the word Belgium appears in documents on the Web. Each server has multiple CPUs and can read from multiple disks in parallel. As such, each server can analyze many documents in parallel.

What do we gain? Parallellism! Let us consider the maximal aggregate bandwidth: the speed by which we can analyze data in parallel assuming ideal data distribution over servers & disks Embarassingly parallel example: count the number of times the word Belgium appears in documents on the Web. Each server has multiple CPUs and can read from multiple disks in parallel. As such, each server can analyze many documents in parallel. At the end, sum the per-server counters (which can be done very fast)

What do we gain? Parallellism! Let us consider the maximal aggregate bandwidth: the speed by which we can analyze data in parallel assuming ideal data distribution over servers & disks Component Max Aggr Bandwidth 1 Hard Disk 100 MB/sec ( 1 Gbps) Server = 12 Hard Disks 1.2 GB/sec ( 12 Gbps) Rack = 80 servers 96 GB/sec ( 768 Gbps) Cluster/datacenter = 30 racks 2.88 TB/sec ( 23 Tbps) Scanning 400TB hence takes 138 secs 2,3 minutes Scanning 400TB sequentially at 100 MB/sec takes 46,29 days

The challenge Scalable software development: allow growth without requiring re-architecting algorithm/applications Auto scale

Google s solutions New programming models and frameworks for distributed and scalable data analysis

Google s solutions New programming models and frameworks for distributed and scalable data analysis Name Google File System Purpose A distributed file system for scalable storage and high-throughput retrieval Map Reduce Pregel Dremel A programming model + execution environment for general-purpose distributed batch processing A programming model + execution environment for analysis of graphs A query language for interactive SQLlike analysis of structured datasets

Google s solutions New programming models and frameworks for distributed and scalable data analysis Name Google File System Purpose A distributed file system for scalable storage and high-throughput retrieval Map Reduce Pregel Dremel A programming model + execution environment for general-purpose distributed batch processing A programming model + execution environment for analysis of graphs A query language for interactive SQLlike analysis of structured datasets High-level design described in a series of papers; no implementation available

Google s solutions New programming models and frameworks for distributed and scalable data analysis Name Google File System Map Reduce Pregel Dremel Purpose A distributed file system for scalable storage and high-throughput retrieval A programming model + execution environment for general-purpose distributed batch processing A programming model + execution environment for analysis of graphs A query language for interactive SQLlike analysis of structured datasets Open Source Impl Apache Hadoop HDFS M/R Apache Giraph Apache Drill High-level design described in a series of papers; no implementation available

HDFS: THE HADOOP DISTRIBUTED FILE SYTEM

HDFS Architecture HDFS has a master/slave architecture Master = NameNode (NN) manages the file system and regulates access to files by clients. Slaves = DataNodes (DN), usually one per server in the cluster, manage storage attached to the server that they run on.

HDFS Architecture HDFS has a master/slave architecture Master = NameNode (NN) manages the file system and regulates access to files by clients. Slaves = DataNodes (DN), usually one per server in the cluster, manage storage attached to the server that they run on. NameNode DataNodes

HDFS Architecture HDFS has a master/slave architecture Master = NameNode (NN) manages the file system and regulates access to files by clients. Slaves = DataNodes (DN), usually one per server in the cluster, manage storage attached to the server that they run on. Files are transparently broken down into blocks (default 64 MB). Blocks are replicated across datanodes. Replication is rack-aware. NameNode DataNodes

HDFS Architecture HDFS has a master/slave architecture Master = NameNode (NN) manages the file system and regulates access to files by clients. Slaves = DataNodes (DN), usually one per server in the cluster, manage storage attached to the server that they run on. Files are transparently broken down into blocks (default 64 MB). Blocks are replicated across datanodes. Replication is rack-aware. NameNode DataNodes dat0.txt 1 3 dat1.txt 2 4 5

HDFS Architecture HDFS has a master/slave architecture Master = NameNode (NN) manages the file system and regulates access to files by clients. Slaves = DataNodes (DN), usually one per server in the cluster, manage storage attached to the server that they run on. Files are transparently broken down into blocks (default 64 MB). Blocks are replicated across datanodes. Replication is rack-aware. NameNode DataNodes dat0.txt 1 3 dat1.txt 2 4 5 1 2 2 5 1 3 2 4 4 5 3 4 5

HDFS Architecture HDFS has a master/slave architecture Master = NameNode (NN) manages the file system and regulates access to files by clients. Slaves = DataNodes (DN), usually one per server in the cluster, manage storage attached to the server that they run on. Files are transparently broken down into blocks (default 64 MB). Blocks are replicated across datanodes. Replication is rack-aware. dat0.txt 1 3 NameNode DataNodes MetaData(FileName, Replication Factor, Block Ids) /users/sv/dat0.txt r:2 {1,3} /users/sv/dat1.txt r:3 {2, 4, 5} dat1.txt 2 4 5 1 2 2 5 1 3 2 4 4 5 3 4 5

HDFS Architecture HDFS has a master/slave architecture Master = NameNode (NN) manages the file system and regulates access to files by clients. Slaves = DataNodes (DN), usually one per server in the cluster, manage storage attached to the server that they run on. Optimized for: Large files Read throughput Appending writes NameNode MetaData(FileName, Replication Factor, Block Ids) /users/sv/dat0.txt r:2 {1,3} /users/sv/dat1.txt r:3 {2, 4, 5} Replication ensures: Durability Availability Throughput DataNodes 1 2 2 5 1 3 2 4 4 5 3 4 5

HDFS Implementation The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. Source: Hadoop documentation This implies that clients only need to install a JAR file to access the HDFS

Typical HDFS commands bin/hadoop fs ls bin/hadoop fs mkdir bin/hadoop fs copyfromlocal bin/hadoop fs copytolocal bin/hadoop fs movetolocal bin/hadoop fs rm

MAP/REDUCE: SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS

M/R Computational Model A M/R program (or job) is specified by two functions: map and reduce

M/R Computational Model A M/R program (or job) is specified by two functions: map and reduce The input key/value pair represents a logical record in the input data source. In the case of a file this could be a line, or if the input source is a database table, this could be a record, map: (key1, value1) -> list(key2, value2) A single input key/value pair may result in zero or more output key/value pairs.

M/R Computational Model A M/R program (or job) is specified by two functions: map and reduce The reduce function is called once per unique map output key, and receives a list of all values emitted for that key reduce: (key2, list(value2)) -> list(key3, value3) Like the map function, reduce can output zero to many key/value pairs.

M/R Computational Model For each word occurring in a document on the Web, count the number of occurrences across all documents. def map(docid, line): for each word in line: yield (word, docid) def reduce(word, list-of-docids): yield (word, sizeof(list-of-docids))

M/R: opportunity for parallellism We can spawn multiple copies of the map function (called map tasks) in parallel (at most one for each key/value pair). Likewise, we can spawn multiple copies of the reduce function (called reduce tasks) in parallel (at most one for each unique key output by the map). Input map output shuffle+sort reduce input doc1 map cat, d1 dog, d1 turtle, d1 cat, [d1,d2] reduce doc2 map cat, d2 belgium, d2 dog, [d1,d3] reduce turtle, [d1,d3] reduce doc3 map turtle, d3 dog, d3 belgium, [d2] reduce

M/R: opportunity for parallellism We can spawn multiple copies of the map function (called map tasks) in parallel (at most one for each key/value pair). Likewise, we can spawn multiple copies of the reduce function (called reduce tasks) in parallel (at most one for each unique key output by the map). Input map output shuffle+sort reduce input doc1 map cat, d1 dog, d1 turtle, d1 cat, [d1,d2] reduce doc2 map cat, d2 belgium, d2 dog, [d1,d3] reduce turtle, [d1,d3] reduce doc3 map turtle, d3 dog, d3 belgium, [d2] reduce

M/R Execution: Architecture Master = JobTracker accepts jobs, decomposes them into map and reduce tasks, and schedules them for remote execution on child nodes. Slave = TaskTracker accepts tasks from Jobtracker and spawns child processes to do the actual work. Idea: ship computation to data The JobTracker accepts M/R jobs. If the input is a HDFS file, a map task is created for each block and sent to the node holding that block for execution. Map output is written to local disk. NameNode JobTracker Job1 DataNodes = TaskTrackers Job2 1 2 2 5 1 3 4 5 3 4 Map task 1 Map task 2 Reduce task 1 Reduce task 2 2 4 5

M/R Execution: Architecture Master = JobTracker accepts jobs, decomposes them into map and reduce tasks, and schedules them for remote execution on child nodes. Slave = TaskTracker accepts tasks from Jobtracker and spawns child processes to do the actual work. Idea: ship computation to data The JobTracker creates reduce tasks intelligently Reduce tasks read the map outputs over the network and write their output back to HDFS. NameNode JobTracker Job1 Job2 DataNodes = TaskTrackers 1 2 2 5 1 3 Map task 1 Map task 2 Reduce task 1 Reduce task 2 2 4 4 5 3 4 5

M/R Execution: Architecture Master = JobTracker accepts jobs, decomposes them into map and reduce tasks, and schedules them for remote execution on child nodes. Slave = TaskTracker accepts tasks from Jobtracker and spawns child processes to do the actual work. Idea: ship computation to data Load balancing: The JobTracker monitors for stragglers and may spawn additional map on datanodes that hold a block replica. Whichever node completes first is allowed to proceed. The other(s) is/are killed. NameNode JobTracker Job1 DataNodes = TaskTrackers Job2 1 2 2 5 1 3 4 5 3 4 Map task 1 Map task 2 Reduce task 1 Reduce task 2 2 4 5

M/R Execution: Architecture Master = JobTracker accepts jobs, decomposes them into map and reduce tasks, and schedules them for remote execution on child nodes. Slave = TaskTracker accepts tasks from Jobtracker and spawns child processes to do the actual work. Idea: ship computation to data Failures: In clusters with 1000s of nodes, hardware failures occur frequently. The same mechanism as for load balancing allows to cope with such failures NameNode JobTracker DataNodes = TaskTrackers Job1 Job2 Map task 1 Map task 2 Reduce task 1 Reduce task 2 1 2 2 5 1 3 2 4 4 5 3 4 5

References S. Ghemawate, H. Gobioff H.-T. Leung The Google File System http://research.google.com/archive/gfs-sosp2003.pdf HDFS architecture: http://hadoop.apache.org/docs/stable/hadoopproject-dist/hadoop-hdfs/hdfsdesign.html Jeffrey Dean and Sanjay Ghemawat MapReduce: Simplified Data Processing on Large Clusters Communications of the ACM 2008 G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large scale graph processing http://kowshik.github.io/jpregel/pregel_paper.pdf L. A. Barroso, J. Clidaras, U. Hölzle The datacenter as a computer. http://www.morganclaypool.com/doi/abs/10.2200/s00516ed2v01y20130 6CAC024