CS 6604: Data Mining Large Networks and Time-series. B. Aditya Prakash Lecture #6: Hadoop and Graph Analysis
|
|
- Willa Booth
- 5 years ago
- Views:
Transcription
1 CS 6604: Data Mining Large Networks and Time-series B. Aditya Prakash Lecture #6: Hadoop and Graph Analysis
2 (some slides from Xiao Yu) NO SQL Prakash 20 CS 6604: DM Large Networks & Time-Series 2
3 Why No SQL? Prakash 20 CS 6604: DM Large Networks & Time-Series 3
4 RDBMS The predominant choice in storing data Not so true for data miners since we put much in txt files. First formulated in 969 by Codd We are using RDBMS everywhere Prakash 20 CS 6604: DM Large Networks & Time-Series 4
5 Slide from neo technology, A NoSQL Overview and the Benefits of Graph Databases" Prakash 20 CS 6604: DM Large Networks & Time-Series
6 When RDBMS met Web 2.0 Slide from Lorenzo Alberton, "NoSQL Databases: Why, what and when" Prakash 20 CS 6604: DM Large Networks & Time-Series 6
7 What to do if data is really large? Peta-bytes (exabytes, zeaabytes.. ) Google processed 24 PB of data per day (2009) FB adds 0. PB per day Prakash 20 CS 6604: DM Large Networks & Time-Series
8 BIG data Prakash 20 CS 6604: DM Large Networks & Time-Series 8
9 What s Wrong with RelaKonal DB? Nothing is wrong. You just need to use the right tool. Relaeonal is hard to scale. Easy to scale reads Hard to scale writes Prakash 20 CS 6604: DM Large Networks & Time-Series 9
10 What s NoSQL? The misleading term NoSQL is short for Not Only SQL. non-relaeonal, schema-free, non-(quite)-acid More on ACID transaceons later in class horizontally scalable, distributed, easy replicaeon support simple API Prakash 20 CS 6604: DM Large Networks & Time-Series 0
11 Four (emerging) NoSQL Categories Key-value (K-V) stores Based on Distributed Hash Tables/ Amazon s Dynamo paper * Data model: (global) colleceon of K-V pairs Example: Voldemort Column Families BigTable clones ** Data model: big table, column families Example: HBase, Cassandra, Hypertable *G DeCandia et al, Dynamo: Amazon's Highly Available Key-value Store, SOSP 0 ** F Chang et al, Bigtable: A Distributed Storage System for Structured Data, OSDI 06 Prakash 20 CS 6604: DM Large Networks & Time-Series
12 Four (emerging) NoSQL Categories Document databases Inspired by Lotus Notes Data model: colleceons of K-V Colleceons Example: CouchDB, MongoDB Graph databases Inspired by Euler & graph theory Data model: nodes, relaeons, K-V on both Example: AllegroGraph, VertexDB, Neo4j Prakash 20 CS 6604: DM Large Networks & Time-Series 2
13 Focus of Different Data Models Slide from neo technology, A NoSQL Overview and the Benefits of Graph Databases" Prakash 20 CS 6604: DM Large Networks & Time-Series 3
14 C-A-P theorem" RDBMS NoSQL (most) Pareeon Tolerance Consistency Availability Prakash 20 CS 6604: DM Large Networks & Time-Series 4
15 When to use NoSQL? Bigness Massive write performance Twiaer generates TB / per day (200) Fast key-value access Flexible schema or data types Schema migraeon Write availability Writes need to succeed no maaer what (CAP, pareeoning) Easier maintainability, administraeon and operaeons No single point of failure Generally available parallel compueng Programmer ease of use Use the right data model for the right problem Avoid hiqng the wall Distributed systems support Tunable CAP tradeoffs from hap://highscalability.com/ Prakash 20 CS 6604: DM Large Networks & Time-Series
16 Key-Value Stores id hair_color age height 923 Red Blue 34 NA Table in relaeonal db Store/Domain in Key-Value db Find users whose age is above 8? Find all aaributes of user 923? Find users whose hair color is Red and age is 9? (Join operaeon) Calculate average age of all grad students? Prakash 20 CS 6604: DM Large Networks & Time-Series 6
17 Voldemort in LinkedIn Sid Anand, LinkedIn Data Infrastructure (QCon London 202) Prakash 20 CS 6604: DM Large Networks & Time-Series
18 Voldemort vs MySQL Sid Anand, LinkedIn Data Infrastructure (QCon London 202) Prakash 20 CS 6604: DM Large Networks & Time-Series 8
19 Column Families BigTable like Prakash 20 CS 6604: DM Large Networks & Time-Series 9 F Chang, et al, Bigtable: A Distributed Storage System for Structured Data, osdi 06
20 BigTable Data Model The row name is a reversed URL. The contents column family contains the page contents, and the anchor column family contains the text of any anchors that reference the page. Prakash 20 CS 6604: DM Large Networks & Time-Series 20
21 BigTable Performance Prakash 20 CS 6604: DM Large Networks & Time-Series 2
22 Document Database - mongodb Table in relaeonal db Documents in a colleceon Inieal release 2009 Open source, document db Json-like document with dynamic schema Prakash 20 CS 6604: DM Large Networks & Time-Series 22
23 mongodb Product Deployment Prakash 20 CS 6604: And DM Large much Networks more & Time-Series 23
24 Graph Database Data Model Abstraceon: Nodes Relaeons Properees Prakash 20 CS 6604: DM Large Networks & Time-Series 24
25 Neo4j - Build a Graph Slide from neo technology, A NoSQL Overview and the Benefits of Graph Databases" Prakash 20 CS 6604: DM Large Networks & Time-Series 2
26 A Debatable Performance EvaluaKon Prakash 20 CS 6604: DM Large Networks & Time-Series 26
27 Conclusion Use the right data model for the right problem Prakash 20 CS 6604: DM Large Networks & Time-Series 2
28 MAP-REDUCE AND HADOOP Prakash 20 CS 6604: DM Large Networks & Time-Series 28
29 MapReduce Challenges: How to distribute computaeon? Distributed/parallel programming is hard Map-reduce addresses all of the above Google s computaeonal/data manipulaeon model Elegant way to work with big data Prakash 20 CS 6604: DM Large Networks & Time-Series 29
30 Single Node Architecture CPU Memory Machine Learning, StaKsKcs Classical Data Mining Disk Prakash 20 CS 6604: DM Large Networks & Time-Series 30
31 MoKvaKon: Google Example 20+ billion web pages x 20KB = 400+ TB computer reads 30-3 MB/sec from disk ~4 months to read the web ~,000 hard drives to store the web Takes even more to do something useful with the data! Today, a standard architecture for such problems is emerging: Cluster of commodity Linux nodes Commodity network (ethernet) to connect them Prakash 20 CS 6604: DM Large Networks & Time-Series 3
32 Cluster Architecture Gbps between any pair of nodes in a rack Switch 2-0 Gbps backbone between racks Switch Switch CPU CPU CPU CPU Mem Mem Mem Mem Disk Disk Disk Disk Each rack contains 6-64 nodes Prakash In it was guestimated CS 6604: that DM Large Google Networks had & Time-Series M machines, 32
33 Prakash 20 CS 6604: DM Large Networks & Time-Series 33
34 Large-scale CompuKng Large-scale compukng for data mining problems on commodity hardware Challenges: Prakash 20 How do you distribute computakon? How can we make it easy to write distributed programs? Machines fail: One server may stay up 3 years (,000 days) If you have,000 servers, expect to loose /day People esemated Google had ~M machines in 20,000 machines fail every day! CS 6604: DM Large Networks & Time-Series 34
35 Idea and SoluKon Issue: Copying data over a network takes Kme Idea: Bring computaeon close to the data Store files muleple emes for reliability Map-reduce addresses these problems Google s computaeonal/data manipulaeon model Elegant way to work with big data Storage Infrastructure File system Google: GFS. Hadoop: HDFS Programming model Prakash 20 Map-Reduce CS 6604: DM Large Networks & Time-Series 3
36 Hadoop VS NoSQL Hadoop: compueng framework Supports data-intensive applicaeons Includes MapReduce, HDFS etc. (we will study MR mainly next) NoSQL: Not only SQL databases Can be built ON hadoop. E.g. HBase. Prakash 20 CS 6604: DM Large Networks & Time-Series 36
37 Storage Infrastructure Problem: If nodes fail, how to store data persistently? Answer: Distributed File System: Provides global file namespace Google GFS; Hadoop HDFS; Typical usage padern Huge files (00s of GB to TB) Data is rarely updated in place Reads and appends are common Prakash 20 CS 6604: DM Large Networks & Time-Series 3
38 Distributed File System Chunk servers File is split into coneguous chunks Typically each chunk is 6-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks Master node a.k.a. Name Node in Hadoop s HDFS Stores metadata about where files are stored Might be replicated Client library for file access Talks to master to find chunk servers Connects directly to chunk servers to access data Prakash 20 CS 6604: DM Large Networks & Time-Series 38
39 Distributed File System Reliable distributed file system Data kept in chunks spread across machines Each chunk replicated on different machines Seamless recovery from disk or machine failure C 0 C D 0 C C 2 C C 0 C C C 2 C C 3 D 0 D D 0 C 2 Chunk server Chunk server 2 Chunk server 3 Chunk server N Bring computaeon directly to the data! Chunk servers also serve as compute servers Prakash 20 CS 6604: DM Large Networks & Time-Series 39
40 Programming Model: MapReduce Warm-up task: We have a huge text document Count the number of emes each disenct word appears in the file Sample applicakon: Analyze web server logs to find popular URLs Prakash 20 CS 6604: DM Large Networks & Time-Series 40
41 Task: Word Count Case : File too large for memory, but all <word, count> pairs fit in memory Case 2: Count occurrences of words: words(doc.txt) sort uniq -c where words takes a file and outputs the words in it, one per a line Case 2 captures the essence of MapReduce Great thing is that it is naturally parallelizable Prakash 20 CS 6604: DM Large Networks & Time-Series 4
42 MapReduce: Overview Sequeneally read a lot of data Map: Extract something you care about Group by key: Sort and Shuffle Reduce: Aggregate, summarize, filter or transform Write the result Outline stays the same, Map and Reduce change to fit the problem Prakash 20 CS 6604: DM Large Networks & Time-Series 42
43 MapReduce: The Map Step Input key-value pairs Intermediate key-value pairs k v map k k v v k v map k v k v k v Prakash 20 CS 6604: DM Large Networks & Time-Series 43
44 MapReduce: The Reduce Step Intermediate key-value pairs k v Key-value groups k v v v reduce Output key-value pairs k v k v Group by key k v v reduce k v k v k v k v k v Prakash 20 CS 6604: DM Large Networks & Time-Series 44
45 More Specifically Input: a set of key-value pairs Programmer specifies two methods: Map(k, v) <k, v >* Takes a key-value pair and outputs a set of key-value pairs E.g., key is the filename, value is a single line in the file There is one Map call for every (k,v) pair Reduce(k, <v >*) <k, v >* All values v with same key k are reduced together and processed in v order There is one Reduce funceon call per unique key k Prakash 20 CS 6604: DM Large Networks & Time-Series 4
46 MapReduce: Word CounKng Provided by the programmer Provided by the programmer MAP: Read input and produces a set of key-value pairs Group by key: Collect all pairs with same key Reduce: Collect all values belonging to the key and output The crew of the space shuttle Endeavor recently returned to Earth as ambassadors, harbingers of a new era of space exploration. Scientists at NASA are saying that the recent assembly of the Dextre bot is the first step in a long-term space-based man/mache partnership. '"The work we're doing now -- the robotics we're doing -- is what we're going to need.. (The, ) (crew, ) (of, ) (the, ) (space, ) (shuale, ) (Endeavor, ) (recently, ). (crew, ) (crew, ) (space, ) (the, ) (the, ) (the, ) (shuale, ) (recently, ) (crew, 2) (space, ) (the, 3) (shuale, ) (recently, ) Sequeneally read the data Only sequeneal reads Big document Prakash 20 (key, value) (key, value) (key, value) CS 6604: DM Large Networks & Time-Series 46
47 Word Count Using MapReduce map(key, value): // key: document name; value: text of the document for each word w in value: emit(w, ) reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(key, result) Prakash 20 CS 6604: DM Large Networks & Time-Series 4
48 Map-Reduce (MR) as SQL select count(*) from DOCUMENT group by word Reducer Mapper Prakash 20 CS 6604: DM Large Networks & Time-Series 48
49 Map-Reduce: Environment Map-Reduce environment takes care of: Pareeoning the input data Scheduling the program s execueon across a set of machines Performing the group by key step Handling machine failures Managing required inter-machine communicaeon Prakash 20 CS 6604: DM Large Networks & Time-Series 49
50 Map-Reduce: A diagram Big document MAP: Read input and produces a set of key-value pairs Group by key: Collect all pairs with same key (Hash merge, Shuffle, Sort, ParKKon) Reduce: Collect all values belonging to the key and output Prakash 20 CS 6604: DM Large Networks & Time-Series 0
51 Map-Reduce: In Parallel Prakash All 20 phases are distributed CS 6604: DM Large with Networks many & Time-Series tasks doing the work
52 Map-Reduce Programmer specifies: Map and Reduce and input files Workflow: Read inputs as a set of key-value-pairs Map transforms input kv-pairs into a new set of k'v'-pairs Sorts & Shuffles the k'v'-pairs to output nodes All k v -pairs with a given k are sent to the same reduce Reduce processes all k'v'-pairs grouped by key into new k''v''-pairs Write the resuleng pairs to files All phases are distributed with many tasks doing the work Input 0 Map 0 Input Map Input 2 Map 2 Shuffle Reduce 0 Reduce Out 0 Out Prakash 20 CS 6604: DM Large Networks & Time-Series 2
53 Data Flow Input and final output are stored on a distributed file system (FS): Scheduler tries to schedule map tasks close to physical storage locaeon of input data Intermediate results are stored on local FS of Map and Reduce workers Output is oien input to another MapReduce task Prakash 20 CS 6604: DM Large Networks & Time-Series 3
54 CoordinaKon: Master Master node takes care of coordinakon: Task status: (idle, in-progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the locaeon and sizes of its R intermediate files, one for each reducer Master pushes this info to reducers Master pings workers periodically to detect failures Prakash 20 CS 6604: DM Large Networks & Time-Series 4
55 Dealing with Failures Map worker failure Map tasks completed or in-progress at worker are reset to idle Reduce workers are noefied when task is rescheduled on another worker Reduce worker failure Only in-progress tasks are reset to idle Reduce task is restarted Master failure MapReduce task is aborted and client is noefied Prakash 20 CS 6604: DM Large Networks & Time-Series
56 GRAPH ANALYSIS ON HADOOP Based on slides from Prof. U. Kang Prakash 20 CS 6604: DM Large Networks & Time-Series 6
57 Problem DefiniKon Given: a BIG graph G(V, E) Compute: Connected Components Diameter PageRank. SoluKon: Use Hadoop Prakash 20 CS 6604: DM Large Networks & Time-Series
58 GIM-V: One unified framework [Kang+, 2008] YahooWeb: V =.4B E = 6.6B Count 300-size cmpt X 00. Why? 00-size cmpt X 6. Why? Size Prakash 20 CS 6604: DM Large Networks & Time-Series 8
59 Main Idea GIM-V Generalized Iteraeve Matrix-Vector Muleplicaeon Extension of plain matrix-vector muleplicaeon Includes as special cases Connected Components PageRank RWR Diameters. Prakash 20 CS 6604: DM Large Networks & Time-Series 9
60 Main Idea: IntuiKon 0. Weighted Combination of Colors ~ Message Passing M v v' X = v4' 4 i m i v i 4 Prakash 20 CS 6604: DM Large Networks & Time-Series 60
61 M v v' 0. X = 0. Three implicit operaeons here: multiply m ij and sum n multiplication results update v i ' v j combine2 combineall assign Message sending Message combination Prakash 20 CS 6604: DM Large Networks & Time-Series 6
62 Main Idea GIM-V Matrix represents edge(src, dest) Vector represents node values/labels Customizing the three operaeons operations Standard MV Con. Cmpt. PageRank RWR Diameter combine2 Multiply combineall Sum assign Assign Prakash 20 CS 6604: DM Large Networks & Time-Series 62
63 GIM-V operations Standard MV Con. Cmpt. PageRank RWR Diameter combine2 Multiply Bool. X combineall Sum MIN assign Assign MIN Prakash 20 CS 6604: DM Large Networks & Time-Series 63
64 GIM-V for CC How many connected components? 2 == which node belongs to which component? G G G Input Graph 8 node id Output component id Prakash 20 CS 6604: DM Large Networks & Time-Series 64
65 GIM-V for CC final vector init vector G G G? Prakash 20 CS 6604: DM Large Networks & Time-Series 6
66 GIM-V for CC Prakash 20 CS 6604: DM Large Networks & Time-Series 66 åa min(, min(2) ) min(2, min(,3) ) min(3, min(2,4) ) min(4, min(3) ) min(, min(6) ) min(6, min() ) min(, min(8) ) min(8, min() ) GIM-V for Connected Components Sending Invitations Accept the Smallest
67 GIM-V for CC Prakash 20 CS 6604: DM Large Networks & Time-Series 6 åa min(, min(2) ) min(2, min(,3) ) min(3, min(2,4) ) min(4, min(3) ) min(, min(6) ) min(6, min() ) min(, min(8) ) min(8, min() ) GIM-V for Connected Components Sending Invitations Accept the Smallest GIM-V with MIN = find min. node ids within hop
68 GIM-V for CC Prakash 20 CS 6604: DM Large Networks & Time-Series 68 åa min(, min(2) ) min(2, min(,3) ) min(3, min(2,4) ) min(4, min(3) ) min(, min(6) ) min(6, min() ) min(, min(8) ) min(8, min() ) GIM-V for Connected Components Sending Invitations Accept the Smallest k GIM-V with MIN = find min. node ids within k hops
69 GIM-V for CC Prakash 20 CS 6604: DM Large Networks & Time-Series 69 åa min(, min(2) ) min(2, min(,3) ) min(3, min(2,4) ) min(4, min(3) ) min(, min(6) ) min(6, min() ) min(, min(8) ) min(8, min() ) GIM-V for Connected Components Sending Invitations Accept the Smallest Max. iterakons = diameter
70 GIM-V Operations Standard MV Con. Cmpt. PageRank RWR Diameter combine2 Multiply Multiply Multiply with c Multiply with c Multiply b it-vector combineall Sum MIN Sum with rj prob. Sum with res tart prob BIT-OR() assign Assign MIN Assign Assign BIT-OR() Prakash 20 CS 6604: DM Large Networks & Time-Series 0
71 HDFS restrickons [R] HDFS is locaeon transperant Users don t know which machine has which file [R2] A line is never split A large file is split into pieces of a size (e.g. 26MB) Users don t know the point of split Prakash 20 CS 6604: DM Large Networks & Time-Series
72 Fast GIM-V Given R and R2, how to design faster algs. for GIM-V?. Block Muleplicaeon 2. Clustering 3. Compression Prakash 20 CS 6604: DM Large Networks & Time-Series 2
73 Block MulKplicaKon Group elements together into line 2. Storage for an element: 2log n bits -> 2log b bits 3. Adjust the MapReduce code(block multiplication) log b bits log n bits b: block width n: # of nodes Prakash 20 CS 6604: DM Large Networks & Time-Series 3
74 Clustering and Compress Prakash 20 CS 6604: DM Large Networks & Time-Series 4 Preprocess Compress ZIP ZIP
75 Performance 43x smaller 9.2x faster Block Encoding? Compression? Clustering? RAW No No No NNB Yes No No NCB Yes Yes No CCB Yes Yes Yes Prakash 20 CS 6604: DM Large Networks & Time-Series
76 Many extensions and opkmizakons Local queries Diagonal block iteraeons. Prakash 20 CS 6604: DM Large Networks & Time-Series 6
77 Conclusions Hadoop is a distributed data-intesive compueng framework MapReduce Simple programming paradigm Surprisingly powerful (may not be suitable for all tasks though) Hadoop has specialized FileSystem, Master- Slave Architecture to scale-up Prakash 20 CS 6604: DM Large Networks & Time-Series
78 NoSQL and Hadoop Hot area with several new problems Good for academic research Good for industry = Fun AND Profit J Prakash 20 CS 6604: DM Large Networks & Time-Series 8
79 POINTERS AND FURTHER READING Prakash 20 CS 6604: DM Large Networks & Time-Series 9
80 ImplementaKons Google Not available outside Google Hadoop An open-source implementaeon in Java Uses HDFS for stable storage Download: Aster Data Cluster-opemized SQL Database that also implements MapReduce Prakash 20 CS 6604: DM Large Networks & Time-Series 80
81 Cloud CompuKng Ability to rent compueng by the hour Addieonal services e.g., persistent storage Amazon s Elasec Compute Cloud (EC2) Aster Data and Hadoop can both be run on EC2 Prakash 20 CS 6604: DM Large Networks & Time-Series 8
82 Reading Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters hap://labs.google.com/papers/mapreduce.html Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung: The Google File System hap://labs.google.com/papers/gfs.html Prakash 20 CS 6604: DM Large Networks & Time-Series 82
83 Resources Hadoop Wiki Introduceon hap://wiki.apache.org/lucene-hadoop/ Geqng Started hap://wiki.apache.org/lucene-hadoop/ GeqngStartedWithHadoop Map/Reduce Overview hap://wiki.apache.org/lucene-hadoop/hadoopmapreduce hap://wiki.apache.org/lucene-hadoop/hadoopmapredclasses Eclipse Environment hap://wiki.apache.org/lucene-hadoop/eclipseenvironment Javadoc hap://lucene.apache.org/hadoop/docs/api/ Prakash 20 CS 6604: DM Large Networks & Time-Series 83
84 Resources Releases from Apache download mirrors hap:// hadoop/ Nightly builds of source hap://people.apache.org/dist/lucene/hadoop/ nightly/ Source code from subversion hap://lucene.apache.org/hadoop/ version_control.html Prakash 20 CS 6604: DM Large Networks & Time-Series 84
85 Further Reading Programming model inspired by funceonal language primieves Pareeoning/shuffling similar to many large-scale soreng systems NOW-Sort ['9] Re-execueon for fault tolerance BAD-FS ['04] and TACC ['9] Locality opemizaeon has parallels with Aceve Disks/Diamond work Aceve Disks ['0], Diamond ['04] Backup tasks similar to Eager Scheduling in Charloae system Charloae ['96] Dynamic load balancing solves similar problem as River's distributed queues River ['99] Prakash 20 CS 6604: DM Large Networks & Time-Series 8
CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #11: No- SQL, Map- Reduce and New So8ware Stack
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #11: No- SQL, Map- Reduce and New So8ware Stack (some slides from Xiao Yu) NO SQL VT CS 5614 2 Why No SQL? VT CS 5614 3 RDBMS The predominant
More informationCS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #12: NoSQL and MapReduce
CS 4604: Introduc0on to Database Management Systems B. Aditya Prakash Lecture #12: NoSQL and MapReduce (some slides from Xiao Yu) NO SQL Prakash 2016 VT CS 4604 2 Why No SQL? Prakash 2016 VT CS 4604 3
More informationCS 345A Data Mining. MapReduce
CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes
More informationOutline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins
MapReduce 1 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce
More informationCS 345A Data Mining. MapReduce
CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Dis Commodity Clusters Web data sets can be ery large Tens to hundreds of terabytes
More informationMapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec
MapReduce: Recap Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Sequentially read a lot of data Why? Map: extract something we care about map (k, v)
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 1/6/2015 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 1/6/2015 Jure Leskovec,
More informationDatabases 2 (VU) ( )
Databases 2 (VU) (707.030) Map-Reduce Denis Helic KMI, TU Graz Nov 4, 2013 Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 1 / 90 Outline 1 Motivation 2 Large Scale Computation 3 Map-Reduce 4 Environment
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More informationThe MapReduce Abstraction
The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationA BigData Tour HDFS, Ceph and MapReduce
A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!
More informationWhere We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)
More informationPrinciples of Data Management. Lecture #16 (MapReduce & DFS for Big Data)
Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin
More informationBig Data and IoT. Baris Aksanli 02/10/2016
Big Data and IoT Baris Aksanli 02/10/2016 Why is there big data? Number of devices increasing exponeneally They conenuously generate data For example, on average, 72 hours of videos are uploaded to YouTube
More informationMapReduce. Stony Brook University CSE545, Fall 2016
MapReduce Stony Brook University CSE545, Fall 2016 Classical Data Mining CPU Memory Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining
More informationThe amount of data increases every day Some numbers ( 2012):
1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect
More information2/26/2017. The amount of data increases every day Some numbers ( 2012):
The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to
More informationIn the news. Request- Level Parallelism (RLP) Agenda 10/7/11
In the news "The cloud isn't this big fluffy area up in the sky," said Rich Miller, who edits Data Center Knowledge, a publicaeon that follows the data center industry. "It's buildings filled with servers
More informationMap Reduce. Yerevan.
Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate
More informationProgramming Systems for Big Data
Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There
More informationHDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationMap Reduce Group Meeting
Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in
More informationAnnouncements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm
Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)
More informationMotivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters
Motivation MapReduce: Simplified Data Processing on Large Clusters These are slides from Dan Weld s class at U. Washington (who in turn made his slides based on those by Jeff Dean, Sanjay Ghemawat, Google,
More informationDistributed File Systems II
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
More informationDatabase Systems CSE 414
Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create
More informationMapReduce & BigTable
CPSC 426/526 MapReduce & BigTable Ennan Zhai Computer Science Department Yale University Lecture Roadmap Cloud Computing Overview Challenges in the Clouds Distributed File Systems: GFS Data Process & Analysis:
More informationThe Google File System. Alexandru Costan
1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems
More informationCS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece
More informationTDDD43 HT2014: Advanced databases and data models Theme 4: NoSQL, Distributed File System, Map-Reduce
TDDD43 HT2014: Advanced databases and data models Theme 4: NoSQL, Distributed File System, Map-Reduce Valentina Ivanova ADIT, IDA, Linköping University Slides based on slides by Fang Wei-Kleiner DFS, Map-Reduce
More informationMapReduce and Friends
MapReduce and Friends Craig C. Douglas University of Wyoming with thanks to Mookwon Seo Why was it invented? MapReduce is a mergesort for large distributed memory computers. It was the basis for a web
More informationMapReduce. U of Toronto, 2014
MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in
More informationBigData and Map Reduce VITMAC03
BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Cluster File
More informationCSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA
CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers
More informationParallel Computing: MapReduce Jin, Hai
Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google
More informationΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing
ΕΠΛ 602:Foundations of Internet Technologies Cloud Computing 1 Outline Bigtable(data component of cloud) Web search basedonch13of thewebdatabook 2 What is Cloud Computing? ACloudis an infrastructure, transparent
More informationCS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab
CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 26 File Systems Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ Cylinders: all the platters?
More informationCS November 2017
Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account
More informationMI-PDB, MIE-PDB: Advanced Database Systems
MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:
More informationCompSci 516: Database Systems
CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 9 MapReduce Prof. Li Jiang 2014/11/19 1 What is MapReduce Origin from Google, [OSDI 04] A simple programming model Functional model For large-scale
More informationHadoop/MapReduce Computing Paradigm
Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data
More informationHadoop An Overview. - Socrates CCDH
Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected
More informationRule 14 Use Databases Appropriately
Rule 14 Use Databases Appropriately Rule 14: What, When, How, and Why What: Use relational databases when you need ACID properties to maintain relationships between your data. For other data storage needs
More informationAnnouncements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems
Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1 HW8 is out Last assignment! Get Amazon credits now (see instructions) Spark with Hadoop Due next wed CSE 344 - Fall 2016
More informationCS 655 Advanced Topics in Distributed Systems
Presented by : Walid Budgaga CS 655 Advanced Topics in Distributed Systems Computer Science Department Colorado State University 1 Outline Problem Solution Approaches Comparison Conclusion 2 Problem 3
More informationCmpE 138 Spring 2011 Special Topics L2
CmpE 138 Spring 2011 Special Topics L2 Shivanshu Singh shivanshu.sjsu@gmail.com Map Reduce ElecBon process Map Reduce Typical single node architecture Applica'on CPU Memory Storage Map Reduce Applica'on
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationCS 61C: Great Ideas in Computer Architecture. MapReduce
CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later
More informationIntroduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)
Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction
More informationComparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2014 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
More informationMap-Reduce. Marco Mura 2010 March, 31th
Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of
More informationMapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1
MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3.
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationCassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent
Tanton Jeppson CS 401R Lab 3 Cassandra, MongoDB, and HBase Introduction For my report I have chosen to take a deeper look at 3 NoSQL database systems: Cassandra, MongoDB, and HBase. I have chosen these
More informationCISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL
CISC 7610 Lecture 5 Distributed multimedia databases Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL Motivation YouTube receives 400 hours of video per minute That is 200M hours
More informationLessons Learned While Building Infrastructure Software at Google
Lessons Learned While Building Infrastructure Software at Google Jeff Dean jeff@google.com Google Circa 1997 (google.stanford.edu) Corkboards (1999) Google Data Center (2000) Google Data Center (2000)
More informationData Partitioning and MapReduce
Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,
More informationMapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia
MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationCPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University
CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network
More informationMapReduce. Cloud Computing COMP / ECPE 293A
Cloud Computing COMP / ECPE 293A MapReduce Jeffrey Dean and Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, In Proceedings of the 6th conference on Symposium on Opera7ng Systems
More informationYuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013
Yuval Carmel Tel-Aviv University "Advanced Topics in About & Keywords Motivation & Purpose Assumptions Architecture overview & Comparison Measurements How does it fit in? The Future 2 About & Keywords
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce
More informationNoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems
CompSci 516 Data Intensive Computing Systems Lecture 21 (optional) NoSQL systems Instructor: Sudeepa Roy Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 1 Key- Value Stores Duke CS,
More informationIntroduction to MapReduce (cont.)
Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations
More informationCS November 2018
Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account
More informationNoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu
NoSQL Databases MongoDB vs Cassandra Kenny Huynh, Andre Chik, Kevin Vu Introduction - Relational database model - Concept developed in 1970 - Inefficient - NoSQL - Concept introduced in 1980 - Related
More informationMapReduce: Simplified Data Processing on Large Clusters 유연일민철기
MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel
More informationDatabases and Big Data Today. CS634 Class 22
Databases and Big Data Today CS634 Class 22 Current types of Databases SQL using relational tables: still very important! NoSQL, i.e., not using relational tables: term NoSQL popular since about 2007.
More informationCIB Session 12th NoSQL Databases Structures
CIB Session 12th NoSQL Databases Structures By: Shahab Safaee & Morteza Zahedi Software Engineering PhD Email: safaee.shx@gmail.com, morteza.zahedi.a@gmail.com cibtrc.ir cibtrc cibtrc 2 Agenda What is
More informationMapReduce: A Programming Model for Large-Scale Distributed Computation
CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview
More informationData contains value and knowledge. 4/1/19 Tim Althoff, UW CS547: Machine Learning for Big Data,
Data contains value and knowledge 4/1/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 3 But to extract the knowledge data needs to be Stored (systems) Managed
More informationDistributed Computations MapReduce. adapted from Jeff Dean s slides
Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)
More informationCS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014
CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions
More informationGoogle File System (GFS) and Hadoop Distributed File System (HDFS)
Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear
More informationJeffrey D. Ullman Stanford University
Jeffrey D. Ullman Stanford University for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationDistributed Data Store
Distributed Data Store Large-Scale Distributed le system Q: What if we have too much data to store in a single machine? Q: How can we create one big filesystem over a cluster of machines, whose data is
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions
More informationCSE-E5430 Scalable Cloud Computing Lecture 9
CSE-E5430 Scalable Cloud Computing Lecture 9 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 15.11-2015 1/24 BigTable Described in the paper: Fay
More informationAnnouncements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414
Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s
More informationHADOOP FRAMEWORK FOR BIG DATA
HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further
More information