CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #11: No- SQL, Map- Reduce and New So8ware Stack
(some slides from Xiao Yu) NO SQL VT CS 5614 2
Why No SQL? VT CS 5614 3
RDBMS The predominant choice in storing data Not so true for data miners since we put much in txt files. First formulated in 1969 by Codd We are using RDBMS everywhere VT CS 5614 4
Slide from neo technology, A NoSQL Overview and the Benefits of Graph Databases" VT CS 5614 5
When RDBMS met Web 2.0 Slide from Lorenzo Alberton, "NoSQL Databases: Why, what and when" VT CS 5614 6
What to do if data is really large? Peta- bytes (exabytes, ze`abytes.. ) Google processed 24 PB of data per day (2009) FB adds 0.5 PB per day VT CS 5614 7
BIG data Prakash 2014 VT CS 5614 8
What s Wrong with RelaLonal DB? Nothing is wrong. You just need to use the right tool. Relaeonal is hard to scale. Easy to scale reads Hard to scale writes VT CS 5614 9
What s NoSQL? The misleading term NoSQL is short for Not Only SQL. non- relaeonal, schema- free, non- (quite)- ACID More on ACID transaceons later in class horizontally scalable, distributed, easy replicaeon support simple API VT CS 5614 10
Four (emerging) NoSQL Categories Key- value (K- V) stores Based on Distributed Hash Tables/ Amazon s Dynamo paper * Data model: (global) colleceon of K- V pairs Example: Voldemort Column Families BigTable clones ** Data model: big table, column families Example: HBase, Cassandra, Hypertable *G DeCandia et al, Dynamo: Amazon's Highly Available Key- value Store, SOSP 07 ** F Chang et al, Bigtable: A Distributed Storage System for Structured Data, OSDI 06 VT CS 5614 11
Four (emerging) NoSQL Categories Document databases Inspired by Lotus Notes Data model: colleceons of K- V Colleceons Example: CouchDB, MongoDB Graph databases Inspired by Euler & graph theory Data model: nodes, relaeons, K- V on both Example: AllegroGraph, VertexDB, Neo4j VT CS 5614 12
Focus of Different Data Models Slide from neo technology, A NoSQL Overview and the Benefits of Graph Databases" VT CS 5614 13
C- A- P theorem" RDBMS NoSQL (most) Pareeon Tolerance Consistency Availability VT CS 5614 14
When to use NoSQL? Bigness Massive write performance Twi`er generates 7TB / per day (2010) Fast key- value access Flexible schema or data types Schema migraeon Write availability Writes need to succeed no ma`er what (CAP, pareeoning) Easier maintainability, administraeon and operaeons No single point of failure Generally available parallel compueng Programmer ease of use Use the right data model for the right problem Avoid hirng the wall Distributed systems support Tunable CAP tradeoffs from h`p://highscalability.com/ VT CS 5614 15
Key- Value Stores id hair_color age height 1923 Red 18 6 0 3371 Blue 34 NA Table in relaeonal db Store/Domain in Key- Value db Find users whose age is above 18? Find all a`ributes of user 1923? Find users whose hair color is Red and age is 19? (Join operaeon) Calculate average age of all grad students? VT CS 5614 16
Voldemort in LinkedIn Sid Anand, LinkedIn Data Infrastructure (QCon London 2012) VT CS 5614 17
Voldemort vs MySQL Sid Anand, LinkedIn Data Infrastructure (QCon London 2012) VT CS 5614 18
Column Families BigTable like VT CS 5614 19 F Chang, et al, Bigtable: A Distributed Storage System for Structured Data, osdi 06
BigTable Data Model The row name is a reversed URL. The contents column family contains the page contents, and the anchor column family contains the text of any anchors that reference the page. VT CS 5614 20
BigTable Performance VT CS 5614 21
Document Database - mongodb Table in relaeonal db Documents in a colleceon Inieal release 2009 Open source, document db Json- like document with dynamic schema VT CS 5614 22
mongodb Product Deployment And much VT CS 5614 more 23
Graph Database Data Model Abstraceon: Nodes Relaeons Properees VT CS 5614 24
Neo4j - Build a Graph Slide from neo technology, A NoSQL Overview and the Benefits of Graph Databases" VT CS 5614 25
A Debatable Performance EvaluaLon VT CS 5614 26
Conclusion Use the right data model for the right problem VT CS 5614 27
MAP- REDUCE AND HADOOP VT CS 5614 28
MapReduce Much of the course will be devoted to large scale compulng for data mining Challenges: How to distribute computaeon? Distributed/parallel programming is hard Map- reduce addresses all of the above Google s computaeonal/data manipulaeon model Elegant way to work with big data VT CS 5614 29
Single Node Architecture CPU Memory Machine Learning, StaLsLcs Classical Data Mining Disk VT CS 5614 30
MoLvaLon: Google Example 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30-35 MB/sec from disk ~4 months to read the web ~1,000 hard drives to store the web Takes even more to do something useful with the data! Today, a standard architecture for such problems is emerging: Cluster of commodity Linux nodes Commodity network (ethernet) to connect them VT CS 5614 31
Cluster Architecture 1 Gbps between any pair of nodes in a rack Switch 2-10 Gbps backbone between racks Switch Switch CPU CPU CPU CPU Mem Mem Mem Mem Disk Disk Disk Disk Each rack contains 16-64 nodes Prakash In 2014 2011 it was guestimated that Google VT CS 5614 had 1M machines, http://bit.ly/shh0ro 32
VT CS 5614 33
Large- scale CompuLng Large- scale compulng for data mining problems on commodity hardware Challenges: How do you distribute computalon? How can we make it easy to write distributed programs? Machines fail: One server may stay up 3 years (1,000 days) If you have 1,000 servers, expect to loose 1/day People esemated Google had ~1M machines in 2011 1,000 machines fail every day! VT CS 5614 34
Idea and SoluLon Issue: Copying data over a network takes Lme Idea: Bring computaeon close to the data Store files muleple emes for reliability Map- reduce addresses these problems Google s computaeonal/data manipulaeon model Elegant way to work with big data Storage Infrastructure File system Google: GFS. Hadoop: HDFS Programming model Map- Reduce VT CS 5614 35
Hadoop VS NoSQL Hadoop: compueng framework Supports data- intensive applicaeons Includes MapReduce, HDFS etc. (we will study MR mainly next) NoSQL: Not only SQL databases Can be built ON hadoop. E.g. HBase. VT CS 4604 36
Storage Infrastructure Problem: If nodes fail, how to store data persistently? Answer: Distributed File System: Provides global file namespace Google GFS; Hadoop HDFS; Typical usage pafern Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common VT CS 5614 37
Distributed File System Chunk servers File is split into coneguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks Master node a.k.a. Name Node in Hadoop s HDFS Stores metadata about where files are stored Might be replicated Client library for file access Talks to master to find chunk servers Connects directly to chunk servers to access data VT CS 5614 38
Distributed File System Reliable distributed file system Data kept in chunks spread across machines Each chunk replicated on different machines Seamless recovery from disk or machine failure C 0 C 1 D 0 C 1 C 2 C 5 C 0 C 5 C 5 C 2 C 5 C 3 D 0 D 1 D 0 C 2 Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N Bring computaeon directly to the data! Chunk servers also serve as compute servers VT CS 5614 39
Programming Model: MapReduce Warm- up task: We have a huge text document Count the number of emes each disenct word appears in the file Sample applicalon: Analyze web server logs to find popular URLs VT CS 5614 40
Task: Word Count Case 1: File too large for memory, but all <word, count> pairs fit in memory Case 2: Count occurrences of words: words(doc.txt) sort uniq -c where words takes a file and outputs the words in it, one per a line Case 2 captures the essence of MapReduce Great thing is that it is naturally parallelizable VT CS 5614 41
MapReduce: Overview Sequeneally read a lot of data Map: Extract something you care about Group by key: Sort and Shuffle Reduce: Aggregate, summarize, filter or transform Write the result Outline stays the same, Map and Reduce change to fit the problem VT CS 5614 42
MapReduce: The Map Step Input key-value pairs Intermediate key-value pairs k v map k k v v k v map k v k v k v VT CS 5614 43
MapReduce: The Reduce Step Intermediate key-value pairs k v Key-value groups k v v v reduce Output key-value pairs k v k v Group by key k v v reduce k v k v k v k v k v VT CS 5614 44
More Specifically Input: a set of key- value pairs Programmer specifies two methods: Map(k, v) <k, v >* Takes a key- value pair and outputs a set of key- value pairs E.g., key is the filename, value is a single line in the file There is one Map call for every (k,v) pair Reduce(k, <v >*) <k, v >* All values v with same key k are reduced together and processed in v order There is one Reduce funceon call per unique key k VT CS 5614 45
MapReduce: Word CounLng Provided by the programmer Provided by the programmer MAP: Read input and produces a set of key- value pairs Group by key: Collect all pairs with same key Reduce: Collect all values belonging to the key and output The crew of the space shuttle Endeavor recently returned to Earth as ambassadors, harbingers of a new era of space exploration. Scientists at NASA are saying that the recent assembly of the Dextre bot is the first step in a long-term space-based man/mache partnership. '"The work we're doing now -- the robotics we're doing -- is what we're going to need.. (The, 1) (crew, 1) (of, 1) (the, 1) (space, 1) (shu`le, 1) (Endeavor, 1) (recently, 1). (crew, 1) (crew, 1) (space, 1) (the, 1) (the, 1) (the, 1) (shu`le, 1) (recently, 1) (crew, 2) (space, 1) (the, 3) (shu`le, 1) (recently, 1) Sequeneally read the data Only sequeneal reads Big document (key, value) (key, value) (key, value) VT CS 5614 46
Word Count Using MapReduce map(key, value): // key: document name; value: text of the document for each word w in value: emit(w, 1) reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(key, result) VT CS 5614 47
Map- Reduce (MR) as SQL select count(*) from DOCUMENT group by word Reducer Mapper VT CS 5614 48
Map- Reduce: Environment Map- Reduce environment takes care of: Pareeoning the input data Scheduling the program s execueon across a set of machines Performing the group by key step Handling machine failures Managing required inter- machine communicaeon VT CS 5614 49
Map- Reduce: A diagram Big document MAP: Read input and produces a set of key- value pairs Group by key: Collect all pairs with same key (Hash merge, Shuffle, Sort, ParLLon) Reduce: Collect all values belonging to the key and output VT CS 5614 50
Map- Reduce: In Parallel Prakash All 2014 phases are distributed with VT CS 5614 many tasks doing the work 51
Map- Reduce Programmer specifies: Map and Reduce and input files Workflow: Read inputs as a set of key- value- pairs Map transforms input kv- pairs into a new set of k'v'- pairs Sorts & Shuffles the k'v'- pairs to output nodes All k v - pairs with a given k are sent to the same reduce Reduce processes all k'v'- pairs grouped by key into new k''v''- pairs Write the resuleng pairs to files All phases are distributed with many tasks doing the work Input 0 Map 0 Input 1 Map 1 Input 2 Map 2 Shuffle Reduce 0 Reduce 1 Out 0 Out 1 VT CS 5614 52
Data Flow Input and final output are stored on a distributed file system (FS): Scheduler tries to schedule map tasks close to physical storage locaeon of input data Intermediate results are stored on local FS of Map and Reduce workers Output is ojen input to another MapReduce task VT CS 5614 53
CoordinaLon: Master Master node takes care of coordinalon: Task status: (idle, in- progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the locaeon and sizes of its R intermediate files, one for each reducer Master pushes this info to reducers Master pings workers periodically to detect failures VT CS 5614 54
Dealing with Failures Map worker failure Map tasks completed or in- progress at worker are reset to idle Reduce workers are noefied when task is rescheduled on another worker Reduce worker failure Only in- progress tasks are reset to idle Reduce task is restarted Master failure MapReduce task is aborted and client is noefied VT CS 5614 55
How many Map and Reduce jobs? M map tasks, R reduce tasks Rule of a thumb: Make M much larger than the number of nodes in the cluster One DFS chunk per map is common Improves dynamic load balancing and speeds up recovery from worker failures Usually R is smaller than M Because output is spread across R files VT CS 5614 56
Task Granularity & Pipelining Fine granularity tasks: map tasks >> machines Minimizes eme for fault recovery Can do pipeline shuffling with map execueon Be`er dynamic load balancing VT CS 5614 57
Refinements: Backup Tasks Problem Slow workers significantly lengthen the job compleeon eme: Other jobs on the machine Bad disks Weird things SoluLon Near end of phase, spawn backup copies of tasks Whichever one finishes first wins Effect Dramaecally shortens job compleeon eme VT CS 5614 58
Refinement: Combiners O8en a Map task will produce many pairs of the form (k,v 1 ), (k,v 2 ), for the same key k E.g., popular words in the word count example Can save network Lme by pre- aggregalng values in the mapper: combine(k, list(v 1 )) à v 2 Combiner is usually same as the reduce funceon Works only if reduce funceon is commutaeve and associaeve VT CS 5614 59
Refinement: Combiners Back to our word counlng example: Combiner combines the values of all keys of a single mapper (single machine): Much less data needs to be copied and shuffled! VT CS 5614 60
Refinement: ParLLon FuncLon Want to control how keys get parlloned Inputs to map tasks are created by coneguous splits of input file Reduce needs to ensure that records with the same intermediate key end up at the same worker System uses a default parllon funclon: hash(key) mod R SomeLmes useful to override the hash funclon: E.g., hash(hostname(url)) mod R ensures URLs from a host end up in VT CS the 5614 same output file 61
PROBLEMS SUITED FOR MAP- REDUCE VT CS 5614 62
Example: Host size Suppose we have a large web corpus Look at the metadata file Lines of the form: (URL, size, date, ) For each host, find the total number of bytes That is, the sum of the page sizes for all URLs from that parecular host Other examples: Link analysis and graph processing Machine Learning algorithms VT CS 5614 63
Example: Language Model StaLsLcal machine translalon: Need to count number of emes every 5- word sequence occurs in a large corpus of documents Very easy with MapReduce: Map: Extract (5- word sequence, count) from document Reduce: Combine the counts VT CS 5614 64
Example: Join By Map- Reduce Compute the natural join R(A,B) S(B,C) R and S are each stored in files Tuples are pairs (a,b) or (b,c) A B B C A C a 1 b 1 a 2 b 1 b 2 c 1 b 2 c 2 = a 3 c 1 a 3 c 2 a 3 b 2 b 3 c 3 a 4 c 3 a 4 b 3 S R VT CS 5614 65
Map- Reduce Join Use a hash funclon h from B- values to 1...k A Map process turns: Each input tuple R(a,b) into key- value pair (b,(a,r)) Each input tuple S(b,c) into (b,(c,s)) Map processes send each key- value pair with key b to Reduce process h(b) Hadoop does this automaecally; just tell it what k is. Each Reduce process matches all the pairs (b, (a,r)) with all (b,(c,s)) and outputs (a,b,c). VT CS 5614 66
Cost Measures for Algorithms In MapReduce we quanlfy the cost of an algorithm using 1. CommunicaAon cost = total I/O of all processes 2. Elapsed communicaaon cost = max of I/O along any path 3. (Elapsed) computaaon cost analogous, but count only running eme of processes Note that here the big-o notation is not the most useful (adding more machines is always an option) VT CS 5614 67
Example: Cost Measures For a map- reduce algorithm: CommunicaLon cost = input file size + 2 (sum of the sizes of all files passed from Map processes to Reduce processes) + the sum of the output sizes of the Reduce processes. Elapsed communicalon cost is the sum of the largest input + output for any map process, plus the same for any reduce process VT CS 5614 68
What Cost Measures Mean Either the I/O (communicaeon) or processing (computaeon) cost dominates Ignore one or the other Total cost tells what you pay in rent from your friendly neighborhood cloud (eg. AWS in HW3) Elapsed cost is wall- clock eme using parallelism VT CS 5614 69
Cost of Map- Reduce Join Total communicalon cost = O( R + S + R S ) Elapsed communicalon cost = O(s) We re going to pick k and the number of Map processes so that the I/O limit s is respected We put a limit s on the amount of input or output that any one process can have. s could be: What fits in main memory What fits on local disk With proper indexes, computaeon cost is linear in the input + output size So computaeon cost is like comm. cost VT CS 5614 70
Conclusions Hadoop is a distributed data- intesive compueng framework MapReduce Simple programming paradigm Surprisingly powerful (may not be suitable for all tasks though) Hadoop has specialized FileSystem, Master- Slave Architecture to scale- up VT CS 5614 71
NoSQL and Hadoop Hot area with several new problems Good for academic research Good for industry = Fun AND Profit J VT CS 5614 72
POINTERS AND FURTHER READING VT CS 5614 73
ImplementaLons Google Not available outside Google Hadoop An open- source implementaeon in Java Uses HDFS for stable storage Download: http://lucene.apache.org/hadoop/ Aster Data Cluster- opemized SQL Database that also implements MapReduce VT CS 5614 74
Cloud CompuLng Ability to rent compueng by the hour Addieonal services e.g., persistent storage Amazon s Elasec Compute Cloud (EC2) Aster Data and Hadoop can both be run on EC2 VT CS 5614 75
Reading Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters h`p://labs.google.com/papers/mapreduce.html Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung: The Google File System h`p://labs.google.com/papers/gfs.html VT CS 5614 76
Resources Hadoop Wiki Introduceon h`p://wiki.apache.org/lucene- hadoop/ Gerng Started h`p://wiki.apache.org/lucene- hadoop/ GerngStartedWithHadoop Map/Reduce Overview h`p://wiki.apache.org/lucene- hadoop/hadoopmapreduce h`p://wiki.apache.org/lucene- hadoop/hadoopmapredclasses Eclipse Environment h`p://wiki.apache.org/lucene- hadoop/eclipseenvironment Javadoc h`p://lucene.apache.org/hadoop/docs/api/ VT CS 5614 77
Resources Releases from Apache download mirrors h`p://www.apache.org/dyn/closer.cgi/lucene/ hadoop/ Nightly builds of source h`p://people.apache.org/dist/lucene/hadoop/ nightly/ Source code from subversion h`p://lucene.apache.org/hadoop/ version_control.html VT CS 5614 78
Further Reading Programming model inspired by funceonal language primieves Pareeoning/shuffling similar to many large- scale soreng systems NOW- Sort ['97] Re- execueon for fault tolerance BAD- FS ['04] and TACC ['97] Locality opemizaeon has parallels with Aceve Disks/Diamond work Aceve Disks ['01], Diamond ['04] Backup tasks similar to Eager Scheduling in Charlo`e system Charlo`e ['96] Dynamic load balancing solves similar problem as River's distributed queues River ['99] VT CS 5614 79