CS 6604: Data Mining Large Networks and Time-series. B. Aditya Prakash Lecture #6: Hadoop and Graph Analysis

Size: px

Start display at page:

Download "CS 6604: Data Mining Large Networks and Time-series. B. Aditya Prakash Lecture #6: Hadoop and Graph Analysis"

Willa Booth
5 years ago
Views:

1 CS 6604: Data Mining Large Networks and Time-series B. Aditya Prakash Lecture #6: Hadoop and Graph Analysis

2 (some slides from Xiao Yu) NO SQL Prakash 20 CS 6604: DM Large Networks & Time-Series 2

3 Why No SQL? Prakash 20 CS 6604: DM Large Networks & Time-Series 3

4 RDBMS The predominant choice in storing data Not so true for data miners since we put much in txt files. First formulated in 969 by Codd We are using RDBMS everywhere Prakash 20 CS 6604: DM Large Networks & Time-Series 4

5 Slide from neo technology, A NoSQL Overview and the Benefits of Graph Databases" Prakash 20 CS 6604: DM Large Networks & Time-Series

6 When RDBMS met Web 2.0 Slide from Lorenzo Alberton, "NoSQL Databases: Why, what and when" Prakash 20 CS 6604: DM Large Networks & Time-Series 6

7 What to do if data is really large? Peta-bytes (exabytes, zeaabytes.. ) Google processed 24 PB of data per day (2009) FB adds 0. PB per day Prakash 20 CS 6604: DM Large Networks & Time-Series

8 BIG data Prakash 20 CS 6604: DM Large Networks & Time-Series 8

9 What s Wrong with RelaKonal DB? Nothing is wrong. You just need to use the right tool. Relaeonal is hard to scale. Easy to scale reads Hard to scale writes Prakash 20 CS 6604: DM Large Networks & Time-Series 9

10 What s NoSQL? The misleading term NoSQL is short for Not Only SQL. non-relaeonal, schema-free, non-(quite)-acid More on ACID transaceons later in class horizontally scalable, distributed, easy replicaeon support simple API Prakash 20 CS 6604: DM Large Networks & Time-Series 0

11 Four (emerging) NoSQL Categories Key-value (K-V) stores Based on Distributed Hash Tables/ Amazon s Dynamo paper * Data model: (global) colleceon of K-V pairs Example: Voldemort Column Families BigTable clones ** Data model: big table, column families Example: HBase, Cassandra, Hypertable *G DeCandia et al, Dynamo: Amazon's Highly Available Key-value Store, SOSP 0 ** F Chang et al, Bigtable: A Distributed Storage System for Structured Data, OSDI 06 Prakash 20 CS 6604: DM Large Networks & Time-Series

12 Four (emerging) NoSQL Categories Document databases Inspired by Lotus Notes Data model: colleceons of K-V Colleceons Example: CouchDB, MongoDB Graph databases Inspired by Euler & graph theory Data model: nodes, relaeons, K-V on both Example: AllegroGraph, VertexDB, Neo4j Prakash 20 CS 6604: DM Large Networks & Time-Series 2

13 Focus of Different Data Models Slide from neo technology, A NoSQL Overview and the Benefits of Graph Databases" Prakash 20 CS 6604: DM Large Networks & Time-Series 3

14 C-A-P theorem" RDBMS NoSQL (most) Pareeon Tolerance Consistency Availability Prakash 20 CS 6604: DM Large Networks & Time-Series 4

15 When to use NoSQL? Bigness Massive write performance Twiaer generates TB / per day (200) Fast key-value access Flexible schema or data types Schema migraeon Write availability Writes need to succeed no maaer what (CAP, pareeoning) Easier maintainability, administraeon and operaeons No single point of failure Generally available parallel compueng Programmer ease of use Use the right data model for the right problem Avoid hiqng the wall Distributed systems support Tunable CAP tradeoffs from hap://highscalability.com/ Prakash 20 CS 6604: DM Large Networks & Time-Series

16 Key-Value Stores id hair_color age height 923 Red Blue 34 NA Table in relaeonal db Store/Domain in Key-Value db Find users whose age is above 8? Find all aaributes of user 923? Find users whose hair color is Red and age is 9? (Join operaeon) Calculate average age of all grad students? Prakash 20 CS 6604: DM Large Networks & Time-Series 6

17 Voldemort in LinkedIn Sid Anand, LinkedIn Data Infrastructure (QCon London 202) Prakash 20 CS 6604: DM Large Networks & Time-Series

18 Voldemort vs MySQL Sid Anand, LinkedIn Data Infrastructure (QCon London 202) Prakash 20 CS 6604: DM Large Networks & Time-Series 8

19 Column Families BigTable like Prakash 20 CS 6604: DM Large Networks & Time-Series 9 F Chang, et al, Bigtable: A Distributed Storage System for Structured Data, osdi 06

20 BigTable Data Model The row name is a reversed URL. The contents column family contains the page contents, and the anchor column family contains the text of any anchors that reference the page. Prakash 20 CS 6604: DM Large Networks & Time-Series 20

21 BigTable Performance Prakash 20 CS 6604: DM Large Networks & Time-Series 2

22 Document Database - mongodb Table in relaeonal db Documents in a colleceon Inieal release 2009 Open source, document db Json-like document with dynamic schema Prakash 20 CS 6604: DM Large Networks & Time-Series 22

23 mongodb Product Deployment Prakash 20 CS 6604: And DM Large much Networks more & Time-Series 23

24 Graph Database Data Model Abstraceon: Nodes Relaeons Properees Prakash 20 CS 6604: DM Large Networks & Time-Series 24

25 Neo4j - Build a Graph Slide from neo technology, A NoSQL Overview and the Benefits of Graph Databases" Prakash 20 CS 6604: DM Large Networks & Time-Series 2

26 A Debatable Performance EvaluaKon Prakash 20 CS 6604: DM Large Networks & Time-Series 26

27 Conclusion Use the right data model for the right problem Prakash 20 CS 6604: DM Large Networks & Time-Series 2

28 MAP-REDUCE AND HADOOP Prakash 20 CS 6604: DM Large Networks & Time-Series 28

29 MapReduce Challenges: How to distribute computaeon? Distributed/parallel programming is hard Map-reduce addresses all of the above Google s computaeonal/data manipulaeon model Elegant way to work with big data Prakash 20 CS 6604: DM Large Networks & Time-Series 29

30 Single Node Architecture CPU Memory Machine Learning, StaKsKcs Classical Data Mining Disk Prakash 20 CS 6604: DM Large Networks & Time-Series 30

31 MoKvaKon: Google Example 20+ billion web pages x 20KB = 400+ TB computer reads 30-3 MB/sec from disk ~4 months to read the web ~,000 hard drives to store the web Takes even more to do something useful with the data! Today, a standard architecture for such problems is emerging: Cluster of commodity Linux nodes Commodity network (ethernet) to connect them Prakash 20 CS 6604: DM Large Networks & Time-Series 3

32 Cluster Architecture Gbps between any pair of nodes in a rack Switch 2-0 Gbps backbone between racks Switch Switch CPU CPU CPU CPU Mem Mem Mem Mem Disk Disk Disk Disk Each rack contains 6-64 nodes Prakash In it was guestimated CS 6604: that DM Large Google Networks had & Time-Series M machines, 32

33 Prakash 20 CS 6604: DM Large Networks & Time-Series 33

34 Large-scale CompuKng Large-scale compukng for data mining problems on commodity hardware Challenges: Prakash 20 How do you distribute computakon? How can we make it easy to write distributed programs? Machines fail: One server may stay up 3 years (,000 days) If you have,000 servers, expect to loose /day People esemated Google had ~M machines in 20,000 machines fail every day! CS 6604: DM Large Networks & Time-Series 34

35 Idea and SoluKon Issue: Copying data over a network takes Kme Idea: Bring computaeon close to the data Store files muleple emes for reliability Map-reduce addresses these problems Google s computaeonal/data manipulaeon model Elegant way to work with big data Storage Infrastructure File system Google: GFS. Hadoop: HDFS Programming model Prakash 20 Map-Reduce CS 6604: DM Large Networks & Time-Series 3

36 Hadoop VS NoSQL Hadoop: compueng framework Supports data-intensive applicaeons Includes MapReduce, HDFS etc. (we will study MR mainly next) NoSQL: Not only SQL databases Can be built ON hadoop. E.g. HBase. Prakash 20 CS 6604: DM Large Networks & Time-Series 36

37 Storage Infrastructure Problem: If nodes fail, how to store data persistently? Answer: Distributed File System: Provides global file namespace Google GFS; Hadoop HDFS; Typical usage padern Huge files (00s of GB to TB) Data is rarely updated in place Reads and appends are common Prakash 20 CS 6604: DM Large Networks & Time-Series 3

38 Distributed File System Chunk servers File is split into coneguous chunks Typically each chunk is 6-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks Master node a.k.a. Name Node in Hadoop s HDFS Stores metadata about where files are stored Might be replicated Client library for file access Talks to master to find chunk servers Connects directly to chunk servers to access data Prakash 20 CS 6604: DM Large Networks & Time-Series 38

39 Distributed File System Reliable distributed file system Data kept in chunks spread across machines Each chunk replicated on different machines Seamless recovery from disk or machine failure C 0 C D 0 C C 2 C C 0 C C C 2 C C 3 D 0 D D 0 C 2 Chunk server Chunk server 2 Chunk server 3 Chunk server N Bring computaeon directly to the data! Chunk servers also serve as compute servers Prakash 20 CS 6604: DM Large Networks & Time-Series 39

40 Programming Model: MapReduce Warm-up task: We have a huge text document Count the number of emes each disenct word appears in the file Sample applicakon: Analyze web server logs to find popular URLs Prakash 20 CS 6604: DM Large Networks & Time-Series 40

41 Task: Word Count Case : File too large for memory, but all <word, count> pairs fit in memory Case 2: Count occurrences of words: words(doc.txt) sort uniq -c where words takes a file and outputs the words in it, one per a line Case 2 captures the essence of MapReduce Great thing is that it is naturally parallelizable Prakash 20 CS 6604: DM Large Networks & Time-Series 4

MapReduce: Overview Sequeneally read a lot of data Map: Extract something you care about Group by key: Sort and Shuffle Reduce: Aggregate, summarize,

42 MapReduce: Overview Sequeneally read a lot of data Map: Extract something you care about Group by key: Sort and Shuffle Reduce: Aggregate, summarize, filter or transform Write the result Outline stays the same, Map and Reduce change to fit the problem Prakash 20 CS 6604: DM Large Networks & Time-Series 42

43 MapReduce: The Map Step Input key-value pairs Intermediate key-value pairs k v map k k v v k v map k v k v k v Prakash 20 CS 6604: DM Large Networks & Time-Series 43

44 MapReduce: The Reduce Step Intermediate key-value pairs k v Key-value groups k v v v reduce Output key-value pairs k v k v Group by key k v v reduce k v k v k v k v k v Prakash 20 CS 6604: DM Large Networks & Time-Series 44

45 More Specifically Input: a set of key-value pairs Programmer specifies two methods: Map(k, v) <k, v >* Takes a key-value pair and outputs a set of key-value pairs E.g., key is the filename, value is a single line in the file There is one Map call for every (k,v) pair Reduce(k, <v >*) <k, v >* All values v with same key k are reduced together and processed in v order There is one Reduce funceon call per unique key k Prakash 20 CS 6604: DM Large Networks & Time-Series 4

46 MapReduce: Word CounKng Provided by the programmer Provided by the programmer MAP: Read input and produces a set of key-value pairs Group by key: Collect all pairs with same key Reduce: Collect all values belonging to the key and output The crew of the space shuttle Endeavor recently returned to Earth as ambassadors, harbingers of a new era of space exploration. Scientists at NASA are saying that the recent assembly of the Dextre bot is the first step in a long-term space-based man/mache partnership. '"The work we're doing now -- the robotics we're doing -- is what we're going to need.. (The, ) (crew, ) (of, ) (the, ) (space, ) (shuale, ) (Endeavor, ) (recently, ). (crew, ) (crew, ) (space, ) (the, ) (the, ) (the, ) (shuale, ) (recently, ) (crew, 2) (space, ) (the, 3) (shuale, ) (recently, ) Sequeneally read the data Only sequeneal reads Big document Prakash 20 (key, value) (key, value) (key, value) CS 6604: DM Large Networks & Time-Series 46

47 Word Count Using MapReduce map(key, value): // key: document name; value: text of the document for each word w in value: emit(w, ) reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(key, result) Prakash 20 CS 6604: DM Large Networks & Time-Series 4

48 Map-Reduce (MR) as SQL select count(*) from DOCUMENT group by word Reducer Mapper Prakash 20 CS 6604: DM Large Networks & Time-Series 48

49 Map-Reduce: Environment Map-Reduce environment takes care of: Pareeoning the input data Scheduling the program s execueon across a set of machines Performing the group by key step Handling machine failures Managing required inter-machine communicaeon Prakash 20 CS 6604: DM Large Networks & Time-Series 49

50 Map-Reduce: A diagram Big document MAP: Read input and produces a set of key-value pairs Group by key: Collect all pairs with same key (Hash merge, Shuffle, Sort, ParKKon) Reduce: Collect all values belonging to the key and output Prakash 20 CS 6604: DM Large Networks & Time-Series 0

51 Map-Reduce: In Parallel Prakash All 20 phases are distributed CS 6604: DM Large with Networks many & Time-Series tasks doing the work

52 Map-Reduce Programmer specifies: Map and Reduce and input files Workflow: Read inputs as a set of key-value-pairs Map transforms input kv-pairs into a new set of k'v'-pairs Sorts & Shuffles the k'v'-pairs to output nodes All k v -pairs with a given k are sent to the same reduce Reduce processes all k'v'-pairs grouped by key into new k''v''-pairs Write the resuleng pairs to files All phases are distributed with many tasks doing the work Input 0 Map 0 Input Map Input 2 Map 2 Shuffle Reduce 0 Reduce Out 0 Out Prakash 20 CS 6604: DM Large Networks & Time-Series 2

53 Data Flow Input and final output are stored on a distributed file system (FS): Scheduler tries to schedule map tasks close to physical storage locaeon of input data Intermediate results are stored on local FS of Map and Reduce workers Output is oien input to another MapReduce task Prakash 20 CS 6604: DM Large Networks & Time-Series 3

54 CoordinaKon: Master Master node takes care of coordinakon: Task status: (idle, in-progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the locaeon and sizes of its R intermediate files, one for each reducer Master pushes this info to reducers Master pings workers periodically to detect failures Prakash 20 CS 6604: DM Large Networks & Time-Series 4

55 Dealing with Failures Map worker failure Map tasks completed or in-progress at worker are reset to idle Reduce workers are noefied when task is rescheduled on another worker Reduce worker failure Only in-progress tasks are reset to idle Reduce task is restarted Master failure MapReduce task is aborted and client is noefied Prakash 20 CS 6604: DM Large Networks & Time-Series

56 GRAPH ANALYSIS ON HADOOP Based on slides from Prof. U. Kang Prakash 20 CS 6604: DM Large Networks & Time-Series 6

57 Problem DefiniKon Given: a BIG graph G(V, E) Compute: Connected Components Diameter PageRank. SoluKon: Use Hadoop Prakash 20 CS 6604: DM Large Networks & Time-Series

58 GIM-V: One unified framework [Kang+, 2008] YahooWeb: V =.4B E = 6.6B Count 300-size cmpt X 00. Why? 00-size cmpt X 6. Why? Size Prakash 20 CS 6604: DM Large Networks & Time-Series 8

59 Main Idea GIM-V Generalized Iteraeve Matrix-Vector Muleplicaeon Extension of plain matrix-vector muleplicaeon Includes as special cases Connected Components PageRank RWR Diameters. Prakash 20 CS 6604: DM Large Networks & Time-Series 9

60 Main Idea: IntuiKon 0. Weighted Combination of Colors ~ Message Passing M v v' X = v4' 4 i m i v i 4 Prakash 20 CS 6604: DM Large Networks & Time-Series 60

61 M v v' 0. X = 0. Three implicit operaeons here: multiply m ij and sum n multiplication results update v i ' v j combine2 combineall assign Message sending Message combination Prakash 20 CS 6604: DM Large Networks & Time-Series 6

62 Main Idea GIM-V Matrix represents edge(src, dest) Vector represents node values/labels Customizing the three operaeons operations Standard MV Con. Cmpt. PageRank RWR Diameter combine2 Multiply combineall Sum assign Assign Prakash 20 CS 6604: DM Large Networks & Time-Series 62

63 GIM-V operations Standard MV Con. Cmpt. PageRank RWR Diameter combine2 Multiply Bool. X combineall Sum MIN assign Assign MIN Prakash 20 CS 6604: DM Large Networks & Time-Series 63

64 GIM-V for CC How many connected components? 2 == which node belongs to which component? G G G Input Graph 8 node id Output component id Prakash 20 CS 6604: DM Large Networks & Time-Series 64

65 GIM-V for CC final vector init vector G G G? Prakash 20 CS 6604: DM Large Networks & Time-Series 6

66 GIM-V for CC Prakash 20 CS 6604: DM Large Networks & Time-Series 66 åa min(, min(2) ) min(2, min(,3) ) min(3, min(2,4) ) min(4, min(3) ) min(, min(6) ) min(6, min() ) min(, min(8) ) min(8, min() ) GIM-V for Connected Components Sending Invitations Accept the Smallest

67 GIM-V for CC Prakash 20 CS 6604: DM Large Networks & Time-Series 6 åa min(, min(2) ) min(2, min(,3) ) min(3, min(2,4) ) min(4, min(3) ) min(, min(6) ) min(6, min() ) min(, min(8) ) min(8, min() ) GIM-V for Connected Components Sending Invitations Accept the Smallest GIM-V with MIN = find min. node ids within hop

68 GIM-V for CC Prakash 20 CS 6604: DM Large Networks & Time-Series 68 åa min(, min(2) ) min(2, min(,3) ) min(3, min(2,4) ) min(4, min(3) ) min(, min(6) ) min(6, min() ) min(, min(8) ) min(8, min() ) GIM-V for Connected Components Sending Invitations Accept the Smallest k GIM-V with MIN = find min. node ids within k hops

69 GIM-V for CC Prakash 20 CS 6604: DM Large Networks & Time-Series 69 åa min(, min(2) ) min(2, min(,3) ) min(3, min(2,4) ) min(4, min(3) ) min(, min(6) ) min(6, min() ) min(, min(8) ) min(8, min() ) GIM-V for Connected Components Sending Invitations Accept the Smallest Max. iterakons = diameter

70 GIM-V Operations Standard MV Con. Cmpt. PageRank RWR Diameter combine2 Multiply Multiply Multiply with c Multiply with c Multiply b it-vector combineall Sum MIN Sum with rj prob. Sum with res tart prob BIT-OR() assign Assign MIN Assign Assign BIT-OR() Prakash 20 CS 6604: DM Large Networks & Time-Series 0

71 HDFS restrickons [R] HDFS is locaeon transperant Users don t know which machine has which file [R2] A line is never split A large file is split into pieces of a size (e.g. 26MB) Users don t know the point of split Prakash 20 CS 6604: DM Large Networks & Time-Series

72 Fast GIM-V Given R and R2, how to design faster algs. for GIM-V?. Block Muleplicaeon 2. Clustering 3. Compression Prakash 20 CS 6604: DM Large Networks & Time-Series 2

73 Block MulKplicaKon Group elements together into line 2. Storage for an element: 2log n bits -> 2log b bits 3. Adjust the MapReduce code(block multiplication) log b bits log n bits b: block width n: # of nodes Prakash 20 CS 6604: DM Large Networks & Time-Series 3

74 Clustering and Compress Prakash 20 CS 6604: DM Large Networks & Time-Series 4 Preprocess Compress ZIP ZIP

75 Performance 43x smaller 9.2x faster Block Encoding? Compression? Clustering? RAW No No No NNB Yes No No NCB Yes Yes No CCB Yes Yes Yes Prakash 20 CS 6604: DM Large Networks & Time-Series

76 Many extensions and opkmizakons Local queries Diagonal block iteraeons. Prakash 20 CS 6604: DM Large Networks & Time-Series 6

77 Conclusions Hadoop is a distributed data-intesive compueng framework MapReduce Simple programming paradigm Surprisingly powerful (may not be suitable for all tasks though) Hadoop has specialized FileSystem, Master- Slave Architecture to scale-up Prakash 20 CS 6604: DM Large Networks & Time-Series

78 NoSQL and Hadoop Hot area with several new problems Good for academic research Good for industry = Fun AND Profit J Prakash 20 CS 6604: DM Large Networks & Time-Series 8

79 POINTERS AND FURTHER READING Prakash 20 CS 6604: DM Large Networks & Time-Series 9

80 ImplementaKons Google Not available outside Google Hadoop An open-source implementaeon in Java Uses HDFS for stable storage Download: Aster Data Cluster-opemized SQL Database that also implements MapReduce Prakash 20 CS 6604: DM Large Networks & Time-Series 80

81 Cloud CompuKng Ability to rent compueng by the hour Addieonal services e.g., persistent storage Amazon s Elasec Compute Cloud (EC2) Aster Data and Hadoop can both be run on EC2 Prakash 20 CS 6604: DM Large Networks & Time-Series 8

82 Reading Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters hap://labs.google.com/papers/mapreduce.html Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung: The Google File System hap://labs.google.com/papers/gfs.html Prakash 20 CS 6604: DM Large Networks & Time-Series 82

83 Resources Hadoop Wiki Introduceon hap://wiki.apache.org/lucene-hadoop/ Geqng Started hap://wiki.apache.org/lucene-hadoop/ GeqngStartedWithHadoop Map/Reduce Overview hap://wiki.apache.org/lucene-hadoop/hadoopmapreduce hap://wiki.apache.org/lucene-hadoop/hadoopmapredclasses Eclipse Environment hap://wiki.apache.org/lucene-hadoop/eclipseenvironment Javadoc hap://lucene.apache.org/hadoop/docs/api/ Prakash 20 CS 6604: DM Large Networks & Time-Series 83

84 Resources Releases from Apache download mirrors hap:// hadoop/ Nightly builds of source hap://people.apache.org/dist/lucene/hadoop/ nightly/ Source code from subversion hap://lucene.apache.org/hadoop/ version_control.html Prakash 20 CS 6604: DM Large Networks & Time-Series 84

85 Further Reading Programming model inspired by funceonal language primieves Pareeoning/shuffling similar to many large-scale soreng systems NOW-Sort ['9] Re-execueon for fault tolerance BAD-FS ['04] and TACC ['9] Locality opemizaeon has parallels with Aceve Disks/Diamond work Aceve Disks ['0], Diamond ['04] Backup tasks similar to Eager Scheduling in Charloae system Charloae ['96] Dynamic load balancing solves similar problem as River's distributed queues River ['99] Prakash 20 CS 6604: DM Large Networks & Time-Series 8

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #11: No- SQL, Map- Reduce and New So8ware Stack

CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #11: No- SQL, Map- Reduce and New So8ware Stack (some slides from Xiao Yu) NO SQL VT CS 5614 2 Why No SQL? VT CS 5614 3 RDBMS The predominant