4. Managing Big Data. Cloud Computing & Big Data MASTER ENGINYERIA INFORMÀTICA FIB/UPC. Fall Jordi Torres, UPC - BSC

Size: px

Start display at page:

Download "4. Managing Big Data. Cloud Computing & Big Data MASTER ENGINYERIA INFORMÀTICA FIB/UPC. Fall Jordi Torres, UPC - BSC"

Laurence Baker
5 years ago
Views:

1 4. Managing Big Data Cloud Computing & Big Data MASTER ENGINYERIA INFORMÀTICA FIB/UPC Fall Jordi Torres, UPC - BSC

2 Slides are only for presentation guide We will discuss+debate additional concepts/ideas appeared during your participation! (and we could skip part of the content) FEEL FREE TO PARTICIPATE!

3 Content Motivation HDFS and Hbase Alternatives to HDFS/Hbase? CAP Theorem Case Study: Cassandra Big Data AWS in our Desktop Guest Lecture: AENEAS HOMEWORK: AENEAS Hands-on 3

4 Relational DB can t support everything Execution Time Conventional Systems The relational DB has ruled for 2-3 decades capabilities, implementations,. (Main problem: scalability) Large Data Sets, growing too big for conventional storage/tools GBs Data Volume PBs 4

5 proposals Proposal: Hadoop + Which DB??? Execution time GBs PBs 5

6 Example: last.fm Internet radio and music community website that offers many services to its users, such as free music streams and downloads, music and event recommendations, personalized charts, and much more. Founded in 2002 There are about 25 million people who use Last.fm 6

7 Example: last.fm Million people who use Last.fm generate huge amounts of data that need to be processed. 7

8 Example: last.fm Source: Marc de Palol - 8

9 Example: last.fm Source: Marc de Palol - 9

Example: last.fm One example of this is users transmitting information indicating which songs they are listening to (this is known as scrobbling ). This data is processed and stored by Last.

10 Example: last.fm One example of this is users transmitting information indicating which songs they are listening to (this is known as scrobbling ). This data is processed and stored by Last.fm, so the user can access it directly (in the form of charts), and it is also used to make decisions about users musical tastes and compatibility, and artist and track similarity. Source: Marc de Palol

11 Example: last.fm Source: Marc de Palol

12 Example: last.fm data from users userid, trackid, albumid, artistid web/api web nodes? Idea: Marc de Palol

13 Example: last.fm data from users userid, trackid, albumid, artistid web/api web nodes? Idea: Marc de Palol

14 Example: last.fm data from users (input example) user1, track1,... user1, track2,... user2, track Tb (output) track1, #users track2, #users? Idea: Marc de Palol

15 Example: last.fm data from users (output) track1, #users track2, #users (input example) user1, track1,... user1, track2,... user2, track1...? HADOOP 4 Tb Idea: Marc de Palol

16 Example: last.fm Problem: a user can listen to this song several times!!!! 16

17 Example: last.fm Problem: a user can listen to this song several times!!!! we want to get rid of duplicates 17

18 Example: last.fm Problem: a user can listen to this song several times!!!! Do you want the code? 18

19 Example: last.fm Solution? mapper( ) { } reduce( ) { } Idea: Marc de Palol

20 Example: last.fm Mapper? mapper(facebookid, songid) { output(songid, facebookid) } Idea: Marc de Palol

21 Example: last.fm Reduce? mapper(facebookid, songid) { output(songid, facebookid) } reduce(songid, List<facebookIds>) { } } Idea: Marc de Palol

22 Example: last.fm Reduce: so let's add the usersids into a set mapper(facebookid, songid) { output(songid, facebookid) } reduce(songid, List<facebookIds>) { Set uniqueusers = new Set() for (facebookid in facebookids) { uniqueusers.add(facebookid) } output(songid, uniqueusers.size()) } 22

23 Example : last.fm Another important example for last.fm: Log Processing Input: Log files Assume each day they could have 10Gb of logs per server Assume 500 servers Output: We require to know how many IP accessed the cluster of web servers. Problem: Size!!!! 500x10Gb of logs Technology & Algorithm required? Information from: Marc de Palol

24 Example : last.fm Another important example for last.fm: Log Processing Input: Log files Assume each day they could have 10Gb of logs per server Assume 500 servers Output: We require to know how many IP accessed the cluster of web servers. Problem: Size!!!! 500x10Gb of logs Technology & Algorithm required? You already have the answer Information from: Marc de Palol

Reminder!!!! Source: CLOUDERA - http://www.slideshare.

25 Reminder!!!! Source: CLOUDERA

26 Reminder!!!! Source: CLOUDERA

27 Reminder!!!! Source: CLOUDERA

28 Last.fm example: final words There were several reasons for adopting Hadoop: The distributed filesystem provided redundant backups for the data stored on it (e.g., web logs, user listening data) at no extra cost. Scalability was simplified through the ability to add cheap, commodity hardware when required. The cost was right (free) at a time when Last.fm had limited financial resources. The open source code and active community meant that Last.fm could freely modify Hadoop to add custom features and patches. Hadoop provided a flexible framework for running distributed computing algorithms with a relatively easy learning curve. 28

29 MapReduce. excellent but data requirements? 29

30 MapReduce: data requirements!!! In general... The data expected is not relational data This data does not require a schema and may be unstructured Map Reduce Data 30

31 MapReduce: data requirements!!! In general... The data expected is not relational data This data does not require a schema and may be unstructured Instead, data is consumed in chunks which are then divided among nodes fed to the map phase as key-value pairs Map Reduce Data 31

32 MapReduce: data requirements!!! In general... The data expected is not relational data This data does not require a schema and may be unstructured Instead, data is consumed in chunks which are then divided among nodes fed to the map phase as key-value pairs Map Reduce Data The data must be available in a distributed fashion, to serve each processing node. 32

33 Parallel Data Bases Shared nothing Shared disc Shared memory interconnect interconnect interconnect Storage Network processor memory disk 33 17

affect the ease with which data can be loaded and the

34 MapReduce: data layer requirements The design and features of the data layer are important because they affect the ease with which data can be loaded and the results of computation extracted and searched. Map Reduce Data 34

35 Content Motivation HDFS and Hbase Alternatives to HDFS/Hbase? CAP Theorem Case Study: Cassandra Big Data AWS in our Desktop Guest Lecture: AENEAS HOMEWORK: AENEAS Hands-on 35

36 HDFS Hadoop: standard storage mechanism for HADOOP Hadoop Distributed File System (HDFS) 36

37 HDFS Hadoop Distributed File System (HDFS) Fault tolerance Assuming that failure will happen allows HDFS to run on commodity hardware. Streaming data access HDFS is written with batch processing in mind, and emphasizes high throughput rather than random access to data. Extreme scalability HDFS will scale to petabytes (current versions) Portability HDFS is portable across platforms. 37

38 Hadoop: standard storage mechanism Hadoop Distributed File System (HDFS) Most HDFS applications need a write-once-read-many access model for files By assuming a file will remain unchanged after it is written, HDFS simplifies replication and speeds up data throughput. Moving Computation is Cheaper than Moving Data : Locality of computation Due to data volume, it is often much faster to move the program near to the data HDFS has features to facilitate this. 38

39 HDFS: an example A given file is broken down into blocks (default=64mb),

40 HDFS: an example then blocks are replicated across cluster (default=3)

41 MapReduce: Resource Management Scheduling A given job is broken down into tasks, then tasks are scheduled to be as close to data as possible Optimized for Bach processing Failure recovery

42 Hadoop: standard storage mechanism Starting point / 42

43 Hadoop: standard storage mechanism HDFS Interface Interface similar to that of regular filesystems. can only store and retrieve data, not index it. Simple random access to data is not possible. Map Reduce Solution: higher-level layers HBase have been created to provide finer-grained functionality to Hadoop deployments Hbase HDFS 43

44 Hbase, the Hadoop Database HBase Creates indexes offers fast and random access to its content Modeled after Google's BigTable DB Uses HDFS as a storage system Map Reduce Hbase It belongs to the NoSQL universe similar to Cassandra, Hypertable, HDFS 44

45 Hbase versus HDFS (a brief comparison) HDFS: Optimized For: Large Files Sequential Access (High Throughput) Append Only Use for fact tables that are mostly append only and require sequential full table scans. HBase: Optimized For: Small Records (but many records) Random Access Atomic Record Updates Use for dimension lookup tables which are updated frequently and require random low-latency lookups. 45

46 Content Motivation HDFS and Hbase Alternatives to HDFS/Hbase? CAP Theorem Case Study: Cassandra Big Data AWS in our Desktop Guest Lecture: AENEAS HOMEWORK: AENEAS Hands-on 46

47 Alternatives to Hbase/HDFS? An Apache project, Cassandra originated at Facebook and is now in production in many large-scale websites (also at BSC). Hypertable was created at Zvents and spun out as an open source project. Are both scalable column-store databases that follow the pattern of BigTable, similar to HBase. Map Reduce Cassandra Map Reduce Hypertable And 47

48 And dozens List Of NoSQL Databases [currently 150+] 48

49 NoS QL The concept is something that has gained momentum in recent years Today is a mature and efficient alternative that can help us solve the problems of scalability and performance (e.g. online applications with thousands of concurrent users and million hits a day) 49

(LinkedIn) Column-based Systems BigTable (Google) HBase

50 Different Types of NoSQL Systems Distributed Key-Value Systems Amazon s S3 Key-Value Store (Dynamo) Voldemort (LinkedIn) Column-based Systems BigTable (Google) HBase Cassandra Document-based systems CouchDB MongoDB Graph DB 50 50

51 DB data model Relational systems are the databases we've been using for a while now. RDBMSs and systems that support ACIDity and joins are considered relational. Key-value systems basically support get, put, and delete operations based on a primary key. Column-oriented systems still use tables but have no joins (joins must be handled within your application). Obviously, they store data by column as opposed to traditional row-oriented databases. This makes aggregations much easier. Document-oriented systems store structured "documents" such as JSON or XML but have no joins (joins must be handled within your application). It's very easy to map data from object-oriented software to these systems. 51

52 DB data model Relational systems are the databases we've been using for a while now. RDBMSs and systems that support ACIDity and joins are considered relational. Atomicity requires that each transaction is "all or nothing": if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes. This guarantees that a transaction cannot be left in an incomplete state. Consistency ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including but not limited to constraints, cascades, triggers, and any combination thereof. Isolation refers to the requirement that no transaction should be able to interfere with another transaction. One way of achieving this is to ensure that no transactions that affect the same rows can run concurrently, since their sequence, and hence the outcome, might be unpredictable. This property of ACID is often partly relaxed due to the huge speed decrease this type of concurrency management entails. Durability means that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently. If the database crashes immediately thereafter, it should be possible to restore the database to the state after the last transaction committed. Source: wikipedia 52

53 The problems with Relational DB RDBMS scale up well in a single node (vertical scalability) price!!!! Apparent solution? Replication and caches Vertical partitioning: Different tables in different servers Horizontal partitioning: Rows of same table in different Servers Good for fault-tolerance, for sure OK for many concurrent reads Not much help with writes, if we want to keep ACID 53

54 There s a reason: The CAP theorem Source: Ricard Gavaldà. "Information Retrieval", Erasmus Mundus Master program on Data Mining and Knowledge Discovery 54

55 There s a reason: The CAP theorem Source: Ricard Gavaldà. "Information Retrieval", Erasmus Mundus Master program on Data Mining and Knowledge Discovery 55

56 Content Motivation HDFS and Hbase Alternatives to HDFS/Hbase? CAP Theorem Case Study: Cassandra Big Data AWS in our Desktop Guest Lecture: AENEAS HOMEWORK: AENEAS Hands-on 56

57 NoS QL and the CAP Theorem What about NoSQL Scalability? Vertical: CPU, Memory,. (Price!!! ) Horizontal More servers Better Fault Tolerance of the global system 57

58 The CAP theorem Source: Ricard Gavaldà. "Information Retrieval", Erasmus Mundus Master program on Data Mining and Knowledge Discovery 58

59 There s a reason: The CAP theorem Source: Ricard Gavaldà. "Information Retrieval", Erasmus Mundus Master program on Data Mining and Knowledge Discovery 59

60 The CAP theorem proof proof Source: Ricard Gavaldà. "Information Retrieval", Erasmus Mundus Master program on Data Mining and Knowledge Discovery 60

61 The problem with RDBMS proof Source: Ricard Gavaldà. "Information Retrieval", Erasmus Mundus Master program on Data Mining and Knowledge Discovery 61

62 Scale out requires partitions IMPORTANT!!!!! A distributed system only offers simultaneously two of this three characteristics Most large web-based systems choose availability over consistency 62

63 CAP Choose Two Per Operation C Consistency CA: available and consistent, unless there is a partition. CP: always consistent, even in a partition, but a reachable replica may deny service without quorum. A Availability AP: a reachable replica provides service even in a partition, but may be inconsistent. 63 P Partition-Tolerant

64 Visual Guide to NoSQL System Source: 64

65 Visual Guide to NoSQL System Source: 65

66 Consistent, Available (CA) Systems have trouble with partitions and typically deal with it with replication. Examples of CA systems include: Traditional RDBMSs like Postgres, MySQL, etc (relational) Vertica (column-oriented) Aster Data (relational) Greenplum (relational) Source: 66

67 Consistent, Partition-Tolerant (CP) Systems have trouble with availability while keeping data consistent across partitioned nodes. Examples of CP systems include: BigTable (column-oriented/tabular) Hypertable (column-oriented/tabular) HBase (column-oriented/tabular) MongoDB (document-oriented) Terrastore (document-oriented) Redis (key-value) Scalaris (key-value) MemcacheDB (key-value) Berkeley DB (key-value) Source: 67

68 Available, Partition-Tolerant (AP) Systems achieve "eventual consistency" through replication and verification. Examples of AP systems include: Dynamo (key-value) Voldemort (key-value) Tokyo Cabinet (key-value) KAI (key-value) Cassandra (column-oriented/tabular) CouchDB (document-oriented) SimpleDB (document-oriented) Riak (document-oriented) Source: 68

69 Eventual Consistency If no updates occur for a while, all updates eventually propagate through the system and all the nodes will be consistent Eventually, a node is either updated or removed from service. Can be implemented with Gossip protocol Amazon s Dynamo popularized this approach Sometimes this is called BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID 69

70 NoSQL alternatives But, the differences between NoSQL databases are much bigger than ever was between one SQL database and another!!!!. This means that it is a bigger responsibility on software architects to choose the appropriate one for a project right at the beginning. 70

71 Content Motivation HDFS and Hbase Alternatives to HDFS/Hbase? CAP Theorem Case Study: Cassandra Big Data AWS in our Desktop Guest Lecture: AENEAS HOMEWORK: AENEAS Hands-on 71

72 Cassandra: main features Cassandra does not support relationships between column families ( tables ), disregarding foreign keys and join operations. Knowing this, the best practice when designing a data model is to keep related data in the same column family. In this section we will review only the main features of Cassandra as an example 72

73 Architecture The architecture of Cassandra is completely decentralized and peer-to-peer, meaning all nodes in a Cassandra cluster are equivalent and provide the same functionality: receive read and write requests, or forward them to other nodes. Peer-to-peer, distributed system All nodes the same Data Partitioned Custom data replication 73

74 Partitioning Cassandra implements automatic partitioning and replication mechanisms to decide which nodes are in charge of each replica. How? PARTITIONER Divide the data across the nodes in the cluster Each node is responsible for a range of the overall data Source: Juan Luis Pérez researcher at BSC (EEDC 2012 master course) 74

75 Partitioning Node A Node B Node C Node D Source: Juan Luis Pérez researcher at BSC (EEDC 2012 master course) 75

76 Partitioning Row Key determines node placement raiser name: john pass: **** url: icann.org trucker name: james pass: **** url: w3.org dumpe r name: maria pass: **** biker name: linda pass: **** 76

Partitioning Range of MD5 hash [000..1 400.

77 Partitioning Range of MD5 hash [ ] [ ] [ c00..0] [c ] 77

78 Partitioning Row Key MD5 Hash raiser trucker dumpe r biker 65236c... a113f4... d4ab [ ] [ c00..0] [ ] [c ] 78

79 Partitioning Row Key MD5 Hash raiser trucker dumpe r biker 65236c... a113f4... d4ab [ ] [ c00..0] [ ] [c ] 79

80 Partitioning Row Key MD5 Hash raiser trucker dumpe r biker 65236c... a113f4... d4ab [ ] [ c00..0] [ ] [c ] 80

81 Partitioning Row Key MD5 Hash raiser trucker dumpe r biker 65236c... a113f4... d4ab [ ] [ c00..0] [ ] [c ] 81

82 Partitioning Row Key MD5 Hash raiser trucker dumpe r biker 65236c... a113f4... d4ab [ ] [ c00..0] [ ] [c ] 82

83 Replication Remember: Cassandra implements automatic partitioning and replication mechanisms to decide which nodes are in charge of each replica The user only needs to configure the number of replicas and the system assigns each replica to a node in the cluster. 83

84 Replication Cassandra stores multiple copies of rows on multiple nodes Replication factor = number of replicas Replica Placement Strategy DEFAULT: SimpleStrategy NetworkTopologyStrategy Configurable both: Replication factor Placement Strategy 84

85 Replication SimpleStrategy First replica determined by the partitioner Additional replicas rows are placed on the next nodes clockwise in the ring Original Row raiser Copy Row raiser 85

86 Replication NetworkTopologyStrategy Allows replication between different racks Racks in a data center or in multiple data centers Reliability & Performance Others 86

87 Consistency The goal of current distributed key-value stores such as Cassandra is to read and write data operations, exactly the same as any database system However, while traditional databases provide strong consistency guarantees of replicated data by controlling the concurrent execution of transactions, Cassandra provides tunable consistency in order to favour scalability and availability. 87

88 Consistency Data consistency is tunable by the user when queries are performed, so depending on the desired level of consistency, operations can either return as soon as possible or wait until a majority or all nodes respond Tunable data consistency Choose between strong and eventual consistency Consistency per-operation (reads & writes) 88

89 Strategy for Read 89 89

90 Strategy for Writes 90 90

91 Strong/Weak consistency? As it can be derived from their description, strong consistency can only be achieved when using (Quorum and) All consistency levels. Operations that use weaker consistency levels, such as Zero, Any and One, aren t guaranteed to read the most recent data. However, this weaker consistency provides certain flexibility for applications that can benefit from better performance and don t have strong consistency needs. imagine your facebook wall!!! 91

Caching Data is first written to a commit log for durability Local to the node (for disaster recovery purpouse) Then written to a in-memory structure (memtable) Node that store

92 Caching Data is first written to a commit log for durability Local to the node (for disaster recovery purpouse) Then written to a in-memory structure (memtable) Node that store the row And then to disk (SSTable) once memtable is full Data durability is assured memtable Commit log SSTable Source: Juan Luis Pérez researcher at BSC (EEDC 2012 master course) 92

93 Content Motivation HDFS and Hbase Alternatives to HDFS/Hbase? CAP Theorem Case Study: Cassandra Big Data AWS in our Desktop Guest Lecture: AENEAS HOMEWORK: AENEAS Hands-on 93

94 What we need? Amazon AWS Account Eclipse for any platform 5 minutes 94 Felipe Caicedo

95 Instructions (I) 1. Open Eclipse and click on Help in the toolbar 1. Make click on: 1. Install New Software 95 Felipe Caicedo

Instructions (II) 1. Enter https://aws.amazon.

96 Instructions (II) 1. Enter in the first input text, then click Add 2. Enter the name of the repository and click Ok 96 Felipe Caicedo

97 Instructions (III) 1. Select all elements of the list (if needed) 2. Continue with the wizard 97 Felipe Caicedo

98 Instructions (IV) 1. To finish the wizard, accept the terms and conditions and Click Finish 2. After the installation, restart Eclipse 98 Felipe Caicedo

99 Instructions (V) 1. We should have an icon like this image, Click on that icon 2. Click on preferences to configure the Amazon AWS Account 99 Felipe Caicedo

100 Instructions (VI) 1. Click on Find your existing AWS security credentials to configure the account 2. You should see a Web page like the below image 100 Felipe Caicedo

101 Instructions (VII) 1. Copy the Access Key ID and the Secret Access Key (You can create a new access key if you haven t) 101 Felipe Caicedo

102 Instructions (VIII) 1. Click on Add account, enter the name of the new account and finally, enter the Access Key ID and the Secret Access Key copied previously 2. Click Ok 102 Felipe Caicedo

103 Instructions (IV) 1. Click on Show AWS Explorer View 2. You should see a view like the below image (the position of this view depends of your Eclipse configuration) 103 Felipe Caicedo

104 Instructions (X) Clicking on any item, you can see the corresponding view 104 Felipe Caicedo

105 Instructions (XI) 1. Clicking on Amazon DynamoDB for instance 2. You should see a view like this (with your tables, if created) 105 Felipe Caicedo

106 Instructions (Creating an AWS project) 1. File 2. New 3. Other 4. Select Aws Java Project 106 Felipe Caicedo

107 References Download Eclipse AWS toolkit Creating new AWS project Thank you to Felipe Caicedo (FIB student) for producing this slides 107 Felipe Caicedo

108 Content Motivation HDFS and Hbase Alternatives to HDFS/Hbase? CAP Theorem Case Study: Cassandra Big Data AWS in our Desktop Guest Lecture: AENEAS NEXT CLASS HOMEWORK: AENEAS Hands-on 108

CompSci 516 Database Systems

CompSci 516 Database Systems Lecture 20 NoSQL and Column Store Instructor: Sudeepa Roy Duke CS, Fall 2018 CompSci 516: Database Systems 1 Reading Material NOSQL: Scalable SQL and NoSQL Data Stores Rick