4. Managing Big Data. Cloud Computing & Big Data MASTER ENGINYERIA INFORMÀTICA FIB/UPC. Fall Jordi Torres, UPC - BSC

Size: px
Start display at page:

Download "4. Managing Big Data. Cloud Computing & Big Data MASTER ENGINYERIA INFORMÀTICA FIB/UPC. Fall Jordi Torres, UPC - BSC"

Transcription

1 4. Managing Big Data Cloud Computing & Big Data MASTER ENGINYERIA INFORMÀTICA FIB/UPC Fall Jordi Torres, UPC - BSC

2 Slides are only for presentation guide We will discuss+debate additional concepts/ideas appeared during your participation! (and we could skip part of the content) FEEL FREE TO PARTICIPATE!

3 Content Motivation HDFS and Hbase Alternatives to HDFS/Hbase? CAP Theorem Case Study: Cassandra Big Data AWS in our Desktop Guest Lecture: AENEAS HOMEWORK: AENEAS Hands-on 3

4 Relational DB can t support everything Execution Time Conventional Systems The relational DB has ruled for 2-3 decades capabilities, implementations,. (Main problem: scalability) Large Data Sets, growing too big for conventional storage/tools GBs Data Volume PBs 4

5 proposals Proposal: Hadoop + Which DB??? Execution time GBs PBs 5

6 Example: last.fm Internet radio and music community website that offers many services to its users, such as free music streams and downloads, music and event recommendations, personalized charts, and much more. Founded in 2002 There are about 25 million people who use Last.fm 6

7 Example: last.fm Million people who use Last.fm generate huge amounts of data that need to be processed. 7

8 Example: last.fm Source: Marc de Palol - 8

9 Example: last.fm Source: Marc de Palol - 9

10 Example: last.fm One example of this is users transmitting information indicating which songs they are listening to (this is known as scrobbling ). This data is processed and stored by Last.fm, so the user can access it directly (in the form of charts), and it is also used to make decisions about users musical tastes and compatibility, and artist and track similarity. Source: Marc de Palol

11 Example: last.fm Source: Marc de Palol

12 Example: last.fm data from users userid, trackid, albumid, artistid web/api web nodes? Idea: Marc de Palol

13 Example: last.fm data from users userid, trackid, albumid, artistid web/api web nodes? Idea: Marc de Palol

14 Example: last.fm data from users (input example) user1, track1,... user1, track2,... user2, track Tb (output) track1, #users track2, #users? Idea: Marc de Palol

15 Example: last.fm data from users (output) track1, #users track2, #users (input example) user1, track1,... user1, track2,... user2, track1...? HADOOP 4 Tb Idea: Marc de Palol

16 Example: last.fm Problem: a user can listen to this song several times!!!! 16

17 Example: last.fm Problem: a user can listen to this song several times!!!! we want to get rid of duplicates 17

18 Example: last.fm Problem: a user can listen to this song several times!!!! Do you want the code? 18

19 Example: last.fm Solution? mapper( ) { } reduce( ) { } Idea: Marc de Palol

20 Example: last.fm Mapper? mapper(facebookid, songid) { output(songid, facebookid) } Idea: Marc de Palol

21 Example: last.fm Reduce? mapper(facebookid, songid) { output(songid, facebookid) } reduce(songid, List<facebookIds>) { } } Idea: Marc de Palol

22 Example: last.fm Reduce: so let's add the usersids into a set mapper(facebookid, songid) { output(songid, facebookid) } reduce(songid, List<facebookIds>) { Set uniqueusers = new Set() for (facebookid in facebookids) { uniqueusers.add(facebookid) } output(songid, uniqueusers.size()) } 22

23 Example : last.fm Another important example for last.fm: Log Processing Input: Log files Assume each day they could have 10Gb of logs per server Assume 500 servers Output: We require to know how many IP accessed the cluster of web servers. Problem: Size!!!! 500x10Gb of logs Technology & Algorithm required? Information from: Marc de Palol

24 Example : last.fm Another important example for last.fm: Log Processing Input: Log files Assume each day they could have 10Gb of logs per server Assume 500 servers Output: We require to know how many IP accessed the cluster of web servers. Problem: Size!!!! 500x10Gb of logs Technology & Algorithm required? You already have the answer Information from: Marc de Palol

25 Reminder!!!! Source: CLOUDERA

26 Reminder!!!! Source: CLOUDERA

27 Reminder!!!! Source: CLOUDERA

28 Last.fm example: final words There were several reasons for adopting Hadoop: The distributed filesystem provided redundant backups for the data stored on it (e.g., web logs, user listening data) at no extra cost. Scalability was simplified through the ability to add cheap, commodity hardware when required. The cost was right (free) at a time when Last.fm had limited financial resources. The open source code and active community meant that Last.fm could freely modify Hadoop to add custom features and patches. Hadoop provided a flexible framework for running distributed computing algorithms with a relatively easy learning curve. 28

29 MapReduce. excellent but data requirements? 29

30 MapReduce: data requirements!!! In general... The data expected is not relational data This data does not require a schema and may be unstructured Map Reduce Data 30

31 MapReduce: data requirements!!! In general... The data expected is not relational data This data does not require a schema and may be unstructured Instead, data is consumed in chunks which are then divided among nodes fed to the map phase as key-value pairs Map Reduce Data 31

32 MapReduce: data requirements!!! In general... The data expected is not relational data This data does not require a schema and may be unstructured Instead, data is consumed in chunks which are then divided among nodes fed to the map phase as key-value pairs Map Reduce Data The data must be available in a distributed fashion, to serve each processing node. 32

33 Parallel Data Bases Shared nothing Shared disc Shared memory interconnect interconnect interconnect Storage Network processor memory disk 33 17

34 MapReduce: data layer requirements The design and features of the data layer are important because they affect the ease with which data can be loaded and the results of computation extracted and searched. Map Reduce Data 34

35 Content Motivation HDFS and Hbase Alternatives to HDFS/Hbase? CAP Theorem Case Study: Cassandra Big Data AWS in our Desktop Guest Lecture: AENEAS HOMEWORK: AENEAS Hands-on 35

36 HDFS Hadoop: standard storage mechanism for HADOOP Hadoop Distributed File System (HDFS) 36

37 HDFS Hadoop Distributed File System (HDFS) Fault tolerance Assuming that failure will happen allows HDFS to run on commodity hardware. Streaming data access HDFS is written with batch processing in mind, and emphasizes high throughput rather than random access to data. Extreme scalability HDFS will scale to petabytes (current versions) Portability HDFS is portable across platforms. 37

38 Hadoop: standard storage mechanism Hadoop Distributed File System (HDFS) Most HDFS applications need a write-once-read-many access model for files By assuming a file will remain unchanged after it is written, HDFS simplifies replication and speeds up data throughput. Moving Computation is Cheaper than Moving Data : Locality of computation Due to data volume, it is often much faster to move the program near to the data HDFS has features to facilitate this. 38

39 HDFS: an example A given file is broken down into blocks (default=64mb),

40 HDFS: an example then blocks are replicated across cluster (default=3)

41 MapReduce: Resource Management Scheduling A given job is broken down into tasks, then tasks are scheduled to be as close to data as possible Optimized for Bach processing Failure recovery

42 Hadoop: standard storage mechanism Starting point / 42

43 Hadoop: standard storage mechanism HDFS Interface Interface similar to that of regular filesystems. can only store and retrieve data, not index it. Simple random access to data is not possible. Map Reduce Solution: higher-level layers HBase have been created to provide finer-grained functionality to Hadoop deployments Hbase HDFS 43

44 Hbase, the Hadoop Database HBase Creates indexes offers fast and random access to its content Modeled after Google's BigTable DB Uses HDFS as a storage system Map Reduce Hbase It belongs to the NoSQL universe similar to Cassandra, Hypertable, HDFS 44

45 Hbase versus HDFS (a brief comparison) HDFS: Optimized For: Large Files Sequential Access (High Throughput) Append Only Use for fact tables that are mostly append only and require sequential full table scans. HBase: Optimized For: Small Records (but many records) Random Access Atomic Record Updates Use for dimension lookup tables which are updated frequently and require random low-latency lookups. 45

46 Content Motivation HDFS and Hbase Alternatives to HDFS/Hbase? CAP Theorem Case Study: Cassandra Big Data AWS in our Desktop Guest Lecture: AENEAS HOMEWORK: AENEAS Hands-on 46

47 Alternatives to Hbase/HDFS? An Apache project, Cassandra originated at Facebook and is now in production in many large-scale websites (also at BSC). Hypertable was created at Zvents and spun out as an open source project. Are both scalable column-store databases that follow the pattern of BigTable, similar to HBase. Map Reduce Cassandra Map Reduce Hypertable And 47

48 And dozens List Of NoSQL Databases [currently 150+] 48

49 NoS QL The concept is something that has gained momentum in recent years Today is a mature and efficient alternative that can help us solve the problems of scalability and performance (e.g. online applications with thousands of concurrent users and million hits a day) 49

50 Different Types of NoSQL Systems Distributed Key-Value Systems Amazon s S3 Key-Value Store (Dynamo) Voldemort (LinkedIn) Column-based Systems BigTable (Google) HBase Cassandra Document-based systems CouchDB MongoDB Graph DB 50 50

51 DB data model Relational systems are the databases we've been using for a while now. RDBMSs and systems that support ACIDity and joins are considered relational. Key-value systems basically support get, put, and delete operations based on a primary key. Column-oriented systems still use tables but have no joins (joins must be handled within your application). Obviously, they store data by column as opposed to traditional row-oriented databases. This makes aggregations much easier. Document-oriented systems store structured "documents" such as JSON or XML but have no joins (joins must be handled within your application). It's very easy to map data from object-oriented software to these systems. 51

52 DB data model Relational systems are the databases we've been using for a while now. RDBMSs and systems that support ACIDity and joins are considered relational. Atomicity requires that each transaction is "all or nothing": if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes. This guarantees that a transaction cannot be left in an incomplete state. Consistency ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including but not limited to constraints, cascades, triggers, and any combination thereof. Isolation refers to the requirement that no transaction should be able to interfere with another transaction. One way of achieving this is to ensure that no transactions that affect the same rows can run concurrently, since their sequence, and hence the outcome, might be unpredictable. This property of ACID is often partly relaxed due to the huge speed decrease this type of concurrency management entails. Durability means that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently. If the database crashes immediately thereafter, it should be possible to restore the database to the state after the last transaction committed. Source: wikipedia 52

53 The problems with Relational DB RDBMS scale up well in a single node (vertical scalability) price!!!! Apparent solution? Replication and caches Vertical partitioning: Different tables in different servers Horizontal partitioning: Rows of same table in different Servers Good for fault-tolerance, for sure OK for many concurrent reads Not much help with writes, if we want to keep ACID 53

54 There s a reason: The CAP theorem Source: Ricard Gavaldà. "Information Retrieval", Erasmus Mundus Master program on Data Mining and Knowledge Discovery 54

55 There s a reason: The CAP theorem Source: Ricard Gavaldà. "Information Retrieval", Erasmus Mundus Master program on Data Mining and Knowledge Discovery 55

56 Content Motivation HDFS and Hbase Alternatives to HDFS/Hbase? CAP Theorem Case Study: Cassandra Big Data AWS in our Desktop Guest Lecture: AENEAS HOMEWORK: AENEAS Hands-on 56

57 NoS QL and the CAP Theorem What about NoSQL Scalability? Vertical: CPU, Memory,. (Price!!! ) Horizontal More servers Better Fault Tolerance of the global system 57

58 The CAP theorem Source: Ricard Gavaldà. "Information Retrieval", Erasmus Mundus Master program on Data Mining and Knowledge Discovery 58

59 There s a reason: The CAP theorem Source: Ricard Gavaldà. "Information Retrieval", Erasmus Mundus Master program on Data Mining and Knowledge Discovery 59

60 The CAP theorem proof proof Source: Ricard Gavaldà. "Information Retrieval", Erasmus Mundus Master program on Data Mining and Knowledge Discovery 60

61 The problem with RDBMS proof Source: Ricard Gavaldà. "Information Retrieval", Erasmus Mundus Master program on Data Mining and Knowledge Discovery 61

62 Scale out requires partitions IMPORTANT!!!!! A distributed system only offers simultaneously two of this three characteristics Most large web-based systems choose availability over consistency 62

63 CAP Choose Two Per Operation C Consistency CA: available and consistent, unless there is a partition. CP: always consistent, even in a partition, but a reachable replica may deny service without quorum. A Availability AP: a reachable replica provides service even in a partition, but may be inconsistent. 63 P Partition-Tolerant

64 Visual Guide to NoSQL System Source: 64

65 Visual Guide to NoSQL System Source: 65

66 Consistent, Available (CA) Systems have trouble with partitions and typically deal with it with replication. Examples of CA systems include: Traditional RDBMSs like Postgres, MySQL, etc (relational) Vertica (column-oriented) Aster Data (relational) Greenplum (relational) Source: 66

67 Consistent, Partition-Tolerant (CP) Systems have trouble with availability while keeping data consistent across partitioned nodes. Examples of CP systems include: BigTable (column-oriented/tabular) Hypertable (column-oriented/tabular) HBase (column-oriented/tabular) MongoDB (document-oriented) Terrastore (document-oriented) Redis (key-value) Scalaris (key-value) MemcacheDB (key-value) Berkeley DB (key-value) Source: 67

68 Available, Partition-Tolerant (AP) Systems achieve "eventual consistency" through replication and verification. Examples of AP systems include: Dynamo (key-value) Voldemort (key-value) Tokyo Cabinet (key-value) KAI (key-value) Cassandra (column-oriented/tabular) CouchDB (document-oriented) SimpleDB (document-oriented) Riak (document-oriented) Source: 68

69 Eventual Consistency If no updates occur for a while, all updates eventually propagate through the system and all the nodes will be consistent Eventually, a node is either updated or removed from service. Can be implemented with Gossip protocol Amazon s Dynamo popularized this approach Sometimes this is called BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID 69

70 NoSQL alternatives But, the differences between NoSQL databases are much bigger than ever was between one SQL database and another!!!!. This means that it is a bigger responsibility on software architects to choose the appropriate one for a project right at the beginning. 70

71 Content Motivation HDFS and Hbase Alternatives to HDFS/Hbase? CAP Theorem Case Study: Cassandra Big Data AWS in our Desktop Guest Lecture: AENEAS HOMEWORK: AENEAS Hands-on 71

72 Cassandra: main features Cassandra does not support relationships between column families ( tables ), disregarding foreign keys and join operations. Knowing this, the best practice when designing a data model is to keep related data in the same column family. In this section we will review only the main features of Cassandra as an example 72

73 Architecture The architecture of Cassandra is completely decentralized and peer-to-peer, meaning all nodes in a Cassandra cluster are equivalent and provide the same functionality: receive read and write requests, or forward them to other nodes. Peer-to-peer, distributed system All nodes the same Data Partitioned Custom data replication 73

74 Partitioning Cassandra implements automatic partitioning and replication mechanisms to decide which nodes are in charge of each replica. How? PARTITIONER Divide the data across the nodes in the cluster Each node is responsible for a range of the overall data Source: Juan Luis Pérez researcher at BSC (EEDC 2012 master course) 74

75 Partitioning Node A Node B Node C Node D Source: Juan Luis Pérez researcher at BSC (EEDC 2012 master course) 75

76 Partitioning Row Key determines node placement raiser name: john pass: **** url: icann.org trucker name: james pass: **** url: w3.org dumpe r name: maria pass: **** biker name: linda pass: **** 76

77 Partitioning Range of MD5 hash [ ] [ ] [ c00..0] [c ] 77

78 Partitioning Row Key MD5 Hash raiser trucker dumpe r biker 65236c... a113f4... d4ab [ ] [ c00..0] [ ] [c ] 78

79 Partitioning Row Key MD5 Hash raiser trucker dumpe r biker 65236c... a113f4... d4ab [ ] [ c00..0] [ ] [c ] 79

80 Partitioning Row Key MD5 Hash raiser trucker dumpe r biker 65236c... a113f4... d4ab [ ] [ c00..0] [ ] [c ] 80

81 Partitioning Row Key MD5 Hash raiser trucker dumpe r biker 65236c... a113f4... d4ab [ ] [ c00..0] [ ] [c ] 81

82 Partitioning Row Key MD5 Hash raiser trucker dumpe r biker 65236c... a113f4... d4ab [ ] [ c00..0] [ ] [c ] 82

83 Replication Remember: Cassandra implements automatic partitioning and replication mechanisms to decide which nodes are in charge of each replica The user only needs to configure the number of replicas and the system assigns each replica to a node in the cluster. 83

84 Replication Cassandra stores multiple copies of rows on multiple nodes Replication factor = number of replicas Replica Placement Strategy DEFAULT: SimpleStrategy NetworkTopologyStrategy Configurable both: Replication factor Placement Strategy 84

85 Replication SimpleStrategy First replica determined by the partitioner Additional replicas rows are placed on the next nodes clockwise in the ring Original Row raiser Copy Row raiser 85

86 Replication NetworkTopologyStrategy Allows replication between different racks Racks in a data center or in multiple data centers Reliability & Performance Others 86

87 Consistency The goal of current distributed key-value stores such as Cassandra is to read and write data operations, exactly the same as any database system However, while traditional databases provide strong consistency guarantees of replicated data by controlling the concurrent execution of transactions, Cassandra provides tunable consistency in order to favour scalability and availability. 87

88 Consistency Data consistency is tunable by the user when queries are performed, so depending on the desired level of consistency, operations can either return as soon as possible or wait until a majority or all nodes respond Tunable data consistency Choose between strong and eventual consistency Consistency per-operation (reads & writes) 88

89 Strategy for Read 89 89

90 Strategy for Writes 90 90

91 Strong/Weak consistency? As it can be derived from their description, strong consistency can only be achieved when using (Quorum and) All consistency levels. Operations that use weaker consistency levels, such as Zero, Any and One, aren t guaranteed to read the most recent data. However, this weaker consistency provides certain flexibility for applications that can benefit from better performance and don t have strong consistency needs. imagine your facebook wall!!! 91

92 Caching Data is first written to a commit log for durability Local to the node (for disaster recovery purpouse) Then written to a in-memory structure (memtable) Node that store the row And then to disk (SSTable) once memtable is full Data durability is assured memtable Commit log SSTable Source: Juan Luis Pérez researcher at BSC (EEDC 2012 master course) 92

93 Content Motivation HDFS and Hbase Alternatives to HDFS/Hbase? CAP Theorem Case Study: Cassandra Big Data AWS in our Desktop Guest Lecture: AENEAS HOMEWORK: AENEAS Hands-on 93

94 What we need? Amazon AWS Account Eclipse for any platform 5 minutes 94 Felipe Caicedo

95 Instructions (I) 1. Open Eclipse and click on Help in the toolbar 1. Make click on: 1. Install New Software 95 Felipe Caicedo

96 Instructions (II) 1. Enter in the first input text, then click Add 2. Enter the name of the repository and click Ok 96 Felipe Caicedo

97 Instructions (III) 1. Select all elements of the list (if needed) 2. Continue with the wizard 97 Felipe Caicedo

98 Instructions (IV) 1. To finish the wizard, accept the terms and conditions and Click Finish 2. After the installation, restart Eclipse 98 Felipe Caicedo

99 Instructions (V) 1. We should have an icon like this image, Click on that icon 2. Click on preferences to configure the Amazon AWS Account 99 Felipe Caicedo

100 Instructions (VI) 1. Click on Find your existing AWS security credentials to configure the account 2. You should see a Web page like the below image 100 Felipe Caicedo

101 Instructions (VII) 1. Copy the Access Key ID and the Secret Access Key (You can create a new access key if you haven t) 101 Felipe Caicedo

102 Instructions (VIII) 1. Click on Add account, enter the name of the new account and finally, enter the Access Key ID and the Secret Access Key copied previously 2. Click Ok 102 Felipe Caicedo

103 Instructions (IV) 1. Click on Show AWS Explorer View 2. You should see a view like the below image (the position of this view depends of your Eclipse configuration) 103 Felipe Caicedo

104 Instructions (X) Clicking on any item, you can see the corresponding view 104 Felipe Caicedo

105 Instructions (XI) 1. Clicking on Amazon DynamoDB for instance 2. You should see a view like this (with your tables, if created) 105 Felipe Caicedo

106 Instructions (Creating an AWS project) 1. File 2. New 3. Other 4. Select Aws Java Project 106 Felipe Caicedo

107 References Download Eclipse AWS toolkit Creating new AWS project Thank you to Felipe Caicedo (FIB student) for producing this slides 107 Felipe Caicedo

108 Content Motivation HDFS and Hbase Alternatives to HDFS/Hbase? CAP Theorem Case Study: Cassandra Big Data AWS in our Desktop Guest Lecture: AENEAS NEXT CLASS HOMEWORK: AENEAS Hands-on 108

CompSci 516 Database Systems

CompSci 516 Database Systems CompSci 516 Database Systems Lecture 20 NoSQL and Column Store Instructor: Sudeepa Roy Duke CS, Fall 2018 CompSci 516: Database Systems 1 Reading Material NOSQL: Scalable SQL and NoSQL Data Stores Rick

More information

CIB Session 12th NoSQL Databases Structures

CIB Session 12th NoSQL Databases Structures CIB Session 12th NoSQL Databases Structures By: Shahab Safaee & Morteza Zahedi Software Engineering PhD Email: safaee.shx@gmail.com, morteza.zahedi.a@gmail.com cibtrc.ir cibtrc cibtrc 2 Agenda What is

More information

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems CompSci 516 Data Intensive Computing Systems Lecture 21 (optional) NoSQL systems Instructor: Sudeepa Roy Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 1 Key- Value Stores Duke CS,

More information

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems Jargons, Concepts, Scope and Systems Key Value Stores, Document Stores, Extensible Record Stores Overview of different scalable relational systems Examples of different Data stores Predictions, Comparisons

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

CS 655 Advanced Topics in Distributed Systems

CS 655 Advanced Topics in Distributed Systems Presented by : Walid Budgaga CS 655 Advanced Topics in Distributed Systems Computer Science Department Colorado State University 1 Outline Problem Solution Approaches Comparison Conclusion 2 Problem 3

More information

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos Instituto Politécnico de Tomar Introduction to Big Data NoSQL Databases Ricardo Campos Mestrado EI-IC Análise e Processamento de Grandes Volumes de Dados Tomar, Portugal, 2016 Part of the slides used in

More information

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University Introduction to Computer Science William Hsu Department of Computer Science and Engineering National Taiwan Ocean University Chapter 9: Database Systems supplementary - nosql You can have data without

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2015 Lecture 14 NoSQL References Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol. 39, No.

More information

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent Tanton Jeppson CS 401R Lab 3 Cassandra, MongoDB, and HBase Introduction For my report I have chosen to take a deeper look at 3 NoSQL database systems: Cassandra, MongoDB, and HBase. I have chosen these

More information

ADVANCED DATABASES CIS 6930 Dr. Markus Schneider

ADVANCED DATABASES CIS 6930 Dr. Markus Schneider ADVANCED DATABASES CIS 6930 Dr. Markus Schneider Group 2 Archana Nagarajan, Krishna Ramesh, Raghav Ravishankar, Satish Parasaram Drawbacks of RDBMS Replication Lag Master Slave Vertical Scaling. ACID doesn

More information

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL * Some History * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL * NoSQL Taxonomy * Towards NewSQL Overview * Some History * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL * NoSQL Taxonomy *TowardsNewSQL NoSQL

More information

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Goal of the presentation is to give an introduction of NoSQL databases, why they are there. 1 Goal of the presentation is to give an introduction of NoSQL databases, why they are there. We want to present "Why?" first to explain the need of something like "NoSQL" and then in "What?" we go in

More information

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY WHAT IS NOSQL? Stands for No-SQL or Not Only SQL. Class of non-relational data storage systems E.g.

More information

EEDC. Part 2. Big Data. Execution Environments for Distributed Computing. Master in Computer Architecture, Networks and Systems - CANS

EEDC. Part 2. Big Data. Execution Environments for Distributed Computing. Master in Computer Architecture, Networks and Systems - CANS EEDC Execution Environments for Distributed Computing 34330 Master in Computer Architecture, Networks and Systems - CANS Part 2. Big Data Course Content 2 Content Part 2. Big Data Challenges 2.1. Motivation

More information

A Study of NoSQL Database

A Study of NoSQL Database A Study of NoSQL Database International Journal of Engineering Research & Technology (IJERT) Biswajeet Sethi 1, Samaresh Mishra 2, Prasant ku. Patnaik 3 1,2,3 School of Computer Engineering, KIIT University

More information

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL CISC 7610 Lecture 5 Distributed multimedia databases Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL Motivation YouTube receives 400 hours of video per minute That is 200M hours

More information

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu NoSQL Databases MongoDB vs Cassandra Kenny Huynh, Andre Chik, Kevin Vu Introduction - Relational database model - Concept developed in 1970 - Inefficient - NoSQL - Concept introduced in 1980 - Related

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

Relational databases

Relational databases COSC 6397 Big Data Analytics NoSQL databases Edgar Gabriel Spring 2017 Relational databases Long lasting industry standard to store data persistently Key points concurrency control, transactions, standard

More information

CSE 530A. Non-Relational Databases. Washington University Fall 2013

CSE 530A. Non-Relational Databases. Washington University Fall 2013 CSE 530A Non-Relational Databases Washington University Fall 2013 NoSQL "NoSQL" was originally the name of a specific RDBMS project that did not use a SQL interface Was co-opted years later to refer to

More information

Introduction to NoSQL Databases

Introduction to NoSQL Databases Introduction to NoSQL Databases Roman Kern KTI, TU Graz 2017-10-16 Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 1 / 31 Introduction Intro Why NoSQL? Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 2 / 31 Introduction

More information

Warnings! Today. New Systems. OLAP vs. OLTP. New Systems vs. RDMS. NoSQL 11/5/17. Material from Cattell s paper ( ) some info will be outdated

Warnings! Today. New Systems. OLAP vs. OLTP. New Systems vs. RDMS. NoSQL 11/5/17. Material from Cattell s paper ( ) some info will be outdated Announcements CompSci 516 Database Systems Lecture 19 NoSQL and Column Store HW3 released on Sakai Due on Monday, Nov 20, 11:55 pm (in 2 weeks) Start soon, finish soon! You can learn about conceptual questions

More information

Chapter 24 NOSQL Databases and Big Data Storage Systems

Chapter 24 NOSQL Databases and Big Data Storage Systems Chapter 24 NOSQL Databases and Big Data Storage Systems - Large amounts of data such as social media, Web links, user profiles, marketing and sales, posts and tweets, road maps, spatial data, email - NOSQL

More information

Distributed Data Store

Distributed Data Store Distributed Data Store Large-Scale Distributed le system Q: What if we have too much data to store in a single machine? Q: How can we create one big filesystem over a cluster of machines, whose data is

More information

Rule 14 Use Databases Appropriately

Rule 14 Use Databases Appropriately Rule 14 Use Databases Appropriately Rule 14: What, When, How, and Why What: Use relational databases when you need ACID properties to maintain relationships between your data. For other data storage needs

More information

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases Key-Value Document Column Family Graph John Edgar 2 Relational databases are the prevalent solution

More information

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014 NoSQL Databases Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 10, 2014 Amir H. Payberah (SICS) NoSQL Databases April 10, 2014 1 / 67 Database and Database Management System

More information

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc. PROFESSIONAL NoSQL Shashank Tiwari WILEY John Wiley & Sons, Inc. Examining CONTENTS INTRODUCTION xvil CHAPTER 1: NOSQL: WHAT IT IS AND WHY YOU NEED IT 3 Definition and Introduction 4 Context and a Bit

More information

Architekturen für die Cloud

Architekturen für die Cloud Architekturen für die Cloud Eberhard Wolff Architecture & Technology Manager adesso AG 08.06.11 What is Cloud? National Institute for Standards and Technology (NIST) Definition On-demand self-service >

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan COSC 416 NoSQL Databases NoSQL Databases Overview Dr. Ramon Lawrence University of British Columbia Okanagan ramon.lawrence@ubc.ca Databases Brought Back to Life!!! Image copyright: www.dragoart.com Image

More information

Migrating Oracle Databases To Cassandra

Migrating Oracle Databases To Cassandra BY UMAIR MANSOOB Why Cassandra Lower Cost of ownership makes it #1 choice for Big Data OLTP Applications. Unlike Oracle, Cassandra can store structured, semi-structured, and unstructured data. Cassandra

More information

Exploring Cassandra and HBase with BigTable Model

Exploring Cassandra and HBase with BigTable Model Exploring Cassandra and HBase with BigTable Model Hemanth Gokavarapu hemagoka@indiana.edu (Guidance of Prof. Judy Qiu) Department of Computer Science Indiana University Bloomington Abstract Cassandra is

More information

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera, How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS

More information

Cassandra Design Patterns

Cassandra Design Patterns Cassandra Design Patterns Sanjay Sharma Chapter No. 1 "An Overview of Architecture and Data Modeling in Cassandra" In this package, you will find: A Biography of the author of the book A preview chapter

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

Advanced Database Technologies NoSQL: Not only SQL

Advanced Database Technologies NoSQL: Not only SQL Advanced Database Technologies NoSQL: Not only SQL Christian Grün Database & Information Systems Group NoSQL Introduction 30, 40 years history of well-established database technology all in vain? Not at

More information

10. Replication. Motivation

10. Replication. Motivation 10. Replication Page 1 10. Replication Motivation Reliable and high-performance computation on a single instance of a data object is prone to failure. Replicate data to overcome single points of failure

More information

Database Availability and Integrity in NoSQL. Fahri Firdausillah [M ]

Database Availability and Integrity in NoSQL. Fahri Firdausillah [M ] Database Availability and Integrity in NoSQL Fahri Firdausillah [M031010012] What is NoSQL Stands for Not Only SQL Mostly addressing some of the points: nonrelational, distributed, horizontal scalable,

More information

MapReduce programming model

MapReduce programming model MapReduce programming model technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate

More information

SCALABLE CONSISTENCY AND TRANSACTION MODELS

SCALABLE CONSISTENCY AND TRANSACTION MODELS Data Management in the Cloud SCALABLE CONSISTENCY AND TRANSACTION MODELS 69 Brewer s Conjecture Three properties that are desirable and expected from realworld shared-data systems C: data consistency A:

More information

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB s C. Faloutsos A. Pavlo Lecture#23: Distributed Database Systems (R&G ch. 22) Administrivia Final Exam Who: You What: R&G Chapters 15-22

More information

The NoSQL Ecosystem. Adam Marcus MIT CSAIL

The NoSQL Ecosystem. Adam Marcus MIT CSAIL The NoSQL Ecosystem Adam Marcus MIT CSAIL marcua@csail.mit.edu / @marcua About Me Social Computing + Database Systems Easily Distracted: Wrote The NoSQL Ecosystem in The Architecture of Open Source Applications

More information

Transactions and ACID

Transactions and ACID Transactions and ACID Kevin Swingler Contents Recap of ACID transactions in RDBMSs Transactions and ACID in MongoDB 1 Concurrency Databases are almost always accessed by multiple users concurrently A user

More information

Introduction to NoSQL

Introduction to NoSQL Introduction to NoSQL Agenda History What is NoSQL Types of NoSQL The CAP theorem History - RDBMS Relational DataBase Management Systems were invented in the 1970s. E. F. Codd, "Relational Model of Data

More information

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Haridimos Kondylakis Computer Science Department, University of Crete

Haridimos Kondylakis Computer Science Department, University of Crete CS-562 Advanced Topics in Databases Haridimos Kondylakis Computer Science Department, University of Crete QSX (LN2) 2 NoSQL NoSQL: Not Only SQL. User case of NoSQL? Massive write performance. Fast key

More information

Presented by Sunnie S Chung CIS 612

Presented by Sunnie S Chung CIS 612 By Yasin N. Silva, Arizona State University Presented by Sunnie S Chung CIS 612 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. See http://creativecommons.org/licenses/by-nc-sa/4.0/

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Data Management for Big Data Part 1

Data Management for Big Data Part 1 2018-04-09 2 Outline Today Part 1 Data Management for Big Data Part 1 Valentina Ivanova IDA, Linköping University RDBMS NoSQL NewSQL DBMS OLAP vs OLTP (ACID) NoSQL Concepts and Techniques Horizontal scalability

More information

NoSQL : A Panorama for Scalable Databases in Web

NoSQL : A Panorama for Scalable Databases in Web NoSQL : A Panorama for Scalable Databases in Web Jagjit Bhatia P.G. Dept of Computer Science,Hans Raj Mahila Maha Vidyalaya, Jalandhar Abstract- Various business applications deal with large amount of

More information

NOSQL Databases: The Need of Enterprises

NOSQL Databases: The Need of Enterprises International Journal of Allied Practice, Research and Review Website: www.ijaprr.com (ISSN 2350-1294) NOSQL Databases: The Need of Enterprises Basit Maqbool Mattu M-Tech CSE Student. (4 th semester).

More information

MapReduce and Friends

MapReduce and Friends MapReduce and Friends Craig C. Douglas University of Wyoming with thanks to Mookwon Seo Why was it invented? MapReduce is a mergesort for large distributed memory computers. It was the basis for a web

More information

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414 Announcements Database Systems CSE 414 Lecture 16: NoSQL and JSon Current assignments: Homework 4 due tonight Web Quiz 6 due next Wednesday [There is no Web Quiz 5 Today s lecture: JSon The book covers

More information

Distributed Databases: SQL vs NoSQL

Distributed Databases: SQL vs NoSQL Distributed Databases: SQL vs NoSQL Seda Unal, Yuchen Zheng April 23, 2017 1 Introduction Distributed databases have become increasingly popular in the era of big data because of their advantages over

More information

A Review Of Non Relational Databases, Their Types, Advantages And Disadvantages

A Review Of Non Relational Databases, Their Types, Advantages And Disadvantages A Review Of Non Relational Databases, Their Types, Advantages And Disadvantages Harpreet kaur, Jaspreet kaur, Kamaljit kaur Student of M.Tech(CSE) Student of M.Tech(CSE) Assit.Prof.in CSE deptt. Sri Guru

More information

Axway API Management 7.5.x Cassandra Best practices. #axway

Axway API Management 7.5.x Cassandra Best practices. #axway Axway API Management 7.5.x Cassandra Best practices #axway Axway API Management 7.5.x Cassandra Best practices Agenda Apache Cassandra - Overview Apache Cassandra - Focus on consistency level Apache Cassandra

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 16: NoSQL and JSon CSE 414 - Spring 2016 1 Announcements Current assignments: Homework 4 due tonight Web Quiz 6 due next Wednesday [There is no Web Quiz 5] Today s lecture:

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

NoSQL Concepts, Techniques & Systems Part 1. Valentina Ivanova IDA, Linköping University

NoSQL Concepts, Techniques & Systems Part 1. Valentina Ivanova IDA, Linköping University NoSQL Concepts, Techniques & Systems Part 1 Valentina Ivanova IDA, Linköping University 2017-03-20 2 Outline Today Part 1 RDBMS NoSQL NewSQL DBMS OLAP vs OLTP NoSQL Concepts and Techniques Horizontal scalability

More information

Next-Generation Cloud Platform

Next-Generation Cloud Platform Next-Generation Cloud Platform Jangwoo Kim Jun 24, 2013 E-mail: jangwoo@postech.ac.kr High Performance Computing Lab Department of Computer Science & Engineering Pohang University of Science and Technology

More information

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma Apache Hadoop Goes Realtime at Facebook Guide - Dr. Sunny S. Chung Presented By- Anand K Singh Himanshu Sharma Index Problem with Current Stack Apache Hadoop and Hbase Zookeeper Applications of HBase at

More information

Cassandra- A Distributed Database

Cassandra- A Distributed Database Cassandra- A Distributed Database Tulika Gupta Department of Information Technology Poornima Institute of Engineering and Technology Jaipur, Rajasthan, India Abstract- A relational database is a traditional

More information

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414 Announcements Database Systems CSE 414 Lecture 11: NoSQL & JSON (mostly not in textbook only Ch 11.1) HW5 will be posted on Friday and due on Nov. 14, 11pm [No Web Quiz 5] Today s lecture: NoSQL & JSON

More information

Module - 17 Lecture - 23 SQL and NoSQL systems. (Refer Slide Time: 00:04)

Module - 17 Lecture - 23 SQL and NoSQL systems. (Refer Slide Time: 00:04) Introduction to Morden Application Development Dr. Gaurav Raina Prof. Tanmai Gopal Department of Computer Science and Engineering Indian Institute of Technology, Madras Module - 17 Lecture - 23 SQL and

More information

Introduction to Database Services

Introduction to Database Services Introduction to Database Services Shaun Pearce AWS Solutions Architect 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Today s agenda Why managed database services? A non-relational

More information

Databases and Big Data Today. CS634 Class 22

Databases and Big Data Today. CS634 Class 22 Databases and Big Data Today CS634 Class 22 Current types of Databases SQL using relational tables: still very important! NoSQL, i.e., not using relational tables: term NoSQL popular since about 2007.

More information

DATABASE DESIGN II - 1DL400

DATABASE DESIGN II - 1DL400 DATABASE DESIGN II - 1DL400 Fall 2016 A second course in database systems http://www.it.uu.se/research/group/udbl/kurser/dbii_ht16 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Introduction to Distributed Data Systems

Introduction to Distributed Data Systems Introduction to Distributed Data Systems Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook January

More information

International Journal of Informative & Futuristic Research ISSN:

International Journal of Informative & Futuristic Research ISSN: www.ijifr.com Volume 5 Issue 8 April 2018 International Journal of Informative & Futuristic Research ISSN: 2347-1697 TRANSITION FROM TRADITIONAL DATABASES TO NOSQL DATABASES Paper ID IJIFR/V5/ E8/ 010

More information

MongoDB and Mysql: Which one is a better fit for me? Room 204-2:20PM-3:10PM

MongoDB and Mysql: Which one is a better fit for me? Room 204-2:20PM-3:10PM MongoDB and Mysql: Which one is a better fit for me? Room 204-2:20PM-3:10PM About us Adamo Tonete MongoDB Support Engineer Agustín Gallego MySQL Support Engineer Agenda What are MongoDB and MySQL; NoSQL

More information

Using space-filling curves for multidimensional

Using space-filling curves for multidimensional Using space-filling curves for multidimensional indexing Dr. Bisztray Dénes Senior Research Engineer 1 Nokia Solutions and Networks 2014 In medias res Performance problems with RDBMS Switch to NoSQL store

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 10: Mutable State (1/2) March 15, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These

More information

Getting to know. by Michelle Darling August 2013

Getting to know. by Michelle Darling August 2013 Getting to know by Michelle Darling mdarlingcmt@gmail.com August 2013 Agenda: What is Cassandra? Installation, CQL3 Data Modelling Summary Only 15 min to cover these, so please hold questions til the end,

More information

CSE 344 JULY 9 TH NOSQL

CSE 344 JULY 9 TH NOSQL CSE 344 JULY 9 TH NOSQL ADMINISTRATIVE MINUTIAE HW3 due Wednesday tests released actual_time should have 0s not NULLs upload new data file or use UPDATE to change 0 ~> NULL Extra OOs on Mondays 5-7pm in

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate

More information

CSE-E5430 Scalable Cloud Computing Lecture 9

CSE-E5430 Scalable Cloud Computing Lecture 9 CSE-E5430 Scalable Cloud Computing Lecture 9 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 15.11-2015 1/24 BigTable Described in the paper: Fay

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES 1 THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB

More information

Outline. Introduction Background Use Cases Data Model & Query Language Architecture Conclusion

Outline. Introduction Background Use Cases Data Model & Query Language Architecture Conclusion Outline Introduction Background Use Cases Data Model & Query Language Architecture Conclusion Cassandra Background What is Cassandra? Open-source database management system (DBMS) Several key features

More information

Modern Database Concepts

Modern Database Concepts Modern Database Concepts Basic Principles Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz NoSQL Overview Main objective: to implement a distributed state Different objects stored on different

More information

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

BIG DATA basics. November Cloud Computing & Big Data. FIB-UPC Master MEI

BIG DATA basics. November Cloud Computing & Big Data. FIB-UPC Master MEI BIG DATA basics Cloud Computing & Big Data November- 2012 FIB-UPC Master MEI Content (Big Data part) A. Motivation B. Big Data Challenges C. Processing Big Data D. Big Data Storage E. Managing Big Data

More information

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014 Cassandra @ Spotify Scaling storage to million of users world wide! Jimmy Mårdell October 14, 2014 2 About me Jimmy Mårdell Tech Product Owner in the Cassandra team 4 years at Spotify

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores

A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores Nikhil Dasharath Karande 1 Department of CSE, Sanjay Ghodawat Institutes, Atigre nikhilkarande18@gmail.com Abstract- This paper

More information

Webinar Series TMIP VISION

Webinar Series TMIP VISION Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing

More information

CS November 2017

CS November 2017 Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29, BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29, 2016 1 OBJECTIVES ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29, 2016 2 WHAT

More information

5/1/17. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

5/1/17. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414 Announcements Database Systems CSE 414 Lecture 15: NoSQL & JSON (mostly not in textbook only Ch 11.1) 1 Homework 4 due tomorrow night [No Web Quiz 5] Midterm grading hopefully finished tonight post online

More information

Key Value Store. Yiding Wang, Zhaoxiong Yang

Key Value Store. Yiding Wang, Zhaoxiong Yang Key Value Store Yiding Wang, Zhaoxiong Yang Outline Part 1 Definitions/Operations Compare with RDBMS Scale Up Part 2 Distributed Key Value Store Network Acceleration Definitions A key-value database, or

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 10: Mutable State (1/2) March 14, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These

More information

Tools for Social Networking Infrastructures

Tools for Social Networking Infrastructures Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

ZHT A Fast, Reliable and Scalable Zero- hop Distributed Hash Table

ZHT A Fast, Reliable and Scalable Zero- hop Distributed Hash Table ZHT A Fast, Reliable and Scalable Zero- hop Distributed Hash Table 1 What is KVS? Why to use? Why not to use? Who s using it? Design issues A storage system A distributed hash table Spread simple structured

More information

Column-Family Databases Cassandra and HBase

Column-Family Databases Cassandra and HBase Column-Family Databases Cassandra and HBase Kevin Swingler Google Big Table Google invented BigTableto store the massive amounts of semi-structured data it was generating Basic model stores items indexed

More information

Non-Relational Databases. Pelle Jakovits

Non-Relational Databases. Pelle Jakovits Non-Relational Databases Pelle Jakovits 25 October 2017 Outline Background Relational model Database scaling The NoSQL Movement CAP Theorem Non-relational data models Key-value Document-oriented Column

More information