Column-Family Databases Cassandra and HBase

Size: px

Start display at page:

Download "Column-Family Databases Cassandra and HBase"

Sheila Lee
6 years ago
Views:

1 Column-Family Databases Cassandra and HBase Kevin Swingler Google Big Table Google invented BigTableto store the massive amounts of semi-structured data it was generating Basic model stores items indexed by a (rowkey, columnkey) pair plus a time stamp Items with the same rowkey refer to the same thing -like a row in a normal table columnkey can vary from row to row 1

2 Facebook Facebook took Google BigTableand Amazon Dynamo (another KV store) and created a database that was then put out to Apache as Cassandra Similarly, HBase, which is the database used in Hadoop is also based on Google BigTable Column-Family Databases In a Column-Family database, rows are indexed by an ID Columns are flexible and can vary from row to row Columns are organised into families of related columns, a bit like relational databases are organised into tables 2

Cassandra Apache column family database Big users: GitHub Netflix ebay Twitter Open source, free distributed Peer to peer distributed data store Keyspace Abstract Example Family 1 Row

3 Cassandra Apache column family database Big users: GitHub Netflix ebay Twitter Open source, free distributed Peer to peer distributed data store Keyspace Abstract Example Family 1 Row 1 Key 1 Row 2 Key 2 Row 3 Key 3 colnamea:value colnameb:value colnamea:value colnamec:value colnamea:value Family 2 Row 1 Key 1 colnamex:value colnamey:value Row 2 Key 2 colnamez:value 3

4 Customers Concrete Example Customer Details Row 1 1 Name:Kevin kms@cs.. Row 2 2 Name:John Tel: Row 3 3 Name:Paul Orders Row 1 1 Date: Value:45.99 Row 2 2 Status: Cancelled Super Columns Family Row 1 Key 1 colnamea: subcol1:val subcol2:val subcol3:val colnameb: subcol4:val subcol5:val subcol6:val Row 2 Key 2 colnamea: subcol1:val subcol2:val subcol3:val colnamec: subcol4:val subcol5:val subcol6:val 4

5 Empty Columns The basic idea is that if a row does not contain a value for a certain column, then rather than a null entry, that column is simply missing from the given row However, you can use the column name as data and leave its value empty Variable length lists of unique values Keys are sorted Peer to Peer Replication Rack 2 Node Data Centre 1 Rack 2 Node No master, no slaves No single point of failure Data Centre 2 Node 5

6 Partitioning Customer Details Node Row 1 1 Name:Kevin kms@cs.. Row 2 2 Name:John Tel: Row 3 3 Name:Paul Node Node Node Ring structure is due to using consistent hashing, which allows nodes to be added or removed with only a small portion of the data needing to be moved around. Random partitioning Partitioning Partition based on MD5 hash of key value MD5 hash values are designed to be uniformly distributed, so load should be balanced across nodes Adjacent keys not kept together (hence random) Order partitioning Based on raw key order Good if you want to access rows in adjacent groups Bad otherwise as causes 'hotspots' on some machines 6

7 Denormalize Joins in a relational model are flexible, storage efficient, elegant But also can be very slow at run time Worse, they perform very poorly in a distributed data model Denormalization can be the answer Denormalize Design Take the many to many relationship between customers and products: 'Bought' ER diagram in relational model: UserID Name 123 John 234 Kevin UserID ProductID ProductID Name 111 Guitar 222 Flute 333 Piano 7

8 Normalized Structure Cassandra has no joins Normal form leads to poor performance as you have to look up both UserIDand ProductID for each row UserID Name 123 John 234 Kevin ProductID Name 111 Guitar 222 Flute 333 Piano UserID ProductID Denormalized Two denormalized tables, each designed for specific query UserID Product I know the userid, and want all the products they bought: I know the productid and what to know who bought it: 123 Guitar 123 Piano 234 Flute ProductID User 111 Kevin 222 Kevin 333 John Rows should be on same node as key is the same 8

9 Using Empty Columns In the last slide, each entry was on a different row They can all be moved to the same row (rows contain arbitrary numbers of columns, remember) using column names as data UserID Guitar Piano 123 null null UserID Flute 234 null... Or Include Data Rather than leave a column empty, store something: UserID Guitar Piano UserID Flute

10 Query Language - CQL VERY SQL like syntax Differences mostly take account of the need for variable length rows SELECT SELECT cols FROM col_family WHERE condition cols can be comma separated list: name, or range name.. or * or FIRST n(gets first n columns) 10

11 INSERT INSERT INTO col_family (col, col,...) VALUES (val, val,...) USING option... VALUES can be a literal, a list, a set or a JSON style array of literals: {k:v, k:v,...} option can be CONSISTENCY, TIMESTAMP, TTL(time to live) INSERT INTO users (user_id, first_name, last_name, s) VALUES('123', Kev', Swingler', {'kms@cs.stir.ac.uk', 'kms7@stir.ac.uk'}); INSERT INTO users (user_id, phone) VALUES('123','467676'); More on CQL 11

12 Cassandra at ebay Cassandra at Twitter 12

13 Google BigTable Paper search.google.com/en//archive/bigtableosdi06.pdf HBase VERY large scale distributed non-relational database Millions of rows (fine) and millions of columns (wow) Based on Google BigTable, but built for Hadoop and HDFS Support for MapReduce 13

14 Querying No dedicated query language Java API plus Project Phoenix -SQL to Hbase interface via JDBC: REST API MapReduce integration Summary Column family stores map row keys to variable length lists of column keys Designed to be highly scalable and run in a fail safe way over a cluster No joins, limited transaction support 14

Cassandra- A Distributed Database

Cassandra- A Distributed Database Tulika Gupta Department of Information Technology Poornima Institute of Engineering and Technology Jaipur, Rajasthan, India Abstract- A relational database is a traditional