Getting to know. by Michelle Darling August 2013

Similar documents
CQL, and the Road to Redemption

A Non-Relational Storage Analysis

CIB Session 12th NoSQL Databases Structures

Column-Family Databases Cassandra and HBase

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

Migrating Oracle Databases To Cassandra

Non-Relational Databases. Pelle Jakovits

ADVANCED DATABASES CIS 6930 Dr. Markus Schneider

Cassandra- A Distributed Database

Distributed Non-Relational Databases. Pelle Jakovits

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Sources. P. J. Sadalage, M Fowler, NoSQL Distilled, Addison Wesley

Challenges for Data Driven Systems

Outline. Introduction Background Use Cases Data Model & Query Language Architecture Conclusion

CS 655 Advanced Topics in Distributed Systems

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Analytics using Apache Hadoop and Spark with Scala

DATABASE DESIGN II - 1DL400

CloudExpo November 2017 Tomer Levi

Introduction to NoSQL Databases

Introduction to Graph Databases

The NoSQL Ecosystem. Adam Marcus MIT CSAIL

Cassandra 1.0 and Beyond

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

Shen PingCAP 2017

Big Data Development CASSANDRA NoSQL Training - Workshop. November 20 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Massively scalable NoSQL with Apache Cassandra! Jonathan Ellis Project Chair, Apache Cassandra CTO,

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

A NoSQL Introduction for Relational Database Developers. Andrew Karcher Las Vegas SQL Saturday September 12th, 2015

Distributed Databases: SQL vs NoSQL

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

The age of Big Data Big Data for Oracle Database Professionals

Big Data Hadoop Course Content

CS-580K/480K Advanced Topics in Cloud Computing. NoSQL Database

Hadoop An Overview. - Socrates CCDH

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook

CISC 7610 Lecture 2b The beginnings of NoSQL

Migrating to Cassandra in the Cloud, the Netflix Way

Apache Cassandra-The Data Storage Framework for Hadoop

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

CSE 344 JULY 9 TH NOSQL

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Introduction to Oracle NoSQL Database

Getting Started with Cassandra

Stages of Data Processing

Hands-on immersion on Big Data tools

Chapter 24 NOSQL Databases and Big Data Storage Systems

Database Availability and Integrity in NoSQL. Fahri Firdausillah [M ]

Study of NoSQL Database Along With Security Comparison

/ Cloud Computing. Recitation 7 October 10, 2017

CSE6331: Cloud Computing

Data Informatics. Seon Ho Kim, Ph.D.

COMP9321 Web Application Engineering

Intro Cassandra. Adelaide Big Data Meetup.

NoSQL Database Comparison: Bigtable, Cassandra and MongoDB CJ Campbell Brigham Young University October 16, 2015

CSE 530A. Non-Relational Databases. Washington University Fall 2013

Massively scalable NoSQL with Apache Cassandra! Jonathan Ellis Project Chair, Apache Cassandra CTO,

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

A Comparative Analysis of MongoDB and Cassandra

/ Cloud Computing. Recitation 10 March 22nd, 2016

10 Million Smart Meter Data with Apache HBase

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

/ Cloud Computing. Recitation 8 October 18, 2016

microsoft

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

IMPLEMENTING A LAMBDA ARCHITECTURE TO PERFORM REAL-TIME UPDATES

Review - Relational Model Concepts

. International Journal of Advance Research in Engineering, Science & Technology. Identifying Vulnerabilities in Apache Cassandra

BIG DATA COURSE CONTENT

Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases

Big Data Analytics. Rasoul Karimi

Column-Family Stores: Cassandra

High Performance NoSQL with MongoDB

Hadoop Development Introduction

Database Solution in Cloud Computing

Why NoSQL? Why Riak?

SQL in the Hybrid World

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Analysis of HBase Read/Write

Comparing SQL and NOSQL databases

Cassandra Database Security

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Performance Analysis of NoSQL Databases with Hadoop Integration

NoSQL Databases. Concept, Types & Use-cases.

Relational databases

NoSQL data stores and SOS: Uniform Access to Non-Relational Database Systems Paolo Atzeni Francesca Bugiotti Luca Rossi

Appropches used in efficient migrption from Relptionpl Dptpbpse to NoSQL Dptpbpse

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

Lecture 25 Overview. Last Lecture Query optimisation/query execution strategies

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

Oracle NoSQL Database Overview Marie-Anne Neimat, VP Development

NoSQL Databases. an overview

Introduction to NoSQL by William McKnight

MongoDB An Overview. 21-Oct Socrates

Transcription:

Getting to know by Michelle Darling mdarlingcmt@gmail.com August 2013

Agenda: What is Cassandra? Installation, CQL3 Data Modelling Summary Only 15 min to cover these, so please hold questions til the end, or email me :-) and I ll summarize Q&A for everyone. Unfortunately, no time for: DB Admin Detailed Architecture Partitioning / Consistent Hashing Consistency Tuning Data Distribution & Replication System Tables App Development Using Python, Ruby etc to access Cassandra Using Hadoop to stream data into Cassandra

What is Cassandra? NoSQL Distributed DB Consistency - A ID Availability - High Point of Failure - none Good for Event Tracking & Analysis Fortuneteller of Doom from Greek Mythology. Tried to warn others about future disasters, but no one listened. Unfortunately, she was 100% accurate. Time series data Sensor device data Social media analytics Risk Analysis Failure Prediction Rackspace: Which servers are under heavy load and are about to crash?

The Evolution of Cassandra 2005 Data Model Wide rows, sparse arrays High performance through very fast write throughput. 2006 Infrastructure Peer-Peer Gossip Key-Value Pairs Tunable Consistency Originally for Inbox Search But now used for Instagram 2008: Open-Source Release / 2013: Enterprise & Community Editions

Other NoSQL vs. NoSQL Taxonomy: Key-Value Pairs Dynamo, Riak, Redis Column-Based BigTable, HBase, Cassandra Document-Based MongoDB, Couchbase Graph Neo4J Cassandra C* Differentiators: Production-proven at Netflix, ebay, Twitter, 20 of Fortune 100 Clear Winner in Scalability, Performance, Availability Big Data Capable -- DataStax

Architecture Cluster (ring) Nodes (circles) Peer-to-Peer Model Gossip Protocol Partitioner: Consistent Hashing

Netflix Streaming Video Personalized Recommendations per family member Built on Amazon Web Services (AWS) + Cassandra

Cloud installation using Amazon Web Services (AWS) Elastic Compute Cloud (EC2) Free for the 1st year! Then pay only for what you use. Sign up for AWS EC2 account: Big Data University Video 4:34 minutes, Amazon Machine Image (AMI) Preconfigured installation template Choose: DataStax AMI for Cassandra Community Edition Follow these *very good* step-by-step instructions from DataStax. AMIs also available for CouchBase, MongoDB (make sure you pick the free tier community versions to avoid monthly charge$$!!!).

AWS EC2 Dashboard

DataStax AMI Setup

DataStax AMI Setup --clustername Michelle --totalnodes 1 --version community

Roll your Own Installation DataStax Community Edition Install instructions For Linux, Windows, MacOS: http://www.datastax.com/2012/01/gettingstarted-with-cassandra Video: Set up a 4node Cassandra cluster in under 2 minutes http://www.screenr.com/5g6

Invoke CQLSH, CREATE KEYSPACE./bin/cqlsh cqlsh> CREATE KEYSPACE big_data with strategy_class = org.apache.cassandra. locator.simplestrategy with strategy_options:replication_factor= 1 ; cqlsh> use big_data; cqlsh:big_data>

Tip: Skip Thrift -- use CQL3 Thrift RPC CQL3 // Your Column Column col = new Column(ByteBuffer.wrap("name". getbytes())); col.setvalue(bytebuffer.wrap("value".getbytes())); col.settimestamp(system.currenttimemillis()); // Don't ask ColumnOrSuperColumn cosc = new ColumnOrSuperColumn(); cosc.setcolumn(col); - Uses cqlsh - SQL-like language - Runs on top of Thrift RPC - Much more user-friendly. // Prepare to be amazed Mutation mutation = new Mutation(); mutation.setcolumnorsupercolumn(cosc); List<Mutation> mutations = new ArrayList<Mutation>(); mutations.add(mutation); Map mutations_map = new HashMap<ByteBuffer, Map<String, List<Mutation>>>(); Map cf_map = new HashMap<String, List<Mutation>>(); cf_map.set("standard1", mutations); mutations_map.put(bytebuffer.wrap("key".getbytes()), cf_map); cassandra.batch_mutate(mutations_map, consistency_level); Thrift code on left equals this in CQL3: INSERT INTO (id, name) VALUES ('key', 'value');

CREATE TABLE cqlsh:big_data> create table user_tags ( user_id varchar, tag varchar, value counter, primary key (user_id, tag) ): TABLE user_tags: How many times has a user mentioned a hashtag? COUNTER datatype - Computes & stores counter value at the time data is written. This optimizes query performance.

UPDATE TABLE SELECT FROM TABLE cqlsh:big_data> UPDATE user_tags SET value=value+1 WHERE user_id = paul AND tag = cassandra cqlsh:big_data> SELECT * FROM user_tags user_id tag value --------+-----------+---------paul cassandra 1

DATA MODELING A Major Paradigm Shift! RDBMS Cassandra Structured Data, Fixed Schema Unstructured Data, Flexible Schema Array of Arrays 2D: ROW x COLUMN Nested Key-Value Pairs 3D: ROW Key x COLUMN key x COLUMN values DATABASE KEYSPACE TABLE TABLE a.k.a COLUMN FAMILY ROW ROW a.k.a PARTITION. Unit of replication. COLUMN COLUMN [Name, Value, Timestamp]. a.k.a CLUSTER. Unit of storage. Up to 2 billion columns per row. FOREIGN KEYS, JOINS, ACID Consistency Referential Integrity not enforced, so A_CID. BUT relationships represented using COLLECTIONS.

Cassandra 3D+: Nested Objects RDBMS 2D: Rows x columns

Example: Twissandra Web App Twitter-Inspired sample application written in Python + Cassandra. Play with the app: twissandra.com Examine & learn from the code on GitHub. Features/Queries: Sign In, Sign Up Post Tweet Userline (User s tweets) Timeline (All tweets) Following (Users being followed by user) Followers (Users following this user)

Twissandra.com vs Twitter.com

Twissandra - RDBMS Version Entities USER, TWEET FOLLOWER, FOLLOWING FRIENDS Relationships: USER has many TWEETs. USER is a FOLLOWER of many USERs. Many USERs are FOLLOWING USER.

Twissandra - Cassandra Version Tip: Model tables to mirror queries. TABLES or CFs TWEET USER, USERNAME FOLLOWERS, FOLLOWING USERLINE, TIMELINE Notes: Extra tables mirror queries. Denormalized tables are pre-formed for faster performance.

Tip: Remember, Skip Thrift -- use CQL3 TABLE

What does C* data look like? TABLE Userline List all of user s Tweets ************* Row Key: user_id Columns Column Key: tweet_id at Timestamp TTL (Time to Live) seconds til expiration date. *************

Cassandra Data Model = LEGOs? Flex Sch ible ema

Summary: Go straight from SQL to CQL3; skip Thrift, Column Families, SuperColumns, etc Denormalize tables to mirror important queries. Roughly 1 table per impt query. Choose wisely: Partition Keys Cluster Keys Indexes TTL Counters Collections See DataStax Music Service Example Consider hybrid approach: 20% - RDBMS for highly structured, OLTP, ACID requirements. 80% - Scale Cassandra to handle the rest of data. Remember: Cheap: storage, servers, OpenSource software. Precious: User AND Developer Happiness.

Resources C* Summit 2013: Slides Cassandra at ebay Scale (slides) Data Modelers Still Have Jobs Adjusting For the NoSQL Environment (Slides) Real-time Analytics using Cassandra, Spark and Shark slides Cassandra By Example: Data Modelling with CQL3 Slides DATASTAX C*OLLEGE CREDIT: DATA MODELLING FOR APACHE CASSANDRA slides I wish I found these 1st: How do I Cassandra? slides Mobile version of DataStax web docs (link)