Hustle Documentation. Release 0.1. Tim Spurway

Similar documents
High-Performance Distributed DBMS for Analytics

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Greenplum Architecture Class Outline

NoSQL Databases Analysis

Oral Questions and Answers (DBMS LAB) Questions & Answers- DBMS

The course modules of MongoDB developer and administrator online certification training:

Course Content MongoDB

Time Series Storage with Apache Kudu (incubating)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Shark: Hive (SQL) on Spark

Part 1: Indexes for Big Data

SAP IQ - Business Intelligence and vertical data processing with 8 GB RAM or less

MongoDB An Overview. 21-Oct Socrates

Understanding NoSQL Database Implementations

Data about data is database Select correct option: True False Partially True None of the Above

Chapter 17 Indexing Structures for Files and Physical Database Design

Column-Oriented Database Systems. Liliya Rudko University of Helsinki

Document Object Storage with MongoDB

MongoDB Tutorial for Beginners

Advanced Data Management Technologies Written Exam

Processing of Very Large Data

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

ExaminingCassandra Constraints: Pragmatic. Eyes

Massive Scalability With InterSystems IRIS Data Platform

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

CA485 Ray Walshe NoSQL

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

CitusDB Documentation

Teradata. This was compiled in order to describe Teradata and provide a brief overview of common capabilities and queries.

Greenplum-Spark Connector Examples Documentation. kong-yew,chan

Druid Power Interactive Applications at Scale. Jonathan Wei Software Engineer

Data Analysis and Data Science

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

A Fast and High Throughput SQL Query System for Big Data

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

15110 PRINCIPLES OF COMPUTING SAMPLE EXAM 2

Goal of this document: A simple yet effective

Introduction to Column Stores with MemSQL. Seminar Database Systems Final presentation, 11. January 2016 by Christian Bisig

Evolution of Database Systems

PHP Composer 9 Benefits of Using a Binary Repository Manager

BanzaiDB Documentation

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Transformer Looping Functions for Pivoting the data :

1. Attempt any two of the following: 10 a. State and justify the characteristics of a Data Warehouse with suitable examples.

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

An Introduction to Big Data Formats

Using the MySQL Document Store

Evaluation Guide for ASP.NET Web CMS and Experience Platforms

Database performance becomes an important issue in the presence of

Binary representation and data

I am: Rana Faisal Munir

Big Data Analytics. Rasoul Karimi

Data Informatics. Seon Ho Kim, Ph.D.

7. Query Processing and Optimization

9 Reasons To Use a Binary Repository for Front-End Development with Bower

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

Programming for Data Science Syllabus

ORC Files. Owen O June Page 1. Hortonworks Inc. 2012

Sepand Gojgini. ColumnStore Index Primer

CHAPTER. Oracle Database 11g Architecture Options

Time Series Live 2017

Project Genesis. Cafepress.com Product Catalog Hundreds of Millions of Products Millions of new products every week Accelerating growth

T-SQL Training: T-SQL for SQL Server for Developers

Inputs. Decisions. Leads to

Use Cases for Partitioning. Bill Karwin Percona, Inc

SQL Server on Linux and Containers

Problem Set 2 Solutions

Hash-Based Indexing 165

Hadoop. copyright 2011 Trainologic LTD

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Table Compression in Oracle9i Release2. An Oracle White Paper May 2002

Invitation to a New Kind of Database. Sheer El Showk Cofounder, Lore Ai We re Hiring!

CISC 7610 Lecture 2b The beginnings of NoSQL

You should have a basic understanding of Relational concepts and basic SQL. It will be good if you have worked with any other RDBMS product.

Contents in Detail. Foreword by Xavier Noria

Databases 2 (VU) ( / )

10 Million Smart Meter Data with Apache HBase

Azure-persistence MARTIN MUDRA

External Sorting Sorting Tables Larger Than Main Memory

CSE 344 MAY 7 TH EXAM REVIEW

Designing Database Solutions for Microsoft SQL Server 2012

Five Common Myths About Scaling MySQL

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. [R&G] Chapter 23, Part A


NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Course Modules for MCSA: SQL Server 2016 Database Development Training & Certification Course:

Administração e Optimização de Bases de Dados 2012/2013 Index Tuning

MongoDB. Nicolas Travers Conservatoire National des Arts et Métiers. MongoDB

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Hadoop Online Training

RavenDB & document stores

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

StorageTapper. Real-time MySQL Change Data Uber. Ovais Tariq, Shriniket Kale & Yevgeniy Firsov. October 03, 2017

OVERVIEW OF RELATIONAL DATABASES: KEYS

Beyond Relational Databases: MongoDB, Redis & ClickHouse. Marcos Albe - Principal Support Percona

Advanced Databases: Parallel Databases A.Poulovassilis

Cassandra- A Distributed Database

NoSQL + SQL = MySQL. Nicolas De Rico Principal Solutions Architect

Transcription:

Hustle Documentation Release 0.1 Tim Spurway February 26, 2014

Contents 1 Features 3 2 Getting started 5 2.1 Installing Hustle............................................. 5 2.2 Hustle Tutorial.............................................. 6 3 Hustle In Depth 7 3.1 Hustle Integration Test Suite....................................... 7 3.2 Configuring Hustle............................................ 7 3.3 Hustle Command Line Interface (CLI)................................. 7 3.4 Hustle Schema Design Guide...................................... 8 3.5 Hustle Query Guide........................................... 10 3.6 Inserting Data To Hustle......................................... 10 3.7 Hustle Indexes.............................................. 10 4 Reference 11 i

ii

Hustle Documentation, Release 0.1 Hustle is a distributed, column oriented, relational OLAP Database. Hustle supports parallel insertions and queries over large data sets, stored on an unreliable cluster of computers. It is meant to load and query the enormous data sets typical of ad-tech, high volume web services, and other large-scale analytics applications. Hustle is a distributed database. When data is inserted into Hustle, it is replicated across a cluster to enhance availability, horizontal scalability and enable parallel query execution. When data is replicated on multiple nodes, your database becomes resistant to node failure because there is always multiple copies of it on the cluster. This allows you to simply add more machines to increase both overall storage and to decrease query time by performing more operations in parallel. Hustle is a relational database, so, unlike other NoSQL databases, it stores it s data in rows and columns in a fixed schema. This means that you must create Tables with a fixed number of Columns of specific data types, before inserting data into the database. The advantage of this is that both storage and query execution can be fine tuned to minimize both the data footprint and the query execution time. Hustle uses a column oriented format for storing data. This scheme is often used for very large databases, as it is more efficient for aggregation operations such as sum() and average() functions over a particular column as well as relational joins across tables. Although Hustle has a relational data model, it is not a SQL database. Hustle extends the Python language to facilitate it s relational query facility. Let s take a look at a typical Hustle query in Python: select(impressions.ad_id, h_sum(pixels.amount), h_count(), where=(impressions.date < 2014-01-13, pixels.date < 2014-01-13 ), join=(impressions.site_id, pixels.site_id), order_by= ad_id, desc=true) which would be equivalent to the SQL query: SELECT i.ad_id, i.site_id, sum(p.amount), count(*) FROM impressions i JOIN pixels p on p.site_id = p.site_id WHERE i.date < 2014-01-13 and p.date < 2014-01-13 ORDER BY i.ad_id DESC GROUP BY i.ad_id, i.site_id The two approaches seem equivalent, however, Python is extensible, whereas SQL is not. You can do much more with Hustle than just query data. Hustle was designed to express distributed computation over indexed data which includes, but is not limited to the classic relational select statement. SQL is good at queries, not as an ecosystem for general purpose data-centric distributed computation. Hustle is meant for large, distributed inserts, and has append only semantics. It is suited to very large log file style inputs, and once data is inserted, it cannot be changed. This scheme is typically suitable for distributed applications that generate large log files, with many (possibly hundreds of) thousands of events per second. Hustle has been streamlined to accept structured JSON log files as it s primary input format, and to perform distributed inserts. A distributed insert delegates most of the database creation work to the client, thereby freeing up the cluster s resources and avoiding a central computational pinch point like in other write bound relational OLAP databases. Hustle can easily handle almost unlimited write load using this scheme. Hustle utilizes modern compression and indexing data structures and algorithms to minimize overall memory footprint and to maximize query performance. It utilizes bitmap indexes, prefix trie (dictionary) and lz4 compression, and has a very rich set of string and numeric data types of various sizes. Typically, Hustle data sets are 25% to 50% than their equivalent GZIPed JSON sources. Hustle has several auxiliary tools: a command line interface (CLI) Python shell with auto-completion of Hustle tables and functions a client side insert script Contents 1

Hustle Documentation, Release 0.1 2 Contents

CHAPTER 1 Features column oriented - super fast queries distributed insert - Hustle is designed for petabyte scale datasets in a distributed environment with massive write loads compressed - bitmap indexes, lz4, and prefix trie compression relational - join gigantic data sets partitioned - smart shards embarrassingly distributed (based on Disco) embarrassingly fast (uses LMDB) NoSQL - Python DSL bulk append only semantics highly available, horizontally scalable REPL/CLI query interface 3

Hustle Documentation, Release 0.1 4 Chapter 1. Features

CHAPTER 2 Getting started 2.1 Installing Hustle Hustle is hosted on GitHub and should be cloned from that repo: git clone git@github.com:changoinc/hustle.git 2.1.1 Dependencies Hustle has the following dependencies: * you will need Python 2.7 <http://www.python.org/downloads/> * you will need Disco 0.5 <http://disco.readthedocs.org/en/latest/start/install.html> 2.1.2 Installing the Hustle Client In order to run Hustle, you will need to install it onto an existing Disco v0.5 cluster. In order to query a Hustle/Disco cluster, you will need to install the Hustle software on that client machine: cd hustle sudo./bootstrap.sh This will build and install Hustle on your client machine. 2.1.3 Installing on the Cluster Disco is a distributed system and may have many nodes. Each of the nodes in your Disco cluster will need to install the Hustle dependencies. These can be found in the hustle/deps directory. The easiest way to install Hustle on your disco slave nodes is to: cd hustle/deps make sudo make install on ALL you disco slave nodes. You may now want to go and run the Integration Tests to validate your installation. 5

Hustle Documentation, Release 0.1 2.2 Hustle Tutorial coming soon... 6 Chapter 2. Getting started

CHAPTER 3 Hustle In Depth 3.1 Hustle Integration Test Suite The Hustle Integration Test suite is a good place to see non-trivial Hustle Tables created, data inserted into them, and some subsequent queries. They are located in: hustle/integration_test To run the test suite, ensure you have installed Nose and Hustle. Before you run the integration tests, you will need to make sure Disco is running and that you have run the setup.py script once: python hustle/integration_test/setup.py You can then execute the nosetests in the integration suite: cd hustle/integration_test nosetests 3.2 Configuring Hustle 3.3 Hustle Command Line Interface (CLI) After installing Hustle, you can invoke the Hustle CLI like this: hustle Assuming you ve installed everything and have a running and correctly configured Disco instance, you will get a Python prompt looking something like this: bin git:(develop)./hustle Loading Hustle Tables from disco://hustlemaster impressions pixels Welcome to Hustle! Type commands() or tables() for some help, exit() to leave. >>> We see here that the CLI has loaded the Hustle tables from the disco://hustlemaster cluster called impressions and pixels. The CLI actually loads these into Python s global variable space, so that these Tables are actually instantiated with their table names in the Python namespace: 7

Hustle Documentation, Release 0.1 >>> schema(impressions) ad_id (int32,ix) cpm_millis (uint32) date (string,ix,pt) site_id (dict(32),ix) time (uint32,ix) token (string,ix) url (dict(32)) gives the schema of the impressions table. Doing a query is just as simple: >>> select(impressions.ad_id, h_sum(impressions.cpm_millis), where=impressions.date == 2014-01-20 ) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ad_id sum(cpm_millis) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 30,016 1,690 30,003 925 30,019 2,023 30,024 1,511 30,009 863 30,025 3,124 30,010 2,555 30,011 2,150 30,014 4,491 3.4 Hustle Schema Design Guide 3.4.1 Fields The fields of a Table are it s columns. Each field has a type, an optional width and an optional index indicator as detailed in the following table: Prefix Type Notes index create a normal index on this column = index create a wide index on this column @N unsigned int N = 1 2 *4 8 #N signed int N = 1 2 *4 8 $ string uncompressed string data %N string trie compressed N = 2 4 * string lz4 compressed & binary uncompressed blob data fields are specified using the following convention: [+ =][type[width]]name, for example: fields=["+$name", "+%2department", "@2salary", "*bio"] 3.4.2 Accessing Fields Consider the following code: imps = Table.from_tag( impressions ) select(imps.date, imps.site_id, where=imps) This is a simple Hustle query written in Python. Note that the column names date and site_id are accessed using the Python dot notation. All columns are accessed as though they were members of the Table class. 8 Chapter 3. Hustle In Depth

Hustle Documentation, Release 0.1 3.4.3 Indexes By default, columns in Hustle are unindexed. By indexing a column you make it available for use as a key in where clause and join clauses in the hustle.select() statement. Unindexed columns can still be in the list of selected columns or in aggregation function. The question whether to index a column or not is a consideration of overall memory/disk space used by that column in your database. An indexed column will take up to twice the amount of memory as an unindexed column. Wide indexes (the = indicator) are used simply as a hint to Hustle to expect the number of unique values for the specified column to be very high with respect to the overall number of rows. The Hustle query optimizer and hustle.insert() function use this information to better manage memory usage when dealing with these columns. 3.4.4 Integer Data Integers can be 1, 2, 4 or 8 bytes and are either signed or unsigned. 3.4.5 String Data and Compression One of the fundamental design goals of Hustle was to allow for the highest level of compression possible. String data is one area that we can maximize compression. Hustle has a total of five types of string representations: uncompressed, lz4 compressed, two flavours of Prefix Trie compression, and a binary/blob format. The first choice for string compression should be the trie compression. This offers the best performance and can offer dramatic compression ratios for string data that has many duplicates or many shared prefixes (consider the strings beginning with http://www., for example). The Hustle trie compression comes in either 2 or 4 byte flavours. The two byte flavour can encode up to 65,536 unique strings, and the 4 byte version can encode over 4 billion strings. Pick the two byte flavour for those columns that have a high degree of full-word repetition, like department, sex, state, country - whose overall bounds are known. For strings that have a larger range, but still have common prefixes and whose overall length is generally less than 256 bytes, like url, last_name, city, user_agent, We investigated many algorithms and implementations of compression algorithms for compressing intermediate sized string data, strings that are more than 256 bytes. We found our implementation of lz4 to be both faster and have much higher compression ratios than Snappy. Use LZ4 for fields like page_content, bio, except, abstract. Some data doesn t like to be compressed. UIDs and many other hash based data fields are designed to be evenly distributed, and therefore defeat most (all of our) compression schemes. In this case, it is more efficient to simply store the uncompressed string. 3.4.6 Binary Data In Hustle, binary data is an attribute that doesn t affect how a string is compressed, but rather, it affects how the value is treated in our query pipeline. Normally, result sets are sorted and grouped to execute group by clause and distinct clause elements of hustle.select(). If you have a column that contains binary data, such as a.png image or sound file, it doesn t make any sense to sort or group it. 3.4.7 Partitions Hustle employs a technique for splitting up data into distinct partitions based on a column in the target table. This allows us to significantly increase query performance by only considering the data that matches the partition specified in the query. Typically a partition column has the following attributes: * the same column is in most Tables * the number of unique values for the column is low * the column is often in where clauses, often as ranges The DATE column usually fits the bill for the partition in most LOG type applications. 3.4. Hustle Schema Design Guide 9

Hustle Documentation, Release 0.1 Hustle currently supports a single column partition per table. All partitions must also be indexed. Partitions must currently be uncompressed string types ( $ indicator). Partitions are implemented both as regular columns in the database and with a DDFS tagging convention. All Hustle tables have DDFS tags that look like: hustle:employees where the name of the Table is employees. Tables that have partitions will never actually store data under this root tag name, rather they will store it under tags that look like: hustle:employees:2014-02-21 this is assuming that the employee table has the date field as a partition. All of the data marbles for the date 2014-02-22 for the employees table is guaranteed to be stored under this DDFS tag. When Hustle sees a query with a where clause identifying this exact date (or a range including this date), we will be able to directly and quickly access the correct data, thereby increasing the speed of the query. 3.5 Hustle Query Guide 3.6 Inserting Data To Hustle 3.7 Hustle Indexes 10 Chapter 3. Hustle In Depth

CHAPTER 4 Reference 11