Hustle Documentation Release 0.1 Tim Spurway February 26, 2014
Contents 1 Features 3 2 Getting started 5 2.1 Installing Hustle............................................. 5 2.2 Hustle Tutorial.............................................. 6 3 Hustle In Depth 7 3.1 Hustle Integration Test Suite....................................... 7 3.2 Configuring Hustle............................................ 7 3.3 Hustle Command Line Interface (CLI)................................. 7 3.4 Hustle Schema Design Guide...................................... 8 3.5 Hustle Query Guide........................................... 10 3.6 Inserting Data To Hustle......................................... 10 3.7 Hustle Indexes.............................................. 10 4 Reference 11 i
ii
Hustle Documentation, Release 0.1 Hustle is a distributed, column oriented, relational OLAP Database. Hustle supports parallel insertions and queries over large data sets, stored on an unreliable cluster of computers. It is meant to load and query the enormous data sets typical of ad-tech, high volume web services, and other large-scale analytics applications. Hustle is a distributed database. When data is inserted into Hustle, it is replicated across a cluster to enhance availability, horizontal scalability and enable parallel query execution. When data is replicated on multiple nodes, your database becomes resistant to node failure because there is always multiple copies of it on the cluster. This allows you to simply add more machines to increase both overall storage and to decrease query time by performing more operations in parallel. Hustle is a relational database, so, unlike other NoSQL databases, it stores it s data in rows and columns in a fixed schema. This means that you must create Tables with a fixed number of Columns of specific data types, before inserting data into the database. The advantage of this is that both storage and query execution can be fine tuned to minimize both the data footprint and the query execution time. Hustle uses a column oriented format for storing data. This scheme is often used for very large databases, as it is more efficient for aggregation operations such as sum() and average() functions over a particular column as well as relational joins across tables. Although Hustle has a relational data model, it is not a SQL database. Hustle extends the Python language to facilitate it s relational query facility. Let s take a look at a typical Hustle query in Python: select(impressions.ad_id, h_sum(pixels.amount), h_count(), where=(impressions.date < 2014-01-13, pixels.date < 2014-01-13 ), join=(impressions.site_id, pixels.site_id), order_by= ad_id, desc=true) which would be equivalent to the SQL query: SELECT i.ad_id, i.site_id, sum(p.amount), count(*) FROM impressions i JOIN pixels p on p.site_id = p.site_id WHERE i.date < 2014-01-13 and p.date < 2014-01-13 ORDER BY i.ad_id DESC GROUP BY i.ad_id, i.site_id The two approaches seem equivalent, however, Python is extensible, whereas SQL is not. You can do much more with Hustle than just query data. Hustle was designed to express distributed computation over indexed data which includes, but is not limited to the classic relational select statement. SQL is good at queries, not as an ecosystem for general purpose data-centric distributed computation. Hustle is meant for large, distributed inserts, and has append only semantics. It is suited to very large log file style inputs, and once data is inserted, it cannot be changed. This scheme is typically suitable for distributed applications that generate large log files, with many (possibly hundreds of) thousands of events per second. Hustle has been streamlined to accept structured JSON log files as it s primary input format, and to perform distributed inserts. A distributed insert delegates most of the database creation work to the client, thereby freeing up the cluster s resources and avoiding a central computational pinch point like in other write bound relational OLAP databases. Hustle can easily handle almost unlimited write load using this scheme. Hustle utilizes modern compression and indexing data structures and algorithms to minimize overall memory footprint and to maximize query performance. It utilizes bitmap indexes, prefix trie (dictionary) and lz4 compression, and has a very rich set of string and numeric data types of various sizes. Typically, Hustle data sets are 25% to 50% than their equivalent GZIPed JSON sources. Hustle has several auxiliary tools: a command line interface (CLI) Python shell with auto-completion of Hustle tables and functions a client side insert script Contents 1
Hustle Documentation, Release 0.1 2 Contents
CHAPTER 1 Features column oriented - super fast queries distributed insert - Hustle is designed for petabyte scale datasets in a distributed environment with massive write loads compressed - bitmap indexes, lz4, and prefix trie compression relational - join gigantic data sets partitioned - smart shards embarrassingly distributed (based on Disco) embarrassingly fast (uses LMDB) NoSQL - Python DSL bulk append only semantics highly available, horizontally scalable REPL/CLI query interface 3
Hustle Documentation, Release 0.1 4 Chapter 1. Features
CHAPTER 2 Getting started 2.1 Installing Hustle Hustle is hosted on GitHub and should be cloned from that repo: git clone git@github.com:changoinc/hustle.git 2.1.1 Dependencies Hustle has the following dependencies: * you will need Python 2.7 <http://www.python.org/downloads/> * you will need Disco 0.5 <http://disco.readthedocs.org/en/latest/start/install.html> 2.1.2 Installing the Hustle Client In order to run Hustle, you will need to install it onto an existing Disco v0.5 cluster. In order to query a Hustle/Disco cluster, you will need to install the Hustle software on that client machine: cd hustle sudo./bootstrap.sh This will build and install Hustle on your client machine. 2.1.3 Installing on the Cluster Disco is a distributed system and may have many nodes. Each of the nodes in your Disco cluster will need to install the Hustle dependencies. These can be found in the hustle/deps directory. The easiest way to install Hustle on your disco slave nodes is to: cd hustle/deps make sudo make install on ALL you disco slave nodes. You may now want to go and run the Integration Tests to validate your installation. 5
Hustle Documentation, Release 0.1 2.2 Hustle Tutorial coming soon... 6 Chapter 2. Getting started
CHAPTER 3 Hustle In Depth 3.1 Hustle Integration Test Suite The Hustle Integration Test suite is a good place to see non-trivial Hustle Tables created, data inserted into them, and some subsequent queries. They are located in: hustle/integration_test To run the test suite, ensure you have installed Nose and Hustle. Before you run the integration tests, you will need to make sure Disco is running and that you have run the setup.py script once: python hustle/integration_test/setup.py You can then execute the nosetests in the integration suite: cd hustle/integration_test nosetests 3.2 Configuring Hustle 3.3 Hustle Command Line Interface (CLI) After installing Hustle, you can invoke the Hustle CLI like this: hustle Assuming you ve installed everything and have a running and correctly configured Disco instance, you will get a Python prompt looking something like this: bin git:(develop)./hustle Loading Hustle Tables from disco://hustlemaster impressions pixels Welcome to Hustle! Type commands() or tables() for some help, exit() to leave. >>> We see here that the CLI has loaded the Hustle tables from the disco://hustlemaster cluster called impressions and pixels. The CLI actually loads these into Python s global variable space, so that these Tables are actually instantiated with their table names in the Python namespace: 7
Hustle Documentation, Release 0.1 >>> schema(impressions) ad_id (int32,ix) cpm_millis (uint32) date (string,ix,pt) site_id (dict(32),ix) time (uint32,ix) token (string,ix) url (dict(32)) gives the schema of the impressions table. Doing a query is just as simple: >>> select(impressions.ad_id, h_sum(impressions.cpm_millis), where=impressions.date == 2014-01-20 ) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ad_id sum(cpm_millis) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 30,016 1,690 30,003 925 30,019 2,023 30,024 1,511 30,009 863 30,025 3,124 30,010 2,555 30,011 2,150 30,014 4,491 3.4 Hustle Schema Design Guide 3.4.1 Fields The fields of a Table are it s columns. Each field has a type, an optional width and an optional index indicator as detailed in the following table: Prefix Type Notes index create a normal index on this column = index create a wide index on this column @N unsigned int N = 1 2 *4 8 #N signed int N = 1 2 *4 8 $ string uncompressed string data %N string trie compressed N = 2 4 * string lz4 compressed & binary uncompressed blob data fields are specified using the following convention: [+ =][type[width]]name, for example: fields=["+$name", "+%2department", "@2salary", "*bio"] 3.4.2 Accessing Fields Consider the following code: imps = Table.from_tag( impressions ) select(imps.date, imps.site_id, where=imps) This is a simple Hustle query written in Python. Note that the column names date and site_id are accessed using the Python dot notation. All columns are accessed as though they were members of the Table class. 8 Chapter 3. Hustle In Depth
Hustle Documentation, Release 0.1 3.4.3 Indexes By default, columns in Hustle are unindexed. By indexing a column you make it available for use as a key in where clause and join clauses in the hustle.select() statement. Unindexed columns can still be in the list of selected columns or in aggregation function. The question whether to index a column or not is a consideration of overall memory/disk space used by that column in your database. An indexed column will take up to twice the amount of memory as an unindexed column. Wide indexes (the = indicator) are used simply as a hint to Hustle to expect the number of unique values for the specified column to be very high with respect to the overall number of rows. The Hustle query optimizer and hustle.insert() function use this information to better manage memory usage when dealing with these columns. 3.4.4 Integer Data Integers can be 1, 2, 4 or 8 bytes and are either signed or unsigned. 3.4.5 String Data and Compression One of the fundamental design goals of Hustle was to allow for the highest level of compression possible. String data is one area that we can maximize compression. Hustle has a total of five types of string representations: uncompressed, lz4 compressed, two flavours of Prefix Trie compression, and a binary/blob format. The first choice for string compression should be the trie compression. This offers the best performance and can offer dramatic compression ratios for string data that has many duplicates or many shared prefixes (consider the strings beginning with http://www., for example). The Hustle trie compression comes in either 2 or 4 byte flavours. The two byte flavour can encode up to 65,536 unique strings, and the 4 byte version can encode over 4 billion strings. Pick the two byte flavour for those columns that have a high degree of full-word repetition, like department, sex, state, country - whose overall bounds are known. For strings that have a larger range, but still have common prefixes and whose overall length is generally less than 256 bytes, like url, last_name, city, user_agent, We investigated many algorithms and implementations of compression algorithms for compressing intermediate sized string data, strings that are more than 256 bytes. We found our implementation of lz4 to be both faster and have much higher compression ratios than Snappy. Use LZ4 for fields like page_content, bio, except, abstract. Some data doesn t like to be compressed. UIDs and many other hash based data fields are designed to be evenly distributed, and therefore defeat most (all of our) compression schemes. In this case, it is more efficient to simply store the uncompressed string. 3.4.6 Binary Data In Hustle, binary data is an attribute that doesn t affect how a string is compressed, but rather, it affects how the value is treated in our query pipeline. Normally, result sets are sorted and grouped to execute group by clause and distinct clause elements of hustle.select(). If you have a column that contains binary data, such as a.png image or sound file, it doesn t make any sense to sort or group it. 3.4.7 Partitions Hustle employs a technique for splitting up data into distinct partitions based on a column in the target table. This allows us to significantly increase query performance by only considering the data that matches the partition specified in the query. Typically a partition column has the following attributes: * the same column is in most Tables * the number of unique values for the column is low * the column is often in where clauses, often as ranges The DATE column usually fits the bill for the partition in most LOG type applications. 3.4. Hustle Schema Design Guide 9
Hustle Documentation, Release 0.1 Hustle currently supports a single column partition per table. All partitions must also be indexed. Partitions must currently be uncompressed string types ( $ indicator). Partitions are implemented both as regular columns in the database and with a DDFS tagging convention. All Hustle tables have DDFS tags that look like: hustle:employees where the name of the Table is employees. Tables that have partitions will never actually store data under this root tag name, rather they will store it under tags that look like: hustle:employees:2014-02-21 this is assuming that the employee table has the date field as a partition. All of the data marbles for the date 2014-02-22 for the employees table is guaranteed to be stored under this DDFS tag. When Hustle sees a query with a where clause identifying this exact date (or a range including this date), we will be able to directly and quickly access the correct data, thereby increasing the speed of the query. 3.5 Hustle Query Guide 3.6 Inserting Data To Hustle 3.7 Hustle Indexes 10 Chapter 3. Hustle In Depth
CHAPTER 4 Reference 11