Hustle Documentation. Release 0.1. Tim Spurway
|
|
- Lorin Lawrence
- 6 years ago
- Views:
Transcription
1 Hustle Documentation Release 0.1 Tim Spurway February 26, 2014
2
3 Contents 1 Features 3 2 Getting started Installing Hustle Hustle Tutorial Hustle In Depth Hustle Integration Test Suite Configuring Hustle Hustle Command Line Interface (CLI) Hustle Schema Design Guide Hustle Query Guide Inserting Data To Hustle Hustle Indexes Reference 11 i
4 ii
5 Hustle Documentation, Release 0.1 Hustle is a distributed, column oriented, relational OLAP Database. Hustle supports parallel insertions and queries over large data sets, stored on an unreliable cluster of computers. It is meant to load and query the enormous data sets typical of ad-tech, high volume web services, and other large-scale analytics applications. Hustle is a distributed database. When data is inserted into Hustle, it is replicated across a cluster to enhance availability, horizontal scalability and enable parallel query execution. When data is replicated on multiple nodes, your database becomes resistant to node failure because there is always multiple copies of it on the cluster. This allows you to simply add more machines to increase both overall storage and to decrease query time by performing more operations in parallel. Hustle is a relational database, so, unlike other NoSQL databases, it stores it s data in rows and columns in a fixed schema. This means that you must create Tables with a fixed number of Columns of specific data types, before inserting data into the database. The advantage of this is that both storage and query execution can be fine tuned to minimize both the data footprint and the query execution time. Hustle uses a column oriented format for storing data. This scheme is often used for very large databases, as it is more efficient for aggregation operations such as sum() and average() functions over a particular column as well as relational joins across tables. Although Hustle has a relational data model, it is not a SQL database. Hustle extends the Python language to facilitate it s relational query facility. Let s take a look at a typical Hustle query in Python: select(impressions.ad_id, h_sum(pixels.amount), h_count(), where=(impressions.date < , pixels.date < ), join=(impressions.site_id, pixels.site_id), order_by= ad_id, desc=true) which would be equivalent to the SQL query: SELECT i.ad_id, i.site_id, sum(p.amount), count(*) FROM impressions i JOIN pixels p on p.site_id = p.site_id WHERE i.date < and p.date < ORDER BY i.ad_id DESC GROUP BY i.ad_id, i.site_id The two approaches seem equivalent, however, Python is extensible, whereas SQL is not. You can do much more with Hustle than just query data. Hustle was designed to express distributed computation over indexed data which includes, but is not limited to the classic relational select statement. SQL is good at queries, not as an ecosystem for general purpose data-centric distributed computation. Hustle is meant for large, distributed inserts, and has append only semantics. It is suited to very large log file style inputs, and once data is inserted, it cannot be changed. This scheme is typically suitable for distributed applications that generate large log files, with many (possibly hundreds of) thousands of events per second. Hustle has been streamlined to accept structured JSON log files as it s primary input format, and to perform distributed inserts. A distributed insert delegates most of the database creation work to the client, thereby freeing up the cluster s resources and avoiding a central computational pinch point like in other write bound relational OLAP databases. Hustle can easily handle almost unlimited write load using this scheme. Hustle utilizes modern compression and indexing data structures and algorithms to minimize overall memory footprint and to maximize query performance. It utilizes bitmap indexes, prefix trie (dictionary) and lz4 compression, and has a very rich set of string and numeric data types of various sizes. Typically, Hustle data sets are 25% to 50% than their equivalent GZIPed JSON sources. Hustle has several auxiliary tools: a command line interface (CLI) Python shell with auto-completion of Hustle tables and functions a client side insert script Contents 1
6 Hustle Documentation, Release Contents
7 CHAPTER 1 Features column oriented - super fast queries distributed insert - Hustle is designed for petabyte scale datasets in a distributed environment with massive write loads compressed - bitmap indexes, lz4, and prefix trie compression relational - join gigantic data sets partitioned - smart shards embarrassingly distributed (based on Disco) embarrassingly fast (uses LMDB) NoSQL - Python DSL bulk append only semantics highly available, horizontally scalable REPL/CLI query interface 3
8 Hustle Documentation, Release Chapter 1. Features
9 CHAPTER 2 Getting started 2.1 Installing Hustle Hustle is hosted on GitHub and should be cloned from that repo: git clone git@github.com:changoinc/hustle.git Dependencies Hustle has the following dependencies: * you will need Python 2.7 < * you will need Disco 0.5 < Installing the Hustle Client In order to run Hustle, you will need to install it onto an existing Disco v0.5 cluster. In order to query a Hustle/Disco cluster, you will need to install the Hustle software on that client machine: cd hustle sudo./bootstrap.sh This will build and install Hustle on your client machine Installing on the Cluster Disco is a distributed system and may have many nodes. Each of the nodes in your Disco cluster will need to install the Hustle dependencies. These can be found in the hustle/deps directory. The easiest way to install Hustle on your disco slave nodes is to: cd hustle/deps make sudo make install on ALL you disco slave nodes. You may now want to go and run the Integration Tests to validate your installation. 5
10 Hustle Documentation, Release Hustle Tutorial coming soon... 6 Chapter 2. Getting started
11 CHAPTER 3 Hustle In Depth 3.1 Hustle Integration Test Suite The Hustle Integration Test suite is a good place to see non-trivial Hustle Tables created, data inserted into them, and some subsequent queries. They are located in: hustle/integration_test To run the test suite, ensure you have installed Nose and Hustle. Before you run the integration tests, you will need to make sure Disco is running and that you have run the setup.py script once: python hustle/integration_test/setup.py You can then execute the nosetests in the integration suite: cd hustle/integration_test nosetests 3.2 Configuring Hustle 3.3 Hustle Command Line Interface (CLI) After installing Hustle, you can invoke the Hustle CLI like this: hustle Assuming you ve installed everything and have a running and correctly configured Disco instance, you will get a Python prompt looking something like this: bin git:(develop)./hustle Loading Hustle Tables from disco://hustlemaster impressions pixels Welcome to Hustle! Type commands() or tables() for some help, exit() to leave. >>> We see here that the CLI has loaded the Hustle tables from the disco://hustlemaster cluster called impressions and pixels. The CLI actually loads these into Python s global variable space, so that these Tables are actually instantiated with their table names in the Python namespace: 7
12 Hustle Documentation, Release 0.1 >>> schema(impressions) ad_id (int32,ix) cpm_millis (uint32) date (string,ix,pt) site_id (dict(32),ix) time (uint32,ix) token (string,ix) url (dict(32)) gives the schema of the impressions table. Doing a query is just as simple: >>> select(impressions.ad_id, h_sum(impressions.cpm_millis), where=impressions.date == ) ad_id sum(cpm_millis) ,016 1,690 30, ,019 2,023 30,024 1,511 30, ,025 3,124 30,010 2,555 30,011 2,150 30,014 4, Hustle Schema Design Guide Fields The fields of a Table are it s columns. Each field has a type, an optional width and an optional index indicator as detailed in the following table: Prefix Type Notes index create a normal index on this column = index create a wide index on this unsigned int N = 1 2 *4 8 #N signed int N = 1 2 *4 8 $ string uncompressed string data %N string trie compressed N = 2 4 * string lz4 compressed & binary uncompressed blob data fields are specified using the following convention: [+ =][type[width]]name, for example: fields=["+$name", "+%2department", "@2salary", "*bio"] Accessing Fields Consider the following code: imps = Table.from_tag( impressions ) select(imps.date, imps.site_id, where=imps) This is a simple Hustle query written in Python. Note that the column names date and site_id are accessed using the Python dot notation. All columns are accessed as though they were members of the Table class. 8 Chapter 3. Hustle In Depth
13 Hustle Documentation, Release Indexes By default, columns in Hustle are unindexed. By indexing a column you make it available for use as a key in where clause and join clauses in the hustle.select() statement. Unindexed columns can still be in the list of selected columns or in aggregation function. The question whether to index a column or not is a consideration of overall memory/disk space used by that column in your database. An indexed column will take up to twice the amount of memory as an unindexed column. Wide indexes (the = indicator) are used simply as a hint to Hustle to expect the number of unique values for the specified column to be very high with respect to the overall number of rows. The Hustle query optimizer and hustle.insert() function use this information to better manage memory usage when dealing with these columns Integer Data Integers can be 1, 2, 4 or 8 bytes and are either signed or unsigned String Data and Compression One of the fundamental design goals of Hustle was to allow for the highest level of compression possible. String data is one area that we can maximize compression. Hustle has a total of five types of string representations: uncompressed, lz4 compressed, two flavours of Prefix Trie compression, and a binary/blob format. The first choice for string compression should be the trie compression. This offers the best performance and can offer dramatic compression ratios for string data that has many duplicates or many shared prefixes (consider the strings beginning with for example). The Hustle trie compression comes in either 2 or 4 byte flavours. The two byte flavour can encode up to 65,536 unique strings, and the 4 byte version can encode over 4 billion strings. Pick the two byte flavour for those columns that have a high degree of full-word repetition, like department, sex, state, country - whose overall bounds are known. For strings that have a larger range, but still have common prefixes and whose overall length is generally less than 256 bytes, like url, last_name, city, user_agent, We investigated many algorithms and implementations of compression algorithms for compressing intermediate sized string data, strings that are more than 256 bytes. We found our implementation of lz4 to be both faster and have much higher compression ratios than Snappy. Use LZ4 for fields like page_content, bio, except, abstract. Some data doesn t like to be compressed. UIDs and many other hash based data fields are designed to be evenly distributed, and therefore defeat most (all of our) compression schemes. In this case, it is more efficient to simply store the uncompressed string Binary Data In Hustle, binary data is an attribute that doesn t affect how a string is compressed, but rather, it affects how the value is treated in our query pipeline. Normally, result sets are sorted and grouped to execute group by clause and distinct clause elements of hustle.select(). If you have a column that contains binary data, such as a.png image or sound file, it doesn t make any sense to sort or group it Partitions Hustle employs a technique for splitting up data into distinct partitions based on a column in the target table. This allows us to significantly increase query performance by only considering the data that matches the partition specified in the query. Typically a partition column has the following attributes: * the same column is in most Tables * the number of unique values for the column is low * the column is often in where clauses, often as ranges The DATE column usually fits the bill for the partition in most LOG type applications Hustle Schema Design Guide 9
14 Hustle Documentation, Release 0.1 Hustle currently supports a single column partition per table. All partitions must also be indexed. Partitions must currently be uncompressed string types ( $ indicator). Partitions are implemented both as regular columns in the database and with a DDFS tagging convention. All Hustle tables have DDFS tags that look like: hustle:employees where the name of the Table is employees. Tables that have partitions will never actually store data under this root tag name, rather they will store it under tags that look like: hustle:employees: this is assuming that the employee table has the date field as a partition. All of the data marbles for the date for the employees table is guaranteed to be stored under this DDFS tag. When Hustle sees a query with a where clause identifying this exact date (or a range including this date), we will be able to directly and quickly access the correct data, thereby increasing the speed of the query. 3.5 Hustle Query Guide 3.6 Inserting Data To Hustle 3.7 Hustle Indexes 10 Chapter 3. Hustle In Depth
15 CHAPTER 4 Reference 11
High-Performance Distributed DBMS for Analytics
1 High-Performance Distributed DBMS for Analytics 2 About me Developer, hardware engineering background Head of Analytic Products Department in Yandex jkee@yandex-team.ru 3 About Yandex One of the largest
More informationCIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )
Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL
More informationGreenplum Architecture Class Outline
Greenplum Architecture Class Outline Introduction to the Greenplum Architecture What is Parallel Processing? The Basics of a Single Computer Data in Memory is Fast as Lightning Parallel Processing Of Data
More informationNoSQL Databases Analysis
NoSQL Databases Analysis Jeffrey Young Intro I chose to investigate Redis, MongoDB, and Neo4j. I chose Redis because I always read about Redis use and its extreme popularity yet I know little about it.
More informationOral Questions and Answers (DBMS LAB) Questions & Answers- DBMS
Questions & Answers- DBMS https://career.guru99.com/top-50-database-interview-questions/ 1) Define Database. A prearranged collection of figures known as data is called database. 2) What is DBMS? Database
More informationThe course modules of MongoDB developer and administrator online certification training:
The course modules of MongoDB developer and administrator online certification training: 1 An Overview of the Course Introduction to the course Table of Contents Course Objectives Course Overview Value
More informationCourse Content MongoDB
Course Content MongoDB 1. Course introduction and mongodb Essentials (basics) 2. Introduction to NoSQL databases What is NoSQL? Why NoSQL? Difference Between RDBMS and NoSQL Databases Benefits of NoSQL
More informationTime Series Storage with Apache Kudu (incubating)
Time Series Storage with Apache Kudu (incubating) Dan Burkert (Committer) dan@cloudera.com @danburkert Tweet about this talk: @getkudu or #kudu 1 Time Series machine metrics event logs sensor telemetry
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 10: Mutable State (1/2) March 15, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These
More informationShark: Hive (SQL) on Spark
Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 21, 2012 UC BERKELEY SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10; Stage 0: Map-Shuffle-Reduce
More informationPart 1: Indexes for Big Data
JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,
More informationSAP IQ - Business Intelligence and vertical data processing with 8 GB RAM or less
SAP IQ - Business Intelligence and vertical data processing with 8 GB RAM or less Dipl.- Inform. Volker Stöffler Volker.Stoeffler@DB-TecKnowledgy.info Public Agenda Introduction: What is SAP IQ - in a
More informationMongoDB An Overview. 21-Oct Socrates
MongoDB An Overview 21-Oct-2016 Socrates Agenda What is NoSQL DB? Types of NoSQL DBs DBMS and MongoDB Comparison Why MongoDB? MongoDB Architecture Storage Engines Data Model Query Language Security Data
More informationUnderstanding NoSQL Database Implementations
Understanding NoSQL Database Implementations Sadalage and Fowler, Chapters 7 11 Class 07: Understanding NoSQL Database Implementations 1 Foreword NoSQL is a broad and diverse collection of technologies.
More informationData about data is database Select correct option: True False Partially True None of the Above
Within a table, each primary key value. is a minimal super key is always the first field in each table must be numeric must be unique Foreign Key is A field in a table that matches a key field in another
More informationChapter 17 Indexing Structures for Files and Physical Database Design
Chapter 17 Indexing Structures for Files and Physical Database Design We assume that a file already exists with some primary organization unordered, ordered or hash. The index provides alternate ways to
More informationColumn-Oriented Database Systems. Liliya Rudko University of Helsinki
Column-Oriented Database Systems Liliya Rudko University of Helsinki 2 Contents 1. Introduction 2. Storage engines 2.1 Evolutionary Column-Oriented Storage (ECOS) 2.2 HYRISE 3. Database management systems
More informationDocument Object Storage with MongoDB
Document Object Storage with MongoDB Lecture BigData Analytics Julian M. Kunkel julian.kunkel@googlemail.com University of Hamburg / German Climate Computing Center (DKRZ) 2017-12-15 Disclaimer: Big Data
More informationMongoDB Tutorial for Beginners
MongoDB Tutorial for Beginners Mongodb is a document-oriented NoSQL database used for high volume data storage. In this tutorial you will learn how Mongodb can be accessed and some of its important features
More informationAdvanced Data Management Technologies Written Exam
Advanced Data Management Technologies Written Exam 02.02.2016 First name Student number Last name Signature Instructions for Students Write your name, student number, and signature on the exam sheet. This
More informationProcessing of Very Large Data
Processing of Very Large Data Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, first
More informationTrack Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross
Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk
More informationExaminingCassandra Constraints: Pragmatic. Eyes
International Journal of Management, IT & Engineering Vol. 9 Issue 3, March 2019, ISSN: 2249-0558 Impact Factor: 7.119 Journal Homepage: Double-Blind Peer Reviewed Refereed Open Access International Journal
More informationMassive Scalability With InterSystems IRIS Data Platform
Massive Scalability With InterSystems IRIS Data Platform Introduction Faced with the enormous and ever-growing amounts of data being generated in the world today, software architects need to pay special
More informationTutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access
Map/Reduce vs. DBMS Sharma Chakravarthy Information Technology Laboratory Computer Science and Engineering Department The University of Texas at Arlington, Arlington, TX 76009 Email: sharma@cse.uta.edu
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 10: Mutable State (1/2) March 14, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These
More informationCA485 Ray Walshe NoSQL
NoSQL BASE vs ACID Summary Traditional relational database management systems (RDBMS) do not scale because they adhere to ACID. A strong movement within cloud computing is to utilize non-traditional data
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationCitusDB Documentation
CitusDB Documentation Release 4.0.1 Citus Data June 07, 2016 Contents 1 Installation Guide 3 1.1 Supported Operating Systems...................................... 3 1.2 Single Node Cluster...........................................
More informationTeradata. This was compiled in order to describe Teradata and provide a brief overview of common capabilities and queries.
Teradata This was compiled in order to describe Teradata and provide a brief overview of common capabilities and queries. What is it? Teradata is a powerful Big Data tool that can be used in order to quickly
More informationGreenplum-Spark Connector Examples Documentation. kong-yew,chan
Greenplum-Spark Connector Examples Documentation kong-yew,chan Dec 10, 2018 Contents 1 Overview 1 1.1 Pivotal Greenplum............................................ 1 1.2 Pivotal Greenplum-Spark Connector...................................
More informationDruid Power Interactive Applications at Scale. Jonathan Wei Software Engineer
Druid Power Interactive Applications at Scale Jonathan Wei Software Engineer History & Motivation Demo Overview Storage Internals Druid Architecture Motivation Motivation Visibility and analysis for complex
More informationData Analysis and Data Science
Data Analysis and Data Science CPS352: Database Systems Simon Miner Gordon College Last Revised: 4/29/15 Agenda Check-in Online Analytical Processing Data Science Homework 8 Check-in Online Analytical
More informationHive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)
Hive and Shark Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 1 / 45 Motivation MapReduce is hard to
More informationA Fast and High Throughput SQL Query System for Big Data
A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More information15110 PRINCIPLES OF COMPUTING SAMPLE EXAM 2
15110 PRINCIPLES OF COMPUTING SAMPLE EXAM 2 Name Section Directions: Answer each question neatly in the space provided. Please read each question carefully. You have 50 minutes for this exam. No electronic
More informationGoal of this document: A simple yet effective
INTRODUCTION TO ELK STACK Goal of this document: A simple yet effective document for folks who want to learn basics of ELK (Elasticsearch, Logstash and Kibana) without any prior knowledge. Introduction:
More informationIntroduction to Column Stores with MemSQL. Seminar Database Systems Final presentation, 11. January 2016 by Christian Bisig
Final presentation, 11. January 2016 by Christian Bisig Topics Scope and goals Approaching Column-Stores Introducing MemSQL Benchmark setup & execution Benchmark result & interpretation Conclusion Questions
More informationEvolution of Database Systems
Evolution of Database Systems Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies, second
More informationPHP Composer 9 Benefits of Using a Binary Repository Manager
PHP Composer 9 Benefits of Using a Binary Repository Manager White Paper Copyright 2017 JFrog Ltd. March 2017 www.jfrog.com Executive Summary PHP development has become one of the most popular platforms
More informationBanzaiDB Documentation
BanzaiDB Documentation Release 0.3.0 Mitchell Stanton-Cook Jul 19, 2017 Contents 1 BanzaiDB documentation contents 3 2 Indices and tables 11 i ii BanzaiDB is a tool for pairing Microbial Genomics Next
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationTransformer Looping Functions for Pivoting the data :
Transformer Looping Functions for Pivoting the data : Convert a single row into multiple rows using Transformer Looping Function? (Pivoting of data using parallel transformer in Datastage 8.5,8.7 and 9.1)
More information1. Attempt any two of the following: 10 a. State and justify the characteristics of a Data Warehouse with suitable examples.
Instructions to the Examiners: 1. May the Examiners not look for exact words from the text book in the Answers. 2. May any valid example be accepted - example may or may not be from the text book 1. Attempt
More informationCopyright 2016 Ramez Elmasri and Shamkant B. Navathe
CHAPTER 19 Query Optimization Introduction Query optimization Conducted by a query optimizer in a DBMS Goal: select best available strategy for executing query Based on information available Most RDBMSs
More informationAn Introduction to Big Data Formats
Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION
More informationUsing the MySQL Document Store
Using the MySQL Document Store Alfredo Kojima, Sr. Software Dev. Manager, MySQL Mike Zinner, Sr. Software Dev. Director, MySQL Safe Harbor Statement The following is intended to outline our general product
More informationEvaluation Guide for ASP.NET Web CMS and Experience Platforms
Evaluation Guide for ASP.NET Web CMS and Experience Platforms CONTENTS Introduction....................... 1 4 Key Differences...2 Architecture:...2 Development Model...3 Content:...4 Database:...4 Bonus:
More informationDatabase performance becomes an important issue in the presence of
Database tuning is the process of improving database performance by minimizing response time (the time it takes a statement to complete) and maximizing throughput the number of statements a database can
More informationBinary representation and data
Binary representation and data Loriano Storchi loriano@storchi.org http:://www.storchi.org/ Binary representation of numbers In a positional numbering system given the base this directly defines the number
More informationI am: Rana Faisal Munir
Self-tuning BI Systems Home University (UPC): Alberto Abelló and Oscar Romero Host University (TUD): Maik Thiele and Wolfgang Lehner I am: Rana Faisal Munir Research Progress Report (RPR) [1 / 44] Introduction
More informationBig Data Analytics. Rasoul Karimi
Big Data Analytics Rasoul Karimi Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 1 Outline
More informationData Informatics. Seon Ho Kim, Ph.D.
Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate
More information7. Query Processing and Optimization
7. Query Processing and Optimization Processing a Query 103 Indexing for Performance Simple (individual) index B + -tree index Matching index scan vs nonmatching index scan Unique index one entry and one
More information9 Reasons To Use a Binary Repository for Front-End Development with Bower
9 Reasons To Use a Binary Repository for Front-End Development with Bower White Paper Introduction The availability of packages for front-end web development has somewhat lagged behind back-end systems.
More informationOverview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL
* Some History * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL * NoSQL Taxonomy * Towards NewSQL Overview * Some History * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL * NoSQL Taxonomy *TowardsNewSQL NoSQL
More informationProgramming for Data Science Syllabus
Programming for Data Science Syllabus Learn to use Python and SQL to solve problems with data Before You Start Prerequisites: There are no prerequisites for this program, aside from basic computer skills.
More informationORC Files. Owen O June Page 1. Hortonworks Inc. 2012
ORC Files Owen O Malley owen@hortonworks.com @owen_omalley owen@hortonworks.com June 2013 Page 1 Who Am I? First committer added to Hadoop in 2006 First VP of Hadoop at Apache Was architect of MapReduce
More informationSepand Gojgini. ColumnStore Index Primer
Sepand Gojgini ColumnStore Index Primer SQLSaturday Sponsors! Titanium & Global Partner Gold Silver Bronze Without the generosity of these sponsors, this event would not be possible! Please, stop by the
More informationCHAPTER. Oracle Database 11g Architecture Options
CHAPTER 1 Oracle Database 11g Architecture Options 3 4 Part I: Critical Database Concepts Oracle Database 11g is a significant upgrade from prior releases of Oracle. New features give developers, database
More informationTime Series Live 2017
1 Time Series Schemas @Percona Live 2017 Who Am I? Chris Larsen Maintainer and author for OpenTSDB since 2013 Software Engineer @ Yahoo Central Monitoring Team Who I m not: A marketer A sales person 2
More informationProject Genesis. Cafepress.com Product Catalog Hundreds of Millions of Products Millions of new products every week Accelerating growth
Scaling with HiveDB Project Genesis Cafepress.com Product Catalog Hundreds of Millions of Products Millions of new products every week Accelerating growth Enter Jeremy and HiveDB Our Requirements OLTP
More informationT-SQL Training: T-SQL for SQL Server for Developers
Duration: 3 days T-SQL Training Overview T-SQL for SQL Server for Developers training teaches developers all the Transact-SQL skills they need to develop queries and views, and manipulate data in a SQL
More informationInputs. Decisions. Leads to
Chapter 6: Physical Database Design and Performance Modern Database Management 9 th Edition Jeffrey A. Hoffer, Mary B. Prescott, Heikki Topi 2009 Pearson Education, Inc. Publishing as Prentice Hall 1 Objectives
More informationUse Cases for Partitioning. Bill Karwin Percona, Inc
Use Cases for Partitioning Bill Karwin Percona, Inc. 2011-02-16 1 Why Use Cases?! Anyone can read the reference manual: http://dev.mysql.com/doc/refman/5.1/en/partitioning.html! This talk is about when
More informationSQL Server on Linux and Containers
http://aka.ms/bobwardms https://github.com/microsoft/sqllinuxlabs SQL Server on Linux and Containers A Brave New World Speaker Name Principal Architect Microsoft bobward@microsoft.com @bobwardms linkedin.com/in/bobwardms
More informationProblem Set 2 Solutions
6.893 Problem Set 2 Solutons 1 Problem Set 2 Solutions The problem set was worth a total of 25 points. Points are shown in parentheses before the problem. Part 1 - Warmup (5 points total) 1. We studied
More informationHash-Based Indexing 165
Hash-Based Indexing 165 h 1 h 0 h 1 h 0 Next = 0 000 00 64 32 8 16 000 00 64 32 8 16 A 001 01 9 25 41 73 001 01 9 25 41 73 B 010 10 10 18 34 66 010 10 10 18 34 66 C Next = 3 011 11 11 19 D 011 11 11 19
More informationHadoop. copyright 2011 Trainologic LTD
Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides
More informationMaking the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor
Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack Chief Architect RainStor Agenda Importance of Hadoop + data compression Data compression techniques Compression,
More informationDistributed Computations MapReduce. adapted from Jeff Dean s slides
Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)
More informationTable Compression in Oracle9i Release2. An Oracle White Paper May 2002
Table Compression in Oracle9i Release2 An Oracle White Paper May 2002 Table Compression in Oracle9i Release2 Executive Overview...3 Introduction...3 How It works...3 What can be compressed...4 Cost and
More informationInvitation to a New Kind of Database. Sheer El Showk Cofounder, Lore Ai We re Hiring!
Invitation to a New Kind of Database Sheer El Showk Cofounder, Lore Ai www.lore.ai We re Hiring! Overview 1. Problem statement (~2 minute) 2. (Proprietary) Solution: Datomics (~10 minutes) 3. Proposed
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationYou should have a basic understanding of Relational concepts and basic SQL. It will be good if you have worked with any other RDBMS product.
About the Tutorial is a popular Relational Database Management System (RDBMS) suitable for large data warehousing applications. It is capable of handling large volumes of data and is highly scalable. This
More informationContents in Detail. Foreword by Xavier Noria
Contents in Detail Foreword by Xavier Noria Acknowledgments xv xvii Introduction xix Who This Book Is For................................................ xx Overview...xx Installation.... xxi Ruby, Rails,
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More information10 Million Smart Meter Data with Apache HBase
10 Million Smart Meter Data with Apache HBase 5/31/2017 OSS Solution Center Hitachi, Ltd. Masahiro Ito OSS Summit Japan 2017 Who am I? Masahiro Ito ( 伊藤雅博 ) Software Engineer at Hitachi, Ltd. Focus on
More informationAzure-persistence MARTIN MUDRA
Azure-persistence MARTIN MUDRA Storage service access Blobs Queues Tables Storage service Horizontally scalable Zone Redundancy Accounts Based on Uri Pricing Calculator Azure table storage Storage Account
More informationExternal Sorting Sorting Tables Larger Than Main Memory
External External Tables Larger Than Main Memory B + -trees for 7.1 External Challenges lurking behind a SQL query aggregation SELECT C.CUST_ID, C.NAME, SUM (O.TOTAL) AS REVENUE FROM CUSTOMERS AS C, ORDERS
More informationCSE 344 MAY 7 TH EXAM REVIEW
CSE 344 MAY 7 TH EXAM REVIEW EXAMINATION STATIONS Exam Wednesday 9:30-10:20 One sheet of notes, front and back Practice solutions out after class Good luck! EXAM LENGTH Production v. Verification Practice
More informationDesigning Database Solutions for Microsoft SQL Server 2012
Designing Database Solutions for Microsoft SQL Server 2012 Course 20465A 5 Days Instructor-led, Hands-on Introduction This course describes how to design and monitor high performance, highly available
More informationFive Common Myths About Scaling MySQL
WHITE PAPER Five Common Myths About Scaling MySQL Five Common Myths About Scaling MySQL In this age of data driven applications, the ability to rapidly store, retrieve and process data is incredibly important.
More informationData Warehousing and Decision Support. Introduction. Three Complementary Trends. [R&G] Chapter 23, Part A
Data Warehousing and Decision Support [R&G] Chapter 23, Part A CS 432 1 Introduction Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business
More informationSymbol Tables Symbol Table: In computer science, a symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier in a program's source code is
More informationNOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY
NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY WHAT IS NOSQL? Stands for No-SQL or Not Only SQL. Class of non-relational data storage systems E.g.
More informationCourse Modules for MCSA: SQL Server 2016 Database Development Training & Certification Course:
Course Modules for MCSA: SQL Server 2016 Database Development Training & Certification Course: 20762C Developing SQL 2016 Databases Module 1: An Introduction to Database Development Introduction to the
More informationAdministração e Optimização de Bases de Dados 2012/2013 Index Tuning
Administração e Optimização de Bases de Dados 2012/2013 Index Tuning Bruno Martins DEI@Técnico e DMIR@INESC-ID Index An index is a data structure that supports efficient access to data Condition on Index
More informationMongoDB. Nicolas Travers Conservatoire National des Arts et Métiers. MongoDB
Nicolas Travers Conservatoire National des Arts et Métiers 1 Introduction Humongous (monstrous / enormous) NoSQL: Documents Oriented JSon Serialized format: BSon objects Implemented in C++ Keys indexing
More informationFinal Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23
Final Exam Review 2 Kathleen Durant CS 3200 Northeastern University Lecture 23 QUERY EVALUATION PLAN Representation of a SQL Command SELECT {DISTINCT} FROM {WHERE
More informationHadoop Online Training
Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the
More informationRavenDB & document stores
université libre de bruxelles INFO-H415 - Advanced Databases RavenDB & document stores Authors: Yasin Arslan Jacky Trinh Professor: Esteban Zimányi Contents 1 Introduction 3 1.1 Présentation...................................
More information8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara
Week 1-B-0 Week 1-B-1 CS535 BIG DATA FAQs Slides are available on the course web Wait list Term project topics PART 0. INTRODUCTION 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer
More informationStorageTapper. Real-time MySQL Change Data Uber. Ovais Tariq, Shriniket Kale & Yevgeniy Firsov. October 03, 2017
StorageTapper Real-time MySQL Change Data Streaming @ Uber Ovais Tariq, Shriniket Kale & Yevgeniy Firsov October 03, 2017 Overview What we will cover today Background & Motivation High Level Features System
More informationOVERVIEW OF RELATIONAL DATABASES: KEYS
OVERVIEW OF RELATIONAL DATABASES: KEYS Keys (typically called ID s in the Sierra Database) come in two varieties, and they define the relationship between tables. Primary Key Foreign Key OVERVIEW OF DATABASE
More informationBeyond Relational Databases: MongoDB, Redis & ClickHouse. Marcos Albe - Principal Support Percona
Beyond Relational Databases: MongoDB, Redis & ClickHouse Marcos Albe - Principal Support Engineer @ Percona Introduction MySQL everyone? Introduction Redis? OLAP -vs- OLTP Image credits: 451 Research (https://451research.com/state-of-the-database-landscape)
More informationAdvanced Databases: Parallel Databases A.Poulovassilis
1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger
More informationCassandra- A Distributed Database
Cassandra- A Distributed Database Tulika Gupta Department of Information Technology Poornima Institute of Engineering and Technology Jaipur, Rajasthan, India Abstract- A relational database is a traditional
More informationNoSQL + SQL = MySQL. Nicolas De Rico Principal Solutions Architect
NoSQL + SQL = MySQL Nicolas De Rico Principal Solutions Architect nicolas.de.rico@oracle.com Safe Harbor Statement The following is intended to outline our general product direction. It is intended for
More information