Splout SQL When Big Data Output is also Big Data
|
|
- Randall Hampton
- 5 years ago
- Views:
Transcription
1 Iván de Prado Alonso CEO of Splout SQL When Big Data Output is also Big Data
2 Big Data consulting & training
3
4 Full SQL * * Within each par??on
5 Full SQL * Unlike NoSQL * Within each par??on
6 Full SQL * Unlike NoSQL For Big Data * Within each par??on
7 Full SQL * For Big Data Unlike NoSQL Unlike RDBMS * Within each par??on
8 Full SQL * For Big Data Unlike NoSQL Unlike RDBMS Web latency & throughput * Within each par??on
9 Full SQL * For Big Data Web latency & throughput Unlike NoSQL Unlike RDBMS Unlike Impala, Apache Drill, etc. * Within each par??on
10 How does it work?
11 How does it work?
12 How does it work? IsolaAon between generaaon and serving
13 GeneraAon Table CLIENTS CID Name U20 Doug U21 Ted U40 John Table SALES SID CID Amount S100 U S101 U20 60 S223 U40 99
14 GeneraAon Generate tablespace CLIENTS_INFO with 2 par??ons for Table CLIENTS CID Name U20 Doug U21 Ted U40 John Table SALES SID CID Amount S100 U S101 U20 60 S223 U40 99
15 GeneraAon Generate tablespace CLIENTS_INFO with 2 par??ons for table CLIENTS Table CLIENTS CID Name U20 Doug U21 Ted U40 John Table SALES SID CID Amount S100 U S101 U20 60 S223 U40 99
16 GeneraAon Generate tablespace CLIENTS_INFO with 2 par??ons for table CLIENTS par??oned by CID Table CLIENTS CID Name U20 Doug U21 Ted U40 John Table SALES SID CID Amount S100 U S101 U20 60 S223 U40 99
17 GeneraAon Generate tablespace CLIENTS_INFO with 2 par??ons for table CLIENTS par??oned by CID table SALES Table CLIENTS CID U20 U21 U40 Name Doug Ted John Table SALES SID CID Amount S100 U S101 U20 60 S223 U40 99
18 GeneraAon Generate tablespace CLIENTS_INFO with 2 par??ons for table CLIENTS par??oned by CID table SALES par??oned by CID Table CLIENTS CID U20 U21 U40 Name Doug Ted John Table SALES SID CID Amount S100 U S101 U20 60 S223 U40 99
19 GeneraAon Table CLIENTS CID Name Generate tablespace CLIENTS_INFO with 2 par??ons for table CLIENTS par??oned by CID table SALES par??oned by CID Tablespace CLIENTS_INFO Par77on U10 U35 U20 U21 U40 Doug Ted John Table CLIENTS CID Name U20 Doug Table SALES SID CID Amount S100 U U21 Ted S101 U20 60 Table SALES SID CID Amount S100 U S101 U20 60 S223 U40 99 Par77on U36 U60 Table CLIENTS CID Name U40 John Table SALES SID CID Amount S223 U40 99
20 Serving Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99
21 Serving SELECT Name, sum(amount) FROM CLIENTS c, SALES s WHERE c.cid = s.cid AND CID = U20 ; Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99
22 Serving For key = U20, tablespace= CLIENTS_INFO SELECT Name, sum(amount) FROM CLIENTS c, SALES s WHERE c.cid = s.cid AND CID = U20 ; Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99
23 Serving For key = U20, tablespace= CLIENTS_INFO SELECT Name, sum(amount) FROM CLIENTS c, SALES s WHERE c.cid = s.cid AND CID = U20 ; Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99
24 Serving For key = U20, tablespace= CLIENTS_INFO SELECT Name, sum(amount) FROM CLIENTS c, SALES s WHERE c.cid = s.cid AND CID = U20 ; Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99
25 Serving Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99
26 Serving SELECT Name, sum(amount) FROM CLIENTS c, SALES s WHERE c.cid = s.cid AND CID = U40 ; Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99
27 Serving For key = U40, tablespace= CLIENTS_INFO SELECT Name, sum(amount) FROM CLIENTS c, SALES s WHERE c.cid = s.cid AND CID = U40 ; Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99
28 Serving For key = U40, tablespace= CLIENTS_INFO SELECT Name, sum(amount) FROM CLIENTS c, SALES s WHERE c.cid = s.cid AND CID = U40 ; Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99
29 Serving For key = U40, tablespace= CLIENTS_INFO SELECT Name, sum(amount) FROM CLIENTS c, SALES s WHERE c.cid = s.cid AND CID = U40 ; Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99
30
31 Why does it scale?
32 Why does it scale? Data is paraaoned
33 Why does it scale? Data is paraaoned Par77ons are distributed across nodes
34 Why does it scale? Data is paraaoned Par77ons are distributed across nodes Adding more nodes increases capacity
35 Why does it scale? Data is paraaoned Par77ons are distributed across nodes Adding more nodes increases capacity Queries restricted to a single paraaon
36 Why does it scale? Data is paraaoned Par77ons are distributed across nodes Adding more nodes increases capacity Queries restricted to a single paraaon Genera7on does not impact serving
37
38 Ok, so what is Splout SQL useful for?
39
40 Big Data Analy?cs
41 Big Data Analy?cs
42 Big Data Analy?cs Manageable output
43
44 Big Data Analy?cs
45 Big Data Analy?cs
46 Big Data Analy?cs SomeAmes Big Data output is also Big Data
47 Splout SQL allows to serve Big Data results
48 Let s see an example
49 Building a Google AnalyAcs
50 Building a Google AnalyAcs Imagine that one crazy day you decide to build some kind of Google AnalyAcs
51 Building a Google AnalyAcs Imagine that one crazy day you decide to build some kind of Google AnalyAcs Zillions of events
52 Building a Google AnalyAcs Imagine that one crazy day you decide to build some kind of Google AnalyAcs Zillions of events Millions of domains
53 Building a Google AnalyAcs Imagine that one crazy day you decide to build some kind of Google AnalyAcs Zillions of events Millions of domains Individual panel per domain
54 Requirements
55 Requirements Time- based charts (day/hour aggrega?ons)
56 Requirements Time- based charts (day/hour aggrega?ons) Flexible dimension breakdown Per page, per browser Per country, per language
57 With Splout SQL
58 Splout SQL provides SQL consolidated views for Hadoop data
59 Let s see more details about Splout SQL
60 Splout SQL Architecture
61 Splout SQL Architecture
62 Each paraaon is
63 Each paraaon is Backed by SQLite or MySQL
64 Each paraaon is Backed by SQLite or MySQL Generated on Hadoop
65 Each paraaon is Backed by SQLite or MySQL Generated on Hadoop Including any indexes needed
66 Each paraaon is Backed by SQLite or MySQL Generated on Hadoop Including any indexes needed Data can be sorted before inser?on to minimize disk seeks at query?me
67 Each paraaon is Backed by SQLite or MySQL Generated on Hadoop Including any indexes needed Data can be sorted before inser?on to minimize disk seeks at query?me Pre- sampling for balancing par??on size
68 Each paraaon is Backed by SQLite or MySQL Generated on Hadoop Including any indexes needed Data can be sorted before inser?on to minimize disk seeks at query?me Pre- sampling for balancing par??on size Distributed on Splout SQL cluster
69 Each paraaon is Backed by SQLite or MySQL Generated on Hadoop Including any indexes needed Data can be sorted before inser?on to minimize disk seeks at query?me Pre- sampling for balancing par??on size Distributed on Splout SQL cluster With replica?on for failover
70 Atomicity
71 Atomicity A tablespace is a set of tables that share the same paraaoning schema
72 Atomicity A tablespace is a set of tables that share the same paraaoning schema Tablespaces are versioned
73 Atomicity A tablespace is a set of tables that share the same paraaoning schema Tablespaces are versioned Only one version served at a?me
74 Atomicity A tablespace is a set of tables that share the same paraaoning schema Tablespaces are versioned Only one version served at a?me Several tablespaces can be deployed at once
75 Atomicity A tablespace is a set of tables that share the same paraaoning schema Tablespaces are versioned Only one version served at a?me Several tablespaces can be deployed at once All- or- nothing seman?cs (atomicity)
76 Atomicity A tablespace is a set of tables that share the same paraaoning schema Tablespaces are versioned Several tablespaces can be deployed at once Only one version served at a?me All- or- nothing seman?cs (atomicity) Rollback support
77 CharacterisAcs
78 CharacterisAcs Ensured ms latencies
79 CharacterisAcs Ensured ms latencies Even when queries hit disk
80 CharacterisAcs Ensured ms latencies Even when queries hit disk Controlled by the developer selec?ng the proper:
81 CharacterisAcs Ensured ms latencies Even when queries hit disk Controlled by the developer selec?ng the proper: Cluster topology Par??oning Indexes Data colloca?on (inser?on order)
82 CharacterisAcs (II)
83 CharacterisAcs (II) 100% SQL
84 CharacterisAcs (II) 100% SQL But restricted to a single par??on
85 CharacterisAcs (II) 100% SQL But restricted to a single par??on Real-?me aggrega?ons
86 CharacterisAcs (II) 100% SQL But restricted to a single par??on Real-?me aggrega?ons Joins
87 CharacterisAcs (II) 100% SQL But restricted to a single par??on Real-?me aggrega?ons Joins Scalability
88 CharacterisAcs (II) 100% SQL But restricted to a single par??on Real-?me aggrega?ons Joins Scalability In data capacity
89 CharacterisAcs (II) 100% SQL But restricted to a single par??on Real-?me aggrega?ons Joins Scalability In data capacity In performance
90 CharacterisAcs (III)
91 CharacterisAcs (III) Atomicity
92 CharacterisAcs (III) Atomicity New data replaces old data all at once
93 CharacterisAcs (III) Atomicity New data replaces old data all at once High availability
94 CharacterisAcs (III) Atomicity New data replaces old data all at once High availability Through the use of replica?on
95 CharacterisAcs (III) Atomicity New data replaces old data all at once High availability Through the use of replica?on Open Source
96 CharacterisAcs (IV)
97 CharacterisAcs (IV) Easy to manage
98 CharacterisAcs (IV) Easy to manage Changing the size of the cluster can be done without any down?me
99 CharacterisAcs (IV) Easy to manage Changing the size of the cluster can be done without any down?me Read only
100 CharacterisAcs (IV) Easy to manage Changing the size of the cluster can be done without any down?me Read only Data is updated in batches
101 CharacterisAcs (IV) Easy to manage Changing the size of the cluster can be done without any down?me Read only Data is updated in batches Updates come from new tablespace deployments
102 CharacterisAcs (V)
103 CharacterisAcs (V) NaAve connectors
104 CharacterisAcs (V) NaAve connectors Hive
105 CharacterisAcs (V) NaAve connectors Hive Pig
106 CharacterisAcs (V) NaAve connectors Hive Pig Cascading
107 API - GeneraAon
108 API - GeneraAon Command line
109 API - GeneraAon Command line Loading CSV files
110 API - GeneraAon Command line Loading CSV files $ hadoop jar splout- *-hadoop.jar generate
111 API - GeneraAon Command line Loading CSV files $ hadoop jar splout- Java API *-hadoop.jar generate
112 API - GeneraAon Command line Loading CSV files $ hadoop jar splout- Java API *-hadoop.jar generate
113 API - GeneraAon Command line Loading CSV files $ hadoop jar splout- *-hadoop.jar generate Java API HCatalog
114 API - GeneraAon Command line Loading CSV files $ hadoop jar splout- *-hadoop.jar generate Java API HCatalog Hive
115 API - GeneraAon Command line Loading CSV files $ hadoop jar splout- *-hadoop.jar generate Java API HCatalog Hive Pig
116 API - Service
117 API - Service Rest API
118 API - Service Rest API
119 API - Service Rest API JSON response
120 API - Console
121 Joins
122 Joins Between co- paraaoned tables
123 Joins Between co- paraaoned tables e.g. Clients and Sales by CID
124 Joins Between co- paraaoned tables e.g. Clients and Sales by CID With omnipresent tables
125 Joins Between co- paraaoned tables e.g. Clients and Sales by CID With omnipresent tables Full data present in every par??on Useful for dimension tables in star schemas e.g. countries table
126 What if I need different paraaoning?
127 What if I need different paraaoning? Example
128 What if I need different paraaoning? Example Queries by Merchant cannot be answered by a tablespace par??oned by Client
129 What if I need different paraaoning? Example Queries by Merchant cannot be answered by a tablespace par??oned by Client Just create more tablespaces
130 What if I need different paraaoning? Example Queries by Merchant cannot be answered by a tablespace par??oned by Client Just create more tablespaces First par??oned by Client Second par??oned by Merchant Deploy both atomically
131 Benchmark
132 Benchmark 350 GB Wikipedia logs
133 Benchmark 350 GB Wikipedia logs Aggrega?on queries impac?ng 15 rows in average
134 Benchmark 350 GB Wikipedia logs Aggrega?on queries impac?ng 15 rows in average 2- machines cluster
135 Benchmark 350 GB Wikipedia logs Aggrega?on queries impac?ng 15 rows in average 2- machines cluster 900 queries/second, 80 ms/query, 80 threads
136 Benchmark 350 GB Wikipedia logs Aggrega?on queries impac?ng 15 rows in average 2- machines cluster 900 queries/second, 80 ms/query, 80 threads
137 Benchmark (II)
138 Benchmark (II) 4- machines cluster
139 Benchmark (II) 4- machines cluster 3150 queries/second, 40 ms/query, 160 threads
140 Benchmark (II) 4- machines cluster 3150 queries/second, 40 ms/query, 160 threads More info: hlp://sploutsql.com/performance.html
141
142 Web latency
143 Web latency SQL
144 Web latency SQL Consolidated Views
145 Web latency SQL Consolidated Views For Hadoop
146 Web latency SQL Consolidated Views For Hadoop A good candidate for the serving layer of a lambda architecture
147 Future work
148 Future work Growing the community
149 Future work Growing the community Do you want to collaborate?
150 Future work Growing the community Do you want to collaborate? More engines
151 Future work Growing the community Do you want to collaborate? More engines SQLite, MySQL and Redis already done Columnar formats
152 Future work Growing the community Do you want to collaborate? More engines SQLite, MySQL and Redis already done Columnar formats Rack awareness
153 Future work Growing the community Do you want to collaborate? More engines SQLite, MySQL and Redis already done Columnar formats Rack awareness MulA- tenancy
154 Future work Growing the community Do you want to collaborate? More engines SQLite, MySQL and Redis already done Columnar formats Rack awareness MulA- tenancy Test on scale
155 Future work Growing the community Do you want to collaborate? More engines SQLite, MySQL and Redis already done Columnar formats Rack awareness MulA- tenancy Test on scale Test Splout on bigger clusters
156 Iván de Prado Alonso CEO of hhp://sploutsql.com QuesAons?
Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationBig Data with Hadoop Ecosystem
Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationMapR Enterprise Hadoop
2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationTHE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
1 THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More informationPROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.
PROFESSIONAL NoSQL Shashank Tiwari WILEY John Wiley & Sons, Inc. Examining CONTENTS INTRODUCTION xvil CHAPTER 1: NOSQL: WHAT IT IS AND WHY YOU NEED IT 3 Definition and Introduction 4 Context and a Bit
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years
More informationThe NoSQL Landscape. Frank Weigel VP, Field Technical Opera;ons
The NoSQL Landscape Frank Weigel VP, Field Technical Opera;ons What we ll talk about Why RDBMS are not enough? What are the different NoSQL taxonomies? Which NoSQL is right for me? Macro Trends Driving
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationApache Drill. Interactive Analysis of Large-Scale Datasets. Tomer Shiran
Apache Drill Interactive Analysis of Large-Scale Datasets Tomer Shiran Latency Matters Ad-hoc analysis with interactive tools Real-time dashboards Event/trend detection Network intrusions Fraud Failures
More informationMySQL Cluster Web Scalability, % Availability. Andrew
MySQL Cluster Web Scalability, 99.999% Availability Andrew Morgan @andrewmorgan www.clusterdb.com Safe Harbour Statement The following is intended to outline our general product direction. It is intended
More informationHadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera
More informationApril Copyright 2013 Cloudera Inc. All rights reserved.
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on
More informationJargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems
Jargons, Concepts, Scope and Systems Key Value Stores, Document Stores, Extensible Record Stores Overview of different scalable relational systems Examples of different Data stores Predictions, Comparisons
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationNew Oracle NoSQL Database APIs that Speed Insertion and Retrieval
New Oracle NoSQL Database APIs that Speed Insertion and Retrieval O R A C L E W H I T E P A P E R F E B R U A R Y 2 0 1 6 1 NEW ORACLE NoSQL DATABASE APIs that SPEED INSERTION AND RETRIEVAL Introduction
More informationMicrosoft Big Data and Hadoop
Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common
More informationHadoop & Big Data Analytics Complete Practical & Real-time Training
An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE
More informationFROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà
FROM LEGACY, TO BATCH, TO NEAR REAL-TIME Marc Sturlese, Dani Solà WHO ARE WE? Marc Sturlese - @sturlese Backend engineer, focused on R&D Interests: search, scalability Dani Solà - @dani_sola Backend engineer
More informationCIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )
Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL
More informationSempala. Interactive SPARQL Query Processing on Hadoop
Sempala Interactive SPARQL Query Processing on Hadoop Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen University of Freiburg, Germany ISWC 2014 - Riva del Garda, Italy Motivation
More information10 Million Smart Meter Data with Apache HBase
10 Million Smart Meter Data with Apache HBase 5/31/2017 OSS Solution Center Hitachi, Ltd. Masahiro Ito OSS Summit Japan 2017 Who am I? Masahiro Ito ( 伊藤雅博 ) Software Engineer at Hitachi, Ltd. Focus on
More informationAn Introduction to Big Data Formats
Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationBig Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More informationGhislain Fourny. Big Data 5. Column stores
Ghislain Fourny Big Data 5. Column stores 1 Introduction 2 Relational model 3 Relational model Schema 4 Issues with relational databases (RDBMS) Small scale Single machine 5 Can we fix a RDBMS? Scale up
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationOracle NoSQL Database Enterprise Edition, Version 18.1
Oracle NoSQL Database Enterprise Edition, Version 18.1 Oracle NoSQL Database is a scalable, distributed NoSQL database, designed to provide highly reliable, flexible and available data management across
More informationVoldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data
More informationBIG DATA COURSE CONTENT
BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data
More informationHCatalog. Table Management for Hadoop. Alan F. Page 1
HCatalog Table Management for Hadoop Alan F. Gates @alanfgates Page 1 Who Am I? HCatalog committer and mentor Co-founder of Hortonworks Tech lead for Data team at Hortonworks Pig committer and PMC Member
More informationCopy Data From One Schema To Another In Sql Developer
Copy Data From One Schema To Another In Sql Developer The easiest way to copy an entire Oracle table (structure, contents, indexes, to copy a table from one schema to another, or from one database to another,.
More informationBig Data Development CASSANDRA NoSQL Training - Workshop. November 20 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI
Big Data Development CASSANDRA NoSQL Training - Workshop November 20 to 24 2016 (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI ISIDUS TECH TEAM FZE PO Box 9798 Dubai UAE, email training-coordinator@isidusnet
More informationConfiguring and Deploying Hadoop Cluster Deployment Templates
Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page
More informationTypical size of data you deal with on a daily basis
Typical size of data you deal with on a daily basis Processes More than 161 Petabytes of raw data a day https://aci.info/2014/07/12/the-dataexplosion-in-2014-minute-by-minuteinfographic/ On average, 1MB-2MB
More informationMySQL Cluster Ed 2. Duration: 4 Days
Oracle University Contact Us: +65 6501 2328 MySQL Cluster Ed 2 Duration: 4 Days What you will learn This MySQL Cluster training teaches you how to install and configure a real-time database cluster at
More informationHow we build TiDB. Max Liu PingCAP Amsterdam, Netherlands October 5, 2016
How we build TiDB Max Liu PingCAP Amsterdam, Netherlands October 5, 2016 About me Infrastructure engineer / CEO of PingCAP Working on open source projects: TiDB: https://github.com/pingcap/tidb TiKV: https://github.com/pingcap/tikv
More informationMySQL Cluster Student Guide
MySQL Cluster Student Guide D62018GC11 Edition 1.1 November 2012 D79677 Technical Contributor and Reviewer Mat Keep Editors Aju Kumar Daniel Milne Graphic Designer Seema Bopaiah Publishers Sujatha Nagendra
More informationAnnouncements. PS 3 is out (see the usual place on the course web) Be sure to read my notes carefully Also read. Take a break around 10:15am
Announcements PS 3 is out (see the usual place on the course web) Be sure to read my notes carefully Also read SQL tutorial: http://www.w3schools.com/sql/default.asp Take a break around 10:15am 1 Databases
More informationFLORIDA DEPARTMENT OF TRANSPORTATION PRODUCTION BIG DATA PLATFORM
FLORIDA DEPARTMENT OF TRANSPORTATION PRODUCTION BIG DATA PLATFORM RECOMMENDATION AND JUSTIFACTION Executive Summary: VHB has been tasked by the Florida Department of Transportation District Five to design
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationOracle Big Data Connectors
Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process
More informationInstructor : Dr. Sunnie Chung. Independent Study Spring Pentaho. 1 P a g e
ABSTRACT Pentaho Business Analytics from different data source, Analytics from csv/sql,create Star Schema Fact & Dimension Tables, kettle transformation for big data integration, MongoDB kettle Transformation,
More informationCloudera Impala Headline Goes Here
Cloudera Impala Headline Goes Here JusAn Erickson Senior Product Manager Speaker Name or Subhead Goes Here February 2013 DO NOT USE PUBLICLY PRIOR TO 10/23/12 Agenda Intro to Impala Architectural Overview
More informationAccelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016
Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016 Nikita Ivanov CTO and Co-Founder GridGain Systems Peter Zaitsev CEO and Co-Founder Percona About the Presentation
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 10: Mutable State (1/2) March 14, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These
More informationSecurity and Performance advances with Oracle Big Data SQL
Security and Performance advances with Oracle Big Data SQL Jean-Pierre Dijcks Oracle Redwood Shores, CA, USA Key Words SQL, Oracle, Database, Analytics, Object Store, Files, Big Data, Big Data SQL, Hadoop,
More informationA scalability comparison study of data management approaches for smart metering systems
A scalability comparison study of data management approaches for smart metering systems Houssem Chihoub, Chris.ne Collet Grenoble INP houssem.chihoub@imag.fr Journées Plateformes Clermont Ferrand 6-7 octobre
More informationHow Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,
How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS
More informationIntroduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data
Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationMySQL & NoSQL: The Best of Both Worlds
MySQL & NoSQL: The Best of Both Worlds Mario Beck Principal Sales Consultant MySQL mario.beck@oracle.com 1 Copyright 2012, Oracle and/or its affiliates. All rights Safe Harbour Statement The following
More informationState of the Dolphin Developing new Apps in MySQL 8
State of the Dolphin Developing new Apps in MySQL 8 Highlights of MySQL 8.0 technology updates Mark Swarbrick MySQL Principle Presales Consultant Jill Anolik MySQL Global Business Unit Israel Copyright
More informationA Fast and High Throughput SQL Query System for Big Data
A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationQuestion: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?
Volume: 72 Questions Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig? A. update hdfs set D as./output ; B. store D
More informationTop 10 SQL- on- Hadoop Pi1alls Monte Zweben
Top 10 SQL- on- Hadoop Pi1alls Monte Zweben CEO, Splice Machine SQL- on- Hadoop Landscape A crowded, confusing landscape, full of poten4al and pi5alls Pi1all #1: Individual Lookups and Range Queries Issues!
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More informationActual4Test. Actual4test - actual test exam dumps-pass for IT exams
Actual4Test http://www.actual4test.com Actual4test - actual test exam dumps-pass for IT exams Exam : 1z1-449 Title : Oracle Big Data 2017 Implementation Essentials Vendor : Oracle Version : DEMO Get Latest
More informationDatabricks, an Introduction
Databricks, an Introduction Chuck Connell, Insight Digital Innovation Insight Presentation Speaker Bio Senior Data Architect at Insight Digital Innovation Focus on Azure big data services HDInsight/Hadoop,
More informationHadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop
Hadoop Open Source Projects Hadoop is supplemented by an ecosystem of open source projects Oozie 25 How to Analyze Large Data Sets in Hadoop Although the Hadoop framework is implemented in Java, MapReduce
More informationOracle NoSQL Database Enterprise Edition, Version 18.1
Oracle NoSQL Database Enterprise Edition, Version 18.1 Oracle NoSQL Database is a scalable, distributed NoSQL database, designed to provide highly reliable, flexible and available data management across
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory
More informationStages of Data Processing
Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,
More informationTalend Big Data Sandbox. Big Data Insights Cookbook
Overview Pre-requisites Setup & Configuration Hadoop Distribution Download Demo (Scenario) Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario) About this cookbook What is
More informationCloud Analytics and Business Intelligence on AWS
Cloud Analytics and Business Intelligence on AWS Enterprise Applications Virtual Desktops Sharing & Collaboration Platform Services Analytics Hadoop Real-time Streaming Data Machine Learning Data Warehouse
More informationEvolving To The Big Data Warehouse
Evolving To The Big Data Warehouse Kevin Lancaster 1 Copyright Director, 2012, Oracle and/or its Engineered affiliates. All rights Insert Systems, Information Protection Policy Oracle Classification from
More informationTop 7 Data API Headaches (and How to Handle Them) Jeff Reser Data Connectivity & Integration Progress Software
Top 7 Data API Headaches (and How to Handle Them) Jeff Reser Data Connectivity & Integration Progress Software jreser@progress.com Agenda Data Variety (Cloud and Enterprise) ABL ODBC Bridge Using Progress
More informationNoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu
NoSQL Databases MongoDB vs Cassandra Kenny Huynh, Andre Chik, Kevin Vu Introduction - Relational database model - Concept developed in 1970 - Inefficient - NoSQL - Concept introduced in 1980 - Related
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More information1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions
1Z0-449 Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions Table of Contents Introduction to 1Z0-449 Exam on Oracle Big Data 2017 Implementation Essentials... 2 Oracle 1Z0-449
More informationIntroduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent
Introduc)on to Apache Ka1a Jun Rao Co- founder of Confluent Agenda Why people use Ka1a Technical overview of Ka1a What s coming What s Apache Ka1a Distributed, high throughput pub/sub system Ka1a Usage
More informationGhislain Fourny. Big Data 5. Wide column stores
Ghislain Fourny Big Data 5. Wide column stores Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage 2 Where we are User interfaces
More informationData warehousing on Hadoop. Marek Grzenkowicz Roche Polska
Data warehousing on Hadoop Marek Grzenkowicz Roche Polska Agenda Introduction Case study: StraDa project Source data Data model Data flow and processing Reporting Lessons learnt Ideas for the future Q&A
More informationAsanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks
Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data
More informationScaling DreamFactory
Scaling DreamFactory This white paper is designed to provide information to enterprise customers about how to scale a DreamFactory Instance. The sections below talk about horizontal, vertical, and cloud
More informationApproaching the Petabyte Analytic Database: What I learned
Disclaimer This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may
More informationInnovatus Technologies
HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String
More informationStart Working with Parquet!!!!
My Goal Tonight. Start Working with Parquet!!!! Parquet Query Performance Origin of Parquet Parquet Storage Query Request Usage with Hadoop Tools Customer Examples Topics Parquet Defined Storage & Encoding
More informationDISTRIBUTED DATABASE OPTIMIZATIONS WITH NoSQL MEMBERS
U.P.B. Sci. Bull., Series C, Vol. 77, Iss. 2, 2015 ISSN 2286-3540 DISTRIBUTED DATABASE OPTIMIZATIONS WITH NoSQL MEMBERS George Dan POPA 1 Distributed database complexity, as well as wide usability area,
More informationData Informatics. Seon Ho Kim, Ph.D.
Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate
More informationOLAP Introduction and Overview
1 CHAPTER 1 OLAP Introduction and Overview What Is OLAP? 1 Data Storage and Access 1 Benefits of OLAP 2 What Is a Cube? 2 Understanding the Cube Structure 3 What Is SAS OLAP Server? 3 About Cube Metadata
More informationHBase... And Lewis Carroll! Twi:er,
HBase... And Lewis Carroll! jw4ean@cloudera.com Twi:er, LinkedIn: @jw4ean 1 Introduc@on 2010: Cloudera Solu@ons Architect 2011: Cloudera TAM/DSE 2012-2013: Cloudera Training focusing on Partners and Newbies
More informationQLIK INTEGRATION WITH AMAZON REDSHIFT
QLIK INTEGRATION WITH AMAZON REDSHIFT Qlik Partner Engineering Created August 2016, last updated March 2017 Contents Introduction... 2 About Amazon Web Services (AWS)... 2 About Amazon Redshift... 2 Qlik
More informationCloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationThe Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou
The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component
More informationMaximizing Fraud Prevention Through Disruptive Architectures Delivering speed at scale.
Maximizing Fraud Prevention Through Disruptive Architectures Delivering speed at scale. January 2016 Credit Card Fraud prevention is among the most time-sensitive and high-value of IT tasks. The databases
More informationProcessing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer
Processing big data with modern applications: Hadoop as DWH backend at Pro7 Dr. Kathrin Spreyer Big data engineer GridKa School Karlsruhe, 02.09.2014 Outline 1. Relational DWH 2. Data integration with
More informationIntro Cassandra. Adelaide Big Data Meetup.
Intro Cassandra Adelaide Big Data Meetup instaclustr.com @Instaclustr Who am I and what do I do? Alex Lourie Worked at Red Hat, Datastax and now Instaclustr We currently manage x10s nodes for various customers,
More informationStudy of NoSQL Database Along With Security Comparison
Study of NoSQL Database Along With Security Comparison Ankita A. Mall [1], Jwalant B. Baria [2] [1] Student, Computer Engineering Department, Government Engineering College, Modasa, Gujarat, India ank.fetr@gmail.com
More informationMariaDB MaxScale 2.0 and ColumnStore 1.0 for the Boston MySQL Meetup Group Jon Day, Solution Architect - MariaDB
MariaDB MaxScale 2.0 and ColumnStore 1.0 for the Boston MySQL Meetup Group Jon Day, Solution Architect - MariaDB 2016 MariaDB Corporation Ab 1 Tonight s Topics: MariaDB MaxScale 2.0 Currently in Beta MariaDB
More informationAerospike Scales with Google Cloud Platform
Aerospike Scales with Google Cloud Platform PERFORMANCE TEST SHOW AEROSPIKE SCALES ON GOOGLE CLOUD Aerospike is an In-Memory NoSQL database and a fast Key Value Store commonly used for caching and by real-time
More informationApache Kudu. Zbigniew Baranowski
Apache Kudu Zbigniew Baranowski Intro What is KUDU? New storage engine for structured data (tables) does not use HDFS! Columnar store Mutable (insert, update, delete) Written in C++ Apache-licensed open
More informationScalable Web Programming. CS193S - Jan Jannink - 2/25/10
Scalable Web Programming CS193S - Jan Jannink - 2/25/10 Weekly Syllabus 1.Scalability: (Jan.) 2.Agile Practices 3.Ecology/Mashups 4.Browser/Client 7.Analytics 8.Cloud/Map-Reduce 9.Published APIs: (Mar.)*
More informationCSE 190D Spring 2017 Final Exam Answers
CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join
More informationOracle Big Data. A NA LYT ICS A ND MA NAG E MENT.
Oracle Big Data. A NALYTICS A ND MANAG E MENT. Oracle Big Data: Redundância. Compatível com ecossistema Hadoop, HIVE, HBASE, SPARK. Integração com Cloudera Manager. Possibilidade de Utilização da Linguagem
More information