Splout SQL When Big Data Output is also Big Data

Size: px

Start display at page:

Download "Splout SQL When Big Data Output is also Big Data"

Randall Hampton
5 years ago
Views:

Iván de Prado Alonso CEO of Datasalt www.datasalt.

1 Iván de Prado Alonso CEO of Splout SQL When Big Data Output is also Big Data

2 Big Data consulting & training

4 Full SQL * * Within each par??on

5 Full SQL * Unlike NoSQL * Within each par??on

6 Full SQL * Unlike NoSQL For Big Data * Within each par??on

7 Full SQL * For Big Data Unlike NoSQL Unlike RDBMS * Within each par??on

8 Full SQL * For Big Data Unlike NoSQL Unlike RDBMS Web latency & throughput * Within each par??on

9 Full SQL * For Big Data Web latency & throughput Unlike NoSQL Unlike RDBMS Unlike Impala, Apache Drill, etc. * Within each par??on

10 How does it work?

11 How does it work?

12 How does it work? IsolaAon between generaaon and serving

13 GeneraAon Table CLIENTS CID Name U20 Doug U21 Ted U40 John Table SALES SID CID Amount S100 U S101 U20 60 S223 U40 99

14 GeneraAon Generate tablespace CLIENTS_INFO with 2 par??ons for Table CLIENTS CID Name U20 Doug U21 Ted U40 John Table SALES SID CID Amount S100 U S101 U20 60 S223 U40 99

15 GeneraAon Generate tablespace CLIENTS_INFO with 2 par??ons for table CLIENTS Table CLIENTS CID Name U20 Doug U21 Ted U40 John Table SALES SID CID Amount S100 U S101 U20 60 S223 U40 99

16 GeneraAon Generate tablespace CLIENTS_INFO with 2 par??ons for table CLIENTS par??oned by CID Table CLIENTS CID Name U20 Doug U21 Ted U40 John Table SALES SID CID Amount S100 U S101 U20 60 S223 U40 99

17 GeneraAon Generate tablespace CLIENTS_INFO with 2 par??ons for table CLIENTS par??oned by CID table SALES Table CLIENTS CID U20 U21 U40 Name Doug Ted John Table SALES SID CID Amount S100 U S101 U20 60 S223 U40 99

18 GeneraAon Generate tablespace CLIENTS_INFO with 2 par??ons for table CLIENTS par??oned by CID table SALES par??oned by CID Table CLIENTS CID U20 U21 U40 Name Doug Ted John Table SALES SID CID Amount S100 U S101 U20 60 S223 U40 99

19 GeneraAon Table CLIENTS CID Name Generate tablespace CLIENTS_INFO with 2 par??ons for table CLIENTS par??oned by CID table SALES par??oned by CID Tablespace CLIENTS_INFO Par77on U10 U35 U20 U21 U40 Doug Ted John Table CLIENTS CID Name U20 Doug Table SALES SID CID Amount S100 U U21 Ted S101 U20 60 Table SALES SID CID Amount S100 U S101 U20 60 S223 U40 99 Par77on U36 U60 Table CLIENTS CID Name U40 John Table SALES SID CID Amount S223 U40 99

20 Serving Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99

21 Serving SELECT Name, sum(amount) FROM CLIENTS c, SALES s WHERE c.cid = s.cid AND CID = U20 ; Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99

22 Serving For key = U20, tablespace= CLIENTS_INFO SELECT Name, sum(amount) FROM CLIENTS c, SALES s WHERE c.cid = s.cid AND CID = U20 ; Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99

23 Serving For key = U20, tablespace= CLIENTS_INFO SELECT Name, sum(amount) FROM CLIENTS c, SALES s WHERE c.cid = s.cid AND CID = U20 ; Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99

24 Serving For key = U20, tablespace= CLIENTS_INFO SELECT Name, sum(amount) FROM CLIENTS c, SALES s WHERE c.cid = s.cid AND CID = U20 ; Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99

25 Serving Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99

26 Serving SELECT Name, sum(amount) FROM CLIENTS c, SALES s WHERE c.cid = s.cid AND CID = U40 ; Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99

27 Serving For key = U40, tablespace= CLIENTS_INFO SELECT Name, sum(amount) FROM CLIENTS c, SALES s WHERE c.cid = s.cid AND CID = U40 ; Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99

28 Serving For key = U40, tablespace= CLIENTS_INFO SELECT Name, sum(amount) FROM CLIENTS c, SALES s WHERE c.cid = s.cid AND CID = U40 ; Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99

29 Serving For key = U40, tablespace= CLIENTS_INFO SELECT Name, sum(amount) FROM CLIENTS c, SALES s WHERE c.cid = s.cid AND CID = U40 ; Par??on U10 U35 Table CLIENTS CID Name Par??on U36 U60 Table CLIENTS CID Name U20 Doug U40 John U21 Ted Table SALES SID CID Amount S100 U S101 U20 60 Table SALES SID CID Amount S223 U40 99

31 Why does it scale?

32 Why does it scale? Data is paraaoned

33 Why does it scale? Data is paraaoned Par77ons are distributed across nodes

34 Why does it scale? Data is paraaoned Par77ons are distributed across nodes Adding more nodes increases capacity

35 Why does it scale? Data is paraaoned Par77ons are distributed across nodes Adding more nodes increases capacity Queries restricted to a single paraaon

36 Why does it scale? Data is paraaoned Par77ons are distributed across nodes Adding more nodes increases capacity Queries restricted to a single paraaon Genera7on does not impact serving

38 Ok, so what is Splout SQL useful for?

40 Big Data Analy?cs

41 Big Data Analy?cs

42 Big Data Analy?cs Manageable output

44 Big Data Analy?cs

45 Big Data Analy?cs

46 Big Data Analy?cs SomeAmes Big Data output is also Big Data

47 Splout SQL allows to serve Big Data results

48 Let s see an example

49 Building a Google AnalyAcs

50 Building a Google AnalyAcs Imagine that one crazy day you decide to build some kind of Google AnalyAcs

51 Building a Google AnalyAcs Imagine that one crazy day you decide to build some kind of Google AnalyAcs Zillions of events

52 Building a Google AnalyAcs Imagine that one crazy day you decide to build some kind of Google AnalyAcs Zillions of events Millions of domains

53 Building a Google AnalyAcs Imagine that one crazy day you decide to build some kind of Google AnalyAcs Zillions of events Millions of domains Individual panel per domain

54 Requirements

55 Requirements Time- based charts (day/hour aggrega?ons)

56 Requirements Time- based charts (day/hour aggrega?ons) Flexible dimension breakdown Per page, per browser Per country, per language

57 With Splout SQL

58 Splout SQL provides SQL consolidated views for Hadoop data

59 Let s see more details about Splout SQL

60 Splout SQL Architecture

61 Splout SQL Architecture

62 Each paraaon is

63 Each paraaon is Backed by SQLite or MySQL

64 Each paraaon is Backed by SQLite or MySQL Generated on Hadoop

65 Each paraaon is Backed by SQLite or MySQL Generated on Hadoop Including any indexes needed

66 Each paraaon is Backed by SQLite or MySQL Generated on Hadoop Including any indexes needed Data can be sorted before inser?on to minimize disk seeks at query?me

67 Each paraaon is Backed by SQLite or MySQL Generated on Hadoop Including any indexes needed Data can be sorted before inser?on to minimize disk seeks at query?me Pre- sampling for balancing par??on size

68 Each paraaon is Backed by SQLite or MySQL Generated on Hadoop Including any indexes needed Data can be sorted before inser?on to minimize disk seeks at query?me Pre- sampling for balancing par??on size Distributed on Splout SQL cluster

69 Each paraaon is Backed by SQLite or MySQL Generated on Hadoop Including any indexes needed Data can be sorted before inser?on to minimize disk seeks at query?me Pre- sampling for balancing par??on size Distributed on Splout SQL cluster With replica?on for failover

70 Atomicity

71 Atomicity A tablespace is a set of tables that share the same paraaoning schema

72 Atomicity A tablespace is a set of tables that share the same paraaoning schema Tablespaces are versioned

73 Atomicity A tablespace is a set of tables that share the same paraaoning schema Tablespaces are versioned Only one version served at a?me

74 Atomicity A tablespace is a set of tables that share the same paraaoning schema Tablespaces are versioned Only one version served at a?me Several tablespaces can be deployed at once

75 Atomicity A tablespace is a set of tables that share the same paraaoning schema Tablespaces are versioned Only one version served at a?me Several tablespaces can be deployed at once All- or- nothing seman?cs (atomicity)

76 Atomicity A tablespace is a set of tables that share the same paraaoning schema Tablespaces are versioned Several tablespaces can be deployed at once Only one version served at a?me All- or- nothing seman?cs (atomicity) Rollback support

77 CharacterisAcs

78 CharacterisAcs Ensured ms latencies

79 CharacterisAcs Ensured ms latencies Even when queries hit disk

80 CharacterisAcs Ensured ms latencies Even when queries hit disk Controlled by the developer selec?ng the proper:

81 CharacterisAcs Ensured ms latencies Even when queries hit disk Controlled by the developer selec?ng the proper: Cluster topology Par??oning Indexes Data colloca?on (inser?on order)

82 CharacterisAcs (II)

83 CharacterisAcs (II) 100% SQL

84 CharacterisAcs (II) 100% SQL But restricted to a single par??on

85 CharacterisAcs (II) 100% SQL But restricted to a single par??on Real-?me aggrega?ons

86 CharacterisAcs (II) 100% SQL But restricted to a single par??on Real-?me aggrega?ons Joins

87 CharacterisAcs (II) 100% SQL But restricted to a single par??on Real-?me aggrega?ons Joins Scalability

88 CharacterisAcs (II) 100% SQL But restricted to a single par??on Real-?me aggrega?ons Joins Scalability In data capacity

89 CharacterisAcs (II) 100% SQL But restricted to a single par??on Real-?me aggrega?ons Joins Scalability In data capacity In performance

90 CharacterisAcs (III)

91 CharacterisAcs (III) Atomicity

92 CharacterisAcs (III) Atomicity New data replaces old data all at once

93 CharacterisAcs (III) Atomicity New data replaces old data all at once High availability

94 CharacterisAcs (III) Atomicity New data replaces old data all at once High availability Through the use of replica?on

95 CharacterisAcs (III) Atomicity New data replaces old data all at once High availability Through the use of replica?on Open Source

96 CharacterisAcs (IV)

97 CharacterisAcs (IV) Easy to manage

98 CharacterisAcs (IV) Easy to manage Changing the size of the cluster can be done without any down?me

99 CharacterisAcs (IV) Easy to manage Changing the size of the cluster can be done without any down?me Read only

100 CharacterisAcs (IV) Easy to manage Changing the size of the cluster can be done without any down?me Read only Data is updated in batches

101 CharacterisAcs (IV) Easy to manage Changing the size of the cluster can be done without any down?me Read only Data is updated in batches Updates come from new tablespace deployments

102 CharacterisAcs (V)

103 CharacterisAcs (V) NaAve connectors

104 CharacterisAcs (V) NaAve connectors Hive

105 CharacterisAcs (V) NaAve connectors Hive Pig

106 CharacterisAcs (V) NaAve connectors Hive Pig Cascading

107 API - GeneraAon

108 API - GeneraAon Command line

109 API - GeneraAon Command line Loading CSV files

110 API - GeneraAon Command line Loading CSV files $ hadoop jar splout- *-hadoop.jar generate

111 API - GeneraAon Command line Loading CSV files $ hadoop jar splout- Java API *-hadoop.jar generate

112 API - GeneraAon Command line Loading CSV files $ hadoop jar splout- Java API *-hadoop.jar generate

113 API - GeneraAon Command line Loading CSV files $ hadoop jar splout- *-hadoop.jar generate Java API HCatalog

114 API - GeneraAon Command line Loading CSV files $ hadoop jar splout- *-hadoop.jar generate Java API HCatalog Hive

115 API - GeneraAon Command line Loading CSV files $ hadoop jar splout- *-hadoop.jar generate Java API HCatalog Hive Pig

116 API - Service

117 API - Service Rest API

118 API - Service Rest API

119 API - Service Rest API JSON response

120 API - Console

121 Joins

122 Joins Between co- paraaoned tables

123 Joins Between co- paraaoned tables e.g. Clients and Sales by CID

124 Joins Between co- paraaoned tables e.g. Clients and Sales by CID With omnipresent tables

125 Joins Between co- paraaoned tables e.g. Clients and Sales by CID With omnipresent tables Full data present in every par??on Useful for dimension tables in star schemas e.g. countries table

126 What if I need different paraaoning?

127 What if I need different paraaoning? Example

128 What if I need different paraaoning? Example Queries by Merchant cannot be answered by a tablespace par??oned by Client

129 What if I need different paraaoning? Example Queries by Merchant cannot be answered by a tablespace par??oned by Client Just create more tablespaces

130 What if I need different paraaoning? Example Queries by Merchant cannot be answered by a tablespace par??oned by Client Just create more tablespaces First par??oned by Client Second par??oned by Merchant Deploy both atomically

131 Benchmark

132 Benchmark 350 GB Wikipedia logs

133 Benchmark 350 GB Wikipedia logs Aggrega?on queries impac?ng 15 rows in average

134 Benchmark 350 GB Wikipedia logs Aggrega?on queries impac?ng 15 rows in average 2- machines cluster

135 Benchmark 350 GB Wikipedia logs Aggrega?on queries impac?ng 15 rows in average 2- machines cluster 900 queries/second, 80 ms/query, 80 threads

136 Benchmark 350 GB Wikipedia logs Aggrega?on queries impac?ng 15 rows in average 2- machines cluster 900 queries/second, 80 ms/query, 80 threads

137 Benchmark (II)

138 Benchmark (II) 4- machines cluster

139 Benchmark (II) 4- machines cluster 3150 queries/second, 40 ms/query, 160 threads

140 Benchmark (II) 4- machines cluster 3150 queries/second, 40 ms/query, 160 threads More info: hlp://sploutsql.com/performance.html

141

142 Web latency

143 Web latency SQL

144 Web latency SQL Consolidated Views

145 Web latency SQL Consolidated Views For Hadoop

146 Web latency SQL Consolidated Views For Hadoop A good candidate for the serving layer of a lambda architecture

147 Future work

148 Future work Growing the community

149 Future work Growing the community Do you want to collaborate?

150 Future work Growing the community Do you want to collaborate? More engines

151 Future work Growing the community Do you want to collaborate? More engines SQLite, MySQL and Redis already done Columnar formats

152 Future work Growing the community Do you want to collaborate? More engines SQLite, MySQL and Redis already done Columnar formats Rack awareness

153 Future work Growing the community Do you want to collaborate? More engines SQLite, MySQL and Redis already done Columnar formats Rack awareness MulA- tenancy

154 Future work Growing the community Do you want to collaborate? More engines SQLite, MySQL and Redis already done Columnar formats Rack awareness MulA- tenancy Test on scale

155 Future work Growing the community Do you want to collaborate? More engines SQLite, MySQL and Redis already done Columnar formats Rack awareness MulA- tenancy Test on scale Test Splout on bigger clusters

156 Iván de Prado Alonso CEO of hhp://sploutsql.com QuesAons?

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop