Druid Power Interactive Applications at Scale. Jonathan Wei Software Engineer

Size: px

Start display at page:

Download "Druid Power Interactive Applications at Scale. Jonathan Wei Software Engineer"

Martha Collins
6 years ago
Views:

1 Druid Power Interactive Applications at Scale Jonathan Wei Software Engineer

2 History & Motivation Demo Overview Storage Internals Druid Architecture

3 Motivation

4 Motivation Visibility and analysis for complex streaming systems Real-time bidding User behavior Network flows, firewall events Application performance metrics

5 Motivation Complex: too much detail to precompute Multi-tenant: 1000s of concurrent users Recency: explore events as they happen Efficiency: individual events are low-value

6 Datastore Requirements - Arbitrary filtering, splitting, and aggregation - Respond quickly to queries (ideally < 1 second response time) - Handle huge amounts of data (up to Petabytes in total) - Handle streaming data

7 Druid Column-oriented data store Realtime streaming ingestion Automatic summarization Ad-hoc queries Approximate algorithms (topns, hyperloglog, theta) Can keep around a lot of history (years are ok) Open source

8 Characteristic Use Cases Druid is ideal for business intelligence/olap use cases that: - Require interactivity - Involve filtering, grouping, aggregating data - Have result set < input set

source in late 2012 - GPL licensed initially - Part-time

9 History Initial use case: power ad-tech analytics product First lines of Druid written in 2011 Druid went open source in late GPL licensed initially - Part-time development until early Apache v2 licensed in early 2015

10 History General purpose relational databases (MySQL, Postgres) Key/value stores (HBase, Cassandra, BigTable) Column stores

11 Column stores Load/scan exactly what you need for a query Different compression algorithms for different columns - Encoding for string columns - Compression for measure columns Different indexes for different columns

12 Community Growing Community contributors from many different companies In production at many different companies, we re hoping for more! - Ad-tech, network traffic, cloud security, operations, activity streams, etc. We love contributions!

13 Companies using Druid in Production + many more!

14 Demo In case the internet didn t work, pretend you saw something cool

15 Volume Largest known cluster: >500 TB of segments (>50 trillion raw events, >50 PB raw data) Extremely cost effective at scale

16 Realtime Ingestion Performance Over 500,000 events/ second, average Over 2M events / second, peak ~10 100k events / second / core

17 Queries 500ms average query latency 90% < 1s 95% < 2s 99% < 10s

18 Storage Internals

19 Summarization ( Rollup ) time make model year sale_price sale_fee :12:00 Honda Civic :14:00 Honda CRV :12:40 Honda Civic :11:40 Toyota Prius :35:40 Toyota Corolla :42:40 Toyota Corolla :12:40 Honda Civic time make model year count sum_sale_price sum_sale_fee min_sale_fee :00:00 Honda Civic :00:00 Honda CRV :00:00 Toyota Prius :00:00 Toyota Corolla

20 Summarization ( Rollup ) time make model year sale_price sale_fee :12:00 Honda Civic :14:00 Honda CRV :12:40 Honda Civic :11:40 Toyota Prius :35:40 Toyota Corolla :42:40 Toyota Corolla :12:40 Honda Civic

21 Summarization ( Rollup ) time make model year count sum sale_price sum sale_fee min sale_fee :00:00 Honda Civic :00:00 Honda CRV :00:00 Toyota Prius :00:00 Toyota Corolla

22 Dictionary Encoding time make model year count sum sale_price sum sale_fee min sale_fee :00:00 Honda Civic :00:00 Honda CRV :00:00 Toyota Prius :00:00 Toyota Corolla model column Values: Civic, Corolla, CRV, Prius Encoding: CRV: 0, Civic: 1, Corolla: 2, Prius: 3 Segment data: 1, 0, 3, 2

23 Bitmap indexes time make model year count sum sale_price sum sale_fee min sale_fee :00:00 Honda Civic :00:00 Honda CRV :00:00 Toyota Prius :00:00 Toyota Corolla make= Toyota bitmap: model = CRV bitmap: year = 2011 bitmap: make = Toyota OR model = CRV bitmap: make = Toyota AND year = 2011 bitmap:

24 Data Partitioning Shards are called segments in Druid First level partition done on time - Done so for query optimization Segments are immutable

25 Data Partitioning

26 Immutable Segments Fundamental storage unit in Druid No contention between reads and writes One thread scans one segment Multiple threads can access same underlying data

27 Approximate algorithms Reduces computation/storage requirements when exact results are not needed. Druid supports: - Hyperloglog - Theta sketches - Approximate Histograms

28 Architecture

29 Node Types Historical Realtime Broker

30 Architecture (Batch Ingestion)

31 Architecture (Batch Ingestion)

32 Real-time Nodes Stores data in write-optimized structure: on-heap hash map Converts write optimized structure -> read optimized structure Read-optimized data structure: Druid segments Can query data immediately

33 Architecture (Streaming Ingestion)

34 Architecture (Lambda)

35 Why reprocess data? Changes in processing code (bugs, updated output, etc.) Imprecise streaming operations like using short join windows Software limitations What if stream processor generates duplicate messages Druid streaming ingestion is currently best-effort

36 Querying Query libraries: - JSON over HTTP - SQL - R - Javascript - Python - Ruby

37 User Interfaces - Pivot - Grafana - Caravel (formerly Panoramix)

38 Takeaway Druid is made for for analytic applications Druid is good at fast OLAP queries Druid is good at streaming ingestion

39 Thanks! imply.io druid.io

New Data Architectures For Netflow Analytics NANOG 74. Fangjin Yang - Imply

New Data Architectures For Netflow Analytics NANOG 74 Fangjin Yang - Cofounder @ Imply The Problem Comparing technologies Overview Operational analytic databases Try this at home The Problem Netflow data