PostgreSQL/Jsonb A First Look
About Me Started programming in 1981 Owner of Enoki Solutions Inc. Consulting and Software Development Running VanDev since Oct 2010
Why PostgreSQL? Open Source Feature Rich Mature So much better than MySql MongoDB has issues Only atomic at the document level https://jira.mongodb.org/browse/server-14766
Why Json? Blame Javascript But, in a DB context Data locality Data atomicity without transaction overhead? Fancy blob?
Jsonb? A binary format PostgreSQL specific In theory faster to modify Generally smaller to store Indexable!
1st Observation Always use Jsonb if you re going to have the db do anything with it Json can t be indexed
Make everything jsonb? CREATE TABLE tst ( id UUID NOT NULL, data JSONB DEFAULT '{}'::jsonb NOT NULL );
A quick aside on UUIDs Structure your UUIDs (128 bits) as follows Time (ms since epoch, 44 bits, >557 years) If generating more than 2 12 /ms allow this to drift forward If it ends up being a problem, it won t be your problem Sequence (12 bits = 4096/ms) Node (12 bits = 4096 nodes) Expect up to ~1s of time drift when using ntpd Random (60 bits, 1% collision/107.644 million) Set per ms Why? Ids generated at the same time share locality Faster inserts
Make everything jsonb? CREATE TABLE tst ( id UUID NOT NULL, data JSONB DEFAULT '{}'::jsonb NOT NULL ); What happens when you modify data? The whole field is updated If data is large that can be very slow
2nd Observation Consider partitioning into sections CREATE TABLE tst ( id UUID NOT NULL, section_name VARCHAR(128), data JSONB DEFAULT '{}'::jsonb NOT NULL ); Updates to data are smaller now Updates by id are no longer atomic across sections unless you use transactions!
Indexing CREATE UNIQUE INDEX idx_tst_id ON tst USING btree (id); CREATE UNIQUE INDEX idx_tst_id_section_name ON tst USING btree (id, section_name); CREATE INDEX idx_tst_id_section_name_data ON tst USING btree (id, section_name, data); CREATE INDEX idx_tst_section_name_data_tags ON tst USING btree (section_name, ((data->>'tags') :: TEXT)); CREATE INDEX idx_tst_section_name_data_count ON tst USING btree (section_name, ((data->>'count') :: INT8)); Looks funny doesn t it.
Some test data WITH A AS ( INSERT INTO "tst" VALUES ('00000000000000000000000000000011', 'meta', '{"tags":["a","b","c"], "count":10}'),('00000000000000000000000000000012', 'meta','{"tags":["a","d","c"], "count":1}') ON CONFLICT DO NOTHING RETURNING * ) SELECT * FROM A; BTW, WITH is awesome
Did it work? EXPLAIN SELECT * FROM tst WHERE section_name='meta' ORDER BY (data->'count'); Sort (cost=8.17..8.18 rows=1 width=354) Sort Key: ((data -> 'count'::text)) -> Index Scan using idx_tst_section_name_data_count on tst (cost=0.14..8.16 rows=1 width=354) Index Cond: ((section_name)::text = 'meta'::text) EXPLAIN SELECT * FROM tst WHERE section_name='meta' ORDER BY ((data->>'count')::int8); Index Scan using idx_tst_section_name_data_count on tst (cost=0.13..8.15 rows=1 width=330) Index Cond: ((section_name)::text = 'meta'::text) SELECT * FROM tst WHERE section_name='meta' ORDER BY ((data->>'count')::int8) 00000000-0000-0000-0000-000000000012 meta {"tags": ["a", "d", "c"], "count": 1} 00000000-0000-0000-0000-000000000011 meta {"tags": ["a", "b", "c"], "count": 10}
Updating Count WITH X AS ( UPDATE tst SET data = jsonb_set(data, '{count}', to_jsonb(((data ->> 'count') :: INT8) + 1 :: INT8), FALSE) WHERE section_name='meta' and data? 'count' AND data -> 'tags'? 'd' RETURNING * ) SELECT * FROM X; 00000000-0000-0000-0000-000000000012 meta {"tags": ["a", "d", "c"], "count": 2}
What about tags? EXPLAIN SELECT * FROM "tst" WHERE section_name='meta' and "data" -> 'tags'? 'b'; Index Scan using idx_tst_section_name_data_count on tst (cost=0.14..8.17 rows=1 width=32) Index Cond: ((section_name)::text = 'meta'::text) Filter: ((data -> 'tags'::text)? 'b'::text) SELECT * FROM "tst" WHERE section_name='meta' and "data" -> 'tags'? 'b'; 00000000-0000-0000-0000-000000000011 meta {"tags": ["a", "b", "c"], "count": 10} Search within an array is linear?
Gin anyone? DROP TABLE tst; CREATE TABLE tst ( data JSONB DEFAULT '{}'::jsonb NOT NULL ); CREATE INDEX idx_tst_data ON tst USING GIN ((data->'tags')); EXPLAIN SELECT * FROM "tst" WHERE "data" -> 'tags'? 'b'; Bitmap Heap Scan on tst (cost=8.01..12.03 rows=1 width=32) Recheck Cond: ((data -> 'tags'::text)? 'b'::text) -> Bitmap Index Scan on idx_tst_data (cost=0.00..8.01 rows=1 width=0) Index Cond: ((data -> 'tags'::text)? 'b'::text)
Add back section_name CREATE TABLE tst ( section_name VARCHAR(128), data JSONB DEFAULT '{}'::jsonb NOT NULL ); CREATE INDEX idx_tst_data ON tst USING GIN (section_name, (data->'tags')); sql> CREATE INDEX idx_tst_data ON tst USING GIN (section_name, (data->'tags')) [2016-07-24 11:19:21] [42704] ERROR: data type character varying has no default operator class for access method "gin" Hint: You must specify an operator class for the index or define a default operator class for the data type. D oh
btree_gin? CREATE EXTENSION btree_gin; CREATE TABLE tst ( section_name VARCHAR(128), data JSONB DEFAULT '{}'::jsonb NOT NULL ); CREATE INDEX idx_tst_data_1 ON tst USING gin (section_name, (data->'tags')); EXPLAIN SELECT * FROM "tst" WHERE section_name = 'meta' and "data" -> 'tags'? 'b'; Seq Scan on tst (cost=0.00..14.20 rows=1 width=306) Filter: (((section_name)::text = 'meta'::text) AND ((data -> 'tags'::text)? 'b'::text)) Worse?!
No, we need more data create or replace FUNCTION tmpf() RETURNS void AS $$ declare i INTEGER; BEGIN i = 0; while i<100000 loop i = i + 1; insert into tst values ('meta','{"tags":["a"]}'); insert into tst values ('meta','{"tags":["b"]}'); insert into tst values ('meta','{"tags":["b","c"]}'); insert into tst values ('meta','{"tags":["c"]}'); end loop; END $$ LANGUAGE plpgsql;
Before and After index EXPLAIN ANALYSE SELECT * FROM "tst" WHERE section_name = 'meta' and "data" -> 'tags'? 'a'; Seq Scan on tst (cost=0.00..4336.68 rows=1 width=306) (actual time=0.030..184.442 rows=100000 loops=1) Filter: (((section_name)::text = 'meta'::text) AND ((data -> 'tags'::text)? 'a'::text)) Rows Removed by Filter: 300000 Planning time: 0.100 ms Execution time: 186.668 ms CREATE INDEX idx_tst_section_name_data ON tst USING gin (section_name, (data->'tags')); EXPLAIN ANALYSE SELECT * FROM "tst" WHERE section_name = 'meta' and "data" -> 'tags'? 'a'; Bitmap Heap Scan on tst (cost=264.10..1379.31 rows=400 width=32) (actual time=31.616..77.926 rows=100000 loops=1) Recheck Cond: (((section_name)::text = 'meta'::text) AND ((data -> 'tags'::text)? 'a'::text)) Heap Blocks: exact=3054 -> Bitmap Index Scan on idx_tst_section_name_data (cost=0.00..264.00 rows=400 width=0) (actual time=31.139.. 31.139 rows=100000 loops=1) Index Cond: (((section_name)::text = 'meta'::text) AND ((data -> 'tags'::text)? 'a'::text)) Planning time: 0.503 ms Execution time: 80.146 ms
Any better without section? CREATE INDEX idx_tst_data_tags ON tst USING GIN ((data -> 'tags')); EXPLAIN ANALYSE SELECT * FROM "tst" WHERE "data" -> 'tags'? 'a'; Bitmap Heap Scan on tst (cost=103.10..1217.31 rows=400 width=32) (actual time=16.161..59.970 rows=100000 loops=1) Recheck Cond: ((data -> 'tags'::text)? 'a'::text) Heap Blocks: exact=3054 -> Bitmap Index Scan on idx_tst_data_tags (cost=0.00..103.00 rows=400 width=0) (actual time=15.455..15.455 rows=100000 loops=1) Index Cond: ((data -> 'tags'::text)? 'a'::text) Planning time: 3.744 ms Execution time: 62.397 ms
Go big Add 4 million rows 1 row with tag e EXPLAIN ANALYSE SELECT * FROM "tst" WHERE section_name = 'meta' and "data" -> 'tags'? 'e'; Planning time: 3.351 ms Execution time: 759.093 ms CREATE INDEX idx_tst_section_name_data ON tst USING gin (section_name, (data->'tags')); EXPLAIN ANALYSE SELECT * FROM "tst" WHERE section_name = 'meta' and "data" -> 'tags'? 'e'; Planning time: 4.428 ms Execution time: 0.199 ms Looks like it works.
3rd Observation Writes (updates) get slow GIN index updates are that not fast 4 million inserts too ~3 minutes on my machine
Summary It s weird, but it works Need to specify type a lot Need to learn about indexes Need to watch out for document size Need to watch out for index update time WITH is awesome
Q&A