Important: This microbenchmark is not intended to represent any real workload. Compression ratios, and therefore performance, will depend heavily on the specific workload. This is only for the purpose of illustrating a "columnar friendly" contrived workload that showcases the benefits of columnar.
CREATE TABLE perf_row(
id INT8,
ts TIMESTAMPTZ,
customer_id INT8,
vendor_id INT8,
name TEXT,
description TEXT,
value NUMERIC,
quantity INT4
);
CREATE TABLE perf_columnar(LIKE perf_row) USING COLUMNAR;
This next block must be run as superuser.
CREATE EXTENSION IF NOT EXISTS plpython3u;
CREATE OR REPLACE FUNCTION random_words(n INT4) RETURNS TEXT LANGUAGE plpython3u AS $$
import random
t = ''
words = ['zero','one','two','three','four','five','six','seven','eight','nine','ten']
for i in range(0,n):
if (i != 0):
t += ' '
r = random.randint(0,len(words)-1)
t += words[r]
return t
$$;
The data creation itself can be run without superuser:
INSERT INTO perf_row
SELECT
g, -- id
'2020-01-01'::timestamptz + ('1 minute'::interval * g), -- ts
(random() * 1000000)::INT4, -- customer_id
(random() * 100)::INT4, -- vendor_id
random_words(7), -- name
random_words(100), -- description
(random() * 100000)::INT4/100.0, -- value
(random() * 100)::INT4 -- quantity
FROM generate_series(1,75000000) g;
INSERT INTO perf_columnar SELECT * FROM perf_row;
=> SELECT pg_total_relation_size('perf_row')::numeric/pg_total_relation_size('perf_columnar') AS compression_ratio;
compression_ratio
--------------------
5.3958044063457513
(1 row)
The overall compression ratio of columnar table, versus the same data stored with row storage, is 5.4X.
=> VACUUM VERBOSE perf_columnar;
INFO: statistics for "perf_columnar":
storage id: 10000000000
total file size: 8761368576, total data size: 8734266196
compression rate: 5.01x
total row count: 75000000, stripe count: 500, average rows per stripe: 150000
chunk count: 60000, containing data for dropped columns: 0, zstd compressed: 60000
VACUUM VERBOSE
reports a smaller compression ratio, because it
only averages the compression ratio of the individual chunks, and does
not account for the metadata savings of the columnar format.
- Azure VM: Standard D2s v3 (2 vcpus, 8 GiB memory)
- Linux (ubuntu 18.04)
- Data Drive: Standard HDD (512GB, 500 IOPS Max, 60 MB/s Max)
- PostgreSQL 13 (
--with-llvm
,--with-python
) shared_buffers = 128MB
max_parallel_workers_per_gather = 0
jit = on
Note: because this was run on a system with enough physical memory to hold a substantial fraction of the table, the IO benefits of columnar won't be entirely realized by the query runtime unless the data size is substantially increased.
-- OFFSET 1000 so that no rows are returned, and we collect only timings
SELECT vendor_id, SUM(quantity) FROM perf_row GROUP BY vendor_id OFFSET 1000;
SELECT vendor_id, SUM(quantity) FROM perf_row GROUP BY vendor_id OFFSET 1000;
SELECT vendor_id, SUM(quantity) FROM perf_row GROUP BY vendor_id OFFSET 1000;
SELECT vendor_id, SUM(quantity) FROM perf_columnar GROUP BY vendor_id OFFSET 1000;
SELECT vendor_id, SUM(quantity) FROM perf_columnar GROUP BY vendor_id OFFSET 1000;
SELECT vendor_id, SUM(quantity) FROM perf_columnar GROUP BY vendor_id OFFSET 1000;
Timing (median of three runs):
- row: 436s
- columnar: 16s
- speedup: 27X