I've seen some odd behavior trying to select specific items from Timescale hypertables, which I reproduce here with randomized data.
We'll need a vanilla Timescale instance; I like Docker.
# start a PG12 instance w/ Timescale 2.0
docker run --rm -d -p 5432:5432 -e POSTGRES_PASSWORD=password --name timescale timescale/timescaledb:latest-pg12
# connect to it
docker exec -it timescale psql -U postgres
First we need data, with both a timestamp and ID column, as well as some value-columns.
We'll create 6-months worth of per-minute data for 10 devices, adding some random perturbation to the timestamps. We'll also add a primary key on (sensor_id, time)
and make it a hypertable.
CREATE TABLE sensor_data AS
SELECT
time + (interval '1 minute' * random()) AS time,
sensor_id,
random() AS cpu,
random()*100 AS temperature
FROM
generate_series(now() - interval '6 months', now(), interval '1 minute') AS g1(time),
generate_series(1,10,1) AS g2(sensor_id)
ORDER BY time;
ALTER TABLE sensor_data ADD CONSTRAINT sensor_data_pk PRIMARY KEY (sensor_id, time);
SELECT FROM create_hypertable('sensor_data', 'time', migrate_data=>true);
SELECT * FROM timescaledb_information.hypertables;
hypertable_schema | hypertable_name | owner | num_dimensions | num_chunks | compression_enabled | is_distributed | replication_factor | data_nodes | tablespaces
-------------------+-----------------+----------+----------------+------------+---------------------+----------------+--------------------+------------+-------------
public | sensor_data | postgres | 1 | 27 | f | f | | |
(1 row)
Let's see some of the oldest values, from chunks we expect are on disk.
SELECT * FROM sensor_data
ORDER BY time
LIMIT 10;
time | sensor_id | cpu | temperature
-------------------------------+-----------+---------------------+--------------------
2020-07-27 22:44:29.963116+00 | 5 | 0.9907297858145725 | 2.443502657772356
2020-07-27 22:44:30.884572+00 | 9 | 0.36664094875715136 | 94.92922018026597
2020-07-27 22:44:33.943573+00 | 6 | 0.6114390001389793 | 61.091366210785125
2020-07-27 22:44:36.798245+00 | 8 | 0.6320730080727373 | 8.323516658384378
2020-07-27 22:44:54.981908+00 | 7 | 0.9271099918152501 | 11.218798468656033
2020-07-27 22:45:02.815279+00 | 4 | 0.5134860056074046 | 11.483146614569861
2020-07-27 22:45:05.306625+00 | 1 | 0.5120288493981846 | 30.728807731221153
2020-07-27 22:45:23.069828+00 | 3 | 0.2670812694924365 | 28.216946641256513
2020-07-27 22:45:25.775646+00 | 2 | 0.9039883996903981 | 27.59812841150371
2020-07-27 22:45:27.690306+00 | 10 | 0.8509384567249754 | 60.05163794338095
(10 rows)
If we want to select the values for a known ID & time, we can stick them in the WHERE
clause, and we get a nice query plan:
EXPLAIN
SELECT * FROM sensor_data
WHERE sensor_id = 5
AND time = '2020-07-27 22:44:29.963116+00';
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Index Scan using "1_1_sensor_data_pk" on _hyper_1_1_chunk (cost=0.29..2.51 rows=1 width=28)
Index Cond: ((sensor_id = 5) AND ("time" = '2020-07-27 22:44:29.963116+00'::timestamp with time zone))
(2 rows)
There are other ways to do this that will generalize better to multiple points, like joining on a table of our known key-pairs:
EXPLAIN
WITH keys AS (
SELECT * FROM (VALUES
(5, '2020-07-27 22:44:29.963116+00'::TIMESTAMPTZ)
) AS _(sensor_id, time)
)
SELECT * FROM sensor_data
INNER JOIN keys
USING (sensor_id, time)
;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Index Scan using "1_1_sensor_data_pk" on _hyper_1_1_chunk (cost=0.29..2.51 rows=1 width=28)
Index Cond: ((sensor_id = 5) AND ("time" = '2020-07-27 22:44:29.963116+00'::timestamp with time zone))
(2 rows)
Still good! What if we expand to another point? We'll also show real timings now with ANALYZE
.
EXPLAIN ANALYZE
WITH keys AS (
SELECT * FROM (VALUES
(5, '2020-07-27 22:44:29.963116+00'::TIMESTAMPTZ),
(9, '2020-07-27 22:44:30.884572+00')
) AS _(sensor_id, time)
)
SELECT * FROM sensor_data
INNER JOIN keys
USING (sensor_id, time)
;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.29..143.00 rows=132 width=28) (actual time=0.014..0.021 rows=2 loops=1)
-> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=12) (actual time=0.001..0.002 rows=2 loops=1)
-> Append (cost=0.29..71.22 rows=27 width=28) (actual time=0.007..0.007 rows=1 loops=2)
-> Index Scan using "1_1_sensor_data_pk" on _hyper_1_1_chunk (cost=0.29..2.51 rows=1 width=28) (actual time=0.006..0.006 rows=1 loops=2)
Index Cond: ((sensor_id = "*VALUES*".column1) AND ("time" = "*VALUES*".column2))
-> Index Scan using "2_2_sensor_data_pk" on _hyper_1_2_chunk (cost=0.42..2.64 rows=1 width=28) (never executed)
Index Cond: ((sensor_id = "*VALUES*".column1) AND ("time" = "*VALUES*".column2))
<...abreviated...>
-> Index Scan using "26_26_sensor_data_pk" on _hyper_1_26_chunk (cost=0.42..2.64 rows=1 width=28) (never executed)
Index Cond: ((sensor_id = "*VALUES*".column1) AND ("time" = "*VALUES*".column2))
-> Index Scan using "27_27_sensor_data_pk" on _hyper_1_27_chunk (cost=0.42..2.64 rows=1 width=28) (never executed)
Index Cond: ((sensor_id = "*VALUES*".column1) AND ("time" = "*VALUES*".column2))
Planning Time: 5.322 ms
Execution Time: 0.492 ms
(59 rows)
We scanned every chunk index!
The following approaches produce the same query plan:
-- lateral join
EXPLAIN ANALYZE
WITH keys AS (
SELECT * FROM (VALUES
(5, '2020-07-27 22:44:29.963116+00'::TIMESTAMPTZ),
(9, '2020-07-27 22:44:30.884572+00')
) AS _(sensor_id, time)
)
SELECT sensor_data.* FROM keys
INNER JOIN LATERAL (
SELECT * FROM sensor_data
WHERE sensor_id = keys.sensor_id
AND time = keys.time
) AS sensor_data ON TRUE
;
-- UNION ALL + INNER JOIN
EXPLAIN ANALYZE
WITH keys AS (
SELECT * FROM (VALUES
(5, '2020-07-27 22:44:29.963116+00'::TIMESTAMPTZ),
(9, '2020-07-27 22:44:30.884572+00')
) AS _(sensor_id, time)
)
SELECT * FROM sensor_data
INNER JOIN (SELECT * FROM keys OFFSET 0 LIMIT 1) AS _
USING (sensor_id, time)
UNION ALL
SELECT * FROM sensor_data
INNER JOIN (SELECT * FROM keys OFFSET 1 LIMIT 1) AS _
USING (sensor_id, time)
;
Also note that forcing the CTE to materialize (hello PG12!) doesn't help any of these.
Taking a UNION ALL
approach with INNER JOIN
still scans every chunk, but UNION ALL
-ing specific selection forces the query planner to behave better:
EXPLAIN ANALYZE
SELECT * FROM sensor_data
WHERE sensor_id = 5
AND time = '2020-07-27 22:44:29.963116+00'
UNION ALL
SELECT * FROM sensor_data
WHERE sensor_id = 9
AND time = '2020-07-27 22:44:30.884572+00'
;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Append (cost=0.29..5.04 rows=2 width=28) (actual time=0.011..0.018 rows=2 loops=1)
-> Index Scan using "1_1_sensor_data_pk" on _hyper_1_1_chunk (cost=0.29..2.51 rows=1 width=28) (actual time=0.010..0.011 rows=1 loops=1)
Index Cond: ((sensor_id = 5) AND ("time" = '2020-07-27 22:44:29.963116+00'::timestamp with time zone))
-> Index Scan using "1_1_sensor_data_pk" on _hyper_1_1_chunk _hyper_1_1_chunk_1 (cost=0.29..2.51 rows=1 width=28) (actual time=0.005..0.005 rows=1 loops=1)
Index Cond: ((sensor_id = 9) AND ("time" = '2020-07-27 22:44:30.884572+00'::timestamp with time zone))
Planning Time: 0.629 ms
Execution Time: 0.039 ms
(7 rows)
By forcing the query planner to only look in appropriate chunks, we've got a 10x speedup. As you can imagine, the join approach gets worse as the chunks grow!
We can get a slightly different but similarly-fast plan if we take a slightly different approach
EXPLAIN ANALYZE
SELECT * FROM sensor_data
WHERE (sensor_id = 5 AND time = '2020-07-27 22:44:29.963116+00')
OR (sensor_id = 9 AND time = '2020-07-27 22:44:30.884572+00')
;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on _hyper_1_1_chunk (cost=2.80..3.92 rows=1 width=28) (actual time=0.012..0.013 rows=2 loops=1)
Recheck Cond: (((sensor_id = 5) AND ("time" = '2020-07-27 22:44:29.963116+00'::timestamp with time zone)) OR ((sensor_id = 9) AND ("time" = '2020-07-27 22:44:30.884572+00'::timestamp with time zone)))
Heap Blocks: exact=1
-> BitmapOr (cost=2.80..2.80 rows=1 width=0) (actual time=0.009..0.010 rows=0 loops=1)
-> Bitmap Index Scan on "1_1_sensor_data_pk" (cost=0.00..1.40 rows=1 width=0) (actual time=0.006..0.006 rows=1 loops=1)
Index Cond: ((sensor_id = 5) AND ("time" = '2020-07-27 22:44:29.963116+00'::timestamp with time zone))
-> Bitmap Index Scan on "1_1_sensor_data_pk" (cost=0.00..1.40 rows=1 width=0) (actual time=0.003..0.003 rows=1 loops=1)
Index Cond: ((sensor_id = 9) AND ("time" = '2020-07-27 22:44:30.884572+00'::timestamp with time zone))
Planning Time: 0.646 ms
Execution Time: 0.040 ms
(10 rows)
Unfortunately, if you have a table of values of interest (suppose you cache the last time per sensor with an upserting trigger into a separate table), it's very unnatural to take either of the fast approaches above. Dynamic query building is certainly an option, but not a great one.