How can we look around the data?
\l
\dt
\d
What does the users table look like?
SELECT * FROM users LIMIT 1;
id | city | name | address | credit_card
---------------------------------------+-----------+----------------+--------------------------------+--------------
ae147ae1-47ae-4800-8000-000000000022 | amsterdam | James Martinez | 53159 Samantha Streets Apt. 69 | 9215735868
And rides?
SELECT * FROM rides LIMIT 1;
id | city | vehicle_city | rider_id | vehicle_id | start_address | end_address | start_time | end_time | revenue
---------------------------------------+-----------+--------------+--------------------------------------+--------------------------------------+--------------------------------+---------------------------+---------------------+---------------------+----------
ab020c49-ba5e-4800-8000-00000000014e | amsterdam | amsterdam | bd70a3d7-0a3d-4000-8000-000000000025 | aaaaaaaa-aaaa-4800-8000-00000000000a | 77033 Melissa Streets Suite 78 | 54177 Larson Fall Apt. 12 | 2018-12-06 03:04:05 | 2018-12-08 02:04:05 | 33.00
How many rides in week to date?
SELECT count(*)
FROM rides
WHERE start_time >= now() - INTERVAL '1 week';
What cities have the most users?
SELECT city, count(*)
FROM users
GROUP BY city
ORDER BY count(*) DESC;
What cities have a threshold of users?
SELECT city, count(*)
FROM users
GROUP BY city
HAVING count(*) > 6
ORDER BY count(*) DESC;
What users have used the 0_i_information_perform
promo code?
SELECT name
FROM users
WHERE id IN (
SELECT user_id
FROM user_promo_codes
WHERE code = '0_i_information_perform'
);
Can I get the same information but with joins?
SELECT u.name
FROM users u
LEFT JOIN user_promo_codes c ON u.id = c.user_id
WHERE c.code = '0_i_information_perform';
How efficient is this?
https://www.cockroachlabs.com/docs/stable/explain-analyze
EXPLAIN SELECT u.name
FROM users u
LEFT JOIN user_promo_codes c ON u.id = c.user_id
WHERE c.code = '0_i_information_perform';
EXPLAIN ANALYZE SELECT u.name
FROM users u
LEFT JOIN user_promo_codes c ON u.id = c.user_id
WHERE c.code = '0_i_information_perform';
EXPLAIN ANALYZE (VERBOSE, TYPES, DISTSQL) SELECT u.name
FROM users u
LEFT JOIN user_promo_codes c ON u.id = c.user_id
WHERE c.code = '0_i_information_perform';
Compared with sub-query?
Not too different! Let's not be too afraid of sub-queries unless we see real performance hit or hard to reason about.
Get a list of names of people who:
- Take rides in a city with both bikes and scooters.
- Have only ever ridden on one mode of transit.
Could be used to offer them a discount to try the other type of vehicle.
What are the cities?
SELECT DISTINCT city
FROM vehicles
WHERE type IN ('bike', 'scooter')
GROUP BY city, type
-- Needs to have both types of vehicle.
HAVING count(type) = 2;
What people in those cities have only ridden via one vehicle type?
SELECT r.rider_id, min(v.type)
FROM rides r
LEFT JOIN vehicles v ON r.vehicle_id = v.id
WHERE r.city
IN (
SELECT DISTINCT city
FROM vehicles
WHERE type IN ('bike', 'scooter')
GROUP BY city, type
-- Needs to have both types of vehicle.
HAVING count(type) = 2
)
AND v.type IN ('bike', 'scooter')
GROUP BY r.rider_id
-- If there is only on distinct type, we know
-- the rider has only tried one mode of transit.
HAVING count(DISTINCT v.type) = 1
Add in names...
SELECT u.name, u.city, r2.type
FROM users u
LEFT JOIN (
SELECT r.rider_id, min(v.type) as type
FROM rides r
LEFT JOIN vehicles v ON r.vehicle_id = v.id
WHERE r.city
IN (
SELECT DISTINCT city
FROM vehicles
WHERE type IN ('bike', 'scooter')
GROUP BY city, type
-- Needs to have both types of vehicle.
HAVING count(type) = 2
)
AND v.type IN ('bike', 'scooter')
GROUP BY r.rider_id
-- If there is only on distinct type, we know
-- the rider has only tried one mode of transit.
HAVING count(DISTINCT v.type) = 1
) r2 ON r2.rider_id = u.id
WHERE r2.type IS NOT NULL
CREATE VIEW bike_scooter_cities AS
SELECT DISTINCT city
FROM vehicles
WHERE type IN ('bike', 'scooter')
GROUP BY city, type
HAVING count(type) = 2
SELECT * FROM bike_scooter_cities;
\d
CREATE VIEW solo_type_riders AS
SELECT r.rider_id, min(v.type) as type
FROM rides r
LEFT JOIN vehicles v ON r.vehicle_id = v.id
WHERE r.city
IN (
SELECT *
FROM bike_scooter_cities
)
AND v.type IN ('bike', 'scooter')
GROUP BY r.rider_id
HAVING count(DISTINCT v.type) = 1
CREATE VIEW offer_opportunities AS
SELECT u.name, u.city, r.type
FROM users u
LEFT JOIN solo_type_riders r ON r.rider_id = u.id
WHERE r.type IS NOT NULL
EXPLAIN SELECT * FROM offer_opportunities;
The plan is the same! How do we make this more efficient?
CREATE MATERIALIZED VIEW offer_opportunities_mat AS
SELECT u.name, u.city, r2.type
FROM users u
LEFT JOIN (
SELECT r.rider_id, min(v.type) as type
FROM rides r
LEFT JOIN vehicles v ON r.vehicle_id = v.id
WHERE r.city
IN (
SELECT DISTINCT city
FROM vehicles
WHERE type IN ('bike', 'scooter')
GROUP BY city, type
-- Needs to have both types of vehicle.
HAVING count(type) = 2
)
AND v.type IN ('bike', 'scooter')
GROUP BY r.rider_id
-- If there is only on distinct type, we know
-- the rider has only tried one mode of transit.
HAVING count(DISTINCT v.type) = 1
) r2 ON r2.rider_id = u.id
WHERE r2.type IS NOT NULL
EXPLAIN SELECT * FROM offer_opportunities_mat;
So much better! Review the execution times for proof.
WITH r2 AS (
SELECT r.rider_id, min(v.type) as type
FROM rides r
LEFT JOIN vehicles v ON r.vehicle_id = v.id
WHERE r.city
IN (
SELECT DISTINCT city
FROM vehicles
WHERE type IN ('bike', 'scooter')
GROUP BY city, type
-- Needs to have both types of vehicle.
HAVING count(type) = 2
)
AND v.type IN ('bike', 'scooter')
GROUP BY r.rider_id
HAVING count(DISTINCT v.type) = 1
)
SELECT u.name, u.city, r2.type
FROM users u
LEFT JOIN r2 ON r2.rider_id = u.id
WHERE r2.type IS NOT NULL
CTEs are great! And let you execute a query once and then use the results all over your main query. However:
- PostgreSQL (and potentially other DBs) will have trouble moving easy filter params into the CTE.
- Also, CTEs are materialized in memory. A whole extra table is created on the fly in memory. For big tables, this can be very memory consuming.
- The same indices that you have for your main tables, can not be used on the CTE tables. Thus the queries might be very slow.
See https://medium.com/@hakibenita/be-careful-with-cte-in-postgresql-fca5e24d2119 for more details.
paul@pdl-advaned-sql-4491.g95.gcp-us-west2.cockroachlabs.cloud:26257/cluster-4491/movr> BEGIN;
BEGIN
Time: 23ms total (execution 24ms / network 23ms)
paul@pdl-advaned-sql-4491.g95.gcp-us-west2.cockroachlabs.cloud:26257/cluster-4491/movr OPEN> DELETE FROM vehicle_location_histories;
DELETE 1009
Time: 40ms total (execution 18ms / network 23ms)
paul@pdl-advaned-sql-4491.g95.gcp-us-west2.cockroachlabs.cloud:26257/cluster-4491/movr OPEN> SELECT * FROM vehicle_location_histories;
city | ride_id | timestamp | lat | long
-------+---------+-----------+-----+-------
(0 rows)
Time: 29ms total (execution 4ms / network 24ms)
paul@pdl-advaned-sql-4491.g95.gcp-us-west2.cockroachlabs.cloud:26257/cluster-4491/movr OPEN> ROLLBACK;
ROLLBACK
Time: 109ms total (execution 6ms / network 102ms)
paul@pdl-advaned-sql-4491.g95.gcp-us-west2.cockroachlabs.cloud:26257/cluster-4491/movr> SELECT count(*) FROM vehicle_location_histories;
count
---------
1009
(1 row)
Time: 54ms total (execution 4ms / network 50ms)
paul@pdl-advaned-sql-4491.g95.gcp-us-west2.cockroachlabs.cloud:26257/cluster-4491/movr>
If you wanted to save that deletion, you would want to run COMMIT
instead.