Skip to content

Instantly share code, notes, and snippets.

@alecco
Created May 8, 2021 15:55
Show Gist options
  • Save alecco/9976dab8fda8256ed403054ed0a65d7b to your computer and use it in GitHub Desktop.
Save alecco/9976dab8fda8256ed403054ed0a65d7b to your computer and use it in GitHub Desktop.
-- Note: example given is for SQLite but can be adapted to other SQL
-- This is a fast query to sample random rows
-- It gets a list of existing row identifiers (rowids),
-- materializes, sorts, and picks 10.
SELECT * FROM test
WHERE rowid IN
(SELECT rowid FROM test
ORDER BY random() LIMIT 10);
-- If the table has many, many rows, say hundreds of millions,
-- materializing the list and sorting it becomes expensive.
-- The size of the list of ids can be reduced. But don't
-- overdo it or you'll affect the sampling properties!
SELECT * FROM test JOIN
(SELECT rowid as rid FROM test
WHERE random() % 10 = 0 -- Reduce rowids by 10x
ORDER BY random() LIMIT 10) AS srid
ON test.rowid = srid.rid;
-- The magic reduction value can be calculated with count() of the table
-- and finding a reasonable number to reduce to a million (for example).
-- But you have to also be careful for tables with less than a million rows.
-- This is left as exercise.
-- To populate a table to play with this you can do
-- https://stackoverflow.com/a/28139886
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment