DuckDB JSON to Parquet

Command to generate sample NDJSON files:

copy (select * from (select a.range as 'a' from range(1000) as a), (select b.range as 'b' from range(1000) as b)) to 'range.ndjson';

Creates a 2x1000000 table:

select count(1) from 'range.ndjson';
┌──────────┐
│ count(1) │
│  int64   │
├──────────┤
│  1000000 │
└──────────┘

select * from 'range.ndjson' limit 10;
┌───────┬───────┐
│   a   │   b   │
│ int64 │ int64 │
├───────┼───────┤
│     0 │     0 │
│     1 │     0 │
│     2 │     0 │
│     3 │     0 │
│     4 │     0 │
│     5 │     0 │
│     6 │     0 │
│     7 │     0 │
│     8 │     0 │
│     9 │     0 │
├───────┴───────┤
│    10 rows    │
└───────────────┘

Convert JSON files to Parquet

copy (select * from 'range.ndjson') to 'range.parquet';

select count(1) from 'range.parquet';
┌──────────┐
│ count(1) │
│  int64   │
├──────────┤
│  1000000 │
└──────────┘

Or for multiple JSON files:

copy (select * from 'prefix*.ndjson') to 'range.parquet';

We can also directly upload to S3 instead of a local file, and it will automatically use S3 multipart upload!

copy (select * from 'prefix*.ndjson') to 's3://some-bucket/range.parquet';

And comparing the sizes of NDJSON vs Parquet:

ll range.*
-rw-r--r-- 1 lambros lambros  17M Jan 18 22:11 range.ndjson
-rw-r--r-- 1 lambros lambros 1.2M Jan 18 22:11 range.parquet

References:

Example files:

lambrospetrou/readme.md

DuckDB JSON to Parquet

Convert JSON files to Parquet