Command to generate sample NDJSON files:
copy (select * from (select a.range as 'a' from range(1000) as a), (select b.range as 'b' from range(1000) as b)) to 'range.ndjson';
Creates a 2x1000000 table:
select count(1) from 'range.ndjson';
┌──────────┐
│ count(1) │
│ int64 │
├──────────┤
│ 1000000 │
└──────────┘
select * from 'range.ndjson' limit 10;
┌───────┬───────┐
│ a │ b │
│ int64 │ int64 │
├───────┼───────┤
│ 0 │ 0 │
│ 1 │ 0 │
│ 2 │ 0 │
│ 3 │ 0 │
│ 4 │ 0 │
│ 5 │ 0 │
│ 6 │ 0 │
│ 7 │ 0 │
│ 8 │ 0 │
│ 9 │ 0 │
├───────┴───────┤
│ 10 rows │
└───────────────┘
copy (select * from 'range.ndjson') to 'range.parquet';
select count(1) from 'range.parquet';
┌──────────┐
│ count(1) │
│ int64 │
├──────────┤
│ 1000000 │
└──────────┘
Or for multiple JSON files:
copy (select * from 'prefix*.ndjson') to 'range.parquet';
We can also directly upload to S3 instead of a local file, and it will automatically use S3 multipart upload!
copy (select * from 'prefix*.ndjson') to 's3://some-bucket/range.parquet';
And comparing the sizes of NDJSON vs Parquet:
ll range.*
-rw-r--r-- 1 lambros lambros 17M Jan 18 22:11 range.ndjson
-rw-r--r-- 1 lambros lambros 1.2M Jan 18 22:11 range.parquet
References:
- https://duckdb.org/docs/extensions/json
- https://duckdb.org/docs/sql/statements/copy
- https://duckdb.org/docs/extensions/httpfs#writing
Example files:
- https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet
- https://shell.duckdb.org/data/tpch/0_01/parquet/nation.parquet
- https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/parquet/aws_services.parquet
- https://raw.githubusercontent.com/tobilg/aws-edge-locations/main/data/aws-edge-locations.parquet
- https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/main/data/providers/all.parquet
- https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/main/data/providers/all.csv