Skip to content

Instantly share code, notes, and snippets.

@marklit
Created July 21, 2024 17:46
Show Gist options
  • Save marklit/3a0d57a0558a80cd387e5da7be4dce96 to your computer and use it in GitHub Desktop.
Save marklit/3a0d57a0558a80cd387e5da7be4dce96 to your computer and use it in GitHub Desktop.
Overture July 2024 Canadian Addresses

DuckDB v1.0.0 1f98600c2c

$ cd '/mnt/j/gis/Global/overture/2024_07/theme=addresses/type=address'
$ COUNTRY=CA
$ echo "COPY (
              SELECT h3_cell_to_boundary_wkt(
                          h3_latlng_to_cell(bbox.ymax, bbox.xmax, 5))::geometry geom,
                      COUNT(*)
              FROM read_parquet('*.parquet')
              WHERE country = '$COUNTRY'
              group by 1
          ) TO '../../addresses.$COUNTRY.gpkg'
              WITH (FORMAT GDAL,
                    DRIVER 'GPKG',
                    LAYER_CREATION_OPTIONS 'WRITE_BBOX=YES')" | ~/duckdb
Floating point exception
@marklit
Copy link
Author

marklit commented Jul 21, 2024

SELECT   country,
         COUNT(*) AS num_addrs,
         sources->0->'dataset' AS source
FROM     READ_PARQUET('*')
GROUP BY 1, 3
ORDER BY 1, 2 DESC;
Invalid Input Error: Malformed JSON at byte 71 of input: unexpected character.  Input: {"property":"","dataset":"NAD","record_id":null...

The above is in the US dataset.

@jwass
Copy link

jwass commented Jul 21, 2024

Just taking a quick look at the US JSON issue. This worked for me. Weird... could be a different version of duckdb? Or maybe it's a specific record? That would be surprising but definitely possible.

v1.0.0 1f98600c2c
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D set s3_region='us-west-2';
D SELECT   country,
           sources->0->'dataset' AS source
  FROM     READ_PARQUET('s3://overturemaps-us-west-2/release/2024-07-22.0/theme=addresses/type=address/*.parquet') where country='US' limit 10;
100% ▕████████████████████████████████████████████████████████████▏
┌─────────┬────────┐
│ country │ source │
│ varchar │  json  │
├─────────┼────────┤
│ US      │ "NAD"  │
│ US      │ "NAD"  │
│ US      │ "NAD"  │
│ US      │ "NAD"  │
│ US      │ "NAD"  │
│ US      │ "NAD"  │
│ US      │ "NAD"  │
│ US      │ "NAD"  │
│ US      │ "NAD"  │
│ US      │ "NAD"  │
├─────────┴────────┤
│     10 rows      │
└──────────────────┘
D

@jwass
Copy link

jwass commented Jul 22, 2024

I also just ran this without any issue:

D install h3 from community;
D load h3;
D COPY (
                SELECT h3_cell_to_boundary_wkt(
                            h3_latlng_to_cell(bbox.ymax, bbox.xmax, 5))::geometry geom,
                        COUNT(*)
                FROM read_parquet('s3://overturemaps-us-west-2/release/2024-07-22.0/theme=addresses/type=address/*.parquet')
                WHERE country = 'CA'
                group by 1
            ) TO 'addresses.ca.gpkg'
                WITH (FORMAT GDAL,
                      DRIVER 'GPKG',
                      LAYER_CREATION_OPTIONS 'WRITE_BBOX=YES');
100% ▕████████████████████████████████████████████████████████████▏

@marklit
Copy link
Author

marklit commented Jul 22, 2024

Using S3 as a source works fine, it's just with a local file that the issue comes up. I'll raise a ticket with DuckDB. duckdb/duckdb_spatial#368

For the JSON issue, scan the whole dataset, it'll be one rogue JSON record. It's unlikely it's in the first 10 records.

@jwass
Copy link

jwass commented Jul 22, 2024

Looks like direct from S3 also resolves the JSON issue also. That's weird.

D SELECT   country,
           COUNT(*) AS num_addrs,
           sources->0->'dataset' AS source
  FROM     READ_PARQUET('s3://overturemaps-us-west-2/release/2024-07-22.0/theme=addresses/type=address/*.parquet')
  WHERE country = 'US' GROUP BY 1, 3
  ORDER BY 1, 2 DESC;
100% ▕████████████████████████████████████████████████████████████▏
┌─────────┬───────────┬────────┐
│ country │ num_addrs │ source │
│ varchar │   int64   │  json  │
├─────────┼───────────┼────────┤
│ US      │  78078341 │ "NAD"  │
└─────────┴───────────┴────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment