Open geo data can be found in a lot of places. Open city data is a great source of geo data in many jurisdictions. Searching for "open data <cityname>" can yield a lot of results. For example, https://datasf.org/opendata/ is San Francisco's open data portal. Some jurisdictions will have dedicated GIS portals.
You'll often find geo data in a few formats:
- A CSV of geo points
- Shapefiles for geo points
- Shapefiles for geo shapes
- WKT (Well-known text)
- GeoJSON
Elasticsearch natively supports WKT and GeoJSON and I'll leave the work to import CSVs as an exercise to the reader for now. I'm going to focus this on how to import shapefiles. Sometimes GeoJSON has a full FeatureCollection which does need to be converted to a list of Features, and I will cover that here in Breaking a GeoJSON FeatureCollection up
In this example, we'll use the counties in Atlanta, which can be found at http://gisdata.fultoncountyga.gov/datasets/53ca7db14b8f4a9193c1883247886459_67. You can go to Download -> Shapefile to get the shapefile zip file. In this counties example, this looks like this once I've unzipped:
$ ls
Counties_Atlanta_Region.cpg Counties_Atlanta_Region.dbf Counties_Atlanta_Region.prj Counties_Atlanta_Region.shp Counties_Atlanta_Region.shx
After you have a shapefile, the next step is to get the data into geojson format.
Looking at the Atlanta counties again, the .shp file is the one that's interesting to us and the ogr2ogr tool can be used to convert .shp files to geojson. ogr2ogr is part of GDAL and can be installed on a Mac if you have homebrew installed by using:
brew install gdal
Alternatively, you can install it manually. ogr2ogr is a wonderful tool to have on your laptop for using/testing geo data. Once you have it, continuing with our example, you should be able to run:
ogr2ogr -f GeoJSON -t_srs crs:84 output_counties.json Counties_Atlanta_Region.shp
This means:
-f GeoJSON
: Output to GeoJSON format-t_srs crs:84
: There are a lot of coordinate reference systems. If you know you need data in a different coordinate system, you can override this with something else, though that's going to generally be a highly specialized case. We're telling ogr2ogr to use WGS84 on the output, which is the same system that GPS uses.output_counties.json
is the output fileCounties_Atlanta_Region.shp
is the input file.
After you run this, you now have a GeoJSON file.
If we look at the resulting GeoJSON file, we see at the top of it:
"type": "FeatureCollection", "name": "Counties_Atlanta_Region", "crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } }, "features": [ ...
Elasticsearch handles most GeoJSON, but FeatureCollections are composed of an array of Feature objects (Features can be geo points or shapes). FeatureCollections are sort of like a "bulk" dataset and we need to get individual points/shapes (Features) so that Elasticsearch can index them. In this example, the individual features are individual counties in Atlanta. This is where jq comes in handy.
jq
can also be installed via homebrew:
brew install jq
Afterwards, you can do "select the array of features[] from output_counties.json
, and output 1 feature per line" by
jq -c '.features[]' output_counties.json
The -c flag means "compact" -- it outputs 1 feature per line, which can be useful for what we're about to do next...
We can do 1 step better than just extracting the features array by simultaneously converting the output to Elasticsearch's bulk format with sed:
jq -c '.features[]' output_counties.json | sed -e 's/^/{ "index" : { "_index" : "geodata", "_type" : "_doc" } }\
/' > output_counties_bulk.json && echo "" >> output_counties_bulk.json
The sed
bit just adds a bulk header line (and a newline) per record and the echo "" >> output_counties_bulk.json
makes sure the file ends in a newline, as this is required by Elasticsearch.
Change geodata
to an index name of your choosing.
At this point, I'd set up the Elasticsearch mappings for this "geodata" index (or whatever name you want to give it). Metadata related to the shape is often in .properties and geo shape data is often in .geometry. The county data here looks typical:
jq -c '.features[].properties' output_counties.json
Shows us a list of properties like:
{"OBJECTID":28,"STATEFP10":"13","COUNTYFP10":"013","GEOID10":"13013","NAME10":"Barrow","NAMELSAD10":"Barrow County","totpop10":69367,"WFD":"N","RDC_AAA":"N","MNGWPD":"N","MPO":"Partial","MSA":"Y","F1HR_NA":"N","F8HR_NA":"N","Reg_Comm":"Northeast Georgia","Acres":104266,"Sq_Miles":162.914993,"Label":"BARROW","GlobalID":"{36E2EA48-1481-44D7-91C9-7C51AC8AB9E9}","last_edite":"2015-10-14T17:19:34.000Z"}
At this point, you can add any mappings around these fields and/or use an ingest node pipeline to manipulate the data prior to indexing. For now, I'm just going to set up the geo_shape field, but you can add extras.
PUT /geodata
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"_doc": {
"properties": {
"geometry": {
"type": "geo_shape"
}
}
}
}
}
And at this point, you can bulk-load the data
curl -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/_bulk --data-binary "@output_counties_bulk.json"
And then you can set up or reload Kibana index patterns for your index to make sure it shows up. Make sure to change any time filters to be appropriate with any visualizations you use. I often turn off the "time" field for quick demos as it can often be inconsistent/missing dates (as I found this data to be).
Get a shapefile
ogr2ogr -f GeoJSON -t_srs crs:84 your_geojson.json your_shapefile.shp
jq -c '.features[]' your_geojson.json | sed -e 's/^/{ "index" : { "_index" : "your_index", "_type" : "_doc" } }\
/' > your_geojson_bulk.json && echo "" >> your_geojson_bulk.json
Set up your mappings. Often the following works, but you may need to check field names:
PUT /geodata
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"_doc": {
"properties": {
"geometry": {
"type": "geo_shape"
}
}
}
}
}
curl -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/_bulk --data-binary "@your_geojson_bulk.json"
Set up (or refresh) Kibana index patterns to include your_index