Today, many datas are geolocalised (meaning that they have a position in space). They're named GIS datas.
It's not rare that we need to do operations on those, such as aggregations, and there are many optimisations existing to do that.
The easiest way to do so is to use either geopandas, or a spatial database such as postGIS, allowing spatial-joins, for example.
The problem ? We want to do it FAST. So we need a scalable way to do so, and here comes... SPAAAARK !
Spark [1] is a monster. It allows to store datas and make computation on them in a distributed way.
But... How to make it handle GIS data ? And using Spark2 and python, if possible ?
Magellan [2]
- Compatible Spark2
- Compatible pySpark
- Efficient spatial joins
- Correctly maintained
SpatialSpark [3]
- Compatible Spark2
- Compatible pySpark
- Efficient spatial joins
- Correctly maintained
pySpark + shapely (hacky way) [4]
- Compatible Spark2
- Compatible pySpark
- Efficient spatial joins
- Correctly maintained
pySpark + geopandas (hacky way) [5]
- Compatible Spark2
- Compatible pySpark
- Efficient spatial joins
- Correctly maintained
GeoSpark [6]
- Compatible Spark2
- Compatible pySpark
- Efficient spatial joins
- Correctly maintained
LocationSpark [7]
- Compatible Spark2
- Compatible pySpark
- Efficient spatial joins
- Correctly maintained
Multiple modules now switched to spark2. However, no great alternative have been found. If you have a solution, please contact me so I can add it here.
Hi @4rzael,
GeoPySpark allows processing large amounts of raster data using PySpark. Unfortunately, operations like spatial joins on geometries are currently not supported. Please see this issue.