The update process in hive that joins the product catalog data to f_product_performance has been causing out of memory failures among many other different types of errors that are nearly impossible to diagnose.
The first proposal involves sending aggregated data to a key/value store for hydra to consume during the selection process.
- Create aggregation script at the end of stats pipeline in pyspark for last 60 days of product performance
- Send aggregated data to (in memory) key/value store for O(1) lookup in hydra's selection process
- Update hydra's filter to retrieve performance data from key/value store
Pros: speed up update process and reliability in hive, stateless, keeps hydra's speed.