The update process in hive that joins the product catalog data to f_product_performance has been causing out of memory failures among many other different types of errors that are nearly impossible to diagnose.
The first proposal involves sending aggregated data to a key/value store for hydra to consume during the selection process.
- Create aggregation script at the end of stats pipeline in pyspark for last 60 days of product performance
- Send aggregated data to (in memory) key/value store for O(1) lookup in hydra's selection process
- Update hydra's filter to retrieve performance data from key/value store
Pros: speed up update process and reliability in hive, stateless, keeps hydra's speed.
Cons: product_performance aggregation will take a long time and we still have to do it for every site regardless of whether they need it.
The second proposal involves extending hydra to send products that need to be filtered by stats to an intermediary service for additional filtering/processing.
- Remove aggregation from hive update script to improve speed and reliability
- Create additional hydra service bound to a new queue (hydra-stats) that will process stats filters only and forward results to generation
- Update medusa (if stats filters present, send to hydra-stats, else to generation)
- Implement handler in go-stats to get aggregated performance data for a set of products
Pros: speed up update process and reliability in hive, only clients that have stats are doing additional processing
Cons: increases architectural complexity of hydra and additional complexity in go-stats