TL;DR :
- Was exploring if hyperspace can be used an alternative for our record/bloom indexes
- For the needle-in-a-haystack search i.e a single id out of all the records, hyperspace also seems to be not very effective atm (might not be suprising given the covered indexes recommendations so far).
- Our old workhorse
BLOOM_INDEX
still significantly outperforms. But we should really step on the gas for RFC-15 like efforts/RFC-08 to make this much faster
https://microsoft.github.io/hyperspace/docs/ug-quick-start-guide/
~/bin/spark-3.0.0-bin-hadoop2.7/bin/spark-shell --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --driver-memory 8g --packages com.microsoft.hyperspace:hyperspace-core_2.12:0.1.0
val part100Path = "file:///Volumes/HUDIDATA/input-data/amazon-reviews-100-parts"
val df100 = spark.read.parquet(part100Path)
df100.registerTempTable("amazon_reviews_100_parts")
import com.microsoft.hyperspace._
val hs = new Hyperspace(spark)
import com.microsoft.hyperspace.index._
+--------------+
| review_id|
+--------------+
|R38YR2K3RQVUT6|
|R1UE9PRDNPVWJN|
|R2T5TIOI92JDOA|
| RY7UKOQOZ1NA9|
| R1LJ65G8LY6L6|
| ROQTM343YUPY5|
|R160R9P9BRK8J6|
| R30ZKF6EPTV76|
|R2Q93ZF9K7BERL|
|R2UG8JB73C003W|
|R1NX7L8FAZFL6T|
| R3RJQHNPYINS1|
| R5Z19IT94F27U|
|R1C1X93D1TPIVY|
|R2AZ4P431BHSXD|
|R1G30L7BW96HH9|
|R2Q05M51VX6P14|
| RL9AZUSVJC16M|
|R119E7G9JQDDO5|
|R36I5SKSR7V0WK|
+--------------+
Query without index