Blue Brain Graph POC

A service which will do the following:

Ingestion

Listen to Nexus SSE of type {Paper} on project {literature}.
Split each paper into sentences.
Make an API call to Blue Brain Search to get the dense vector for each sentence.
For each sentence, insert a document in ES index (paper_sentences):

{
	"paperId": "{id}",
	"text": "{original text}",
	"vector": [...]
}

For each paper, insert a document in ES index (papers) as raw document

Search

Provide an API endpoint /v1/papers:

POST /v1/papers

{
	"paperId": ["{id1}", ..., "{idN}"], //optional
	"text": "{text}"
}

Make an API call to Blue Brain Search to get the dense vector for the {text}
Make an ElasticSearch query using cosinesimilarity (for now)

{
  "query": {
    "script_score": {
      "query": {
        "bool": {
          "filter": {
            "query": {
              "bool": {
                "should": [
                  "term": { "paperId": "{id1}" },
                  "term": { "paperId": "{id2}" }
                ]
              }
            }
          }
        }
      },
      "script": {
        "source": "cosineSimilarity(params.query_vector, 'vector') + 1.0",
        "params": {
          "query_vector": {TEXT_VECTOR}
        }
      }
    }
  },
  "_source": {"excludes": ["vector"]}
}

Collect the returned score sentence from each paper
1. Make an API call to Nexus to retrieve the paper metadata (authors, title, ...).
2. Alternatively make a second elasticsearch query to retrieve the matched papers (probably this can be optimazied in some way).
Compose and serve the API response:

{
  "total": {total},
  "results": [
    {
      "score": {score},
      "paperId": "{id}",
      "title": "{title}",
      "authors": [ "{name1}", "{nameN}" ],
      "text": [
         { "highlight": true, "value": "{text1}" },
         ...
         { "highlight": false, "value": "{textN}" },
      ]
    }
  ]
}

umbreak/README.md

Blue Brain Graph POC

Ingestion

Search