Skip to content

Instantly share code, notes, and snippets.

@thenamankumar
Last active September 6, 2023 19:07
Show Gist options
  • Save thenamankumar/59287c645b29e502e76b9c43d4710b7c to your computer and use it in GitHub Desktop.
Save thenamankumar/59287c645b29e502e76b9c43d4710b7c to your computer and use it in GitHub Desktop.

FIRST

  • user will upload file to ROR -> upload S3 -> create entity in DB
  • push job to sidekiq for chunking of the entity

SECOND

  • process chunk generation async job in lambda
    • call lambda/chunk/generate?url=
      • parser url (html, md, pdf, docs, text, XML sitemap) based on MIME Type from content-type
      • text extraction
      • chunking NLTK based limit 256 token length
      • return chunks
    • save all chunks to DB
    • push job to sidekiq to embed chunks

THIRD

  • process embeddings generation async job in lambda
    • call lambda/embeddings/generate?content=
      • generate embeddings using sentence transformer Mini-LM-6 V2
      • return embeddings
    • save embeddings to chunk entity in DB and update status
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment