FIRST
- user will upload file to ROR -> upload S3 -> create entity in DB
- push job to sidekiq for chunking of the entity
SECOND
- process chunk generation async job in lambda
- call lambda/chunk/generate?url=
- parser url (html, md, pdf, docs, text, XML sitemap) based on MIME Type from content-type
- text extraction
- chunking NLTK based limit 256 token length
- return chunks
- save all chunks to DB
- push job to sidekiq to embed chunks
- call lambda/chunk/generate?url=
THIRD
- process embeddings generation async job in lambda
- call lambda/embeddings/generate?content=
- generate embeddings using sentence transformer Mini-LM-6 V2
- return embeddings
- save embeddings to chunk entity in DB and update status
- call lambda/embeddings/generate?content=