Last active
September 24, 2024 12:55
-
-
Save janduplessis883/8ff94dc5a174e5f4afed5902c39865d9 to your computer and use it in GitHub Desktop.
Embedding Data from a Pandas DataFrame into a Chroma Vector Database using LangChain and Ollama
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
from langchain.schema import Document | |
from langchain_community.embeddings import OllamaEmbeddings | |
from langchain_community.vectorstores import Chroma | |
from tqdm import tqdm |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Initialize the embedding model | |
embedding_model = OllamaEmbeddings(model="nomic-embed-text", show_progress=False) | |
# Initialize Chroma Vector Store (this assumes that you do not need to from_documents here directly) | |
# Assuming vector_db needs to be setup only once | |
vector_db = Chroma(collection_name="GP_Surgery_Reviews") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def embed_with_chroma(df, embedding_model): | |
embeddings = [] | |
# Process each row in the DataFrame with a progress bar | |
for index, row in tqdm(df.iterrows(), total=df.shape[0]): | |
# Create a Document with necessary fields | |
document = Document( | |
page_content=row['review'], # Text content for embedding | |
meta_data={'pcn': row['pcn'], 'surgery': row['surgery']}, # Additional meta-data | |
id=str(row['index']) # Unique identifier as string | |
) | |
# Generate embedding using the correct embedding method | |
try: | |
# As 'embed_documents' expects a list of documents, we pass a list with one document | |
# and then take the first (and only) embedding from the returned list | |
embedding = embedding_model.embed_documents([document.page_content])[0] | |
embeddings.append((document, embedding)) | |
except Exception as e: | |
print(f"Failed to embed document: {e}") | |
return embeddings |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Example DataFrame | |
data = { | |
'index': [1, 2, 3], | |
'review': ['Great service!', 'Needs improvement.', 'Very satisfied.'], | |
'pcn': ['PCN123', 'PCN456', 'PCN789'], | |
'surgery': ['SurgeryA', 'SurgeryB', 'SurgeryC'] | |
} | |
df = pd.DataFrame(data) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Get embeddings and store them in Chroma | |
document_embeddings = embed_with_chroma(nnew_data, embedding_model) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from langchain_community.embeddings import OllamaEmbeddings | |
from langchain_community.vectorstores import Chroma | |
# Initialize your embedding model | |
embedding_model = OllamaEmbeddings(model="nomic-embed-text", show_progress=True) | |
# Initialize Chroma, ensure you provide the correct directory if `persist_directory` is valid | |
chroma = Chroma(embedding_function=embedding_model, persist_directory=DATA_PATH) | |
def ensure_collection_exists(chroma, collection_name): | |
"""Ensure the collection exists using generic methods available.""" | |
try: | |
# Try to retrieve the collection | |
collection = chroma.get(collection_name) | |
return collection | |
except ValueError: | |
# Handle the situation if collection doesn't exist | |
print(f"Collection {collection_name} does not exist.") | |
return None | |
def search_similar_documents(chroma, collection_name, query_text, k=5): | |
"""Search for documents similar to the given query text within the specified collection.""" | |
try: | |
# Use Chroma's similarity_search method | |
results = chroma.similarity_search(query_text, k=k, collection_name=collection_name) | |
return results | |
except Exception as e: | |
print(f"An error occurred during the search: {e}") | |
return [] | |
# Example usage | |
collection_name = "GP_Surgery_Reviews" | |
collection = ensure_collection_exists(chroma, collection_name) | |
if collection: | |
query_text = "Appointment Availability" | |
similar_documents = search_similar_documents(chroma, collection_name, query_text) | |
# Display the results | |
for doc in similar_documents: | |
try: | |
print(f"Document: {doc.page_content}, Similarity Score: {doc.metadata['score']}") | |
except KeyError: | |
print("Error processing document data; required keys not found.") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The provided GitHub Gist repository contains Python code that demonstrates how to embed data from a Pandas DataFrame into a Chroma vector database using LangChain and Ollama. The main objective of this code is to simplify the process of transforming text data from a DataFrame into vector representations that can be stored in a vector database.
To achieve this, the code first creates a sample DataFrame df with four columns:
index, review, pcn, and surgery
. The review column contains text reviews, which will be used as input for the embedding process. The other three columns contain additional metadata that can be associated with each review.The code then defines a function to embed these text reviews into vector representations using an embedding model. This function, called embed_with_chroma, takes two inputs: the DataFrame and the embedding model. It creates a list of documents from the DataFrame, where each document is represented by its corresponding review text, along with some additional metadata.
The code then uses the embed_documents method to generate embeddings for these documents. This method expects a list of documents as input and returns a list of embeddings, one for each document. The code takes the first (and only) embedding from this list and stores it in the embeddings variable.
By simplifying the process of embedding text data from a DataFrame into a vector database, this code enables users to leverage the power of vector databases for various natural language processing tasks, such as information retrieval, clustering, or classification. The use of LangChain and Ollama libraries further enhances the performance and efficiency of the embedding process.