Metadata-Version: 2.1
Name: vector-database-loader
Version: 0.2.0
Summary: A generalized tool for currating and loading content into vector databases
Home-page: https://github.com/danjamk/vector-database-loader
License: MIT
Keywords: vector,database,loader,milvus,langchain,pinecone
Author: danjamk
Requires-Python: >=3.11,<3.13
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: colorama (>=0.4.6,<0.5.0)
Requires-Dist: docx2txt (>=0.8,<0.9)
Requires-Dist: google-api-python-client (>=2.160.0,<3.0.0)
Requires-Dist: google-auth-httplib2 (>=0.2.0,<0.3.0)
Requires-Dist: google-auth-oauthlib (>=1.2.1,<2.0.0)
Requires-Dist: langchain (>=0.3.17,<0.4.0)
Requires-Dist: langchain-community (>=0.3.16,<0.4.0)
Requires-Dist: langchain-milvus (>=0.1.8,<0.2.0)
Requires-Dist: langchain-openai (>=0.3.2,<0.4.0)
Requires-Dist: langchain-pinecone (>=0.2.2,<0.3.0)
Requires-Dist: pymilvus (>=2.5.4,<3.0.0)
Requires-Dist: python-dotenv (>=1.0.1,<2.0.0)
Requires-Dist: selenium (>=4.28.1,<5.0.0)
Requires-Dist: unstructured (>=0.16.17,<0.17.0)
Project-URL: Repository, https://github.com/danjamk/vector-database-loader
Description-Content-Type: text/markdown

# vector-database-loader
Loading content info a vector database is relatively easy to do, especially with frameworks like [LangChain](https://www.langchain.com/).
However, the process of curating the content and loading it into the database can be a bit more complex.  If you are building 
a RAG application or similar, the quality and relevance of the content is critical.  This project is meant to help with that process.

A use case for this type of project is discussed in more depth in the blog [A Cost-Effective AI Chatbot Architecture with AWS Bedrock, Lambda, and Pinecone](https://medium.com/@dan.jam.kuhn/a-cost-effective-ai-chatbot-architecture-with-aws-bedrock-lambda-and-pinecone-40935b9ec361)

## Features
- **Vector Database Support** - The framework is built to support multiple vector databases but is currently implementing support for [Pinecone](https://www.pinecone.io/).
More vector databases will be added, but if needed you can fork the project and handle your own needs by extending the base class.  
- **Embedding Support** - You can use any embedding provided by [Langchain](https://python.langchain.com/docs/integrations/text_embedding/), which includes OpenAI, AWS Bedrock, HuggingFace, Cohere and much, much more.
- **Content Curation** - The framework is built configure some common content types and sources, but again is meant to be extended a needed.
  - Sources include websites, local folders and google drive
  - Types include PDF, Word, and Web content and google docs
- **Text Splitter** - This framework uses the [RecursiveCharacterTextSplitter](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/recursive_text_splitter/) from LangChain.  
This is a powerful tool that can split text into chunks of a specified size, while maintaining the context of the text.  This is especially useful for long documents like web pages or PDFs.

## Example
**Install the package with pip:**

Note: The dependencies are kind of beefy!

```pip install vector-database-loader```

**Configuration details:**

Add your Pinecone API and OpenAI keys to your .env file, see sample.env for an examples.


**Code**

```python
import time

from dotenv import load_dotenv, find_dotenv
from langchain_openai import OpenAIEmbeddings

from vector_database_loader.pinecone_vector_db import PineconeVectorLoader, PineconeVectorQuery

# Define your content sources and add them to the array
web_page_content_source = {"name": "SpaceX", "type": "Website", "items": [
    "https://en.wikipedia.org/wiki/SpaceX"

], "chunk_size": 512}
content_sources = [web_page_content_source]

# Load into your vector database.  Be sure to add your Pinecone and OpenAI API keys to your .env file
load_dotenv(find_dotenv())
embedding_client = OpenAIEmbeddings()
index_name = "my-vectordb-index"
vector_db_loader = PineconeVectorLoader(index_name=index_name,
                                        embedding_client=embedding_client)
vector_db_loader.load_sources(content_sources, delete_index=True)

# Query your vector database
print("Waiting 30 seconds before running the query, to make sure the data is available")
time.sleep(30)  # This is needed because there is a latency in the data being available
vector_db_query = PineconeVectorQuery(index_name=index_name,
                                      embedding_client=embedding_client)
query = "What is SpaceX's most recent rocket model being tested?"
documents = vector_db_query.query(query)
print(f"Query: {query} returned {len(documents)} results")
for doc in documents:
    print(f"   {doc.metadata['title']}")
    print(f"   {doc.page_content}")
```


