An interactive Python tool for uploading legal documents to Pinecone vector database with OpenAI embeddings. Designed specifically for legal document processing with proper metadata handling for n8n workflows and LangChain compatibility.
- 🔧 Interactive Setup: Guided configuration with smart defaults and recommendations
- 📚 Multi-Format Support: Process PDF, DOCX, and DOC files
- 🔑 Secure API Management: Safe handling of OpenAI and Pinecone API keys
- 🧪 Connection Testing: Verify API connections before processing
- 📊 Smart Chunking: Configurable text chunking with overlap for optimal retrieval
- 🎯 LangChain Compatible: Proper metadata structure for downstream processing
- 📈 Progress Tracking: Real-time upload progress with detailed logging
-
Clone the repository:
git clone https://github.com/[username]/PE-RAG.git cd PE-RAG -
Set up Python environment:
python -m venv venv # Windows venv\Scripts\activate # Linux/Mac source venv/bin/activate
-
Install dependencies:
pip install openai pinecone-client langchain langchain-openai python-docx pypdf
-
Run the interactive loader:
python vector_loader.py
The interactive interface guides you through all necessary configurations:
- OpenAI API Key: For generating embeddings
- Pinecone API Key: For vector database operations
- Model Selection:
text-embedding-3-large(3072 dimensions) - Higher accuracytext-embedding-3-small(1536 dimensions) - Faster, cost-effective
- Custom Dimensions: Override default dimensions if needed
- Chunk Size: Text chunk size (default: 1000 characters)
- Chunk Overlap: Overlap between chunks (default: 200 characters)
- Batch Size: Documents processed per batch (default: 10)
- Index Name: Target index for vector storage
- Namespace: Optional namespace for organization
- PDF: Extracted using PyPDF2
- DOCX: Microsoft Word documents
- DOC: Legacy Word documents (converted via python-docx)
Each document is processed with the following metadata:
{
"source": "document_filename.pdf",
"page": 1,
"text": "document_content_chunk",
"chunk_id": "unique_chunk_identifier"
}Important: Content is stored in the text field (not page_content) for proper LangChain compatibility and n8n workflow integration.
python vector_loader.pyFollow the interactive prompts to configure and upload your documents.
your_documents/
├── contract1.pdf
├── agreement2.docx
├── policy3.doc
└── subdirectory/
└── more_docs.pdf
The tool will recursively process all supported documents in your specified directory.
You can set these environment variables to skip interactive setup:
export OPENAI_API_KEY="your_openai_key"
export PINECONE_API_KEY="your_pinecone_key"
export PINECONE_INDEX_NAME="your_index_name"All operations are logged to vector_loader.log with detailed information about:
- Document processing status
- Upload progress
- Error messages and troubleshooting info
-
API Connection Failures
- Verify API keys are correct
- Check internet connectivity
- Ensure Pinecone index exists and is active
-
Document Processing Errors
- Verify file formats are supported
- Check file permissions and accessibility
- Review logs for specific error details
-
Memory Issues with Large Documents
- Reduce batch size
- Decrease chunk size
- Process documents in smaller batches
The tool provides detailed error messages with suggested solutions. Check vector_loader.log for complete error traces.
openai- OpenAI API client for embeddingspinecone-client- Pinecone vector database clientlangchain- Document processing and text splittinglangchain-openai- LangChain OpenAI integrationpython-docx- Word document processingpypdf- PDF document processing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or issues:
- Check the troubleshooting section above
- Review the log file for detailed error information
- Open an issue on GitHub with:
- Error message or unexpected behavior
- Steps to reproduce
- Relevant log entries
- Initial release with interactive interface
- Support for PDF, DOCX, and DOC files
- OpenAI embedding integration
- Pinecone vector storage
- LangChain compatibility
- Comprehensive error handling and logging