PE-RAG: Legal Document Vector Loader for Pinecone

An interactive Python tool for uploading legal documents to Pinecone vector database with OpenAI embeddings. Designed specifically for legal document processing with proper metadata handling for n8n workflows and LangChain compatibility.

Features

🔧 Interactive Setup: Guided configuration with smart defaults and recommendations
📚 Multi-Format Support: Process PDF, DOCX, and DOC files
🔑 Secure API Management: Safe handling of OpenAI and Pinecone API keys
🧪 Connection Testing: Verify API connections before processing
📊 Smart Chunking: Configurable text chunking with overlap for optimal retrieval
🎯 LangChain Compatible: Proper metadata structure for downstream processing
📈 Progress Tracking: Real-time upload progress with detailed logging

Quick Start

Clone the repository:

git clone https://github.com/[username]/PE-RAG.git
cd PE-RAG

Set up Python environment:

python -m venv venv
# Windows
venv\Scripts\activate
# Linux/Mac
source venv/bin/activate

Install dependencies:

pip install openai pinecone-client langchain langchain-openai python-docx pypdf

Run the interactive loader:
```
python vector_loader.py
```

Configuration Options

The interactive interface guides you through all necessary configurations:

API Keys

OpenAI API Key: For generating embeddings
Pinecone API Key: For vector database operations

Embedding Configuration

Model Selection:
- text-embedding-3-large (3072 dimensions) - Higher accuracy
- text-embedding-3-small (1536 dimensions) - Faster, cost-effective
Custom Dimensions: Override default dimensions if needed

Processing Parameters

Chunk Size: Text chunk size (default: 1000 characters)
Chunk Overlap: Overlap between chunks (default: 200 characters)
Batch Size: Documents processed per batch (default: 10)

Pinecone Settings

Index Name: Target index for vector storage
Namespace: Optional namespace for organization

Document Processing

Supported Formats

PDF: Extracted using PyPDF2
DOCX: Microsoft Word documents
DOC: Legacy Word documents (converted via python-docx)

Metadata Structure

Each document is processed with the following metadata:

{
  "source": "document_filename.pdf",
  "page": 1,
  "text": "document_content_chunk",
  "chunk_id": "unique_chunk_identifier"
}

Important: Content is stored in the text field (not page_content) for proper LangChain compatibility and n8n workflow integration.

Usage Examples

Basic Usage

python vector_loader.py

Follow the interactive prompts to configure and upload your documents.

Directory Structure

your_documents/
├── contract1.pdf
├── agreement2.docx
├── policy3.doc
└── subdirectory/
    └── more_docs.pdf

The tool will recursively process all supported documents in your specified directory.

Advanced Configuration

Environment Variables

You can set these environment variables to skip interactive setup:

export OPENAI_API_KEY="your_openai_key"
export PINECONE_API_KEY="your_pinecone_key"
export PINECONE_INDEX_NAME="your_index_name"

Logging

All operations are logged to vector_loader.log with detailed information about:

Document processing status
Upload progress
Error messages and troubleshooting info

Troubleshooting

Common Issues

API Connection Failures
- Verify API keys are correct
- Check internet connectivity
- Ensure Pinecone index exists and is active
Document Processing Errors
- Verify file formats are supported
- Check file permissions and accessibility
- Review logs for specific error details
Memory Issues with Large Documents
- Reduce batch size
- Decrease chunk size
- Process documents in smaller batches

Error Messages

The tool provides detailed error messages with suggested solutions. Check vector_loader.log for complete error traces.

Dependencies

openai - OpenAI API client for embeddings
pinecone-client - Pinecone vector database client
langchain - Document processing and text splitting
langchain-openai - LangChain OpenAI integration
python-docx - Word document processing
pypdf - PDF document processing

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For questions or issues:

Check the troubleshooting section above
Review the log file for detailed error information
Open an issue on GitHub with:
- Error message or unexpected behavior
- Steps to reproduce
- Relevant log entries

Changelog

v1.0.0 (2025-10-20)

Initial release with interactive interface
Support for PDF, DOCX, and DOC files
OpenAI embedding integration
Pinecone vector storage
LangChain compatibility
Comprehensive error handling and logging

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
vector_loader.py		vector_loader.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PE-RAG: Legal Document Vector Loader for Pinecone

Features

Quick Start

Configuration Options

API Keys

Embedding Configuration

Processing Parameters

Pinecone Settings

Document Processing

Supported Formats

Metadata Structure

Usage Examples

Basic Usage

Directory Structure

Advanced Configuration

Environment Variables

Logging

Troubleshooting

Common Issues

Error Messages

Dependencies

Contributing

License

Support

Changelog

v1.0.0 (2025-10-20)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PE-RAG: Legal Document Vector Loader for Pinecone

Features

Quick Start

Configuration Options

API Keys

Embedding Configuration

Processing Parameters

Pinecone Settings

Document Processing

Supported Formats

Metadata Structure

Usage Examples

Basic Usage

Directory Structure

Advanced Configuration

Environment Variables

Logging

Troubleshooting

Common Issues

Error Messages

Dependencies

Contributing

License

Support

Changelog

v1.0.0 (2025-10-20)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages