Skip to content

phapsa1379/RAG-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NeuroQuest: Hardware-Optimized RAG Pipeline for Academic Papers

This repository contains a fully local Retrieval-Augmented Generation (RAG) pipeline designed to extract, synthesize, and analyze information from dense academic PDFs.

I built this project to bridge the gap between large language models and static documents, with a specific focus on making it run smoothly on constrained hardware (specifically, the free tier of Google Colab using a Tesla T4 GPU).

Architecture & Engineering Choices

Building a reliable RAG system requires balancing retrieval accuracy with generation capabilities, all while keeping an eye on VRAM limits. Here is the breakdown of the pipeline:

1. Document Processing & Vectorization

  • Chunking Strategy: The PDF is parsed into 800-character chunks with a 100-character overlap. This prevents context fragmentation and ensures that cross-page sentences aren't split abruptly.
  • Embeddings: sentence-transformers/all-MiniLM-L6-v2. Chosen because it's extremely lightweight and fast, yet provides high semantic density.
  • Vector Search: FAISS (Facebook AI Similarity Search) using L2 distance. It runs entirely on the CPU, saving precious GPU memory for the generation model.

2. The Generation Model: Why Qwen 2.5 (3B)?

Initially, I developed this pipeline using Flan-T5-Base (a Seq2Seq model). While it was fine for simple extraction, it struggled with "context confusion" when presented with long, concatenated chunks of text during multi-hop reasoning tasks.

To achieve deeper, abstractive reasoning without relying on paid APIs, I upgraded the system to Qwen2.5-3B-Instruct.

  • Hardware Optimization: A 3-billion parameter model usually eats up a lot of VRAM. By loading the model in Half-Precision (torch_dtype=torch.float16), the footprint is reduced to roughly 6GB. This allows it to run flawlessly on a standard 15GB T4 GPU without throwing Out-Of-Memory (OOM) errors.
  • Prompt Engineering: The implementation leverages standard Chat Templates (apply_chat_template) combined with strict system prompts and low temperature (temperature=0.1) to keep the LLM grounded and prevent hallucinations.

Getting Started

You can run this entire pipeline directly in a Google Colab notebook.

  1. Clone the repository and install the dependencies:
    pip install -r requirements.txt
  2. Upload your target PDF as paper.pdf in the working directory.
  3. Run the RAG_Engine.ipynb notebook. (Make sure your runtime is set to Hardware Accelerator: GPU / T4).

Evaluation & Sample Output

To benchmark the system, I tested it against the foundational "Attention Is All You Need" paper. The model successfully demonstrates multi-hop reasoning (pulling context from different pages) and provides analytical, non-extractive answers.

Sample Query:

Explain the main advantages of the Transformer architecture over RNNs according to the paper.

System Output:

ANSWER: According to the provided context, the main advantage of the Transformer architecture over RNNs is that it computes representations of its input and output without using sequence-aligned RNNs or convolution. This implies that the Transformer avoids the sequential dependencies and memory limitations inherent in RNNs, which can be a significant drawback in handling long sequences.
SOURCES: Pages 2, 3

Author

Pasha Ahmadi M.Sc. Student in Computer Engineering

About

This repository contains a fully local Retrieval-Augmented Generation (RAG) pipeline designed to extract, synthesize, and analyze information from dense academic PDFs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors