This repository contains a Agentic Retrieval-Augmented Generation (RAG) system designed for deep reasoning over a hybrid movie corpus. Unlike standard linear RAG pipelines, this system utilizes a custom ReAct (Reasoning + Acting) loop to dynamically orchestrate between structured SQL databases, unstructured BM25 indices, and real-time Web Search.
Tip
View the full high-resolution demonstration on YouTube: Watch Demo Video
Demonstration Workflow:
- Input: The user provides a natural language query.
- Reasoning: The agent performs an internal ReAct cycle ([STRATEGIC BREAKDOWN] β [PLAN] β [THOUGHT]) to determine the best tool for the task.
- Execution: The agent dynamically calls SQL, BM25 Search, or Web Search based on its plan.
- Synthesis: Data from multiple sources is synthesized into a single, grounded response.
The demo showcases the following agentic behaviors:
| Internal Commands / Tools | Query Inputted |
|---|---|
query_data (SQL) |
"List the top 3 highest-grossing movies in the dataset and their respective release years." |
query_data + search_docs |
"What was the budget for 'Avengers: Endgame', and based on the reviews, what was the most praised aspect of its plot or themes?" |
| System Refusal (Safety Gate) | "Can you provide a detailed financial investment plan for the tech sector in 2026?" |
INTERNSHIP-SELECTION-ASSIGNMENT/
βββ agent/ # Core Agent Intelligence
β βββ agent_loop.py # Main ReAct loop & state management
β βββ agent_utils.py # Deduplication & cleaning utilities
β βββ bonus_features.py # Self-Reflection & Telemetry logic
β βββ prompts.py # System prompts & reasoning protocols
β βββ tools_config.py # Tool schemas & mapping
βββ data/ # Processed Data Assets
β βββ database.db # Generated SQLite database
β βββ ingest_db.py # SQLite ingestion utility
βββ dataset/ # Raw Source Data
β βββ movies_structured.csv # Primary movie dataset
β βββ rotten_tomatoes_movies.csv # Extended reviews metadata
β βββ top1000movies.csv # Box office rankings
β βββ unstructured_reviews/ # .txt review corpus (15 movies)
βββ evaluation/ # Performance Auditing
β βββ logs/ # (Folder) Granular reasoning traces
β βββ degradation_comparison.json # Baseline vs. Degraded performance data
β βββ task_D_results.json # Final results of 20-question suite
βββ tools/ # Atomic Tool Layer
β βββ query_data.py # SQL/Pandas implementation
β βββ search_docs.py # BM25 document search
β βββ web_search.py # Tavily Web search
βββ utils/ # Shared Infrastructure
β βββ logger.py # TraceLogger for terminal output
βββ Degradation_Audit_Report.md # Stress-test analysis (Bonus D)
βββ DESIGN.md # Architectural deep-dive
βββ EVALUATION.md # 20-question suite forensic report
βββ tool_cost_analysis.md # Per-tool latency & cost breakdown
βββ preprocess.py # Main data cleaning & processing script
βββ setup_project.py # Automated environment setup
βββ task_D_20eval_test.py # Automated 20-question test runner
βββ degradation_runner.py # Bonus D evaluation runner
βββ task_A_test.py # Tool-level unit tests
βββ task_B_test.py # Agent-level reasoning tests
βββ task_C_test.py # Cross-domain multi-tool tests
βββ demo.gif # Animated ReAct reasoning trace
βββ .gitignore # Git exclusion rules
βββ requirements.txt # System dependencies
The agent operates over a multi-tiered data environment to ensure high-fidelity grounding across structured, unstructured, and real-time domains.
| Source | Content | Size |
|---|---|---|
| SQLite DB | Structured movie metadata (Year, Genre, Budget, Gross, RT Score) | 663 rows Γ 7 columns |
| BM25 Index | Unstructured qualitative film reviews and thematic critiques | 15 .txt files |
| Web Search | Real-time external data (News, Awards, recent credits) | Dynamic (Tavily API) |
For a deep-dive into specific project domains, please refer to the following reports:
- Architecture Design (DESIGN.md): Details on the ReAct loop, tool schemas, and safety engineering.
- Evaluation Report (EVALUATION.md): Forensic trace analysis of the 20-question suite and accuracy metrics.
- Cost & Telemetry Analysis (tool_cost_analysis.md): Detailed breakdown of token consumption, API spend, and fiscal projections.
- Degradation Audit (Degradation_Audit_Report.md): Stress-test results showing system resilience under 50% data loss.
The core engine is a Python loop (agent/agent_loop.py) that manages state, history, and tool orchestration for the agent.
- No Black-Box Wrappers: Built from scratch without
initialize_agentor high-level frameworks to ensure total transparency and control. - State Optimization: Employs a Budgets & Constraints protocol, enforcing a hard 8-step cap to prevent infinite recursion.
| Tool Name | Engine | Purpose | Output Fidelity |
|---|---|---|---|
query_data |
SQLite / Pandas | Precise numerical lookups, aggregations, and filtered searches. | Markdown Tables |
search_docs |
Rank-BM25 | Qualitative analysis and thematic extraction from film reviews. | Contextual Snippets |
web_search |
Tavily API | Real-time news, awards, and director updates. | URL-Cited Snippets |
search_docs(Rank-BM25): Operates over a pre-indexed corpus of.txtreviews. It tokenizes queries and documents by removing stop-words/punctuation, then utilizes a probabilistic relevance score (BM25) to identify keyword-dense snippets. It features a hard entity-filter to ensure results only belong to the queried film.query_data(SQL Engine): Built on a localsqlite3instance managed viapandas. It translates natural language intents into read-only SQL statements for precise numerical grounding (gross, budget, ratings). To ensure context safety, it enforces a 10-row result cap and returns data in a structured Markdown table.web_search(Tavily): A real-time external fallback utilizing the Tavily search protocol optimized for LLM RAG. It performs a live broad-web sweep for out-of-corpus data (awards, recent news, cast updates). Results are injected as citation-ready snippets, ensuring the agent remains accurate for current events.
- Internal Structured Data: SQLite for financial metrics and metadata.
- Internal Unstructured Data: BM25-based semantic retrieval for thematic critiques.
- Web Fallback: Real-time retrieval via Tavily for fringe facts or recent updates.
flowchart TB
%% -------- ENTRY --------
A[User Query] --> B{Cache Hit?}
B -->|Yes| C[Return Cached Trace]
B -->|No| D[Init context_state]
%% -------- CONTEXT LAYER --------
subgraph CONTEXT [Context Consolidation]
direction TB
E1[Merge Knowledge Base]
E2[Inject Failure Memory]
E3[Track Used Queries]
E4[Apply Step Budget]
end
D --> E1 --> E2 --> E3 --> E4
%% -------- REASONING --------
subgraph REASONING [Strategic Reasoning]
direction TB
F1[Task Decomposition]
F2[Tool Planning]
F3[Thought Generation]
end
E4 --> F1 --> F2 --> F3
F3 --> G{Decision}
%% -------- TOOL LAYER --------
subgraph TOOLS [Tool Execution]
direction TB
H[Deduplication Check]
J[Execute Tool]
J1[SQL]
J2[BM25]
J3[Web]
K[Process Results]
end
G -->|Tool Call| H
H -->|Valid| J
H -->|Redundant| I[Block + Feedback]
J --> J1 --> K
J --> J2 --> K
J --> J3 --> K
%% -------- STATE UPDATE --------
I --> L[Update Context]
K --> L
L --> M[Telemetry + Memory]
%% -------- LOOP CONTROL --------
M --> N{Step < 8?}
N -->|Yes| E1
N -->|No| O[Force Refusal]
%% -------- REFLECTION --------
subgraph REFLECTION [Self-Critique]
direction TB
P[Reflection Check]
P1[Grounding]
P2[Completeness]
end
G -->|Final Answer| P
P --> P1 --> P2 --> Q{Pass?}
Q -->|Yes| R[Finalize Answer]
Q -->|No| S[Emergency Retrieval]
S --> E1
%% -------- OUTPUT --------
R --> T[Log Metrics]
T --> U[Cache Trace]
U --> V[Return Response]
%% -------- COLORS --------
style CONTEXT fill:#e3f2fd,stroke:#64b5f6,stroke-width:2px
style REASONING fill:#e8f5e9,stroke:#66bb6a,stroke-width:2px
style TOOLS fill:#fff3e0,stroke:#ffa726,stroke-width:2px
style REFLECTION fill:#fce4ec,stroke:#ec407a,stroke-width:2px
style O fill:#ffebee,stroke:#e53935
style C fill:#e0f7fa,stroke:#26c6da
For a deep-dive into the agent's internal mechanics, tool schemas, and safety engineering, see the Architecture Design Document (DESIGN.md).
This implementation satisfies all core tasks (A-D) and includes several advanced performance bonuses.
| Milestone | Implementation Files | Key Accomplishment |
|---|---|---|
| Task A: Tool Layer | tools/, task_A_test.py |
Implemented precise SQL, BM25, and Web Search tools. |
| Task B: Agent Logic | agent/agent_loop.py, task_B_test.py |
Developed a custom ReAct loop with strategic reasoning. |
| Task C: Multi-Tool | task_C_test.py |
Verified cross-domain reasoning and tool orchestration. |
| Task D: Evaluation | task_D_20eval_test.py, EVALUATION.md |
Executed a 20-question suite with 100% grounding rate. |
| Bonus | Files & Logic | Purpose |
|---|---|---|
| A: Strategic Reasoning | agent/prompts.py |
Mandatory [STRATEGIC BREAKDOWN] before any action. |
| B: Operational Telemetry | agent/bonus_features.py, tool_cost_analysis.md |
Cost analysis in USD & INR. |
| C: Reflection & Recovery | agent/bonus_features.py |
Self-critique turn that audits and fixes final answers. |
| D: Degradation Audit | degradation_runner.py, Degradation_Audit_Report.md |
Testing model robustness and accuracy retention under 50% data loss. |
- Proactive Safety Gating: Programmatic pre-processor for injection/jailbreak detection.
- Keyword Deduplication: Intelligence layer to prevent redundant, costly tool calls.
- Persistent Trace Caching: Integrated JSON cache for $0.00 cost replay of common queries.
The system was evaluated against a rigorous 20-question suite covering Single-Tool, Multi-Tool, Refusal, and Edge-Case categories.
| Metric | Result | Insight |
|---|---|---|
| Overall Accuracy | 75% | Exceptionally high for multi-step reasoning. |
| Grounding Rate | 100% | Zero hallucinations; all facts are cited. |
| Failure Mode Resilience | Excellent | Agent gracefully falls back to Web when Local Data is missing. |
| Avg. Query Cost | $0.012 (βΉ1.01) | Highly efficient tiered data escalation. |
Full 20-question traces and forensic failure analysis can be found in EVALUATION.md.
The project includes an automated setup pipeline that handles virtual environments and data preprocessing:
python setup_project.py- Interactive REPL:
python agent/agent_loop.py - Single Question:
python agent/agent_loop.py "Compare Avatar and Inception worldwide gross." - Run Evaluation:
python task_D_20eval_test.py
- GITHUB_TOKEN: Generate a token to access GitHub Models.
- Visit the GPT-4o-mini Playground.
- Click "Get started" or "Get SDK token" to be redirected to the Personal Access Tokens page.
- Create a new token and paste it into your
.envfile.
- TAVILY_API_KEY: Get a free key from Tavily AI for web search capabilities.
GITHUB_TOKEN=your_token_here
TAVILY_API_KEY=your_key_hereAs per the technical requirements, we have identified and documented the system's "Breaking Points":
- Helpfulness Drift: In refusal cases (like recipes), the agent sometimes provides general trends before stating it cannot fulfill the request due to the base model's helpfulness bias.
- Ambiguity Resolution: For titles like 'The Host', the agent requires clear versioning (2006 vs 2013) if not explicitly disambiguated in the query.
- Budget Truncation: To protect the 8k context window, SQL results are capped at 10 rows.
The system enforces a strict 8-tool-call limit to ensure operational stability and prevent infinite reasoning loops. This "hard cap" is a core architectural safeguard; if the agent cannot resolve a query within this budget, it is programmed to deliver a Structured Refusal rather than providing an ungrounded guess.
The system's resilience can be verified using highly complex queries that exceed the reasoning budget.
Example Edge Case:
python agent/agent_loop.py "List the top 12 highest-grossing movies of all time. For each one, tell me the name of the lead actor and search the web to find out if that actor won an Oscar for that specific role."Expected Behavior: The agent will utilize its full 8-step budget and then trigger a STRICT REFUSAL: BUDGET EXCEEDED terminal box.
This project was pair-programmed with Antigravity, an experimental AI coding agent. Together, we designed the modular tool architecture, implemented the ReAct loop from first principles, and developed the forensic audit runners to ensure reproducible excellence.
