Skip to content

Anish-Ramesh/INTERNSHIP-SELECTION-ASSIGNMENT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

51 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Movie Reasoning Agent: Advanced Agentic RAG Implementation (Option C)

Project Status Data Track

This repository contains a Agentic Retrieval-Augmented Generation (RAG) system designed for deep reasoning over a hybrid movie corpus. Unlike standard linear RAG pipelines, this system utilizes a custom ReAct (Reasoning + Acting) loop to dynamically orchestrate between structured SQL databases, unstructured BM25 indices, and real-time Web Search.

πŸŽ₯ 1. Task E: Demo Video (ReAct Reasoning Trace)

Agent Demo

Tip

View the full high-resolution demonstration on YouTube: Watch Demo Video

Demonstration Workflow:

  1. Input: The user provides a natural language query.
  2. Reasoning: The agent performs an internal ReAct cycle ([STRATEGIC BREAKDOWN] β†’ [PLAN] β†’ [THOUGHT]) to determine the best tool for the task.
  3. Execution: The agent dynamically calls SQL, BM25 Search, or Web Search based on its plan.
  4. Synthesis: Data from multiple sources is synthesized into a single, grounded response.

The demo showcases the following agentic behaviors:

Internal Commands / Tools Query Inputted
query_data (SQL) "List the top 3 highest-grossing movies in the dataset and their respective release years."
query_data + search_docs "What was the budget for 'Avengers: Endgame', and based on the reviews, what was the most praised aspect of its plot or themes?"
System Refusal (Safety Gate) "Can you provide a detailed financial investment plan for the tech sector in 2026?"

πŸ“‚ 2. Project Structure

INTERNSHIP-SELECTION-ASSIGNMENT/
β”œβ”€β”€ agent/                      # Core Agent Intelligence
β”‚   β”œβ”€β”€ agent_loop.py           # Main ReAct loop & state management
β”‚   β”œβ”€β”€ agent_utils.py          # Deduplication & cleaning utilities
β”‚   β”œβ”€β”€ bonus_features.py       # Self-Reflection & Telemetry logic
β”‚   β”œβ”€β”€ prompts.py              # System prompts & reasoning protocols
β”‚   └── tools_config.py         # Tool schemas & mapping
β”œβ”€β”€ data/                       # Processed Data Assets
β”‚   β”œβ”€β”€ database.db             # Generated SQLite database
β”‚   └── ingest_db.py            # SQLite ingestion utility
β”œβ”€β”€ dataset/                    # Raw Source Data
β”‚   β”œβ”€β”€ movies_structured.csv   # Primary movie dataset
β”‚   β”œβ”€β”€ rotten_tomatoes_movies.csv # Extended reviews metadata
β”‚   β”œβ”€β”€ top1000movies.csv       # Box office rankings
β”‚   └── unstructured_reviews/   # .txt review corpus (15 movies)
β”œβ”€β”€ evaluation/                 # Performance Auditing
β”‚   β”œβ”€β”€ logs/                   # (Folder) Granular reasoning traces
β”‚   β”œβ”€β”€ degradation_comparison.json # Baseline vs. Degraded performance data
β”‚   └── task_D_results.json     # Final results of 20-question suite
β”œβ”€β”€ tools/                      # Atomic Tool Layer
β”‚   β”œβ”€β”€ query_data.py           # SQL/Pandas implementation
β”‚   β”œβ”€β”€ search_docs.py          # BM25 document search
β”‚   └── web_search.py           # Tavily Web search
β”œβ”€β”€ utils/                      # Shared Infrastructure
β”‚   └── logger.py               # TraceLogger for terminal output
β”œβ”€β”€ Degradation_Audit_Report.md # Stress-test analysis (Bonus D)
β”œβ”€β”€ DESIGN.md                   # Architectural deep-dive
β”œβ”€β”€ EVALUATION.md               # 20-question suite forensic report
β”œβ”€β”€ tool_cost_analysis.md       # Per-tool latency & cost breakdown
β”œβ”€β”€ preprocess.py               # Main data cleaning & processing script
β”œβ”€β”€ setup_project.py            # Automated environment setup
β”œβ”€β”€ task_D_20eval_test.py       # Automated 20-question test runner
β”œβ”€β”€ degradation_runner.py       # Bonus D evaluation runner
β”œβ”€β”€ task_A_test.py              # Tool-level unit tests
β”œβ”€β”€ task_B_test.py              # Agent-level reasoning tests
β”œβ”€β”€ task_C_test.py              # Cross-domain multi-tool tests
β”œβ”€β”€ demo.gif                    # Animated ReAct reasoning trace
β”œβ”€β”€ .gitignore                  # Git exclusion rules
└── requirements.txt            # System dependencies

πŸ“Š 3. Hybrid Knowledge Base

The agent operates over a multi-tiered data environment to ensure high-fidelity grounding across structured, unstructured, and real-time domains.

Source Content Size
SQLite DB Structured movie metadata (Year, Genre, Budget, Gross, RT Score) 663 rows Γ— 7 columns
BM25 Index Unstructured qualitative film reviews and thematic critiques 15 .txt files
Web Search Real-time external data (News, Awards, recent credits) Dynamic (Tavily API)

πŸ“š 4. Documentation Index

For a deep-dive into specific project domains, please refer to the following reports:


πŸ—οΈ 5. Technical Architecture

The Agent Loop

The core engine is a Python loop (agent/agent_loop.py) that manages state, history, and tool orchestration for the agent.

  • No Black-Box Wrappers: Built from scratch without initialize_agent or high-level frameworks to ensure total transparency and control.
  • State Optimization: Employs a Budgets & Constraints protocol, enforcing a hard 8-step cap to prevent infinite recursion.

Tool Contracts

Tool Name Engine Purpose Output Fidelity
query_data SQLite / Pandas Precise numerical lookups, aggregations, and filtered searches. Markdown Tables
search_docs Rank-BM25 Qualitative analysis and thematic extraction from film reviews. Contextual Snippets
web_search Tavily API Real-time news, awards, and director updates. URL-Cited Snippets

Retriever Implementation Details

  • search_docs (Rank-BM25): Operates over a pre-indexed corpus of .txt reviews. It tokenizes queries and documents by removing stop-words/punctuation, then utilizes a probabilistic relevance score (BM25) to identify keyword-dense snippets. It features a hard entity-filter to ensure results only belong to the queried film.
  • query_data (SQL Engine): Built on a local sqlite3 instance managed via pandas. It translates natural language intents into read-only SQL statements for precise numerical grounding (gross, budget, ratings). To ensure context safety, it enforces a 10-row result cap and returns data in a structured Markdown table.
  • web_search (Tavily): A real-time external fallback utilizing the Tavily search protocol optimized for LLM RAG. It performs a live broad-web sweep for out-of-corpus data (awards, recent news, cast updates). Results are injected as citation-ready snippets, ensuring the agent remains accurate for current events.

Design Principles

  1. Internal Structured Data: SQLite for financial metrics and metadata.
  2. Internal Unstructured Data: BM25-based semantic retrieval for thematic critiques.
  3. Web Fallback: Real-time retrieval via Tavily for fringe facts or recent updates.

System Architecture

flowchart TB

%% -------- ENTRY --------
A[User Query] --> B{Cache Hit?}
B -->|Yes| C[Return Cached Trace]
B -->|No| D[Init context_state]

%% -------- CONTEXT LAYER --------
subgraph CONTEXT [Context Consolidation]
    direction TB
    E1[Merge Knowledge Base]
    E2[Inject Failure Memory]
    E3[Track Used Queries]
    E4[Apply Step Budget]
end

D --> E1 --> E2 --> E3 --> E4

%% -------- REASONING --------
subgraph REASONING [Strategic Reasoning]
    direction TB
    F1[Task Decomposition]
    F2[Tool Planning]
    F3[Thought Generation]
end

E4 --> F1 --> F2 --> F3
F3 --> G{Decision}

%% -------- TOOL LAYER --------
subgraph TOOLS [Tool Execution]
    direction TB
    H[Deduplication Check]
    J[Execute Tool]
    J1[SQL]
    J2[BM25]
    J3[Web]
    K[Process Results]
end

G -->|Tool Call| H
H -->|Valid| J
H -->|Redundant| I[Block + Feedback]

J --> J1 --> K
J --> J2 --> K
J --> J3 --> K

%% -------- STATE UPDATE --------
I --> L[Update Context]
K --> L
L --> M[Telemetry + Memory]

%% -------- LOOP CONTROL --------
M --> N{Step < 8?}
N -->|Yes| E1
N -->|No| O[Force Refusal]

%% -------- REFLECTION --------
subgraph REFLECTION [Self-Critique]
    direction TB
    P[Reflection Check]
    P1[Grounding]
    P2[Completeness]
end

G -->|Final Answer| P
P --> P1 --> P2 --> Q{Pass?}

Q -->|Yes| R[Finalize Answer]
Q -->|No| S[Emergency Retrieval]

S --> E1

%% -------- OUTPUT --------
R --> T[Log Metrics]
T --> U[Cache Trace]
U --> V[Return Response]

%% -------- COLORS --------
style CONTEXT fill:#e3f2fd,stroke:#64b5f6,stroke-width:2px
style REASONING fill:#e8f5e9,stroke:#66bb6a,stroke-width:2px
style TOOLS fill:#fff3e0,stroke:#ffa726,stroke-width:2px
style REFLECTION fill:#fce4ec,stroke:#ec407a,stroke-width:2px

style O fill:#ffebee,stroke:#e53935
style C fill:#e0f7fa,stroke:#26c6da
Loading

For a deep-dive into the agent's internal mechanics, tool schemas, and safety engineering, see the Architecture Design Document (DESIGN.md).


πŸš€ 6. Project Roadmap & Milestones

This implementation satisfies all core tasks (A-D) and includes several advanced performance bonuses.

Core Implementation (Tasks A-D)

Milestone Implementation Files Key Accomplishment
Task A: Tool Layer tools/, task_A_test.py Implemented precise SQL, BM25, and Web Search tools.
Task B: Agent Logic agent/agent_loop.py, task_B_test.py Developed a custom ReAct loop with strategic reasoning.
Task C: Multi-Tool task_C_test.py Verified cross-domain reasoning and tool orchestration.
Task D: Evaluation task_D_20eval_test.py, EVALUATION.md Executed a 20-question suite with 100% grounding rate.

Advanced Bonuses (A-D)

Bonus Files & Logic Purpose
A: Strategic Reasoning agent/prompts.py Mandatory [STRATEGIC BREAKDOWN] before any action.
B: Operational Telemetry agent/bonus_features.py, tool_cost_analysis.md Cost analysis in USD & INR.
C: Reflection & Recovery agent/bonus_features.py Self-critique turn that audits and fixes final answers.
D: Degradation Audit degradation_runner.py, Degradation_Audit_Report.md Testing model robustness and accuracy retention under 50% data loss.

Novelty Features

  • Proactive Safety Gating: Programmatic pre-processor for injection/jailbreak detection.
  • Keyword Deduplication: Intelligence layer to prevent redundant, costly tool calls.
  • Persistent Trace Caching: Integrated JSON cache for $0.00 cost replay of common queries.

πŸ“Š 7. Performance & Evaluation Summary

The system was evaluated against a rigorous 20-question suite covering Single-Tool, Multi-Tool, Refusal, and Edge-Case categories.

Metric Result Insight
Overall Accuracy 75% Exceptionally high for multi-step reasoning.
Grounding Rate 100% Zero hallucinations; all facts are cited.
Failure Mode Resilience Excellent Agent gracefully falls back to Web when Local Data is missing.
Avg. Query Cost $0.012 (β‚Ή1.01) Highly efficient tiered data escalation.

Full 20-question traces and forensic failure analysis can be found in EVALUATION.md.


πŸ’» 8. Developer Guide (Setup & Usage)

One-Step Installation

The project includes an automated setup pipeline that handles virtual environments and data preprocessing:

python setup_project.py

Running the Agent

  • Interactive REPL: python agent/agent_loop.py
  • Single Question: python agent/agent_loop.py "Compare Avatar and Inception worldwide gross."
  • Run Evaluation: python task_D_20eval_test.py

Global Configuration (.env)

  1. GITHUB_TOKEN: Generate a token to access GitHub Models.
  2. TAVILY_API_KEY: Get a free key from Tavily AI for web search capabilities.
GITHUB_TOKEN=your_token_here
TAVILY_API_KEY=your_key_here

⚠️ 9. Honest Assessment: Failure Modes

As per the technical requirements, we have identified and documented the system's "Breaking Points":

  1. Helpfulness Drift: In refusal cases (like recipes), the agent sometimes provides general trends before stating it cannot fulfill the request due to the base model's helpfulness bias.
  2. Ambiguity Resolution: For titles like 'The Host', the agent requires clear versioning (2006 vs 2013) if not explicitly disambiguated in the query.
  3. Budget Truncation: To protect the 8k context window, SQL results are capped at 10 rows.

πŸ›‘οΈ 10. Operational Constraints & Resilience

Hard Step Cap (8 Tool Calls)

The system enforces a strict 8-tool-call limit to ensure operational stability and prevent infinite reasoning loops. This "hard cap" is a core architectural safeguard; if the agent cannot resolve a query within this budget, it is programmed to deliver a Structured Refusal rather than providing an ungrounded guess.

Stress-Test Verification

The system's resilience can be verified using highly complex queries that exceed the reasoning budget.

Example Edge Case:

python agent/agent_loop.py "List the top 12 highest-grossing movies of all time. For each one, tell me the name of the lead actor and search the web to find out if that actor won an Oscar for that specific role."

Expected Behavior: The agent will utilize its full 8-step budget and then trigger a STRICT REFUSAL: BUDGET EXCEEDED terminal box.


πŸ“ 11. AI Development Disclosure

This project was pair-programmed with Antigravity, an experimental AI coding agent. Together, we designed the modular tool architecture, implemented the ReAct loop from first principles, and developed the forensic audit runners to ensure reproducible excellence.

About

Prodapt - AI & Data Science Track/Agentic RAG over Mixed Data Sources (Track - C)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages