Drop in a video. Ask anything about it.
VideoAnalyzer runs a full multi-modal analysis pipeline on any video file — object detection, transcription, scene segmentation, audio classification, OCR — then spins up an AI assistant that knows exactly what happened, when, and why.
Upload a video → everything below runs automatically in the background:
| Step | What happens |
|---|---|
| Metadata probe | Duration, resolution, FPS via ffprobe |
| Whisper transcription | Full VTT transcript with timestamps (faster-whisper) |
| YOLO object detection | Frame-by-frame detection at 1 FPS (YOLOv8) |
| Scene segmentation | Cut detection + per-scene brightness, motion, color palette |
| Audio classification | Speech / silence / music+noise segmentation |
| OCR | On-screen text extracted from scene keyframes (EasyOCR) |
| Context assembly | Everything merged into a structured document for the AI |
Then you chat with a Backboard AI assistant that can answer questions like:
"What objects appear between 1:30 and 2:00?" "Find every moment someone says 'product launch'" "Describe what's happening at 0:45 — visually, audibly, and any text on screen" "When does the car first appear and when does it leave?"
- Python 3.10+ · Flask · uv
- YOLOv8 (Ultralytics) — object detection
- faster-whisper — speech transcription
- EasyOCR — on-screen text recognition
- ffmpeg — frame extraction + audio processing
- Backboard — AI assistant with tool-call loop, thread memory, document storage
brew install ffmpeg # macOS
# or: sudo apt install ffmpegYou'll also need a Backboard API key.
git clone https://github.com/your-username/video-analyzer
cd video-analyzer
cp .env.example .env
# → add your BACKBOARD_API_KEY to .env
./start.shOpen http://localhost:5050 and drop in a video.
start.shsyncs dependencies viauv, clears temp files, and starts the server.Ctrl+Cto stop.
# .env
BACKBOARD_API_KEY=your_api_key_here
WHISPER_MODEL=base # tiny | base | small | medium | large
FLASK_PORT=5050Whisper model size trades speed for accuracy. base is a good starting point; small or medium for better results on noisy audio.
All logic lives in the API — the UI is thin.
POST /api/videos Upload a video (multipart/form-data, field: file)
GET /api/videos List all videos + status
GET /api/videos/{id} Full analysis JSON
GET /api/videos/{id}/video Stream source file
GET /api/videos/{id}/transcript.vtt VTT transcript
Processing is async. Poll GET /api/videos/{id} and watch status:
uploading → processing → ready (or error)
POST /api/chat Send a message (returns task_id)
GET /api/chat/task/{task_id} Poll for response
Chat uses a task-polling pattern — post a message, get a task_id, poll until status: done.
Chat request body:
{
"thread_id": "...",
"content": "What objects appear in the first minute?",
"video_id": "..."
}The AI has six tools it can call mid-conversation:
| Tool | What it returns |
|---|---|
get_transcript |
Full or time-filtered VTT transcript |
search_transcript |
Timestamps matching a word or phrase |
get_objects_at_time |
Objects detected at a specific timestamp |
get_object_timeline |
Full appearance timeline for a named object |
get_scene_info |
Scene detail: colors, motion, audio, OCR text |
get_audio_segments |
Speech / silence / music timeline |
src/
├── app.py Flask app factory
├── models.py Pydantic models (Video, Scene, ObjectSpan, ...)
├── backboard_client.py Backboard SDK client
├── api/
│ ├── videos.py Upload, list, serve endpoints
│ └── chat.py Chat + task-polling + tool-call loop
├── assistant/
│ ├── setup.py Assistant + system prompt
│ └── tools.py Tool definitions (JSON schema)
└── services/
├── pipeline.py Orchestrates all analysis steps
├── detector.py YOLO frame detection
├── transcriber.py Whisper transcription
├── audio.py Audio segmentation
├── visual.py Scene analysis + color palette
├── ocr.py EasyOCR on keyframes
├── video_service.py Backboard storage + local cache
└── tool_handler.py Dispatches assistant tool calls
templates/
├── index.html Upload page
└── workspace.html Video + chat workspace
models/
└── yolo26n.pt YOLOv8 weights
.mp4 .mov .webm .avi .mkv — up to 500 MB
MIT