YouTube Video Scraping

The video processor supports harvesting videos from the Virginia Senate's YouTube channel (@SenateofVirginia) in addition to the Granicus platform. This provides redundancy and captures videos that may only be published on YouTube.

Overview

Channel: @SenateofVirginia Channel ID: UC9r1OpPhTY1VmL05bemQD0w Source Identifier: senate-youtube (vs senate for Granicus)

The system uses the YouTube Data API v3 to fetch video metadata and yt-dlp to download videos.

Architecture

Components

YouTubeApiClient (src/Scraper/YouTube/YouTubeApiClient.php)

Interfaces with YouTube Data API v3
Fetches video listings and detailed metadata
Parses ISO 8601 durations (PT1H23M45S → seconds)
Tracks API quota usage

SenateYouTubeScraper (src/Scraper/Senate/SenateYouTubeScraper.php)

Implements VideoSourceScraperInterface
Extracts committee names from titles
Parses dates from video metadata
Detects event types (committee/subcommittee/floor)
Returns standardized video records

VideoDownloadProcessor (src/Fetcher/VideoDownloadProcessor.php)

Enhanced with yt-dlp support
Detects YouTube URLs by domain
Downloads best quality MP4 with audio
Automatically downloads English captions

Data Flow

YouTube API → SenateYouTubeScraper → VideoScraper → JSON snapshots → Pipeline → Database

Setup

1. Install yt-dlp

YouTube video downloads require yt-dlp:

# macOS (Homebrew)
brew install yt-dlp

# Ubuntu/Debian
pip install yt-dlp

# Or download binary from:
# https://github.com/yt-dlp/yt-dlp/releases

Verify installation:

which yt-dlp
yt-dlp --version

1b. Configure YouTube Cookies (Required)

YouTube requires authentication cookies to bypass bot detection. You must export cookies from your local browser and upload them to the server.

Export Cookies from Your Browser

Install the "Get cookies.txt LOCALLY" extension:
- Chrome: https://chrome.google.com/webstore/detail/get-cookiestxt-locally/cclelndahbckbenkjhflpdbgdldlbecc
- Firefox: https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/
Export cookies:
- Visit https://youtube.com and ensure you're logged in
- Click the extension icon
- Click "Export" to download cookies.txt

Upload to server:

scp cookies.txt ubuntu@your-server:/home/ubuntu/youtube-cookies.txt

Verify the file exists:

ssh ubuntu@your-server
ls -lh /home/ubuntu/youtube-cookies.txt

Important: Cookies typically last for several months before expiring. The system will automatically detect when cookies expire and log a critical error.

2. Obtain YouTube API Key

Create Google Cloud Project:

Go to https://console.cloud.google.com/
Click "Select a project" → "New Project"
Name: "Virginia Legislature Video Scraper"
Click "Create"

Enable YouTube Data API v3:

In the project, go to "APIs & Services" → "Library"
Search for "YouTube Data API v3"
Click on it and click "Enable"

Create API Key:

Go to "APIs & Services" → "Credentials"
Click "Create Credentials" → "API Key"
Copy the generated key
Click "Restrict Key" (recommended)
Under "API restrictions", select "Restrict key"
Choose "YouTube Data API v3"
Save

Configure in Application:

Open includes/settings.inc.php
Add: define('YOUTUBE_API_KEY', 'YOUR_KEY_HERE');
Never commit the key to version control

Cookie Maintenance

When Cookies Expire

YouTube cookies typically last several months before expiring. When they expire:

Automatic Detection:
- The system detects "Sign in to confirm you're not a bot" errors
- Logs a CRITICAL error (severity 7) with clear instructions
- Throws YouTubeCookiesExpiredException
- Stops all YouTube download attempts

You'll See in Logs:

CRITICAL: YouTube cookies have expired or are invalid.
Export fresh cookies from your browser using "Get cookies.txt LOCALLY" extension
and upload to /home/ubuntu/youtube-cookies.txt

Impact:
- YouTube videos won't download until cookies are refreshed
- House and Senate Granicus videos continue processing normally
- No data loss or corruption

Refreshing Cookies

Quick Process:

# 1. On your local machine: Export cookies using browser extension
# 2. Upload to server
scp cookies.txt ubuntu@your-server:/home/ubuntu/youtube-cookies.txt

# 3. Optional: Restart video processor if currently running
ssh ubuntu@your-server
sudo systemctl restart video-pipeline.service

Verification:

# Check file exists and is recent
ssh ubuntu@your-server
ls -lh /home/ubuntu/youtube-cookies.txt

# Should show file size around 10-50 KB
# Date should be recent (today)

Monitoring

Check logs for cookie expiration:

# View recent critical errors
grep "CRITICAL.*YouTube cookies" /var/log/video-processor.log

# Monitor for YouTubeCookiesExpiredException
grep "YouTubeCookiesExpiredException" /var/log/video-processor.log

Set up alerts (optional):

Monitor logs for severity 7 errors
Alert when YouTubeCookiesExpiredException appears
Reminder to refresh cookies every 2-3 months

Usage

The YouTube scraper is automatically included when running the pipeline:

# Scrape all sources (including YouTube)
php bin/scrape.php

# Run full pipeline
php bin/pipeline.php

Output will include videos from three sources:

House (Granicus)
Senate (Granicus)
Senate YouTube

API Quota

YouTube API has daily quota limits:

Default quota: 10,000 units/day Typical usage: ~150 units per scrape Daily capacity: ~66 scrapes

Cost breakdown per scrape:

Search: 100 units
Video details: ~50 units (for 50 videos @ 1 unit each)

Monitor quota usage at: https://console.cloud.google.com/

Testing

Run unit tests:

includes/vendor/bin/phpunit tests/Scraper/YouTubeApiClientTest.php
includes/vendor/bin/phpunit tests/Scraper/SenateYouTubeScraperTest.php

Test scraping:

php bin/scrape.php

Verify output:

# Check for YouTube videos
cat storage/scraper/videos-*.json | jq '.records[] | select(.source=="senate-youtube") | {title, video_url, duration_seconds}'

Test video download:

php bin/fetch_videos.php --limit=1

Files Created

Source Code (3 files)

src/Scraper/YouTube/YouTubeApiClient.php - YouTube Data API v3 client
src/Scraper/Senate/SenateYouTubeScraper.php - YouTube scraper
src/Fetcher/VideoDownloadProcessor.php - Enhanced with yt-dlp support

Tests (4 files)

tests/Scraper/YouTubeApiClientTest.php - API client tests
tests/Scraper/SenateYouTubeScraperTest.php - Scraper tests
tests/fixtures/youtube-live-videos.json - API response fixture
tests/fixtures/youtube-video-details.json - Video details fixture

Configuration

includes/settings-default.inc.php - Added YOUTUBE_API_KEY constant
bin/scrape.php - Registered SenateYouTubeScraper
bin/pipeline.php - Added YouTube scraper to pipeline

Features

✅ YouTube Data API v3 integration ✅ Video fetching from channel ✅ Video details retrieval (title, description, duration, thumbnails) ✅ ISO 8601 duration parsing ✅ Committee name extraction from titles ✅ Date extraction from video metadata ✅ Event type detection (committee/subcommittee/floor) ✅ yt-dlp video download with MP4 format selection ✅ Automatic caption download (WebVTT format) ✅ API quota tracking and logging ✅ Error handling for quota limits and network failures ✅ Dual-source operation (Granicus + YouTube) ✅ Comprehensive test coverage (10 tests, 57 assertions)

Troubleshooting

Bot detection error ("Sign in to confirm you're not a bot")

Symptom:

ERROR: Sign in to confirm you're not a bot. Use --cookies-from-browser or --cookies

Cause: YouTube cookies have expired or are missing.

What Happens:

The system detects this error automatically
Logs a CRITICAL error at severity level 7
Throws YouTubeCookiesExpiredException
Halts further YouTube download attempts
Granicus videos continue processing normally

Solution - Refresh Cookies:

Export fresh cookies from your local browser:
- Install "Get cookies.txt LOCALLY" extension (see Setup section above)
- Visit https://youtube.com while logged in
- Click extension icon → Export
- Save as cookies.txt

Upload to server:

scp cookies.txt ubuntu@your-server:/home/ubuntu/youtube-cookies.txt

Restart the video processor:

ssh ubuntu@your-server
sudo systemctl restart video-pipeline.service

Prevention:

Cookies typically last several months
Check logs regularly for severity 7 errors
Consider setting up monitoring alerts for YouTubeCookiesExpiredException

No YouTube videos found

Check API key configuration:

php -r "require 'includes/settings.inc.php'; echo YOUTUBE_API_KEY ?? 'NOT SET';"

yt-dlp not found error

Verify yt-dlp is installed:

which yt-dlp

If not installed, follow installation steps above.

API quota exceeded

Monitor quota at: https://console.cloud.google.com/

If consistently hitting limits, consider:

Reducing scrape frequency
Requesting quota increase from Google

Cookies file not found

Symptom:

YouTube cookies file not found at: /home/ubuntu/youtube-cookies.txt

Solution: Export and upload cookies file (see Setup section above)

Disabling YouTube Scraper

If issues arise, temporarily disable the YouTube scraper:

In bin/scrape.php:

// Comment out YouTube scraper
// $senateYouTube = new SenateYouTubeScraper($http, YOUTUBE_API_KEY ?? '');

// Update VideoScraper to exclude YouTube
$scraper = new VideoScraper([$house, $senateGranicus], $writer, $logger);

In bin/pipeline.php:

// Comment out YouTube scraper
// $senateYouTubeScraper = new SenateYouTubeScraper(...);

// Remove from array_merge
$newRecords = array_merge(
    $houseScraper->scrape(),
    $senateScraper->scrape()
    // $senateYouTubeScraper->scrape()
);

The Granicus scraper continues working normally with no data loss.

Implementation Status

Completed: January 16, 2026 Tests: 10 tests, 57 assertions - all passing Status: Production ready

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YouTube Video Scraping

Overview

Architecture

Components

Data Flow

Setup

1. Install yt-dlp

1b. Configure YouTube Cookies (Required)

Export Cookies from Your Browser

2. Obtain YouTube API Key

Cookie Maintenance

When Cookies Expire

Refreshing Cookies

Monitoring

Usage

API Quota

Testing

Files Created

Source Code (3 files)

Tests (4 files)

Configuration

Features

Troubleshooting

Bot detection error ("Sign in to confirm you're not a bot")

No YouTube videos found

yt-dlp not found error

API quota exceeded

Cookies file not found

Disabling YouTube Scraper

Implementation Status

FilesExpand file tree

youtube-scraping.md

Latest commit

History

youtube-scraping.md

File metadata and controls

YouTube Video Scraping

Overview

Architecture

Components

Data Flow

Setup

1. Install yt-dlp

1b. Configure YouTube Cookies (Required)

Export Cookies from Your Browser

2. Obtain YouTube API Key

Cookie Maintenance

When Cookies Expire

Refreshing Cookies

Monitoring

Usage

API Quota

Testing

Files Created

Source Code (3 files)

Tests (4 files)

Configuration

Features

Troubleshooting

Bot detection error ("Sign in to confirm you're not a bot")

No YouTube videos found

yt-dlp not found error

API quota exceeded

Cookies file not found

Disabling YouTube Scraper

Implementation Status