Comprehensive Processor Improvements

Summary

Made significant improvements to the Logseq content processor for videos, X/Twitter posts, and PDFs.

Changes Made

1. Block Structure Fix ✅

Problem: Previously, all metadata was inline in the main block.

Solution: Changed to proper Logseq hierarchy:

Main block: Contains the {{video URL}}, {{tweet URL}}, or {{pdf URL}} wrapper with topic properties
Sub-blocks: Contain metadata (title, author, duration, etc.) as indented child blocks

Example Output:

topic-1:: machine-learning
topic-2:: python
topic-3:: tutorial
- {{video https://youtube.com/watch?v=...}}
  **Learn Python - Full Course for Beginners**
  By: freeCodeCamp.org
  Duration: 4:26:52

2. Improved Topic Extraction 🚀

Problem: Topic extraction was basic and produced generic/useless tags.

Solution: Implemented advanced NLP-based topic extraction with multiple methods:

New Extraction Methods:

Multi-word Phrase Extraction
- Extracts bigrams (e.g., "machine-learning", "data-science")
- Extracts trigrams (e.g., "deep-learning-neural")
- Recognizes domain-specific terms (machine learning, data science, etc.)
TF-IDF Scoring
- Uses term frequency with intelligent scoring
- Identifies important words based on frequency and context
- Filters out overly common terms
Title-specific Extraction
- Gives extra weight to topics found in titles
- Extracts capitalized words (proper nouns)
- Identifies quoted or specially formatted terms
Advanced Ranking Algorithm
- Scores topics based on:
  - Frequency in content (2x weight)
  - Presence in title (10x bonus)
  - Category matching (5x bonus)
  - Multi-word specificity (2x per word)
  - Domain term recognition (8x bonus)
  - Technical term patterns (+2 bonus)
- Filters duplicates and low-scoring topics
- Avoids near-duplicates (e.g., "learning" vs "machine-learning")

Examples of Improved Topics:

Before: video, content, watch, youtube, learn After: machine-learning, python-tutorial, data-science, deep-learning, neural-networks

3. YouTube Transcript Extraction Fix 🔧

Problem: Transcripts were failing silently for all videos.

Solution: Improved error handling and fallback logic:

Try English transcripts first
Fall back to any available language
Try multiple transcript sources
Better error logging to identify actual issues
Handle NoTranscriptFound and other specific errors gracefully

New Flow:

Try English transcript directly
If not found, try any available transcript
If still failing, list all transcripts and pick best available
Log specific error types for debugging

4. Skip Already Processed Content ⏭️

Problem: Re-running processor would duplicate work and create double-wrapped URLs.

Solution: Added intelligent skip logic:

Check if block already has topic properties (e.g., topic-1, topic-2)
Check if URL is already wrapped in {{...}}
Check for any custom wrapper syntax
Log skipped items for transparency

5. Better Content Analysis 📊

Enhanced Features:

Platform-specific extraction (hashtags for Twitter, academic terms for PDFs)
Multi-language support preparation
Better handling of long content (truncate tweets to 200 chars)
Improved metadata display (duration, page count, file size)

Testing

Run the test script to see topic extraction improvements:

python test_improvements.py

Usage

Run the comprehensive processor on your Logseq graph:

python scripts/comprehensive_processor_cli.py /path/to/logseq/graph --max-topics 3 --log-level INFO

Options:

--max-topics N: Maximum number of topics per item (default: 3)
--dry-run: Preview changes without modifying files
--no-videos: Skip video processing
--no-twitter: Skip X/Twitter processing
--no-pdfs: Skip PDF processing
--youtube-api-key KEY: Use YouTube API for better subtitle extraction
--twitter-bearer-token TOKEN: Use Twitter API for enhanced tweet data
--log-level LEVEL: Set logging level (DEBUG, INFO, WARNING, ERROR)

Benefits

Better Organization: Hierarchical structure matches Logseq's design
Useful Topics: Multi-word, specific topics instead of generic single words
Efficient: Skips already-processed content
Reliable: Better transcript extraction with fallbacks
Flexible: Works with or without API keys
Smart: Context-aware topic extraction based on content type

Technical Details

Topic Extraction Algorithm:

Uses TF-IDF-like scoring for word importance
Recognizes 20+ domain-specific term patterns
Weights title content 2x higher than body
Scores topics on 7 different criteria
Filters out near-duplicates and low-scoring results

Block Formatting:

Main block: URL wrapper + properties
Child blocks: Metadata (indented with 2 spaces)
Properties format: topic-1:: value
Compatible with Logseq's native syntax

Error Handling:

Graceful degradation when APIs unavailable
Specific error messages for debugging
Continues processing even if individual items fail
Comprehensive error statistics in final report

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comprehensive Processor Improvements

Summary

Changes Made

1. Block Structure Fix ✅

2. Improved Topic Extraction 🚀

New Extraction Methods:

Examples of Improved Topics:

3. YouTube Transcript Extraction Fix 🔧

4. Skip Already Processed Content ⏭️

5. Better Content Analysis 📊

Testing

Usage

Options:

Benefits

Technical Details

Topic Extraction Algorithm:

Block Formatting:

Error Handling:

FilesExpand file tree

IMPROVEMENTS.md

Latest commit

History

IMPROVEMENTS.md

File metadata and controls

Comprehensive Processor Improvements

Summary

Changes Made

1. Block Structure Fix ✅

2. Improved Topic Extraction 🚀

New Extraction Methods:

Examples of Improved Topics:

3. YouTube Transcript Extraction Fix 🔧

4. Skip Already Processed Content ⏭️

5. Better Content Analysis 📊

Testing

Usage

Options:

Benefits

Technical Details

Topic Extraction Algorithm:

Block Formatting:

Error Handling: