Made significant improvements to the Logseq content processor for videos, X/Twitter posts, and PDFs.
Problem: Previously, all metadata was inline in the main block.
Solution: Changed to proper Logseq hierarchy:
- Main block: Contains the
{{video URL}},{{tweet URL}}, or{{pdf URL}}wrapper with topic properties - Sub-blocks: Contain metadata (title, author, duration, etc.) as indented child blocks
Example Output:
topic-1:: machine-learning
topic-2:: python
topic-3:: tutorial
- {{video https://youtube.com/watch?v=...}}
**Learn Python - Full Course for Beginners**
By: freeCodeCamp.org
Duration: 4:26:52Problem: Topic extraction was basic and produced generic/useless tags.
Solution: Implemented advanced NLP-based topic extraction with multiple methods:
-
Multi-word Phrase Extraction
- Extracts bigrams (e.g., "machine-learning", "data-science")
- Extracts trigrams (e.g., "deep-learning-neural")
- Recognizes domain-specific terms (machine learning, data science, etc.)
-
TF-IDF Scoring
- Uses term frequency with intelligent scoring
- Identifies important words based on frequency and context
- Filters out overly common terms
-
Title-specific Extraction
- Gives extra weight to topics found in titles
- Extracts capitalized words (proper nouns)
- Identifies quoted or specially formatted terms
-
Advanced Ranking Algorithm
- Scores topics based on:
- Frequency in content (2x weight)
- Presence in title (10x bonus)
- Category matching (5x bonus)
- Multi-word specificity (2x per word)
- Domain term recognition (8x bonus)
- Technical term patterns (+2 bonus)
- Filters duplicates and low-scoring topics
- Avoids near-duplicates (e.g., "learning" vs "machine-learning")
- Scores topics based on:
Before: video, content, watch, youtube, learn
After: machine-learning, python-tutorial, data-science, deep-learning, neural-networks
Problem: Transcripts were failing silently for all videos.
Solution: Improved error handling and fallback logic:
- Try English transcripts first
- Fall back to any available language
- Try multiple transcript sources
- Better error logging to identify actual issues
- Handle
NoTranscriptFoundand other specific errors gracefully
New Flow:
- Try English transcript directly
- If not found, try any available transcript
- If still failing, list all transcripts and pick best available
- Log specific error types for debugging
Problem: Re-running processor would duplicate work and create double-wrapped URLs.
Solution: Added intelligent skip logic:
- Check if block already has topic properties (e.g.,
topic-1,topic-2) - Check if URL is already wrapped in
{{...}} - Check for any custom wrapper syntax
- Log skipped items for transparency
Enhanced Features:
- Platform-specific extraction (hashtags for Twitter, academic terms for PDFs)
- Multi-language support preparation
- Better handling of long content (truncate tweets to 200 chars)
- Improved metadata display (duration, page count, file size)
Run the test script to see topic extraction improvements:
python test_improvements.pyRun the comprehensive processor on your Logseq graph:
python scripts/comprehensive_processor_cli.py /path/to/logseq/graph --max-topics 3 --log-level INFO--max-topics N: Maximum number of topics per item (default: 3)--dry-run: Preview changes without modifying files--no-videos: Skip video processing--no-twitter: Skip X/Twitter processing--no-pdfs: Skip PDF processing--youtube-api-key KEY: Use YouTube API for better subtitle extraction--twitter-bearer-token TOKEN: Use Twitter API for enhanced tweet data--log-level LEVEL: Set logging level (DEBUG, INFO, WARNING, ERROR)
- Better Organization: Hierarchical structure matches Logseq's design
- Useful Topics: Multi-word, specific topics instead of generic single words
- Efficient: Skips already-processed content
- Reliable: Better transcript extraction with fallbacks
- Flexible: Works with or without API keys
- Smart: Context-aware topic extraction based on content type
- Uses TF-IDF-like scoring for word importance
- Recognizes 20+ domain-specific term patterns
- Weights title content 2x higher than body
- Scores topics on 7 different criteria
- Filters out near-duplicates and low-scoring results
- Main block: URL wrapper + properties
- Child blocks: Metadata (indented with 2 spaces)
- Properties format:
topic-1:: value - Compatible with Logseq's native syntax
- Graceful degradation when APIs unavailable
- Specific error messages for debugging
- Continues processing even if individual items fail
- Comprehensive error statistics in final report