Skip to content

Latest commit

 

History

History
146 lines (117 loc) · 5.18 KB

File metadata and controls

146 lines (117 loc) · 5.18 KB

Comprehensive Processor Improvements

Summary

Made significant improvements to the Logseq content processor for videos, X/Twitter posts, and PDFs.

Changes Made

1. Block Structure Fix ✅

Problem: Previously, all metadata was inline in the main block.

Solution: Changed to proper Logseq hierarchy:

  • Main block: Contains the {{video URL}}, {{tweet URL}}, or {{pdf URL}} wrapper with topic properties
  • Sub-blocks: Contain metadata (title, author, duration, etc.) as indented child blocks

Example Output:

topic-1:: machine-learning
topic-2:: python
topic-3:: tutorial
- {{video https://youtube.com/watch?v=...}}
  **Learn Python - Full Course for Beginners**
  By: freeCodeCamp.org
  Duration: 4:26:52

2. Improved Topic Extraction 🚀

Problem: Topic extraction was basic and produced generic/useless tags.

Solution: Implemented advanced NLP-based topic extraction with multiple methods:

New Extraction Methods:

  1. Multi-word Phrase Extraction

    • Extracts bigrams (e.g., "machine-learning", "data-science")
    • Extracts trigrams (e.g., "deep-learning-neural")
    • Recognizes domain-specific terms (machine learning, data science, etc.)
  2. TF-IDF Scoring

    • Uses term frequency with intelligent scoring
    • Identifies important words based on frequency and context
    • Filters out overly common terms
  3. Title-specific Extraction

    • Gives extra weight to topics found in titles
    • Extracts capitalized words (proper nouns)
    • Identifies quoted or specially formatted terms
  4. Advanced Ranking Algorithm

    • Scores topics based on:
      • Frequency in content (2x weight)
      • Presence in title (10x bonus)
      • Category matching (5x bonus)
      • Multi-word specificity (2x per word)
      • Domain term recognition (8x bonus)
      • Technical term patterns (+2 bonus)
    • Filters duplicates and low-scoring topics
    • Avoids near-duplicates (e.g., "learning" vs "machine-learning")

Examples of Improved Topics:

Before: video, content, watch, youtube, learn After: machine-learning, python-tutorial, data-science, deep-learning, neural-networks

3. YouTube Transcript Extraction Fix 🔧

Problem: Transcripts were failing silently for all videos.

Solution: Improved error handling and fallback logic:

  • Try English transcripts first
  • Fall back to any available language
  • Try multiple transcript sources
  • Better error logging to identify actual issues
  • Handle NoTranscriptFound and other specific errors gracefully

New Flow:

  1. Try English transcript directly
  2. If not found, try any available transcript
  3. If still failing, list all transcripts and pick best available
  4. Log specific error types for debugging

4. Skip Already Processed Content ⏭️

Problem: Re-running processor would duplicate work and create double-wrapped URLs.

Solution: Added intelligent skip logic:

  • Check if block already has topic properties (e.g., topic-1, topic-2)
  • Check if URL is already wrapped in {{...}}
  • Check for any custom wrapper syntax
  • Log skipped items for transparency

5. Better Content Analysis 📊

Enhanced Features:

  • Platform-specific extraction (hashtags for Twitter, academic terms for PDFs)
  • Multi-language support preparation
  • Better handling of long content (truncate tweets to 200 chars)
  • Improved metadata display (duration, page count, file size)

Testing

Run the test script to see topic extraction improvements:

python test_improvements.py

Usage

Run the comprehensive processor on your Logseq graph:

python scripts/comprehensive_processor_cli.py /path/to/logseq/graph --max-topics 3 --log-level INFO

Options:

  • --max-topics N: Maximum number of topics per item (default: 3)
  • --dry-run: Preview changes without modifying files
  • --no-videos: Skip video processing
  • --no-twitter: Skip X/Twitter processing
  • --no-pdfs: Skip PDF processing
  • --youtube-api-key KEY: Use YouTube API for better subtitle extraction
  • --twitter-bearer-token TOKEN: Use Twitter API for enhanced tweet data
  • --log-level LEVEL: Set logging level (DEBUG, INFO, WARNING, ERROR)

Benefits

  1. Better Organization: Hierarchical structure matches Logseq's design
  2. Useful Topics: Multi-word, specific topics instead of generic single words
  3. Efficient: Skips already-processed content
  4. Reliable: Better transcript extraction with fallbacks
  5. Flexible: Works with or without API keys
  6. Smart: Context-aware topic extraction based on content type

Technical Details

Topic Extraction Algorithm:

  • Uses TF-IDF-like scoring for word importance
  • Recognizes 20+ domain-specific term patterns
  • Weights title content 2x higher than body
  • Scores topics on 7 different criteria
  • Filters out near-duplicates and low-scoring results

Block Formatting:

  • Main block: URL wrapper + properties
  • Child blocks: Metadata (indented with 2 spaces)
  • Properties format: topic-1:: value
  • Compatible with Logseq's native syntax

Error Handling:

  • Graceful degradation when APIs unavailable
  • Specific error messages for debugging
  • Continues processing even if individual items fail
  • Comprehensive error statistics in final report