Skip to content

AaronWard/scrapify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scrapify

Currently this tool for pulling mp3 files for spotify and youtube playlists. I plan on adding support for other platforms in the future.

  • Spotify playlists
  • Youtube playlists/mixes
  • Twitter / X posts
  • Reddit posts
  • Wayback machine
  • Discord channels
  • Github

Usage:

pip install -e .

create a .env file with the following:

YOUTUBE_API_KEY='***'
OPENAI_API_KEY='***'
GITHUB_TOKEN="***"
TWITTER_USERNAME='***'
TWITTER_PASSWORD='***'
REDDIT_APP ='***'
REDDIT_SECRET = '***'
REDDIT_USERNAME = '***'

SCRAPIFY_BASE="~/Documents/data/"

Youtube

# For youtube playlists/mixes
scrapify "https://youtube.com/playlist?list=XXXXXXXXXXXX"

# For updating a local playlists (if the playlist is not already downloaded)
scrapify "https://youtube.com/playlist?list=XXXXXXXXXXXX" --dir "/path/Local Directory>"

# Download video instead
scrapify "https://youtube.com/watch?v=XXXXXXXXXXXX" --video

# Download transcripts from a YouTube channel
scrapify "https://youtube.com/@channelname" --transcripts

# Download transcripts and summarize them
scrapify "https://youtube.com/@channelname" --transcripts --summarize

Twitter

# Download twitter threads
scrapify "https://x.com/thread/XXXXXXXXXXXX"

# Download twitter threads and media only
scrapify "https://x.com/thread/XXXXXXXXXXXX" --media-only

Github

# Basic user activity fetch
scrapify "https://github.com/username" --activity

# Get repository information
scrapify "https://github.com/username/repo" --issues

# Get repository discussions
scrapify "https://github.com/username/repo" --discussions

Reddit

PRAW For Reddit

  1. Go to https://www.reddit.com/prefs/apps
  2. Click "create app" or "create another app"
  3. Fill out the form:
    • name: Give your app a name.
    • application type: Select "script".
    • redirect uri: Use http://localhost:8080 or a similar placeholder.
  4. After creation, note down the client_id (just under the app name) and client_secret.
# Scrape hot posts
scrapify "https://www.reddit.com/r/LocalLLaMA/"

# Scrape new posts
scrapify "https://www.reddit.com/r/LocalLLaMA/" --sort new

# Scrape top posts with custom limit
scrapify "https://www.reddit.com/r/LocalLLaMA/" --sort top --limit 50

Tips:

Spotify:

  • Make sure the playlist you want to download is not set to private
  • Songs are downloaded at 320 kbps, will fall back to 128 kbps.
  • Some songs may fail to download due to rate limiting, this will apply backoff to try mitigate this. This may not always work, so you are given the option attempt to download the failed songs at the end

Links:

About

Multi domain web scraping tool

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors