Skip to content

RSS ingestor#1276

Open
eilmiv wants to merge 23 commits into
ElixirTeSS:masterfrom
pan-training:rss_ingestor
Open

RSS ingestor#1276
eilmiv wants to merge 23 commits into
ElixirTeSS:masterfrom
pan-training:rss_ingestor

Conversation

@eilmiv
Copy link
Copy Markdown
Collaborator

@eilmiv eilmiv commented Apr 7, 2026

Summary of changes

  • Added RSS and Atom feed support using rss gem
    • Separate ingestor for events and materials (but RSS for events is not as useful)
  • RSS feeds are optionally discovered from html pages using a link element with application/rss+xml or atom
  • Support for metadata extentions (not every extension for every rss/atom version)
    • RDF metadata (Bioschemas)
    • Dublin Core
    • iTunes
    • yahoo media (e.g. used on YouTube)

Motivation and context

Closes #722

Screenshots
image

Checklist

  • I have read and followed the CONTRIBUTING guide.
  • I confirm that I have the authority necessary to make this contribution on behalf of its copyright owner and agree to license it to the TeSS codebase under the BSD license.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds RSS/Atom feed ingestion to TeSS by introducing dedicated ingestors for events and materials, including optional HTML feed discovery and support for several common metadata extensions (Dublin Core, RDF/Bioschemas, iTunes, Yahoo Media).

Changes:

  • Introduce shared RSS/Atom ingestion helpers (RSSIngestion) plus reusable Dublin Core parsing/building (DublinCoreIngestion).
  • Add new ingestors for event and material RSS/Atom feeds, including RDF/Bioschemas merge behavior and HTML alternate-feed discovery.
  • Add RSS Media namespace support for Atom parsing and comprehensive unit tests for RSS/Atom ingestion and extensions.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/unit/rss_media_atom_test.rb Tests Media namespace installation idempotency for Atom.
test/unit/ingestors/material_rss_ingestor_test.rb Material RSS/Atom ingestion tests (DC, RSS versions, RDF/Bioschemas, HTML discovery, media/iTunes extensions).
test/unit/ingestors/event_rss_ingestor_test.rb Event RSS/Atom ingestion tests (DC, relative links, RDF/Bioschemas, HTML discovery).
lib/rss/media.rb Defines Yahoo Media RSS extension wiring + loads Atom-specific patch.
lib/rss/media/atom.rb Patches Atom classes to support media:group parsing and makes namespace installation idempotent.
lib/ingestors/rss_ingestion.rb Shared feed fetching/parsing + HTML discovery + extraction/merge helpers.
lib/ingestors/dublin_core_ingestion.rb Centralized DC-to-OpenStruct builders and normalization helpers.
lib/ingestors/material_rss_ingestor.rb New material RSS/Atom ingestor (RSS/RDF/Atom + Bioschemas LearningResource extraction).
lib/ingestors/event_rss_ingestor.rb New event RSS/Atom ingestor (RSS/RDF/Atom + Bioschemas Event/Course extraction).
lib/ingestors/oai_pmh_ingestor.rb Refactors OAI-PMH DC parsing to reuse DublinCoreIngestion.
lib/ingestors/ingestor_factory.rb Registers the new RSS ingestors.
config/initializers/inflections.rb Adds RSS acronym for correct Zeitwerk/inflector naming.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/ingestors/material_rss_ingestor.rb Outdated
Comment thread lib/ingestors/rss_ingestion.rb Outdated
Comment thread lib/ingestors/rss_ingestion.rb Outdated
@eilmiv
Copy link
Copy Markdown
Collaborator Author

eilmiv commented Apr 9, 2026

Two additional notes:

  • Depending on the RSS feed, only the most recent entries are imported/updated (e.g. the most recent 15 for YouTube)
  • Not sure if there exists an rss feed for events, I have no problem removing the rss event ingestor if this is not seen as useful

@eilmiv eilmiv requested a review from fbacall April 9, 2026 10:29
@eilmiv eilmiv marked this pull request as ready for review April 9, 2026 10:30
Copy link
Copy Markdown
Member

@fbacall fbacall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - very flexible and nice tests. I think some parts can be simplified, and it might be good to split the YouTube functionality into a simple subclass for the sake of clarity.

Comment thread lib/ingestors/dublin_core_ingestion.rb Outdated
Comment thread lib/ingestors/material_rss_ingestor.rb Outdated
Comment thread lib/ingestors/event_rss_ingestor.rb Outdated
Comment thread lib/ingestors/rss_ingestion.rb Outdated
Comment thread lib/ingestors/rss_ingestion.rb Outdated
Comment thread lib/ingestors/rss_ingestion.rb Outdated
Comment thread lib/rss/media/atom.rb Outdated
Comment thread lib/ingestors/rss_ingestion.rb Outdated
Comment thread lib/ingestors/rss_ingestion.rb Outdated
Comment thread lib/ingestors/dublin_core_ingestion.rb Outdated
@eilmiv eilmiv marked this pull request as draft May 22, 2026 09:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

lib/ingestors/material_rss_ingestor.rb:227

  • build_material_from_atom_item always does Addressable::URI.join(feed_url, ...) even though extract_atom_link can return nil when an entry has no usable link. This can raise and abort ingestion. Guard for blank links (and preserve any URL already set from Dublin Core identifiers) before calling Addressable::URI.join.
      material = build_material_from_dublin_core_data(extract_dublin_core(item))

      media_title = text_value(item.media_group&.media_title)
      material.title ||= text_value(item.title) || media_title
      material.url = Addressable::URI.join(feed_url, text_value(extract_atom_link(item))).to_s
      media_group_description = text_value(item.media_group&.media_description)

Comment thread lib/ingestors/material_rss_ingestor.rb
Comment thread lib/ingestors/material_rss_ingestor.rb
Comment thread lib/ingestors/youtube_ingestor.rb
Copy link
Copy Markdown
Collaborator Author

@eilmiv eilmiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed all the comments.

Additionally, I removed the RSS ingestor for events in 6fe1eff. I don't know if there is even an RSS feed that has events, the event metadata that could be got from RSS is not that useful, and it made the implementation unnecessarily complex. If this is needed in the future it can be brought back relatively easily.

Comment thread lib/ingestors/material_rss_ingestor.rb
Comment thread lib/ingestors/material_rss_ingestor.rb
Comment thread lib/ingestors/youtube_ingestor.rb
Comment thread lib/ingestors/dublin_core_ingestion.rb Outdated
Comment thread lib/ingestors/dublin_core_ingestion.rb Outdated
Comment thread lib/ingestors/rss_ingestion.rb Outdated
Comment thread lib/ingestors/rss_ingestion.rb Outdated
Comment thread lib/ingestors/rss_ingestion.rb Outdated
Comment thread lib/ingestors/rss_ingestion.rb Outdated
Comment thread lib/rss/media/atom.rb Outdated
@eilmiv eilmiv marked this pull request as ready for review May 22, 2026 11:25
@eilmiv eilmiv requested a review from fbacall May 22, 2026 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants