Skip to content

feat: add ossf data fetcher (CM-952)#3839

Merged
ulemons merged 13 commits intofeat/add-dal-automatic-project-discoveryfrom
feat/add-assf-data-fetcher
Mar 26, 2026
Merged

feat: add ossf data fetcher (CM-952)#3839
ulemons merged 13 commits intofeat/add-dal-automatic-project-discoveryfrom
feat/add-assf-data-fetcher

Conversation

@ulemons
Copy link
Copy Markdown
Contributor

@ulemons ulemons commented Feb 10, 2026

Note

Medium Risk
Introduces a new scheduled Temporal ingestion pipeline that streams and upserts large external datasets into Postgres, and changes projectCatalog upsert semantics to preserve existing scores when the incoming value is null. Risk is mainly around data correctness, load/timeout behavior, and reliance on external HTTP sources.

Overview
Adds an automatic projects discovery Temporal worker that discovers OSS repos from external sources and bulk upserts them into projectCatalog in 5k-row batches (CSV via csv-parse, JSON via object-mode streams), with retries/timeouts tuned for long-running dataset processing.

Introduces a pluggable IDiscoverySource abstraction plus two sources: OSSF Criticality Score (public GCS bucket CSV snapshots) and LF Criticality Score (paged HTTP JSON API), and updates the Temporal schedule to run daily at midnight with discoverProjects({ mode: 'incremental' }) and a 2-hour workflow execution timeout.

Updates bulkUpsertProjectCatalog/upsertProjectCatalog conflict handling to COALESCE score fields so missing incoming scores no longer overwrite existing ossfCriticalityScore/lfCriticalityScore values.

Written by Cursor Bugbot for commit a8062f5. This will update automatically on new commits. Configure here.

@ulemons ulemons self-assigned this Feb 10, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conventional Commits FTW!

@ulemons ulemons changed the title Feat/add assf data fetcher feat: add ossf data fetcher Feb 10, 2026
@ulemons ulemons added the POC label Feb 11, 2026
@ulemons ulemons changed the title feat: add ossf data fetcher feat: add ossf data fetcher (CM-952) Feb 11, 2026
@ulemons ulemons changed the base branch from feat/add-project-discovery-worker to feat/add-dal-automatic-project-discovery February 11, 2026 10:46
@ulemons ulemons force-pushed the feat/add-dal-automatic-project-discovery branch 2 times, most recently from f4cc9b8 to 543aa5f Compare March 24, 2026 10:31
@ulemons ulemons force-pushed the feat/add-assf-data-fetcher branch from 264892a to e4e5fc2 Compare March 24, 2026 10:37
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@ulemons ulemons force-pushed the feat/add-assf-data-fetcher branch from e4e5fc2 to 0b93da4 Compare March 24, 2026 10:38
@ulemons ulemons force-pushed the feat/add-dal-automatic-project-discovery branch from 3cbe8ec to 0af35ee Compare March 24, 2026 11:56
@ulemons ulemons force-pushed the feat/add-assf-data-fetcher branch from 76c308c to 02ab17d Compare March 24, 2026 11:58
@ulemons ulemons marked this pull request as ready for review March 24, 2026 14:01
Copilot AI review requested due to automatic review settings March 24, 2026 14:01
@ulemons ulemons force-pushed the feat/add-dal-automatic-project-discovery branch from 0af35ee to 82f29d9 Compare March 24, 2026 14:02
@ulemons ulemons force-pushed the feat/add-assf-data-fetcher branch from 02ab17d to ed37511 Compare March 24, 2026 14:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an Automatic Projects Discovery Temporal worker that fetches external criticality datasets (OSSF + LF) and upserts them into the projectCatalog table, while aligning the data-access layer with the underlying schema.

Changes:

  • Split criticalityScore into ossfCriticalityScore and lfCriticalityScore across DAL types and SQL upsert/insert logic.
  • Implement a Temporal workflow + activities pipeline to list datasets per source and process datasets in batches via bulkUpsertProjectCatalog.
  • Add source abstractions/registry plus OSSF (GCS CSV) and LF (API JSON) fetchers, along with scheduling/docs/deps.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
services/libs/data-access-layer/src/project-catalog/types.ts Updates DB types to use ossfCriticalityScore + lfCriticalityScore.
services/libs/data-access-layer/src/project-catalog/projectCatalog.ts Updates select/insert/bulk insert/upsert/update SQL to use the two score columns.
services/apps/automatic_projects_discovery_worker/src/workflows/discoverProjects.ts New workflow orchestration (incremental vs full) + activity timeouts/retries.
services/apps/automatic_projects_discovery_worker/src/sources/types.ts Introduces source/dataset interfaces and CSV vs JSON streaming contract.
services/apps/automatic_projects_discovery_worker/src/sources/registry.ts Registers discovery sources and exposes lookup/list helpers.
services/apps/automatic_projects_discovery_worker/src/sources/ossf-criticality-score/source.ts Implements dataset listing + CSV row parsing for OSSF snapshots.
services/apps/automatic_projects_discovery_worker/src/sources/ossf-criticality-score/bucketClient.ts Minimal HTTP/XML client for listing GCS prefixes + streaming all.csv.
services/apps/automatic_projects_discovery_worker/src/sources/lf-criticality-score/source.ts Implements LF dataset descriptor generation + paginated API streaming + parsing.
services/apps/automatic_projects_discovery_worker/src/schedules/scheduleProjectsDiscovery.ts Registers Temporal schedule (daily cron) and workflow args/timeouts.
services/apps/automatic_projects_discovery_worker/src/main.ts Enables Postgres for this worker and schedules discovery on startup.
services/apps/automatic_projects_discovery_worker/src/activities/activities.ts Adds activities to list sources/datasets and process datasets with batching + upsert.
services/apps/automatic_projects_discovery_worker/src/activities.ts Switches barrel export to explicit named exports for activity typing.
services/apps/automatic_projects_discovery_worker/package.json Adds csv-parse dependency and adjusts local debug script.
services/apps/automatic_projects_discovery_worker/README.md Adds architecture/workflow documentation for the worker.
pnpm-lock.yaml Locks csv-parse dependency version.
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


const log = getServiceLogger()

const DEFAULT_API_URL = 'https://hypervascular-nonduplicative-vern.ngrok-free.dev'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ngrok development URL hardcoded as production default

High Severity

DEFAULT_API_URL is set to a temporary ngrok tunnel URL (https://hypervascular-nonduplicative-vern.ngrok-free.dev). If LF_CRITICALITY_SCORE_API_URL is not set in the environment, every LfCriticalityScoreSource operation will attempt to reach this ephemeral dev tunnel, which will almost certainly be dead in production. This causes all LF criticality score data fetching to fail silently or with errors.

Additional Locations (1)
Fix in Cursor Fix in Web

ulemons added 8 commits March 26, 2026 10:13
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
ulemons and others added 5 commits March 26, 2026 10:13
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
@ulemons ulemons force-pushed the feat/add-assf-data-fetcher branch from e28a6c4 to a8062f5 Compare March 26, 2026 09:14
@ulemons ulemons merged commit fdf1177 into feat/add-dal-automatic-project-discovery Mar 26, 2026
2 checks passed
@ulemons ulemons deleted the feat/add-assf-data-fetcher branch March 26, 2026 09:15
ulemons added a commit that referenced this pull request Mar 26, 2026
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

res.headers.location
) {
httpsGet(res.headers.location).then(resolve, reject)
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing res.resume() causes socket leaks on redirects/errors

Medium Severity

In both httpsGet and getHttpsStream, when a redirect (3xx) or error status is received, the response body is never consumed — no res.resume() call. This keeps the underlying TCP socket open and prevents the HTTP agent from reusing it. The author correctly calls res.resume() on error in fetchPage in the LF source, but missed it here. GCS (commondatastorage.googleapis.com) is known to issue redirects, so the redirect path is likely hit on every request, leaking a socket each time. With listTimePrefixes called once per date prefix (potentially dozens of times), sockets accumulate quickly.

Additional Locations (1)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants