feat: add ossf data fetcher (CM-952)#3839
feat: add ossf data fetcher (CM-952)#3839ulemons merged 13 commits intofeat/add-dal-automatic-project-discoveryfrom
Conversation
f4cc9b8 to
543aa5f
Compare
264892a to
e4e5fc2
Compare
|
|
e4e5fc2 to
0b93da4
Compare
3cbe8ec to
0af35ee
Compare
76c308c to
02ab17d
Compare
0af35ee to
82f29d9
Compare
02ab17d to
ed37511
Compare
There was a problem hiding this comment.
Pull request overview
Adds an Automatic Projects Discovery Temporal worker that fetches external criticality datasets (OSSF + LF) and upserts them into the projectCatalog table, while aligning the data-access layer with the underlying schema.
Changes:
- Split
criticalityScoreintoossfCriticalityScoreandlfCriticalityScoreacross DAL types and SQL upsert/insert logic. - Implement a Temporal workflow + activities pipeline to list datasets per source and process datasets in batches via
bulkUpsertProjectCatalog. - Add source abstractions/registry plus OSSF (GCS CSV) and LF (API JSON) fetchers, along with scheduling/docs/deps.
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| services/libs/data-access-layer/src/project-catalog/types.ts | Updates DB types to use ossfCriticalityScore + lfCriticalityScore. |
| services/libs/data-access-layer/src/project-catalog/projectCatalog.ts | Updates select/insert/bulk insert/upsert/update SQL to use the two score columns. |
| services/apps/automatic_projects_discovery_worker/src/workflows/discoverProjects.ts | New workflow orchestration (incremental vs full) + activity timeouts/retries. |
| services/apps/automatic_projects_discovery_worker/src/sources/types.ts | Introduces source/dataset interfaces and CSV vs JSON streaming contract. |
| services/apps/automatic_projects_discovery_worker/src/sources/registry.ts | Registers discovery sources and exposes lookup/list helpers. |
| services/apps/automatic_projects_discovery_worker/src/sources/ossf-criticality-score/source.ts | Implements dataset listing + CSV row parsing for OSSF snapshots. |
| services/apps/automatic_projects_discovery_worker/src/sources/ossf-criticality-score/bucketClient.ts | Minimal HTTP/XML client for listing GCS prefixes + streaming all.csv. |
| services/apps/automatic_projects_discovery_worker/src/sources/lf-criticality-score/source.ts | Implements LF dataset descriptor generation + paginated API streaming + parsing. |
| services/apps/automatic_projects_discovery_worker/src/schedules/scheduleProjectsDiscovery.ts | Registers Temporal schedule (daily cron) and workflow args/timeouts. |
| services/apps/automatic_projects_discovery_worker/src/main.ts | Enables Postgres for this worker and schedules discovery on startup. |
| services/apps/automatic_projects_discovery_worker/src/activities/activities.ts | Adds activities to list sources/datasets and process datasets with batching + upsert. |
| services/apps/automatic_projects_discovery_worker/src/activities.ts | Switches barrel export to explicit named exports for activity typing. |
| services/apps/automatic_projects_discovery_worker/package.json | Adds csv-parse dependency and adjusts local debug script. |
| services/apps/automatic_projects_discovery_worker/README.md | Adds architecture/workflow documentation for the worker. |
| pnpm-lock.yaml | Locks csv-parse dependency version. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
services/apps/automatic_projects_discovery_worker/src/sources/lf-criticality-score/source.ts
Show resolved
Hide resolved
services/apps/automatic_projects_discovery_worker/src/sources/ossf-criticality-score/source.ts
Show resolved
Hide resolved
services/apps/automatic_projects_discovery_worker/src/schedules/scheduleProjectsDiscovery.ts
Show resolved
Hide resolved
|
|
||
| const log = getServiceLogger() | ||
|
|
||
| const DEFAULT_API_URL = 'https://hypervascular-nonduplicative-vern.ngrok-free.dev' |
There was a problem hiding this comment.
Ngrok development URL hardcoded as production default
High Severity
DEFAULT_API_URL is set to a temporary ngrok tunnel URL (https://hypervascular-nonduplicative-vern.ngrok-free.dev). If LF_CRITICALITY_SCORE_API_URL is not set in the environment, every LfCriticalityScoreSource operation will attempt to reach this ephemeral dev tunnel, which will almost certainly be dead in production. This causes all LF criticality score data fetching to fail silently or with errors.
Additional Locations (1)
services/apps/automatic_projects_discovery_worker/src/sources/ossf-criticality-score/source.ts
Show resolved
Hide resolved
services/apps/automatic_projects_discovery_worker/src/activities/activities.ts
Outdated
Show resolved
Hide resolved
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
e28a6c4 to
a8062f5
Compare
fdf1177
into
feat/add-dal-automatic-project-discovery
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
| res.headers.location | ||
| ) { | ||
| httpsGet(res.headers.location).then(resolve, reject) | ||
| return |
There was a problem hiding this comment.
Missing res.resume() causes socket leaks on redirects/errors
Medium Severity
In both httpsGet and getHttpsStream, when a redirect (3xx) or error status is received, the response body is never consumed — no res.resume() call. This keeps the underlying TCP socket open and prevents the HTTP agent from reusing it. The author correctly calls res.resume() on error in fetchPage in the LF source, but missed it here. GCS (commondatastorage.googleapis.com) is known to issue redirects, so the redirect path is likely hit on every request, leaking a socket each time. With listTimePrefixes called once per date prefix (potentially dozens of times), sockets accumulate quickly.


Note
Medium Risk
Introduces a new scheduled Temporal ingestion pipeline that streams and upserts large external datasets into Postgres, and changes
projectCatalogupsert semantics to preserve existing scores when the incoming value is null. Risk is mainly around data correctness, load/timeout behavior, and reliance on external HTTP sources.Overview
Adds an automatic projects discovery Temporal worker that discovers OSS repos from external sources and bulk upserts them into
projectCatalogin 5k-row batches (CSV viacsv-parse, JSON via object-mode streams), with retries/timeouts tuned for long-running dataset processing.Introduces a pluggable
IDiscoverySourceabstraction plus two sources: OSSF Criticality Score (public GCS bucket CSV snapshots) and LF Criticality Score (paged HTTP JSON API), and updates the Temporal schedule to run daily at midnight withdiscoverProjects({ mode: 'incremental' })and a 2-hour workflow execution timeout.Updates
bulkUpsertProjectCatalog/upsertProjectCatalogconflict handling toCOALESCEscore fields so missing incoming scores no longer overwrite existingossfCriticalityScore/lfCriticalityScorevalues.Written by Cursor Bugbot for commit a8062f5. This will update automatically on new commits. Configure here.