Skip to content

[CI] Add new script + cronjob to amend pull request data#809

Open
jriv01 wants to merge 3 commits intollvm:mainfrom
jriv01:amend-data
Open

[CI] Add new script + cronjob to amend pull request data#809
jriv01 wants to merge 3 commits intollvm:mainfrom
jriv01:amend-data

Conversation

@jriv01
Copy link
Contributor

@jriv01 jriv01 commented Mar 16, 2026

Implement a new script for amending pull request data stored in BigQuery. This script also fetches open pull requests, as that data is not already recorded by process_llvm_commits.py. Data regarding new pull requests and reviews and requeried and amended every two hours, separate from the process_llvm_commits cronjob.

Scheduling of this script reuses the operational metrics container, as both process_llvm_commits.py and amend_pull_request_data.py depend on operational_metrics_lib.py. The image commands for each cronjob have been explicitly added to each cronjob manifest.

@jriv01 jriv01 requested review from boomanaiden154 and cmtice March 16, 2026 20:35
LLVM_REVIEWS_TABLE = "llvm_reviews"


def fetch_open_pull_requests_from_github(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're only querying open PRs. That means we are going to miss PRs that are short lived (<2 hours open to merge)(actually somewhat common), assuming that they do not straddle a two hour boundary.

That also means we are going to miss any postcommit review information. Is that considered out of scope for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cases like those, we'd need to rely on process_llvm_commits.py to record merged commits first before we can re-query them for post-commit reviews. They'd be subject to the 2 day delay that script runs with, but post-commit reviews should be captured at some point.

It might be fine for the time being, but not ideal. Ideally we'd either increase the frequency we poll for open pull requests, or reduce/remove our delay on process_llvm_commits.py. I could follow up on this in another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So postcommit review information is just out of scope for this PR?

We could also query for closed PRs that were opened and closed in the last two hours, which I think should capture that without duplicating data.

There are other race conditions probably though (like PRs that were opened after the two hour mark but before the script starts running.

Copy link
Contributor Author

@jriv01 jriv01 Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not entirely out of scope, once this script is running we'll start collecting post-commit data for commits submitted without review within the last two weeks (using whatever data we already have stored, since these tables are shared).

I agree that we could query for PRs that had a quick turnaround time between open & close, duplicate data shouldn't be an issue since the SQL used will only insert new records and just update old ones. We could also avoid missing PRs that miss the 2 hour window due to race conditions by widening the window we query for. Querying the past 3 hours while running the script every 2 hours shouldn't be an issue since we don't have to worry about duplicates being recorded.

I think for this pull request, missing some post commit information is fine. I'll play around with some different solutions for addressing the edge cases (whether that be changing the queries, increasing the frequency, or both) and open a follow up with whatever yields the best results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants