[CI] Add new script + cronjob to amend pull request data#809
[CI] Add new script + cronjob to amend pull request data#809
Conversation
| LLVM_REVIEWS_TABLE = "llvm_reviews" | ||
|
|
||
|
|
||
| def fetch_open_pull_requests_from_github( |
There was a problem hiding this comment.
We're only querying open PRs. That means we are going to miss PRs that are short lived (<2 hours open to merge)(actually somewhat common), assuming that they do not straddle a two hour boundary.
That also means we are going to miss any postcommit review information. Is that considered out of scope for now?
There was a problem hiding this comment.
In cases like those, we'd need to rely on process_llvm_commits.py to record merged commits first before we can re-query them for post-commit reviews. They'd be subject to the 2 day delay that script runs with, but post-commit reviews should be captured at some point.
It might be fine for the time being, but not ideal. Ideally we'd either increase the frequency we poll for open pull requests, or reduce/remove our delay on process_llvm_commits.py. I could follow up on this in another PR.
There was a problem hiding this comment.
So postcommit review information is just out of scope for this PR?
We could also query for closed PRs that were opened and closed in the last two hours, which I think should capture that without duplicating data.
There are other race conditions probably though (like PRs that were opened after the two hour mark but before the script starts running.
There was a problem hiding this comment.
It's not entirely out of scope, once this script is running we'll start collecting post-commit data for commits submitted without review within the last two weeks (using whatever data we already have stored, since these tables are shared).
I agree that we could query for PRs that had a quick turnaround time between open & close, duplicate data shouldn't be an issue since the SQL used will only insert new records and just update old ones. We could also avoid missing PRs that miss the 2 hour window due to race conditions by widening the window we query for. Querying the past 3 hours while running the script every 2 hours shouldn't be an issue since we don't have to worry about duplicates being recorded.
I think for this pull request, missing some post commit information is fine. I'll play around with some different solutions for addressing the edge cases (whether that be changing the queries, increasing the frequency, or both) and open a follow up with whatever yields the best results.
Implement a new script for amending pull request data stored in BigQuery. This script also fetches open pull requests, as that data is not already recorded by
process_llvm_commits.py. Data regarding new pull requests and reviews and requeried and amended every two hours, separate from theprocess_llvm_commitscronjob.Scheduling of this script reuses the operational metrics container, as both
process_llvm_commits.pyandamend_pull_request_data.pydepend onoperational_metrics_lib.py. The image commands for each cronjob have been explicitly added to each cronjob manifest.