Multi video backend#2153
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Mephistic
left a comment
There was a problem hiding this comment.
This is a fantastic start - thanks for taking this on!
What is the rollout plan for this?
- Will we need to pause the hearings scraper/s before deploying and/or while the backfills run?
- Can this be safely rolled back if need be?
| - **Status**: "Occurred" if `hearing.content.startsAt` is in the past, "Scheduled" if in the future | ||
| - **Date**: formatted from `hearing.content.startsAt` | ||
| - **Watch link**: "Watch the committee hearing here." linked to `hearing.videoURL` — hidden if no video | ||
| - **Watch link**: "Watch the committee hearing here." linked to `hearing.videoURLs` — hidden if no videos |
There was a problem hiding this comment.
If we're linking to a hearing, we should probably just link to the /hearing/<HEARING_ID> page on the MAPLE site. Even if we do link to the legislature, we should link to the hearing page, not directly to the videoURL.
(Not directly relevant here - more a statement about how we shouldn't invest too many calories into making the Ballot Questions' Hearing type support multiple video URLs - at most, I suspect we'll want to link to the "first" video)
There was a problem hiding this comment.
I noticed this discrepancy at my end check for any uses of videoURL - I don't know what the ballot page looks like & the videoURL field does not currently appear to be used. I will fully defer to your judgement on how we want to address this. To be honest, I more meant this as a signal to the contractors working on the ballot questions that there are now multiple videos per hearing, and let them handle that how they will.
| videoURL: Optional(String), | ||
| videoTranscriptionId: Optional(String), | ||
| videoFetchedAt: Optional(InstanceOf(Timestamp)), | ||
| videos: Array(Video), |
There was a problem hiding this comment.
OOC why do we need a videos array where each item will have a transcriptionId along with a separate transcriptionIds array?
There was a problem hiding this comment.
Oh, I see - it's for the lookup from the webhook 👍
There was a problem hiding this comment.
Yes, this was an annoying limitation - AFAIK Firebase does not have elemMatch like queries, so I created a duplicate for querying location of transcription ids. This is partially because I made the frontend first. An alternative (because the only point where we want to find hearing based on transcription is the Assembly AI webhook) is having Assembly AI provide the hearing id in a header or something, but I thought transcription->hearing may be a query functionality we would want regardless.
There was a problem hiding this comment.
Alternately I could remove transcriptionId from each video, where the only reason it was a hard requirement from me is that that is what the frontend I made earlier expects
There was a problem hiding this comment.
Meh, doesn't seem worth the effort to alter this given that it's not much extra data being stored - this is a fine work-around (and can be tossed on our incentive pile to push us to prioritizing getting a normal SQL database into our stack).
| eventId: Number.optional() | ||
| }) | ||
|
|
||
| function migrateVideo( |
There was a problem hiding this comment.
nit: 👍 to splitting this out into a helper, but I'd expect a function called migrateVideo to actually migrate the video - maybe a name more like getVideoUpdates or getHearingVideoUpdates would be more precise
There was a problem hiding this comment.
Changed name to 'getVideoFormatUpdate'?
| } | ||
| } | ||
|
|
||
| function removeCommonWords(strings: string[]) { |
There was a problem hiding this comment.
nit: This is great, but probably worth extracting into its own file (since it has no dependencies and is separately testable).
There was a problem hiding this comment.
I can put these in helpers.ts and add tests to helpers.test.ts?
| ) | ||
| : {} | ||
|
|
||
| const transcriptionIds = await Promise.all( |
There was a problem hiding this comment.
If this is potentially processing multiple videos at the same time, do we need to adjust the function's memory/timeouts?
There was a problem hiding this comment.
Also, if we have e.g. 3 videos for a hearing and hit a problem processing the third, does that mean we could get stuck in a loop where:
- We try to process a Hearing with 3 videos
- We successfully hit AssemblyAI via
submitTranscriptionfor two of them - We timeout/error/whatever on the third hearing
- None of the progress we made was saved
- We still rack up AssemblyAI charges for the transcripts we submitted
- We continue racking up AssemblyAI charges every hour, as we keep trying to re-process the hearing
I'd love some more explicit protections against that case (especially now that our starter deal with AssemblyAI has ended). The old scraper just saved synchronously after each hearing (partly because it assumed one hearing = one video). Could we save progress after each transcription request to Firestore?
(FWIW This isn't an unbounded potential for charges - we have a spending cap with AssemblyAI)
There was a problem hiding this comment.
True, I will rewrite this to be more robust. I am unsure how much memory/time that ffmpeg function needs.
There was a problem hiding this comment.
I'm not sure if it will actually be an issue - we've come pretty close to the time limit for some longer videos (and have had to adjust both timeout and memory up accordingly), but given that the hearings with multiple videos seem to just be one hearing broken up into multiple files, I'm not sure we'd actually be covering more data for a hearing with multiple videos.
I'd say don't worry about this too much for now - but we should keep an eye on these functions post-launch to see if we need to tweak the configs.
(More robust video saving would also alleviate my main concern here - a partial success infinite loop)
| }> { | ||
| const videos = await this.getHearingVideos(EventId) | ||
|
|
||
| const prevURLs = existingVideos |
There was a problem hiding this comment.
nit: I would love a unit test covering this since it's the check that prevents us from repeatedly re-processing videos.
To my understanding, the development firebase can be |
Summary
Added backend support for multiple video handling.
Changes:
Created an emulator for returning Assembly AI style transcripts when testing locally without setting ASSEMBLY_API_KEY.
Created a backfill function called backfillHearingVideoFormat.
Changed backfillHearingTranscriptions to support multiple videos.
Split video/Assembly AI work from HearingScraper/scrapeHearings into a different format called EventPostProcessor meant to update events after they have occurred, which is operated as a separate HearingPostProcessor/scrapeVideos.
Notes:
Checklist
firestore.indexes.json(Please do not only create indexes through the Firebase Web UI, even though the error messages may reccommend it - indexes created this way may be obliterated by subsequent deploys) - I do not believe this is relevant? I have not changed firestore.indexes.json.Known issues
Not tested full pipeline for bucket creation->Assembly AI, it's worth testing in full.
Not tested the ballot ids page; needs testing on dev.
Steps to test/reproduce
yarn firebase-admin run-script backfillHearingVideoFormat --env local)yarn firebase-admin run-script backfillHearingTranscription --env local) and for specific hearings (yarn firebase-admin run-script backfillHearingTranscription --env local --eventId 4258) that exist in the database. Test that rerunning this function without --recreateTranscripts does not create new transcriptIds and vice versa.curl 'http://localhost:5001/demo-dtp/us-central1/triggerPubsubFunction?scheduled=scrapeHearings') (curl 'http://localhost:5001/demo-dtp/us-central1/triggerPubsubFunction?scheduled=scrapeVideos')ASSEMBLY_API_KEYinfunctions/.secret.localConversion process
Run
yarn firebase-admin run-script backfillHearingVideoFormat(This might be too much at once?)Run
yarn firebase-admin run-script backfillHearingTranscription(This runs in batches, I think) - env var ASSEMBLY_API_KEY must be set