Skip to content

Multi video backend#2153

Open
delexagon wants to merge 12 commits into
codeforboston:mainfrom
delexagon:multi-video-backend
Open

Multi video backend#2153
delexagon wants to merge 12 commits into
codeforboston:mainfrom
delexagon:multi-video-backend

Conversation

@delexagon

@delexagon delexagon commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Added backend support for multiple video handling.

Changes:

Created an emulator for returning Assembly AI style transcripts when testing locally without setting ASSEMBLY_API_KEY.
Created a backfill function called backfillHearingVideoFormat.
Changed backfillHearingTranscriptions to support multiple videos.
Split video/Assembly AI work from HearingScraper/scrapeHearings into a different format called EventPostProcessor meant to update events after they have occurred, which is operated as a separate HearingPostProcessor/scrapeVideos.

Notes:

  • backfillHearingVideoFormat will convert the hearings into the new format
  • backfillHearingTranscriptions will fetch all videos for hearings
  • Interesting hearings to test:
    • 2709 has a video that has duplicate uploads, one labeled MASTER and the other labeled archive.
    • 2731 is like 2709, but one of the listed urls has a video of 2 hours of a "Missing File" screen.
    • 2858 has two seemingly identical videos which are also identically named with completely different URLs.
  • A list of all hearings known to have multiple videos up to hearing 5471 is [13, 14, 71, 91, 104, 138, 167, 187, 203, 214, 217, 292, 501, 680, 861, 2118, 2137, 2271, 2289, 2290, 2300, 2476, 2662, 2680, 2709, 2731, 2735, 2858, 2904, 2967, 3073, 3080, 3125, 3167, 3171, 3243, 3317, 3362, 3377, 3381, 3402, 3470, 3480, 3486, 3521, 3579, 3580, 3586, 3642, 3646, 3659, 3660, 3668, 3677, 3685, 3689, 3695, 3713, 3716, 3733, 3774, 3792, 3819, 3829, 3846, 3887, 3891, 3892, 3921, 3930, 3933, 3951, 3976, 3988, 4000, 4016, 4049, 4052, 4065, 4071, 4082, 4111, 4112, 4126, 4127, 4149, 4158, 4201, 4258, 4278, 4458, 4469, 4470, 4558, 4600, 4612, 4641, 4699, 4709, 4711, 4734, 4777, 4847, 4880, 5099, 5173, 5207, 5362, 5382, 5441, 5465, 5471].
  • Assembly AI is connected externally only if the environment variable ASSEMBLY_API_KEY has been set.
  • I don't think ${process.env.FUNCTIONS_API_BASE}/transcription points to localhost:5001 in the emulator, so I set it manually. Maybe it should be more generalized.
  • The new ballotquestions pages seem to reference the videoURLs, but not use them.

Checklist

  • If I've added new Firestore queries, I've added any new required indexes to firestore.indexes.json (Please do not only create indexes through the Firebase Web UI, even though the error messages may reccommend it - indexes created this way may be obliterated by subsequent deploys) - I do not believe this is relevant? I have not changed firestore.indexes.json.

Known issues

Not tested full pipeline for bucket creation->Assembly AI, it's worth testing in full.
Not tested the ballot ids page; needs testing on dev.

Steps to test/reproduce

  1. Test backfillHearingVideoFormat (yarn firebase-admin run-script backfillHearingVideoFormat --env local)
  2. Test backfillHearingTranscription for all hearings (yarn firebase-admin run-script backfillHearingTranscription --env local) and for specific hearings (yarn firebase-admin run-script backfillHearingTranscription --env local --eventId 4258) that exist in the database. Test that rerunning this function without --recreateTranscripts does not create new transcriptIds and vice versa.
  3. Test the functions scrapeSingleHearing and scrapeSingleHearingv2
curl -X POST 'http://localhost:5001/demo-dtp/us-central1/scrapeSingleHearingv2' \
  -H "Content-Type: application/json" \
  -d '{"data": { "eventId": 3713 }}'
  1. Test pubsub functions (curl 'http://localhost:5001/demo-dtp/us-central1/triggerPubsubFunction?scheduled=scrapeHearings') (curl 'http://localhost:5001/demo-dtp/us-central1/triggerPubsubFunction?scheduled=scrapeVideos')
  2. Test that hearing indexing is functional
  3. Test that Assembly AI is interpreted properly after changing ASSEMBLY_API_KEY in functions/.secret.local
  4. Test migrateHearingTranscription (After conversion of dev)
  5. Check that whatever the heck the ballot ids page is doing hasn't been broken

Conversion process

Run yarn firebase-admin run-script backfillHearingVideoFormat (This might be too much at once?)
Run yarn firebase-admin run-script backfillHearingTranscription (This runs in batches, I think) - env var ASSEMBLY_API_KEY must be set

@vercel

vercel Bot commented Jun 2, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
maple-dev Ready Ready Preview, Comment Jun 17, 2026 1:08am

Request Review

@delexagon delexagon marked this pull request as ready for review June 10, 2026 00:15
@delexagon delexagon marked this pull request as draft June 10, 2026 00:40
@delexagon delexagon marked this pull request as ready for review June 10, 2026 00:44

@Mephistic Mephistic left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fantastic start - thanks for taking this on!

What is the rollout plan for this?

  • Will we need to pause the hearings scraper/s before deploying and/or while the backfills run?
  • Can this be safely rolled back if need be?

- **Status**: "Occurred" if `hearing.content.startsAt` is in the past, "Scheduled" if in the future
- **Date**: formatted from `hearing.content.startsAt`
- **Watch link**: "Watch the committee hearing here." linked to `hearing.videoURL` — hidden if no video
- **Watch link**: "Watch the committee hearing here." linked to `hearing.videoURLs` — hidden if no videos

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're linking to a hearing, we should probably just link to the /hearing/<HEARING_ID> page on the MAPLE site. Even if we do link to the legislature, we should link to the hearing page, not directly to the videoURL.

(Not directly relevant here - more a statement about how we shouldn't invest too many calories into making the Ballot Questions' Hearing type support multiple video URLs - at most, I suspect we'll want to link to the "first" video)

@delexagon delexagon Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this discrepancy at my end check for any uses of videoURL - I don't know what the ballot page looks like & the videoURL field does not currently appear to be used. I will fully defer to your judgement on how we want to address this. To be honest, I more meant this as a signal to the contractors working on the ballot questions that there are now multiple videos per hearing, and let them handle that how they will.

Comment thread scripts/firebase-admin/migrateHearingTranscription.ts Outdated
Comment thread scripts/firebase-admin/backfillHearingVideoFormat.ts Outdated
videoURL: Optional(String),
videoTranscriptionId: Optional(String),
videoFetchedAt: Optional(InstanceOf(Timestamp)),
videos: Array(Video),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC why do we need a videos array where each item will have a transcriptionId along with a separate transcriptionIds array?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see - it's for the lookup from the webhook 👍

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was an annoying limitation - AFAIK Firebase does not have elemMatch like queries, so I created a duplicate for querying location of transcription ids. This is partially because I made the frontend first. An alternative (because the only point where we want to find hearing based on transcription is the Assembly AI webhook) is having Assembly AI provide the hearing id in a header or something, but I thought transcription->hearing may be a query functionality we would want regardless.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternately I could remove transcriptionId from each video, where the only reason it was a hard requirement from me is that that is what the frontend I made earlier expects

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh, doesn't seem worth the effort to alter this given that it's not much extra data being stored - this is a fine work-around (and can be tossed on our incentive pile to push us to prioritizing getting a normal SQL database into our stack).

eventId: Number.optional()
})

function migrateVideo(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 👍 to splitting this out into a helper, but I'd expect a function called migrateVideo to actually migrate the video - maybe a name more like getVideoUpdates or getHearingVideoUpdates would be more precise

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed name to 'getVideoFormatUpdate'?

}
}

function removeCommonWords(strings: string[]) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is great, but probably worth extracting into its own file (since it has no dependencies and is separately testable).

@delexagon delexagon Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can put these in helpers.ts and add tests to helpers.test.ts?

)
: {}

const transcriptionIds = await Promise.all(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is potentially processing multiple videos at the same time, do we need to adjust the function's memory/timeouts?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we have e.g. 3 videos for a hearing and hit a problem processing the third, does that mean we could get stuck in a loop where:

  • We try to process a Hearing with 3 videos
  • We successfully hit AssemblyAI via submitTranscription for two of them
  • We timeout/error/whatever on the third hearing
  • None of the progress we made was saved
  • We still rack up AssemblyAI charges for the transcripts we submitted
  • We continue racking up AssemblyAI charges every hour, as we keep trying to re-process the hearing

I'd love some more explicit protections against that case (especially now that our starter deal with AssemblyAI has ended). The old scraper just saved synchronously after each hearing (partly because it assumed one hearing = one video). Could we save progress after each transcription request to Firestore?

(FWIW This isn't an unbounded potential for charges - we have a spending cap with AssemblyAI)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I will rewrite this to be more robust. I am unsure how much memory/time that ffmpeg function needs.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it will actually be an issue - we've come pretty close to the time limit for some longer videos (and have had to adjust both timeout and memory up accordingly), but given that the hearings with multiple videos seem to just be one hearing broken up into multiple files, I'm not sure we'd actually be covering more data for a hearing with multiple videos.

I'd say don't worry about this too much for now - but we should keep an eye on these functions post-launch to see if we need to tweak the configs.

(More robust video saving would also alleviate my main concern here - a partial success infinite loop)

Comment thread functions/src/webhooks/transcription.ts Outdated
Comment thread functions/src/events/scrapeEvents.ts
}> {
const videos = await this.getHearingVideos(EventId)

const prevURLs = existingVideos

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would love a unit test covering this since it's the check that prevents us from repeatedly re-processing videos.

@delexagon

Copy link
Copy Markdown
Collaborator Author

This is a fantastic start - thanks for taking this on!

What is the rollout plan for this?

* Will we need to pause the hearings scraper/s before deploying and/or while the backfills run?

* Can this be safely rolled back if need be?

To my understanding, the development firebase can be
a. exported to the production firebase (with migrateHearingVideos script? - it occurs to me that this does not erase videoTranscriptionId if it exists, so this would be incomplete)
b. reverted to a previous state if necessary - in this sense, I don't know what 'rolled back' means. The frontend will understand the previous version of videos, but the backend (particularly the search function) will fail and crash, so it unlikely the hearing search page would categorize them correctly if the pull request was reverted but not the dev database. Similarly, reverting the database but not the pull request itself means that the search data for old hearings would not update. Thus, both of them should be rolled back, if that is feasible.
My expectation was to run backfillVideoHearingFormat in dev (which should not be destructive to information in the db, though I didn't write a reversion script). From the point upon which this pull request is deployed until the format is backfilled, scrapers probably shouldn't be run. I don't think it should take too long to do this backfill? - it isn't requesting anything from the MA legislature, but it is updating every hearing in the db, which is however many thousand entries. However, the frontend can understand both formats, so there should be no interruption of service while this backfill is being completed.
BackfillHearingTranscription should either be run on all hearings or the hearings I specifically listed (maybe I should write an alternative to run on all hearings after a certain point or all hearings in a JSON (I did create this for testing)). This should not conflict with the scrapers, but it is clear I will have to check more for robustness.
The resulting hearings will be incompatible with the production version backend (that is, search interpretation again) until the pull request is complete there too - should I make a script to move a multiple video on dev to a single video on prod? I am not sure how the dev <-> prod connection works - is it a direct copy in some way, or is it using scripts like migrateHearingTranscription (as mentioned before, this would be incomplete as it is right now). Is it running the backfill scripts on prod as well? I do not know enough to comment on how this part of the migration would work, or what I need to add to make it successful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants