Multi video backend by delexagon · Pull Request #2153 · codeforboston/maple

delexagon · 2026-06-02T23:21:48Z

Summary

Added backend support for multiple video handling.

Changes:

Created an emulator for returning Assembly AI style transcripts when testing locally without setting ASSEMBLY_API_KEY.
Created a backfill function called backfillHearingVideoFormat.
Changed backfillHearingTranscriptions to support multiple videos.
Split video/Assembly AI work from HearingScraper/scrapeHearings into a different format called EventPostProcessor meant to update events after they have occurred, which is operated as a separate HearingPostProcessor/scrapeVideos.

Notes:

backfillHearingVideoFormat will convert the hearings into the new format
backfillHearingTranscriptions will fetch all videos for hearings
Interesting hearings to test:
- 2709 has a video that has duplicate uploads, one labeled MASTER and the other labeled archive.
- 2731 is like 2709, but one of the listed urls has a video of 2 hours of a "Missing File" screen.
- 2858 has two seemingly identical videos which are also identically named with completely different URLs.
A list of all hearings known to have multiple videos up to hearing 5471 is [13, 14, 71, 91, 104, 138, 167, 187, 203, 214, 217, 292, 501, 680, 861, 2118, 2137, 2271, 2289, 2290, 2300, 2476, 2662, 2680, 2709, 2731, 2735, 2858, 2904, 2967, 3073, 3080, 3125, 3167, 3171, 3243, 3317, 3362, 3377, 3381, 3402, 3470, 3480, 3486, 3521, 3579, 3580, 3586, 3642, 3646, 3659, 3660, 3668, 3677, 3685, 3689, 3695, 3713, 3716, 3733, 3774, 3792, 3819, 3829, 3846, 3887, 3891, 3892, 3921, 3930, 3933, 3951, 3976, 3988, 4000, 4016, 4049, 4052, 4065, 4071, 4082, 4111, 4112, 4126, 4127, 4149, 4158, 4201, 4258, 4278, 4458, 4469, 4470, 4558, 4600, 4612, 4641, 4699, 4709, 4711, 4734, 4777, 4847, 4880, 5099, 5173, 5207, 5362, 5382, 5441, 5465, 5471].
Assembly AI is connected externally only if the environment variable ASSEMBLY_API_KEY has been set.
I don't think ${process.env.FUNCTIONS_API_BASE}/transcription points to localhost:5001 in the emulator, so I set it manually. Maybe it should be more generalized.
The new ballotquestions pages seem to reference the videoURLs, but not use them.

Checklist

If I've added new Firestore queries, I've added any new required indexes to firestore.indexes.json (Please do not only create indexes through the Firebase Web UI, even though the error messages may reccommend it - indexes created this way may be obliterated by subsequent deploys) - I do not believe this is relevant? I have not changed firestore.indexes.json.

Known issues

Not tested full pipeline for bucket creation->Assembly AI, it's worth testing in full.
Not tested the ballot ids page; needs testing on dev.

Steps to test/reproduce

Test backfillHearingVideoFormat (yarn firebase-admin run-script backfillHearingVideoFormat --env local)
Test backfillHearingTranscription for all hearings (yarn firebase-admin run-script backfillHearingTranscription --env local) and for specific hearings (yarn firebase-admin run-script backfillHearingTranscription --env local --eventId 4258) that exist in the database. Test that rerunning this function without --recreateTranscripts does not create new transcriptIds and vice versa.
Test the functions scrapeSingleHearing and scrapeSingleHearingv2

curl -X POST 'http://localhost:5001/demo-dtp/us-central1/scrapeSingleHearingv2' \
  -H "Content-Type: application/json" \
  -d '{"data": { "eventId": 3713 }}'

Test pubsub functions (curl 'http://localhost:5001/demo-dtp/us-central1/triggerPubsubFunction?scheduled=scrapeHearings') (curl 'http://localhost:5001/demo-dtp/us-central1/triggerPubsubFunction?scheduled=scrapeVideos')
Test that hearing indexing is functional
Test that Assembly AI is interpreted properly after changing ASSEMBLY_API_KEY in functions/.secret.local
Test migrateHearingTranscription (After conversion of dev)
Check that whatever the heck the ballot ids page is doing hasn't been broken

Conversion process

Run yarn firebase-admin run-script backfillHearingVideoFormat (This might be too much at once?)
Run yarn firebase-admin run-script backfillHearingTranscription (This runs in batches, I think) - env var ASSEMBLY_API_KEY must be set

vercel · 2026-06-02T23:21:53Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
maple-dev	Ready	Preview, Comment	Jun 17, 2026 1:08am

Mephistic

This is a fantastic start - thanks for taking this on!

What is the rollout plan for this?

Will we need to pause the hearings scraper/s before deploying and/or while the backfills run?
Can this be safely rolled back if need be?

Mephistic · 2026-06-16T20:51:18Z

 - **Status**: "Occurred" if `hearing.content.startsAt` is in the past, "Scheduled" if in the future
 - **Date**: formatted from `hearing.content.startsAt`
- **Watch link**: "Watch the committee hearing here." linked to `hearing.videoURL` — hidden if no video
+- **Watch link**: "Watch the committee hearing here." linked to `hearing.videoURLs` — hidden if no videos


If we're linking to a hearing, we should probably just link to the /hearing/<HEARING_ID> page on the MAPLE site. Even if we do link to the legislature, we should link to the hearing page, not directly to the videoURL.

(Not directly relevant here - more a statement about how we shouldn't invest too many calories into making the Ballot Questions' Hearing type support multiple video URLs - at most, I suspect we'll want to link to the "first" video)

I noticed this discrepancy at my end check for any uses of videoURL - I don't know what the ballot page looks like & the videoURL field does not currently appear to be used. I will fully defer to your judgement on how we want to address this. To be honest, I more meant this as a signal to the contractors working on the ballot questions that there are now multiple videos per hearing, and let them handle that how they will.

Mephistic · 2026-06-16T21:25:20Z

-  videoURL: Optional(String),
-  videoTranscriptionId: Optional(String),
-  videoFetchedAt: Optional(InstanceOf(Timestamp)),
+  videos: Array(Video),


OOC why do we need a videos array where each item will have a transcriptionId along with a separate transcriptionIds array?

Oh, I see - it's for the lookup from the webhook 👍

Yes, this was an annoying limitation - AFAIK Firebase does not have elemMatch like queries, so I created a duplicate for querying location of transcription ids. This is partially because I made the frontend first. An alternative (because the only point where we want to find hearing based on transcription is the Assembly AI webhook) is having Assembly AI provide the hearing id in a header or something, but I thought transcription->hearing may be a query functionality we would want regardless.

Alternately I could remove transcriptionId from each video, where the only reason it was a hard requirement from me is that that is what the frontend I made earlier expects

Meh, doesn't seem worth the effort to alter this given that it's not much extra data being stored - this is a fine work-around (and can be tossed on our incentive pile to push us to prioritizing getting a normal SQL database into our stack).

Mephistic · 2026-06-16T21:27:45Z

+  eventId: Number.optional()
+})
+
+function migrateVideo(


nit: 👍 to splitting this out into a helper, but I'd expect a function called migrateVideo to actually migrate the video - maybe a name more like getVideoUpdates or getHearingVideoUpdates would be more precise

Changed name to 'getVideoFormatUpdate'?

Mephistic · 2026-06-16T21:56:52Z

+  }
+}
+
+function removeCommonWords(strings: string[]) {


nit: This is great, but probably worth extracting into its own file (since it has no dependencies and is separately testable).

I can put these in helpers.ts and add tests to helpers.test.ts?

Mephistic · 2026-06-16T22:05:26Z

+        )
+      : {}
+
+    const transcriptionIds = await Promise.all(


If this is potentially processing multiple videos at the same time, do we need to adjust the function's memory/timeouts?

Also, if we have e.g. 3 videos for a hearing and hit a problem processing the third, does that mean we could get stuck in a loop where:

We try to process a Hearing with 3 videos

We successfully hit AssemblyAI via submitTranscription for two of them

We timeout/error/whatever on the third hearing

None of the progress we made was saved

We still rack up AssemblyAI charges for the transcripts we submitted

We continue racking up AssemblyAI charges every hour, as we keep trying to re-process the hearing

I'd love some more explicit protections against that case (especially now that our starter deal with AssemblyAI has ended). The old scraper just saved synchronously after each hearing (partly because it assumed one hearing = one video). Could we save progress after each transcription request to Firestore?

(FWIW This isn't an unbounded potential for charges - we have a spending cap with AssemblyAI)

True, I will rewrite this to be more robust. I am unsure how much memory/time that ffmpeg function needs.

I'm not sure if it will actually be an issue - we've come pretty close to the time limit for some longer videos (and have had to adjust both timeout and memory up accordingly), but given that the hearings with multiple videos seem to just be one hearing broken up into multiple files, I'm not sure we'd actually be covering more data for a hearing with multiple videos.

I'd say don't worry about this too much for now - but we should keep an eye on these functions post-launch to see if we need to tweak the configs.

(More robust video saving would also alleviate my main concern here - a partial success infinite loop)

Mephistic · 2026-06-16T22:36:25Z

+  }> {
+    const videos = await this.getHearingVideos(EventId)
+
+    const prevURLs = existingVideos


nit: I would love a unit test covering this since it's the check that prevents us from repeatedly re-processing videos.

delexagon · 2026-06-17T01:59:15Z

This is a fantastic start - thanks for taking this on!

What is the rollout plan for this?
* Will we need to pause the hearings scraper/s before deploying and/or while the backfills run?

* Can this be safely rolled back if need be?

To my understanding, the development firebase can be
a. exported to the production firebase (with migrateHearingVideos script? - it occurs to me that this does not erase videoTranscriptionId if it exists, so this would be incomplete)
b. reverted to a previous state if necessary - in this sense, I don't know what 'rolled back' means. The frontend will understand the previous version of videos, but the backend (particularly the search function) will fail and crash, so it unlikely the hearing search page would categorize them correctly if the pull request was reverted but not the dev database. Similarly, reverting the database but not the pull request itself means that the search data for old hearings would not update. Thus, both of them should be rolled back, if that is feasible.
My expectation was to run backfillVideoHearingFormat in dev (which should not be destructive to information in the db, though I didn't write a reversion script). From the point upon which this pull request is deployed until the format is backfilled, scrapers probably shouldn't be run. I don't think it should take too long to do this backfill? - it isn't requesting anything from the MA legislature, but it is updating every hearing in the db, which is however many thousand entries. However, the frontend can understand both formats, so there should be no interruption of service while this backfill is being completed.
BackfillHearingTranscription should either be run on all hearings or the hearings I specifically listed (maybe I should write an alternative to run on all hearings after a certain point or all hearings in a JSON (I did create this for testing)). This should not conflict with the scrapers, but it is clear I will have to check more for robustness.
The resulting hearings will be incompatible with the production version backend (that is, search interpretation again) until the pull request is complete there too - should I make a script to move a multiple video on dev to a single video on prod? I am not sure how the dev <-> prod connection works - is it a direct copy in some way, or is it using scripts like migrateHearingTranscription (as mentioned before, this would be incomplete as it is right now). Is it running the backfill scripts on prod as well? I do not know enough to comment on how this part of the migration would work, or what I need to add to make it successful.

delexagon added 7 commits June 1, 2026 23:56

New attempt at hearing backend

4b793e3

Fastforward to main

3dd28e2

Prettier

9389560

Small fixes

d227c5f

Video title parsing

80fa337

Allowed to reuse existing transcripts

fa46b20

Prettier

5569394

vercel Bot deployed to Preview – maple-dev June 2, 2026 23:28 View deployment

Rewrote additional files

e135a15

vercel Bot deployed to Preview – maple-dev June 3, 2026 23:07 View deployment

Small bugfixing

f449912

delexagon marked this pull request as ready for review June 10, 2026 00:15

delexagon requested review from Mephistic, alexjball, kiminkim724, mertbagt, mvictor55, nesanders, sashamaryl and timblais as code owners June 10, 2026 00:15

vercel Bot deployed to Preview – maple-dev June 10, 2026 00:18 View deployment

delexagon marked this pull request as draft June 10, 2026 00:40

delexagon marked this pull request as ready for review June 10, 2026 00:44

Align to Assembly AI return

76f7afb

vercel Bot deployed to Preview – maple-dev June 10, 2026 01:34 View deployment

Prettier

88df8e8

vercel Bot deployed to Preview – maple-dev June 10, 2026 01:40 View deployment

Mephistic reviewed Jun 16, 2026

View reviewed changes

Fixing simple bugs

09ee0bc

vercel Bot deployed to Preview – maple-dev June 17, 2026 01:08 View deployment

Uh oh!

Conversation

delexagon commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes:

Notes:

Checklist

Known issues

Steps to test/reproduce

Conversion process

Uh oh!

vercel Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mephistic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

delexagon Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

delexagon Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

delexagon commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

delexagon commented Jun 2, 2026 •

edited

Loading

vercel Bot commented Jun 2, 2026 •

edited

Loading

delexagon Jun 17, 2026 •

edited

Loading

delexagon Jun 17, 2026 •

edited

Loading