Skip to content

Remove Line Numbers from Bill PDF Scraping #2157

@Mephistic

Description

@Mephistic

Summary

We just added a new codepath to scrape the Document text of a "Bill" from the legislature-provided PDF as a fallback for when the legislature doesn't make that text available via the API's content.DocumentText field.

It looks like this is working reasonably well, but we've just noticed an issue in one particular case: in addition to the text of the bill, we are also scraping the line numbers (which is technically legible, but looks noticeably off).

To remedy this, we should filter out line number when scraping the text from a bill's PDF.

Success Criteria

  • PDF-scraped text should also filter out line numbers
  • Existing successful text scraping should be unaffected

Additional Links

  • Example bill that hit this issue: https://maple-dev.vercel.app/bills/194/H5469
    • This has a DocumentText of null in the API, so it will use the PDF fallback, but we have also scraped line numbers (visible if you go to the page and click "View Text" to open the bill text modal).

@Smoss Related to your recent work on PDF scraping

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions