Skip to content

[Hackathon] feat: add LLM file source#5105

Open
tanishqgandhi1908 wants to merge 2 commits into
apache:mainfrom
tanishqgandhi1908:feat/llm-file-source
Open

[Hackathon] feat: add LLM file source#5105
tanishqgandhi1908 wants to merge 2 commits into
apache:mainfrom
tanishqgandhi1908:feat/llm-file-source

Conversation

@tanishqgandhi1908
Copy link
Copy Markdown

@tanishqgandhi1908 tanishqgandhi1908 commented May 16, 2026

Video Submission

https://youtu.be/vbJ5N4HtoUE

What changes were proposed in this PR?

This PR adds an LLM File Source operator that turns irregular files into usable no-code workflow inputs.

Story / motivation

Smart Source works well when a file already has a known machine-readable structure. But many real datasets arrive as PDFs, vendor reports, and semi-structured exports where the user can see the tables, yet Texera has no ready-made parser to begin with.

The goal here is to keep that user inside the workflow canvas instead of sending them out to write preprocessing code first.

User task Before After
Start from an irregular PDF/report Manually extract tables or hand-write a parser outside the workflow Select the file/folder and let Texera generate a parser tailored to that input
Reuse the extracted data Build custom branching logic by hand See detected logical tables and create Filter + Projection branches directly from the property panel
Process repeated reports Handle each report separately Point the same source at a folder of similarly structured reports and parse them together

Upload folder of pdfs

Screenshot 2026-05-16 at 12 17 58 PM

LLM identifies the data and gives split of table and suggestions of operators which on click adds to workflow

Screenshot 2026-05-16 at 12 18 18 PM Screenshot 2026-05-16 at 12 18 26 PM

Run workflow to fetch data from pdf and do analysis

Screenshot 2026-05-16 at 12 18 58 PM Screenshot 2026-05-16 at 12 17 58 PM

Main changes

  1. Add LLM File Source, a new source operator for irregular files such as PDFs, reports, and similarly structured folders.
  2. Add a generation flow that samples the input, asks the LLM for logical table definitions plus parser code, validates the generated code with syntax checks and a dry run, and retries repairs when needed.
  3. Represent multi-table outputs through a __table__ discriminator and a dense union schema so downstream operators can branch safely by detected table.
  4. Add frontend support to show detected tables and create Filter + Projection branches from the operator property panel.
  5. Extend folder handling so folder-backed datasets are materialized locally for parser execution and can combine repeated reports into one workflow source.
  6. Capture Python worker stdout/stderr so generated parser failures are easier to diagnose from workflow execution.

Any related issues, documentation, discussions?

How was this PR tested?

JAVA_HOME=$(/usr/libexec/java_home -v 17) sbt "testOnly org.apache.texera.web.resource.LLMSourceResourceSpec org.apache.texera.amber.operator.source.scan.FolderInputResolverSpec org.apache.texera.amber.operator.source.llm.LLMFileSourceOpDescSpec"

Manual verification:

  1. Ran a single PDF through LLM File Source; it produced 17 source rows and split them into 12 revenue_by_region rows and 5 headcount_by_department rows.
  2. Ran a folder with two similarly structured PDF reports; it produced 34 source rows and split them into 24 revenue_by_region rows and 10 headcount_by_department rows.
  3. Verified the property panel shows detected tables and that the Filter + Projection action creates the expected downstream branches.

@github-actions github-actions Bot added feature engine dependencies Pull requests that update a dependency file python frontend Changes related to the frontend GUI common agent-service labels May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-service common dependencies Pull requests that update a dependency file engine feature frontend Changes related to the frontend GUI python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant