[Hackathon] feat: add LLM file source#5105
Open
tanishqgandhi1908 wants to merge 2 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Video Submission
https://youtu.be/vbJ5N4HtoUE
What changes were proposed in this PR?
This PR adds an
LLM File Sourceoperator that turns irregular files into usable no-code workflow inputs.Story / motivation
Smart Sourceworks well when a file already has a known machine-readable structure. But many real datasets arrive as PDFs, vendor reports, and semi-structured exports where the user can see the tables, yet Texera has no ready-made parser to begin with.The goal here is to keep that user inside the workflow canvas instead of sending them out to write preprocessing code first.
Filter + Projectionbranches directly from the property panelUpload folder of pdfs
LLM identifies the data and gives split of table and suggestions of operators which on click adds to workflow
Run workflow to fetch data from pdf and do analysis
Main changes
LLM File Source, a new source operator for irregular files such as PDFs, reports, and similarly structured folders.__table__discriminator and a dense union schema so downstream operators can branch safely by detected table.Filter + Projectionbranches from the operator property panel.Any related issues, documentation, discussions?
How was this PR tested?
Manual verification:
LLM File Source; it produced 17 source rows and split them into 12revenue_by_regionrows and 5headcount_by_departmentrows.revenue_by_regionrows and 10headcount_by_departmentrows.Filter + Projectionaction creates the expected downstream branches.