Pipeline reliability, LLM-assisted validation, and docs overhaul#359
Open
kyleaoconnell22 wants to merge 5 commits into
Open
Pipeline reliability, LLM-assisted validation, and docs overhaul#359kyleaoconnell22 wants to merge 5 commits into
kyleaoconnell22 wants to merge 5 commits into
Conversation
- Add _retry decorator with exponential backoff to FTPClient and SFTPClient connect/upload methods (3 attempts, 5s/10s/20s delays) - Add Nextflow-level errorStrategy retry to PREP_SUBMISSION and SUBMIT_SUBMISSION - Fix all logging.info/debug calls incorrectly passed file=sys.stderr kwarg - Remove debug print() statements from GenbankSubmission.__init__ - Fix FTP binary upload extension detection (use set lookup; add missing .ready, .zip) - Add env var fallback in SubmissionConfigParser for NCBI credentials (supports Nextflow Secrets: nextflow secrets set NCBI_USERNAME/NCBI_PASSWORD) - Declare NCBI_USERNAME and NCBI_PASSWORD secrets in submission process blocks
- Replace df.applymap() with getattr(df, 'map', df.applymap)() for pandas 2.1+ compatibility (applymap removed in 2.1) - Add --use_llm and --llm_model CLI flags to validate_metadata.py - When --use_llm is set, write error_llm_suggestions.txt alongside error.txt with plain-English explanations and fix suggestions - Pass LLM flags through METADATA_VALIDATION Nextflow process - Declare OPENAI_API_KEY as a Nextflow secret in the process block
- Add --use_llm and --llm_model flags to fetch_submission.py - Pass flags to parse_and_save_reports; writes <batch_id>_llm_interpretation.txt alongside the report CSV when errors or non-success statuses are present - Emit llm_interpretation.txt as optional output from FETCH_REPORTS process - Declare OPENAI_API_KEY secret in FETCH_REPORTS process block
- Add bin/llm_helper.py: thin OpenAI adapter with graceful fallback (explain_validation_errors, interpret_ncbi_report, suggest_metadata_fixes) All functions return None silently when OPENAI_API_KEY is absent or the openai package is not installed -- no behavioral change without LLM - Add .env.example documenting both .env (local) and Nextflow Secrets (Seqera Platform) credential patterns for OPENAI_API_KEY and NCBI creds - Add use_llm and llm_model params to nextflow.config (default: disabled) with inline documentation of Nextflow Secrets setup - Add openai>=1.0 and python-dotenv>=1.0 to environment.yml
Complete overhaul of the TOSTADAS documentation. Every page in docs/user-guide/ has been rewritten for better structure, accuracy, and completeness: - installation.md: streamlined setup steps, added .env and Nextflow Secrets credential options, LLM setup section - parameters.md: reorganized into logical sections, fixed incorrect types for organism_type/virus_subtype, added LLM parameters - submission_guide.md: added LLM features section, full walkthrough examples for mpox and bacteria, credential management docs - outputs.md: added directory tree, per-file descriptions including new LLM output files - troubleshooting.md: expanded from 2 issues to a comprehensive guide covering installation, validation, FTP, accession, and Nextflow errors - wastewater_guide.md: generalized beyond SARS-CoV-2 NWSS to cover any pathogen wastewater submission - general_NCBI_submission_guide.md: clarified SPUID usage, BioProject requirements, submission mechanics table - profile.md: added input file tables, cloud profile setup, custom profile creation guide - custom_metadata_guide.md: added type casting table, cleaner JSON format reference - user_provided_annotation_guide.md: added annotation file format details, standalone usage instructions - vadr_install.md: added RSV model setup, cleaner step structure, verification command - get-in-touch.md: simplified Co-authored-by: Kyle O'Connell <kyoconnell@deloitte.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--use_llmflag (defaultfalse). Credentials via.envfile, Nextflow Secrets, or environment variable. All LLM code paths are fully skipped when the flag is off — pipeline behavior is identical to before.@_retry) on all FTP/SFTP connect and upload methods;errorStrategy: retry+maxRetries 2on submit/prep processes; improved binary-file detection in FTP upload using a class-level set.logging.info(..., file=sys.stderr)calls that passed an invalid kwarg; fixed pandas 2.xapplymap→mapdeprecation; added env var fallback for NCBI credentials inSubmissionConfigParser.secretdirectives added to submission and fetch-reports modules for both NCBI and OpenAI credentials — compatible with Seqera Platform.docs/user-guide/rewritten for clarity, accuracy, and completeness. Added LLM parameter docs, new output file descriptions, expanded troubleshooting (2 → ~15 issues), generalized wastewater guide beyond SARS-CoV-2.New files
bin/llm_helper.py— LLM adapter; returnsNonesilently when OpenAI unavailable.env.example— credential template for local and Seqera Platform runsTest plan
nextflow run main.nf -profile test,mpox,singularity --workflow biosample_and_srapasses without--use_llm--use_llm trueand a validOPENAI_API_KEY--use_llm falseproduces noerror_llm_suggestions.txtor*_llm_interpretation.txtfilessubmission_helper.pyloads credentials from.envwhensubmission_config.yamlfields are blankCo-authored-by: Kyle O'Connell kyoconnell@deloitte.com