Pipeline reliability, LLM-assisted validation, and docs overhaul by kyleaoconnell22 · Pull Request #359 · CDCgov/tostadas

kyleaoconnell22 · 2026-05-12T17:42:00Z

Summary

LLM-assisted features (optional, non-breaking): OpenAI-powered plain-English explanations for metadata validation errors and NCBI report interpretation. Controlled by --use_llm flag (default false). Credentials via .env file, Nextflow Secrets, or environment variable. All LLM code paths are fully skipped when the flag is off — pipeline behavior is identical to before.
Submission reliability: Exponential-backoff retry decorator (@_retry) on all FTP/SFTP connect and upload methods; errorStrategy: retry + maxRetries 2 on submit/prep processes; improved binary-file detection in FTP upload using a class-level set.
Bug fixes: Fixed logging.info(..., file=sys.stderr) calls that passed an invalid kwarg; fixed pandas 2.x applymap → map deprecation; added env var fallback for NCBI credentials in SubmissionConfigParser.
Nextflow Secrets support: secret directives added to submission and fetch-reports modules for both NCBI and OpenAI credentials — compatible with Seqera Platform.
Docs overhaul: All 13 pages in docs/user-guide/ rewritten for clarity, accuracy, and completeness. Added LLM parameter docs, new output file descriptions, expanded troubleshooting (2 → ~15 issues), generalized wastewater guide beyond SARS-CoV-2.

New files

bin/llm_helper.py — LLM adapter; returns None silently when OpenAI unavailable
.env.example — credential template for local and Seqera Platform runs

Test plan

nextflow run main.nf -profile test,mpox,singularity --workflow biosample_and_sra passes without --use_llm
Same run passes with --use_llm true and a valid OPENAI_API_KEY
--use_llm false produces no error_llm_suggestions.txt or *_llm_interpretation.txt files
FTP retry logic triggers on transient connection failure (manual test or unit test)
submission_helper.py loads credentials from .env when submission_config.yaml fields are blank

Co-authored-by: Kyle O'Connell kyoconnell@deloitte.com

- Add _retry decorator with exponential backoff to FTPClient and SFTPClient connect/upload methods (3 attempts, 5s/10s/20s delays) - Add Nextflow-level errorStrategy retry to PREP_SUBMISSION and SUBMIT_SUBMISSION - Fix all logging.info/debug calls incorrectly passed file=sys.stderr kwarg - Remove debug print() statements from GenbankSubmission.__init__ - Fix FTP binary upload extension detection (use set lookup; add missing .ready, .zip) - Add env var fallback in SubmissionConfigParser for NCBI credentials (supports Nextflow Secrets: nextflow secrets set NCBI_USERNAME/NCBI_PASSWORD) - Declare NCBI_USERNAME and NCBI_PASSWORD secrets in submission process blocks

- Replace df.applymap() with getattr(df, 'map', df.applymap)() for pandas 2.1+ compatibility (applymap removed in 2.1) - Add --use_llm and --llm_model CLI flags to validate_metadata.py - When --use_llm is set, write error_llm_suggestions.txt alongside error.txt with plain-English explanations and fix suggestions - Pass LLM flags through METADATA_VALIDATION Nextflow process - Declare OPENAI_API_KEY as a Nextflow secret in the process block

- Add --use_llm and --llm_model flags to fetch_submission.py - Pass flags to parse_and_save_reports; writes <batch_id>_llm_interpretation.txt alongside the report CSV when errors or non-success statuses are present - Emit llm_interpretation.txt as optional output from FETCH_REPORTS process - Declare OPENAI_API_KEY secret in FETCH_REPORTS process block

- Add bin/llm_helper.py: thin OpenAI adapter with graceful fallback (explain_validation_errors, interpret_ncbi_report, suggest_metadata_fixes) All functions return None silently when OPENAI_API_KEY is absent or the openai package is not installed -- no behavioral change without LLM - Add .env.example documenting both .env (local) and Nextflow Secrets (Seqera Platform) credential patterns for OPENAI_API_KEY and NCBI creds - Add use_llm and llm_model params to nextflow.config (default: disabled) with inline documentation of Nextflow Secrets setup - Add openai>=1.0 and python-dotenv>=1.0 to environment.yml

Complete overhaul of the TOSTADAS documentation. Every page in docs/user-guide/ has been rewritten for better structure, accuracy, and completeness: - installation.md: streamlined setup steps, added .env and Nextflow Secrets credential options, LLM setup section - parameters.md: reorganized into logical sections, fixed incorrect types for organism_type/virus_subtype, added LLM parameters - submission_guide.md: added LLM features section, full walkthrough examples for mpox and bacteria, credential management docs - outputs.md: added directory tree, per-file descriptions including new LLM output files - troubleshooting.md: expanded from 2 issues to a comprehensive guide covering installation, validation, FTP, accession, and Nextflow errors - wastewater_guide.md: generalized beyond SARS-CoV-2 NWSS to cover any pathogen wastewater submission - general_NCBI_submission_guide.md: clarified SPUID usage, BioProject requirements, submission mechanics table - profile.md: added input file tables, cloud profile setup, custom profile creation guide - custom_metadata_guide.md: added type casting table, cleaner JSON format reference - user_provided_annotation_guide.md: added annotation file format details, standalone usage instructions - vadr_install.md: added RSV model setup, cleaner step structure, verification command - get-in-touch.md: simplified Co-authored-by: Kyle O'Connell <kyoconnell@deloitte.com>

kyleoconnell added 5 commits May 12, 2026 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline reliability, LLM-assisted validation, and docs overhaul#359

Pipeline reliability, LLM-assisted validation, and docs overhaul#359
kyleaoconnell22 wants to merge 5 commits into
CDCgov:masterfrom
kyleaoconnell22:kyle_patches

kyleaoconnell22 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kyleaoconnell22 commented May 12, 2026

Summary

New files

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants