Skip to content

Pipeline reliability, LLM-assisted validation, and docs overhaul#359

Open
kyleaoconnell22 wants to merge 5 commits into
CDCgov:masterfrom
kyleaoconnell22:kyle_patches
Open

Pipeline reliability, LLM-assisted validation, and docs overhaul#359
kyleaoconnell22 wants to merge 5 commits into
CDCgov:masterfrom
kyleaoconnell22:kyle_patches

Conversation

@kyleaoconnell22
Copy link
Copy Markdown

Summary

  • LLM-assisted features (optional, non-breaking): OpenAI-powered plain-English explanations for metadata validation errors and NCBI report interpretation. Controlled by --use_llm flag (default false). Credentials via .env file, Nextflow Secrets, or environment variable. All LLM code paths are fully skipped when the flag is off — pipeline behavior is identical to before.
  • Submission reliability: Exponential-backoff retry decorator (@_retry) on all FTP/SFTP connect and upload methods; errorStrategy: retry + maxRetries 2 on submit/prep processes; improved binary-file detection in FTP upload using a class-level set.
  • Bug fixes: Fixed logging.info(..., file=sys.stderr) calls that passed an invalid kwarg; fixed pandas 2.x applymapmap deprecation; added env var fallback for NCBI credentials in SubmissionConfigParser.
  • Nextflow Secrets support: secret directives added to submission and fetch-reports modules for both NCBI and OpenAI credentials — compatible with Seqera Platform.
  • Docs overhaul: All 13 pages in docs/user-guide/ rewritten for clarity, accuracy, and completeness. Added LLM parameter docs, new output file descriptions, expanded troubleshooting (2 → ~15 issues), generalized wastewater guide beyond SARS-CoV-2.

New files

  • bin/llm_helper.py — LLM adapter; returns None silently when OpenAI unavailable
  • .env.example — credential template for local and Seqera Platform runs

Test plan

  • nextflow run main.nf -profile test,mpox,singularity --workflow biosample_and_sra passes without --use_llm
  • Same run passes with --use_llm true and a valid OPENAI_API_KEY
  • --use_llm false produces no error_llm_suggestions.txt or *_llm_interpretation.txt files
  • FTP retry logic triggers on transient connection failure (manual test or unit test)
  • submission_helper.py loads credentials from .env when submission_config.yaml fields are blank

Co-authored-by: Kyle O'Connell kyoconnell@deloitte.com

- Add _retry decorator with exponential backoff to FTPClient and SFTPClient
  connect/upload methods (3 attempts, 5s/10s/20s delays)
- Add Nextflow-level errorStrategy retry to PREP_SUBMISSION and SUBMIT_SUBMISSION
- Fix all logging.info/debug calls incorrectly passed file=sys.stderr kwarg
- Remove debug print() statements from GenbankSubmission.__init__
- Fix FTP binary upload extension detection (use set lookup; add missing .ready, .zip)
- Add env var fallback in SubmissionConfigParser for NCBI credentials
  (supports Nextflow Secrets: nextflow secrets set NCBI_USERNAME/NCBI_PASSWORD)
- Declare NCBI_USERNAME and NCBI_PASSWORD secrets in submission process blocks
- Replace df.applymap() with getattr(df, 'map', df.applymap)() for
  pandas 2.1+ compatibility (applymap removed in 2.1)
- Add --use_llm and --llm_model CLI flags to validate_metadata.py
- When --use_llm is set, write error_llm_suggestions.txt alongside
  error.txt with plain-English explanations and fix suggestions
- Pass LLM flags through METADATA_VALIDATION Nextflow process
- Declare OPENAI_API_KEY as a Nextflow secret in the process block
- Add --use_llm and --llm_model flags to fetch_submission.py
- Pass flags to parse_and_save_reports; writes <batch_id>_llm_interpretation.txt
  alongside the report CSV when errors or non-success statuses are present
- Emit llm_interpretation.txt as optional output from FETCH_REPORTS process
- Declare OPENAI_API_KEY secret in FETCH_REPORTS process block
- Add bin/llm_helper.py: thin OpenAI adapter with graceful fallback
  (explain_validation_errors, interpret_ncbi_report, suggest_metadata_fixes)
  All functions return None silently when OPENAI_API_KEY is absent or
  the openai package is not installed -- no behavioral change without LLM
- Add .env.example documenting both .env (local) and Nextflow Secrets
  (Seqera Platform) credential patterns for OPENAI_API_KEY and NCBI creds
- Add use_llm and llm_model params to nextflow.config (default: disabled)
  with inline documentation of Nextflow Secrets setup
- Add openai>=1.0 and python-dotenv>=1.0 to environment.yml
Complete overhaul of the TOSTADAS documentation. Every page in docs/user-guide/ has been rewritten for better structure, accuracy, and completeness:

- installation.md: streamlined setup steps, added .env and Nextflow Secrets credential options, LLM setup section
- parameters.md: reorganized into logical sections, fixed incorrect types for organism_type/virus_subtype, added LLM parameters
- submission_guide.md: added LLM features section, full walkthrough examples for mpox and bacteria, credential management docs
- outputs.md: added directory tree, per-file descriptions including new LLM output files
- troubleshooting.md: expanded from 2 issues to a comprehensive guide covering installation, validation, FTP, accession, and Nextflow errors
- wastewater_guide.md: generalized beyond SARS-CoV-2 NWSS to cover any pathogen wastewater submission
- general_NCBI_submission_guide.md: clarified SPUID usage, BioProject requirements, submission mechanics table
- profile.md: added input file tables, cloud profile setup, custom profile creation guide
- custom_metadata_guide.md: added type casting table, cleaner JSON format reference
- user_provided_annotation_guide.md: added annotation file format details, standalone usage instructions
- vadr_install.md: added RSV model setup, cleaner step structure, verification command
- get-in-touch.md: simplified

Co-authored-by: Kyle O'Connell <kyoconnell@deloitte.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants