Skip to content

feat: per-rule queues, per-job tags, EC2/Fargate-aware validation#40

Open
nh13 wants to merge 8 commits into
snakemake:mainfrom
nh13:feat/submit-job-tags
Open

feat: per-rule queues, per-job tags, EC2/Fargate-aware validation#40
nh13 wants to merge 8 commits into
snakemake:mainfrom
nh13:feat/submit-job-tags

Conversation

@nh13
Copy link
Copy Markdown

@nh13 nh13 commented Apr 12, 2026

Summary

Hardens the AWS Batch executor for multi-arch, multi-tenant production use. Three user-facing capabilities plus underlying validation/compat fixes.

What's new

  • Per-rule queue routing via resources.batch_queue — route each rule to a Batch queue wired to the matching compute environment. Enables a single workflow to run ARM rules on Graviton queues and x86 rules on Intel/AMD queues. Falls back to the profile default queue.
  • Per-job tagssettings.tags are now sent to submit_job (not only register_job_definition), so the AWS Batch console and Cost Explorer can group jobs by tag. A SNAKEMAKE_AWS_BATCH_JOB_TAGS env var carries dynamic KEY=VALUE pairs for coordinator → worker tag inheritance (e.g. per-run commit hashes for cost attribution). Env tags override settings on key conflict.
  • resources.shared_memory_size_mb — rules opt into a larger /dev/shm (EC2/ECS default 64 MB is too small for tools that stage large in-memory indexes). EC2 only; Fargate does not honor linuxParameters.sharedMemorySize.

Validation and correctness

  • EC2 vs FARGATE detection (closes VALID_RESOURCES_MAPPING is restrictive, and only applicable to Fargate compute environments. #34): resolves the queue's compute environment and runs platform-specific vCPU/memory checks. Previously all jobs were validated against Fargate's strict resource matrix, which blocked legitimate EC2 requests. Real AWS lookup errors (botocore.ClientError) now propagate as WorkflowError instead of silently falling back to EC2.
  • vCPU from threads — derived from self.job.threads (the documented Snakemake directive) instead of the undocumented _cores resource, with _cores retained as fallback.
  • Fargate memory floor — when the requested memory is not a valid Fargate value, pick the smallest valid memory >= mem instead of silently shrinking to min(valid_mems) and OOMing the job. Raise if no valid value exists.
  • Tag validation — reject empty tag keys and >50 tags before calling submit_job, matching AWS Batch limits.

Known limitation

Fargate job-definition registration is not yet supported. Fargate requires executionRoleArn, prohibits privileged: True, and does not allow GPU resources — none of which this plugin currently emits conditionally. build_job_definition raises a clear WorkflowError when a Fargate queue is detected, replacing an opaque AWS-side rejection with a local message. EC2 queues are unaffected.

Compatibility

  • auto_deploy_default_storage_provider=False so workers don't pip install an unpinned snakemake-storage-plugin-s3 at startup — s3-plugin 0.3.x metadata requires snakemake 9.x and breaks 8.x workers. Image maintainers pre-install a compatible plugin version.
  • snakemake-storage-plugin-s3 pin widened to >=0.2,<0.4 so 8.x consumers can fall back to 0.2.x.

Dependencies

Stacked on #38 (fix-vcpu-threads-and-validation) — the first three commits here are the same as #38's, so #38 should land first (or be closed in favor of this PR).

Test plan

  • 31 unit tests in tests/test_batch_job_builder.py:
    • Tag assembly + propagation to submit_job (settings, env var, merge, override, empty cases)
    • Tag validation (empty keys from env or settings, 50-tag boundary)
    • Platform detection (ClientError re-raises, empty-response fallback, Fargate detection)
    • Fargate resource validation (smallest valid ≥ requested, exact match, request exceeds maximum, invalid vCPU)
    • Fargate rejection in build_job_definition
  • All existing tests still pass

Summary by CodeRabbit

  • New Features

    • Added queue-aware job submission—jobs can now specify target AWS Batch queue via resources.
    • Enabled Fargate platform support alongside existing EC2 execution.
    • Added job tagging capability with environment variable override support.
  • Behavior Changes

    • Disabled default storage provider auto-deployment; pre-installed container plugins now required.
    • Updated vCPU and memory resource defaults for improved performance.
    • Enhanced resource validation with platform-specific compatibility checks.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds two passthrough AWS Batch client methods; extends BatchJobBuilder to derive job queue, detect EC2 vs FARGATE via AWS queries, perform platform-specific resource validation, set platformCapabilities on job definitions, and merge job tags from settings plus an env var; tests added for tag and platform behaviors.

Changes

Cohort / File(s) Summary
BatchClient Enhancements
snakemake_executor_plugin_aws_batch/batch_client.py
Added passthrough methods describe_job_queues(**kwargs) and describe_compute_environments(**kwargs) delegating to the boto3 Batch client.
BatchJobBuilder Platform & Resource Logic
snakemake_executor_plugin_aws_batch/batch_job_builder.py
Derives job_queue per-rule, discovers platform via describe_job_queues/describe_compute_environments, dispatches platform-specific validation (FARGATE vs EC2), prefers job.threads for vCPU, lowers default memory to 1024 MB, maps resources.shared_memory_size_mb to linuxParameters.sharedMemorySize (EC2-only), registers job definitions with platformCapabilities=[self.platform], and submits to the resolved self.job_queue.
Tag Assembly & Submission
snakemake_executor_plugin_aws_batch/batch_job_builder.py, tests/test_batch_job_builder.py
Introduces _build_job_tags() to merge settings.tags with SNAKEMAKE_AWS_BATCH_JOB_TAGS env var (env overrides, handles = in values, rejects empty keys, limits to 50 tags). Submission now attaches tags when present; tests cover tag parsing/validation, submit behavior, platform detection, and FARGATE validation.
Tests
tests/test_batch_job_builder.py
New comprehensive unit tests for BatchJobBuilder covering tag behavior, env-var parsing, platform detection, resource validation, and job-definition registration/submission differences for EC2 vs FARGATE.
Default Setting Change
snakemake_executor_plugin_aws_batch/__init__.py
Set common_settings.auto_deploy_default_storage_provider = False and updated inline comments to require the storage plugin be pre-installed in container images.

Sequence Diagram(s)

sequenceDiagram
    participant Env as Env/Settings
    participant Builder as BatchJobBuilder
    participant Client as BatchClient
    participant AWS as AWS Batch

    Env->>Builder: provide settings, job, env-var tags
    Builder->>Client: describe_job_queues(...)
    Client->>AWS: DescribeJobQueues API
    AWS-->>Client: queue description
    Client-->>Builder: queue info
    alt computeEnvironment referenced
        Builder->>Client: describe_compute_environments(...)
        Client->>AWS: DescribeComputeEnvironments API
        AWS-->>Client: compute env description
        Client-->>Builder: compute env info
    end
    Builder->>Builder: determine platform (EC2 / FARGATE) and validate resources
    Builder->>AWS: register_job_definition(platformCapabilities=[Builder.platform], ...)
    AWS-->>Builder: job definition ARN
    Builder->>Builder: _build_job_tags() merge settings + env tags
    Builder->>Client: submit_job(..., jobQueue=Builder.job_queue, tags=merged?)
    Client->>AWS: SubmitJob API (with tags if present)
    AWS-->>Client: submit response
    Client-->>Builder: submit response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the three main user-facing capabilities: per-rule queue routing, per-job tags, and platform-aware (EC2/Fargate) validation—matching the core changes across BatchClient, BatchJobBuilder, and tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
snakemake_executor_plugin_aws_batch/batch_job_builder.py (1)

179-205: ⚠️ Potential issue | 🔴 Critical

Fargate job definitions cannot use EC2-only container settings.

Line 215 sets platformCapabilities=[self.platform], which can be FARGATE when detected from the job queue. However, the container_properties dict (lines 179–205) violates AWS Fargate requirements:

  • executionRoleArn is required but never supplied
  • privileged=True is not supported; Fargate requires it omitted or false
  • GPU resourceRequirements (lines 198–204) are unsupported on Fargate

The AWS Batch API will reject Fargate job registrations with these violations. Platform-specific container-property assembly is needed before Fargate can work.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 179 -
205, The container_properties assembly (container_properties dict in
batch_job_builder.py used when creating job definitions) is not Fargate-safe:
when platformCapabilities is set to self.platform (FARGATE) you must include
executionRoleArn, remove or set privileged to False, and omit GPU
resourceRequirements; update the logic that builds container_properties and the
call site that sets platformCapabilities to make platform-specific adjustments
(detect self.platform == "FARGATE" and then add executionRoleArn from settings,
delete or set privileged to False, and skip adding
BATCH_JOB_RESOURCE_REQUIREMENT_TYPE.GPU entries) so Fargate job registrations
meet AWS requirements.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 123-133: The current logic picks min(valid_mems) which can
under-provision memory; instead, in the block that computes valid_mems from
VALID_RESOURCES_MAPPING for the given vcpu, filter for candidates >= mem and
pick the smallest of those (e.g., candidate_mem = min(m for m in valid_mems if m
>= mem)); if no candidate exists raise a WorkflowError explaining the requested
mem is too large for that vCPU; update the logger.warning to report the chosen
candidate_mem (and original mem/vcpu) rather than min_mem. Ensure this change
touches the same symbols: VALID_RESOURCES_MAPPING, valid_mems, vcpu, mem,
WorkflowError, and the logger.warning call.
- Around line 105-109: The broad except block that logs and returns
BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value must not silently downgrade on all
errors; instead, remove the unconditional return in the except (or change it to
re-raise) so errors propagate (fail fast) while retaining existing explicit
empty-response fallbacks elsewhere; update the except handling around the
queue/compute-environment lookup in the method (the block that uses
self.logger.warning and BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value) to either
re-raise the caught exception after logging or catch and rethrow only
non-empty-response-safe errors (e.g., inspect/handle specific AWS client
errors), ensuring only the explicit empty-response branches keep the EC2
fallback.

---

Outside diff comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 179-205: The container_properties assembly (container_properties
dict in batch_job_builder.py used when creating job definitions) is not
Fargate-safe: when platformCapabilities is set to self.platform (FARGATE) you
must include executionRoleArn, remove or set privileged to False, and omit GPU
resourceRequirements; update the logic that builds container_properties and the
call site that sets platformCapabilities to make platform-specific adjustments
(detect self.platform == "FARGATE" and then add executionRoleArn from settings,
delete or set privileged to False, and skip adding
BATCH_JOB_RESOURCE_REQUIREMENT_TYPE.GPU entries) so Fargate job registrations
meet AWS requirements.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: adee3872-6ca3-4367-b682-4abbb205004a

📥 Commits

Reviewing files that changed from the base of the PR and between 2525ca0 and 71cbeea.

📒 Files selected for processing (3)
  • snakemake_executor_plugin_aws_batch/batch_client.py
  • snakemake_executor_plugin_aws_batch/batch_job_builder.py
  • tests/test_batch_job_builder.py

Comment thread snakemake_executor_plugin_aws_batch/batch_job_builder.py Outdated
Comment thread snakemake_executor_plugin_aws_batch/batch_job_builder.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
snakemake_executor_plugin_aws_batch/batch_job_builder.py (1)

185-221: ⚠️ Potential issue | 🔴 Critical

Add platform-aware container properties for Fargate compatibility.

The job definition construction at line 185 unconditionally sets privileged: True, omits executionRoleArn, and allows GPU resources, but line 221 can send platformCapabilities=["FARGATE"]. This causes Fargate job definition registration to fail: AWS Batch Fargate requires an execution role, prohibits privileged mode, and does not support GPU. The code already validates Fargate vs. EC2 resource requirements but does not handle these container property differences.

Conditionally set executionRoleArn for Fargate, omit or set privileged to false for Fargate (true only for EC2), and reject GPU requests on Fargate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 185 -
221, The container_properties currently set privileged=True, omit
executionRoleArn, and may include GPU requirements even when self.platform ==
"FARGATE"; update the job-definition assembly in batch_job_builder.py so that
inside the block building container_properties you (1) only set privileged=True
for non-FARGATE platforms and set privileged=False (or omit) when self.platform
== self.platform value for FARGATE, (2) add
executionRoleArn=self.settings.execution_role (or settings.executionRoleArn) to
container_properties when self.platform == FARGATE, and (3) guard appending a
GPU resource (BATCH_JOB_RESOURCE_REQUIREMENT_TYPE.GPU) — raise/return an error
if gpu > 0 and self.platform == FARGATE to reject unsupported GPU on Fargate;
ensure these changes are honored before calling
self.batch_client.register_job_definition so register_job_definition receives
the correct platform-aware containerProperties and tags.
♻️ Duplicate comments (2)
snakemake_executor_plugin_aws_batch/batch_job_builder.py (2)

129-140: ⚠️ Potential issue | 🟠 Major

Don’t reduce Fargate memory below the requested amount.

Line 135 can turn vcpu=1, mem=5000 into 2048, under-provisioning the job and causing avoidable OOM failures. Pick the smallest valid memory >= mem, or fail when the request exceeds the maximum for that vCPU.

Proposed fix
             valid_mems = [m for m, v in VALID_RESOURCES_MAPPING.items() if vcpu in v]
             if not valid_mems:
                 raise WorkflowError(
                     f"Invalid vCPU value {vcpu} for Fargate. "
                     f"Check valid Fargate resource configurations."
                 )
-            min_mem = min(valid_mems)
+            candidate_mems = [m for m in valid_mems if m >= mem]
+            if not candidate_mems:
+                raise WorkflowError(
+                    f"Memory value {mem} MB exceeds the maximum allowed for "
+                    f"vCPU {vcpu} on Fargate."
+                )
+            min_mem = min(candidate_mems)
             self.logger.warning(
                 f"Memory value {mem} MB is invalid for vCPU {vcpu} on Fargate. "
-                f"Setting memory to minimum allowed value {min_mem} MB."
+                f"Setting memory to next allowed value {min_mem} MB."
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 129 -
140, The current logic in batch_job_builder.py (inside the function that maps
Fargate vcpu/mem) picks min(valid_mems) which can under-provision requested mem;
change it to select the smallest entry in valid_mems that is >= mem (use vcpu,
mem, VALID_RESOURCES_MAPPING, valid_mems) and return that; if no valid_mems >=
mem exist, raise WorkflowError indicating the requested memory exceeds the
maximum allowed for that vCPU rather than silently lowering it; keep the
existing logging (self.logger.warning) but update the message to reflect that
memory was bumped up to the selected valid value, and ensure returned values
remain str(vcpu), str(selected_mem).

111-115: ⚠️ Potential issue | 🟠 Major

Don’t silently downgrade platform lookup failures to EC2.

Line 111 still catches every exception from the AWS queue/environment lookup and turns it into an EC2 fallback. That can hide IAM/configuration/transient AWS failures and register the wrong job definition; keep the explicit empty-response fallbacks, but re-raise real lookup errors.

Proposed fix
-        except Exception as e:
-            self.logger.warning(
-                f"Failed to determine platform from queue: {e}. Defaulting to EC2."
-            )
-            return BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value
+        except Exception as e:
+            raise WorkflowError(
+                f"Failed to determine platform from queue {self.job_queue}: {e}"
+            ) from e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 111 -
115, The current broad except in the platform lookup swallows all errors and
returns BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value; instead, change the error
handling in the platform-lookup routine so you only fall back to EC2 when the
lookup explicitly returns an empty/no-result value, and re-raise real AWS/client
errors. Concretely: remove the generic "except Exception" around the AWS
queue/environment lookup, or narrow it to only catch the expected "no result"
condition (e.g., check for None/empty response) and call
self.logger.warning(...) then return EC2; for any other exceptions from the AWS
client (botocore.exceptions.BotoCoreError/ClientError or unexpected errors),
re-raise them so they surface to callers rather than being silenced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 237-247: The merged tags builder currently accepts empty keys
(e.g., "=value") and doesn't enforce AWS Batch limits; update the tag assembly
in the method that builds/returns tags (the block using
SNAKEMAKE_AWS_BATCH_JOB_TAGS_ENV_VAR and variable tags) to validate after
merging: trim key and value, skip or reject any entries where the key is empty
(key.strip() == ""), and enforce a maximum of 50 tags (len(tags) <= 50); if
validation fails, raise a clear ValueError describing the problem so the caller
(before submit_job) fails fast with a helpful message.

---

Outside diff comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 185-221: The container_properties currently set privileged=True,
omit executionRoleArn, and may include GPU requirements even when self.platform
== "FARGATE"; update the job-definition assembly in batch_job_builder.py so that
inside the block building container_properties you (1) only set privileged=True
for non-FARGATE platforms and set privileged=False (or omit) when self.platform
== self.platform value for FARGATE, (2) add
executionRoleArn=self.settings.execution_role (or settings.executionRoleArn) to
container_properties when self.platform == FARGATE, and (3) guard appending a
GPU resource (BATCH_JOB_RESOURCE_REQUIREMENT_TYPE.GPU) — raise/return an error
if gpu > 0 and self.platform == FARGATE to reject unsupported GPU on Fargate;
ensure these changes are honored before calling
self.batch_client.register_job_definition so register_job_definition receives
the correct platform-aware containerProperties and tags.

---

Duplicate comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 129-140: The current logic in batch_job_builder.py (inside the
function that maps Fargate vcpu/mem) picks min(valid_mems) which can
under-provision requested mem; change it to select the smallest entry in
valid_mems that is >= mem (use vcpu, mem, VALID_RESOURCES_MAPPING, valid_mems)
and return that; if no valid_mems >= mem exist, raise WorkflowError indicating
the requested memory exceeds the maximum allowed for that vCPU rather than
silently lowering it; keep the existing logging (self.logger.warning) but update
the message to reflect that memory was bumped up to the selected valid value,
and ensure returned values remain str(vcpu), str(selected_mem).
- Around line 111-115: The current broad except in the platform lookup swallows
all errors and returns BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value; instead,
change the error handling in the platform-lookup routine so you only fall back
to EC2 when the lookup explicitly returns an empty/no-result value, and re-raise
real AWS/client errors. Concretely: remove the generic "except Exception" around
the AWS queue/environment lookup, or narrow it to only catch the expected "no
result" condition (e.g., check for None/empty response) and call
self.logger.warning(...) then return EC2; for any other exceptions from the AWS
client (botocore.exceptions.BotoCoreError/ClientError or unexpected errors),
re-raise them so they surface to callers rather than being silenced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2631aa04-9e9f-490a-8474-163c63b38e71

📥 Commits

Reviewing files that changed from the base of the PR and between 71cbeea and f1943b7.

📒 Files selected for processing (1)
  • snakemake_executor_plugin_aws_batch/batch_job_builder.py

Comment thread snakemake_executor_plugin_aws_batch/batch_job_builder.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
snakemake_executor_plugin_aws_batch/batch_job_builder.py (3)

248-258: ⚠️ Potential issue | 🟡 Minor

Validate tag keys and count before returning.

The current implementation accepts empty keys from malformed input like =value or ,=value,. Additionally, AWS Batch enforces a 50-tag limit per job. Consider validating that keys are non-empty and the total count doesn't exceed 50 to fail fast with a clear error.

Proposed fix
         env_tags_str = os.environ.get(SNAKEMAKE_AWS_BATCH_JOB_TAGS_ENV_VAR, "")
         if env_tags_str:
             for pair in env_tags_str.split(","):
                 pair = pair.strip()
                 if "=" in pair:
                     key, _, value = pair.partition("=")
-                    tags[key.strip()] = value.strip()
+                    key = key.strip()
+                    if not key:
+                        raise WorkflowError(
+                            f"Invalid {SNAKEMAKE_AWS_BATCH_JOB_TAGS_ENV_VAR}: "
+                            "tag key cannot be empty."
+                        )
+                    tags[key] = value.strip()
+
+        if len(tags) > 50:
+            raise WorkflowError(
+                f"AWS Batch jobs support at most 50 tags, got {len(tags)}."
+            )

         return tags
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 248 -
258, Validate and reject empty tag keys and enforce AWS Batch's 50-tag limit
before returning tags: after copying self.settings.tags into tags and while
parsing env_tags_str (using SNAKEMAKE_AWS_BATCH_JOB_TAGS_ENV_VAR), strip the key
and if key is empty raise a clear ValueError (or similar) indicating malformed
tag like "=value"; also validate existing keys in self.settings.tags (strip and
error on any empty key). After merging all tags, if len(tags) > 50 raise a clear
ValueError indicating the tag count exceeds AWS Batch's 50-tag limit; return
tags only if all validations pass.

111-115: ⚠️ Potential issue | 🟠 Major

Broad exception handling masks real configuration errors.

The catch-all Exception handler defaults to EC2 for any failure, including permission errors (AccessDeniedException), validation errors, and transient AWS failures. This can turn real configuration issues into silent fallbacks, pushing the actual error to a later register/submit call where it's harder to diagnose.

Consider catching specific boto3/botocore exceptions (e.g., ClientError) and re-raising unexpected errors to fail fast, while keeping the EC2 fallback only for the explicit empty-response branches.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 111 -
115, Replace the broad except Exception handler around the logic that returns
BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value with targeted exception handling:
import and catch botocore.exceptions.ClientError (and any specific AWS
exceptions you expect, e.g., AccessDeniedException) and handle only the cases
where the AWS response is empty or indicates "no platform" by returning the EC2
fallback; for other ClientError codes (and for unexpected exceptions) re-raise
the exception so the caller fails fast instead of silently defaulting, and keep
the EC2 fallback limited to the explicit empty-response branch in the method
that determines platform from the queue.

129-140: ⚠️ Potential issue | 🟠 Major

Memory fallback may under-provision the container.

When the requested memory isn't in VALID_RESOURCES_MAPPING, line 135 picks min(valid_mems), which could be significantly lower than requested. For example, vcpu=1, mem=5000 gets rewritten to 2048, potentially causing OOM failures.

Consider picking the smallest valid memory that is >= mem, or raising an error when no valid memory can satisfy the request.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 129 -
140, The current fallback chooses min(valid_mems) which may under-provision
(e.g., vcpu=1, mem=5000 -> 2048); update the logic in the block that computes
valid_mems (using VALID_RESOURCES_MAPPING) to select the smallest valid memory
>= requested mem (e.g., candidate = min(m for m in valid_mems if m >= mem)) and
return that, and if no such candidate exists raise a WorkflowError explaining
that no valid Fargate memory satisfies the requested mem for the given vcpu;
keep the return format str(vcpu), str(selected_mem).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 248-258: Validate and reject empty tag keys and enforce AWS
Batch's 50-tag limit before returning tags: after copying self.settings.tags
into tags and while parsing env_tags_str (using
SNAKEMAKE_AWS_BATCH_JOB_TAGS_ENV_VAR), strip the key and if key is empty raise a
clear ValueError (or similar) indicating malformed tag like "=value"; also
validate existing keys in self.settings.tags (strip and error on any empty key).
After merging all tags, if len(tags) > 50 raise a clear ValueError indicating
the tag count exceeds AWS Batch's 50-tag limit; return tags only if all
validations pass.
- Around line 111-115: Replace the broad except Exception handler around the
logic that returns BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value with targeted
exception handling: import and catch botocore.exceptions.ClientError (and any
specific AWS exceptions you expect, e.g., AccessDeniedException) and handle only
the cases where the AWS response is empty or indicates "no platform" by
returning the EC2 fallback; for other ClientError codes (and for unexpected
exceptions) re-raise the exception so the caller fails fast instead of silently
defaulting, and keep the EC2 fallback limited to the explicit empty-response
branch in the method that determines platform from the queue.
- Around line 129-140: The current fallback chooses min(valid_mems) which may
under-provision (e.g., vcpu=1, mem=5000 -> 2048); update the logic in the block
that computes valid_mems (using VALID_RESOURCES_MAPPING) to select the smallest
valid memory >= requested mem (e.g., candidate = min(m for m in valid_mems if m
>= mem)) and return that, and if no such candidate exists raise a WorkflowError
explaining that no valid Fargate memory satisfies the requested mem for the
given vcpu; keep the return format str(vcpu), str(selected_mem).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 528b17b0-82e7-4cf6-816a-3a455ee9968b

📥 Commits

Reviewing files that changed from the base of the PR and between 0627c08 and c736516.

⛔ Files ignored due to path filters (1)
  • pyproject.toml is excluded by !pyproject.toml
📒 Files selected for processing (1)
  • snakemake_executor_plugin_aws_batch/batch_job_builder.py

radusuciu and others added 8 commits April 28, 2026 23:37
As described in issue snakemake#34, validation assumed a Fargate compute environment. We can detect the compute environment by looking up the queue, and only run the Fargate specific validation when necessary.
Use self.job.threads for vCPU instead of the undocumented _cores resource.
Falls back to _cores for backward compatibility if threads is not set.
Avoid ValueError from min() on empty list when vCPU is not valid
for any Fargate memory configuration.
Add tags to the submit_job AWS API call (previously only applied to
register_job_definition). Also support reading dynamic tags from the
SNAKEMAKE_AWS_BATCH_JOB_TAGS environment variable as comma-separated
KEY=VALUE pairs, merged with settings.tags (env var overrides on
conflict). This enables per-run cost tracking when a coordinator job
sets the env var for child jobs to inherit.
Multi-arch workflows need to route each job to a queue wired to the
matching compute environment (e.g. ARM vs x86). Prior to this change
all jobs went to the single queue specified by --aws-batch-job-queue
regardless of the rule's resources.

Read resources.batch_queue on the job first, falling back to the
profile-wide default. Platform detection and submission both use the
resolved per-job queue.
snakemake core's auto-deploy injects 'pip install
snakemake-storage-plugin-s3' into every worker's startup command with no
version constraint. As of snakemake-storage-plugin-s3==0.3.3 the PyPI
metadata requires snakemake-interface-storage-plugins >= 4.2.3, which
needs snakemake >= 9. That breaks workers running snakemake 8.x.

Set auto_deploy_default_storage_provider=False so workers rely on the
storage plugin being pre-installed in the container image at a pinned,
compatible version. Image maintainers are now responsible for pinning
a compatible snakemake-storage-plugin-s3.
s3-plugin 0.3.x declares snakemake-interface-storage-plugins>=4.2.3,
but the plugin keeps the pre-4.2.3 super() signature and so breaks
with snakemake 8.x (which cannot pass the new logger argument). The
executor plugin never calls into s3-plugin APIs directly — it only
needs some storage provider to be installed. Widening the pin to
">=0.2,<0.4" lets consumers fall back to 0.2.x (which keeps the
<4.0 interface and works with snakemake 8.x) until upstream ships a
0.3.x s3 build that matches the current base __init__ signature.
…ySize

POSIX shm_open writes to /dev/shm, which on EC2/ECS containers defaults to
64 MB. Tools that stage large in-memory indexes (bwa-mem2 shm, etc.) need a
larger /dev/shm. Rules opt in via resources.shared_memory_size_mb.

Fargate is excluded — it does not honor linuxParameters.sharedMemorySize.
@nh13 nh13 changed the title feat: pass tags to submit_job and support dynamic env var tags feat: per-rule queues, per-job tags, EC2/Fargate-aware validation Apr 29, 2026
@nh13 nh13 force-pushed the feat/submit-job-tags branch from c736516 to 31b0253 Compare April 29, 2026 06:46
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 118-120: The error messages in batch_job_builder.py reference
self.settings.job_queue but the actual queue used for a job can be overridden
and stored in self.job_queue; update the two places (around the
platform-determination and the second error block) to use self.job_queue instead
of self.settings.job_queue so diagnostics show the resolved per-job queue and
not the default settings value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 822c07d0-b3f7-4862-864a-9c0d2861a654

📥 Commits

Reviewing files that changed from the base of the PR and between c736516 and 31b0253.

⛔ Files ignored due to path filters (1)
  • pyproject.toml is excluded by !pyproject.toml
📒 Files selected for processing (4)
  • snakemake_executor_plugin_aws_batch/__init__.py
  • snakemake_executor_plugin_aws_batch/batch_client.py
  • snakemake_executor_plugin_aws_batch/batch_job_builder.py
  • tests/test_batch_job_builder.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • snakemake_executor_plugin_aws_batch/batch_client.py
  • tests/test_batch_job_builder.py

Comment on lines +118 to +120
f"Failed to determine platform from queue "
f"{self.settings.job_queue}: {e}"
) from e
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use the resolved per-job queue in error messages.

Line 119 and Line 190 use self.settings.job_queue, which can misreport the queue when resources.batch_queue overrides it. Use self.job_queue for accurate diagnostics.

💡 Suggested patch
-                f"{self.settings.job_queue}: {e}"
+                f"{self.job_queue}: {e}"
@@
-                f"(queue {self.settings.job_queue} resolves to FARGATE). "
+                f"(queue {self.job_queue} resolves to FARGATE). "

Also applies to: 188-191

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 118 -
120, The error messages in batch_job_builder.py reference
self.settings.job_queue but the actual queue used for a job can be overridden
and stored in self.job_queue; update the two places (around the
platform-determination and the second error block) to use self.job_queue instead
of self.settings.job_queue so diagnostics show the resolved per-job queue and
not the default settings value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VALID_RESOURCES_MAPPING is restrictive, and only applicable to Fargate compute environments.

2 participants