Skip to content

Commit bc605ee

Browse files
authored
fix(webapp): verify deployment image exists before finalizing (#4049)
A deployment could be marked deployed and promoted to current without its image ever landing in the registry. Finalize trusted the CLI: the v1 path never pushed or checked, and the v2/v3 path skips its own push when the CLI sends `skipPushToRegistry` - which the local-build path always does. In the happy path the CLI pushes the image itself, so this stayed latent. But any deviation - `--no-push`/`--load`, a push that lands in a different registry, or an old CLI - promoted a version whose image can't be pulled, so every run failed at pull time while the deploy itself reported success. This adds a registry existence check after push and before finalize. If the image isn't there, the deploy fails loudly instead of promoting a version that can't start. The check is ECR-only (a no-op for other registries, so self-hosted setups are unaffected) and uses `BatchGetImage`, which the deploy role already allows. It fails open on an ambiguous registry error so the check can't itself turn into a deploy outage. The image reference is the platform-generated value and the lookup is bound to the configured registry host; the CLI-supplied digest is validated before use. Can be turned off with `DEPLOY_IMAGE_VERIFICATION_ENABLED=0` for setups that push images out of band (e.g. an air-gapped registry the platform can't reach). refs TRI-11243
1 parent 9f01e31 commit bc605ee

5 files changed

Lines changed: 422 additions & 0 deletions

File tree

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
area: webapp
3+
type: fix
4+
---
5+
6+
Verify a deployment's image exists in the registry before marking it deployed, so a deploy whose image wasn't pushed fails instead of silently breaking runs (can be turned off via `DEPLOY_IMAGE_VERIFICATION_ENABLED=0` for setups that push images out of band)

apps/webapp/app/env.server.ts

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -553,6 +553,9 @@ const EnvironmentSchema = z
553553
// log-only mode before enforcement.
554554
DEPRECATE_V3_CLI_DEPLOYS_ENABLED: z.string().default("0"),
555555

556+
// Verify the deploy image exists before promoting. Disable for out-of-band/air-gapped push. ECR only.
557+
DEPLOY_IMAGE_VERIFICATION_ENABLED: BoolEnv.default(true),
558+
556559
OBJECT_STORE_BASE_URL: z.string().optional(),
557560
OBJECT_STORE_BUCKET: z.string().optional(),
558561
OBJECT_STORE_ACCESS_KEY_ID: z.string().optional(),

apps/webapp/app/v3/services/finalizeDeploymentV2.server.ts

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ import { getEcrAuthToken, isEcrRegistry } from "../getDeploymentImageRef.server"
1616
import { tryCatch } from "@trigger.dev/core";
1717
import { getRegistryConfig, type RegistryConfig } from "../registryConfig.server";
1818
import { ComputeTemplateCreationService } from "./computeTemplateCreation.server";
19+
import { ecrImageExists } from "./verifyDeploymentImage.server";
1920

2021
export class FinalizeDeploymentV2Service extends BaseService {
2122
public async call(
@@ -75,6 +76,11 @@ export class FinalizeDeploymentV2Service extends BaseService {
7576
});
7677
}
7778

79+
// The CLI claims the image is already in the registry (local build, or a
80+
// self-hosted setup). Verify before promoting so we never mark a
81+
// deployment DEPLOYED when nothing was actually pushed.
82+
await this.#assertImagePullable(deployment, body);
83+
7884
await this.#createTemplateIfNeeded(deployment, id, authenticatedEnv, writer);
7985
return finalizeService.call(authenticatedEnv, id, body);
8086
}
@@ -142,10 +148,53 @@ export class FinalizeDeploymentV2Service extends BaseService {
142148
pushedImage: pushResult.image,
143149
});
144150

151+
// Belt and suspenders: confirm the push actually landed before promoting.
152+
await this.#assertImagePullable(deployment, body);
153+
145154
await this.#createTemplateIfNeeded(deployment, id, authenticatedEnv, writer);
146155
return finalizeService.call(authenticatedEnv, id, body);
147156
}
148157

158+
async #assertImagePullable(
159+
deployment: { imageReference: string | null; type: string | null },
160+
body: FinalizeDeploymentRequestBody
161+
): Promise<void> {
162+
if (!env.DEPLOY_IMAGE_VERIFICATION_ENABLED) {
163+
return;
164+
}
165+
166+
if (!deployment.imageReference) {
167+
return;
168+
}
169+
170+
const registryConfig = getRegistryConfig(deployment.type === "MANAGED");
171+
172+
// ECR-only: non-ECR (self-hosted) registries can't be checked this way, so skip.
173+
if (!isEcrRegistry(registryConfig.host)) {
174+
return;
175+
}
176+
177+
const result = await ecrImageExists({
178+
imageReference: deployment.imageReference,
179+
imageDigest: body.imageDigest,
180+
registryConfig,
181+
});
182+
183+
if (result === "missing") {
184+
throw new ServiceValidationError(
185+
"Deployment image was not found in the registry. It may not have been pushed (for example a local build without a push, or a push to a different registry). Aborting the deploy to avoid promoting a version that cannot start."
186+
);
187+
}
188+
189+
// Fail closed: if we can't confirm the image is present, don't promote a version
190+
// that might not start. Set DEPLOY_IMAGE_VERIFICATION_ENABLED=0 for out-of-band pushes.
191+
if (result === "unknown") {
192+
throw new ServiceValidationError(
193+
"Could not verify the deployment image exists in the registry. Aborting the deploy."
194+
);
195+
}
196+
}
197+
149198
async #createTemplateIfNeeded(
150199
deployment: { imageReference: string | null; worker: { project: { id: string } } | null },
151200
deploymentFriendlyId: string,
Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
import {
2+
BatchGetImageCommand,
3+
type BatchGetImageCommandOutput,
4+
RepositoryNotFoundException,
5+
} from "@aws-sdk/client-ecr";
6+
import { tryCatch } from "@trigger.dev/core";
7+
import pRetry, { AbortError } from "p-retry";
8+
import { logger } from "~/services/logger.server";
9+
import {
10+
type AssumeRoleConfig,
11+
createEcrClient,
12+
isEcrRegistry,
13+
parseEcrRegistryDomain,
14+
} from "../getDeploymentImageRef.server";
15+
import { type RegistryConfig } from "../registryConfig.server";
16+
17+
const SHA256_DIGEST = /^sha256:[a-f0-9]{64}$/;
18+
19+
export type ImageLookupResult = "found" | "missing" | "unknown";
20+
21+
/**
22+
* Split a stored ECR image reference into repository + tag.
23+
*
24+
* Trust boundary: the ref is platform-generated, but we still bind the lookup to
25+
* our configured host (region/account come from the env host) and only parse refs
26+
* that sit under it. Returns null otherwise.
27+
*/
28+
export function parseEcrImageReference(
29+
imageReference: string,
30+
registryHost: string
31+
): { repositoryName: string; tag: string } | null {
32+
const prefix = `${registryHost}/`;
33+
if (!imageReference.startsWith(prefix)) {
34+
return null;
35+
}
36+
37+
// namespace/projectRef:tag, optionally @sha256:... which we drop here
38+
const remainder = imageReference.slice(prefix.length).split("@")[0];
39+
const lastColon = remainder.lastIndexOf(":");
40+
41+
if (lastColon <= 0) {
42+
return null;
43+
}
44+
45+
const repositoryName = remainder.slice(0, lastColon);
46+
const tag = remainder.slice(lastColon + 1);
47+
48+
if (!repositoryName || !tag || tag.includes("/")) {
49+
return null;
50+
}
51+
52+
return { repositoryName, tag };
53+
}
54+
55+
export function interpretBatchGetImageResponse(
56+
response: BatchGetImageCommandOutput
57+
): ImageLookupResult {
58+
if (response.images && response.images.length > 0) {
59+
return "found";
60+
}
61+
62+
if (response.failures?.some((failure) => failure.failureCode === "ImageNotFound")) {
63+
return "missing";
64+
}
65+
66+
// No image and no explicit not-found failure (some other failure code) -
67+
// we can't say it's missing, so don't block the deploy on it.
68+
return "unknown";
69+
}
70+
71+
type BatchGetImageInput = {
72+
region: string;
73+
assumeRole?: AssumeRoleConfig;
74+
registryId?: string;
75+
repositoryName: string;
76+
imageIds: { imageTag?: string; imageDigest?: string }[];
77+
};
78+
79+
type BatchGetImageSender = (input: BatchGetImageInput) => Promise<BatchGetImageCommandOutput>;
80+
81+
const sendBatchGetImage: BatchGetImageSender = async ({
82+
region,
83+
assumeRole,
84+
registryId,
85+
repositoryName,
86+
imageIds,
87+
}) => {
88+
const ecr = await createEcrClient({ region, assumeRole });
89+
// No acceptedMediaTypes: only the single-manifest types are valid enum values, and
90+
// we only care whether the image exists, not its manifest format.
91+
return ecr.send(new BatchGetImageCommand({ repositoryName, registryId, imageIds }));
92+
};
93+
94+
/**
95+
* Pre-promotion backstop: check the deployment image actually exists in ECR.
96+
*
97+
* "found"/"missing" are definitive (a nonexistent repo counts as missing).
98+
* "unknown" means we couldn't determine it - non-ECR registry, unparseable ref, or
99+
* an API error; the caller decides what to do with each. `_send` is a test seam.
100+
*/
101+
export async function ecrImageExists(
102+
{
103+
imageReference,
104+
imageDigest,
105+
registryConfig,
106+
}: {
107+
imageReference: string;
108+
imageDigest?: string;
109+
registryConfig: RegistryConfig;
110+
},
111+
_send: BatchGetImageSender = sendBatchGetImage
112+
): Promise<ImageLookupResult> {
113+
if (!isEcrRegistry(registryConfig.host)) {
114+
return "unknown";
115+
}
116+
117+
const parsed = parseEcrImageReference(imageReference, registryConfig.host);
118+
119+
if (!parsed) {
120+
logger.warn("Could not parse deployment image reference for verification", { imageReference });
121+
return "unknown";
122+
}
123+
124+
const { accountId, region } = parseEcrRegistryDomain(registryConfig.host);
125+
126+
// imageDigest is supplied by the CLI request body - validate before trusting it.
127+
// Prefer it when valid (catches a tag that resolves to a different image), else
128+
// fall back to the platform-generated tag.
129+
const validDigest =
130+
imageDigest && SHA256_DIGEST.test(imageDigest.trim()) ? imageDigest.trim() : undefined;
131+
const imageId = validDigest ? { imageDigest: validDigest } : { imageTag: parsed.tag };
132+
133+
const assumeRole = registryConfig.ecrAssumeRoleArn
134+
? {
135+
roleArn: registryConfig.ecrAssumeRoleArn,
136+
externalId: registryConfig.ecrAssumeRoleExternalId,
137+
}
138+
: undefined;
139+
140+
// Retry transient ECR failures (throttling/network) before giving up, so a blip
141+
// doesn't fail an otherwise-fine deploy. A missing repo is definitive - don't retry.
142+
const [error, response] = await tryCatch(
143+
pRetry(
144+
() =>
145+
_send({
146+
region,
147+
assumeRole,
148+
registryId: accountId,
149+
repositoryName: parsed.repositoryName,
150+
imageIds: [imageId],
151+
}).catch((err) => {
152+
if (err instanceof RepositoryNotFoundException) {
153+
throw new AbortError(err);
154+
}
155+
throw err;
156+
}),
157+
{
158+
retries: 2,
159+
minTimeout: 200,
160+
maxTimeout: 1000,
161+
onFailedAttempt: (e) => {
162+
logger.warn("Retrying ECR image verification", {
163+
imageReference,
164+
attempt: e.attemptNumber,
165+
error: e.message,
166+
});
167+
},
168+
}
169+
)
170+
);
171+
172+
if (error) {
173+
// A missing repo is a definitive miss, not an ambiguous error.
174+
if (error instanceof RepositoryNotFoundException) {
175+
return "missing";
176+
}
177+
178+
logger.error("Failed to verify deployment image in ECR", {
179+
imageReference,
180+
repositoryName: parsed.repositoryName,
181+
error: error.message,
182+
});
183+
return "unknown";
184+
}
185+
186+
return interpretBatchGetImageResponse(response);
187+
}

0 commit comments

Comments
 (0)