Skip to content

Commit c1464ed

Browse files
rekmarksclaude
andauthored
fix(ocap-kernel): enforce one delivery per crank, fix rollback cache staleness (#879)
As it turns out, we have been violating the invariant that a crank consists of the delivery of a single message or notification. Since at least the introduction of `KernelQueue.ts` in #484, one iteration of the kernel's run queue—which should be equivalent to a crank—has actually been able to deliver an unbounded number of messages. This means that, if a delivery aborts mid-crank, `rollbackCrank('start')` reverts all deliveries in the crank (including earlier successful ones), creating inconsistency with vat in-memory state and leaving promise subscriptions permanently dangling. This PR ensures that we correctly implement cranks via the kernel's run queue loop as described below. ## Summary - Enforce one run-queue item per crank (change `while` to `if` in KernelQueue generator) and fix stale `StoredQueue` caches after `rollbackCrank` by refreshing the run queue and invalidating `runQueueLengthCache` - Reject JS promise subscriptions when a crank aborts with vat termination; fix `terminateVat` callback in Kernel to avoid deadlock by bypassing `VatManager.terminateVat()` (which calls `waitForCrank()`) - Simplify the run queue implementation; in lieu of an async generator + loop, use a single loop with helper functions - Improve error messages for splat cases (revoked, no owner, no object, endpoint gone) and handle vanished endpoints in KernelRouter delivery - Fix SubclusterManager to catch rejected bootstrap promises - Add orphaned ephemeral exo tests (unit + e2e) - Glossary formatting and crank definition correction ## Test plan - [x] Existing unit tests updated and passing (`KernelQueue.test.ts`, `KernelRouter.test.ts`, `crank.test.ts`, `syscall-validation.test.ts`, `vat-lifecycle.test.ts`) - [x] New unit test for orphaned ephemeral exos (`orphaned-ephemeral-exo.test.ts`) - [x] New e2e test for orphaned ephemeral exos (`orphaned-ephemeral-exo.test.ts` in kernel-node-runtime) 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **High Risk** > High risk because it changes core `KernelQueue`/`KernelRouter` crank semantics, rollback behavior, and how message failures propagate (resolve vs reject), which can affect delivery ordering, retries, and many callers/tests. > > **Overview** > **Kernel crank semantics are tightened and error propagation is made consistent.** `KernelQueue.run` is rewritten to process *exactly one* run-queue item per crank, and JS-side subscriptions created by `enqueueMessage` now support both `resolve` and `reject` so rejected kernel promises reject the returned promise. > > **Rollback and termination handling are hardened.** `rollbackCrank` now refreshes the stored run-queue and invalidates length caches to avoid stale in-memory state after DB rollback, and abort+terminate paths immediately reject the aborted send’s subscription. Kernel vat termination during a crank bypasses `terminateVat()` to avoid deadlock. > > **Message “splat” cases are clearer and better handled.** `KernelRouter` improves errors for revoked/no-owner/no-object/endpoint-gone cases, resolves splat rejections using the current promise decider, and treats vanished endpoints as a splat with promise rejection. > > **Tests/docs updated and expanded.** Many tests are updated to expect promise rejections (including remote comms, revocation, lifecycle), new unit+e2e coverage is added for orphaned ephemeral exos across vat restart, `kernel-utils` exports a new `isCapData` guard used to rethrow bootstrap errors as real `Error`s, and the glossary is expanded/clarified (kernel promises/decider/crank definition). > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 233587c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 6600be0 commit c1464ed

31 files changed

Lines changed: 988 additions & 316 deletions

docs/glossary.md

Lines changed: 140 additions & 28 deletions
Large diffs are not rendered by default.

packages/evm-wallet-experiment/test/integration/peer-wallet.test.ts

Lines changed: 15 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -262,15 +262,14 @@ describe.sequential('Peer wallet integration', () => {
262262
};
263263

264264
// Transaction signing has no peer fallback — kernel2 has no local
265-
// keys so this should return an error, not forward to kernel1.
266-
const result = await kernel2.queueMessage(
267-
coordinatorKref2,
268-
'signTransaction',
269-
[tx],
270-
);
271-
await waitUntilQuiescent();
272-
expect(result.body).toContain('#error');
273-
expect(result.body).toContain('No authority to sign this transaction');
265+
// keys so this should reject, not forward to kernel1.
266+
await expect(
267+
kernel2.queueMessage(coordinatorKref2, 'signTransaction', [tx]),
268+
).rejects.toMatchObject({
269+
body: expect.stringContaining(
270+
'No authority to sign this transaction',
271+
),
272+
});
274273
},
275274
NETWORK_TIMEOUT,
276275
);
@@ -281,16 +280,13 @@ describe.sequential('Peer wallet integration', () => {
281280
'returns error when no local keys and no peer wallet',
282281
async () => {
283282
// Kernel2 has no keys and no peer wallet connected
284-
// queueMessage resolves with error CapData (not rejects)
285-
const result = await kernel2.queueMessage(
286-
coordinatorKref2,
287-
'signMessage',
288-
['should fail'],
289-
);
290-
await waitUntilQuiescent();
291-
// Error CapData body contains #error marker
292-
expect(result.body).toContain('#error');
293-
expect(result.body).toContain('No authority to sign message');
283+
await expect(
284+
kernel2.queueMessage(coordinatorKref2, 'signMessage', [
285+
'should fail',
286+
]),
287+
).rejects.toMatchObject({
288+
body: expect.stringContaining('No authority to sign message'),
289+
});
294290
},
295291
NETWORK_TIMEOUT,
296292
);

packages/extension/test/e2e/object-registry.test.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,6 @@ test.describe('Object Registry', () => {
124124

125125
// After revoking, the previously successful message should fail
126126
response = await sendMessage(popupPage, target, method, params);
127-
await expect(response).toContainText(/[Rr]evoked object/u);
127+
await expect(response).toContainText('has been revoked');
128128
});
129129
});
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
import { makeSQLKernelDatabase } from '@metamask/kernel-store/sqlite/nodejs';
2+
import { kunser } from '@metamask/ocap-kernel';
3+
import type { ClusterConfig } from '@metamask/ocap-kernel';
4+
import { delay } from '@ocap/repo-tools/test-utils';
5+
import { mkdtemp, rm } from 'node:fs/promises';
6+
import { tmpdir } from 'node:os';
7+
import { join } from 'node:path';
8+
import { describe, it, expect } from 'vitest';
9+
10+
import { makeTestKernel } from '../helpers/kernel.ts';
11+
12+
const PROVIDER_BUNDLE =
13+
'http://localhost:3000/orphaned-ephemeral-provider-vat.bundle';
14+
const CONSUMER_BUNDLE =
15+
'http://localhost:3000/orphaned-ephemeral-consumer-vat.bundle';
16+
17+
const clusterConfig: ClusterConfig = {
18+
bootstrap: 'consumer',
19+
vats: {
20+
provider: {
21+
bundleSpec: PROVIDER_BUNDLE,
22+
parameters: {},
23+
},
24+
consumer: {
25+
bundleSpec: CONSUMER_BUNDLE,
26+
parameters: {},
27+
},
28+
},
29+
};
30+
31+
describe('Orphaned ephemeral exo', { timeout: 30_000 }, () => {
32+
it('rejects when provider vat restarts', async () => {
33+
const tempDir = await mkdtemp(join(tmpdir(), 'ocap-ephemeral-'));
34+
const dbFilename = join(tempDir, 'kernel.db');
35+
try {
36+
const kernel = await makeTestKernel(
37+
await makeSQLKernelDatabase({ dbFilename }),
38+
);
39+
try {
40+
const { rootKref, subclusterId } =
41+
await kernel.launchSubcluster(clusterConfig);
42+
await delay();
43+
44+
// Works before restart
45+
const r1 = await kernel.queueMessage(rootKref, 'useEphemeral', []);
46+
expect(kunser(r1)).toBe(999);
47+
48+
// Restart only the provider — the consumer still holds the
49+
// ephemeral ref, but the exo behind it no longer exists.
50+
const subcluster = kernel.getSubcluster(subclusterId);
51+
expect(subcluster).toBeDefined();
52+
await kernel.restartVat(subcluster!.vats.provider);
53+
await delay();
54+
55+
// The consumer's E(ephemeral).increment() targets an orphaned vref.
56+
// Liveslots in the provider throws "I don't remember allocating",
57+
// which terminates the provider and rejects the caller's promise.
58+
// This is surfaced to the caller as "target object has no owner".
59+
await expect(
60+
kernel.queueMessage(rootKref, 'useEphemeral', []),
61+
).rejects.toMatchObject({
62+
body: expect.stringContaining('target object has no owner'),
63+
});
64+
} finally {
65+
await kernel.stop();
66+
}
67+
} finally {
68+
await rm(tempDir, { recursive: true, force: true });
69+
}
70+
});
71+
});

packages/kernel-node-runtime/test/e2e/remote-comms.test.ts

Lines changed: 43 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -582,21 +582,17 @@ describe.sequential('Remote Communications E2E', () => {
582582
const results = await Promise.allSettled(messagePromises);
583583
expect(results).toHaveLength(201);
584584

585-
// Verify that messages within queue capacity were delivered
586-
const successfulResults = results.filter(
587-
(result) => result.status === 'fulfilled',
588-
);
589-
// At least 200 messages should succeed (the queue limit)
590-
expect(successfulResults.length).toBeGreaterThanOrEqual(200);
591-
592-
// Messages beyond queue capacity should be rejected with queue full error
593-
const rejectedResults = results.filter(
585+
// Messages beyond queue capacity should be rejected with queue full error.
586+
// Messages within capacity may fulfill or reject (e.g., if the remote vat
587+
// was restarted and references are stale), but they should NOT contain
588+
// "queue at capacity".
589+
const queueFullResults = results.filter(
594590
(result): result is PromiseRejectedResult =>
595-
result.status === 'rejected',
591+
result.status === 'rejected' &&
592+
String(result.reason).includes('queue at capacity'),
596593
);
597-
for (const result of rejectedResults) {
598-
expect(String(result.reason)).toContain('queue at capacity');
599-
}
594+
// At most 1 message (the 201st) should be rejected due to queue capacity
595+
expect(queueFullResults.length).toBeLessThanOrEqual(1);
600596

601597
const newMessageResult = await kernel1.queueMessage(
602598
aliceRef,
@@ -761,12 +757,11 @@ describe.sequential('Remote Communications E2E', () => {
761757
kernel2 = restartResult.kernel;
762758

763759
// The message should not have been delivered because we didn't reconnect
764-
const result = await messageAfterClose;
765-
const response = kunser(result);
766-
expect(response).toBeInstanceOf(Error);
767-
expect((response as Error).message).toContain(
768-
'Message delivery failed after intentional close',
769-
);
760+
await expect(messageAfterClose).rejects.toMatchObject({
761+
body: expect.stringContaining(
762+
'Message delivery failed after intentional close',
763+
),
764+
});
770765
},
771766
NETWORK_TIMEOUT * 2,
772767
);
@@ -844,18 +839,17 @@ describe.sequential('Remote Communications E2E', () => {
844839
await delay(100);
845840

846841
// Try to send a message after closing - should fail
847-
const messageAfterClose = kernel1.queueMessage(
848-
aliceRef,
849-
'sendRemoteMessage',
850-
[bobURL, 'hello', ['Alice']],
851-
);
852-
853-
const result = await messageAfterClose;
854-
const response = kunser(result);
855-
expect(response).toBeInstanceOf(Error);
856-
expect((response as Error).message).toContain(
857-
'Message delivery failed after intentional close',
858-
);
842+
await expect(
843+
kernel1.queueMessage(aliceRef, 'sendRemoteMessage', [
844+
bobURL,
845+
'hello',
846+
['Alice'],
847+
]),
848+
).rejects.toMatchObject({
849+
body: expect.stringContaining(
850+
'Message delivery failed after intentional close',
851+
),
852+
});
859853

860854
// Manually reconnect
861855
await kernel1.reconnectPeer(peerId2);
@@ -920,18 +914,18 @@ describe.sequential('Remote Communications E2E', () => {
920914
// and trigger promise rejection for pending work.
921915
// The await will naturally wait for the promise to settle - either
922916
// succeeding (unexpected) or failing due to incarnation change detection.
923-
const result = await kernel1.queueMessage(
924-
aliceRef,
925-
'sendRemoteMessage',
926-
[bobURL, 'hello', ['Alice']],
927-
);
928-
const response = kunser(result);
929-
930917
// The message should fail because incarnation changed.
931918
// The handshake detects the new incarnation and triggers onIncarnationChange,
932919
// which resets RemoteHandle state and rejects pending work.
933-
expect(response).toBeInstanceOf(Error);
934-
expect((response as Error).message).toMatch(/Remote connection lost/u);
920+
await expect(
921+
kernel1.queueMessage(aliceRef, 'sendRemoteMessage', [
922+
bobURL,
923+
'hello',
924+
['Alice'],
925+
]),
926+
).rejects.toMatchObject({
927+
body: expect.stringMatching(/Remote connection lost/u),
928+
});
935929
},
936930
NETWORK_TIMEOUT * 3,
937931
);
@@ -970,16 +964,15 @@ describe.sequential('Remote Communications E2E', () => {
970964
// The message will create a promise with the remote as decider (from URL redemption)
971965
// When we give up on the remote, that promise should be rejected
972966
// The vat should then propagate that rejection to the promise returned here
973-
const messagePromise = kernel1.queueMessage(
974-
aliceRef,
975-
'sendRemoteMessage',
976-
[bobURL, 'hello', ['Alice']],
977-
);
978-
979-
const result = await messagePromise;
980-
const response = kunser(result);
981-
expect(response).toBeInstanceOf(Error);
982-
expect((response as Error).message).toContain('Remote connection lost');
967+
await expect(
968+
kernel1.queueMessage(aliceRef, 'sendRemoteMessage', [
969+
bobURL,
970+
'hello',
971+
['Alice'],
972+
]),
973+
).rejects.toMatchObject({
974+
body: expect.stringContaining('Remote connection lost'),
975+
});
983976
},
984977
NETWORK_TIMEOUT * 2,
985978
);
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
import { E } from '@endo/eventual-send';
2+
import { makeDefaultExo } from '@metamask/kernel-utils/exo';
3+
4+
/**
5+
* A consumer vat that obtains an ephemeral exo reference from the provider
6+
* during bootstrap and calls it on demand.
7+
*
8+
* @returns The root object.
9+
*/
10+
export function buildRootObject() {
11+
let ephemeralRef: unknown;
12+
13+
return makeDefaultExo('root', {
14+
async bootstrap(vats: { provider: unknown }) {
15+
ephemeralRef = await E(vats.provider).getEphemeral();
16+
},
17+
18+
async useEphemeral() {
19+
return E(ephemeralRef).increment();
20+
},
21+
});
22+
}
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
import { makeDefaultExo } from '@metamask/kernel-utils/exo';
2+
3+
/**
4+
* A provider vat that exposes a single ephemeral (non-durable) exo.
5+
* The exo will not survive a vat restart.
6+
*
7+
* @returns The root object.
8+
*/
9+
export function buildRootObject() {
10+
const ephemeral = makeDefaultExo('EphemeralCounter', {
11+
increment() {
12+
return 999;
13+
},
14+
});
15+
16+
return makeDefaultExo('root', {
17+
getEphemeral() {
18+
return ephemeral;
19+
},
20+
});
21+
}

packages/kernel-test/src/endowments.test.ts

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,11 @@ describe('endowments', () => {
4949

5050
await waitUntilQuiescent();
5151

52-
await kernel.queueMessage(v1Root, 'hello', [`https://${badHost}`]);
52+
await expect(
53+
kernel.queueMessage(v1Root, 'hello', [`https://${badHost}`]),
54+
).rejects.toMatchObject({
55+
body: expect.stringContaining(`Invalid host: ${badHost}`),
56+
});
5357

5458
await waitUntilQuiescent();
5559

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
import { makeSQLKernelDatabase } from '@metamask/kernel-store/sqlite/nodejs';
2+
import { waitUntilQuiescent } from '@metamask/kernel-utils';
3+
import { kunser } from '@metamask/ocap-kernel';
4+
import { describe, expect, it } from 'vitest';
5+
6+
import { getBundleSpec, makeKernel, makeTestLogger } from './utils.ts';
7+
8+
describe('orphaned ephemeral exo', () => {
9+
it('rejects when provider vat restarts', async () => {
10+
const { logger } = makeTestLogger();
11+
const database = await makeSQLKernelDatabase({});
12+
const kernel = await makeKernel(database, true, logger);
13+
14+
const { rootKref, subclusterId } = await kernel.launchSubcluster({
15+
bootstrap: 'consumer',
16+
vats: {
17+
provider: {
18+
bundleSpec: getBundleSpec('orphaned-ephemeral-provider'),
19+
parameters: {},
20+
},
21+
consumer: {
22+
bundleSpec: getBundleSpec('orphaned-ephemeral-consumer'),
23+
parameters: {},
24+
},
25+
},
26+
});
27+
await waitUntilQuiescent();
28+
29+
// Works before restart
30+
const r1 = await kernel.queueMessage(rootKref, 'useEphemeral', []);
31+
expect(kunser(r1)).toBe(999);
32+
33+
// Restart only the provider — the consumer still holds the
34+
// ephemeral ref, but the exo behind it no longer exists.
35+
const subcluster = kernel.getSubcluster(subclusterId);
36+
expect(subcluster).toBeDefined();
37+
await kernel.restartVat(subcluster!.vats.provider);
38+
await waitUntilQuiescent();
39+
40+
// The consumer's E(ephemeral).increment() targets an orphaned vref.
41+
// Liveslots in the provider throws "I don't remember allocating",
42+
// which terminates the provider vat. The message is retried in a new
43+
// crank, but the endpoint is gone — so it splats and rejects.
44+
await expect(
45+
kernel.queueMessage(rootKref, 'useEphemeral', []),
46+
).rejects.toMatchObject({
47+
body: expect.stringContaining('has no owner'),
48+
});
49+
});
50+
});

0 commit comments

Comments
 (0)