From 2f83449be3845f652631fe47c42eb3c7919900db Mon Sep 17 00:00:00 2001 From: Makram Kamaleddine Date: Thu, 21 May 2026 20:51:31 +0300 Subject: [PATCH 1/2] docs: add e2e debugging doc --- docs/end-to-end-debugging.md | 414 +++++++++++++++++++++++++++++++++++ 1 file changed, 414 insertions(+) create mode 100644 docs/end-to-end-debugging.md diff --git a/docs/end-to-end-debugging.md b/docs/end-to-end-debugging.md new file mode 100644 index 000000000..a46fe895f --- /dev/null +++ b/docs/end-to-end-debugging.md @@ -0,0 +1,414 @@ +# End-to-End Message Debugging + +This guide walks a single `messageID` through the full offchain CCIP pipeline. Use it when you need to answer: *where did this message stop?* + +For deep dives on one service, see the [per-service debugging guides](../README.md#debugging-guides). This document focuses on **order**, **happy-path anchors**, and **what to search for when an anchor is missing**. + +## Pipeline at a glance + +```mermaid +flowchart TB + subgraph V["Verifier (per source chain)"] + V1["SourceReader: discover + finality"] + V2["TaskVerifier: sign / attest"] + V3["StorageWriter: persist"] + end + + subgraph A["Aggregator (committee only)"] + A1["WriteCommitteeVerifierNodeResult"] + A2["Quorum + aggregated report"] + end + + subgraph I["Indexer"] + I1["Discovery: GetMessagesSince"] + I2["Worker: token REST fetches"] + end + + subgraph E["Executor"] + E1["Stream: GET /messages"] + E2["Heap + elector stagger"] + E3["GET /verifierresults + transmit"] + end + + V1 --> V2 --> V3 + V3 -->|committee gRPC write| A1 --> A2 + A2 --> I1 + V3 -->|token: verifier HTTP storage| I2 + I1 --> I2 + I1 --> E1 --> E2 --> E3 +``` + +**Two verification shapes:** + +| Path | Verifier output goes to | Indexer still discovers message via | +|------|-------------------------|-------------------------------------| +| **Committee** | Aggregator → `GetMessagesSince` | Aggregator (discovery marker + quorum result) | +| **Token** (USDC / Lombard) | Verifier HTTP API (`/verifications`) | Aggregator (`GetMessagesSince` for message shell) | + +--- + +## Before you start + +1. Normalize the ID as `0x` + 64 hex chars (most logs use `messageID=0x…`). +2. Filter logs by that ID across services (Datadog, Loki, local files). +3. Identify which verifiers apply (committee, USDC, Lombard) from the message’s CCV addresses / lane config. +4. Work **top to bottom** — do not skip to executor if verifier never persisted. + +**Global grep:** + +```bash +MSG=0xYOUR_MESSAGE_ID +grep -E "messageID=.*${MSG#0x}|messageID=${MSG}|${MSG}" +``` + +--- + +## Stage 0 — On-chain source (context only) + +Offchain debugging assumes the message was emitted on the source chain OnRamp. If nothing appears in verifier logs, confirm the on-chain event exists for this `messageID` on the source chain first (outside this repo’s services). + +--- + +## Stage 1 — Verifier: discovery and enqueue + +**Service:** Verifier `SourceReaderService` +**Detail:** [verifier/docs/debugging.md](../verifier/docs/debugging.md) + +### What happens + +Events are read from the source chain, filtered, held until finality (and curse/rules checks pass), then published to `ccv_task_verifier_jobs`. + +### Happy path + +| Order | Log message | Level | +|-------|-------------|-------| +| 1 | `Added message to pending queue` | Info | +| 2 | `Finality check` with `meetsRequirement=true` | Info | +| 3 | `Successfully published and tracked tasks` | Info | + +**Anchor:** If you see **#3**, the message entered the verifier task queue. + +### If anchor not found — search for + +| Symptom | Logs / areas to check | +|---------|------------------------| +| Never discovered | `Message filtered out by filter`; `No events found in range`; `Processed block range` (chain stalled?) | +| ID bad | `Message ID mismatch`; `Failed to compute message ID` | +| Stuck pending | `Finality check` with `meetsRequirement=false`; `Reorg-affected message finality check` | +| Dropped | `Dropping task - lane is cursed`; `Dropping task - message matched a disablement rule` | +| Not published | `Failed to publish tasks to job queue`; `Curse or message rules state unknown` | +| Reorg | `Removing task from pending queue due to reorg` | +| Chain down | `FINALITY VIOLATION`; `Chain is disabled, skipping` | + +**Config context:** `component=Service`, `chain=` on source-reader logs. + +--- + +## Stage 2 — Verifier: verification + +**Service:** Verifier `TaskVerifier` + pluggable verifier +**Detail:** [verifier/docs/debugging.md#verifier-implementations](../verifier/docs/debugging.md#verifier-implementations) + +### What happens + +Tasks are consumed from `ccv_task_verifier_jobs`, verified (committee sign, CCTP attestation, Lombard attestation, etc.), and successful results are published to `ccv_storage_writer_jobs`. + +### Happy path + +| Order | Log message | Verifier type | +|-------|-------------|---------------| +| 1 | *(batch)* `Processing verification tasks batch` | TaskVerifier (Debug) | +| 2a | `Message verification completed successfully` | **Commit** | +| 2b | `VerifierResults: Successfully verified message` | **CCTP / Lombard** | +| 3 | No `Message verification failed` for this `messageID` | TaskVerifier | + +**Anchor:** **#2a or #2b** without a matching **#3** failure. + +### If anchor not found — search for + +| Symptom | Logs / areas to check | +|---------|------------------------| +| Not consumed | Queue empty / job stuck in `processing` — see verifier DB `ccv_task_verifier_jobs`; batch errors without per-ID | +| Committee failed | `Message verification failed` with `retryable`; commit path lacks per-error Info logs — read `error` field | +| CCTP not ready | `Attestation not ready for message` (Debug); `Failed to fetch attestation` | +| Lombard not ready | `Attestation not found for message`; `Attestation not ready for message` | +| Permanent fail | `Message verification failed` with `retryable=false` | +| Mapping bug | `Job ID not found for message` | + +**Logger context (token):** `messageID` + `txHash` on all CCTP/Lombard task lines. + +--- + +## Stage 3 — Verifier: offchain persist + +**Service:** Verifier `StorageWriter` +**Detail:** [verifier/docs/debugging.md#stage-3--storagewriter-processor](../verifier/docs/debugging.md) + +### What happens + +Verifier results are written to configured offchain storage. For **committee**, this typically pushes to the **aggregator** via gRPC. For **token**, data is exposed on the verifier HTTP API for the indexer to pull later. + +### Happy path + +| Order | Log message | Level | +|-------|-------------|-------| +| 1 | `Write succeeded for message` | Debug | +| 2 | `CCV data batch write completed` with `successful >= 1` | Info | + +**Anchor:** **`Write succeeded for message`** with your `messageID`. + +### If anchor not found — search for + +| Symptom | Logs / areas to check | +|---------|------------------------| +| Prior stage failed | Return to Stage 2 | +| Write failed | `Write failed for message (retryable)` / `(non-retryable)` | +| Batch catastrophe | `Failed to write CCV data batch to storage with no results` | +| Queue stuck | `Failed to publish verification results to queue` (TaskVerifier — result never reached storage writer) | +| Committee → agg | Aggregator: `Signature validated successfully` → `Triggered aggregation check` (Stage 4) | + +--- + +## Stage 4 — Aggregator: committee quorum (committee path only) + +**Service:** Aggregator +**Detail:** [aggregator/docs/debugging.md](../aggregator/docs/debugging.md) + +Skip this stage for pure token lanes that never write committee signatures to the aggregator. + +### What happens + +Each committee node writes `WriteCommitteeVerifierNodeResult`. The aggregator stores per-node verifications, checks quorum asynchronously, and persists `commit_aggregated_reports`. + +### Happy path + +| Order | Log message | Level | +|-------|-------------|-------| +| 1 | `Signature validated successfully` | Info | +| 2 | `Successfully saved commit verification record` | Info | +| 3 | `Triggered aggregation check` | Info | +| 4 | `Checking aggregation for message` | Info | +| 5 | `Report submitted successfully` | Info | + +**Anchor:** **`Report submitted successfully`** with `messageID` in context. + +### If anchor not found — search for + +| Symptom | Logs / areas to check | +|---------|------------------------| +| Write never arrived | Verifier Stage 3 + aggregator ingress `Request failed`; auth/HMAC errors (no `messageID`) | +| Write rejected | `validation error`; `signature validation failed`; `Rejected write: message matched a disablement rule` | +| Backpressure | `Aggregation channel is full` | +| Quorum not yet met | `Quorum not met, not submitting report` — need more node writes | +| Already done | `Skipping aggregation: existing report already meets quorum` | +| Stuck async | `Failed to list verifications`; `Failed to submit report` | +| Orphan | `Failed to process orphaned record` (orphan recoverer) | + +**Note:** `WriteStatus_SUCCESS` only means enqueue to aggregation — not quorum complete. + +--- + +## Stage 5 — Indexer: aggregator discovery + +**Service:** Indexer `AggregatorMessageDiscovery` +**Detail:** [indexer/docs/debugging.md#stage-1--sourcereaderservice](../indexer/docs/debugging.md) + +### What happens + +Indexer polls aggregator `GetMessagesSince`, persists messages (and non–discovery-only verifications), emits work to the indexer worker channel. + +### Happy path + +| Order | Log message | Level | +|-------|-------------|-------| +| 1 | `Found Message` | Info | +| 2 | `Discovery batch persisted` | Debug | +| 3 | `Enqueueing new Message` | Info | + +**Anchor:** **`Enqueueing new Message`** with your `messageID` (worker pool picked it up). + +### If anchor not found — search for + +| Symptom | Logs / areas to check | +|---------|------------------------| +| Aggregator has no report | Stage 4 `Report submitted successfully` | +| Poll failing | `Error reading VerificationResult from aggregator`; `Circuit breaker is open` | +| Not persisted | `Unable to persist discovery batch, will retry` | +| Filtered at encode | `Skipping message, cannot encode for insert` | +| Deduped (multi-source) | `messageID already discovered from different source, skipping` | +| Stream never ran | `MessageDiscovery Started` missing / `Error calling Aggregator` | + +--- + +## Stage 6 — Indexer: token / missing verifier fetches + +**Service:** Indexer `Task` + `VerifierReader` (REST to token verifiers) +**Detail:** [indexer/docs/debugging.md#stage-2--taskverifier-processor](../indexer/docs/debugging.md) + +### What happens + +For each CCV address on the message not yet in DB, the indexer calls verifier HTTP APIs (or aggregator `GetVerifierResultsForMessage` for aggregator-type verifiers). + +### Happy path + +| Order | Log message | Level | +|-------|-------------|-------| +| 1 | `Attempting to retrieve N verifications for the message` | Info | +| 2 | `Received result from … for MessageID` | Debug | +| 3 | `Collected N new verifications for the message` (N > 0) | Info | +| 4 | No `Message … entered DLQ` | WorkerPool | + +**Anchor:** **`Collected N new verifications`** with N > 0, then message eventually `successful` in DB. + +### If anchor not found — search for + +| Symptom | Logs / areas to check | +|---------|------------------------| +| Stage 5 missing | Return to indexer discovery | +| Unknown verifier | `Detected N unknown verifiers`; `Unknown CCVs` — registry / `issuer_addresses` | +| Token HTTP 404 | `REST reader 404 (no results)` — verifier Stage 2/3 not done for token | +| Token HTTP error | `REST reader HTTP request failed`; `REST reader unexpected status` | +| Already have data | `Attempting to retrieve 0 verifications` — check DB `verifier_results` | +| Gave up | `Message … entered DLQ` | + +**Cross-check verifier:** `Write succeeded for message` (committee) or token `Successfully verified message`. + +--- + +## Stage 7 — Executor: message stream + +**Service:** Executor `IndexerStorageStreamer` +**Detail:** [executor/docs/debugging.md#stage-1--indexer-message-stream](../executor/docs/debugging.md) + +### What happens + +Executor polls indexer `GET /v1/messages` and forwards new messages to the coordinator. + +### Happy path + +| Order | Log message | Level | +|-------|-------------|-------| +| 1 | `Found net new message from Indexer` | Info | + +**Anchor:** **`Found net new message from Indexer`** with your `messageID`. + +### If anchor not found — search for + +| Symptom | Logs / areas to check | +|---------|------------------------| +| Indexer not ready | Indexer Stage 5–6; message `status` in `indexer.messages` | +| Poll error | `IndexerStorageStreamer read error` | +| Wrong dest chain | `enabled_dest_chains` vs message `dest_chain_selector` | +| Stream down | `failed to start ccv result streamer`; `streamerResults closed` | +| Invalid payload | `dropping message with invalid ID` | + +--- + +## Stage 8 — Executor: schedule and execute + +**Service:** Executor `Coordinator` + `ChainlinkExecutor` +**Detail:** [executor/docs/debugging.md](../executor/docs/debugging.md) + +### What happens + +Message waits on delayed heap until leader-elector `readyTimestamp`, then workers call `HandleMessage`: curse check, on-chain state, indexer verifier results + off-ramp quorum, honest-attempt check, transmit. + +### Happy path + +| Order | Log message | Level | +|-------|-------------|-------| +| 1 | `pushing message to delayed heap` | Info | +| 2 | `processing message with ID` | Info | +| 3 | `got ccv info and verifier results` with `verifierResultsLen > 0` | Info | +| 4 | `transmitting aggregated report to chain` | Info | +| 5 | No `message should be retried` / `will retry execution` for same ID | Info/Warn | + +**Anchor:** **`transmitting aggregated report to chain`** — offchain pipeline complete for this attempt. + +### If anchor not found — search for + +| Symptom | Logs / areas to check | +|---------|------------------------| +| Stage 7 missing | Return to executor stream | +| Long wait before #2 | `pushing message to delayed heap` → check `readyTimestamp`; Debug `calculated ready timestamp` | +| Not this executor | `skipping message, executor not in pool for destination chain` | +| Duplicate / skip | `message already in delayed heap`; `message already in flight` | +| Expired | `message has expired` | +| Indexer empty | `delaying execution due to no verifier results`; indexer `CCV data not found` (404) — Stages 5–6 | +| Quorum not met | `message did not meet verifier quorum`; `delaying execution due to failed request for verifier results` | +| Already on-chain | `skipping execution due to already being successfully executed` | +| Someone else executed | `skipping execution due to existing honest attempt` | +| Curse | `delaying execution due to curse` | +| Tx failed | `will retry execution due to failed ConvertAndWriteMessageToChain` | +| Permanent | `skipping retry due to message encoding error`; `impossible receiver verifier quorum` | +| Indexer failover | `No healthy alternates found`; `Switching active indexer` | + +--- + +## Full happy-path chain (committee + token) + +Use as a single checklist — each line should appear in order for a fully healthy lane: + +```text +[Verifier] + Added message to pending queue + Finality check (meetsRequirement=true) + Successfully published and tracked tasks + Message verification completed successfully OR VerifierResults: Successfully verified message + Write succeeded for message + +[Aggregator — committee] + Signature validated successfully + Triggered aggregation check + Report submitted successfully + +[Indexer] + Found Message + Enqueueing new Message + Collected N new verifications for the message (N > 0 if token verifiers required) + +[Executor] + Found net new message from Indexer + pushing message to delayed heap + processing message with ID + got ccv info and verifier results + transmitting aggregated report to chain +``` + +**Token-only note:** Committee verifier lines in the middle may be replaced by discovery markers from aggregator `GetMessagesSince`; token attestations still require indexer Stage 6 and verifier token logs. + +--- + +## Where am I stuck? (quick routing) + +| Last happy-path log you see | Investigate next | +|---------------------------|------------------| +| *(none in verifier)* | Stage 1 — source reader / chain disabled | +| `Successfully published and tracked tasks` | Stage 2 — task verifier | +| `Message verification failed` | Stage 2 — error + `retryable` | +| `Write succeeded for message` (committee) | Stage 4 — aggregator | +| `Write succeeded` but no aggregator write | Aggregator connectivity / HMAC / `Aggregation channel is full` | +| `Report submitted successfully` | Stage 5 — indexer discovery | +| `Enqueueing new Message` | Stage 6 — token REST / registry | +| `Collected N new verifications` | Stage 7 — executor stream config | +| `Found net new message from Indexer` | Stage 8 — heap timing / elector / execution | +| `got ccv info and verifier results` | Stage 8 — transmit / RPC / quorum ordering | +| `transmitting aggregated report to chain` | On-chain receipt / dest chain tooling (outside this doc) | + +--- + +## Per-service reference + +| Service | Deep-dive guide | +|---------|-----------------| +| Verifier | [verifier/docs/debugging.md](../verifier/docs/debugging.md) | +| Aggregator | [aggregator/docs/debugging.md](../aggregator/docs/debugging.md) | +| Indexer | [indexer/docs/debugging.md](../indexer/docs/debugging.md) | +| Executor | [executor/docs/debugging.md](../executor/docs/debugging.md) | + +--- + +## Related + +- [Repository README — Debugging guides](../README.md#debugging-guides) +- [Verifier design](../verifier/docs/verifier.md) From 1ca5fa88b6e7348823fcbc90541c3d065f3296a6 Mon Sep 17 00:00:00 2001 From: Makram Kamaleddine Date: Thu, 21 May 2026 20:55:22 +0300 Subject: [PATCH 2/2] add per-service guides --- README.md | 11 + aggregator/README.md | 3 + aggregator/docs/debugging.md | 393 ++++++++++++++++++++++++++++++++++ executor/README.md | 3 + executor/docs/debugging.md | 348 ++++++++++++++++++++++++++++++ indexer/README.md | 3 + indexer/docs/debugging.md | 374 ++++++++++++++++++++++++++++++++ verifier/docs/debugging.md | 398 +++++++++++++++++++++++++++++++++++ 8 files changed, 1533 insertions(+) create mode 100644 aggregator/docs/debugging.md create mode 100644 executor/docs/debugging.md create mode 100644 indexer/docs/debugging.md create mode 100644 verifier/docs/debugging.md diff --git a/README.md b/README.md index 295e71072..e55439f43 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,18 @@ just tests ## Local Dev Environment Follow the [README](./build/devenv/README.md) +## Debugging guides +**Start here for a full message trace:** [End-to-end message debugging](./docs/end-to-end-debugging.md) — happy-path anchors per pipeline stage and what to search when each anchor is missing. + +Per-service references (grep patterns, log tables, DB queries): + +| Stage | Guide | +|-------|--------| +| Verifier | [verifier/docs/debugging.md](./verifier/docs/debugging.md) — Source discovery, verification, and storage-writer | +| Aggregator | [aggregator/docs/debugging.md](./aggregator/docs/debugging.md) — Committee signatures, quorum, aggregated reports | +| Indexer | [indexer/docs/debugging.md](./indexer/docs/debugging.md) — Aggregator discovery and token verifier fetches | +| Executor | [executor/docs/debugging.md](./executor/docs/debugging.md) — Indexer CCV data and on-chain execution | ### Glossary - CCV: Cross chain verifier comprising two onchain Verifier components and an offchain Verifier component which exposes VerifierResults (usually containing some form of authentication, like signatures/zkps etc.) from its offchain component to be passed to its onchain component. diff --git a/aggregator/README.md b/aggregator/README.md index f112e2d9f..c1ec02361 100644 --- a/aggregator/README.md +++ b/aggregator/README.md @@ -1,4 +1,7 @@ # Aggregator Service + +For tracing a message through write → quorum → aggregated report, see [docs/debugging.md](docs/debugging.md). + Build dev image (hot-reload) ``` just build-dev diff --git a/aggregator/docs/debugging.md b/aggregator/docs/debugging.md new file mode 100644 index 000000000..bf7798e08 --- /dev/null +++ b/aggregator/docs/debugging.md @@ -0,0 +1,393 @@ +# Aggregator Debugging Guide + +This guide helps you trace a single CCIP message through the aggregator’s commit-verification pipeline using structured logs. Committee verifiers push signed results via gRPC (`WriteCommitteeVerifierNodeResult` / `BatchWriteCommitteeVerifierNodeResult`); the aggregator stores per-node verifications, checks quorum asynchronously, and persists an aggregated report when threshold is met. + +For how verifiers produce signatures upstream, see the [Verifier debugging guide](../../verifier/docs/debugging.md). + +## Pipeline overview + +```mermaid +flowchart LR + CV["Committee verifier(s)\ngRPC write"] + WH["WriteCommitVerifierNodeResultHandler"] + DB["PostgreSQL\ncommit_verification_records"] + CH["Aggregation channel\n(per caller_id)"] + AW["CommitReportAggregator\nbackground workers"] + QV["EVMQuorumValidator"] + AR["commit_aggregated_reports"] + + CV -->|signature + message| WH + WH -->|SaveCommitVerification| DB + WH -->|CheckAggregation| CH + CH --> AW + AW -->|List verifications| DB + AW --> QV + QV -->|quorum met| AR +``` + +**Stages:** + +1. **Write handler** — Validates request, checks message-disablement rules, validates ECDSA signature, derives `aggregationKey`, saves `CommitVerificationRecord`, enqueues aggregation work. +2. **Channel manager** — Fair-schedules aggregation requests from per-client buffers (one buffer per `caller_id` / API client, plus an orphan-recovery channel). +3. **Aggregation worker** — Loads all verifications for `(messageID, aggregationKey)`, checks quorum, maps to exportable proto, calls `SubmitAggregatedReport` (PostgreSQL sink). +4. **Orphan recoverer** (optional) — Periodically re-triggers aggregation for `(message_id, aggregation_key)` pairs that have verifications but no aggregated report. + +A message has **reached quorum on the aggregator** when you see **`Report submitted successfully`** with `messageID` in context (and a row exists in `commit_aggregated_reports` for that message and aggregation key). + +--- + +## End-to-end flow (committee verifier → aggregator) + +Typical path when using the commit verifier + aggregator writer: + +1. Verifier signs message ([verifier commit logs](../../verifier/docs/debugging.md#verifier-implementations)). +2. Verifier storage writer succeeds (`Write succeeded for message`). +3. Aggregator writer issues `BatchWriteCommitteeVerifierNodeResult` (or single write). +4. Aggregator: signature validated → verification saved → aggregation triggered. +5. Background worker: quorum met → aggregated report submitted. + +If step 4 succeeds but step 5 shows `Quorum not met`, more committee nodes must write signatures for the same `(messageID, aggregationKey)` before quorum is reached. + +--- + +## Filtering logs for one message + +### Structured context fields + +The aggregator augments loggers from gRPC context (`aggregator/pkg/scope/scope.go`): + +| Field | When present | Meaning | +|-------|----------------|---------| +| `messageID` | After write/read handlers attach scope | Hex string (`0x…`) via `protocol.ByteSlice` | +| `aggregationKey` | After aggregation key derived / worker started | Hex-encoded signable hash grouping verifications | +| `address` | After signer identified on write | Recovered committee signer address | +| `caller_id` | Authenticated write requests | API client / verifier identity | +| `requestID` | All gRPC requests | Per-request UUID | +| `apiName` | All gRPC requests | Full gRPC method (e.g. `/…/WriteCommitteeVerifierNodeResult`) | +| `component` | Subsystems | `aggregator_worker`, `OrphanRecoverer` | + +### Grep examples + +```bash +# messageID on context (write path, aggregation worker, read paths) +grep 'messageID=0xYOUR_MESSAGE_ID' + +# Orphan recoverer uses raw hex without 0x prefix +grep 'messageID=YOUR_MESSAGE_ID_WITHOUT_0x' + +# Aggregation key (if you know it) +grep 'aggregationKey=YOUR_AGGREGATION_KEY_HEX' + +# Ingress API +grep 'apiName=.*/WriteCommitteeVerifierNodeResult' +grep 'apiName=.*/BatchWriteCommitteeVerifierNodeResult' + +# Successful quorum + persist +grep 'Report submitted successfully' | grep 'messageID=0xYOUR_MESSAGE_ID' +``` + +--- + +## Happy-path checklist + +| Step | What to look for | Location | +|------|------------------|----------| +| 1. gRPC ingress | `Request completed` + write API `apiName` | `LoggingMiddleware` | +| 2. Signature OK | `Signature validated successfully` | `WriteCommitVerifierNodeResultHandler` | +| 3. Saved | `Successfully saved commit verification record` | same (with `address` on logger) | +| 4. Aggregation queued | `Triggered aggregation check` | same | +| 5. Worker started | `Checking aggregation for message` | `CommitReportAggregator` | +| 6. Quorum met | `Quorum met with N unique signer addresses` (Debug) | `EVMQuorumValidator` | +| 7. Report stored | `Report submitted successfully` | `CommitReportAggregator` | + +Downstream consumers (indexer / message discovery) poll via `GetMessagesSince`; verifiers or tools may read via `GetVerifierResultsForMessage`. + +--- + +## Stage 1: gRPC ingress (all APIs) + +**Source:** `aggregator/pkg/middlewares/logging_middleware.go`, `scoping_middleware.go` + +Every unary RPC gets `requestID` and `apiName` on the logger. + +| Level | Message | Key fields | +|-------|---------|------------| +| Info | `Request received` | `requestID`, `apiName`, `caller_id` (if auth) | +| Info | `Request completed` | `duration_ms`, `status` | +| Error | `Request failed` | `duration_ms`, `status` | +| Debug | `Request payload received` / `Response sent` | `payload` (verbose) | + +**Write path APIs** (committee verifier signatures): + +- `…/WriteCommitteeVerifierNodeResult` +- `…/BatchWriteCommitteeVerifierNodeResult` + +**Read path APIs** (usually no `messageID` until handler runs): + +- `…/ReadCommitteeVerifierNodeResult` — sets `messageID` from request +- `…/GetVerifierResultsForMessage` — per-ID in loop (see below) +- `…/GetMessagesSince` — sequence-based, logs `messageID` at Trace for each report + +**Auth failures** (no `messageID`): `HMACAuthMiddleware`, `AnonymousAuthMiddleware`, `RequireAuthMiddleware` — `Authentication failed`, etc. + +--- + +## Stage 2: WriteCommitVerifierNodeResultHandler + +**Source:** `aggregator/pkg/handlers/write_commit_verifier_node_result.go` + +After proto conversion, `scope.WithMessageID(ctx, record.MessageID)` is applied — subsequent logs for that request include **`messageID`**. + +| Level | Message | `messageID` | Notes | +|-------|---------|-------------|-------| +| Warn | `validation error` | no | Before message scoped | +| Error | `Failed to convert proto to domain model` | no | Invalid payload | +| Info | `Rejected write: message matched a disablement rule` | yes | `FailedPrecondition` | +| Error | `signature validation failed` | yes | Invalid / unknown signer | +| Info | `Signature validated successfully` | yes | | +| Error | `failed to derive aggregation key` | yes | | +| Error | `failed to save commit verification record` | yes | Uses `address` in signer context | +| Info | `Successfully saved commit verification record` | yes | `address` field | +| Error | `Aggregation channel is full` | yes | `ResourceExhausted` — backpressure | +| Error | `failed to trigger aggregation` | yes | | +| Info | `Triggered aggregation check` | yes | Request succeeds (`WriteStatus_SUCCESS`) | + +**Important:** A successful write response only means the verification row was saved and aggregation was **enqueued**. Quorum and aggregated report creation happen asynchronously. + +### Batch write + +**Source:** `aggregator/pkg/handlers/batch_write_commit_verifier_node_result.go` + +Each sub-request runs `WriteCommitVerifierNodeResultHandler.Handle` in its own goroutine (with the batch request’s context). Per-message logs still get `messageID` from the child handler. + +Batch-level errors (no per-message ID on parent context): + +| Level | Message | +|-------|---------| +| Error | `unexpected error type` | +| Error | `failed to write commit verification node result` | + +--- + +## Stage 3: Signature validation (write path) + +**Source:** `aggregator/pkg/quorum/evm_quorum_validator.go` (`ValidateSignature`, `DeriveAggregationKey`) + +Runs during the write handler with `messageID` already on context. + +| Level | Message | Typical meaning | +|-------|---------|-----------------| +| Debug | `Validating signature for report` | Start validation | +| Error | `Missing signature in report` | Empty signature | +| Error | `Failed to compute message hash` | Bad message payload | +| Error | `Failed to produce signed hash` | CCV version / committee hash error | +| Error | `Failed to decode single signature` | Malformed signature bytes | +| Error | `Failed to recover address from signature` | ECDSA recovery failed | +| Info | `Recovered address from signature` | Signer matched committee config | +| Debug | `No valid signers found for the provided signature` | Recovered address not in committee | +| Error | `committee config not found` | Missing committee for chain | + +--- + +## Stage 4: Aggregation worker + +**Source:** `aggregator/pkg/aggregation/aggregator.go` + +Worker context includes **`messageID`** and **`aggregationKey`** (`scope.WithMessageID` / `WithAggregationKey` in `StartBackground`). + +| Level | Message | `messageID` | Key fields | +|-------|---------|-------------|------------| +| Info | `Checking aggregation for message` | yes | Start of work item | +| Warn | `No aggregated store available, cannot check existing aggregations` | yes | | +| Warn | `Failed to check for existing aggregated report` | yes | | +| Debug | `No existing aggregated report found, proceeding with aggregation` | yes | | +| Info | `Skipping aggregation: existing report already meets quorum` | yes | `verificationCount` — idempotent skip | +| Info | `Existing report no longer meets quorum, proceeding with new aggregation` | yes | Config change scenario | +| Error | `Failed to list verifications` | yes | DB read failure | +| Debug | `Verifications retrieved` | yes | `count` | +| Debug | `Aggregated report created` | yes | `verificationCount` | +| Error | `Failed to check quorum` | yes | | +| Error | `Cannot map aggregated report for export: committee configuration is missing` | yes | Blocked export | +| Error | `Aggregated report is not exportable` | yes | Mapping / quorum encoding error | +| Error | `Failed to submit report` | yes | `SubmitAggregatedReport` failed | +| **Info** | **`Report submitted successfully`** | yes | **`verifications`, `timeToAggregation`** — **success** | +| Info | `Quorum not met, not submitting report` | yes | `verifications` — need more signatures | +| Error | `Panic during aggregation` | yes | Worker panic | + +### Quorum check (during aggregation) + +**Source:** `aggregator/pkg/quorum/evm_quorum_validator.go` (`CheckQuorum`) + +| Level | Message | Meaning | +|-------|---------|---------| +| Error | `No verifications found` | Empty set | +| Warn | `Source verifier address not in message CCV addresses, skipping aggregation` | Read-path filter parity | +| Debug | `Not enough verifications to meet quorum: have X, need Y` | Below threshold | +| Error | `Failed to validate signature` | Per-verification failure (others may still count) | +| Warn | `No valid signer found. Might be due to a config change` | Signer not in current committee | +| Error | `Hash mismatch detected - possible data tampering` | Inconsistent signable hashes | +| Debug | `Quorum not met: have X unique signer addresses, need Y` | Distinct signers below threshold | +| Debug | `Quorum met with N unique signer addresses` | Quorum satisfied | + +### Channel / shutdown (usually no `messageID`) + +| Level | Message | +|-------|---------| +| Warn | `aggregation queue full` / `aggregation queue over 80% full` | +| Error | `Drain timed out, dropping in-flight work` | +| Infow | `Channel drained, waiting for in-flight workers` | +| Infow | `Aggregation worker shutdown completed cleanly` | + +**Enqueue timeout:** `CheckAggregation` returns `ErrAggregationChannelFull` if the per-`caller_id` buffer is full for `CheckAggregationTimeout` — surfaced on the write handler as `Aggregation channel is full`. + +--- + +## Stage 5: Orphan recovery + +**Source:** `aggregator/pkg/orphan_recoverer.go` + +Recovers verifications that were saved but never aggregated (e.g. transient aggregation errors). Re-triggers `CheckAggregation` on channel `OrphanRecoveryChannelKey`. + +| Level | Message | `messageID` format | +|-------|---------|-------------------| +| Info | `Starting orphan recovery process` | no | +| Info | `Initiating orphan recovery scan` | no | +| Info | `Orphan stats` | no — `nonExpired`, `expired`, `total` | +| Error | `Failed to process orphaned record` | **`%x` hex (no `0x` prefix)** | +| Debug | `Successfully processed orphaned record` | **`%x` hex** | +| Debug | `Successfully triggered re-aggregation check` | **`%x` hex** | +| Info | `Orphan recovery completed` | no — `processed`, `errors` | +| Error | `Reached max orphans per scan` | no | + +--- + +## Read paths + +### GetVerifierResultsForMessage + +**Source:** `aggregator/pkg/handlers/get_verifier_results_for_message.go` + +Batch read of latest aggregated report per message ID (used by verifier HTTP API and tools). + +| Level | Message | `messageID` | +|-------|---------|-------------| +| Error | `Failed to retrieve batch CCV data` | no | +| Error | `Quorum config not found for source selector … message ID %s` | in message string | +| Debug | `Source verifier address not in ccvAddresses for message ID %s` | in message string | +| Error | `Failed to map aggregated report to proto` | yes — `messageID` field | + +gRPC response uses per-index errors (`NotFound` when no aggregated report). + +### GetMessagesSince + +**Source:** `aggregator/pkg/handlers/get_messages_since.go` + +Message-discovery polling by sequence number. + +| Level | Message | `messageID` | +|-------|---------|-------------| +| Trace | `Received GetMessagesSinceRequest, sinceSequence: N` | no | +| Trace | `Report MessageID: %x, Sequence: …` | hex in message | +| Error | `failed to map aggregated report to proto` | yes | +| Error | `missing quorum config for source chain selector` | yes | +| Trace | `Returning N records for GetMessagesSinceRequest` | no | + +### ReadCommitVerifierNodeResult + +**Source:** `aggregator/pkg/handlers/read_commit_verifier_node_result.go` + +| Level | Message | `messageID` | +|-------|---------|-------------| +| Warn | `validation error` | no | +| Error | `failed to get commit verification record` | yes + `address` | +| Error | `failed to convert record to proto` | yes | + +--- + +## Storage layer logs + +**Source:** `aggregator/pkg/storage/postgres/database_storage.go` + +Generally **no per-message success logs** on write; errors may include `message_id`: + +| Level | Message | `message_id` / `messageID` | +|-------|---------|---------------------------| +| Error | `failed to parse message_id hex in aggregated report, skipping` | `message_id` | +| Error | `corrupted verification row, excluding entire report` | `report_key`, `verification_id` | +| Error | `failed to parse message ID` (orphan scan) | no | +| Error | `corrupted verification row in batch report, excluding entire report` | `message_id` | + +`SubmitAggregatedReport` does not log on success — confirm via aggregation worker `Report submitted successfully` or DB query. + +--- + +## Database checks (when logs are inconclusive) + +```sql +-- Per-node verification (one row per signer per aggregation key) +SELECT message_id, signer_identifier, aggregation_key, created_at +FROM commit_verification_records +WHERE message_id = '0xYOUR_MESSAGE_ID' +ORDER BY created_at; + +-- Aggregated report (quorum result) +SELECT message_id, aggregation_key, created_at, seq_num +FROM commit_aggregated_reports +WHERE message_id = '0xYOUR_MESSAGE_ID' +ORDER BY created_at DESC; +``` + +Orphan backlog: orphan recoverer `Orphan stats` metrics/logs, or storage `ListOrphanedKeys` logic (verifications without matching aggregated report for same `aggregation_key`). + +--- + +## Suggested debug workflows + +### Write returns SUCCESS but no aggregated report + +1. `Triggered aggregation check` — was aggregation enqueued? +2. `Checking aggregation for message` — did worker pick it up? +3. `Quorum not met, not submitting report` — count `verifications` vs committee threshold; need more node writes. +4. `Skipping aggregation: existing report already meets quorum` — report may already exist (check DB). +5. `Aggregation channel is full` — backpressure; retry write or scale buffers/workers. +6. `Failed to submit report` — DB error on persist. +7. Orphan path: `Successfully triggered re-aggregation check` with your message hex. + +### Write fails immediately + +1. `validation error` — malformed proto. +2. `signature validation failed` / `No valid signers found` — wrong key or committee config. +3. `Rejected write: message matched a disablement rule` — disablement registry. +4. `Authentication failed` — HMAC / auth middleware (no `messageID` yet). + +### Hash / tampering + +- `Hash mismatch detected - possible data tampering` during quorum — verifications for same `(messageID, aggregationKey)` disagree on signable hash (CCV version mismatch across nodes). + +### CCV version / aggregation key change + +- Multiple `aggregation_key` values per `message_id` are expected after CCV version changes. +- Orphan recovery and `ListOrphanedKeys` join on **both** `message_id` and `aggregation_key`. + +--- + +## Quick reference: logs with `messageID` (or equivalent) + +| Package / component | Logs | +|---------------------|------| +| `handlers` (write) | Disablement reject, signature/save/aggregation errors, success path Infos | +| `handlers` (read/get) | Get/read failures with `messageID` | +| `aggregation` | Full worker trail including **`Report submitted successfully`** | +| `quorum` | Signature recovery (Info), quorum Debug/Warn/Error | +| `orphan_recoverer` | Process/trigger logs (`%x` message ID) | +| `scope` | All augmented fields on child loggers | + +--- + +## Related documentation + +- [End-to-end message debugging](../../docs/end-to-end-debugging.md) — Full pipeline happy path and fallbacks +- [Debugging guides (all pipeline stages)](../../README.md#debugging-guides) — Per-service deep dives +- [Verifier design](../../verifier/docs/verifier.md) +- [Committee verifier](../../verifier/docs/committee_verifier.md) +- [Aggregator README](../README.md) diff --git a/executor/README.md b/executor/README.md index bd69a4231..ff158e9ef 100644 --- a/executor/README.md +++ b/executor/README.md @@ -1,4 +1,7 @@ # Executor Service + +For tracing a message from indexer discovery through on-chain execution, see [docs/debugging.md](docs/debugging.md). + Build dev image (hot-reload) ``` just build-dev diff --git a/executor/docs/debugging.md b/executor/docs/debugging.md new file mode 100644 index 000000000..0777eda9c --- /dev/null +++ b/executor/docs/debugging.md @@ -0,0 +1,348 @@ +# Executor Debugging Guide + +This guide helps you trace a single CCIP message through the executor using structured logs. The executor is the **final stage**: it learns about messages from the **indexer**, fetches all required verifier results from the indexer HTTP API, validates quorum against the **destination off-ramp**, and transmits an aggregated report on-chain. + +Upstream context: + +- [Indexer debugging guide](../../indexer/docs/debugging.md) — Populates messages and verifier results the executor reads +- [Aggregator debugging guide](../../aggregator/docs/debugging.md) — Committee quorum path into indexer +- [Verifier debugging guide](../../verifier/docs/debugging.md) — Token / committee verification before indexer + +## Pipeline overview + +```mermaid +flowchart TB + IDX["Indexer HTTP API\n/messages + /verifierresults/:id"] + STR["IndexerStorageStreamer\n(poll messages)"] + COORD["Coordinator\nheap + workers"] + ELEC["HashBasedLeaderElector\nstaggered execution turn"] + CLE["ChainlinkExecutor\nHandleMessage"] + ORAMP["Destination off-ramp\nquorum + execution state"] + TX["ContractTransmitter\nConvertAndWriteMessageToChain"] + + IDX -->|ReadMessages| STR + STR -->|MessageWithMetadata| COORD + COORD --> ELEC + COORD -->|ready time| CLE + CLE -->|GetVerifierResults| IDX + CLE --> ORAMP + CLE --> TX +``` + +**Stages:** + +1. **Message stream** — `IndexerStorageStreamer` polls indexer `GET /v1/messages` for newly ingested messages (by destination chain filter). +2. **Coordinator intake** — Validates message, computes executor-pool ready time, pushes to **delayed heap**. +3. **Processing loop** — Every second, pops messages whose `ReadyTime` has passed and dispatches to worker pool. +4. **ChainlinkExecutor** — Curse check, on-chain execution state, **indexer** verifier results + off-ramp CCV quorum, honest-attempt check, transmit. +5. **Retry** — Transient failures re-push to heap with `RetryInterval`; expiry drops message after `max_retry_duration`. + +Execution success from the executor’s perspective is **`transmitting aggregated report to chain`** with no subsequent retry log for that attempt. + +--- + +## Data sources + +| Source | Interface | API / call | Purpose | +|--------|-----------|------------|---------| +| **Indexer** | `MessageReader` | `GET /v1/messages` | Discover messages ready to execute | +| **Indexer** | `VerifierResultReader` | `GET /verifierresults/:messageID` | All CCV attestations for one message | +| **Destination chain** | `DestinationReader` | `GetMessageSuccess`, `HasHonestAttempt`, `GetCCVSForMessage`, `IsReady` | On-chain execution state and receiver quorum | +| **RMN** | `CurseChecker` | `IsRemoteChainCursed` | Delay if lane cursed | + +The **`IndexerReaderAdapter`** wraps one or more indexer base URLs with health-check failover (logs when switching). + +--- + +## Filtering logs for one message + +### Logger scope + +| Component | Logger naming / fields | +|-----------|-------------------------| +| Service bootstrap | `executor` (named in `cmd/executor/service.go`) | +| Coordinator | Root `executor` logger passed to `NewCoordinator` | +| ChainlinkExecutor | Root logger (often same tree) | +| Indexer adapter | `Created indexer adapter with multiple clients` at startup | +| Indexer streamer | `IndexerStorageStreamer` logs | +| Leader elector | `messageID` on **Debug** `calculated ready timestamp` | +| Task-style | N/A — executor uses coordinator logs with `messageID` field | + +### Grep examples + +```bash +# Stream discovery +grep 'Found net new message from Indexer' | grep 'messageID=0xYOUR_MESSAGE_ID' + +# Coordinator lifecycle +grep 'messageID=0xYOUR_MESSAGE_ID' | grep -E 'pushing message|processing message|transmitting' + +# Indexer fetch (may not include messageID in adapter debug lines — use coordinator lines) +grep 'got ccv info and verifier results' | grep '0xYOUR_MESSAGE_ID' + +# Successful on-chain submit +grep 'transmitting aggregated report to chain' | grep '0xYOUR_MESSAGE_ID' +``` + +--- + +## Happy-path checklist + +| Step | What to look for | Component | +|------|------------------|-----------| +| 1. Seen in stream | `Found net new message from Indexer` | `IndexerStorageStreamer` | +| 2. Scheduled | `pushing message to delayed heap` | `Coordinator` | +| 3. Due | `found messages ready for processing` (Debug, may list payload) | `Coordinator` | +| 4. Executing | `processing message with ID` | `Coordinator` | +| 5. CCV data | `got ccv info and verifier results` | `ChainlinkExecutor` | +| 6. Transmit | `transmitting aggregated report to chain` | `ChainlinkExecutor` | +| 7. Done | No `message should be retried` / `will retry execution` for same ID | `Coordinator` | + +Confirm indexer has rows: see [Indexer debugging guide](../../indexer/docs/debugging.md) database section. + +--- + +## Stage 1: Indexer message stream + +**Source:** `integration/pkg/ccvstreamer/indexer_storage_streamer.go` + +Polls indexer for messages with `IngestionTimestamp` after `lastQueryTime`, dedupes via `ExpirableMessageSet`. + +| Level | Message | `messageID` | Notes | +|-------|---------|-------------|-------| +| Debug | `IndexerStorageStreamer query results` | no | `start`, `count`, `error` | +| Error | `dropping message with invalid ID` | no | Bad payload | +| **Info** | **`Found net new message from Indexer`** | **yes** | Emitted to coordinator stream | +| Error | `IndexerStorageStreamer read error` | no | Backoff retry | +| Info | `IndexerStorageStreamer hit query limit, there may be more results` | no | Pagination | + +Coordinator stream errors: + +| Level | Message | +|-------|---------| +| Error | `failed to start ccv result streamer, shutting down coordinator` | +| Error | `error in coordinator component` | +| Warn | `streamerResults closed` | + +--- + +## Stage 2: Coordinator (heap, elector, workers) + +**Source:** `executor/executor_coordinator.go`, `executor/pkg/message_heap/message_heap.go`, `executor/pkg/leaderelector/hash_based_elector.go` + +### Intake from stream + +| Level | Message | `messageID` | Meaning | +|-------|---------|-------------|---------| +| Error | `invalid message, skipping` | no | `CheckValidMessage` failed | +| Info | `message already in delayed heap, skipping` | yes | Dedup | +| Info | `message already in flight, skipping` | yes | Worker processing | +| Info | `skipping message, executor not in pool for destination chain` | yes | This node not in `executor_pool` for dest chain | +| Error | `leader elector failed for message, skipping` | yes | Config / elector error | +| **Info** | **`pushing message to delayed heap`** | **yes** | `ingestionTimestamp`, `readyTimestamp` | +| Info | `duplicate message rejected by heap` | yes | Second push same ID | + +### Leader elector (timing) + +**Debug** only (per message): + +| Message | Fields | +|---------|--------| +| `calculated ready timestamp` | `messageID`, `queueSize`, `executionInterval`, `delay`, `readyTime` | + +Ready time = ingestion time + (position in per-message executor rotation × `execution_interval` for that dest chain). Multi-executor deployments stagger who transmits first. + +### Processing loop and workers + +| Level | Message | `messageID` | +|-------|---------|-------------| +| Debug | `found messages ready for processing` | in `readyMessages` dump | +| Info | `processing message with ID` | yes | +| Info | `message has expired` | yes | Past `max_retry_duration` window | +| Info | `message should be retried, putting back in heap` | yes | +| Warn | `retry push rejected, message already in heap` | yes | +| Error | `failed to handle message` | yes | `error`, `shouldRetry` | + +### Heap integrity (rare) + +| Level | Message | +|-------|---------| +| Error | `heap corrupted: popped entry is not MessageHeapEntry` | +| Error | `orphaned heap entry: messageID present in heap but missing from dataMap` | + +--- + +## Stage 3: Indexer verifier results (HTTP) + +**Source:** `executor/pkg/adapter/adapter.go` +**Endpoint:** `GET /verifierresults/{messageID}` (indexer v1 API) + +Called from `ChainlinkExecutor.getVerifierResultsAndQuorum` for each execution attempt. + +| Level | Message | `messageID` in fields | +|-------|---------|----------------------| +| Infow | `Created indexer adapter with multiple clients` | no (startup) | +| Debug | `Active indexer returned result` | no (`activeIdx`, `status`) | +| Warn | `Active indexer returned non-success status, checking health and querying alternates` | no | +| Infow | `Active indexer health check passed, retrying query` | no | +| Warn | `Retry on active indexer also failed after health check passed` | no | +| Warn | `Active indexer unavailable, selecting alternate` | no | +| Infow | `Selected healthy alternate indexer` | no | +| Infow | `Switching active indexer` | no | +| Error | `No healthy alternates found, returning active client result` | no | + +**Semantics:** + +- HTTP **404** from indexer → empty verifier list → `delaying execution due to no verifier results` (retry until indexer has data). +- HTTP **200** with results → filtered by `default_executor_address` for source chain (warnings if mismatch). + +Per-result warnings in executor (no `messageID` on line; correlate via surrounding `processing message`): + +| Level | Message | +|-------|---------| +| Warn | `Verifier Result did not specify our executor` | +| Warn | `Verifier Result fields are inconsistent` | + +--- + +## Stage 4: ChainlinkExecutor execution + +**Source:** `executor/pkg/executor/cl_executor.go` + +### Preconditions (`CheckValidMessage`) + +Failures log as `invalid message, skipping` on coordinator (no dest reader / transmitter / reader unhealthy). + +| Level | Message | `messageID` | +|-------|---------|-------------| +| Warn | `delaying execution - curse state unknown` | yes | +| Info | `delaying execution due to curse` | yes | +| Warn | `delaying execution due to failed check GetMessageExecutionState` | yes | RPC / read errors | +| Info | `skipping execution due to already being successfully executed` | yes | On-chain SUCCESS | +| Warn | `delaying execution due to failed request for verifier results and quorum` | yes | Indexer or `GetCCVSForMessage` failed | + +### Quorum and ordering + +| Level | Message | `messageID` | Notes | +|-------|---------|-------------|-------| +| **Info** | **`got ccv info and verifier results`** | **yes** | `verifierResultsLen`, quorum fields, verifier address lists | +| Info | `skipping execution and not retrying due to impossible receiver verifier quorum` | yes | Permanent — required CCVs > message CCV addresses | +| Warn | `message did not meet verifier quorum, will retry` | yes | Missing required/optional CCV data from indexer | +| Info | `execution attempt poller not ready, will retry` | yes | Destination reader not ready | +| Error | `unable to call execution checker, will retry message` | yes | `HasHonestAttempt` error | +| Info | `skipping execution due to existing honest attempt` | yes | Another executor already tried same data | + +### On-chain transmit + +| Level | Message | `messageID` | +|-------|---------|-------------| +| **Info** | **`transmitting aggregated report to chain`** | **yes** | `destinationChain`, `latestCCVTimestamp`, `aggregatedReport` | +| Warn | `skipping retry due to message encoding error` | yes | Non-retryable (`ErrMessageEncoding`) | +| Warn | `will retry execution due to failed ConvertAndWriteMessageToChain` | yes | Transient tx / RPC failure | + +Service lifecycle: + +| Level | Message | +|-------|---------| +| Infow | `new chainlink executor` | +| Info | `Starting Chainlink Executor` / `Stopping Chainlink Executor` | + +--- + +## Retry vs terminal outcomes + +| Outcome | `shouldRetry` | Typical logs | +|---------|---------------|--------------| +| Transient (curse, RPC, indexer 404, quorum not met, tx fail) | `true` | `message should be retried, putting back in heap` | +| Already executed / honest attempt / impossible quorum / encoding error | `false` | `skipping execution…` | +| Expired | n/a (never calls executor) | `message has expired` | + +--- + +## Suggested debug workflows + +### Message never reaches executor + +1. Indexer: message exists and status not stuck? See [Indexer debugging guide](../../indexer/docs/debugging.md). +2. Stream: `Found net new message from Indexer`? +3. Dest chain filter: `enabled_dest_chains` in streamer config vs message `dest_chain_selector`. +4. Executor pool: `skipping message, executor not in pool for destination chain` — this node’s `executor_id` not in pool for that chain. + +### Stuck in heap (long delay before `processing message`) + +1. `pushing message to delayed heap` — note `readyTimestamp`. +2. Debug: `calculated ready timestamp` — `queueSize` and `execution_interval` explain stagger. +3. Compare `readyTimestamp` to current time — multi-executor rotation is intentional. + +### Processing but never transmits + +1. `got ccv info and verifier results` — is `verifierResultsLen` > 0? +2. If 0: indexer 404 or all results filtered (`Verifier Result did not specify our executor`). +3. `message did not meet verifier quorum` — indexer missing USDC/Lombard/committee attestations; trace [Indexer](../../indexer/docs/debugging.md) then [Verifier](../../verifier/docs/debugging.md). +4. `delaying execution due to no verifier results` — indexer empty for ID. +5. `skipping execution due to existing honest attempt` — another node already executed. +6. `skipping execution due to already being successfully executed` — on-chain done. + +### Transmit fails repeatedly + +1. `will retry execution due to failed ConvertAndWriteMessageToChain` — RPC, gas, nonce, contract revert (check chain logs). +2. `skipping retry due to message encoding error` — permanent payload issue. + +### Indexer connectivity + +1. `IndexerStorageStreamer read error` — stream backoff. +2. `Active indexer returned non-success status` / `No healthy alternates found` — all indexer replicas unhealthy. +3. `Switching active indexer` — failover (check which URI is active). + +--- + +## End-to-end trace (full CCIP path) + +```text +Verifier: Added message to pending → Successfully published → Write succeeded for message + ↓ +Aggregator: Triggered aggregation check → Report submitted successfully + ↓ +Indexer: Found Message → Enqueueing new Message → Collected N new verifications + ↓ +Executor stream: Found net new message from Indexer + ↓ +Executor: pushing message to delayed heap → processing message with ID + ↓ +Executor: got ccv info and verifier results → transmitting aggregated report to chain +``` + +Token-only lanes still use aggregator discovery for the message shell; token attestations appear in `got ccv info` via indexer REST fetches. + +--- + +## Configuration knobs (logging context) + +| Config | Effect on logs | +|--------|----------------| +| `this_executor_id` / `executor_pool` | `not in pool`, elector `queueSize` | +| `execution_interval` | `calculated ready timestamp` delay | +| `max_retry_duration` | `message has expired` | +| `indexer_addresses` | Indexer adapter failover | +| `worker_count` | Parallel `processing message` lines | +| `enabled_dest_chains` | Which messages appear in stream | + +--- + +## Quick reference: logs with explicit `messageID` + +| Package | Logs | +|---------|------| +| `ccvstreamer` | `Found net new message from Indexer` | +| `executor_coordinator` | Heap push, in-flight skip, processing, retry, expire, handle errors | +| `leaderelector` | `calculated ready timestamp` (Debug) | +| `cl_executor` | Curse, execution state, CCV info, transmit, skip paths | +| `message_heap` | Orphaned entry errors | + +--- + +## Related documentation + +- [End-to-end message debugging](../../docs/end-to-end-debugging.md) — Full pipeline happy path and fallbacks +- [Debugging guides (all pipeline stages)](../../README.md#debugging-guides) — Per-service deep dives +- [Executor README](../README.md) diff --git a/indexer/README.md b/indexer/README.md index 6032fcb51..8f54d7105 100644 --- a/indexer/README.md +++ b/indexer/README.md @@ -1,4 +1,7 @@ # Indexer Service + +For tracing a message through aggregator discovery and token verifier fetches, see [docs/debugging.md](docs/debugging.md). + Build dev image (hot-reload) ``` just build-dev diff --git a/indexer/docs/debugging.md b/indexer/docs/debugging.md new file mode 100644 index 000000000..d3c99c1e7 --- /dev/null +++ b/indexer/docs/debugging.md @@ -0,0 +1,374 @@ +# Indexer Debugging Guide + +This guide helps you trace a single CCIP message through the indexer using structured logs. The indexer has two main upstream dependencies: + +1. **Aggregator (message discovery)** — Polls `GetMessagesSince` to discover new messages and committee-side verification markers. +2. **Token verifiers (USDC / Lombard, etc.)** — REST `GET /verifications?messageID=…` on configured verifier HTTP APIs to fetch attestations for CCV addresses listed on the message. + +For upstream context, see the [Verifier debugging guide](../../verifier/docs/debugging.md) and [Aggregator debugging guide](../../aggregator/docs/debugging.md). + +## Pipeline overview + +```mermaid +flowchart TB + AGG["Aggregator\nGetMessagesSince"] + DISC["AggregatorMessageDiscovery\n(+ optional MultiSource)"] + DB["PostgreSQL\nmessages + verifier_results"] + CH["discovery channel"] + WP["WorkerPool"] + TASK["Task\nper messageID"] + REG["VerifierRegistry"] + REST["REST reader\n(token verifier HTTP)"] + AGG2["Aggregator reader\nGetVerifierResultsForMessage"] + + AGG -->|poll| DISC + DISC -->|PersistDiscoveryBatch| DB + DISC -->|emit| CH + CH --> WP + WP --> TASK + TASK -->|missing CCV addresses| REG + REG --> REST + REG --> AGG2 + REST -->|InsertVerifierResults| DB + AGG2 -->|InsertVerifierResults| DB +``` + +**Stages:** + +1. **Discovery** — Poll aggregator for new aggregated reports since a persisted sequence number; persist messages and non–discovery-only verifications; emit every verification to the worker channel. +2. **Enqueue** — Worker pool creates a `Task` per discovered message (logger scoped with `messageID`). +3. **Verification gather** — Task loads existing verifications from DB, determines missing CCV addresses, calls each registered `VerifierReader` (batched REST or aggregator gRPC). +4. **Completion** — Inserts new verifier results; marks message `successful` when all configured verifiers responded; otherwise retries via scheduler or DLQ on timeout. + +A message is **fully indexed** when status is successful and `verifier_results` contains rows for all expected CCV addresses on the message. + +--- + +## How discovery differs from token fetches + +| Path | API | What it provides | +|------|-----|------------------| +| **Discovery** | Aggregator `GetMessagesSince` | New messages + committee discovery markers / aggregated committee data | +| **Token verifier** | Verifier HTTP `GET /verifications?messageID=…` | USDC (CCTP), Lombard, etc. attestations | +| **Aggregator as verifier** | Aggregator `GetVerifierResultsForMessage` | Batch read of stored aggregated results (when verifier `type = aggregator` in config) | + +Discovery-only CCV data (4-byte `MessageDiscoveryVersion` prefix) is **not** persisted as verifier results but still flows on the discovery channel so the worker can fetch token verifications. + +--- + +## Filtering logs for one message + +### Logger names and fields + +| Scope | Fields | +|-------|--------| +| Root | `indexer` (from `cmd/main.go`) | +| Discovery | Shared discovery logger (no per-message context until log line) | +| **Task** | `messageID` on logger + component name `Task` | +| Aggregator gRPC client | `aggregatorAddress` | +| REST reader | `url` on HTTP calls | + +### Grep examples + +```bash +# Discovery found the message +grep 'Found Message' | grep 'messageID=0xYOUR_MESSAGE_ID' + +# Worker pipeline (Task logger) +grep 'messageID=0xYOUR_MESSAGE_ID' + +# Enqueue from discovery +grep 'Enqueueing new Message' | grep '0xYOUR_MESSAGE_ID' + +# Token verifier HTTP (batch may include multiple IDs in URL) +grep 'REST reader' | grep '0xYOUR_MESSAGE_ID' + +# Storage +grep 'messageID=0xYOUR_MESSAGE_ID' +``` + +--- + +## Happy-path checklist + +| Step | What to look for | Component | +|------|------------------|-----------| +| 1. Discovered | `Found Message` with `messageID`, `verifierSourceAddress` | `AggregatorMessageDiscovery` | +| 2. Persisted | `Discovery batch persisted` (Debug) | `PostgresStorage` | +| 3. Queued | `Enqueueing new Message` | `WorkerPool` | +| 4. Worker run | `Starting Worker for 0x…` | `WorkerPool` | +| 5. Fetch plan | `Attempting to retrieve N verifications` | `Task` | +| 6. Token hit | `Received result from … for MessageID` (Debug) | `Task` | +| 7. Stored | `Collected N new verifications` | `Task` | +| 8. Done | Message status `successful` (no DLQ warn) | `WorkerPool` / DB | + +--- + +## Stage 1: Aggregator message discovery + +**Source:** `indexer/pkg/discovery/message_discovery.go` +**Upstream:** `integration/storageaccess/aggregator_reader.go` → aggregator `GetMessagesSince` + +### Lifecycle + +| Level | Message | `messageID` | Notes | +|-------|---------|-------------|-------| +| Info | `MessageDiscovery Started` | no | Service up | +| Info | `MessageDiscovery Stopped` | no | Shutdown | +| Info | `MessageDiscovery stopped due to context cancellation` | no | | +| Info | `Aggregator timed out, cancelling consumeReader` | no | Per-tick timeout | +| Error | `Error calling Aggregator` | no | Non-deadline error | +| Error | `Circuit breaker is open, skipping MessageDiscovery this tick` | no | Resilient reader open | +| Error | `Error reading VerificationResult from aggregator` | no | gRPC / resilience failure | +| Debug | `Called Aggregator` | no | Tick completed read | +| **Info** | **`Found Message`** | **yes** | **`verifierSourceAddress`** — primary discovery marker | +| Warn | `Unable to persist discovery batch, will retry` | no | Rolls back `since` sequence | +| Warn | `Skipping message, cannot encode for insert` | no | `index`, `reason` | + +### Aggregator reader (gRPC client) + +**Source:** `integration/storageaccess/aggregator_reader.go` (logger: `aggregatorAddress`) + +| Level | Message | Notes | +|-------|---------|-------| +| Debug | `Got messages since` | `count`, `since` | +| Warn | `Dropping invalid verifier result from aggregator response` | `index`, `sequence`, `reason` | +| Error | `Aggregator returned result with sequence less than requested since` | Sequence regression | + +### Multi-source discovery + +**Source:** `indexer/pkg/discovery/multi_source.go` +Used when multiple `[[discovery]]` entries exist in config. + +| Level | Message | `messageID` | +|-------|---------|-------------| +| Info | `MultiSourceMessageDiscovery started` | no | +| Debug | `messageID already discovered from different source, skipping` | yes | +| Warn | `one source message discovery channel closed` | no | +| Info | `MultiSourceMessageDiscovery stopped` | no | + +First source to emit a `messageID` wins; duplicates from other aggregators are dropped. + +### Resilient reader (discovery + verifier calls) + +**Source:** `indexer/pkg/readers/resilient_reader.go` + +| Level | Message | Applies to | +|-------|---------|------------| +| Warn | `ReadCCVData circuit breaker opened` | Discovery (`GetMessagesSince` path) | +| Info | `ReadCCVData circuit breaker closed` | Discovery recovered | +| Warn | `GetVerifications circuit breaker opened` | Token / aggregator verifier reads | +| Warn | `Max consecutive read errors reached` | `consecutive_errors` | +| Warn | `ReadCCVData request timeout exceeded` | Discovery timeout | +| Warn | `GetVerifications request timeout exceeded` | Verifier batch timeout | + +(Log names use `ReadCCVData` / `GetVerifications` as the policy name.) + +--- + +## Stage 2: Worker pool and Task + +**Sources:** `indexer/pkg/worker/worker_pool.go`, `task.go`, `worker.go` + +### Worker pool + +| Level | Message | `messageID` | +|-------|---------|-------------| +| **Info** | **`Enqueueing new Message`** | **yes** (in log fields) | +| Info | `Starting Worker for %s` | yes (in message string) | +| Info | `Exiting WorkerPool` / `Exiting EnqueueMessages` | no | +| Error | `Unable to create Task` | no | +| Error | `Unable to update Message Status for MessageID %s` | yes (string) | +| **Warn** | **`Message %s entered DLQ`** | **yes** — visibility window expired | +| Error | `Discovery channel closed` | no | + +### Task logger (`messageID` + `Task`) + +Created in `NewTask`: + +```go +logger.Named(logger.With(lggr, "messageID", message.MessageID), "Task") +``` + +| Level | Message | Meaning | +|-------|---------|---------| +| Info | `Attempting to retrieve N verifications for the message. Total Verifiers: M` | Missing vs total CCV addresses | +| Info | `Detected N unknown verifiers within the message, ignoring them` | Address not in registry | +| Info | `Source Specified CCVs …` | Expected addresses on message | +| Info | `Exisiting Verifications …` | Already in DB | +| Info | `Attempting to Retrieve …` | Addresses being queried this run | +| Info | `Unknown CCVs …` | Not registered | +| **Info** | **`Collected N new verifications for the message`** | **Success** | +| **Debug** | **`Received result from %s for MessageID %s`** | Per verifier success | + +Task does **not** log on storage insert failure beyond returning error to pool (pool retries). + +### Completion rules + +- **Success:** `UnavailableCCVs == 0` and no error → `UpdateMessageStatus(..., successful)`. +- **Retry:** storage error or any verifier reader did not return a result → re-enqueued on scheduler. +- **DLQ:** max attempts / visibility window → `MessageTimeout` + `Message … entered DLQ`. + +--- + +## Stage 3: Token verifier (REST) and aggregator verifier reads + +**Source:** `indexer/pkg/readers/verifier_reader.go` (batching, no per-message logs) + +### REST reader (USDC / Lombard) + +**Source:** `indexer/pkg/readers/rest_reader.go` +**URL pattern:** `{baseURL}/verifications?messageID=0x…&messageID=0x…` + +| Level | Message | `messageID` in fields | +|-------|---------|----------------------| +| Debug | `REST reader calling token verifier` | no (`url` contains IDs) | +| Debug | `REST reader 404 (no results), returning empty map` | no | +| Error | `REST reader HTTP request failed` | no | +| Error | `REST reader unexpected status` | no | +| Warn | `REST reader response exceeded size limit` | no | +| Error | `REST reader parse failed` | no (`body` may be large) | + +A 404 yields an empty map → `UnavailableCCVs` increments → task retries until DLQ. + +### Aggregator as verifier reader + +When a verifier is configured with `type = "aggregator"`, `GetVerifications` calls aggregator `GetVerifierResultsForMessage` (no extra logs in indexer beyond resilient reader policies). See [Aggregator debugging guide](../../aggregator/docs/debugging.md) for aggregator-side `GetVerifierResultsForMessage` handler logs. + +### Verifier registry + +**Source:** `indexer/pkg/registry/verifier_registry.go` +No logging — misconfigured addresses show up as `Unknown CCVs` on the Task. + +--- + +## Stage 4: Storage (PostgreSQL) + +**Source:** `indexer/pkg/storage/postgres.go` + +### Per-message + +| Level | Message | `messageID` | +|-------|---------|-------------| +| Error | `Failed to query CCV data` | yes | +| Error | `Failed to query message` | yes | +| Error | `Failed to update message status` | yes | +| Error | `Failed to persist discovery batch` | no | +| Debug | `Discovery batch persisted` | no (`messages`, `verifications`, `sequenceNumber`) | +| Debug | `Batch insert completed` | no | + +### HTTP API (debugging indexed data) + +**Source:** `indexer/pkg/api/handlers/v1/` + +| Handler | Log | `messageID` | +|---------|-----|-------------| +| `VerifierResultsByMessageID` | `CCV data not found for MessageID` | yes | +| `VerifierResultsByMessageID` | `failed storage call GetCCVData for MessageID` | yes | +| `messages` | `skipping message with invalid ID` | no | + +--- + +## Replay tooling (CLI) + +**Source:** `indexer/pkg/replay/`, `indexer/cmd/replay/main.go` + +Separate from the main indexer process; useful for backfill. + +| Level | Message | `messageID` | +|-------|---------|-------------| +| Info | `Discovery replay starting` | no (`jobID`, `sinceSequenceNumber`) | +| Info | `Discovery replay batch` | no | +| Info | `Replaying message` | **yes** | +| Warn | `Failed to enqueue message for verifier` | yes | +| Warn | `Failed to persist verifier result` | yes | +| Warn | `Message not found in local storage` | yes | +| Info | `Message replay finished` | no | + +--- + +## Suggested debug workflows + +### Message never appears in indexer + +1. Aggregator discovery: any `Found Message` for this ID? If not, check aggregator [GetMessagesSince](../../aggregator/docs/debugging.md) and indexer discovery errors (`Error reading VerificationResult from aggregator`, circuit breaker). +2. Multi-source: `messageID already discovered from different source, skipping` — may have been deduped (first source wins). +3. `Skipping message, cannot encode for insert` — message failed encoding. +4. `Unable to persist discovery batch` — DB issue; discovery will retry same sequence. + +### Discovered but no token (USDC / Lombard) verification + +1. `Enqueueing new Message` — reached worker? +2. Task: `Unknown CCVs` — verifier address not in `[[verifiers]]` or wrong `issuer_addresses`. +3. `Attempting to Retrieve` empty — all verifications already in DB? +4. REST: `REST reader 404` — verifier has no result yet (verifier still processing; see [Verifier debugging guide](../../verifier/docs/debugging.md)). +5. REST: `HTTP request failed` / circuit breaker — verifier down or network. +6. `Collected 0 new verifications` with retries — check `UnavailableCCVs` path until DLQ. + +### Discovered but missing committee verification + +1. Discovery persists non–discovery-only results from aggregator in `PersistDiscoveryBatch`. +2. If only discovery marker arrived, committee aggregated data may not be on aggregator yet — see [Aggregator debugging guide](../../aggregator/docs/debugging.md) (`Report submitted successfully`). +3. Verifier-type `aggregator` in indexer config can backfill via `GetVerifierResultsForMessage`. + +### Stuck in retry / DLQ + +1. `Starting Worker for 0x…` repeating — scheduler retries. +2. `UnavailableCCVs > 0` — some verifiers not returning. +3. **`Message … entered DLQ`** — gave up after visibility window; check `messages.status` and `last_error` in DB. + +### End-to-end (committee + token) + +```text +Verifier: Message verification completed → Write succeeded for message + ↓ +Aggregator: Triggered aggregation check → Report submitted successfully + ↓ +Indexer: Found Message → Enqueueing new Message → Collected N new verifications +``` + +Token path skips aggregator quorum; discovery still comes from aggregator `GetMessagesSince`, then Task hits token verifier REST. + +--- + +## Database checks + +```sql +-- Message row +SELECT message_id, status, last_error, ingestion_timestamp +FROM indexer.messages +WHERE message_id = '0xYOUR_MESSAGE_ID'; + +-- All verifier results for message +SELECT message_id, source_verifier_address, verifier_name, ingestion_timestamp +FROM indexer.verifier_results +WHERE message_id = '0xYOUR_MESSAGE_ID'; + +-- Discovery cursor for an aggregator endpoint +SELECT discovery_location, last_sequence_number +FROM indexer.discovery_state; +``` + +--- + +## Quick reference: logs with `messageID` + +| Package | Logs | +|---------|------| +| `discovery/message_discovery` | `Found Message` | +| `discovery/multi_source` | Duplicate skip | +| `worker/worker_pool` | Enqueue, DLQ | +| `worker/task` | All Task Info/Debug lines (context field) | +| `storage/postgres` | Query/update errors | +| `api/handlers/v1` | Get by message ID | +| `replay` | Message replay | + +--- + +## Related documentation + +- [End-to-end message debugging](../../docs/end-to-end-debugging.md) — Full pipeline happy path and fallbacks +- [Debugging guides (all pipeline stages)](../../README.md#debugging-guides) — Per-service deep dives +- [Discovery module README](../pkg/discovery/README.md) +- [Storage README](../pkg/storage/README.md) +- [Indexer README](../README.md) diff --git a/verifier/docs/debugging.md b/verifier/docs/debugging.md new file mode 100644 index 000000000..e623d00f9 --- /dev/null +++ b/verifier/docs/debugging.md @@ -0,0 +1,398 @@ +# Verifier Debugging Guide + +This guide helps you trace a single CCIP message through the verifier’s durable processing pipeline using structured logs. For architecture and component responsibilities, see [verifier.md](./verifier.md). + +## Pipeline overview + +The coordinator runs a three-stage pipeline connected by PostgreSQL job queues: + +```mermaid +flowchart LR + SRS["SourceReaderService\n(per source chain)"] + TQ["ccv_task_verifier_jobs"] + TVP["TaskVerifier Processor"] + RQ["ccv_storage_writer_jobs"] + SWP["StorageWriter Processor"] + OS["Offchain storage"] + + SRS -->|Publish VerificationTask| TQ + TQ -->|Consume batch| TVP + TVP -->|Publish VerifierNodeResult| RQ + RQ -->|Consume batch| SWP + SWP -->|WriteCCVNodeData| OS +``` + +**Stages:** + +1. **SourceReaderService** — Discovers on-chain events, applies filters/finality/curse/rules, holds tasks in memory until ready, publishes to `ccv_task_verifier_jobs`. +2. **TaskVerifier Processor** — Consumes tasks, calls the pluggable `Verifier` (committee, CCTP, Lombard, etc.), publishes results to `ccv_storage_writer_jobs`. +3. **StorageWriter Processor** — Consumes results, writes `VerifierNodeResult` to offchain storage. + +A message is fully processed from the verifier’s perspective when storage reports **write success** and the storage-writer job is completed. + +--- + +## Filtering logs for one message + +Most message-scoped logs use the structured field **`messageID`** (hex string, typically `0x…`). + +### Grep examples + +```bash +# Direct field (most stages) +grep 'messageID=0xYOUR_MESSAGE_ID' + +# CCTP / Lombard: messageID is on logger context for the whole task +grep 'messageID.*0xYOUR_MESSAGE_ID' + +# Commit verifier also logs the ID in several Infow calls +grep '0xYOUR_MESSAGE_ID' +``` + +In log aggregators (Datadog, etc.), filter on the JSON field `messageID` (or nested context fields when using `logger.With`). + +### Logger context tags + +These fields appear on many lines and help narrow by component: + +| Component | Structured fields | +|-----------|-------------------| +| Coordinator | `verifierID` | +| Source reader (runtime) | `component=Service`, `chain=` | +| Source reader (creation) | `component=SourceReaderDB`, `chainID=` | +| Finality checker | `component=FinalityChecker`, `chainID=` | +| Reorg tracker | `component=ReorgTracker` | +| Task queue | `component=task_queue` | +| Result queue | `component=result_queue` | +| Task queue observer | `component=task_queue_observer` | +| Result queue observer | `component=result_queue_observer` | +| TaskVerifier / StorageWriter processors | `verifierID` (no `component` tag) | +| Heartbeat | `component=HeartbeatReporter` | + +**CCTP and Lombard** scope each task with: + +```go +lggr := logger.With(v.lggr, "messageID", task.MessageID, "txHash", task.TxHash) +``` + +Every log for that task should carry **`messageID`** and **`txHash`** without repeating them in the log message string. + +--- + +## Happy-path checklist + +Use this sequence when confirming a message completed successfully: + +| Step | What to look for | Package | +|------|------------------|---------| +| 1. Discovered | `Added message to pending queue` | `sourcereader` | +| 2. Finality OK | `Finality check` with `meetsRequirement=true` | `sourcereader` | +| 3. Enqueued | `Successfully published and tracked tasks` | `sourcereader` | +| 4. Verified | See [Verifier implementations](#verifier-implementations) below | `commit` / `token/*` | +| 5. No TVP failure | No `Message verification failed` for this ID | `taskverifier` | +| 6. Persisted | `Write succeeded for message` | `storagewriter` | + +Optional confirmation: query `ccv_task_verifier_jobs` / `ccv_storage_writer_jobs` (and archive tables) by `message_id` if logs are inconclusive. + +--- + +## Stage 1: SourceReaderService + +**Logger:** `component=Service`, `chain=` + +**Source:** `verifier/pkg/sourcereader/service.go` + +### Discovery and pending queue + +| Level | Message | `messageID` | Other useful fields | +|-------|---------|-------------|---------------------| +| Info | `Message filtered out by filter` | yes | `destChain` | +| Error | `Failed to compute message ID` | no | `error` | +| Error | `Message ID mismatch` | no | `computed`, `onchain` | +| Info | `Added message to pending queue` | yes | `blockNumber`, `seqNum`, `pendingCount` | +| Debug | `Skipping already-sent message` | yes | `blockNumber` | + +### Reorg handling + +| Level | Message | `messageID` | Other useful fields | +|-------|---------|-------------|---------------------| +| Warn | `Removing task from pending queue due to reorg` | yes | `blockNumber`, `seqNum`, `destChain`, `fromBlock` | +| Warn | `Removing task from sentTasks due to reorg` | yes | `seqNum`, `destChain` | + +### Finality and readiness + +| Level | Message | `messageID` | Other useful fields | +|-------|---------|-------------|---------------------| +| Info | `Checking for ready messages to send` | no | `latestBlock`, `safeBlock`, `finalizedBlock` | +| Info | `Reorg-affected message finality check` | yes | `meetsRequirement`, `messageBlock`, `finalizedBlock` | +| Info | `Finality check` | yes | `finality`, `meetsRequirement`, `messageBlock`, `safeBlock`, `finalizedBlock` | +| Error | `Finality check failed due to nil block argument` | yes | `error` | + +### Curse, message rules, publish + +| Level | Message | `messageID` | Notes | +|-------|---------|-------------|-------| +| Warn | `Blocking lane - curse state unknown` | yes | Transient; retried next poll | +| Warn | `Dropping task - lane is cursed` | yes | Permanent drop | +| Warn | `Blocking message - message rules state unknown` | yes | Transient | +| Warn | `Dropping task - message matched a disablement rule` | yes | Permanent drop | +| Warn | `Curse or message rules state unknown, keeping checkpoint unchanged` | no | Batch may stall checkpoint | +| Info | `Publishing ready tasks to job queue` | no | `ready`, `pending`, `sentTasks` | +| Error | `Failed to publish tasks to job queue` | no | Tasks stay in `pendingTasks` for retry | +| Info | `Successfully published and tracked tasks` | no | `published`, `remainingPending`, `totalSent` | +| Info | `Checkpoint advanced` | no | Chain-level `finalizedBlock` | + +### Chain-wide failures (no per-message ID) + +| Level | Message | Impact | +|-------|---------|--------| +| Error | `FINALITY VIOLATION - disabling chain` | Chain disabled; pending/sent maps flushed | +| Error | `Flushed all tasks due to finality violation` | `pendingFlushed`, `sentFlushed` | +| Error | `Finality violation detected` | Triggers disable path | +| Warn | `Error when querying logs` | Event cycle may not advance | +| Error | `Failed to get latest block` | Poll skipped for this cycle | + +### Cycle-level logs (no `messageID`) + +Useful for “why is nothing moving on this chain?”: + +- `processEventCycle starting` +- `Querying from block` +- `Processed block range` +- `No events found in range` + +--- + +## Stage 2: TaskVerifier Processor + +**Logger:** top-level `verifierID` (service name `verifier.Processor[]`) + +**Source:** `verifier/pkg/taskverifier/processor.go` + +### Batch processing (usually no per-message ID) + +| Level | Message | Key fields | +|-------|---------|------------| +| Debug | `Processing verification tasks batch` | `batchSize` | +| Debug | `Published verification results to queue` | `count` | +| Debug | `Verification batch completed` | `successCount`, `errorCount`, `retryCount`, `failedCount` | + +### Per-message failures + +| Level | Message | `messageID` | Key fields | +|-------|---------|-------------|------------| +| Error | `Message verification failed` | yes | `error`, `retryable`, `nonce`, `sourceChain`, `destChain` | +| Error | `Job ID not found for message` | yes | Internal mapping bug / mismatch | + +**`retryable`:** + +- `true` — Job scheduled for retry in `ccv_task_verifier_jobs` (delay set by verifier implementation). +- `false` — Permanent failure; job archived as failed. + +### Queue operation failures (batch-level) + +Jobs may remain in `processing` until stale-lock reclaim: + +- `Failed to publish verification results to queue` +- `Failed to complete jobs` +- `Failed to retry jobs` +- `Failed to mark jobs as failed` + +--- + +## Verifier implementations + +The TaskVerifier delegates to a pluggable `Verifier`. Log patterns differ by type. + +### Committee / commit verifier + +**Source:** `verifier/pkg/commit/verifier.go` + +Per-message **Info** logs (all include `messageID`): + +| Message | Typical fields | +|---------|----------------| +| `Starting message verification` | `nonce`, `sourceChain`, `destChain` | +| `Message validation passed` | `verifierAddress`, `defaultExecutorAddress` | +| `Using message discovery version for message` | `version` | +| `Message signed successfully` | `signer`, `signatureLength` | +| `Message verification completed successfully` | `nonce`, `sourceChain`, `destChain` | + +Batch-level: + +- `Starting batch verification` / `Batch verification completed` — `batchSize`, `successCount`, `errorCount` (no per-message ID). + +Errors are returned to TaskVerifier as `Message verification failed` (commit verifier does not log every error path at Info/Error level before return). + +### CCTP verifier + +**Source:** `verifier/pkg/token/cctp/verifier.go` + +Logger context: `messageID`, `txHash` on every task. + +| Level | Message | +|-------|---------| +| Info | `Verifying CCTP task` | +| Warn | `Failed to fetch attestation` | +| Debug | `Attestation not ready for message` | +| Error | `Failed to decode attestation data` | +| Info | `Attestation fetched and decoded successfully` | +| Error | `CreateVerifierNodeResult: Failed to create VerifierNodeResult` | +| Info | `VerifierResults: Successfully verified message` | + +Attestation fetch helper (`token/cctp/attestation.go`) may log `skipping CCTP message as it doesn't match CCIP message` without `messageID` on the context (Circle API response parsing). + +### Lombard verifier + +**Source:** `verifier/pkg/token/lombard/verifier.go`, `token/lombard/attestation.go` + +Logger context: `messageID`, `txHash` on each task in the verifier loop. + +**Verifier loop** — same pattern as CCTP (`Verifying Lombard task`, attestation ready/not found, success/failure). + +**Attestation service** — explicit `messageID` on mapping failures: + +| Level | Message | +|-------|---------| +| Warn | `No verifier resolver configured for source chain, skipping task` | +| Warn | `No matching blob found for task in ReceiptBlobs` | +| Error | `No verifier resolver configured for source chain; marking attestation as missing` | +| Error | `No matching blob found for task; marking attestation as missing` | +| Error | `Failed to find attestation for task in the response` | + +--- + +## Stage 3: StorageWriter Processor + +**Logger:** top-level `verifierID` (service name `verifier.Processor[]`) + +**Source:** `verifier/pkg/storagewriter/processor.go` + +### Per-message + +| Level | Message | `messageID` | Key fields | +|-------|---------|-------------|------------| +| Debug | `Write succeeded for message` | yes | `jobID` | +| Error | `Write failed for message (retryable)` | yes | `jobID`, `error` | +| Error | `Write failed for message (non-retryable)` | yes | `jobID`, `error` | + +### Batch summary + +| Level | Message | Key fields | +|-------|---------|------------| +| Debug | `Processing verification results batch` | `batchSize` | +| Info | `CCV data batch write completed` | `totalRequests`, `successful`, `retriableFailed`, `nonRetriableFailed` | +| Info | `Scheduling retry for failed writes` | `retriableFailedCount`, `retryDelay` | +| Warn | `Marking non-retryable failed jobs as failed` | `nonRetriableFailedCount` | +| Debug | `No successful writes in this batch, skipping completion` | — | + +### Batch / queue failures + +- `Failed to write CCV data batch to storage with no results, scheduling retry` +- `Failed to schedule retry for CCV data batch` / `Failed to schedule retry for failed writes` +- `Failed to complete jobs in queue` +- `Failed to mark jobs as failed` + +--- + +## E2E latency tracking + +**Source:** `verifier/pkg/monitoring/message_latency.go` + +| Level | Message | `messageID` | When | +|-------|---------|-------------|------| +| Warn | `Negative E2E latency detected due to clock drift` | yes | After successful write; block time vs node clock | +| Error | `Invalid timestamp type in cache for message` | no | Cache corruption / bug | + +Timestamps are recorded when TaskVerifier calls `MarkMessageAsSeen` (at consume time), using `ReadyForVerificationAt` when set. + +--- + +## Job queue logs (no per-message ID) + +**Source:** `verifier/pkg/jobqueue/postgres_queue.go` + +**Components:** `task_queue` (`ccv_task_verifier_jobs`), `result_queue` (`ccv_storage_writer_jobs`) + +These logs are **batch counts only**. To inspect a specific message in the queue, query the database: + +```sql +-- Example: task queue (adjust table/schema as deployed) +SELECT job_id, status, message_id, attempt_count, last_error, available_at, started_at +FROM ccv_task_verifier_jobs +WHERE message_id = decode(replace('0xYOUR_MESSAGE_ID', '0x', ''), 'hex'); +``` + +| Level | Message | Key fields | +|-------|---------|------------| +| Debug | `Published jobs to queue` | `queue`, `count`, `delay` | +| Debug | `Consumed jobs from queue` | `queue`, `count` | +| Debug | `Completed jobs` | `queue`, `count` | +| Info | `Retried jobs` | `queue`, `retried`, `failed`, `delay` | +| Info | `Failed and archived jobs` | `queue`, `count` | +| Info | `Archived jobs that exceeded retry deadline` | `queue`, `count` | + +**Observers** (`task_queue_observer`, `result_queue_observer`): + +- `JobQueue size` — periodic backlog depth + +--- + +## Suggested debug workflow + +### Message never appears in verifier logs + +1. Confirm the source chain SourceReader is running (`Service started`, no `Chain is disabled, skipping`). +2. Check filter: `Message filtered out by filter`. +3. Check event cycle: `Processed block range`, `No events found in range`. +4. Check ID integrity: `Message ID mismatch`. + +### Stuck after discovery + +1. `Finality check` with `meetsRequirement=false` — wait for finality or check `finality` mode on message. +2. `Reorg-affected message finality check` — stricter path after reorg. +3. Curse/rules: `Blocking … state unknown` vs `Dropping …`. +4. Publish: `Failed to publish tasks to job queue`. + +### Stuck in verification + +1. `Message verification failed` — read `error`, `retryable`. +2. Token verifiers: `Attestation not ready` / `not found`, fetch errors. +3. Commit: errors often only in TVP line; enable debug or trace batch with `successCount=0`. +4. Queue: `Retried jobs`, `Failed and archived jobs` on `task_queue`. + +### Stuck after verification + +1. `Failed to publish verification results to queue` — TVP could not enqueue to storage writer. +2. `Write failed for message` — storage backend issue. +3. `Failed to write CCV data batch` — whole batch retry. + +### Entire chain stopped + +1. `FINALITY VIOLATION` (checker or SRS disable path). +2. `Finality violation detected` in sendReadyMessages. +3. Coordinator: `Chain is disabled, skipping`. + +--- + +## Quick reference: logs with explicit `messageID` + +| Package | Log messages | +|---------|----------------| +| `sourcereader` | Filter, reorg removal, skip sent, add pending, curse/rules, finality checks | +| `taskverifier` | `Message verification failed`, `Job ID not found for message` | +| `storagewriter` | Write succeeded / failed (retryable and non-retryable) | +| `commit` | Full per-message verification trail (Infow) | +| `token/cctp`, `token/lombard` | All task logs via `logger.With(..., "messageID", ...)` | +| `token/lombard/attestation` | Resolver, blob, attestation mapping errors | +| `monitoring` | Clock drift warning on E2E latency | + +--- + +## Related documentation + +- [End-to-end message debugging](../../docs/end-to-end-debugging.md) — Full pipeline happy path and fallbacks +- [Debugging guides (all pipeline stages)](../../README.md#debugging-guides) — Per-service deep dives +- [Verifier design](./verifier.md) — Architecture and configuration +- [Committee verifier](./committee_verifier.md) +- [Token verifier](./token_verifier.md)