Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 51 additions & 1 deletion SDD-Keylime-Monitoring-Tool.md
Original file line number Diff line number Diff line change
Expand Up @@ -941,6 +941,53 @@ Re-render with new data

**Trace:** SRS SR-001, SR-002, SR-010, SR-011

#### 3.5.4 Data Flow: Background Attestation Recording

The backend spawns a long-lived Tokio task at startup that records attestation observations independently of frontend API requests, ensuring the time-series attestation history (FR-024) is complete regardless of dashboard usage.

```text
tokio::spawn(background_observation_loop)
|
+-- tokio::time::interval(observation_interval) // Default: 30s (FR-087)
|
+-- Every tick:
| +-- Acquire KeylimeClient via AppState::keylime()
| +-- Fetch agent list from Verifier API
| +-- For each agent:
| | +-- Check dedup tracker (HashMap<Uuid, (Instant, AttestationResult)>)
| | | |
| | | +-- Same result within 30s: SKIP (dedup)
| | | +-- State changed OR interval elapsed: RECORD
| | |
| | +-- Store observation via AttestationRepository::store_result()
| | +-- Update dedup tracker entry
| |
| +-- Every 10th tick (5 min / 30s = 10):
| +-- Full fleet reconciliation sweep (NFR-020)
| +-- Compare cached state against Verifier, log corrections
|
+-- tokio::select! { _ = interval.tick() => ..., _ = shutdown_rx => break }
```

**Relationship to existing requirements:**

| Requirement | Relationship |
|-------------|-------------|
| NFR-007 (Polling Fallback) | Background task reuses the same 30s polling cadence; both operate in polling mode against the Verifier API |
| NFR-020 (Reconciliation) | The 5-minute reconciliation sweep is performed by the background task as a superset of the per-tick observation — every 10th tick performs a full fleet comparison |
| FR-024 (Attestation Analytics) | The background task is the primary data producer for `AttestationRepository`, ensuring timeline charts have continuous data |
| NFR-006 (Event-Driven Ingestion) | When ZeroMQ event-driven ingestion is available, the background task serves as the fallback/reconciliation mechanism rather than the primary data path |

**Reuse of `record_agent_observations()`:** The background task calls the same `pub(crate) async fn record_agent_observations(state: &AppState)` used by the attestation API handlers. This function iterates agents, consults the dedup tracker on `AppState`, and stores results via `AttestationRepository::store_result()`. Sharing this function guarantees identical observation logic whether triggered by an API request or the background task.

**Dedup Tracker:** `AppState` holds an `attestation_tracker: Arc<Mutex<HashMap<Uuid, (Instant, String)>>>` that maps each agent UUID to its last recorded observation timestamp and result. The tracker prevents redundant writes when an agent's state has not changed and fewer than 30 seconds have elapsed since the last recording. State changes (e.g., `pass` → `fail`) bypass the interval check and record immediately.

**Graceful Shutdown:** The task uses `tokio::select!` to race the interval tick against a shutdown signal (`tokio::sync::watch` or `broadcast` channel). On receiving the shutdown signal, the task breaks out of the loop, logs the number of observations recorded during its lifetime, and drops cleanly. No in-flight Verifier API calls are aborted — the current tick completes before shutdown.

**Configurable Interval:** The observation interval is read from `AppConfig::keylime::observation_interval_secs` (default: 30). Changing this value requires a restart; runtime hot-reload is not supported for the background task interval.

**Trace:** SRS FR-087, FR-024, NFR-006, NFR-007, NFR-020; Implementation -- `keylime-webtool-backend/src/api/handlers/attestations.rs`

### 3.6 State Dynamics View

#### 3.6.1 Agent State Machine
Expand Down Expand Up @@ -1087,6 +1134,7 @@ AppConfig
| | +-- key: String // HSM/Vault URI (SR-005, SR-006) or file path
| | +-- ca_cert: PathBuf
| +-- timeout_secs: u64 // Default: 30
| +-- observation_interval_secs: u64 // Default: 30 (FR-087, aligned with NFR-007)
| +-- circuit_breaker
| +-- failure_threshold: u32 // Default: 5
| +-- reset_timeout_secs: u64 // Default: 60
Expand Down Expand Up @@ -1204,6 +1252,7 @@ Maximum 5 parallel concurrent log fetches to the Verifier API, enforced via Toki
| `AuditRepository` is insert-only (no update/delete) | Enforces audit immutability at the trait API level; implementations cannot accidentally expose mutation; hash-chain integrity (SR-015) depends on append-only semantics | FR-061, SR-015, SR-026 |
| Repository injection via `AppState` (compile-time DI) | No runtime DI framework needed; `main.rs` constructs concrete implementations based on config and injects `Arc<dyn Trait>` into `AppState`; consistent with existing `KeylimeClient` and `SettingsStore` injection pattern | -- |
| `FallbackAttestationRepository` preserves current behavior | Timeline distribution algorithm (3.7.1) runs inside the fallback repository implementation, not in the handler; isolates the pre-DB algorithm behind the trait so the same handler code works with real history data once `SqlAttestationRepository` is implemented | FR-024 |
| Background `tokio::spawn` task for observations | Decouples attestation recording from frontend requests; ensures timeline data is continuous even when no user is viewing the dashboard; reuses `record_agent_observations()` to guarantee identical logic in both paths; 30s interval aligns with NFR-007 polling cadence and dedup tracker window | FR-087, NFR-007, NFR-020 |
| No `AgentRepository` — agents excluded from repository pattern | Agents are Keylime-owned data observed via pass-through proxy; all operations (list, detail, actions, bulk) forward to Verifier/Registrar APIs and cache responses (10s TTL). Keeping agents out of the repository layer preserves graceful degradation (NFR-016): agent listings work even when the webtool DB is down. `agent_id` in attestation/alert records is a bare UUID reference, not a foreign key requiring local agent persistence | FR-012, NFR-016 |

<!-- CHANGED: Added 7 repository abstraction design rationale entries -->
Expand Down Expand Up @@ -1246,7 +1295,7 @@ Maximum 5 parallel concurrent log fetches to the Verifier API, enforced via Toki
| Storage fallback | In-memory repository implementations when DB unavailable (3.3.11) | NFR-016 |
| Fault tolerance | Circuit breaker on Verifier API (threshold: 5, reset: 60s) | NFR-017 |
| Log fetch limit | Max 5 parallel concurrent Verifier log fetches | NFR-023 |
| Reconciliation | Periodic sweep every 5 minutes | NFR-020 |
| Reconciliation | Periodic sweep every 5 minutes via background observation task (3.5.4) | NFR-020, FR-087 |
| Frontend query cache | TanStack Query: 30s stale time, 1 retry | NFR-001 |

---
Expand Down Expand Up @@ -1314,6 +1363,7 @@ Maximum 5 parallel concurrent log fetches to the Verifier API, enforced via Toki
| FR-083 | 3.4.3 | Raw Data tab: compact copy icon button to the right of source selector group; Clipboard API with 2s checkmark feedback |
| FR-084 | 3.2.2 | `KpiCard.tsx`: optional `linkTo` prop wraps card in React Router `<Link>`; Dashboard page maps each KPI to its target route (e.g., Failed Agents → `/agents?state=failed,invalid_quote,tenant_failed`) |
| FR-085 | 3.2.2 | `Alerts.tsx`: three Recharts donut `PieChart` components below alert table — By Severity, By Type, By State; clickable segments navigate to `/alerts?{dimension}={value}` with filter pre-applied; color maps match Dashboard alert chart (FR-047) |
| FR-087 | 3.5.4, 3.8.1 | Background `tokio::spawn` observation task, `record_agent_observations()` reuse, dedup tracker, configurable interval |

### 6.2 Non-Functional Requirements

Expand Down
40 changes: 39 additions & 1 deletion SRS-Keylime-Monitoring-Tool.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ The System transforms Keylime from a CLI-driven security tool into a visual oper
| FR-084 | Fleet Overview KPI card drill-down navigation | SHOULD | Dashboard - Key Performance Indicators |
| FR-085 | Alert Center distribution pie charts (by severity, type, state) | MUST | Revocation - Alert Workflow |
| FR-086 | Integrations topology view with SSH connect | SHOULD | Integration Status - Backend Connectivity |
| FR-087 | Background attestation observation recording independent of frontend | MUST | Attestation Analytics - Overview Dashboard |

### 2.2 Non-Functional Requirements

Expand Down Expand Up @@ -2760,6 +2761,42 @@ Feature: Integrations Topology View
And the node statuses MUST match those displayed in List View
```

### FR-087: Background Attestation Observation Recording

**Description:** The System MUST continuously record attestation observations (pass/fail per agent) at the polling interval (default 30 seconds, aligned with NFR-007) via a background task, independent of frontend API requests. The background task MUST iterate over all known agents at each interval, query the Keylime Verifier for current attestation state, and store the result through the `AttestationRepository` (FR-024). Duplicate observations within the dedup interval (same agent, same result, within 30 seconds of last recorded observation) MUST be suppressed. Every 5 minutes (aligned with NFR-020), the background task MUST perform a full fleet reconciliation sweep to detect state drift. The observation interval MUST be configurable via `AppConfig`. If the Keylime Verifier API is unreachable, the background task MUST log a warning and retry on the next interval without crashing. The background task MUST shut down gracefully when the application receives a termination signal.

**Trace:** Attestation Analytics - Overview Dashboard; FR-024, NFR-006, NFR-007, NFR-020

```gherkin
Feature: Background Attestation Observation Recording

Scenario: Observations recorded when no user is viewing the dashboard
Given the backend is running with the background observation task active
And no frontend clients are connected
And the Keylime Verifier reports 50 agents in GET_QUOTE state and 2 in FAILED state
When 5 minutes elapse
Then the AttestationRepository MUST contain at least 10 observation records
And the hourly attestation timeline (FR-024) MUST show non-zero bars for the elapsed period

Scenario: Dedup interval prevents duplicate observations
Given agent "agent-042" was last recorded as "pass" 15 seconds ago
And agent "agent-042" is still in GET_QUOTE state (pass)
When the background observation task runs its next cycle
Then a new observation for agent "agent-042" MUST NOT be stored
And the dedup tracker MUST retain the existing timestamp
But if agent "agent-042" transitions to FAILED state before the next cycle
Then a new observation with result "fail" MUST be stored immediately

Scenario: Graceful degradation when Keylime API is unreachable
Given the background observation task is running
And the Keylime Verifier API becomes unreachable
When the next observation cycle executes
Then the task MUST log a warning indicating the Verifier is unreachable
And the task MUST NOT crash or panic
And the task MUST retry on the subsequent interval
And previously recorded observations MUST remain intact in the repository
```

---

## 4. Non-Functional Requirements Detail
Expand Down Expand Up @@ -4050,7 +4087,8 @@ The design details that realize these requirements -- including component decomp
| IR-017: Sidebar Visibility Toggle | 3.2.2 | Composition View |
| IR-018: Backend Health Probes | 3.7.3 | Algorithm View |
| IR-019: Repository Abstraction Layer | 3.3.11 | Logical View |
| IR-020: Background Attestation Recording | 3.5.4 | Interaction View |

<!-- CHANGED: Added IR-019 for repository abstraction layer -->
<!-- CHANGED: Added IR-020 for background attestation recording -->

The SDD also includes a full SRS traceability matrix (Section 6) mapping every implemented requirement to its corresponding design element.
Loading