feat(mongodb): alert when compaction is needed#2433
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
807646c to
146b0b5
Compare
|
@francoisferrand Heuristics are up for discussion -- I chose a mix of:
|
francoisferrand
left a comment
There was a problem hiding this comment.
- not sure about thresholds/computation: are these relevant default?
- not sure we should merge in 2.15, or target 2.16 and get time to "preview" this and avoid shipping alerts which would lead to support calls...
Add --collector.dbstatsfreestorage to mongodb_exporter's extraArgs so
the dbstats response's freeStorageSize / indexFreeStorageSize /
totalFreeStorageSize fields are surfaced as top-level Prometheus series
(mongodb_dbstats_freeStorageSize{database, rs_nm, ...} etc.).
These sub-collectors are not bundled into the catch-all options the
exporter exposes (--collect-all and similar shortcuts); they have to be
opted into explicitly. Since the chart no longer uses --collect-all
anyway (dropped in 8414833 for ZENKO-5281), each individual collector
we want has to be named in extraArgs — which is already how dbstats,
diagnosticdata, replicasetstatus, and topmetrics are wired up. This
just adds dbstatsfreestorage to that list.
Verified on a live Artesca cluster (exporter 0.40.0): without this flag
the freeStorageSize fields only appear as part of the per-host
mongodb_dbstats_raw_<host>_freeStorageSize series — clunky for alerting
queries. With the flag they appear cleanly as top-level series with
{database, rs_nm, ...} labels.
This unblocks the MongoDbCompactionNeeded alert added in the following
commit, which needs totalFreeStorageSize at top level to express the
compaction-pressure heuristic.
Issue: ZENKO-5293
146b0b5 to
2790943
Compare
|
Keeping on 2.15 for now as 2.16 doesnt exist. |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
The following reviewers are expecting changes from the author, or must review again: |
2790943 to
b8e0eda
Compare
|
Intentionally set 30% by default (conservative), up for discussion |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
Add MongoDbCompactionNeeded Prometheus rule that fires when a MongoDB database has accumulated reclaimable storage exceeding 30% of the underlying filesystem capacity: totalFreeStorageSize > 0.3 * fsTotalSize for: 1h Per-(pod, database) granularity, severity warning. The threshold is exposed as the compactionFreeStorageRatioThreshold x-input. Expressing it as a fraction of fsTotalSize lets it scale across cluster sizes: a 100 GB filesystem fires around 30 GB reclaimable; a 10 TB filesystem fires around 3 TB. Companion fixture covers a needs-compaction DB and a healthy DB sharing the same filesystem (the first crosses the threshold, the second doesn't). Issue: ZENKO-5293
b8e0eda to
fe5f995
Compare
Follow-up to ZENKO-5285 (PR #2431), which bundled two alerts in its description and only shipped the first (createIndexes-failed). This adds the second — a fragmentation / compaction-needed signal — that @DarkIsDude explicitly asked be filed as a follow-up.
Two commits, deliberately split
mongodb: enable dbstatsfreestorage collector in exporter— one-linevalues.yamlchange adding--collector.dbstatsfreestoragetometrics.extraArgs. The chart's exporter no longer uses--collect-all(dropped in 8414833 for ZENKO-5281), so each sub-collector is opted into individually inextraArgs(already listsdbstats,diagnosticdata,replicasetstatus,topmetrics). Withoutdbstatsfreestorage,freeStorageSize/totalFreeStorageSizeonly appear as part of the clunky per-hostmongodb_dbstats_raw_<host>_*expansion instead of as top-levelmongodb_dbstats_*series.mongodb: alert when compaction is needed— the alert proper.The alert
Per-(pod, database) granularity. Severity warning. Threshold exposed as the
compactionFreeStorageRatioThresholdx-input (default 0.3).Expressing the threshold as a fraction of filesystem capacity (not of the DB's own storage) lets it scale across cluster sizes:
Heuristic notes
A previous draft combined three legs (FS-pressure + fragmentation ratio + absolute floor). @francoisferrand's review pointed out that the fragmentation-ratio leg added noise without much signal and the FS-pressure leg could mask real waste on under-used disks. The single fs-scaled threshold above folds both concerns into one condition.
Per-DB granularity means the alert tells you
pod={{ "{{ $labels.pod }}" }}anddatabase={{ "{{ $labels.database }}" }}but does not identify the specific collection. To find that, an operator runscollStatson the alerting DB. The exporter has a--collector.collstatsflag that could expose per-collection visibility, but at Artesca-scale cardinality (thousands of buckets per DB) that's expensive — deferred.Safety against missing / zero values
(pod, database)tuple → PromQL produces empty → no alert.totalFreeStorageSize = 0(no fragmentation) →0 > 0.3 * fsTotalSizeis always false.fsTotalSize = 0would degenerate (0.3 * 0 = 0, then any positivefreeStoragewould fire), but it's physically impossible for an attached PVC.Follow-up (not in this PR)
Matching
compactionFreeStorageRatioThresholdconfig option in ZKOP, to expose the knob to ops. Not filed as a ticket yet; will track when this lands.Related
Issue: ZENKO-5293