StatsD instrumentation

**Instrument gh-ost to emit metrics for migration health, performance, and diagnostics.**

Today, we have little visibility into gh-ost runtime metrics, internal performance, or migration lifecycle signals. Key signals not currently emitted include: queue time, throttle times, backfill speed, binlog processing rate, query latency (source vs target), sleep/wait durations, and binlog backlog size. This data will identify where performance needs to improve and inform future dashboards for migration health.

## Key metrics we'd like emitted from gh-ost:
- Go runtime memory/GC: diagnose memory leaks, GC pressure
- Binlog backlog: size vs max - are we falling behind?
- Throughput: row backfill and binlog processing rates
- Query latency: source vs target
- Sleep/wait time: duration at each stage
- Throttle time: duration and extent per database
- Cutover duration + lock wait: currently a black box

## statsd package choice
We're looking to utilize the `github.com/DataDog/datadog-go/v5` package in order to emit the metrics. We've looked at `github.com/smira/go-statsd` as well for a more leaner package, however `datadog-go` offers us some features that we would probably have to re-implement such as:
- client side aggregation: many metrics will probably emit at high volume
- more tunable sampling and buffering
- histograms and distributions for stuff such as lag(replication seconds, and heartbeat seconds), throttle duration, etc.

# Areas where we want to instrument
## Wiring / process

| Metric name | Type | Instrumentation |
|-------------|------|-------------------|
| `gh_ost.startup` | counter | [`go/cmd/gh-ost/main.go`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/cmd/gh-ost/main.go) — after metrics client is ready, e.g. just before [`Migrate` / `Revert`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/cmd/gh-ost/main.go#L378-L383). |
| `gh_ost.go_runtime.alloc_bytes` | gauge | [`go/cmd/gh-ost/main.go`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/cmd/gh-ost/main.go) — background sampler goroutine (CLI flag for interval, e.g. default 10s). |
| `gh_ost.go_runtime.sys_bytes` | gauge | same |
| `gh_ost.go_runtime.heap_inuse_bytes` | gauge | same |
| `gh_ost.go_runtime.num_gc` | gauge | same |
| `gh_ost.go_runtime.gc_pause_total_ns` | gauge | same (cumulative; use `rate()` downstream) |
| `gh_ost.go_runtime.goroutines` | gauge | same |

Sampling from: `runtime.ReadMemStats` and `runtime.NumGoroutine`.

---

## Status tick (row copy, DML, backlog, lag)

**Primary file:** [`go/logic/migrator.go`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go) — [`(*Migrator).printStatus`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go#L1350). Values match the status [`fmt.Sprintf`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go#L1385-L1392).

Since `printStatus` returns early when `shouldPrintStatus` is false ([L1379–L1381](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go#L1379-L1381)), we should still emit those metrics on every tick when printing is suppressed.

| Metric name | Type | Notes |
|-------------|------|--------|
| `gh_ost.row_copy.rows_copied` | gauge | `totalRowsCopied` in `printStatus`. |
| `gh_ost.row_copy.rows_estimate` | gauge | `rowsEstimate` in `printStatus`. |
| `gh_ost.dml.events_applied` | gauge | `TotalDMLEventsApplied` (cumulative; `rate()` downstream). |
| `gh_ost.binlog.backlog_size` | gauge | `len(mgtr.applyEventsQueue)`. |
| `gh_ost.binlog.backlog_capacity` | gauge | `cap(mgtr.applyEventsQueue)`. |
| `gh_ost.binlog.backlog_utilization` | gauge | size / cap in [0,1]. |
| `gh_ost.lag.replication_seconds` | histogram | `GetCurrentLagDuration().Seconds()`; tag `throttled:true or false`. |
| `gh_ost.lag.heartbeat_seconds` | histogram | `TimeSinceLastHeartbeatOnChangelog().Seconds()`; tag `throttled:true or false`. |

---

## Query latency

**Metric:** `gh_ost.query.duration_seconds` (histogram)  
**Tags:** `side:source|target`, `kind:...`, `outcome:ok|error`

| kind | side | Instrument |
|------|------|------------|
| `chunk_copy` | target | [`(*Applier).ApplyIterationInsertQuery`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/applier.go#L923) ([`applier.go`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/applier.go)) or [`(*Migrator).iterateChunks`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go#L1596) call site ([`migrator.go`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go)). |
| `range_select` | target | [`(*Applier).CalculateNextIterationRangeEndValues`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/applier.go#L876) ([`applier.go`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/applier.go)). |
| `binlog_apply` | target | [`(*Applier).ApplyDMLEventQueries`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/applier.go#L1630) or [`(*Migrator).onApplyEventStruct`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go#L1676). |
| `row_count` | source | [`(*Inspector).CountTableRows`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/inspect.go#L661) ([`inspect.go`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/inspect.go)). |
| `heartbeat_read` | source | [`(*Inspector).readChangelogState`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/inspect.go#L936); called from [`(*Throttler).collectReplicationLag`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/throttler.go#L138) ([L151–L157](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/throttler.go#L151-L157)). Optionally time [`GetReplicationLagFromSlaveStatus`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/throttler.go#L151) on the replica path. |

---

## Throttle

**File:** [`go/logic/throttler.go`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/throttler.go)

| Metric name | Type | Instrumentation |
|-------------|------|-----------------|
| `gh_ost.throttle.duration_seconds` | histogram | [`(*Throttler).initiateThrottlerChecks`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/throttler.go#L467): throttle **enter/exit** around [`SetThrottled`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/throttler.go#L481); tag `reason:<text>`. |
| `gh_ost.throttle.events_total` | count | Same transitions. |
| `gh_ost.throttle.active` | gauge | Each tick: 1 if throttled else 0. |

---

## Cutover

**File:** [`go/logic/migrator.go`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go)

| Metric name | Type | Instrumentation |
|-------------|------|-----------------|
| `gh_ost.cut_over.phase_duration_seconds` | histogram | [`(*Migrator).cutOver`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go#L806), [`cutOverTwoStep`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go#L931), [`atomicCutOver`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go#L963): phase timings via `time.Since`; tags `phase`, `outcome`. |
| `gh_ost.cut_over.attempts_total` | count | Tag `outcome:success or retry or abort`. |
| `gh_ost.cut_over.total_duration_seconds` | histogram | One sample on terminal cutover outcome. |

---

## Sleep / wait

Helper (e.g. `metrics.RecordSleep(stage string, d time.Duration)`):

- `gh_ost.sleep.duration_seconds` (histogram, tag `stage`)
- `gh_ost.sleep.total_seconds` (count, tag `stage`)

| Suggested `stage` | Location |
|-------------------|----------|
| `cut_over_postpone` | [`(*Migrator).sleepWhileTrue`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go#L128), inner [`time.Sleep`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go#L141); used from [`cutOver`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go#L806) postpone loop ([heartbeat / postpone region ~L820+](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go#L820-L850)). |
| `retry_backoff` | [`retryOperation`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go#L159) / [`retryOperationWithExponentialBackoff`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go#L195) via `RetrySleepFn`. |
| `chunk_throttle` | [`(*Throttler).throttle`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/throttler.go#L506) wait loop. |
| (other?) | Remaining `time.Sleep` / `RetrySleepFn` in [`migrator.go`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/migrator.go); [`RetrySleepFn` in `applier.go`](https://github.com/github/gh-ost/blob/494570d783ce21272bcd360e6033e64306a5f3fa/go/logic/applier.go#L322). |


Metric name	Type	Notes
`gh_ost.row_copy.rows_copied`	gauge	`totalRowsCopied` in `printStatus`.
`gh_ost.row_copy.rows_estimate`	gauge	`rowsEstimate` in `printStatus`.
`gh_ost.dml.events_applied`	gauge	`TotalDMLEventsApplied` (cumulative; `rate()` downstream).
`gh_ost.binlog.backlog_size`	gauge	`len(mgtr.applyEventsQueue)`.
`gh_ost.binlog.backlog_capacity`	gauge	`cap(mgtr.applyEventsQueue)`.
`gh_ost.binlog.backlog_utilization`	gauge	size / cap in [0,1].
`gh_ost.lag.replication_seconds`	histogram	`GetCurrentLagDuration().Seconds()`; tag `throttled:true or false`.
`gh_ost.lag.heartbeat_seconds`	histogram	`TimeSinceLastHeartbeatOnChangelog().Seconds()`; tag `throttled:true or false`.

kind	side	Instrument
`chunk_copy`	target	`(Applier).ApplyIterationInsertQuery` (`applier.go`) or `(Migrator).iterateChunks` call site (`migrator.go`).
`range_select`	target	`(*Applier).CalculateNextIterationRangeEndValues` (`applier.go`).
`binlog_apply`	target	`(Applier).ApplyDMLEventQueries` or `(Migrator).onApplyEventStruct`.
`row_count`	source	`(*Inspector).CountTableRows` (`inspect.go`).
`heartbeat_read`	source	`(Inspector).readChangelogState`; called from `(Throttler).collectReplicationLag` (L151–L157). Optionally time `GetReplicationLagFromSlaveStatus` on the replica path.

Suggested `stage`	Location
`cut_over_postpone`	`(*Migrator).sleepWhileTrue`, inner `time.Sleep`; used from `cutOver` postpone loop (heartbeat / postpone region ~L820+).
`retry_backoff`	`retryOperation` / `retryOperationWithExponentialBackoff` via `RetrySleepFn`.
`chunk_throttle`	`(*Throttler).throttle` wait loop.
(other?)	Remaining `time.Sleep` / `RetrySleepFn` in `migrator.go`; `RetrySleepFn` in `applier.go`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StatsD instrumentation #1672

Key metrics we'd like emitted from gh-ost:

statsd package choice

Areas where we want to instrument

Wiring / process

Status tick (row copy, DML, backlog, lag)

Query latency

Throttle

Cutover

Sleep / wait

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric name	Type	Instrumentation
`gh_ost.startup`	counter	`go/cmd/gh-ost/main.go` — after metrics client is ready, e.g. just before `Migrate` / `Revert`.
`gh_ost.go_runtime.alloc_bytes`	gauge	`go/cmd/gh-ost/main.go` — background sampler goroutine (CLI flag for interval, e.g. default 10s).
`gh_ost.go_runtime.sys_bytes`	gauge	same
`gh_ost.go_runtime.heap_inuse_bytes`	gauge	same
`gh_ost.go_runtime.num_gc`	gauge	same
`gh_ost.go_runtime.gc_pause_total_ns`	gauge	same (cumulative; use `rate()` downstream)
`gh_ost.go_runtime.goroutines`	gauge	same

Metric name	Type	Instrumentation
`gh_ost.throttle.duration_seconds`	histogram	`(Throttler).initiateThrottlerChecks`: throttle enter/exit* around `SetThrottled`; tag `reason:<text>`.
`gh_ost.throttle.events_total`	count	Same transitions.
`gh_ost.throttle.active`	gauge	Each tick: 1 if throttled else 0.

Metric name	Type	Instrumentation
`gh_ost.cut_over.phase_duration_seconds`	histogram	`(*Migrator).cutOver`, `cutOverTwoStep`, `atomicCutOver`: phase timings via `time.Since`; tags `phase`, `outcome`.
`gh_ost.cut_over.attempts_total`	count	Tag `outcome:success or retry or abort`.
`gh_ost.cut_over.total_duration_seconds`	histogram	One sample on terminal cutover outcome.

StatsD instrumentation #1672

Description

Key metrics we'd like emitted from gh-ost:

statsd package choice

Areas where we want to instrument

Wiring / process

Status tick (row copy, DML, backlog, lag)

Query latency

Throttle

Cutover

Sleep / wait

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions