Instrument gh-ost to emit metrics for migration health, performance, and diagnostics.
Today, we have little visibility into gh-ost runtime metrics, internal performance, or migration lifecycle signals. Key signals not currently emitted include: queue time, throttle times, backfill speed, binlog processing rate, query latency (source vs target), sleep/wait durations, and binlog backlog size. This data will identify where performance needs to improve and inform future dashboards for migration health.
Key metrics we'd like emitted from gh-ost:
- Go runtime memory/GC: diagnose memory leaks, GC pressure
- Binlog backlog: size vs max - are we falling behind?
- Throughput: row backfill and binlog processing rates
- Query latency: source vs target
- Sleep/wait time: duration at each stage
- Throttle time: duration and extent per database
- Cutover duration + lock wait: currently a black box
statsd package choice
We're looking to utilize the github.com/DataDog/datadog-go/v5 package in order to emit the metrics. We've looked at github.com/smira/go-statsd as well for a more leaner package, however datadog-go offers us some features that we would probably have to re-implement such as:
- client side aggregation: many metrics will probably emit at high volume
- more tunable sampling and buffering
- histograms and distributions for stuff such as lag(replication seconds, and heartbeat seconds), throttle duration, etc.
Areas where we want to instrument
Wiring / process
| Metric name |
Type |
Instrumentation |
gh_ost.startup |
counter |
go/cmd/gh-ost/main.go — after metrics client is ready, e.g. just before Migrate / Revert. |
gh_ost.go_runtime.alloc_bytes |
gauge |
go/cmd/gh-ost/main.go — background sampler goroutine (CLI flag for interval, e.g. default 10s). |
gh_ost.go_runtime.sys_bytes |
gauge |
same |
gh_ost.go_runtime.heap_inuse_bytes |
gauge |
same |
gh_ost.go_runtime.num_gc |
gauge |
same |
gh_ost.go_runtime.gc_pause_total_ns |
gauge |
same (cumulative; use rate() downstream) |
gh_ost.go_runtime.goroutines |
gauge |
same |
Sampling from: runtime.ReadMemStats and runtime.NumGoroutine.
Status tick (row copy, DML, backlog, lag)
Primary file: go/logic/migrator.go — (*Migrator).printStatus. Values match the status fmt.Sprintf.
Since printStatus returns early when shouldPrintStatus is false (L1379–L1381), we should still emit those metrics on every tick when printing is suppressed.
| Metric name |
Type |
Notes |
gh_ost.row_copy.rows_copied |
gauge |
totalRowsCopied in printStatus. |
gh_ost.row_copy.rows_estimate |
gauge |
rowsEstimate in printStatus. |
gh_ost.dml.events_applied |
gauge |
TotalDMLEventsApplied (cumulative; rate() downstream). |
gh_ost.binlog.backlog_size |
gauge |
len(mgtr.applyEventsQueue). |
gh_ost.binlog.backlog_capacity |
gauge |
cap(mgtr.applyEventsQueue). |
gh_ost.binlog.backlog_utilization |
gauge |
size / cap in [0,1]. |
gh_ost.lag.replication_seconds |
histogram |
GetCurrentLagDuration().Seconds(); tag throttled:true or false. |
gh_ost.lag.heartbeat_seconds |
histogram |
TimeSinceLastHeartbeatOnChangelog().Seconds(); tag throttled:true or false. |
Query latency
Metric: gh_ost.query.duration_seconds (histogram)
Tags: side:source|target, kind:..., outcome:ok|error
Throttle
File: go/logic/throttler.go
| Metric name |
Type |
Instrumentation |
gh_ost.throttle.duration_seconds |
histogram |
(*Throttler).initiateThrottlerChecks: throttle enter/exit around SetThrottled; tag reason:<text>. |
gh_ost.throttle.events_total |
count |
Same transitions. |
gh_ost.throttle.active |
gauge |
Each tick: 1 if throttled else 0. |
Cutover
File: go/logic/migrator.go
| Metric name |
Type |
Instrumentation |
gh_ost.cut_over.phase_duration_seconds |
histogram |
(*Migrator).cutOver, cutOverTwoStep, atomicCutOver: phase timings via time.Since; tags phase, outcome. |
gh_ost.cut_over.attempts_total |
count |
Tag outcome:success or retry or abort. |
gh_ost.cut_over.total_duration_seconds |
histogram |
One sample on terminal cutover outcome. |
Sleep / wait
Helper (e.g. metrics.RecordSleep(stage string, d time.Duration)):
gh_ost.sleep.duration_seconds (histogram, tag stage)
gh_ost.sleep.total_seconds (count, tag stage)
Instrument gh-ost to emit metrics for migration health, performance, and diagnostics.
Today, we have little visibility into gh-ost runtime metrics, internal performance, or migration lifecycle signals. Key signals not currently emitted include: queue time, throttle times, backfill speed, binlog processing rate, query latency (source vs target), sleep/wait durations, and binlog backlog size. This data will identify where performance needs to improve and inform future dashboards for migration health.
Key metrics we'd like emitted from gh-ost:
statsd package choice
We're looking to utilize the
github.com/DataDog/datadog-go/v5package in order to emit the metrics. We've looked atgithub.com/smira/go-statsdas well for a more leaner package, howeverdatadog-gooffers us some features that we would probably have to re-implement such as:Areas where we want to instrument
Wiring / process
gh_ost.startupgo/cmd/gh-ost/main.go— after metrics client is ready, e.g. just beforeMigrate/Revert.gh_ost.go_runtime.alloc_bytesgo/cmd/gh-ost/main.go— background sampler goroutine (CLI flag for interval, e.g. default 10s).gh_ost.go_runtime.sys_bytesgh_ost.go_runtime.heap_inuse_bytesgh_ost.go_runtime.num_gcgh_ost.go_runtime.gc_pause_total_nsrate()downstream)gh_ost.go_runtime.goroutinesSampling from:
runtime.ReadMemStatsandruntime.NumGoroutine.Status tick (row copy, DML, backlog, lag)
Primary file:
go/logic/migrator.go—(*Migrator).printStatus. Values match the statusfmt.Sprintf.Since
printStatusreturns early whenshouldPrintStatusis false (L1379–L1381), we should still emit those metrics on every tick when printing is suppressed.gh_ost.row_copy.rows_copiedtotalRowsCopiedinprintStatus.gh_ost.row_copy.rows_estimaterowsEstimateinprintStatus.gh_ost.dml.events_appliedTotalDMLEventsApplied(cumulative;rate()downstream).gh_ost.binlog.backlog_sizelen(mgtr.applyEventsQueue).gh_ost.binlog.backlog_capacitycap(mgtr.applyEventsQueue).gh_ost.binlog.backlog_utilizationgh_ost.lag.replication_secondsGetCurrentLagDuration().Seconds(); tagthrottled:true or false.gh_ost.lag.heartbeat_secondsTimeSinceLastHeartbeatOnChangelog().Seconds(); tagthrottled:true or false.Query latency
Metric:
gh_ost.query.duration_seconds(histogram)Tags:
side:source|target,kind:...,outcome:ok|errorchunk_copy(*Applier).ApplyIterationInsertQuery(applier.go) or(*Migrator).iterateChunkscall site (migrator.go).range_select(*Applier).CalculateNextIterationRangeEndValues(applier.go).binlog_apply(*Applier).ApplyDMLEventQueriesor(*Migrator).onApplyEventStruct.row_count(*Inspector).CountTableRows(inspect.go).heartbeat_read(*Inspector).readChangelogState; called from(*Throttler).collectReplicationLag(L151–L157). Optionally timeGetReplicationLagFromSlaveStatuson the replica path.Throttle
File:
go/logic/throttler.gogh_ost.throttle.duration_seconds(*Throttler).initiateThrottlerChecks: throttle enter/exit aroundSetThrottled; tagreason:<text>.gh_ost.throttle.events_totalgh_ost.throttle.activeCutover
File:
go/logic/migrator.gogh_ost.cut_over.phase_duration_seconds(*Migrator).cutOver,cutOverTwoStep,atomicCutOver: phase timings viatime.Since; tagsphase,outcome.gh_ost.cut_over.attempts_totaloutcome:success or retry or abort.gh_ost.cut_over.total_duration_secondsSleep / wait
Helper (e.g.
metrics.RecordSleep(stage string, d time.Duration)):gh_ost.sleep.duration_seconds(histogram, tagstage)gh_ost.sleep.total_seconds(count, tagstage)stagecut_over_postpone(*Migrator).sleepWhileTrue, innertime.Sleep; used fromcutOverpostpone loop (heartbeat / postpone region ~L820+).retry_backoffretryOperation/retryOperationWithExponentialBackoffviaRetrySleepFn.chunk_throttle(*Throttler).throttlewait loop.time.Sleep/RetrySleepFninmigrator.go;RetrySleepFninapplier.go.