Skip to content

StatsD instrumentation #1672

@forge33

Description

@forge33

Instrument gh-ost to emit metrics for migration health, performance, and diagnostics.

Today, we have little visibility into gh-ost runtime metrics, internal performance, or migration lifecycle signals. Key signals not currently emitted include: queue time, throttle times, backfill speed, binlog processing rate, query latency (source vs target), sleep/wait durations, and binlog backlog size. This data will identify where performance needs to improve and inform future dashboards for migration health.

Key metrics we'd like emitted from gh-ost:

  • Go runtime memory/GC: diagnose memory leaks, GC pressure
  • Binlog backlog: size vs max - are we falling behind?
  • Throughput: row backfill and binlog processing rates
  • Query latency: source vs target
  • Sleep/wait time: duration at each stage
  • Throttle time: duration and extent per database
  • Cutover duration + lock wait: currently a black box

statsd package choice

We're looking to utilize the github.com/DataDog/datadog-go/v5 package in order to emit the metrics. We've looked at github.com/smira/go-statsd as well for a more leaner package, however datadog-go offers us some features that we would probably have to re-implement such as:

  • client side aggregation: many metrics will probably emit at high volume
  • more tunable sampling and buffering
  • histograms and distributions for stuff such as lag(replication seconds, and heartbeat seconds), throttle duration, etc.

Areas where we want to instrument

Wiring / process

Metric name Type Instrumentation
gh_ost.startup counter go/cmd/gh-ost/main.go — after metrics client is ready, e.g. just before Migrate / Revert.
gh_ost.go_runtime.alloc_bytes gauge go/cmd/gh-ost/main.go — background sampler goroutine (CLI flag for interval, e.g. default 10s).
gh_ost.go_runtime.sys_bytes gauge same
gh_ost.go_runtime.heap_inuse_bytes gauge same
gh_ost.go_runtime.num_gc gauge same
gh_ost.go_runtime.gc_pause_total_ns gauge same (cumulative; use rate() downstream)
gh_ost.go_runtime.goroutines gauge same

Sampling from: runtime.ReadMemStats and runtime.NumGoroutine.


Status tick (row copy, DML, backlog, lag)

Primary file: go/logic/migrator.go(*Migrator).printStatus. Values match the status fmt.Sprintf.

Since printStatus returns early when shouldPrintStatus is false (L1379–L1381), we should still emit those metrics on every tick when printing is suppressed.

Metric name Type Notes
gh_ost.row_copy.rows_copied gauge totalRowsCopied in printStatus.
gh_ost.row_copy.rows_estimate gauge rowsEstimate in printStatus.
gh_ost.dml.events_applied gauge TotalDMLEventsApplied (cumulative; rate() downstream).
gh_ost.binlog.backlog_size gauge len(mgtr.applyEventsQueue).
gh_ost.binlog.backlog_capacity gauge cap(mgtr.applyEventsQueue).
gh_ost.binlog.backlog_utilization gauge size / cap in [0,1].
gh_ost.lag.replication_seconds histogram GetCurrentLagDuration().Seconds(); tag throttled:true or false.
gh_ost.lag.heartbeat_seconds histogram TimeSinceLastHeartbeatOnChangelog().Seconds(); tag throttled:true or false.

Query latency

Metric: gh_ost.query.duration_seconds (histogram)
Tags: side:source|target, kind:..., outcome:ok|error

kind side Instrument
chunk_copy target (*Applier).ApplyIterationInsertQuery (applier.go) or (*Migrator).iterateChunks call site (migrator.go).
range_select target (*Applier).CalculateNextIterationRangeEndValues (applier.go).
binlog_apply target (*Applier).ApplyDMLEventQueries or (*Migrator).onApplyEventStruct.
row_count source (*Inspector).CountTableRows (inspect.go).
heartbeat_read source (*Inspector).readChangelogState; called from (*Throttler).collectReplicationLag (L151–L157). Optionally time GetReplicationLagFromSlaveStatus on the replica path.

Throttle

File: go/logic/throttler.go

Metric name Type Instrumentation
gh_ost.throttle.duration_seconds histogram (*Throttler).initiateThrottlerChecks: throttle enter/exit around SetThrottled; tag reason:<text>.
gh_ost.throttle.events_total count Same transitions.
gh_ost.throttle.active gauge Each tick: 1 if throttled else 0.

Cutover

File: go/logic/migrator.go

Metric name Type Instrumentation
gh_ost.cut_over.phase_duration_seconds histogram (*Migrator).cutOver, cutOverTwoStep, atomicCutOver: phase timings via time.Since; tags phase, outcome.
gh_ost.cut_over.attempts_total count Tag outcome:success or retry or abort.
gh_ost.cut_over.total_duration_seconds histogram One sample on terminal cutover outcome.

Sleep / wait

Helper (e.g. metrics.RecordSleep(stage string, d time.Duration)):

  • gh_ost.sleep.duration_seconds (histogram, tag stage)
  • gh_ost.sleep.total_seconds (count, tag stage)
Suggested stage Location
cut_over_postpone (*Migrator).sleepWhileTrue, inner time.Sleep; used from cutOver postpone loop (heartbeat / postpone region ~L820+).
retry_backoff retryOperation / retryOperationWithExponentialBackoff via RetrySleepFn.
chunk_throttle (*Throttler).throttle wait loop.
(other?) Remaining time.Sleep / RetrySleepFn in migrator.go; RetrySleepFn in applier.go.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions