Skip to content

gprecoverseg runs pg_basebackup twice when replication slot does not exist #1654

@my-ship-it

Description

@my-ship-it

Summary

gprecoverseg -F (and gpaddmirrors) runs pg_basebackup twice per segment when internal_wal_replication_slot does not exist on the primary. The first attempt completes the full data copy but fails at the WAL streaming phase, causing pg_basebackup to remove the entire data directory. A second attempt with --create-slot then starts the full copy from scratch.

For large segments (e.g. ~1TB), this effectively doubles the recovery time and I/O.

Reported in #1648.

Root Cause

In gpMgmt/sbin/gpsegrecovery.py, FullRecovery.run() uses a two-attempt strategy:

  1. First attempt: pg_basebackup --slot internal_wal_replication_slot (without --create-slot), assuming the slot exists.
  2. If it fails, second attempt: pg_basebackup --create-slot --slot internal_wal_replication_slot.

The assumption was that the first attempt would "fail quickly" if the slot doesn't exist. However, the slot check only happens during START_REPLICATION (WAL streaming phase) — after the full data copy is already complete. The same issue exists in gpMgmt/bin/lib/gpconfigurenewsegment.

There is an existing GPDB_12_MERGE_FIXME comment in the code acknowledging this:

#  GPDB_12_MERGE_FIXME could we check it before? or let
#  pg_basebackup create slot if not exists.

Proposed Fix

Before running pg_basebackup, check whether internal_wal_replication_slot exists on the primary (via a replication or utility-mode connection to pg_replication_slots), and create it if needed. This ensures the first pg_basebackup attempt always succeeds, avoiding the costly retry.

Affected Files

  • gpMgmt/sbin/gpsegrecovery.pyFullRecovery.run()
  • gpMgmt/bin/lib/gpconfigurenewsegmentConfExpSegCmd.run()

Workaround

Manually create the replication slot on each primary before running gprecoverseg:

PGOPTIONS='-c gp_role=utility' psql -h <primary_host> -p <primary_port> -d postgres -c \
  "SELECT pg_create_physical_replication_slot('internal_wal_replication_slot');"

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: BugSomething isn't workingtype: EnhancementNew feature or request, ideas

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions