-
Notifications
You must be signed in to change notification settings - Fork 204
gprecoverseg runs pg_basebackup twice when replication slot does not exist #1654
Description
Summary
gprecoverseg -F (and gpaddmirrors) runs pg_basebackup twice per segment when internal_wal_replication_slot does not exist on the primary. The first attempt completes the full data copy but fails at the WAL streaming phase, causing pg_basebackup to remove the entire data directory. A second attempt with --create-slot then starts the full copy from scratch.
For large segments (e.g. ~1TB), this effectively doubles the recovery time and I/O.
Reported in #1648.
Root Cause
In gpMgmt/sbin/gpsegrecovery.py, FullRecovery.run() uses a two-attempt strategy:
- First attempt:
pg_basebackup --slot internal_wal_replication_slot(without--create-slot), assuming the slot exists. - If it fails, second attempt:
pg_basebackup --create-slot --slot internal_wal_replication_slot.
The assumption was that the first attempt would "fail quickly" if the slot doesn't exist. However, the slot check only happens during START_REPLICATION (WAL streaming phase) — after the full data copy is already complete. The same issue exists in gpMgmt/bin/lib/gpconfigurenewsegment.
There is an existing GPDB_12_MERGE_FIXME comment in the code acknowledging this:
# GPDB_12_MERGE_FIXME could we check it before? or let
# pg_basebackup create slot if not exists.Proposed Fix
Before running pg_basebackup, check whether internal_wal_replication_slot exists on the primary (via a replication or utility-mode connection to pg_replication_slots), and create it if needed. This ensures the first pg_basebackup attempt always succeeds, avoiding the costly retry.
Affected Files
gpMgmt/sbin/gpsegrecovery.py—FullRecovery.run()gpMgmt/bin/lib/gpconfigurenewsegment—ConfExpSegCmd.run()
Workaround
Manually create the replication slot on each primary before running gprecoverseg:
PGOPTIONS='-c gp_role=utility' psql -h <primary_host> -p <primary_port> -d postgres -c \
"SELECT pg_create_physical_replication_slot('internal_wal_replication_slot');"