fix(supervise): cleanStaleLocks removes stale postmaster.pid (container PID-recycling CrashLoop)#276
Merged
KeiaiLab-PHIL merged 1 commit intoJun 22, 2026
Conversation
…용 CrashLoop RCA) postgres-prod 7일 CrashLoopBackOff RCA: 컨테이너 PID 재활용으로 postgres 가 매 기동 동일 PID(15)를 얻어, 직전 crash 가 남긴 postmaster.pid 의 PID 를 PostgreSQL kill(pid,0) liveness 검사가 살아있음으로 오인 → FATAL lock file already exists 무한 fail. init 컨테이너 1회성 정리는 메인 컨테이너 재시작(CrashLoop 시 init 미재실행)을 못 막음. cleanStaleSocket→cleanStaleLocks 확장: socket lock 과 동일 안전논리(Start fork 직전=pod 내 live postmaster 부재 + shard 전용 RWO PVC)로 DataDir postmaster.pid 도 매 Start 제거. 검증: go test supervise ok + 회귀가드 2 PASS. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause (postgres-prod 7-day CrashLoopBackOff)
supervise.go의cleanStaleSocket()는 unix socket lock만 정리하고 DataDir의postmaster.pid는 "PostgreSQL 자체가 PID-alive 검사 후 stale 처리" 한다고 가정했다. 이 가정이 컨테이너 PID 재활용 환경에서 거짓:postmaster.pid의 PID(15)를 PostgreSQLkill(pid,0)liveness 검사가 *"살아있음"*으로 오인 →FATAL: lock file "postmaster.pid" already exists라이브 trigger: 리더 역할 전환 시 supervisor의 postgres 재시작이 첫 crash를 유발 → 이후 무한 루프.
Fix
cleanStaleSocket→cleanStaleLocks확장: socket lock과 동일 안전 논리(Start()의 fork 직전 = pod 내 live postmaster 부재 + 각 shard 전용 RWO PVC = cross-pod 데이터 dir 공유 없음)로 DataDirpostmaster.pid도 매Start()제거. init 1회성이 아닌 매 재시작 커버 → 루프 차단.Tests
go test ./internal/instance/supervise/...ok (회귀 0)TestReal_cleanStaleLocks_RemovesPostmasterPid+TestReal_cleanStaleLocks_NoPidIsNoErrorPASSLive verification (postgres-prod, 2026-06-22)
ghcr.io/keiailab/pg:18-pidfix(linux/amd64) 빌드 → ClusterImageCatalogkeiailab-pg갱신 → 전 샤드 재생성:postgrescluster postgres-prod: Provisioning/False (7d) → Ready/True1/1 Running(0 restarts),SELECT 1 → 1, "database system is ready to accept connections"postmaster.pid FATAL재발 = 0Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com