HBASE-29905 BackupLogCleaner: skip tables no longer in the backup set#7761
HBASE-29905 BackupLogCleaner: skip tables no longer in the backup set#7761janvanbesien wants to merge 1 commit intoapache:masterfrom
Conversation
5a9a2ef to
9f69699
Compare
|
Hi Jan, 2 remarks regarding the format of the PR:
|
| } | ||
|
|
||
| @Test | ||
| public void testRemovedTableDoesNotPinWals() throws Exception { |
There was a problem hiding this comment.
The name of this test is misleading. I'd also add a textual description as to the specific case you're checking here.
The current name implies it has to do with a table that was removed. That is also the problem originally encountered (in NGDATA): logs were not being removed for a table that has already been removed. But you're not testing for that here, this test doesn't remove a table.
There's a difference in the logic that's relevant. If you were to extend your testcase to the following: full backup F1, incremental I1, full backup F2 (table 1 only), incremental I2 (table 1 only), delete F1, I1, I'm pretty sure the test would fail because I2 would still refer to table2.
However, if you deleted table 2 after I1, I suspect it will work as expected.
There was a problem hiding this comment.
I suggest to also modify your commit message a bit to explain the main use case being tables that have since been deleted.
There was a problem hiding this comment.
Thanks for your review dieter.
As I understand it, the problem can be triggered by explicitly removing a table from the backup (i.e. explicitly deciding to no longer backup said table) or by removing the table completely (i.e. implicitly it is no longer in any future backup either). You're right that I mixed the two cases and that the primary trigger was the latter, not the former.
I now have a test for both scenarios. Does that make sense?
Also updated commit message and rebased on master.
9f69699 to
81e9e0d
Compare
81e9e0d to
1f676bb
Compare
|
I've force-pushed a rebase on the current master. |
…es in system:backup table BackupLogCleaner.serverToPreservationBoundaryTs() computes the WAL deletion boundary by iterating over tableSetTimestampMap from persisted BackupInfo sessions. This map can contain entries for tables that were once part of the backup set but have since been deleted (or had all their backups deleted). Their stale, old timestamps drag the minimum WAL boundary back, preventing old WALs from being cleaned up. Fix: when computing boundaries, load the incrbackupset per backup root from BackupSystemTable and skip tables not in the active set. The incrbackupset is populated on every full backup and pruned when backups are deleted, so it accurately reflects which tables still need WAL retention.
1f676bb to
5e8a415
Compare
DieterDePaepe
left a comment
There was a problem hiding this comment.
A few minor changes to do, looks good overall.
| * removed, and hence no longer in any backup either. | ||
| */ | ||
| @Test | ||
| public void testRemovedTableDoesNotPinWals() throws Exception { |
There was a problem hiding this comment.
You probably added this test due to my previous comment, but seeing it in action now, I now realize it serves no purpose. The fact that the table is removed has no effect (i.e. all assertions are fine even if the table isn't removed).
The mechanism added in this PR is that the logcleaner only cares about the tables still in scope for backups, and that's covered in the first test.
Given that this test takes +-30secs to run, I think its best to remove it.
| backupTables(BackupType.FULL, Arrays.asList(table1, table2), backupRoot.toString()); | ||
| assertTrue(checkSucceeded(backupIdB1)); | ||
|
|
||
| Set<FileStatus> walFilesAfterB1 = |
There was a problem hiding this comment.
Add a sanity check that this isn't empty (and for walFilesAfterB2).
There was a problem hiding this comment.
I rebased your changes to the latest master, and there was some refactoring involved. Would appreciate a second pair of eyes to take a look at the new code.
|
|
||
| // Create FULL backup B2 with only table1. | ||
| // B2's tableSetTimestampMap carries forward the old timestamp from B1 for table2, | ||
| // while table1 gets a fresh timestamp: { table1: ts(B2), table2: ts(B1) } |
There was a problem hiding this comment.
Would be useful to verify this in the test as well.
There was a problem hiding this comment.
Two observations that I had while reviewing, but not something that needs fixing in this PR. Mainly thinking out loud, and might log issues for these.
- If a table is deleted, it will result in WAL files sticking around until all backups containing that deleted table are gone (because that is when the incremental backup set gets updated). If long backup retention times are used, this can still be problematic. There's room for further improvement here.
- If a table is deleted and recreated with the same name, the backup info will still have the timestamps of the deleted table, and WALs that are included in incremental backups will also contain data of the deleted table. This would lead to a corrupted restore.
|
@hgromer Could you also review this PR? |
BackupLogCleaner.serverToPreservationBoundaryTs() computes the WAL deletion boundary by iterating over tableSetTimestampMap from persisted BackupInfo sessions. This map can contain entries for tables that were once part of the backup set but have since had all their backups deleted. Their stale, old timestamps drag the minimum WAL boundary back, preventing old WALs from being cleaned up.
Fix: when computing boundaries, load the incrbackupset per backup root from BackupSystemTable and skip tables not in the active set. The incrbackupset is populated on every full backup and pruned when backups are deleted, so it accurately reflects which tables still need WAL retention.