Skip to content

HBASE-29905 BackupLogCleaner: skip tables no longer in the backup set#7761

Open
janvanbesien wants to merge 1 commit intoapache:masterfrom
NGDATA:HBASE-29905_backup_cleaner_stale_entries
Open

HBASE-29905 BackupLogCleaner: skip tables no longer in the backup set#7761
janvanbesien wants to merge 1 commit intoapache:masterfrom
NGDATA:HBASE-29905_backup_cleaner_stale_entries

Conversation

@janvanbesien
Copy link
Copy Markdown

BackupLogCleaner.serverToPreservationBoundaryTs() computes the WAL deletion boundary by iterating over tableSetTimestampMap from persisted BackupInfo sessions. This map can contain entries for tables that were once part of the backup set but have since had all their backups deleted. Their stale, old timestamps drag the minimum WAL boundary back, preventing old WALs from being cleaned up.

Fix: when computing boundaries, load the incrbackupset per backup root from BackupSystemTable and skip tables not in the active set. The incrbackupset is populated on every full backup and pruned when backups are deleted, so it accurately reflects which tables still need WAL retention.

@janvanbesien janvanbesien force-pushed the HBASE-29905_backup_cleaner_stale_entries branch from 5a9a2ef to 9f69699 Compare February 17, 2026 07:50
@DieterDP-ng
Copy link
Copy Markdown
Contributor

Hi Jan,

2 remarks regarding the format of the PR:

  • HBASE accepts PRs for the master branch. If accepted for master, it's backported to the other branches by the person doing the merge.
  • HBASE asks that the commit header message matches the title of the ticket it solves.

}

@Test
public void testRemovedTableDoesNotPinWals() throws Exception {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this test is misleading. I'd also add a textual description as to the specific case you're checking here.

The current name implies it has to do with a table that was removed. That is also the problem originally encountered (in NGDATA): logs were not being removed for a table that has already been removed. But you're not testing for that here, this test doesn't remove a table.

There's a difference in the logic that's relevant. If you were to extend your testcase to the following: full backup F1, incremental I1, full backup F2 (table 1 only), incremental I2 (table 1 only), delete F1, I1, I'm pretty sure the test would fail because I2 would still refer to table2.

However, if you deleted table 2 after I1, I suspect it will work as expected.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to also modify your commit message a bit to explain the main use case being tables that have since been deleted.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review dieter.

As I understand it, the problem can be triggered by explicitly removing a table from the backup (i.e. explicitly deciding to no longer backup said table) or by removing the table completely (i.e. implicitly it is no longer in any future backup either). You're right that I mixed the two cases and that the primary trigger was the latter, not the former.

I now have a test for both scenarios. Does that make sense?

Also updated commit message and rebased on master.

@janvanbesien janvanbesien force-pushed the HBASE-29905_backup_cleaner_stale_entries branch from 9f69699 to 81e9e0d Compare February 17, 2026 13:53
@janvanbesien janvanbesien changed the base branch from branch-2.6 to master February 17, 2026 13:54
@DieterDP-ng DieterDP-ng force-pushed the HBASE-29905_backup_cleaner_stale_entries branch from 81e9e0d to 1f676bb Compare April 3, 2026 18:09
@DieterDP-ng
Copy link
Copy Markdown
Contributor

I've force-pushed a rebase on the current master.

…es in system:backup table

BackupLogCleaner.serverToPreservationBoundaryTs() computes the WAL
deletion boundary by iterating over tableSetTimestampMap from persisted
BackupInfo sessions. This map can contain entries for tables that were
once part of the backup set but have since been deleted (or had all
their backups deleted). Their stale, old timestamps drag the minimum WAL
boundary back, preventing old WALs from being cleaned up.

Fix: when computing boundaries, load the incrbackupset per backup root
from BackupSystemTable and skip tables not in the active set. The
incrbackupset is populated on every full backup and pruned when backups
are deleted, so it accurately reflects which tables still need WAL
retention.
@DieterDP-ng DieterDP-ng force-pushed the HBASE-29905_backup_cleaner_stale_entries branch from 1f676bb to 5e8a415 Compare April 3, 2026 18:31
Copy link
Copy Markdown

@DieterDePaepe DieterDePaepe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor changes to do, looks good overall.

* removed, and hence no longer in any backup either.
*/
@Test
public void testRemovedTableDoesNotPinWals() throws Exception {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably added this test due to my previous comment, but seeing it in action now, I now realize it serves no purpose. The fact that the table is removed has no effect (i.e. all assertions are fine even if the table isn't removed).

The mechanism added in this PR is that the logcleaner only cares about the tables still in scope for backups, and that's covered in the first test.

Given that this test takes +-30secs to run, I think its best to remove it.

backupTables(BackupType.FULL, Arrays.asList(table1, table2), backupRoot.toString());
assertTrue(checkSucceeded(backupIdB1));

Set<FileStatus> walFilesAfterB1 =
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a sanity check that this isn't empty (and for walFilesAfterB2).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebased your changes to the latest master, and there was some refactoring involved. Would appreciate a second pair of eyes to take a look at the new code.


// Create FULL backup B2 with only table1.
// B2's tableSetTimestampMap carries forward the old timestamp from B1 for table2,
// while table1 gets a fresh timestamp: { table1: ts(B2), table2: ts(B1) }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to verify this in the test as well.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two observations that I had while reviewing, but not something that needs fixing in this PR. Mainly thinking out loud, and might log issues for these.

  • If a table is deleted, it will result in WAL files sticking around until all backups containing that deleted table are gone (because that is when the incremental backup set gets updated). If long backup retention times are used, this can still be problematic. There's room for further improvement here.
  • If a table is deleted and recreated with the same name, the backup info will still have the timestamps of the deleted table, and WALs that are included in incremental backups will also contain data of the deleted table. This would lead to a corrupted restore.

@DieterDePaepe
Copy link
Copy Markdown

@hgromer Could you also review this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants