feat(import): add support for multiple hbase snapshot imports by tianlei2 · Pull Request #4600 · googleapis/java-bigtable-hbase

tianlei2 · 2026-04-28T18:41:41Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

b/429250716

This is the first PR that incorporates changes from https://github.com/jhambleton/java-bigtable-hbase/commits/dataflow-v2-v2.15.6 and some fixes to make it pass the tests.

Fixed Test Isolation Issues
SnapshotUtilsTest.testGetHbaseConfiguration was failing because the static configuration field SnapshotUtils.hbaseConfiguration cached state between test cases, leaking stale data into subsequent tests.
- Solution: Added a @before setup method to reset the static field to null via reflection before every test run.
Fixed Timestamp Formatting Tests
SnapshotUtilsTest.testAppendCurrentTimestamp was throwing a NumberFormatException because the return value contained a UUID suffix (timestamp-UUID), but the test attempted to parse the entire string directly as a Long.
- Solution: Updated the test to split the string using the "-" character to extract and correctly parse just the timestamp prefix.
Resolved Classpath and SPI Conflicts (dnsjava)
Integration tests failed on Java 8 and 11 in Kokoro because of unshaded transitive dependency conflicts (com.google.protobuf.LiteralByteString NoClassDefFoundError).
- Solution: Reverted back to the shaded hbase-shaded-mapreduce dependency, ensuring proper compatibility across all Java versions.
Uncommented and Fixed Tests in ImportJobFromHbaseSnapshotTest
Several useful unit tests were commented out in ImportJobFromHbaseSnapshotTest because mockito-core lacked the ability to mock static methods.
- Solution:
  Switched from mockito-core to mockito-inline in the pom.xml to allow static mocking.
  Uncommented the code and restored the original formatting to prevent any lint errors, enabling JUnit to verify correct configuration parsing.
ComputeAndValidateHashFromBigtableDoFnTest.java was accidentally deleted, adding back
Cleanups on unused comments

tianlei2 · 2026-04-29T17:39:52Z

Integration test is run and passing:
https://fusion2.corp.google.com/invocations/33f68247-794b-45c2-8294-1b97ff42c5d0/artifacts/github%2Fjava-bigtable-hbase%2Fbigtable-dataflow-parent%2Fbigtable-beam-import%2Ftarget%2Ffailsafe-reports%2Fintegration-beam%2Fcom.google.cloud.bigtable.beam.hbasesnapshots.EndToEndIT;config=default

…ndToEndIT

…sjava SPI conflict

…bFromHbaseSnapshotTest

…lsTest

mutianf · 2026-05-19T15:58:27Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a new tool, HBaseSnapshotRestoreTool, and updates the existing ImportJobFromHbaseSnapshot to support loading multiple HBase snapshots into Bigtable. It includes several infrastructure improvements, such as adding necessary dependencies, introducing a RegionConfigCoder for efficient serialization, and enhancing the ReadRegions transform with dynamic splitting and sharding capabilities. My review identified several areas for improvement, including fixing an incorrect tracker claim in ReadSnapshotRegion, improving configuration handling in ImportJobFromHbaseSnapshot, ensuring consistent brace usage, and addressing potential null pointer exceptions and minor code style issues.

…m/google/cloud/bigtable/beam/hbasesnapshots/dofn/HBaseRegionScanner.java Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…m/google/cloud/bigtable/beam/hbasesnapshots/dofn/CreateBigtableMutations.java Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…m/google/cloud/bigtable/beam/hbasesnapshots/ImportJobFromHbaseSnapshot.java Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…lity

mutianf · 2026-05-19T20:09:57Z

+      <!-- Version alignment -->
+      <!-- Mark all annotations as provided. They don't affect the runtime of the pipeline so
+      there is no need to try to version align them -->
+      <dependency>


Why are they added in dependency management instead of dependencies?

These are annotations only needed for compilation. Placing them here with provided scope ensures they don't get included in the final shaded JAR if other libraries pull them in.

mutianf · 2026-05-20T02:11:04Z

+    this.snapshots = new ArrayList<>();
+    snapshots.forEach(
+        (snapshotName, bigtableName) ->
+            this.snapshots.add(new SnapshotInfo(snapshotName, bigtableName)));


this.snapshots = snapshots.entrySet().stream().map(entry -> new SnapshotInfo(entry.getKey(), entry.getValue()).collect(Collectors.toList());

mutianf · 2026-05-20T02:14:11Z

+  private Map<String, String> bigtableConfiguration;
+
+  public void setSnapshotsFromMap(Map<String, String> snapshots) {
+    this.snapshots = new ArrayList<>();


nit, maybe rename snapshots to snapshotInfos?

mutianf · 2026-05-20T14:28:09Z

+  // and ensuring thread safety since Beam isolates DoFn instances.
+  @Setup
+  public void setup() {
+    configCache = new java.util.HashMap<>();


nit: import java.util.HashMap and call new HashMap<>() here :)

mutianf · 2026-05-20T14:30:04Z

+          return;
+        }
+      }
+      // Signal completion of the range.


Maybe also update the comment here: ByteKeyRangeTracker uses EMPTY to mark the end of the range https://github.com/apache/beam/blob/2c4d2c6de4dca6b5954c529c2d40c031a8b74f60/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/splittabledofn/ByteKeyRangeTracker.java#L37

mutianf · 2026-05-20T14:34:20Z

+
+  @GetInitialRestriction
+  public ByteKeyRange getInitialRange(@Element RegionConfig regionConfig) {
+    byte[] endKey = regionConfig.getRegionInfo().getEndKey();


Is this line used?

mutianf · 2026-05-20T14:43:49Z

+import org.slf4j.LoggerFactory;
+
+/** A Splittable {@link DoFn} for reading the records from each region. */
+public class ReadSnapshotRegion extends DoFn<RegionConfig, KV<SnapshotConfig, Result>> {


is this internal api?

added a notation here

mutianf · 2026-05-22T20:14:28Z

+              snapshotConfig.getRestorePath(),
+              ex);
+          failedCleanups.inc();
+          return; // Give up but don't fail the job


This is just a clean up job so if we skipped removing a path it's probably ok. However, we should probably call this out somewhere on the public API.

mutianf · 2026-05-22T20:17:07Z

+
+    List<Mutation> mutations = new ArrayList<>();
+
+    boolean logAndSkipIncompatibleRowMutations =


I don't think the logic is correc,t I think checking the flag should be inside of convertAndValidateThresholds? And also, why pass in an empty list? I think we can just do List mutations = convertAndValidateThresholds(rowKey, element.getValue()..., snapshotName)

mutianf · 2026-05-22T20:20:01Z

+ * dynamic splitting.
+ */
+@InternalApi("For internal usage only")
+public class HbaseRegionSplitTracker extends RestrictionTracker<ByteKeyRange, ByteKey>


ByteKeyRangeTracker is not a final class, and looks like most of the calls here are just delegating to ByteKeyRangeTracker, can we extend ByteKeyRangeTracker instead? This way the only method we need to override is trySplit()?

mutianf · 2026-05-22T20:21:33Z

+    // and ensuring thread safety since Beam isolates DoFn instances.
+    @Setup
+    public void setup() {
+      configCache = new java.util.HashMap<>();


same here, import java.util.HashMap and call new HashMap<>().

mutianf · 2026-05-22T20:47:04Z

+
+/** A Splittable {@link DoFn} for reading the records from each region. */
+@InternalApi("For internal usage only")
+public class ReadSnapshotRegionFn extends DoFn<RegionConfig, KV<SnapshotConfig, Result>> {


From gemini: The key is the snapshot config, this KV is produced for every single HBase row (potentially billions), using a complex object like SnapshotConfig as a key is extremely inefficient. SnapshotConfig contains a Configuration objectbwhich includes a map of configuration properties. Serializing and transmitting this entire configuration for every row will significantly increase network traffic and memory usage. It would be more efficient to use the table name (a String) as the key.

is it possible to do this refactor? If possible, the subsequent CreateBigtableMutationsFn signature also need to be chagned.

mutianf · 2026-05-22T21:02:51Z

+import org.apache.hadoop.hbase.snapshot.RestoreSnapshotHelper;
+
+/**
+ *


This needs to be a bit more descriptive, see

java-bigtable-hbase/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/ExportJob.java

Lines 47 to 96 in 60b309b

/**

* Beam job to export a Bigtable table to a set of SequenceFiles. Afterwards, the files can be

* either imported into another Bigtable or HBase table. You can limit the rows and columns exported

* using the options in {@link ExportOptions}. Please note that the rows in SequenceFiles will not

* be sorted.

*

* Furthermore, you can export a subset of the data using a combination of --bigtableStartRow,

* --bigtableStopRow and --bigtableFilter.

*

* Execute the following command to run the job directly:

*

* <pre>

* {@code mvn compile exec:java \

* -Dexec.mainClass=com.google.cloud.bigtable.beam.sequencefiles.ExportJob \

* -Dexec.args="--runner=dataflow \

* --project=[PROJECT_ID] \

* --tempLocation=gs://[BUCKET]/[TEMP_PATH] \

* --bigtableInstanceId=[INSTANCE] \

* --bigtableTableId=[TABLE] \

* --destination=gs://[BUCKET]/[EXPORT_PATH] \

* --maxNumWorkers=[nodes * 10]"

* }

* </pre>

*

* Execute the following command to create the Dataflow template:

*

* <pre>

* mvn compile exec:java \

* -DmainClass=com.google.cloud.bigtable.beam.sequencefiles.ExportJob \

* -Dexec.args="--runner=DataflowRunner \

* --project=[PROJECT_ID] \

* --stagingLocation=gs://[STAGING_PATH] \

* --templateLocation=gs://[TEMPLATE_PATH] \

* --wait=false"

* </pre>

*

* There are a few ways to run the pipeline using the template. See Dataflow doc for details:

* https://cloud.google.com/dataflow/docs/templates/executing-templates. Optionally, you can upload

* a metadata file that contains information about the runtime parameters that can be used for

* parameter validation purpose and more. A sample metadata file can be found at

* "src/main/resources/ExportJob_metadata".

*

* An example using gcloud command line:

*

* <pre>

* gcloud beta dataflow jobs run [JOB_NAME] \

* --gcs-location gs://[TEMPLATE_PATH] \

* --parameters bigtableProject=[PROJECT_ID],bigtableInstanceId=[INSTANCE],bigtableTableId=[TABLE],destinationPath=gs://[DESTINATION_PATH],filenamePrefix=[FILENAME_PREFIX]

* </pre>

*/

for an example.

mutianf · 2026-05-22T21:03:27Z

+ *    -Dsnapshots=$SNAPSHOT \
+ *    -Dregion=$REGION \
+ *    -DrestorePath=gs://HBASE_EXPORT_ROOT_PATH/restore \
+ *     -jar bigtable-dataflow-parent/bigtable-beam-import/target/bigtable-beam-import-2.12.1-shaded.jar  \


We don't want to hardcode the jar version. can we use the mvn command instead? See the above link for the example.

mutianf · 2026-05-22T21:08:27Z

+    options.setProject(System.getProperty("project"));
+
+    ImportConfig importConfig =
+        System.getProperty("importConfigFilePath") != null


Consider extract these system property names as variables

mutianf · 2026-05-22T21:08:50Z

+  private static final Log LOG = LogFactory.getLog(HBaseSnapshotRestoreTool.class);
+
+  @VisibleForTesting
+  static final String MISSING_SNAPSHOT_SOURCEPATH =


maybe MISSING_SNAPSHOT_SOURCEPATH_ERROR? same for MISSING_SNAPSHOT_NAMES

tianlei2 requested a review from a team as a code owner April 28, 2026 18:41

product-auto-label Bot added size: xl Pull request size is extra large. api: bigtable Issues related to the googleapis/java-bigtable-hbase API. labels Apr 28, 2026

tianlei2 force-pushed the dataflow-import branch 3 times, most recently from f5324bd to f8b5932 Compare April 28, 2026 19:22

tianlei2 changed the title ~~Dataflow import~~ feat(import): add support for multiple hbase snapshot imports Apr 28, 2026

tianlei2 force-pushed the dataflow-import branch 3 times, most recently from 13f65dc to a3a6c7f Compare April 28, 2026 20:17

tianlei2 marked this pull request as draft April 28, 2026 20:42

tianlei2 marked this pull request as ready for review April 28, 2026 22:41

tianlei2 marked this pull request as draft April 28, 2026 22:42

tianlei2 added the kokoro:run Add this label to force Kokoro to re-run the tests. label Apr 29, 2026

tianlei2 self-assigned this Apr 29, 2026

yoshi-kokoro removed the kokoro:run Add this label to force Kokoro to re-run the tests. label Apr 29, 2026

tianlei2 added 4 commits April 29, 2026 18:18

feat(import): add support for multiple hbase snapshot imports

26e20e4

fix(import): add fork to compiler plugin and optimize GCS search in E…

58cb66e

…ndToEndIT

fix(import): format non-complying files for Google Java Style compliance

66659a8

fix(import): exclude hbase-shaded-client from mapreduce to prevent dn…

a1f9b04

…sjava SPI conflict

tianlei2 force-pushed the dataflow-import branch 4 times, most recently from 12f1c11 to d511a61 Compare April 29, 2026 19:46

tianlei2 added 2 commits April 29, 2026 19:55

test(import): switch to mockito-inline and fix unit tests in ImportJo…

49c3e28

…bFromHbaseSnapshotTest

test(import): fix test isolation and timestamp parsing in SnapshotUti…

5ec8dc1

…lsTest

tianlei2 force-pushed the dataflow-import branch from d511a61 to 5ec8dc1 Compare April 29, 2026 19:57

tianlei2 requested a review from vermas2012 April 29, 2026 20:27

googleapis deleted a comment from google-cla Bot Apr 29, 2026

tianlei2 force-pushed the dataflow-import branch 2 times, most recently from a06223d to 18a4509 Compare May 16, 2026 01:45

Stabilize HBase snapshot import and refactor tests

4da7066

tianlei2 force-pushed the dataflow-import branch from 18a4509 to fa0e9ec Compare May 16, 2026 01:48

tianlei2 added kokoro:run Add this label to force Kokoro to re-run the tests. kokoro:force-run Add this label to force Kokoro to re-run the tests. and removed kokoro:run Add this label to force Kokoro to re-run the tests. labels May 16, 2026

tianlei2 added 12 commits May 19, 2026 14:14

Harden ReadSnapshotRegion splitting and tracking logic

270c8ee

Harden CreateBigtableMutations against OOM and NPE

0a6f2a3

Harden sharding math in ReadRegions

828a6aa

Isolate restore path in SnapshotUtils

0a48635

Optimize configuration caching in ListRegions and SnapshotConfig

f443819

Harden HBaseRegionScanner by disabling background threads

9b6d325

Add unit tests for Transforms

a0f9a82

Add utility and scanner tests

f133b04

Harden cleanup orchestration in CleanupRestoredSnapshots

3fc6786

Restore snapshot idempotency in RestoreSnapshot and ImportJob

fa3b802

Merge remote-tracking branch 'origin/main' into dataflow-import

37d6d65

Improve documentation and comments in bigtable-beam-import

c1b3011

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

tianlei2 and others added 6 commits May 19, 2026 12:03

Update bigtable-dataflow-parent/bigtable-beam-import/src/main/java/co…

727e9f3

…m/google/cloud/bigtable/beam/hbasesnapshots/dofn/HBaseRegionScanner.java Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update bigtable-dataflow-parent/bigtable-beam-import/src/main/java/co…

82465e8

…m/google/cloud/bigtable/beam/hbasesnapshots/dofn/CreateBigtableMutations.java Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update bigtable-dataflow-parent/bigtable-beam-import/src/main/java/co…

4206cdd

…m/google/cloud/bigtable/beam/hbasesnapshots/ImportJobFromHbaseSnapshot.java Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Apply Gemini Code Assist suggestions to ImportConfig and ImportJob

afc4f3f

Use parameterized logging in HBaseSnapshotRestoreTool

0ed8356

Fix logging in HBaseSnapshotRestoreTool for Commons Logging compatibi…

c4ae5d8

…lity

mutianf reviewed May 20, 2026

View reviewed changes

Rename DoFn classes to follow Beam convention with Fn suffix

5be9e32

mutianf reviewed May 22, 2026

View reviewed changes


		List<Mutation> mutations = new ArrayList<>();

		boolean logAndSkipIncompatibleRowMutations =

	/**
	* Beam job to export a Bigtable table to a set of SequenceFiles. Afterwards, the files can be
	* either imported into another Bigtable or HBase table. You can limit the rows and columns exported
	* using the options in {@link ExportOptions}. Please note that the rows in SequenceFiles will not
	* be sorted.
	*
	* <p>Furthermore, you can export a subset of the data using a combination of --bigtableStartRow,
	* --bigtableStopRow and --bigtableFilter.
	*
	* <p>Execute the following command to run the job directly:
	*
	* <pre>
	* {@code mvn compile exec:java \
	* -Dexec.mainClass=com.google.cloud.bigtable.beam.sequencefiles.ExportJob \
	* -Dexec.args="--runner=dataflow \
	* --project=[PROJECT_ID] \
	* --tempLocation=gs://[BUCKET]/[TEMP_PATH] \
	* --bigtableInstanceId=[INSTANCE] \
	* --bigtableTableId=[TABLE] \
	* --destination=gs://[BUCKET]/[EXPORT_PATH] \
	* --maxNumWorkers=[nodes * 10]"
	* }
	* </pre>
	*
	* <p>Execute the following command to create the Dataflow template:
	*
	* <pre>
	* mvn compile exec:java \
	* -DmainClass=com.google.cloud.bigtable.beam.sequencefiles.ExportJob \
	* -Dexec.args="--runner=DataflowRunner \
	* --project=[PROJECT_ID] \
	* --stagingLocation=gs://[STAGING_PATH] \
	* --templateLocation=gs://[TEMPLATE_PATH] \
	* --wait=false"
	* </pre>
	*
	* <p>There are a few ways to run the pipeline using the template. See Dataflow doc for details:
	* https://cloud.google.com/dataflow/docs/templates/executing-templates. Optionally, you can upload
	* a metadata file that contains information about the runtime parameters that can be used for
	* parameter validation purpose and more. A sample metadata file can be found at
	* "src/main/resources/ExportJob_metadata".
	*
	* <p>An example using gcloud command line:
	*
	* <pre>
	* gcloud beta dataflow jobs run [JOB_NAME] \
	* --gcs-location gs://[TEMPLATE_PATH] \
	* --parameters bigtableProject=[PROJECT_ID],bigtableInstanceId=[INSTANCE],bigtableTableId=[TABLE],destinationPath=gs://[DESTINATION_PATH],filenamePrefix=[FILENAME_PREFIX]
	* </pre>
	*/

Conversation

tianlei2 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianlei2 commented Apr 29, 2026

Uh oh!

mutianf commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tianlei2 commented Apr 28, 2026 •

edited

Loading