feat(import): add support for multiple hbase snapshot imports#4600
feat(import): add support for multiple hbase snapshot imports#4600tianlei2 wants to merge 32 commits into
Conversation
f5324bd to
f8b5932
Compare
13f65dc to
a3a6c7f
Compare
…sjava SPI conflict
12f1c11 to
d511a61
Compare
d511a61 to
5ec8dc1
Compare
a06223d to
18a4509
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new tool, HBaseSnapshotRestoreTool, and updates the existing ImportJobFromHbaseSnapshot to support loading multiple HBase snapshots into Bigtable. It includes several infrastructure improvements, such as adding necessary dependencies, introducing a RegionConfigCoder for efficient serialization, and enhancing the ReadRegions transform with dynamic splitting and sharding capabilities. My review identified several areas for improvement, including fixing an incorrect tracker claim in ReadSnapshotRegion, improving configuration handling in ImportJobFromHbaseSnapshot, ensuring consistent brace usage, and addressing potential null pointer exceptions and minor code style issues.
…m/google/cloud/bigtable/beam/hbasesnapshots/dofn/HBaseRegionScanner.java Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…m/google/cloud/bigtable/beam/hbasesnapshots/dofn/CreateBigtableMutations.java Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…m/google/cloud/bigtable/beam/hbasesnapshots/ImportJobFromHbaseSnapshot.java Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
| <!-- Version alignment --> | ||
| <!-- Mark all annotations as provided. They don't affect the runtime of the pipeline so | ||
| there is no need to try to version align them --> | ||
| <dependency> |
There was a problem hiding this comment.
Why are they added in dependency management instead of dependencies?
There was a problem hiding this comment.
These are annotations only needed for compilation. Placing them here with provided scope ensures they don't get included in the final shaded JAR if other libraries pull them in.
| this.snapshots = new ArrayList<>(); | ||
| snapshots.forEach( | ||
| (snapshotName, bigtableName) -> | ||
| this.snapshots.add(new SnapshotInfo(snapshotName, bigtableName))); |
There was a problem hiding this comment.
this.snapshots = snapshots.entrySet().stream().map(entry -> new SnapshotInfo(entry.getKey(), entry.getValue()).collect(Collectors.toList());
| private Map<String, String> bigtableConfiguration; | ||
|
|
||
| public void setSnapshotsFromMap(Map<String, String> snapshots) { | ||
| this.snapshots = new ArrayList<>(); |
There was a problem hiding this comment.
nit, maybe rename snapshots to snapshotInfos?
| // and ensuring thread safety since Beam isolates DoFn instances. | ||
| @Setup | ||
| public void setup() { | ||
| configCache = new java.util.HashMap<>(); |
There was a problem hiding this comment.
nit: import java.util.HashMap and call new HashMap<>() here :)
| return; | ||
| } | ||
| } | ||
| // Signal completion of the range. |
There was a problem hiding this comment.
Maybe also update the comment here: ByteKeyRangeTracker uses EMPTY to mark the end of the range https://github.com/apache/beam/blob/2c4d2c6de4dca6b5954c529c2d40c031a8b74f60/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/splittabledofn/ByteKeyRangeTracker.java#L37
|
|
||
| @GetInitialRestriction | ||
| public ByteKeyRange getInitialRange(@Element RegionConfig regionConfig) { | ||
| byte[] endKey = regionConfig.getRegionInfo().getEndKey(); |
| import org.slf4j.LoggerFactory; | ||
|
|
||
| /** A Splittable {@link DoFn} for reading the records from each region. */ | ||
| public class ReadSnapshotRegion extends DoFn<RegionConfig, KV<SnapshotConfig, Result>> { |
| snapshotConfig.getRestorePath(), | ||
| ex); | ||
| failedCleanups.inc(); | ||
| return; // Give up but don't fail the job |
There was a problem hiding this comment.
This is just a clean up job so if we skipped removing a path it's probably ok. However, we should probably call this out somewhere on the public API.
|
|
||
| List<Mutation> mutations = new ArrayList<>(); | ||
|
|
||
| boolean logAndSkipIncompatibleRowMutations = |
There was a problem hiding this comment.
I don't think the logic is correc,t I think checking the flag should be inside of convertAndValidateThresholds? And also, why pass in an empty list? I think we can just do List mutations = convertAndValidateThresholds(rowKey, element.getValue()..., snapshotName)
| * dynamic splitting. | ||
| */ | ||
| @InternalApi("For internal usage only") | ||
| public class HbaseRegionSplitTracker extends RestrictionTracker<ByteKeyRange, ByteKey> |
There was a problem hiding this comment.
ByteKeyRangeTracker is not a final class, and looks like most of the calls here are just delegating to ByteKeyRangeTracker, can we extend ByteKeyRangeTracker instead? This way the only method we need to override is trySplit()?
| // and ensuring thread safety since Beam isolates DoFn instances. | ||
| @Setup | ||
| public void setup() { | ||
| configCache = new java.util.HashMap<>(); |
There was a problem hiding this comment.
same here, import java.util.HashMap and call new HashMap<>().
|
|
||
| /** A Splittable {@link DoFn} for reading the records from each region. */ | ||
| @InternalApi("For internal usage only") | ||
| public class ReadSnapshotRegionFn extends DoFn<RegionConfig, KV<SnapshotConfig, Result>> { |
There was a problem hiding this comment.
From gemini: The key is the snapshot config, this KV is produced for every single HBase row (potentially billions), using a complex object like SnapshotConfig as a key is extremely inefficient. SnapshotConfig contains a Configuration objectbwhich includes a map of configuration properties. Serializing and transmitting this entire configuration for every row will significantly increase network traffic and memory usage. It would be more efficient to use the table name (a String) as the key.
is it possible to do this refactor? If possible, the subsequent CreateBigtableMutationsFn signature also need to be chagned.
| import org.apache.hadoop.hbase.snapshot.RestoreSnapshotHelper; | ||
|
|
||
| /** | ||
| * |
There was a problem hiding this comment.
This needs to be a bit more descriptive, see
for an example.| * -Dsnapshots=$SNAPSHOT \ | ||
| * -Dregion=$REGION \ | ||
| * -DrestorePath=gs://HBASE_EXPORT_ROOT_PATH/restore \ | ||
| * -jar bigtable-dataflow-parent/bigtable-beam-import/target/bigtable-beam-import-2.12.1-shaded.jar \ |
There was a problem hiding this comment.
We don't want to hardcode the jar version. can we use the mvn command instead? See the above link for the example.
| options.setProject(System.getProperty("project")); | ||
|
|
||
| ImportConfig importConfig = | ||
| System.getProperty("importConfigFilePath") != null |
There was a problem hiding this comment.
Consider extract these system property names as variables
| private static final Log LOG = LogFactory.getLog(HBaseSnapshotRestoreTool.class); | ||
|
|
||
| @VisibleForTesting | ||
| static final String MISSING_SNAPSHOT_SOURCEPATH = |
There was a problem hiding this comment.
maybe MISSING_SNAPSHOT_SOURCEPATH_ERROR? same for MISSING_SNAPSHOT_NAMES
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
b/429250716
This is the first PR that incorporates changes from https://github.com/jhambleton/java-bigtable-hbase/commits/dataflow-v2-v2.15.6 and some fixes to make it pass the tests.
SnapshotUtilsTest.testGetHbaseConfiguration was failing because the static configuration field SnapshotUtils.hbaseConfiguration cached state between test cases, leaking stale data into subsequent tests.
SnapshotUtilsTest.testAppendCurrentTimestamp was throwing a NumberFormatException because the return value contained a UUID suffix (timestamp-UUID), but the test attempted to parse the entire string directly as a Long.
Integration tests failed on Java 8 and 11 in Kokoro because of unshaded transitive dependency conflicts (com.google.protobuf.LiteralByteString NoClassDefFoundError).
Several useful unit tests were commented out in ImportJobFromHbaseSnapshotTest because mockito-core lacked the ability to mock static methods.
Switched from mockito-core to mockito-inline in the pom.xml to allow static mocking.
Uncommented the code and restored the original formatting to prevent any lint errors, enabling JUnit to verify correct configuration parsing.