HDDS-14897. Add multiple S3 gateways to the rolling-upgrade suite by dombizita · Pull Request #10028 · apache/ozone

dombizita · 2026-04-02T05:48:24Z

What changes were proposed in this pull request?

Use HA Proxy to load balance multiple S3 gateways. I did the necessary changes in docker-compose.yaml and adjusted the shell scripts for that. I didn't use the existing s3-haproxy.yaml, as the one in common was not working out of the box with the Ozone HA setup (also found HDDS-14956). As this suite always need to have multiple S3 gateways I think it's okay to have it in the docker-compose.yaml.

One outstanding change is in the hadoop-ozone/dist/src/main/compose/testlib.sh. Without that change I faced this error:

OCI runtime exec failed: exec failed: unable to start container process: exec: "bash": executable file not found in $PATH

Cursor help: "This is from reorder_om_nodes in testlib.sh. It iterates over ALL containers and runs docker exec ... bash -c "...". The HAProxy container (ha-s3g-1) uses haproxy:lts-alpine — Alpine Linux — which only has sh, not bash."

This is new, as Ozone HA suite never used S3 HAProxy setup before and if it's not Ozone HA we are not calling reorder_om_nodes. This fix will simply skip it and as the ha proxy container doesn't need ozone-site.xml, it's safe to do this. The downside is it would also silently swallow genuine bash failures. Another solution is to use sh instead of bash

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14897

How was this patch tested?

CI with the rolling upgrade test suite: https://github.com/dombizita/ozone/actions/runs/23846523428
With commenting out (current state on HDDS-14496-zdu): https://github.com/dombizita/ozone/actions/runs/23846562903

--- RESTARTING s3g1 WITH IMAGE 2.2.0 ---
Using Docker Compose v2
==============================================================================
2.2.0-2.2.0-2-s3g1-generate-generate-s3g1 :: Generate data                    
==============================================================================
Create a volume and bucket                                            | PASS |
------------------------------------------------------------------------------
Create key                                                            | PASS |
------------------------------------------------------------------------------
Create a bucket in s3v volume                                         | PASS |
------------------------------------------------------------------------------
Create key in the bucket in s3v volume                                | PASS |
------------------------------------------------------------------------------
Try to create a bucket using S3 API                                   | PASS |
------------------------------------------------------------------------------
Create key using S3 API                                               | PASS |
------------------------------------------------------------------------------
2.2.0-2.2.0-2-s3g1-generate-generate-s3g1 :: Generate data            | PASS |
6 tests, 6 passed, 0 failed
==============================================================================
Output:  /tmp/smoketest/upgrade/result/robot-2.2.0-2.2.0-2-s3g1-001.xml
Using Docker Compose v2
==============================================================================
2.2.0-2.2.0-2-s3g1-validate-generate-s3g1 :: Smoketest ozone cluster startup  
==============================================================================
Read data from previously created key                                 | PASS |
------------------------------------------------------------------------------
Read key created with Ozone Shell using S3 API                        | PASS |
------------------------------------------------------------------------------
Read key created with S3 API using S3 API                             | PASS |
------------------------------------------------------------------------------
2.2.0-2.2.0-2-s3g1-validate-generate-s3g1 :: Smoketest ozone clust... | PASS |
3 tests, 3 passed, 0 failed
==============================================================================
Output:  /tmp/smoketest/upgrade/result/robot-2.2.0-2.2.0-2-s3g1-002.xml
--- RESTARTING s3g2 WITH IMAGE 2.2.0 ---
Using Docker Compose v2
==============================================================================
2.2.0-2.2.0-2-s3g2-generate-generate-s3g2 :: Generate data                    
==============================================================================
Create a volume and bucket                                            | PASS |
------------------------------------------------------------------------------
Create key                                                            | PASS |
------------------------------------------------------------------------------
Create a bucket in s3v volume                                         | PASS |
------------------------------------------------------------------------------
Create key in the bucket in s3v volume                                | PASS |
------------------------------------------------------------------------------
Try to create a bucket using S3 API                                   | PASS |
------------------------------------------------------------------------------
Create key using S3 API                                               | PASS |
------------------------------------------------------------------------------
2.2.0-2.2.0-2-s3g2-generate-generate-s3g2 :: Generate data            | PASS |
6 tests, 6 passed, 0 failed
==============================================================================
Output:  /tmp/smoketest/upgrade/result/robot-2.2.0-2.2.0-2-s3g2-001.xml
Using Docker Compose v2
==============================================================================
2.2.0-2.2.0-2-s3g2-validate-generate-s3g2 :: Smoketest ozone cluster startup  
==============================================================================
Read data from previously created key                                 | PASS |
------------------------------------------------------------------------------
Read key created with Ozone Shell using S3 API                        | PASS |
------------------------------------------------------------------------------
Read key created with S3 API using S3 API                             | PASS |
------------------------------------------------------------------------------
2.2.0-2.2.0-2-s3g2-validate-generate-s3g2 :: Smoketest ozone clust... | PASS |
3 tests, 3 passed, 0 failed
==============================================================================
Output:  /tmp/smoketest/upgrade/result/robot-2.2.0-2.2.0-2-s3g2-002.xml
--- RESTARTING s3g3 WITH IMAGE 2.2.0 ---
Using Docker Compose v2
==============================================================================
2.2.0-2.2.0-2-s3g3-generate-generate-s3g3 :: Generate data                    
==============================================================================
Create a volume and bucket                                            | PASS |
------------------------------------------------------------------------------
Create key                                                            | PASS |
------------------------------------------------------------------------------
Create a bucket in s3v volume                                         | PASS |
------------------------------------------------------------------------------
Create key in the bucket in s3v volume                                | PASS |
------------------------------------------------------------------------------
Try to create a bucket using S3 API                                   | PASS |
------------------------------------------------------------------------------
Create key using S3 API                                               | PASS |
------------------------------------------------------------------------------
2.2.0-2.2.0-2-s3g3-generate-generate-s3g3 :: Generate data            | PASS |
6 tests, 6 passed, 0 failed
==============================================================================
Output:  /tmp/smoketest/upgrade/result/robot-2.2.0-2.2.0-2-s3g3-001.xml
Using Docker Compose v2
==============================================================================
2.2.0-2.2.0-2-s3g3-validate-generate-s3g3 :: Smoketest ozone cluster startup  
==============================================================================
Read data from previously created key                                 | PASS |
------------------------------------------------------------------------------
Read key created with Ozone Shell using S3 API                        | PASS |
------------------------------------------------------------------------------
Read key created with S3 API using S3 API                             | PASS |
------------------------------------------------------------------------------
2.2.0-2.2.0-2-s3g3-validate-generate-s3g3 :: Smoketest ozone clust... | PASS |
3 tests, 3 passed, 0 failed

adoroszlai

Thanks @dombizita, LGTM.

hadoop-ozone/dist/src/main/compose/upgrade/compose/ha/load.sh

adoroszlai · 2026-04-02T12:05:17Z

hadoop-ozone/dist/src/main/compose/testlib.sh

          sed -i -e 's/om1,om2,om3/${new_order}/' /etc/hadoop/ozone-site.xml; \
          echo 'Replaced OM order with ${new_order} in ${c}'; \
-        fi"
+        fi" || true


silently swallow genuine bash failures

This is fine. In the worst case OM client will contact follower first with the original order.

Can we do this differerently? For example:

Exclude haproxy containers from the loop

Check if /etc/hadoop/ozone-site.xml exists in the container using sh before invoking bash

Reformat this to use sh

Switch to the debian based haproxy image (30mb larger)

If we do want to use the current solution let's add a comment explaining it.

errose28

Thanks for adding this @dombizita. I don't have any experience configuring HA proxy but Cursor found this potential issue:

s3-haproxy.cfg uses plain balance roundrobin with no check / option httpchk and no option redispatch / retries. While a backend is stopped during rolling_restart_service, HAProxy still has a 1-in-3 chance of selecting it on each new connection, so S3 calls can fail even though two gateways are up. That works against “constant uptime” and can make the upgraded callbacks flaky.

errose28 · 2026-04-08T20:02:50Z

hadoop-ozone/dist/src/main/compose/upgrade/upgrades/rolling-upgrade/driver.sh

I think we need a wait_for_port check for s3g{1..3}. Hypothetically all the s3 gateways could be restarted faster than their individual ports are active after restart and they would all be offline for a period of time.

errose28 · 2026-04-08T20:17:33Z

hadoop-ozone/dist/src/main/compose/testlib.sh

          sed -i -e 's/om1,om2,om3/${new_order}/' /etc/hadoop/ozone-site.xml; \
          echo 'Replaced OM order with ${new_order} in ${c}'; \
-        fi"
+        fi" || true


Can we do this differerently? For example:

Exclude haproxy containers from the loop

Check if /etc/hadoop/ozone-site.xml exists in the container using sh before invoking bash

Reformat this to use sh

Switch to the debian based haproxy image (30mb larger)

If we do want to use the current solution let's add a comment explaining it.

dombizita added 2 commits April 1, 2026 13:33

HDDS-14897. Add multiple S3 gateways to the rolling-upgrade suite

023286a

Comment out rolling upgrade run

7e2c25e

dombizita requested review from adoroszlai and errose28 April 2, 2026 05:48

github-actions bot added the zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496 label Apr 2, 2026

adoroszlai reviewed Apr 2, 2026

View reviewed changes

Remove unnecessary s3g data directory

215088b

errose28 reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-14897. Add multiple S3 gateways to the rolling-upgrade suite#10028

HDDS-14897. Add multiple S3 gateways to the rolling-upgrade suite#10028
dombizita wants to merge 3 commits intoapache:HDDS-14496-zdufrom
dombizita:HDDS-14897

dombizita commented Apr 2, 2026

Uh oh!

adoroszlai left a comment

Uh oh!

Uh oh!

adoroszlai Apr 2, 2026

Uh oh!

errose28 Apr 8, 2026

Uh oh!

errose28 left a comment

Uh oh!

errose28 Apr 8, 2026

Uh oh!

errose28 Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dombizita commented Apr 2, 2026

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adoroszlai Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

errose28 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

errose28 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

errose28 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants