Fix additional issues with atomic replication by dralley · Pull Request #7711 · pulp/pulpcore

dralley · 2026-05-13T04:55:43Z

In situations where an installation was replicating >1 external Pulp servers, the result was cascading failures as the replications would race and conflict with each other.

📜 Checklist

Commits are cleanly separated with meaningful messages (simple features and bug fixes should be squashed to one commit)
A changelog entry or entries has been added for any significant changes
Follows the Pulp policy on AI Usage
(For new features) - User documentation and test coverage has been added

See: Pull Request Walkthrough

dralley · 2026-05-13T05:28:58Z

I recommend looking at the commits individually, and probably in 1, 3, 2 order

mdellweg · 2026-05-13T09:47:25Z



+def distros_lock_uri(domain_id):
+    return f"pdrn:{domain_id}:distributions"


So this was never an uri to begin with? This function is solely used for locking, the domains distribution url space?
Is it the same lock used for updating distributions directly?

I just noticed that similar lock strings were being created in multiple places, so I factored it into a helper, but the value should actually be the same - I believe self.domain.pulp_id == server.pulp_domain_id in this context.

mdellweg · 2026-05-13T09:53:50Z

+                raise serializers.ValidationError(
+                    "Multiple upstream-pulp configurations exist in this domain. "
+                    "Set the 'policy' to 'labeled' on all of them to avoid conflicts, "
+                    "or remove the extras."
+                )


This is rather opinionated. Are we sure we can take that "feature" away?

Are there possible usecases for having two conflicting upstream configurations in one domain?

Something like Blue-Green upstream where you sync them alternatingly.

Having a hot spare upstream in case the main one is out of reach.

I'm not 100% sure, I will ask them in our call today.

I feel like having multiple potentially overlapping Upstreams replicating into the same domain is probably a very bad idea. I don't see an obvious use case for it. With LABELED it would be allowed but at least you would get a better error in cases where they overlap instead of just wreaking havoc

edit: actually there is no error, they are skipped silently by the 2nd replication. That's probably not ideal.

@gerrod3 should we be raising an error if a 2nd UpstreamPulp tries to touch objects created by a different UpstreamPulp with LABELED policy?

Maybe, my understanding was that the second upstream would become the new owner of that object as they replace it. This behavior might not be expected for users of the labeled policy. If we are changing the upstream existence policy then we probably should also change this overlap behavior for labeled.

The description of "labeled" is

Replication only manages objects that it created in a previous run. Objects created by replication
are tagged with an UpstreamPulp label linking them to the specific Upstream Pulp object. Manually
created local objects with the same content types are left untouched, even if they share names with
upstream distributions. If a replicated object has its UpstreamPulp label removed, replication
will no longer manage or delete it.

Which implies no replacing

mdellweg · 2026-05-13T09:57:34Z

+    failed_tasks = task_group.tasks.exclude(pk=task.pk).exclude(state=TASK_STATES.COMPLETED)
+    has_failures = failed_tasks.exists()
+
+    # Best-effort: update all distributions we can, even if some subtasks failed.


Isn't that the antithesis to atomicity?
I thought the idea is the domain (maybe scoped by labels...) looks exactly like upstream today, or nothing changes.

The main idea was just that successful updates should be synchronized rather than randomly applied over the course of an hour+. That was the primary original complaint

dralley · 2026-05-13T18:03:23Z

After discussion, commits 2 and 3 are likely not suitable for merge in their current state.

…reporting Replication could fail when a stale distribution in the replica domain had a base_path that conflicted with an upstream distribution. This happened because distribution create subtasks were dispatched before remove_missing, so the stale distribution's base_path hadn't been freed yet. Fix by splitting the per-replicator loop into two passes: the first pass creates remotes, repositories, and dispatches syncs, then remove_missing runs to clean up stale distributions, and only then does the second pass dispatch distribution create/update subtasks. Also improves finalize_replication error reporting to include the name and error description of each failed subtask. Known limitation: when a distribution is renamed upstream (same base_path, different name), this approach deletes the old distribution and creates a new one, causing a brief window where content at that base_path is not served. A follow-up commit will address this by reusing existing distributions matched by base_path instead of deleting and recreating them. Closes: pulp#7614 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When a distribution is renamed upstream (same base_path, different name), the previous commit would delete the old distribution and create a new one, causing a window where content at that base_path is not served. Instead, create_or_update_distribution now falls back to a base_path lookup when the name lookup misses. If a managed distribution with the same base_path exists, it is updated in-place (including the name change) rather than deleted and recreated. This preserves content continuity at that base_path. The old name is returned to the caller and added to distro_names so that remove_missing — which has already been dispatched but not yet executed — does not delete the distribution that is being renamed. This coordination is implicit: remove_missing doesn't know why the old name is in the skip list. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the implicit distro_names coordination from the previous commit with a synchronous rename. When a distribution is found by base_path instead of by name, the name is updated immediately via distro.save(update_fields=["name"]) in the main task, before remove_missing queries the DB. This eliminates the need for the caller to propagate old names into distro_names. The synchronous rename is safe because no Django lifecycle hook fires on name changes (hooks watch base_path, publication, repository, repository_version — not name), and the DB enforces unique_together(name, pulp_domain) directly. This follows the same pattern as create_or_update_remote and create_or_update_repository, which already use synchronous .save() for creates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions Bot added multi-commit no-issue labels May 13, 2026

dralley requested review from gerrod3 and mdellweg May 13, 2026 04:56

dralley force-pushed the atomic-replication branch from e157ff7 to 37f4ba4 Compare May 13, 2026 04:57

mdellweg reviewed May 13, 2026

View reviewed changes

mdellweg previously approved these changes May 13, 2026

View reviewed changes

gerrod3 previously approved these changes May 13, 2026

View reviewed changes

dralley marked this pull request as draft May 13, 2026 18:02

dralley requested review from mdellweg and removed request for mdellweg May 13, 2026 18:02

dralley and others added 3 commits May 27, 2026 01:36

dralley dismissed stale reviews from gerrod3 and mdellweg via 3ae1a05 May 27, 2026 05:50

dralley force-pushed the atomic-replication branch from 37f4ba4 to 3ae1a05 Compare May 27, 2026 05:50

github-actions Bot added the no-changelog label May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix additional issues with atomic replication#7711

Fix additional issues with atomic replication#7711
dralley wants to merge 3 commits into
pulp:mainfrom
dralley:atomic-replication

dralley commented May 13, 2026

Uh oh!

dralley commented May 13, 2026

Uh oh!

mdellweg May 13, 2026

Uh oh!

dralley May 13, 2026

Uh oh!

mdellweg May 13, 2026

Uh oh!

dralley May 13, 2026

Uh oh!

dralley May 13, 2026 •

edited

Loading

Uh oh!

gerrod3 May 13, 2026

Uh oh!

dralley May 13, 2026

Uh oh!

mdellweg May 13, 2026

Uh oh!

dralley May 13, 2026

Uh oh!

dralley commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		def distros_lock_uri(domain_id):
		return f"pdrn:{domain_id}:distributions"

Conversation

dralley commented May 13, 2026

📜 Checklist

Uh oh!

dralley commented May 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dralley May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dralley commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dralley May 13, 2026 •

edited

Loading