backfill: accept revision arguments#2070
backfill: accept revision arguments#2070derrickstolee wants to merge 6 commits intogitgitgadget:masterfrom
Conversation
The REV_INFO_INIT macro includes a use of the DEFAULT_ABBREV macro, which is defined in object-name.h. Include it in revision.h so consumers of REV_INFO_INIT do not need to include this hidden dependency. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Prepare the test infrastructure for upcoming changes that teach 'git
backfill' to accept revision arguments and pathspecs.
Add test_tick before each commit in the setup loop so that commit dates
are deterministic. This enables reliable testing with '--since'.
Rename the 'd/e/' directory to 'd/f/' so that the prefix 'd/f' is
ambiguous with the files 'd/file.*.txt'. This exercises the subtlety
in prefix pathspec matching that will be added in a later commit.
Create a branched version of the test repository (src-revs) with:
- A 'side' branch merged into main, adding s/file.{1,2}.txt with
two versions (4 new blobs, 52 total from main HEAD).
- An unmerged 'other' branch adding o/file.{1,2}.txt (2 more blobs,
54 total reachable from --all).
This structure makes --all, --first-parent, and --since produce
meaningfully different results when used with 'git backfill'.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
6bb9bb6 to
beb1c92
Compare
|
/submit |
|
Submitted as pull.2070.git.1773707361.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
|
/allow |
|
I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345 |
|
Error: User jayesh0104 is not yet permitted to use GitGitGadget |
|
Junio C Hamano wrote on the Git mailing list (how to reply to this email): "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> The git backfill command assists in downloading missing blobs for blobless
> partial clones. However, its current version lacks some valuable
> functionality. It currently:
>
> 1. Only walks commits reachable from HEAD.
> 2. It walks all reachable commits to the full history.
> 3. It can focus on the current sparse-checkout definition, but otherwise it
> doesn't focus on a given pathspec.
>
> All of these are being updated by this patch series, which allows rev-list
> options to impact the path-walk. These include:
>
> 1. Specifying a given refspec, including --all.
Makes sense. You can only be on a single branch at a time, but may
want to work on multiple topics in reasonably quick succession in a
single repository. Being able to prepare enough material to go back
to when working on whichever topic in a single backfill invocation
would be a welcome addition.
> 2. Modifying the commit walk, including --first-parent, commit ranges, or
> recency using --since.
> 3. Modifying the set of paths to download using pathspecs.
Both are good mechanisms to express which subset of history you will
be working on.
> One particularly valuable situation here is that now a user can run git
> backfill -- <path> to download all versions of a specific file or a specific
> directory, accelerating history queries within that path without downloading
> more than necessary. This can accelerate git blame or git log -L for these
> paths, where normally those commands download missing blobs one-by-one
> during its diff algorithms.
Yup. Even if your project is a huge monorepo that contains all, you
do not necessarily have to look at everything the organization has
all the time. "git blame -C -C -C" would of course not work in such
an environment (would it end up on-demand lazy fetch these blobs, or
are there ways to say "I know the object store of my repository is
only sparsely populated, and I do not want you to on-demand download
the missing blobs---do your best to work with only what is already
available?), but that's a tradeoff a monorepo makes.
|
| @@ -4,6 +4,7 @@ | |||
| #include "commit.h" | |||
There was a problem hiding this comment.
Junio C Hamano wrote on the Git mailing list (how to reply to this email):
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Derrick Stolee <stolee@gmail.com>
>
> The REV_INFO_INIT macro includes a use of the DEFAULT_ABBREV macro, which is
> defined in object-name.h. Include it in revision.h so consumers of
> REV_INFO_INIT do not need to include this hidden dependency.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
> revision.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/revision.h b/revision.h
> index b36acfc2d9..18c9bbd822 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -4,6 +4,7 @@
> #include "commit.h"
> #include "grep.h"
> #include "notes.h"
> +#include "object-name.h"
> #include "oidset.h"
> #include "pretty.h"
> #include "diff.h"
OK. Other symbols REV_INFO_INIT needs are REV_SORT_IN_GRAPH_ORDER
(in <commit.h>), CMIT_FMT_DEFAULT (in <pretty.h>), and STRVEC_INIT
(in <strvec.h>), and all three are already included there.
Makes sense.| @@ -63,9 +63,12 @@ OPTIONS | |||
| current sparse-checkout. If the sparse-checkout feature is enabled, | |||
There was a problem hiding this comment.
Junio C Hamano wrote on the Git mailing list (how to reply to this email):
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> - repo_init_revisions(ctx->repo, &revs, "");
> - handle_revision_arg("HEAD", &revs, 0, 0);
So we used to "cheat" and did an initialization without even knowing
in which directory we were started ...
> + /* Walk from HEAD if otherwise unspecified. */
> + if (!ctx->revs.pending.nr)
> + handle_revision_arg("HEAD", &ctx->revs, 0, 0);
... but by initializing the revs correctly in the caller, we would
be correcting it. Looking good.
> @@ -134,7 +135,12 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
> builtin_backfill_usage, options);
>
> argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
> - 0);
> + PARSE_OPT_KEEP_UNKNOWN_OPT |
> + PARSE_OPT_KEEP_ARGV0 |
> + PARSE_OPT_KEEP_DASHDASH);
> +
> + repo_init_revisions(repo, &ctx.revs, prefix);
> + argc = setup_revisions(argc, argv, &ctx.revs, NULL);
OK.| @@ -206,6 +206,49 @@ static int add_tree_entries(struct path_walk_context *ctx, | |||
| match != MATCHED) | |||
There was a problem hiding this comment.
Junio C Hamano wrote on the Git mailing list (how to reply to this email):
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Derrick Stolee <stolee@gmail.com>
>
> The previous change allowed specifying revision arguments over the 'git
> backfill' command-line. This created the opportunity for pathspecs that
> specify a smaller set of starting commits, but otherwise did not restrict
> the blob paths that were downloaded.
"pathspecs that specify a smaller set of starting commits" is
puzzling, as starting commits would be coming from the revision
arguments. "opportunity for pathspec to further filter commits
to those that touch only the matching paths...", or something?
> Update the path-walk API to accept certain kinds of pathspecs and to
> silently ignore anything too complex.
Hmph, "silently ignore", instead of "no, you cannot use that! and
die", or at least "sorry, I cannot do that, so the result may not be
what you wanted, you've been warned"?
> The current behavior focuses on
> pathspecs that match paths exactly. This includes exact filenames,
> including directory names as prefixes. Pathspecs containing wildcards
> or magic are cleared so the path walk downloads all blobs, as before.
Ah, "we punt and lift the limitation to grab everything, so at least
everything you wanted to have will become available to you, even
though we may download more than what you asked"? OK, users would
survive that, and as we improve the pathspec support, the user
experience would only improve. OK.
There was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/17/2026 6:10 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> The previous change allowed specifying revision arguments over the 'git
>> backfill' command-line. This created the opportunity for pathspecs that
>> specify a smaller set of starting commits, but otherwise did not restrict
>> the blob paths that were downloaded.
>
> "pathspecs that specify a smaller set of starting commits" is
> puzzling, as starting commits would be coming from the revision
> arguments. "opportunity for pathspec to further filter commits
> to those that touch only the matching paths...", or something?
You're right. I'm using "starting commits" incorrectly. My view was too
focused on how the path-walk API starts from the commits output by the
revision walk to get a list of root trees and then walks by path from
that point.
I'll reword to make this more clear.
>> Update the path-walk API to accept certain kinds of pathspecs and to
>> silently ignore anything too complex.
>
> Hmph, "silently ignore", instead of "no, you cannot use that! and
> die", or at least "sorry, I cannot do that, so the result may not be
> what you wanted, you've been warned"?
The behavior when silently ignoring is to over-download. The revision
walk still filters commits, but the path-walk then walks paths beyond
that pathspec. This will be fixed in the next commit, so adding an
error case didn't seem worth it. I'll do a better job foreshadowing.
>> The current behavior focuses on
>> pathspecs that match paths exactly. This includes exact filenames,
>> including directory names as prefixes. Pathspecs containing wildcards
>> or magic are cleared so the path walk downloads all blobs, as before.
>
> Ah, "we punt and lift the limitation to grab everything, so at least
> everything you wanted to have will become available to you, even
> though we may download more than what you asked"? OK, users would
> survive that, and as we improve the pathspec support, the user
> experience would only improve. OK.
Exactly. I can word things better.
Thanks,
-Stolee
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Wed, Mar 18, 2026 at 09:15:00AM -0400, Derrick Stolee wrote:
> On 3/17/2026 6:10 PM, Junio C Hamano wrote:
> >> Update the path-walk API to accept certain kinds of pathspecs and to
> >> silently ignore anything too complex.
> >
> > Hmph, "silently ignore", instead of "no, you cannot use that! and
> > die", or at least "sorry, I cannot do that, so the result may not be
> > what you wanted, you've been warned"?
>
> The behavior when silently ignoring is to over-download. The revision
> walk still filters commits, but the path-walk then walks paths beyond
> that pathspec. This will be fixed in the next commit, so adding an
> error case didn't seem worth it. I'll do a better job foreshadowing.
I guess this is a fine tradeoff when documented properly. But I think in
that case we should make very clear that this behaviour may change in
the future if find a way to efficiently limit the pathwalk, too.
PatrickThere was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Wed, Mar 18, 2026 at 09:15:00AM -0400, Derrick Stolee wrote:
> On 3/17/2026 6:10 PM, Junio C Hamano wrote:
> >> Update the path-walk API to accept certain kinds of pathspecs and to
> >> silently ignore anything too complex.
> >
> > Hmph, "silently ignore", instead of "no, you cannot use that! and
> > die", or at least "sorry, I cannot do that, so the result may not be
> > what you wanted, you've been warned"?
>
> The behavior when silently ignoring is to over-download. The revision
> walk still filters commits, but the path-walk then walks paths beyond
> that pathspec. This will be fixed in the next commit, so adding an
> error case didn't seem worth it. I'll do a better job foreshadowing.
I guess this is a fine tradeoff when documented properly. But I think in
that case we should make very clear that this behaviour may change in
the future if find a way to efficiently limit the pathwalk, too.
Patrick| @@ -62,6 +62,8 @@ struct path_walk_context { | |||
| */ | |||
There was a problem hiding this comment.
Junio C Hamano wrote on the Git mailing list (how to reply to this email):
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Derrick Stolee <stolee@gmail.com>
>
> Previously, walk_objects_by_path() silently ignored pathspecs containing
> wildcards or magic by clearing them. This caused all blobs to be
> downloaded regardless of the given pathspec. Wildcard pathspecs like
> "d/file.*.txt" are useful for narrowing which blobs to process (e.g.,
> during 'git backfill').
>
> Support wildcard pathspecs by making three changes:
>
> 1. Add an 'exact_pathspecs' flag to path_walk_context. When the
> pathspec has no wildcards or magic, set this flag and use the
> existing fast-path prefix matching in add_tree_entries(). When
> wildcards are present, skip that block since prefix matching
> cannot handle glob patterns.
>
> 2. Disable revision-level commit pruning (revs->prune = 0) for
> wildcard pathspecs. The revision walk uses the pathspec to filter
> commits via TREESAME detection. For exact prefix pathspecs this
> works well, but wildcard pathspecs may fail to match through
> TREESAME because fnmatch with WM_PATHNAME does not cross directory
> boundaries. Disabling pruning ensures all commits are visited and
> their trees are available for the path-walk to filter.
Hmph, I wonder how significant an impact does it have on the
performance that we have to disable pruning here. With the bog
standard tree traversal, wouldn't tree_entry_interesting() already
be capable of doing this, even with fnmatch / WM_PATHNAME ?
> 3. Add a match_pathspec() check in walk_path() to filter out blobs
> whose full path does not match the pathspec. This provides the
> actual blob-level filtering for wildcard pathspecs.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The latter person cannot sign DCO or vouch for the origin of what
they have written in this patch, can they?
> ---
> path-walk.c | 22 ++++++++++++++--------
> t/t5620-backfill.sh | 7 +++----
> 2 files changed, 17 insertions(+), 12 deletions(-)There was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/17/2026 6:19 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> Previously, walk_objects_by_path() silently ignored pathspecs containing
>> wildcards or magic by clearing them. This caused all blobs to be
>> downloaded regardless of the given pathspec. Wildcard pathspecs like
>> "d/file.*.txt" are useful for narrowing which blobs to process (e.g.,
>> during 'git backfill').
>>
>> Support wildcard pathspecs by making three changes:
>>
>> 1. Add an 'exact_pathspecs' flag to path_walk_context. When the
>> pathspec has no wildcards or magic, set this flag and use the
>> existing fast-path prefix matching in add_tree_entries(). When
>> wildcards are present, skip that block since prefix matching
>> cannot handle glob patterns.
>>
>> 2. Disable revision-level commit pruning (revs->prune = 0) for
>> wildcard pathspecs. The revision walk uses the pathspec to filter
>> commits via TREESAME detection. For exact prefix pathspecs this
>> works well, but wildcard pathspecs may fail to match through
>> TREESAME because fnmatch with WM_PATHNAME does not cross directory
>> boundaries. Disabling pruning ensures all commits are visited and
>> their trees are available for the path-walk to filter.
>
> Hmph, I wonder how significant an impact does it have on the
> performance that we have to disable pruning here. With the bog
> standard tree traversal, wouldn't tree_entry_interesting() already
> be capable of doing this, even with fnmatch / WM_PATHNAME ?
I will explore what's possible here and see what I can do.
>> 3. Add a match_pathspec() check in walk_path() to filter out blobs
>> whose full path does not match the pathspec. This provides the
>> actual blob-level filtering for wildcard pathspecs.
>>
>> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
>
> The latter person cannot sign DCO or vouch for the origin of what
> they have written in this patch, can they?
No they cannot. Sorry for this error.
Thanks,
-Stolee
There was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/18/26 9:16 AM, Derrick Stolee wrote:
> On 3/17/2026 6:19 PM, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Derrick Stolee <stolee@gmail.com>
>>>
>>> Previously, walk_objects_by_path() silently ignored pathspecs containing
>>> wildcards or magic by clearing them. This caused all blobs to be
>>> downloaded regardless of the given pathspec. Wildcard pathspecs like
>>> "d/file.*.txt" are useful for narrowing which blobs to process (e.g.,
>>> during 'git backfill').
>>>
>>> Support wildcard pathspecs by making three changes:
>>>
>>> 1. Add an 'exact_pathspecs' flag to path_walk_context. When the
>>> pathspec has no wildcards or magic, set this flag and use the
>>> existing fast-path prefix matching in add_tree_entries(). When
>>> wildcards are present, skip that block since prefix matching
>>> cannot handle glob patterns.
>>>
>>> 2. Disable revision-level commit pruning (revs->prune = 0) for
>>> wildcard pathspecs. The revision walk uses the pathspec to filter
>>> commits via TREESAME detection. For exact prefix pathspecs this
>>> works well, but wildcard pathspecs may fail to match through
>>> TREESAME because fnmatch with WM_PATHNAME does not cross directory
>>> boundaries. Disabling pruning ensures all commits are visited and
>>> their trees are available for the path-walk to filter.
>>
>> Hmph, I wonder how significant an impact does it have on the
>> performance that we have to disable pruning here. With the bog
>> standard tree traversal, wouldn't tree_entry_interesting() already
>> be capable of doing this, even with fnmatch / WM_PATHNAME ?
> > I will explore what's possible here and see what I can do.
I must have needed the 'revs->prune = 0' at some point during development
and left it even though it isn't actually necessary. Leaving it
implicitly at '1' should indeed be faster due to traversing fewer commits
and parsing fewer trees while still reaching all necessary blobs.
Only changes 1 and 3 are necessary.
Thanks,
-StoleeThere was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/18/26 9:16 AM, Derrick Stolee wrote:
> On 3/17/2026 6:19 PM, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Derrick Stolee <stolee@gmail.com>
>>>
>>> Previously, walk_objects_by_path() silently ignored pathspecs containing
>>> wildcards or magic by clearing them. This caused all blobs to be
>>> downloaded regardless of the given pathspec. Wildcard pathspecs like
>>> "d/file.*.txt" are useful for narrowing which blobs to process (e.g.,
>>> during 'git backfill').
>>>
>>> Support wildcard pathspecs by making three changes:
>>>
>>> 1. Add an 'exact_pathspecs' flag to path_walk_context. When the
>>> pathspec has no wildcards or magic, set this flag and use the
>>> existing fast-path prefix matching in add_tree_entries(). When
>>> wildcards are present, skip that block since prefix matching
>>> cannot handle glob patterns.
>>>
>>> 2. Disable revision-level commit pruning (revs->prune = 0) for
>>> wildcard pathspecs. The revision walk uses the pathspec to filter
>>> commits via TREESAME detection. For exact prefix pathspecs this
>>> works well, but wildcard pathspecs may fail to match through
>>> TREESAME because fnmatch with WM_PATHNAME does not cross directory
>>> boundaries. Disabling pruning ensures all commits are visited and
>>> their trees are available for the path-walk to filter.
>>
>> Hmph, I wonder how significant an impact does it have on the
>> performance that we have to disable pruning here. With the bog
>> standard tree traversal, wouldn't tree_entry_interesting() already
>> be capable of doing this, even with fnmatch / WM_PATHNAME ?
> > I will explore what's possible here and see what I can do.
I must have needed the 'revs->prune = 0' at some point during development
and left it even though it isn't actually necessary. Leaving it
implicitly at '1' should indeed be faster due to traversing fewer commits
and parsing fewer trees while still reaching all necessary blobs.
Only changes 1 and 3 are necessary.
Thanks,
-Stolee| @@ -63,9 +63,12 @@ OPTIONS | |||
| current sparse-checkout. If the sparse-checkout feature is enabled, | |||
There was a problem hiding this comment.
"Kristoffer Haugsbakk" wrote on the Git mailing list (how to reply to this email):
On Tue, Mar 17, 2026, at 01:29, Derrick Stolee via GitGitGadget wrote:
>[snip]
> diff --git a/Documentation/git-backfill.adoc b/Documentation/git-backfill.adoc
> index b8394dcf22..fdfe22d623 100644
> --- a/Documentation/git-backfill.adoc
> +++ b/Documentation/git-backfill.adoc
> @@ -63,9 +63,12 @@ OPTIONS
> current sparse-checkout. If the sparse-checkout feature is enabled,
> then `--sparse` is assumed and can be disabled with `--no-sparse`.
>
> +You may also specify the commit limiting options from linkgit:git-rev-list[1].
> +
> SEE ALSO
> --------
> linkgit:git-clone[1].
> +linkgit:git-rev-list[1].
Should there be a comma between these two?
>[snip]There was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/18/26 11:37 AM, Kristoffer Haugsbakk wrote:
> On Tue, Mar 17, 2026, at 01:29, Derrick Stolee via GitGitGadget wrote:
>> [snip]
>> diff --git a/Documentation/git-backfill.adoc b/Documentation/git-backfill.adoc
>> index b8394dcf22..fdfe22d623 100644
>> --- a/Documentation/git-backfill.adoc
>> +++ b/Documentation/git-backfill.adoc
>> @@ -63,9 +63,12 @@ OPTIONS
>> current sparse-checkout. If the sparse-checkout feature is enabled,
>> then `--sparse` is assumed and can be disabled with `--no-sparse`.
>>
>> +You may also specify the commit limiting options from linkgit:git-rev-list[1].
>> +
>> SEE ALSO
>> --------
>> linkgit:git-clone[1].
>> +linkgit:git-rev-list[1].
> > Should there be a comma between these two?
Good catch. Also there shouldn't be a hard stop, either.
Thanks,
-StoleeThere was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/18/26 11:37 AM, Kristoffer Haugsbakk wrote:
> On Tue, Mar 17, 2026, at 01:29, Derrick Stolee via GitGitGadget wrote:
>> [snip]
>> diff --git a/Documentation/git-backfill.adoc b/Documentation/git-backfill.adoc
>> index b8394dcf22..fdfe22d623 100644
>> --- a/Documentation/git-backfill.adoc
>> +++ b/Documentation/git-backfill.adoc
>> @@ -63,9 +63,12 @@ OPTIONS
>> current sparse-checkout. If the sparse-checkout feature is enabled,
>> then `--sparse` is assumed and can be disabled with `--no-sparse`.
>>
>> +You may also specify the commit limiting options from linkgit:git-rev-list[1].
>> +
>> SEE ALSO
>> --------
>> linkgit:git-clone[1].
>> +linkgit:git-rev-list[1].
> > Should there be a comma between these two?
Good catch. Also there shouldn't be a hard stop, either.
Thanks,
-Stolee|
User |
|
This branch is now known as |
|
This patch series was integrated into seen via git@b0faae0. |
|
This patch series was integrated into seen via git@2e7d77b. |
|
This patch series was integrated into seen via git@03fbc16. |
The existing implementation of 'git backfill' only includes downloading missing blobs reachable from HEAD. Advanced uses may desire more general commit limiting options, such as '--all' for all references, specifying a commit range via negative references, or specifying a recency of use such as with '--since=<date>'. All of these options are available if we use setup_revisions() to parse the unknown arguments with the revision machinery. This opens up a large number of possibilities, only a small set of which are tested here. For documentation, we avoid duplicating the option documentation and instead link to the documentation of 'git rev-list'. Note that these arguments currently allow specifying a pathspec, which modifies the commit history checks but does not limit the paths used in the backfill logic. This will be updated in a future change. Signed-off-by: Derrick Stolee <stolee@gmail.com>
beb1c92 to
1168edf
Compare
|
/submit |
|
Submitted as pull.2070.v2.git.1774266019.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
|
This patch series was integrated into seen via git@90589f0. |
|
@derrickstolee |
The previous change allowed specifying revision arguments over the 'git backfill' command-line. This created the opportunity for restricting the initial commit set by filtering the revision walk through a pathspec. Other than filtering the commit set (and thereby the root trees), this did not restrict the path-walk implementation of 'git backfill' and did not restrict the blobs that were downloaded to only those matching the pathspec. Update the path-walk API to accept certain kinds of pathspecs and to silently ignore anything too complex, for now. We will update this in the next change to properly restrict to even complex pathspecs. The current behavior focuses on pathspecs that match paths exactly. This includes exact filenames, including directory names as prefixes. Pathspecs containing wildcards or magic are cleared so the path walk downloads all blobs, as before. The reason for this restriction is to allow for a faster execution by pruning the path walk to only trees that could contribute towards one of those paths as a parent directory. The test directory 'd/f/' (next to 'd/file*.txt') was prepared in a previous commit to exercise the subtlety in prefix matching. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Previously, walk_objects_by_path() silently ignored pathspecs containing
wildcards or magic by clearing them. This caused all blobs to be
downloaded regardless of the given pathspec. Wildcard pathspecs like
"d/file.*.txt" are useful for narrowing which blobs to process (e.g.,
during 'git backfill').
Support wildcard pathspecs by making two changes:
1. Add an 'exact_pathspecs' flag to path_walk_context. When the
pathspec has no wildcards or magic, set this flag and use the
existing fast-path prefix matching in add_tree_entries(). When
wildcards are present, skip that block since prefix matching
cannot handle glob patterns.
2. Add a match_pathspec() check in walk_path() to filter out blobs
whose full path does not match the pathspec. This provides the
actual blob-level filtering for wildcard pathspecs.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Before the recent changes to parse rev-list arguments inside of 'git backfill', the builtin would take arbitrary arguments without complaint (and ignore them). This was noticed and a patch was sent [1] which motivates this change. [1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/ Note that the revision machinery can output an "ambiguous argument" warning if a value not starting with '--' is found and doesn't make sense as a reference or a pathspec. For unrecognized arguments starting with '--' we need to add logic into builtin/backfill.c to catch leftover arguments. Reported-by: Siddharth Shrimali <r.siddharth.shrimali@gmail.com> Signed-off-by: Derrick Stolee <stolee@gmail.com>
9699650 to
b6423f9
Compare
|
This patch series was integrated into seen via git@a88badf. |
|
This patch series was integrated into seen via git@04eaf02. |
|
This patch series was integrated into seen via git@c743d8f. |
|
This patch series was integrated into seen via git@4e58217. |
|
This patch series was integrated into master via git@4e58217. |
|
This patch series was integrated into next via git@4e58217. |
|
Congratulations! 🎉 Your patch series was merged into upstream via 4e58217. Note: this pull request will show as "Closed" rather than "Merged" because the merge happened in the upstream repository, not on GitHub. This is expected — your contribution has been accepted! |
|
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email): On Tue, Mar 17, 2026 at 12:29:16AM +0000, Derrick Stolee via GitGitGadget wrote:
> The git backfill command assists in downloading missing blobs for blobless
> partial clones. However, its current version lacks some valuable
> functionality. It currently:
>
> 1. Only walks commits reachable from HEAD.
> 2. It walks all reachable commits to the full history.
> 3. It can focus on the current sparse-checkout definition, but otherwise it
> doesn't focus on a given pathspec.
>
> All of these are being updated by this patch series, which allows rev-list
> options to impact the path-walk. These include:
>
> 1. Specifying a given refspec, including --all.
> 2. Modifying the commit walk, including --first-parent, commit ranges, or
> recency using --since.
> 3. Modifying the set of paths to download using pathspecs.
>
> One particularly valuable situation here is that now a user can run git
> backfill -- <path> to download all versions of a specific file or a specific
> directory, accelerating history queries within that path without downloading
> more than necessary. This can accelerate git blame or git log -L for these
> paths, where normally those commands download missing blobs one-by-one
> during its diff algorithms.
Nice.
I think especially blaming is a bit of a sore spot -- downloading blobs
one by one simply doesn't cut it there. I wonder whether we can easily
use the backfill mechanism to fetch blobs automatically in git-blame(1)
so that the user doesn't need to know about git-backfill(1) at all?
Patrick |
| @@ -63,9 +63,12 @@ OPTIONS | |||
| current sparse-checkout. If the sparse-checkout feature is enabled, | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Tue, Mar 17, 2026 at 12:29:19AM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <stolee@gmail.com>
>
> The existing implementation of 'git backfill' only includes downloading
> missing blobs reachable from HEAD. Advanced uses may desire more general
> commit limiting options, such as '--all' for all references, specifying a
> commit range via negative references, or specifying a recency of use such as
> with '--since=<date>'.
>
> All of these options are available if we use setup_revisions() to parse the
> unknown arguments with the revision machinery. This opens up a large number
> of possibilities, only a small set of which are tested here.
>
> For documentation, we avoid duplicating the option documentation and instead
> link to the documentation of 'git rev-list'.
>
> Note that these arguments currently allow specifying a pathspec, which
> modifies the commit history checks but does not limit the paths used in the
> backfill logic. This will be updated in a future change.
Makes me wonder whether reversing the order would have avoided this
slight awkwardness. But let's just stick with the current order, the end
result would be the same anyway.
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
> Documentation/git-backfill.adoc | 3 +
> builtin/backfill.c | 19 ++--
> t/t5620-backfill.sh | 156 ++++++++++++++++++++++++++++++++
> 3 files changed, 172 insertions(+), 6 deletions(-)
>
> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index e80fc1b694..1b5595b27c 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -92,13 +92,14 @@ static int do_backfill(struct backfill_context *ctx)
> }
> }
>
> - repo_init_revisions(ctx->repo, &revs, "");
> - handle_revision_arg("HEAD", &revs, 0, 0);
> + /* Walk from HEAD if otherwise unspecified. */
> + if (!ctx->revs.pending.nr)
> + handle_revision_arg("HEAD", &ctx->revs, 0, 0);
Can we use `add_head_to_pending(&ctx->revs)` instead?
PatrickThere was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/19/26 5:54 AM, Patrick Steinhardt wrote:
> On Tue, Mar 17, 2026 at 12:29:19AM +0000, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> The existing implementation of 'git backfill' only includes downloading
>> missing blobs reachable from HEAD. Advanced uses may desire more general
>> commit limiting options, such as '--all' for all references, specifying a
>> commit range via negative references, or specifying a recency of use such as
>> with '--since=<date>'.
>>
>> All of these options are available if we use setup_revisions() to parse the
>> unknown arguments with the revision machinery. This opens up a large number
>> of possibilities, only a small set of which are tested here.
>>
>> For documentation, we avoid duplicating the option documentation and instead
>> link to the documentation of 'git rev-list'.
>>
>> Note that these arguments currently allow specifying a pathspec, which
>> modifies the commit history checks but does not limit the paths used in the
>> backfill logic. This will be updated in a future change.
> > Makes me wonder whether reversing the order would have avoided this
> slight awkwardness. But let's just stick with the current order, the end
> result would be the same anyway.
True, we could have added the pathspec logic first, but we wouldn't be able
to test it right away because the parsing comes through the rev-list.
>> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>> ---
>> Documentation/git-backfill.adoc | 3 +
>> builtin/backfill.c | 19 ++--
>> t/t5620-backfill.sh | 156 ++++++++++++++++++++++++++++++++
>> 3 files changed, 172 insertions(+), 6 deletions(-)
>>
>> diff --git a/builtin/backfill.c b/builtin/backfill.c
>> index e80fc1b694..1b5595b27c 100644
>> --- a/builtin/backfill.c
>> +++ b/builtin/backfill.c
>> @@ -92,13 +92,14 @@ static int do_backfill(struct backfill_context *ctx)
>> }
>> }
>> >> - repo_init_revisions(ctx->repo, &revs, "");
>> - handle_revision_arg("HEAD", &revs, 0, 0);
>> + /* Walk from HEAD if otherwise unspecified. */
>> + if (!ctx->revs.pending.nr)
>> + handle_revision_arg("HEAD", &ctx->revs, 0, 0);
> > Can we use `add_head_to_pending(&ctx->revs)` instead?
Nice. We absolutely can and should.
Thanks,
-Stolee
| @@ -206,6 +206,49 @@ static int add_tree_entries(struct path_walk_context *ctx, | |||
| match != MATCHED) | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Tue, Mar 17, 2026 at 12:29:20AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/path-walk.c b/path-walk.c
> index 364e4cfa19..e1ad4b0208 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -481,6 +524,18 @@ int walk_objects_by_path(struct path_walk_info *info)
> if (info->tags)
> info->revs->tag_objects = 1;
>
> + if (ctx.revs->prune_data.nr) {
> + /*
> + * Only exact prefix pathspecs are currently supported.
> + * Clear any wildcard or magic pathspecs to avoid
> + * incorrect prefix matching.
> + */
> + struct pathspec *pd = &ctx.revs->prune_data;
> +
> + if (pd->has_wildcard || pd->magic)
> + pd->nr = 0;
> + }
Huh, curious. Won't this cause a leak? I guess we should rather use
`clear_pathspec()` here.
Also shows that this path is missing test coverage.
Patrick| @@ -206,6 +206,49 @@ static int add_tree_entries(struct path_walk_context *ctx, | |||
| match != MATCHED) | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Tue, Mar 17, 2026 at 12:29:20AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/path-walk.c b/path-walk.c
> index 364e4cfa19..e1ad4b0208 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -206,6 +206,49 @@ static int add_tree_entries(struct path_walk_context *ctx,
> match != MATCHED)
> continue;
> }
> + if (ctx->revs->prune_data.nr) {
> + struct pathspec *pd = &ctx->revs->prune_data;
> + bool found = false;
> +
> + for (int i = 0; i < pd->nr; i++) {
> + struct pathspec_item *item = &pd->items[i];
> +
> + /*
> + * Is this path a parent directory of
> + * the pathspec item?
> + */
> + if (path.len < (size_t)item->len &&
> + !strncmp(path.buf, item->match, path.len) &&
> + item->match[path.len - 1] == '/') {
> + found = true;
> + break;
> + }
> +
> + /*
> + * Or, is the pathspec an exact match?
> + */
> + if (path.len == (size_t)item->len &&
> + !strcmp(path.buf, item->match)) {
> + found = true;
> + break;
> + }
> +
> + /*
> + * Or, is the pathspec a directory prefix
> + * match?
> + */
> + if (path.len > (size_t)item->len &&
> + !strncmp(path.buf, item->match, item->len) &&
> + path.buf[item->len] == '/') {
> + found = true;
> + break;
> + }
Ah, one more thing: we could expose `dir_prefix()` from "path.c" and
reuse it here.
PatrickThere was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/19/26 6:15 AM, Patrick Steinhardt wrote:
> Ah, one more thing: we could expose `dir_prefix()` from "path.c" and
> reuse it here.
Good idea. This becomes
/*
* Continue if either is a directory prefix
* of the other.
*/
if (dir_prefix(path.buf, item->match) ||
dir_prefix(item->match, path.buf)) {
found = true;
break;
}
With the idea that we need to walk the parents of each prefix in
addition to walking all of their children.
Thanks,
-Stolee
|
Derrick Stolee wrote on the Git mailing list (how to reply to this email): On 3/19/2026 5:54 AM, Patrick Steinhardt wrote:
> I think especially blaming is a bit of a sore spot -- downloading blobs
> one by one simply doesn't cut it there. I wonder whether we can easily
> use the backfill mechanism to fetch blobs automatically in git-blame(1)
> so that the user doesn't need to know about git-backfill(1) at all?
I've thought about this a bit, and I'm not sure that we want to run
'git backfill' directly. Instead, it would be nice if we did a "staged"
algorithm for 'git blame':
1. Walk commits according to the pathspec to collect the commits that
changed the path.
2. Collect the list of blob OIDs that will be needed for computing diffs
for the line-tracking algorithm.
3. In batches, download groups of missing blobs and then process them
for line-tracking diffs. (Stop if all lines are blamed; continue to
next batch if more lines are needed.)
This would be a significant rewrite of the blame algorithm, though. I
briefly considered this approach about a year ago and decided it would
be easier to start with 'git backfill' and see whether that satisfies
most needs.
The biggest reason to maybe avoid 'git backfill HEAD -- <path>' before
_every_ blame operation is that this will add overhead on repeated
calls that may be obnoxious in its own way. Maybe doing an opt-in
'git blame --backfill <path>' would make it easier for users to opt-in
when they want to.
Thanks,
-Stolee
|
|
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email): On Thu, Mar 19, 2026 at 08:59:01AM -0400, Derrick Stolee wrote:
> On 3/19/2026 5:54 AM, Patrick Steinhardt wrote:
> > I think especially blaming is a bit of a sore spot -- downloading blobs
> > one by one simply doesn't cut it there. I wonder whether we can easily
> > use the backfill mechanism to fetch blobs automatically in git-blame(1)
> > so that the user doesn't need to know about git-backfill(1) at all?
>
> I've thought about this a bit, and I'm not sure that we want to run
> 'git backfill' directly. Instead, it would be nice if we did a "staged"
> algorithm for 'git blame':
>
> 1. Walk commits according to the pathspec to collect the commits that
> changed the path.
>
> 2. Collect the list of blob OIDs that will be needed for computing diffs
> for the line-tracking algorithm.
>
> 3. In batches, download groups of missing blobs and then process them
> for line-tracking diffs. (Stop if all lines are blamed; continue to
> next batch if more lines are needed.)
>
> This would be a significant rewrite of the blame algorithm, though. I
> briefly considered this approach about a year ago and decided it would
> be easier to start with 'git backfill' and see whether that satisfies
> most needs.
>
> The biggest reason to maybe avoid 'git backfill HEAD -- <path>' before
> _every_ blame operation is that this will add overhead on repeated
> calls that may be obnoxious in its own way. Maybe doing an opt-in
> 'git blame --backfill <path>' would make it easier for users to opt-in
> when they want to.
That's fair. I fully agree that just blindly doing this would be likely
be inefficient. Ideally, the batching logic would only kick in whenever
we see a missing object.
Anyway, this definitely doesn't have to be part of this series, I was
mostly wondering how hard it is to do.
Thanks!
Patrick |
| @@ -7,6 +7,14 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME | |||
|
|
|||
There was a problem hiding this comment.
Junio C Hamano wrote on the Git mailing list (how to reply to this email):
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Derrick Stolee <stolee@gmail.com>
>
> Before the recent changes to parse rev-list arguments inside of 'git
> backfill', the builtin would take arbitrary arguments without complaint (and
> ignore them). This was noticed and a patch was sent [1] which motivates this
> change to encode this behavior in test.
>
> [1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/
>
> Reported-by: Siddharth Shrimali <r.siddharth.shrimali@gmail.com>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
> t/t5620-backfill.sh | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
> index c6f54ee91c..85740f1f13 100755
> --- a/t/t5620-backfill.sh
> +++ b/t/t5620-backfill.sh
> @@ -7,6 +7,14 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
>
> . ./test-lib.sh
>
> +test_expect_success 'backfill rejects unexpected arguments' '
> + test_must_fail git backfill unexpected-arg 2>err &&
> + test_grep "ambiguous argument .*unexpected-arg" err &&
> +
> + test_must_fail git backfill --all --firt-parent unexpected-arg 2>err &&
> + test_grep "ambiguous argument .*unexpected-arg" err
> +'
Hmph, I would have expected that an earlier --firt-parent on the
command line would trigger "unknown option" instead.
Having said that, if the code lets the setup_revisions() parse the
command line, the usual "unless disambiguated with a double-dash
'--', stop at the first non-revision and take everything as paths
but for safety all of them must refer to an existing path in the
working tree" behaviour should trigger, and it is not specific to
"backfill", and may already be tested centrally (if not, I do not
object to such a new set of tests).
For any cmd that take revisions and pathspec (e.g., log, rev-list,
grep) these should hold true:
$ git $cmd [<options>]... Makefile HEAD
Without disambiguation the command should say "Ah, Makefile
is not a revision, so we will see no more revisions, and
everything, including the current one we are looking at, must be
an existing path on the working tree", and barfs on HEAD that
does not exist as a file/directory.
$ git $cmd [<options>]... Makefile -- HEAD
With disambiguation, the command should verify everything before
the double-dash to be a rev, and barf that Makefile is not a
rev.
$ git $cmd [<options>]... -- Makefile HEAD
With disambiguation, the command should take everything after
the double-dash to be a pathspec element without barfing. After
all, it may be referring to a path that used to exist in some
revision the command will look at.
Thanks.
There was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/23/2026 11:29 AM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> Before the recent changes to parse rev-list arguments inside of 'git
>> backfill', the builtin would take arbitrary arguments without complaint (and
>> ignore them). This was noticed and a patch was sent [1] which motivates this
>> change to encode this behavior in test.
>>
>> [1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/
>>
>> Reported-by: Siddharth Shrimali <r.siddharth.shrimali@gmail.com>
>> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>> ---
>> t/t5620-backfill.sh | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
>> index c6f54ee91c..85740f1f13 100755
>> --- a/t/t5620-backfill.sh
>> +++ b/t/t5620-backfill.sh
>> @@ -7,6 +7,14 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
>>
>> . ./test-lib.sh
>>
>> +test_expect_success 'backfill rejects unexpected arguments' '
>> + test_must_fail git backfill unexpected-arg 2>err &&
>> + test_grep "ambiguous argument .*unexpected-arg" err &&
>> +
>> + test_must_fail git backfill --all --firt-parent unexpected-arg 2>err &&
>> + test_grep "ambiguous argument .*unexpected-arg" err
>> +'
>
> Hmph, I would have expected that an earlier --firt-parent on the
> command line would trigger "unknown option" instead.
Interesting that my mistype has demonstrated an interesting
behavior here. It turns out that random options starting with
'--' are accepted here, including --unexpected-arg.
This means that we actually have room here for some improvement!
I'll see what can be done to make even these arguments be seen
as failures.
> Having said that, if the code lets the setup_revisions() parse the
> command line, the usual "unless disambiguated with a double-dash
> '--', stop at the first non-revision and take everything as paths
> but for safety all of them must refer to an existing path in the
> working tree" behaviour should trigger, and it is not specific to
> "backfill", and may already be tested centrally (if not, I do not
> object to such a new set of tests).
Thanks,
-Stolee
| @@ -63,9 +63,12 @@ OPTIONS | |||
| current sparse-checkout. If the sparse-checkout feature is enabled, | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Mon, Mar 23, 2026 at 11:40:16AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index e80fc1b694..90c9d84793 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -134,7 +135,12 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
> builtin_backfill_usage, options);
>
> argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
> - 0);
> + PARSE_OPT_KEEP_UNKNOWN_OPT |
> + PARSE_OPT_KEEP_ARGV0 |
> + PARSE_OPT_KEEP_DASHDASH);
> +
> + repo_init_revisions(repo, &ctx.revs, prefix);
> + argc = setup_revisions(argc, argv, &ctx.revs, NULL);
We should probably die here in case we still have unknown arguments.
PatrickThere was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/24/2026 3:59 AM, Patrick Steinhardt wrote:
> On Mon, Mar 23, 2026 at 11:40:16AM +0000, Derrick Stolee via GitGitGadget wrote:
>> diff --git a/builtin/backfill.c b/builtin/backfill.c
>> index e80fc1b694..90c9d84793 100644
>> --- a/builtin/backfill.c
>> +++ b/builtin/backfill.c
>> @@ -134,7 +135,12 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>> builtin_backfill_usage, options);
>>
>> argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
>> - 0);
>> + PARSE_OPT_KEEP_UNKNOWN_OPT |
>> + PARSE_OPT_KEEP_ARGV0 |
>> + PARSE_OPT_KEEP_DASHDASH);
>> +
>> + repo_init_revisions(repo, &ctx.revs, prefix);
>> + argc = setup_revisions(argc, argv, &ctx.revs, NULL);
>
> We should probably die here in case we still have unknown arguments.
That is indeed the fix for the bad test in patch 6. I'll make the
necessary update in v3's patch 6 along with the test for it.
Thanks,
-Stolee
| @@ -11,6 +11,7 @@ | |||
| #include "list-objects.h" | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Mon, Mar 23, 2026 at 11:40:17AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/path-walk.c b/path-walk.c
> index 364e4cfa19..0d640e2f24 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -206,6 +207,34 @@ static int add_tree_entries(struct path_walk_context *ctx,
> match != MATCHED)
> continue;
> }
> + if (ctx->revs->prune_data.nr) {
> + struct pathspec *pd = &ctx->revs->prune_data;
> + bool found = false;
> +
> + /* remove '/' for these checks. */
> + path.buf[path.len - 1] = 0;
Hm. Is this _always_ safe to do? We add the directory separator a few
lines further up, but only in the case where `type == OBJ_TREE`. So in
reverse this may mean that there are cases where we don't have a
trailing '/'.
Maybe we should instead:
did_strip_suffix = strbuf_strip_suffix(path, "/");
...
if (did_strip_suffix)
strbuf_addch(path, "/");
PatrickThere was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/24/2026 3:59 AM, Patrick Steinhardt wrote:
> On Mon, Mar 23, 2026 at 11:40:17AM +0000, Derrick Stolee via GitGitGadget wrote:
>> diff --git a/path-walk.c b/path-walk.c
>> index 364e4cfa19..0d640e2f24 100644
>> --- a/path-walk.c
>> +++ b/path-walk.c
>> @@ -206,6 +207,34 @@ static int add_tree_entries(struct path_walk_context *ctx,
>> match != MATCHED)
>> continue;
>> }
>> + if (ctx->revs->prune_data.nr) {
>> + struct pathspec *pd = &ctx->revs->prune_data;
>> + bool found = false;
>> +
>> + /* remove '/' for these checks. */
>> + path.buf[path.len - 1] = 0;
>
> Hm. Is this _always_ safe to do? We add the directory separator a few
> lines further up, but only in the case where `type == OBJ_TREE`. So in
> reverse this may mean that there are cases where we don't have a
> trailing '/'.
>
> Maybe we should instead:
>
> did_strip_suffix = strbuf_strip_suffix(path, "/");
>
> ...
>
> if (did_strip_suffix)
> strbuf_addch(path, "/");
This is much cleaner, too! Thanks.
-Stolee|
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email): On Thu, Mar 26, 2026 at 03:14:48PM +0000, Derrick Stolee via GitGitGadget wrote:
> 6: 9699650aa7 ! 6: b6423f9595 t5620: test backfill's unknown argument handling
> @@ Commit message
>
> Before the recent changes to parse rev-list arguments inside of 'git
> backfill', the builtin would take arbitrary arguments without complaint (and
> - ignore them). This was noticed and a patch was sent [1] which motivates this
> - change to encode this behavior in test.
> + ignore them). This was noticed and a patch was sent [1] which motivates
> + this change.
>
> [1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/
>
> + Note that the revision machinery can output an "ambiguous argument"
> + warning if a value not starting with '--' is found and doesn't make
> + sense as a reference or a pathspec. For unrecognized arguments starting
> + with '--' we need to add logic into builtin/backfill.c to catch leftover
> + arguments.
> +
> Reported-by: Siddharth Shrimali <r.siddharth.shrimali@gmail.com>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>
> + ## builtin/backfill.c ##
> +@@ builtin/backfill.c: int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
> + repo_init_revisions(repo, &ctx.revs, prefix);
> + argc = setup_revisions(argc, argv, &ctx.revs, NULL);
> +
> ++ if (argc > 1)
> ++ die(_("unrecognized argument: %s"), argv[1]);
> ++
> + repo_config(repo, git_default_config, NULL);
> +
> + if (ctx.sparse < 0)
> +
> ## t/t5620-backfill.sh ##
I would've expected this chunk to already come in patch 3, but that
alone isn't really worth a reroll. All the other changes look good to
me, thanks!
Patrick |
|
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email): On Tue, Mar 17, 2026 at 12:29:16AM +0000, Derrick Stolee via GitGitGadget wrote:
> The git backfill command assists in downloading missing blobs for blobless
> partial clones. However, its current version lacks some valuable
> functionality. It currently:
>
> 1. Only walks commits reachable from HEAD.
> 2. It walks all reachable commits to the full history.
> 3. It can focus on the current sparse-checkout definition, but otherwise it
> doesn't focus on a given pathspec.
>
> All of these are being updated by this patch series, which allows rev-list
> options to impact the path-walk. These include:
>
> 1. Specifying a given refspec, including --all.
> 2. Modifying the commit walk, including --first-parent, commit ranges, or
> recency using --since.
> 3. Modifying the set of paths to download using pathspecs.
>
> One particularly valuable situation here is that now a user can run git
> backfill -- <path> to download all versions of a specific file or a specific
> directory, accelerating history queries within that path without downloading
> more than necessary. This can accelerate git blame or git log -L for these
> paths, where normally those commands download missing blobs one-by-one
> during its diff algorithms.
Nice.
I think especially blaming is a bit of a sore spot -- downloading blobs
one by one simply doesn't cut it there. I wonder whether we can easily
use the backfill mechanism to fetch blobs automatically in git-blame(1)
so that the user doesn't need to know about git-backfill(1) at all?
Patrick |
| @@ -63,9 +63,12 @@ OPTIONS | |||
| current sparse-checkout. If the sparse-checkout feature is enabled, | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Tue, Mar 17, 2026 at 12:29:19AM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <stolee@gmail.com>
>
> The existing implementation of 'git backfill' only includes downloading
> missing blobs reachable from HEAD. Advanced uses may desire more general
> commit limiting options, such as '--all' for all references, specifying a
> commit range via negative references, or specifying a recency of use such as
> with '--since=<date>'.
>
> All of these options are available if we use setup_revisions() to parse the
> unknown arguments with the revision machinery. This opens up a large number
> of possibilities, only a small set of which are tested here.
>
> For documentation, we avoid duplicating the option documentation and instead
> link to the documentation of 'git rev-list'.
>
> Note that these arguments currently allow specifying a pathspec, which
> modifies the commit history checks but does not limit the paths used in the
> backfill logic. This will be updated in a future change.
Makes me wonder whether reversing the order would have avoided this
slight awkwardness. But let's just stick with the current order, the end
result would be the same anyway.
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
> Documentation/git-backfill.adoc | 3 +
> builtin/backfill.c | 19 ++--
> t/t5620-backfill.sh | 156 ++++++++++++++++++++++++++++++++
> 3 files changed, 172 insertions(+), 6 deletions(-)
>
> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index e80fc1b694..1b5595b27c 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -92,13 +92,14 @@ static int do_backfill(struct backfill_context *ctx)
> }
> }
>
> - repo_init_revisions(ctx->repo, &revs, "");
> - handle_revision_arg("HEAD", &revs, 0, 0);
> + /* Walk from HEAD if otherwise unspecified. */
> + if (!ctx->revs.pending.nr)
> + handle_revision_arg("HEAD", &ctx->revs, 0, 0);
Can we use `add_head_to_pending(&ctx->revs)` instead?
PatrickThere was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/19/26 5:54 AM, Patrick Steinhardt wrote:
> On Tue, Mar 17, 2026 at 12:29:19AM +0000, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> The existing implementation of 'git backfill' only includes downloading
>> missing blobs reachable from HEAD. Advanced uses may desire more general
>> commit limiting options, such as '--all' for all references, specifying a
>> commit range via negative references, or specifying a recency of use such as
>> with '--since=<date>'.
>>
>> All of these options are available if we use setup_revisions() to parse the
>> unknown arguments with the revision machinery. This opens up a large number
>> of possibilities, only a small set of which are tested here.
>>
>> For documentation, we avoid duplicating the option documentation and instead
>> link to the documentation of 'git rev-list'.
>>
>> Note that these arguments currently allow specifying a pathspec, which
>> modifies the commit history checks but does not limit the paths used in the
>> backfill logic. This will be updated in a future change.
> > Makes me wonder whether reversing the order would have avoided this
> slight awkwardness. But let's just stick with the current order, the end
> result would be the same anyway.
True, we could have added the pathspec logic first, but we wouldn't be able
to test it right away because the parsing comes through the rev-list.
>> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>> ---
>> Documentation/git-backfill.adoc | 3 +
>> builtin/backfill.c | 19 ++--
>> t/t5620-backfill.sh | 156 ++++++++++++++++++++++++++++++++
>> 3 files changed, 172 insertions(+), 6 deletions(-)
>>
>> diff --git a/builtin/backfill.c b/builtin/backfill.c
>> index e80fc1b694..1b5595b27c 100644
>> --- a/builtin/backfill.c
>> +++ b/builtin/backfill.c
>> @@ -92,13 +92,14 @@ static int do_backfill(struct backfill_context *ctx)
>> }
>> }
>> >> - repo_init_revisions(ctx->repo, &revs, "");
>> - handle_revision_arg("HEAD", &revs, 0, 0);
>> + /* Walk from HEAD if otherwise unspecified. */
>> + if (!ctx->revs.pending.nr)
>> + handle_revision_arg("HEAD", &ctx->revs, 0, 0);
> > Can we use `add_head_to_pending(&ctx->revs)` instead?
Nice. We absolutely can and should.
Thanks,
-Stolee
| @@ -206,6 +206,49 @@ static int add_tree_entries(struct path_walk_context *ctx, | |||
| match != MATCHED) | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Tue, Mar 17, 2026 at 12:29:20AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/path-walk.c b/path-walk.c
> index 364e4cfa19..e1ad4b0208 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -481,6 +524,18 @@ int walk_objects_by_path(struct path_walk_info *info)
> if (info->tags)
> info->revs->tag_objects = 1;
>
> + if (ctx.revs->prune_data.nr) {
> + /*
> + * Only exact prefix pathspecs are currently supported.
> + * Clear any wildcard or magic pathspecs to avoid
> + * incorrect prefix matching.
> + */
> + struct pathspec *pd = &ctx.revs->prune_data;
> +
> + if (pd->has_wildcard || pd->magic)
> + pd->nr = 0;
> + }
Huh, curious. Won't this cause a leak? I guess we should rather use
`clear_pathspec()` here.
Also shows that this path is missing test coverage.
Patrick| @@ -206,6 +206,49 @@ static int add_tree_entries(struct path_walk_context *ctx, | |||
| match != MATCHED) | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Tue, Mar 17, 2026 at 12:29:20AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/path-walk.c b/path-walk.c
> index 364e4cfa19..e1ad4b0208 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -206,6 +206,49 @@ static int add_tree_entries(struct path_walk_context *ctx,
> match != MATCHED)
> continue;
> }
> + if (ctx->revs->prune_data.nr) {
> + struct pathspec *pd = &ctx->revs->prune_data;
> + bool found = false;
> +
> + for (int i = 0; i < pd->nr; i++) {
> + struct pathspec_item *item = &pd->items[i];
> +
> + /*
> + * Is this path a parent directory of
> + * the pathspec item?
> + */
> + if (path.len < (size_t)item->len &&
> + !strncmp(path.buf, item->match, path.len) &&
> + item->match[path.len - 1] == '/') {
> + found = true;
> + break;
> + }
> +
> + /*
> + * Or, is the pathspec an exact match?
> + */
> + if (path.len == (size_t)item->len &&
> + !strcmp(path.buf, item->match)) {
> + found = true;
> + break;
> + }
> +
> + /*
> + * Or, is the pathspec a directory prefix
> + * match?
> + */
> + if (path.len > (size_t)item->len &&
> + !strncmp(path.buf, item->match, item->len) &&
> + path.buf[item->len] == '/') {
> + found = true;
> + break;
> + }
Ah, one more thing: we could expose `dir_prefix()` from "path.c" and
reuse it here.
PatrickThere was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/19/26 6:15 AM, Patrick Steinhardt wrote:
> Ah, one more thing: we could expose `dir_prefix()` from "path.c" and
> reuse it here.
Good idea. This becomes
/*
* Continue if either is a directory prefix
* of the other.
*/
if (dir_prefix(path.buf, item->match) ||
dir_prefix(item->match, path.buf)) {
found = true;
break;
}
With the idea that we need to walk the parents of each prefix in
addition to walking all of their children.
Thanks,
-Stolee
|
Derrick Stolee wrote on the Git mailing list (how to reply to this email): On 3/19/2026 5:54 AM, Patrick Steinhardt wrote:
> I think especially blaming is a bit of a sore spot -- downloading blobs
> one by one simply doesn't cut it there. I wonder whether we can easily
> use the backfill mechanism to fetch blobs automatically in git-blame(1)
> so that the user doesn't need to know about git-backfill(1) at all?
I've thought about this a bit, and I'm not sure that we want to run
'git backfill' directly. Instead, it would be nice if we did a "staged"
algorithm for 'git blame':
1. Walk commits according to the pathspec to collect the commits that
changed the path.
2. Collect the list of blob OIDs that will be needed for computing diffs
for the line-tracking algorithm.
3. In batches, download groups of missing blobs and then process them
for line-tracking diffs. (Stop if all lines are blamed; continue to
next batch if more lines are needed.)
This would be a significant rewrite of the blame algorithm, though. I
briefly considered this approach about a year ago and decided it would
be easier to start with 'git backfill' and see whether that satisfies
most needs.
The biggest reason to maybe avoid 'git backfill HEAD -- <path>' before
_every_ blame operation is that this will add overhead on repeated
calls that may be obnoxious in its own way. Maybe doing an opt-in
'git blame --backfill <path>' would make it easier for users to opt-in
when they want to.
Thanks,
-Stolee
|
|
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email): On Thu, Mar 19, 2026 at 08:59:01AM -0400, Derrick Stolee wrote:
> On 3/19/2026 5:54 AM, Patrick Steinhardt wrote:
> > I think especially blaming is a bit of a sore spot -- downloading blobs
> > one by one simply doesn't cut it there. I wonder whether we can easily
> > use the backfill mechanism to fetch blobs automatically in git-blame(1)
> > so that the user doesn't need to know about git-backfill(1) at all?
>
> I've thought about this a bit, and I'm not sure that we want to run
> 'git backfill' directly. Instead, it would be nice if we did a "staged"
> algorithm for 'git blame':
>
> 1. Walk commits according to the pathspec to collect the commits that
> changed the path.
>
> 2. Collect the list of blob OIDs that will be needed for computing diffs
> for the line-tracking algorithm.
>
> 3. In batches, download groups of missing blobs and then process them
> for line-tracking diffs. (Stop if all lines are blamed; continue to
> next batch if more lines are needed.)
>
> This would be a significant rewrite of the blame algorithm, though. I
> briefly considered this approach about a year ago and decided it would
> be easier to start with 'git backfill' and see whether that satisfies
> most needs.
>
> The biggest reason to maybe avoid 'git backfill HEAD -- <path>' before
> _every_ blame operation is that this will add overhead on repeated
> calls that may be obnoxious in its own way. Maybe doing an opt-in
> 'git blame --backfill <path>' would make it easier for users to opt-in
> when they want to.
That's fair. I fully agree that just blindly doing this would be likely
be inefficient. Ideally, the batching logic would only kick in whenever
we see a missing object.
Anyway, this definitely doesn't have to be part of this series, I was
mostly wondering how hard it is to do.
Thanks!
Patrick |
| @@ -7,6 +7,14 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME | |||
|
|
|||
There was a problem hiding this comment.
Junio C Hamano wrote on the Git mailing list (how to reply to this email):
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Derrick Stolee <stolee@gmail.com>
>
> Before the recent changes to parse rev-list arguments inside of 'git
> backfill', the builtin would take arbitrary arguments without complaint (and
> ignore them). This was noticed and a patch was sent [1] which motivates this
> change to encode this behavior in test.
>
> [1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/
>
> Reported-by: Siddharth Shrimali <r.siddharth.shrimali@gmail.com>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
> t/t5620-backfill.sh | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
> index c6f54ee91c..85740f1f13 100755
> --- a/t/t5620-backfill.sh
> +++ b/t/t5620-backfill.sh
> @@ -7,6 +7,14 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
>
> . ./test-lib.sh
>
> +test_expect_success 'backfill rejects unexpected arguments' '
> + test_must_fail git backfill unexpected-arg 2>err &&
> + test_grep "ambiguous argument .*unexpected-arg" err &&
> +
> + test_must_fail git backfill --all --firt-parent unexpected-arg 2>err &&
> + test_grep "ambiguous argument .*unexpected-arg" err
> +'
Hmph, I would have expected that an earlier --firt-parent on the
command line would trigger "unknown option" instead.
Having said that, if the code lets the setup_revisions() parse the
command line, the usual "unless disambiguated with a double-dash
'--', stop at the first non-revision and take everything as paths
but for safety all of them must refer to an existing path in the
working tree" behaviour should trigger, and it is not specific to
"backfill", and may already be tested centrally (if not, I do not
object to such a new set of tests).
For any cmd that take revisions and pathspec (e.g., log, rev-list,
grep) these should hold true:
$ git $cmd [<options>]... Makefile HEAD
Without disambiguation the command should say "Ah, Makefile
is not a revision, so we will see no more revisions, and
everything, including the current one we are looking at, must be
an existing path on the working tree", and barfs on HEAD that
does not exist as a file/directory.
$ git $cmd [<options>]... Makefile -- HEAD
With disambiguation, the command should verify everything before
the double-dash to be a rev, and barf that Makefile is not a
rev.
$ git $cmd [<options>]... -- Makefile HEAD
With disambiguation, the command should take everything after
the double-dash to be a pathspec element without barfing. After
all, it may be referring to a path that used to exist in some
revision the command will look at.
Thanks.
There was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/23/2026 11:29 AM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> Before the recent changes to parse rev-list arguments inside of 'git
>> backfill', the builtin would take arbitrary arguments without complaint (and
>> ignore them). This was noticed and a patch was sent [1] which motivates this
>> change to encode this behavior in test.
>>
>> [1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/
>>
>> Reported-by: Siddharth Shrimali <r.siddharth.shrimali@gmail.com>
>> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>> ---
>> t/t5620-backfill.sh | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
>> index c6f54ee91c..85740f1f13 100755
>> --- a/t/t5620-backfill.sh
>> +++ b/t/t5620-backfill.sh
>> @@ -7,6 +7,14 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
>>
>> . ./test-lib.sh
>>
>> +test_expect_success 'backfill rejects unexpected arguments' '
>> + test_must_fail git backfill unexpected-arg 2>err &&
>> + test_grep "ambiguous argument .*unexpected-arg" err &&
>> +
>> + test_must_fail git backfill --all --firt-parent unexpected-arg 2>err &&
>> + test_grep "ambiguous argument .*unexpected-arg" err
>> +'
>
> Hmph, I would have expected that an earlier --firt-parent on the
> command line would trigger "unknown option" instead.
Interesting that my mistype has demonstrated an interesting
behavior here. It turns out that random options starting with
'--' are accepted here, including --unexpected-arg.
This means that we actually have room here for some improvement!
I'll see what can be done to make even these arguments be seen
as failures.
> Having said that, if the code lets the setup_revisions() parse the
> command line, the usual "unless disambiguated with a double-dash
> '--', stop at the first non-revision and take everything as paths
> but for safety all of them must refer to an existing path in the
> working tree" behaviour should trigger, and it is not specific to
> "backfill", and may already be tested centrally (if not, I do not
> object to such a new set of tests).
Thanks,
-Stolee
| @@ -63,9 +63,12 @@ OPTIONS | |||
| current sparse-checkout. If the sparse-checkout feature is enabled, | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Mon, Mar 23, 2026 at 11:40:16AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index e80fc1b694..90c9d84793 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -134,7 +135,12 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
> builtin_backfill_usage, options);
>
> argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
> - 0);
> + PARSE_OPT_KEEP_UNKNOWN_OPT |
> + PARSE_OPT_KEEP_ARGV0 |
> + PARSE_OPT_KEEP_DASHDASH);
> +
> + repo_init_revisions(repo, &ctx.revs, prefix);
> + argc = setup_revisions(argc, argv, &ctx.revs, NULL);
We should probably die here in case we still have unknown arguments.
PatrickThere was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/24/2026 3:59 AM, Patrick Steinhardt wrote:
> On Mon, Mar 23, 2026 at 11:40:16AM +0000, Derrick Stolee via GitGitGadget wrote:
>> diff --git a/builtin/backfill.c b/builtin/backfill.c
>> index e80fc1b694..90c9d84793 100644
>> --- a/builtin/backfill.c
>> +++ b/builtin/backfill.c
>> @@ -134,7 +135,12 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>> builtin_backfill_usage, options);
>>
>> argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
>> - 0);
>> + PARSE_OPT_KEEP_UNKNOWN_OPT |
>> + PARSE_OPT_KEEP_ARGV0 |
>> + PARSE_OPT_KEEP_DASHDASH);
>> +
>> + repo_init_revisions(repo, &ctx.revs, prefix);
>> + argc = setup_revisions(argc, argv, &ctx.revs, NULL);
>
> We should probably die here in case we still have unknown arguments.
That is indeed the fix for the bad test in patch 6. I'll make the
necessary update in v3's patch 6 along with the test for it.
Thanks,
-Stolee
| @@ -11,6 +11,7 @@ | |||
| #include "list-objects.h" | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Mon, Mar 23, 2026 at 11:40:17AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/path-walk.c b/path-walk.c
> index 364e4cfa19..0d640e2f24 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -206,6 +207,34 @@ static int add_tree_entries(struct path_walk_context *ctx,
> match != MATCHED)
> continue;
> }
> + if (ctx->revs->prune_data.nr) {
> + struct pathspec *pd = &ctx->revs->prune_data;
> + bool found = false;
> +
> + /* remove '/' for these checks. */
> + path.buf[path.len - 1] = 0;
Hm. Is this _always_ safe to do? We add the directory separator a few
lines further up, but only in the case where `type == OBJ_TREE`. So in
reverse this may mean that there are cases where we don't have a
trailing '/'.
Maybe we should instead:
did_strip_suffix = strbuf_strip_suffix(path, "/");
...
if (did_strip_suffix)
strbuf_addch(path, "/");
PatrickThere was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/24/2026 3:59 AM, Patrick Steinhardt wrote:
> On Mon, Mar 23, 2026 at 11:40:17AM +0000, Derrick Stolee via GitGitGadget wrote:
>> diff --git a/path-walk.c b/path-walk.c
>> index 364e4cfa19..0d640e2f24 100644
>> --- a/path-walk.c
>> +++ b/path-walk.c
>> @@ -206,6 +207,34 @@ static int add_tree_entries(struct path_walk_context *ctx,
>> match != MATCHED)
>> continue;
>> }
>> + if (ctx->revs->prune_data.nr) {
>> + struct pathspec *pd = &ctx->revs->prune_data;
>> + bool found = false;
>> +
>> + /* remove '/' for these checks. */
>> + path.buf[path.len - 1] = 0;
>
> Hm. Is this _always_ safe to do? We add the directory separator a few
> lines further up, but only in the case where `type == OBJ_TREE`. So in
> reverse this may mean that there are cases where we don't have a
> trailing '/'.
>
> Maybe we should instead:
>
> did_strip_suffix = strbuf_strip_suffix(path, "/");
>
> ...
>
> if (did_strip_suffix)
> strbuf_addch(path, "/");
This is much cleaner, too! Thanks.
-Stolee|
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email): On Thu, Mar 26, 2026 at 03:14:48PM +0000, Derrick Stolee via GitGitGadget wrote:
> 6: 9699650aa7 ! 6: b6423f9595 t5620: test backfill's unknown argument handling
> @@ Commit message
>
> Before the recent changes to parse rev-list arguments inside of 'git
> backfill', the builtin would take arbitrary arguments without complaint (and
> - ignore them). This was noticed and a patch was sent [1] which motivates this
> - change to encode this behavior in test.
> + ignore them). This was noticed and a patch was sent [1] which motivates
> + this change.
>
> [1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/
>
> + Note that the revision machinery can output an "ambiguous argument"
> + warning if a value not starting with '--' is found and doesn't make
> + sense as a reference or a pathspec. For unrecognized arguments starting
> + with '--' we need to add logic into builtin/backfill.c to catch leftover
> + arguments.
> +
> Reported-by: Siddharth Shrimali <r.siddharth.shrimali@gmail.com>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>
> + ## builtin/backfill.c ##
> +@@ builtin/backfill.c: int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
> + repo_init_revisions(repo, &ctx.revs, prefix);
> + argc = setup_revisions(argc, argv, &ctx.revs, NULL);
> +
> ++ if (argc > 1)
> ++ die(_("unrecognized argument: %s"), argv[1]);
> ++
> + repo_config(repo, git_default_config, NULL);
> +
> + if (ctx.sparse < 0)
> +
> ## t/t5620-backfill.sh ##
I would've expected this chunk to already come in patch 3, but that
alone isn't really worth a reroll. All the other changes look good to
me, thanks!
Patrick |
The
git backfillcommand assists in downloading missing blobs for blobless partial clones. However, its current version lacks some valuable functionality. It currently:HEAD.All of these are being updated by this patch series, which allows rev-list options to impact the path-walk. These include:
--all.--first-parent, commit ranges, or recency using--since.One particularly valuable situation here is that now a user can run
git backfill -- <path>to download all versions of a specific file or a specific directory, accelerating history queries within that path without downloading more than necessary. This can accelerategit blameorgit log -Lfor these paths, where normally those commands download missing blobs one-by-one during its diff algorithms.This patch series is organized in the following way:
#includeis added to prevent future compilation issues.backfillbuiltin parses the rev-list arguments. We test the top arguments that work as expected, though the pathspec arguments need extra work.builtin/backfill.cinstead of restricting the walk in the path-walk API.The main goal of this series is to make such customizations possible, and to improve performance where common use cases are expected. I'm open to feedback as to whether we should consider more detailed performance analysis or whether we should wait for how users interact with these new options before overoptimizing unlikely use cases.
Updates in v2
[1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/
Updates in v3
--.Thanks,
-Stolee
cc: gitster@pobox.com
cc: "Kristoffer Haugsbakk" kristofferhaugsbakk@fastmail.com
cc: r.siddharth.shrimali@gmail.com
cc: ps@pks.im