Skip to content

backfill: accept revision arguments#2070

Closed
derrickstolee wants to merge 6 commits intogitgitgadget:masterfrom
derrickstolee:backfill-revs
Closed

backfill: accept revision arguments#2070
derrickstolee wants to merge 6 commits intogitgitgadget:masterfrom
derrickstolee:backfill-revs

Conversation

@derrickstolee
Copy link
Copy Markdown

@derrickstolee derrickstolee commented Mar 16, 2026

The git backfill command assists in downloading missing blobs for blobless partial clones. However, its current version lacks some valuable functionality. It currently:

  1. Only walks commits reachable from HEAD.
  2. It walks all reachable commits to the full history.
  3. It can focus on the current sparse-checkout definition, but otherwise it doesn't focus on a given pathspec.

All of these are being updated by this patch series, which allows rev-list options to impact the path-walk. These include:

  1. Specifying a given refspec, including --all.
  2. Modifying the commit walk, including --first-parent, commit ranges, or recency using --since.
  3. Modifying the set of paths to download using pathspecs.

One particularly valuable situation here is that now a user can run git backfill -- <path> to download all versions of a specific file or a specific directory, accelerating history queries within that path without downloading more than necessary. This can accelerate git blame or git log -L for these paths, where normally those commands download missing blobs one-by-one during its diff algorithms.

This patch series is organized in the following way:

  1. A missing #include is added to prevent future compilation issues.
  2. The test repo in t5620 is expanded to make later tests more interesting.
  3. The backfill builtin parses the rev-list arguments. We test the top arguments that work as expected, though the pathspec arguments need extra work.
  4. Update the path-walk logic to work efficiently with some pathspecs, such as fixed prefix pathspecs, accelerating the computation.
  5. For more complicated pathspecs, do a post-filter in builtin/backfill.c instead of restricting the walk in the path-walk API.

The main goal of this series is to make such customizations possible, and to improve performance where common use cases are expected. I'm open to feedback as to whether we should consider more detailed performance analysis or whether we should wait for how users interact with these new options before overoptimizing unlikely use cases.

Updates in v2

  • Hard stops are replaced with a comma (and no punctuation) in the docs.
  • add_head_to_pending() simplifies some code.
  • My poor explanation of "starting commits" is updated.
  • Language around temporary prefix restriction is clarified.
  • Prefix match logic is simplified with dir_prefix().
  • Temporary memory leak (introduced in v1's patch 4 and removed in v1's patch 5) is removed in v2's patch 4.
  • Commit pruning is reenabled in v2's patch 5. There was no need for that with the way the logic works in the patch.
  • Add a new patch with a test demonstrating the new behavior that was being discussed in [1].

[1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/

Updates in v3

  • Fixed the argument checks to actually catch unknown arguments, because the revision machinery will skip unknown options starting with --.

Thanks,
-Stolee

cc: gitster@pobox.com
cc: "Kristoffer Haugsbakk" kristofferhaugsbakk@fastmail.com
cc: r.siddharth.shrimali@gmail.com
cc: ps@pks.im

The REV_INFO_INIT macro includes a use of the DEFAULT_ABBREV macro, which is
defined in object-name.h. Include it in revision.h so consumers of
REV_INFO_INIT do not need to include this hidden dependency.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Prepare the test infrastructure for upcoming changes that teach 'git
backfill' to accept revision arguments and pathspecs.

Add test_tick before each commit in the setup loop so that commit dates
are deterministic. This enables reliable testing with '--since'.

Rename the 'd/e/' directory to 'd/f/' so that the prefix 'd/f' is
ambiguous with the files 'd/file.*.txt'. This exercises the subtlety
in prefix pathspec matching that will be added in a later commit.

Create a branched version of the test repository (src-revs) with:
 - A 'side' branch merged into main, adding s/file.{1,2}.txt with
   two versions (4 new blobs, 52 total from main HEAD).
 - An unmerged 'other' branch adding o/file.{1,2}.txt (2 more blobs,
   54 total reachable from --all).

This structure makes --all, --first-parent, and --since produce
meaningfully different results when used with 'git backfill'.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
@derrickstolee
Copy link
Copy Markdown
Author

/submit

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Mar 17, 2026

Submitted as pull.2070.git.1773707361.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2070/derrickstolee/backfill-revs-v1

To fetch this version to local tag pr-2070/derrickstolee/backfill-revs-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2070/derrickstolee/backfill-revs-v1

@jayesh0104
Copy link
Copy Markdown

/allow

@jayesh0104
Copy link
Copy Markdown

I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Mar 17, 2026

Error: User jayesh0104 is not yet permitted to use GitGitGadget

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Mar 17, 2026

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> The git backfill command assists in downloading missing blobs for blobless
> partial clones. However, its current version lacks some valuable
> functionality. It currently:
>
>  1. Only walks commits reachable from HEAD.
>  2. It walks all reachable commits to the full history.
>  3. It can focus on the current sparse-checkout definition, but otherwise it
>     doesn't focus on a given pathspec.
>
> All of these are being updated by this patch series, which allows rev-list
> options to impact the path-walk. These include:
>
>  1. Specifying a given refspec, including --all.

Makes sense.  You can only be on a single branch at a time, but may
want to work on multiple topics in reasonably quick succession in a
single repository.  Being able to prepare enough material to go back
to when working on whichever topic in a single backfill invocation
would be a welcome addition.

>  2. Modifying the commit walk, including --first-parent, commit ranges, or
>     recency using --since.
>  3. Modifying the set of paths to download using pathspecs.

Both are good mechanisms to express which subset of history you will
be working on.

> One particularly valuable situation here is that now a user can run git
> backfill -- <path> to download all versions of a specific file or a specific
> directory, accelerating history queries within that path without downloading
> more than necessary. This can accelerate git blame or git log -L for these
> paths, where normally those commands download missing blobs one-by-one
> during its diff algorithms.

Yup.  Even if your project is a huge monorepo that contains all, you
do not necessarily have to look at everything the organization has
all the time.  "git blame -C -C -C" would of course not work in such
an environment (would it end up on-demand lazy fetch these blobs, or
are there ways to say "I know the object store of my repository is
only sparsely populated, and I do not want you to on-demand download
the missing blobs---do your best to work with only what is already
available?), but that's a tradeoff a monorepo makes.

Comment thread revision.h
@@ -4,6 +4,7 @@
#include "commit.h"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <stolee@gmail.com>
>
> The REV_INFO_INIT macro includes a use of the DEFAULT_ABBREV macro, which is
> defined in object-name.h. Include it in revision.h so consumers of
> REV_INFO_INIT do not need to include this hidden dependency.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  revision.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/revision.h b/revision.h
> index b36acfc2d9..18c9bbd822 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -4,6 +4,7 @@
>  #include "commit.h"
>  #include "grep.h"
>  #include "notes.h"
> +#include "object-name.h"
>  #include "oidset.h"
>  #include "pretty.h"
>  #include "diff.h"

OK.  Other symbols REV_INFO_INIT needs are REV_SORT_IN_GRAPH_ORDER
(in <commit.h>), CMIT_FMT_DEFAULT (in <pretty.h>), and STRVEC_INIT
(in <strvec.h>), and all three are already included there.

Makes sense.

@@ -63,9 +63,12 @@ OPTIONS
current sparse-checkout. If the sparse-checkout feature is enabled,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> -	repo_init_revisions(ctx->repo, &revs, "");
> -	handle_revision_arg("HEAD", &revs, 0, 0);

So we used to "cheat" and did an initialization without even knowing
in which directory we were started ...

> +	/* Walk from HEAD if otherwise unspecified. */
> +	if (!ctx->revs.pending.nr)
> +		handle_revision_arg("HEAD", &ctx->revs, 0, 0);

... but by initializing the revs correctly in the caller, we would
be correcting it.  Looking good.

> @@ -134,7 +135,12 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>  					 builtin_backfill_usage, options);
>  
>  	argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
> -			     0);
> +			     PARSE_OPT_KEEP_UNKNOWN_OPT |
> +			     PARSE_OPT_KEEP_ARGV0 |
> +			     PARSE_OPT_KEEP_DASHDASH);
> +
> +	repo_init_revisions(repo, &ctx.revs, prefix);
> +	argc = setup_revisions(argc, argv, &ctx.revs, NULL);

OK.

Comment thread path-walk.c
@@ -206,6 +206,49 @@ static int add_tree_entries(struct path_walk_context *ctx,
match != MATCHED)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <stolee@gmail.com>
>
> The previous change allowed specifying revision arguments over the 'git
> backfill' command-line. This created the opportunity for pathspecs that
> specify a smaller set of starting commits, but otherwise did not restrict
> the blob paths that were downloaded.

"pathspecs that specify a smaller set of starting commits" is
puzzling, as starting commits would be coming from the revision
arguments.  "opportunity for pathspec to further filter commits
to those that touch only the matching paths...", or something?

> Update the path-walk API to accept certain kinds of pathspecs and to
> silently ignore anything too complex.

Hmph, "silently ignore", instead of "no, you cannot use that! and
die", or at least "sorry, I cannot do that, so the result may not be
what you wanted, you've been warned"?

> The current behavior focuses on
> pathspecs that match paths exactly. This includes exact filenames,
> including directory names as prefixes. Pathspecs containing wildcards
> or magic are cleared so the path walk downloads all blobs, as before.

Ah, "we punt and lift the limitation to grab everything, so at least
everything you wanted to have will become available to you, even
though we may download more than what you asked"?  OK, users would
survive that, and as we improve the pathspec support, the user
experience would only improve.  OK.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/17/2026 6:10 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> The previous change allowed specifying revision arguments over the 'git
>> backfill' command-line. This created the opportunity for pathspecs that
>> specify a smaller set of starting commits, but otherwise did not restrict
>> the blob paths that were downloaded.
> 
> "pathspecs that specify a smaller set of starting commits" is
> puzzling, as starting commits would be coming from the revision
> arguments.  "opportunity for pathspec to further filter commits
> to those that touch only the matching paths...", or something?

You're right. I'm using "starting commits" incorrectly. My view was too
focused on how the path-walk API starts from the commits output by the
revision walk to get a list of root trees and then walks by path from
that point.

I'll reword to make this more clear.

>> Update the path-walk API to accept certain kinds of pathspecs and to
>> silently ignore anything too complex.
> 
> Hmph, "silently ignore", instead of "no, you cannot use that! and
> die", or at least "sorry, I cannot do that, so the result may not be
> what you wanted, you've been warned"?

The behavior when silently ignoring is to over-download. The revision
walk still filters commits, but the path-walk then walks paths beyond
that pathspec. This will be fixed in the next commit, so adding an
error case didn't seem worth it. I'll do a better job foreshadowing.

>> The current behavior focuses on
>> pathspecs that match paths exactly. This includes exact filenames,
>> including directory names as prefixes. Pathspecs containing wildcards
>> or magic are cleared so the path walk downloads all blobs, as before.
> 
> Ah, "we punt and lift the limitation to grab everything, so at least
> everything you wanted to have will become available to you, even
> though we may download more than what you asked"?  OK, users would
> survive that, and as we improve the pathspec support, the user
> experience would only improve.  OK.

Exactly. I can word things better.

Thanks,
-Stolee

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Wed, Mar 18, 2026 at 09:15:00AM -0400, Derrick Stolee wrote:
> On 3/17/2026 6:10 PM, Junio C Hamano wrote:
> >> Update the path-walk API to accept certain kinds of pathspecs and to
> >> silently ignore anything too complex.
> > 
> > Hmph, "silently ignore", instead of "no, you cannot use that! and
> > die", or at least "sorry, I cannot do that, so the result may not be
> > what you wanted, you've been warned"?
> 
> The behavior when silently ignoring is to over-download. The revision
> walk still filters commits, but the path-walk then walks paths beyond
> that pathspec. This will be fixed in the next commit, so adding an
> error case didn't seem worth it. I'll do a better job foreshadowing.

I guess this is a fine tradeoff when documented properly. But I think in
that case we should make very clear that this behaviour may change in
the future if find a way to efficiently limit the pathwalk, too.

Patrick

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Wed, Mar 18, 2026 at 09:15:00AM -0400, Derrick Stolee wrote:
> On 3/17/2026 6:10 PM, Junio C Hamano wrote:
> >> Update the path-walk API to accept certain kinds of pathspecs and to
> >> silently ignore anything too complex.
> > 
> > Hmph, "silently ignore", instead of "no, you cannot use that! and
> > die", or at least "sorry, I cannot do that, so the result may not be
> > what you wanted, you've been warned"?
> 
> The behavior when silently ignoring is to over-download. The revision
> walk still filters commits, but the path-walk then walks paths beyond
> that pathspec. This will be fixed in the next commit, so adding an
> error case didn't seem worth it. I'll do a better job foreshadowing.

I guess this is a fine tradeoff when documented properly. But I think in
that case we should make very clear that this behaviour may change in
the future if find a way to efficiently limit the pathwalk, too.

Patrick

Comment thread path-walk.c
@@ -62,6 +62,8 @@ struct path_walk_context {
*/
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <stolee@gmail.com>
>
> Previously, walk_objects_by_path() silently ignored pathspecs containing
> wildcards or magic by clearing them. This caused all blobs to be
> downloaded regardless of the given pathspec. Wildcard pathspecs like
> "d/file.*.txt" are useful for narrowing which blobs to process (e.g.,
> during 'git backfill').
>
> Support wildcard pathspecs by making three changes:
>
>  1. Add an 'exact_pathspecs' flag to path_walk_context. When the
>     pathspec has no wildcards or magic, set this flag and use the
>     existing fast-path prefix matching in add_tree_entries(). When
>     wildcards are present, skip that block since prefix matching
>     cannot handle glob patterns.
>
>  2. Disable revision-level commit pruning (revs->prune = 0) for
>     wildcard pathspecs. The revision walk uses the pathspec to filter
>     commits via TREESAME detection. For exact prefix pathspecs this
>     works well, but wildcard pathspecs may fail to match through
>     TREESAME because fnmatch with WM_PATHNAME does not cross directory
>     boundaries. Disabling pruning ensures all commits are visited and
>     their trees are available for the path-walk to filter.

Hmph, I wonder how significant an impact does it have on the
performance that we have to disable pruning here.  With the bog
standard tree traversal, wouldn't tree_entry_interesting() already
be capable of doing this, even with fnmatch / WM_PATHNAME ?

>  3. Add a match_pathspec() check in walk_path() to filter out blobs
>     whose full path does not match the pathspec. This provides the
>     actual blob-level filtering for wildcard pathspecs.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The latter person cannot sign DCO or vouch for the origin of what
they have written in this patch, can they?

> ---
>  path-walk.c         | 22 ++++++++++++++--------
>  t/t5620-backfill.sh |  7 +++----
>  2 files changed, 17 insertions(+), 12 deletions(-)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/17/2026 6:19 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> Previously, walk_objects_by_path() silently ignored pathspecs containing
>> wildcards or magic by clearing them. This caused all blobs to be
>> downloaded regardless of the given pathspec. Wildcard pathspecs like
>> "d/file.*.txt" are useful for narrowing which blobs to process (e.g.,
>> during 'git backfill').
>>
>> Support wildcard pathspecs by making three changes:
>>
>>  1. Add an 'exact_pathspecs' flag to path_walk_context. When the
>>     pathspec has no wildcards or magic, set this flag and use the
>>     existing fast-path prefix matching in add_tree_entries(). When
>>     wildcards are present, skip that block since prefix matching
>>     cannot handle glob patterns.
>>
>>  2. Disable revision-level commit pruning (revs->prune = 0) for
>>     wildcard pathspecs. The revision walk uses the pathspec to filter
>>     commits via TREESAME detection. For exact prefix pathspecs this
>>     works well, but wildcard pathspecs may fail to match through
>>     TREESAME because fnmatch with WM_PATHNAME does not cross directory
>>     boundaries. Disabling pruning ensures all commits are visited and
>>     their trees are available for the path-walk to filter.
> 
> Hmph, I wonder how significant an impact does it have on the
> performance that we have to disable pruning here.  With the bog
> standard tree traversal, wouldn't tree_entry_interesting() already
> be capable of doing this, even with fnmatch / WM_PATHNAME ?

I will explore what's possible here and see what I can do.

>>  3. Add a match_pathspec() check in walk_path() to filter out blobs
>>     whose full path does not match the pathspec. This provides the
>>     actual blob-level filtering for wildcard pathspecs.
>>
>> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
> 
> The latter person cannot sign DCO or vouch for the origin of what
> they have written in this patch, can they?
No they cannot. Sorry for this error.

Thanks,
-Stolee

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/18/26 9:16 AM, Derrick Stolee wrote:
> On 3/17/2026 6:19 PM, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Derrick Stolee <stolee@gmail.com>
>>>
>>> Previously, walk_objects_by_path() silently ignored pathspecs containing
>>> wildcards or magic by clearing them. This caused all blobs to be
>>> downloaded regardless of the given pathspec. Wildcard pathspecs like
>>> "d/file.*.txt" are useful for narrowing which blobs to process (e.g.,
>>> during 'git backfill').
>>>
>>> Support wildcard pathspecs by making three changes:
>>>
>>>   1. Add an 'exact_pathspecs' flag to path_walk_context. When the
>>>      pathspec has no wildcards or magic, set this flag and use the
>>>      existing fast-path prefix matching in add_tree_entries(). When
>>>      wildcards are present, skip that block since prefix matching
>>>      cannot handle glob patterns.
>>>
>>>   2. Disable revision-level commit pruning (revs->prune = 0) for
>>>      wildcard pathspecs. The revision walk uses the pathspec to filter
>>>      commits via TREESAME detection. For exact prefix pathspecs this
>>>      works well, but wildcard pathspecs may fail to match through
>>>      TREESAME because fnmatch with WM_PATHNAME does not cross directory
>>>      boundaries. Disabling pruning ensures all commits are visited and
>>>      their trees are available for the path-walk to filter.
>>
>> Hmph, I wonder how significant an impact does it have on the
>> performance that we have to disable pruning here.  With the bog
>> standard tree traversal, wouldn't tree_entry_interesting() already
>> be capable of doing this, even with fnmatch / WM_PATHNAME ?
> > I will explore what's possible here and see what I can do.

I must have needed the 'revs->prune = 0' at some point during development
and left it even though it isn't actually necessary. Leaving it
implicitly at '1' should indeed be faster due to traversing fewer commits
and parsing fewer trees while still reaching all necessary blobs.

Only changes 1 and 3 are necessary.

Thanks,
-Stolee

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/18/26 9:16 AM, Derrick Stolee wrote:
> On 3/17/2026 6:19 PM, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Derrick Stolee <stolee@gmail.com>
>>>
>>> Previously, walk_objects_by_path() silently ignored pathspecs containing
>>> wildcards or magic by clearing them. This caused all blobs to be
>>> downloaded regardless of the given pathspec. Wildcard pathspecs like
>>> "d/file.*.txt" are useful for narrowing which blobs to process (e.g.,
>>> during 'git backfill').
>>>
>>> Support wildcard pathspecs by making three changes:
>>>
>>>   1. Add an 'exact_pathspecs' flag to path_walk_context. When the
>>>      pathspec has no wildcards or magic, set this flag and use the
>>>      existing fast-path prefix matching in add_tree_entries(). When
>>>      wildcards are present, skip that block since prefix matching
>>>      cannot handle glob patterns.
>>>
>>>   2. Disable revision-level commit pruning (revs->prune = 0) for
>>>      wildcard pathspecs. The revision walk uses the pathspec to filter
>>>      commits via TREESAME detection. For exact prefix pathspecs this
>>>      works well, but wildcard pathspecs may fail to match through
>>>      TREESAME because fnmatch with WM_PATHNAME does not cross directory
>>>      boundaries. Disabling pruning ensures all commits are visited and
>>>      their trees are available for the path-walk to filter.
>>
>> Hmph, I wonder how significant an impact does it have on the
>> performance that we have to disable pruning here.  With the bog
>> standard tree traversal, wouldn't tree_entry_interesting() already
>> be capable of doing this, even with fnmatch / WM_PATHNAME ?
> > I will explore what's possible here and see what I can do.

I must have needed the 'revs->prune = 0' at some point during development
and left it even though it isn't actually necessary. Leaving it
implicitly at '1' should indeed be faster due to traversing fewer commits
and parsing fewer trees while still reaching all necessary blobs.

Only changes 1 and 3 are necessary.

Thanks,
-Stolee

@@ -63,9 +63,12 @@ OPTIONS
current sparse-checkout. If the sparse-checkout feature is enabled,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Kristoffer Haugsbakk" wrote on the Git mailing list (how to reply to this email):

On Tue, Mar 17, 2026, at 01:29, Derrick Stolee via GitGitGadget wrote:
>[snip]
> diff --git a/Documentation/git-backfill.adoc b/Documentation/git-backfill.adoc
> index b8394dcf22..fdfe22d623 100644
> --- a/Documentation/git-backfill.adoc
> +++ b/Documentation/git-backfill.adoc
> @@ -63,9 +63,12 @@ OPTIONS
>  	current sparse-checkout. If the sparse-checkout feature is enabled,
>  	then `--sparse` is assumed and can be disabled with `--no-sparse`.
>
> +You may also specify the commit limiting options from linkgit:git-rev-list[1].
> +
>  SEE ALSO
>  --------
>  linkgit:git-clone[1].
> +linkgit:git-rev-list[1].

Should there be a comma between these two?

>[snip]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/18/26 11:37 AM, Kristoffer Haugsbakk wrote:
> On Tue, Mar 17, 2026, at 01:29, Derrick Stolee via GitGitGadget wrote:
>> [snip]
>> diff --git a/Documentation/git-backfill.adoc b/Documentation/git-backfill.adoc
>> index b8394dcf22..fdfe22d623 100644
>> --- a/Documentation/git-backfill.adoc
>> +++ b/Documentation/git-backfill.adoc
>> @@ -63,9 +63,12 @@ OPTIONS
>>   	current sparse-checkout. If the sparse-checkout feature is enabled,
>>   	then `--sparse` is assumed and can be disabled with `--no-sparse`.
>>
>> +You may also specify the commit limiting options from linkgit:git-rev-list[1].
>> +
>>   SEE ALSO
>>   --------
>>   linkgit:git-clone[1].
>> +linkgit:git-rev-list[1].
> > Should there be a comma between these two?

Good catch. Also there shouldn't be a hard stop, either.

Thanks,
-Stolee

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/18/26 11:37 AM, Kristoffer Haugsbakk wrote:
> On Tue, Mar 17, 2026, at 01:29, Derrick Stolee via GitGitGadget wrote:
>> [snip]
>> diff --git a/Documentation/git-backfill.adoc b/Documentation/git-backfill.adoc
>> index b8394dcf22..fdfe22d623 100644
>> --- a/Documentation/git-backfill.adoc
>> +++ b/Documentation/git-backfill.adoc
>> @@ -63,9 +63,12 @@ OPTIONS
>>   	current sparse-checkout. If the sparse-checkout feature is enabled,
>>   	then `--sparse` is assumed and can be disabled with `--no-sparse`.
>>
>> +You may also specify the commit limiting options from linkgit:git-rev-list[1].
>> +
>>   SEE ALSO
>>   --------
>>   linkgit:git-clone[1].
>> +linkgit:git-rev-list[1].
> > Should there be a comma between these two?

Good catch. Also there shouldn't be a hard stop, either.

Thanks,
-Stolee

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Mar 18, 2026

User "Kristoffer Haugsbakk" <kristofferhaugsbakk@fastmail.com> has been added to the cc: list.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Mar 18, 2026

This branch is now known as ds/backfill-revs.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Mar 18, 2026

This patch series was integrated into seen via git@b0faae0.

@gitgitgadget gitgitgadget bot added the seen label Mar 18, 2026
@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Mar 19, 2026

This patch series was integrated into seen via git@2e7d77b.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Mar 22, 2026

This patch series was integrated into seen via git@03fbc16.

The existing implementation of 'git backfill' only includes downloading
missing blobs reachable from HEAD. Advanced uses may desire more general
commit limiting options, such as '--all' for all references, specifying a
commit range via negative references, or specifying a recency of use such as
with '--since=<date>'.

All of these options are available if we use setup_revisions() to parse the
unknown arguments with the revision machinery. This opens up a large number
of possibilities, only a small set of which are tested here.

For documentation, we avoid duplicating the option documentation and instead
link to the documentation of 'git rev-list'.

Note that these arguments currently allow specifying a pathspec, which
modifies the commit history checks but does not limit the paths used in the
backfill logic. This will be updated in a future change.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
@derrickstolee
Copy link
Copy Markdown
Author

/submit

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Mar 23, 2026

Submitted as pull.2070.v2.git.1774266019.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2070/derrickstolee/backfill-revs-v2

To fetch this version to local tag pr-2070/derrickstolee/backfill-revs-v2:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2070/derrickstolee/backfill-revs-v2

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Mar 24, 2026

This patch series was integrated into seen via git@90589f0.

@Sporarum
Copy link
Copy Markdown

@derrickstolee
Sorry to bother you, I'm trying to fix a small mistake in the documentation of git stash, and I need someone to /allow me, could you do that ?
See git#2255

The previous change allowed specifying revision arguments over the 'git
backfill' command-line. This created the opportunity for restricting the
initial commit set by filtering the revision walk through a pathspec. Other
than filtering the commit set (and thereby the root trees), this did not
restrict the path-walk implementation of 'git backfill' and did not restrict
the blobs that were downloaded to only those matching the pathspec.

Update the path-walk API to accept certain kinds of pathspecs and to
silently ignore anything too complex, for now. We will update this in the
next change to properly restrict to even complex pathspecs.

The current behavior focuses on pathspecs that match paths exactly. This
includes exact filenames, including directory names as prefixes. Pathspecs
containing wildcards or magic are cleared so the path walk downloads all
blobs, as before.

The reason for this restriction is to allow for a faster execution by
pruning the path walk to only trees that could contribute towards one of
those paths as a parent directory.

The test directory 'd/f/' (next to 'd/file*.txt') was prepared in a
previous commit to exercise the subtlety in prefix matching.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Previously, walk_objects_by_path() silently ignored pathspecs containing
wildcards or magic by clearing them. This caused all blobs to be
downloaded regardless of the given pathspec. Wildcard pathspecs like
"d/file.*.txt" are useful for narrowing which blobs to process (e.g.,
during 'git backfill').

Support wildcard pathspecs by making two changes:

 1. Add an 'exact_pathspecs' flag to path_walk_context. When the
    pathspec has no wildcards or magic, set this flag and use the
    existing fast-path prefix matching in add_tree_entries(). When
    wildcards are present, skip that block since prefix matching
    cannot handle glob patterns.

 2. Add a match_pathspec() check in walk_path() to filter out blobs
    whose full path does not match the pathspec. This provides the
    actual blob-level filtering for wildcard pathspecs.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Before the recent changes to parse rev-list arguments inside of 'git
backfill', the builtin would take arbitrary arguments without complaint (and
ignore them). This was noticed and a patch was sent [1] which motivates
this change.

[1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/

Note that the revision machinery can output an "ambiguous argument"
warning if a value not starting with '--' is found and doesn't make
sense as a reference or a pathspec. For unrecognized arguments starting
with '--' we need to add logic into builtin/backfill.c to catch leftover
arguments.

Reported-by: Siddharth Shrimali <r.siddharth.shrimali@gmail.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Apr 1, 2026

This patch series was integrated into seen via git@a88badf.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Apr 2, 2026

This patch series was integrated into seen via git@04eaf02.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Apr 2, 2026

This patch series was integrated into seen via git@c743d8f.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Apr 3, 2026

This patch series was integrated into seen via git@4e58217.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Apr 3, 2026

This patch series was integrated into master via git@4e58217.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Apr 3, 2026

This patch series was integrated into next via git@4e58217.

@gitgitgadget gitgitgadget bot added the master label Apr 3, 2026
@gitgitgadget gitgitgadget bot closed this Apr 3, 2026
@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Apr 3, 2026

Congratulations! 🎉 Your patch series was merged into upstream via 4e58217.

Note: this pull request will show as "Closed" rather than "Merged" because the merge happened in the upstream repository, not on GitHub. This is expected — your contribution has been accepted!

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Apr 7, 2026

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Tue, Mar 17, 2026 at 12:29:16AM +0000, Derrick Stolee via GitGitGadget wrote:
> The git backfill command assists in downloading missing blobs for blobless
> partial clones. However, its current version lacks some valuable
> functionality. It currently:
> 
>  1. Only walks commits reachable from HEAD.
>  2. It walks all reachable commits to the full history.
>  3. It can focus on the current sparse-checkout definition, but otherwise it
>     doesn't focus on a given pathspec.
> 
> All of these are being updated by this patch series, which allows rev-list
> options to impact the path-walk. These include:
> 
>  1. Specifying a given refspec, including --all.
>  2. Modifying the commit walk, including --first-parent, commit ranges, or
>     recency using --since.
>  3. Modifying the set of paths to download using pathspecs.
> 
> One particularly valuable situation here is that now a user can run git
> backfill -- <path> to download all versions of a specific file or a specific
> directory, accelerating history queries within that path without downloading
> more than necessary. This can accelerate git blame or git log -L for these
> paths, where normally those commands download missing blobs one-by-one
> during its diff algorithms.

Nice.

I think especially blaming is a bit of a sore spot -- downloading blobs
one by one simply doesn't cut it there. I wonder whether we can easily
use the backfill mechanism to fetch blobs automatically in git-blame(1)
so that the user doesn't need to know about git-backfill(1) at all?

Patrick

@@ -63,9 +63,12 @@ OPTIONS
current sparse-checkout. If the sparse-checkout feature is enabled,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Tue, Mar 17, 2026 at 12:29:19AM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <stolee@gmail.com>
> 
> The existing implementation of 'git backfill' only includes downloading
> missing blobs reachable from HEAD. Advanced uses may desire more general
> commit limiting options, such as '--all' for all references, specifying a
> commit range via negative references, or specifying a recency of use such as
> with '--since=<date>'.
> 
> All of these options are available if we use setup_revisions() to parse the
> unknown arguments with the revision machinery. This opens up a large number
> of possibilities, only a small set of which are tested here.
> 
> For documentation, we avoid duplicating the option documentation and instead
> link to the documentation of 'git rev-list'.
> 
> Note that these arguments currently allow specifying a pathspec, which
> modifies the commit history checks but does not limit the paths used in the
> backfill logic. This will be updated in a future change.

Makes me wonder whether reversing the order would have avoided this
slight awkwardness. But let's just stick with the current order, the end
result would be the same anyway.

> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  Documentation/git-backfill.adoc |   3 +
>  builtin/backfill.c              |  19 ++--
>  t/t5620-backfill.sh             | 156 ++++++++++++++++++++++++++++++++
>  3 files changed, 172 insertions(+), 6 deletions(-)
> 
> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index e80fc1b694..1b5595b27c 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -92,13 +92,14 @@ static int do_backfill(struct backfill_context *ctx)
>  		}
>  	}
>  
> -	repo_init_revisions(ctx->repo, &revs, "");
> -	handle_revision_arg("HEAD", &revs, 0, 0);
> +	/* Walk from HEAD if otherwise unspecified. */
> +	if (!ctx->revs.pending.nr)
> +		handle_revision_arg("HEAD", &ctx->revs, 0, 0);

Can we use `add_head_to_pending(&ctx->revs)` instead?

Patrick

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/19/26 5:54 AM, Patrick Steinhardt wrote:
> On Tue, Mar 17, 2026 at 12:29:19AM +0000, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> The existing implementation of 'git backfill' only includes downloading
>> missing blobs reachable from HEAD. Advanced uses may desire more general
>> commit limiting options, such as '--all' for all references, specifying a
>> commit range via negative references, or specifying a recency of use such as
>> with '--since=<date>'.
>>
>> All of these options are available if we use setup_revisions() to parse the
>> unknown arguments with the revision machinery. This opens up a large number
>> of possibilities, only a small set of which are tested here.
>>
>> For documentation, we avoid duplicating the option documentation and instead
>> link to the documentation of 'git rev-list'.
>>
>> Note that these arguments currently allow specifying a pathspec, which
>> modifies the commit history checks but does not limit the paths used in the
>> backfill logic. This will be updated in a future change.
> > Makes me wonder whether reversing the order would have avoided this
> slight awkwardness. But let's just stick with the current order, the end
> result would be the same anyway.
True, we could have added the pathspec logic first, but we wouldn't be able
to test it right away because the parsing comes through the rev-list.

>> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>> ---
>>   Documentation/git-backfill.adoc |   3 +
>>   builtin/backfill.c              |  19 ++--
>>   t/t5620-backfill.sh             | 156 ++++++++++++++++++++++++++++++++
>>   3 files changed, 172 insertions(+), 6 deletions(-)
>>
>> diff --git a/builtin/backfill.c b/builtin/backfill.c
>> index e80fc1b694..1b5595b27c 100644
>> --- a/builtin/backfill.c
>> +++ b/builtin/backfill.c
>> @@ -92,13 +92,14 @@ static int do_backfill(struct backfill_context *ctx)
>>   		}
>>   	}
>>   >> -	repo_init_revisions(ctx->repo, &revs, "");
>> -	handle_revision_arg("HEAD", &revs, 0, 0);
>> +	/* Walk from HEAD if otherwise unspecified. */
>> +	if (!ctx->revs.pending.nr)
>> +		handle_revision_arg("HEAD", &ctx->revs, 0, 0);
> > Can we use `add_head_to_pending(&ctx->revs)` instead?

Nice. We absolutely can and should.

Thanks,
-Stolee

Comment thread path-walk.c
@@ -206,6 +206,49 @@ static int add_tree_entries(struct path_walk_context *ctx,
match != MATCHED)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Tue, Mar 17, 2026 at 12:29:20AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/path-walk.c b/path-walk.c
> index 364e4cfa19..e1ad4b0208 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -481,6 +524,18 @@ int walk_objects_by_path(struct path_walk_info *info)
>  	if (info->tags)
>  		info->revs->tag_objects = 1;
>  
> +	if (ctx.revs->prune_data.nr) {
> +		/*
> +		 * Only exact prefix pathspecs are currently supported.
> +		 * Clear any wildcard or magic pathspecs to avoid
> +		 * incorrect prefix matching.
> +		 */
> +		struct pathspec *pd = &ctx.revs->prune_data;
> +
> +		if (pd->has_wildcard || pd->magic)
> +			pd->nr = 0;
> +	}

Huh, curious. Won't this cause a leak? I guess we should rather use
`clear_pathspec()` here.

Also shows that this path is missing test coverage.

Patrick

Comment thread path-walk.c
@@ -206,6 +206,49 @@ static int add_tree_entries(struct path_walk_context *ctx,
match != MATCHED)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Tue, Mar 17, 2026 at 12:29:20AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/path-walk.c b/path-walk.c
> index 364e4cfa19..e1ad4b0208 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -206,6 +206,49 @@ static int add_tree_entries(struct path_walk_context *ctx,
>  				 match != MATCHED)
>  				continue;
>  		}
> +		if (ctx->revs->prune_data.nr) {
> +			struct pathspec *pd = &ctx->revs->prune_data;
> +			bool found = false;
> +
> +			for (int i = 0; i < pd->nr; i++) {
> +				struct pathspec_item *item = &pd->items[i];
> +
> +				/*
> +				 * Is this path a parent directory of
> +				 * the pathspec item?
> +				 */
> +				if (path.len < (size_t)item->len &&
> +				    !strncmp(path.buf, item->match, path.len) &&
> +				    item->match[path.len - 1] == '/') {
> +					found = true;
> +					break;
> +				}
> +
> +				/*
> +				 * Or, is the pathspec an exact match?
> +				 */
> +				if (path.len == (size_t)item->len &&
> +				    !strcmp(path.buf, item->match)) {
> +					found = true;
> +					break;
> +				}
> +
> +				/*
> +				 * Or, is the pathspec a directory prefix
> +				 * match?
> +				 */
> +				if (path.len > (size_t)item->len &&
> +				    !strncmp(path.buf, item->match, item->len) &&
> +				    path.buf[item->len] == '/') {
> +					found = true;
> +					break;
> +				}

Ah, one more thing: we could expose `dir_prefix()` from "path.c" and
reuse it here.

Patrick

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/19/26 6:15 AM, Patrick Steinhardt wrote:
> Ah, one more thing: we could expose `dir_prefix()` from "path.c" and
> reuse it here.

Good idea. This becomes

			/*
			 * Continue if either is a directory prefix
			 * of the other.
			 */
			if (dir_prefix(path.buf, item->match) ||
			    dir_prefix(item->match, path.buf)) {
				found = true;
				break;
			}

With the idea that we need to walk the parents of each prefix in
addition to walking all of their children.

Thanks,
-Stolee

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Apr 7, 2026

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/19/2026 5:54 AM, Patrick Steinhardt wrote:
> I think especially blaming is a bit of a sore spot -- downloading blobs
> one by one simply doesn't cut it there. I wonder whether we can easily
> use the backfill mechanism to fetch blobs automatically in git-blame(1)
> so that the user doesn't need to know about git-backfill(1) at all?

I've thought about this a bit, and I'm not sure that we want to run
'git backfill' directly. Instead, it would be nice if we did a "staged"
algorithm for 'git blame':

1. Walk commits according to the pathspec to collect the commits that
   changed the path.

2. Collect the list of blob OIDs that will be needed for computing diffs
   for the line-tracking algorithm.

3. In batches, download groups of missing blobs and then process them
   for line-tracking diffs. (Stop if all lines are blamed; continue to
   next batch if more lines are needed.)

This would be a significant rewrite of the blame algorithm, though. I
briefly considered this approach about a year ago and decided it would
be easier to start with 'git backfill' and see whether that satisfies
most needs.

The biggest reason to maybe avoid 'git backfill HEAD -- <path>' before
_every_ blame operation is that this will add overhead on repeated
calls that may be obnoxious in its own way. Maybe doing an opt-in
'git blame --backfill <path>' would make it easier for users to opt-in
when they want to.

Thanks,
-Stolee

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Apr 7, 2026

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Thu, Mar 19, 2026 at 08:59:01AM -0400, Derrick Stolee wrote:
> On 3/19/2026 5:54 AM, Patrick Steinhardt wrote:
> > I think especially blaming is a bit of a sore spot -- downloading blobs
> > one by one simply doesn't cut it there. I wonder whether we can easily
> > use the backfill mechanism to fetch blobs automatically in git-blame(1)
> > so that the user doesn't need to know about git-backfill(1) at all?
> 
> I've thought about this a bit, and I'm not sure that we want to run
> 'git backfill' directly. Instead, it would be nice if we did a "staged"
> algorithm for 'git blame':
> 
> 1. Walk commits according to the pathspec to collect the commits that
>    changed the path.
> 
> 2. Collect the list of blob OIDs that will be needed for computing diffs
>    for the line-tracking algorithm.
> 
> 3. In batches, download groups of missing blobs and then process them
>    for line-tracking diffs. (Stop if all lines are blamed; continue to
>    next batch if more lines are needed.)
> 
> This would be a significant rewrite of the blame algorithm, though. I
> briefly considered this approach about a year ago and decided it would
> be easier to start with 'git backfill' and see whether that satisfies
> most needs.
> 
> The biggest reason to maybe avoid 'git backfill HEAD -- <path>' before
> _every_ blame operation is that this will add overhead on repeated
> calls that may be obnoxious in its own way. Maybe doing an opt-in
> 'git blame --backfill <path>' would make it easier for users to opt-in
> when they want to.

That's fair. I fully agree that just blindly doing this would be likely
be inefficient. Ideally, the batching logic would only kick in whenever
we see a missing object.

Anyway, this definitely doesn't have to be part of this series, I was
mostly wondering how hard it is to do.

Thanks!

Patrick

Comment thread t/t5620-backfill.sh
@@ -7,6 +7,14 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <stolee@gmail.com>
>
> Before the recent changes to parse rev-list arguments inside of 'git
> backfill', the builtin would take arbitrary arguments without complaint (and
> ignore them). This was noticed and a patch was sent [1] which motivates this
> change to encode this behavior in test.
>
> [1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/
>
> Reported-by: Siddharth Shrimali <r.siddharth.shrimali@gmail.com>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  t/t5620-backfill.sh | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
> index c6f54ee91c..85740f1f13 100755
> --- a/t/t5620-backfill.sh
> +++ b/t/t5620-backfill.sh
> @@ -7,6 +7,14 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
>  
>  . ./test-lib.sh
>  
> +test_expect_success 'backfill rejects unexpected arguments' '
> +	test_must_fail git backfill unexpected-arg 2>err &&
> +	test_grep "ambiguous argument .*unexpected-arg" err &&
> +
> +	test_must_fail git backfill --all --firt-parent unexpected-arg 2>err &&
> +	test_grep "ambiguous argument .*unexpected-arg" err
> +'

Hmph, I would have expected that an earlier --firt-parent on the
command line would trigger "unknown option" instead.

Having said that, if the code lets the setup_revisions() parse the
command line, the usual "unless disambiguated with a double-dash
'--', stop at the first non-revision and take everything as paths
but for safety all of them must refer to an existing path in the
working tree" behaviour should trigger, and it is not specific to
"backfill", and may already be tested centrally (if not, I do not
object to such a new set of tests).

For any cmd that take revisions and pathspec (e.g., log, rev-list,
grep) these should hold true:

  $ git $cmd [<options>]... Makefile HEAD

    Without disambiguation the command should say "Ah, Makefile
    is not a revision, so we will see no more revisions, and
    everything, including the current one we are looking at, must be
    an existing path on the working tree", and barfs on HEAD that
    does not exist as a file/directory.

  $ git $cmd [<options>]... Makefile -- HEAD

    With disambiguation, the command should verify everything before
    the double-dash to be a rev, and barf that Makefile is not a
    rev.

  $ git $cmd [<options>]... -- Makefile HEAD

    With disambiguation, the command should take everything after
    the double-dash to be a pathspec element without barfing.  After
    all, it may be referring to a path that used to exist in some
    revision the command will look at.

Thanks.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/23/2026 11:29 AM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> Before the recent changes to parse rev-list arguments inside of 'git
>> backfill', the builtin would take arbitrary arguments without complaint (and
>> ignore them). This was noticed and a patch was sent [1] which motivates this
>> change to encode this behavior in test.
>>
>> [1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/
>>
>> Reported-by: Siddharth Shrimali <r.siddharth.shrimali@gmail.com>
>> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>> ---
>>  t/t5620-backfill.sh | 8 ++++++++
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
>> index c6f54ee91c..85740f1f13 100755
>> --- a/t/t5620-backfill.sh
>> +++ b/t/t5620-backfill.sh
>> @@ -7,6 +7,14 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
>>  
>>  . ./test-lib.sh
>>  
>> +test_expect_success 'backfill rejects unexpected arguments' '
>> +	test_must_fail git backfill unexpected-arg 2>err &&
>> +	test_grep "ambiguous argument .*unexpected-arg" err &&
>> +
>> +	test_must_fail git backfill --all --firt-parent unexpected-arg 2>err &&
>> +	test_grep "ambiguous argument .*unexpected-arg" err
>> +'
> 
> Hmph, I would have expected that an earlier --firt-parent on the
> command line would trigger "unknown option" instead.

Interesting that my mistype has demonstrated an interesting
behavior here. It turns out that random options starting with
'--' are accepted here, including --unexpected-arg.

This means that we actually have room here for some improvement!

I'll see what can be done to make even these arguments be seen
as failures.

> Having said that, if the code lets the setup_revisions() parse the
> command line, the usual "unless disambiguated with a double-dash
> '--', stop at the first non-revision and take everything as paths
> but for safety all of them must refer to an existing path in the
> working tree" behaviour should trigger, and it is not specific to
> "backfill", and may already be tested centrally (if not, I do not
> object to such a new set of tests).

Thanks,
-Stolee

@@ -63,9 +63,12 @@ OPTIONS
current sparse-checkout. If the sparse-checkout feature is enabled,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Mon, Mar 23, 2026 at 11:40:16AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index e80fc1b694..90c9d84793 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -134,7 +135,12 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>  					 builtin_backfill_usage, options);
>  
>  	argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
> -			     0);
> +			     PARSE_OPT_KEEP_UNKNOWN_OPT |
> +			     PARSE_OPT_KEEP_ARGV0 |
> +			     PARSE_OPT_KEEP_DASHDASH);
> +
> +	repo_init_revisions(repo, &ctx.revs, prefix);
> +	argc = setup_revisions(argc, argv, &ctx.revs, NULL);

We should probably die here in case we still have unknown arguments.

Patrick

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/24/2026 3:59 AM, Patrick Steinhardt wrote:
> On Mon, Mar 23, 2026 at 11:40:16AM +0000, Derrick Stolee via GitGitGadget wrote:
>> diff --git a/builtin/backfill.c b/builtin/backfill.c
>> index e80fc1b694..90c9d84793 100644
>> --- a/builtin/backfill.c
>> +++ b/builtin/backfill.c
>> @@ -134,7 +135,12 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>>  					 builtin_backfill_usage, options);
>>  
>>  	argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
>> -			     0);
>> +			     PARSE_OPT_KEEP_UNKNOWN_OPT |
>> +			     PARSE_OPT_KEEP_ARGV0 |
>> +			     PARSE_OPT_KEEP_DASHDASH);
>> +
>> +	repo_init_revisions(repo, &ctx.revs, prefix);
>> +	argc = setup_revisions(argc, argv, &ctx.revs, NULL);
> 
> We should probably die here in case we still have unknown arguments.

That is indeed the fix for the bad test in patch 6. I'll make the
necessary update in v3's patch 6 along with the test for it.

Thanks,
-Stolee

Comment thread path-walk.c
@@ -11,6 +11,7 @@
#include "list-objects.h"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Mon, Mar 23, 2026 at 11:40:17AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/path-walk.c b/path-walk.c
> index 364e4cfa19..0d640e2f24 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -206,6 +207,34 @@ static int add_tree_entries(struct path_walk_context *ctx,
>  				 match != MATCHED)
>  				continue;
>  		}
> +		if (ctx->revs->prune_data.nr) {
> +			struct pathspec *pd = &ctx->revs->prune_data;
> +			bool found = false;
> +
> +			/* remove '/' for these checks. */
> +			path.buf[path.len - 1] = 0;

Hm. Is this _always_ safe to do? We add the directory separator a few
lines further up, but only in the case where `type == OBJ_TREE`. So in
reverse this may mean that there are cases where we don't have a
trailing '/'.

Maybe we should instead:

    did_strip_suffix = strbuf_strip_suffix(path, "/");

    ...

    if (did_strip_suffix)
        strbuf_addch(path, "/");

Patrick

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/24/2026 3:59 AM, Patrick Steinhardt wrote:
> On Mon, Mar 23, 2026 at 11:40:17AM +0000, Derrick Stolee via GitGitGadget wrote:
>> diff --git a/path-walk.c b/path-walk.c
>> index 364e4cfa19..0d640e2f24 100644
>> --- a/path-walk.c
>> +++ b/path-walk.c
>> @@ -206,6 +207,34 @@ static int add_tree_entries(struct path_walk_context *ctx,
>>  				 match != MATCHED)
>>  				continue;
>>  		}
>> +		if (ctx->revs->prune_data.nr) {
>> +			struct pathspec *pd = &ctx->revs->prune_data;
>> +			bool found = false;
>> +
>> +			/* remove '/' for these checks. */
>> +			path.buf[path.len - 1] = 0;
> 
> Hm. Is this _always_ safe to do? We add the directory separator a few
> lines further up, but only in the case where `type == OBJ_TREE`. So in
> reverse this may mean that there are cases where we don't have a
> trailing '/'.
> 
> Maybe we should instead:
> 
>     did_strip_suffix = strbuf_strip_suffix(path, "/");
> 
>     ...
> 
>     if (did_strip_suffix)
>         strbuf_addch(path, "/");

This is much cleaner, too! Thanks.

-Stolee

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Apr 7, 2026

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Thu, Mar 26, 2026 at 03:14:48PM +0000, Derrick Stolee via GitGitGadget wrote:
>  6:  9699650aa7 ! 6:  b6423f9595 t5620: test backfill's unknown argument handling
>      @@ Commit message
>       
>           Before the recent changes to parse rev-list arguments inside of 'git
>           backfill', the builtin would take arbitrary arguments without complaint (and
>      -    ignore them). This was noticed and a patch was sent [1] which motivates this
>      -    change to encode this behavior in test.
>      +    ignore them). This was noticed and a patch was sent [1] which motivates
>      +    this change.
>       
>           [1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/
>       
>      +    Note that the revision machinery can output an "ambiguous argument"
>      +    warning if a value not starting with '--' is found and doesn't make
>      +    sense as a reference or a pathspec. For unrecognized arguments starting
>      +    with '--' we need to add logic into builtin/backfill.c to catch leftover
>      +    arguments.
>      +
>           Reported-by: Siddharth Shrimali <r.siddharth.shrimali@gmail.com>
>           Signed-off-by: Derrick Stolee <stolee@gmail.com>
>       
>      + ## builtin/backfill.c ##
>      +@@ builtin/backfill.c: int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>      + 	repo_init_revisions(repo, &ctx.revs, prefix);
>      + 	argc = setup_revisions(argc, argv, &ctx.revs, NULL);
>      + 
>      ++	if (argc > 1)
>      ++		die(_("unrecognized argument: %s"), argv[1]);
>      ++
>      + 	repo_config(repo, git_default_config, NULL);
>      + 
>      + 	if (ctx.sparse < 0)
>      +
>        ## t/t5620-backfill.sh ##

I would've expected this chunk to already come in patch 3, but that
alone isn't really worth a reroll. All the other changes look good to
me, thanks!

Patrick

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Apr 7, 2026

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Tue, Mar 17, 2026 at 12:29:16AM +0000, Derrick Stolee via GitGitGadget wrote:
> The git backfill command assists in downloading missing blobs for blobless
> partial clones. However, its current version lacks some valuable
> functionality. It currently:
> 
>  1. Only walks commits reachable from HEAD.
>  2. It walks all reachable commits to the full history.
>  3. It can focus on the current sparse-checkout definition, but otherwise it
>     doesn't focus on a given pathspec.
> 
> All of these are being updated by this patch series, which allows rev-list
> options to impact the path-walk. These include:
> 
>  1. Specifying a given refspec, including --all.
>  2. Modifying the commit walk, including --first-parent, commit ranges, or
>     recency using --since.
>  3. Modifying the set of paths to download using pathspecs.
> 
> One particularly valuable situation here is that now a user can run git
> backfill -- <path> to download all versions of a specific file or a specific
> directory, accelerating history queries within that path without downloading
> more than necessary. This can accelerate git blame or git log -L for these
> paths, where normally those commands download missing blobs one-by-one
> during its diff algorithms.

Nice.

I think especially blaming is a bit of a sore spot -- downloading blobs
one by one simply doesn't cut it there. I wonder whether we can easily
use the backfill mechanism to fetch blobs automatically in git-blame(1)
so that the user doesn't need to know about git-backfill(1) at all?

Patrick

@@ -63,9 +63,12 @@ OPTIONS
current sparse-checkout. If the sparse-checkout feature is enabled,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Tue, Mar 17, 2026 at 12:29:19AM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <stolee@gmail.com>
> 
> The existing implementation of 'git backfill' only includes downloading
> missing blobs reachable from HEAD. Advanced uses may desire more general
> commit limiting options, such as '--all' for all references, specifying a
> commit range via negative references, or specifying a recency of use such as
> with '--since=<date>'.
> 
> All of these options are available if we use setup_revisions() to parse the
> unknown arguments with the revision machinery. This opens up a large number
> of possibilities, only a small set of which are tested here.
> 
> For documentation, we avoid duplicating the option documentation and instead
> link to the documentation of 'git rev-list'.
> 
> Note that these arguments currently allow specifying a pathspec, which
> modifies the commit history checks but does not limit the paths used in the
> backfill logic. This will be updated in a future change.

Makes me wonder whether reversing the order would have avoided this
slight awkwardness. But let's just stick with the current order, the end
result would be the same anyway.

> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  Documentation/git-backfill.adoc |   3 +
>  builtin/backfill.c              |  19 ++--
>  t/t5620-backfill.sh             | 156 ++++++++++++++++++++++++++++++++
>  3 files changed, 172 insertions(+), 6 deletions(-)
> 
> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index e80fc1b694..1b5595b27c 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -92,13 +92,14 @@ static int do_backfill(struct backfill_context *ctx)
>  		}
>  	}
>  
> -	repo_init_revisions(ctx->repo, &revs, "");
> -	handle_revision_arg("HEAD", &revs, 0, 0);
> +	/* Walk from HEAD if otherwise unspecified. */
> +	if (!ctx->revs.pending.nr)
> +		handle_revision_arg("HEAD", &ctx->revs, 0, 0);

Can we use `add_head_to_pending(&ctx->revs)` instead?

Patrick

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/19/26 5:54 AM, Patrick Steinhardt wrote:
> On Tue, Mar 17, 2026 at 12:29:19AM +0000, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> The existing implementation of 'git backfill' only includes downloading
>> missing blobs reachable from HEAD. Advanced uses may desire more general
>> commit limiting options, such as '--all' for all references, specifying a
>> commit range via negative references, or specifying a recency of use such as
>> with '--since=<date>'.
>>
>> All of these options are available if we use setup_revisions() to parse the
>> unknown arguments with the revision machinery. This opens up a large number
>> of possibilities, only a small set of which are tested here.
>>
>> For documentation, we avoid duplicating the option documentation and instead
>> link to the documentation of 'git rev-list'.
>>
>> Note that these arguments currently allow specifying a pathspec, which
>> modifies the commit history checks but does not limit the paths used in the
>> backfill logic. This will be updated in a future change.
> > Makes me wonder whether reversing the order would have avoided this
> slight awkwardness. But let's just stick with the current order, the end
> result would be the same anyway.
True, we could have added the pathspec logic first, but we wouldn't be able
to test it right away because the parsing comes through the rev-list.

>> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>> ---
>>   Documentation/git-backfill.adoc |   3 +
>>   builtin/backfill.c              |  19 ++--
>>   t/t5620-backfill.sh             | 156 ++++++++++++++++++++++++++++++++
>>   3 files changed, 172 insertions(+), 6 deletions(-)
>>
>> diff --git a/builtin/backfill.c b/builtin/backfill.c
>> index e80fc1b694..1b5595b27c 100644
>> --- a/builtin/backfill.c
>> +++ b/builtin/backfill.c
>> @@ -92,13 +92,14 @@ static int do_backfill(struct backfill_context *ctx)
>>   		}
>>   	}
>>   >> -	repo_init_revisions(ctx->repo, &revs, "");
>> -	handle_revision_arg("HEAD", &revs, 0, 0);
>> +	/* Walk from HEAD if otherwise unspecified. */
>> +	if (!ctx->revs.pending.nr)
>> +		handle_revision_arg("HEAD", &ctx->revs, 0, 0);
> > Can we use `add_head_to_pending(&ctx->revs)` instead?

Nice. We absolutely can and should.

Thanks,
-Stolee

Comment thread path-walk.c
@@ -206,6 +206,49 @@ static int add_tree_entries(struct path_walk_context *ctx,
match != MATCHED)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Tue, Mar 17, 2026 at 12:29:20AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/path-walk.c b/path-walk.c
> index 364e4cfa19..e1ad4b0208 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -481,6 +524,18 @@ int walk_objects_by_path(struct path_walk_info *info)
>  	if (info->tags)
>  		info->revs->tag_objects = 1;
>  
> +	if (ctx.revs->prune_data.nr) {
> +		/*
> +		 * Only exact prefix pathspecs are currently supported.
> +		 * Clear any wildcard or magic pathspecs to avoid
> +		 * incorrect prefix matching.
> +		 */
> +		struct pathspec *pd = &ctx.revs->prune_data;
> +
> +		if (pd->has_wildcard || pd->magic)
> +			pd->nr = 0;
> +	}

Huh, curious. Won't this cause a leak? I guess we should rather use
`clear_pathspec()` here.

Also shows that this path is missing test coverage.

Patrick

Comment thread path-walk.c
@@ -206,6 +206,49 @@ static int add_tree_entries(struct path_walk_context *ctx,
match != MATCHED)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Tue, Mar 17, 2026 at 12:29:20AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/path-walk.c b/path-walk.c
> index 364e4cfa19..e1ad4b0208 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -206,6 +206,49 @@ static int add_tree_entries(struct path_walk_context *ctx,
>  				 match != MATCHED)
>  				continue;
>  		}
> +		if (ctx->revs->prune_data.nr) {
> +			struct pathspec *pd = &ctx->revs->prune_data;
> +			bool found = false;
> +
> +			for (int i = 0; i < pd->nr; i++) {
> +				struct pathspec_item *item = &pd->items[i];
> +
> +				/*
> +				 * Is this path a parent directory of
> +				 * the pathspec item?
> +				 */
> +				if (path.len < (size_t)item->len &&
> +				    !strncmp(path.buf, item->match, path.len) &&
> +				    item->match[path.len - 1] == '/') {
> +					found = true;
> +					break;
> +				}
> +
> +				/*
> +				 * Or, is the pathspec an exact match?
> +				 */
> +				if (path.len == (size_t)item->len &&
> +				    !strcmp(path.buf, item->match)) {
> +					found = true;
> +					break;
> +				}
> +
> +				/*
> +				 * Or, is the pathspec a directory prefix
> +				 * match?
> +				 */
> +				if (path.len > (size_t)item->len &&
> +				    !strncmp(path.buf, item->match, item->len) &&
> +				    path.buf[item->len] == '/') {
> +					found = true;
> +					break;
> +				}

Ah, one more thing: we could expose `dir_prefix()` from "path.c" and
reuse it here.

Patrick

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/19/26 6:15 AM, Patrick Steinhardt wrote:
> Ah, one more thing: we could expose `dir_prefix()` from "path.c" and
> reuse it here.

Good idea. This becomes

			/*
			 * Continue if either is a directory prefix
			 * of the other.
			 */
			if (dir_prefix(path.buf, item->match) ||
			    dir_prefix(item->match, path.buf)) {
				found = true;
				break;
			}

With the idea that we need to walk the parents of each prefix in
addition to walking all of their children.

Thanks,
-Stolee

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Apr 7, 2026

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/19/2026 5:54 AM, Patrick Steinhardt wrote:
> I think especially blaming is a bit of a sore spot -- downloading blobs
> one by one simply doesn't cut it there. I wonder whether we can easily
> use the backfill mechanism to fetch blobs automatically in git-blame(1)
> so that the user doesn't need to know about git-backfill(1) at all?

I've thought about this a bit, and I'm not sure that we want to run
'git backfill' directly. Instead, it would be nice if we did a "staged"
algorithm for 'git blame':

1. Walk commits according to the pathspec to collect the commits that
   changed the path.

2. Collect the list of blob OIDs that will be needed for computing diffs
   for the line-tracking algorithm.

3. In batches, download groups of missing blobs and then process them
   for line-tracking diffs. (Stop if all lines are blamed; continue to
   next batch if more lines are needed.)

This would be a significant rewrite of the blame algorithm, though. I
briefly considered this approach about a year ago and decided it would
be easier to start with 'git backfill' and see whether that satisfies
most needs.

The biggest reason to maybe avoid 'git backfill HEAD -- <path>' before
_every_ blame operation is that this will add overhead on repeated
calls that may be obnoxious in its own way. Maybe doing an opt-in
'git blame --backfill <path>' would make it easier for users to opt-in
when they want to.

Thanks,
-Stolee

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Apr 7, 2026

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Thu, Mar 19, 2026 at 08:59:01AM -0400, Derrick Stolee wrote:
> On 3/19/2026 5:54 AM, Patrick Steinhardt wrote:
> > I think especially blaming is a bit of a sore spot -- downloading blobs
> > one by one simply doesn't cut it there. I wonder whether we can easily
> > use the backfill mechanism to fetch blobs automatically in git-blame(1)
> > so that the user doesn't need to know about git-backfill(1) at all?
> 
> I've thought about this a bit, and I'm not sure that we want to run
> 'git backfill' directly. Instead, it would be nice if we did a "staged"
> algorithm for 'git blame':
> 
> 1. Walk commits according to the pathspec to collect the commits that
>    changed the path.
> 
> 2. Collect the list of blob OIDs that will be needed for computing diffs
>    for the line-tracking algorithm.
> 
> 3. In batches, download groups of missing blobs and then process them
>    for line-tracking diffs. (Stop if all lines are blamed; continue to
>    next batch if more lines are needed.)
> 
> This would be a significant rewrite of the blame algorithm, though. I
> briefly considered this approach about a year ago and decided it would
> be easier to start with 'git backfill' and see whether that satisfies
> most needs.
> 
> The biggest reason to maybe avoid 'git backfill HEAD -- <path>' before
> _every_ blame operation is that this will add overhead on repeated
> calls that may be obnoxious in its own way. Maybe doing an opt-in
> 'git blame --backfill <path>' would make it easier for users to opt-in
> when they want to.

That's fair. I fully agree that just blindly doing this would be likely
be inefficient. Ideally, the batching logic would only kick in whenever
we see a missing object.

Anyway, this definitely doesn't have to be part of this series, I was
mostly wondering how hard it is to do.

Thanks!

Patrick

Comment thread t/t5620-backfill.sh
@@ -7,6 +7,14 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <stolee@gmail.com>
>
> Before the recent changes to parse rev-list arguments inside of 'git
> backfill', the builtin would take arbitrary arguments without complaint (and
> ignore them). This was noticed and a patch was sent [1] which motivates this
> change to encode this behavior in test.
>
> [1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/
>
> Reported-by: Siddharth Shrimali <r.siddharth.shrimali@gmail.com>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  t/t5620-backfill.sh | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
> index c6f54ee91c..85740f1f13 100755
> --- a/t/t5620-backfill.sh
> +++ b/t/t5620-backfill.sh
> @@ -7,6 +7,14 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
>  
>  . ./test-lib.sh
>  
> +test_expect_success 'backfill rejects unexpected arguments' '
> +	test_must_fail git backfill unexpected-arg 2>err &&
> +	test_grep "ambiguous argument .*unexpected-arg" err &&
> +
> +	test_must_fail git backfill --all --firt-parent unexpected-arg 2>err &&
> +	test_grep "ambiguous argument .*unexpected-arg" err
> +'

Hmph, I would have expected that an earlier --firt-parent on the
command line would trigger "unknown option" instead.

Having said that, if the code lets the setup_revisions() parse the
command line, the usual "unless disambiguated with a double-dash
'--', stop at the first non-revision and take everything as paths
but for safety all of them must refer to an existing path in the
working tree" behaviour should trigger, and it is not specific to
"backfill", and may already be tested centrally (if not, I do not
object to such a new set of tests).

For any cmd that take revisions and pathspec (e.g., log, rev-list,
grep) these should hold true:

  $ git $cmd [<options>]... Makefile HEAD

    Without disambiguation the command should say "Ah, Makefile
    is not a revision, so we will see no more revisions, and
    everything, including the current one we are looking at, must be
    an existing path on the working tree", and barfs on HEAD that
    does not exist as a file/directory.

  $ git $cmd [<options>]... Makefile -- HEAD

    With disambiguation, the command should verify everything before
    the double-dash to be a rev, and barf that Makefile is not a
    rev.

  $ git $cmd [<options>]... -- Makefile HEAD

    With disambiguation, the command should take everything after
    the double-dash to be a pathspec element without barfing.  After
    all, it may be referring to a path that used to exist in some
    revision the command will look at.

Thanks.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/23/2026 11:29 AM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> Before the recent changes to parse rev-list arguments inside of 'git
>> backfill', the builtin would take arbitrary arguments without complaint (and
>> ignore them). This was noticed and a patch was sent [1] which motivates this
>> change to encode this behavior in test.
>>
>> [1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/
>>
>> Reported-by: Siddharth Shrimali <r.siddharth.shrimali@gmail.com>
>> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>> ---
>>  t/t5620-backfill.sh | 8 ++++++++
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
>> index c6f54ee91c..85740f1f13 100755
>> --- a/t/t5620-backfill.sh
>> +++ b/t/t5620-backfill.sh
>> @@ -7,6 +7,14 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
>>  
>>  . ./test-lib.sh
>>  
>> +test_expect_success 'backfill rejects unexpected arguments' '
>> +	test_must_fail git backfill unexpected-arg 2>err &&
>> +	test_grep "ambiguous argument .*unexpected-arg" err &&
>> +
>> +	test_must_fail git backfill --all --firt-parent unexpected-arg 2>err &&
>> +	test_grep "ambiguous argument .*unexpected-arg" err
>> +'
> 
> Hmph, I would have expected that an earlier --firt-parent on the
> command line would trigger "unknown option" instead.

Interesting that my mistype has demonstrated an interesting
behavior here. It turns out that random options starting with
'--' are accepted here, including --unexpected-arg.

This means that we actually have room here for some improvement!

I'll see what can be done to make even these arguments be seen
as failures.

> Having said that, if the code lets the setup_revisions() parse the
> command line, the usual "unless disambiguated with a double-dash
> '--', stop at the first non-revision and take everything as paths
> but for safety all of them must refer to an existing path in the
> working tree" behaviour should trigger, and it is not specific to
> "backfill", and may already be tested centrally (if not, I do not
> object to such a new set of tests).

Thanks,
-Stolee

@@ -63,9 +63,12 @@ OPTIONS
current sparse-checkout. If the sparse-checkout feature is enabled,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Mon, Mar 23, 2026 at 11:40:16AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index e80fc1b694..90c9d84793 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -134,7 +135,12 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>  					 builtin_backfill_usage, options);
>  
>  	argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
> -			     0);
> +			     PARSE_OPT_KEEP_UNKNOWN_OPT |
> +			     PARSE_OPT_KEEP_ARGV0 |
> +			     PARSE_OPT_KEEP_DASHDASH);
> +
> +	repo_init_revisions(repo, &ctx.revs, prefix);
> +	argc = setup_revisions(argc, argv, &ctx.revs, NULL);

We should probably die here in case we still have unknown arguments.

Patrick

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/24/2026 3:59 AM, Patrick Steinhardt wrote:
> On Mon, Mar 23, 2026 at 11:40:16AM +0000, Derrick Stolee via GitGitGadget wrote:
>> diff --git a/builtin/backfill.c b/builtin/backfill.c
>> index e80fc1b694..90c9d84793 100644
>> --- a/builtin/backfill.c
>> +++ b/builtin/backfill.c
>> @@ -134,7 +135,12 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>>  					 builtin_backfill_usage, options);
>>  
>>  	argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
>> -			     0);
>> +			     PARSE_OPT_KEEP_UNKNOWN_OPT |
>> +			     PARSE_OPT_KEEP_ARGV0 |
>> +			     PARSE_OPT_KEEP_DASHDASH);
>> +
>> +	repo_init_revisions(repo, &ctx.revs, prefix);
>> +	argc = setup_revisions(argc, argv, &ctx.revs, NULL);
> 
> We should probably die here in case we still have unknown arguments.

That is indeed the fix for the bad test in patch 6. I'll make the
necessary update in v3's patch 6 along with the test for it.

Thanks,
-Stolee

Comment thread path-walk.c
@@ -11,6 +11,7 @@
#include "list-objects.h"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Mon, Mar 23, 2026 at 11:40:17AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/path-walk.c b/path-walk.c
> index 364e4cfa19..0d640e2f24 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -206,6 +207,34 @@ static int add_tree_entries(struct path_walk_context *ctx,
>  				 match != MATCHED)
>  				continue;
>  		}
> +		if (ctx->revs->prune_data.nr) {
> +			struct pathspec *pd = &ctx->revs->prune_data;
> +			bool found = false;
> +
> +			/* remove '/' for these checks. */
> +			path.buf[path.len - 1] = 0;

Hm. Is this _always_ safe to do? We add the directory separator a few
lines further up, but only in the case where `type == OBJ_TREE`. So in
reverse this may mean that there are cases where we don't have a
trailing '/'.

Maybe we should instead:

    did_strip_suffix = strbuf_strip_suffix(path, "/");

    ...

    if (did_strip_suffix)
        strbuf_addch(path, "/");

Patrick

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/24/2026 3:59 AM, Patrick Steinhardt wrote:
> On Mon, Mar 23, 2026 at 11:40:17AM +0000, Derrick Stolee via GitGitGadget wrote:
>> diff --git a/path-walk.c b/path-walk.c
>> index 364e4cfa19..0d640e2f24 100644
>> --- a/path-walk.c
>> +++ b/path-walk.c
>> @@ -206,6 +207,34 @@ static int add_tree_entries(struct path_walk_context *ctx,
>>  				 match != MATCHED)
>>  				continue;
>>  		}
>> +		if (ctx->revs->prune_data.nr) {
>> +			struct pathspec *pd = &ctx->revs->prune_data;
>> +			bool found = false;
>> +
>> +			/* remove '/' for these checks. */
>> +			path.buf[path.len - 1] = 0;
> 
> Hm. Is this _always_ safe to do? We add the directory separator a few
> lines further up, but only in the case where `type == OBJ_TREE`. So in
> reverse this may mean that there are cases where we don't have a
> trailing '/'.
> 
> Maybe we should instead:
> 
>     did_strip_suffix = strbuf_strip_suffix(path, "/");
> 
>     ...
> 
>     if (did_strip_suffix)
>         strbuf_addch(path, "/");

This is much cleaner, too! Thanks.

-Stolee

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Apr 7, 2026

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Thu, Mar 26, 2026 at 03:14:48PM +0000, Derrick Stolee via GitGitGadget wrote:
>  6:  9699650aa7 ! 6:  b6423f9595 t5620: test backfill's unknown argument handling
>      @@ Commit message
>       
>           Before the recent changes to parse rev-list arguments inside of 'git
>           backfill', the builtin would take arbitrary arguments without complaint (and
>      -    ignore them). This was noticed and a patch was sent [1] which motivates this
>      -    change to encode this behavior in test.
>      +    ignore them). This was noticed and a patch was sent [1] which motivates
>      +    this change.
>       
>           [1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/
>       
>      +    Note that the revision machinery can output an "ambiguous argument"
>      +    warning if a value not starting with '--' is found and doesn't make
>      +    sense as a reference or a pathspec. For unrecognized arguments starting
>      +    with '--' we need to add logic into builtin/backfill.c to catch leftover
>      +    arguments.
>      +
>           Reported-by: Siddharth Shrimali <r.siddharth.shrimali@gmail.com>
>           Signed-off-by: Derrick Stolee <stolee@gmail.com>
>       
>      + ## builtin/backfill.c ##
>      +@@ builtin/backfill.c: int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>      + 	repo_init_revisions(repo, &ctx.revs, prefix);
>      + 	argc = setup_revisions(argc, argv, &ctx.revs, NULL);
>      + 
>      ++	if (argc > 1)
>      ++		die(_("unrecognized argument: %s"), argv[1]);
>      ++
>      + 	repo_config(repo, git_default_config, NULL);
>      + 
>      + 	if (ctx.sparse < 0)
>      +
>        ## t/t5620-backfill.sh ##

I would've expected this chunk to already come in patch 3, but that
alone isn't really worth a reroll. All the other changes look good to
me, thanks!

Patrick

@derrickstolee derrickstolee deleted the backfill-revs branch April 9, 2026 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants