Skip to content

Commit d9f5a98

Browse files
pack-objects: support sparse:oid filter with path-walk
The --filter=sparse:<oid> option to 'git pack-objects' allows focusing an object set to a sparse-checkout definition. This reduces the set of matching blobs while retaining all reachable trees. No server currently supports fetching with this filter because it is expensive to compute and reachability bitmaps do not help without a significant effort to extend the bitmap feature to store bitmaps for each supported sparse- checkout definition. Without focusing on serving fetches and clones with these filters, there are still benefits that could be realized by making this faster. With the sparse index, it's more realistic now than ever to be able to operate a local clone that was bootstrapped by a packfile created with a sparse filter, because the missing trees are not needed to move a sparse-checkout from one commit to another or to view the history of any path in scope. Such clones could perhaps be bootstrapped by partial bundles. Previously, constructing these sparse packs has been incredibly computationally inefficient. The revision walk that explores which objects are in scope spends a lot of time checking each object to see if it matches the sparse-checkout patterns, causing quadratic behavior (number of objects times number of sparse-checkout patterns). This improves somewhat when using cone-mode sparse-checkout patterns that can use hashtables and prefix matches to determine containment. However, the check per object is still too expensive for most cases. This is where the path-walk feature comes in. We can proceed as normal by placing objects in bins by path and _then_ check a group of objects all at once. Since sparse:<oid> only restricts blobs, the path-walk must include all reachable trees while using the cone-mode patterns to skip blobs at paths outside the sparse scope. This establishes a baseline for a potential future "treesparse:<oid>" filter that would also restrict trees, but introducing such a new filter is deferred to a later change. The implementation here is focused around loading the sparse-checkout patterns from the provided object ID and checking that the patterns are indeed cone-mode patterns. We can then load the correct pattern list into the path walk context and use the logic that already exists from bff4555 (backfill: add --sparse option, 2025-02-03), though that feature loads sparse-checkout patterns from the worktree's local settings and also restricts tree objects. We use a combination of errors and warnings to signal problems during this load. The difference is that errors are likely fatal for the non-path-walk version while the warnings are probably just implementation details for the path-walk version and the 'git pack-objects' command can fall back to the revision walk version. Now that the SEEN flag is deferred until after pattern checks (from the previous commit), handle the case where a tree with a shared OID appears at both an out-of-cone and in-cone path. When trees are not being pruned (pl_sparse_trees == 0), the path-walk re-walks the tree at the in-cone path so that in-cone blobs within it are discovered. The new tests in t5317 and t6601 demonstrate this behavior and would fail without these changes. The performance test p5315 shows the impact of this change when using sparse filters: Test HEAD~1 HEAD ---------------------------------------------------------------------- 5315.10: repack (sparse:oid) 77.98 77.47 -0.7% 5315.11: repack size (sparse:oid) 187.5M 187.4M -0.0% 5315.12: repack (sparse:oid, --path-walk) 77.91 31.41 -59.7% 5315.13: repack size (sparse:oid, --path-walk) 187.5M 161.1M -14.1% These performance tests were run on the Git repository. The --path-walk feature shows meaningful space savings (14% smaller for sparse packs) and dramatic time savings (60% faster) by leveraging the path-walk's ability to skip blobs outside the sparse scope. Co-authored-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Taylor Blaue <me@ttaylorr.com> Signed-off-by: Derrick Stolee <stolee@gmail.com>
1 parent 2360a5b commit d9f5a98

6 files changed

Lines changed: 350 additions & 10 deletions

File tree

Documentation/git-backfill.adoc

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,10 @@ OPTIONS
8080
+
8181
You may also use commit-limiting options understood by
8282
linkgit:git-rev-list[1] such as `--first-parent`, `--since`, or pathspecs.
83+
+
84+
Most `--filter=<spec>` options don't work with the purpose of
85+
`git backfill`, but the `sparse:<oid>` filter is integrated to provide a
86+
focused set of paths to download, distinct from the `--sparse` option.
8387

8488
SEE ALSO
8589
--------

Documentation/git-pack-objects.adoc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -404,7 +404,8 @@ will be automatically changed to version `1`.
404404
+
405405
Incompatible with `--delta-islands`. The `--use-bitmap-index` option is
406406
ignored in the presence of `--path-walk`. Whe `--path-walk` option
407-
supports the `--filter=<spec>` forms `blob:none` and `blob:limit=<n>`.
407+
supports the `--filter=<spec>` forms `blob:none`, `blob:limit=<n>`, and
408+
`sparse:<oid>`.
408409
409410
410411
DELTA ISLANDS

builtin/pack-objects.c

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4754,7 +4754,7 @@ static int add_objects_by_path(const char *path,
47544754
return 0;
47554755
}
47564756

4757-
static void get_object_list_path_walk(struct rev_info *revs)
4757+
static int get_object_list_path_walk(struct rev_info *revs)
47584758
{
47594759
struct path_walk_info info = PATH_WALK_INFO_INIT;
47604760
unsigned int processed = 0;
@@ -4777,8 +4777,9 @@ static void get_object_list_path_walk(struct rev_info *revs)
47774777
result = walk_objects_by_path(&info);
47784778
trace2_region_leave("pack-objects", "path-walk", revs->repo);
47794779

4780-
if (result)
4781-
die(_("failed to pack objects via path-walk"));
4780+
path_walk_info_clear(&info);
4781+
4782+
return result;
47824783
}
47834784

47844785
static void get_object_list(struct rev_info *revs, struct strvec *argv)
@@ -4841,8 +4842,13 @@ static void get_object_list(struct rev_info *revs, struct strvec *argv)
48414842
fn_show_object = show_object;
48424843

48434844
if (path_walk) {
4844-
get_object_list_path_walk(revs);
4845-
} else {
4845+
if (get_object_list_path_walk(revs)) {
4846+
warning(_("failed to pack objects via path-walk"));
4847+
path_walk = 0;
4848+
}
4849+
}
4850+
4851+
if (!path_walk) {
48464852
if (prepare_revision_walk(revs))
48474853
die(_("revision walk setup failed"));
48484854
mark_edges_uninteresting(revs, show_edge, sparse);

path-walk.c

Lines changed: 77 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
#include "hex.h"
1111
#include "list-objects.h"
1212
#include "list-objects-filter-options.h"
13+
#include "object-name.h"
1314
#include "odb.h"
1415
#include "object.h"
1516
#include "oid-array.h"
@@ -180,10 +181,6 @@ static int add_tree_entries(struct path_walk_context *ctx,
180181
return -1;
181182
}
182183

183-
/* Skip this object if already seen. */
184-
if (o->flags & SEEN)
185-
continue;
186-
187184
strbuf_setlen(&path, base_len);
188185
strbuf_add(&path, entry.path, entry.pathlen);
189186

@@ -194,6 +191,40 @@ static int add_tree_entries(struct path_walk_context *ctx,
194191
if (type == OBJ_TREE)
195192
strbuf_addch(&path, '/');
196193

194+
if (o->flags & SEEN) {
195+
/*
196+
* A tree with a shared OID may appear at multiple
197+
* paths. Even though we already added this tree to
198+
* the output at some other path, we still need to
199+
* walk into it at this in-cone path to discover
200+
* blobs that were not found at the earlier
201+
* out-of-cone path.
202+
*
203+
* Only do this for paths not yet in our map, to
204+
* avoid duplicate entries when the same tree OID
205+
* appears at the same path across multiple commits.
206+
*/
207+
if (type == OBJ_TREE && ctx->info->pl &&
208+
ctx->info->pl->use_cone_patterns &&
209+
!ctx->info->pl_sparse_trees &&
210+
!strmap_contains(&ctx->paths_to_lists, path.buf)) {
211+
int dtype;
212+
enum pattern_match_result m;
213+
m = path_matches_pattern_list(path.buf, path.len,
214+
path.buf + base_len,
215+
&dtype,
216+
ctx->info->pl,
217+
ctx->repo->index);
218+
if (m != NOT_MATCHED) {
219+
add_path_to_list(ctx, path.buf, type,
220+
&entry.oid,
221+
!(o->flags & UNINTERESTING));
222+
push_to_stack(ctx, path.buf);
223+
}
224+
}
225+
continue;
226+
}
227+
197228
if (ctx->info->pl) {
198229
int dtype;
199230
enum pattern_match_result match;
@@ -543,6 +574,48 @@ static int prepare_filters(struct path_walk_info *info,
543574
}
544575
return 1;
545576

577+
case LOFC_SPARSE_OID:
578+
if (info) {
579+
struct object_id sparse_oid;
580+
struct repository *repo = info->revs->repo;
581+
582+
if (info->pl) {
583+
warning(_("sparse filter cannot be combined with existing sparse patterns"));
584+
return 0;
585+
}
586+
587+
if (repo_get_oid_with_flags(repo,
588+
options->sparse_oid_name,
589+
&sparse_oid,
590+
GET_OID_BLOB)) {
591+
error(_("unable to access sparse blob in '%s'"),
592+
options->sparse_oid_name);
593+
return 0;
594+
}
595+
596+
CALLOC_ARRAY(info->pl, 1);
597+
info->pl->use_cone_patterns = 1;
598+
599+
if (add_patterns_from_blob_to_list(&sparse_oid, "", 0,
600+
info->pl) < 0) {
601+
clear_pattern_list(info->pl);
602+
FREE_AND_NULL(info->pl);
603+
error(_("unable to parse sparse filter data in '%s'"),
604+
oid_to_hex(&sparse_oid));
605+
return 0;
606+
}
607+
608+
if (!info->pl->use_cone_patterns) {
609+
clear_pattern_list(info->pl);
610+
FREE_AND_NULL(info->pl);
611+
warning(_("sparse filter is not cone-mode compatible"));
612+
return 0;
613+
}
614+
615+
list_objects_filter_release(options);
616+
}
617+
return 1;
618+
546619
default:
547620
error(_("object filter '%s' not supported by the path-walk API"),
548621
list_objects_filter_spec(options));

t/t5317-pack-objects-filter-objects.sh

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -478,4 +478,129 @@ test_expect_success 'verify pack-objects w/ --missing=allow-any' '
478478
EOF
479479
'
480480

481+
# Test that --path-walk produces the same object set as standard traversal
482+
# when using sparse:oid filters with cone-mode patterns.
483+
#
484+
# The sparse:oid filter restricts only blobs, not trees. Both standard
485+
# and path-walk should produce identical sets of blobs, commits, and trees.
486+
487+
test_expect_success 'setup pw_sparse for path-walk comparison' '
488+
git init pw_sparse &&
489+
mkdir -p pw_sparse/inc/sub pw_sparse/exc/sub &&
490+
491+
for n in 1 2
492+
do
493+
echo "inc $n" >pw_sparse/inc/file$n &&
494+
echo "inc sub $n" >pw_sparse/inc/sub/file$n &&
495+
echo "exc $n" >pw_sparse/exc/file$n &&
496+
echo "exc sub $n" >pw_sparse/exc/sub/file$n &&
497+
echo "root $n" >pw_sparse/root$n || return 1
498+
done &&
499+
500+
git -C pw_sparse add . &&
501+
git -C pw_sparse commit -m "first" &&
502+
503+
echo "inc 1 modified" >pw_sparse/inc/file1 &&
504+
echo "exc 1 modified" >pw_sparse/exc/file1 &&
505+
echo "root 1 modified" >pw_sparse/root1 &&
506+
git -C pw_sparse add . &&
507+
git -C pw_sparse commit -m "second" &&
508+
509+
# Cone-mode sparse pattern: include root + inc/
510+
printf "/*\n!/*/\n/inc/\n" |
511+
git -C pw_sparse hash-object -w --stdin >sparse_oid
512+
'
513+
514+
test_expect_success 'sparse:oid with --path-walk produces same blobs' '
515+
oid=$(cat sparse_oid) &&
516+
517+
git -C pw_sparse pack-objects --revs --stdout \
518+
--filter=sparse:oid=$oid >standard.pack <<-EOF &&
519+
HEAD
520+
EOF
521+
git -C pw_sparse index-pack ../standard.pack &&
522+
git -C pw_sparse verify-pack -v ../standard.pack >standard_verify &&
523+
524+
git -C pw_sparse pack-objects --revs --stdout \
525+
--path-walk --filter=sparse:oid=$oid >pathwalk.pack <<-EOF &&
526+
HEAD
527+
EOF
528+
git -C pw_sparse index-pack ../pathwalk.pack &&
529+
git -C pw_sparse verify-pack -v ../pathwalk.pack >pathwalk_verify &&
530+
531+
# Blobs must match exactly
532+
grep -E "^[0-9a-f]{40} blob" standard_verify |
533+
awk "{print \$1}" | sort >standard_blobs &&
534+
grep -E "^[0-9a-f]{40} blob" pathwalk_verify |
535+
awk "{print \$1}" | sort >pathwalk_blobs &&
536+
test_cmp standard_blobs pathwalk_blobs &&
537+
538+
# Commits must match exactly
539+
grep -E "^[0-9a-f]{40} commit" standard_verify |
540+
awk "{print \$1}" | sort >standard_commits &&
541+
grep -E "^[0-9a-f]{40} commit" pathwalk_verify |
542+
awk "{print \$1}" | sort >pathwalk_commits &&
543+
test_cmp standard_commits pathwalk_commits
544+
'
545+
546+
test_expect_success 'sparse:oid with --path-walk includes all trees' '
547+
# The sparse:oid filter restricts only blobs, not trees.
548+
# Both standard and path-walk should include the same trees.
549+
grep -E "^[0-9a-f]{40} tree" standard_verify |
550+
awk "{print \$1}" | sort >standard_trees &&
551+
grep -E "^[0-9a-f]{40} tree" pathwalk_verify |
552+
awk "{print \$1}" | sort >pathwalk_trees &&
553+
554+
test_cmp standard_trees pathwalk_trees
555+
'
556+
557+
# Test the edge case where the same tree/blob OID appears at both an
558+
# in-cone and out-of-cone path. When sibling directories have identical
559+
# contents, they share a tree OID. The path-walk defers marking objects
560+
# SEEN until after checking sparse patterns, so an object at an out-of-cone
561+
# path can still be discovered at an in-cone path.
562+
563+
test_expect_success 'setup pw_shared for shared OID across cone boundary' '
564+
git init pw_shared &&
565+
mkdir pw_shared/aaa pw_shared/zzz &&
566+
echo "shared content" >pw_shared/aaa/file &&
567+
echo "shared content" >pw_shared/zzz/file &&
568+
echo "root file" >pw_shared/rootfile &&
569+
git -C pw_shared add . &&
570+
git -C pw_shared commit -m "aaa and zzz share tree OID" &&
571+
572+
# Verify they share a tree OID
573+
aaa_tree=$(git -C pw_shared rev-parse HEAD:aaa) &&
574+
zzz_tree=$(git -C pw_shared rev-parse HEAD:zzz) &&
575+
test "$aaa_tree" = "$zzz_tree" &&
576+
577+
# Cone pattern: include root + zzz/ (not aaa/)
578+
printf "/*\n!/*/\n/zzz/\n" |
579+
git -C pw_shared hash-object -w --stdin >shared_sparse_oid
580+
'
581+
582+
test_expect_success 'shared tree OID: --path-walk blobs match standard' '
583+
oid=$(cat shared_sparse_oid) &&
584+
585+
git -C pw_shared pack-objects --revs --stdout \
586+
--filter=sparse:oid=$oid >shared_std.pack <<-EOF &&
587+
HEAD
588+
EOF
589+
git -C pw_shared index-pack ../shared_std.pack &&
590+
git -C pw_shared verify-pack -v ../shared_std.pack >shared_std_verify &&
591+
592+
git -C pw_shared pack-objects --revs --stdout \
593+
--path-walk --filter=sparse:oid=$oid >shared_pw.pack <<-EOF &&
594+
HEAD
595+
EOF
596+
git -C pw_shared index-pack ../shared_pw.pack &&
597+
git -C pw_shared verify-pack -v ../shared_pw.pack >shared_pw_verify &&
598+
599+
grep -E "^[0-9a-f]{40} blob" shared_std_verify |
600+
awk "{print \$1}" | sort >shared_std_blobs &&
601+
grep -E "^[0-9a-f]{40} blob" shared_pw_verify |
602+
awk "{print \$1}" | sort >shared_pw_blobs &&
603+
test_cmp shared_std_blobs shared_pw_blobs
604+
'
605+
481606
test_done

0 commit comments

Comments
 (0)