diff mbox series

bitmaps: don't recurse into trees already in the bitmap

Message ID YMcExxF7DI6k+sZ+@coredump.intra.peff.net (mailing list archive)
State New
Headers show
Series bitmaps: don't recurse into trees already in the bitmap | expand

Commit Message

Jeff King June 14, 2021, 7:27 a.m. UTC
If an object is already mentioned in a reachability bitmap we are
building, then by definition so are all of the objects it can reach. We
have an optimization to stop traversing commits when we see they are
already in the bitmap, but we don't do the same for trees.

It's generally unavoidable to recurse into trees for commits not yet
covered by bitmaps (since most commits generally do have unique
top-level trees). But they usually have subtrees that are shared with
other commits (i.e., all of the subtrees the commit _didn't_ touch). And
some of those commits (and their trees) may be covered by the bitmap.

Usually this isn't _too_ big a deal, because we'll visit those subtrees
only once in total for the whole walk. But if you have a large number of
unbitmapped commits, and if your tree is big, then you may end up
opening a lot of sub-trees for no good reason.

We can use the same optimization we do for commits here: when we are
about to open a tree, see if it's in the bitmap (either the one we are
building, or the "seen" bitmap which covers the UNINTERESTING side of
the bitmap when doing a set-difference).

This works especially well because we'll visit all commits before
hitting any trees. So even in a history like:

  A -- B

if "A" has a bitmap on disk but "B" doesn't, we'll already have OR-ed in
the results from A before looking at B's tree (so we really will only
look at trees touched by B).

For most repositories, the timings produced by p5310 are unspectacular.
Here's linux.git:

  Test                         HEAD^             HEAD
  --------------------------------------------------------------------
  5310.4: simulated clone      6.00(5.90+0.10)   5.98(5.90+0.08) -0.3%
  5310.5: simulated fetch      2.98(5.45+0.18)   2.85(5.31+0.18) -4.4%
  5310.7: rev-list (commits)   0.32(0.29+0.03)   0.33(0.30+0.03) +3.1%
  5310.8: rev-list (objects)   1.48(1.44+0.03)   1.49(1.44+0.05) +0.7%

Any improvement there is within the noise (the +3.1% on test 7 has to be
noise, since we are not recursing into trees, and thus the new code
isn't even run). The results for git.git are likewise uninteresting.

But here are numbers from some other real-world repositories (that are
not public). This one's tree is comparable in size to linux.git, but has
~16k refs (and so less complete bitmap coverage):

  Test                         HEAD^               HEAD
  -------------------------------------------------------------------------
  5310.4: simulated clone      38.34(39.86+0.74)   33.95(35.53+0.76) -11.5%
  5310.5: simulated fetch      2.29(6.31+0.35)     2.20(5.97+0.41) -3.9%
  5310.7: rev-list (commits)   0.99(0.86+0.13)     0.96(0.85+0.11) -3.0%
  5310.8: rev-list (objects)   11.32(11.04+0.27)   6.59(6.37+0.21) -41.8%

And here's another with a very large tree (~340k entries), and a fairly
large number of refs (~10k):

  Test                         HEAD^               HEAD
  -------------------------------------------------------------------------
  5310.3: simulated clone      53.83(54.71+1.54)   39.77(40.76+1.50) -26.1%
  5310.4: simulated fetch      19.91(20.11+0.56)   19.79(19.98+0.67) -0.6%
  5310.6: rev-list (commits)   0.54(0.44+0.11)     0.51(0.43+0.07) -5.6%
  5310.7: rev-list (objects)   24.32(23.59+0.73)   9.85(9.49+0.36) -59.5%

This patch provides substantial improvements in these larger cases, and
have any drawbacks for smaller ones (the cost of the bitmap check is
quite small compared to an actual tree traversal).

Note that we have to add a version of revision.c's include_check
callback which handles non-commits. We could possibly consolidate this
into a single callback for all objects types, as there's only one user
of the feature which would need converted (pack-bitmap.c:should_include).
That would in theory let us avoid duplicating any logic. But when I
tried it, the code ended up much worse to read, with lots of repeated
"if it's a commit do this, otherwise do that". Having two separate
callbacks splits that naturally, and matches the existing split of
show_commit/show_object callbacks.

Signed-off-by: Jeff King <peff@peff.net>
---
Patrick, I cc'd you since you mentioned you'd tried using bitmaps to
optimize a few queries, but they sometimes made things slower. This
might help some (but I doubt it will make all problems go away).

Of course, any review is appreciated, as well. :)

 list-objects.c |  3 +++
 pack-bitmap.c  | 17 +++++++++++++++++
 revision.h     |  1 +
 3 files changed, 21 insertions(+)

Comments

Jeff King June 14, 2021, 12:05 p.m. UTC | #1
On Mon, Jun 14, 2021 at 03:27:04AM -0400, Jeff King wrote:

> @@ -620,6 +636,7 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
>  		incdata.seen = seen;
>  
>  		revs->include_check = should_include;
> +		revs->include_check_obj = should_include_obj;
>  		revs->include_check_data = &incdata;
>  
>  		if (prepare_revision_walk(revs))

Whoops. This is missing one crucial thing. I wrote this patch a month or
two ago, but cherry-picked it onto the current master. These days we
have 1e951c6473 (pack-bitmap: clean up include_check after use,
2021-04-28), so we need to do the same cleanup.

t5304 notices, and I even caught it before sending, but I failed to
stage the change before running "git commit --amend". One of these days
I will figure out this whole "git" thing.

The squashable fix is just:

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 9dfee4a5b7..bfc10148f5 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -650,6 +650,7 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 					      &show_data, NULL);
 
 		revs->include_check = NULL;
+		revs->include_check_obj = NULL;
 		revs->include_check_data = NULL;
 	}
 

but the complete corrected patch is below.

-- >8 --
Subject: [PATCH] bitmaps: don't recurse into trees already in the bitmap

If an object is already mentioned in a reachability bitmap we are
building, then by definition so are all of the objects it can reach. We
have an optimization to stop traversing commits when we see they are
already in the bitmap, but we don't do the same for trees.

It's generally unavoidable to recurse into trees for commits not yet
covered by bitmaps (since most commits generally do have unique
top-level trees). But they usually have subtrees that are shared with
other commits (i.e., all of the subtrees the commit _didn't_ touch). And
some of those commits (and their trees) may be covered by the bitmap.

Usually this isn't _too_ big a deal, because we'll visit those subtrees
only once in total for the whole walk. But if you have a large number of
unbitmapped commits, and if your tree is big, then you may end up
opening a lot of sub-trees for no good reason.

We can use the same optimization we do for commits here: when we are
about to open a tree, see if it's in the bitmap (either the one we are
building, or the "seen" bitmap which covers the UNINTERESTING side of
the bitmap when doing a set-difference).

This works especially well because we'll visit all commits before
hitting any trees. So even in a history like:

  A -- B

if "A" has a bitmap on disk but "B" doesn't, we'll already have OR-ed in
the results from A before looking at B's tree (so we really will only
look at trees touched by B).

For most repositories, the timings produced by p5310 are unspectacular.
Here's linux.git:

  Test                         HEAD^             HEAD
  --------------------------------------------------------------------
  5310.4: simulated clone      6.00(5.90+0.10)   5.98(5.90+0.08) -0.3%
  5310.5: simulated fetch      2.98(5.45+0.18)   2.85(5.31+0.18) -4.4%
  5310.7: rev-list (commits)   0.32(0.29+0.03)   0.33(0.30+0.03) +3.1%
  5310.8: rev-list (objects)   1.48(1.44+0.03)   1.49(1.44+0.05) +0.7%

Any improvement there is within the noise (the +3.1% on test 7 has to be
noise, since we are not recursing into trees, and thus the new code
isn't even run). The results for git.git are likewise uninteresting.

But here are numbers from some other real-world repositories (that are
not public). This one's tree is comparable in size to linux.git, but has
~16k refs (and so less complete bitmap coverage):

  Test                         HEAD^               HEAD
  -------------------------------------------------------------------------
  5310.4: simulated clone      38.34(39.86+0.74)   33.95(35.53+0.76) -11.5%
  5310.5: simulated fetch      2.29(6.31+0.35)     2.20(5.97+0.41) -3.9%
  5310.7: rev-list (commits)   0.99(0.86+0.13)     0.96(0.85+0.11) -3.0%
  5310.8: rev-list (objects)   11.32(11.04+0.27)   6.59(6.37+0.21) -41.8%

And here's another with a very large tree (~340k entries), and a fairly
large number of refs (~10k):

  Test                         HEAD^               HEAD
  -------------------------------------------------------------------------
  5310.3: simulated clone      53.83(54.71+1.54)   39.77(40.76+1.50) -26.1%
  5310.4: simulated fetch      19.91(20.11+0.56)   19.79(19.98+0.67) -0.6%
  5310.6: rev-list (commits)   0.54(0.44+0.11)     0.51(0.43+0.07) -5.6%
  5310.7: rev-list (objects)   24.32(23.59+0.73)   9.85(9.49+0.36) -59.5%

This patch provides substantial improvements in these larger cases, and
have any drawbacks for smaller ones (the cost of the bitmap check is
quite small compared to an actual tree traversal).

Note that we have to add a version of revision.c's include_check
callback which handles non-commits. We could possibly consolidate this
into a single callback for all objects types, as there's only one user
of the feature which would need converted (pack-bitmap.c:should_include).
That would in theory let us avoid duplicating any logic. But when I
tried it, the code ended up much worse to read, with lots of repeated
"if it's a commit do this, otherwise do that". Having two separate
callbacks splits that naturally, and matches the existing split of
show_commit/show_object callbacks.

Signed-off-by: Jeff King <peff@peff.net>
---
 list-objects.c |  3 +++
 pack-bitmap.c  | 18 ++++++++++++++++++
 revision.h     |  1 +
 3 files changed, 22 insertions(+)

diff --git a/list-objects.c b/list-objects.c
index 7f404677d5..473a332416 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -164,6 +164,9 @@ static void process_tree(struct traversal_context *ctx,
 		die("bad tree object");
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return;
+	if (revs->include_check_obj &&
+	    !revs->include_check_obj(&tree->object, revs->include_check_data))
+		return;
 
 	failed_parse = parse_tree_gently(tree, 1);
 	if (failed_parse) {
diff --git a/pack-bitmap.c b/pack-bitmap.c
index d90e1d9d8c..bfc10148f5 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -525,6 +525,22 @@ static int should_include(struct commit *commit, void *_data)
 	return 1;
 }
 
+static int should_include_obj(struct object *obj, void *_data)
+{
+	struct include_data *data = _data;
+	int bitmap_pos;
+
+	bitmap_pos = bitmap_position(data->bitmap_git, &obj->oid);
+	if (bitmap_pos < 0)
+		return 1;
+	if ((data->seen && bitmap_get(data->seen, bitmap_pos)) ||
+	     bitmap_get(data->base, bitmap_pos)) {
+		obj->flags |= SEEN;
+		return 0;
+	}
+	return 1;
+}
+
 static int add_commit_to_bitmap(struct bitmap_index *bitmap_git,
 				struct bitmap **base,
 				struct commit *commit)
@@ -620,6 +636,7 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 		incdata.seen = seen;
 
 		revs->include_check = should_include;
+		revs->include_check_obj = should_include_obj;
 		revs->include_check_data = &incdata;
 
 		if (prepare_revision_walk(revs))
@@ -633,6 +650,7 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 					      &show_data, NULL);
 
 		revs->include_check = NULL;
+		revs->include_check_obj = NULL;
 		revs->include_check_data = NULL;
 	}
 
diff --git a/revision.h b/revision.h
index 93aa012f51..7bb5fa4e7c 100644
--- a/revision.h
+++ b/revision.h
@@ -262,6 +262,7 @@ struct rev_info {
 	int min_parents;
 	int max_parents;
 	int (*include_check)(struct commit *, void *);
+	int (*include_check_obj)(struct object *obj, void *);
 	void *include_check_data;
 
 	/* diff info for patches and for paths limiting */
Derrick Stolee June 15, 2021, 2:17 p.m. UTC | #2
On 6/14/2021 8:05 AM, Jeff King wrote:
> On Mon, Jun 14, 2021 at 03:27:04AM -0400, Jeff King wrote:
...
> We can use the same optimization we do for commits here: when we are
> about to open a tree, see if it's in the bitmap (either the one we are
> building, or the "seen" bitmap which covers the UNINTERESTING side of
> the bitmap when doing a set-difference).

I was surprised that we were not already doing this. As Git can handle
larger repos now than it could a few years ago, I expect this
optimization to be immediately valuable, and critical in years to come.

> But here are numbers from some other real-world repositories (that are
> not public). This one's tree is comparable in size to linux.git, but has
> ~16k refs (and so less complete bitmap coverage):
> 
>   Test                         HEAD^               HEAD
>   -------------------------------------------------------------------------
>   5310.4: simulated clone      38.34(39.86+0.74)   33.95(35.53+0.76) -11.5%
>   5310.5: simulated fetch      2.29(6.31+0.35)     2.20(5.97+0.41) -3.9%
>   5310.7: rev-list (commits)   0.99(0.86+0.13)     0.96(0.85+0.11) -3.0%
>   5310.8: rev-list (objects)   11.32(11.04+0.27)   6.59(6.37+0.21) -41.8%
> 
> And here's another with a very large tree (~340k entries), and a fairly
> large number of refs (~10k):
> 
>   Test                         HEAD^               HEAD
>   -------------------------------------------------------------------------
>   5310.3: simulated clone      53.83(54.71+1.54)   39.77(40.76+1.50) -26.1%
>   5310.4: simulated fetch      19.91(20.11+0.56)   19.79(19.98+0.67) -0.6%
>   5310.6: rev-list (commits)   0.54(0.44+0.11)     0.51(0.43+0.07) -5.6%
>   5310.7: rev-list (objects)   24.32(23.59+0.73)   9.85(9.49+0.36) -59.5%
> 
> This patch provides substantial improvements in these larger cases, and
> have any drawbacks for smaller ones (the cost of the bitmap check is
> quite small compared to an actual tree traversal).

These many-refs scenarios make sense as something that is difficult to
verify using a single fork of an open-source project, but is common in
many closed-source projects that do not use forking to reduce the ref
count per repo.

I was impressed by how little code was required for this change. LGTM.

Thanks,
-Stolee
Patrick Steinhardt June 16, 2021, 12:31 p.m. UTC | #3
On Tue, Jun 15, 2021 at 10:17:04AM -0400, Derrick Stolee wrote:
> On 6/14/2021 8:05 AM, Jeff King wrote:
[snip]
> > But here are numbers from some other real-world repositories (that are
> > not public). This one's tree is comparable in size to linux.git, but has
> > ~16k refs (and so less complete bitmap coverage):
> > 
> >   Test                         HEAD^               HEAD
> >   -------------------------------------------------------------------------
> >   5310.4: simulated clone      38.34(39.86+0.74)   33.95(35.53+0.76) -11.5%
> >   5310.5: simulated fetch      2.29(6.31+0.35)     2.20(5.97+0.41) -3.9%
> >   5310.7: rev-list (commits)   0.99(0.86+0.13)     0.96(0.85+0.11) -3.0%
> >   5310.8: rev-list (objects)   11.32(11.04+0.27)   6.59(6.37+0.21) -41.8%
> > 
> > And here's another with a very large tree (~340k entries), and a fairly
> > large number of refs (~10k):
> > 
> >   Test                         HEAD^               HEAD
> >   -------------------------------------------------------------------------
> >   5310.3: simulated clone      53.83(54.71+1.54)   39.77(40.76+1.50) -26.1%
> >   5310.4: simulated fetch      19.91(20.11+0.56)   19.79(19.98+0.67) -0.6%
> >   5310.6: rev-list (commits)   0.54(0.44+0.11)     0.51(0.43+0.07) -5.6%
> >   5310.7: rev-list (objects)   24.32(23.59+0.73)   9.85(9.49+0.36) -59.5%
> > 
> > This patch provides substantial improvements in these larger cases, and
> > have any drawbacks for smaller ones (the cost of the bitmap check is
> > quite small compared to an actual tree traversal).
> 
> These many-refs scenarios make sense as something that is difficult to
> verify using a single fork of an open-source project, but is common in
> many closed-source projects that do not use forking to reduce the ref
> count per repo.

Agreed. What I typically do to emulate this is to use some version of
following command to create refs for "$n" commits.

    git log --all --format="tformat:create refs/commit/%h %H" |
        shuf | head -n "$n" | git update-ref --stdin

It's obviously not ideal given that resulting refs are distributed at
random. But combined with a sufficiently large repo, it's still helped
me at times to reproduce adverse performance at times.

Anyway, the patch does look good to me and sounds like it may help with
some of the cases where I have observed adverse performance with bitmaps
enabled in the past. Thanks!

Patrick
Jeff King June 18, 2021, 12:59 p.m. UTC | #4
On Wed, Jun 16, 2021 at 02:31:04PM +0200, Patrick Steinhardt wrote:

> > These many-refs scenarios make sense as something that is difficult to
> > verify using a single fork of an open-source project, but is common in
> > many closed-source projects that do not use forking to reduce the ref
> > count per repo.
> 
> Agreed. What I typically do to emulate this is to use some version of
> following command to create refs for "$n" commits.
> 
>     git log --all --format="tformat:create refs/commit/%h %H" |
>         shuf | head -n "$n" | git update-ref --stdin
> 
> It's obviously not ideal given that resulting refs are distributed at
> random. But combined with a sufficiently large repo, it's still helped
> me at times to reproduce adverse performance at times.

Yeah, I've done something similar. But I'd caution people into reading
too much into performance tests from something like that. What I've
found over the years is that traversal and bitmap performance can be
somewhat sensitive to tree shape and bitmap placement (which in turn is
often influenced by ref placement, if you are using delta islands or the
recently added pack.preferBitmapTips).

More than once I've spent many head-scratching hours trying to determine
why some real-world repo performs better or worse than expected. :)

> Anyway, the patch does look good to me and sounds like it may help with
> some of the cases where I have observed adverse performance with bitmaps
> enabled in the past. Thanks!

Thanks for taking a look!

-Peff
Patrick Steinhardt June 18, 2021, 1:35 p.m. UTC | #5
On Fri, Jun 18, 2021 at 08:59:59AM -0400, Jeff King wrote:
> On Wed, Jun 16, 2021 at 02:31:04PM +0200, Patrick Steinhardt wrote:
> 
> > > These many-refs scenarios make sense as something that is difficult to
> > > verify using a single fork of an open-source project, but is common in
> > > many closed-source projects that do not use forking to reduce the ref
> > > count per repo.
> > 
> > Agreed. What I typically do to emulate this is to use some version of
> > following command to create refs for "$n" commits.
> > 
> >     git log --all --format="tformat:create refs/commit/%h %H" |
> >         shuf | head -n "$n" | git update-ref --stdin
> > 
> > It's obviously not ideal given that resulting refs are distributed at
> > random. But combined with a sufficiently large repo, it's still helped
> > me at times to reproduce adverse performance at times.
> 
> Yeah, I've done something similar. But I'd caution people into reading
> too much into performance tests from something like that. What I've
> found over the years is that traversal and bitmap performance can be
> somewhat sensitive to tree shape and bitmap placement (which in turn is
> often influenced by ref placement, if you are using delta islands or the
> recently added pack.preferBitmapTips).
> 
> More than once I've spent many head-scratching hours trying to determine
> why some real-world repo performs better or worse than expected. :)

I couldn't agree more. I've also had my fair share of weird performance
characteristics in completely unexpected ways. Unfortunately, it has
made me become quite cautious about bitmaps given that they've already
caused their fair share of perfomance regressions.

But your work here actually encourages me to give it another try soonish
and see what kind of repo shapes make them explode this time.

Patrick
Jeff King June 18, 2021, 2:10 p.m. UTC | #6
On Fri, Jun 18, 2021 at 03:35:05PM +0200, Patrick Steinhardt wrote:

> > More than once I've spent many head-scratching hours trying to determine
> > why some real-world repo performs better or worse than expected. :)
> 
> I couldn't agree more. I've also had my fair share of weird performance
> characteristics in completely unexpected ways. Unfortunately, it has
> made me become quite cautious about bitmaps given that they've already
> caused their fair share of perfomance regressions.
> 
> But your work here actually encourages me to give it another try soonish
> and see what kind of repo shapes make them explode this time.

Not really for immediate work, but I just wanted to note while I'm
thinking about it: the two big things that can actually make bitmaps
_worse_ than a regular traversal are:

  - loading the bitmaps is inherently linear in the file size. We could
    do better with an mmap-able index of bitmapped commits in the file.
    This would be easy to add as an extension to the file. This matters
    most for really small queries (the absolute time to parse the bitmap
    file isn't that high, but if the non-bitmap query can be answered in
    a few milliseconds, it becomes relatively more important).

  - we produce a definite answer of reachability for the bitmaps by
    filling in a bitmap for the negative (UNINTERESTING) tips, then one
    for the regular tips, and then taking a set difference. Whereas a
    normal traversal walks both sides together until it hits a merge
    base, and then actually looks into the trees in the boundary
    commits. This normal-traversal technique may look at fewer objects
    overall for some cases (especially if you have a lot of negative
    tips, like with "--not --all" and not-so-good bitmap coverage). But
    it does mean we may report an object as in the traversal when it's
    actually not (if it was added/removed in part of the uninteresting
    history, and so not present at the boundary).

    We could possibly teach the bitmap traversal to behave more like
    a normal one. I suspect this would clear up most of the regressions
    you'd see using bitmaps with rev-list.

There is one other related case I've run into. If you put a bitmap on
every commit, you'd _think_ that would make everything faster, since
you'd never traverse at all. But there's an inflection point where you
spend more time going through each bitmap twiddling bits (which remember
is O(nr_of_objects * nr_bitmapped_commits) bits across all of them). And
most of those bits will already be "1", so there's a point where the
saved traversal is less than the cost of OR-ing bits.

Anyway, this is all big-picture stuff that I don't expect you to work on
or even respond to. I'm mostly just writing up some thoughts that I
don't think have made it to the list in a coherent form. I'm not sure if
I'll eventually work on some of them or not (but anybody who is
interested can be my guest :) ).

-Peff
Patrick Steinhardt June 22, 2021, 10:47 a.m. UTC | #7
On Fri, Jun 18, 2021 at 03:35:05PM +0200, Patrick Steinhardt wrote:
> On Fri, Jun 18, 2021 at 08:59:59AM -0400, Jeff King wrote:
> > On Wed, Jun 16, 2021 at 02:31:04PM +0200, Patrick Steinhardt wrote:
> > 
> > > > These many-refs scenarios make sense as something that is difficult to
> > > > verify using a single fork of an open-source project, but is common in
> > > > many closed-source projects that do not use forking to reduce the ref
> > > > count per repo.
> > > 
> > > Agreed. What I typically do to emulate this is to use some version of
> > > following command to create refs for "$n" commits.
> > > 
> > >     git log --all --format="tformat:create refs/commit/%h %H" |
> > >         shuf | head -n "$n" | git update-ref --stdin
> > > 
> > > It's obviously not ideal given that resulting refs are distributed at
> > > random. But combined with a sufficiently large repo, it's still helped
> > > me at times to reproduce adverse performance at times.
> > 
> > Yeah, I've done something similar. But I'd caution people into reading
> > too much into performance tests from something like that. What I've
> > found over the years is that traversal and bitmap performance can be
> > somewhat sensitive to tree shape and bitmap placement (which in turn is
> > often influenced by ref placement, if you are using delta islands or the
> > recently added pack.preferBitmapTips).
> > 
> > More than once I've spent many head-scratching hours trying to determine
> > why some real-world repo performs better or worse than expected. :)
> 
> I couldn't agree more. I've also had my fair share of weird performance
> characteristics in completely unexpected ways. Unfortunately, it has
> made me become quite cautious about bitmaps given that they've already
> caused their fair share of perfomance regressions.
> 
> But your work here actually encourages me to give it another try soonish
> and see what kind of repo shapes make them explode this time.
> 
> Patrick

Today I've been experimenting with the connectivity check of
git-receive-pack(1) once again to see whether I'm able to get a
performance improvement if the git-rev-list(1) command spawned as part
of the connectivity check uses `--use-bitmap-index`.

Turns out the answer is "no": it has exactly the same performance
characteristics when pushing into a bitmapped repository (linux.git)
compared to the version not using a bitmap index, except for once case
where it performs 70x worse: force-pushing "master~10:master" into a
bitmapped repo takes 11s instead of 0.15s with bitmaps enabled.

Just leaving this here as a note for anybody (or myself) to pick up at a
later point.

Patrick
Jeff King June 22, 2021, 7:39 p.m. UTC | #8
On Tue, Jun 22, 2021 at 12:47:57PM +0200, Patrick Steinhardt wrote:

> Today I've been experimenting with the connectivity check of
> git-receive-pack(1) once again to see whether I'm able to get a
> performance improvement if the git-rev-list(1) command spawned as part
> of the connectivity check uses `--use-bitmap-index`.
> 
> Turns out the answer is "no": it has exactly the same performance
> characteristics when pushing into a bitmapped repository (linux.git)
> compared to the version not using a bitmap index, except for once case
> where it performs 70x worse: force-pushing "master~10:master" into a
> bitmapped repo takes 11s instead of 0.15s with bitmaps enabled.
> 
> Just leaving this here as a note for anybody (or myself) to pick up at a
> later point.

Thanks. I'd wager that's probably the "we find the exact bitmap for the
'haves'" problem. There are probably a lot of old refs with so-so bitmap
coverage, and we traverse all of them down to the nearest bitmap. If we
filled in the bitmap by traversing commits in timestamp or generation
order and ending at the merge-base, we could probably avoid looking at
them at all.

-Peff
diff mbox series

Patch

diff --git a/list-objects.c b/list-objects.c
index 7f404677d5..473a332416 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -164,6 +164,9 @@  static void process_tree(struct traversal_context *ctx,
 		die("bad tree object");
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return;
+	if (revs->include_check_obj &&
+	    !revs->include_check_obj(&tree->object, revs->include_check_data))
+		return;
 
 	failed_parse = parse_tree_gently(tree, 1);
 	if (failed_parse) {
diff --git a/pack-bitmap.c b/pack-bitmap.c
index d90e1d9d8c..9dfee4a5b7 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -525,6 +525,22 @@  static int should_include(struct commit *commit, void *_data)
 	return 1;
 }
 
+static int should_include_obj(struct object *obj, void *_data)
+{
+	struct include_data *data = _data;
+	int bitmap_pos;
+
+	bitmap_pos = bitmap_position(data->bitmap_git, &obj->oid);
+	if (bitmap_pos < 0)
+		return 1;
+	if ((data->seen && bitmap_get(data->seen, bitmap_pos)) ||
+	     bitmap_get(data->base, bitmap_pos)) {
+		obj->flags |= SEEN;
+		return 0;
+	}
+	return 1;
+}
+
 static int add_commit_to_bitmap(struct bitmap_index *bitmap_git,
 				struct bitmap **base,
 				struct commit *commit)
@@ -620,6 +636,7 @@  static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 		incdata.seen = seen;
 
 		revs->include_check = should_include;
+		revs->include_check_obj = should_include_obj;
 		revs->include_check_data = &incdata;
 
 		if (prepare_revision_walk(revs))
diff --git a/revision.h b/revision.h
index 93aa012f51..7bb5fa4e7c 100644
--- a/revision.h
+++ b/revision.h
@@ -262,6 +262,7 @@  struct rev_info {
 	int min_parents;
 	int max_parents;
 	int (*include_check)(struct commit *, void *);
+	int (*include_check_obj)(struct object *obj, void *);
 	void *include_check_data;
 
 	/* diff info for patches and for paths limiting */