diff mbox series

[v3,1/2] pack-bitmap.c: remove unnecessary "open_pack_index()" calls

Message ID aaeb021538cdfeb83dc6004fe7b3ac35a23aef49.1668063122.git.dyroneteng@gmail.com (mailing list archive)
State Accepted
Commit 2aa84d5f3ea1966a81477ad31bee0136e39d3917
Headers show
Series [v3,1/2] pack-bitmap.c: remove unnecessary "open_pack_index()" calls | expand

Commit Message

Teng Long Nov. 10, 2022, 7:10 a.m. UTC
From: Teng Long <dyroneteng@gmail.com>

When trying to open a pack bitmap, we call open_pack_bitmap_1() in a
loop, during which it tries to open up the pack index corresponding
with each available pack.

It's likely that we'll end up relying on objects in that pack later
in the process (in which case we're doing the work of opening the
pack index optimistically), but not guaranteed.

For instance, consider a repository with a large number of small
packs, and one large pack with a bitmap. If we see that bitmap pack
last in our loop which calls open_pack_bitmap_1(), the current code
will have opened *all* pack index files in the repository. If the
request can be served out of the bitmapped pack alone, then the time
spent opening these idx files was wasted.S

Since open_pack_bitmap_1() calls is_pack_valid() later on (which in
turns calls open_pack_index() itself), we can just drop the earlier
call altogether.

Signed-off-by: Teng Long <dyroneteng@gmail.com>
---
 pack-bitmap.c | 3 ---
 1 file changed, 3 deletions(-)

Comments

Jeff King Nov. 14, 2022, 10:03 p.m. UTC | #1
On Thu, Nov 10, 2022 at 03:10:11PM +0800, Teng Long wrote:

> It's likely that we'll end up relying on objects in that pack later
> in the process (in which case we're doing the work of opening the
> pack index optimistically), but not guaranteed.
> 
> For instance, consider a repository with a large number of small
> packs, and one large pack with a bitmap. If we see that bitmap pack
> last in our loop which calls open_pack_bitmap_1(), the current code
> will have opened *all* pack index files in the repository. If the
> request can be served out of the bitmapped pack alone, then the time
> spent opening these idx files was wasted.S

By the way, I wondered if it was possible to measure a slowdown in this
case. It is, but you have to try pretty hard. Something like this:

  # one bitmapped pack
  git repack -adb

  # and then a bunch of other packs
  git rev-list HEAD |
  head -10000 |
  while read commit; do
    echo $commit | git pack-objects .git/objects/pack/pack
  done

  # make the bitmapped one newest, since otherwise our non-bitmap lookup
  # of the initial traversal commit causes us to open all the other
  # packs first!
  bitmap=$(echo .git/objects/pack/pack-*.bitmap)
  touch ${bitmap%.bitmap}.*

  hyperfine -L v old,new './git.{v} rev-list --count --use-bitmap-index HEAD'

where "new" and "old" are builds with and without this patch. I get:

  Benchmark 1: ./git.old rev-list --count --use-bitmap-index HEAD
    Time (mean ± σ):     117.9 ms ±   1.8 ms    [User: 26.9 ms, System: 90.0 ms]
    Range (min … max):   113.4 ms … 120.3 ms    25 runs
  
  Benchmark 2: ./git.new rev-list --count --use-bitmap-index HEAD
    Time (mean ± σ):      71.8 ms ±   2.6 ms    [User: 21.2 ms, System: 50.5 ms]
    Range (min … max):    67.0 ms …  75.1 ms    41 runs
  
  Summary
    './git.new rev-list --count --use-bitmap-index HEAD' ran
      1.64 ± 0.06 times faster than './git.old rev-list --count --use-bitmap-index HEAD'

which implies to me two things:

  - this probably isn't helping anybody much in the real world, as
    evidenced by the contortions I had to go through to set up the
    situation (and which would be made much better by repacking, which
    would also speed up non-bitmap operations).

  - it's worth doing anyway. Even if it only shaves off microseconds,
    the existing call is just pointless.

-Peff
Taylor Blau Nov. 14, 2022, 10:14 p.m. UTC | #2
On Mon, Nov 14, 2022 at 05:03:56PM -0500, Jeff King wrote:
> On Thu, Nov 10, 2022 at 03:10:11PM +0800, Teng Long wrote:
>
> > It's likely that we'll end up relying on objects in that pack later
> > in the process (in which case we're doing the work of opening the
> > pack index optimistically), but not guaranteed.
> >
> > For instance, consider a repository with a large number of small
> > packs, and one large pack with a bitmap. If we see that bitmap pack
> > last in our loop which calls open_pack_bitmap_1(), the current code
> > will have opened *all* pack index files in the repository. If the
> > request can be served out of the bitmapped pack alone, then the time
> > spent opening these idx files was wasted.S
>
> By the way, I wondered if it was possible to measure a slowdown in this
> case. It is, but you have to try pretty hard. Something like this:
>
>   # one bitmapped pack
>   git repack -adb
>
>   # and then a bunch of other packs
>   git rev-list HEAD |
>   head -10000 |
>   while read commit; do
>     echo $commit | git pack-objects .git/objects/pack/pack
>   done

OK, so with 10K packs, we see about a 1.6-fold improvement, which is
definitely substantial.

On a fresh clone of git.git, repeating your experiment with only 1K
packs (which is definitely a number of packs that GitHub sees in
under-maintained repositories), the runtime goes from 25.3ms -> 20.9ms
on my machine, or about a 1.2-fold improvement.

So definitely smaller, but even at 1/10th the number of packs from your
experiment, still noticeable.

>   - this probably isn't helping anybody much in the real world, as
>     evidenced by the contortions I had to go through to set up the
>     situation (and which would be made much better by repacking, which
>     would also speed up non-bitmap operations).

Per above, I'm not sure I totally agree ;-). 1K packs is definitely an
extreme amount of packs, but not out-of-this-world. It probably would
show up in carefully-picked graphs, but not in "overall git rev-list
time" or something as broad/noisy as that.

>   - it's worth doing anyway. Even if it only shaves off microseconds,
>     the existing call is just pointless.

Yes, definitely.

Thanks,
Taylor
Jeff King Nov. 14, 2022, 10:31 p.m. UTC | #3
On Mon, Nov 14, 2022 at 05:14:37PM -0500, Taylor Blau wrote:

> OK, so with 10K packs, we see about a 1.6-fold improvement, which is
> definitely substantial.
> 
> On a fresh clone of git.git, repeating your experiment with only 1K
> packs (which is definitely a number of packs that GitHub sees in
> under-maintained repositories), the runtime goes from 25.3ms -> 20.9ms
> on my machine, or about a 1.2-fold improvement.
> 
> So definitely smaller, but even at 1/10th the number of packs from your
> experiment, still noticeable.

Interesting. I had tried it initially with 1000 and the improvements were
much smaller. I just did it again, though, and got the same 20% speedup.
I'm not sure what I screwed up earlier (I may have been confused by the
timestamp/sorting issue; I only realized it was important midway through
looking into this).

> >   - this probably isn't helping anybody much in the real world, as
> >     evidenced by the contortions I had to go through to set up the
> >     situation (and which would be made much better by repacking, which
> >     would also speed up non-bitmap operations).
> 
> Per above, I'm not sure I totally agree ;-). 1K packs is definitely an
> extreme amount of packs, but not out-of-this-world. It probably would
> show up in carefully-picked graphs, but not in "overall git rev-list
> time" or something as broad/noisy as that.

Yeah, I agree that 1k is a lot more compelling. The big impractical
thing I think is that if the bitmapped pack is older (and it usually
is), then we'd often open all the other packs anyway:

  - if the start of the traversal is in the bitmapped pack, then we
    fruitlessly open each of the others looking for the object (since
    the bitmapped one will come last in the reverse-chronological
    sorting)

  - if it isn't in the bitmapped pack, then we'll end up opening all
    those other packs anyway to fill out the bitmap (since by definition
    it can't be included in the on-disk bitmaps)

So I'd be surprised if it ever mattered in the real world. Though again,
I think the new code is less surprising in general, and could matter if
we changed other things (e.g., if we prioritized lookups in a pack with
a .bitmap).

-Peff
Taylor Blau Nov. 14, 2022, 10:50 p.m. UTC | #4
On Mon, Nov 14, 2022 at 05:31:18PM -0500, Jeff King wrote:
> Yeah, I agree that 1k is a lot more compelling. The big impractical
> thing I think is that if the bitmapped pack is older (and it usually
> is), then we'd often open all the other packs anyway:
>
>   - if the start of the traversal is in the bitmapped pack, then we
>     fruitlessly open each of the others looking for the object (since
>     the bitmapped one will come last in the reverse-chronological
>     sorting)
>
>   - if it isn't in the bitmapped pack, then we'll end up opening all
>     those other packs anyway to fill out the bitmap (since by definition
>     it can't be included in the on-disk bitmaps)
>
> So I'd be surprised if it ever mattered in the real world. Though again,
> I think the new code is less surprising in general, and could matter if
> we changed other things (e.g., if we prioritized lookups in a pack with
> a .bitmap).

I completely agree. It's definitely worth doing purely based on the
principle of least-surprise. But the potential performance improvements
are just gravy on top ;-).

Thanks,
Taylor
diff mbox series

Patch

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 440407f1be..982e286bac 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -411,9 +411,6 @@  static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, struct packed_git
 	struct stat st;
 	char *bitmap_name;
 
-	if (open_pack_index(packfile))
-		return -1;
-
 	bitmap_name = pack_bitmap_filename(packfile);
 	fd = git_open(bitmap_name);