[v2,5/8] p5303: measure time to repack with keep

Message ID	b5081c01b53beb568ef2e59956d25b36be9f24d0.1612411124.git.me@ttaylorr.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> Date: Wed, 3 Feb 2021 22:59:13 -0500 From: Taylor Blau <me@ttaylorr.com> To: git@vger.kernel.org Cc: dstolee@microsoft.com, gitster@pobox.com, peff@peff.net Subject: [PATCH v2 5/8] p5303: measure time to repack with keep Message-ID: <b5081c01b53beb568ef2e59956d25b36be9f24d0.1612411124.git.me@ttaylorr.com> References: <cover.1611098616.git.me@ttaylorr.com> <cover.1612411123.git.me@ttaylorr.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <cover.1612411123.git.me@ttaylorr.com> Precedence: bulk
Series	repack: support repacking into a geometric sequence \| expand [v2,0/8] repack: support repacking into a geometric sequence [v2,1/8] packfile: introduce 'find_kept_pack_entry()' [v2,2/8] revision: learn '--no-kept-objects' [v2,3/8] builtin/pack-objects.c: add '--stdin-packs' option [v2,4/8] p5303: add missing &&-chains [v2,5/8] p5303: measure time to repack with keep [v2,6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic [v2,7/8] packfile: add kept-pack cache for find_kept_pack_entry() [v2,8/8] builtin/repack.c: add '--geometric' option

Message ID

b5081c01b53beb568ef2e59956d25b36be9f24d0.1612411124.git.me@ttaylorr.com (mailing list archive)

State

Superseded

Headers

Date: Wed, 3 Feb 2021 22:59:13 -0500
From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Cc: dstolee@microsoft.com, gitster@pobox.com, peff@peff.net
Subject: [PATCH v2 5/8] p5303: measure time to repack with keep
Message-ID: 
 <b5081c01b53beb568ef2e59956d25b36be9f24d0.1612411124.git.me@ttaylorr.com>
References: <cover.1611098616.git.me@ttaylorr.com>
 <cover.1612411123.git.me@ttaylorr.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <cover.1612411123.git.me@ttaylorr.com>
Precedence: bulk

Series

repack: support repacking into a geometric sequence | expand

Commit Message

Taylor Blau Feb. 4, 2021, 3:59 a.m. UTC

From: Jeff King <peff@peff.net>

This is the same as the regular repack test, except that we mark the
single base pack as "kept" and use --assume-kept-packs-closed. The
theory is that this should be faster than the normal repack, because
we'll have fewer objects to traverse and process.

Here are some timings on a recent clone of the kernel. In the
single-pack case, there is nothing do since there are no non-excluded
packs:

  5303.5: repack (1)                          57.42(54.88+10.64)
  5303.6: repack with --stdin-packs (1)       0.01(0.01+0.00)

and in the 50-pack case, it is much faster to use `--stdin-packs`, since
we avoid having to consider any objects in the excluded pack:

  5303.10: repack (50)                        71.26(88.24+4.96)
  5303.11: repack with --stdin-packs (50)     3.49(11.82+0.28)

but our improvements vanish as we approach 1000 packs.

  5303.15: repack (1000)                      215.64(491.33+14.80)
  5303.16: repack with --stdin-packs (1000)   198.79(380.51+7.97)

That's because the code paths around handling .keep files are known to
scale badly; they look in every single pack file to find each object.
Our solution to that was to notice that most repos don't have keep
files, and to make that case a fast path. But as soon as you add a
single .keep, that part of pack-objects slows down again (even if we
have fewer objects total to look at).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/perf/p5303-many-packs.sh | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

Comments

Jeff King Feb. 16, 2021, 11:58 p.m. UTC | #1

On Wed, Feb 03, 2021 at 10:59:13PM -0500, Taylor Blau wrote:

> From: Jeff King <peff@peff.net>
> 
> This is the same as the regular repack test, except that we mark the
> single base pack as "kept" and use --assume-kept-packs-closed. The

I don't think that option exists anymore. I guess we are just using
--stdin-packs, which causes us to mark a pack as kept.

I think we could just mark it in the filesystem and use
--honor-pack-keep, which would make it independent of your new feature.
At first I was going to say "but it doesn't matter either way", but...

> theory is that this should be faster than the normal repack, because
> we'll have fewer objects to traverse and process.
> 
> Here are some timings on a recent clone of the kernel. In the
> single-pack case, there is nothing do since there are no non-excluded
> packs:
> 
>   5303.5: repack (1)                          57.42(54.88+10.64)
>   5303.6: repack with --stdin-packs (1)       0.01(0.01+0.00)
> 
> and in the 50-pack case, it is much faster to use `--stdin-packs`, since
> we avoid having to consider any objects in the excluded pack:
> 
>   5303.10: repack (50)                        71.26(88.24+4.96)
>   5303.11: repack with --stdin-packs (50)     3.49(11.82+0.28)
> 
> but our improvements vanish as we approach 1000 packs.
> 
>   5303.15: repack (1000)                      215.64(491.33+14.80)
>   5303.16: repack with --stdin-packs (1000)   198.79(380.51+7.97)
> 
> That's because the code paths around handling .keep files are known to
> scale badly; they look in every single pack file to find each object.

Well, part of it is just that with 1000 packs we have 20 times as many
objects that are actually getting packed with --stdin-packs, compared to
the 50-pack case. IIRC, each pack is a fixed-size slice and then the
residual is put into the .keep pack. So the fact that the time gets
closer to a full repack as we add more packs is expected: we are asking
pack-objects to do more work!

For showing the impact of the optimizations in patches 7 and 8, I think
doing a full repack with --honor-pack-keep is a better test. Because
then we're always doing a full traversal, and most of the work continues
to scale with the repo size (though obviously not the actual shuffling
of packed bytes around). That would get rid of the weird "no work to do"
case in the single-pack tests, too.

-Peff

Jeff King Feb. 17, 2021, 12:02 a.m. UTC | #2

On Tue, Feb 16, 2021 at 06:58:16PM -0500, Jeff King wrote:

> For showing the impact of the optimizations in patches 7 and 8, I think
> doing a full repack with --honor-pack-keep is a better test. Because
> then we're always doing a full traversal, and most of the work continues
> to scale with the repo size (though obviously not the actual shuffling
> of packed bytes around). That would get rid of the weird "no work to do"
> case in the single-pack tests, too.

I meant to add: but I do like that we are timing --stdin-packs, too. We
may actually want to time both.

Another thing we _could_ do, if we have --honor-pack-keep perf tests, is
to shuffle patches 5, 6, and 7 towards the front of the series. They
should be able to show off the improvement even without the
--stdin-packs feature.

-Peff

Taylor Blau Feb. 17, 2021, 7:13 p.m. UTC | #3

On Tue, Feb 16, 2021 at 06:58:16PM -0500, Jeff King wrote:
> On Wed, Feb 03, 2021 at 10:59:13PM -0500, Taylor Blau wrote:
>
> > From: Jeff King <peff@peff.net>
> >
> > This is the same as the regular repack test, except that we mark the
> > single base pack as "kept" and use --assume-kept-packs-closed. The
>
> I don't think that option exists anymore. I guess we are just using
> --stdin-packs, which causes us to mark a pack as kept.
>
> I think we could just mark it in the filesystem and use
> --honor-pack-keep, which would make it independent of your new feature.
> At first I was going to say "but it doesn't matter either way", but...
>
> > theory is that this should be faster than the normal repack, because
> > we'll have fewer objects to traverse and process.
> >
> > Here are some timings on a recent clone of the kernel. In the
> > single-pack case, there is nothing do since there are no non-excluded
> > packs:
> >
> >   5303.5: repack (1)                          57.42(54.88+10.64)
> >   5303.6: repack with --stdin-packs (1)       0.01(0.01+0.00)
> >
> > and in the 50-pack case, it is much faster to use `--stdin-packs`, since
> > we avoid having to consider any objects in the excluded pack:
> >
> >   5303.10: repack (50)                        71.26(88.24+4.96)
> >   5303.11: repack with --stdin-packs (50)     3.49(11.82+0.28)
> >
> > but our improvements vanish as we approach 1000 packs.
> >
> >   5303.15: repack (1000)                      215.64(491.33+14.80)
> >   5303.16: repack with --stdin-packs (1000)   198.79(380.51+7.97)
> >
> > That's because the code paths around handling .keep files are known to
> > scale badly; they look in every single pack file to find each object.
>
> Well, part of it is just that with 1000 packs we have 20 times as many
> objects that are actually getting packed with --stdin-packs, compared to
> the 50-pack case. IIRC, each pack is a fixed-size slice and then the
> residual is put into the .keep pack. So the fact that the time gets
> closer to a full repack as we add more packs is expected: we are asking
> pack-objects to do more work!

No, the residual base pack isn't marked as kept on-disk. But the
--stdin-packs test treats it as such, by passing '^pack-$base_pack.pack'
as input to '--stdin-packs' (thus marking it as kept in-core).

> For showing the impact of the optimizations in patches 7 and 8, I think
> doing a full repack with --honor-pack-keep is a better test. Because
> then we're always doing a full traversal, and most of the work continues
> to scale with the repo size (though obviously not the actual shuffling
> of packed bytes around). That would get rid of the weird "no work to do"
> case in the single-pack tests, too.

I think you're suggesting that we change the "repack ($nr_packs)" test
to have the residual pack marked as kept (so we're measuring time it
takes to repack everything that _isn't_ in the base pack)?

That would allow a more direct comparison, but I think it's loosing out
on an important aspect which is how long it takes to pack the entire
repository. Maybe we want three.

What do you think?

> -Peff

Thanks,
Taylor

Jeff King Feb. 17, 2021, 7:25 p.m. UTC | #4

On Wed, Feb 17, 2021 at 02:13:47PM -0500, Taylor Blau wrote:

> > > That's because the code paths around handling .keep files are known to
> > > scale badly; they look in every single pack file to find each object.
> >
> > Well, part of it is just that with 1000 packs we have 20 times as many
> > objects that are actually getting packed with --stdin-packs, compared to
> > the 50-pack case. IIRC, each pack is a fixed-size slice and then the
> > residual is put into the .keep pack. So the fact that the time gets
> > closer to a full repack as we add more packs is expected: we are asking
> > pack-objects to do more work!
> 
> No, the residual base pack isn't marked as kept on-disk. But the
> --stdin-packs test treats it as such, by passing '^pack-$base_pack.pack'
> as input to '--stdin-packs' (thus marking it as kept in-core).

Sorry, I perhaps shouldn't have said ".keep" here. But it's the same
thing, isn't it? The 50 pack case is packing 50*pack_size objects
(because it's excluding everything else that is in the base pack we mark
as keep-in-core), and the 1000-pack case is packing 1000*pack_size
objects (for the same reason).

So any patterns we see between them have more to do with that, than how
the keep-handling code scales with the number of non-kept packs.

> > For showing the impact of the optimizations in patches 7 and 8, I think
> > doing a full repack with --honor-pack-keep is a better test. Because
> > then we're always doing a full traversal, and most of the work continues
> > to scale with the repo size (though obviously not the actual shuffling
> > of packed bytes around). That would get rid of the weird "no work to do"
> > case in the single-pack tests, too.
> 
> I think you're suggesting that we change the "repack ($nr_packs)" test
> to have the residual pack marked as kept (so we're measuring time it
> takes to repack everything that _isn't_ in the base pack)?
> 
> That would allow a more direct comparison, but I think it's loosing out
> on an important aspect which is how long it takes to pack the entire
> repository. Maybe we want three.

That was what I was suggesting, but I think it's equivalent to what your
--stdin-packs is testing. I guess the most interesting thing would
actually be an _additional_ pack mark as .keep (and that pack does not
even have to contain anything interesting -- the point is how much
effort it costs to find that out. Of course the bigger it is the more
pronounced the effect of avoiding lookups in it).

-Peff

diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
index d90d714923..b76a6efe00 100755
--- a/t/perf/p5303-many-packs.sh
+++ b/t/perf/p5303-many-packs.sh
@@ -31,8 +31,11 @@  repack_into_n () {
 	' "$1" >pushes &&
 
 	# create base packfile
-	head -n 1 pushes |
-	git pack-objects --delta-base-offset --revs staging/pack &&
+	base_pack=$(
+		head -n 1 pushes |
+		git pack-objects --delta-base-offset --revs staging/pack
+	) &&
+	test_export base_pack &&
 
 	# and then incrementals between each pair of commits
 	last= &&
@@ -49,6 +52,12 @@  repack_into_n () {
 		last=$rev
 	done <pushes &&
 
+	(
+		find staging -type f -name 'pack-*.pack' |
+			xargs -n 1 basename | grep -v "$base_pack" &&
+		printf "^pack-%s.pack\n" $base_pack
+	) >stdin.packs
+
 	# and install the whole thing
 	rm -f .git/objects/pack/* &&
 	mv staging/* .git/objects/pack/
@@ -91,6 +100,15 @@  do
 		  --reflog --indexed-objects --delta-base-offset \
 		  --stdout </dev/null >/dev/null
 	'
+
+	test_perf "repack with --stdin-packs ($nr_packs)" '
+		git pack-objects \
+		  --keep-true-parents \
+		  --stdin-packs \
+		  --non-empty \
+		  --delta-base-offset \
+		  --stdout <stdin.packs >/dev/null
+	'
 done
 
 # Measure pack loading with 10,000 packs.

[v2,5/8] p5303: measure time to repack with keep

Commit Message

Comments

Patch