Message ID | b5081c01b53beb568ef2e59956d25b36be9f24d0.1612411124.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | repack: support repacking into a geometric sequence | expand |
On Wed, Feb 03, 2021 at 10:59:13PM -0500, Taylor Blau wrote: > From: Jeff King <peff@peff.net> > > This is the same as the regular repack test, except that we mark the > single base pack as "kept" and use --assume-kept-packs-closed. The I don't think that option exists anymore. I guess we are just using --stdin-packs, which causes us to mark a pack as kept. I think we could just mark it in the filesystem and use --honor-pack-keep, which would make it independent of your new feature. At first I was going to say "but it doesn't matter either way", but... > theory is that this should be faster than the normal repack, because > we'll have fewer objects to traverse and process. > > Here are some timings on a recent clone of the kernel. In the > single-pack case, there is nothing do since there are no non-excluded > packs: > > 5303.5: repack (1) 57.42(54.88+10.64) > 5303.6: repack with --stdin-packs (1) 0.01(0.01+0.00) > > and in the 50-pack case, it is much faster to use `--stdin-packs`, since > we avoid having to consider any objects in the excluded pack: > > 5303.10: repack (50) 71.26(88.24+4.96) > 5303.11: repack with --stdin-packs (50) 3.49(11.82+0.28) > > but our improvements vanish as we approach 1000 packs. > > 5303.15: repack (1000) 215.64(491.33+14.80) > 5303.16: repack with --stdin-packs (1000) 198.79(380.51+7.97) > > That's because the code paths around handling .keep files are known to > scale badly; they look in every single pack file to find each object. Well, part of it is just that with 1000 packs we have 20 times as many objects that are actually getting packed with --stdin-packs, compared to the 50-pack case. IIRC, each pack is a fixed-size slice and then the residual is put into the .keep pack. So the fact that the time gets closer to a full repack as we add more packs is expected: we are asking pack-objects to do more work! For showing the impact of the optimizations in patches 7 and 8, I think doing a full repack with --honor-pack-keep is a better test. Because then we're always doing a full traversal, and most of the work continues to scale with the repo size (though obviously not the actual shuffling of packed bytes around). That would get rid of the weird "no work to do" case in the single-pack tests, too. -Peff
On Tue, Feb 16, 2021 at 06:58:16PM -0500, Jeff King wrote: > For showing the impact of the optimizations in patches 7 and 8, I think > doing a full repack with --honor-pack-keep is a better test. Because > then we're always doing a full traversal, and most of the work continues > to scale with the repo size (though obviously not the actual shuffling > of packed bytes around). That would get rid of the weird "no work to do" > case in the single-pack tests, too. I meant to add: but I do like that we are timing --stdin-packs, too. We may actually want to time both. Another thing we _could_ do, if we have --honor-pack-keep perf tests, is to shuffle patches 5, 6, and 7 towards the front of the series. They should be able to show off the improvement even without the --stdin-packs feature. -Peff
On Tue, Feb 16, 2021 at 06:58:16PM -0500, Jeff King wrote: > On Wed, Feb 03, 2021 at 10:59:13PM -0500, Taylor Blau wrote: > > > From: Jeff King <peff@peff.net> > > > > This is the same as the regular repack test, except that we mark the > > single base pack as "kept" and use --assume-kept-packs-closed. The > > I don't think that option exists anymore. I guess we are just using > --stdin-packs, which causes us to mark a pack as kept. > > I think we could just mark it in the filesystem and use > --honor-pack-keep, which would make it independent of your new feature. > At first I was going to say "but it doesn't matter either way", but... > > > theory is that this should be faster than the normal repack, because > > we'll have fewer objects to traverse and process. > > > > Here are some timings on a recent clone of the kernel. In the > > single-pack case, there is nothing do since there are no non-excluded > > packs: > > > > 5303.5: repack (1) 57.42(54.88+10.64) > > 5303.6: repack with --stdin-packs (1) 0.01(0.01+0.00) > > > > and in the 50-pack case, it is much faster to use `--stdin-packs`, since > > we avoid having to consider any objects in the excluded pack: > > > > 5303.10: repack (50) 71.26(88.24+4.96) > > 5303.11: repack with --stdin-packs (50) 3.49(11.82+0.28) > > > > but our improvements vanish as we approach 1000 packs. > > > > 5303.15: repack (1000) 215.64(491.33+14.80) > > 5303.16: repack with --stdin-packs (1000) 198.79(380.51+7.97) > > > > That's because the code paths around handling .keep files are known to > > scale badly; they look in every single pack file to find each object. > > Well, part of it is just that with 1000 packs we have 20 times as many > objects that are actually getting packed with --stdin-packs, compared to > the 50-pack case. IIRC, each pack is a fixed-size slice and then the > residual is put into the .keep pack. So the fact that the time gets > closer to a full repack as we add more packs is expected: we are asking > pack-objects to do more work! No, the residual base pack isn't marked as kept on-disk. But the --stdin-packs test treats it as such, by passing '^pack-$base_pack.pack' as input to '--stdin-packs' (thus marking it as kept in-core). > For showing the impact of the optimizations in patches 7 and 8, I think > doing a full repack with --honor-pack-keep is a better test. Because > then we're always doing a full traversal, and most of the work continues > to scale with the repo size (though obviously not the actual shuffling > of packed bytes around). That would get rid of the weird "no work to do" > case in the single-pack tests, too. I think you're suggesting that we change the "repack ($nr_packs)" test to have the residual pack marked as kept (so we're measuring time it takes to repack everything that _isn't_ in the base pack)? That would allow a more direct comparison, but I think it's loosing out on an important aspect which is how long it takes to pack the entire repository. Maybe we want three. What do you think? > -Peff Thanks, Taylor
On Wed, Feb 17, 2021 at 02:13:47PM -0500, Taylor Blau wrote: > > > That's because the code paths around handling .keep files are known to > > > scale badly; they look in every single pack file to find each object. > > > > Well, part of it is just that with 1000 packs we have 20 times as many > > objects that are actually getting packed with --stdin-packs, compared to > > the 50-pack case. IIRC, each pack is a fixed-size slice and then the > > residual is put into the .keep pack. So the fact that the time gets > > closer to a full repack as we add more packs is expected: we are asking > > pack-objects to do more work! > > No, the residual base pack isn't marked as kept on-disk. But the > --stdin-packs test treats it as such, by passing '^pack-$base_pack.pack' > as input to '--stdin-packs' (thus marking it as kept in-core). Sorry, I perhaps shouldn't have said ".keep" here. But it's the same thing, isn't it? The 50 pack case is packing 50*pack_size objects (because it's excluding everything else that is in the base pack we mark as keep-in-core), and the 1000-pack case is packing 1000*pack_size objects (for the same reason). So any patterns we see between them have more to do with that, than how the keep-handling code scales with the number of non-kept packs. > > For showing the impact of the optimizations in patches 7 and 8, I think > > doing a full repack with --honor-pack-keep is a better test. Because > > then we're always doing a full traversal, and most of the work continues > > to scale with the repo size (though obviously not the actual shuffling > > of packed bytes around). That would get rid of the weird "no work to do" > > case in the single-pack tests, too. > > I think you're suggesting that we change the "repack ($nr_packs)" test > to have the residual pack marked as kept (so we're measuring time it > takes to repack everything that _isn't_ in the base pack)? > > That would allow a more direct comparison, but I think it's loosing out > on an important aspect which is how long it takes to pack the entire > repository. Maybe we want three. That was what I was suggesting, but I think it's equivalent to what your --stdin-packs is testing. I guess the most interesting thing would actually be an _additional_ pack mark as .keep (and that pack does not even have to contain anything interesting -- the point is how much effort it costs to find that out. Of course the bigger it is the more pronounced the effect of avoiding lookups in it). -Peff
diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh index d90d714923..b76a6efe00 100755 --- a/t/perf/p5303-many-packs.sh +++ b/t/perf/p5303-many-packs.sh @@ -31,8 +31,11 @@ repack_into_n () { ' "$1" >pushes && # create base packfile - head -n 1 pushes | - git pack-objects --delta-base-offset --revs staging/pack && + base_pack=$( + head -n 1 pushes | + git pack-objects --delta-base-offset --revs staging/pack + ) && + test_export base_pack && # and then incrementals between each pair of commits last= && @@ -49,6 +52,12 @@ repack_into_n () { last=$rev done <pushes && + ( + find staging -type f -name 'pack-*.pack' | + xargs -n 1 basename | grep -v "$base_pack" && + printf "^pack-%s.pack\n" $base_pack + ) >stdin.packs + # and install the whole thing rm -f .git/objects/pack/* && mv staging/* .git/objects/pack/ @@ -91,6 +100,15 @@ do --reflog --indexed-objects --delta-base-offset \ --stdout </dev/null >/dev/null ' + + test_perf "repack with --stdin-packs ($nr_packs)" ' + git pack-objects \ + --keep-true-parents \ + --stdin-packs \ + --non-empty \ + --delta-base-offset \ + --stdout <stdin.packs >/dev/null + ' done # Measure pack loading with 10,000 packs.