mbox series

[0/3] index-pack threading defaults

Message ID 20200821175153.GA3263018@coredump.intra.peff.net (mailing list archive)
Headers show
Series index-pack threading defaults | expand

Message

Jeff King Aug. 21, 2020, 5:51 p.m. UTC
I decided to revisit index-pack's hard-coded default of 3 threads. And
it turns out that higher numbers perform much better. :)

That value was determined experimentally in 2012. I'm not sure of the
exact reason it's different now (modern processors are better at
parallelism, or modern git is better at parallelism, or the original
experiment was just a fluke). But regardless, I can get on the order of
a two-to-one speedup by bumping the thread count. See the final patch
for timings and more specific discussion.

The first two patches are just perf-test improvements (actually, the
slowness to value ratio of p5302 from the first patch is why I was
looking at this at all).

  [1/3]: p5302: disable thread-count parameter tests by default
  [2/3]: p5302: count up to online-cpus for thread tests
  [3/3]: index-pack: adjust default threading cap

 builtin/index-pack.c       | 19 ++++++++++++---
 t/perf/README              |  9 ++++++++
 t/perf/p5302-pack-index.sh | 47 +++++++++++++++++++-------------------
 t/perf/perf-lib.sh         |  2 ++
 4 files changed, 51 insertions(+), 26 deletions(-)

-Peff

Comments

Jeff King Aug. 21, 2020, 6:44 p.m. UTC | #1
On Fri, Aug 21, 2020 at 01:51:53PM -0400, Jeff King wrote:

> That value was determined experimentally in 2012. I'm not sure of the
> exact reason it's different now (modern processors are better at
> parallelism, or modern git is better at parallelism, or the original
> experiment was just a fluke). But regardless, I can get on the order of
> a two-to-one speedup by bumping the thread count. See the final patch
> for timings and more specific discussion.

After writing a response elsewhere in the thread, it occurred to me that
a good candidate for explaining this may be that our modern sha1dc
implementation is way slower than what we were using in 2012 (which
would have been either block-sha1, or the even-faster openssl
implementation). And since a good chunk of index-pack's time is going to
computing sha1 hashes on the resulting objects, that means that since
2012, we're spending relatively more time in the hash computation (which
parallelizes per-object) and less time in the other parts that happen
under a lock.

-Peff
Junio C Hamano Aug. 21, 2020, 6:59 p.m. UTC | #2
Jeff King <peff@peff.net> writes:

> On Fri, Aug 21, 2020 at 01:51:53PM -0400, Jeff King wrote:
>
>> That value was determined experimentally in 2012. I'm not sure of the
>> exact reason it's different now (modern processors are better at
>> parallelism, or modern git is better at parallelism, or the original
>> experiment was just a fluke). But regardless, I can get on the order of
>> a two-to-one speedup by bumping the thread count. See the final patch
>> for timings and more specific discussion.
>
> After writing a response elsewhere in the thread, it occurred to me that
> a good candidate for explaining this may be that our modern sha1dc
> implementation is way slower than what we were using in 2012 (which
> would have been either block-sha1, or the even-faster openssl
> implementation). And since a good chunk of index-pack's time is going to
> computing sha1 hashes on the resulting objects, that means that since
> 2012, we're spending relatively more time in the hash computation (which
> parallelizes per-object) and less time in the other parts that happen
> under a lock.

Believable conjecture that is.  You could benchmark again with
block-sha1 on today's hardware, but because the performance profile
with sha1dc is what matters in the real world anyway...

Thanks.
Jeff King Aug. 21, 2020, 7:14 p.m. UTC | #3
On Fri, Aug 21, 2020 at 11:59:58AM -0700, Junio C Hamano wrote:

> > After writing a response elsewhere in the thread, it occurred to me that
> > a good candidate for explaining this may be that our modern sha1dc
> > implementation is way slower than what we were using in 2012 (which
> > would have been either block-sha1, or the even-faster openssl
> > implementation). And since a good chunk of index-pack's time is going to
> > computing sha1 hashes on the resulting objects, that means that since
> > 2012, we're spending relatively more time in the hash computation (which
> > parallelizes per-object) and less time in the other parts that happen
> > under a lock.
> 
> Believable conjecture that is.  You could benchmark again with
> block-sha1 on today's hardware, but because the performance profile
> with sha1dc is what matters in the real world anyway...

Yeah, I agree on the "real world" part, but I'm the curious sort, so
here are numbers compiled against openssl (which is generally even
faster than block-sha1, and would thus emphasize the results of our
hypothesis):

  5302.3: index-pack 0 threads                   108.78(106.39+2.31)
  5302.4: index-pack 1 threads                   110.65(108.08+2.49)
  5302.5: index-pack 2 threads                   67.57(110.83+2.75) 
  5302.6: index-pack 4 threads                   48.18(123.82+3.02) 
  5302.7: index-pack 8 threads                   39.07(153.45+4.13) 
  5302.8: index-pack 16 threads                  38.38(265.78+7.71) 
  5302.9: index-pack default number of threads   54.64(117.35+2.73)

So it's actually pretty similar. Things continue getting faster as we go
past 3 threads. Though our total improvement is less (29% better with 8
threads compared to 3, versus 42% better when using sha1dc). So I think
it's _part_ of the reason, but not all of it.

-Peff