Message ID | 20200821175800.GC3263141@coredump.intra.peff.net (mailing list archive) |
---|---|
State | Accepted |
Commit | fbff95b67f7e7bfc8799c2d6d16263cfeb61bc6f |
Headers | show |
Series | index-pack threading defaults | expand |
On Fri, Aug 21, 2020 at 1:58 PM Jeff King <peff@peff.net> wrote: > So what's a good default value? It's clear that the current cap of 3 is > too low; our default values are 42% and 57% slower than the best times > on each machine. The results on the 40-core machine imply that 20 > threads is an actual barrier regardless of the number of cores, so we'll > take that as a maximum. We get the best results on these machines at > half of the online-cpus value. That's presumably a result of the > hyperthreading. That's common on multi-core Intel processors, but not > necessarily elsewhere. But if we take it as an assumption, we can > perform optimally on hyperthreaded machines and still do much better > than the status quo on other machines, as long as we never half below > the current value of 3. I'm not familiar with the index-pack machinery, so this response may be silly, but the first question which came to my mind was whether or not SSD vs. spinning-platter disk impacts these results, and which of the two you were using for the tests (which I don't think was mentioned in any of the commit messages). So, basically, I'm wondering about the implication of this change for those of us still stuck with old spinning-platter disks.
On Fri, Aug 21, 2020 at 02:08:55PM -0400, Eric Sunshine wrote: > On Fri, Aug 21, 2020 at 1:58 PM Jeff King <peff@peff.net> wrote: > > So what's a good default value? It's clear that the current cap of 3 is > > too low; our default values are 42% and 57% slower than the best times > > on each machine. The results on the 40-core machine imply that 20 > > threads is an actual barrier regardless of the number of cores, so we'll > > take that as a maximum. We get the best results on these machines at > > half of the online-cpus value. That's presumably a result of the > > hyperthreading. That's common on multi-core Intel processors, but not > > necessarily elsewhere. But if we take it as an assumption, we can > > perform optimally on hyperthreaded machines and still do much better > > than the status quo on other machines, as long as we never half below > > the current value of 3. > > I'm not familiar with the index-pack machinery, so this response may > be silly, but the first question which came to my mind was whether or > not SSD vs. spinning-platter disk impacts these results, and which of > the two you were using for the tests (which I don't think was > mentioned in any of the commit messages). So, basically, I'm wondering > about the implication of this change for those of us still stuck with > old spinning-platter disks. They were both SSD machines, but it wouldn't matter for these tests because they easily fit the whole pack into memory anyway. But in the general case, I don't think disk performance would be relevant. Delta resolution is very CPU-bound, because it's de-compressing data and then computing its SHA-1. So linux.git, for instance, is looking at ~1.3GB on disk that expands to 87.5GB of bytes to run through SHA-1. And it would be pretty unlikely to hit the disk anyway, as the thing we primarily index is incoming packs which we've literally just written. So I'd expect them to be in cache. Of course, if you can get different numbers from p5302, I'd be curious to hear them. :) A more plausible downside might be that memory usage would increase as we operate on multiple deltas at once. But pack-objects is already much more hungry here, as it runs online_cpus() delta-compression threads simultaneously, each of which may have up to window_size entries in memory at once. -Peff
On 2020-08-21 at 17:58:00, Jeff King wrote: > diff --git a/builtin/index-pack.c b/builtin/index-pack.c > index f865666db9..9721bf1ffe 100644 > --- a/builtin/index-pack.c > +++ b/builtin/index-pack.c > @@ -1798,9 +1798,22 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix) > > if (HAVE_THREADS && !nr_threads) { > nr_threads = online_cpus(); > - /* An experiment showed that more threads does not mean faster */ > - if (nr_threads > 3) > - nr_threads = 3; > + /* > + * Experiments show that going above 20 threads doesn't help, > + * no matter how many cores you have. Below that, we tend to > + * max at half the number of online_cpus(), presumably because > + * half of those are hyperthreads rather than full cores. We'll > + * never reduce the level below "3", though, to match a > + * historical value that nobody complained about. > + */ > + if (nr_threads < 4) > + ; /* too few cores to consider capping */ > + else if (nr_threads < 6) > + nr_threads = 3; /* historic cap */ > + else if (nr_threads < 40) > + nr_threads /= 2; I was going to ask if we could make the halving conditional on x86_64, but it turns out POWER and UltraSPARC also have SMT, so that doesn't make sense. I expect that most users who have more than 6 "cores" are going to be on one of those systems or possibly ARM64, and since the performance penalty of using half as many cores isn't that significant, I'm not sure it's worth worrying about further. This will be an improvement regardless. Which is just a long way of saying, this patch seems fine to me.
On Sat, Aug 22, 2020 at 01:16:07AM +0000, brian m. carlson wrote: > > + if (nr_threads < 4) > > + ; /* too few cores to consider capping */ > > + else if (nr_threads < 6) > > + nr_threads = 3; /* historic cap */ > > + else if (nr_threads < 40) > > + nr_threads /= 2; > > I was going to ask if we could make the halving conditional on x86_64, > but it turns out POWER and UltraSPARC also have SMT, so that doesn't > make sense. I expect that most users who have more than 6 "cores" are > going to be on one of those systems or possibly ARM64, and since the > performance penalty of using half as many cores isn't that significant, > I'm not sure it's worth worrying about further. This will be an > improvement regardless. > > Which is just a long way of saying, this patch seems fine to me. OK, good. :) I agree there may be room for more improvement on those systems. But lacking access to any, my goal was to make things better on systems I _could_ test on, and not make things worse on other systems. So I'd be very happy if people on other platforms (especially non-intel ones) wanted to run: cd t/perf GIT_PERF_EXTRA=1 \ GIT_PERF_LARGE_REPO=/path/to/clone/of/linux.git \ ./p5302-pack-index.sh and report the results. I do have a slightly-old AMD machine with 4 cores (an A8-7600). Here's what it says: 5302.3: index-pack 0 threads 447.67(436.62+6.57) 5302.4: index-pack 1 threads 450.80(441.26+7.20) 5302.5: index-pack 2 threads 265.62(459.56+7.30) 5302.6: index-pack 4 threads 177.06(477.56+8.22) 5302.7: index-pack default number of threads 202.60(473.15+7.61) So it does get better with 4 threads (but we continue to cap it at 3). I wonder whether we should just do: diff --git a/builtin/index-pack.c b/builtin/index-pack.c index 9721bf1ffe..d7453d0c09 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -1809,7 +1809,7 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix) if (nr_threads < 4) ; /* too few cores to consider capping */ else if (nr_threads < 6) - nr_threads = 3; /* historic cap */ + nr_threads = nr_threads; else if (nr_threads < 40) nr_threads /= 2; else That does probably make things slightly worse for a 6-core hyperthreaded Intel machine. And it doesn't help an actual 8-core AMD machine at all. -Peff
On Mon, Aug 24, 2020 at 01:37:35PM -0400, Jeff King wrote: > So I'd be very happy if people on other platforms (especially non-intel > ones) wanted to run: > > cd t/perf > GIT_PERF_EXTRA=1 \ > GIT_PERF_LARGE_REPO=/path/to/clone/of/linux.git \ > ./p5302-pack-index.sh > > and report the results. > > I do have a slightly-old AMD machine with 4 cores (an A8-7600). Here's > what it says: > > 5302.3: index-pack 0 threads 447.67(436.62+6.57) > 5302.4: index-pack 1 threads 450.80(441.26+7.20) > 5302.5: index-pack 2 threads 265.62(459.56+7.30) > 5302.6: index-pack 4 threads 177.06(477.56+8.22) > 5302.7: index-pack default number of threads 202.60(473.15+7.61) I tested on my 9.5 year old 4-core iMac with 2.5 GHz Intel Core i5, spinning platter hard disk, and 12 GB memory (recently upgraded from 4 GB). I used git.git rather than linux.git because I didn't want to wait several days for the results (plus this is my primary machine which I'm actively using). Results were similar to what you saw: 5302.3: index-pack 0 threads 14.85(14.10+0.66) 5302.4: index-pack 1 threads 14.65(13.83+0.63) 5302.5: index-pack 2 threads 9.11(14.40+0.80) 5302.6: index-pack 4 threads 6.89(17.03+1.32) 5302.7: index-pack default number of threads 7.72(16.15+1.15) Thanks for providing exact instructions for running the test; I had never done it before, thus didn't know how.
diff --git a/builtin/index-pack.c b/builtin/index-pack.c index f865666db9..9721bf1ffe 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -1798,9 +1798,22 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix) if (HAVE_THREADS && !nr_threads) { nr_threads = online_cpus(); - /* An experiment showed that more threads does not mean faster */ - if (nr_threads > 3) - nr_threads = 3; + /* + * Experiments show that going above 20 threads doesn't help, + * no matter how many cores you have. Below that, we tend to + * max at half the number of online_cpus(), presumably because + * half of those are hyperthreads rather than full cores. We'll + * never reduce the level below "3", though, to match a + * historical value that nobody complained about. + */ + if (nr_threads < 4) + ; /* too few cores to consider capping */ + else if (nr_threads < 6) + nr_threads = 3; /* historic cap */ + else if (nr_threads < 40) + nr_threads /= 2; + else + nr_threads = 20; /* hard cap */ } curr_pack = open_pack_file(pack_name);
Commit b8a2486f15 (index-pack: support multithreaded delta resolving, 2012-05-06) describes an experiment that shows that setting the number of threads for index-pack higher than 3 does not help. I repeated that experiment using a more modern version of Git and a more modern CPU and got different results. Here are timings for p5302 against linux.git run on my laptop, a Core i9-9880H with 8 cores plus hyperthreading (so online-cpus returns 16): 5302.3: index-pack 0 threads 256.28(253.41+2.79) 5302.4: index-pack 1 threads 257.03(254.03+2.91) 5302.5: index-pack 2 threads 149.39(268.34+3.06) 5302.6: index-pack 4 threads 94.96(294.10+3.23) 5302.7: index-pack 8 threads 68.12(339.26+3.89) 5302.8: index-pack 16 threads 70.90(655.03+7.21) 5302.9: index-pack default number of threads 116.91(290.05+3.21) You can see that wall-clock times continue to improve dramatically up to the number of cores, but bumping beyond that (into hyperthreading territory) does not help (and in fact hurts a little). Here's the same experiment on a machine with dual Xeon 6230's, totaling 40 cores (80 with hyperthreading): 5302.3: index-pack 0 threads 310.04(302.73+6.90) 5302.4: index-pack 1 threads 310.55(302.68+7.40) 5302.5: index-pack 2 threads 178.17(304.89+8.20) 5302.6: index-pack 5 threads 99.53(315.54+9.56) 5302.7: index-pack 10 threads 72.80(327.37+12.79) 5302.8: index-pack 20 threads 60.68(357.74+21.66) 5302.9: index-pack 40 threads 58.07(454.44+67.96) 5302.10: index-pack 80 threads 59.81(720.45+334.52) 5302.11: index-pack default number of threads 134.18(309.32+7.98) The results are similar; things stop improving at 40 threads. Curiously, going from 20 to 40 really doesn't help much, either (and increases CPU time considerably). So that may represent an actual barrier to parallelism, where we lose out due to context-switching and loss of cache locality, but don't reap the wall-clock benefits due to contention of our coarse-grained locks. So what's a good default value? It's clear that the current cap of 3 is too low; our default values are 42% and 57% slower than the best times on each machine. The results on the 40-core machine imply that 20 threads is an actual barrier regardless of the number of cores, so we'll take that as a maximum. We get the best results on these machines at half of the online-cpus value. That's presumably a result of the hyperthreading. That's common on multi-core Intel processors, but not necessarily elsewhere. But if we take it as an assumption, we can perform optimally on hyperthreaded machines and still do much better than the status quo on other machines, as long as we never half below the current value of 3. So that's what this patch does. Signed-off-by: Jeff King <peff@peff.net> --- Somebody with an AMD machine with a bunch of cores may want to revisit this. I don't think it makes anything worse, but they may want to not do the halving. I don't know if there's an easy way for us to determine the difference via online_cpus(), though. builtin/index-pack.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-)