mbox series

[v2,0/4] submodule: parallelize diff

Message ID 20221011232604.839941-1-calvinwan@google.com (mailing list archive)
Headers show
Series submodule: parallelize diff | expand

Message

Calvin Wan Oct. 11, 2022, 11:26 p.m. UTC
Changes since v1

This series is now rebased on top of Avar's run-command refactor
series[1], which allows new functions to be easily added to
run_parallel_processes without having to change other callers.

The config option has been renamed to submodule.diffJobs. This name
accurately reflects what functionality is affected. While other APIs
parse for config options from the initial command, this config option is
parsed for in the diff-lib.c library. This is because there are so many
entry points into run_diff_files resulting in signicant code changes
required to pass in this config option.

I also wanted to pose another question to list regarding defaults for
parallel processes. For jobs that clearly scale with the number of
processes (aka jobs that are mostly processor bound), it is obvious that
setting the default number of processes to the number of available cores
is the most optimal option. However, this changes when the job is mostly
I/O bound or has a combination of I/O and processing. Looking at my use
case for `status` on a cold cache (see below), we notice that increasing
the number of parallel processes speeds up status, but after a certain
number, it actually starts slowing down. This is because `status` spends
most of its time reading from the index. While SSDs are able to do a
certain amount of reads in parallel, this limit can be significantly
smaller than the number of cores and going past that limit causes the
read head to constantly swap, slowing down `status`. With an HDD,
parallel reads aren't even possible and while I haven't tested my patch
on an HDD, I have a suspicion that adding multiple processes would
probably make `status` slower than the baseline. How should the default
be set then? In my case, the safe option would be to default it to 1,
but then many users wouldn't be able to discover this optimization.
There are also multiple places in the git documentation where we say
that setting a config option for a parallel process to 0 will result in
a "reasonable amount" of processes, which generally entails the number
of available cores. Can this be problematic for other parallel processes
that spend a significant time in I/O? Should my option even have the
option to set it to 0 given the pitalls?

Original cover letter:

When running 'git status' in a superproject, git spawns a subprocess in
series to run status for every submodule. For projects with many large
submodules, parallelizing status subprocesses can significantly speed
up the runtime for both cold and warm caches. While my initial
intention was to speedup status, it turns out that much of the runtime
spent on status is in diff-lib.c, resulting in speedups for many
different other commands that utilize this library.

Here are some timing tests from running status on the Android Open
Source Project (AOSP). My machine has an SSD and 48 cores. 
  Warm Cache:
    git 2.37
Time (mean ± σ):     17.685 s ±  2.040 s    [User: 5.041 s, System: 22.799 s]
Range (min … max):   16.168 s … 22.804 s    10 runs

    this patch (status.parallelSubmodules=1)
Time (mean ± σ):     13.102 s ±  0.500 s    [User: 4.894 s, System: 19.533 s]
Range (min … max):   12.841 s … 14.447 s    10 runs

    this patch (status.parallelSubmodules=5)
Time (mean ± σ):      3.994 s ±  0.152 s    [User: 4.998 s, System: 20.805 s]
Range (min … max):    3.744 s …  4.163 s    10 runs

    this patch (status.parallelSubmodules=10)
Time (mean ± σ):      3.445 s ±  0.085 s    [User: 5.151 s, System: 20.208 s]
Range (min … max):    3.319 s …  3.586 s    10 runs

    this patch (status.parallelSubmodules=20)
Time (mean ± σ):      3.626 s ±  0.109 s    [User: 5.087 s, System: 20.366 s]
Range (min … max):    3.438 s …  3.763 s    10 runs

We can see that there are diminishing returns and even slightly worse
performance after a certain number of max processes, but optimally
there is a speed up factor of around 5.

  Cold Cache:
    git 2.37
      mean of 3 runs: 6m32s
    this patch (status.parallelSubmodules=1)
      mean of 3 runs: 5m34s
    this patch (status.parallelSubmodules=5)
      mean of 3 runs: 2m23s
    this patch (status.parallelSubmodules=10)
      mean of 3 runs: 2m45s
    this patch (status.parallelSubmodules=20)
      mean of 3 runs: 3m23s

We can witness the same phenomenon as above and optimally there is a
speed up factor of around 2.7.

Patch 1 adds output piping to run_processes_parallel_opt so the output
from each submodule can be parsed. Patches 2 and 3 move preexisting
functionality into separate functions and refactor code to prepare
for patch 4 to implement parallelization.

Future work: The reason why status is much slower on a cold cache vs
warm cache is mainly due to refreshing the index. It is worth
investigating whether this is entirely necessary.

[1] https://lore.kernel.org/git/cover-00.15-00000000000-20220930T111343Z-avarab@gmail.com/

Calvin Wan (4):
  run-command: add pipe_output_fn to run_processes_parallel_opts
  submodule: move status parsing into function
  diff-lib: refactor match_stat_with_submodule
  diff-lib: parallelize run_diff_files for submodules

 Documentation/config/submodule.txt |  12 ++
 diff-lib.c                         | 108 ++++++++++++--
 run-command.c                      |  12 +-
 run-command.h                      |  22 +++
 submodule.c                        | 230 +++++++++++++++++++++++++----
 submodule.h                        |   9 ++
 t/helper/test-run-command.c        |  18 +++
 t/t0061-run-command.sh             |  65 ++++----
 t/t4027-diff-submodule.sh          |  19 +++
 t/t7506-status-submodule.sh        |  19 +++
 10 files changed, 441 insertions(+), 73 deletions(-)

Comments

Junio C Hamano Oct. 12, 2022, 5:52 a.m. UTC | #1
Calvin Wan <calvinwan@google.com> writes:

> I also wanted to pose another question to list regarding defaults for
> parallel processes. For jobs that clearly scale with the number of
> processes (aka jobs that are mostly processor bound), it is obvious that
> setting the default number of processes to the number of available cores
> is the most optimal option. However, this changes when the job is mostly
> I/O bound or has a combination of I/O and processing. Looking at my use
> case for `status` on a cold cache (see below), we notice that increasing
> the number of parallel processes speeds up status, but after a certain
> number, it actually starts slowing down.

I do not offhand recall how the default parallelism is computed
there, but if I am correct to suspect that "git grep" has a similar
scaling pattern, i.e. the threads all need to compete for I/O to
read from the filesystem to find needles from the haystack, perhaps
it would give us a precedent to model the behaviour of this part of
the code, too, hopefully?
Calvin Wan Oct. 14, 2022, 12:39 a.m. UTC | #2
> Calvin Wan <calvinwan@google.com> writes:
>
> > I also wanted to pose another question to list regarding defaults for
> > parallel processes. For jobs that clearly scale with the number of
> > processes (aka jobs that are mostly processor bound), it is obvious that
> > setting the default number of processes to the number of available cores
> > is the most optimal option. However, this changes when the job is mostly
> > I/O bound or has a combination of I/O and processing. Looking at my use
> > case for `status` on a cold cache (see below), we notice that increasing
> > the number of parallel processes speeds up status, but after a certain
> > number, it actually starts slowing down.
>
> I do not offhand recall how the default parallelism is computed
> there, but if I am correct to suspect that "git grep" has a similar
> scaling pattern, i.e. the threads all need to compete for I/O to
> read from the filesystem to find needles from the haystack, perhaps
> it would give us a precedent to model the behaviour of this part of
> the code, too, hopefully?

Setting grep.threads=0 does default it to the number of available cores
(at least the documentation is clear about this). I tested "git grep" on
my machine and found that it started slowing down after 4 threads --
this is most likely because my NVMe SSD uses 4 PCIe lanes aka it can at
most do 4 reads in parallel. AFAIK, there is no way to tell how many
reads a disk can do in parallel. This coupled with the fact that other
commands have varying levels of IO requirements makes it impossible to
set a "reasonable" amount of threads.