mbox series

[v3,0/2] send_ref buffering

Message ID 20210831093444.28199-1-jacob@gitlab.com (mailing list archive)
Headers show
Series send_ref buffering | expand

Message

Jacob Vosmaer Aug. 31, 2021, 9:34 a.m. UTC
Changes compared to v2:
- remove setvbuf call
- add packet_fwrite_fmt
- add packet_fflush

Non-changes:
- no ferror calls because those are for reads, not for writes

Thanks for the reactions everyone. I agree that packet_fwrite_fmt
simplifies the patch nicely. Jeff, I hope I have given you credit
in an appropriate way, let me know if you want me to change something
there.

Regarding setvbuf: I have found out that GNU coreutils has a utility
called stdbuf that lets you modify the stdout buffer size at runtime
using some LD_PRELOAD hack so we can use that in Gitaly. I don't
think this is the best outcome for users, we ought to give them a
good default instead of expecting them to invoke git-upload-pack
as 'stdbuf -o 64K git-upload-pack'. But I can't judge the impact
of globally changing the stdout buffer size for Git so I'll settle
for having to use stdbuf.

Jacob Vosmaer (2):
  pkt-line: add packet_fwrite and packet_fwrite_fmt
  upload-pack: use stdio in send_ref callbacks

 cache.h        |  2 ++
 ls-refs.c      |  4 +++-
 pkt-line.c     | 30 ++++++++++++++++++++++++++++++
 pkt-line.h     |  8 ++++++++
 upload-pack.c  |  8 +++++---
 write-or-die.c | 12 ++++++++++++
 6 files changed, 60 insertions(+), 4 deletions(-)

Comments

Jeff King Aug. 31, 2021, 10:25 a.m. UTC | #1
On Tue, Aug 31, 2021 at 11:34:42AM +0200, Jacob Vosmaer wrote:

> Thanks for the reactions everyone. I agree that packet_fwrite_fmt
> simplifies the patch nicely. Jeff, I hope I have given you credit
> in an appropriate way, let me know if you want me to change something
> there.

What you did looks fine.

Overall the series looks much nicer, and I don't have any real
complaints. I do think it would be nice to take the packet_writer
interface further (letting it replace the static buf, and use stdio
handles, and using it throughout upload-pack). But this is a strict
improvement, so we can do that other refactoring later.

> Regarding setvbuf: I have found out that GNU coreutils has a utility
> called stdbuf that lets you modify the stdout buffer size at runtime
> using some LD_PRELOAD hack so we can use that in Gitaly. I don't
> think this is the best outcome for users, we ought to give them a
> good default instead of expecting them to invoke git-upload-pack
> as 'stdbuf -o 64K git-upload-pack'. But I can't judge the impact
> of globally changing the stdout buffer size for Git so I'll settle
> for having to use stdbuf.

Does the 64k buffer actually improve things? Here are the timings I get
on a repo with ~1M refs (it's linux.git with one ref per commit). "git"
is current unbuffered version, and "git.compile" is master with your
patches on top:

  $ hyperfine -i 'git upload-pack .' 'git.compile upload-pack .' 'stdbuf -o 64K git.compile upload-pack .'
  Benchmark #1: git upload-pack .
    Time (mean ± σ):     948.6 ms ±   7.3 ms    [User: 840.8 ms, System: 107.8 ms]
    Range (min … max):   937.7 ms … 961.1 ms    10 runs
   
    Warning: Ignoring non-zero exit code.
   
  Benchmark #2: git.compile upload-pack .
    Time (mean ± σ):     867.3 ms ±   6.8 ms    [User: 821.5 ms, System: 45.7 ms]
    Range (min … max):   859.7 ms … 883.0 ms    10 runs
   
    Warning: Ignoring non-zero exit code.
   
  Benchmark #3: stdbuf -o 64K git.compile upload-pack .
    Time (mean ± σ):     861.1 ms ±   8.2 ms    [User: 815.5 ms, System: 45.6 ms]
    Range (min … max):   846.1 ms … 872.0 ms    10 runs
   
    Warning: Ignoring non-zero exit code.
   
  Summary
    'stdbuf -o 64K git.compile upload-pack .' ran
      1.01 ± 0.01 times faster than 'git.compile upload-pack .'
      1.10 ± 0.01 times faster than 'git upload-pack .'

This is on a glibc system, so the default buffers should be 4k. It
doesn't appear to make any difference (there's a slight improvement, but
well within the noise, and I had other runs where it did worse).

By the way, if you really want to speed things up, try this:

  $ hyperfine -i 'git.compile upload-pack .' 'GIT_REF_PARANOIA=1 git.compile upload-pack .'
  Benchmark #1: git.compile upload-pack .
    Time (mean ± σ):     855.4 ms ±   5.8 ms    [User: 803.4 ms, System: 52.0 ms]
    Range (min … max):   848.7 ms … 869.5 ms    10 runs
   
    Warning: Ignoring non-zero exit code.
   
  Benchmark #2: GIT_REF_PARANOIA=1 git.compile upload-pack .
    Time (mean ± σ):     394.4 ms ±   3.0 ms    [User: 357.9 ms, System: 36.4 ms]
    Range (min … max):   390.6 ms … 400.3 ms    10 runs
   
    Warning: Ignoring non-zero exit code.
   
  Summary
    'GIT_REF_PARANOIA=1 git.compile upload-pack .' ran
      2.17 ± 0.02 times faster than 'git.compile upload-pack .'

It's not exactly the intended use of that environment variable, but its
side effect is that we do not call has_object_file() on each ref tip.

-Peff
Jacob Vosmaer Aug. 31, 2021, 1:08 p.m. UTC | #2
On Tue, Aug 31, 2021 at 12:25 PM Jeff King <peff@peff.net> wrote:
> I do think it would be nice to take the packet_writer
> interface further (letting it replace the static buf, and use stdio
> handles, and using it throughout upload-pack).
I would like that too, for the sake of neatness and general
performance, but I don't have the time to take on a larger project
like that at the moment.

> Does the 64k buffer actually improve things? Here are the timings I get
> on a repo with ~1M refs (it's linux.git with one ref per commit).
Thanks for challenging that. I have a repeatable benchmark where it
matters, because each write syscall wakes up a chain of proxies
between the user and git-upload-pack. Larger buffers means fewer
wake-ups. But then I tried to simplify my example by having sshd as
the only intermediary, and in that experiment 64K buffers were not
better than 4K buffers. I think that goes to show that picking a good
buffer size is hard, and we'd be better off picking one specifically
for Gitaly (and GitLab) that works with our stack.

>   Summary
>     'GIT_REF_PARANOIA=1 git.compile upload-pack .' ran
>       2.17 ± 0.02 times faster than 'git.compile upload-pack .'
>
> It's not exactly the intended use of that environment variable, but its
> side effect is that we do not call has_object_file() on each ref tip.
That is nice to know, but as a user of Git I don't know when it is or
is not safe to skip those has_object_file() calls. If it's safe to
skip them then Git should skip them always. If not, then I will err on
the side of caution and keep the checks.

Jacob
Jacob Vosmaer Aug. 31, 2021, 5:44 p.m. UTC | #3
On Tue, Aug 31, 2021 at 3:08 PM Jacob Vosmaer <jacob@gitlab.com> wrote:
>
> On Tue, Aug 31, 2021 at 12:25 PM Jeff King <peff@peff.net> wrote:
> > I do think it would be nice to take the packet_writer
> > interface further (letting it replace the static buf, and use stdio
> > handles, and using it throughout upload-pack).
> I would like that too, for the sake of neatness and general
> performance, but I don't have the time to take on a larger project
> like that at the moment.
I gave solving the problem with packet_writer a couple of hours today.
The diff gets too big, and I have too little confidence I'm not
introducing deadlocks. This really is more work than I can chew off
right now. Sorry!

Jacob