mbox series

[0/1] Make upload-pack pack write size configurable

Message ID 20211213132345.26310-1-jacob@gitlab.com (mailing list archive)
Headers show
Series Make upload-pack pack write size configurable | expand

Message

Jacob Vosmaer Dec. 13, 2021, 1:23 p.m. UTC
When transfering packfile data, upload-pack.c uses an 8KB buffer.
This is a reasonable size but when you transfer a lot of packfile
data, like we do on GitLab.com, we find it is beneficial to use a
larger buffer size.

Below you will find a commit where we make the size of this 8KB
buffer configurable at compile time. It appears pack-objects always
does 8KB writes so I don't think we should change the default. But
for GitLab, where we have a cache for the output of pack-objects,
it is beneficial to use a larger IO size because the cache does
64KB writes.

I have also considered converting the packfile copying code to use
stdio when writing to stdout, but that would be a bigger change
because we have to be careful not to interleave stdio and stdlib
writes. And we would have to make the stdout output buffer size
configurable, because the default stdio buffer size is 4KB which
is no better than the status quo. A final argument against the stdio
approach is that it only reduces the number of writes from upload-pack,
while a larger buffer size reduces both the number of reads and
writes.

Having said all that, if the Git maintainers prefer the stdio
approach over this compile time constant, I am happy to submit a
patch series for that instead.

Thanks,

Jacob Vosmaer

Jacob Vosmaer (1):
  upload-pack.c: make output buffer size configurable

 upload-pack.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Jeff King Dec. 14, 2021, 3:12 p.m. UTC | #1
On Mon, Dec 13, 2021 at 02:23:44PM +0100, Jacob Vosmaer wrote:

> When transfering packfile data, upload-pack.c uses an 8KB buffer.
> This is a reasonable size but when you transfer a lot of packfile
> data, like we do on GitLab.com, we find it is beneficial to use a
> larger buffer size.
> 
> Below you will find a commit where we make the size of this 8KB
> buffer configurable at compile time. It appears pack-objects always
> does 8KB writes so I don't think we should change the default. But
> for GitLab, where we have a cache for the output of pack-objects,
> it is beneficial to use a larger IO size because the cache does
> 64KB writes.

I suspect the big reason that it matters for the cache is that for
pack-objects, we're typically CPU bound computing the sha1 of the
outgoing pack (for its trailer).

I do suspect we could just always move to something closer to
LARGE_PACKET_DATA_MAX (probably minus 1 for the sideband marker). Even
if pack-objects only writes in 8k chunks, we may be able to pull several
in one read(), especially if the network is a bottleneck (because we'd
block writing to the client, and pack-objects would fill up the pipe
buffer).

I.e., it doesn't seem like there's any real downside to doing so.

> I have also considered converting the packfile copying code to use
> stdio when writing to stdout, but that would be a bigger change
> because we have to be careful not to interleave stdio and stdlib
> writes. And we would have to make the stdout output buffer size
> configurable, because the default stdio buffer size is 4KB which
> is no better than the status quo. A final argument against the stdio
> approach is that it only reduces the number of writes from upload-pack,
> while a larger buffer size reduces both the number of reads and
> writes.

Yeah, I don't think that would help all that much. We really want to
size this based on pkt-line limits. That reduces syscalls, but also
shrinks the overall size a little (since larger packets minimizes the
framing overhead of the pkt-line headers).

-Peff