Message ID | 7a9525a78a7b7b237150b9264cf675a4a0b37267.1555110278.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Avoid spawning gzip in git archive | expand |
On Fri, Apr 12, 2019 at 04:04:39PM -0700, Rohit Ashiwal via GitGitGadget wrote: > From: Rohit Ashiwal <rohit.ashiwal265@gmail.com> > > MinGit for Windows comes without `gzip` bundled inside, git-archive uses > `gzip -cn` to compress tar files but for this to work, gzip needs to be > present on the host system. > > In the next commit, we will change the gzip compression so that we no > longer spawn `gzip` but let zlib perform the compression in the same > process instead. > > In preparation for this, we consolidate all the block writes into a > single function. Sounds like a good preparatory step. This part confused me, though: > @@ -38,11 +40,21 @@ static int write_tar_filter_archive(const struct archiver *ar, > #define USTAR_MAX_MTIME 077777777777ULL > #endif > > +/* writes out the whole block, or dies if fails */ > +static void write_block_or_die(const char *block) { > + if (gzip) { > + if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE) > + die(_("gzwrite failed")); > + } else { > + write_or_die(1, block, BLOCKSIZE); > + } > +} What is gzwrite()? At first I thought this was an out-of-sequence bit of the series, but it turns out that this is a zlib.h interface. So the idea (I think) is that here we introduce a "gzip" variable that is always false, and this first conditional arm is effectively dead code. And then in a later patch we'd set up "gzip" and it would become not-dead. I think it would be less confusing if this just factored out write_block_or_die(), which starts as a thin wrapper and then grows the gzip parts in the next patch. A few nits on the code itself: > +static gzFile gzip; > [...] > + if (gzip) { Is it OK for us to ask about the truthiness of this opaque type? That works if it's really a pointer behind the scenes, but it seems like it would be equally OK for zlib to declare it as a struct. It looks OK in my version of zlib, and that library tends to be fairly conservative so I wouldn't be surprised if it was that way back to the beginning and remains that way for eternity. But it feels like a bad pattern. > + if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE) This cast is interesting. All of the matching write_or_die() calls are promoting it to a size_t, which is also unsigned. BLOCKSIZE is a constant. Should we be defining it with a "U" in the first place? I doubt it matters much either way from a correctness perspective. I just wonder when I see a cast like that if we're going to get unexpected truncation or similar. -Peff
Jeff King <peff@peff.net> writes: >> +/* writes out the whole block, or dies if fails */ >> +static void write_block_or_die(const char *block) { >> + if (gzip) { >> + if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE) >> + die(_("gzwrite failed")); >> + } else { >> + write_or_die(1, block, BLOCKSIZE); >> + } >> +} I agree everything you said you your two review messages. One thing you did not mention but I found disturbing was that this does not take size argument but hardcodes BLOCKSIZE. If the patch were removing use of BLOCKSIZE in its callers (because everybody uses the same constant), this would not have bothered me, but as the caller passes BLOCKSIZE to all its callees except this one, I found that the interface optimizes for a wrong thing (i.e. reducing one-time pain of writing this single patch of having to repeat BLOCKSIZE in all calls to this function). This function should be updated to take the size_t and have its caller(s) pass BLOCKSIZE. Thanks for a review, and thanks Rohit for starting to get rid of external dependency on gzip binary.
Hey Peff! On 2019-04-13 1:34 UTC Jeff King <peff@peff.net> wrote: > What is gzwrite()? > [...] > I think it would be less confusing if this just factored out > write_block_or_die(), which starts as a thin wrapper and then grows the > gzip parts in the next patch. You are right, it might appear to someone as a bit confusing, but I feel like, this is the right commit to put it. > Is it OK for us to ask about the truthiness of this opaque type? That > works if it's really a pointer behind the scenes, but it seems like it > would be equally OK for zlib to declare it as a struct. It would be perfectly sane on zlib's part to make gzFile a struct, and if so happens, I'll be there to refactor the code. Regards Rohit
Hey jch! I'll change the signature of the function in next revision. Thanks Rohit
Rohit Ashiwal <rohit.ashiwal265@gmail.com> writes: > On 2019-04-13 1:34 UTC Jeff King <peff@peff.net> wrote: > >> What is gzwrite()? >> [...] >> I think it would be less confusing if this just factored out >> write_block_or_die(), which starts as a thin wrapper and then grows the >> gzip parts in the next patch. > > You are right, it might appear to someone as a bit confusing, but I feel > like, this is the right commit to put it. Often, the original author is the worst judge about the patch series organization, because s/he has been staring at his or her own patches too long and knows too much about them. Unless the author is very experienced and is good at pretending to be the first-time reader when proofreading his or her own patch, that is. FWIW, I tend to agree with Peff that the organization would become much easier to follow with "first refactor without new feature, and in gzip related step add gzip thing". >> Is it OK for us to ask about the truthiness of this opaque type? That >> works if it's really a pointer behind the scenes, but it seems like it >> would be equally OK for zlib to declare it as a struct. Or a small integer indexing into an internal array the library keeps track of ;-) At that point, truthiness would be completely gone, and the compiler would not help catching "if (opaque)" as a syntax error (if the library implements the opaque thing as a structure, then we will be saved). > It would be perfectly sane on zlib's part to make gzFile a struct, and if > so happens, I'll be there to refactor the code. We do not trust any single developer enough with "I'll do so when needed"---in practice, it will often be done by somebody else, and more importantly, we would want anybody to be able to take things over, instead of relying on any one "indispensable contributor". If it is reasonable to expect that things can easily be broken by an external factor, I'd prefer to see us being defensive from day one.
Hi Peff, On Fri, 12 Apr 2019, Jeff King wrote: > On Fri, Apr 12, 2019 at 04:04:39PM -0700, Rohit Ashiwal via GitGitGadget wrote: > > > From: Rohit Ashiwal <rohit.ashiwal265@gmail.com> > > > > MinGit for Windows comes without `gzip` bundled inside, git-archive uses > > `gzip -cn` to compress tar files but for this to work, gzip needs to be > > present on the host system. > > > > In the next commit, we will change the gzip compression so that we no > > longer spawn `gzip` but let zlib perform the compression in the same > > process instead. > > > > In preparation for this, we consolidate all the block writes into a > > single function. > > Sounds like a good preparatory step. This part confused me, though: > > > @@ -38,11 +40,21 @@ static int write_tar_filter_archive(const struct archiver *ar, > > #define USTAR_MAX_MTIME 077777777777ULL > > #endif > > > > +/* writes out the whole block, or dies if fails */ > > +static void write_block_or_die(const char *block) { > > + if (gzip) { > > + if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE) > > + die(_("gzwrite failed")); > > + } else { > > + write_or_die(1, block, BLOCKSIZE); > > + } > > +} > > What is gzwrite()? At first I thought this was an out-of-sequence bit of > the series, but it turns out that this is a zlib.h interface. So the > idea (I think) is that here we introduce a "gzip" variable that is > always false, and this first conditional arm is effectively dead code. > And then in a later patch we'd set up "gzip" and it would become > not-dead. > > I think it would be less confusing if this just factored out > write_block_or_die(), which starts as a thin wrapper and then grows the > gzip parts in the next patch. Yes, I missed this in my pre-submission review. Sorry about that! > A few nits on the code itself: > > > +static gzFile gzip; > > [...] > > + if (gzip) { > > Is it OK for us to ask about the truthiness of this opaque type? That > works if it's really a pointer behind the scenes, but it seems like it > would be equally OK for zlib to declare it as a struct. > > It looks OK in my version of zlib, and that library tends to be fairly > conservative so I wouldn't be surprised if it was that way back to the > beginning and remains that way for eternity. But it feels like a bad > pattern. It is even part of the public API that `gzFile` is `typedef`'d to a pointer. So I think in the interest of simplicity, I'll leave it at that (but I'll mention this in the commit message). > > + if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE) > > This cast is interesting. All of the matching write_or_die() calls are > promoting it to a size_t, which is also unsigned. > > BLOCKSIZE is a constant. Should we be defining it with a "U" in the > first place? Yep, good idea. Ciao, Dscho
Hi Junio, On Sat, 13 Apr 2019, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > >> +/* writes out the whole block, or dies if fails */ > >> +static void write_block_or_die(const char *block) { > >> + if (gzip) { > >> + if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE) > >> + die(_("gzwrite failed")); > >> + } else { > >> + write_or_die(1, block, BLOCKSIZE); > >> + } > >> +} > > I agree everything you said you your two review messages. > > One thing you did not mention but I found disturbing was that this > does not take size argument but hardcodes BLOCKSIZE. That is very much on purpose, as this code really is specific to the `tar` file format, which has a fixed, well-defined block size. It would make it easier to introduce a bug if that was a parameter. Ciao, Dscho
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: >> >> +/* writes out the whole block, or dies if fails */ >> >> +static void write_block_or_die(const char *block) { >> >> + if (gzip) { >> >> + if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE) >> >> + die(_("gzwrite failed")); >> >> + } else { >> >> + write_or_die(1, block, BLOCKSIZE); >> >> + } >> >> +} >> >> I agree everything you said you your two review messages. >> >> One thing you did not mention but I found disturbing was that this >> does not take size argument but hardcodes BLOCKSIZE. > > That is very much on purpose, as this code really is specific to the `tar` > file format, which has a fixed, well-defined block size. It would make it > easier to introduce a bug if that was a parameter. I am not so sure for two reasons. One is that its caller is full of BLOCKSIZE constants passed as parameters (instead of calling a specialized function that hardcodes the BLOCKSIZE without taking it as a parameter), and this being a file-scope static, it does not really matter with respect to an accidental bug of mistakenly changing BLOCKSIZE either in the caller or callee. Another is that I am not sure how your "fixed format" argument meshes with the "-b blocksize" parameter to affect the tar/pax output. The format may be fixed, but it is parameterized. If we ever need to grow the ability to take "-b", having the knowledge that our current code is limited to the fixed BLOCKSIZE in a single function (i.e. the caller of this function , not the callee) would be less error prone. These two are in addition to the uniformity of the abstraction concerns I raised in my original review comment. So, sorry, I do not think your response makes much sense. Thanks.
Hi Junio, On Sat, 27 Apr 2019, Junio C Hamano wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > >> >> +/* writes out the whole block, or dies if fails */ > >> >> +static void write_block_or_die(const char *block) { > >> >> + if (gzip) { > >> >> + if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE) > >> >> + die(_("gzwrite failed")); > >> >> + } else { > >> >> + write_or_die(1, block, BLOCKSIZE); > >> >> + } > >> >> +} > >> > >> I agree everything you said you your two review messages. > >> > >> One thing you did not mention but I found disturbing was that this > >> does not take size argument but hardcodes BLOCKSIZE. > > > > That is very much on purpose, as this code really is specific to the `tar` > > file format, which has a fixed, well-defined block size. It would make it > > easier to introduce a bug if that was a parameter. > > I am not so sure for two reasons. > > One is that its caller is full of BLOCKSIZE constants passed as > parameters (instead of calling a specialized function that hardcodes > the BLOCKSIZE without taking it as a parameter), and this being a > file-scope static, it does not really matter with respect to an > accidental bug of mistakenly changing BLOCKSIZE either in the caller > or callee. I guess I can try to find some time next week to clean up those callers. But honestly, I do not really think that this cleanup falls squarely into the goal of this here patch series. > Another is that I am not sure how your "fixed format" argument > meshes with the "-b blocksize" parameter to affect the tar/pax > output. The format may be fixed, but it is parameterized. If > we ever need to grow the ability to take "-b", having the knowledge > that our current code is limited to the fixed BLOCKSIZE in a single > function (i.e. the caller of this function , not the callee) would > be less error prone. This argument would hold a lot more water if the following lines were not part of archive-tar.c: #define RECORDSIZE (512) #define BLOCKSIZE (RECORDSIZE * 20) static char block[BLOCKSIZE]; If you can tell me how the `-b` (run-time) parameter can affect the (compile-time) `BLOCKSIZE` constant, maybe I can start to understand your concern. :-) Ciao, Dscho P.S.: I just looked, and I do not even see a `-b` option of `git archive`, so I suspect that you talked about the generic tar file format? I was not talking about each and every implementation of the tar file format here, I was talking about the tar file format that Git generates.
On Fri, Apr 26, 2019 at 10:28:12AM -0400, Johannes Schindelin wrote: > > > +static gzFile gzip; > > > [...] > > > + if (gzip) { > > > > Is it OK for us to ask about the truthiness of this opaque type? That > > works if it's really a pointer behind the scenes, but it seems like it > > would be equally OK for zlib to declare it as a struct. > > > > It looks OK in my version of zlib, and that library tends to be fairly > > conservative so I wouldn't be surprised if it was that way back to the > > beginning and remains that way for eternity. But it feels like a bad > > pattern. > > It is even part of the public API that `gzFile` is `typedef`'d to a > pointer. So I think in the interest of simplicity, I'll leave it at that > (but I'll mention this in the commit message). I think that's probably OK. My biggest concern is that we'd notice if our assumption changes, but I think modern compilers would generally complain about checking a tautological truth value. -Peff
On Mon, Apr 29, 2019 at 05:32:50PM -0400, Johannes Schindelin wrote: > > Another is that I am not sure how your "fixed format" argument > > meshes with the "-b blocksize" parameter to affect the tar/pax > > output. The format may be fixed, but it is parameterized. If > > we ever need to grow the ability to take "-b", having the knowledge > > that our current code is limited to the fixed BLOCKSIZE in a single > > function (i.e. the caller of this function , not the callee) would > > be less error prone. > > This argument would hold a lot more water if the following lines were not > part of archive-tar.c: > > #define RECORDSIZE (512) > #define BLOCKSIZE (RECORDSIZE * 20) > > static char block[BLOCKSIZE]; > > If you can tell me how the `-b` (run-time) parameter can affect the > (compile-time) `BLOCKSIZE` constant, maybe I can start to understand your > concern. FWIW, I agree with you here. These patches are not making anything worse (and may even make them better, since we'd probably need to swap out the BLOCKSIZE constant for a run-time "blocksize" variable in fewer places). -Peff
Am 01.05.19 um 20:09 schrieb Jeff King: > On Mon, Apr 29, 2019 at 05:32:50PM -0400, Johannes Schindelin wrote: > >>> Another is that I am not sure how your "fixed format" argument >>> meshes with the "-b blocksize" parameter to affect the tar/pax >>> output. The format may be fixed, but it is parameterized. If >>> we ever need to grow the ability to take "-b", having the knowledge >>> that our current code is limited to the fixed BLOCKSIZE in a single >>> function (i.e. the caller of this function , not the callee) would >>> be less error prone. >> >> This argument would hold a lot more water if the following lines were not >> part of archive-tar.c: >> >> #define RECORDSIZE (512) >> #define BLOCKSIZE (RECORDSIZE * 20) >> >> static char block[BLOCKSIZE]; >> >> If you can tell me how the `-b` (run-time) parameter can affect the >> (compile-time) `BLOCKSIZE` constant, maybe I can start to understand your >> concern. > > FWIW, I agree with you here. These patches are not making anything worse > (and may even make them better, since we'd probably need to swap out the > BLOCKSIZE constant for a run-time "blocksize" variable in fewer places). The block size is mostly relevant for writing tar archives to magnetic tapes. You can do that with git archive and a tape drive that supports the blocking factor 20, which is the default for GNU tar and thus should be quite common. You may get higher performance with a higher blocking factor, if supported. But so far this didn't come up on the mailing list, and I'd be surprised if people really wrote snapshots of git archives directly to tape. So I'm not too worried about this define ever becoming a user-settable option. Sealing the constant into a function a bit feels dirty, though. Mixing code and data makes the code more brittle. Another example of that is the hard-coded file descriptor in the same function, by the way. It's a lot of busywork to undo in order to gain the ability to write to some other fd, for the questionable convenience of not having to pass that parameter along the call chain. My bad. But anyway, I worry more about the fact that blocking is not needed when gzip'ing; gzwrite can be fed pieces of any size, not just 20 KB chunks. The tar writer just needs to round up the archive size to a multiple of 20 KB and pad with NUL bytes at the end, in order to produce the same uncompressed output as non-compressing tar. If we'd wanted to be tape-friendly, then we'd have to block the gzip'ed output instead of the uncompressed tar file, but I'm not suggesting doing that. Note to self: I wonder if moving the blocking part out into an asynchronous function could simplify the code. René
Jeff King <peff@peff.net> writes: > FWIW, I agree with you here. These patches are not making anything worse > (and may even make them better, since we'd probably need to swap out the > BLOCKSIZE constant for a run-time "blocksize" variable in fewer places). It's just that leaving the interface uneven is an easy way to introduce an unnecessary bug, e.g. -type function(args) { +type function(args, size_t blocksize) { decls; - helper_one(BLOCKSIZE, other, args); + helper_one(blocksize, other, args); helper_two(its, args); - helper_three(BLOCKSIZE, even, more, args); + helper_three(blocksize, even, more, args); } when this caller is away from the implementation of helper_two() that hardcodes the assumption that this callchain only uses BLOCKSIZE and in an implicit way. And that can easily be avoided by defensively making helper_two() to take BLOCKSIZE as an argument as everybody else in the caller does. I do not actually care too deeply, though. Hopefully whoever adds "-b" would be careful enough to follow all callchain, and at least look at all the callees that are file-scope static, and the one I have trouble with _is_ a file-scope static. Or maybe nobody does "-b", in which case this ticking time bomb will not trigger, so we'd be OK.
On Sun, May 05, 2019 at 02:25:59PM +0900, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > FWIW, I agree with you here. These patches are not making anything worse > > (and may even make them better, since we'd probably need to swap out the > > BLOCKSIZE constant for a run-time "blocksize" variable in fewer places). > > It's just that leaving the interface uneven is an easy way to > introduce an unnecessary bug, e.g. > > -type function(args) { > +type function(args, size_t blocksize) { > decls; > - helper_one(BLOCKSIZE, other, args); > + helper_one(blocksize, other, args); > helper_two(its, args); > - helper_three(BLOCKSIZE, even, more, args); > + helper_three(blocksize, even, more, args); > } > > when this caller is away from the implementation of helper_two() > that hardcodes the assumption that this callchain only uses > BLOCKSIZE and in an implicit way. > > And that can easily be avoided by defensively making helper_two() to > take BLOCKSIZE as an argument as everybody else in the caller does. > > I do not actually care too deeply, though. Hopefully whoever adds > "-b" would be careful enough to follow all callchain, and at least > look at all the callees that are file-scope static, and the one I > have trouble with _is_ a file-scope static. Right, my assumption was that the first step in the conversion would be somebody doing s/BLOCKSIZE/global_blocksize_variable/. But that is just a guess. > Or maybe nobody does "-b", in which case this ticking time bomb will > not trigger, so we'd be OK. Yes. I suspect we're probably going down an unproductive tangent. :) -Peff
diff --git a/archive-tar.c b/archive-tar.c index 4aabd566fb..ba37dad27c 100644 --- a/archive-tar.c +++ b/archive-tar.c @@ -17,6 +17,8 @@ static unsigned long offset; static int tar_umask = 002; +static gzFile gzip; + static int write_tar_filter_archive(const struct archiver *ar, struct archiver_args *args); @@ -38,11 +40,21 @@ static int write_tar_filter_archive(const struct archiver *ar, #define USTAR_MAX_MTIME 077777777777ULL #endif +/* writes out the whole block, or dies if fails */ +static void write_block_or_die(const char *block) { + if (gzip) { + if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE) + die(_("gzwrite failed")); + } else { + write_or_die(1, block, BLOCKSIZE); + } +} + /* writes out the whole block, but only if it is full */ static void write_if_needed(void) { if (offset == BLOCKSIZE) { - write_or_die(1, block, BLOCKSIZE); + write_block_or_die(block); offset = 0; } } @@ -66,7 +78,7 @@ static void do_write_blocked(const void *data, unsigned long size) write_if_needed(); } while (size >= BLOCKSIZE) { - write_or_die(1, buf, BLOCKSIZE); + write_block_or_die(buf); size -= BLOCKSIZE; buf += BLOCKSIZE; } @@ -101,10 +113,10 @@ static void write_trailer(void) { int tail = BLOCKSIZE - offset; memset(block + offset, 0, tail); - write_or_die(1, block, BLOCKSIZE); + write_block_or_die(block); if (tail < 2 * RECORDSIZE) { memset(block, 0, offset); - write_or_die(1, block, BLOCKSIZE); + write_block_or_die(block); } }