mbox series

[0/2] Avoid spawning gzip in git archive

Message ID pull.145.git.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Avoid spawning gzip in git archive | expand

Message

John Cai via GitGitGadget April 12, 2019, 11:04 p.m. UTC
When creating .tar.gz archives with git archive, we let gzip handle the
compression part. But that is not even necessary, as we already require zlib
(to compress our loose objects). It is also unfortunate, as it requires gzip 
to be in the PATH (which is not the case e.g. with MinGit for Windows, which
tries to bundle just the bare minimum of files to make Git work
non-interactively, for use with 3rd-party applications requiring Git).

This patch series resolves this conundrum by teaching git archive the trick
to gzip-compress in-process.

Rohit Ashiwal (2):
  archive: replace write_or_die() calls with write_block_or_die()
  archive: avoid spawning `gzip`

 archive-tar.c | 54 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 41 insertions(+), 13 deletions(-)


base-commit: 8104ec994ea3849a968b4667d072fedd1e688642
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-145%2Fdscho%2Fdont-spawn-gzip-in-archive-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-145/dscho/dont-spawn-gzip-in-archive-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/145

Comments

René Scharfe May 2, 2019, 8:29 p.m. UTC | #1
Am 27.04.19 um 01:27 schrieb Rohit Ashiwal via GitGitGadget:
> From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
>
> As we already link to the zlib library, we can perform the compression
> without even requiring gzip on the host machine.
>
> Note: the `-n` flag that `git archive` passed to `gzip` wants to ensure
> that a reproducible file is written, i.e. no filename or mtime will be
> recorded in the compressed output. This is already the default for
> zlib's `gzopen()` function (if the file name or mtime should be
> recorded, the `deflateSetHeader()` function would have to be called
> instead).
>
> Note also that the `gzFile` datatype is defined as a pointer in
> `zlib.h`, i.e. we can rely on the fact that it can be `NULL`.
>
> At this point, this new mode is hidden behind the pseudo command
> `:zlib`: assign this magic string to the `archive.tgz.command` config
> setting to enable it.

Technically the patch emits the gzip format using the gz* functions.
Raw zlib output with deflate* would be slightly different.  So I'd
rather use "gzip" instead of "zlib" in the magic string.

And I'm not sure about the colon as the only magic marker.  Perhaps
throw in a "git " or "git-" instead or in addition?

> @@ -459,18 +464,40 @@ static int write_tar_filter_archive(const struct archiver *ar,
>  	filter.use_shell = 1;
>  	filter.in = -1;
>
> -	if (start_command(&filter) < 0)
> -		die_errno(_("unable to start '%s' filter"), argv[0]);
> -	close(1);
> -	if (dup2(filter.in, 1) < 0)
> -		die_errno(_("unable to redirect descriptor"));
> -	close(filter.in);
> +	if (!strcmp(":zlib", ar->data)) {
> +		struct strbuf mode = STRBUF_INIT;
> +
> +		strbuf_addstr(&mode, "wb");
> +
> +		if (args->compression_level >= 0 && args->compression_level <= 9)
> +			strbuf_addf(&mode, "%d", args->compression_level);

Using gzsetparams() to set the compression level numerically after gzdopen()
instead of baking it into the mode string feels cleaner.

> +
> +		gzip = gzdopen(fileno(stdout), mode.buf);
> +		if (!gzip)
> +			die(_("Could not gzdopen stdout"));
> +		strbuf_release(&mode);
> +	} else {
> +		if (start_command(&filter) < 0)
> +			die_errno(_("unable to start '%s' filter"), argv[0]);
> +		close(1);
> +		if (dup2(filter.in, 1) < 0)
> +			die_errno(_("unable to redirect descriptor"));
> +		close(filter.in);
> +	}
>
>  	r = write_tar_archive(ar, args);
>
> -	close(1);
> -	if (finish_command(&filter) != 0)
> -		die(_("'%s' filter reported error"), argv[0]);
> +	if (gzip) {
> +		int ret = gzclose(gzip);
> +		if (ret == Z_ERRNO)
> +			die_errno(_("gzclose failed"));
> +		else if (ret != Z_OK)
> +			die(_("gzclose failed (%d)"), ret);
> +	} else {
> +		close(1);
> +		if (finish_command(&filter) != 0)
> +			die(_("'%s' filter reported error"), argv[0]);
> +	}
>
>  	strbuf_release(&cmd);
>  	return r;
>
René Scharfe May 2, 2019, 8:30 p.m. UTC | #2
Am 27.04.19 um 01:27 schrieb Johannes Schindelin via GitGitGadget:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> They really are unsigned, and we are using e.g. BLOCKSIZE as `size_t`
> parameter to pass to `write_or_die()`.

True, but the compiler converts that value correctly to size_t without
complaint already, doesn't it?  What am I missing?

>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  archive-tar.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/archive-tar.c b/archive-tar.c
> index af9ea70733..be06c8b205 100644
> --- a/archive-tar.c
> +++ b/archive-tar.c
> @@ -9,7 +9,7 @@
>  #include "streaming.h"
>  #include "run-command.h"
>
> -#define RECORDSIZE	(512)
> +#define RECORDSIZE	(512u)
>  #define BLOCKSIZE	(RECORDSIZE * 20)
>
>  static char block[BLOCKSIZE];
>
Johannes Schindelin May 8, 2019, 11:45 a.m. UTC | #3
Hi René,

On Thu, 2 May 2019, René Scharfe wrote:

> Am 27.04.19 um 01:27 schrieb Johannes Schindelin via GitGitGadget:
> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >
> > They really are unsigned, and we are using e.g. BLOCKSIZE as `size_t`
> > parameter to pass to `write_or_die()`.
>
> True, but the compiler converts that value correctly to size_t without
> complaint already, doesn't it?  What am I missing?

Are you talking about a specific compiler? It sure sounds as if you did.

I really do not want to fall into the "you can build Git with *any*
compiler, as long as that compiler happens to be GCC, oh, and as long it
is version X" trap.

We *already* rely on GCC's optimization in way too many places for my
liking, e.g. when we adapted the `hasheq()` code *specifically* to make
GCC's particular optimization strategies to kick in.

Or the way we defined the `SWAP()` macro: it depends on GCC's ability to
see through the veil and out-guess the code, deducing its intent rather
than what it *says* ("Do As I Want, Not As I Say", anyone?). We *do* want
to swap registers when possible (instead of forcing register variables to
be written to memory just for the sake of being swapped, as our code says
rather explicitly).

Essentially, we build a cruise ship of a dependency on GCC here. Which
should not make anybody happy (except maybe the GCC folks).

Let's not make things worse.

Ciao,
Dscho
Jeff King May 8, 2019, 11:04 p.m. UTC | #4
On Wed, May 08, 2019 at 01:45:25PM +0200, Johannes Schindelin wrote:

> Hi René,
> 
> On Thu, 2 May 2019, René Scharfe wrote:
> 
> > Am 27.04.19 um 01:27 schrieb Johannes Schindelin via GitGitGadget:
> > > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> > >
> > > They really are unsigned, and we are using e.g. BLOCKSIZE as `size_t`
> > > parameter to pass to `write_or_die()`.
> >
> > True, but the compiler converts that value correctly to size_t without
> > complaint already, doesn't it?  What am I missing?
> 
> Are you talking about a specific compiler? It sure sounds as if you did.
> 
> I really do not want to fall into the "you can build Git with *any*
> compiler, as long as that compiler happens to be GCC, oh, and as long it
> is version X" trap.

I don't this this has anything to do with gcc. The point is that we
already have this line:

  write_or_die(fd, buf, BLOCKSIZE);

which does not cast and nobody has complained, even though the signed
constant is implicitly converted to a size_t. So adding another line
like:

  gzwrite(gzip, block, BLOCKSIZE);

would in theory be treated the same (gzwrite takes an "unsigned").

The conversion from signed to unsigned is well defined in ANSI C, and
I'd expect a compiler to either complain about neither or both (and the
latter probably with warnings like -Wconversion cranked up).

But of course if you have data otherwise, we can revise that. Was the
cast added out of caution, or to squelch a compiler warning?

-Peff
Johannes Schindelin May 9, 2019, 2:06 p.m. UTC | #5
Hi Peff,

On Wed, 8 May 2019, Jeff King wrote:

> On Wed, May 08, 2019 at 01:45:25PM +0200, Johannes Schindelin wrote:
>
> > Hi René,
> >
> > On Thu, 2 May 2019, René Scharfe wrote:
> >
> > > Am 27.04.19 um 01:27 schrieb Johannes Schindelin via GitGitGadget:
> > > > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> > > >
> > > > They really are unsigned, and we are using e.g. BLOCKSIZE as `size_t`
> > > > parameter to pass to `write_or_die()`.
> > >
> > > True, but the compiler converts that value correctly to size_t without
> > > complaint already, doesn't it?  What am I missing?
> >
> > Are you talking about a specific compiler? It sure sounds as if you did.
> >
> > I really do not want to fall into the "you can build Git with *any*
> > compiler, as long as that compiler happens to be GCC, oh, and as long it
> > is version X" trap.
>
> I don't this this has anything to do with gcc. The point is that we
> already have this line:
>
>   write_or_die(fd, buf, BLOCKSIZE);
>
> which does not cast and nobody has complained,

I mistook this part of your reply in
https://public-inbox.org/git/20190413013451.GB2040@sigill.intra.peff.net/
as precisely such a complaint:

	BLOCKSIZE is a constant. Should we be defining it with a "U" in
	the first place?

Thanks,
Dscho

> even though the signed
> constant is implicitly converted to a size_t. So adding another line
> like:
>
>   gzwrite(gzip, block, BLOCKSIZE);
>
> would in theory be treated the same (gzwrite takes an "unsigned").
>
> The conversion from signed to unsigned is well defined in ANSI C, and
> I'd expect a compiler to either complain about neither or both (and the
> latter probably with warnings like -Wconversion cranked up).
>
> But of course if you have data otherwise, we can revise that. Was the
> cast added out of caution, or to squelch a compiler warning?
>
> -Peff
>
Jeff King May 9, 2019, 6:38 p.m. UTC | #6
On Thu, May 09, 2019 at 04:06:22PM +0200, Johannes Schindelin wrote:

> > I don't this this has anything to do with gcc. The point is that we
> > already have this line:
> >
> >   write_or_die(fd, buf, BLOCKSIZE);
> >
> > which does not cast and nobody has complained,
> 
> I mistook this part of your reply in
> https://public-inbox.org/git/20190413013451.GB2040@sigill.intra.peff.net/
> as precisely such a complaint:
> 
> 	BLOCKSIZE is a constant. Should we be defining it with a "U" in
> 	the first place?

Ah, sorry to introduce confusion. I mostly meant "if we need to cast,
why not just define as unsigned in the first place?". But I think René
was pointing out that we do not even need to cast, and I am fine with
that approach.

I do dream of a world where we do not have a bunch of implicit
conversions (both signedness but also truncation) in our code base, and
can compile cleanly with -Wconversion We know that this case is
perfectly fine, but I am sure there are many that are not. However, I'm
not sure if we'll ever get there, and in the meantime I don't think it's
worth worrying too much about individual cases like this.

-Peff
René Scharfe May 10, 2019, 5:18 p.m. UTC | #7
Am 09.05.19 um 20:38 schrieb Jeff King:
> I do dream of a world where we do not have a bunch of implicit
> conversions (both signedness but also truncation) in our code base, and
> can compile cleanly with -Wconversion We know that this case is
> perfectly fine, but I am sure there are many that are not. However, I'm
> not sure if we'll ever get there, and in the meantime I don't think it's
> worth worrying too much about individual cases like this.

Here's a rough take on how to silence that warning for archive-tar.c using
GCC 8.3.  Some of the changes are worth polishing and submitting.  Some
are silly.  The one for regexec_buf() is scary; I don't see a clean way of
dealing with that size_t to int conversion.

---
 archive-tar.c     | 54 +++++++++++++++++++++++++++++++----------------
 cache.h           | 10 ++++-----
 git-compat-util.h | 21 +++++++++++++++---
 hash.h            |  2 +-
 strbuf.h          |  2 +-
 5 files changed, 61 insertions(+), 28 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 3e53aac1e6..bfd91782ab 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -15,7 +15,7 @@
 static char block[BLOCKSIZE];
 static unsigned long offset;

-static int tar_umask = 002;
+static mode_t tar_umask = 002;

 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args);
@@ -99,7 +99,7 @@ static void write_blocked(const void *data, unsigned long size)
  */
 static void write_trailer(void)
 {
-	int tail = BLOCKSIZE - offset;
+	size_t tail = BLOCKSIZE - offset;
 	memset(block + offset, 0, tail);
 	write_or_die(1, block, BLOCKSIZE);
 	if (tail < 2 * RECORDSIZE) {
@@ -127,12 +127,13 @@ static int stream_blocked(const struct object_id *oid)
 		readlen = read_istream(st, buf, sizeof(buf));
 		if (readlen <= 0)
 			break;
-		do_write_blocked(buf, readlen);
+		do_write_blocked(buf, (size_t)readlen);
 	}
 	close_istream(st);
-	if (!readlen)
-		finish_record();
-	return readlen;
+	if (readlen < 0)
+		return -1;
+	finish_record();
+	return 0;
 }

 /*
@@ -142,9 +143,9 @@ static int stream_blocked(const struct object_id *oid)
  * string and appends it to a struct strbuf.
  */
 static void strbuf_append_ext_header(struct strbuf *sb, const char *keyword,
-				     const char *value, unsigned int valuelen)
+				     const char *value, size_t valuelen)
 {
-	int len, tmp;
+	size_t len, tmp;

 	/* "%u %s=%s\n" */
 	len = 1 + 1 + strlen(keyword) + 1 + valuelen + 1;
@@ -152,7 +153,7 @@ static void strbuf_append_ext_header(struct strbuf *sb, const char *keyword,
 		len++;

 	strbuf_grow(sb, len);
-	strbuf_addf(sb, "%u %s=", len, keyword);
+	strbuf_addf(sb, "%"PRIuMAX" %s=", (uintmax_t)len, keyword);
 	strbuf_add(sb, value, valuelen);
 	strbuf_addch(sb, '\n');
 }
@@ -168,7 +169,9 @@ static void strbuf_append_ext_header_uint(struct strbuf *sb,
 	int len;

 	len = xsnprintf(buf, sizeof(buf), "%"PRIuMAX, value);
-	strbuf_append_ext_header(sb, keyword, buf, len);
+	if (len < 0)
+		BUG("unable to convert %"PRIuMAX" to decimal", value);
+	strbuf_append_ext_header(sb, keyword, buf, (size_t)len);
 }

 static unsigned int ustar_header_chksum(const struct ustar_header *header)
@@ -177,7 +180,7 @@ static unsigned int ustar_header_chksum(const struct ustar_header *header)
 	unsigned int chksum = 0;
 	while (p < (const unsigned char *)header->chksum)
 		chksum += *p++;
-	chksum += sizeof(header->chksum) * ' ';
+	chksum += (unsigned int)sizeof(header->chksum) * ' ';
 	p += sizeof(header->chksum);
 	while (p < (const unsigned char *)header + sizeof(struct ustar_header))
 		chksum += *p++;
@@ -355,12 +358,14 @@ static void write_global_extended_header(struct archiver_args *args)
 }

 static struct archiver **tar_filters;
-static int nr_tar_filters;
-static int alloc_tar_filters;
+static size_t nr_tar_filters;
+static size_t alloc_tar_filters;

-static struct archiver *find_tar_filter(const char *name, int len)
+static struct archiver *find_tar_filter(const char *name, size_t len)
 {
 	int i;
+	if (len < 1)
+		return NULL;
 	for (i = 0; i < nr_tar_filters; i++) {
 		struct archiver *ar = tar_filters[i];
 		if (!strncmp(ar->name, name, len) && !ar->name[len])
@@ -369,14 +374,27 @@ static struct archiver *find_tar_filter(const char *name, int len)
 	return NULL;
 }

+static int parse_config_key2(const char *var, const char *section,
+			     const char **subsection, size_t *subsection_len,
+			     const char **key)
+{
+	int rc, len;
+
+	rc = parse_config_key(var, section, subsection, &len, key);
+	if (!rc && len < 0)
+		return -1;
+	*subsection_len = (size_t)len;
+	return rc;
+}
+
 static int tar_filter_config(const char *var, const char *value, void *data)
 {
 	struct archiver *ar;
 	const char *name;
 	const char *type;
-	int namelen;
+	size_t namelen;

-	if (parse_config_key(var, "tar", &name, &namelen, &type) < 0 || !name)
+	if (parse_config_key2(var, "tar", &name, &namelen, &type) < 0 || !name)
 		return 0;

 	ar = find_tar_filter(name, namelen);
@@ -400,7 +418,7 @@ static int tar_filter_config(const char *var, const char *value, void *data)
 		if (git_config_bool(var, value))
 			ar->flags |= ARCHIVER_REMOTE;
 		else
-			ar->flags &= ~ARCHIVER_REMOTE;
+			ar->flags &= ~(unsigned int)ARCHIVER_REMOTE;
 		return 0;
 	}

@@ -414,7 +432,7 @@ static int git_tar_config(const char *var, const char *value, void *cb)
 			tar_umask = umask(0);
 			umask(tar_umask);
 		} else {
-			tar_umask = git_config_int(var, value);
+			tar_umask = (mode_t)git_config_ulong(var, value);
 		}
 		return 0;
 	}
diff --git a/cache.h b/cache.h
index 67cc2e1806..a791034260 100644
--- a/cache.h
+++ b/cache.h
@@ -241,7 +241,7 @@ static inline void copy_cache_entry(struct cache_entry *dst,
 				    const struct cache_entry *src)
 {
 	unsigned int state = dst->ce_flags & CE_HASHED;
-	int mem_pool_allocated = dst->mem_pool_allocated;
+	unsigned int mem_pool_allocated = dst->mem_pool_allocated;

 	/* Don't copy hash chain and name */
 	memcpy(&dst->ce_stat_data, &src->ce_stat_data,
@@ -249,7 +249,7 @@ static inline void copy_cache_entry(struct cache_entry *dst,
 			offsetof(struct cache_entry, ce_stat_data));

 	/* Restore the hash state */
-	dst->ce_flags = (dst->ce_flags & ~CE_HASHED) | state;
+	dst->ce_flags = (dst->ce_flags & ~(unsigned int)CE_HASHED) | state;

 	/* Restore the mem_pool_allocated flag */
 	dst->mem_pool_allocated = mem_pool_allocated;
@@ -1314,7 +1314,7 @@ extern int check_and_freshen_file(const char *fn, int freshen);
 extern const signed char hexval_table[256];
 static inline unsigned int hexval(unsigned char c)
 {
-	return hexval_table[c];
+	return (unsigned int)hexval_table[c];
 }

 /*
@@ -1323,8 +1323,8 @@ static inline unsigned int hexval(unsigned char c)
  */
 static inline int hex2chr(const char *s)
 {
-	unsigned int val = hexval(s[0]);
-	return (val & ~0xf) ? val : (val << 4) | hexval(s[1]);
+	int val = hexval_table[(unsigned char)s[0]];
+	return (val < 0) ? val : (val << 4) | hexval_table[(unsigned char)s[1]];
 }

 /* Convert to/from hex/sha1 representation */
diff --git a/git-compat-util.h b/git-compat-util.h
index 4386b3e1c8..cf33e84c96 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -1068,7 +1068,7 @@ static inline int strtoul_ui(char const *s, int base, unsigned int *result)
 	ul = strtoul(s, &p, base);
 	if (errno || *p || p == s || (unsigned int) ul != ul)
 		return -1;
-	*result = ul;
+	*result = (unsigned int)ul;
 	return 0;
 }

@@ -1081,7 +1081,7 @@ static inline int strtol_i(char const *s, int base, int *result)
 	ul = strtol(s, &p, base);
 	if (errno || *p || p == s || (int) ul != ul)
 		return -1;
-	*result = ul;
+	*result = (int)ul;
 	return 0;
 }

@@ -1119,7 +1119,22 @@ static inline int regexec_buf(const regex_t *preg, const char *buf, size_t size,
 {
 	assert(nmatch > 0 && pmatch);
 	pmatch[0].rm_so = 0;
-	pmatch[0].rm_eo = size;
+	pmatch[0].rm_eo = (regoff_t)size;
+	if (pmatch[0].rm_eo != size) {
+		if (((regoff_t)-1) < 0) {
+			if (sizeof(regoff_t) == sizeof(int))
+				pmatch[0].rm_eo = (regoff_t)INT_MAX;
+			else if (sizeof(regoff_t) == sizeof(long))
+				pmatch[0].rm_eo = (regoff_t)LONG_MAX;
+			else
+				die("unable to determine maximum value of regoff_t");
+		} else {
+			pmatch[0].rm_eo = (regoff_t)-1;
+		}
+		warning("buffer too big (%"PRIuMAX"), "
+			"will search only the first %"PRIuMAX" bytes",
+			(uintmax_t)size, (uintmax_t)pmatch[0].rm_eo);
+	}
 	return regexec(preg, buf, nmatch, pmatch, eflags | REG_STARTEND);
 }

diff --git a/hash.h b/hash.h
index 661c9f2281..7056f89eb4 100644
--- a/hash.h
+++ b/hash.h
@@ -134,7 +134,7 @@ int hash_algo_by_id(uint32_t format_id);
 /* Identical, except based on the length. */
 int hash_algo_by_length(int len);
 /* Identical, except for a pointer to struct git_hash_algo. */
-static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
+static inline ptrdiff_t hash_algo_by_ptr(const struct git_hash_algo *p)
 {
 	return p - hash_algos;
 }
diff --git a/strbuf.h b/strbuf.h
index c8d98dfb95..30659f2d5d 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -225,7 +225,7 @@ int strbuf_cmp(const struct strbuf *first, const struct strbuf *second);
 /**
  * Add a single character to the buffer.
  */
-static inline void strbuf_addch(struct strbuf *sb, int c)
+static inline void strbuf_addch(struct strbuf *sb, char c)
 {
 	if (!strbuf_avail(sb))
 		strbuf_grow(sb, 1);
--
2.21.0
Jeff King May 10, 2019, 9:20 p.m. UTC | #8
On Fri, May 10, 2019 at 07:18:44PM +0200, René Scharfe wrote:

> Am 09.05.19 um 20:38 schrieb Jeff King:
> > I do dream of a world where we do not have a bunch of implicit
> > conversions (both signedness but also truncation) in our code base, and
> > can compile cleanly with -Wconversion We know that this case is
> > perfectly fine, but I am sure there are many that are not. However, I'm
> > not sure if we'll ever get there, and in the meantime I don't think it's
> > worth worrying too much about individual cases like this.
> 
> Here's a rough take on how to silence that warning for archive-tar.c using
> GCC 8.3.  Some of the changes are worth polishing and submitting.  Some
> are silly.  The one for regexec_buf() is scary; I don't see a clean way of
> dealing with that size_t to int conversion.

This is actually slightly less tedious than I had imagined it to be, but
still pretty bad. I dunno. If somebody wants to tackle it, I do think it
would make the world a better place. But I'm not sure if it is worth the
effort involved.

>  static void write_trailer(void)
>  {
> -	int tail = BLOCKSIZE - offset;
> +	size_t tail = BLOCKSIZE - offset;

These kinds of int/size_t conversions are the ones I think are the most
valuable (because the size_t's are often used to allocate or access
arrays, and truncated or negative values there can cause other security
problems). _Most_ of them are harmless, of course, but it's hard to
separate the important ones from the mundane.

> @@ -414,7 +432,7 @@ static int git_tar_config(const char *var, const char *value, void *cb)
>  			tar_umask = umask(0);
>  			umask(tar_umask);
>  		} else {
> -			tar_umask = git_config_int(var, value);
> +			tar_umask = (mode_t)git_config_ulong(var, value);
>  		}

It's nice that the cast here shuts up the compiler, and I agree it is
not likely to be a problem in this instance. But we'd probably want some
kind of "safe cast" helper. To some degree, if you put 2^64-1 in your
"umask" value you get what you deserve, but it would be nice if we could
detect such nonsense (less for this case, but more for others where we
do cast).

> @@ -1119,7 +1119,22 @@ static inline int regexec_buf(const regex_t *preg, const char *buf, size_t size,
>  {
>  	assert(nmatch > 0 && pmatch);
>  	pmatch[0].rm_so = 0;
> -	pmatch[0].rm_eo = size;
> +	pmatch[0].rm_eo = (regoff_t)size;
> +	if (pmatch[0].rm_eo != size) {
> +		if (((regoff_t)-1) < 0) {
> +			if (sizeof(regoff_t) == sizeof(int))
> +				pmatch[0].rm_eo = (regoff_t)INT_MAX;
> +			else if (sizeof(regoff_t) == sizeof(long))
> +				pmatch[0].rm_eo = (regoff_t)LONG_MAX;
> +			else
> +				die("unable to determine maximum value of regoff_t");
> +		} else {
> +			pmatch[0].rm_eo = (regoff_t)-1;
> +		}
> +		warning("buffer too big (%"PRIuMAX"), "
> +			"will search only the first %"PRIuMAX" bytes",
> +			(uintmax_t)size, (uintmax_t)pmatch[0].rm_eo);
> +	}
>  	return regexec(preg, buf, nmatch, pmatch, eflags | REG_STARTEND);
>  }

I think a helper could make things less awful here, too. Our xsize_t()
is sort of like this, but of course it dies. But I think it would be
possible to write a macro to let you do:

  if (ASSIGN_CAST(pmatch[0].rm_eo, size))
	warning(...);

This is definitely a rabbit-hole that I've been afraid to go down. :)

-Peff