diff mbox series

[4/5] builtin/repack.c: be more conservative with unsigned overflows

Message ID d55324f7a256fce491a29a1debf142f817eb01d3.1614957681.git.me@ttaylorr.com (mailing list archive)
State Accepted
Commit 2a15964128edd08278efeacc75e75c77cfdde77e
Headers show
Series clean-ups to geometric repacking | expand

Commit Message

Taylor Blau March 5, 2021, 3:21 p.m. UTC
There are a number of places in the geometric repack code where we
multiply the number of objects in a pack by another unsigned value. We
trust that the number of objects in a pack is always representable by a
uint32_t, but we don't necessarily trust that that number can be
multiplied without overflow.

Sprinkle some unsigned_add_overflows() and unsigned_mult_overflows() in
split_pack_geometry() to check that we never overflow any unsigned types
when adding or multiplying them.

Arguably these checks are a little too conservative, and certainly they
do not help the readability of this function. But they are serving a
useful purpose, so I think they are worthwhile overall.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

Comments

Junio C Hamano March 5, 2021, 7:32 p.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:

> There are a number of places in the geometric repack code where we
> multiply the number of objects in a pack by another unsigned value. We
> trust that the number of objects in a pack is always representable by a
> uint32_t, but we don't necessarily trust that that number can be
> multiplied without overflow.
>
> Sprinkle some unsigned_add_overflows() and unsigned_mult_overflows() in
> split_pack_geometry() to check that we never overflow any unsigned types
> when adding or multiplying them.
>
> Arguably these checks are a little too conservative, and certainly they
> do not help the readability of this function. But they are serving a
> useful purpose, so I think they are worthwhile overall.
>
> Suggested-by: Junio C Hamano <gitster@pobox.com>

I do not deserve this for my idle thinking-out-aloud speculation.
You did all the thinking and work.

I do not know if just dying is a useful thing to do, though.  These
all happen in the "discover where the split point is, to figure out
which packs to coalesce" phase where we haven't touched anything on
disk, so in that sense it is safe to die, but what recourse do users
have after seeing these errors?

For example, in the first overflow check in hunk -388,11, presumably
when total_size for a given set of packfiles does not fit in
int32_t, it won't be fruitful to attempt to coalesce such a set of
packs into one, so perhaps a better fix may be to shift the split
point earlier to find a point that lets us to pack as many as we
could?

I do not offhand know what a sensible and gentler fallback position
would be for the factor multiplication, but presumably, if the right
hand side of this ...

>  		if (geometry_pack_weight(ours) < factor * total_size) {

... overflows, we can say it would definitely be larger than our
weight and continue, instead of "no, we cannot multiply total with
factor, as that would give us too big a number", I guess.

I am OK to either (1) leave the code as-is without this patch, because
the overflow would only affect the largest of packs and would be rare,
and people who would notice when they hit it would probably are the ones
with enough resource to diagnose, report and even give us a fix ;-)
or (2) apply this patch as-is, because the only people who will see
the die() would be the same ones affected by (1) above anyway.

And applying this patch as-is would give better chance than (1) for
the overflow to be noticed.

So, let's queue the patch as-is.

Thanks.
Taylor Blau March 5, 2021, 7:41 p.m. UTC | #2
On Fri, Mar 05, 2021 at 11:32:57AM -0800, Junio C Hamano wrote:
> > Suggested-by: Junio C Hamano <gitster@pobox.com>
>
> I do not deserve this for my idle thinking-out-aloud speculation.
> You did all the thinking and work.

You're selling yourself short ;-).

> I do not offhand know what a sensible and gentler fallback position
> would be for the factor multiplication, but presumably, if the right
> hand side of this ...
>
> >  		if (geometry_pack_weight(ours) < factor * total_size) {
>
> ... overflows, we can say it would definitely be larger than our
> weight and continue, instead of "no, we cannot multiply total with
> factor, as that would give us too big a number", I guess.

To answer your question about whether or not dying is sensible, I think
that there are a couple of options. You could say, "I can no longer
multiply total_size by factor without overflowing, so I'm not even going
to bother considering additional packs". I guess that makes some
progress in condensing packs, but that's as good as it would get, since
the subsequent repack would also overflow instead of making further
progress, so it would just get stuck.

Which makes me think that the other option, dying, is more sensible. But
I think that this is all a moot point as you seem to indicate anyway,
because having a large enough pack that we are verging on overflowing is
likely not a real-world problem that we are going to deal with (at least
right away).

> I am OK to either (1) leave the code as-is without this patch, because
> the overflow would only affect the largest of packs and would be rare,
> and people who would notice when they hit it would probably are the ones
> with enough resource to diagnose, report and even give us a fix ;-)
> or (2) apply this patch as-is, because the only people who will see
> the die() would be the same ones affected by (1) above anyway.
>
> And applying this patch as-is would give better chance than (1) for
> the overflow to be noticed.
>
> So, let's queue the patch as-is.

Great, I'm fine with that. Thanks.

> Thanks.

Taylor
Jeff King March 10, 2021, 9 p.m. UTC | #3
On Fri, Mar 05, 2021 at 10:21:56AM -0500, Taylor Blau wrote:

> There are a number of places in the geometric repack code where we
> multiply the number of objects in a pack by another unsigned value. We
> trust that the number of objects in a pack is always representable by a
> uint32_t, but we don't necessarily trust that that number can be
> multiplied without overflow.
> 
> Sprinkle some unsigned_add_overflows() and unsigned_mult_overflows() in
> split_pack_geometry() to check that we never overflow any unsigned types
> when adding or multiplying them.
> 
> Arguably these checks are a little too conservative, and certainly they
> do not help the readability of this function. But they are serving a
> useful purpose, so I think they are worthwhile overall.

Hmm. My initial reaction was: how close might we reasonably come to the
limit? Packfiles are limited to uint32_t, and:

  - you'd have a pretty hard time approaching that; pack-objects needs
    at least 100 bytes of heap per object for its internal book-keeping.
    So you're looking at 200GB-400GB of RAM to generate such a packfile.

  - I'm pretty sure the rest of the repack code would die horribly and
    unpredictably if it were allowed to have that much RAM.

Which isn't an argument against such protections, but I wonder if it
might give people a false sense that we are in any way prepared for
repositories of this scale.

However, at least one of these looks to be multiplying the user-provided
scale factor. I can't imagine a scale factor beyond "2" is all that
useful, but conceptually somebody could provide a big number there.

Looking at the code, though...

> diff --git a/builtin/repack.c b/builtin/repack.c
> index 21a5778e73..677c6b75c1 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -363,6 +363,12 @@ static void split_pack_geometry(struct pack_geometry *geometry, int factor)
>  	for (i = geometry->pack_nr - 1; i > 0; i--) {
>  		struct packed_git *ours = geometry->pack[i];
>  		struct packed_git *prev = geometry->pack[i - 1];
> +
> +		if (unsigned_mult_overflows(factor, geometry_pack_weight(prev)))
> +			die(_("pack %s too large to consider in geometric "
> +			      "progression"),
> +			    prev->pack_name);

This says "unsigned_mult_overflows", but "factor" is a signed int. This
will generally be cast to unsigned in the actual multiplication, but it
depends on the size of the operands.

If int is larger than uint32_t, we'd do the multiplication as a signed
int. But the overflow check would be against an unsigned int (since our
macro only looks at the type of the first argument). So it would be
overly permissive with the final bit. I suspect it would also be wrong
if "factor" is negative.

If int is smaller than 32 bits, then we'd be too conservative (it would
get promoted to uint32_t for the actual multiplication). Also, you
should get a better computer in that case.

It's probably OK in practice, as int tends to just be 32 bits. But if
the point is to be careful, we should probably just take "factor" as a
uint32_t in the first place.

> @@ -388,11 +394,25 @@ static void split_pack_geometry(struct pack_geometry *geometry, int factor)
>  	 * packs in the heavy half need to be joined into it (if any) to restore
>  	 * the geometric progression.
>  	 */
> -	for (i = 0; i < split; i++)
> -		total_size += geometry_pack_weight(geometry->pack[i]);
> +	for (i = 0; i < split; i++) {
> +		struct packed_git *p = geometry->pack[i];
> +
> +		if (unsigned_add_overflows(total_size, geometry_pack_weight(p)))
> +			die(_("pack %s too large to roll up"), p->pack_name);
> +		total_size += geometry_pack_weight(p);
> +	}

This one seems even less likely to overflow. total_size is an off_t, so
unless you're on a really lame system, we should be able to fit a lot of
uint32_t's in there.

(It actually feels a little weird for it to be an off_t in the first
place; we're still dealing in units of "number of objects", which the
rest of Git generally considers to be in the realm of a uint32_t).

>  	for (i = split; i < geometry->pack_nr; i++) {
>  		struct packed_git *ours = geometry->pack[i];
> +
> +		if (unsigned_mult_overflows(factor, total_size))
> +			die(_("pack %s too large to roll up"), ours->pack_name);

This one is wrong in the same way as the earlier multiplication, except
this time we're pretty sure that total_size actually is much bigger. So
we'll complain about overflowing an int, but the multiplication will
actually be done as an off_t. And of course it has the same problems
with negative values.

Should total_size just be a uint32_t? Or perhaps any of these factor
multiplication results should just be uint64_t, which would be the
obviously-large-enough type. Then you wouldn't have these ugly overflow
checks. :)

-Peff
diff mbox series

Patch

diff --git a/builtin/repack.c b/builtin/repack.c
index 21a5778e73..677c6b75c1 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -363,6 +363,12 @@  static void split_pack_geometry(struct pack_geometry *geometry, int factor)
 	for (i = geometry->pack_nr - 1; i > 0; i--) {
 		struct packed_git *ours = geometry->pack[i];
 		struct packed_git *prev = geometry->pack[i - 1];
+
+		if (unsigned_mult_overflows(factor, geometry_pack_weight(prev)))
+			die(_("pack %s too large to consider in geometric "
+			      "progression"),
+			    prev->pack_name);
+
 		if (geometry_pack_weight(ours) < factor * geometry_pack_weight(prev))
 			break;
 	}
@@ -388,11 +394,25 @@  static void split_pack_geometry(struct pack_geometry *geometry, int factor)
 	 * packs in the heavy half need to be joined into it (if any) to restore
 	 * the geometric progression.
 	 */
-	for (i = 0; i < split; i++)
-		total_size += geometry_pack_weight(geometry->pack[i]);
+	for (i = 0; i < split; i++) {
+		struct packed_git *p = geometry->pack[i];
+
+		if (unsigned_add_overflows(total_size, geometry_pack_weight(p)))
+			die(_("pack %s too large to roll up"), p->pack_name);
+		total_size += geometry_pack_weight(p);
+	}
 	for (i = split; i < geometry->pack_nr; i++) {
 		struct packed_git *ours = geometry->pack[i];
+
+		if (unsigned_mult_overflows(factor, total_size))
+			die(_("pack %s too large to roll up"), ours->pack_name);
+
 		if (geometry_pack_weight(ours) < factor * total_size) {
+			if (unsigned_add_overflows(total_size,
+						   geometry_pack_weight(ours)))
+				die(_("pack %s too large to roll up"),
+				    ours->pack_name);
+
 			split++;
 			total_size += geometry_pack_weight(ours);
 		} else