Message ID | d55324f7a256fce491a29a1debf142f817eb01d3.1614957681.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 2a15964128edd08278efeacc75e75c77cfdde77e |
Headers | show |
Series | clean-ups to geometric repacking | expand |
Taylor Blau <me@ttaylorr.com> writes: > There are a number of places in the geometric repack code where we > multiply the number of objects in a pack by another unsigned value. We > trust that the number of objects in a pack is always representable by a > uint32_t, but we don't necessarily trust that that number can be > multiplied without overflow. > > Sprinkle some unsigned_add_overflows() and unsigned_mult_overflows() in > split_pack_geometry() to check that we never overflow any unsigned types > when adding or multiplying them. > > Arguably these checks are a little too conservative, and certainly they > do not help the readability of this function. But they are serving a > useful purpose, so I think they are worthwhile overall. > > Suggested-by: Junio C Hamano <gitster@pobox.com> I do not deserve this for my idle thinking-out-aloud speculation. You did all the thinking and work. I do not know if just dying is a useful thing to do, though. These all happen in the "discover where the split point is, to figure out which packs to coalesce" phase where we haven't touched anything on disk, so in that sense it is safe to die, but what recourse do users have after seeing these errors? For example, in the first overflow check in hunk -388,11, presumably when total_size for a given set of packfiles does not fit in int32_t, it won't be fruitful to attempt to coalesce such a set of packs into one, so perhaps a better fix may be to shift the split point earlier to find a point that lets us to pack as many as we could? I do not offhand know what a sensible and gentler fallback position would be for the factor multiplication, but presumably, if the right hand side of this ... > if (geometry_pack_weight(ours) < factor * total_size) { ... overflows, we can say it would definitely be larger than our weight and continue, instead of "no, we cannot multiply total with factor, as that would give us too big a number", I guess. I am OK to either (1) leave the code as-is without this patch, because the overflow would only affect the largest of packs and would be rare, and people who would notice when they hit it would probably are the ones with enough resource to diagnose, report and even give us a fix ;-) or (2) apply this patch as-is, because the only people who will see the die() would be the same ones affected by (1) above anyway. And applying this patch as-is would give better chance than (1) for the overflow to be noticed. So, let's queue the patch as-is. Thanks.
On Fri, Mar 05, 2021 at 11:32:57AM -0800, Junio C Hamano wrote: > > Suggested-by: Junio C Hamano <gitster@pobox.com> > > I do not deserve this for my idle thinking-out-aloud speculation. > You did all the thinking and work. You're selling yourself short ;-). > I do not offhand know what a sensible and gentler fallback position > would be for the factor multiplication, but presumably, if the right > hand side of this ... > > > if (geometry_pack_weight(ours) < factor * total_size) { > > ... overflows, we can say it would definitely be larger than our > weight and continue, instead of "no, we cannot multiply total with > factor, as that would give us too big a number", I guess. To answer your question about whether or not dying is sensible, I think that there are a couple of options. You could say, "I can no longer multiply total_size by factor without overflowing, so I'm not even going to bother considering additional packs". I guess that makes some progress in condensing packs, but that's as good as it would get, since the subsequent repack would also overflow instead of making further progress, so it would just get stuck. Which makes me think that the other option, dying, is more sensible. But I think that this is all a moot point as you seem to indicate anyway, because having a large enough pack that we are verging on overflowing is likely not a real-world problem that we are going to deal with (at least right away). > I am OK to either (1) leave the code as-is without this patch, because > the overflow would only affect the largest of packs and would be rare, > and people who would notice when they hit it would probably are the ones > with enough resource to diagnose, report and even give us a fix ;-) > or (2) apply this patch as-is, because the only people who will see > the die() would be the same ones affected by (1) above anyway. > > And applying this patch as-is would give better chance than (1) for > the overflow to be noticed. > > So, let's queue the patch as-is. Great, I'm fine with that. Thanks. > Thanks. Taylor
On Fri, Mar 05, 2021 at 10:21:56AM -0500, Taylor Blau wrote: > There are a number of places in the geometric repack code where we > multiply the number of objects in a pack by another unsigned value. We > trust that the number of objects in a pack is always representable by a > uint32_t, but we don't necessarily trust that that number can be > multiplied without overflow. > > Sprinkle some unsigned_add_overflows() and unsigned_mult_overflows() in > split_pack_geometry() to check that we never overflow any unsigned types > when adding or multiplying them. > > Arguably these checks are a little too conservative, and certainly they > do not help the readability of this function. But they are serving a > useful purpose, so I think they are worthwhile overall. Hmm. My initial reaction was: how close might we reasonably come to the limit? Packfiles are limited to uint32_t, and: - you'd have a pretty hard time approaching that; pack-objects needs at least 100 bytes of heap per object for its internal book-keeping. So you're looking at 200GB-400GB of RAM to generate such a packfile. - I'm pretty sure the rest of the repack code would die horribly and unpredictably if it were allowed to have that much RAM. Which isn't an argument against such protections, but I wonder if it might give people a false sense that we are in any way prepared for repositories of this scale. However, at least one of these looks to be multiplying the user-provided scale factor. I can't imagine a scale factor beyond "2" is all that useful, but conceptually somebody could provide a big number there. Looking at the code, though... > diff --git a/builtin/repack.c b/builtin/repack.c > index 21a5778e73..677c6b75c1 100644 > --- a/builtin/repack.c > +++ b/builtin/repack.c > @@ -363,6 +363,12 @@ static void split_pack_geometry(struct pack_geometry *geometry, int factor) > for (i = geometry->pack_nr - 1; i > 0; i--) { > struct packed_git *ours = geometry->pack[i]; > struct packed_git *prev = geometry->pack[i - 1]; > + > + if (unsigned_mult_overflows(factor, geometry_pack_weight(prev))) > + die(_("pack %s too large to consider in geometric " > + "progression"), > + prev->pack_name); This says "unsigned_mult_overflows", but "factor" is a signed int. This will generally be cast to unsigned in the actual multiplication, but it depends on the size of the operands. If int is larger than uint32_t, we'd do the multiplication as a signed int. But the overflow check would be against an unsigned int (since our macro only looks at the type of the first argument). So it would be overly permissive with the final bit. I suspect it would also be wrong if "factor" is negative. If int is smaller than 32 bits, then we'd be too conservative (it would get promoted to uint32_t for the actual multiplication). Also, you should get a better computer in that case. It's probably OK in practice, as int tends to just be 32 bits. But if the point is to be careful, we should probably just take "factor" as a uint32_t in the first place. > @@ -388,11 +394,25 @@ static void split_pack_geometry(struct pack_geometry *geometry, int factor) > * packs in the heavy half need to be joined into it (if any) to restore > * the geometric progression. > */ > - for (i = 0; i < split; i++) > - total_size += geometry_pack_weight(geometry->pack[i]); > + for (i = 0; i < split; i++) { > + struct packed_git *p = geometry->pack[i]; > + > + if (unsigned_add_overflows(total_size, geometry_pack_weight(p))) > + die(_("pack %s too large to roll up"), p->pack_name); > + total_size += geometry_pack_weight(p); > + } This one seems even less likely to overflow. total_size is an off_t, so unless you're on a really lame system, we should be able to fit a lot of uint32_t's in there. (It actually feels a little weird for it to be an off_t in the first place; we're still dealing in units of "number of objects", which the rest of Git generally considers to be in the realm of a uint32_t). > for (i = split; i < geometry->pack_nr; i++) { > struct packed_git *ours = geometry->pack[i]; > + > + if (unsigned_mult_overflows(factor, total_size)) > + die(_("pack %s too large to roll up"), ours->pack_name); This one is wrong in the same way as the earlier multiplication, except this time we're pretty sure that total_size actually is much bigger. So we'll complain about overflowing an int, but the multiplication will actually be done as an off_t. And of course it has the same problems with negative values. Should total_size just be a uint32_t? Or perhaps any of these factor multiplication results should just be uint64_t, which would be the obviously-large-enough type. Then you wouldn't have these ugly overflow checks. :) -Peff
diff --git a/builtin/repack.c b/builtin/repack.c index 21a5778e73..677c6b75c1 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -363,6 +363,12 @@ static void split_pack_geometry(struct pack_geometry *geometry, int factor) for (i = geometry->pack_nr - 1; i > 0; i--) { struct packed_git *ours = geometry->pack[i]; struct packed_git *prev = geometry->pack[i - 1]; + + if (unsigned_mult_overflows(factor, geometry_pack_weight(prev))) + die(_("pack %s too large to consider in geometric " + "progression"), + prev->pack_name); + if (geometry_pack_weight(ours) < factor * geometry_pack_weight(prev)) break; } @@ -388,11 +394,25 @@ static void split_pack_geometry(struct pack_geometry *geometry, int factor) * packs in the heavy half need to be joined into it (if any) to restore * the geometric progression. */ - for (i = 0; i < split; i++) - total_size += geometry_pack_weight(geometry->pack[i]); + for (i = 0; i < split; i++) { + struct packed_git *p = geometry->pack[i]; + + if (unsigned_add_overflows(total_size, geometry_pack_weight(p))) + die(_("pack %s too large to roll up"), p->pack_name); + total_size += geometry_pack_weight(p); + } for (i = split; i < geometry->pack_nr; i++) { struct packed_git *ours = geometry->pack[i]; + + if (unsigned_mult_overflows(factor, total_size)) + die(_("pack %s too large to roll up"), ours->pack_name); + if (geometry_pack_weight(ours) < factor * total_size) { + if (unsigned_add_overflows(total_size, + geometry_pack_weight(ours))) + die(_("pack %s too large to roll up"), + ours->pack_name); + split++; total_size += geometry_pack_weight(ours); } else
There are a number of places in the geometric repack code where we multiply the number of objects in a pack by another unsigned value. We trust that the number of objects in a pack is always representable by a uint32_t, but we don't necessarily trust that that number can be multiplied without overflow. Sprinkle some unsigned_add_overflows() and unsigned_mult_overflows() in split_pack_geometry() to check that we never overflow any unsigned types when adding or multiplying them. Arguably these checks are a little too conservative, and certainly they do not help the readability of this function. But they are serving a useful purpose, so I think they are worthwhile overall. Suggested-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> --- builtin/repack.c | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-)