Message ID | YI1fubjvQQlrPz9D@coredump.intra.peff.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | pack-objects: better handling of negative options | expand |
On 5/1/2021 10:03 AM, Jeff King wrote: > A negative window size makes no sense, and the code in find_deltas() is > not prepared to handle it. If you pass "-1", for example, we end up > generate a 0-length array of "struct unpacked", but our loop assumes it > has at least one entry in it (and we end up reading garbage memory). > > We could complain to the user about this, but it's more forgiving to > just clamp it to 0, which means "do not find any deltas at all". The > 0-case is already tested earlier in the script, so we'll make sure this > does the same thing. This seems like a reasonable approach. It takes existing "undefined" behavior and turns it into well-understood, "defined" behavior. Thanks, -Stolee
On Mon, May 03, 2021 at 08:10:24AM -0400, Derrick Stolee wrote: > On 5/1/2021 10:03 AM, Jeff King wrote: > > A negative window size makes no sense, and the code in find_deltas() is > > not prepared to handle it. If you pass "-1", for example, we end up > > generate a 0-length array of "struct unpacked", but our loop assumes it > > has at least one entry in it (and we end up reading garbage memory). > > > > We could complain to the user about this, but it's more forgiving to > > just clamp it to 0, which means "do not find any deltas at all". The > > 0-case is already tested earlier in the script, so we'll make sure this > > does the same thing. > > This seems like a reasonable approach. It takes existing "undefined" > behavior and turns it into well-understood, "defined" behavior. I was on the fence on doing that, or just: if (window < 0) die("sorry dude, negative windows are nonsense"); So if anybody has a strong preference, I could be easily persuaded. Part of what led me to being forgiving was that we similarly clamp too-large depth values (with a warning; I didn't think it was really necessary here, though). -Peff
Am 03.05.21 um 16:55 schrieb Jeff King: > On Mon, May 03, 2021 at 08:10:24AM -0400, Derrick Stolee wrote: > >> On 5/1/2021 10:03 AM, Jeff King wrote: >>> A negative window size makes no sense, and the code in find_deltas() is >>> not prepared to handle it. If you pass "-1", for example, we end up >>> generate a 0-length array of "struct unpacked", but our loop assumes it >>> has at least one entry in it (and we end up reading garbage memory). >>> >>> We could complain to the user about this, but it's more forgiving to >>> just clamp it to 0, which means "do not find any deltas at all". The >>> 0-case is already tested earlier in the script, so we'll make sure this >>> does the same thing. >> >> This seems like a reasonable approach. It takes existing "undefined" >> behavior and turns it into well-understood, "defined" behavior. > > I was on the fence on doing that, or just: > > if (window < 0) > die("sorry dude, negative windows are nonsense"); > > So if anybody has a strong preference, I could be easily persuaded. Part > of what led me to being forgiving was that we similarly clamp too-large > depth values (with a warning; I didn't think it was really necessary > here, though). There's another option: Mapping -1 or all negative values to the maximum: if (window < 0) window = INT_MAX; if (depth < 0) depth = (1 << OE_DEPTH_BITS) - 1; That's allows saying "gimme all you got" without knowing or being willing to type out the exact maximum value, which may change between versions. Not all that useful for --window, I guess. That convention has been used elsewhere I'm sure, but can't point out a concrete example. $arr[-1] get the last item of the array $arr in PowerShell, though, which is kind of similar. Sure, you get the same effect in both cases by typing 2147483647, but -1 is more convenient. Not a strong preference, but I thought it was at least worth mentioning that particular bike shed color. :) René
On Tue, May 04, 2021 at 08:47:56PM +0200, René Scharfe wrote: > >> This seems like a reasonable approach. It takes existing "undefined" > >> behavior and turns it into well-understood, "defined" behavior. > > > > I was on the fence on doing that, or just: > > > > if (window < 0) > > die("sorry dude, negative windows are nonsense"); > > > > So if anybody has a strong preference, I could be easily persuaded. Part > > of what led me to being forgiving was that we similarly clamp too-large > > depth values (with a warning; I didn't think it was really necessary > > here, though). > > There's another option: Mapping -1 or all negative values to the > maximum: > > if (window < 0) > window = INT_MAX; > if (depth < 0) > depth = (1 << OE_DEPTH_BITS) - 1; > > That's allows saying "gimme all you got" without knowing or being > willing to type out the exact maximum value, which may change between > versions. Not all that useful for --window, I guess. That convention > has been used elsewhere I'm sure, but can't point out a concrete > example. $arr[-1] get the last item of the array $arr in PowerShell, > though, which is kind of similar. > > Sure, you get the same effect in both cases by typing 2147483647, but > -1 is more convenient. > > Not a strong preference, but I thought it was at least worth > mentioning that particular bike shed color. :) Thanks for thinking about this. In general, yeah, allowing "-1" as "unlimited" is a sensible thing for many options. But for these two variables, I don't think "unlimited" is ever a good idea: - for --window, it makes the amount of work quadratic in the number of objects in the repository (and likewise, memory usage equivalent to the uncompressed contents of the repo). Going beyond about 250 or so gives diminishing returns. - for --depth, going beyond about 50 provides diminishing space returns, and makes the run-time performance of accessing the deltas horrible. So certainly INT_MAX or the max allowable by OE_DEPTH_BITS is a very bad idea in both cases, and the use would probably be happier if we just hit a die() instead. :) We _could_ make "-1" mean "the maximum sensible valuable", but I think there's a lot of wiggle room for "sensible" there. I much prefer using and advertising the actual values there (as we do for "gc --aggressive"). -Peff
On Tue, May 04 2021, René Scharfe wrote: > Am 03.05.21 um 16:55 schrieb Jeff King: >> On Mon, May 03, 2021 at 08:10:24AM -0400, Derrick Stolee wrote: >> >>> On 5/1/2021 10:03 AM, Jeff King wrote: >>>> A negative window size makes no sense, and the code in find_deltas() is >>>> not prepared to handle it. If you pass "-1", for example, we end up >>>> generate a 0-length array of "struct unpacked", but our loop assumes it >>>> has at least one entry in it (and we end up reading garbage memory). >>>> >>>> We could complain to the user about this, but it's more forgiving to >>>> just clamp it to 0, which means "do not find any deltas at all". The >>>> 0-case is already tested earlier in the script, so we'll make sure this >>>> does the same thing. >>> >>> This seems like a reasonable approach. It takes existing "undefined" >>> behavior and turns it into well-understood, "defined" behavior. >> >> I was on the fence on doing that, or just: >> >> if (window < 0) >> die("sorry dude, negative windows are nonsense"); >> >> So if anybody has a strong preference, I could be easily persuaded. Part >> of what led me to being forgiving was that we similarly clamp too-large >> depth values (with a warning; I didn't think it was really necessary >> here, though). > > There's another option: Mapping -1 or all negative values to the > maximum: > > if (window < 0) > window = INT_MAX; > if (depth < 0) > depth = (1 << OE_DEPTH_BITS) - 1; > > That's allows saying "gimme all you got" without knowing or being > willing to type out the exact maximum value, which may change between > versions. Not all that useful for --window, I guess. That convention > has been used elsewhere I'm sure, but can't point out a concrete > example. $arr[-1] get the last item of the array $arr in PowerShell, > though, which is kind of similar. > > Sure, you get the same effect in both cases by typing 2147483647, but > -1 is more convenient. > > Not a strong preference, but I thought it was at least worth > mentioning that particular bike shed color. :) That seems sensible to expose, but I think should really be --window=max, not --window=-1. The latter feels way too much like assuming that a user might know about C's "set all bits" semantics. The one example of such a variable I could think of is core.abbrev=no, which could arguably benefit from a core.abbrev=max synonym. Another one is *.threads, e.g. grep.threads, index.threads. We currently say that "auto" is like "max, but I can see how we'd eventually benefit from splitting those up. I sometimes run git on machines where that "auto" is 48 or whatever (I haven't benchmarked, but that's surely counter-productive). Having max != auto in that case (but still having a "max") would be nice.
Am 05.05.21 um 13:53 schrieb Ævar Arnfjörð Bjarmason: > > On Tue, May 04 2021, René Scharfe wrote: > >> There's another option: Mapping -1 or all negative values to the >> maximum: > That seems sensible to expose, but I think should really be > --window=max, not --window=-1. The latter feels way too much like > assuming that a user might know about C's "set all bits" semantics. Nitpicking: Setting all bits for -1 is done by two's complement, which is just one of the signed number representations supported by C. But yeah: A non-numeric value would probably be better in general. As Peff explained it's not a particularly good idea to specify the maximum values of --depth and --window, though, so no need to make it easier. > The one example of such a variable I could think of is core.abbrev=no, > which could arguably benefit from a core.abbrev=max synonym. core.abbrev=no turns off abbreviation, i.e. you get the full hash size (false and off work as well). Following that logic, core.abbrev=max would ask for a maximum of abbreviation, i.e. for the shortest unambiguous hash. That would have a length of at least 4. This value is stored in a constant called minimum_abbrev -- which seems backwards. The implied negation of abbrev (the more you abbreviate, the shorter the length) is a bit confusing. > Another one is *.threads, e.g. grep.threads, index.threads. We currently > say that "auto" is like "max, but I can see how we'd eventually benefit > from splitting those up. I sometimes run git on machines where that > "auto" is 48 or whatever (I haven't benchmarked, but that's surely > counter-productive). Having max != auto in that case (but still having a > "max") would be nice. Good thinking. What is the maximum number of threads? Certainly higher than the number of CPUs. Would that be useful? Maybe -- on a single-core VM with an SSD queue length of 32 we can probably benefit from running more than one thread. Are our threads CPU-bound or I/O-bound? I guess the answer is "yes". :) How do we even find out the disk queue length in a portable way? And how would we calculate the optimal number of threads? Are these even the right questions to ask? An "auto" option might help with that. I imagine it starting out with some default value and then experimentally decreasing and and increasing the number of threads to find out which one works best. Downside: It would need to have comparable workloads for that. And these benchmarks need an otherwise quiet system. Similar battery level if running on a laptop. Same system temperature. Impractical during normal use. Perhaps a "git benchmark" command that runs some synthetic speed tests to tune grep.threads etc. would be possible? René
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 6d13cd3e1a..ea7a5b3ba5 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3871,6 +3871,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) (1U << OE_Z_DELTA_BITS) - 1); cache_max_small_delta_size = (1U << OE_Z_DELTA_BITS) - 1; } + if (window < 0) + window = 0; strvec_push(&rp, "pack-objects"); if (thin) { diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh index 887e2d8d88..5c5e53f0be 100755 --- a/t/t5300-pack-object.sh +++ b/t/t5300-pack-object.sh @@ -613,4 +613,9 @@ test_expect_success '--stdin-packs with broken links' ' ) ' +test_expect_success 'negative window clamps to 0' ' + git pack-objects --progress --window=-1 neg-window <obj-list 2>stderr && + check_deltas stderr = 0 +' + test_done
A negative window size makes no sense, and the code in find_deltas() is not prepared to handle it. If you pass "-1", for example, we end up generate a 0-length array of "struct unpacked", but our loop assumes it has at least one entry in it (and we end up reading garbage memory). We could complain to the user about this, but it's more forgiving to just clamp it to 0, which means "do not find any deltas at all". The 0-case is already tested earlier in the script, so we'll make sure this does the same thing. Reported-by: Yiyuan guo <yguoaz@gmail.com> Signed-off-by: Jeff King <peff@peff.net> --- builtin/pack-objects.c | 2 ++ t/t5300-pack-object.sh | 5 +++++ 2 files changed, 7 insertions(+)