Message ID | 20190204161217.20047-1-avarab@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC/PATCH] core.abbrev doc: document and test the abbreviation length | expand |
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > +The algorithm to pick the the current abbreviation length is > +considered an implementation detail, and might be changed in the > +future. Since Git version 2.11, the length has been configured to > +auto-scale based on the estimated number of objects in the > +repository. We pick a length such that if all objects in the > +repository were abbreviated, we'd have a 50% chance of a *single* > +collision. Correct and reads well. > +For example, with 2^14-1 is the last object count at which we'll pick > +a short length of "7", and will roll over to "8" once we have one more > +object at 2^14. Since each hexdigit we add (4 bits) allows us to have > +four times (2 bits) as many objects in the repository Something is missing at this point in the sentence. "without raising the chance of a single collision higher" or something like that. > , we'll roll over > +to a length of "9" at 2^16 objects, "10" at 2^18 etc. Correct and reads well. > We'll never > +automatically pick a length less than "7", which effectively hardcodes > +2^12 as the minimum number of objects in a repository we'll consider > +when choosing the abbreviation length. This may be technicaly correct, but to me, it seems to place stress on the wrong side of the equation. Since nobody would find "Ah, so I can create up to 2^12 objects without fearing that my abbreviated object name would become longer than 7", I do not see much point in saying "hardcoded floor for the number of objects". On the other hand, saying that 7 is the hardcoded floor for the abbreviation length does make sense, as those adept at math after reading the paragraph up to this point would wonder why their tiny repository still uses 7 hexdigits, which is way too many to ensure the low collision rate for the size of their toy repository. We do not use abbreviation shorter than 7 hexdigits by default, so a small repository with less than 2^12 objects may have even smaller chance than 50% to have a single collision. may be an improvement.
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > @@ -773,6 +773,14 @@ int cmd_rev_parse(int argc, const char **argv, const char *prefix) > return 1; > continue; > } > + if (opt_with_value(arg, "--abbrev-len", &arg)) { > + unsigned long v; > + if (!git_parse_ulong(arg, &v)) > + return 1; > + int len = abbrev_length_for_object_count(v); > + printf("%d\n", len); > + continue; > + } Instead of exposing this pretty-much "test-only" feature as a new option to t/helper/test-tool, I think it is OK, if not even better, to have it in rev-parse proper like this patch does. I however have a mildly strong suspition that people would expect "rev-parse --abbrev-len=<num>" to be a synonym of "--short=<num>" As this is pretty-much a test-only option, perhaps going longer but more descriptive would make sense? git rev-parse --compute-abbrev-length-for <object-count> may be an overkill, but something along those lines. Oh by the way, the code has decl-after-stmt, and perhaps len needs to be of type "const int" ;-)
On Mon, Feb 04 2019, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > >> @@ -773,6 +773,14 @@ int cmd_rev_parse(int argc, const char **argv, const char *prefix) >> return 1; >> continue; >> } >> + if (opt_with_value(arg, "--abbrev-len", &arg)) { >> + unsigned long v; >> + if (!git_parse_ulong(arg, &v)) >> + return 1; >> + int len = abbrev_length_for_object_count(v); >> + printf("%d\n", len); >> + continue; >> + } > > Instead of exposing this pretty-much "test-only" feature as a new > option to t/helper/test-tool, I think it is OK, if not even better, > to have it in rev-parse proper like this patch does. While I mainly added this code so I could prove the docs correct with a test for both myself & others, I think having this exposed is probably useful. I've seen more than once some feature of a web frontend for git where there's both access to aggregate statistics (number of commits or objects), and SHA-1 shortening going on, but the latter is just done via substr(). Right now we have nothing directly exposed to answer "what length would git pick", you can of course e.g. "log --abbrev" a single commit, but if that commit happens to be more ambiguous than most you'll get the right answer. > I however have a mildly strong suspition that people would expect > "rev-parse --abbrev-len=<num>" to be a synonym of "--short=<num>" > > As this is pretty-much a test-only option, perhaps going longer but > more descriptive would make sense? > > git rev-parse --compute-abbrev-length-for <object-count> > > may be an overkill, but something along those lines. Yeah I think that's better. This is so rare that it's better to be verbose. > Oh by the way, the code has decl-after-stmt, and perhaps len needs > to be of type "const int" ;-) Thanks. Will fix.
On Mon, Feb 04, 2019 at 12:04:21PM -0800, Junio C Hamano wrote: > Instead of exposing this pretty-much "test-only" feature as a new > option to t/helper/test-tool, I think it is OK, if not even better, > to have it in rev-parse proper like this patch does. > > I however have a mildly strong suspition that people would expect > "rev-parse --abbrev-len=<num>" to be a synonym of "--short=<num>" > > As this is pretty-much a test-only option, perhaps going longer but > more descriptive would make sense? > > git rev-parse --compute-abbrev-length-for <object-count> > > may be an overkill, but something along those lines. You could even default <object-count> to the number of objects in the repository. Which implies that perhaps the best spot is the command where we already count the number of objects, git-count-objects. -Peff
On Tue, Feb 05 2019, Jeff King wrote: > On Mon, Feb 04, 2019 at 12:04:21PM -0800, Junio C Hamano wrote: > >> Instead of exposing this pretty-much "test-only" feature as a new >> option to t/helper/test-tool, I think it is OK, if not even better, >> to have it in rev-parse proper like this patch does. >> >> I however have a mildly strong suspition that people would expect >> "rev-parse --abbrev-len=<num>" to be a synonym of "--short=<num>" >> >> As this is pretty-much a test-only option, perhaps going longer but >> more descriptive would make sense? >> >> git rev-parse --compute-abbrev-length-for <object-count> >> >> may be an overkill, but something along those lines. > > You could even default <object-count> to the number of objects in the > repository. Which implies that perhaps the best spot is the command > where we already count the number of objects, git-count-objects. That's documented as reporting loose objects by default, although it has a full report with -v. Maybe rev-parse isn't the right place, I just picked it because it seems to be the general utility belt for stuff that doesn't fit elsewhere. But putting it in git-count-objects seems like a bit more of a stretch given the above.
On Tue, Feb 05, 2019 at 12:50:23AM +0100, Ævar Arnfjörð Bjarmason wrote: > >> As this is pretty-much a test-only option, perhaps going longer but > >> more descriptive would make sense? > >> > >> git rev-parse --compute-abbrev-length-for <object-count> > >> > >> may be an overkill, but something along those lines. > > > > You could even default <object-count> to the number of objects in the > > repository. Which implies that perhaps the best spot is the command > > where we already count the number of objects, git-count-objects. > > That's documented as reporting loose objects by default, although it has > a full report with -v. True, though I think that's mostly for historical reasons. It _could_ be part of the full report, like: $ git count-objects -v ... abbrev-len: 12 but from your test-script usage, I'd expect you'd want to be able to feed a fake count to it, like: git count-objects --compute-abbrev-len=1234 or something (of course you _could_ also make a repository with N objects, but that's a lot more expensive). > Maybe rev-parse isn't the right place, I just picked it because it seems > to be the general utility belt for stuff that doesn't fit elsewhere. > > But putting it in git-count-objects seems like a bit more of a stretch > given the above. I dunno. It seems like less of a stretch to me, but it is true that rev-parse is already a kitchen sink repository. I can live with it either way. -Peff
On Wed, Feb 06 2019, Jeff King wrote: > On Tue, Feb 05, 2019 at 12:50:23AM +0100, Ævar Arnfjörð Bjarmason wrote: > >> >> As this is pretty-much a test-only option, perhaps going longer but >> >> more descriptive would make sense? >> >> >> >> git rev-parse --compute-abbrev-length-for <object-count> >> >> >> >> may be an overkill, but something along those lines. >> > >> > You could even default <object-count> to the number of objects in the >> > repository. Which implies that perhaps the best spot is the command >> > where we already count the number of objects, git-count-objects. >> >> That's documented as reporting loose objects by default, although it has >> a full report with -v. > > True, though I think that's mostly for historical reasons. It _could_ be > part of the full report, like: > > $ git count-objects -v > ... > abbrev-len: 12 > > but from your test-script usage, I'd expect you'd want to be able to > feed a fake count to it, like: > > git count-objects --compute-abbrev-len=1234 Yeah for just reporting it count-objects makes more sense. I think I'll add it there... > or something (of course you _could_ also make a repository with N > objects, but that's a lot more expensive). ...but yes, for the test script & to export the info I'd like to have the "what's the abbrev length for a repo with N objects" option, which would be for rev-parse. >> Maybe rev-parse isn't the right place, I just picked it because it seems >> to be the general utility belt for stuff that doesn't fit elsewhere. >> >> But putting it in git-count-objects seems like a bit more of a stretch >> given the above. > > I dunno. It seems like less of a stretch to me, but it is true that > rev-parse is already a kitchen sink repository. I can live with it > either way. > > -Peff
diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index 185857a13f..2175761833 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -599,6 +599,23 @@ core.abbrev:: abbreviated object names to stay unique for some time. The minimum length is 4. + +The algorithm to pick the the current abbreviation length is +considered an implementation detail, and might be changed in the +future. Since Git version 2.11, the length has been configured to +auto-scale based on the estimated number of objects in the +repository. We pick a length such that if all objects in the +repository were abbreviated, we'd have a 50% chance of a *single* +collision. ++ +For example, with 2^14-1 is the last object count at which we'll pick +a short length of "7", and will roll over to "8" once we have one more +object at 2^14. Since each hexdigit we add (4 bits) allows us to have +four times (2 bits) as many objects in the repository, we'll roll over +to a length of "9" at 2^16 objects, "10" at 2^18 etc. We'll never +automatically pick a length less than "7", which effectively hardcodes +2^12 as the minimum number of objects in a repository we'll consider +when choosing the abbreviation length. ++ This can also be set to relative values such as `+2` or `-2`, which means to add or subtract N characters from the SHA-1 that Git would otherwise print, this allows for producing more future-proof SHA-1s diff --git a/builtin/rev-parse.c b/builtin/rev-parse.c index d0d751a009..e7bf4375a2 100644 --- a/builtin/rev-parse.c +++ b/builtin/rev-parse.c @@ -773,6 +773,14 @@ int cmd_rev_parse(int argc, const char **argv, const char *prefix) return 1; continue; } + if (opt_with_value(arg, "--abbrev-len", &arg)) { + unsigned long v; + if (!git_parse_ulong(arg, &v)) + return 1; + int len = abbrev_length_for_object_count(v); + printf("%d\n", len); + continue; + } if (!strcmp(arg, "--bisect")) { for_each_fullref_in("refs/bisect/bad", show_reference, NULL, 0); for_each_fullref_in("refs/bisect/good", anti_reference, NULL, 0); diff --git a/t/t1512-rev-parse-disambiguation.sh b/t/t1512-rev-parse-disambiguation.sh index 265a6972fc..0e97888a44 100755 --- a/t/t1512-rev-parse-disambiguation.sh +++ b/t/t1512-rev-parse-disambiguation.sh @@ -450,4 +450,78 @@ test_expect_success C_LOCALE_OUTPUT 'ambiguous commits are printed by type first done ' +test_expect_success 'abbreviation length at 2^N-1 and 2^N' ' + pow_2_min=$(git rev-parse --abbrev-len=3) && + pow_2_eql=$(git rev-parse --abbrev-len=4) && + pow_4_min=$(git rev-parse --abbrev-len=15) && + pow_4_eql=$(git rev-parse --abbrev-len=16) && + pow_6_min=$(git rev-parse --abbrev-len=63) && + pow_6_eql=$(git rev-parse --abbrev-len=64) && + pow_8_min=$(git rev-parse --abbrev-len=255) && + pow_8_eql=$(git rev-parse --abbrev-len=256) && + pow_10_min=$(git rev-parse --abbrev-len=1023) && + pow_10_eql=$(git rev-parse --abbrev-len=1024) && + pow_12_min=$(git rev-parse --abbrev-len=4095) && + pow_12_eql=$(git rev-parse --abbrev-len=4096) && + pow_14_min=$(git rev-parse --abbrev-len=16383) && + pow_14_eql=$(git rev-parse --abbrev-len=16384) && + pow_16_min=$(git rev-parse --abbrev-len=65535) && + pow_16_eql=$(git rev-parse --abbrev-len=65536) && + pow_18_min=$(git rev-parse --abbrev-len=262143) && + pow_18_eql=$(git rev-parse --abbrev-len=262144) && + pow_20_min=$(git rev-parse --abbrev-len=1048575) && + pow_20_eql=$(git rev-parse --abbrev-len=1048576) && + pow_22_min=$(git rev-parse --abbrev-len=4194303) && + pow_22_eql=$(git rev-parse --abbrev-len=4194304) && + pow_24_min=$(git rev-parse --abbrev-len=16777215) && + pow_24_eql=$(git rev-parse --abbrev-len=16777216) && + pow_26_min=$(git rev-parse --abbrev-len=67108863) && + pow_26_eql=$(git rev-parse --abbrev-len=67108864) && + pow_28_min=$(git rev-parse --abbrev-len=268435455) && + pow_28_eql=$(git rev-parse --abbrev-len=268435456) && + pow_30_min=$(git rev-parse --abbrev-len=1073741823) && + pow_30_eql=$(git rev-parse --abbrev-len=1073741824) && + pow_32_min=$(git rev-parse --abbrev-len=4294967295) && + + cat >actual <<-EOF && + 2 = $pow_2_min $pow_2_eql + 4 = $pow_4_min $pow_4_eql + 6 = $pow_6_min $pow_6_eql + 8 = $pow_8_min $pow_8_eql + 10 = $pow_10_min $pow_10_eql + 12 = $pow_12_min $pow_12_eql + 14 = $pow_14_min $pow_14_eql + 16 = $pow_16_min $pow_16_eql + 18 = $pow_18_min $pow_18_eql + 20 = $pow_20_min $pow_20_eql + 22 = $pow_22_min $pow_22_eql + 24 = $pow_24_min $pow_24_eql + 26 = $pow_26_min $pow_26_eql + 28 = $pow_28_min $pow_28_eql + 30 = $pow_30_min $pow_30_eql + 32 = 16 + EOF + + cat >expected <<-\EOF && + 2 = 7 7 + 4 = 7 7 + 6 = 7 7 + 8 = 7 7 + 10 = 7 7 + 12 = 7 7 + 14 = 7 8 + 16 = 8 9 + 18 = 9 10 + 20 = 10 11 + 22 = 11 12 + 24 = 12 13 + 26 = 13 14 + 28 = 14 15 + 30 = 15 16 + 32 = 16 + EOF + + test_cmp expected actual +' + test_done
The algorithm we use to pick the default abbreviation length as a function of the approximate number of objects is described in the commit message for e6c587c733 ("abbrev: auto size the default abbreviation", 2016-09-30), as well as in and downthread of [1], but it hasn't been documented. Let's do that, and while we're at it explicitly test for when the current implementation will "roll over" up to values of 2^32-1 (the maximum portable "unsigned long" value). 1. https://public-inbox.org/git/20160926043442.3pz7ccawdcsn2kzb@sigill.intra.peff.net/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- This is a patch from the middle of a series I'm currently working on re-rolling. See https://public-inbox.org/git/20180608224136.20220-1-avarab@gmail.com/ What I'd like to get here is commentary on the phrasing and accuracy of the doc patch I'm adding here. This patch assumes that we have a abbrev_length_for_object_count() function, which I've added in an eariler unpublished patch. It just exposes the length picking algorithm found in find_unique_abbrev_r(). Documentation/config/core.txt | 17 +++++++ builtin/rev-parse.c | 8 ++++ t/t1512-rev-parse-disambiguation.sh | 74 +++++++++++++++++++++++++++++ 3 files changed, 99 insertions(+)