Message ID | ab341dd0e58f77b3c7c6f5765d9e34cb02bef56f.1730775908.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | pack-objects: Create an alternative name hash algorithm (recreated) | expand |
On Tue, Nov 05, 2024 at 03:05:07AM +0000, Derrick Stolee via GitGitGadget wrote: > Test this tree > ----------------------------------------------------------------- > 5314.1: paths at head 4.5K > 5314.2: number of distinct name-hashes 4.1K > 5314.3: number of distinct full-name-hashes 4.5K > 5314.4: maximum multiplicity of name-hashes 13 > 5314.5: maximum multiplicity of fullname-hashes 1 > > Here, the maximum collision multiplicity is 13, but around 10% of paths > have a collision with another path. Neat. > diff --git a/t/helper/test-name-hash.c b/t/helper/test-name-hash.c > new file mode 100644 > index 00000000000..e4ecd159b76 > --- /dev/null > +++ b/t/helper/test-name-hash.c > @@ -0,0 +1,24 @@ > +/* > + * test-name-hash.c: Read a list of paths over stdin and report on their > + * name-hash and full name-hash. > + */ > + > +#include "test-tool.h" > +#include "git-compat-util.h" > +#include "pack-objects.h" > +#include "strbuf.h" > + > +int cmd__name_hash(int argc UNUSED, const char **argv UNUSED) > +{ > + struct strbuf line = STRBUF_INIT; > + > + while (!strbuf_getline(&line, stdin)) { > + uint32_t name_hash = pack_name_hash(line.buf); > + uint32_t full_hash = pack_full_name_hash(line.buf); > + > + printf("%10"PRIu32"\t%10"PRIu32"\t%s\n", name_hash, full_hash, line.buf); I'm definitely nitpicking, but having a tab to separate these two 32-bit values feels odd when we know already that they will be at most 10-characters wide. I probably would have written: printf("%10"PRIu32" %10"PRIu32"\t%s\n", name_hash, full_hash, line.buf); instead, but this is obviously not a big deal either way ;-). > diff --git a/t/perf/p5314-name-hash.sh b/t/perf/p5314-name-hash.sh > new file mode 100755 > index 00000000000..9fe26612fac > --- /dev/null > +++ b/t/perf/p5314-name-hash.sh > @@ -0,0 +1,41 @@ > +#!/bin/sh > + > +test_description='Tests pack performance using bitmaps' > +. ./perf-lib.sh > + > +GIT_TEST_PASSING_SANITIZE_LEAK=0 > +export GIT_TEST_PASSING_SANITIZE_LEAK Does this conflict with Patrick's series to remove these leak checking annotations? I think it might, which is not unexpected given this series was written before that one (and it's my fault for not reviewing it earlier). > +test_perf_large_repo > + > +test_size 'paths at head' ' > + git ls-tree -r --name-only HEAD >path-list && > + wc -l <path-list > +' > + > +test_size 'number of distinct name-hashes' ' > + cat path-list | test-tool name-hash >name-hashes && > + cat name-hashes | awk "{ print \$1; }" | sort -n | uniq -c >name-hash-count && In these two (and a handful of others lower down in this same script) the "cat ... |" is unnecessary. I think this one should be written as: test-tool name-hash <path-list >name-hashes && awk "{ print \$1; }" <name-hashes | sort | uniq -c >name-hash-count && (sort -n is unnecessary, since we just care about getting the list in sorted order so that "uniq -c" can count the number of unique values). > + wc -l <name-hash-count > +' > + > +test_size 'number of distinct full-name-hashes' ' > + cat name-hashes | awk "{ print \$2; }" | sort -n | uniq -c >full-name-hash-count && > + wc -l <full-name-hash-count > +' > + > +test_size 'maximum multiplicity of name-hashes' ' > + cat name-hash-count | \ > + sort -nr | \ > + head -n 1 | \ > + awk "{ print \$1; }" > +' > + > +test_size 'maximum multiplicity of fullname-hashes' ' > + cat full-name-hash-count | \ > + sort -nr | \ > + head -n 1 | \ > + awk "{ print \$1; }" Nitpicking again, but you could extract the "sort | head | awk" pipeline into a function. Thanks, Taylor
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Derrick Stolee <stolee@gmail.com> > > Add a new test-tool helper, name-hash, to output the value of the > name-hash algorithms for the input list of strings, one per line. I've looked at all 7 patches. I didn't really understand the concern with shallow in patch 6 (in particular, the documentation of "git pack-objects --shallow" seems to imply that it's for use by a server to a shallow client, but at the point that the server would need such a feature, it probably would already have bitmaps packed with the new hash algorithm). I didn't look at it further, though, since I had an algorithm that seemed to also do OK in the shallow test. So we might be able to drop patch 6. Other than that, and other than all my comments and Taylor's comments, this series looks good.
diff --git a/Makefile b/Makefile index 6f5986b66ea..65403f6dd09 100644 --- a/Makefile +++ b/Makefile @@ -816,6 +816,7 @@ TEST_BUILTINS_OBJS += test-lazy-init-name-hash.o TEST_BUILTINS_OBJS += test-match-trees.o TEST_BUILTINS_OBJS += test-mergesort.o TEST_BUILTINS_OBJS += test-mktemp.o +TEST_BUILTINS_OBJS += test-name-hash.o TEST_BUILTINS_OBJS += test-online-cpus.o TEST_BUILTINS_OBJS += test-pack-mtimes.o TEST_BUILTINS_OBJS += test-parse-options.o diff --git a/t/helper/test-name-hash.c b/t/helper/test-name-hash.c new file mode 100644 index 00000000000..e4ecd159b76 --- /dev/null +++ b/t/helper/test-name-hash.c @@ -0,0 +1,24 @@ +/* + * test-name-hash.c: Read a list of paths over stdin and report on their + * name-hash and full name-hash. + */ + +#include "test-tool.h" +#include "git-compat-util.h" +#include "pack-objects.h" +#include "strbuf.h" + +int cmd__name_hash(int argc UNUSED, const char **argv UNUSED) +{ + struct strbuf line = STRBUF_INIT; + + while (!strbuf_getline(&line, stdin)) { + uint32_t name_hash = pack_name_hash(line.buf); + uint32_t full_hash = pack_full_name_hash(line.buf); + + printf("%10"PRIu32"\t%10"PRIu32"\t%s\n", name_hash, full_hash, line.buf); + } + + strbuf_release(&line); + return 0; +} diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c index 1ebb69a5dc4..e794058ab6d 100644 --- a/t/helper/test-tool.c +++ b/t/helper/test-tool.c @@ -44,6 +44,7 @@ static struct test_cmd cmds[] = { { "match-trees", cmd__match_trees }, { "mergesort", cmd__mergesort }, { "mktemp", cmd__mktemp }, + { "name-hash", cmd__name_hash }, { "online-cpus", cmd__online_cpus }, { "pack-mtimes", cmd__pack_mtimes }, { "parse-options", cmd__parse_options }, diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h index 21802ac27da..26ff30a5a9a 100644 --- a/t/helper/test-tool.h +++ b/t/helper/test-tool.h @@ -37,6 +37,7 @@ int cmd__lazy_init_name_hash(int argc, const char **argv); int cmd__match_trees(int argc, const char **argv); int cmd__mergesort(int argc, const char **argv); int cmd__mktemp(int argc, const char **argv); +int cmd__name_hash(int argc, const char **argv); int cmd__online_cpus(int argc, const char **argv); int cmd__pack_mtimes(int argc, const char **argv); int cmd__parse_options(int argc, const char **argv); diff --git a/t/perf/p5314-name-hash.sh b/t/perf/p5314-name-hash.sh new file mode 100755 index 00000000000..9fe26612fac --- /dev/null +++ b/t/perf/p5314-name-hash.sh @@ -0,0 +1,41 @@ +#!/bin/sh + +test_description='Tests pack performance using bitmaps' +. ./perf-lib.sh + +GIT_TEST_PASSING_SANITIZE_LEAK=0 +export GIT_TEST_PASSING_SANITIZE_LEAK + +test_perf_large_repo + +test_size 'paths at head' ' + git ls-tree -r --name-only HEAD >path-list && + wc -l <path-list +' + +test_size 'number of distinct name-hashes' ' + cat path-list | test-tool name-hash >name-hashes && + cat name-hashes | awk "{ print \$1; }" | sort -n | uniq -c >name-hash-count && + wc -l <name-hash-count +' + +test_size 'number of distinct full-name-hashes' ' + cat name-hashes | awk "{ print \$2; }" | sort -n | uniq -c >full-name-hash-count && + wc -l <full-name-hash-count +' + +test_size 'maximum multiplicity of name-hashes' ' + cat name-hash-count | \ + sort -nr | \ + head -n 1 | \ + awk "{ print \$1; }" +' + +test_size 'maximum multiplicity of fullname-hashes' ' + cat full-name-hash-count | \ + sort -nr | \ + head -n 1 | \ + awk "{ print \$1; }" +' + +test_done diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh index caa3c125548..965c3abca5f 100755 --- a/t/t5310-pack-bitmaps.sh +++ b/t/t5310-pack-bitmaps.sh @@ -27,6 +27,32 @@ has_any () { grep -Ff "$1" "$2" } +# Since name-hash values are stored in the .bitmap files, add a test +# that checks that the name-hash calculations are stable across versions. +# Not exhaustive, but these hashing algorithms would be hard to change +# without causing deviations here. +test_expect_success 'name-hash value stability' ' + cat >names <<-\EOF && + first + second + third + one-long-enough-for-collisions + two-long-enough-for-collisions + EOF + + test-tool name-hash <names >out && + + cat >expect <<-\EOF && + 2582249472 3109209818 first + 2289942528 3781118409 second + 2300837888 3028707182 third + 2544516325 3241327563 one-long-enough-for-collisions + 2544516325 4207880830 two-long-enough-for-collisions + EOF + + test_cmp expect out +' + test_bitmap_cases () { writeLookupTable=false for i in "$@"