diff mbox series

[7/7] test-tool: add helper for name-hash values

Message ID ab341dd0e58f77b3c7c6f5765d9e34cb02bef56f.1730775908.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series pack-objects: Create an alternative name hash algorithm (recreated) | expand

Commit Message

Derrick Stolee Nov. 5, 2024, 3:05 a.m. UTC
From: Derrick Stolee <stolee@gmail.com>

Add a new test-tool helper, name-hash, to output the value of the
name-hash algorithms for the input list of strings, one per line.

Since the name-hash values can be stored in the .bitmap files, it is
important that these hash functions do not change across Git versions.
Add a simple test to t5310-pack-bitmaps.sh to provide some testing of
the current values. Due to how these functions are implemented, it would
be difficult to change them without disturbing these values.

Create a performance test that uses test_size to demonstrate how
collisions occur for these hash algorithms. This test helps inform
someone as to the behavior of the name-hash algorithms for their repo
based on the paths at HEAD.

My copy of the Git repository shows modest statistics around the
collisions of the default name-hash algorithm:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                        4.5K
5314.2: number of distinct name-hashes                       4.1K
5314.3: number of distinct full-name-hashes                  4.5K
5314.4: maximum multiplicity of name-hashes                    13
5314.5: maximum multiplicity of fullname-hashes                 1

Here, the maximum collision multiplicity is 13, but around 10% of paths
have a collision with another path.

In a more interesting example, the microsoft/fluentui [1] repo had these
statistics at time of committing:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                       19.6K
5314.2: number of distinct name-hashes                       8.2K
5314.3: number of distinct full-name-hashes                 19.6K
5314.4: maximum multiplicity of name-hashes                   279
5314.5: maximum multiplicity of fullname-hashes                 1

[1] https://github.com/microsoft/fluentui

That demonstrates that of the nearly twenty thousand path names, they
are assigned around eight thousand distinct values. 279 paths are
assigned to a single value, leading the packing algorithm to sort
objects from those paths together, by size.

In this repository, no collisions occur for the full-name-hash
algorithm.

In a more extreme example, an internal monorepo had a much worse
collision rate:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                      221.6K
5314.2: number of distinct name-hashes                      72.0K
5314.3: number of distinct full-name-hashes                221.6K
5314.4: maximum multiplicity of name-hashes                 14.4K
5314.5: maximum multiplicity of fullname-hashes                 2

Even in this repository with many more paths at HEAD, the collision rate
was low and the maximum number of paths being grouped into a single
bucket by the full-path-name algorithm was two.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Makefile                  |  1 +
 t/helper/test-name-hash.c | 24 +++++++++++++++++++++++
 t/helper/test-tool.c      |  1 +
 t/helper/test-tool.h      |  1 +
 t/perf/p5314-name-hash.sh | 41 +++++++++++++++++++++++++++++++++++++++
 t/t5310-pack-bitmaps.sh   | 26 +++++++++++++++++++++++++
 6 files changed, 94 insertions(+)
 create mode 100644 t/helper/test-name-hash.c
 create mode 100755 t/perf/p5314-name-hash.sh

Comments

Taylor Blau Nov. 21, 2024, 8:42 p.m. UTC | #1
On Tue, Nov 05, 2024 at 03:05:07AM +0000, Derrick Stolee via GitGitGadget wrote:
> Test                                              this tree
> -----------------------------------------------------------------
> 5314.1: paths at head                                        4.5K
> 5314.2: number of distinct name-hashes                       4.1K
> 5314.3: number of distinct full-name-hashes                  4.5K
> 5314.4: maximum multiplicity of name-hashes                    13
> 5314.5: maximum multiplicity of fullname-hashes                 1
>
> Here, the maximum collision multiplicity is 13, but around 10% of paths
> have a collision with another path.

Neat.

> diff --git a/t/helper/test-name-hash.c b/t/helper/test-name-hash.c
> new file mode 100644
> index 00000000000..e4ecd159b76
> --- /dev/null
> +++ b/t/helper/test-name-hash.c
> @@ -0,0 +1,24 @@
> +/*
> + * test-name-hash.c: Read a list of paths over stdin and report on their
> + * name-hash and full name-hash.
> + */
> +
> +#include "test-tool.h"
> +#include "git-compat-util.h"
> +#include "pack-objects.h"
> +#include "strbuf.h"
> +
> +int cmd__name_hash(int argc UNUSED, const char **argv UNUSED)
> +{
> +	struct strbuf line = STRBUF_INIT;
> +
> +	while (!strbuf_getline(&line, stdin)) {
> +		uint32_t name_hash = pack_name_hash(line.buf);
> +		uint32_t full_hash = pack_full_name_hash(line.buf);
> +
> +		printf("%10"PRIu32"\t%10"PRIu32"\t%s\n", name_hash, full_hash, line.buf);

I'm definitely nitpicking, but having a tab to separate these two 32-bit
values feels odd when we know already that they will be at most
10-characters wide.

I probably would have written:

    printf("%10"PRIu32" %10"PRIu32"\t%s\n", name_hash, full_hash, line.buf);

instead, but this is obviously not a big deal either way ;-).

> diff --git a/t/perf/p5314-name-hash.sh b/t/perf/p5314-name-hash.sh
> new file mode 100755
> index 00000000000..9fe26612fac
> --- /dev/null
> +++ b/t/perf/p5314-name-hash.sh
> @@ -0,0 +1,41 @@
> +#!/bin/sh
> +
> +test_description='Tests pack performance using bitmaps'
> +. ./perf-lib.sh
> +
> +GIT_TEST_PASSING_SANITIZE_LEAK=0
> +export GIT_TEST_PASSING_SANITIZE_LEAK

Does this conflict with Patrick's series to remove these leak checking
annotations? I think it might, which is not unexpected given this series
was written before that one (and it's my fault for not reviewing it
earlier).

> +test_perf_large_repo
> +
> +test_size 'paths at head' '
> +	git ls-tree -r --name-only HEAD >path-list &&
> +	wc -l <path-list
> +'
> +
> +test_size 'number of distinct name-hashes' '
> +	cat path-list | test-tool name-hash >name-hashes &&
> +	cat name-hashes | awk "{ print \$1; }" | sort -n | uniq -c >name-hash-count &&

In these two (and a handful of others lower down in this same script)
the "cat ... |" is unnecessary. I think this one should be written as:

    test-tool name-hash <path-list >name-hashes &&
    awk "{ print \$1; }" <name-hashes | sort | uniq -c >name-hash-count &&

(sort -n is unnecessary, since we just care about getting the list in
sorted order so that "uniq -c" can count the number of unique values).

> +	wc -l <name-hash-count
> +'
> +
> +test_size 'number of distinct full-name-hashes' '
> +	cat name-hashes | awk "{ print \$2; }" | sort -n | uniq -c >full-name-hash-count &&
> +	wc -l <full-name-hash-count
> +'
> +
> +test_size 'maximum multiplicity of name-hashes' '
> +	cat name-hash-count | \
> +		sort -nr | \
> +		head -n 1 | \
> +		awk "{ print \$1; }"
> +'
> +
> +test_size 'maximum multiplicity of fullname-hashes' '
> +	cat full-name-hash-count | \
> +		sort -nr | \
> +		head -n 1 | \
> +		awk "{ print \$1; }"

Nitpicking again, but you could extract the "sort | head | awk" pipeline
into a function.

Thanks,
Taylor
Jonathan Tan Nov. 22, 2024, 1:23 a.m. UTC | #2
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Derrick Stolee <stolee@gmail.com>
> 
> Add a new test-tool helper, name-hash, to output the value of the
> name-hash algorithms for the input list of strings, one per line.

I've looked at all 7 patches.

I didn't really understand the concern with shallow in patch 6 (in
particular, the documentation of "git pack-objects --shallow" seems
to imply that it's for use by a server to a shallow client, but at
the point that the server would need such a feature, it probably would
already have bitmaps packed with the new hash algorithm). I didn't look
at it further, though, since I had an algorithm that seemed to also do
OK in the shallow test. So we might be able to drop patch 6.

Other than that, and other than all my comments and Taylor's comments,
this series looks good.
diff mbox series

Patch

diff --git a/Makefile b/Makefile
index 6f5986b66ea..65403f6dd09 100644
--- a/Makefile
+++ b/Makefile
@@ -816,6 +816,7 @@  TEST_BUILTINS_OBJS += test-lazy-init-name-hash.o
 TEST_BUILTINS_OBJS += test-match-trees.o
 TEST_BUILTINS_OBJS += test-mergesort.o
 TEST_BUILTINS_OBJS += test-mktemp.o
+TEST_BUILTINS_OBJS += test-name-hash.o
 TEST_BUILTINS_OBJS += test-online-cpus.o
 TEST_BUILTINS_OBJS += test-pack-mtimes.o
 TEST_BUILTINS_OBJS += test-parse-options.o
diff --git a/t/helper/test-name-hash.c b/t/helper/test-name-hash.c
new file mode 100644
index 00000000000..e4ecd159b76
--- /dev/null
+++ b/t/helper/test-name-hash.c
@@ -0,0 +1,24 @@ 
+/*
+ * test-name-hash.c: Read a list of paths over stdin and report on their
+ * name-hash and full name-hash.
+ */
+
+#include "test-tool.h"
+#include "git-compat-util.h"
+#include "pack-objects.h"
+#include "strbuf.h"
+
+int cmd__name_hash(int argc UNUSED, const char **argv UNUSED)
+{
+	struct strbuf line = STRBUF_INIT;
+
+	while (!strbuf_getline(&line, stdin)) {
+		uint32_t name_hash = pack_name_hash(line.buf);
+		uint32_t full_hash = pack_full_name_hash(line.buf);
+
+		printf("%10"PRIu32"\t%10"PRIu32"\t%s\n", name_hash, full_hash, line.buf);
+	}
+
+	strbuf_release(&line);
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 1ebb69a5dc4..e794058ab6d 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -44,6 +44,7 @@  static struct test_cmd cmds[] = {
 	{ "match-trees", cmd__match_trees },
 	{ "mergesort", cmd__mergesort },
 	{ "mktemp", cmd__mktemp },
+	{ "name-hash", cmd__name_hash },
 	{ "online-cpus", cmd__online_cpus },
 	{ "pack-mtimes", cmd__pack_mtimes },
 	{ "parse-options", cmd__parse_options },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 21802ac27da..26ff30a5a9a 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -37,6 +37,7 @@  int cmd__lazy_init_name_hash(int argc, const char **argv);
 int cmd__match_trees(int argc, const char **argv);
 int cmd__mergesort(int argc, const char **argv);
 int cmd__mktemp(int argc, const char **argv);
+int cmd__name_hash(int argc, const char **argv);
 int cmd__online_cpus(int argc, const char **argv);
 int cmd__pack_mtimes(int argc, const char **argv);
 int cmd__parse_options(int argc, const char **argv);
diff --git a/t/perf/p5314-name-hash.sh b/t/perf/p5314-name-hash.sh
new file mode 100755
index 00000000000..9fe26612fac
--- /dev/null
+++ b/t/perf/p5314-name-hash.sh
@@ -0,0 +1,41 @@ 
+#!/bin/sh
+
+test_description='Tests pack performance using bitmaps'
+. ./perf-lib.sh
+
+GIT_TEST_PASSING_SANITIZE_LEAK=0
+export GIT_TEST_PASSING_SANITIZE_LEAK
+
+test_perf_large_repo
+
+test_size 'paths at head' '
+	git ls-tree -r --name-only HEAD >path-list &&
+	wc -l <path-list
+'
+
+test_size 'number of distinct name-hashes' '
+	cat path-list | test-tool name-hash >name-hashes &&
+	cat name-hashes | awk "{ print \$1; }" | sort -n | uniq -c >name-hash-count &&
+	wc -l <name-hash-count
+'
+
+test_size 'number of distinct full-name-hashes' '
+	cat name-hashes | awk "{ print \$2; }" | sort -n | uniq -c >full-name-hash-count &&
+	wc -l <full-name-hash-count
+'
+
+test_size 'maximum multiplicity of name-hashes' '
+	cat name-hash-count | \
+		sort -nr | \
+		head -n 1 | \
+		awk "{ print \$1; }"
+'
+
+test_size 'maximum multiplicity of fullname-hashes' '
+	cat full-name-hash-count | \
+		sort -nr | \
+		head -n 1 | \
+		awk "{ print \$1; }"
+'
+
+test_done
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index caa3c125548..965c3abca5f 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -27,6 +27,32 @@  has_any () {
 	grep -Ff "$1" "$2"
 }
 
+# Since name-hash values are stored in the .bitmap files, add a test
+# that checks that the name-hash calculations are stable across versions.
+# Not exhaustive, but these hashing algorithms would be hard to change
+# without causing deviations here.
+test_expect_success 'name-hash value stability' '
+	cat >names <<-\EOF &&
+	first
+	second
+	third
+	one-long-enough-for-collisions
+	two-long-enough-for-collisions
+	EOF
+
+	test-tool name-hash <names >out &&
+
+	cat >expect <<-\EOF &&
+	2582249472	3109209818	first
+	2289942528	3781118409	second
+	2300837888	3028707182	third
+	2544516325	3241327563	one-long-enough-for-collisions
+	2544516325	4207880830	two-long-enough-for-collisions
+	EOF
+
+	test_cmp expect out
+'
+
 test_bitmap_cases () {
 	writeLookupTable=false
 	for i in "$@"