[v2,0/6] pack-objects: create new name-hash algorithm

Message ID	pull.1785.v2.git.1726692381.gitgitgadget@gmail.com (mailing list archive)
Headers	show Received: from mail-ed1-f51.google.com (mail-ed1-f51.google.com [209.85.208.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 67DC845027 for <git@vger.kernel.org>; Wed, 18 Sep 2024 20:46:26 +0000 (UTC) Message-Id: <pull.1785.v2.git.1726692381.gitgitgadget@gmail.com> In-Reply-To: <pull.1785.git.1725890210.gitgitgadget@gmail.com> References: <pull.1785.git.1725890210.gitgitgadget@gmail.com> From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> Date: Wed, 18 Sep 2024 20:46:15 +0000 Subject: [PATCH v2 0/6] pack-objects: create new name-hash algorithm Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net, ps@pks.im, me@ttaylorr.com, johncai86@gmail.com, newren@gmail.com, Derrick Stolee <stolee@gmail.com>
Series	pack-objects: create new name-hash algorithm \| expand [v2,0/6] pack-objects: create new name-hash algorithm [v2,1/6] pack-objects: add --full-name-hash option [v2,2/6] repack: test --full-name-hash option [v2,3/6] pack-objects: add GIT_TEST_FULL_NAME_HASH [v2,4/6] git-repack: update usage to match docs [v2,5/6] p5313: add size comparison test [v2,6/6] test-tool: add helper for name-hash values

Jean-Noël Avila via GitGitGadget Sept. 18, 2024, 8:46 p.m. UTC

I've been focused recently on understanding and mitigating the growth of a
few internal repositories. Some of these are growing much larger than
expected for the number of contributors, and there are multiple aspects to
why this growth is so large.

This is part of the RFC I submitted [1] [2] involving the path-walk API,
though this doesn't use the path-walk API directly. In full repack cases, it
seems that the --full-name-hash option gets nearly as good compression as
the --path-walk option introduced in that series. I continue to work on that
feature as well, so we can review it after this series is complete.

[1] https://github.com/gitgitgadget/git/pull/1786

[2]
https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/

The main issue plaguing these repositories is that deltas are not being
computed against objects that appear at the same path. While the size of
these files at tip is one aspect of growth that would prevent this issue,
the changes to these files are reasonable and should result in good delta
compression. However, Git is not discovering the connections across
different versions of the same file.

One way to find some improvement in these repositories is to increase the
window size, which was an initial indicator that the delta compression could
be improved, but was not a clear indicator. After some digging (and
prototyping some analysis tools) the main discovery was that the current
name-hash algorithm only considers the last 16 characters in the path name
and has some naturally-occurring collisions within that scope.

This series introduces a new name-hash algorithm, but does not replace the
existing one. There are cases, such as packing a single snapshot of a
repository, where the existing algorithm outperforms the new one.

However, my findings show that when a repository has many versions of files
at the same path (and especially when there are many name-hash collisions)
then there are significant gains to be made using the new algorithm.

(This table is updated in v2 with even more private examples that were found
while communicating findings internally.)

| Repo     | Standard Repack | With --full-name-hash |
|----------|-----------------|-----------------------|
| fluentui |         438 MB  |               168 MB  |
| Repo B   |       6,255 MB  |               829 MB  |
| Repo C   |      37,737 MB  |             7,125 MB  |
| Repo D   |     130,049 MB  |             6,190 MB  |
| Repo E   |     100,957 MB  |            22,979 MB  |
| Repo F   |       8,308 MB  |               746 MB  |
| Repo G   |       4,329 MB  |             3,643 MB  |


I include Repo G here as an example where the improvement is less drastic,
since this repo does not demonstrate a very high rate of name-hash
collisions; the collisions that exist seem to be in paths that are not
changed very often. Thus, the standard name-hash algorithm is nearly as
effective in these full repacks.

The main change in this series is in patch 1, which adds the algorithm and
the option to 'git pack-objects' and 'git repack'. The remaining patches are
focused on creating more evidence around the value of the new name-hash
algorithm and its effects on the packfiles created with it.

I will also try to make clear that I've been focused on client-side
performance and size concerns. Based on discussions in v1, it appears that
the following is true:

 * This feature is completely orthogonal to delta islands.
 * Changing the name-hash function can lead to compatibility issues with
   .bitmap files, as they store a name-hash value. Without augmenting the
   data structure to indicate which name-hash value was used at write time,
   the full-name-hash values should not be stored in the .bitmap files or
   used when reading .bitmap files and other objects. Thus, the
   full-name-hash is marked as incompatible with bitmaps for now.
 * The --path-walk option from the RFC ([1] and [2]) is likely incompatible
   with the delta-islands feature without significant work, so this suggests
   that the --full-name-hash is a better long-term solution for servers.
   This would still require the .bitmap modifications to make it work, but
   it's a smaller leap to get there.

Thanks, -Stolee


UPDATES IN V2
=============

Thank you for all of the discussion on v1. Here are the things I've learned
and how they have changed this patch series:

 * The test-tool name-hash helper change collides with the removal of
   test-tool oid-array, so I have rebased this series onto the latest master
   branch.
 * I learned (thanks, Taylor and Peff) that the .bitmap files store
   name-hash values. This means that the --full-name-hash option risks
   having different name-hash functions across bitmap reads and dynamic
   computations from the object walk. For this reason, the option is now
   made explicit to not work with bitmap walks. This could be corrected in
   the future with a modification to the .bitmap data structure to store a
   "name-hash version" value. This behavior is confirmed with a test.
 * A new test explicitly tests that git repack --full-name-hash passes the
   option to git pack-objects. This uses a new helper method,
   test_subcommand_flex that is more flexible than the existing
   test_subcommand.
 * To get more testing across different cases, add a new
   GIT_TEST_FULL_NAME_HASH environment variable. I point out which tests
   need modification when this is enabled.
 * The size-comparison tests were not portable with their use of du.
   Instead, use wc -c for the single pack-file that remains after a git
   repack -adf ... command.
 * The final patch still updates the test-tool name-hash helper, which was
   previously only used by a performance test. Now, use that test in a
   regular test to help guarantee that the functions do not change over
   time. This is directly related to the fact that these values can be
   stored in the .bitmap files and we need stable hash functions to keep
   compatibility with files written by previous versions of Git.

Other things that have happened include investigations into ways to adapt
the full-name hash to improve upon the name-hash. I did some experimenting
with increasing the size of 'struct object_entry' by using a 64-bit hash
value (name-hash, then full-name-hash) for a single-pass compression or two
32-bit hash values for a two-pass compression process. I include my WIP
branch at [3] to show what I tried, though the single-pass option did not
present any improvements and the two-pass option seems to be broken to the
point that the compression is substantially worse. (I'll try to elaborate on
this in a reply to this cover letter.)

[3]
https://github.com/derrickstolee/git/compare/full-name...derrickstolee:git:full-name-wip

Derrick Stolee (6):
  pack-objects: add --full-name-hash option
  repack: test --full-name-hash option
  pack-objects: add GIT_TEST_FULL_NAME_HASH
  git-repack: update usage to match docs
  p5313: add size comparison test
  test-tool: add helper for name-hash values

 Documentation/git-pack-objects.txt |  3 +-
 Documentation/git-repack.txt       |  4 +-
 Makefile                           |  1 +
 builtin/pack-objects.c             | 25 ++++++++---
 builtin/repack.c                   |  9 +++-
 ci/run-build-and-tests.sh          |  1 +
 pack-objects.h                     | 21 +++++++++
 t/README                           |  4 ++
 t/helper/test-name-hash.c          | 23 ++++++++++
 t/helper/test-tool.c               |  1 +
 t/helper/test-tool.h               |  1 +
 t/perf/p5313-pack-objects.sh       | 71 ++++++++++++++++++++++++++++++
 t/perf/p5314-name-hash.sh          | 41 +++++++++++++++++
 t/t0450/txt-help-mismatches        |  1 -
 t/t5300-pack-object.sh             | 15 +++++++
 t/t5310-pack-bitmaps.sh            | 26 +++++++++++
 t/t5510-fetch.sh                   |  7 ++-
 t/t5616-partial-clone.sh           | 26 ++++++++++-
 t/t6020-bundle-misc.sh             |  6 ++-
 t/t7700-repack.sh                  |  7 +++
 t/test-lib-functions.sh            | 27 ++++++++++++
 21 files changed, 307 insertions(+), 13 deletions(-)
 create mode 100644 t/helper/test-name-hash.c
 create mode 100755 t/perf/p5313-pack-objects.sh
 create mode 100755 t/perf/p5314-name-hash.sh


base-commit: 3fb745257b30a643ee78c9a7c52ab107c82e4745
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1785%2Fderrickstolee%2Ffull-name-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1785/derrickstolee/full-name-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1785

Range-diff vs v1:

 1:  ff30f774ca8 ! 1:  5f52e261765 pack-objects: add --full-name-hash option
     @@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const
       		OPT_END(),
       	};
       
     +@@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const char *prefix)
     + 	if (pack_to_stdout || !rev_list_all)
     + 		write_bitmap_index = 0;
     + 
     ++	if (write_bitmap_index && use_full_name_hash)
     ++		die(_("currently, the --full-name-hash option is incompatible with --write-bitmap-index"));
     ++
     + 	if (use_delta_islands)
     + 		strvec_push(&rp, "--topo-order");
     + 
      
       ## builtin/repack.c ##
      @@ builtin/repack.c: struct pack_objects_args {
     @@ pack-objects.h: static inline uint32_t pack_name_hash(const char *name)
      +		return 0;
      +
      +	/*
     -+	 * Just do the dumbest thing possible: add random multiples of a
     -+	 * large prime number with a binary shift. Goal is not cryptographic,
     -+	 * but generally uniformly distributed.
     ++	 * Do the simplest thing that will resemble pseduo-randomness: add
     ++	 * random multiples of a large prime number with a binary shift.
     ++	 * The goal is not to be cryptographic, but to be generally
     ++	 * uniformly distributed.
      +	 */
      +	while ((c = *name++) != 0) {
      +		hash += c * bigp;
     @@ pack-objects.h: static inline uint32_t pack_name_hash(const char *name)
       static inline enum object_type oe_type(const struct object_entry *e)
       {
       	return e->type_valid ? e->type_ : OBJ_BAD;
     +
     + ## t/t5300-pack-object.sh ##
     +@@ t/t5300-pack-object.sh: do
     + 	'
     + done
     + 
     ++# The following test is not necessarily a permanent choice, but since we do not
     ++# have a "name hash version" bit in the .bitmap file format, we cannot write the
     ++# full-name hash values into the .bitmap file without risking breakage later.
     ++#
     ++# TODO: Make these compatible in the future and replace this test with the
     ++# expected behavior when both are specified.
     ++test_expect_success '--full-name-hash and --write-bitmap-index are incompatible' '
     ++	test_must_fail git pack-objects base --all \
     ++		--full-name-hash --write-bitmap-index 2>err &&
     ++	grep incompatible err &&
     ++
     ++	# --stdout option silently removes --write-bitmap-index
     ++	git pack-objects --stdout --all --full-name-hash --write-bitmap-index >out
     ++'
     ++
     + test_done
 -:  ----------- > 2:  48b87cccedb repack: test --full-name-hash option
 -:  ----------- > 3:  48b3876a102 pack-objects: add GIT_TEST_FULL_NAME_HASH
 2:  eb0aa5f4e94 = 4:  c5827126120 git-repack: update usage to match docs
 3:  ed3ea4281a9 ! 5:  999b1d09424 p5313: add size comparison test
     @@ t/perf/p5313-pack-objects.sh (new)
      +'
      +
      +test_size 'repack size' '
     -+	du -a .git/objects/pack | sort -nr | awk "{ print \$1; }" | head -n 1
     ++	wc -c <.git/objects/pack/pack-*.pack
      +'
      +
      +test_perf 'repack with --full-name-hash' '
     @@ t/perf/p5313-pack-objects.sh (new)
      +'
      +
      +test_size 'repack size with --full-name-hash' '
     -+	du -a .git/objects/pack | sort -nr | awk "{ print \$1; }" | head -n 1
     ++	wc -c <.git/objects/pack/pack-*.pack
      +'
      +
      +test_done
 4:  5bcbcce6665 ! 6:  7e47fc8cb53 p5314: add a size test for name-hash collisions
     @@ Metadata
      Author: Derrick Stolee <dstolee@microsoft.com>
      
       ## Commit message ##
     -    p5314: add a size test for name-hash collisions
     +    test-tool: add helper for name-hash values
      
          Add a new test-tool helper, name-hash, to output the value of the
          name-hash algorithms for the input list of strings, one per line.
      
     +    Since the name-hash values can be stored in the .bitmap files, it is
     +    important that these hash functions do not change across Git versions.
     +    Add a simple test to t5310-pack-bitmaps.sh to provide some testing of
     +    the current values. Due to how these functions are implemented, it would
     +    be difficult to change them without disturbing these values.
     +
          Create a performance test that uses test_size to demonstrate how
          collisions occur for these hash algorithms. This test helps inform
          someone as to the behavior of the name-hash algorithms for their repo
     @@ Makefile: TEST_BUILTINS_OBJS += test-lazy-init-name-hash.o
       TEST_BUILTINS_OBJS += test-mergesort.o
       TEST_BUILTINS_OBJS += test-mktemp.o
      +TEST_BUILTINS_OBJS += test-name-hash.o
     - TEST_BUILTINS_OBJS += test-oid-array.o
       TEST_BUILTINS_OBJS += test-online-cpus.o
       TEST_BUILTINS_OBJS += test-pack-mtimes.o
     + TEST_BUILTINS_OBJS += test-parse-options.o
      
       ## t/helper/test-name-hash.c (new) ##
      @@
     @@ t/helper/test-tool.c: static struct test_cmd cmds[] = {
       	{ "mergesort", cmd__mergesort },
       	{ "mktemp", cmd__mktemp },
      +	{ "name-hash", cmd__name_hash },
     - 	{ "oid-array", cmd__oid_array },
       	{ "online-cpus", cmd__online_cpus },
       	{ "pack-mtimes", cmd__pack_mtimes },
     + 	{ "parse-options", cmd__parse_options },
      
       ## t/helper/test-tool.h ##
      @@ t/helper/test-tool.h: int cmd__lazy_init_name_hash(int argc, const char **argv);
     @@ t/perf/p5314-name-hash.sh (new)
      +'
      +
      +test_done
     +
     + ## t/t5310-pack-bitmaps.sh ##
     +@@ t/t5310-pack-bitmaps.sh: has_any () {
     + 	grep -Ff "$1" "$2"
     + }
     + 
     ++# Since name-hash values are stored in the .bitmap files, add a test
     ++# that checks that the name-hash calculations are stable across versions.
     ++# Not exhaustive, but these hashing algorithms would be hard to change
     ++# without causing deviations here.
     ++test_expect_success 'name-hash value stability' '
     ++	cat >names <<-\EOF &&
     ++	first
     ++	second
     ++	third
     ++	one-long-enough-for-collisions
     ++	two-long-enough-for-collisions
     ++	EOF
     ++
     ++	test-tool name-hash <names >out &&
     ++
     ++	cat >expect <<-\EOF &&
     ++	2582249472	3109209818	first
     ++	2289942528	3781118409	second
     ++	2300837888	3028707182	third
     ++	2544516325	3241327563	one-long-enough-for-collisions
     ++	2544516325	4207880830	two-long-enough-for-collisions
     ++	EOF
     ++
     ++	test_cmp expect out
     ++'
     ++
     + test_bitmap_cases () {
     + 	writeLookupTable=false
     + 	for i in "$@"

Derrick Stolee Sept. 18, 2024, 11:30 p.m. UTC | #1

On 9/18/24 4:46 PM, Derrick Stolee via GitGitGadget wrote:
...
 > Other things that have happened include investigations into ways to adapt
 > the full-name hash to improve upon the name-hash. I did some experimenting
 > with increasing the size of 'struct object_entry' by using a 64-bit hash
 > value (name-hash, then full-name-hash) for a single-pass compression or two
 > 32-bit hash values for a two-pass compression process. I include my WIP
 > branch at [3] to show what I tried, though the single-pass option did not
 > present any improvements and the two-pass option seems to be broken to the
 > point that the compression is substantially worse. (I'll try to elaborate on
 > this in a reply to this cover letter.)
 >
 > [3] 
https://github.com/derrickstolee/git/compare/full-name...derrickstolee:git:full-name-wip

To break down what I attempted in [3], let me break down a few things.

First, I tried using a 64-bit hash value [1]. This used the standard name-hash
as the most-significant digits and the full-name-hash as the least-significant
digits. The goal here was to still have locality from the name-hash but get a
good partition based on full-name-hash within those collisions.

However, when sorting this way, the boundaries of the full-name-hash partitions
are ineffective at getting good delta bases because the largest object from one
full-name-hash set is next to the smallest object from the next full-name-hash
set. Even when a full-name-hash set has size one, it is sorted roughly randomly
among the other colliding path names instead of grouped nicely with objects of
a similar size. This makes the results nearly identical to the 32-bit
full-name-hash implementation.

[1] 
https://github.com/derrickstolee/git/commit/aaa6befa3016667ea5eb10fdd6aa2b7fcec3a52e

Second, I tried storing two 32-bit hashes and doing a two-pass delta search [2].
In theory, this should be very similar to the --path-walk feature from the RFC.
However, I failed to make it work. Something about this version of a two-pass
walk was hitting some strange behavior. For example, I had to put in this extra
condition [4] if a best delta base was not found, or else we could get a
segfault.

[2] 
https://github.com/derrickstolee/git/commit/bf71271040ab93a624a8cdf5bc8aaff68e9b1b17

[4] 
https://github.com/derrickstolee/git/commit/fedc4fc543e50563f4748a5ffc45b51b530023e0

In fact, the results were not just _bad_ but they were _significantly worse_.

It took me a long time to report these details because they just didn't make
sense and I couldn't figure out what was going wrong. I'd be very grateful to
anyone who could explore these WIP commits and point out what I'm doing wrong
so I can learn and maybe we can get a boost to the feature.

Even if we had strong data from these examples, I'm not sure that we'd want
to add four bytes per object to the packing data, especially in a way that
impacts users that aren't even using the new feature. We would want to
explore options that use some kind of hashtable to map objects to their
64-bit hash values, perhaps. It also affects the .bitmap file format, which
would need modification even for a new 32-bit hash function (though one of
the same size could be used by adding an extension saying "I'm using hash
function v2" and leave the rest of the structure the same).

I would also like to test the performance against the threaded version of the
--path-walk feature, which I recently got working in my prototype branch [5].

[5] 
https://github.com/derrickstolee/git/pull/28/commits/a9fc233390ae00e3d4b156be64d6b3974e30d8a1

Thanks,
-Stolee

Junio C Hamano Oct. 4, 2024, 9:17 p.m. UTC | #2

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> I've been focused recently on understanding and mitigating the growth of a
> few internal repositories. Some of these are growing much larger than
> expected for the number of contributors, and there are multiple aspects to
> why this growth is so large.
> ...
> Derrick Stolee (6):
>   pack-objects: add --full-name-hash option
>   repack: test --full-name-hash option
>   pack-objects: add GIT_TEST_FULL_NAME_HASH
>   git-repack: update usage to match docs
>   p5313: add size comparison test
>   test-tool: add helper for name-hash values

Recent CI jobs on 'seen' has been failing the leak-check jobs, e.g.

  https://github.com/git/git/actions/runs/11184876759/job/31096601111#step:4:1886

shows that t5310 and t5334 are having problems.

I randomly picked this topic and ejected it out of 'seen', and the
result is fairing a bit better. t5310 that failed 228/228 subtests
in the above run is now passing.  I didn't run this topic alone
under the leak checker, so it is entirely possible that there are
some unfortunate interactions with other topics in flight.

t5334 still fails 8/16 subtests just like the above run, exonerating
this topic for its breakage.

  https://github.com/git/git/actions/runs/11186737635/job/31102378478#step:4:1880

Taylor Blau Oct. 4, 2024, 10:30 p.m. UTC | #3

On Fri, Oct 04, 2024 at 02:17:30PM -0700, Junio C Hamano wrote:
> t5334 still fails 8/16 subtests just like the above run, exonerating
> this topic for its breakage.
>
>   https://github.com/git/git/actions/runs/11186737635/job/31102378478#step:4:1880

This is not all too surprising, since part two of the incremental MIDX
topic introduces some new leaks which were not seen by Patrick's recent
leakfixes. So I expect that this broke with 9d4855eef3 (midx-write: fix
leaking buffer, 2024-09-30), which marked t5534 as leak-free.

And that's true, without the patches from tb/incremental-midx-part-2.
The leak stems from the fact that the 'bitmap_index' struct now has a
new optional "base" pointer, which points to another 'bitmap_index'
corresponding to the previous MIDX layer.

The fix is fairly straightforward, and should be limited to:

--- 8< ---
Subject: [PATCH] fixup! pack-bitmap.c: open and store incremental bitmap
 layers

The incremental MIDX bitmap work was done prior to 9d4855eef3
(midx-write: fix leaking buffer, 2024-09-30), and causes test failures
in t5334 in a post-9d4855eef3 world.

The leak looks like:

    Direct leak of 264 byte(s) in 1 object(s) allocated from:
        #0 0x7f6bcd87eaca in calloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:90
        #1 0x55ad1428e8a4 in xcalloc wrapper.c:151
        #2 0x55ad14199e16 in prepare_midx_bitmap_git pack-bitmap.c:742
        #3 0x55ad14199447 in open_midx_bitmap_1 pack-bitmap.c:507
        #4 0x55ad14199cca in open_midx_bitmap pack-bitmap.c:704
        #5 0x55ad14199d44 in open_bitmap pack-bitmap.c:717
        #6 0x55ad14199dc2 in prepare_bitmap_git pack-bitmap.c:733
        #7 0x55ad1419e496 in test_bitmap_walk pack-bitmap.c:2698
        #8 0x55ad14047b0b in cmd_rev_list builtin/rev-list.c:629
        #9 0x55ad13f71cd6 in run_builtin git.c:487
        #10 0x55ad13f72132 in handle_builtin git.c:756
        #11 0x55ad13f72380 in run_argv git.c:826
        #12 0x55ad13f728f4 in cmd_main git.c:961
        #13 0x55ad1407d3ae in main common-main.c:64
        #14 0x7f6bcd5f0c89 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
        #15 0x7f6bcd5f0d44 in __libc_start_main_impl ../csu/libc-start.c:360
        #16 0x55ad13f6ff90 in _start (git+0x1ef90) (BuildId: 3e63cdd415f1d185b21da3035cb48332510dddce)

, and is a result of us not freeing the resources corresponding to the
bitmap's base layer, if one was present.

Rectify that leak by calling the newly-introduced free_bitmap_index()
function on the base layer to ensure that its resources are also freed.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 2b4c3972e5..fe0df2b6c3 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -3045,6 +3045,7 @@ void free_bitmap_index(struct bitmap_index *b)
 		close_midx_revindex(b->midx);
 	}
 	free_pseudo_merge_map(&b->pseudo_merges);
+	free_bitmap_index(b->base);
 	free(b);
 }

--
2.47.0.rc1.154.ge6e38f19f35.dirty
--- >8 ---

That should apply on top of 'seen' easily, and will at least fix the
leak failures you pointed out here.

When the integration branches are rewound after 2.47.0 is tagged, I'll
send a new version of this topic based on Patrick's work here.

Thanks,
Taylor

Jeff King Oct. 4, 2024, 10:35 p.m. UTC | #4

On Fri, Oct 04, 2024 at 02:17:30PM -0700, Junio C Hamano wrote:

> > Derrick Stolee (6):
> >   pack-objects: add --full-name-hash option
> >   repack: test --full-name-hash option
> >   pack-objects: add GIT_TEST_FULL_NAME_HASH
> >   git-repack: update usage to match docs
> >   p5313: add size comparison test
> >   test-tool: add helper for name-hash values
> 
> Recent CI jobs on 'seen' has been failing the leak-check jobs, e.g.
> 
>   https://github.com/git/git/actions/runs/11184876759/job/31096601111#step:4:1886
> 
> shows that t5310 and t5334 are having problems.
> 
> I randomly picked this topic and ejected it out of 'seen', and the
> result is fairing a bit better. t5310 that failed 228/228 subtests
> in the above run is now passing.  I didn't run this topic alone
> under the leak checker, so it is entirely possible that there are
> some unfortunate interactions with other topics in flight.

Maybe squash this into the final patch of that series?

diff --git a/t/helper/test-name-hash.c b/t/helper/test-name-hash.c
index 15fb8f853c..e4ecd159b7 100644
--- a/t/helper/test-name-hash.c
+++ b/t/helper/test-name-hash.c
@@ -19,5 +19,6 @@ int cmd__name_hash(int argc UNUSED, const char **argv UNUSED)
 		printf("%10"PRIu32"\t%10"PRIu32"\t%s\n", name_hash, full_hash, line.buf);
 	}
 
+	strbuf_release(&line);
 	return 0;
 }

That seems to be enough to clear t5310 on "seen". It was not noticed in
the original topic because t5310 was not yet marked as leak-free in its
base. That happened in fa016423c7 (revision: fix leaking parents when
simplifying commits, 2024-09-26)

-Peff

Derrick Stolee Oct. 8, 2024, 6:32 p.m. UTC | #5

On 9/18/24 4:46 PM, Derrick Stolee via GitGitGadget wrote:
> I've been focused recently on understanding and mitigating the growth of a
> few internal repositories. Some of these are growing much larger than
> expected for the number of contributors, and there are multiple aspects to
> why this growth is so large.
> 
> This is part of the RFC I submitted [1] [2] involving the path-walk API,
> though this doesn't use the path-walk API directly. In full repack cases, it
> seems that the --full-name-hash option gets nearly as good compression as
> the --path-walk option introduced in that series. I continue to work on that
> feature as well, so we can review it after this series is complete.
Based on the discussion in this thread, I recommend pulling this branch out
of 'seen' and instead focus on the --path-walk option which is now available
at [3]. If we choose to revisit the --full-name-hash option, then that can
be done on top of that feature which is probably a higher priority.

[3] https://lore.kernel.org/git/pull.1813.git.1728396723.gitgitgadget@gmail.com

Thanks,
-Stolee

[v2,0/6] pack-objects: create new name-hash algorithm

Message

Comments