Message ID | pull.1785.v2.git.1726692381.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | pack-objects: create new name-hash algorithm | expand |
On 9/18/24 4:46 PM, Derrick Stolee via GitGitGadget wrote: ... > Other things that have happened include investigations into ways to adapt > the full-name hash to improve upon the name-hash. I did some experimenting > with increasing the size of 'struct object_entry' by using a 64-bit hash > value (name-hash, then full-name-hash) for a single-pass compression or two > 32-bit hash values for a two-pass compression process. I include my WIP > branch at [3] to show what I tried, though the single-pass option did not > present any improvements and the two-pass option seems to be broken to the > point that the compression is substantially worse. (I'll try to elaborate on > this in a reply to this cover letter.) > > [3] https://github.com/derrickstolee/git/compare/full-name...derrickstolee:git:full-name-wip To break down what I attempted in [3], let me break down a few things. First, I tried using a 64-bit hash value [1]. This used the standard name-hash as the most-significant digits and the full-name-hash as the least-significant digits. The goal here was to still have locality from the name-hash but get a good partition based on full-name-hash within those collisions. However, when sorting this way, the boundaries of the full-name-hash partitions are ineffective at getting good delta bases because the largest object from one full-name-hash set is next to the smallest object from the next full-name-hash set. Even when a full-name-hash set has size one, it is sorted roughly randomly among the other colliding path names instead of grouped nicely with objects of a similar size. This makes the results nearly identical to the 32-bit full-name-hash implementation. [1] https://github.com/derrickstolee/git/commit/aaa6befa3016667ea5eb10fdd6aa2b7fcec3a52e Second, I tried storing two 32-bit hashes and doing a two-pass delta search [2]. In theory, this should be very similar to the --path-walk feature from the RFC. However, I failed to make it work. Something about this version of a two-pass walk was hitting some strange behavior. For example, I had to put in this extra condition [4] if a best delta base was not found, or else we could get a segfault. [2] https://github.com/derrickstolee/git/commit/bf71271040ab93a624a8cdf5bc8aaff68e9b1b17 [4] https://github.com/derrickstolee/git/commit/fedc4fc543e50563f4748a5ffc45b51b530023e0 In fact, the results were not just _bad_ but they were _significantly worse_. It took me a long time to report these details because they just didn't make sense and I couldn't figure out what was going wrong. I'd be very grateful to anyone who could explore these WIP commits and point out what I'm doing wrong so I can learn and maybe we can get a boost to the feature. Even if we had strong data from these examples, I'm not sure that we'd want to add four bytes per object to the packing data, especially in a way that impacts users that aren't even using the new feature. We would want to explore options that use some kind of hashtable to map objects to their 64-bit hash values, perhaps. It also affects the .bitmap file format, which would need modification even for a new 32-bit hash function (though one of the same size could be used by adding an extension saying "I'm using hash function v2" and leave the rest of the structure the same). I would also like to test the performance against the threaded version of the --path-walk feature, which I recently got working in my prototype branch [5]. [5] https://github.com/derrickstolee/git/pull/28/commits/a9fc233390ae00e3d4b156be64d6b3974e30d8a1 Thanks, -Stolee
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > I've been focused recently on understanding and mitigating the growth of a > few internal repositories. Some of these are growing much larger than > expected for the number of contributors, and there are multiple aspects to > why this growth is so large. > ... > Derrick Stolee (6): > pack-objects: add --full-name-hash option > repack: test --full-name-hash option > pack-objects: add GIT_TEST_FULL_NAME_HASH > git-repack: update usage to match docs > p5313: add size comparison test > test-tool: add helper for name-hash values Recent CI jobs on 'seen' has been failing the leak-check jobs, e.g. https://github.com/git/git/actions/runs/11184876759/job/31096601111#step:4:1886 shows that t5310 and t5334 are having problems. I randomly picked this topic and ejected it out of 'seen', and the result is fairing a bit better. t5310 that failed 228/228 subtests in the above run is now passing. I didn't run this topic alone under the leak checker, so it is entirely possible that there are some unfortunate interactions with other topics in flight. t5334 still fails 8/16 subtests just like the above run, exonerating this topic for its breakage. https://github.com/git/git/actions/runs/11186737635/job/31102378478#step:4:1880
On Fri, Oct 04, 2024 at 02:17:30PM -0700, Junio C Hamano wrote: > t5334 still fails 8/16 subtests just like the above run, exonerating > this topic for its breakage. > > https://github.com/git/git/actions/runs/11186737635/job/31102378478#step:4:1880 This is not all too surprising, since part two of the incremental MIDX topic introduces some new leaks which were not seen by Patrick's recent leakfixes. So I expect that this broke with 9d4855eef3 (midx-write: fix leaking buffer, 2024-09-30), which marked t5534 as leak-free. And that's true, without the patches from tb/incremental-midx-part-2. The leak stems from the fact that the 'bitmap_index' struct now has a new optional "base" pointer, which points to another 'bitmap_index' corresponding to the previous MIDX layer. The fix is fairly straightforward, and should be limited to: --- 8< --- Subject: [PATCH] fixup! pack-bitmap.c: open and store incremental bitmap layers The incremental MIDX bitmap work was done prior to 9d4855eef3 (midx-write: fix leaking buffer, 2024-09-30), and causes test failures in t5334 in a post-9d4855eef3 world. The leak looks like: Direct leak of 264 byte(s) in 1 object(s) allocated from: #0 0x7f6bcd87eaca in calloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:90 #1 0x55ad1428e8a4 in xcalloc wrapper.c:151 #2 0x55ad14199e16 in prepare_midx_bitmap_git pack-bitmap.c:742 #3 0x55ad14199447 in open_midx_bitmap_1 pack-bitmap.c:507 #4 0x55ad14199cca in open_midx_bitmap pack-bitmap.c:704 #5 0x55ad14199d44 in open_bitmap pack-bitmap.c:717 #6 0x55ad14199dc2 in prepare_bitmap_git pack-bitmap.c:733 #7 0x55ad1419e496 in test_bitmap_walk pack-bitmap.c:2698 #8 0x55ad14047b0b in cmd_rev_list builtin/rev-list.c:629 #9 0x55ad13f71cd6 in run_builtin git.c:487 #10 0x55ad13f72132 in handle_builtin git.c:756 #11 0x55ad13f72380 in run_argv git.c:826 #12 0x55ad13f728f4 in cmd_main git.c:961 #13 0x55ad1407d3ae in main common-main.c:64 #14 0x7f6bcd5f0c89 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 #15 0x7f6bcd5f0d44 in __libc_start_main_impl ../csu/libc-start.c:360 #16 0x55ad13f6ff90 in _start (git+0x1ef90) (BuildId: 3e63cdd415f1d185b21da3035cb48332510dddce) , and is a result of us not freeing the resources corresponding to the bitmap's base layer, if one was present. Rectify that leak by calling the newly-introduced free_bitmap_index() function on the base layer to ensure that its resources are also freed. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- pack-bitmap.c | 1 + 1 file changed, 1 insertion(+) diff --git a/pack-bitmap.c b/pack-bitmap.c index 2b4c3972e5..fe0df2b6c3 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -3045,6 +3045,7 @@ void free_bitmap_index(struct bitmap_index *b) close_midx_revindex(b->midx); } free_pseudo_merge_map(&b->pseudo_merges); + free_bitmap_index(b->base); free(b); } -- 2.47.0.rc1.154.ge6e38f19f35.dirty --- >8 --- That should apply on top of 'seen' easily, and will at least fix the leak failures you pointed out here. When the integration branches are rewound after 2.47.0 is tagged, I'll send a new version of this topic based on Patrick's work here. Thanks, Taylor
On Fri, Oct 04, 2024 at 02:17:30PM -0700, Junio C Hamano wrote: > > Derrick Stolee (6): > > pack-objects: add --full-name-hash option > > repack: test --full-name-hash option > > pack-objects: add GIT_TEST_FULL_NAME_HASH > > git-repack: update usage to match docs > > p5313: add size comparison test > > test-tool: add helper for name-hash values > > Recent CI jobs on 'seen' has been failing the leak-check jobs, e.g. > > https://github.com/git/git/actions/runs/11184876759/job/31096601111#step:4:1886 > > shows that t5310 and t5334 are having problems. > > I randomly picked this topic and ejected it out of 'seen', and the > result is fairing a bit better. t5310 that failed 228/228 subtests > in the above run is now passing. I didn't run this topic alone > under the leak checker, so it is entirely possible that there are > some unfortunate interactions with other topics in flight. Maybe squash this into the final patch of that series? diff --git a/t/helper/test-name-hash.c b/t/helper/test-name-hash.c index 15fb8f853c..e4ecd159b7 100644 --- a/t/helper/test-name-hash.c +++ b/t/helper/test-name-hash.c @@ -19,5 +19,6 @@ int cmd__name_hash(int argc UNUSED, const char **argv UNUSED) printf("%10"PRIu32"\t%10"PRIu32"\t%s\n", name_hash, full_hash, line.buf); } + strbuf_release(&line); return 0; } That seems to be enough to clear t5310 on "seen". It was not noticed in the original topic because t5310 was not yet marked as leak-free in its base. That happened in fa016423c7 (revision: fix leaking parents when simplifying commits, 2024-09-26) -Peff
On 9/18/24 4:46 PM, Derrick Stolee via GitGitGadget wrote: > I've been focused recently on understanding and mitigating the growth of a > few internal repositories. Some of these are growing much larger than > expected for the number of contributors, and there are multiple aspects to > why this growth is so large. > > This is part of the RFC I submitted [1] [2] involving the path-walk API, > though this doesn't use the path-walk API directly. In full repack cases, it > seems that the --full-name-hash option gets nearly as good compression as > the --path-walk option introduced in that series. I continue to work on that > feature as well, so we can review it after this series is complete. Based on the discussion in this thread, I recommend pulling this branch out of 'seen' and instead focus on the --path-walk option which is now available at [3]. If we choose to revisit the --full-name-hash option, then that can be done on top of that feature which is probably a higher priority. [3] https://lore.kernel.org/git/pull.1813.git.1728396723.gitgitgadget@gmail.com Thanks, -Stolee