Message ID | cover.1702592603.git.me@ttaylorr.com (mailing list archive) |
---|---|
Headers | show |
Series | pack-objects: multi-pack verbatim reuse | expand |
I haven't looked into the details yet, but it seems that t5309-pack-delta-cycles.sh fails under $ SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true make -j16 test
On Thu, Dec 14, 2023 at 04:06:49PM -0800, Junio C Hamano wrote: > I haven't looked into the details yet, but it seems that > t5309-pack-delta-cycles.sh fails under > > $ SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true make -j16 test Hrm. I tried to reproduce this, but I'm not seeing it. I have: $ make SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true test [ ... ] All tests successful. Files=1001, Tests=14558, 48 wallclock secs ( 8.34 usr 2.10 sys + 391.60 cusr 319.67 csys = 721.71 CPU) Result: PASS With this series applied on top of 1a87c842ec (Start the 2.44 cycle, 2023-12-09). The tree I get at the end is d148e16f5cfba405a9823cb68540a8c83004f98f. Did we apply onto different bases? Thanks, Taylor
Junio C Hamano <gitster@pobox.com> writes: > I haven't looked into the details yet, but it seems that > t5309-pack-delta-cycles.sh fails under > > $ SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true make -j16 test Hmph, this seems to be elusive. I tried it again and then it did not fail this time.
On Thu, Dec 14, 2023 at 04:40:40PM -0800, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > > > I haven't looked into the details yet, but it seems that > > t5309-pack-delta-cycles.sh fails under > > > > $ SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true make -j16 test > > Hmph, this seems to be elusive. I tried it again and then it did > not fail this time. Indeed, but I was able to reproduce the failure both on my branch and on 'master' under --stress, yielding the following failure in t5309.6: + git index-pack --fix-thin --stdin fatal: REF_DELTA at offset 46 already resolved (duplicate base 01d7713666f4de822776c7622c10f1b07de280dc?) ================================================================= ==3904583==ERROR: LeakSanitizer: detected memory leaks Direct leak of 32 byte(s) in 1 object(s) allocated from: #0 0x7fa790d01986 in __interceptor_realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:98 #1 0x7fa790add769 in __pthread_getattr_np nptl/pthread_getattr_np.c:180 #2 0x7fa790d117c5 in __sanitizer::GetThreadStackTopAndBottom(bool, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:150 #3 0x7fa790d11957 in __sanitizer::GetThreadStackAndTls(bool, unsigned long*, unsigned long*, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:598 #4 0x7fa790d03fe8 in __lsan::ThreadStart(unsigned int, unsigned long long, __sanitizer::ThreadType) ../../../../src/libsanitizer/lsan/lsan_posix.cpp:51 #5 0x7fa790d013fd in __lsan_thread_start_func ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:440 #6 0x7fa790adc3eb in start_thread nptl/pthread_create.c:444 #7 0x7fa790b5ca5b in clone3 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 SUMMARY: LeakSanitizer: 32 byte(s) leaked in 1 allocation(s). Aborted The duplicate base thing is a red-herring, and is an expected result of the test. But the leak is definitely not, and I'm not sure what's going on here since the frames listed above are in the LSan runtime, not Git. I'll try to dig into this a bit more, but I'm not quite sure what's going on yet. Thanks, Taylor
On Thu, Dec 14, 2023 at 08:25:35PM -0500, Taylor Blau wrote: > On Thu, Dec 14, 2023 at 04:40:40PM -0800, Junio C Hamano wrote: > > Junio C Hamano <gitster@pobox.com> writes: > > > > > I haven't looked into the details yet, but it seems that > > > t5309-pack-delta-cycles.sh fails under > > > > > > $ SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true make -j16 test > > > > Hmph, this seems to be elusive. I tried it again and then it did > > not fail this time. > > Indeed, but I was able to reproduce the failure both on my branch and on > 'master' under --stress, yielding the following failure in t5309.6: OK, so it's nothing new, and we can ignore it for your series (I haven't seen it in the wild yet, but it is something we may need to deal with in the long run if it keeps popping up). > + git index-pack --fix-thin --stdin > fatal: REF_DELTA at offset 46 already resolved (duplicate base 01d7713666f4de822776c7622c10f1b07de280dc?) > > ================================================================= > ==3904583==ERROR: LeakSanitizer: detected memory leaks > > Direct leak of 32 byte(s) in 1 object(s) allocated from: > #0 0x7fa790d01986 in __interceptor_realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:98 > #1 0x7fa790add769 in __pthread_getattr_np nptl/pthread_getattr_np.c:180 > #2 0x7fa790d117c5 in __sanitizer::GetThreadStackTopAndBottom(bool, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:150 > #3 0x7fa790d11957 in __sanitizer::GetThreadStackAndTls(bool, unsigned long*, unsigned long*, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:598 > #4 0x7fa790d03fe8 in __lsan::ThreadStart(unsigned int, unsigned long long, __sanitizer::ThreadType) ../../../../src/libsanitizer/lsan/lsan_posix.cpp:51 > #5 0x7fa790d013fd in __lsan_thread_start_func ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:440 > #6 0x7fa790adc3eb in start_thread nptl/pthread_create.c:444 > #7 0x7fa790b5ca5b in clone3 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 > > SUMMARY: LeakSanitizer: 32 byte(s) leaked in 1 allocation(s). > Aborted > > The duplicate base thing is a red-herring, and is an expected result of > the test. But the leak is definitely not, and I'm not sure what's going > on here since the frames listed above are in the LSan runtime, not Git. I suspect this is a race in LSan caused by a thread calling exit() while other threads are spawning. Here's my theory. When a thread is spawned, LSan needs to know where its stack is (so it can look for points to reachable memory). It calls pthread_getattr_np(), which gets an attributes object that must be cleaned up with pthread_attr_destroy(). Presumably it does this shortly after. But there's a race window where that attr object is allocated and we haven't yet set up the new thread's info. If another thread calls exit() then, LSan will run but its book-keeping will be in an inconsistent state. One way to work around that would be to make the creation of the threads atomic. That is, create each in a suspended state, and only let them run once they are all created. There's no option in pthreads to do this, but we can simulate it by having them block on a mutex before starting. And indeed, we already have such a lock: the work_lock() that they all use to get data to process. After applying this patch: diff --git a/builtin/index-pack.c b/builtin/index-pack.c index dda94a9f46..0e94819216 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -1257,13 +1257,15 @@ static void resolve_deltas(void) base_cache_limit = delta_base_cache_limit * nr_threads; if (nr_threads > 1 || getenv("GIT_FORCE_THREADS")) { init_thread(); + work_lock(); for (i = 0; i < nr_threads; i++) { int ret = pthread_create(&thread_data[i].thread, NULL, threaded_second_pass, thread_data + i); if (ret) die(_("unable to create thread: %s"), strerror(ret)); } + work_unlock(); for (i = 0; i < nr_threads; i++) pthread_join(thread_data[i].thread, NULL); cleanup_thread(); I ran t5309 with "--stress --run=6" for about 5 minutes with no failures (whereas without the patch, I usually see a failure in 10 seconds or so). So it's a pretty easy fix, though I don't love it in general. Every place that spawns multiple threads that can die() would need the same treatment. And this isn't a "real" leak in any reasonable sense; it only happens because we're exiting the program directly, at which point all of the memory is returned to the OS anyway. So I hate changing production code to satisfy a leak-checking false positive. OTOH, dealing with false positives is annoying for humans, and the run-time cost should be negligible. We can work around this one, and avoid making the same change in other spots unless somebody sees a racy failure in practice. -Peff
On Thu, Dec 21, 2023 at 05:51:24AM -0500, Jeff King wrote: > I suspect this is a race in LSan caused by a thread calling exit() while > other threads are spawning. Here's my theory. > > When a thread is spawned, LSan needs to know where its stack is (so it > can look for points to reachable memory). It calls pthread_getattr_np(), > which gets an attributes object that must be cleaned up with > pthread_attr_destroy(). Presumably it does this shortly after. But > there's a race window where that attr object is allocated and we haven't > yet set up the new thread's info. If another thread calls exit() then, > LSan will run but its book-keeping will be in an inconsistent state. Thanks for digging. I agree with your theory, and am annoyed with how clever it is ;-). > So it's a pretty easy fix, though I don't love it in general. Every > place that spawns multiple threads that can die() would need the same > treatment. And this isn't a "real" leak in any reasonable sense; it only > happens because we're exiting the program directly, at which point all > of the memory is returned to the OS anyway. So I hate changing > production code to satisfy a leak-checking false positive. > > OTOH, dealing with false positives is annoying for humans, and the > run-time cost should be negligible. We can work around this one, and > avoid making the same change in other spots unless somebody sees a racy > failure in practice. Yeah... I share your thoughts here as well. It's kind of gross that we have to touch production code purely to appease the leak checker, but I think that the trade-off is worth it since: - the false positives are annoying to diagnose (as you said, and as evidenced by the time that you, Junio, and myself have sunk into discussing this ;-)). - the run-time cost is negligible. So I think that this is a good change to make, and I'm happy to see it go through. I don't think we should necessarily try too hard to find all spots that might benefit from a similar change, and instead just apply the same treatment if/when we notice false positives in CI. Thanks, Taylor