Message ID | 20241230042830.GC113400@coredump.intra.peff.net (mailing list archive) |
---|---|
State | Accepted |
Commit | 7d0037b59ae0d22a2718c28d8e70e3ef3f3f991e |
Headers | show |
Series | fixing thread races in linux-leaks CI | expand |
On Sun, Dec 29, 2024 at 11:28:30PM -0500, Jeff King wrote: > One thread primitive we don't yet support is a barrier: it waits for all > threads to reach a synchronization point before letting any of them > continue. This would be useful for avoiding the LSan race we see in > index-pack (and other places) by having all threads complete their > initialization before any of them start to do real work. > > POSIX introduced a pthread_barrier_t in 2004, which does what we want. > But if we want to rely on it: > > 1. Our Windows pthread emulation would need a new set of wrapper > functions. There's a Synchronization Barrier primitive there, which > was introduced in Windows 8 (which is old enough for us to depend > on). > > 2. macOS (and possibly other systems) has pthreads but not > pthread_barrier_t. So there we'd have to implement our own barrier > based on the mutex and cond primitives. > > Those are do-able, but since we only care about avoiding races in our > LSan builds, there's an easier way: make it a noop on systems without a > native pthread barrier. I think this is fine for a first iteration. If we ever feel the need for having barriers anywhere else for actual correctness we can iterate on the solution and provide wrappers for those platforms. > This patch introduces a "maybe_thread_barrier" API. The clunky name > (rather than just using pthread_barrier directly) should hopefully clue > people in that on some systems it will do nothing. It's wired to a > Makefile knob which has to be triggered manually, and we enable it for > the linux-leaks CI jobs (since we know we'll have it there). > > There are some other possible options: > > - we could turn it on all the time for Linux systems based on uname. Tiniest of nits: these are all full sentences, which should start with an upper-case letter. > But we really only care about it for LSan builds, and there is no > need to add extra code to regular builds. > > - we could turn it on only for LSan builds. But that would break > builds on non-Linux platforms (like macOS) that otherwise should > support sanitizers. Mh. I'm not a huge fan of having extra code for just a subset of our builds and think that having the code generally enabled on platforms that support it is preferable to reduce the number of build variants. But... > - we could trigger only on the combination of Linux and LSan together. > This isn't too hard to do, but the uname check isn't completely > accurate. It is really about what your libc supports, and non-glibc > systems might not have it (though at least musl seems to). > > So we'd risk breaking builds on those systems, which would need to > add a new knob. Though the upside would be that running local "make > SANITIZE=leak test" would be protected automatically. ... this is a fair remark. So I dunno. > And of course none of this protects LSan runs from races on systems > without pthread barriers. It's probably OK in practice to protect only > our CI jobs, though. The race is rare-ish and most leak-checking happens > through CI. > > Signed-off-by: Jeff King <peff@peff.net> > --- > Makefile | 7 +++++++ > ci/lib.sh | 1 + > thread-utils.h | 17 +++++++++++++++++ > 3 files changed, 25 insertions(+) > > diff --git a/Makefile b/Makefile > index 97e8385b66..2c6dad8a75 100644 > --- a/Makefile > +++ b/Makefile > @@ -141,6 +141,10 @@ include shared.mak > # > # Define NO_PTHREADS if you do not have or do not want to use Pthreads. > # > +# Define THREAD_BARRIER_PTHREAD if your system has pthread_barrier_t. Barrier > +# support is optional and is only helpful when building with SANITIZE=leak, as > +# it is used to eliminate some races in the leak-checker. > +# > # Define NO_PREAD if you have a problem with pread() system call (e.g. > # cygwin1.dll before v1.5.22). > # > @@ -2079,6 +2083,9 @@ ifdef NO_PTHREADS > else > BASIC_CFLAGS += $(PTHREAD_CFLAGS) > EXTLIBS += $(PTHREAD_LIBS) > + ifdef THREAD_BARRIER_PTHREAD > + BASIC_CFLAGS += -DTHREAD_BARRIER_PTHREAD > + endif > endif > > ifdef HAVE_PATHS_H Okay. The Meson equivalent would be: diff --git a/meson.build b/meson.build index a0654a3f24..db4c1e6929 100644 --- a/meson.build +++ b/meson.build @@ -788,6 +788,10 @@ threads = dependency('threads', required: false) if threads.found() libgit_dependencies += threads build_options_config.set('NO_PTHREADS', '') + + if get_option('b_sanitize').contains('leak') and compiler.has_function('pthread_barrier_init', dependencies: threads) + libgit_c_args += '-DTHREAD_BARRIER_PTHREAD' + endif else libgit_c_args += '-DNO_PTHREADS' build_options_config.set('NO_PTHREADS', '1') > diff --git a/thread-utils.h b/thread-utils.h > index 4961487ed9..3df5be9916 100644 > --- a/thread-utils.h > +++ b/thread-utils.h > @@ -53,5 +53,22 @@ int dummy_pthread_init(void *); > int online_cpus(void); > int init_recursive_mutex(pthread_mutex_t*); > > +#ifdef THREAD_BARRIER_PTHREAD > +#define maybe_thread_barrier_t pthread_barrier_t > +#define maybe_thread_barrier_init pthread_barrier_init > +#define maybe_thread_barrier_wait pthread_barrier_wait > +#define maybe_thread_barrier_destroy pthread_barrier_destroy > +#else > +#define maybe_thread_barrier_t int Out of curiosity: why did you pick a define here and not a typedef? > +static inline int maybe_thread_barrier_init(maybe_thread_barrier_t *b UNUSED, > + void *attr UNUSED, > + unsigned nr UNUSED) > +{ > + errno = ENOSYS; > + return -1; > +} > +#define maybe_thread_barrier_wait(barrier) > +#define maybe_thread_barrier_destroy(barrier) So the way these wrappers are implemented it is not possible to check for errors of `pthread_barrier_init()` et al. When the implementation exists we do have return codes, but if it's stubbed out we don't. I think we should align these two implementations so that it does become possible to check for errors, or otherwise we wouldn't be using the pthread APIs correctly. It does raise the question though whether we should really return `-1` in the stubbed-out variant or whether we should instead pretend as if things were alright. An alternative would be to die in case the pthread-functions return an error. Patrick
On Mon, Dec 30, 2024 at 08:03:38AM +0100, Patrick Steinhardt wrote: > > - we could turn it on only for LSan builds. But that would break > > builds on non-Linux platforms (like macOS) that otherwise should > > support sanitizers. > > Mh. I'm not a huge fan of having extra code for just a subset of our > builds and think that having the code generally enabled on platforms > that support it is preferable to reduce the number of build variants. > But... > > > - we could trigger only on the combination of Linux and LSan together. > > This isn't too hard to do, but the uname check isn't completely > > accurate. It is really about what your libc supports, and non-glibc > > systems might not have it (though at least musl seems to). > > > > So we'd risk breaking builds on those systems, which would need to > > add a new knob. Though the upside would be that running local "make > > SANITIZE=leak test" would be protected automatically. > > ... this is a fair remark. So I dunno. Yeah. I do not like having the behavior differ only for LSan, or only on Linux. But I also do not like having tricky threading code that we do not otherwise need in all of the other builds. So it is really about picking the least-bad option, and they all seem similarly bad to me. So I picked the one that involved writing the least amount of code. ;) > Okay. The Meson equivalent would be: > [...] Yeah, I figured it would need something similar (but for our CI it does not yet matter). Do you want to prepare that as a patch on top? (Though also see the message I'm about to send that we might be able to avoid this series entirely!). > > +#ifdef THREAD_BARRIER_PTHREAD > > +#define maybe_thread_barrier_t pthread_barrier_t > > +#define maybe_thread_barrier_init pthread_barrier_init > > +#define maybe_thread_barrier_wait pthread_barrier_wait > > +#define maybe_thread_barrier_destroy pthread_barrier_destroy > > +#else > > +#define maybe_thread_barrier_t int > > Out of curiosity: why did you pick a define here and not a typedef? That's what we do for all of the NO_PTHREADS fallbacks, so I was just following that style. I suspect it matters more there, because you would not want to typedef pthread_t on a system that is building with NO_PTHREADS but actually does define that type (whereas the "maybe" variants are our own invention, so we can be confident those names won't conflict). I don't think it matters too much either way. > > +static inline int maybe_thread_barrier_init(maybe_thread_barrier_t *b UNUSED, > > + void *attr UNUSED, > > + unsigned nr UNUSED) > > +{ > > + errno = ENOSYS; > > + return -1; > > +} > > +#define maybe_thread_barrier_wait(barrier) > > +#define maybe_thread_barrier_destroy(barrier) > > So the way these wrappers are implemented it is not possible to check > for errors of `pthread_barrier_init()` et al. When the implementation > exists we do have return codes, but if it's stubbed out we don't. Yeah, again I was following the NO_PTHREADS fallbacks defined above. Perhaps those are different in that we wouldn't generally expect to ever call them if we don't have threads at all (whereas these "maybe" ones are meant to be quiet noops). > I think we should align these two implementations so that it does become > possible to check for errors, or otherwise we wouldn't be using the > pthread APIs correctly. It does raise the question though whether we > should really return `-1` in the stubbed-out variant or whether we > should instead pretend as if things were alright. You'd have to return a fake success if you expect to check errors, I'd think. Unless you want to introduce a bunch of conditional code to say "well, we tried to initialize a barrier but it didn't work, so let's not actually wait". For all of the existing pthread calls, we simply assume init stuff like pthread_mutex_init() doesn't fail. Possibly we should change that globally, but this is just following the same pattern. > An alternative would be to die in case the pthread-functions return an > error. That's certainly akin to xmalloc(). But probably we don't want to go that direction, simply because it works against libification in the long run. -Peff
On Wed, Jan 01, 2025 at 01:28:26PM -0500, Jeff King wrote: > On Mon, Dec 30, 2024 at 08:03:38AM +0100, Patrick Steinhardt wrote: > > Okay. The Meson equivalent would be: > > [...] > > Yeah, I figured it would need something similar (but for our CI it does > not yet matter). Do you want to prepare that as a patch on top? (Though > also see the message I'm about to send that we might be able to avoid > this series entirely!). Looks like your version was reverted in favor of the new version you have sent. So I'll refrain from doing that :) Patrick
diff --git a/Makefile b/Makefile index 97e8385b66..2c6dad8a75 100644 --- a/Makefile +++ b/Makefile @@ -141,6 +141,10 @@ include shared.mak # # Define NO_PTHREADS if you do not have or do not want to use Pthreads. # +# Define THREAD_BARRIER_PTHREAD if your system has pthread_barrier_t. Barrier +# support is optional and is only helpful when building with SANITIZE=leak, as +# it is used to eliminate some races in the leak-checker. +# # Define NO_PREAD if you have a problem with pread() system call (e.g. # cygwin1.dll before v1.5.22). # @@ -2079,6 +2083,9 @@ ifdef NO_PTHREADS else BASIC_CFLAGS += $(PTHREAD_CFLAGS) EXTLIBS += $(PTHREAD_LIBS) + ifdef THREAD_BARRIER_PTHREAD + BASIC_CFLAGS += -DTHREAD_BARRIER_PTHREAD + endif endif ifdef HAVE_PATHS_H diff --git a/ci/lib.sh b/ci/lib.sh index 8885ee3c3f..6a1267fbcb 100755 --- a/ci/lib.sh +++ b/ci/lib.sh @@ -385,6 +385,7 @@ linux-musl) ;; linux-leaks|linux-reftable-leaks) export SANITIZE=leak + export THREAD_BARRIER_PTHREAD=1 ;; linux-asan-ubsan) export SANITIZE=address,undefined diff --git a/thread-utils.h b/thread-utils.h index 4961487ed9..3df5be9916 100644 --- a/thread-utils.h +++ b/thread-utils.h @@ -53,5 +53,22 @@ int dummy_pthread_init(void *); int online_cpus(void); int init_recursive_mutex(pthread_mutex_t*); +#ifdef THREAD_BARRIER_PTHREAD +#define maybe_thread_barrier_t pthread_barrier_t +#define maybe_thread_barrier_init pthread_barrier_init +#define maybe_thread_barrier_wait pthread_barrier_wait +#define maybe_thread_barrier_destroy pthread_barrier_destroy +#else +#define maybe_thread_barrier_t int +static inline int maybe_thread_barrier_init(maybe_thread_barrier_t *b UNUSED, + void *attr UNUSED, + unsigned nr UNUSED) +{ + errno = ENOSYS; + return -1; +} +#define maybe_thread_barrier_wait(barrier) +#define maybe_thread_barrier_destroy(barrier) +#endif #endif /* THREAD_COMPAT_H */
One thread primitive we don't yet support is a barrier: it waits for all threads to reach a synchronization point before letting any of them continue. This would be useful for avoiding the LSan race we see in index-pack (and other places) by having all threads complete their initialization before any of them start to do real work. POSIX introduced a pthread_barrier_t in 2004, which does what we want. But if we want to rely on it: 1. Our Windows pthread emulation would need a new set of wrapper functions. There's a Synchronization Barrier primitive there, which was introduced in Windows 8 (which is old enough for us to depend on). 2. macOS (and possibly other systems) has pthreads but not pthread_barrier_t. So there we'd have to implement our own barrier based on the mutex and cond primitives. Those are do-able, but since we only care about avoiding races in our LSan builds, there's an easier way: make it a noop on systems without a native pthread barrier. This patch introduces a "maybe_thread_barrier" API. The clunky name (rather than just using pthread_barrier directly) should hopefully clue people in that on some systems it will do nothing. It's wired to a Makefile knob which has to be triggered manually, and we enable it for the linux-leaks CI jobs (since we know we'll have it there). There are some other possible options: - we could turn it on all the time for Linux systems based on uname. But we really only care about it for LSan builds, and there is no need to add extra code to regular builds. - we could turn it on only for LSan builds. But that would break builds on non-Linux platforms (like macOS) that otherwise should support sanitizers. - we could trigger only on the combination of Linux and LSan together. This isn't too hard to do, but the uname check isn't completely accurate. It is really about what your libc supports, and non-glibc systems might not have it (though at least musl seems to). So we'd risk breaking builds on those systems, which would need to add a new knob. Though the upside would be that running local "make SANITIZE=leak test" would be protected automatically. And of course none of this protects LSan runs from races on systems without pthread barriers. It's probably OK in practice to protect only our CI jobs, though. The race is rare-ish and most leak-checking happens through CI. Signed-off-by: Jeff King <peff@peff.net> --- Makefile | 7 +++++++ ci/lib.sh | 1 + thread-utils.h | 17 +++++++++++++++++ 3 files changed, 25 insertions(+)