diff mbox series

[3/5] thread-utils: introduce optional barrier type

Message ID 20241230042830.GC113400@coredump.intra.peff.net (mailing list archive)
State Accepted
Commit 7d0037b59ae0d22a2718c28d8e70e3ef3f3f991e
Headers show
Series fixing thread races in linux-leaks CI | expand

Commit Message

Jeff King Dec. 30, 2024, 4:28 a.m. UTC
One thread primitive we don't yet support is a barrier: it waits for all
threads to reach a synchronization point before letting any of them
continue. This would be useful for avoiding the LSan race we see in
index-pack (and other places) by having all threads complete their
initialization before any of them start to do real work.

POSIX introduced a pthread_barrier_t in 2004, which does what we want.
But if we want to rely on it:

  1. Our Windows pthread emulation would need a new set of wrapper
     functions. There's a Synchronization Barrier primitive there, which
     was introduced in Windows 8 (which is old enough for us to depend
     on).

  2. macOS (and possibly other systems) has pthreads but not
     pthread_barrier_t. So there we'd have to implement our own barrier
     based on the mutex and cond primitives.

Those are do-able, but since we only care about avoiding races in our
LSan builds, there's an easier way: make it a noop on systems without a
native pthread barrier.

This patch introduces a "maybe_thread_barrier" API. The clunky name
(rather than just using pthread_barrier directly) should hopefully clue
people in that on some systems it will do nothing. It's wired to a
Makefile knob which has to be triggered manually, and we enable it for
the linux-leaks CI jobs (since we know we'll have it there).

There are some other possible options:

  - we could turn it on all the time for Linux systems based on uname.
    But we really only care about it for LSan builds, and there is no
    need to add extra code to regular builds.

  - we could turn it on only for LSan builds. But that would break
    builds on non-Linux platforms (like macOS) that otherwise should
    support sanitizers.

  - we could trigger only on the combination of Linux and LSan together.
    This isn't too hard to do, but the uname check isn't completely
    accurate. It is really about what your libc supports, and non-glibc
    systems might not have it (though at least musl seems to).

    So we'd risk breaking builds on those systems, which would need to
    add a new knob. Though the upside would be that running local "make
    SANITIZE=leak test" would be protected automatically.

And of course none of this protects LSan runs from races on systems
without pthread barriers. It's probably OK in practice to protect only
our CI jobs, though. The race is rare-ish and most leak-checking happens
through CI.

Signed-off-by: Jeff King <peff@peff.net>
---
 Makefile       |  7 +++++++
 ci/lib.sh      |  1 +
 thread-utils.h | 17 +++++++++++++++++
 3 files changed, 25 insertions(+)

Comments

Patrick Steinhardt Dec. 30, 2024, 7:03 a.m. UTC | #1
On Sun, Dec 29, 2024 at 11:28:30PM -0500, Jeff King wrote:
> One thread primitive we don't yet support is a barrier: it waits for all
> threads to reach a synchronization point before letting any of them
> continue. This would be useful for avoiding the LSan race we see in
> index-pack (and other places) by having all threads complete their
> initialization before any of them start to do real work.
> 
> POSIX introduced a pthread_barrier_t in 2004, which does what we want.
> But if we want to rely on it:
> 
>   1. Our Windows pthread emulation would need a new set of wrapper
>      functions. There's a Synchronization Barrier primitive there, which
>      was introduced in Windows 8 (which is old enough for us to depend
>      on).
> 
>   2. macOS (and possibly other systems) has pthreads but not
>      pthread_barrier_t. So there we'd have to implement our own barrier
>      based on the mutex and cond primitives.
> 
> Those are do-able, but since we only care about avoiding races in our
> LSan builds, there's an easier way: make it a noop on systems without a
> native pthread barrier.

I think this is fine for a first iteration. If we ever feel the need for
having barriers anywhere else for actual correctness we can iterate on
the solution and provide wrappers for those platforms.

> This patch introduces a "maybe_thread_barrier" API. The clunky name
> (rather than just using pthread_barrier directly) should hopefully clue
> people in that on some systems it will do nothing. It's wired to a
> Makefile knob which has to be triggered manually, and we enable it for
> the linux-leaks CI jobs (since we know we'll have it there).
> 
> There are some other possible options:
> 
>   - we could turn it on all the time for Linux systems based on uname.

Tiniest of nits: these are all full sentences, which should start with
an upper-case letter.

>     But we really only care about it for LSan builds, and there is no
>     need to add extra code to regular builds.
> 
>   - we could turn it on only for LSan builds. But that would break
>     builds on non-Linux platforms (like macOS) that otherwise should
>     support sanitizers.

Mh. I'm not a huge fan of having extra code for just a subset of our
builds and think that having the code generally enabled on platforms
that support it is preferable to reduce the number of build variants.
But...

>   - we could trigger only on the combination of Linux and LSan together.
>     This isn't too hard to do, but the uname check isn't completely
>     accurate. It is really about what your libc supports, and non-glibc
>     systems might not have it (though at least musl seems to).
> 
>     So we'd risk breaking builds on those systems, which would need to
>     add a new knob. Though the upside would be that running local "make
>     SANITIZE=leak test" would be protected automatically.

... this is a fair remark. So I dunno.

> And of course none of this protects LSan runs from races on systems
> without pthread barriers. It's probably OK in practice to protect only
> our CI jobs, though. The race is rare-ish and most leak-checking happens
> through CI.
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  Makefile       |  7 +++++++
>  ci/lib.sh      |  1 +
>  thread-utils.h | 17 +++++++++++++++++
>  3 files changed, 25 insertions(+)
> 
> diff --git a/Makefile b/Makefile
> index 97e8385b66..2c6dad8a75 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -141,6 +141,10 @@ include shared.mak
>  #
>  # Define NO_PTHREADS if you do not have or do not want to use Pthreads.
>  #
> +# Define THREAD_BARRIER_PTHREAD if your system has pthread_barrier_t. Barrier
> +# support is optional and is only helpful when building with SANITIZE=leak, as
> +# it is used to eliminate some races in the leak-checker.
> +#
>  # Define NO_PREAD if you have a problem with pread() system call (e.g.
>  # cygwin1.dll before v1.5.22).
>  #
> @@ -2079,6 +2083,9 @@ ifdef NO_PTHREADS
>  else
>  	BASIC_CFLAGS += $(PTHREAD_CFLAGS)
>  	EXTLIBS += $(PTHREAD_LIBS)
> +	ifdef THREAD_BARRIER_PTHREAD
> +		BASIC_CFLAGS += -DTHREAD_BARRIER_PTHREAD
> +	endif
>  endif
>  
>  ifdef HAVE_PATHS_H

Okay. The Meson equivalent would be:

diff --git a/meson.build b/meson.build
index a0654a3f24..db4c1e6929 100644
--- a/meson.build
+++ b/meson.build
@@ -788,6 +788,10 @@ threads = dependency('threads', required: false)
 if threads.found()
   libgit_dependencies += threads
   build_options_config.set('NO_PTHREADS', '')
+
+  if get_option('b_sanitize').contains('leak') and compiler.has_function('pthread_barrier_init', dependencies: threads)
+    libgit_c_args += '-DTHREAD_BARRIER_PTHREAD'
+  endif
 else
   libgit_c_args += '-DNO_PTHREADS'
   build_options_config.set('NO_PTHREADS', '1')

> diff --git a/thread-utils.h b/thread-utils.h
> index 4961487ed9..3df5be9916 100644
> --- a/thread-utils.h
> +++ b/thread-utils.h
> @@ -53,5 +53,22 @@ int dummy_pthread_init(void *);
>  int online_cpus(void);
>  int init_recursive_mutex(pthread_mutex_t*);
>  
> +#ifdef THREAD_BARRIER_PTHREAD
> +#define maybe_thread_barrier_t pthread_barrier_t
> +#define maybe_thread_barrier_init pthread_barrier_init
> +#define maybe_thread_barrier_wait pthread_barrier_wait
> +#define maybe_thread_barrier_destroy pthread_barrier_destroy
> +#else
> +#define maybe_thread_barrier_t int

Out of curiosity: why did you pick a define here and not a typedef?

> +static inline int maybe_thread_barrier_init(maybe_thread_barrier_t *b UNUSED,
> +					    void *attr UNUSED,
> +					    unsigned nr UNUSED)
> +{
> +	errno = ENOSYS;
> +	return -1;
> +}
> +#define maybe_thread_barrier_wait(barrier)
> +#define maybe_thread_barrier_destroy(barrier)

So the way these wrappers are implemented it is not possible to check
for errors of `pthread_barrier_init()` et al. When the implementation
exists we do have return codes, but if it's stubbed out we don't.

I think we should align these two implementations so that it does become
possible to check for errors, or otherwise we wouldn't be using the
pthread APIs correctly. It does raise the question though whether we
should really return `-1` in the stubbed-out variant or whether we
should instead pretend as if things were alright.

An alternative would be to die in case the pthread-functions return an
error.

Patrick
Jeff King Jan. 1, 2025, 6:28 p.m. UTC | #2
On Mon, Dec 30, 2024 at 08:03:38AM +0100, Patrick Steinhardt wrote:

> >   - we could turn it on only for LSan builds. But that would break
> >     builds on non-Linux platforms (like macOS) that otherwise should
> >     support sanitizers.
> 
> Mh. I'm not a huge fan of having extra code for just a subset of our
> builds and think that having the code generally enabled on platforms
> that support it is preferable to reduce the number of build variants.
> But...
> 
> >   - we could trigger only on the combination of Linux and LSan together.
> >     This isn't too hard to do, but the uname check isn't completely
> >     accurate. It is really about what your libc supports, and non-glibc
> >     systems might not have it (though at least musl seems to).
> > 
> >     So we'd risk breaking builds on those systems, which would need to
> >     add a new knob. Though the upside would be that running local "make
> >     SANITIZE=leak test" would be protected automatically.
> 
> ... this is a fair remark. So I dunno.

Yeah. I do not like having the behavior differ only for LSan, or only on
Linux. But I also do not like having tricky threading code that we do
not otherwise need in all of the other builds. So it is really about
picking the least-bad option, and they all seem similarly bad to me. So
I picked the one that involved writing the least amount of code. ;)

> Okay. The Meson equivalent would be:
> [...]

Yeah, I figured it would need something similar (but for our CI it does
not yet matter). Do you want to prepare that as a patch on top? (Though
also see the message I'm about to send that we might be able to avoid
this series entirely!).

> > +#ifdef THREAD_BARRIER_PTHREAD
> > +#define maybe_thread_barrier_t pthread_barrier_t
> > +#define maybe_thread_barrier_init pthread_barrier_init
> > +#define maybe_thread_barrier_wait pthread_barrier_wait
> > +#define maybe_thread_barrier_destroy pthread_barrier_destroy
> > +#else
> > +#define maybe_thread_barrier_t int
> 
> Out of curiosity: why did you pick a define here and not a typedef?

That's what we do for all of the NO_PTHREADS fallbacks, so I was just
following that style. I suspect it matters more there, because you would
not want to typedef pthread_t on a system that is building with
NO_PTHREADS but actually does define that type (whereas the "maybe"
variants are our own invention, so we can be confident those names won't
conflict).

I don't think it matters too much either way.

> > +static inline int maybe_thread_barrier_init(maybe_thread_barrier_t *b UNUSED,
> > +					    void *attr UNUSED,
> > +					    unsigned nr UNUSED)
> > +{
> > +	errno = ENOSYS;
> > +	return -1;
> > +}
> > +#define maybe_thread_barrier_wait(barrier)
> > +#define maybe_thread_barrier_destroy(barrier)
> 
> So the way these wrappers are implemented it is not possible to check
> for errors of `pthread_barrier_init()` et al. When the implementation
> exists we do have return codes, but if it's stubbed out we don't.

Yeah, again I was following the NO_PTHREADS fallbacks defined above.
Perhaps those are different in that we wouldn't generally expect to ever
call them if we don't have threads at all (whereas these "maybe" ones
are meant to be quiet noops).

> I think we should align these two implementations so that it does become
> possible to check for errors, or otherwise we wouldn't be using the
> pthread APIs correctly. It does raise the question though whether we
> should really return `-1` in the stubbed-out variant or whether we
> should instead pretend as if things were alright.

You'd have to return a fake success if you expect to check errors, I'd
think. Unless you want to introduce a bunch of conditional code to say
"well, we tried to initialize a barrier but it didn't work, so let's not
actually wait".

For all of the existing pthread calls, we simply assume init stuff like
pthread_mutex_init() doesn't fail. Possibly we should change that
globally, but this is just following the same pattern.

> An alternative would be to die in case the pthread-functions return an
> error.

That's certainly akin to xmalloc(). But probably we don't want to go
that direction, simply because it works against libification in the long
run.

-Peff
Patrick Steinhardt Jan. 3, 2025, 6:45 a.m. UTC | #3
On Wed, Jan 01, 2025 at 01:28:26PM -0500, Jeff King wrote:
> On Mon, Dec 30, 2024 at 08:03:38AM +0100, Patrick Steinhardt wrote:
> > Okay. The Meson equivalent would be:
> > [...]
> 
> Yeah, I figured it would need something similar (but for our CI it does
> not yet matter). Do you want to prepare that as a patch on top? (Though
> also see the message I'm about to send that we might be able to avoid
> this series entirely!).

Looks like your version was reverted in favor of the new version you
have sent. So I'll refrain from doing that :)

Patrick
diff mbox series

Patch

diff --git a/Makefile b/Makefile
index 97e8385b66..2c6dad8a75 100644
--- a/Makefile
+++ b/Makefile
@@ -141,6 +141,10 @@  include shared.mak
 #
 # Define NO_PTHREADS if you do not have or do not want to use Pthreads.
 #
+# Define THREAD_BARRIER_PTHREAD if your system has pthread_barrier_t. Barrier
+# support is optional and is only helpful when building with SANITIZE=leak, as
+# it is used to eliminate some races in the leak-checker.
+#
 # Define NO_PREAD if you have a problem with pread() system call (e.g.
 # cygwin1.dll before v1.5.22).
 #
@@ -2079,6 +2083,9 @@  ifdef NO_PTHREADS
 else
 	BASIC_CFLAGS += $(PTHREAD_CFLAGS)
 	EXTLIBS += $(PTHREAD_LIBS)
+	ifdef THREAD_BARRIER_PTHREAD
+		BASIC_CFLAGS += -DTHREAD_BARRIER_PTHREAD
+	endif
 endif
 
 ifdef HAVE_PATHS_H
diff --git a/ci/lib.sh b/ci/lib.sh
index 8885ee3c3f..6a1267fbcb 100755
--- a/ci/lib.sh
+++ b/ci/lib.sh
@@ -385,6 +385,7 @@  linux-musl)
 	;;
 linux-leaks|linux-reftable-leaks)
 	export SANITIZE=leak
+	export THREAD_BARRIER_PTHREAD=1
 	;;
 linux-asan-ubsan)
 	export SANITIZE=address,undefined
diff --git a/thread-utils.h b/thread-utils.h
index 4961487ed9..3df5be9916 100644
--- a/thread-utils.h
+++ b/thread-utils.h
@@ -53,5 +53,22 @@  int dummy_pthread_init(void *);
 int online_cpus(void);
 int init_recursive_mutex(pthread_mutex_t*);
 
+#ifdef THREAD_BARRIER_PTHREAD
+#define maybe_thread_barrier_t pthread_barrier_t
+#define maybe_thread_barrier_init pthread_barrier_init
+#define maybe_thread_barrier_wait pthread_barrier_wait
+#define maybe_thread_barrier_destroy pthread_barrier_destroy
+#else
+#define maybe_thread_barrier_t int
+static inline int maybe_thread_barrier_init(maybe_thread_barrier_t *b UNUSED,
+					    void *attr UNUSED,
+					    unsigned nr UNUSED)
+{
+	errno = ENOSYS;
+	return -1;
+}
+#define maybe_thread_barrier_wait(barrier)
+#define maybe_thread_barrier_destroy(barrier)
+#endif
 
 #endif /* THREAD_COMPAT_H */