[1/2] lib/find: Make functions safe on changing bitmaps

Message ID	20231011150252.32737-1-jack@suse.cz (mailing list archive)
State	New, archived
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9443F1F5E1 for <linux-fsdevel@vger.kernel.org>; Wed, 11 Oct 2023 15:02:57 +0000 (UTC) From: Jan Kara <jack@suse.cz> To: Yury Norov <yury.norov@gmail.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>, Rasmus Villemoes <linux@rasmusvillemoes.dk>, Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>, Matthew Wilcox <willy@infradead.org>, <linux-fsdevel@vger.kernel.org>, Jan Kara <jack@suse.cz> Subject: [PATCH 1/2] lib/find: Make functions safe on changing bitmaps Date: Wed, 11 Oct 2023 17:02:31 +0200 Message-Id: <20231011150252.32737-1-jack@suse.cz> In-Reply-To: <20231011144320.29201-1-jack@suse.cz> References: <20231011144320.29201-1-jack@suse.cz> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	lib/find: Fix KCSAN warnings in find__bit() functions \| expand [0/2] lib/find: Fix KCSAN warnings in find__bit() functions [1/2] lib/find: Make functions safe on changing bitmaps [2/2] xarray: Fix race in xas_find_chunk()

Jan Kara Oct. 11, 2023, 3:02 p.m. UTC

From: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>

Some users of lib/find functions can operate on bitmaps that change
while we are looking at the bitmap. For example xarray code looks at tag
bitmaps only with RCU protection. The xarray users are prepared to
handle a situation were tag modification races with reading of a tag
bitmap (and thus we get back a stale information) but the find_*bit()
functions should return based on bitmap contents at *some* point in time.
As UBSAN properly warns, the code like:

	val = *addr;
	if (val)
		__ffs(val)

prone to refetching 'val' contents from addr by the compiler and thus
passing 0 to __ffs() which has undefined results.

Fix the problem by using READ_ONCE() in all the appropriate places so
that the compiler cannot accidentally refetch the contents of addr
between the test and __ffs(). This makes the undefined behavior
impossible. The generated code is somewhat larger due to gcc tracking
both the index and target fetch address in separate registers (according
to GCC folks the volatile cast confuses their optimizer a bit, they are
looking into a fix). The difference in performance is minimal though.
Targetted microbenchmark (in userspace) of the bit searching loop shows
about 2% regression on some x86 microarchitectures so for normal kernel
workloads this should be in the noise and not worth introducing special
set of bitmap searching functions.

[JK: Wrote commit message]

CC: Yury Norov <yury.norov@gmail.com>
Link: https://lore.kernel.org/all/20230918044739.29782-1-mirsad.todorovac@alu.unizg.hr/
Signed-off-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/find.h | 40 ++++++++++++++++++++++++----------------
 lib/find_bit.c       | 39 +++++++++++++++++++++++----------------
 2 files changed, 47 insertions(+), 32 deletions(-)

Yury Norov Oct. 11, 2023, 6:26 p.m. UTC | #1

Long story short: KCSAN found some potential issues related to how
people use bitmap API. And instead of working through that issues,
the following code shuts down KCSAN by applying READ_ONCE() here
and there.

On Wed, Oct 11, 2023 at 05:02:31PM +0200, Jan Kara wrote:
> From: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
> 
> Some users of lib/find functions can operate on bitmaps that change
> while we are looking at the bitmap. For example xarray code looks at tag
> bitmaps only with RCU protection. The xarray users are prepared to
> handle a situation were tag modification races with reading of a tag
> bitmap (and thus we get back a stale information) but the find_*bit()
> functions should return based on bitmap contents at *some* point in time.
> As UBSAN properly warns, the code like:
> 
> 	val = *addr;
> 	if (val)
> 		__ffs(val)
> 
> prone to refetching 'val' contents from addr by the compiler and thus
> passing 0 to __ffs() which has undefined results.
> 
> Fix the problem by using READ_ONCE() in all the appropriate places so
> that the compiler cannot accidentally refetch the contents of addr
> between the test and __ffs(). This makes the undefined behavior
> impossible. The generated code is somewhat larger due to gcc tracking
> both the index and target fetch address in separate registers (according
> to GCC folks the volatile cast confuses their optimizer a bit, they are
> looking into a fix). The difference in performance is minimal though.
> Targetted microbenchmark (in userspace) of the bit searching loop shows
> about 2% regression on some x86 microarchitectures so for normal kernel
> workloads this should be in the noise and not worth introducing special
> set of bitmap searching functions.
> 
> [JK: Wrote commit message]

READ_ONCE() fixes nothing because nothing is broken in find_bit() API.
As I suspected, and as Matthew confirmed in his email, the true reason
for READ_ONCE() here is to disable KCSAN check:

        READ_ONCE() serves two functions here;
        one is that it tells the compiler not to try anything fancy, and
        the other is that it tells KCSAN to not bother instrumenting this
        load; no load-delay-reload.

https://lkml.kernel.org/linux-mm/ZQkhgVb8nWGxpSPk@casper.infradead.org/

And as side-effect, it of course hurts the performance. In the same
email Matthew said he doesn't believe me that READ_ONCE would do that,
so thank you for testing and confirming that it does.

Matthew, in my experience compilers do really well in that low-level
things, and gambling with compilers usually makes thing worse. x86_64
is one of the most strong-ordered architectures that I've been working
with, and even here READ_ONCE has visible effect on performance. I
expect that it will get even worse on more weakly ordered platforms.

I'm OK that you don't believe me; probably you can believe Paul
McKenney and his paper on kernel memory model, which says in the very
first paragraph:

        Applying ACCESS_ONCE() to a large array or structure is unlikely
        to do anything useful, and use of READ_ONCE() and WRITE_ONCE()
        in this situation can result in load-tearing and store-tearing.

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4444.html

Jan, I think that in your last email you confirmed that the xarray
problem that you're trying to solve is about a lack of proper locking:

        Well, for xarray the write side is synchronized with a spinlock but the read
        side is not (only RCU protected).

https://lkml.kernel.org/linux-mm/20230918155403.ylhfdbscgw6yek6p@quack3/

If there's no enough synchronization, why not just adding it?

Regardless performance consideration, my main concern is that this patch
considers bitmap as an atomic structure, which is intentionally not.
There are just a few single-bit atomic operations like set_bit() and
clear_bit(). All other functions are non-atomic, including those
find_bit() operations.

There is quite a lot of examples of wrong use of bitmaps wrt
atomicity, the most typical is like:
        for(idx = 0; idx < num; idx++) {
                ...
                set_bit(idx, bitmap);
        }

This is wrong because a series of atomic ops is not atomic itself, and
if you see something like this in you code, it should be converted to
using non-atomic __set_bit(), and protected externally if needed.

Similarly, READ_ONCE() in a for-loop doesn't guarantee any ordering or
atomicity, and only hurts the performance. And this is exactly what
this patch does.

Thanks,
Yury
 
> CC: Yury Norov <yury.norov@gmail.com>
> Link: https://lore.kernel.org/all/20230918044739.29782-1-mirsad.todorovac@alu.unizg.hr/
> Signed-off-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/find.h | 40 ++++++++++++++++++++++++----------------
>  lib/find_bit.c       | 39 +++++++++++++++++++++++----------------
>  2 files changed, 47 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/find.h b/include/linux/find.h
> index 5e4f39ef2e72..5ef6466dc7cc 100644
> --- a/include/linux/find.h
> +++ b/include/linux/find.h
> @@ -60,7 +60,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>  		if (unlikely(offset >= size))
>  			return size;
>  
> -		val = *addr & GENMASK(size - 1, offset);
> +		val = READ_ONCE(*addr) & GENMASK(size - 1, offset);
>  		return val ? __ffs(val) : size;
>  	}
>  
> @@ -90,7 +90,8 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
>  		if (unlikely(offset >= size))
>  			return size;
>  
> -		val = *addr1 & *addr2 & GENMASK(size - 1, offset);
> +		val = READ_ONCE(*addr1) & READ_ONCE(*addr2) &
> +						GENMASK(size - 1, offset);
>  		return val ? __ffs(val) : size;
>  	}
>  
> @@ -121,7 +122,8 @@ unsigned long find_next_andnot_bit(const unsigned long *addr1,
>  		if (unlikely(offset >= size))
>  			return size;
>  
> -		val = *addr1 & ~*addr2 & GENMASK(size - 1, offset);
> +		val = READ_ONCE(*addr1) & ~READ_ONCE(*addr2) &
> +						GENMASK(size - 1, offset);
>  		return val ? __ffs(val) : size;
>  	}
>  
> @@ -151,7 +153,8 @@ unsigned long find_next_or_bit(const unsigned long *addr1,
>  		if (unlikely(offset >= size))
>  			return size;
>  
> -		val = (*addr1 | *addr2) & GENMASK(size - 1, offset);
> +		val = (READ_ONCE(*addr1) | READ_ONCE(*addr2)) &
> +						GENMASK(size - 1, offset);
>  		return val ? __ffs(val) : size;
>  	}
>  
> @@ -179,7 +182,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
>  		if (unlikely(offset >= size))
>  			return size;
>  
> -		val = *addr | ~GENMASK(size - 1, offset);
> +		val = READ_ONCE(*addr) | ~GENMASK(size - 1, offset);
>  		return val == ~0UL ? size : ffz(val);
>  	}
>  
> @@ -200,7 +203,7 @@ static inline
>  unsigned long find_first_bit(const unsigned long *addr, unsigned long size)
>  {
>  	if (small_const_nbits(size)) {
> -		unsigned long val = *addr & GENMASK(size - 1, 0);
> +		unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0);
>  
>  		return val ? __ffs(val) : size;
>  	}
> @@ -229,7 +232,7 @@ unsigned long find_nth_bit(const unsigned long *addr, unsigned long size, unsign
>  		return size;
>  
>  	if (small_const_nbits(size)) {
> -		unsigned long val =  *addr & GENMASK(size - 1, 0);
> +		unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0);
>  
>  		return val ? fns(val, n) : size;
>  	}
> @@ -255,7 +258,8 @@ unsigned long find_nth_and_bit(const unsigned long *addr1, const unsigned long *
>  		return size;
>  
>  	if (small_const_nbits(size)) {
> -		unsigned long val =  *addr1 & *addr2 & GENMASK(size - 1, 0);
> +		unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2)
> +							& GENMASK(size - 1, 0);
>  
>  		return val ? fns(val, n) : size;
>  	}
> @@ -282,7 +286,8 @@ unsigned long find_nth_andnot_bit(const unsigned long *addr1, const unsigned lon
>  		return size;
>  
>  	if (small_const_nbits(size)) {
> -		unsigned long val =  *addr1 & (~*addr2) & GENMASK(size - 1, 0);
> +		unsigned long val = READ_ONCE(*addr1) & ~READ_ONCE(*addr2) &
> +							GENMASK(size - 1, 0);
>  
>  		return val ? fns(val, n) : size;
>  	}
> @@ -312,7 +317,8 @@ unsigned long find_nth_and_andnot_bit(const unsigned long *addr1,
>  		return size;
>  
>  	if (small_const_nbits(size)) {
> -		unsigned long val =  *addr1 & *addr2 & (~*addr3) & GENMASK(size - 1, 0);
> +		unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) &
> +				~READ_ONCE(*addr3) & GENMASK(size - 1, 0);
>  
>  		return val ? fns(val, n) : size;
>  	}
> @@ -336,7 +342,8 @@ unsigned long find_first_and_bit(const unsigned long *addr1,
>  				 unsigned long size)
>  {
>  	if (small_const_nbits(size)) {
> -		unsigned long val = *addr1 & *addr2 & GENMASK(size - 1, 0);
> +		unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) &
> +							GENMASK(size - 1, 0);
>  
>  		return val ? __ffs(val) : size;
>  	}
> @@ -358,7 +365,7 @@ static inline
>  unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size)
>  {
>  	if (small_const_nbits(size)) {
> -		unsigned long val = *addr | ~GENMASK(size - 1, 0);
> +		unsigned long val = READ_ONCE(*addr) | ~GENMASK(size - 1, 0);
>  
>  		return val == ~0UL ? size : ffz(val);
>  	}
> @@ -379,7 +386,7 @@ static inline
>  unsigned long find_last_bit(const unsigned long *addr, unsigned long size)
>  {
>  	if (small_const_nbits(size)) {
> -		unsigned long val = *addr & GENMASK(size - 1, 0);
> +		unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0);
>  
>  		return val ? __fls(val) : size;
>  	}
> @@ -505,7 +512,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
>  		long size, unsigned long offset)
>  {
>  	if (small_const_nbits(size)) {
> -		unsigned long val = *(const unsigned long *)addr;
> +		unsigned long val = READ_ONCE(*(const unsigned long *)addr);
>  
>  		if (unlikely(offset >= size))
>  			return size;
> @@ -523,7 +530,8 @@ static inline
>  unsigned long find_first_zero_bit_le(const void *addr, unsigned long size)
>  {
>  	if (small_const_nbits(size)) {
> -		unsigned long val = swab(*(const unsigned long *)addr) | ~GENMASK(size - 1, 0);
> +		unsigned long val = swab(READ_ONCE(*(const unsigned long *)addr))
> +						| ~GENMASK(size - 1, 0);
>  
>  		return val == ~0UL ? size : ffz(val);
>  	}
> @@ -538,7 +546,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
>  		long size, unsigned long offset)
>  {
>  	if (small_const_nbits(size)) {
> -		unsigned long val = *(const unsigned long *)addr;
> +		unsigned long val = READ_ONCE(*(const unsigned long *)addr);
>  
>  		if (unlikely(offset >= size))
>  			return size;
> diff --git a/lib/find_bit.c b/lib/find_bit.c
> index 32f99e9a670e..6867ef8a85e0 100644
> --- a/lib/find_bit.c
> +++ b/lib/find_bit.c
> @@ -98,7 +98,7 @@ out:										\
>   */
>  unsigned long _find_first_bit(const unsigned long *addr, unsigned long size)
>  {
> -	return FIND_FIRST_BIT(addr[idx], /* nop */, size);
> +	return FIND_FIRST_BIT(READ_ONCE(addr[idx]), /* nop */, size);
>  }
>  EXPORT_SYMBOL(_find_first_bit);
>  #endif
> @@ -111,7 +111,8 @@ unsigned long _find_first_and_bit(const unsigned long *addr1,
>  				  const unsigned long *addr2,
>  				  unsigned long size)
>  {
> -	return FIND_FIRST_BIT(addr1[idx] & addr2[idx], /* nop */, size);
> +	return FIND_FIRST_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]),
> +				/* nop */, size);
>  }
>  EXPORT_SYMBOL(_find_first_and_bit);
>  #endif
> @@ -122,7 +123,7 @@ EXPORT_SYMBOL(_find_first_and_bit);
>   */
>  unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size)
>  {
> -	return FIND_FIRST_BIT(~addr[idx], /* nop */, size);
> +	return FIND_FIRST_BIT(~READ_ONCE(addr[idx]), /* nop */, size);
>  }
>  EXPORT_SYMBOL(_find_first_zero_bit);
>  #endif
> @@ -130,28 +131,30 @@ EXPORT_SYMBOL(_find_first_zero_bit);
>  #ifndef find_next_bit
>  unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start)
>  {
> -	return FIND_NEXT_BIT(addr[idx], /* nop */, nbits, start);
> +	return FIND_NEXT_BIT(READ_ONCE(addr[idx]), /* nop */, nbits, start);
>  }
>  EXPORT_SYMBOL(_find_next_bit);
>  #endif
>  
>  unsigned long __find_nth_bit(const unsigned long *addr, unsigned long size, unsigned long n)
>  {
> -	return FIND_NTH_BIT(addr[idx], size, n);
> +	return FIND_NTH_BIT(READ_ONCE(addr[idx]), size, n);
>  }
>  EXPORT_SYMBOL(__find_nth_bit);
>  
>  unsigned long __find_nth_and_bit(const unsigned long *addr1, const unsigned long *addr2,
>  				 unsigned long size, unsigned long n)
>  {
> -	return FIND_NTH_BIT(addr1[idx] & addr2[idx], size, n);
> +	return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]),
> +			    size, n);
>  }
>  EXPORT_SYMBOL(__find_nth_and_bit);
>  
>  unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
>  				 unsigned long size, unsigned long n)
>  {
> -	return FIND_NTH_BIT(addr1[idx] & ~addr2[idx], size, n);
> +	return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & ~READ_ONCE(addr2[idx]),
> +			    size, n);
>  }
>  EXPORT_SYMBOL(__find_nth_andnot_bit);
>  
> @@ -160,7 +163,8 @@ unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1,
>  					const unsigned long *addr3,
>  					unsigned long size, unsigned long n)
>  {
> -	return FIND_NTH_BIT(addr1[idx] & addr2[idx] & ~addr3[idx], size, n);
> +	return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]) &
> +			    ~READ_ONCE(addr3[idx]), size, n);
>  }
>  EXPORT_SYMBOL(__find_nth_and_andnot_bit);
>  
> @@ -168,7 +172,8 @@ EXPORT_SYMBOL(__find_nth_and_andnot_bit);
>  unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
>  					unsigned long nbits, unsigned long start)
>  {
> -	return FIND_NEXT_BIT(addr1[idx] & addr2[idx], /* nop */, nbits, start);
> +	return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]),
> +			     /* nop */, nbits, start);
>  }
>  EXPORT_SYMBOL(_find_next_and_bit);
>  #endif
> @@ -177,7 +182,8 @@ EXPORT_SYMBOL(_find_next_and_bit);
>  unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
>  					unsigned long nbits, unsigned long start)
>  {
> -	return FIND_NEXT_BIT(addr1[idx] & ~addr2[idx], /* nop */, nbits, start);
> +	return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) & ~READ_ONCE(addr2[idx]),
> +			     /* nop */, nbits, start);
>  }
>  EXPORT_SYMBOL(_find_next_andnot_bit);
>  #endif
> @@ -186,7 +192,8 @@ EXPORT_SYMBOL(_find_next_andnot_bit);
>  unsigned long _find_next_or_bit(const unsigned long *addr1, const unsigned long *addr2,
>  					unsigned long nbits, unsigned long start)
>  {
> -	return FIND_NEXT_BIT(addr1[idx] | addr2[idx], /* nop */, nbits, start);
> +	return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) | READ_ONCE(addr2[idx]),
> +			     /* nop */, nbits, start);
>  }
>  EXPORT_SYMBOL(_find_next_or_bit);
>  #endif
> @@ -195,7 +202,7 @@ EXPORT_SYMBOL(_find_next_or_bit);
>  unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
>  					 unsigned long start)
>  {
> -	return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start);
> +	return FIND_NEXT_BIT(~READ_ONCE(addr[idx]), /* nop */, nbits, start);
>  }
>  EXPORT_SYMBOL(_find_next_zero_bit);
>  #endif
> @@ -208,7 +215,7 @@ unsigned long _find_last_bit(const unsigned long *addr, unsigned long size)
>  		unsigned long idx = (size-1) / BITS_PER_LONG;
>  
>  		do {
> -			val &= addr[idx];
> +			val &= READ_ONCE(addr[idx]);
>  			if (val)
>  				return idx * BITS_PER_LONG + __fls(val);
>  
> @@ -242,7 +249,7 @@ EXPORT_SYMBOL(find_next_clump8);
>   */
>  unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size)
>  {
> -	return FIND_FIRST_BIT(~addr[idx], swab, size);
> +	return FIND_FIRST_BIT(~READ_ONCE(addr[idx]), swab, size);
>  }
>  EXPORT_SYMBOL(_find_first_zero_bit_le);
>  
> @@ -252,7 +259,7 @@ EXPORT_SYMBOL(_find_first_zero_bit_le);
>  unsigned long _find_next_zero_bit_le(const unsigned long *addr,
>  					unsigned long size, unsigned long offset)
>  {
> -	return FIND_NEXT_BIT(~addr[idx], swab, size, offset);
> +	return FIND_NEXT_BIT(~READ_ONCE(addr[idx]), swab, size, offset);
>  }
>  EXPORT_SYMBOL(_find_next_zero_bit_le);
>  #endif
> @@ -261,7 +268,7 @@ EXPORT_SYMBOL(_find_next_zero_bit_le);
>  unsigned long _find_next_bit_le(const unsigned long *addr,
>  				unsigned long size, unsigned long offset)
>  {
> -	return FIND_NEXT_BIT(addr[idx], swab, size, offset);
> +	return FIND_NEXT_BIT(READ_ONCE(addr[idx]), swab, size, offset);
>  }
>  EXPORT_SYMBOL(_find_next_bit_le);
>  
> -- 
> 2.35.3

Matthew Wilcox Oct. 11, 2023, 6:49 p.m. UTC | #2

On Wed, Oct 11, 2023 at 11:26:29AM -0700, Yury Norov wrote:
> Long story short: KCSAN found some potential issues related to how
> people use bitmap API. And instead of working through that issues,
> the following code shuts down KCSAN by applying READ_ONCE() here
> and there.

Pfft.

> READ_ONCE() fixes nothing because nothing is broken in find_bit() API.
> As I suspected, and as Matthew confirmed in his email, the true reason
> for READ_ONCE() here is to disable KCSAN check:
> 
>         READ_ONCE() serves two functions here;
>         one is that it tells the compiler not to try anything fancy, and
>         the other is that it tells KCSAN to not bother instrumenting this
>         load; no load-delay-reload.
> 
> https://lkml.kernel.org/linux-mm/ZQkhgVb8nWGxpSPk@casper.infradead.org/
> 
> And as side-effect, it of course hurts the performance. In the same
> email Matthew said he doesn't believe me that READ_ONCE would do that,
> so thank you for testing and confirming that it does.

You really misinterpreted what Jan wrote to accomplish this motivated
reasoning.

> Jan, I think that in your last email you confirmed that the xarray
> problem that you're trying to solve is about a lack of proper locking:
> 
>         Well, for xarray the write side is synchronized with a spinlock but the read
>         side is not (only RCU protected).
> 
> https://lkml.kernel.org/linux-mm/20230918155403.ylhfdbscgw6yek6p@quack3/
> 
> If there's no enough synchronization, why not just adding it?

You don't understand.  We _intend_ for there to be no locking.
We_understand_ there is a race here.  We're _fine_ with there being
a race here.

> Regardless performance consideration, my main concern is that this patch
> considers bitmap as an atomic structure, which is intentionally not.
> There are just a few single-bit atomic operations like set_bit() and
> clear_bit(). All other functions are non-atomic, including those
> find_bit() operations.

... and for KCSAN to understand that, we have to use READ_ONCE.

> There is quite a lot of examples of wrong use of bitmaps wrt
> atomicity, the most typical is like:
>         for(idx = 0; idx < num; idx++) {
>                 ...
>                 set_bit(idx, bitmap);
>         }
> 
> This is wrong because a series of atomic ops is not atomic itself, and
> if you see something like this in you code, it should be converted to
> using non-atomic __set_bit(), and protected externally if needed.

That is a bad use of set_bit()!  I agree!  See, for example, commit
b21866f514cb where I remove precisely this kind of code.

> Similarly, READ_ONCE() in a for-loop doesn't guarantee any ordering or
> atomicity, and only hurts the performance. And this is exactly what
> this patch does.

Go back and read Jan's patch again, instead of cherry-picking some
little bits that confirm your prejudices.

Mirsad Todorovac Oct. 11, 2023, 7:25 p.m. UTC | #3

On 10/11/23 20:49, Matthew Wilcox wrote:
> On Wed, Oct 11, 2023 at 11:26:29AM -0700, Yury Norov wrote:
>> Long story short: KCSAN found some potential issues related to how
>> people use bitmap API. And instead of working through that issues,
>> the following code shuts down KCSAN by applying READ_ONCE() here
>> and there.
> 
> Pfft.
> 
>> READ_ONCE() fixes nothing because nothing is broken in find_bit() API.
>> As I suspected, and as Matthew confirmed in his email, the true reason
>> for READ_ONCE() here is to disable KCSAN check:
>>
>>          READ_ONCE() serves two functions here;
>>          one is that it tells the compiler not to try anything fancy, and
>>          the other is that it tells KCSAN to not bother instrumenting this
>>          load; no load-delay-reload.
>>
>> https://lkml.kernel.org/linux-mm/ZQkhgVb8nWGxpSPk@casper.infradead.org/
>>
>> And as side-effect, it of course hurts the performance. In the same
>> email Matthew said he doesn't believe me that READ_ONCE would do that,
>> so thank you for testing and confirming that it does.
> 
> You really misinterpreted what Jan wrote to accomplish this motivated
> reasoning.
> 
>> Jan, I think that in your last email you confirmed that the xarray
>> problem that you're trying to solve is about a lack of proper locking:
>>
>>          Well, for xarray the write side is synchronized with a spinlock but the read
>>          side is not (only RCU protected).
>>
>> https://lkml.kernel.org/linux-mm/20230918155403.ylhfdbscgw6yek6p@quack3/
>>
>> If there's no enough synchronization, why not just adding it?
> 
> You don't understand.  We _intend_ for there to be no locking.
> We_understand_ there is a race here.  We're _fine_ with there being
> a race here.
> 
>> Regardless performance consideration, my main concern is that this patch
>> considers bitmap as an atomic structure, which is intentionally not.
>> There are just a few single-bit atomic operations like set_bit() and
>> clear_bit(). All other functions are non-atomic, including those
>> find_bit() operations.
> 
> ... and for KCSAN to understand that, we have to use READ_ONCE.
> 
>> There is quite a lot of examples of wrong use of bitmaps wrt
>> atomicity, the most typical is like:
>>          for(idx = 0; idx < num; idx++) {
>>                  ...
>>                  set_bit(idx, bitmap);
>>          }
>>
>> This is wrong because a series of atomic ops is not atomic itself, and
>> if you see something like this in you code, it should be converted to
>> using non-atomic __set_bit(), and protected externally if needed.
> 
> That is a bad use of set_bit()!  I agree!  See, for example, commit
> b21866f514cb where I remove precisely this kind of code.
> 
>> Similarly, READ_ONCE() in a for-loop doesn't guarantee any ordering or
>> atomicity, and only hurts the performance. And this is exactly what
>> this patch does.
> 
> Go back and read Jan's patch again, instead of cherry-picking some
> little bits that confirm your prejudices.

Hey Yuri,

No hard feelings, but I tend to agree with Mr. Wilcox and Jan.

set_bit just as any atomic increment or memory barrier - by the same
author you quoted - causes a LOCK prefix to the assembler instruction.
By my modest knowledge of the machine language, this will cause the CPU
core to LOCK the memory bus for the time the byte, word, longword or quadword
(or even bit) is being read, changed, and written back.

If I am not making a stupid logical mistake, this LOCK on the memory
bus by a core is going to prevent the other cores from accessing memory
or filling or flushing their caches.

I agree with Mr. Wilcox that locking would have much worse performance
penalty that a simply READ_ONCE() that is designed to prevent the compiler
from the "funny" optimisations, such as using the two faster instructions
instead of the atomic load - which might in the worst case be interrupted
just between two half-loads.

It does nothing to hurt performance on the level of a memory read or write barrier
or the memory bus lock that stalls all cores.

So, it silences KCSAN and I am happy with it, but I will not proceed with
a formal patch proposal until we have a consensus about it.

The data-race actually means that another core can tear your half-load and
you get unexpected results. Why does it happen more often on my configuration
that on the others I cannot tell. But it might hurt the integrity of any
filesystem relying of find_first_bit() and find_next_bit() primitives.

I mean, in the worst case scenario.

Meaning, I might not opt to go to Mars with the ship's computer running
with data-races ;-)

Best regards,
Mirsad Todorovac

Mirsad Todorovac Oct. 11, 2023, 8:40 p.m. UTC | #4

On 10/11/23 17:02, Jan Kara wrote:
> From: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
> 
> Some users of lib/find functions can operate on bitmaps that change
> while we are looking at the bitmap. For example xarray code looks at tag
> bitmaps only with RCU protection. The xarray users are prepared to
> handle a situation were tag modification races with reading of a tag
> bitmap (and thus we get back a stale information) but the find_*bit()
> functions should return based on bitmap contents at *some* point in time.
> As UBSAN properly warns, the code like:
> 
> 	val = *addr;
> 	if (val)
> 		__ffs(val)
> 
> prone to refetching 'val' contents from addr by the compiler and thus
> passing 0 to __ffs() which has undefined results.
> 
> Fix the problem by using READ_ONCE() in all the appropriate places so
> that the compiler cannot accidentally refetch the contents of addr
> between the test and __ffs(). This makes the undefined behavior
> impossible. The generated code is somewhat larger due to gcc tracking
> both the index and target fetch address in separate registers (according
> to GCC folks the volatile cast confuses their optimizer a bit, they are
> looking into a fix). The difference in performance is minimal though.
> Targetted microbenchmark (in userspace) of the bit searching loop shows
> about 2% regression on some x86 microarchitectures so for normal kernel
> workloads this should be in the noise and not worth introducing special
> set of bitmap searching functions.
> 
> [JK: Wrote commit message]
> 
> CC: Yury Norov <yury.norov@gmail.com>
> Link: https://lore.kernel.org/all/20230918044739.29782-1-mirsad.todorovac@alu.unizg.hr/
> Signed-off-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>   include/linux/find.h | 40 ++++++++++++++++++++++++----------------
>   lib/find_bit.c       | 39 +++++++++++++++++++++++----------------
>   2 files changed, 47 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/find.h b/include/linux/find.h
> index 5e4f39ef2e72..5ef6466dc7cc 100644
> --- a/include/linux/find.h
> +++ b/include/linux/find.h
> @@ -60,7 +60,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>   		if (unlikely(offset >= size))
>   			return size;
>   
> -		val = *addr & GENMASK(size - 1, offset);
> +		val = READ_ONCE(*addr) & GENMASK(size - 1, offset);
>   		return val ? __ffs(val) : size;
>   	}
>   
> @@ -90,7 +90,8 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
>   		if (unlikely(offset >= size))
>   			return size;
>   
> -		val = *addr1 & *addr2 & GENMASK(size - 1, offset);
> +		val = READ_ONCE(*addr1) & READ_ONCE(*addr2) &
> +						GENMASK(size - 1, offset);
>   		return val ? __ffs(val) : size;
>   	}
>   
> @@ -121,7 +122,8 @@ unsigned long find_next_andnot_bit(const unsigned long *addr1,
>   		if (unlikely(offset >= size))
>   			return size;
>   
> -		val = *addr1 & ~*addr2 & GENMASK(size - 1, offset);
> +		val = READ_ONCE(*addr1) & ~READ_ONCE(*addr2) &
> +						GENMASK(size - 1, offset);
>   		return val ? __ffs(val) : size;
>   	}
>   
> @@ -151,7 +153,8 @@ unsigned long find_next_or_bit(const unsigned long *addr1,
>   		if (unlikely(offset >= size))
>   			return size;
>   
> -		val = (*addr1 | *addr2) & GENMASK(size - 1, offset);
> +		val = (READ_ONCE(*addr1) | READ_ONCE(*addr2)) &
> +						GENMASK(size - 1, offset);
>   		return val ? __ffs(val) : size;
>   	}
>   
> @@ -179,7 +182,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
>   		if (unlikely(offset >= size))
>   			return size;
>   
> -		val = *addr | ~GENMASK(size - 1, offset);
> +		val = READ_ONCE(*addr) | ~GENMASK(size - 1, offset);
>   		return val == ~0UL ? size : ffz(val);
>   	}
>   
> @@ -200,7 +203,7 @@ static inline
>   unsigned long find_first_bit(const unsigned long *addr, unsigned long size)
>   {
>   	if (small_const_nbits(size)) {
> -		unsigned long val = *addr & GENMASK(size - 1, 0);
> +		unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0);
>   
>   		return val ? __ffs(val) : size;
>   	}
> @@ -229,7 +232,7 @@ unsigned long find_nth_bit(const unsigned long *addr, unsigned long size, unsign
>   		return size;
>   
>   	if (small_const_nbits(size)) {
> -		unsigned long val =  *addr & GENMASK(size - 1, 0);
> +		unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0);
>   
>   		return val ? fns(val, n) : size;
>   	}
> @@ -255,7 +258,8 @@ unsigned long find_nth_and_bit(const unsigned long *addr1, const unsigned long *
>   		return size;
>   
>   	if (small_const_nbits(size)) {
> -		unsigned long val =  *addr1 & *addr2 & GENMASK(size - 1, 0);
> +		unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2)
> +							& GENMASK(size - 1, 0);
>   
>   		return val ? fns(val, n) : size;
>   	}
> @@ -282,7 +286,8 @@ unsigned long find_nth_andnot_bit(const unsigned long *addr1, const unsigned lon
>   		return size;
>   
>   	if (small_const_nbits(size)) {
> -		unsigned long val =  *addr1 & (~*addr2) & GENMASK(size - 1, 0);
> +		unsigned long val = READ_ONCE(*addr1) & ~READ_ONCE(*addr2) &
> +							GENMASK(size - 1, 0);
>   
>   		return val ? fns(val, n) : size;
>   	}
> @@ -312,7 +317,8 @@ unsigned long find_nth_and_andnot_bit(const unsigned long *addr1,
>   		return size;
>   
>   	if (small_const_nbits(size)) {
> -		unsigned long val =  *addr1 & *addr2 & (~*addr3) & GENMASK(size - 1, 0);
> +		unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) &
> +				~READ_ONCE(*addr3) & GENMASK(size - 1, 0);
>   
>   		return val ? fns(val, n) : size;
>   	}
> @@ -336,7 +342,8 @@ unsigned long find_first_and_bit(const unsigned long *addr1,
>   				 unsigned long size)
>   {
>   	if (small_const_nbits(size)) {
> -		unsigned long val = *addr1 & *addr2 & GENMASK(size - 1, 0);
> +		unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) &
> +							GENMASK(size - 1, 0);
>   
>   		return val ? __ffs(val) : size;
>   	}
> @@ -358,7 +365,7 @@ static inline
>   unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size)
>   {
>   	if (small_const_nbits(size)) {
> -		unsigned long val = *addr | ~GENMASK(size - 1, 0);
> +		unsigned long val = READ_ONCE(*addr) | ~GENMASK(size - 1, 0);
>   
>   		return val == ~0UL ? size : ffz(val);
>   	}
> @@ -379,7 +386,7 @@ static inline
>   unsigned long find_last_bit(const unsigned long *addr, unsigned long size)
>   {
>   	if (small_const_nbits(size)) {
> -		unsigned long val = *addr & GENMASK(size - 1, 0);
> +		unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0);
>   
>   		return val ? __fls(val) : size;
>   	}
> @@ -505,7 +512,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
>   		long size, unsigned long offset)
>   {
>   	if (small_const_nbits(size)) {
> -		unsigned long val = *(const unsigned long *)addr;
> +		unsigned long val = READ_ONCE(*(const unsigned long *)addr);
>   
>   		if (unlikely(offset >= size))
>   			return size;
> @@ -523,7 +530,8 @@ static inline
>   unsigned long find_first_zero_bit_le(const void *addr, unsigned long size)
>   {
>   	if (small_const_nbits(size)) {
> -		unsigned long val = swab(*(const unsigned long *)addr) | ~GENMASK(size - 1, 0);
> +		unsigned long val = swab(READ_ONCE(*(const unsigned long *)addr))
> +						| ~GENMASK(size - 1, 0);
>   
>   		return val == ~0UL ? size : ffz(val);
>   	}
> @@ -538,7 +546,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
>   		long size, unsigned long offset)
>   {
>   	if (small_const_nbits(size)) {
> -		unsigned long val = *(const unsigned long *)addr;
> +		unsigned long val = READ_ONCE(*(const unsigned long *)addr);
>   
>   		if (unlikely(offset >= size))
>   			return size;
> diff --git a/lib/find_bit.c b/lib/find_bit.c
> index 32f99e9a670e..6867ef8a85e0 100644
> --- a/lib/find_bit.c
> +++ b/lib/find_bit.c
> @@ -98,7 +98,7 @@ out:										\
>    */
>   unsigned long _find_first_bit(const unsigned long *addr, unsigned long size)
>   {
> -	return FIND_FIRST_BIT(addr[idx], /* nop */, size);
> +	return FIND_FIRST_BIT(READ_ONCE(addr[idx]), /* nop */, size);
>   }
>   EXPORT_SYMBOL(_find_first_bit);
>   #endif
> @@ -111,7 +111,8 @@ unsigned long _find_first_and_bit(const unsigned long *addr1,
>   				  const unsigned long *addr2,
>   				  unsigned long size)
>   {
> -	return FIND_FIRST_BIT(addr1[idx] & addr2[idx], /* nop */, size);
> +	return FIND_FIRST_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]),
> +				/* nop */, size);
>   }
>   EXPORT_SYMBOL(_find_first_and_bit);
>   #endif
> @@ -122,7 +123,7 @@ EXPORT_SYMBOL(_find_first_and_bit);
>    */
>   unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size)
>   {
> -	return FIND_FIRST_BIT(~addr[idx], /* nop */, size);
> +	return FIND_FIRST_BIT(~READ_ONCE(addr[idx]), /* nop */, size);
>   }
>   EXPORT_SYMBOL(_find_first_zero_bit);
>   #endif
> @@ -130,28 +131,30 @@ EXPORT_SYMBOL(_find_first_zero_bit);
>   #ifndef find_next_bit
>   unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start)
>   {
> -	return FIND_NEXT_BIT(addr[idx], /* nop */, nbits, start);
> +	return FIND_NEXT_BIT(READ_ONCE(addr[idx]), /* nop */, nbits, start);
>   }
>   EXPORT_SYMBOL(_find_next_bit);
>   #endif
>   
>   unsigned long __find_nth_bit(const unsigned long *addr, unsigned long size, unsigned long n)
>   {
> -	return FIND_NTH_BIT(addr[idx], size, n);
> +	return FIND_NTH_BIT(READ_ONCE(addr[idx]), size, n);
>   }
>   EXPORT_SYMBOL(__find_nth_bit);
>   
>   unsigned long __find_nth_and_bit(const unsigned long *addr1, const unsigned long *addr2,
>   				 unsigned long size, unsigned long n)
>   {
> -	return FIND_NTH_BIT(addr1[idx] & addr2[idx], size, n);
> +	return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]),
> +			    size, n);
>   }
>   EXPORT_SYMBOL(__find_nth_and_bit);
>   
>   unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
>   				 unsigned long size, unsigned long n)
>   {
> -	return FIND_NTH_BIT(addr1[idx] & ~addr2[idx], size, n);
> +	return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & ~READ_ONCE(addr2[idx]),
> +			    size, n);
>   }
>   EXPORT_SYMBOL(__find_nth_andnot_bit);
>   
> @@ -160,7 +163,8 @@ unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1,
>   					const unsigned long *addr3,
>   					unsigned long size, unsigned long n)
>   {
> -	return FIND_NTH_BIT(addr1[idx] & addr2[idx] & ~addr3[idx], size, n);
> +	return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]) &
> +			    ~READ_ONCE(addr3[idx]), size, n);
>   }
>   EXPORT_SYMBOL(__find_nth_and_andnot_bit);
>   
> @@ -168,7 +172,8 @@ EXPORT_SYMBOL(__find_nth_and_andnot_bit);
>   unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
>   					unsigned long nbits, unsigned long start)
>   {
> -	return FIND_NEXT_BIT(addr1[idx] & addr2[idx], /* nop */, nbits, start);
> +	return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]),
> +			     /* nop */, nbits, start);
>   }
>   EXPORT_SYMBOL(_find_next_and_bit);
>   #endif
> @@ -177,7 +182,8 @@ EXPORT_SYMBOL(_find_next_and_bit);
>   unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
>   					unsigned long nbits, unsigned long start)
>   {
> -	return FIND_NEXT_BIT(addr1[idx] & ~addr2[idx], /* nop */, nbits, start);
> +	return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) & ~READ_ONCE(addr2[idx]),
> +			     /* nop */, nbits, start);
>   }
>   EXPORT_SYMBOL(_find_next_andnot_bit);
>   #endif
> @@ -186,7 +192,8 @@ EXPORT_SYMBOL(_find_next_andnot_bit);
>   unsigned long _find_next_or_bit(const unsigned long *addr1, const unsigned long *addr2,
>   					unsigned long nbits, unsigned long start)
>   {
> -	return FIND_NEXT_BIT(addr1[idx] | addr2[idx], /* nop */, nbits, start);
> +	return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) | READ_ONCE(addr2[idx]),
> +			     /* nop */, nbits, start);
>   }
>   EXPORT_SYMBOL(_find_next_or_bit);
>   #endif
> @@ -195,7 +202,7 @@ EXPORT_SYMBOL(_find_next_or_bit);
>   unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
>   					 unsigned long start)
>   {
> -	return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start);
> +	return FIND_NEXT_BIT(~READ_ONCE(addr[idx]), /* nop */, nbits, start);
>   }
>   EXPORT_SYMBOL(_find_next_zero_bit);
>   #endif
> @@ -208,7 +215,7 @@ unsigned long _find_last_bit(const unsigned long *addr, unsigned long size)
>   		unsigned long idx = (size-1) / BITS_PER_LONG;
>   
>   		do {
> -			val &= addr[idx];
> +			val &= READ_ONCE(addr[idx]);
>   			if (val)
>   				return idx * BITS_PER_LONG + __fls(val);
>   
> @@ -242,7 +249,7 @@ EXPORT_SYMBOL(find_next_clump8);
>    */
>   unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size)
>   {
> -	return FIND_FIRST_BIT(~addr[idx], swab, size);
> +	return FIND_FIRST_BIT(~READ_ONCE(addr[idx]), swab, size);
>   }
>   EXPORT_SYMBOL(_find_first_zero_bit_le);
>   
> @@ -252,7 +259,7 @@ EXPORT_SYMBOL(_find_first_zero_bit_le);
>   unsigned long _find_next_zero_bit_le(const unsigned long *addr,
>   					unsigned long size, unsigned long offset)
>   {
> -	return FIND_NEXT_BIT(~addr[idx], swab, size, offset);
> +	return FIND_NEXT_BIT(~READ_ONCE(addr[idx]), swab, size, offset);
>   }
>   EXPORT_SYMBOL(_find_next_zero_bit_le);
>   #endif
> @@ -261,7 +268,7 @@ EXPORT_SYMBOL(_find_next_zero_bit_le);
>   unsigned long _find_next_bit_le(const unsigned long *addr,
>   				unsigned long size, unsigned long offset)
>   {
> -	return FIND_NEXT_BIT(addr[idx], swab, size, offset);
> +	return FIND_NEXT_BIT(READ_ONCE(addr[idx]), swab, size, offset);
>   }
>   EXPORT_SYMBOL(_find_next_bit_le);
>   

Works like charm. Nothing in KCSAN.

Tested-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>

Best regards,
Mirsad

Jan Kara Oct. 12, 2023, 12:21 p.m. UTC | #5

On Wed 11-10-23 11:26:29, Yury Norov wrote:
> Long story short: KCSAN found some potential issues related to how
> people use bitmap API. And instead of working through that issues,
> the following code shuts down KCSAN by applying READ_ONCE() here
> and there.

I'm sorry but this is not what the patch does. I'm not sure how to get the
message across so maybe let me start from a different angle:

Bitmaps are perfectly fine to be used without any external locking if
only atomic bit ops (set_bit, clear_bit, test_and_{set/clear}_bit) are
used. This is a significant performance gain compared to using a spinlock
or other locking and people do this for a long time. I hope we agree on
that.

Now it is also common that you need to find a set / clear bit in a bitmap.
To maintain lockless protocol and deal with races people employ schemes
like (the dumbest form):

	do {
		bit = find_first_bit(bitmap, n);
		if (bit >= n)
			abort...
	} while (!test_and_clear_bit(bit, bitmap));

So the code loops until it finds a set bit that is successfully cleared by
it. This is perfectly fine and safe lockless code and such use should be
supported. Agreed?

*Except* that the above actually is not safe due to find_first_bit()
implementation and KCSAN warns about that. The problem is that:

Assume *addr == 1
CPU1			CPU2
find_first_bit(addr, 64)
  val = *addr;
  if (val) -> true
			clear_bit(0, addr)
    val = *addr -> compiler decided to refetch addr contents for whatever
		   reason in the generated assembly
    __ffs(val) -> now executed for value 0 which has undefined results.

And the READ_ONCE() this patch adds prevents the compiler from adding the
refetching of addr into the assembly.

Are we on the same page now?

								Honza

> On Wed, Oct 11, 2023 at 05:02:31PM +0200, Jan Kara wrote:
> > From: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
> > 
> > Some users of lib/find functions can operate on bitmaps that change
> > while we are looking at the bitmap. For example xarray code looks at tag
> > bitmaps only with RCU protection. The xarray users are prepared to
> > handle a situation were tag modification races with reading of a tag
> > bitmap (and thus we get back a stale information) but the find_*bit()
> > functions should return based on bitmap contents at *some* point in time.
> > As UBSAN properly warns, the code like:
> > 
> > 	val = *addr;
> > 	if (val)
> > 		__ffs(val)
> > 
> > prone to refetching 'val' contents from addr by the compiler and thus
> > passing 0 to __ffs() which has undefined results.
> > 
> > Fix the problem by using READ_ONCE() in all the appropriate places so
> > that the compiler cannot accidentally refetch the contents of addr
> > between the test and __ffs(). This makes the undefined behavior
> > impossible. The generated code is somewhat larger due to gcc tracking
> > both the index and target fetch address in separate registers (according
> > to GCC folks the volatile cast confuses their optimizer a bit, they are
> > looking into a fix). The difference in performance is minimal though.
> > Targetted microbenchmark (in userspace) of the bit searching loop shows
> > about 2% regression on some x86 microarchitectures so for normal kernel
> > workloads this should be in the noise and not worth introducing special
> > set of bitmap searching functions.
> > 
> > [JK: Wrote commit message]
> 
> READ_ONCE() fixes nothing because nothing is broken in find_bit() API.
> As I suspected, and as Matthew confirmed in his email, the true reason
> for READ_ONCE() here is to disable KCSAN check:
> 
>         READ_ONCE() serves two functions here;
>         one is that it tells the compiler not to try anything fancy, and
>         the other is that it tells KCSAN to not bother instrumenting this
>         load; no load-delay-reload.
> 
> https://lkml.kernel.org/linux-mm/ZQkhgVb8nWGxpSPk@casper.infradead.org/
> 
> And as side-effect, it of course hurts the performance. In the same
> email Matthew said he doesn't believe me that READ_ONCE would do that,
> so thank you for testing and confirming that it does.
> 
> Matthew, in my experience compilers do really well in that low-level
> things, and gambling with compilers usually makes thing worse. x86_64
> is one of the most strong-ordered architectures that I've been working
> with, and even here READ_ONCE has visible effect on performance. I
> expect that it will get even worse on more weakly ordered platforms.
> 
> I'm OK that you don't believe me; probably you can believe Paul
> McKenney and his paper on kernel memory model, which says in the very
> first paragraph:
> 
>         Applying ACCESS_ONCE() to a large array or structure is unlikely
>         to do anything useful, and use of READ_ONCE() and WRITE_ONCE()
>         in this situation can result in load-tearing and store-tearing.
> 
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4444.html
> 
> Jan, I think that in your last email you confirmed that the xarray
> problem that you're trying to solve is about a lack of proper locking:
> 
>         Well, for xarray the write side is synchronized with a spinlock but the read
>         side is not (only RCU protected).
> 
> https://lkml.kernel.org/linux-mm/20230918155403.ylhfdbscgw6yek6p@quack3/
> 
> If there's no enough synchronization, why not just adding it?
> 
> Regardless performance consideration, my main concern is that this patch
> considers bitmap as an atomic structure, which is intentionally not.
> There are just a few single-bit atomic operations like set_bit() and
> clear_bit(). All other functions are non-atomic, including those
> find_bit() operations.
> 
> There is quite a lot of examples of wrong use of bitmaps wrt
> atomicity, the most typical is like:
>         for(idx = 0; idx < num; idx++) {
>                 ...
>                 set_bit(idx, bitmap);
>         }
> 
> This is wrong because a series of atomic ops is not atomic itself, and
> if you see something like this in you code, it should be converted to
> using non-atomic __set_bit(), and protected externally if needed.
> 
> Similarly, READ_ONCE() in a for-loop doesn't guarantee any ordering or
> atomicity, and only hurts the performance. And this is exactly what
> this patch does.
> 
> Thanks,
> Yury
>  
> > CC: Yury Norov <yury.norov@gmail.com>
> > Link: https://lore.kernel.org/all/20230918044739.29782-1-mirsad.todorovac@alu.unizg.hr/
> > Signed-off-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  include/linux/find.h | 40 ++++++++++++++++++++++++----------------
> >  lib/find_bit.c       | 39 +++++++++++++++++++++++----------------
> >  2 files changed, 47 insertions(+), 32 deletions(-)
> > 
> > diff --git a/include/linux/find.h b/include/linux/find.h
> > index 5e4f39ef2e72..5ef6466dc7cc 100644
> > --- a/include/linux/find.h
> > +++ b/include/linux/find.h
> > @@ -60,7 +60,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
> >  		if (unlikely(offset >= size))
> >  			return size;
> >  
> > -		val = *addr & GENMASK(size - 1, offset);
> > +		val = READ_ONCE(*addr) & GENMASK(size - 1, offset);
> >  		return val ? __ffs(val) : size;
> >  	}
> >  
> > @@ -90,7 +90,8 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
> >  		if (unlikely(offset >= size))
> >  			return size;
> >  
> > -		val = *addr1 & *addr2 & GENMASK(size - 1, offset);
> > +		val = READ_ONCE(*addr1) & READ_ONCE(*addr2) &
> > +						GENMASK(size - 1, offset);
> >  		return val ? __ffs(val) : size;
> >  	}
> >  
> > @@ -121,7 +122,8 @@ unsigned long find_next_andnot_bit(const unsigned long *addr1,
> >  		if (unlikely(offset >= size))
> >  			return size;
> >  
> > -		val = *addr1 & ~*addr2 & GENMASK(size - 1, offset);
> > +		val = READ_ONCE(*addr1) & ~READ_ONCE(*addr2) &
> > +						GENMASK(size - 1, offset);
> >  		return val ? __ffs(val) : size;
> >  	}
> >  
> > @@ -151,7 +153,8 @@ unsigned long find_next_or_bit(const unsigned long *addr1,
> >  		if (unlikely(offset >= size))
> >  			return size;
> >  
> > -		val = (*addr1 | *addr2) & GENMASK(size - 1, offset);
> > +		val = (READ_ONCE(*addr1) | READ_ONCE(*addr2)) &
> > +						GENMASK(size - 1, offset);
> >  		return val ? __ffs(val) : size;
> >  	}
> >  
> > @@ -179,7 +182,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
> >  		if (unlikely(offset >= size))
> >  			return size;
> >  
> > -		val = *addr | ~GENMASK(size - 1, offset);
> > +		val = READ_ONCE(*addr) | ~GENMASK(size - 1, offset);
> >  		return val == ~0UL ? size : ffz(val);
> >  	}
> >  
> > @@ -200,7 +203,7 @@ static inline
> >  unsigned long find_first_bit(const unsigned long *addr, unsigned long size)
> >  {
> >  	if (small_const_nbits(size)) {
> > -		unsigned long val = *addr & GENMASK(size - 1, 0);
> > +		unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0);
> >  
> >  		return val ? __ffs(val) : size;
> >  	}
> > @@ -229,7 +232,7 @@ unsigned long find_nth_bit(const unsigned long *addr, unsigned long size, unsign
> >  		return size;
> >  
> >  	if (small_const_nbits(size)) {
> > -		unsigned long val =  *addr & GENMASK(size - 1, 0);
> > +		unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0);
> >  
> >  		return val ? fns(val, n) : size;
> >  	}
> > @@ -255,7 +258,8 @@ unsigned long find_nth_and_bit(const unsigned long *addr1, const unsigned long *
> >  		return size;
> >  
> >  	if (small_const_nbits(size)) {
> > -		unsigned long val =  *addr1 & *addr2 & GENMASK(size - 1, 0);
> > +		unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2)
> > +							& GENMASK(size - 1, 0);
> >  
> >  		return val ? fns(val, n) : size;
> >  	}
> > @@ -282,7 +286,8 @@ unsigned long find_nth_andnot_bit(const unsigned long *addr1, const unsigned lon
> >  		return size;
> >  
> >  	if (small_const_nbits(size)) {
> > -		unsigned long val =  *addr1 & (~*addr2) & GENMASK(size - 1, 0);
> > +		unsigned long val = READ_ONCE(*addr1) & ~READ_ONCE(*addr2) &
> > +							GENMASK(size - 1, 0);
> >  
> >  		return val ? fns(val, n) : size;
> >  	}
> > @@ -312,7 +317,8 @@ unsigned long find_nth_and_andnot_bit(const unsigned long *addr1,
> >  		return size;
> >  
> >  	if (small_const_nbits(size)) {
> > -		unsigned long val =  *addr1 & *addr2 & (~*addr3) & GENMASK(size - 1, 0);
> > +		unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) &
> > +				~READ_ONCE(*addr3) & GENMASK(size - 1, 0);
> >  
> >  		return val ? fns(val, n) : size;
> >  	}
> > @@ -336,7 +342,8 @@ unsigned long find_first_and_bit(const unsigned long *addr1,
> >  				 unsigned long size)
> >  {
> >  	if (small_const_nbits(size)) {
> > -		unsigned long val = *addr1 & *addr2 & GENMASK(size - 1, 0);
> > +		unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) &
> > +							GENMASK(size - 1, 0);
> >  
> >  		return val ? __ffs(val) : size;
> >  	}
> > @@ -358,7 +365,7 @@ static inline
> >  unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size)
> >  {
> >  	if (small_const_nbits(size)) {
> > -		unsigned long val = *addr | ~GENMASK(size - 1, 0);
> > +		unsigned long val = READ_ONCE(*addr) | ~GENMASK(size - 1, 0);
> >  
> >  		return val == ~0UL ? size : ffz(val);
> >  	}
> > @@ -379,7 +386,7 @@ static inline
> >  unsigned long find_last_bit(const unsigned long *addr, unsigned long size)
> >  {
> >  	if (small_const_nbits(size)) {
> > -		unsigned long val = *addr & GENMASK(size - 1, 0);
> > +		unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0);
> >  
> >  		return val ? __fls(val) : size;
> >  	}
> > @@ -505,7 +512,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
> >  		long size, unsigned long offset)
> >  {
> >  	if (small_const_nbits(size)) {
> > -		unsigned long val = *(const unsigned long *)addr;
> > +		unsigned long val = READ_ONCE(*(const unsigned long *)addr);
> >  
> >  		if (unlikely(offset >= size))
> >  			return size;
> > @@ -523,7 +530,8 @@ static inline
> >  unsigned long find_first_zero_bit_le(const void *addr, unsigned long size)
> >  {
> >  	if (small_const_nbits(size)) {
> > -		unsigned long val = swab(*(const unsigned long *)addr) | ~GENMASK(size - 1, 0);
> > +		unsigned long val = swab(READ_ONCE(*(const unsigned long *)addr))
> > +						| ~GENMASK(size - 1, 0);
> >  
> >  		return val == ~0UL ? size : ffz(val);
> >  	}
> > @@ -538,7 +546,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
> >  		long size, unsigned long offset)
> >  {
> >  	if (small_const_nbits(size)) {
> > -		unsigned long val = *(const unsigned long *)addr;
> > +		unsigned long val = READ_ONCE(*(const unsigned long *)addr);
> >  
> >  		if (unlikely(offset >= size))
> >  			return size;
> > diff --git a/lib/find_bit.c b/lib/find_bit.c
> > index 32f99e9a670e..6867ef8a85e0 100644
> > --- a/lib/find_bit.c
> > +++ b/lib/find_bit.c
> > @@ -98,7 +98,7 @@ out:										\
> >   */
> >  unsigned long _find_first_bit(const unsigned long *addr, unsigned long size)
> >  {
> > -	return FIND_FIRST_BIT(addr[idx], /* nop */, size);
> > +	return FIND_FIRST_BIT(READ_ONCE(addr[idx]), /* nop */, size);
> >  }
> >  EXPORT_SYMBOL(_find_first_bit);
> >  #endif
> > @@ -111,7 +111,8 @@ unsigned long _find_first_and_bit(const unsigned long *addr1,
> >  				  const unsigned long *addr2,
> >  				  unsigned long size)
> >  {
> > -	return FIND_FIRST_BIT(addr1[idx] & addr2[idx], /* nop */, size);
> > +	return FIND_FIRST_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]),
> > +				/* nop */, size);
> >  }
> >  EXPORT_SYMBOL(_find_first_and_bit);
> >  #endif
> > @@ -122,7 +123,7 @@ EXPORT_SYMBOL(_find_first_and_bit);
> >   */
> >  unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size)
> >  {
> > -	return FIND_FIRST_BIT(~addr[idx], /* nop */, size);
> > +	return FIND_FIRST_BIT(~READ_ONCE(addr[idx]), /* nop */, size);
> >  }
> >  EXPORT_SYMBOL(_find_first_zero_bit);
> >  #endif
> > @@ -130,28 +131,30 @@ EXPORT_SYMBOL(_find_first_zero_bit);
> >  #ifndef find_next_bit
> >  unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start)
> >  {
> > -	return FIND_NEXT_BIT(addr[idx], /* nop */, nbits, start);
> > +	return FIND_NEXT_BIT(READ_ONCE(addr[idx]), /* nop */, nbits, start);
> >  }
> >  EXPORT_SYMBOL(_find_next_bit);
> >  #endif
> >  
> >  unsigned long __find_nth_bit(const unsigned long *addr, unsigned long size, unsigned long n)
> >  {
> > -	return FIND_NTH_BIT(addr[idx], size, n);
> > +	return FIND_NTH_BIT(READ_ONCE(addr[idx]), size, n);
> >  }
> >  EXPORT_SYMBOL(__find_nth_bit);
> >  
> >  unsigned long __find_nth_and_bit(const unsigned long *addr1, const unsigned long *addr2,
> >  				 unsigned long size, unsigned long n)
> >  {
> > -	return FIND_NTH_BIT(addr1[idx] & addr2[idx], size, n);
> > +	return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]),
> > +			    size, n);
> >  }
> >  EXPORT_SYMBOL(__find_nth_and_bit);
> >  
> >  unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
> >  				 unsigned long size, unsigned long n)
> >  {
> > -	return FIND_NTH_BIT(addr1[idx] & ~addr2[idx], size, n);
> > +	return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & ~READ_ONCE(addr2[idx]),
> > +			    size, n);
> >  }
> >  EXPORT_SYMBOL(__find_nth_andnot_bit);
> >  
> > @@ -160,7 +163,8 @@ unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1,
> >  					const unsigned long *addr3,
> >  					unsigned long size, unsigned long n)
> >  {
> > -	return FIND_NTH_BIT(addr1[idx] & addr2[idx] & ~addr3[idx], size, n);
> > +	return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]) &
> > +			    ~READ_ONCE(addr3[idx]), size, n);
> >  }
> >  EXPORT_SYMBOL(__find_nth_and_andnot_bit);
> >  
> > @@ -168,7 +172,8 @@ EXPORT_SYMBOL(__find_nth_and_andnot_bit);
> >  unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
> >  					unsigned long nbits, unsigned long start)
> >  {
> > -	return FIND_NEXT_BIT(addr1[idx] & addr2[idx], /* nop */, nbits, start);
> > +	return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]),
> > +			     /* nop */, nbits, start);
> >  }
> >  EXPORT_SYMBOL(_find_next_and_bit);
> >  #endif
> > @@ -177,7 +182,8 @@ EXPORT_SYMBOL(_find_next_and_bit);
> >  unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
> >  					unsigned long nbits, unsigned long start)
> >  {
> > -	return FIND_NEXT_BIT(addr1[idx] & ~addr2[idx], /* nop */, nbits, start);
> > +	return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) & ~READ_ONCE(addr2[idx]),
> > +			     /* nop */, nbits, start);
> >  }
> >  EXPORT_SYMBOL(_find_next_andnot_bit);
> >  #endif
> > @@ -186,7 +192,8 @@ EXPORT_SYMBOL(_find_next_andnot_bit);
> >  unsigned long _find_next_or_bit(const unsigned long *addr1, const unsigned long *addr2,
> >  					unsigned long nbits, unsigned long start)
> >  {
> > -	return FIND_NEXT_BIT(addr1[idx] | addr2[idx], /* nop */, nbits, start);
> > +	return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) | READ_ONCE(addr2[idx]),
> > +			     /* nop */, nbits, start);
> >  }
> >  EXPORT_SYMBOL(_find_next_or_bit);
> >  #endif
> > @@ -195,7 +202,7 @@ EXPORT_SYMBOL(_find_next_or_bit);
> >  unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
> >  					 unsigned long start)
> >  {
> > -	return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start);
> > +	return FIND_NEXT_BIT(~READ_ONCE(addr[idx]), /* nop */, nbits, start);
> >  }
> >  EXPORT_SYMBOL(_find_next_zero_bit);
> >  #endif
> > @@ -208,7 +215,7 @@ unsigned long _find_last_bit(const unsigned long *addr, unsigned long size)
> >  		unsigned long idx = (size-1) / BITS_PER_LONG;
> >  
> >  		do {
> > -			val &= addr[idx];
> > +			val &= READ_ONCE(addr[idx]);
> >  			if (val)
> >  				return idx * BITS_PER_LONG + __fls(val);
> >  
> > @@ -242,7 +249,7 @@ EXPORT_SYMBOL(find_next_clump8);
> >   */
> >  unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size)
> >  {
> > -	return FIND_FIRST_BIT(~addr[idx], swab, size);
> > +	return FIND_FIRST_BIT(~READ_ONCE(addr[idx]), swab, size);
> >  }
> >  EXPORT_SYMBOL(_find_first_zero_bit_le);
> >  
> > @@ -252,7 +259,7 @@ EXPORT_SYMBOL(_find_first_zero_bit_le);
> >  unsigned long _find_next_zero_bit_le(const unsigned long *addr,
> >  					unsigned long size, unsigned long offset)
> >  {
> > -	return FIND_NEXT_BIT(~addr[idx], swab, size, offset);
> > +	return FIND_NEXT_BIT(~READ_ONCE(addr[idx]), swab, size, offset);
> >  }
> >  EXPORT_SYMBOL(_find_next_zero_bit_le);
> >  #endif
> > @@ -261,7 +268,7 @@ EXPORT_SYMBOL(_find_next_zero_bit_le);
> >  unsigned long _find_next_bit_le(const unsigned long *addr,
> >  				unsigned long size, unsigned long offset)
> >  {
> > -	return FIND_NEXT_BIT(addr[idx], swab, size, offset);
> > +	return FIND_NEXT_BIT(READ_ONCE(addr[idx]), swab, size, offset);
> >  }
> >  EXPORT_SYMBOL(_find_next_bit_le);
> >  
> > -- 
> > 2.35.3

Yury Norov Oct. 14, 2023, 12:15 a.m. UTC | #6

Restore LKML

On Thu, Oct 12, 2023 at 02:21:10PM +0200, Jan Kara wrote:
> On Wed 11-10-23 11:26:29, Yury Norov wrote:
> > Long story short: KCSAN found some potential issues related to how
> > people use bitmap API. And instead of working through that issues,
> > the following code shuts down KCSAN by applying READ_ONCE() here
> > and there.
> 
> I'm sorry but this is not what the patch does. I'm not sure how to get the
> message across so maybe let me start from a different angle:
> 
> Bitmaps are perfectly fine to be used without any external locking if
> only atomic bit ops (set_bit, clear_bit, test_and_{set/clear}_bit) are
> used. This is a significant performance gain compared to using a spinlock
> or other locking and people do this for a long time. I hope we agree on
> that.
> 
> Now it is also common that you need to find a set / clear bit in a bitmap.
> To maintain lockless protocol and deal with races people employ schemes
> like (the dumbest form):
> 
> 	do {
> 		bit = find_first_bit(bitmap, n);
> 		if (bit >= n)
> 			abort...
> 	} while (!test_and_clear_bit(bit, bitmap));
> 
> So the code loops until it finds a set bit that is successfully cleared by
> it. This is perfectly fine and safe lockless code and such use should be
> supported. Agreed?

Great example. When you're running non-atomic functions concurrently,
the result may easily become incorrect, and this is what you're
demonstrating here.

Regarding find_first_bit() it means that:
 - it may erroneously return unset bit;
 - it may erroneously return non-first set bit;
 - it may erroneously return no bits for non-empty bitmap.

Effectively it means that find_first bit may just return a random number.

Let's take another example:

	do {
		bit = get_random_number();
		if (bit >= n)
			abort...
	} while (!test_and_clear_bit(bit, bitmap));

When running concurrently, the difference between this and your code
is only in probability of getting set bit somewhere from around the
beginning of bitmap.

The key point is that find_bit() may return undef even if READ_ONCE() is
used. If bitmap gets changed anytime in the process, the result becomes
invalid. It may happen even after returning from find_first_bit().

And if my understanding correct, your code is designed in the
assumption that find_first_bit() may return garbage, so handles it
correctly.

> *Except* that the above actually is not safe due to find_first_bit()
> implementation and KCSAN warns about that. The problem is that:
> 
> Assume *addr == 1
> CPU1			CPU2
> find_first_bit(addr, 64)
>   val = *addr;
>   if (val) -> true
> 			clear_bit(0, addr)
>     val = *addr -> compiler decided to refetch addr contents for whatever
> 		   reason in the generated assembly
>     __ffs(val) -> now executed for value 0 which has undefined results.

Yes, __ffs(0) is undef. But the whole function is undef when accessing
bitmap concurrently.

> And the READ_ONCE() this patch adds prevents the compiler from adding the
> refetching of addr into the assembly.

That's true. But it doesn't improve on the situation. It was an undef
before, and it's undef after, but a 2% slower undef.

Now on that KCSAN warning. If I understand things correctly, for the
example above, KCSAN warning is false-positive, because you're
intentionally running lockless.

But for some other people it may be a true error, and now they'll have
no chance to catch it if KCSAN is forced to ignore find_bit() entirely.

We've got the whole class of lockless algorithms that allow safe concurrent
access to the memory. And now that there's a tool that searches for them
(concurrent accesses), we need to have an option to somehow teach it
to suppress irrelevant warnings. Maybe something like this?

        lockless_algorithm_begin(bitmap, bitmap_size(nbits));
	do {
		bit = find_first_bit(bitmap, nbits);
		if (bit >= nbits)
			break;
	} while (!test_and_clear_bit(bit, bitmap));
        lockless_algorithm_end(bitmap, bitmap_size(nbits));

And, of course, as I suggested a couple iterations ago, you can invent
a thread-safe version of find_bit(), that would be perfectly correct
for lockless use:

 unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size)
 {
        unsigned long bit = 0;

        while (!test_and_clear_bit(bit, bitmap) {
                bit = FIND_FIRST_BIT(addr[idx], /* nop */, size);
                if (bit >= size)
                        return size;
        }

        return bit;
 }

Didn't test that, but I hope 'volatile' specifier should be enough
for compiler to realize that it shouldn't optimize memory access, and
for KCSAN that everything's OK here. 

By the way, thank you for respectful and professional communication.

Thanks,
Yury

Mirsad Todorovac Oct. 14, 2023, 2:21 a.m. UTC | #7

On 10/14/2023 2:15 AM, Yury Norov wrote:
> Restore LKML
> 
> On Thu, Oct 12, 2023 at 02:21:10PM +0200, Jan Kara wrote:
>> On Wed 11-10-23 11:26:29, Yury Norov wrote:
>>> Long story short: KCSAN found some potential issues related to how
>>> people use bitmap API. And instead of working through that issues,
>>> the following code shuts down KCSAN by applying READ_ONCE() here
>>> and there.
>>
>> I'm sorry but this is not what the patch does. I'm not sure how to get the
>> message across so maybe let me start from a different angle:
>>
>> Bitmaps are perfectly fine to be used without any external locking if
>> only atomic bit ops (set_bit, clear_bit, test_and_{set/clear}_bit) are
>> used. This is a significant performance gain compared to using a spinlock
>> or other locking and people do this for a long time. I hope we agree on
>> that.
>>
>> Now it is also common that you need to find a set / clear bit in a bitmap.
>> To maintain lockless protocol and deal with races people employ schemes
>> like (the dumbest form):
>>
>> 	do {
>> 		bit = find_first_bit(bitmap, n);
>> 		if (bit >= n)
>> 			abort...
>> 	} while (!test_and_clear_bit(bit, bitmap));
>>
>> So the code loops until it finds a set bit that is successfully cleared by
>> it. This is perfectly fine and safe lockless code and such use should be
>> supported. Agreed?
> 
> Great example. When you're running non-atomic functions concurrently,
> the result may easily become incorrect, and this is what you're
> demonstrating here.
> 
> Regarding find_first_bit() it means that:
>   - it may erroneously return unset bit;
>   - it may erroneously return non-first set bit;
>   - it may erroneously return no bits for non-empty bitmap.
> 
> Effectively it means that find_first bit may just return a random number.
> 
> Let's take another example:
> 
> 	do {
> 		bit = get_random_number();
> 		if (bit >= n)
> 			abort...
> 	} while (!test_and_clear_bit(bit, bitmap));
> 
> When running concurrently, the difference between this and your code
> is only in probability of getting set bit somewhere from around the
> beginning of bitmap.
> 
> The key point is that find_bit() may return undef even if READ_ONCE() is
> used. If bitmap gets changed anytime in the process, the result becomes
> invalid. It may happen even after returning from find_first_bit().
> 
> And if my understanding correct, your code is designed in the
> assumption that find_first_bit() may return garbage, so handles it
> correctly.
> 
>> *Except* that the above actually is not safe due to find_first_bit()
>> implementation and KCSAN warns about that. The problem is that:
>>
>> Assume *addr == 1
>> CPU1			CPU2
>> find_first_bit(addr, 64)
>>    val = *addr;
>>    if (val) -> true
>> 			clear_bit(0, addr)
>>      val = *addr -> compiler decided to refetch addr contents for whatever
>> 		   reason in the generated assembly
>>      __ffs(val) -> now executed for value 0 which has undefined results.
> 
> Yes, __ffs(0) is undef. But the whole function is undef when accessing
> bitmap concurrently.
> 
>> And the READ_ONCE() this patch adds prevents the compiler from adding the
>> refetching of addr into the assembly.
> 
> That's true. But it doesn't improve on the situation. It was an undef
> before, and it's undef after, but a 2% slower undef.
> 
> Now on that KCSAN warning. If I understand things correctly, for the
> example above, KCSAN warning is false-positive, because you're
> intentionally running lockless.
> 
> But for some other people it may be a true error, and now they'll have
> no chance to catch it if KCSAN is forced to ignore find_bit() entirely.
> 
> We've got the whole class of lockless algorithms that allow safe concurrent
> access to the memory. And now that there's a tool that searches for them
> (concurrent accesses), we need to have an option to somehow teach it
> to suppress irrelevant warnings. Maybe something like this?
> 
>          lockless_algorithm_begin(bitmap, bitmap_size(nbits));
> 	do {
> 		bit = find_first_bit(bitmap, nbits);
> 		if (bit >= nbits)
> 			break;
> 	} while (!test_and_clear_bit(bit, bitmap));
>          lockless_algorithm_end(bitmap, bitmap_size(nbits));
> 
> And, of course, as I suggested a couple iterations ago, you can invent
> a thread-safe version of find_bit(), that would be perfectly correct
> for lockless use:
> 
>   unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size)
>   {
>          unsigned long bit = 0;
>   
>          while (!test_and_clear_bit(bit, bitmap) {
>                  bit = FIND_FIRST_BIT(addr[idx], /* nop */, size);
>                  if (bit >= size)
>                          return size;
>          }
> 
>          return bit;
>   }

Hi, Yuri,

But the code above effectively does the same as the READ_ONCE() macro
as defined in rwonce.h:

#ifndef __READ_ONCE
#define __READ_ONCE(x)	(*(const volatile __unqual_scalar_typeof(x) *)&(x))
#endif

#define READ_ONCE(x)							\
({									\
	compiletime_assert_rwonce_type(x);				\
	__READ_ONCE(x);							\
})

Both uses only prevent the funny stuff the compiler might have done to the
read of the addr[idx], there's no black magic in READ_ONCE().

Both examples would probably result in the same assembly and produce the
same 2% slowdown ...

Only you declare volatile in one place, and READ_ONCE() in each read, but
this will only compile a bit slower and generate the same machine code.

Best regards,
Mirsad Todorovac


> Didn't test that, but I hope 'volatile' specifier should be enough
> for compiler to realize that it shouldn't optimize memory access, and
> for KCSAN that everything's OK here.
> 
> By the way, thank you for respectful and professional communication.
> 
> Thanks,
> Yury

Yury Norov Oct. 14, 2023, 2:53 a.m. UTC | #8

On Sat, Oct 14, 2023 at 04:21:50AM +0200, Mirsad Goran Todorovac wrote:
> On 10/14/2023 2:15 AM, Yury Norov wrote:
> > Restore LKML
> > 
> > On Thu, Oct 12, 2023 at 02:21:10PM +0200, Jan Kara wrote:
> > > On Wed 11-10-23 11:26:29, Yury Norov wrote:
> > > > Long story short: KCSAN found some potential issues related to how
> > > > people use bitmap API. And instead of working through that issues,
> > > > the following code shuts down KCSAN by applying READ_ONCE() here
> > > > and there.
> > > 
> > > I'm sorry but this is not what the patch does. I'm not sure how to get the
> > > message across so maybe let me start from a different angle:
> > > 
> > > Bitmaps are perfectly fine to be used without any external locking if
> > > only atomic bit ops (set_bit, clear_bit, test_and_{set/clear}_bit) are
> > > used. This is a significant performance gain compared to using a spinlock
> > > or other locking and people do this for a long time. I hope we agree on
> > > that.
> > > 
> > > Now it is also common that you need to find a set / clear bit in a bitmap.
> > > To maintain lockless protocol and deal with races people employ schemes
> > > like (the dumbest form):
> > > 
> > > 	do {
> > > 		bit = find_first_bit(bitmap, n);
> > > 		if (bit >= n)
> > > 			abort...
> > > 	} while (!test_and_clear_bit(bit, bitmap));
> > > 
> > > So the code loops until it finds a set bit that is successfully cleared by
> > > it. This is perfectly fine and safe lockless code and such use should be
> > > supported. Agreed?
> > 
> > Great example. When you're running non-atomic functions concurrently,
> > the result may easily become incorrect, and this is what you're
> > demonstrating here.
> > 
> > Regarding find_first_bit() it means that:
> >   - it may erroneously return unset bit;
> >   - it may erroneously return non-first set bit;
> >   - it may erroneously return no bits for non-empty bitmap.
> > 
> > Effectively it means that find_first bit may just return a random number.
> > 
> > Let's take another example:
> > 
> > 	do {
> > 		bit = get_random_number();
> > 		if (bit >= n)
> > 			abort...
> > 	} while (!test_and_clear_bit(bit, bitmap));
> > 
> > When running concurrently, the difference between this and your code
> > is only in probability of getting set bit somewhere from around the
> > beginning of bitmap.
> > 
> > The key point is that find_bit() may return undef even if READ_ONCE() is
> > used. If bitmap gets changed anytime in the process, the result becomes
> > invalid. It may happen even after returning from find_first_bit().
> > 
> > And if my understanding correct, your code is designed in the
> > assumption that find_first_bit() may return garbage, so handles it
> > correctly.
> > 
> > > *Except* that the above actually is not safe due to find_first_bit()
> > > implementation and KCSAN warns about that. The problem is that:
> > > 
> > > Assume *addr == 1
> > > CPU1			CPU2
> > > find_first_bit(addr, 64)
> > >    val = *addr;
> > >    if (val) -> true
> > > 			clear_bit(0, addr)
> > >      val = *addr -> compiler decided to refetch addr contents for whatever
> > > 		   reason in the generated assembly
> > >      __ffs(val) -> now executed for value 0 which has undefined results.
> > 
> > Yes, __ffs(0) is undef. But the whole function is undef when accessing
> > bitmap concurrently.
> > 
> > > And the READ_ONCE() this patch adds prevents the compiler from adding the
> > > refetching of addr into the assembly.
> > 
> > That's true. But it doesn't improve on the situation. It was an undef
> > before, and it's undef after, but a 2% slower undef.
> > 
> > Now on that KCSAN warning. If I understand things correctly, for the
> > example above, KCSAN warning is false-positive, because you're
> > intentionally running lockless.
> > 
> > But for some other people it may be a true error, and now they'll have
> > no chance to catch it if KCSAN is forced to ignore find_bit() entirely.
> > 
> > We've got the whole class of lockless algorithms that allow safe concurrent
> > access to the memory. And now that there's a tool that searches for them
> > (concurrent accesses), we need to have an option to somehow teach it
> > to suppress irrelevant warnings. Maybe something like this?
> > 
> >          lockless_algorithm_begin(bitmap, bitmap_size(nbits));
> > 	do {
> > 		bit = find_first_bit(bitmap, nbits);
> > 		if (bit >= nbits)
> > 			break;
> > 	} while (!test_and_clear_bit(bit, bitmap));
> >          lockless_algorithm_end(bitmap, bitmap_size(nbits));
> > 
> > And, of course, as I suggested a couple iterations ago, you can invent
> > a thread-safe version of find_bit(), that would be perfectly correct
> > for lockless use:
> > 
> >   unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size)
> >   {
> >          unsigned long bit = 0;
> >          while (!test_and_clear_bit(bit, bitmap) {
> >                  bit = FIND_FIRST_BIT(addr[idx], /* nop */, size);
> >                  if (bit >= size)
> >                          return size;
> >          }
> > 
> >          return bit;
> >   }
> 
> Hi, Yuri,
> 
> But the code above effectively does the same as the READ_ONCE() macro
> as defined in rwonce.h:
> 
> #ifndef __READ_ONCE
> #define __READ_ONCE(x)	(*(const volatile __unqual_scalar_typeof(x) *)&(x))
> #endif
> 
> #define READ_ONCE(x)							\
> ({									\
> 	compiletime_assert_rwonce_type(x);				\
> 	__READ_ONCE(x);							\
> })
> 
> Both uses only prevent the funny stuff the compiler might have done to the
> read of the addr[idx], there's no black magic in READ_ONCE().
> 
> Both examples would probably result in the same assembly and produce the
> same 2% slowdown ...
> 
> Only you declare volatile in one place, and READ_ONCE() in each read, but
> this will only compile a bit slower and generate the same machine code.

The difference is that find_and_clear_bit() has a semantics of
atomic operation. Those who will decide to use it will also anticipate
associate downsides. And other hundreds (or thousands) users of
non-atomic find_bit() functions will not have to pay extra buck
for unneeded atomicity.

Check how 'volatile' is used in test_and_clear_bit(), and consider
find_and_clear_bit() as a wrapper around test_and_clear_bit().

In other words, this patch suggests to make find_bit() thread-safe by
using READ_ONCE(), and it doesn't work. find_and_clear_bit(), on the
other hand, is simply a wrapper around test_and_clear_bit(), and
doesn't imply any new restriction that test_and_clear_bit() doesn't.

Think of it as an optimized version of:
         while (bit < nbits && !test_and_clear_bit(bit, bitmap)
                bit++;

If you think it's worth to try in your code, I can send a patch for
you.

Thanks,
Yury

Mirsad Todorovac Oct. 14, 2023, 10:04 a.m. UTC | #9

On 10/14/23 04:53, Yury Norov wrote:
> On Sat, Oct 14, 2023 at 04:21:50AM +0200, Mirsad Goran Todorovac wrote:
>> On 10/14/2023 2:15 AM, Yury Norov wrote:
>>> Restore LKML
>>>
>>> On Thu, Oct 12, 2023 at 02:21:10PM +0200, Jan Kara wrote:
>>>> On Wed 11-10-23 11:26:29, Yury Norov wrote:
>>>>> Long story short: KCSAN found some potential issues related to how
>>>>> people use bitmap API. And instead of working through that issues,
>>>>> the following code shuts down KCSAN by applying READ_ONCE() here
>>>>> and there.
>>>>
>>>> I'm sorry but this is not what the patch does. I'm not sure how to get the
>>>> message across so maybe let me start from a different angle:
>>>>
>>>> Bitmaps are perfectly fine to be used without any external locking if
>>>> only atomic bit ops (set_bit, clear_bit, test_and_{set/clear}_bit) are
>>>> used. This is a significant performance gain compared to using a spinlock
>>>> or other locking and people do this for a long time. I hope we agree on
>>>> that.
>>>>
>>>> Now it is also common that you need to find a set / clear bit in a bitmap.
>>>> To maintain lockless protocol and deal with races people employ schemes
>>>> like (the dumbest form):
>>>>
>>>> 	do {
>>>> 		bit = find_first_bit(bitmap, n);
>>>> 		if (bit >= n)
>>>> 			abort...
>>>> 	} while (!test_and_clear_bit(bit, bitmap));
>>>>
>>>> So the code loops until it finds a set bit that is successfully cleared by
>>>> it. This is perfectly fine and safe lockless code and such use should be
>>>> supported. Agreed?
>>>
>>> Great example. When you're running non-atomic functions concurrently,
>>> the result may easily become incorrect, and this is what you're
>>> demonstrating here.
>>>
>>> Regarding find_first_bit() it means that:
>>>    - it may erroneously return unset bit;
>>>    - it may erroneously return non-first set bit;
>>>    - it may erroneously return no bits for non-empty bitmap.
>>>
>>> Effectively it means that find_first bit may just return a random number.
>>>
>>> Let's take another example:
>>>
>>> 	do {
>>> 		bit = get_random_number();
>>> 		if (bit >= n)
>>> 			abort...
>>> 	} while (!test_and_clear_bit(bit, bitmap));
>>>
>>> When running concurrently, the difference between this and your code
>>> is only in probability of getting set bit somewhere from around the
>>> beginning of bitmap.
>>>
>>> The key point is that find_bit() may return undef even if READ_ONCE() is
>>> used. If bitmap gets changed anytime in the process, the result becomes
>>> invalid. It may happen even after returning from find_first_bit().
>>>
>>> And if my understanding correct, your code is designed in the
>>> assumption that find_first_bit() may return garbage, so handles it
>>> correctly.
>>>
>>>> *Except* that the above actually is not safe due to find_first_bit()
>>>> implementation and KCSAN warns about that. The problem is that:
>>>>
>>>> Assume *addr == 1
>>>> CPU1			CPU2
>>>> find_first_bit(addr, 64)
>>>>     val = *addr;
>>>>     if (val) -> true
>>>> 			clear_bit(0, addr)
>>>>       val = *addr -> compiler decided to refetch addr contents for whatever
>>>> 		   reason in the generated assembly
>>>>       __ffs(val) -> now executed for value 0 which has undefined results.
>>>
>>> Yes, __ffs(0) is undef. But the whole function is undef when accessing
>>> bitmap concurrently.
>>>
>>>> And the READ_ONCE() this patch adds prevents the compiler from adding the
>>>> refetching of addr into the assembly.
>>>
>>> That's true. But it doesn't improve on the situation. It was an undef
>>> before, and it's undef after, but a 2% slower undef.
>>>
>>> Now on that KCSAN warning. If I understand things correctly, for the
>>> example above, KCSAN warning is false-positive, because you're
>>> intentionally running lockless.
>>>
>>> But for some other people it may be a true error, and now they'll have
>>> no chance to catch it if KCSAN is forced to ignore find_bit() entirely.
>>>
>>> We've got the whole class of lockless algorithms that allow safe concurrent
>>> access to the memory. And now that there's a tool that searches for them
>>> (concurrent accesses), we need to have an option to somehow teach it
>>> to suppress irrelevant warnings. Maybe something like this?
>>>
>>>           lockless_algorithm_begin(bitmap, bitmap_size(nbits));
>>> 	do {
>>> 		bit = find_first_bit(bitmap, nbits);
>>> 		if (bit >= nbits)
>>> 			break;
>>> 	} while (!test_and_clear_bit(bit, bitmap));
>>>           lockless_algorithm_end(bitmap, bitmap_size(nbits));
>>>
>>> And, of course, as I suggested a couple iterations ago, you can invent
>>> a thread-safe version of find_bit(), that would be perfectly correct
>>> for lockless use:
>>>
>>>    unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size)
>>>    {
>>>           unsigned long bit = 0;
>>>           while (!test_and_clear_bit(bit, bitmap) {
>>>                   bit = FIND_FIRST_BIT(addr[idx], /* nop */, size);
>>>                   if (bit >= size)
>>>                           return size;
>>>           }
>>>
>>>           return bit;
>>>    }
>>
>> Hi, Yuri,
>>
>> But the code above effectively does the same as the READ_ONCE() macro
>> as defined in rwonce.h:
>>
>> #ifndef __READ_ONCE
>> #define __READ_ONCE(x)	(*(const volatile __unqual_scalar_typeof(x) *)&(x))
>> #endif
>>
>> #define READ_ONCE(x)							\
>> ({									\
>> 	compiletime_assert_rwonce_type(x);				\
>> 	__READ_ONCE(x);							\
>> })
>>
>> Both uses only prevent the funny stuff the compiler might have done to the
>> read of the addr[idx], there's no black magic in READ_ONCE().
>>
>> Both examples would probably result in the same assembly and produce the
>> same 2% slowdown ...
>>
>> Only you declare volatile in one place, and READ_ONCE() in each read, but
>> this will only compile a bit slower and generate the same machine code.
> 
> The difference is that find_and_clear_bit() has a semantics of
> atomic operation. Those who will decide to use it will also anticipate
> associate downsides. And other hundreds (or thousands) users of
> non-atomic find_bit() functions will not have to pay extra buck
> for unneeded atomicity.
> 
> Check how 'volatile' is used in test_and_clear_bit(), and consider
> find_and_clear_bit() as a wrapper around test_and_clear_bit().
> 
> In other words, this patch suggests to make find_bit() thread-safe by
> using READ_ONCE(), and it doesn't work. find_and_clear_bit(), on the
> other hand, is simply a wrapper around test_and_clear_bit(), and
> doesn't imply any new restriction that test_and_clear_bit() doesn't.
> 
> Think of it as an optimized version of:
>           while (bit < nbits && !test_and_clear_bit(bit, bitmap)
>                  bit++;
> 
> If you think it's worth to try in your code, I can send a patch for
> you.
> 
> Thanks,
> Yury

After some thinking, your declaration:

>>>    unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size)

OK, this makes "addr" a pointer to a volatile array of unsigned longs.

But to this I have an objection:

>           while (bit < nbits && !test_and_clear_bit(bit, bitmap)
>                  bit++;

Note that there is nothing magical in an atomic test_and_clear_bit():
it has to read entire (quad)word, remember the stat of the bit, clear it,
and write it back.

The problem is that LOCK prefix comes before the assembled instruction

LOCK BTR r/m32, imm8

so it would be executed atomically.

Otherwise there are no guarantees that other core wouldn't write its own
idea of the value.

But atomic test_and_clear_bit() is not a free lunch: you would LOCK the
bus for all cores except yours 32/64 times for each bit, per (quad)word tested.

That is 32/64 times more than it is optimal, and it looks like a real hog.

Do you see the difference?

Needless to say, your atomicity works for one bit, and nothing prevents i.e.
core 5 to modify/set atomically bits you have already tested and found clear ...

Ideally, you would use atomic ffs() which is compiled as a single and atomic BSF/BSR
instruction on x86.

BSR r32, r/m32

(Alternatively you might want BSL in your algorithm, scanning from the least
significant bit.)

Even better would be if we could also atomically clear that bit in memory and
have the instruction return its index.

Test is atomic because it is a single instruction, but it can also be prefixed
with a LOCK.

LOCK BSR reg, mem
LOCK BTR mem, reg

would unfortunately not work, because something unfortunate could change the
memory location in between.

But your proposed algorithm is nothing more atomic than

         bit = ffs(bitmap);
	test_and_clear_bit(bit, bitmap);

And only less efficient, since you use on average 16/32 bus LOCKs instead of
one assembly instruction BSR/BRL. Those LOCK prefixes would mean that the entire
set of cores is prevented from reading or modifying memory while the bus is
LOCKed, or only writes on smarted architectures with WRITE LOCK in assembly,
as reads won't hurt the process of test_and_clear_bit(), only might give an
out-of-sync value.

Whatever fancy C function or macro, it cannot outsmart the CPU instruction set if
the set doesn't have that one.

For x86/x86_64, I couldn't find an atomic instruction to find the first set bit
and atomically clear it, returning its value ... but it doesn't mean nobody else knows.

Regards,
Mirsad

Jan Kara Oct. 16, 2023, 9:22 a.m. UTC | #10

On Fri 13-10-23 17:15:28, Yury Norov wrote:
> On Thu, Oct 12, 2023 at 02:21:10PM +0200, Jan Kara wrote:
> > On Wed 11-10-23 11:26:29, Yury Norov wrote:
> > > Long story short: KCSAN found some potential issues related to how
> > > people use bitmap API. And instead of working through that issues,
> > > the following code shuts down KCSAN by applying READ_ONCE() here
> > > and there.
> > 
> > I'm sorry but this is not what the patch does. I'm not sure how to get the
> > message across so maybe let me start from a different angle:
> > 
> > Bitmaps are perfectly fine to be used without any external locking if
> > only atomic bit ops (set_bit, clear_bit, test_and_{set/clear}_bit) are
> > used. This is a significant performance gain compared to using a spinlock
> > or other locking and people do this for a long time. I hope we agree on
> > that.
> > 
> > Now it is also common that you need to find a set / clear bit in a bitmap.
> > To maintain lockless protocol and deal with races people employ schemes
> > like (the dumbest form):
> > 
> > 	do {
> > 		bit = find_first_bit(bitmap, n);
> > 		if (bit >= n)
> > 			abort...
> > 	} while (!test_and_clear_bit(bit, bitmap));
> > 
> > So the code loops until it finds a set bit that is successfully cleared by
> > it. This is perfectly fine and safe lockless code and such use should be
> > supported. Agreed?
> 
> Great example. When you're running non-atomic functions concurrently,
> the result may easily become incorrect, and this is what you're
> demonstrating here.
> 
> Regarding find_first_bit() it means that:
>  - it may erroneously return unset bit;
>  - it may erroneously return non-first set bit;
>  - it may erroneously return no bits for non-empty bitmap.

Correct.

> Effectively it means that find_first bit may just return a random number.

I prefer to think that it can return a result that is no longer valid by
the time we further use it :)

> Let's take another example:
> 
> 	do {
> 		bit = get_random_number();
> 		if (bit >= n)
> 			abort...
> 	} while (!test_and_clear_bit(bit, bitmap));
> 
> When running concurrently, the difference between this and your code
> is only in probability of getting set bit somewhere from around the
> beginning of bitmap.

Well, as you say the difference is in the probability - i.e., average
number of loops taken is higher with using truly random number and that is
the whole point. We bother with complexity of lockless access exactly
because of performance :). As long as find_first_bit() returns set bit in
case there's no collision with other bitmap modification, we are fine with
its results (usually we don't expect the collision to happen, often the
bitmap users also employ schemes to spread different processes modifying
the bitmap to different parts of the bitmap to further reduce likelyhood of
a collision).

> The key point is that find_bit() may return undef even if READ_ONCE() is
> used. If bitmap gets changed anytime in the process, the result becomes
> invalid. It may happen even after returning from find_first_bit().
> 
> And if my understanding correct, your code is designed in the
> assumption that find_first_bit() may return garbage, so handles it
> correctly.

Yes, that is true.

> > *Except* that the above actually is not safe due to find_first_bit()
> > implementation and KCSAN warns about that. The problem is that:
> > 
> > Assume *addr == 1
> > CPU1			CPU2
> > find_first_bit(addr, 64)
> >   val = *addr;
> >   if (val) -> true
> > 			clear_bit(0, addr)
> >     val = *addr -> compiler decided to refetch addr contents for whatever
> > 		   reason in the generated assembly
> >     __ffs(val) -> now executed for value 0 which has undefined results.
> 
> Yes, __ffs(0) is undef. But the whole function is undef when accessing
> bitmap concurrently.

So here I think we get at the core of our misunderstanding :): Yes,
find_first_bit() may return a bit number that is not set any longer. But it
is guaranteed to return some number between 0 and n where n is the bitmap
size. What __ffs() does when passed 0 value is unclear and likely will be
architecture dependent. If we are guaranteed it returns some number between
0 and 8*sizeof(unsigned long), then we are fine. But I'm concerned it may
throw exception (similarly to division by 0) or return number greater than
8*sizeof(unsigned long) for some architecture and that would be a problem.
E.g. reading the x86 bsf instruction documentation, the destination
register is untouched if there is no set bit so the result can indeed be >
8*sizeof(unsigned long). So __ffs(0) can result in returning a number
beyond the end of the bitmap (e.g. 0xffffffff). And that is IMO
unacceptable output for find_first_bit().

> > And the READ_ONCE() this patch adds prevents the compiler from adding the
> > refetching of addr into the assembly.
> 
> That's true. But it doesn't improve on the situation. It was an undef
> before, and it's undef after, but a 2% slower undef.
> 
> Now on that KCSAN warning. If I understand things correctly, for the
> example above, KCSAN warning is false-positive, because you're
> intentionally running lockless.

As I wrote above, there are different levels of "undefinedness" and that
matters in this case. KCSAN is complaining that the value passed to __ffs()
function may be different one from the one tested in the condition before
it. Depending on exact __ffs() behavior this may be fine or it may be not.

> But for some other people it may be a true error, and now they'll have
> no chance to catch it if KCSAN is forced to ignore find_bit() entirely.

I agree some people may accidentally use bitmap function unlocked without
properly handling the races. However in this case KCSAN does not warn about
unsafe use of the result from find_bit() (which is what should happen for
those unsafe uses). It complains about unsafe internal implementation of
find_bit() when it is used without external synchronization. These two are
different things so I don't think this is a good argument for leaving the
race in find_bit().

Furthermore I'd note that READ_ONCE() does not make KCSAN ignore find_bit()
completely. READ_ONCE() forces the compiler to use the same value for the
test and __ffs() argument (by telling it it cannot assume the standard C
memory model using "volatile" keyword for this fetch). That's all.  That
makes it impossible for KCSAN to inject a modification of the bitmap &
refetch from memory inbetween the two uses of the local variable and thus
it doesn't generate the warning anymore.

> We've got the whole class of lockless algorithms that allow safe concurrent
> access to the memory. And now that there's a tool that searches for them
> (concurrent accesses), we need to have an option to somehow teach it
> to suppress irrelevant warnings. Maybe something like this?
> 
>         lockless_algorithm_begin(bitmap, bitmap_size(nbits));
> 	do {
> 		bit = find_first_bit(bitmap, nbits);
> 		if (bit >= nbits)
> 			break;
> 	} while (!test_and_clear_bit(bit, bitmap));
>         lockless_algorithm_end(bitmap, bitmap_size(nbits));
> 
> And, of course, as I suggested a couple iterations ago, you can invent
> a thread-safe version of find_bit(), that would be perfectly correct
> for lockless use:
> 
>  unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size)
>  {
>         unsigned long bit = 0;
>  
>         while (!test_and_clear_bit(bit, bitmap) {
>                 bit = FIND_FIRST_BIT(addr[idx], /* nop */, size);
>                 if (bit >= size)
>                         return size;
>         }
> 
>         return bit;
>  }
> 
> Didn't test that, but I hope 'volatile' specifier should be enough
> for compiler to realize that it shouldn't optimize memory access, and
> for KCSAN that everything's OK here. 

Based on my research regarding __ffs() we indeed do need find_*_bit()
functions that are guaranteed to return number in 0-n range even in
presence of concurrent bitmap modifications. Do I get it right you'd
rather prefer cloning all the find_*_bit() implementations to create
such variants? IMO that's worse both in terms of maintainability (more
code) and usability (users have to be aware special functions are needed
for lockless code) so the 2% of performance overhead until gcc is fixed
isn't IMO worth it but you are the maintainer...

								Honza

kernel test robot Oct. 18, 2023, 4:23 p.m. UTC | #11

Hi Jan,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on v6.6-rc6 next-20231018]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Jan-Kara/lib-find-Make-functions-safe-on-changing-bitmaps/20231011-230553
base:   linus/master
patch link:    https://lore.kernel.org/r/20231011150252.32737-1-jack%40suse.cz
patch subject: [PATCH 1/2] lib/find: Make functions safe on changing bitmaps
config: i386-randconfig-002-20231018 (https://download.01.org/0day-ci/archive/20231019/202310190005.NqJcRXtK-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231019/202310190005.NqJcRXtK-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202310190005.NqJcRXtK-lkp@intel.com/

All warnings (new ones prefixed by >>):

   mm/percpu.c: In function 'pcpu_build_alloc_info':
>> mm/percpu.c:2885:26: warning: array subscript 32 is above array bounds of 'int[32]' [-Warray-bounds]
    2885 |                 group_map[cpu] = group;
         |                 ~~~~~~~~~^~~~~
   mm/percpu.c:2842:20: note: while referencing 'group_map'
    2842 |         static int group_map[NR_CPUS] __initdata;
         |                    ^~~~~~~~~


vim +2885 mm/percpu.c

3c9a024fde58b0 Tejun Heo              2010-09-09  2813  
3c9a024fde58b0 Tejun Heo              2010-09-09  2814  /* pcpu_build_alloc_info() is used by both embed and page first chunk */
3c9a024fde58b0 Tejun Heo              2010-09-09  2815  #if defined(BUILD_EMBED_FIRST_CHUNK) || defined(BUILD_PAGE_FIRST_CHUNK)
3c9a024fde58b0 Tejun Heo              2010-09-09  2816  /**
3c9a024fde58b0 Tejun Heo              2010-09-09  2817   * pcpu_build_alloc_info - build alloc_info considering distances between CPUs
3c9a024fde58b0 Tejun Heo              2010-09-09  2818   * @reserved_size: the size of reserved percpu area in bytes
3c9a024fde58b0 Tejun Heo              2010-09-09  2819   * @dyn_size: minimum free size for dynamic allocation in bytes
3c9a024fde58b0 Tejun Heo              2010-09-09  2820   * @atom_size: allocation atom size
3c9a024fde58b0 Tejun Heo              2010-09-09  2821   * @cpu_distance_fn: callback to determine distance between cpus, optional
3c9a024fde58b0 Tejun Heo              2010-09-09  2822   *
3c9a024fde58b0 Tejun Heo              2010-09-09  2823   * This function determines grouping of units, their mappings to cpus
3c9a024fde58b0 Tejun Heo              2010-09-09  2824   * and other parameters considering needed percpu size, allocation
3c9a024fde58b0 Tejun Heo              2010-09-09  2825   * atom size and distances between CPUs.
3c9a024fde58b0 Tejun Heo              2010-09-09  2826   *
bffc4375897ea0 Yannick Guerrini       2015-03-06  2827   * Groups are always multiples of atom size and CPUs which are of
3c9a024fde58b0 Tejun Heo              2010-09-09  2828   * LOCAL_DISTANCE both ways are grouped together and share space for
3c9a024fde58b0 Tejun Heo              2010-09-09  2829   * units in the same group.  The returned configuration is guaranteed
3c9a024fde58b0 Tejun Heo              2010-09-09  2830   * to have CPUs on different nodes on different groups and >=75% usage
3c9a024fde58b0 Tejun Heo              2010-09-09  2831   * of allocated virtual address space.
3c9a024fde58b0 Tejun Heo              2010-09-09  2832   *
3c9a024fde58b0 Tejun Heo              2010-09-09  2833   * RETURNS:
3c9a024fde58b0 Tejun Heo              2010-09-09  2834   * On success, pointer to the new allocation_info is returned.  On
3c9a024fde58b0 Tejun Heo              2010-09-09  2835   * failure, ERR_PTR value is returned.
3c9a024fde58b0 Tejun Heo              2010-09-09  2836   */
258e0815e2b170 Dennis Zhou            2021-02-14  2837  static struct pcpu_alloc_info * __init __flatten pcpu_build_alloc_info(
3c9a024fde58b0 Tejun Heo              2010-09-09  2838  				size_t reserved_size, size_t dyn_size,
3c9a024fde58b0 Tejun Heo              2010-09-09  2839  				size_t atom_size,
3c9a024fde58b0 Tejun Heo              2010-09-09  2840  				pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
3c9a024fde58b0 Tejun Heo              2010-09-09  2841  {
3c9a024fde58b0 Tejun Heo              2010-09-09  2842  	static int group_map[NR_CPUS] __initdata;
3c9a024fde58b0 Tejun Heo              2010-09-09  2843  	static int group_cnt[NR_CPUS] __initdata;
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2844  	static struct cpumask mask __initdata;
3c9a024fde58b0 Tejun Heo              2010-09-09  2845  	const size_t static_size = __per_cpu_end - __per_cpu_start;
3c9a024fde58b0 Tejun Heo              2010-09-09  2846  	int nr_groups = 1, nr_units = 0;
3c9a024fde58b0 Tejun Heo              2010-09-09  2847  	size_t size_sum, min_unit_size, alloc_size;
3f649ab728cda8 Kees Cook              2020-06-03  2848  	int upa, max_upa, best_upa;	/* units_per_alloc */
3c9a024fde58b0 Tejun Heo              2010-09-09  2849  	int last_allocs, group, unit;
3c9a024fde58b0 Tejun Heo              2010-09-09  2850  	unsigned int cpu, tcpu;
3c9a024fde58b0 Tejun Heo              2010-09-09  2851  	struct pcpu_alloc_info *ai;
3c9a024fde58b0 Tejun Heo              2010-09-09  2852  	unsigned int *cpu_map;
3c9a024fde58b0 Tejun Heo              2010-09-09  2853  
3c9a024fde58b0 Tejun Heo              2010-09-09  2854  	/* this function may be called multiple times */
3c9a024fde58b0 Tejun Heo              2010-09-09  2855  	memset(group_map, 0, sizeof(group_map));
3c9a024fde58b0 Tejun Heo              2010-09-09  2856  	memset(group_cnt, 0, sizeof(group_cnt));
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2857  	cpumask_clear(&mask);
3c9a024fde58b0 Tejun Heo              2010-09-09  2858  
3c9a024fde58b0 Tejun Heo              2010-09-09  2859  	/* calculate size_sum and ensure dyn_size is enough for early alloc */
3c9a024fde58b0 Tejun Heo              2010-09-09  2860  	size_sum = PFN_ALIGN(static_size + reserved_size +
3c9a024fde58b0 Tejun Heo              2010-09-09  2861  			    max_t(size_t, dyn_size, PERCPU_DYNAMIC_EARLY_SIZE));
3c9a024fde58b0 Tejun Heo              2010-09-09  2862  	dyn_size = size_sum - static_size - reserved_size;
3c9a024fde58b0 Tejun Heo              2010-09-09  2863  
3c9a024fde58b0 Tejun Heo              2010-09-09  2864  	/*
3c9a024fde58b0 Tejun Heo              2010-09-09  2865  	 * Determine min_unit_size, alloc_size and max_upa such that
3c9a024fde58b0 Tejun Heo              2010-09-09  2866  	 * alloc_size is multiple of atom_size and is the smallest
25985edcedea63 Lucas De Marchi        2011-03-30  2867  	 * which can accommodate 4k aligned segments which are equal to
3c9a024fde58b0 Tejun Heo              2010-09-09  2868  	 * or larger than min_unit_size.
3c9a024fde58b0 Tejun Heo              2010-09-09  2869  	 */
3c9a024fde58b0 Tejun Heo              2010-09-09  2870  	min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
3c9a024fde58b0 Tejun Heo              2010-09-09  2871  
9c01516278ef87 Dennis Zhou (Facebook  2017-07-15  2872) 	/* determine the maximum # of units that can fit in an allocation */
3c9a024fde58b0 Tejun Heo              2010-09-09  2873  	alloc_size = roundup(min_unit_size, atom_size);
3c9a024fde58b0 Tejun Heo              2010-09-09  2874  	upa = alloc_size / min_unit_size;
f09f1243ca2d5d Alexander Kuleshov     2015-11-05  2875  	while (alloc_size % upa || (offset_in_page(alloc_size / upa)))
3c9a024fde58b0 Tejun Heo              2010-09-09  2876  		upa--;
3c9a024fde58b0 Tejun Heo              2010-09-09  2877  	max_upa = upa;
3c9a024fde58b0 Tejun Heo              2010-09-09  2878  
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2879  	cpumask_copy(&mask, cpu_possible_mask);
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2880  
3c9a024fde58b0 Tejun Heo              2010-09-09  2881  	/* group cpus according to their proximity */
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2882  	for (group = 0; !cpumask_empty(&mask); group++) {
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2883  		/* pop the group's first cpu */
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2884  		cpu = cpumask_first(&mask);
3c9a024fde58b0 Tejun Heo              2010-09-09 @2885  		group_map[cpu] = group;
3c9a024fde58b0 Tejun Heo              2010-09-09  2886  		group_cnt[group]++;
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2887  		cpumask_clear_cpu(cpu, &mask);
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2888  
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2889  		for_each_cpu(tcpu, &mask) {
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2890  			if (!cpu_distance_fn ||
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2891  			    (cpu_distance_fn(cpu, tcpu) == LOCAL_DISTANCE &&
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2892  			     cpu_distance_fn(tcpu, cpu) == LOCAL_DISTANCE)) {
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2893  				group_map[tcpu] = group;
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2894  				group_cnt[group]++;
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2895  				cpumask_clear_cpu(tcpu, &mask);
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2896  			}
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2897  		}
3c9a024fde58b0 Tejun Heo              2010-09-09  2898  	}
d7d29ac76f7efb Wonhyuk Yang           2020-10-30  2899  	nr_groups = group;
3c9a024fde58b0 Tejun Heo              2010-09-09  2900  
3c9a024fde58b0 Tejun Heo              2010-09-09  2901  	/*
9c01516278ef87 Dennis Zhou (Facebook  2017-07-15  2902) 	 * Wasted space is caused by a ratio imbalance of upa to group_cnt.
9c01516278ef87 Dennis Zhou (Facebook  2017-07-15  2903) 	 * Expand the unit_size until we use >= 75% of the units allocated.
9c01516278ef87 Dennis Zhou (Facebook  2017-07-15  2904) 	 * Related to atom_size, which could be much larger than the unit_size.
3c9a024fde58b0 Tejun Heo              2010-09-09  2905  	 */
3c9a024fde58b0 Tejun Heo              2010-09-09  2906  	last_allocs = INT_MAX;
4829c791b22f98 Dennis Zhou            2021-06-14  2907  	best_upa = 0;
3c9a024fde58b0 Tejun Heo              2010-09-09  2908  	for (upa = max_upa; upa; upa--) {
3c9a024fde58b0 Tejun Heo              2010-09-09  2909  		int allocs = 0, wasted = 0;
3c9a024fde58b0 Tejun Heo              2010-09-09  2910  
f09f1243ca2d5d Alexander Kuleshov     2015-11-05  2911  		if (alloc_size % upa || (offset_in_page(alloc_size / upa)))
3c9a024fde58b0 Tejun Heo              2010-09-09  2912  			continue;
3c9a024fde58b0 Tejun Heo              2010-09-09  2913  
3c9a024fde58b0 Tejun Heo              2010-09-09  2914  		for (group = 0; group < nr_groups; group++) {
3c9a024fde58b0 Tejun Heo              2010-09-09  2915  			int this_allocs = DIV_ROUND_UP(group_cnt[group], upa);
3c9a024fde58b0 Tejun Heo              2010-09-09  2916  			allocs += this_allocs;
3c9a024fde58b0 Tejun Heo              2010-09-09  2917  			wasted += this_allocs * upa - group_cnt[group];
3c9a024fde58b0 Tejun Heo              2010-09-09  2918  		}
3c9a024fde58b0 Tejun Heo              2010-09-09  2919  
3c9a024fde58b0 Tejun Heo              2010-09-09  2920  		/*
3c9a024fde58b0 Tejun Heo              2010-09-09  2921  		 * Don't accept if wastage is over 1/3.  The
3c9a024fde58b0 Tejun Heo              2010-09-09  2922  		 * greater-than comparison ensures upa==1 always
3c9a024fde58b0 Tejun Heo              2010-09-09  2923  		 * passes the following check.
3c9a024fde58b0 Tejun Heo              2010-09-09  2924  		 */
3c9a024fde58b0 Tejun Heo              2010-09-09  2925  		if (wasted > num_possible_cpus() / 3)
3c9a024fde58b0 Tejun Heo              2010-09-09  2926  			continue;
3c9a024fde58b0 Tejun Heo              2010-09-09  2927  
3c9a024fde58b0 Tejun Heo              2010-09-09  2928  		/* and then don't consume more memory */
3c9a024fde58b0 Tejun Heo              2010-09-09  2929  		if (allocs > last_allocs)
3c9a024fde58b0 Tejun Heo              2010-09-09  2930  			break;
3c9a024fde58b0 Tejun Heo              2010-09-09  2931  		last_allocs = allocs;
3c9a024fde58b0 Tejun Heo              2010-09-09  2932  		best_upa = upa;
3c9a024fde58b0 Tejun Heo              2010-09-09  2933  	}
4829c791b22f98 Dennis Zhou            2021-06-14  2934  	BUG_ON(!best_upa);
3c9a024fde58b0 Tejun Heo              2010-09-09  2935  	upa = best_upa;
3c9a024fde58b0 Tejun Heo              2010-09-09  2936  
3c9a024fde58b0 Tejun Heo              2010-09-09  2937  	/* allocate and fill alloc_info */
3c9a024fde58b0 Tejun Heo              2010-09-09  2938  	for (group = 0; group < nr_groups; group++)
3c9a024fde58b0 Tejun Heo              2010-09-09  2939  		nr_units += roundup(group_cnt[group], upa);
3c9a024fde58b0 Tejun Heo              2010-09-09  2940  
3c9a024fde58b0 Tejun Heo              2010-09-09  2941  	ai = pcpu_alloc_alloc_info(nr_groups, nr_units);
3c9a024fde58b0 Tejun Heo              2010-09-09  2942  	if (!ai)
3c9a024fde58b0 Tejun Heo              2010-09-09  2943  		return ERR_PTR(-ENOMEM);
3c9a024fde58b0 Tejun Heo              2010-09-09  2944  	cpu_map = ai->groups[0].cpu_map;
3c9a024fde58b0 Tejun Heo              2010-09-09  2945  
3c9a024fde58b0 Tejun Heo              2010-09-09  2946  	for (group = 0; group < nr_groups; group++) {
3c9a024fde58b0 Tejun Heo              2010-09-09  2947  		ai->groups[group].cpu_map = cpu_map;
3c9a024fde58b0 Tejun Heo              2010-09-09  2948  		cpu_map += roundup(group_cnt[group], upa);
3c9a024fde58b0 Tejun Heo              2010-09-09  2949  	}
3c9a024fde58b0 Tejun Heo              2010-09-09  2950  
3c9a024fde58b0 Tejun Heo              2010-09-09  2951  	ai->static_size = static_size;
3c9a024fde58b0 Tejun Heo              2010-09-09  2952  	ai->reserved_size = reserved_size;
3c9a024fde58b0 Tejun Heo              2010-09-09  2953  	ai->dyn_size = dyn_size;
3c9a024fde58b0 Tejun Heo              2010-09-09  2954  	ai->unit_size = alloc_size / upa;
3c9a024fde58b0 Tejun Heo              2010-09-09  2955  	ai->atom_size = atom_size;
3c9a024fde58b0 Tejun Heo              2010-09-09  2956  	ai->alloc_size = alloc_size;
3c9a024fde58b0 Tejun Heo              2010-09-09  2957  
2de7852fe9096e Peng Fan               2019-02-20  2958  	for (group = 0, unit = 0; group < nr_groups; group++) {
3c9a024fde58b0 Tejun Heo              2010-09-09  2959  		struct pcpu_group_info *gi = &ai->groups[group];
3c9a024fde58b0 Tejun Heo              2010-09-09  2960  
3c9a024fde58b0 Tejun Heo              2010-09-09  2961  		/*
3c9a024fde58b0 Tejun Heo              2010-09-09  2962  		 * Initialize base_offset as if all groups are located
3c9a024fde58b0 Tejun Heo              2010-09-09  2963  		 * back-to-back.  The caller should update this to
3c9a024fde58b0 Tejun Heo              2010-09-09  2964  		 * reflect actual allocation.
3c9a024fde58b0 Tejun Heo              2010-09-09  2965  		 */
3c9a024fde58b0 Tejun Heo              2010-09-09  2966  		gi->base_offset = unit * ai->unit_size;
3c9a024fde58b0 Tejun Heo              2010-09-09  2967  
3c9a024fde58b0 Tejun Heo              2010-09-09  2968  		for_each_possible_cpu(cpu)
3c9a024fde58b0 Tejun Heo              2010-09-09  2969  			if (group_map[cpu] == group)
3c9a024fde58b0 Tejun Heo              2010-09-09  2970  				gi->cpu_map[gi->nr_units++] = cpu;
3c9a024fde58b0 Tejun Heo              2010-09-09  2971  		gi->nr_units = roundup(gi->nr_units, upa);
3c9a024fde58b0 Tejun Heo              2010-09-09  2972  		unit += gi->nr_units;
3c9a024fde58b0 Tejun Heo              2010-09-09  2973  	}
3c9a024fde58b0 Tejun Heo              2010-09-09  2974  	BUG_ON(unit != nr_units);
3c9a024fde58b0 Tejun Heo              2010-09-09  2975  
3c9a024fde58b0 Tejun Heo              2010-09-09  2976  	return ai;
3c9a024fde58b0 Tejun Heo              2010-09-09  2977  }
23f917169ef157 Kefeng Wang            2022-01-19  2978

kernel test robot Oct. 25, 2023, 7:18 a.m. UTC | #12

Hello,

kernel test robot noticed a 3.7% improvement of will-it-scale.per_thread_ops on:


commit: df671b17195cd6526e029c70d04dfb72561082d7 ("[PATCH 1/2] lib/find: Make functions safe on changing bitmaps")
url: https://github.com/intel-lab-lkp/linux/commits/Jan-Kara/lib-find-Make-functions-safe-on-changing-bitmaps/20231011-230553
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 1c8b86a3799f7e5be903c3f49fcdaee29fd385b5
patch link: https://lore.kernel.org/all/20231011150252.32737-1-jack@suse.cz/
patch subject: [PATCH 1/2] lib/find: Make functions safe on changing bitmaps

testcase: will-it-scale
test machine: 104 threads 2 sockets (Skylake) with 192G memory
parameters:

	nr_task: 50%
	mode: thread
	test: tlb_flush3
	cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231025/202310251458.48b4452d-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/tlb_flush3/will-it-scale

commit: 
  1c8b86a379 ("Merge tag 'xsa441-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip")
  df671b1719 ("lib/find: Make functions safe on changing bitmaps")

1c8b86a3799f7e5b df671b17195cd6526e029c70d04 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      0.14 ± 19%     +36.9%       0.19 ± 17%  perf-sched.wait_time.avg.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write_killable.__vm_munmap
  2.26e+08            +3.6%  2.343e+08        proc-vmstat.pgfault
      0.04           +25.0%       0.05        turbostat.IPC
     32666           -15.5%      27605 ±  2%  turbostat.POLL
      7856            +2.2%       8025        vmstat.system.cs
   6331931            +2.3%    6478704        vmstat.system.in
    700119            +3.7%     725931        will-it-scale.52.threads
     13463            +3.7%      13959        will-it-scale.per_thread_ops
    700119            +3.7%     725931        will-it-scale.workload
      8.36            -7.3%       7.74        perf-stat.i.MPKI
 4.591e+09            +3.4%  4.747e+09        perf-stat.i.branch-instructions
 1.832e+08            +2.8%  1.883e+08        perf-stat.i.branch-misses
     26.70            -0.3       26.40        perf-stat.i.cache-miss-rate%
      7852            +2.2%       8021        perf-stat.i.context-switches
      6.43            -7.2%       5.97        perf-stat.i.cpi
    769.61            +1.8%     783.29        perf-stat.i.cpu-migrations
  6.39e+09            +3.4%  6.606e+09        perf-stat.i.dTLB-loads
  2.94e+09            +3.2%  3.035e+09        perf-stat.i.dTLB-stores
     78.29            -0.9       77.44        perf-stat.i.iTLB-load-miss-rate%
  18959450            +3.5%   19621273        perf-stat.i.iTLB-load-misses
   5254435            +8.7%    5713444        perf-stat.i.iTLB-loads
 2.236e+10            +7.7%  2.408e+10        perf-stat.i.instructions
      1181            +4.0%       1228        perf-stat.i.instructions-per-iTLB-miss
      0.16            +7.7%       0.17        perf-stat.i.ipc
      0.02 ± 36%     -49.6%       0.01 ± 53%  perf-stat.i.major-faults
    485.08            +3.0%     499.67        perf-stat.i.metric.K/sec
    141.71            +3.2%     146.25        perf-stat.i.metric.M/sec
    747997            +3.7%     775416        perf-stat.i.minor-faults
   3127957           -13.9%    2693728        perf-stat.i.node-loads
  26089697            +3.4%   26965335        perf-stat.i.node-store-misses
    767569            +3.7%     796095        perf-stat.i.node-stores
    747997            +3.7%     775416        perf-stat.i.page-faults
      8.35            -7.3%       7.74        perf-stat.overall.MPKI
     26.70            -0.3       26.40        perf-stat.overall.cache-miss-rate%
      6.43            -7.1%       5.97        perf-stat.overall.cpi
     78.30            -0.9       77.45        perf-stat.overall.iTLB-load-miss-rate%
      1179            +4.0%       1226        perf-stat.overall.instructions-per-iTLB-miss
      0.16            +7.7%       0.17        perf-stat.overall.ipc
   9644584            +3.8%   10011125        perf-stat.overall.path-length
 4.575e+09            +3.4%  4.731e+09        perf-stat.ps.branch-instructions
 1.825e+08            +2.8%  1.876e+08        perf-stat.ps.branch-misses
      7825            +2.2%       7995        perf-stat.ps.context-switches
    767.16            +1.8%     780.76        perf-stat.ps.cpu-migrations
 6.368e+09            +3.4%  6.583e+09        perf-stat.ps.dTLB-loads
  2.93e+09            +3.2%  3.025e+09        perf-stat.ps.dTLB-stores
  18896725            +3.5%   19555325        perf-stat.ps.iTLB-load-misses
   5236456            +8.7%    5693636        perf-stat.ps.iTLB-loads
 2.229e+10            +7.6%  2.399e+10        perf-stat.ps.instructions
    745423            +3.7%     772705        perf-stat.ps.minor-faults
   3117663           -13.9%    2684861        perf-stat.ps.node-loads
  26002765            +3.4%   26875267        perf-stat.ps.node-store-misses
    764789            +3.7%     793098        perf-stat.ps.node-stores
    745423            +3.7%     772705        perf-stat.ps.page-faults
 6.752e+12            +7.6%  7.267e+12        perf-stat.total.instructions
     19.21            -1.0       18.18        perf-profile.calltrace.cycles-pp.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu
     17.00            -0.9       16.09        perf-profile.calltrace.cycles-pp.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.zap_pte_range
     65.30            -0.6       64.69        perf-profile.calltrace.cycles-pp.zap_page_range_single.madvise_vma_behavior.do_madvise.__x64_sys_madvise.do_syscall_64
     65.34            -0.6       64.75        perf-profile.calltrace.cycles-pp.madvise_vma_behavior.do_madvise.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe
     65.98            -0.5       65.45        perf-profile.calltrace.cycles-pp.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
     65.96            -0.5       65.42        perf-profile.calltrace.cycles-pp.do_madvise.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
      9.72 ±  2%      -0.5        9.20        perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range
     66.33            -0.5       65.81        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
     66.46            -0.5       65.95        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__madvise
     31.88            -0.4       31.43        perf-profile.calltrace.cycles-pp.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu.zap_page_range_single
     67.72            -0.4       67.28        perf-profile.calltrace.cycles-pp.__madvise
     32.15            -0.4       31.73        perf-profile.calltrace.cycles-pp.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu.zap_page_range_single.madvise_vma_behavior
     32.60            -0.4       32.21        perf-profile.calltrace.cycles-pp.flush_tlb_mm_range.tlb_finish_mmu.zap_page_range_single.madvise_vma_behavior.do_madvise
     32.93            -0.3       32.58        perf-profile.calltrace.cycles-pp.tlb_finish_mmu.zap_page_range_single.madvise_vma_behavior.do_madvise.__x64_sys_madvise
     31.07            -0.3       30.74        perf-profile.calltrace.cycles-pp.flush_tlb_mm_range.zap_pte_range.zap_pmd_range.unmap_page_range.zap_page_range_single
     31.58            -0.3       31.28        perf-profile.calltrace.cycles-pp.zap_pte_range.zap_pmd_range.unmap_page_range.zap_page_range_single.madvise_vma_behavior
     31.61            -0.3       31.30        perf-profile.calltrace.cycles-pp.zap_pmd_range.unmap_page_range.zap_page_range_single.madvise_vma_behavior.do_madvise
     31.80            -0.3       31.51        perf-profile.calltrace.cycles-pp.unmap_page_range.zap_page_range_single.madvise_vma_behavior.do_madvise.__x64_sys_madvise
      8.34            -0.1        8.22        perf-profile.calltrace.cycles-pp.sysvec_call_function.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask
      8.06            -0.1        7.95        perf-profile.calltrace.cycles-pp.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond
      7.98            -0.1        7.87        perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.llist_add_batch
      0.59 ±  3%      +0.1        0.65 ±  2%  perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.testcase
      1.46            +0.1        1.53        perf-profile.calltrace.cycles-pp.filemap_map_pages.do_read_fault.do_fault.__handle_mm_fault.handle_mm_fault
      1.48            +0.1        1.55        perf-profile.calltrace.cycles-pp.do_read_fault.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
      1.53            +0.1        1.62        perf-profile.calltrace.cycles-pp.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
      2.92            +0.1        3.02        perf-profile.calltrace.cycles-pp.flush_tlb_func.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function
      1.26 ±  2%      +0.1        1.36        perf-profile.calltrace.cycles-pp.default_send_IPI_mask_sequence_phys.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.zap_pte_range
      1.84            +0.1        1.96        perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
      7.87            +0.1        8.00        perf-profile.calltrace.cycles-pp.llist_reverse_order.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function
      2.03 ±  2%      +0.1        2.17        perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
      2.90            +0.2        3.06        perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.zap_pte_range
      2.62 ±  3%      +0.2        2.80        perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.testcase
      2.58 ±  3%      +0.2        2.76        perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
      2.95 ±  3%      +0.2        3.14        perf-profile.calltrace.cycles-pp.asm_exc_page_fault.testcase
      2.75            +0.2        2.94        perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu
      4.96            +0.3        5.29        perf-profile.calltrace.cycles-pp.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask
      4.92            +0.3        5.25        perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond
      5.13            +0.3        5.46        perf-profile.calltrace.cycles-pp.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range
      5.08            +0.4        5.44        perf-profile.calltrace.cycles-pp.testcase
     37.25            -2.0       35.24        perf-profile.children.cycles-pp.llist_add_batch
     62.82            -0.8       62.04        perf-profile.children.cycles-pp.on_each_cpu_cond_mask
     62.82            -0.8       62.04        perf-profile.children.cycles-pp.smp_call_function_many_cond
     63.70            -0.7       62.98        perf-profile.children.cycles-pp.flush_tlb_mm_range
     65.30            -0.6       64.70        perf-profile.children.cycles-pp.zap_page_range_single
     65.34            -0.6       64.75        perf-profile.children.cycles-pp.madvise_vma_behavior
     65.98            -0.5       65.45        perf-profile.children.cycles-pp.__x64_sys_madvise
     65.96            -0.5       65.43        perf-profile.children.cycles-pp.do_madvise
     66.52            -0.5       66.01        perf-profile.children.cycles-pp.do_syscall_64
     66.65            -0.5       66.16        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
     67.79            -0.4       67.36        perf-profile.children.cycles-pp.__madvise
     32.94            -0.3       32.60        perf-profile.children.cycles-pp.tlb_finish_mmu
     31.74            -0.3       31.43        perf-profile.children.cycles-pp.zap_pte_range
     31.76            -0.3       31.46        perf-profile.children.cycles-pp.zap_pmd_range
     31.95            -0.3       31.66        perf-profile.children.cycles-pp.unmap_page_range
      0.42 ±  2%      +0.0        0.46        perf-profile.children.cycles-pp.error_entry
      0.20 ±  3%      +0.0        0.24 ±  5%  perf-profile.children.cycles-pp.up_read
      0.69            +0.0        0.74        perf-profile.children.cycles-pp.native_flush_tlb_local
      1.47            +0.1        1.55        perf-profile.children.cycles-pp.filemap_map_pages
      1.48            +0.1        1.56        perf-profile.children.cycles-pp.do_read_fault
      1.54            +0.1        1.62        perf-profile.children.cycles-pp.do_fault
      2.75            +0.1        2.86        perf-profile.children.cycles-pp.default_send_IPI_mask_sequence_phys
      1.85            +0.1        1.98        perf-profile.children.cycles-pp.__handle_mm_fault
      2.04 ±  2%      +0.1        2.18        perf-profile.children.cycles-pp.handle_mm_fault
      2.63 ±  3%      +0.2        2.81        perf-profile.children.cycles-pp.exc_page_fault
      2.62 ±  3%      +0.2        2.80        perf-profile.children.cycles-pp.do_user_addr_fault
      3.24 ±  3%      +0.2        3.44        perf-profile.children.cycles-pp.asm_exc_page_fault
      3.83            +0.2        4.04        perf-profile.children.cycles-pp.flush_tlb_func
      0.69 ±  2%      +0.2        0.92        perf-profile.children.cycles-pp._find_next_bit
      9.92            +0.3       10.23        perf-profile.children.cycles-pp.llist_reverse_order
      5.45            +0.4        5.81        perf-profile.children.cycles-pp.testcase
     18.42            +0.5       18.96        perf-profile.children.cycles-pp.asm_sysvec_call_function
     16.24            +0.5       16.78        perf-profile.children.cycles-pp.__flush_smp_call_function_queue
     15.78            +0.5       16.32        perf-profile.children.cycles-pp.__sysvec_call_function
     16.36            +0.5       16.90        perf-profile.children.cycles-pp.sysvec_call_function
     27.92            -1.9       26.04        perf-profile.self.cycles-pp.llist_add_batch
      0.16 ±  2%      +0.0        0.18 ±  4%  perf-profile.self.cycles-pp.up_read
      0.42 ±  2%      +0.0        0.45        perf-profile.self.cycles-pp.error_entry
      0.21 ±  4%      +0.0        0.24 ±  5%  perf-profile.self.cycles-pp.down_read
      0.26 ±  2%      +0.0        0.29 ±  3%  perf-profile.self.cycles-pp.tlb_finish_mmu
      2.01            +0.0        2.05        perf-profile.self.cycles-pp.default_send_IPI_mask_sequence_phys
      0.68            +0.0        0.73        perf-profile.self.cycles-pp.native_flush_tlb_local
      3.10            +0.2        3.26        perf-profile.self.cycles-pp.flush_tlb_func
      0.50 ±  2%      +0.2        0.68        perf-profile.self.cycles-pp._find_next_bit
      9.92            +0.3       10.22        perf-profile.self.cycles-pp.llist_reverse_order
     16.10            +0.5       16.64        perf-profile.self.cycles-pp.smp_call_function_many_cond




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.

Rasmus Villemoes Oct. 25, 2023, 8:18 a.m. UTC | #13

On 25/10/2023 09.18, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed a 3.7% improvement of will-it-scale.per_thread_ops on:

So with that, can we please just finally say "yeah, let's make the
generic bitmap library functions correct and usable in more cases"
instead of worrying about random micro-benchmarks that just show
you-win-some-you-lose-some.

Yes, users will have to treat results from the find routines carefully
if their bitmap may be concurrently modified. They do. Nobody wins if
those users are forced to implement their own bitmap routines for their
lockless algorithms.

Rasmus

Yury Norov Oct. 27, 2023, 3:51 a.m. UTC | #14

On Wed, Oct 25, 2023 at 10:18:00AM +0200, Rasmus Villemoes wrote:
> On 25/10/2023 09.18, kernel test robot wrote:
> > 
> > 
> > Hello,
> > 
> > kernel test robot noticed a 3.7% improvement of will-it-scale.per_thread_ops on:
> 
> So with that, can we please just finally say "yeah, let's make the
> generic bitmap library functions correct

They are all correct already.

> and usable in more cases"

See below.

> instead of worrying about random micro-benchmarks that just show
> you-win-some-you-lose-some.

That's I agree. I don't worry about either +2% or -3% benchmark, and
don't think that they alone can or can't justificate such a radical
change like making all find_bit functions volatile, and shutting down
a newborn KCSAN.

Keeping that in mind, my best guess is that Jan's and Misrad's test
that shows +2% was against stable bitmaps; and what robot measured
is most likely against heavily concurrent access to some bitmap in
the kernel.

I didn't look at both tests sources, but that at least makes some
sense, because if GCC optimizes code against properly described
memory correctly, this is exactly what we can expect.

> Yes, users will have to treat results from the find routines carefully
> if their bitmap may be concurrently modified. They do. Nobody wins if
> those users are forced to implement their own bitmap routines for their
> lockless algorithms.

Again, I agree with this point, and I'm trying to address exactly this.

I'm working on a series that introduces lockless find_bit functions
based on existing FIND_BIT() engine. It's not ready yet, but I hope
I'll submit it in the next merge window.

https://github.com/norov/linux/commits/find_and_bit

Now that we've got a test that presumably works faster if find_bit()
functions are all switched to be volatile, it would be great if we get
into details and understand:
 - what find_bit function or functions gives that gain in performance;
 - on what bitmap(s);
 - is the reason in concurrent memory access (guess yes), and if so,
 - can we refactor the code to use lockless find_and_bit() functions
   mentioned above;
 - if not, how else can we address this.

If you or someone else have an extra time slot to get deeper into
that, I'll be really thankful. 

Thanks,
Yury

Jan Kara Oct. 27, 2023, 9:55 a.m. UTC | #15

On Thu 26-10-23 20:51:22, Yury Norov wrote:
> On Wed, Oct 25, 2023 at 10:18:00AM +0200, Rasmus Villemoes wrote:
> > Yes, users will have to treat results from the find routines carefully
> > if their bitmap may be concurrently modified. They do. Nobody wins if
> > those users are forced to implement their own bitmap routines for their
> > lockless algorithms.
> 
> Again, I agree with this point, and I'm trying to address exactly this.
> 
> I'm working on a series that introduces lockless find_bit functions
> based on existing FIND_BIT() engine. It's not ready yet, but I hope
> I'll submit it in the next merge window.
> 
> https://github.com/norov/linux/commits/find_and_bit

I agree that the find_and_{set|clear}() bit functions are useful and we'll
be able to remove some boilerplate code with them. But also note that you
will need to duplicate practically all of the bitmap API to provide similar
"atomic" functionality - e.g. the sbitmap conversion you have in your
branch has a bug that it drops the 'lock' memory ordering from the bitmap
manipulation. So you need something like find_and_set_bit_lock() (which you
already have) and find_and_set_bit_wrap_lock() (which you don't have yet).
If you are to convert bitmap code in filesystems (some of which is lockless
as well), you will need to add little and big endian variants of volatile
bitmap functions. Finally there are users like lib/xarray.c which don't
want to set/clear found bit (we just want to quickly find a good guess for
a set bit in the bitmap and we then verify in another structure whether the
guess was right or not). So we'll need the volatile variant of plain
find_first_bit(), find_next_bit() as well. Also when we have variants of
bitmap functions that are safe wrt parallel changes and those that are not,
the function names should be probably indicating which are which.

So as much as I agree your solution is theorically the cleanest, I
personally don't think the cost in terms of code duplication and code churn
is really worth it.

> Now that we've got a test that presumably works faster if find_bit()
> functions are all switched to be volatile, it would be great if we get
> into details and understand:
>  - what find_bit function or functions gives that gain in performance;
>  - on what bitmap(s);
>  - is the reason in concurrent memory access (guess yes), and if so,
>  - can we refactor the code to use lockless find_and_bit() functions
>    mentioned above;
>  - if not, how else can we address this.

Frankly, I don't think there's any substantial reason why the volatile or
non-volatile code should be faster. The guys from compiler team looked at
the x86 disassembly and said both variants should be the same speed based on
instruction costs. What could be causing these small performance differences
is that the resulting code is layed out slightly differently (and the
volatile bitmap functions end up being somewhat larger) and that somehow
interferes with instruction caching or CPU-internal out-of-order execution.

								Honza

Mirsad Todorovac Oct. 27, 2023, 3:51 p.m. UTC | #16

On 10/27/2023 5:51 AM, Yury Norov wrote:
> On Wed, Oct 25, 2023 at 10:18:00AM +0200, Rasmus Villemoes wrote:
>> On 25/10/2023 09.18, kernel test robot wrote:
>>>
>>>
>>> Hello,
>>>
>>> kernel test robot noticed a 3.7% improvement of will-it-scale.per_thread_ops on:
>>
>> So with that, can we please just finally say "yeah, let's make the
>> generic bitmap library functions correct
> 
> They are all correct already.
> 
>> and usable in more cases"
> 
> See below.
> 
>> instead of worrying about random micro-benchmarks that just show
>> you-win-some-you-lose-some.
> 
> That's I agree. I don't worry about either +2% or -3% benchmark, and
> don't think that they alone can or can't justificate such a radical
> change like making all find_bit functions volatile, and shutting down
> a newborn KCSAN.
> 
> Keeping that in mind, my best guess is that Jan's and Misrad's test
> that shows +2% was against stable bitmaps; and what robot measured
> is most likely against heavily concurrent access to some bitmap in
> the kernel.
> 
> I didn't look at both tests sources, but that at least makes some
> sense, because if GCC optimizes code against properly described
> memory correctly, this is exactly what we can expect.
> 
>> Yes, users will have to treat results from the find routines carefully
>> if their bitmap may be concurrently modified. They do. Nobody wins if
>> those users are forced to implement their own bitmap routines for their
>> lockless algorithms.
> 
> Again, I agree with this point, and I'm trying to address exactly this.
> 
> I'm working on a series that introduces lockless find_bit functions
> based on existing FIND_BIT() engine. It's not ready yet, but I hope
> I'll submit it in the next merge window.
> 
> https://github.com/norov/linux/commits/find_and_bit
> 
> Now that we've got a test that presumably works faster if find_bit()
> functions are all switched to be volatile, it would be great if we get
> into details and understand:
>   - what find_bit function or functions gives that gain in performance;
>   - on what bitmap(s);

I am positive that your test_and_clear_bit() loop per bit wouldn't 
improve performance. No insult, but it has to:

- LOCK the bus
- READ the (byte/word/longword/quadword) from memory
- TEST the bit and remember the result in FLAGS
- CLEAR the bit
- WRITE back the entire byte/word/longword/quadword

So, instead of speeding up, you'd end up reading 64 times in the worst 
case where only the last bit in the quadword is set.

What we need is this ffs() that expands to assembly instruction BSF
(bit scan forward)

  [1] https://www.felixcloutier.com/x86/bsf

followed by a BTR (bit test and reset)

  [2] https://www.felixcloutier.com/x86/btr

Ideally, we'd have an instruction that does both, and atomic, or a way 
to LOCK the bus for two instructions. But bad guys could use that to 
stall all cores indefinitely:

DON'T DO THIS!
-----------------------------
      LOCK
loop:
      BSF r16, m16/m32/m64
      BTR m16/m32/m64, r16
      JMP loop
      UNLOCK
-----------------------------

This would better work with the hardware-assisted CAM locking device, 
than stopping all cores from reading and writing on each memory barrier.

But this is a long story.

>   - is the reason in concurrent memory access (guess yes), and if so,
>   - can we refactor the code to use lockless find_and_bit() functions
>     mentioned above;
>   - if not, how else can we address this.
> 
> If you or someone else have an extra time slot to get deeper into
> that, I'll be really thankful.

I don't know if the Linux kernel uses any advantage of a trasnactional 
memory device if it finds one?

The idea of each mutex() or even user-space futex() stalling all cores 
to "asm LOCK CMPXCHG m8/m16/m32/m64, r8/r16/r32/r64" simply doesn't 
scale well with i.e. 128 cores that have to wait for one long 
read-modify-write cycle that bypasses cache and talks to very slow memory.

I see some progress with per-CPU variables.

[3] 
https://stackoverflow.com/questions/58664496/locking-on-per-cpu-variable/67997961#67997961

For multiprocessor system and to protect percpu variables, we can just 
disable preemption and do local_irq_save. This way we avoid taking the 
spinlock. Spinlock requires atomicity across multiple CPU's. With per 
cpu variables it shall not be required.

	local_irq_save(flags);
	preempt_disable();
	-- Modify the percpu variable
	preempt_enable();
	local_irq_restore(flags);

That compiles roughly to:

	unsigned long flags;

	asm volatile("cli": : :"memory");
	asm volatile(
		"mrs	%0, " IRQMASK_REG_NAME_R " @local_save_flags"
		: "=r" (flags) : : "memory", "cc");
	preempt_count_inc();
	__asm__ __volatile__("": : :"memory")  // barrier()
	--- your stuff here ---
	if (!(!(flags & X86_EFLAGS_IF))
		asm volatile("sti": : :"memory");
	__asm__ __volatile__("": : :"memory")  // barrier()
	preempt_count_dec();

With this code other CPUs can work in parallel. None of the CPU spins.

If we take spinlock then we modify an atomic variable also other CPU 
comes then it has to wait/spin if spinlock is acquired.

Best regards,
Mirsad


> Thanks,
> Yury

[1/2] lib/find: Make functions safe on changing bitmaps

Commit Message

Comments

Patch