diff mbox series

[v2,net-next,1/6] bitmap: try to optimize arr32 <-> bitmap on 64-bit LEs

Message ID 20221018140027.48086-2-alexandr.lobakin@intel.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series netlink: add universal 'bigint' attribute type | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 17950 this patch: 17950
netdev/cc_maintainers success CCed 3 of 3 maintainers
netdev/build_clang success Errors and warnings before: 4361 this patch: 4361
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 18746 this patch: 18746
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Alexander Lobakin Oct. 18, 2022, 2 p.m. UTC
Unlike bitmap_{from,to}_arr64(), when there can be no out-of-bounds
accesses (due to u64 always being no shorter than unsigned long),
it can't be guaranteed with arr32s due to that on 64-bit platforms:

bits     BITS_TO_U32 * sizeof(u32)    BITS_TO_LONGS * sizeof(long)
1-32     4                            8
33-64    8                            8
95-96    12                           16
97-128   16                           16

and so on.
That is why bitmap_{from,to}_arr32() are always defined there as
externs. But quite often @nbits is a compile-time constant, which
means we could suggest whether it can be inlined or not at
compile-time basing on the number of bits (above).

So, try to determine that at compile time and, in case of both
containers having the same size in bytes, resolve it to
bitmap_copy_clear_tail() on Little Endian. No changes here for
Big Endian or when the number of bits *really* is variable.

Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
 include/linux/bitmap.h | 51 ++++++++++++++++++++++++++++++------------
 lib/bitmap.c           | 12 +++++-----
 2 files changed, 43 insertions(+), 20 deletions(-)

Comments

Yury Norov Oct. 18, 2022, 10:41 p.m. UTC | #1
On Tue, Oct 18, 2022 at 04:00:22PM +0200, Alexander Lobakin wrote:
> Unlike bitmap_{from,to}_arr64(), when there can be no out-of-bounds
> accesses (due to u64 always being no shorter than unsigned long),
> it can't be guaranteed with arr32s due to that on 64-bit platforms:
> 
> bits     BITS_TO_U32 * sizeof(u32)    BITS_TO_LONGS * sizeof(long)
> 1-32     4                            8
> 33-64    8                            8
> 95-96    12                           16
> 97-128   16                           16
> 
> and so on.
> That is why bitmap_{from,to}_arr32() are always defined there as
> externs. But quite often @nbits is a compile-time constant, which
> means we could suggest whether it can be inlined or not at
> compile-time basing on the number of bits (above).
> 
> So, try to determine that at compile time and, in case of both
> containers having the same size in bytes, resolve it to
> bitmap_copy_clear_tail() on Little Endian. No changes here for
> Big Endian or when the number of bits *really* is variable.

You're saying 'try to optimize', but don't show any numbers. What's
the target for your optimization? Can you demonstrate how it performs
in test or in real work?
 
> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> ---
>  include/linux/bitmap.h | 51 ++++++++++++++++++++++++++++++------------
>  lib/bitmap.c           | 12 +++++-----
>  2 files changed, 43 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
> index 7d6d73b78147..79d12e0f748b 100644
> --- a/include/linux/bitmap.h
> +++ b/include/linux/bitmap.h
> @@ -283,24 +283,47 @@ static inline void bitmap_copy_clear_tail(unsigned long *dst,
>   * On 32-bit systems bitmaps are represented as u32 arrays internally. On LE64
>   * machines the order of hi and lo parts of numbers match the bitmap structure.
>   * In both cases conversion is not needed when copying data from/to arrays of
> - * u32. But in LE64 case, typecast in bitmap_copy_clear_tail() may lead
> - * to out-of-bound access. To avoid that, both LE and BE variants of 64-bit
> - * architectures are not using bitmap_copy_clear_tail().
> + * u32. But in LE64 case, typecast in bitmap_copy_clear_tail() may lead to
> + * out-of-bound access. To avoid that, LE variant of 64-bit architectures uses
> + * bitmap_copy_clear_tail() only when @bitmap and @buf containers have the same
> + * size in memory (known at compile time), and 64-bit BEs never use it.
>   */
> -#if BITS_PER_LONG == 64
> -void bitmap_from_arr32(unsigned long *bitmap, const u32 *buf,
> -							unsigned int nbits);
> -void bitmap_to_arr32(u32 *buf, const unsigned long *bitmap,
> -							unsigned int nbits);
> +#if BITS_PER_LONG == 32
> +#define bitmap_arr32_compat(nbits)		true
> +#elif defined(__LITTLE_ENDIAN)
> +#define bitmap_arr32_compat(nbits)		\

'Compat' is reserved for a compatibility layer between kernel and
user spaces running different ABIs. Can you pick some other word?

> +	(__builtin_constant_p(nbits) &&		\
> +	 BITS_TO_U32(nbits) * sizeof(u32) ==	\
> +	 BITS_TO_LONGS(nbits) * sizeof(long))

I think it should be:
        round_up(nbits, 32) == round_up(nbits, 64)

>  #else
> -#define bitmap_from_arr32(bitmap, buf, nbits)			\
> -	bitmap_copy_clear_tail((unsigned long *) (bitmap),	\
> -			(const unsigned long *) (buf), (nbits))
> -#define bitmap_to_arr32(buf, bitmap, nbits)			\
> -	bitmap_copy_clear_tail((unsigned long *) (buf),		\
> -			(const unsigned long *) (bitmap), (nbits))

Can you keep this part untouched? I'd like to have a clear meaning -
on 32-bit arch, bitmap_to_arr32 is a simple copy.

> +#define bitmap_arr32_compat(nbits)		false
>  #endif
>  
> +void __bitmap_from_arr32(unsigned long *bitmap, const u32 *buf, unsigned int nbits);
> +void __bitmap_to_arr32(u32 *buf, const unsigned long *bitmap, unsigned int nbits);
> +
> +static inline void bitmap_from_arr32(unsigned long *bitmap, const u32 *buf,
> +				     unsigned int nbits)
> +{
> +	const unsigned long *src = (const unsigned long *)buf;
> +
> +	if (bitmap_arr32_compat(nbits))
> +		bitmap_copy_clear_tail(bitmap, src, nbits);
> +	else
> +		__bitmap_from_arr32(bitmap, buf, nbits);

If you would really want to optimize it, I'd suggest something like
this:
    #ifdef __LITTLE_ENDIAN
        /*copy as many full 64-bit words as we can */
        bitmap_copy(bitmap, src, round_down(nbits, BITS_PER_LONG)); 

        /* now copy part of last word per-byte */
        ...
    #else
	__bitmap_from_arr32(bitmap, buf, nbits);
    #endif

This should be better because it uses fast bitmap_copy() regardless
the number of bits. Assuming bitmap_copy() is significantly faster
than bitmap_from_arr(), people will be surprised by the difference of
speed of copying, say, 2048 and 2049-bit bitmaps. Right?

But unless we'll see real numbers, it's questionable to me if that's
worth the effort.

Thanks,
Yury
Alexander Lobakin Oct. 19, 2022, 3:21 p.m. UTC | #2
From: Yury Norov <yury.norov@gmail.com>
Date: Tue, 18 Oct 2022 15:41:46 -0700

> On Tue, Oct 18, 2022 at 04:00:22PM +0200, Alexander Lobakin wrote:
> > Unlike bitmap_{from,to}_arr64(), when there can be no out-of-bounds
> > accesses (due to u64 always being no shorter than unsigned long),
> > it can't be guaranteed with arr32s due to that on 64-bit platforms:
> > 
> > bits     BITS_TO_U32 * sizeof(u32)    BITS_TO_LONGS * sizeof(long)
> > 1-32     4                            8
> > 33-64    8                            8
> > 95-96    12                           16
> > 97-128   16                           16
> > 
> > and so on.
> > That is why bitmap_{from,to}_arr32() are always defined there as
> > externs. But quite often @nbits is a compile-time constant, which
> > means we could suggest whether it can be inlined or not at
> > compile-time basing on the number of bits (above).
> > 
> > So, try to determine that at compile time and, in case of both
> > containers having the same size in bytes, resolve it to
> > bitmap_copy_clear_tail() on Little Endian. No changes here for
> > Big Endian or when the number of bits *really* is variable.
> 
> You're saying 'try to optimize', but don't show any numbers. What's
> the target for your optimization? Can you demonstrate how it performs
> in test or in real work?

I had them somewhere, but given that you provided a different
approach to try, I'd better retest each of them and collect some
fresh numbers. If it's not worth it, I'll simply drop the patch
from the series / include stats in the commitmsg otherwise.

>  
> > Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> > ---
> >  include/linux/bitmap.h | 51 ++++++++++++++++++++++++++++++------------
> >  lib/bitmap.c           | 12 +++++-----
> >  2 files changed, 43 insertions(+), 20 deletions(-)

[...]

> > +#if BITS_PER_LONG == 32
> > +#define bitmap_arr32_compat(nbits)		true
> > +#elif defined(__LITTLE_ENDIAN)
> > +#define bitmap_arr32_compat(nbits)		\
> 
> 'Compat' is reserved for a compatibility layer between kernel and
> user spaces running different ABIs. Can you pick some other word?

Yeah, sure. I was also thinking of this as it didn't sound good to
me as well, but didn't find anything better at the top of my head
and then forgot about it at all.

> 
> > +	(__builtin_constant_p(nbits) &&		\
> > +	 BITS_TO_U32(nbits) * sizeof(u32) ==	\
> > +	 BITS_TO_LONGS(nbits) * sizeof(long))
> 
> I think it should be:
>         round_up(nbits, 32) == round_up(nbits, 64)

Ah, correct.

> 
> >  #else
> > -#define bitmap_from_arr32(bitmap, buf, nbits)			\
> > -	bitmap_copy_clear_tail((unsigned long *) (bitmap),	\
> > -			(const unsigned long *) (buf), (nbits))
> > -#define bitmap_to_arr32(buf, bitmap, nbits)			\
> > -	bitmap_copy_clear_tail((unsigned long *) (buf),		\
> > -			(const unsigned long *) (bitmap), (nbits))
> 
> Can you keep this part untouched? I'd like to have a clear meaning -
> on 32-bit arch, bitmap_to_arr32 is a simple copy.

Sure, why not, I don't have a strong preference here.

> 
> > +#define bitmap_arr32_compat(nbits)		false
> >  #endif
> >  
> > +void __bitmap_from_arr32(unsigned long *bitmap, const u32 *buf, unsigned int nbits);
> > +void __bitmap_to_arr32(u32 *buf, const unsigned long *bitmap, unsigned int nbits);
> > +
> > +static inline void bitmap_from_arr32(unsigned long *bitmap, const u32 *buf,
> > +				     unsigned int nbits)
> > +{
> > +	const unsigned long *src = (const unsigned long *)buf;
> > +
> > +	if (bitmap_arr32_compat(nbits))
> > +		bitmap_copy_clear_tail(bitmap, src, nbits);
> > +	else
> > +		__bitmap_from_arr32(bitmap, buf, nbits);
> 
> If you would really want to optimize it, I'd suggest something like
> this:
>     #ifdef __LITTLE_ENDIAN
>         /*copy as many full 64-bit words as we can */
>         bitmap_copy(bitmap, src, round_down(nbits, BITS_PER_LONG)); 
> 
>         /* now copy part of last word per-byte */
>         ...
>     #else
> 	__bitmap_from_arr32(bitmap, buf, nbits);
>     #endif
> 
> This should be better because it uses fast bitmap_copy() regardless
> the number of bits. Assuming bitmap_copy() is significantly faster
> than bitmap_from_arr(), people will be surprised by the difference of
> speed of copying, say, 2048 and 2049-bit bitmaps. Right?
> 
> But unless we'll see real numbers, it's questionable to me if that's
> worth the effort.

Nice suggestion, thanks! I'll retest all I have and then we'll see.

> 
> Thanks,
> Yury

Thanks,
Olek
diff mbox series

Patch

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 7d6d73b78147..79d12e0f748b 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -283,24 +283,47 @@  static inline void bitmap_copy_clear_tail(unsigned long *dst,
  * On 32-bit systems bitmaps are represented as u32 arrays internally. On LE64
  * machines the order of hi and lo parts of numbers match the bitmap structure.
  * In both cases conversion is not needed when copying data from/to arrays of
- * u32. But in LE64 case, typecast in bitmap_copy_clear_tail() may lead
- * to out-of-bound access. To avoid that, both LE and BE variants of 64-bit
- * architectures are not using bitmap_copy_clear_tail().
+ * u32. But in LE64 case, typecast in bitmap_copy_clear_tail() may lead to
+ * out-of-bound access. To avoid that, LE variant of 64-bit architectures uses
+ * bitmap_copy_clear_tail() only when @bitmap and @buf containers have the same
+ * size in memory (known at compile time), and 64-bit BEs never use it.
  */
-#if BITS_PER_LONG == 64
-void bitmap_from_arr32(unsigned long *bitmap, const u32 *buf,
-							unsigned int nbits);
-void bitmap_to_arr32(u32 *buf, const unsigned long *bitmap,
-							unsigned int nbits);
+#if BITS_PER_LONG == 32
+#define bitmap_arr32_compat(nbits)		true
+#elif defined(__LITTLE_ENDIAN)
+#define bitmap_arr32_compat(nbits)		\
+	(__builtin_constant_p(nbits) &&		\
+	 BITS_TO_U32(nbits) * sizeof(u32) ==	\
+	 BITS_TO_LONGS(nbits) * sizeof(long))
 #else
-#define bitmap_from_arr32(bitmap, buf, nbits)			\
-	bitmap_copy_clear_tail((unsigned long *) (bitmap),	\
-			(const unsigned long *) (buf), (nbits))
-#define bitmap_to_arr32(buf, bitmap, nbits)			\
-	bitmap_copy_clear_tail((unsigned long *) (buf),		\
-			(const unsigned long *) (bitmap), (nbits))
+#define bitmap_arr32_compat(nbits)		false
 #endif
 
+void __bitmap_from_arr32(unsigned long *bitmap, const u32 *buf, unsigned int nbits);
+void __bitmap_to_arr32(u32 *buf, const unsigned long *bitmap, unsigned int nbits);
+
+static inline void bitmap_from_arr32(unsigned long *bitmap, const u32 *buf,
+				     unsigned int nbits)
+{
+	const unsigned long *src = (const unsigned long *)buf;
+
+	if (bitmap_arr32_compat(nbits))
+		bitmap_copy_clear_tail(bitmap, src, nbits);
+	else
+		__bitmap_from_arr32(bitmap, buf, nbits);
+}
+
+static inline void bitmap_to_arr32(u32 *buf, const unsigned long *bitmap,
+				   unsigned int nbits)
+{
+	unsigned long *dst = (unsigned long *)buf;
+
+	if (bitmap_arr32_compat(nbits))
+		bitmap_copy_clear_tail(dst, bitmap, nbits);
+	else
+		__bitmap_to_arr32(buf, bitmap, nbits);
+}
+
 /*
  * On 64-bit systems bitmaps are represented as u64 arrays internally. On LE32
  * machines the order of hi and lo parts of numbers match the bitmap structure.
diff --git a/lib/bitmap.c b/lib/bitmap.c
index 1c81413c51f8..e3eb12ff1637 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -1449,12 +1449,12 @@  EXPORT_SYMBOL_GPL(devm_bitmap_zalloc);
 
 #if BITS_PER_LONG == 64
 /**
- * bitmap_from_arr32 - copy the contents of u32 array of bits to bitmap
+ * __bitmap_from_arr32 - copy the contents of u32 array of bits to bitmap
  *	@bitmap: array of unsigned longs, the destination bitmap
  *	@buf: array of u32 (in host byte order), the source bitmap
  *	@nbits: number of bits in @bitmap
  */
-void bitmap_from_arr32(unsigned long *bitmap, const u32 *buf, unsigned int nbits)
+void __bitmap_from_arr32(unsigned long *bitmap, const u32 *buf, unsigned int nbits)
 {
 	unsigned int i, halfwords;
 
@@ -1469,15 +1469,15 @@  void bitmap_from_arr32(unsigned long *bitmap, const u32 *buf, unsigned int nbits
 	if (nbits % BITS_PER_LONG)
 		bitmap[(halfwords - 1) / 2] &= BITMAP_LAST_WORD_MASK(nbits);
 }
-EXPORT_SYMBOL(bitmap_from_arr32);
+EXPORT_SYMBOL(__bitmap_from_arr32);
 
 /**
- * bitmap_to_arr32 - copy the contents of bitmap to a u32 array of bits
+ * __bitmap_to_arr32 - copy the contents of bitmap to a u32 array of bits
  *	@buf: array of u32 (in host byte order), the dest bitmap
  *	@bitmap: array of unsigned longs, the source bitmap
  *	@nbits: number of bits in @bitmap
  */
-void bitmap_to_arr32(u32 *buf, const unsigned long *bitmap, unsigned int nbits)
+void __bitmap_to_arr32(u32 *buf, const unsigned long *bitmap, unsigned int nbits)
 {
 	unsigned int i, halfwords;
 
@@ -1492,7 +1492,7 @@  void bitmap_to_arr32(u32 *buf, const unsigned long *bitmap, unsigned int nbits)
 	if (nbits % BITS_PER_LONG)
 		buf[halfwords - 1] &= (u32) (UINT_MAX >> ((-nbits) & 31));
 }
-EXPORT_SYMBOL(bitmap_to_arr32);
+EXPORT_SYMBOL(__bitmap_to_arr32);
 #endif
 
 #if (BITS_PER_LONG == 32) && defined(__BIG_ENDIAN)