[v2] lib/memweight.c: open codes bitmap_weight()

Message ID	20190824100102.1167-1-efremov@ispras.ru (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=GmRJ=WU=vger.kernel.org=linux-fsdevel-owner@kernel.org> From: Denis Efremov <efremov@ispras.ru> To: akpm@linux-foundation.org Cc: Denis Efremov <efremov@ispras.ru>, Akinobu Mita <akinobu.mita@gmail.com>, Jan Kara <jack@suse.cz>, linux-kernel@vger.kernel.org, Matthew Wilcox <matthew@wil.cx>, dm-devel@redhat.com, linux-fsdevel@vger.kernel.org, linux-media@vger.kernel.org, Erdem Tumurov <erdemus@gmail.com>, Vladimir Shelekhov <vshel@iis.nsk.su> Subject: [PATCH v2] lib/memweight.c: open codes bitmap_weight() Date: Sat, 24 Aug 2019 13:01:02 +0300 Message-Id: <20190824100102.1167-1-efremov@ispras.ru> In-Reply-To: <20190821074200.2203-1-efremov@ispras.ru> References: <20190821074200.2203-1-efremov@ispras.ru> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	[v2] lib/memweight.c: open codes bitmap_weight() \| expand [v2] lib/memweight.c: open codes bitmap_weight()

Denis Efremov Aug. 24, 2019, 10:01 a.m. UTC

This patch open codes the bitmap_weight() call. The direct
invocation of hweight_long() allows to remove the BUG_ON and
excessive "longs to bits, bits to longs" conversion.

BUG_ON was required to check that bitmap_weight() will return
a correct value, i.e. the computed weight will fit the int type
of the return value. With this patch memweight() controls the
computation directly with size_t type everywhere. Thus, the BUG_ON
becomes unnecessary.

Total size reduced:
./scripts/bloat-o-meter lib/memweight.o.old lib/memweight.o.new
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
Function                                     old     new   delta
memweight                                    162     152     -10

Co-developed-by: Erdem Tumurov <erdemus@gmail.com>
Signed-off-by: Erdem Tumurov <erdemus@gmail.com>
Co-developed-by: Vladimir Shelekhov <vshel@iis.nsk.su>
Signed-off-by: Vladimir Shelekhov <vshel@iis.nsk.su>
Signed-off-by: Denis Efremov <efremov@ispras.ru>
---
 lib/memweight.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Matthew Wilcox Aug. 25, 2019, 6:11 a.m. UTC | #1

On Sat, Aug 24, 2019 at 01:01:02PM +0300, Denis Efremov wrote:
> This patch open codes the bitmap_weight() call. The direct
> invocation of hweight_long() allows to remove the BUG_ON and
> excessive "longs to bits, bits to longs" conversion.

Honestly, that's not the problem with this function.  Take a look
at https://danluu.com/assembly-intrinsics/ for a _benchmarked_
set of problems with popcnt.

> BUG_ON was required to check that bitmap_weight() will return
> a correct value, i.e. the computed weight will fit the int type
> of the return value.

What?  No.  Look at the _arguments_ of bitmap_weight():

static __always_inline int bitmap_weight(const unsigned long *src, unsigned int nbits)

> With this patch memweight() controls the
> computation directly with size_t type everywhere. Thus, the BUG_ON
> becomes unnecessary.

Why are you bothering?  How are you allocating half a gigabyte of memory?
Why are you calling memweight() on half a gigabyte of memory?

>  	if (longs) {
> -		BUG_ON(longs >= INT_MAX / BITS_PER_LONG);
> -		ret += bitmap_weight((unsigned long *)bitmap,
> -				longs * BITS_PER_LONG);
> +		const unsigned long *bitmap_long =
> +			(const unsigned long *)bitmap;
> +
>  		bytes -= longs * sizeof(long);
> -		bitmap += longs * sizeof(long);
> +		for (; longs > 0; longs--, bitmap_long++)
> +			ret += hweight_long(*bitmap_long);
> +		bitmap = (const unsigned char *)bitmap_long;
>  	}

If you really must change anything, I'd rather see this turned into a
loop:

	while (longs) {
		unsigned int nbits;

		if (longs >= INT_MAX / BITS_PER_LONG)
			nbits = INT_MAX + 1;
		else
			nbits = longs * BITS_PER_LONG;

		ret += bitmap_weight((unsigned long *)bitmap, sz);
		bytes -= nbits / 8;
		bitmap += nbits / 8;
		longs -= nbits / BITS_PER_LONG;
	}

then we only have to use Dan Luu's optimisation in bitmap_weight()
and not in memweight() as well.

Also, why does the trailer do this:

        for (; bytes > 0; bytes--, bitmap++)
                ret += hweight8(*bitmap);

instead of calling hweight_long on *bitmap & mask?

Denis Efremov Aug. 25, 2019, 11:39 a.m. UTC | #2

On 25.08.2019 09:11, Matthew Wilcox wrote:
> On Sat, Aug 24, 2019 at 01:01:02PM +0300, Denis Efremov wrote:
>> This patch open codes the bitmap_weight() call. The direct
>> invocation of hweight_long() allows to remove the BUG_ON and
>> excessive "longs to bits, bits to longs" conversion.
> 
> Honestly, that's not the problem with this function.  Take a look
> at https://danluu.com/assembly-intrinsics/ for a _benchmarked_
> set of problems with popcnt.
> 
>> BUG_ON was required to check that bitmap_weight() will return
>> a correct value, i.e. the computed weight will fit the int type
>> of the return value.
> 
> What?  No.  Look at the _arguments_ of bitmap_weight():
> 
> static __always_inline int bitmap_weight(const unsigned long *src, unsigned int nbits)

I'm not sure why it is INT_MAX then? I would expect in case we care only about arguments
something like:

BUG_ON(longs >= UINT_MAX / BITS_PER_LONG);

> 
>> With this patch memweight() controls the
>> computation directly with size_t type everywhere. Thus, the BUG_ON
>> becomes unnecessary.
> 
> Why are you bothering?  How are you allocating half a gigabyte of memory?
> Why are you calling memweight() on half a gigabyte of memory?
> 

No, we don't use such big arrays. However, it's possible to remove BUG_ON and make
the code more "straight". Why do we need to "artificially" limit this function
to arrays of a particular size if we can relatively simple omit this restriction?

> 
> If you really must change anything, I'd rather see this turned into a
> loop:
> 
> 	while (longs) {
> 		unsigned int nbits;
> 
> 		if (longs >= INT_MAX / BITS_PER_LONG)
> 			nbits = INT_MAX + 1;
> 		else
> 			nbits = longs * BITS_PER_LONG;
> 
> 		ret += bitmap_weight((unsigned long *)bitmap, sz);
> 		bytes -= nbits / 8;
> 		bitmap += nbits / 8;
> 		longs -= nbits / BITS_PER_LONG;
> 	}
> 
> then we only have to use Dan Luu's optimisation in bitmap_weight()
> and not in memweight() as well.

I don't know how the implementation of this optimization will look like in it's
final shape, because of different hardware/compiler issues. It looks there are
a number of different ways to do it https://arxiv.org/pdf/1611.07612.pdf, 
http://0x80.pl/articles/sse-popcount.html.

However, if it will be based on popcnt instruction I would expect that
hweight_long will also contain this intrinsics. Since version 4.9.2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011#c13 GCC knows of the
false-dependency in popcnt and generates code to handle it
(e.g. xor https://godbolt.org/z/Q7AW_d) Thus, I would expect that it's
possible to use popcnt intrinsics in hweight_long that would be natively
optimized in all loops like "for (...) { res += hweight_long() }" without
requiring manual unrolling like in builtin_popcnt_unrolled_errata_manual
example of Dan Luu's optimization.

> 
> Also, why does the trailer do this:
> 
>         for (; bytes > 0; bytes--, bitmap++)
>                 ret += hweight8(*bitmap);
> 
> instead of calling hweight_long on *bitmap & mask?
> 

Do you mean something like this?

        longs = bytes;
        bytes = do_div(longs, sizeof(long));
        bitmap_long = (const unsigned long *)bitmap;
        if (longs) {
                for (; longs > 0; longs--, bitmap_long++)
                        ret += hweight_long(*bitmap_long);
        }
        if (bytes) {
                ret += hweight_long(*bitmap_long &
                                   ((0x1 << bytes * BITS_PER_BYTE) - 1));
        }

The *bitmap_long will lead to buffer overflow here.

Thanks,
Denis

Matthew Wilcox Aug. 26, 2019, 6:39 p.m. UTC | #3

On Sun, Aug 25, 2019 at 02:39:47PM +0300, Denis Efremov wrote:
> On 25.08.2019 09:11, Matthew Wilcox wrote:
> > On Sat, Aug 24, 2019 at 01:01:02PM +0300, Denis Efremov wrote:
> >> This patch open codes the bitmap_weight() call. The direct
> >> invocation of hweight_long() allows to remove the BUG_ON and
> >> excessive "longs to bits, bits to longs" conversion.
> > 
> > Honestly, that's not the problem with this function.  Take a look
> > at https://danluu.com/assembly-intrinsics/ for a _benchmarked_
> > set of problems with popcnt.
> > 
> >> BUG_ON was required to check that bitmap_weight() will return
> >> a correct value, i.e. the computed weight will fit the int type
> >> of the return value.
> > 
> > What?  No.  Look at the _arguments_ of bitmap_weight():
> > 
> > static __always_inline int bitmap_weight(const unsigned long *src, unsigned int nbits)
> 
> I'm not sure why it is INT_MAX then? I would expect in case we care only about arguments
> something like:
>  
> BUG_ON(longs >= UINT_MAX / BITS_PER_LONG);

People aren't always terribly consistent with INT_MAX vs UINT_MAX.
Also, bitmap_weight() should arguably return an unisnged int (it can't
legitimately return a negative value).

> >> With this patch memweight() controls the
> >> computation directly with size_t type everywhere. Thus, the BUG_ON
> >> becomes unnecessary.
> > 
> > Why are you bothering?  How are you allocating half a gigabyte of memory?
> > Why are you calling memweight() on half a gigabyte of memory?
> > 
> 
> No, we don't use such big arrays. However, it's possible to remove BUG_ON and make
> the code more "straight". Why do we need to "artificially" limit this function
> to arrays of a particular size if we can relatively simple omit this restriction?

You're not making a great case for changing the implementation of
memweight() here ...

> I don't know how the implementation of this optimization will look like in it's
> final shape, because of different hardware/compiler issues. It looks there are
> a number of different ways to do it https://arxiv.org/pdf/1611.07612.pdf, 
> http://0x80.pl/articles/sse-popcount.html.

The problem with using XMM registers is that they have to be saved/restored.
Not to mention the thermal issues caused by heavy usage of AVX instructions.

> However, if it will be based on popcnt instruction I would expect that
> hweight_long will also contain this intrinsics. Since version 4.9.2
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011#c13 GCC knows of the
> false-dependency in popcnt and generates code to handle it

Ah!  Glad to see GCC knows about this problem and has worked around it.

> (e.g. xor https://godbolt.org/z/Q7AW_d) Thus, I would expect that it's
> possible to use popcnt intrinsics in hweight_long that would be natively
> optimized in all loops like "for (...) { res += hweight_long() }" without
> requiring manual unrolling like in builtin_popcnt_unrolled_errata_manual
> example of Dan Luu's optimization.

That might be expecting rather more from our compiler than is reasonable ...

> > 
> > Also, why does the trailer do this:
> > 
> >         for (; bytes > 0; bytes--, bitmap++)
> >                 ret += hweight8(*bitmap);
> > 
> > instead of calling hweight_long on *bitmap & mask?
> > 
> 
> Do you mean something like this?
> 
>         longs = bytes;
>         bytes = do_div(longs, sizeof(long));
>         bitmap_long = (const unsigned long *)bitmap;
>         if (longs) {
>                 for (; longs > 0; longs--, bitmap_long++)
>                         ret += hweight_long(*bitmap_long);
>         }
>         if (bytes) {
>                 ret += hweight_long(*bitmap_long &
>                                    ((0x1 << bytes * BITS_PER_BYTE) - 1));
>         }
> 
> The *bitmap_long will lead to buffer overflow here.

No it won't.  The CPU will access more bytes than the `bytes' argument
would seem to imply -- but it's going to have fetched that entire
cacheline anyway.  It might confuse a very strict bounds checking library,
but usually those just check you're not accessing outside your object,
which is going to be a multiple of 'sizeof(long)' anyway.

If we do something like this, we'll need to use an 'inverse' of that mask
on big-endian machines.  ie something more like:

	if (bytes) {
		unsigned long mask;
		if (_BIG_ENDIAN)
			mask = ~0UL >> (bytes * 8);
		else
			mask = ~0UL << (bytes * 8);
		ret += hweight_long(*bitmap_long & ~mask);
	}

Also we need a memweight() test to be sure we didn't get that wrong.

Denis Efremov Sept. 13, 2019, 11:48 a.m. UTC | #4

Hi,

Sorry for reviving this conversation, but it looks to me like
this function could be reduced to a single bitmap_weight call:

static inline size_t memweight(const void *ptr, size_t bytes)
{
        BUG_ON(bytes >= UINT_MAX / BITS_PER_BYTE);
        return bitmap_weight(ptr, bytes * BITS_PER_BYTE);
}

Comparing to the current implementation
https://elixir.bootlin.com/linux/latest/source/lib/memweight.c#L11 
this results in a signification simplification. 

__bitmap_weight already count last bits with hweight_long as we
discussed earlier.

int __bitmap_weight(const unsigned long *bitmap, unsigned int bits)
{
	...
	if (bits % BITS_PER_LONG)
		w += hweight_long(bitmap[k] & BITMAP_LAST_WORD_MASK(bits));
	...
}

and __arch_hweight* functions use popcnt instruction.

I've briefly tested the equivalence of 2 implementations on x86_64 with
fuzzing here: https://gist.github.com/evdenis/95a8b9b8041e09368b31c3a9510491a5

What do you think making this function static inline and moving it
to include/linux/string.h? I could prepare a patch for it and add some tests for
memweight and bitmap_weight. Or maybe I miss something again?

Best regards,
Denis

Denis Efremov Sept. 13, 2019, 1:41 p.m. UTC | #5

Sorry, no question, pointer alignment of course.

Denis Efremov писал 2019-09-13 14:48:
> Hi,
> 
> Sorry for reviving this conversation, but it looks to me like
> this function could be reduced to a single bitmap_weight call:
> 
> static inline size_t memweight(const void *ptr, size_t bytes)
> {
>         BUG_ON(bytes >= UINT_MAX / BITS_PER_BYTE);
>         return bitmap_weight(ptr, bytes * BITS_PER_BYTE);
> }
> 
> Comparing to the current implementation
> https://elixir.bootlin.com/linux/latest/source/lib/memweight.c#L11
> this results in a signification simplification.
> 
> __bitmap_weight already count last bits with hweight_long as we
> discussed earlier.
> 
> int __bitmap_weight(const unsigned long *bitmap, unsigned int bits)
> {
> 	...
> 	if (bits % BITS_PER_LONG)
> 		w += hweight_long(bitmap[k] & BITMAP_LAST_WORD_MASK(bits));
> 	...
> }
> 
> and __arch_hweight* functions use popcnt instruction.
> 
> I've briefly tested the equivalence of 2 implementations on x86_64 with
> fuzzing here: 
> https://gist.github.com/evdenis/95a8b9b8041e09368b31c3a9510491a5
> 
> What do you think making this function static inline and moving it
> to include/linux/string.h? I could prepare a patch for it and add some 
> tests for
> memweight and bitmap_weight. Or maybe I miss something again?
> 
> Best regards,
> Denis

[v2] lib/memweight.c: open codes bitmap_weight()

Commit Message

Comments

Patch