diff mbox series

[RFC,3/4] Arm64: further speed-up to hweight{32, 64}()

Message ID 5CF0F9A30200007800233E07@prv1-mh.provo.novell.com (mailing list archive)
State New, archived
Headers show
Series bitops: hweight<N>() improvements | expand

Commit Message

Jan Beulich May 31, 2019, 9:53 a.m. UTC
According to Linux commit e75bef2a4f ("arm64: Select
ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the
variant using only bitwise operations on at least some hardware, and no
worse on other.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC: To be honest I'm not fully convinced this is a win in particular in
     the hweight32() case, as there's no actual shift insn which gets
     replaced by the multiplication. Even for hweight64() the compiler
     could emit better code and avoid the explicit shift by 32 (which it
     emits at least for me).

Comments

Julien Grall June 4, 2019, 4:11 p.m. UTC | #1
Hi Jan,

On 5/31/19 10:53 AM, Jan Beulich wrote:
> According to Linux commit e75bef2a4f ("arm64: Select
> ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the
> variant using only bitwise operations on at least some hardware, and no
> worse on other.
> 
> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> RFC: To be honest I'm not fully convinced this is a win in particular in
>       the hweight32() case, as there's no actual shift insn which gets
>       replaced by the multiplication. Even for hweight64() the compiler
>       could emit better code and avoid the explicit shift by 32 (which it
>       emits at least for me).

I can see multiplication instruction used in both hweight32() and 
hweight64() with the compiler I am using.

I would expect the compiler could easily replace a multiply by a series 
of shift but it would be more difficult to do the invert.

Also, this has been in Linux for a year now, so I am assuming Linux 
folks are happy with changes (CCing Robin just in case I missed 
anything). Therefore I am happy to give it a go on Xen as well.

Cheers,

> 
> --- a/xen/arch/arm/Kconfig
> +++ b/xen/arch/arm/Kconfig
> @@ -12,6 +12,7 @@ config ARM_32
>   config ARM_64
>   	def_bool y
>   	depends on 64BIT
> +	select HAS_FAST_MULTIPLY
>   
>   config ARM
>   	def_bool y
> 
>
Robin Murphy June 4, 2019, 5:30 p.m. UTC | #2
On 04/06/2019 17:11, Julien Grall wrote:
> Hi Jan,
> 
> On 5/31/19 10:53 AM, Jan Beulich wrote:
>> According to Linux commit e75bef2a4f ("arm64: Select
>> ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the
>> variant using only bitwise operations on at least some hardware, and no
>> worse on other.
>>
>> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> RFC: To be honest I'm not fully convinced this is a win in particular in
>>       the hweight32() case, as there's no actual shift insn which gets
>>       replaced by the multiplication. Even for hweight64() the compiler
>>       could emit better code and avoid the explicit shift by 32 (which it
>>       emits at least for me).
> 
> I can see multiplication instruction used in both hweight32() and 
> hweight64() with the compiler I am using.
> 
> I would expect the compiler could easily replace a multiply by a series 
> of shift but it would be more difficult to do the invert.
> 
> Also, this has been in Linux for a year now, so I am assuming Linux 
> folks are happy with changes (CCing Robin just in case I missed 
> anything). Therefore I am happy to give it a go on Xen as well.

IIRC it did look like Linux's hweight() routines could possibly be 
further tuned for the A64 ISA to shave off one or two more instructions, 
but it's yet to be proven that performance is anywhere near critical 
enough to justify maintaining arch-specific versions. It costs us 
basically nothing to switch between the two generic implementations 
though, so since the smaller-and-no-slower code can be considered a net 
win (however minor), there seemed no reason *not* to apply the existing 
option appropriately.

Robin.

> 
> Cheers,
> 
>>
>> --- a/xen/arch/arm/Kconfig
>> +++ b/xen/arch/arm/Kconfig
>> @@ -12,6 +12,7 @@ config ARM_32
>>   config ARM_64
>>       def_bool y
>>       depends on 64BIT
>> +    select HAS_FAST_MULTIPLY
>>   config ARM
>>       def_bool y
>>
>>
>
Jan Beulich June 5, 2019, 7:42 a.m. UTC | #3
>>> On 04.06.19 at 18:11, <julien.grall@arm.com> wrote:
> On 5/31/19 10:53 AM, Jan Beulich wrote:
>> According to Linux commit e75bef2a4f ("arm64: Select
>> ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the
>> variant using only bitwise operations on at least some hardware, and no
>> worse on other.
>> 
>> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> RFC: To be honest I'm not fully convinced this is a win in particular in
>>       the hweight32() case, as there's no actual shift insn which gets
>>       replaced by the multiplication. Even for hweight64() the compiler
>>       could emit better code and avoid the explicit shift by 32 (which it
>>       emits at least for me).
> 
> I can see multiplication instruction used in both hweight32() and 
> hweight64() with the compiler I am using.

That is for which exact implementation? What I was referring to as
"could emit better code" was the multiplication-free variant, where
the compiler fails to recognize (afaict) another opportunity to fold
a shift into an arithmetic instruction:

	add	x0, x0, x0, lsr #4
	and	x0, x0, #0xf0f0f0f0f0f0f0f
	add	x0, x0, x0, lsr #8
	add	x0, x0, x0, lsr #16
>>>	lsr	x1, x0, #32
>>>	add	w0, w1, w0
	and	w0, w0, #0xff
	ret

Afaict the two marked insns could be replaced by

	add	x0, x0, x0, lsr #32

With there only a sequence of add-s remaining, I'm having
difficulty seeing how the use of mul+lsr would actually help:

	add	x0, x0, x0, lsr #4
	and	x0, x0, #0xf0f0f0f0f0f0f0f
	mov	x1, #0x101010101010101
	mul	x0, x0, x1
	lsr	x0, x0, #56
	ret

But of course I know nothing about throughput and latency
of such add-s with one of their operands shifted first. And
yes, the variant using mul is, comparing with the better
optimized case, still one insn smaller.

> I would expect the compiler could easily replace a multiply by a series 
> of shift but it would be more difficult to do the invert.
> 
> Also, this has been in Linux for a year now, so I am assuming Linux 
> folks are happy with changes (CCing Robin just in case I missed 
> anything). Therefore I am happy to give it a go on Xen as well.

In which case - can I take this as an ack, or do you want to first
pursue the discussion?

Jan
Julien Grall June 5, 2019, 9:24 a.m. UTC | #4
Hi Jan,

On 05/06/2019 08:42, Jan Beulich wrote:
>>>> On 04.06.19 at 18:11, <julien.grall@arm.com> wrote:
>> On 5/31/19 10:53 AM, Jan Beulich wrote:
>>> According to Linux commit e75bef2a4f ("arm64: Select
>>> ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the
>>> variant using only bitwise operations on at least some hardware, and no
>>> worse on other.
>>>
>>> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> ---
>>> RFC: To be honest I'm not fully convinced this is a win in particular in
>>>        the hweight32() case, as there's no actual shift insn which gets
>>>        replaced by the multiplication. Even for hweight64() the compiler
>>>        could emit better code and avoid the explicit shift by 32 (which it
>>>        emits at least for me).
>>
>> I can see multiplication instruction used in both hweight32() and
>> hweight64() with the compiler I am using.
> 
> That is for which exact implementation?

A simple call hweight64().

> What I was referring to as
> "could emit better code" was the multiplication-free variant, where
> the compiler fails to recognize (afaict) another opportunity to fold
> a shift into an arithmetic instruction:
> 
> 	add	x0, x0, x0, lsr #4
> 	and	x0, x0, #0xf0f0f0f0f0f0f0f
> 	add	x0, x0, x0, lsr #8
> 	add	x0, x0, x0, lsr #16
>>>> 	lsr	x1, x0, #32
>>>> 	add	w0, w1, w0
> 	and	w0, w0, #0xff
> 	ret
> 
> Afaict the two marked insns could be replaced by
> 
> 	add	x0, x0, x0, lsr #32

I am not a compiler expert. Anyway this likely depends on the version of the 
compiler you are using. They are becoming smarter and smarter.

> 
> With there only a sequence of add-s remaining, I'm having
> difficulty seeing how the use of mul+lsr would actually help:
> 
> 	add	x0, x0, x0, lsr #4
> 	and	x0, x0, #0xf0f0f0f0f0f0f0f
> 	mov	x1, #0x101010101010101
> 	mul	x0, x0, x1
> 	lsr	x0, x0, #56
> 	ret
> 
> But of course I know nothing about throughput and latency
> of such add-s with one of their operands shifted first. And
> yes, the variant using mul is, comparing with the better > optimized case, still one insn smaller.
The commit message in Linux (and Robin's answer) is pretty clear. It may improve 
on some core but does not make it worst on other.

> 
>> I would expect the compiler could easily replace a multiply by a series
>> of shift but it would be more difficult to do the invert.
>>
>> Also, this has been in Linux for a year now, so I am assuming Linux
>> folks are happy with changes (CCing Robin just in case I missed
>> anything). Therefore I am happy to give it a go on Xen as well.
> 
> In which case - can I take this as an ack, or do you want to first
> pursue the discussion?

I will commit it later on with another bunch of patches.

Cheers,
Jan Beulich June 5, 2019, 9:59 a.m. UTC | #5
>>> On 05.06.19 at 11:24, <julien.grall@arm.com> wrote:
> On 05/06/2019 08:42, Jan Beulich wrote:
>>>>> On 04.06.19 at 18:11, <julien.grall@arm.com> wrote:
>>> On 5/31/19 10:53 AM, Jan Beulich wrote:
>>>> According to Linux commit e75bef2a4f ("arm64: Select
>>>> ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the
>>>> variant using only bitwise operations on at least some hardware, and no
>>>> worse on other.
>>>>
>>>> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> ---
>>>> RFC: To be honest I'm not fully convinced this is a win in particular in
>>>>        the hweight32() case, as there's no actual shift insn which gets
>>>>        replaced by the multiplication. Even for hweight64() the compiler
>>>>        could emit better code and avoid the explicit shift by 32 (which it
>>>>        emits at least for me).
>>>
>>> I can see multiplication instruction used in both hweight32() and
>>> hweight64() with the compiler I am using.
>> 
>> That is for which exact implementation?
> 
> A simple call hweight64().

That's as ambiguous as it was before: Are you talking about
un-patched code (before this series), or patches up to at
least this one applied. What I'm trying to understand is if your
compiler is smart enough to use MUL without us telling it to.

Jan
Julien Grall June 5, 2019, 11:05 a.m. UTC | #6
On 05/06/2019 10:59, Jan Beulich wrote:
>>>> On 05.06.19 at 11:24, <julien.grall@arm.com> wrote:
>> On 05/06/2019 08:42, Jan Beulich wrote:
>>>>>> On 04.06.19 at 18:11, <julien.grall@arm.com> wrote:
>>>> On 5/31/19 10:53 AM, Jan Beulich wrote:
>>>>> According to Linux commit e75bef2a4f ("arm64: Select
>>>>> ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the
>>>>> variant using only bitwise operations on at least some hardware, and no
>>>>> worse on other.
>>>>>
>>>>> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>>> ---
>>>>> RFC: To be honest I'm not fully convinced this is a win in particular in
>>>>>         the hweight32() case, as there's no actual shift insn which gets
>>>>>         replaced by the multiplication. Even for hweight64() the compiler
>>>>>         could emit better code and avoid the explicit shift by 32 (which it
>>>>>         emits at least for me).
>>>>
>>>> I can see multiplication instruction used in both hweight32() and
>>>> hweight64() with the compiler I am using.
>>>
>>> That is for which exact implementation?
>>
>> A simple call hweight64().
> 
> That's as ambiguous as it was before: Are you talking about
> un-patched code (before this series), or patches up to at
> least this one applied. What I'm trying to understand is if your
> compiler is smart enough to use MUL without us telling it to.

unsigned int test_hweight64(uint64_t t)
{
     return hweight64(t);
}

I looked at the difference between before and after this patch. Before the 
multiplication is not used. After, it was used with less instructions emitted.

Cheers,
Julien Grall June 6, 2019, 3:29 p.m. UTC | #7
On 6/5/19 10:24 AM, Julien Grall wrote:
> I will commit it later on with another bunch of patches.

Pushed now.

Thank you,
diff mbox series

Patch

--- a/xen/arch/arm/Kconfig
+++ b/xen/arch/arm/Kconfig
@@ -12,6 +12,7 @@  config ARM_32
 config ARM_64
 	def_bool y
 	depends on 64BIT
+	select HAS_FAST_MULTIPLY
 
 config ARM
 	def_bool y