Message ID | 5CF0F9A30200007800233E07@prv1-mh.provo.novell.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | bitops: hweight<N>() improvements | expand |
Hi Jan, On 5/31/19 10:53 AM, Jan Beulich wrote: > According to Linux commit e75bef2a4f ("arm64: Select > ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the > variant using only bitwise operations on at least some hardware, and no > worse on other. > > Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > RFC: To be honest I'm not fully convinced this is a win in particular in > the hweight32() case, as there's no actual shift insn which gets > replaced by the multiplication. Even for hweight64() the compiler > could emit better code and avoid the explicit shift by 32 (which it > emits at least for me). I can see multiplication instruction used in both hweight32() and hweight64() with the compiler I am using. I would expect the compiler could easily replace a multiply by a series of shift but it would be more difficult to do the invert. Also, this has been in Linux for a year now, so I am assuming Linux folks are happy with changes (CCing Robin just in case I missed anything). Therefore I am happy to give it a go on Xen as well. Cheers, > > --- a/xen/arch/arm/Kconfig > +++ b/xen/arch/arm/Kconfig > @@ -12,6 +12,7 @@ config ARM_32 > config ARM_64 > def_bool y > depends on 64BIT > + select HAS_FAST_MULTIPLY > > config ARM > def_bool y > >
On 04/06/2019 17:11, Julien Grall wrote: > Hi Jan, > > On 5/31/19 10:53 AM, Jan Beulich wrote: >> According to Linux commit e75bef2a4f ("arm64: Select >> ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the >> variant using only bitwise operations on at least some hardware, and no >> worse on other. >> >> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> --- >> RFC: To be honest I'm not fully convinced this is a win in particular in >> the hweight32() case, as there's no actual shift insn which gets >> replaced by the multiplication. Even for hweight64() the compiler >> could emit better code and avoid the explicit shift by 32 (which it >> emits at least for me). > > I can see multiplication instruction used in both hweight32() and > hweight64() with the compiler I am using. > > I would expect the compiler could easily replace a multiply by a series > of shift but it would be more difficult to do the invert. > > Also, this has been in Linux for a year now, so I am assuming Linux > folks are happy with changes (CCing Robin just in case I missed > anything). Therefore I am happy to give it a go on Xen as well. IIRC it did look like Linux's hweight() routines could possibly be further tuned for the A64 ISA to shave off one or two more instructions, but it's yet to be proven that performance is anywhere near critical enough to justify maintaining arch-specific versions. It costs us basically nothing to switch between the two generic implementations though, so since the smaller-and-no-slower code can be considered a net win (however minor), there seemed no reason *not* to apply the existing option appropriately. Robin. > > Cheers, > >> >> --- a/xen/arch/arm/Kconfig >> +++ b/xen/arch/arm/Kconfig >> @@ -12,6 +12,7 @@ config ARM_32 >> config ARM_64 >> def_bool y >> depends on 64BIT >> + select HAS_FAST_MULTIPLY >> config ARM >> def_bool y >> >> >
>>> On 04.06.19 at 18:11, <julien.grall@arm.com> wrote: > On 5/31/19 10:53 AM, Jan Beulich wrote: >> According to Linux commit e75bef2a4f ("arm64: Select >> ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the >> variant using only bitwise operations on at least some hardware, and no >> worse on other. >> >> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> --- >> RFC: To be honest I'm not fully convinced this is a win in particular in >> the hweight32() case, as there's no actual shift insn which gets >> replaced by the multiplication. Even for hweight64() the compiler >> could emit better code and avoid the explicit shift by 32 (which it >> emits at least for me). > > I can see multiplication instruction used in both hweight32() and > hweight64() with the compiler I am using. That is for which exact implementation? What I was referring to as "could emit better code" was the multiplication-free variant, where the compiler fails to recognize (afaict) another opportunity to fold a shift into an arithmetic instruction: add x0, x0, x0, lsr #4 and x0, x0, #0xf0f0f0f0f0f0f0f add x0, x0, x0, lsr #8 add x0, x0, x0, lsr #16 >>> lsr x1, x0, #32 >>> add w0, w1, w0 and w0, w0, #0xff ret Afaict the two marked insns could be replaced by add x0, x0, x0, lsr #32 With there only a sequence of add-s remaining, I'm having difficulty seeing how the use of mul+lsr would actually help: add x0, x0, x0, lsr #4 and x0, x0, #0xf0f0f0f0f0f0f0f mov x1, #0x101010101010101 mul x0, x0, x1 lsr x0, x0, #56 ret But of course I know nothing about throughput and latency of such add-s with one of their operands shifted first. And yes, the variant using mul is, comparing with the better optimized case, still one insn smaller. > I would expect the compiler could easily replace a multiply by a series > of shift but it would be more difficult to do the invert. > > Also, this has been in Linux for a year now, so I am assuming Linux > folks are happy with changes (CCing Robin just in case I missed > anything). Therefore I am happy to give it a go on Xen as well. In which case - can I take this as an ack, or do you want to first pursue the discussion? Jan
Hi Jan, On 05/06/2019 08:42, Jan Beulich wrote: >>>> On 04.06.19 at 18:11, <julien.grall@arm.com> wrote: >> On 5/31/19 10:53 AM, Jan Beulich wrote: >>> According to Linux commit e75bef2a4f ("arm64: Select >>> ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the >>> variant using only bitwise operations on at least some hardware, and no >>> worse on other. >>> >>> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> >>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>> --- >>> RFC: To be honest I'm not fully convinced this is a win in particular in >>> the hweight32() case, as there's no actual shift insn which gets >>> replaced by the multiplication. Even for hweight64() the compiler >>> could emit better code and avoid the explicit shift by 32 (which it >>> emits at least for me). >> >> I can see multiplication instruction used in both hweight32() and >> hweight64() with the compiler I am using. > > That is for which exact implementation? A simple call hweight64(). > What I was referring to as > "could emit better code" was the multiplication-free variant, where > the compiler fails to recognize (afaict) another opportunity to fold > a shift into an arithmetic instruction: > > add x0, x0, x0, lsr #4 > and x0, x0, #0xf0f0f0f0f0f0f0f > add x0, x0, x0, lsr #8 > add x0, x0, x0, lsr #16 >>>> lsr x1, x0, #32 >>>> add w0, w1, w0 > and w0, w0, #0xff > ret > > Afaict the two marked insns could be replaced by > > add x0, x0, x0, lsr #32 I am not a compiler expert. Anyway this likely depends on the version of the compiler you are using. They are becoming smarter and smarter. > > With there only a sequence of add-s remaining, I'm having > difficulty seeing how the use of mul+lsr would actually help: > > add x0, x0, x0, lsr #4 > and x0, x0, #0xf0f0f0f0f0f0f0f > mov x1, #0x101010101010101 > mul x0, x0, x1 > lsr x0, x0, #56 > ret > > But of course I know nothing about throughput and latency > of such add-s with one of their operands shifted first. And > yes, the variant using mul is, comparing with the better > optimized case, still one insn smaller. The commit message in Linux (and Robin's answer) is pretty clear. It may improve on some core but does not make it worst on other. > >> I would expect the compiler could easily replace a multiply by a series >> of shift but it would be more difficult to do the invert. >> >> Also, this has been in Linux for a year now, so I am assuming Linux >> folks are happy with changes (CCing Robin just in case I missed >> anything). Therefore I am happy to give it a go on Xen as well. > > In which case - can I take this as an ack, or do you want to first > pursue the discussion? I will commit it later on with another bunch of patches. Cheers,
>>> On 05.06.19 at 11:24, <julien.grall@arm.com> wrote: > On 05/06/2019 08:42, Jan Beulich wrote: >>>>> On 04.06.19 at 18:11, <julien.grall@arm.com> wrote: >>> On 5/31/19 10:53 AM, Jan Beulich wrote: >>>> According to Linux commit e75bef2a4f ("arm64: Select >>>> ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the >>>> variant using only bitwise operations on at least some hardware, and no >>>> worse on other. >>>> >>>> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>>> --- >>>> RFC: To be honest I'm not fully convinced this is a win in particular in >>>> the hweight32() case, as there's no actual shift insn which gets >>>> replaced by the multiplication. Even for hweight64() the compiler >>>> could emit better code and avoid the explicit shift by 32 (which it >>>> emits at least for me). >>> >>> I can see multiplication instruction used in both hweight32() and >>> hweight64() with the compiler I am using. >> >> That is for which exact implementation? > > A simple call hweight64(). That's as ambiguous as it was before: Are you talking about un-patched code (before this series), or patches up to at least this one applied. What I'm trying to understand is if your compiler is smart enough to use MUL without us telling it to. Jan
On 05/06/2019 10:59, Jan Beulich wrote: >>>> On 05.06.19 at 11:24, <julien.grall@arm.com> wrote: >> On 05/06/2019 08:42, Jan Beulich wrote: >>>>>> On 04.06.19 at 18:11, <julien.grall@arm.com> wrote: >>>> On 5/31/19 10:53 AM, Jan Beulich wrote: >>>>> According to Linux commit e75bef2a4f ("arm64: Select >>>>> ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the >>>>> variant using only bitwise operations on at least some hardware, and no >>>>> worse on other. >>>>> >>>>> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> >>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>>>> --- >>>>> RFC: To be honest I'm not fully convinced this is a win in particular in >>>>> the hweight32() case, as there's no actual shift insn which gets >>>>> replaced by the multiplication. Even for hweight64() the compiler >>>>> could emit better code and avoid the explicit shift by 32 (which it >>>>> emits at least for me). >>>> >>>> I can see multiplication instruction used in both hweight32() and >>>> hweight64() with the compiler I am using. >>> >>> That is for which exact implementation? >> >> A simple call hweight64(). > > That's as ambiguous as it was before: Are you talking about > un-patched code (before this series), or patches up to at > least this one applied. What I'm trying to understand is if your > compiler is smart enough to use MUL without us telling it to. unsigned int test_hweight64(uint64_t t) { return hweight64(t); } I looked at the difference between before and after this patch. Before the multiplication is not used. After, it was used with less instructions emitted. Cheers,
On 6/5/19 10:24 AM, Julien Grall wrote:
> I will commit it later on with another bunch of patches.
Pushed now.
Thank you,
--- a/xen/arch/arm/Kconfig +++ b/xen/arch/arm/Kconfig @@ -12,6 +12,7 @@ config ARM_32 config ARM_64 def_bool y depends on 64BIT + select HAS_FAST_MULTIPLY config ARM def_bool y
According to Linux commit e75bef2a4f ("arm64: Select ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the variant using only bitwise operations on at least some hardware, and no worse on other. Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> --- RFC: To be honest I'm not fully convinced this is a win in particular in the hweight32() case, as there's no actual shift insn which gets replaced by the multiplication. Even for hweight64() the compiler could emit better code and avoid the explicit shift by 32 (which it emits at least for me).