[9/9] x86/bitops: Use the POPCNT instruction when available

Message ID	20240822230635.954557-10-andrew.cooper3@citrix.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> From: Andrew Cooper <andrew.cooper3@citrix.com> To: Xen-devel <xen-devel@lists.xenproject.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Jan Beulich <JBeulich@suse.com>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com> Subject: [PATCH 9/9] x86/bitops: Use the POPCNT instruction when available Date: Fri, 23 Aug 2024 00:06:35 +0100 Message-Id: <20240822230635.954557-10-andrew.cooper3@citrix.com> In-Reply-To: <20240822230635.954557-1-andrew.cooper3@citrix.com> References: <20240822230635.954557-1-andrew.cooper3@citrix.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	xen/bitops: hweight() cleanup/improvements \| expand [0/9] xen/bitops: hweight() cleanup/improvements [1/9] xen/bitops: Reinstate the please tidy message [2/9] xen/bitops: Introduce a multiple_bits_set() helper [3/9] xen/bitops: Convert 'hweight(x) > 1' to new multiple_bits_set() [4/9] xen/bitops: Drop the remnants of hweight{8,16}() [5/9] xen/bitops: Introduce generic_hweightl() and hweightl() [6/9] xen/bitops: Drop hweight_long() and use hweightl() [7/9] xen/bitops: Implement hweight64() in terms of hweightl() [8/9] xen/bitops: Implement hweight32() in terms of hweightl() [9/9] x86/bitops: Use the POPCNT instruction when available

Andrew Cooper Aug. 22, 2024, 11:06 p.m. UTC

It has existed in x86 CPUs since 2008, so we're only 16 years late adding
support.  With all the other scafolding in place, implement arch_hweightl()
for x86.

The only complication is that the call to arch_generic_hweightl() is behind
the compilers back.  Address this by writing it in ASM and ensure that it
preserves all registers.

Copy the code generation from generic_hweightl().  It's not a complicated
algorithm, and is easy to regenerate if needs be, but cover it with the same
unit tests as test_generic_hweightl() just for piece of mind.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

A few RFC points.

 * I throught we had an x86 general lib-y but I can't find one, hence why it's
   still in xen/lib/ for now.

 * When we up the minimum toolchain to GCC 7 / Clang 5, we can use a
   __attribute__((no_caller_saved_registers)) and can forgo writing this in asm.

   GCC seems to need extra help, and wants -mgeneral-regs-only too.  It has a
   habit of complaining about incompatible instructions even when it's not
   emitting them.
---
 xen/arch/x86/include/asm/bitops.h | 21 ++++++++++++++
 xen/lib/Makefile                  |  1 +
 xen/lib/arch-generic-hweightl.S   | 46 +++++++++++++++++++++++++++++++
 xen/lib/generic-hweightl.c        | 15 ++++++++++
 4 files changed, 83 insertions(+)
 create mode 100644 xen/lib/arch-generic-hweightl.S

Nicola Vetrini Aug. 23, 2024, 3:35 p.m. UTC | #1

On 2024-08-23 01:06, Andrew Cooper wrote:
> It has existed in x86 CPUs since 2008, so we're only 16 years late 
> adding
> support.  With all the other scafolding in place, implement 
> arch_hweightl()
> for x86.
> 
> The only complication is that the call to arch_generic_hweightl() is 
> behind
> the compilers back.  Address this by writing it in ASM and ensure that 
> it
> preserves all registers.
> 
> Copy the code generation from generic_hweightl().  It's not a 
> complicated
> algorithm, and is easy to regenerate if needs be, but cover it with the 
> same
> unit tests as test_generic_hweightl() just for piece of mind.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> 
> A few RFC points.
> 
>  * I throught we had an x86 general lib-y but I can't find one, hence 
> why it's
>    still in xen/lib/ for now.
> 
>  * When we up the minimum toolchain to GCC 7 / Clang 5, we can use a
>    __attribute__((no_caller_saved_registers)) and can forgo writing 
> this in asm.
> 
>    GCC seems to need extra help, and wants -mgeneral-regs-only too.  It 
> has a
>    habit of complaining about incompatible instructions even when it's 
> not
>    emitting them.
> ---

> diff --git a/xen/lib/arch-generic-hweightl.S 
> b/xen/lib/arch-generic-hweightl.S
> new file mode 100644
> index 000000000000..15c6e3394845
> --- /dev/null
> +++ b/xen/lib/arch-generic-hweightl.S
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +        .file __FILE__
> +
> +#include <xen/linkage.h>
> +
> +/*
> + * An implementation of generic_hweightl() used on hardware without 
> the POPCNT
> + * instruction.
> + *
> + * This function is called from within an ALTERNATIVE in 
> arch_hweightl().
> + * i.e. behind the back of the compiler.  Therefore all registers are 
> callee
> + * preserved.
> + *
> + * The ASM is what GCC-12 emits for generic_hweightl() in a release 
> build of
> + * Xen, with spilling of %rdi/%rdx to preserve the callers registers.
> + */
> +FUNC(arch_generic_hweightl)
> +        push   %rdi
> +        push   %rdx
> +
> +        movabs $0x5555555555555555, %rdx
> +        mov    %rdi, %rax
> +        shr    $1, %rax
> +        and    %rdx, %rax
> +        sub    %rax, %rdi
> +        movabs $0x3333333333333333, %rax
> +        mov    %rdi, %rdx
> +        shr    $0x2, %rdi
> +        and    %rax, %rdx
> +        and    %rax, %rdi
> +        add    %rdi, %rdx
> +        mov    %rdx, %rax
> +        shr    $0x4, %rax
> +        add    %rdx, %rax
> +        movabs $0xf0f0f0f0f0f0f0f, %rdx
> +        and    %rdx, %rax
> +        movabs $0x101010101010101, %rdx
> +        imul   %rdx, %rax
> +        shr    $0x38, %rax
> +
> +        pop    %rdx
> +        pop    %rdi
> +
> +        ret
> +END(arch_generic_hweightl)
> diff --git a/xen/lib/generic-hweightl.c b/xen/lib/generic-hweightl.c
> index fa4bbec273ab..4b39dd84de5e 100644
> --- a/xen/lib/generic-hweightl.c
> +++ b/xen/lib/generic-hweightl.c
> @@ -43,4 +43,19 @@ static void __init __constructor 
> test_generic_hweightl(void)
>      RUNTIME_CHECK(generic_hweightl, 1 | (1UL << (BITS_PER_LONG - 1)), 
> 2);
>      RUNTIME_CHECK(generic_hweightl, -1UL, BITS_PER_LONG);
>  }
> +
> +#ifdef CONFIG_X86
> +unsigned int arch_generic_hweightl(unsigned long);

Hi Andrew,

do you mind putting a parameter name here, as the current form 
introduces a violation of MISRA Rule 8.2 [1] (even if unnecessary, given 
its implementation)?

Thanks,
  Nicola

[1] 
https://saas.eclairit.com:3787/fs/var/local/eclair/xen-project.ecdf/xen-project/patchew/xen/ECLAIR_normal/patchew/20240822230635.954557-1-andrew.cooper3@citrix.com/X86_64/7647484702/PROJECT.ecd;/by_service/MC3R1.R8.2.html

> +static void __init __constructor test_arch_generic_hweightl(void)
> +{
> +    RUNTIME_CHECK(arch_generic_hweightl, 0, 0);
> +    RUNTIME_CHECK(arch_generic_hweightl, 1, 1);
> +    RUNTIME_CHECK(arch_generic_hweightl, 3, 2);
> +    RUNTIME_CHECK(arch_generic_hweightl, 7, 3);
> +    RUNTIME_CHECK(arch_generic_hweightl, 0xff, 8);
> +
> +    RUNTIME_CHECK(arch_generic_hweightl, 1 | (1UL << (BITS_PER_LONG - 
> 1)), 2);
> +    RUNTIME_CHECK(arch_generic_hweightl, -1UL, BITS_PER_LONG);
> +}
> +#endif /* CONFIG_X86 */
>  #endif /* CONFIG_SELF_TESTS */

Andrew Cooper Aug. 23, 2024, 3:38 p.m. UTC | #2

On 23/08/2024 4:35 pm, Nicola Vetrini wrote:
> On 2024-08-23 01:06, Andrew Cooper wrote:
>> diff --git a/xen/lib/generic-hweightl.c b/xen/lib/generic-hweightl.c
>> index fa4bbec273ab..4b39dd84de5e 100644
>> --- a/xen/lib/generic-hweightl.c
>> +++ b/xen/lib/generic-hweightl.c
>> @@ -43,4 +43,19 @@ static void __init __constructor
>> test_generic_hweightl(void)
>>      RUNTIME_CHECK(generic_hweightl, 1 | (1UL << (BITS_PER_LONG -
>> 1)), 2);
>>      RUNTIME_CHECK(generic_hweightl, -1UL, BITS_PER_LONG);
>>  }
>> +
>> +#ifdef CONFIG_X86
>> +unsigned int arch_generic_hweightl(unsigned long);
>
> Hi Andrew,
>
> do you mind putting a parameter name here, as the current form
> introduces a violation of MISRA Rule 8.2 [1] (even if unnecessary,
> given its implementation)?

Sorry.

I did run Eclair on this series during development, but this was a late
change and I forgot to re-check.

Fixed locally.

~Andrew

Jan Beulich Aug. 26, 2024, 1:07 p.m. UTC | #3

On 23.08.2024 01:06, Andrew Cooper wrote:
> A few RFC points.
> 
>  * I throught we had an x86 general lib-y but I can't find one, hence why it's
>    still in xen/lib/ for now.

We indeed have nothing like that yet. The file name should then imo not be
arch-* though, but x86-*. Or you could put it in xen/lib/x86/. It could even
be obj-y rather than lib-y, unless you expect we'll be able to get rid of
all uses, which seems unlikely at least due to bitmap_weight(). Otoh with
my ABI-level series the call site in arch_hweightl() could then be made go
away for v2 and above, at which point it living in lib-y will be preferable.

>  * When we up the minimum toolchain to GCC 7 / Clang 5, we can use a
>    __attribute__((no_caller_saved_registers)) and can forgo writing this in asm.
> 
>    GCC seems to need extra help, and wants -mgeneral-regs-only too.  It has a
>    habit of complaining about incompatible instructions even when it's not
>    emitting them.

What is this part about? What incompatible instructions, in particular?

> @@ -475,4 +476,24 @@ static always_inline unsigned int arch_flsl(unsigned long x)
>  }
>  #define arch_flsl arch_flsl
>  
> +static always_inline unsigned int arch_hweightl(unsigned long x)
> +{
> +    unsigned int r;
> +
> +    /*
> +     * arch_generic_hweightl() is written in ASM in order to preserve all
> +     * registers, as the compiler can't see the call.
> +     *
> +     * This limits the POPCNT instruction to using the same ABI as a function
> +     * call (input in %rdi, output in %eax) but that's fine.
> +     */
> +    alternative_io("call arch_generic_hweightl",
> +                   "popcnt %[val], %q[res]", X86_FEATURE_POPCNT,
> +                   ASM_OUTPUT2([res] "=a" (r) ASM_CALL_CONSTRAINT),
> +                   [val] "D" (x));

If you made [val] an output ("+D") you could avoid preserving the register
in the function. And I'd expect the register wouldn't be re-used often
afterwards, so its clobbering likely won't harm code quality very much.

> --- /dev/null
> +++ b/xen/lib/arch-generic-hweightl.S
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +        .file __FILE__
> +
> +#include <xen/linkage.h>
> +
> +/*
> + * An implementation of generic_hweightl() used on hardware without the POPCNT
> + * instruction.
> + *
> + * This function is called from within an ALTERNATIVE in arch_hweightl().
> + * i.e. behind the back of the compiler.  Therefore all registers are callee
> + * preserved.
> + *
> + * The ASM is what GCC-12 emits for generic_hweightl() in a release build of
> + * Xen, with spilling of %rdi/%rdx to preserve the callers registers.
> + */
> +FUNC(arch_generic_hweightl)
> +        push   %rdi
> +        push   %rdx
> +
> +        movabs $0x5555555555555555, %rdx
> +        mov    %rdi, %rax
> +        shr    $1, %rax
> +        and    %rdx, %rax
> +        sub    %rax, %rdi
> +        movabs $0x3333333333333333, %rax
> +        mov    %rdi, %rdx
> +        shr    $0x2, %rdi

Could I talk you into omitting the 0x here and ...

> +        and    %rax, %rdx
> +        and    %rax, %rdi
> +        add    %rdi, %rdx
> +        mov    %rdx, %rax
> +        shr    $0x4, %rax

... here, and maybe use ...

> +        add    %rdx, %rax
> +        movabs $0xf0f0f0f0f0f0f0f, %rdx
> +        and    %rdx, %rax
> +        movabs $0x101010101010101, %rdx
> +        imul   %rdx, %rax
> +        shr    $0x38, %rax

... 56 here (or even BITS_PER_LONG-8)?

Jan

Andrew Cooper Aug. 27, 2024, 11:17 a.m. UTC | #4

On 26/08/2024 2:07 pm, Jan Beulich wrote:
> On 23.08.2024 01:06, Andrew Cooper wrote:
>> A few RFC points.
>>
>>  * I throught we had an x86 general lib-y but I can't find one, hence why it's
>>    still in xen/lib/ for now.
> We indeed have nothing like that yet. The file name should then imo not be
> arch-* though, but x86-*. Or you could put it in xen/lib/x86/.

I was worried about xen/lib/x86/ and interfering with userspace.

> It could even
> be obj-y rather than lib-y, unless you expect we'll be able to get rid of
> all uses, which seems unlikely at least due to bitmap_weight(). Otoh with
> my ABI-level series the call site in arch_hweightl() could then be made go
> away for v2 and above, at which point it living in lib-y will be preferable.

Yes, I was specifically trying to account for this.

I'm expecting the mandatory-popcnt work to end up looking something like:

diff --git a/xen/arch/x86/include/asm/bitops.h
b/xen/arch/x86/include/asm/bitops.h
index 0db698ed3f4c..027eda60c094 100644
--- a/xen/arch/x86/include/asm/bitops.h
+++ b/xen/arch/x86/include/asm/bitops.h
@@ -480,6 +480,9 @@ static always_inline unsigned int
arch_hweightl(unsigned long x)
 {
     unsigned int r;
 
+    if ( IS_ENABLED(CONFIG_REQUIRE_POPCNT /* or whatever */) )
+        return __builtin_popcountl(x);
+
     /*
      * arch_generic_hweightl() is written in ASM in order to preserve all
      * registers, as the compiler can't see the call.


which in turn DCE's the alternative_io() and drops the reference to
arch_generic_hweightl().

>
>>  * When we up the minimum toolchain to GCC 7 / Clang 5, we can use a
>>    __attribute__((no_caller_saved_registers)) and can forgo writing this in asm.
>>
>>    GCC seems to need extra help, and wants -mgeneral-regs-only too.  It has a
>>    habit of complaining about incompatible instructions even when it's not
>>    emitting them.
> What is this part about? What incompatible instructions, in particular?

This was weird.  https://godbolt.org/z/4z1qzWbfE is an example.

>
>> @@ -475,4 +476,24 @@ static always_inline unsigned int arch_flsl(unsigned long x)
>>  }
>>  #define arch_flsl arch_flsl
>>  
>> +static always_inline unsigned int arch_hweightl(unsigned long x)
>> +{
>> +    unsigned int r;
>> +
>> +    /*
>> +     * arch_generic_hweightl() is written in ASM in order to preserve all
>> +     * registers, as the compiler can't see the call.
>> +     *
>> +     * This limits the POPCNT instruction to using the same ABI as a function
>> +     * call (input in %rdi, output in %eax) but that's fine.
>> +     */
>> +    alternative_io("call arch_generic_hweightl",
>> +                   "popcnt %[val], %q[res]", X86_FEATURE_POPCNT,
>> +                   ASM_OUTPUT2([res] "=a" (r) ASM_CALL_CONSTRAINT),
>> +                   [val] "D" (x));
> If you made [val] an output ("+D") you could avoid preserving the register
> in the function. And I'd expect the register wouldn't be re-used often
> afterwards, so its clobbering likely won't harm code quality very much.

"+D" means it's modified by the asm, which forces preservation of the
register, if it's still needed afterwards.

Plain "D" means not modified by the asm, which means it can be reused if
necessary.

>
>> --- /dev/null
>> +++ b/xen/lib/arch-generic-hweightl.S
>> @@ -0,0 +1,46 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +        .file __FILE__
>> +
>> +#include <xen/linkage.h>
>> +
>> +/*
>> + * An implementation of generic_hweightl() used on hardware without the POPCNT
>> + * instruction.
>> + *
>> + * This function is called from within an ALTERNATIVE in arch_hweightl().
>> + * i.e. behind the back of the compiler.  Therefore all registers are callee
>> + * preserved.
>> + *
>> + * The ASM is what GCC-12 emits for generic_hweightl() in a release build of
>> + * Xen, with spilling of %rdi/%rdx to preserve the callers registers.
>> + */
>> +FUNC(arch_generic_hweightl)
>> +        push   %rdi
>> +        push   %rdx
>> +
>> +        movabs $0x5555555555555555, %rdx
>> +        mov    %rdi, %rax
>> +        shr    $1, %rax
>> +        and    %rdx, %rax
>> +        sub    %rax, %rdi
>> +        movabs $0x3333333333333333, %rax
>> +        mov    %rdi, %rdx
>> +        shr    $0x2, %rdi
> Could I talk you into omitting the 0x here and ...
>
>> +        and    %rax, %rdx
>> +        and    %rax, %rdi
>> +        add    %rdi, %rdx
>> +        mov    %rdx, %rax
>> +        shr    $0x4, %rax
> ... here, and maybe use ...
>
>> +        add    %rdx, %rax
>> +        movabs $0xf0f0f0f0f0f0f0f, %rdx
>> +        and    %rdx, %rax
>> +        movabs $0x101010101010101, %rdx
>> +        imul   %rdx, %rax
>> +        shr    $0x38, %rax
> ... 56 here (or even BITS_PER_LONG-8)?

Ok.

~Andrew

Jan Beulich Aug. 27, 2024, 12:47 p.m. UTC | #5

On 27.08.2024 13:17, Andrew Cooper wrote:
> On 26/08/2024 2:07 pm, Jan Beulich wrote:
>> On 23.08.2024 01:06, Andrew Cooper wrote:
>>> A few RFC points.
>>>
>>>  * I throught we had an x86 general lib-y but I can't find one, hence why it's
>>>    still in xen/lib/ for now.
>> We indeed have nothing like that yet. The file name should then imo not be
>> arch-* though, but x86-*. Or you could put it in xen/lib/x86/.
> 
> I was worried about xen/lib/x86/ and interfering with userspace.
> 
>> It could even
>> be obj-y rather than lib-y, unless you expect we'll be able to get rid of
>> all uses, which seems unlikely at least due to bitmap_weight(). Otoh with
>> my ABI-level series the call site in arch_hweightl() could then be made go
>> away for v2 and above, at which point it living in lib-y will be preferable.
> 
> Yes, I was specifically trying to account for this.
> 
> I'm expecting the mandatory-popcnt work to end up looking something like:
> 
> diff --git a/xen/arch/x86/include/asm/bitops.h
> b/xen/arch/x86/include/asm/bitops.h
> index 0db698ed3f4c..027eda60c094 100644
> --- a/xen/arch/x86/include/asm/bitops.h
> +++ b/xen/arch/x86/include/asm/bitops.h
> @@ -480,6 +480,9 @@ static always_inline unsigned int
> arch_hweightl(unsigned long x)
>  {
>      unsigned int r;
>  
> +    if ( IS_ENABLED(CONFIG_REQUIRE_POPCNT /* or whatever */) )
> +        return __builtin_popcountl(x);
> +
>      /*
>       * arch_generic_hweightl() is written in ASM in order to preserve all
>       * registers, as the compiler can't see the call.
> 
> 
> which in turn DCE's the alternative_io() and drops the reference to
> arch_generic_hweightl().

Right, that's along the lines of what I was thinking to re-base to once
your work has gone in. (I have close to zero hope that my work would be
going in first. [1]) Just that I don't think we'll have separate
CONFIG_REQUIRE_<feature> settings. Yet how exactly that wants structuring
is something we ought to discuss there, not here.

>>>  * When we up the minimum toolchain to GCC 7 / Clang 5, we can use a
>>>    __attribute__((no_caller_saved_registers)) and can forgo writing this in asm.
>>>
>>>    GCC seems to need extra help, and wants -mgeneral-regs-only too.  It has a
>>>    habit of complaining about incompatible instructions even when it's not
>>>    emitting them.
>> What is this part about? What incompatible instructions, in particular?
> 
> This was weird.  https://godbolt.org/z/4z1qzWbfE is an example.

That's because apparently in your experiments you didn't add -mno-sse. If you
incrementally add that, then -mno-mmx, then -msoft-float, you'll first see the
diagnostic change and then observe it to finally compile. And yes, from
looking at the gcc code emitting this error, this is solely tied to the ISAs
enabled at the time the function is being compiled. It's independent of the
choice of insns. Pretty clearly a shortcoming, imo.

>>> @@ -475,4 +476,24 @@ static always_inline unsigned int arch_flsl(unsigned long x)
>>>  }
>>>  #define arch_flsl arch_flsl
>>>  
>>> +static always_inline unsigned int arch_hweightl(unsigned long x)
>>> +{
>>> +    unsigned int r;
>>> +
>>> +    /*
>>> +     * arch_generic_hweightl() is written in ASM in order to preserve all
>>> +     * registers, as the compiler can't see the call.
>>> +     *
>>> +     * This limits the POPCNT instruction to using the same ABI as a function
>>> +     * call (input in %rdi, output in %eax) but that's fine.
>>> +     */
>>> +    alternative_io("call arch_generic_hweightl",
>>> +                   "popcnt %[val], %q[res]", X86_FEATURE_POPCNT,
>>> +                   ASM_OUTPUT2([res] "=a" (r) ASM_CALL_CONSTRAINT),
>>> +                   [val] "D" (x));
>> If you made [val] an output ("+D") you could avoid preserving the register
>> in the function. And I'd expect the register wouldn't be re-used often
>> afterwards, so its clobbering likely won't harm code quality very much.
> 
> "+D" means it's modified by the asm, which forces preservation of the
> register, if it's still needed afterwards.
> 
> Plain "D" means not modified by the asm, which means it can be reused if
> necessary.

And we likely would prefer the former: If the register's value isn't
use afterwards (and that's the case as far as the function on its own
goes), the compiler will know it doesn't need to preserve anything.
That way, rather than always preserving (in the called function)
preservation will be limited to just the (likely few) cases where the
value actually is still needed afterwards.

Jan

[1] "x86: allow Kconfig control over psABI level" actually has a suitable
    R-b, but its prereq "build: permit Kconfig control over how to deal
    with unsatisfiable choices" doesn't.

Andrew Cooper Aug. 27, 2024, 2:59 p.m. UTC | #6

On 27/08/2024 1:47 pm, Jan Beulich wrote:
> On 27.08.2024 13:17, Andrew Cooper wrote:
>> On 26/08/2024 2:07 pm, Jan Beulich wrote:
>>> On 23.08.2024 01:06, Andrew Cooper wrote:
>>>> A few RFC points.
>>>>
>>>>  * I throught we had an x86 general lib-y but I can't find one, hence why it's
>>>>    still in xen/lib/ for now.
>>> We indeed have nothing like that yet. The file name should then imo not be
>>> arch-* though, but x86-*. Or you could put it in xen/lib/x86/.
>> I was worried about xen/lib/x86/ and interfering with userspace.
>>
>>> It could even
>>> be obj-y rather than lib-y, unless you expect we'll be able to get rid of
>>> all uses, which seems unlikely at least due to bitmap_weight(). Otoh with
>>> my ABI-level series the call site in arch_hweightl() could then be made go
>>> away for v2 and above, at which point it living in lib-y will be preferable.
>> Yes, I was specifically trying to account for this.
>>
>> I'm expecting the mandatory-popcnt work to end up looking something like:
>>
>> diff --git a/xen/arch/x86/include/asm/bitops.h
>> b/xen/arch/x86/include/asm/bitops.h
>> index 0db698ed3f4c..027eda60c094 100644
>> --- a/xen/arch/x86/include/asm/bitops.h
>> +++ b/xen/arch/x86/include/asm/bitops.h
>> @@ -480,6 +480,9 @@ static always_inline unsigned int
>> arch_hweightl(unsigned long x)
>>  {
>>      unsigned int r;
>>  
>> +    if ( IS_ENABLED(CONFIG_REQUIRE_POPCNT /* or whatever */) )
>> +        return __builtin_popcountl(x);
>> +
>>      /*
>>       * arch_generic_hweightl() is written in ASM in order to preserve all
>>       * registers, as the compiler can't see the call.
>>
>>
>> which in turn DCE's the alternative_io() and drops the reference to
>> arch_generic_hweightl().
> Right, that's along the lines of what I was thinking to re-base to once
> your work has gone in. (I have close to zero hope that my work would be
> going in first. [1]) Just that I don't think we'll have separate
> CONFIG_REQUIRE_<feature> settings. Yet how exactly that wants structuring
> is something we ought to discuss there, not here.

I just couldn't remember the name you'd given the option.  I wasn't
trying to driveby review it here.

>
>>>>  * When we up the minimum toolchain to GCC 7 / Clang 5, we can use a
>>>>    __attribute__((no_caller_saved_registers)) and can forgo writing this in asm.
>>>>
>>>>    GCC seems to need extra help, and wants -mgeneral-regs-only too.  It has a
>>>>    habit of complaining about incompatible instructions even when it's not
>>>>    emitting them.
>>> What is this part about? What incompatible instructions, in particular?
>> This was weird.  https://godbolt.org/z/4z1qzWbfE is an example.
> That's because apparently in your experiments you didn't add -mno-sse. If you
> incrementally add that, then -mno-mmx, then -msoft-float, you'll first see the
> diagnostic change and then observe it to finally compile. And yes, from
> looking at the gcc code emitting this error, this is solely tied to the ISAs
> enabled at the time the function is being compiled. It's independent of the
> choice of insns. Pretty clearly a shortcoming, imo.

Ah - it was -msoft-float which I was looking for.

I'd played around with -mno-{sse,mmx} but not gotten a working result.


>
>>>> @@ -475,4 +476,24 @@ static always_inline unsigned int arch_flsl(unsigned long x)
>>>>  }
>>>>  #define arch_flsl arch_flsl
>>>>  
>>>> +static always_inline unsigned int arch_hweightl(unsigned long x)
>>>> +{
>>>> +    unsigned int r;
>>>> +
>>>> +    /*
>>>> +     * arch_generic_hweightl() is written in ASM in order to preserve all
>>>> +     * registers, as the compiler can't see the call.
>>>> +     *
>>>> +     * This limits the POPCNT instruction to using the same ABI as a function
>>>> +     * call (input in %rdi, output in %eax) but that's fine.
>>>> +     */
>>>> +    alternative_io("call arch_generic_hweightl",
>>>> +                   "popcnt %[val], %q[res]", X86_FEATURE_POPCNT,
>>>> +                   ASM_OUTPUT2([res] "=a" (r) ASM_CALL_CONSTRAINT),
>>>> +                   [val] "D" (x));
>>> If you made [val] an output ("+D") you could avoid preserving the register
>>> in the function. And I'd expect the register wouldn't be re-used often
>>> afterwards, so its clobbering likely won't harm code quality very much.
>> "+D" means it's modified by the asm, which forces preservation of the
>> register, if it's still needed afterwards.
>>
>> Plain "D" means not modified by the asm, which means it can be reused if
>> necessary.
> And we likely would prefer the former: If the register's value isn't
> use afterwards (and that's the case as far as the function on its own
> goes), the compiler will know it doesn't need to preserve anything.
> That way, rather than always preserving (in the called function)
> preservation will be limited to just the (likely few) cases where the
> value actually is still needed afterwards.

Constraints are there to describe how the asm() behaves to the compiler.

You appear to be asking me to put in explicitly-incorrect constraints
because you think it will game the optimiser.

Except the reasoning is backwards.  The only thing forcing "+D" will do
is cause the compiler to preserve the value elsewhere if it's actually
needed later, despite the contents of %rdi still being good for the purpose.

In other words, using "+D" for asm which is really only "D" (as this one
is) will result in strictly worse code generation in the corner case you
seem to be worried about.

~Andrew

Jan Beulich Aug. 27, 2024, 3:04 p.m. UTC | #7

On 27.08.2024 16:59, Andrew Cooper wrote:
> On 27/08/2024 1:47 pm, Jan Beulich wrote:
>> On 27.08.2024 13:17, Andrew Cooper wrote:
>>> On 26/08/2024 2:07 pm, Jan Beulich wrote:
>>>> On 23.08.2024 01:06, Andrew Cooper wrote:
>>>>> @@ -475,4 +476,24 @@ static always_inline unsigned int arch_flsl(unsigned long x)
>>>>>  }
>>>>>  #define arch_flsl arch_flsl
>>>>>  
>>>>> +static always_inline unsigned int arch_hweightl(unsigned long x)
>>>>> +{
>>>>> +    unsigned int r;
>>>>> +
>>>>> +    /*
>>>>> +     * arch_generic_hweightl() is written in ASM in order to preserve all
>>>>> +     * registers, as the compiler can't see the call.
>>>>> +     *
>>>>> +     * This limits the POPCNT instruction to using the same ABI as a function
>>>>> +     * call (input in %rdi, output in %eax) but that's fine.
>>>>> +     */
>>>>> +    alternative_io("call arch_generic_hweightl",
>>>>> +                   "popcnt %[val], %q[res]", X86_FEATURE_POPCNT,
>>>>> +                   ASM_OUTPUT2([res] "=a" (r) ASM_CALL_CONSTRAINT),
>>>>> +                   [val] "D" (x));
>>>> If you made [val] an output ("+D") you could avoid preserving the register
>>>> in the function. And I'd expect the register wouldn't be re-used often
>>>> afterwards, so its clobbering likely won't harm code quality very much.
>>> "+D" means it's modified by the asm, which forces preservation of the
>>> register, if it's still needed afterwards.
>>>
>>> Plain "D" means not modified by the asm, which means it can be reused if
>>> necessary.
>> And we likely would prefer the former: If the register's value isn't
>> use afterwards (and that's the case as far as the function on its own
>> goes), the compiler will know it doesn't need to preserve anything.
>> That way, rather than always preserving (in the called function)
>> preservation will be limited to just the (likely few) cases where the
>> value actually is still needed afterwards.
> 
> Constraints are there to describe how the asm() behaves to the compiler.
> 
> You appear to be asking me to put in explicitly-incorrect constraints
> because you think it will game the optimiser.
> 
> Except the reasoning is backwards.  The only thing forcing "+D" will do
> is cause the compiler to preserve the value elsewhere if it's actually
> needed later, despite the contents of %rdi still being good for the purpose.
> 
> In other words, using "+D" for asm which is really only "D" (as this one
> is) will result in strictly worse code generation in the corner case you
> seem to be worried about.

Well, then leave it as is. (An extra benefit would have been that
arch_generic_hweightl() then would ended up with a odd-number-of-slots
stack frame. Not that this matters much, but having ABI-compliant
functions where possible seems always preferable, when possible.)

Jan

[9/9] x86/bitops: Use the POPCNT instruction when available

Commit Message

Comments

Patch