diff mbox series

x86/hvm: Improve hvm_set_guest_pat() code generation again

Message ID 20220810133655.18040-1-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show
Series x86/hvm: Improve hvm_set_guest_pat() code generation again | expand

Commit Message

Andrew Cooper Aug. 10, 2022, 1:36 p.m. UTC
From: Edwin Török <edvin.torok@citrix.com>

Following on from cset 9ce0a5e207f3 ("x86/hvm: Improve hvm_set_guest_pat()
code generation"), and the discovery that Clang/LLVM makes some especially
disastrous code generation for the loop at -O2

  https://github.com/llvm/llvm-project/issues/54644

Edvin decided to remove the loop entirely by fully vectorising it.  This is
substantially more efficient than the loop, and rather harder for a typical
compiler to mess up.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Edwin Török <edvin.torok@citrix.com>
---
 xen/arch/x86/hvm/hvm.c | 51 ++++++++++++++++++++++++++++++++++----------------
 1 file changed, 35 insertions(+), 16 deletions(-)

Comments

Jan Beulich Aug. 10, 2022, 2:06 p.m. UTC | #1
On 10.08.2022 15:36, Andrew Cooper wrote:
> From: Edwin Török <edvin.torok@citrix.com>
> 
> Following on from cset 9ce0a5e207f3 ("x86/hvm: Improve hvm_set_guest_pat()
> code generation"), and the discovery that Clang/LLVM makes some especially
> disastrous code generation for the loop at -O2
> 
>   https://github.com/llvm/llvm-project/issues/54644
> 
> Edvin decided to remove the loop entirely by fully vectorising it.  This is
> substantially more efficient than the loop, and rather harder for a typical
> compiler to mess up.
> 
> Signed-off-by: Edwin Török <edvin.torok@citrix.com>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

The main downside being that changing the code to fit in a new PAT
type will now be harder. I wonder in particular whether with that
in mind it wouldn't be better to express the check not in terms of
relations, but in terms of set / clear bits ("bits 3-7 clear AND
(bit 2 set OR bit 1 clear)"). The code kind of does so already, but
the variable names don't reflect that (and would hence need to
change in such an event).

> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -302,24 +302,43 @@ void hvm_get_guest_pat(struct vcpu *v, u64 *guest_pat)
>          *guest_pat = v->arch.hvm.pat_cr;
>  }
>  
> -int hvm_set_guest_pat(struct vcpu *v, uint64_t guest_pat)
> +/*
> + * MSR_PAT takes 8 uniform fields, each of which must be a valid architectural
> + * memory type (0, 1, 4-7).  This is a fully vectorised form of the
> + * 8-iteration loop over bytes looking for PAT_TYPE_* constants.

While grep-ing for PAT_TYPE_ will hit this line, I think we want
every individual type to also be found here when grep-ing for one.
The actual values aren't going to change, but perhaps the beast
way to do so would still be by way of BUILD_BUG_ON()s.

> + */
> +static bool pat_valid(uint64_t val)
>  {
> -    unsigned int i;
> -    uint64_t tmp;
> +    /* Yields a non-zero value in any lane which had value greater than 7. */
> +    uint64_t any_gt_7   =  val & 0xf8f8f8f8f8f8f8f8;

This and the other constant want to gain UL suffixes. (While I'm
open to be convinced otherwise on the earlier two points, this one
I'm going to insist on. Yet in case it would end up being the only
change in need of making, it could of course be done while
committing.)

Jan
Andrew Cooper Dec. 16, 2022, 8:53 p.m. UTC | #2
On 10/08/2022 3:06 pm, Jan Beulich wrote:
> On 10.08.2022 15:36, Andrew Cooper wrote:
>> From: Edwin Török <edvin.torok@citrix.com>
>>
>> Following on from cset 9ce0a5e207f3 ("x86/hvm: Improve hvm_set_guest_pat()
>> code generation"), and the discovery that Clang/LLVM makes some especially
>> disastrous code generation for the loop at -O2
>>
>>   https://github.com/llvm/llvm-project/issues/54644
>>
>> Edvin decided to remove the loop entirely by fully vectorising it.  This is
>> substantially more efficient than the loop, and rather harder for a typical
>> compiler to mess up.
>>
>> Signed-off-by: Edwin Török <edvin.torok@citrix.com>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> The main downside being that changing the code to fit in a new PAT
> type will now be harder.

When was the last PAT type change?

Trick question.  Never, because PAT hasn't changed since it was
introduced 24 years ago in the Pentium III.

I really don't think we're in danger of needing to change this logic.

> I wonder in particular whether with that
> in mind it wouldn't be better to express the check not in terms of
> relations, but in terms of set / clear bits ("bits 3-7 clear AND
> (bit 2 set OR bit 1 clear)"). The code kind of does so already, but
> the variable names don't reflect that (and would hence need to
> change in such an event).

That would reduced clarity.

The bits being set or cleared are trivial for any developer, given the
particularly basic RHS expressions.

The constant names are what relate the bit patterns to the description
of the problem.

>> --- a/xen/arch/x86/hvm/hvm.c
>> +++ b/xen/arch/x86/hvm/hvm.c
>> @@ -302,24 +302,43 @@ void hvm_get_guest_pat(struct vcpu *v, u64 *guest_pat)
>>          *guest_pat = v->arch.hvm.pat_cr;
>>  }
>>  
>> -int hvm_set_guest_pat(struct vcpu *v, uint64_t guest_pat)
>> +/*
>> + * MSR_PAT takes 8 uniform fields, each of which must be a valid architectural
>> + * memory type (0, 1, 4-7).  This is a fully vectorised form of the
>> + * 8-iteration loop over bytes looking for PAT_TYPE_* constants.
> While grep-ing for PAT_TYPE_ will hit this line, I think we want
> every individual type to also be found here when grep-ing for one.
> The actual values aren't going to change, but perhaps the beast
> way to do so would still be by way of BUILD_BUG_ON()s.

Why?  What does that solve or improve?

"pat" is the thing people are going to be looking for if they're
actually trying to find this logic.

(And I bring this patch up specifically after reviewing Demi's series,
where PAT_TYPE_* changes to X86_MT_* but "pat" is still the useful
search term IMO.)

>
>> + */
>> +static bool pat_valid(uint64_t val)
>>  {
>> -    unsigned int i;
>> -    uint64_t tmp;
>> +    /* Yields a non-zero value in any lane which had value greater than 7. */
>> +    uint64_t any_gt_7   =  val & 0xf8f8f8f8f8f8f8f8;
> This and the other constant want to gain UL suffixes.

Fixed.

> (While I'm
> open to be convinced otherwise on the earlier two points, this one
> I'm going to insist on. Yet in case it would end up being the only
> change in need of making, it could of course be done while
> committing.)

I've tweaked the grammar a bit, but I don't think the other changes
would be an improvement.

~Andrew
Jan Beulich Dec. 19, 2022, 7:28 a.m. UTC | #3
On 16.12.2022 21:53, Andrew Cooper wrote:
> On 10/08/2022 3:06 pm, Jan Beulich wrote:
>> On 10.08.2022 15:36, Andrew Cooper wrote:
>>> From: Edwin Török <edvin.torok@citrix.com>
>>>
>>> Following on from cset 9ce0a5e207f3 ("x86/hvm: Improve hvm_set_guest_pat()
>>> code generation"), and the discovery that Clang/LLVM makes some especially
>>> disastrous code generation for the loop at -O2
>>>
>>>   https://github.com/llvm/llvm-project/issues/54644
>>>
>>> Edvin decided to remove the loop entirely by fully vectorising it.  This is
>>> substantially more efficient than the loop, and rather harder for a typical
>>> compiler to mess up.
>>>
>>> Signed-off-by: Edwin Török <edvin.torok@citrix.com>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> The main downside being that changing the code to fit in a new PAT
>> type will now be harder.
> 
> When was the last PAT type change?
> 
> Trick question.  Never, because PAT hasn't changed since it was
> introduced 24 years ago in the Pentium III.
> 
> I really don't think we're in danger of needing to change this logic.

One way to look at things, sure.

>> I wonder in particular whether with that
>> in mind it wouldn't be better to express the check not in terms of
>> relations, but in terms of set / clear bits ("bits 3-7 clear AND
>> (bit 2 set OR bit 1 clear)"). The code kind of does so already, but
>> the variable names don't reflect that (and would hence need to
>> change in such an event).
> 
> That would reduced clarity.
> 
> The bits being set or cleared are trivial for any developer, given the
> particularly basic RHS expressions.
> 
> The constant names are what relate the bit patterns to the description
> of the problem.

Again - one way to look at things. Plus, with Demi's series now also in
mind, what's done here is moving us in exactly the opposite direction.
Is this hot enough a function to warrant that?

>>> --- a/xen/arch/x86/hvm/hvm.c
>>> +++ b/xen/arch/x86/hvm/hvm.c
>>> @@ -302,24 +302,43 @@ void hvm_get_guest_pat(struct vcpu *v, u64 *guest_pat)
>>>          *guest_pat = v->arch.hvm.pat_cr;
>>>  }
>>>  
>>> -int hvm_set_guest_pat(struct vcpu *v, uint64_t guest_pat)
>>> +/*
>>> + * MSR_PAT takes 8 uniform fields, each of which must be a valid architectural
>>> + * memory type (0, 1, 4-7).  This is a fully vectorised form of the
>>> + * 8-iteration loop over bytes looking for PAT_TYPE_* constants.
>> While grep-ing for PAT_TYPE_ will hit this line, I think we want
>> every individual type to also be found here when grep-ing for one.
>> The actual values aren't going to change, but perhaps the beast
>> way to do so would still be by way of BUILD_BUG_ON()s.
> 
> Why?  What does that solve or improve?
> 
> "pat" is the thing people are going to be looking for if they're
> actually trying to find this logic.
> 
> (And I bring this patch up specifically after reviewing Demi's series,
> where PAT_TYPE_* changes to X86_MT_* but "pat" is still the useful
> search term IMO.)

I don't think "PAT" is a good thing to grep for when trying to find uses
of particular memory types. Go grep for "_WP", "_WC", or "_WT" - you'll
find not overly many hits, and with the false positives filtered out
you'll have a good overview of which of these types is used in how many
places.

Jan
Andrew Cooper March 24, 2023, 12:59 a.m. UTC | #4
On 19/12/2022 7:28 am, Jan Beulich wrote:
> On 16.12.2022 21:53, Andrew Cooper wrote:
> Again - one way to look at things. Plus, with Demi's series now also in
> mind, what's done here is moving us in exactly the opposite direction.
> Is this hot enough a function to warrant that?

Yes - from the first cset, 9ce0a5e207f3 - it's used on virtual
vmentry/exit so is (or will be) reasonably hot in due course.

What is more important in the short term is avoiding the catastrophic
code generation that Clang still does with default Xen build settings,
also linked from the commit message.

>>>> --- a/xen/arch/x86/hvm/hvm.c
>>>> +++ b/xen/arch/x86/hvm/hvm.c
>>>> @@ -302,24 +302,43 @@ void hvm_get_guest_pat(struct vcpu *v, u64 *guest_pat)
>>>>          *guest_pat = v->arch.hvm.pat_cr;
>>>>  }
>>>>  
>>>> -int hvm_set_guest_pat(struct vcpu *v, uint64_t guest_pat)
>>>> +/*
>>>> + * MSR_PAT takes 8 uniform fields, each of which must be a valid architectural
>>>> + * memory type (0, 1, 4-7).  This is a fully vectorised form of the
>>>> + * 8-iteration loop over bytes looking for PAT_TYPE_* constants.
>>> While grep-ing for PAT_TYPE_ will hit this line, I think we want
>>> every individual type to also be found here when grep-ing for one.
>>> The actual values aren't going to change, but perhaps the beast
>>> way to do so would still be by way of BUILD_BUG_ON()s.
>> Why?  What does that solve or improve?
>>
>> "pat" is the thing people are going to be looking for if they're
>> actually trying to find this logic.
>>
>> (And I bring this patch up specifically after reviewing Demi's series,
>> where PAT_TYPE_* changes to X86_MT_* but "pat" is still the useful
>> search term IMO.)
> I don't think "PAT" is a good thing to grep for when trying to find uses
> of particular memory types.

This is not a logical use of a particular memory type.  Being an
architectural auditing function, the only legitimate use of these
constants here is all of them at once.  This is the one place you firmly
don't care about finding when asking the question "How does Xen go about
handling WP mappings".

I have swapped PAT_TYPE_* for X86_MT_* now that Demi's series has been
committed, but that is the extent to which I think there are relevant
changes to be made.

~Andrew
Jan Beulich March 24, 2023, 9:32 a.m. UTC | #5
On 24.03.2023 01:59, Andrew Cooper wrote:
> On 19/12/2022 7:28 am, Jan Beulich wrote:
>> On 16.12.2022 21:53, Andrew Cooper wrote:
>> Again - one way to look at things. Plus, with Demi's series now also in
>> mind, what's done here is moving us in exactly the opposite direction.
>> Is this hot enough a function to warrant that?
> 
> Yes - from the first cset, 9ce0a5e207f3 - it's used on virtual
> vmentry/exit so is (or will be) reasonably hot in due course.
> 
> What is more important in the short term is avoiding the catastrophic
> code generation that Clang still does with default Xen build settings,
> also linked from the commit message.
> 
>>>>> --- a/xen/arch/x86/hvm/hvm.c
>>>>> +++ b/xen/arch/x86/hvm/hvm.c
>>>>> @@ -302,24 +302,43 @@ void hvm_get_guest_pat(struct vcpu *v, u64 *guest_pat)
>>>>>          *guest_pat = v->arch.hvm.pat_cr;
>>>>>  }
>>>>>  
>>>>> -int hvm_set_guest_pat(struct vcpu *v, uint64_t guest_pat)
>>>>> +/*
>>>>> + * MSR_PAT takes 8 uniform fields, each of which must be a valid architectural
>>>>> + * memory type (0, 1, 4-7).  This is a fully vectorised form of the
>>>>> + * 8-iteration loop over bytes looking for PAT_TYPE_* constants.
>>>> While grep-ing for PAT_TYPE_ will hit this line, I think we want
>>>> every individual type to also be found here when grep-ing for one.
>>>> The actual values aren't going to change, but perhaps the beast
>>>> way to do so would still be by way of BUILD_BUG_ON()s.
>>> Why?  What does that solve or improve?
>>>
>>> "pat" is the thing people are going to be looking for if they're
>>> actually trying to find this logic.
>>>
>>> (And I bring this patch up specifically after reviewing Demi's series,
>>> where PAT_TYPE_* changes to X86_MT_* but "pat" is still the useful
>>> search term IMO.)
>> I don't think "PAT" is a good thing to grep for when trying to find uses
>> of particular memory types.
> 
> This is not a logical use of a particular memory type.  Being an
> architectural auditing function, the only legitimate use of these
> constants here is all of them at once.  This is the one place you firmly
> don't care about finding when asking the question "How does Xen go about
> handling WP mappings".
> 
> I have swapped PAT_TYPE_* for X86_MT_* now that Demi's series has been
> committed, but that is the extent to which I think there are relevant
> changes to be made.

In the interest of getting the code gen issue addressed, but without
being fully convinced this is a good move:
Acked-by: Jan Beulich <jbeulich@suse.com>

Jan
Andrew Cooper March 24, 2023, 12:15 p.m. UTC | #6
On 24/03/2023 9:32 am, Jan Beulich wrote:
> On 24.03.2023 01:59, Andrew Cooper wrote:
>> On 19/12/2022 7:28 am, Jan Beulich wrote:
>>> On 16.12.2022 21:53, Andrew Cooper wrote:
>>> Again - one way to look at things. Plus, with Demi's series now also in
>>> mind, what's done here is moving us in exactly the opposite direction.
>>> Is this hot enough a function to warrant that?
>> Yes - from the first cset, 9ce0a5e207f3 - it's used on virtual
>> vmentry/exit so is (or will be) reasonably hot in due course.
>>
>> What is more important in the short term is avoiding the catastrophic
>> code generation that Clang still does with default Xen build settings,
>> also linked from the commit message.
>>
>>>>>> --- a/xen/arch/x86/hvm/hvm.c
>>>>>> +++ b/xen/arch/x86/hvm/hvm.c
>>>>>> @@ -302,24 +302,43 @@ void hvm_get_guest_pat(struct vcpu *v, u64 *guest_pat)
>>>>>>          *guest_pat = v->arch.hvm.pat_cr;
>>>>>>  }
>>>>>>  
>>>>>> -int hvm_set_guest_pat(struct vcpu *v, uint64_t guest_pat)
>>>>>> +/*
>>>>>> + * MSR_PAT takes 8 uniform fields, each of which must be a valid architectural
>>>>>> + * memory type (0, 1, 4-7).  This is a fully vectorised form of the
>>>>>> + * 8-iteration loop over bytes looking for PAT_TYPE_* constants.
>>>>> While grep-ing for PAT_TYPE_ will hit this line, I think we want
>>>>> every individual type to also be found here when grep-ing for one.
>>>>> The actual values aren't going to change, but perhaps the beast
>>>>> way to do so would still be by way of BUILD_BUG_ON()s.
>>>> Why?  What does that solve or improve?
>>>>
>>>> "pat" is the thing people are going to be looking for if they're
>>>> actually trying to find this logic.
>>>>
>>>> (And I bring this patch up specifically after reviewing Demi's series,
>>>> where PAT_TYPE_* changes to X86_MT_* but "pat" is still the useful
>>>> search term IMO.)
>>> I don't think "PAT" is a good thing to grep for when trying to find uses
>>> of particular memory types.
>> This is not a logical use of a particular memory type.  Being an
>> architectural auditing function, the only legitimate use of these
>> constants here is all of them at once.  This is the one place you firmly
>> don't care about finding when asking the question "How does Xen go about
>> handling WP mappings".
>>
>> I have swapped PAT_TYPE_* for X86_MT_* now that Demi's series has been
>> committed, but that is the extent to which I think there are relevant
>> changes to be made.
> In the interest of getting the code gen issue addressed, but without
> being fully convinced this is a good move:
> Acked-by: Jan Beulich <jbeulich@suse.com>

Thankyou.

~Andrew
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 0dd320a6a9fc..b63e6073dfd0 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -302,24 +302,43 @@  void hvm_get_guest_pat(struct vcpu *v, u64 *guest_pat)
         *guest_pat = v->arch.hvm.pat_cr;
 }
 
-int hvm_set_guest_pat(struct vcpu *v, uint64_t guest_pat)
+/*
+ * MSR_PAT takes 8 uniform fields, each of which must be a valid architectural
+ * memory type (0, 1, 4-7).  This is a fully vectorised form of the
+ * 8-iteration loop over bytes looking for PAT_TYPE_* constants.
+ */
+static bool pat_valid(uint64_t val)
 {
-    unsigned int i;
-    uint64_t tmp;
+    /* Yields a non-zero value in any lane which had value greater than 7. */
+    uint64_t any_gt_7   =  val & 0xf8f8f8f8f8f8f8f8;
 
-    for ( i = 0, tmp = guest_pat; i < 8; i++, tmp >>= 8 )
-        switch ( tmp & 0xff )
-        {
-        case PAT_TYPE_UC_MINUS:
-        case PAT_TYPE_UNCACHABLE:
-        case PAT_TYPE_WRBACK:
-        case PAT_TYPE_WRCOMB:
-        case PAT_TYPE_WRPROT:
-        case PAT_TYPE_WRTHROUGH:
-            break;
-        default:
-            return 0;
-        }
+    /*
+     * With the > 7 case covered, identify lanes with the value 0-3 by finding
+     * lanes with bit 2 clear.
+     *
+     * Yields bit 2 set in each lane which has a value <= 3.
+     */
+    uint64_t any_le_3   = ~val & 0x0404040404040404;
+
+    /*
+     * Logically, any_2_or_3 is any_le_3 && bit 1 set.
+     *
+     * We could calculate any_gt_1 as val & 0x02 and resolve the two vectors
+     * of booleans (shift one of them until the mask lines up, then bitwise
+     * and), but that is unnecessary calculation.
+     *
+     * Shift any_le_3 so it becomes bit 1 in each lane which has a value <= 3,
+     * and look for bit 1 in a subset of lanes.
+     */
+    uint64_t any_2_or_3 =  val & (any_le_3 >> 1);
+
+    return !(any_gt_7 | any_2_or_3);
+}
+
+int hvm_set_guest_pat(struct vcpu *v, uint64_t guest_pat)
+{
+    if ( !pat_valid(guest_pat) )
+        return 0;
 
     if ( !alternative_call(hvm_funcs.set_guest_pat, v, guest_pat) )
         v->arch.hvm.pat_cr = guest_pat;