diff mbox series

[v2,1/6] KVM: x86: Clear all supported MPX xfeatures if they are not all set

Message ID 20221230162442.3781098-2-aaronlewis@google.com (mailing list archive)
State New, archived
Headers show
Series Clean up the supported xfeatures | expand

Commit Message

Aaron Lewis Dec. 30, 2022, 4:24 p.m. UTC
Be a good citizen and don't allow any of the supported MPX xfeatures[1]
to be set if they can't all be set.  That way userspace or a guest
doesn't fail if it attempts to set them in XCR0.

[1] CPUID.(EAX=0DH,ECX=0):EAX.BNDREGS[bit-3]
    CPUID.(EAX=0DH,ECX=0):EAX.BNDCSR[bit-4]

Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
---
 arch/x86/kvm/cpuid.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Xiaoyao Li Jan. 2, 2023, 3:03 p.m. UTC | #1
On 12/31/2022 12:24 AM, Aaron Lewis wrote:
> Be a good citizen and don't allow any of the supported MPX xfeatures[1]
> to be set if they can't all be set.  That way userspace or a guest
> doesn't fail if it attempts to set them in XCR0.
> 
> [1] CPUID.(EAX=0DH,ECX=0):EAX.BNDREGS[bit-3]
>      CPUID.(EAX=0DH,ECX=0):EAX.BNDCSR[bit-4]
> 
> Suggested-by: Jim Mattson <jmattson@google.com>
> Signed-off-by: Aaron Lewis <aaronlewis@google.com>
> ---
>   arch/x86/kvm/cpuid.c | 12 ++++++++++++
>   1 file changed, 12 insertions(+)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index c4e8257629165..2431c46d456b4 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -855,6 +855,16 @@ static int __do_cpuid_func_emulated(struct kvm_cpuid_array *array, u32 func)
>   	return 0;
>   }
>   
> +static u64 sanitize_xcr0(u64 xcr0)
> +{
> +	u64 mask;
> +
> +	mask = XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR;
> +	if ((xcr0 & mask) != mask)
> +		xcr0 &= ~mask;

Maybe it can WARN_ON_ONCE() here.

It implies either a kernel bug that permitted_xcr0 is invalid or a 
broken HW.

> +	return xcr0;
> +}
> +
>   static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>   {
>   	struct kvm_cpuid_entry2 *entry;
> @@ -982,6 +992,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>   		u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm();
>   		u64 permitted_xss = kvm_caps.supported_xss;
>   
> +		permitted_xcr0 = sanitize_xcr0(permitted_xcr0);
> +
>   		entry->eax &= permitted_xcr0;
>   		entry->ebx = xstate_required_size(permitted_xcr0, false);
>   		entry->ecx = entry->ebx;
Sean Christopherson Jan. 3, 2023, 6:46 p.m. UTC | #2
On Fri, Dec 30, 2022, Aaron Lewis wrote:
> Be a good citizen and don't allow any of the supported MPX xfeatures[1]
> to be set if they can't all be set.  That way userspace or a guest
> doesn't fail if it attempts to set them in XCR0.
> 
> [1] CPUID.(EAX=0DH,ECX=0):EAX.BNDREGS[bit-3]
>     CPUID.(EAX=0DH,ECX=0):EAX.BNDCSR[bit-4]
> 
> Suggested-by: Jim Mattson <jmattson@google.com>
> Signed-off-by: Aaron Lewis <aaronlewis@google.com>
> ---
>  arch/x86/kvm/cpuid.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index c4e8257629165..2431c46d456b4 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -855,6 +855,16 @@ static int __do_cpuid_func_emulated(struct kvm_cpuid_array *array, u32 func)
>  	return 0;
>  }
>  
> +static u64 sanitize_xcr0(u64 xcr0)
> +{
> +	u64 mask;
> +
> +	mask = XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR;
> +	if ((xcr0 & mask) != mask)
> +		xcr0 &= ~mask;
> +
> +	return xcr0;
> +}
> +
>  static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  {
>  	struct kvm_cpuid_entry2 *entry;
> @@ -982,6 +992,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  		u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm();
>  		u64 permitted_xss = kvm_caps.supported_xss;
>  
> +		permitted_xcr0 = sanitize_xcr0(permitted_xcr0);


This isn't 100% correct, all usage needs to be sanitized so that KVM provides a
consistent view.  E.g. KVM_CAP_XSAVE2 would report the wrong size.

	case KVM_CAP_XSAVE2: {
		u64 guest_perm = xstate_get_guest_group_perm();

		r = xstate_required_size(kvm_caps.supported_xcr0 & guest_perm, false);
		if (r < sizeof(struct kvm_xsave))
			r = sizeof(struct kvm_xsave);
		break;
	}

Barring a kernel bug, xstate_get_guest_group_perm() will never report partial
support, so I think the easy solution is to sanitize kvm_caps.suport_xcr0.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2480b8027a45..7ea06c58eaf6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9344,6 +9344,10 @@ int kvm_arch_init(void *opaque)
        if (boot_cpu_has(X86_FEATURE_XSAVE)) {
                host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
                kvm_caps.supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0;
+               if (!(kvm_caps.supported_xcr0 & XFEATURE_MASK_BNDREGS) ||
+                   !(kvm_caps.supported_xcr0 & XFEATURE_MASK_BNDCSR))
+                       kvm_caps.supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS |
+                                                    XFEATURE_MASK_BNDCSR);
        }
 
        if (pi_inject_timer == -1)


> +
>  		entry->eax &= permitted_xcr0;
>  		entry->ebx = xstate_required_size(permitted_xcr0, false);
>  		entry->ecx = entry->ebx;
> -- 
> 2.39.0.314.g84b9a713c41-goog
>
Sean Christopherson Jan. 3, 2023, 6:47 p.m. UTC | #3
On Mon, Jan 02, 2023, Xiaoyao Li wrote:
> On 12/31/2022 12:24 AM, Aaron Lewis wrote:
> > Be a good citizen and don't allow any of the supported MPX xfeatures[1]
> > to be set if they can't all be set.  That way userspace or a guest
> > doesn't fail if it attempts to set them in XCR0.
> > 
> > [1] CPUID.(EAX=0DH,ECX=0):EAX.BNDREGS[bit-3]
> >      CPUID.(EAX=0DH,ECX=0):EAX.BNDCSR[bit-4]
> > 
> > Suggested-by: Jim Mattson <jmattson@google.com>
> > Signed-off-by: Aaron Lewis <aaronlewis@google.com>
> > ---
> >   arch/x86/kvm/cpuid.c | 12 ++++++++++++
> >   1 file changed, 12 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index c4e8257629165..2431c46d456b4 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -855,6 +855,16 @@ static int __do_cpuid_func_emulated(struct kvm_cpuid_array *array, u32 func)
> >   	return 0;
> >   }
> > +static u64 sanitize_xcr0(u64 xcr0)
> > +{
> > +	u64 mask;
> > +
> > +	mask = XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR;
> > +	if ((xcr0 & mask) != mask)
> > +		xcr0 &= ~mask;
> 
> Maybe it can WARN_ON_ONCE() here.
> 
> It implies either a kernel bug that permitted_xcr0 is invalid or a broken
> HW.

I'm pretty sure KVM can't WARN, as this falls into the category of "it's technically
architecturally legal to report only one of the features, but real hardware will
always report both".
Aaron Lewis Jan. 10, 2023, 2:49 p.m. UTC | #4
> >  static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> >  {
> >       struct kvm_cpuid_entry2 *entry;
> > @@ -982,6 +992,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> >               u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm();
> >               u64 permitted_xss = kvm_caps.supported_xss;
> >
> > +             permitted_xcr0 = sanitize_xcr0(permitted_xcr0);
>
>
> This isn't 100% correct, all usage needs to be sanitized so that KVM provides a
> consistent view.  E.g. KVM_CAP_XSAVE2 would report the wrong size.
>
>         case KVM_CAP_XSAVE2: {
>                 u64 guest_perm = xstate_get_guest_group_perm();
>
>                 r = xstate_required_size(kvm_caps.supported_xcr0 & guest_perm, false);
>                 if (r < sizeof(struct kvm_xsave))
>                         r = sizeof(struct kvm_xsave);
>                 break;
>         }
>
> Barring a kernel bug, xstate_get_guest_group_perm() will never report partial
> support, so I think the easy solution is to sanitize kvm_caps.suport_xcr0.
>

When I run xcr0_cpuid_test it fails because
xstate_get_guest_group_perm() reports partial support on SPR.  It's
reporting 0x206e7 rather than the 0x6e7 I was hoping for.  That's why
I went down the road of sanitizing xcr0.  Though, if it's expected for
that to report something valid then sanitizing seems like the wrong
approach.  If xcr0 is invalid it should stay invalid, and it should
cause a test to fail.

Looking at how xstate_get_guest_group_perm() comes through with
invalid bits I came across this commit:

2308ee57d93d ("x86/fpu/amx: Enable the AMX feature in 64-bit mode")

-       /* [XFEATURE_XTILE_DATA] = XFEATURE_MASK_XTILE, */
+       [XFEATURE_XTILE_DATA] = XFEATURE_MASK_XTILE_DATA,

Seems like it should really be:

+       [XFEATURE_XTILE_DATA] = XFEATURE_MASK_XTILE,

With that change xstate_get_guest_group_perm() should no longer report
partial support.

That means this entire series can be simplified to a 'fixes patch' for
commit 2308ee57d93d and xcr0_cpuid_test to demonstrate the fix.

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2480b8027a45..7ea06c58eaf6 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9344,6 +9344,10 @@ int kvm_arch_init(void *opaque)
>         if (boot_cpu_has(X86_FEATURE_XSAVE)) {
>                 host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
>                 kvm_caps.supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0;
> +               if (!(kvm_caps.supported_xcr0 & XFEATURE_MASK_BNDREGS) ||
> +                   !(kvm_caps.supported_xcr0 & XFEATURE_MASK_BNDCSR))
> +                       kvm_caps.supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS |
> +                                                    XFEATURE_MASK_BNDCSR);
>         }
>
>         if (pi_inject_timer == -1)
>
>
> > +
> >               entry->eax &= permitted_xcr0;
> >               entry->ebx = xstate_required_size(permitted_xcr0, false);
> >               entry->ecx = entry->ebx;
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >
Chang S. Bae Jan. 10, 2023, 6:32 p.m. UTC | #5
On 1/10/2023 6:49 AM, Aaron Lewis wrote:
> 
> When I run xcr0_cpuid_test it fails because
> xstate_get_guest_group_perm() reports partial support on SPR.  It's
> reporting 0x206e7 rather than the 0x6e7 I was hoping for.  That's why
> I went down the road of sanitizing xcr0.  Though, if it's expected for
> that to report something valid then sanitizing seems like the wrong
> approach.  If xcr0 is invalid it should stay invalid, and it should
> cause a test to fail.

FWIW, we have this [1]:

/* Features which are dynamically enabled for a process on request */
#define XFEATURE_MASK_USER_DYNAMIC	XFEATURE_MASK_XTILE_DATA

IOW, TILE_CFG is not part of the dynamic state. Because this state is 
not XFD-supported, we can't enforce the state use. SDM has relevant text 
here [2]:

"LDTILECFG and TILERELEASE initialize the TILEDATA state component. An 
execution of either of these instructions does not generate #NM when 
XCR0[18] = IA32_XFD[18] = 1; instead, it initializes TILEDATA normally.
(Note that STTILECFG does not use the TILEDATA state component. Thus, an 
execution of this instruction does
not generate #NM when XCR0[18] = IA32_XFD[18] = 1.)"

> Looking at how xstate_get_guest_group_perm() comes through with
> invalid bits I came across this commit:
> 
> 2308ee57d93d ("x86/fpu/amx: Enable the AMX feature in 64-bit mode")
> 
> -       /* [XFEATURE_XTILE_DATA] = XFEATURE_MASK_XTILE, */
> +       [XFEATURE_XTILE_DATA] = XFEATURE_MASK_XTILE_DATA,
> 
> Seems like it should really be:
> 
> +       [XFEATURE_XTILE_DATA] = XFEATURE_MASK_XTILE,

Thus, the change was intentional as far as I can remember.

Thank,
Chang

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/asm/fpu/xstate.h#n50
[2] SDM Vol 1. 13.14 Extended Feature Disable (XFD), 
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
Mingwei Zhang Jan. 12, 2023, 6:21 p.m. UTC | #6
On Tue, Jan 10, 2023, Chang S. Bae wrote:
> On 1/10/2023 6:49 AM, Aaron Lewis wrote:
> > 
> > When I run xcr0_cpuid_test it fails because
> > xstate_get_guest_group_perm() reports partial support on SPR.  It's
> > reporting 0x206e7 rather than the 0x6e7 I was hoping for.  That's why
> > I went down the road of sanitizing xcr0.  Though, if it's expected for
> > that to report something valid then sanitizing seems like the wrong
> > approach.  If xcr0 is invalid it should stay invalid, and it should
> > cause a test to fail.
> 
> FWIW, we have this [1]:
> 
> /* Features which are dynamically enabled for a process on request */
> #define XFEATURE_MASK_USER_DYNAMIC	XFEATURE_MASK_XTILE_DATA
> 
> IOW, TILE_CFG is not part of the dynamic state. Because this state is not
> XFD-supported, we can't enforce the state use. SDM has relevant text here
> [2]:
> 
> "LDTILECFG and TILERELEASE initialize the TILEDATA state component. An
> execution of either of these instructions does not generate #NM when
> XCR0[18] = IA32_XFD[18] = 1; instead, it initializes TILEDATA normally.
> (Note that STTILECFG does not use the TILEDATA state component. Thus, an
> execution of this instruction does
> not generate #NM when XCR0[18] = IA32_XFD[18] = 1.)"
> 
> > Looking at how xstate_get_guest_group_perm() comes through with
> > invalid bits I came across this commit:
> > 
> > 2308ee57d93d ("x86/fpu/amx: Enable the AMX feature in 64-bit mode")
> > 
> > -       /* [XFEATURE_XTILE_DATA] = XFEATURE_MASK_XTILE, */
> > +       [XFEATURE_XTILE_DATA] = XFEATURE_MASK_XTILE_DATA,
> > 
> > Seems like it should really be:
> > 
> > +       [XFEATURE_XTILE_DATA] = XFEATURE_MASK_XTILE,
> 
> Thus, the change was intentional as far as I can remember.
> 
> Thank,
> Chang
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/asm/fpu/xstate.h#n50
> [2] SDM Vol 1. 13.14 Extended Feature Disable (XFD), https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
> 

Hmm. Given that XFEATURE_XTILE_DATA is a dynamic feature, that means
XFEATURE_XTILE_CFG could be set without XFEATURE_XTILE_DATA. Shall we
also consider loose the constraints at __kvm_set_xcr() as well?

https://elixir.bootlin.com/linux/v6.1.4/source/arch/x86/kvm/x86.c#L1079
Chang S. Bae Jan. 12, 2023, 6:44 p.m. UTC | #7
On 1/12/2023 10:21 AM, Mingwei Zhang wrote:
> 
> Hmm. Given that XFEATURE_XTILE_DATA is a dynamic feature, that means
> XFEATURE_XTILE_CFG could be set without XFEATURE_XTILE_DATA.

No, XCR0 is different from the IA32_XFD MSR. Check this in the SDM -- 
Vol.1 13.3 "Enabling the XSAVE Feature Set and XSAVE-Enabled Features":

"In addition, executing the XSETBV instruction causes a general-
  protection fault (#GP) if ECX = 0 and EAX[17] ≠ EAX[18] (TILECFG and
  TILEDATA must be enabled together). This implies that the value of
  XCR0[18:17] is always either 00b or 11b."

Thanks,
Chang
Mingwei Zhang Jan. 12, 2023, 7:17 p.m. UTC | #8
On Thu, Jan 12, 2023, Chang S. Bae wrote:
> On 1/12/2023 10:21 AM, Mingwei Zhang wrote:
> > 
> > Hmm. Given that XFEATURE_XTILE_DATA is a dynamic feature, that means
> > XFEATURE_XTILE_CFG could be set without XFEATURE_XTILE_DATA.
> 
> No, XCR0 is different from the IA32_XFD MSR. Check this in the SDM -- Vol.1
> 13.3 "Enabling the XSAVE Feature Set and XSAVE-Enabled Features":
> 
> "In addition, executing the XSETBV instruction causes a general-
>  protection fault (#GP) if ECX = 0 and EAX[17] ≠ EAX[18] (TILECFG and
>  TILEDATA must be enabled together). This implies that the value of
>  XCR0[18:17] is always either 00b or 11b."
> 
> Thanks,
> Chang

Hmm. Make sense. The proper execution of XSETBV was the key point of the
patch. But the permitted_xcr0 and supported_xcr0 seems never used
directly as the parameter of XSETBV, but only for size calculation.

One more question: I am very confused by the implementation of
xstate_get_guest_group_perm(). It seems fetching the xfeatures from the
host process (&current->group_leader->thread.fpu). Is this intentional?
Does that mean in order to enable AMX in the guest we have to enable it
on the host process first?
Chang S. Bae Jan. 12, 2023, 8:31 p.m. UTC | #9
On 1/12/2023 11:17 AM, Mingwei Zhang wrote:
> 
> But the permitted_xcr0 and supported_xcr0 seems never used
> directly as the parameter of XSETBV, but only for size calculation.

Yeah, I saw that too, and tried to improve it [1]. Maybe this is not a 
big deal in KVM.

> One more question: I am very confused by the implementation of
> xstate_get_guest_group_perm(). It seems fetching the xfeatures from the
> host process (&current->group_leader->thread.fpu). Is this intentional?
> Does that mean in order to enable AMX in the guest we have to enable it
> on the host process first?

Yes, it was designed that QEMU requests permissions via arch_prctl() 
before creating vCPU threads [2][3]. Granted, this feature capability 
will be advertised to the guest. Then, it will *enable* the feature, right?

Thanks,
Chang

[1] 
https://lore.kernel.org/kvm/20220823231402.7839-2-chang.seok.bae@intel.com/
[2] https://lore.kernel.org/lkml/87wnmf66m5.ffs@tglx/
[3] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=980fe2fddcff
Mingwei Zhang Jan. 12, 2023, 9:21 p.m. UTC | #10
On Thu, Jan 12, 2023, Chang S. Bae wrote:
> On 1/12/2023 11:17 AM, Mingwei Zhang wrote:
> > 
> > But the permitted_xcr0 and supported_xcr0 seems never used
> > directly as the parameter of XSETBV, but only for size calculation.
> 
> Yeah, I saw that too, and tried to improve it [1]. Maybe this is not a big
> deal in KVM.
> 
> > One more question: I am very confused by the implementation of
> > xstate_get_guest_group_perm(). It seems fetching the xfeatures from the
> > host process (&current->group_leader->thread.fpu). Is this intentional?
> > Does that mean in order to enable AMX in the guest we have to enable it
> > on the host process first?
> 
> Yes, it was designed that QEMU requests permissions via arch_prctl() before
> creating vCPU threads [2][3]. Granted, this feature capability will be
> advertised to the guest. Then, it will *enable* the feature, right?
> 
> Thanks,
> Chang
> 
> [1]
> https://lore.kernel.org/kvm/20220823231402.7839-2-chang.seok.bae@intel.com/
> [2] https://lore.kernel.org/lkml/87wnmf66m5.ffs@tglx/
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=980fe2fddcff
> 

Thanks for the clarification. Yeah, that was out of my expectation since
I assumed AMX enabling in the guest should be orthogonal to the enabling
in the host. But since AMX requires dynamic size of fp_state, host
awareness of larger fp_state is highly intended.

The only comment I would have is that it seems not following the least
privilege principle as host process (QEMU) may not have the motivation
to do any matrix multiplication. But this is a minor one.

Since this enabling once per-process, I am wondering when after
invocation of arch_prctl(2), all of the host threads will have a larger
fp_state? If so, that might be a sizeable overhead since host userspace
may have lots of threads doing various of other things, i.e., they may
not be vCPU threads.
Chang S. Bae Jan. 12, 2023, 9:33 p.m. UTC | #11
On 1/12/2023 1:21 PM, Mingwei Zhang wrote:
> 
> The only comment I would have is that it seems not following the least
> privilege principle as host process (QEMU) may not have the motivation
> to do any matrix multiplication. But this is a minor one.
> 
> Since this enabling once per-process, I am wondering when after
> invocation of arch_prctl(2), all of the host threads will have a larger
> fp_state? If so, that might be a sizeable overhead since host userspace
> may have lots of threads doing various of other things, i.e., they may
> not be vCPU threads.

No, the permission request does not immediately result in the kernel's 
XSAVE buffer expansion, but only when the state is about used. As 
XFD-armed, the state use will raise #NM. Then, it will reallocate the 
task's fpstate via this call chain:

#NM --> handle_xfd_event() --> xfd_enable_feature() --> fpstate_realloc()

Thanks,
Chang
Mingwei Zhang Jan. 13, 2023, 12:25 a.m. UTC | #12
On Thu, Jan 12, 2023 at 1:33 PM Chang S. Bae <chang.seok.bae@intel.com> wrote:
>
> On 1/12/2023 1:21 PM, Mingwei Zhang wrote:
> >
> > The only comment I would have is that it seems not following the least
> > privilege principle as host process (QEMU) may not have the motivation
> > to do any matrix multiplication. But this is a minor one.
> >
> > Since this enabling once per-process, I am wondering when after
> > invocation of arch_prctl(2), all of the host threads will have a larger
> > fp_state? If so, that might be a sizeable overhead since host userspace
> > may have lots of threads doing various of other things, i.e., they may
> > not be vCPU threads.
>
> No, the permission request does not immediately result in the kernel's
> XSAVE buffer expansion, but only when the state is about used. As
> XFD-armed, the state use will raise #NM. Then, it will reallocate the
> task's fpstate via this call chain:
>
> #NM --> handle_xfd_event() --> xfd_enable_feature() --> fpstate_realloc()
>
> Thanks,
> Chang

Thanks for the info. But I think you are talking about host level AMX
enabling. This is known to me. I am asking about how AMX was enabled
by QEMU and used by vCPU threads in the guest. After digging a little
bit, I think I understand it now.

So, it should be the following: (in fact, the guest fp_state is not
allocated lazily but at the very beginning at KVM_SET_CPUID2 time).

  kvm_set_cpuid() / kvm_set_cpuid2() ->
    kvm_check_cpuid() ->
      fpu_enable_guest_xfd_features() ->
        __xfd_enable_feature() ->
          fpstate_realloc()

Note that KVM does intercept #NM for the guest, but only for the
handling of XFD_ERR.

Prior to the kvm_set_cpuid() or kvm_set_cpuid2() call, the QEMU thread
should ask for permission via arch_prctl(REQ_XCOMP_GUEST_PERM) in
order to become a vCPU thread. Otherwise, the above call sequence will
fail. Fortunately, asking-for-guest-permission is only needed once per
process (per-VM).

Because of the above, the non-vCPU threads do not need to create a
larger fp_state unless/until they invoke kvm_set_cpuid() or
kvm_set_cpuid2().

Now, I think that closes the loop for me.

Thanks.

-Mingwei
Aaron Lewis Jan. 13, 2023, 2:43 p.m. UTC | #13
On Fri, Jan 13, 2023 at 12:26 AM Mingwei Zhang <mizhang@google.com> wrote:
>
> On Thu, Jan 12, 2023 at 1:33 PM Chang S. Bae <chang.seok.bae@intel.com> wrote:
> >
> > On 1/12/2023 1:21 PM, Mingwei Zhang wrote:
> > >
> > > The only comment I would have is that it seems not following the least
> > > privilege principle as host process (QEMU) may not have the motivation
> > > to do any matrix multiplication. But this is a minor one.
> > >
> > > Since this enabling once per-process, I am wondering when after
> > > invocation of arch_prctl(2), all of the host threads will have a larger
> > > fp_state? If so, that might be a sizeable overhead since host userspace
> > > may have lots of threads doing various of other things, i.e., they may
> > > not be vCPU threads.
> >
> > No, the permission request does not immediately result in the kernel's
> > XSAVE buffer expansion, but only when the state is about used. As
> > XFD-armed, the state use will raise #NM. Then, it will reallocate the
> > task's fpstate via this call chain:
> >
> > #NM --> handle_xfd_event() --> xfd_enable_feature() --> fpstate_realloc()
> >
> > Thanks,
> > Chang
>
> Thanks for the info. But I think you are talking about host level AMX
> enabling. This is known to me. I am asking about how AMX was enabled
> by QEMU and used by vCPU threads in the guest. After digging a little
> bit, I think I understand it now.
>
> So, it should be the following: (in fact, the guest fp_state is not
> allocated lazily but at the very beginning at KVM_SET_CPUID2 time).
>
>   kvm_set_cpuid() / kvm_set_cpuid2() ->
>     kvm_check_cpuid() ->
>       fpu_enable_guest_xfd_features() ->
>         __xfd_enable_feature() ->
>           fpstate_realloc()
>
> Note that KVM does intercept #NM for the guest, but only for the
> handling of XFD_ERR.
>
> Prior to the kvm_set_cpuid() or kvm_set_cpuid2() call, the QEMU thread
> should ask for permission via arch_prctl(REQ_XCOMP_GUEST_PERM) in
> order to become a vCPU thread. Otherwise, the above call sequence will
> fail. Fortunately, asking-for-guest-permission is only needed once per
> process (per-VM).
>
> Because of the above, the non-vCPU threads do not need to create a
> larger fp_state unless/until they invoke kvm_set_cpuid() or
> kvm_set_cpuid2().
>
> Now, I think that closes the loop for me.
>
> Thanks.
>
> -Mingwei

I'd still like to clean up CPUID.(EAX=0DH,ECX=0):EAX.XTILECFG[17] by
keeping it consistent with CPUID.(EAX=0DH,ECX=0):EAX.XTILEDATA[18] in
the guest, but it's not clear to me what the best way to do that is.
The crux of the issue is that xstate_get_guest_group_perm() returns
partial support for AMX when userspace doesn't call
prctl(ARCH_REQ_XCOMP_GUEST_PERM), I.e. the guest CPUID will report
XTILECFG=1 and XTILEDATA=0 in that case.  In that situation, XTILECFG
should be cleared for it to be consistent.  I can see two ways of
potentially doing that:

1. We can ensure that perm->__state_perm never stores partial support.

2. We can sanitize the bits in xstate_get_guest_group_perm() before
they are returned, to ensure KVM never sees partial support.

I like the idea of #1, but if that has negative effects on the host or
XFD I'm open to #2.  Though, XFD has its own field, so I thought that
wouldn't be an issue.  Would it work to set __state_perm and/or
default_features (what originally sets __state_perm) to a consistent
view, so partial support is never returned from
xstate_get_guest_group_perm()?
Chang S. Bae Jan. 17, 2023, 8:32 p.m. UTC | #14
On 1/13/2023 6:43 AM, Aaron Lewis wrote:
> 
> I'd still like to clean up CPUID.(EAX=0DH,ECX=0):EAX.XTILECFG[17] by
> keeping it consistent with CPUID.(EAX=0DH,ECX=0):EAX.XTILEDATA[18] in
> the guest, but it's not clear to me what the best way to do that is.
> The crux of the issue is that xstate_get_guest_group_perm() returns
> partial support for AMX when userspace doesn't call
> prctl(ARCH_REQ_XCOMP_GUEST_PERM), I.e. the guest CPUID will report
> XTILECFG=1 and XTILEDATA=0 in that case.  In that situation, XTILECFG
> should be cleared for it to be consistent.  I can see two ways of
> potentially doing that:
> 
> 1. We can ensure that perm->__state_perm never stores partial support.
> 
> 2. We can sanitize the bits in xstate_get_guest_group_perm() before
> they are returned, to ensure KVM never sees partial support.
> 
> I like the idea of #1, but if that has negative effects on the host or
> XFD I'm open to #2.  Though, XFD has its own field, so I thought that
> wouldn't be an issue.  Would it work to set __state_perm and/or
> default_features (what originally sets __state_perm) to a consistent
> view, so partial support is never returned from
> xstate_get_guest_group_perm()?

FWIW, I was trying to clarify that ARCH_GET_XCOMP_GUEST_PERM is a 
variant of ARCH_GET_XCOMP_PERM in the documentation [1]. So, I guess #1 
will have some side-effect (at least confusion) for this semantics.

There may be some ways to transform the permission bits to the 
XCR0-capable bits. For instance, when considering this permission 
support in host, the highest feature number was considered to denote the 
rest feature bits [2].

Thanks,
Chang

[1] 
https://lore.kernel.org/lkml/20220922195810.23248-5-chang.seok.bae@intel.com/
[2] https://lore.kernel.org/lkml/878rz7fyhe.ffs@tglx/
Mingwei Zhang Jan. 17, 2023, 10:39 p.m. UTC | #15
On Tue, Jan 17, 2023 at 12:34 PM Chang S. Bae <chang.seok.bae@intel.com> wrote:
>
> On 1/13/2023 6:43 AM, Aaron Lewis wrote:
> >
> > I'd still like to clean up CPUID.(EAX=0DH,ECX=0):EAX.XTILECFG[17] by
> > keeping it consistent with CPUID.(EAX=0DH,ECX=0):EAX.XTILEDATA[18] in
> > the guest, but it's not clear to me what the best way to do that is.
> > The crux of the issue is that xstate_get_guest_group_perm() returns
> > partial support for AMX when userspace doesn't call
> > prctl(ARCH_REQ_XCOMP_GUEST_PERM), I.e. the guest CPUID will report
> > XTILECFG=1 and XTILEDATA=0 in that case.  In that situation, XTILECFG
> > should be cleared for it to be consistent.  I can see two ways of
> > potentially doing that:
> >
> > 1. We can ensure that perm->__state_perm never stores partial support.
> >
> > 2. We can sanitize the bits in xstate_get_guest_group_perm() before
> > they are returned, to ensure KVM never sees partial support.
> >
> > I like the idea of #1, but if that has negative effects on the host or
> > XFD I'm open to #2.  Though, XFD has its own field, so I thought that
> > wouldn't be an issue.  Would it work to set __state_perm and/or
> > default_features (what originally sets __state_perm) to a consistent
> > view, so partial support is never returned from
> > xstate_get_guest_group_perm()?
>
> FWIW, I was trying to clarify that ARCH_GET_XCOMP_GUEST_PERM is a
> variant of ARCH_GET_XCOMP_PERM in the documentation [1]. So, I guess #1
> will have some side-effect (at least confusion) for this semantics.

Right, before talking in this thread, I was not aware of the other
arch_prctl(2) for AMX. But when I look into it, it looks reasonable to
me from an engineering point of view since it seems reusing almost all
of the code in the host.

ARCH_GET_XCOMP_GUEST_PERM is to ask for "guest permission", but it is
not just about the "permission" (the fp_state size increase for AMX).
It is also about the CPUID feature bits exposure. So for the host side
of AMX usage, this is fine, since there is no partial CPUID exposure
problem. But the guest side does have it.

So, can we just grant two bits instead of 1 bit? For that, I find 1)
seems more reasonable than 2). Of course, if there are many side
effects, option #2 could be considered as well. But before that, it
will be good to understand where the side effects are.

>
> There may be some ways to transform the permission bits to the
> XCR0-capable bits. For instance, when considering this permission
> support in host, the highest feature number was considered to denote the
> rest feature bits [2].

Hmm, this is out of my concern since it is about the host-level
enabling. I have no problem with the existing host side permission
control for AMX.

However, for me, [2] does not seem a little hacky but I get the point.
The concern is that how do we know in the future, the highest bit
always indicates lower bits? Will Intel CPU features always follow
this style?  Even so, there is no such guidance/guarantee that other
CPU vendors (e.g., AMD) will do the same. Also what if there are more
than 1 dynamic features for a feature? Please kindly correct me.

Thanks.
-Mingwei


>
> Thanks,
> Chang
>
> [1]
> https://lore.kernel.org/lkml/20220922195810.23248-5-chang.seok.bae@intel.com/
> [2] https://lore.kernel.org/lkml/878rz7fyhe.ffs@tglx/
Chang S. Bae Jan. 18, 2023, 12:34 a.m. UTC | #16
On 1/17/2023 2:39 PM, Mingwei Zhang wrote:
> On Tue, Jan 17, 2023 at 12:34 PM Chang S. Bae <chang.seok.bae@intel.com> wrote:
>>
>> On 1/13/2023 6:43 AM, Aaron Lewis wrote:
>>>
>>> I'd still like to clean up CPUID.(EAX=0DH,ECX=0):EAX.XTILECFG[17] by
>>> keeping it consistent with CPUID.(EAX=0DH,ECX=0):EAX.XTILEDATA[18] in
>>> the guest, but it's not clear to me what the best way to do that is.
>>> The crux of the issue is that xstate_get_guest_group_perm() returns
>>> partial support for AMX when userspace doesn't call
>>> prctl(ARCH_REQ_XCOMP_GUEST_PERM), I.e. the guest CPUID will report
>>> XTILECFG=1 and XTILEDATA=0 in that case.  In that situation, XTILECFG
>>> should be cleared for it to be consistent.  I can see two ways of
>>> potentially doing that:
>>>
>>> 1. We can ensure that perm->__state_perm never stores partial support.
>>>
>>> 2. We can sanitize the bits in xstate_get_guest_group_perm() before
>>> they are returned, to ensure KVM never sees partial support.
>>>
>>> I like the idea of #1, but if that has negative effects on the host or
>>> XFD I'm open to #2.  Though, XFD has its own field, so I thought that
>>> wouldn't be an issue.  Would it work to set __state_perm and/or
>>> default_features (what originally sets __state_perm) to a consistent
>>> view, so partial support is never returned from
>>> xstate_get_guest_group_perm()?
>>
>> FWIW, I was trying to clarify that ARCH_GET_XCOMP_GUEST_PERM is a
>> variant of ARCH_GET_XCOMP_PERM in the documentation [1]. So, I guess #1
>> will have some side-effect (at least confusion) for this semantics.
> 
> Right, before talking in this thread, I was not aware of the other
> arch_prctl(2) for AMX. But when I look into it, it looks reasonable to
> me from an engineering point of view since it seems reusing almost all
> of the code in the host.
> 
> ARCH_GET_XCOMP_GUEST_PERM is to ask for "guest permission", but it is
> not just about the "permission" (the fp_state size increase for AMX).
> It is also about the CPUID feature bits exposure. So for the host side
> of AMX usage, this is fine, since there is no partial CPUID exposure
> problem. But the guest side does have it.

I think this is still about permission. While lazy allocation is 
possible, this pre-allocation makes it simple. So this is how that 
allocation part is implemented, right?

> So, can we just grant two bits instead of 1 bit? For that, I find 1)
> seems more reasonable than 2). Of course, if there are many side
> effects, option #2 could be considered as well. But before that, it
> will be good to understand where the side effects are.
Sorry, I don't get it. But I guess maintainers will dictate it.

Also, I would feel the code can be more easily adjusted for the KVM/Qemu 
use if it is part of KVM API [3]. Then, the generic arch_prctl options 
can remain as it is.

>> There may be some ways to transform the permission bits to the
>> XCR0-capable bits. For instance, when considering this permission
>> support in host, the highest feature number was considered to denote the
>> rest feature bits [2].
> 
> Hmm, this is out of my concern since it is about the host-level
> enabling. I have no problem with the existing host side permission
> control for AMX.
> 
> However, for me, [2] does not seem a little hacky but I get the point.
> The concern is that how do we know in the future, the highest bit
> always indicates lower bits? Will Intel CPU features always follow
> this style?  Even so, there is no such guidance/guarantee that other
> CPU vendors (e.g., AMD) will do the same. Also what if there are more
> than 1 dynamic features for a feature? Please kindly correct me.

At least, that works for AMX I thought. I was not saying anything for 
the future feature. The code can be feature-specific here, no?

Thanks,
Chang

[3] 
https://lore.kernel.org/kvm/20220823231402.7839-1-chang.seok.bae@intel.com/
diff mbox series

Patch

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index c4e8257629165..2431c46d456b4 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -855,6 +855,16 @@  static int __do_cpuid_func_emulated(struct kvm_cpuid_array *array, u32 func)
 	return 0;
 }
 
+static u64 sanitize_xcr0(u64 xcr0)
+{
+	u64 mask;
+
+	mask = XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR;
+	if ((xcr0 & mask) != mask)
+		xcr0 &= ~mask;
+
+	return xcr0;
+}
+
 static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 {
 	struct kvm_cpuid_entry2 *entry;
@@ -982,6 +992,8 @@  static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm();
 		u64 permitted_xss = kvm_caps.supported_xss;
 
+		permitted_xcr0 = sanitize_xcr0(permitted_xcr0);
+
 		entry->eax &= permitted_xcr0;
 		entry->ebx = xstate_required_size(permitted_xcr0, false);
 		entry->ecx = entry->ebx;