diff mbox series

[RFC,6/7] KVM: X86: Expose PKS to guest and userspace

Message ID 20200807084841.7112-7-chenyi.qiang@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: PKS Virtualization support | expand

Commit Message

Chenyi Qiang Aug. 7, 2020, 8:48 a.m. UTC
Existence of PKS is enumerated via CPUID.(EAX=7H,ECX=0):ECX[31]. It is
enabled by setting CR4.PKS when long mode is active. PKS is only
implemented when EPT is enabled and requires the support of VM_{ENTRY,
EXIT}_LOAD_IA32_PKRS currently.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/cpuid.c            |  3 ++-
 arch/x86/kvm/vmx/vmx.c          | 15 ++++++++++++---
 arch/x86/kvm/x86.c              |  7 +++++--
 4 files changed, 21 insertions(+), 7 deletions(-)

Comments

Jim Mattson Aug. 13, 2020, 7:04 p.m. UTC | #1
On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>
> Existence of PKS is enumerated via CPUID.(EAX=7H,ECX=0):ECX[31]. It is
> enabled by setting CR4.PKS when long mode is active. PKS is only
> implemented when EPT is enabled and requires the support of VM_{ENTRY,
> EXIT}_LOAD_IA32_PKRS currently.
>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>

> @@ -967,7 +969,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>  {
>         unsigned long old_cr4 = kvm_read_cr4(vcpu);
>         unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
> -                                  X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
> +                                  X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
> +                                  X86_CR4_PKS;

This list already seems overly long, but I don't think CR4.PKS belongs
here. In volume 3 of the SDM, section 4.4.1, it says:

- If PAE paging would be in use following an execution of MOV to CR0
or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any
of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP;
then the PDPTEs are loaded from the address in CR3.

CR4.PKS is not in the list of CR4 bits that result in a PDPTE load.
Since it has no effect on PAE paging, I would be surprised if it did
result in a PDPTE load.

>         if (kvm_valid_cr4(vcpu, cr4))
>                 return 1;
> @@ -1202,7 +1205,7 @@ static const u32 msrs_to_save_all[] = {
>         MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
>         MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
>         MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
> -       MSR_IA32_UMWAIT_CONTROL,
> +       MSR_IA32_UMWAIT_CONTROL, MSR_IA32_PKRS,

Should MSR_IA32_PKRS be added to the switch statement in
kvm_init_msr_list()? Something like...

case MSR_IA32_PKRS:
        if (!kvm_cpu_cap_has(X86_FEATURE_PKRS))
                continue;
        break;
Chenyi Qiang Aug. 14, 2020, 2:33 a.m. UTC | #2
On 8/14/2020 3:04 AM, Jim Mattson wrote:
> On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>>
>> Existence of PKS is enumerated via CPUID.(EAX=7H,ECX=0):ECX[31]. It is
>> enabled by setting CR4.PKS when long mode is active. PKS is only
>> implemented when EPT is enabled and requires the support of VM_{ENTRY,
>> EXIT}_LOAD_IA32_PKRS currently.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> 
>> @@ -967,7 +969,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>>   {
>>          unsigned long old_cr4 = kvm_read_cr4(vcpu);
>>          unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
>> -                                  X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
>> +                                  X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
>> +                                  X86_CR4_PKS;
> 
> This list already seems overly long, but I don't think CR4.PKS belongs
> here. In volume 3 of the SDM, section 4.4.1, it says:
> 
> - If PAE paging would be in use following an execution of MOV to CR0
> or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any
> of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP;
> then the PDPTEs are loaded from the address in CR3.
> 
> CR4.PKS is not in the list of CR4 bits that result in a PDPTE load.
> Since it has no effect on PAE paging, I would be surprised if it did
> result in a PDPTE load.
> 

Oh, My mistake.

>>          if (kvm_valid_cr4(vcpu, cr4))
>>                  return 1;
>> @@ -1202,7 +1205,7 @@ static const u32 msrs_to_save_all[] = {
>>          MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
>>          MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
>>          MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
>> -       MSR_IA32_UMWAIT_CONTROL,
>> +       MSR_IA32_UMWAIT_CONTROL, MSR_IA32_PKRS,
> 
> Should MSR_IA32_PKRS be added to the switch statement in
> kvm_init_msr_list()? Something like...
> 
> case MSR_IA32_PKRS:
>          if (!kvm_cpu_cap_has(X86_FEATURE_PKRS))
>                  continue;
>          break;
> 

Yes, this should be added.
Sean Christopherson Sept. 30, 2020, 4:36 a.m. UTC | #3
On Thu, Aug 13, 2020 at 12:04:54PM -0700, Jim Mattson wrote:
> On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
> >
> > Existence of PKS is enumerated via CPUID.(EAX=7H,ECX=0):ECX[31]. It is
> > enabled by setting CR4.PKS when long mode is active. PKS is only
> > implemented when EPT is enabled and requires the support of VM_{ENTRY,
> > EXIT}_LOAD_IA32_PKRS currently.
> >
> > Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> 
> > @@ -967,7 +969,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> >  {
> >         unsigned long old_cr4 = kvm_read_cr4(vcpu);
> >         unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
> > -                                  X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
> > +                                  X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
> > +                                  X86_CR4_PKS;
> 
> This list already seems overly long, but I don't think CR4.PKS belongs
> here. In volume 3 of the SDM, section 4.4.1, it says:
> 
> - If PAE paging would be in use following an execution of MOV to CR0
> or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any
> of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP;
> then the PDPTEs are loaded from the address in CR3.
> 
> CR4.PKS is not in the list of CR4 bits that result in a PDPTE load.
> Since it has no effect on PAE paging, I would be surprised if it did
> result in a PDPTE load.

It does belong in the mmu_role_bits though ;-)
Paolo Bonzini Jan. 26, 2021, 6:24 p.m. UTC | #4
On 30/09/20 06:36, Sean Christopherson wrote:
>> CR4.PKS is not in the list of CR4 bits that result in a PDPTE load.
>> Since it has no effect on PAE paging, I would be surprised if it did
>> result in a PDPTE load.
> It does belong in the mmu_role_bits though;-)
> 

Does it?  We don't support PKU/PKS for shadow paging, and it's always 
zero for EPT.  We only support enough PKU/PKS for emulation.

Paolo
Sean Christopherson Jan. 26, 2021, 7:56 p.m. UTC | #5
On Tue, Jan 26, 2021, Paolo Bonzini wrote:
> On 30/09/20 06:36, Sean Christopherson wrote:
> > > CR4.PKS is not in the list of CR4 bits that result in a PDPTE load.
> > > Since it has no effect on PAE paging, I would be surprised if it did
> > > result in a PDPTE load.
> > It does belong in the mmu_role_bits though;-)
> > 
> 
> Does it?  We don't support PKU/PKS for shadow paging, and it's always zero
> for EPT.  We only support enough PKU/PKS for emulation.

As proposed, yes.  The PKU/PKS mask is tracked on a per-mmu basis, e.g. computed
in update_pkr_bitmask() and consumed in permission_fault() during emulation.
Omitting CR4.PKS from the extended role could let KVM reuse an MMU with the
wrong pkr_mask.

That being said, I don't think it needs a dedicated bit.  IIUC, the logic is
PKU|PKS, with pkr_mask generation being PKU vs. PKS agnostic.  The role could do
the same and smush the bits together, e.g.

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3d6616f6f6ef..3bfca34f6ea2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -293,7 +293,7 @@ union kvm_mmu_extended_role {
                unsigned int cr0_pg:1;
                unsigned int cr4_pae:1;
                unsigned int cr4_pse:1;
-               unsigned int cr4_pke:1;
+               unsigned int cr4_pkr:1;
                unsigned int cr4_smap:1;
                unsigned int cr4_smep:1;
                unsigned int maxphyaddr:6;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 79166288ed03..2774b85a36d5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4448,7 +4448,8 @@ static union kvm_mmu_extended_role kvm_calc_mmu_role_ext(struct kvm_vcpu *vcpu)
        ext.cr4_smep = !!kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
        ext.cr4_smap = !!kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
        ext.cr4_pse = !!is_pse(vcpu);
-       ext.cr4_pke = !!kvm_read_cr4_bits(vcpu, X86_CR4_PKE);
+       ext.cr4_pkr = !!kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
+                     !!kvm_read_cr4_bits(vcpu, X86_CR4_PKS);
        ext.maxphyaddr = cpuid_maxphyaddr(vcpu);

        ext.valid = 1;


Another option would be to move the tracking out of the MMU, e.g. make pkr_mask
per-vCPU and recalculate when CR4 changes.  I think that would "just work", even
when nested VMs are in play?
Paolo Bonzini Jan. 26, 2021, 8:05 p.m. UTC | #6
On 26/01/21 20:56, Sean Christopherson wrote:
>>> It does belong in the mmu_role_bits though;-)
>>
>> Does it?  We don't support PKU/PKS for shadow paging, and it's always zero
>> for EPT.  We only support enough PKU/PKS for emulation.
>
> As proposed, yes. The PKU/PKS mask is tracked on a per-mmu basis, e.g. 
> computed in update_pkr_bitmask() and consumed in permission_fault() 
> during emulation. Omitting CR4.PKS from the extended role could let KVM 
> reuse an MMU with the wrong pkr_mask.

Right, not for the hash table key but for reuse.

> IIUC, the logic is PKU|PKS, with pkr_mask generation being PKU vs. PKS agnostic.

Not in the patches as submitted, but it's what I suggested indeed (using 
one bit of the PFEC to pick one of CR4.PKE and CR4.PKS).

> Another option would be to move the tracking out of the MMU, e.g. make pkr_mask
> per-vCPU and recalculate when CR4 changes.  I think that would "just work", even
> when nested VMs are in play?

Yeah, pkr_mask is basically one of four constants (depending on CR4.PKE 
and CR4.PKS) so recalculating when CR4 changes would work too.  But I'm 
okay with doing that later, too.

Paolo
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 736e56e023d5..36c7356693c8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -99,7 +99,8 @@ 
 			  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
 			  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
 			  | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
-			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP))
+			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
+			  | X86_CR4_PKS))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 8a294f9747aa..897749250afd 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -325,7 +325,8 @@  void kvm_set_cpu_caps(void)
 		F(AVX512VBMI) | F(LA57) | F(PKU) | 0 /*OSPKE*/ | F(RDPID) |
 		F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
 		F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
-		F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/
+		F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
+		0 /*PKS*/
 	);
 	/* Set LA57 based on hardware capability. */
 	if (cpuid_ecx(7) & F(LA57))
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d91d59fb46fa..5cdf4d3848fb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3216,7 +3216,7 @@  int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 		}
 
 		/*
-		 * SMEP/SMAP/PKU is disabled if CPU is in non-paging mode in
+		 * SMEP/SMAP/PKU/PKS is disabled if CPU is in non-paging mode in
 		 * hardware.  To emulate this behavior, SMEP/SMAP/PKU needs
 		 * to be manually disabled when guest switches to non-paging
 		 * mode.
@@ -3224,10 +3224,11 @@  int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 		 * If !enable_unrestricted_guest, the CPU is always running
 		 * with CR0.PG=1 and CR4 needs to be modified.
 		 * If enable_unrestricted_guest, the CPU automatically
-		 * disables SMEP/SMAP/PKU when the guest sets CR0.PG=0.
+		 * disables SMEP/SMAP/PKU/PKS when the guest sets CR0.PG=0.
 		 */
 		if (!is_paging(vcpu))
-			hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
+			hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
+				    X86_CR4_PKS);
 	}
 
 	vmcs_writel(CR4_READ_SHADOW, cr4);
@@ -7348,6 +7349,14 @@  static __init void vmx_set_cpu_caps(void)
 	if (vmx_pt_mode_is_host_guest())
 		kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT);
 
+	/*
+	 * PKS is not yet implemented for shadow paging.
+	 * If not support VM_{ENTRY, EXIT}_LOAD_IA32_PKRS,
+	 * don't expose the PKS as well.
+	 */
+	if (enable_ept && cpu_has_load_ia32_pkrs())
+		kvm_cpu_cap_check_and_set(X86_FEATURE_PKS);
+
 	if (vmx_umip_emulated())
 		kvm_cpu_cap_set(X86_FEATURE_UMIP);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 88c593f83b28..8ad9622ee2b2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -949,6 +949,8 @@  EXPORT_SYMBOL_GPL(kvm_set_xcr);
 		__reserved_bits |= X86_CR4_LA57;	\
 	if (!__cpu_has(__c, X86_FEATURE_UMIP))		\
 		__reserved_bits |= X86_CR4_UMIP;	\
+	if (!__cpu_has(__c, X86_FEATURE_PKS))		\
+		__reserved_bits |= X86_CR4_PKS;		\
 	__reserved_bits;				\
 })
 
@@ -967,7 +969,8 @@  int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	unsigned long old_cr4 = kvm_read_cr4(vcpu);
 	unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
-				   X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
+				   X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
+				   X86_CR4_PKS;
 
 	if (kvm_valid_cr4(vcpu, cr4))
 		return 1;
@@ -1202,7 +1205,7 @@  static const u32 msrs_to_save_all[] = {
 	MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
 	MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
 	MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
-	MSR_IA32_UMWAIT_CONTROL,
+	MSR_IA32_UMWAIT_CONTROL, MSR_IA32_PKRS,
 
 	MSR_ARCH_PERFMON_FIXED_CTR0, MSR_ARCH_PERFMON_FIXED_CTR1,
 	MSR_ARCH_PERFMON_FIXED_CTR0 + 2, MSR_ARCH_PERFMON_FIXED_CTR0 + 3,