diff mbox series

KVM: x86: only do L1TF workaround on affected processors

Message ID 20200519095008.1212-1-pbonzini@redhat.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: only do L1TF workaround on affected processors | expand

Commit Message

Paolo Bonzini May 19, 2020, 9:50 a.m. UTC
KVM stores the gfn in MMIO SPTEs as a caching optimization.  These are split
in two parts, as in "[high 11111 low]", to thwart any attempt to use these bits
in an L1TF attack.  This works as long as there are 5 free bits between
MAXPHYADDR and bit 50 (inclusive), leaving bit 51 free so that the MMIO
access triggers a reserved-bit-set page fault.

The bit positions however were computed wrongly for AMD processors that have
encryption support.  In this case, x86_phys_bits is reduced (for example
from 48 to 43, to account for the C bit at position 47 and four bits used
internally to store the SEV ASID and other stuff) while x86_cache_bits in
would remain set to 48, and _all_ bits between the reduced MAXPHYADDR
and bit 51 are set.  Then low_phys_bits would also cover some of the
bits that are set in the shadow_mmio_value, terribly confusing the gfn
caching mechanism.

To fix this, avoid splitting gfns as long as the processor does not have
the L1TF bug (which includes all AMD processors).  When there is no
splitting, low_phys_bits can be set to the reduced MAXPHYADDR removing
the overlap.  This fixes "npt=0" operation on EPYC processors.

Thanks to Maxim Levitsky for bisecting this bug.

Cc: stable@vger.kernel.org
Fixes: 52918ed5fcf0 ("KVM: SVM: Override default MMIO mask if memory encryption is enabled")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

Comments

Maxim Levitsky May 19, 2020, 10:59 a.m. UTC | #1
On Tue, 2020-05-19 at 05:50 -0400, Paolo Bonzini wrote:
> KVM stores the gfn in MMIO SPTEs as a caching optimization.  These are split
> in two parts, as in "[high 11111 low]", to thwart any attempt to use these bits
> in an L1TF attack.  This works as long as there are 5 free bits between
> MAXPHYADDR and bit 50 (inclusive), leaving bit 51 free so that the MMIO
> access triggers a reserved-bit-set page fault.

Most of machines I used have MAXPHYADDR=39, however on larger server machines,
isn't MAXPHYADDR already something like 48, thus not allowing enought space for these bits?
This is the case for my machine as well.

In this case, if I understand correctly, the MAXPHYADDR value reported to the guest can
be reduced to accomodate for these bits, is that true?


> 
> The bit positions however were computed wrongly for AMD processors that have
> encryption support.  In this case, x86_phys_bits is reduced (for example
> from 48 to 43, to account for the C bit at position 47 and four bits used
> internally to store the SEV ASID and other stuff) while x86_cache_bits in
> would remain set to 48, and _all_ bits between the reduced MAXPHYADDR
> and bit 51 are set.  

If I understand correctly this is done by the host kernel. I haven't had memory encryption
enabled when I did these tests.


FYI, later on, I did some digging about SME and SEV on my machine (3970X), and found out that memory encryption (SME) does actually work,
except that it makes AMD's own amdgpu driver panic on boot and according to google this is a very well known issue.
This is why I always thought that it wasn't supported.

I tested this issue while SME is enabled with efifb and it seems that its state (enabled/disabled) doesn't affect this bug,
which suggest me that a buggy bios always reports that memory encrypiton is enabled in that msr, or something
like that. I haven't yet studied this area well enought to be sure.

SEV on the other hand is not active because the system doesn't seem to have PSP firmware loaded,
and only have CCP active (I added some printks to the ccp/psp driver and it shows that PSP reports 0 capability which indicates that it is not there)
It is reported as supported in CPUID (even SEV-ES).

I tested this patch and it works.

However note (not related to this patch) that running nested guest,
makes the L1 guest panic right in the very startup of the guest when npt=1.
I tested this with many guest/host combinations and even with fedora kernel 5.3 running
on both host and guest, this is the case.

Tested-by: Maxim Levitsky <mlevitsk@redhat.com>

Overall the patch makes sense to me, however I don't yet know this area well enought
for a review, but I think I'll dig into it today and once it all makes sense to me,
I'll review this patch as well.

Best regards,
	Maxim Levitsky

> Then low_phys_bits would also cover some of the
> bits that are set in the shadow_mmio_value, terribly confusing the gfn
> caching mechanism.
> 
> To fix this, avoid splitting gfns as long as the processor does not have
> the L1TF bug (which includes all AMD processors).  When there is no
> splitting, low_phys_bits can be set to the reduced MAXPHYADDR removing
> the overlap.  This fixes "npt=0" operation on EPYC processors.
> 
> Thanks to Maxim Levitsky for bisecting this bug.
> 
> Cc: stable@vger.kernel.org
> Fixes: 52918ed5fcf0 ("KVM: SVM: Override default MMIO mask if memory encryption is enabled")
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 19 ++++++++++---------
>  1 file changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8071952e9cf2..86619631ff6a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -335,6 +335,8 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value, u64 access_mask)
>  {
>  	BUG_ON((u64)(unsigned)access_mask != access_mask);
>  	BUG_ON((mmio_mask & mmio_value) != mmio_value);
> +	WARN_ON(mmio_value & (shadow_nonpresent_or_rsvd_mask << shadow_nonpresent_or_rsvd_mask_len));
> +	WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);
>  	shadow_mmio_value = mmio_value | SPTE_MMIO_MASK;
>  	shadow_mmio_mask = mmio_mask | SPTE_SPECIAL_MASK;
>  	shadow_mmio_access_mask = access_mask;
> @@ -583,16 +585,15 @@ static void kvm_mmu_reset_all_pte_masks(void)
>  	 * the most significant bits of legal physical address space.
>  	 */
>  	shadow_nonpresent_or_rsvd_mask = 0;
> -	low_phys_bits = boot_cpu_data.x86_cache_bits;
> -	if (boot_cpu_data.x86_cache_bits <
> -	    52 - shadow_nonpresent_or_rsvd_mask_len) {
> +	low_phys_bits = boot_cpu_data.x86_phys_bits;
> +	if (boot_cpu_has_bug(X86_BUG_L1TF) &&
> +	    !WARN_ON_ONCE(boot_cpu_data.x86_cache_bits >=
> +			  52 - shadow_nonpresent_or_rsvd_mask_len)) {
> +		low_phys_bits = boot_cpu_data.x86_cache_bits
> +			- shadow_nonpresent_or_rsvd_mask_len;
>  		shadow_nonpresent_or_rsvd_mask =
> -			rsvd_bits(boot_cpu_data.x86_cache_bits -
> -				  shadow_nonpresent_or_rsvd_mask_len,
> -				  boot_cpu_data.x86_cache_bits - 1);
> -		low_phys_bits -= shadow_nonpresent_or_rsvd_mask_len;
> -	} else
> -		WARN_ON_ONCE(boot_cpu_has_bug(X86_BUG_L1TF));
> +			rsvd_bits(low_phys_bits, boot_cpu_data.x86_cache_bits - 1);
> +	}
>  
>  	shadow_nonpresent_or_rsvd_lower_gfn_mask =
>  		GENMASK_ULL(low_phys_bits - 1, PAGE_SHIFT);
Maxim Levitsky May 19, 2020, 11:36 a.m. UTC | #2
On Tue, 2020-05-19 at 13:59 +0300, Maxim Levitsky wrote:
> On Tue, 2020-05-19 at 05:50 -0400, Paolo Bonzini wrote:
> > KVM stores the gfn in MMIO SPTEs as a caching optimization.  These are split
> > in two parts, as in "[high 11111 low]", to thwart any attempt to use these bits
> > in an L1TF attack.  This works as long as there are 5 free bits between
> > MAXPHYADDR and bit 50 (inclusive), leaving bit 51 free so that the MMIO
> > access triggers a reserved-bit-set page fault.
> 
> Most of machines I used have MAXPHYADDR=39, however on larger server machines,
> isn't MAXPHYADDR already something like 48, thus not allowing enought space for these bits?
> This is the case for my machine as well.
> 
> In this case, if I understand correctly, the MAXPHYADDR value reported to the guest can
> be reduced to accomodate for these bits, is that true?
> 
> 
> > The bit positions however were computed wrongly for AMD processors that have
> > encryption support.  In this case, x86_phys_bits is reduced (for example
> > from 48 to 43, to account for the C bit at position 47 and four bits used
> > internally to store the SEV ASID and other stuff) while x86_cache_bits in
> > would remain set to 48, and _all_ bits between the reduced MAXPHYADDR
> > and bit 51 are set.  
> 
> If I understand correctly this is done by the host kernel. I haven't had memory encryption
> enabled when I did these tests.
> 
> 
> FYI, later on, I did some digging about SME and SEV on my machine (3970X), and found out that memory encryption (SME) does actually work,
> except that it makes AMD's own amdgpu driver panic on boot and according to google this is a very well known issue.
> This is why I always thought that it wasn't supported.
> 
> I tested this issue while SME is enabled with efifb and it seems that its state (enabled/disabled) doesn't affect this bug,
> which suggest me that a buggy bios always reports that memory encrypiton is enabled in that msr, or something
> like that. I haven't yet studied this area well enought to be sure.
> 
> SEV on the other hand is not active because the system doesn't seem to have PSP firmware loaded,
> and only have CCP active (I added some printks to the ccp/psp driver and it shows that PSP reports 0 capability which indicates that it is not there)
> It is reported as supported in CPUID (even SEV-ES).
> 
> I tested this patch and it works.
> 
> However note (not related to this patch) that running nested guest,
> makes the L1 guest panic right in the very startup of the guest when npt=1.
npt=0 of course - I need more coffee today.

Best regards,
	Maxim Levitsky

> I tested this with many guest/host combinations and even with fedora kernel 5.3 running
> on both host and guest, this is the case.
> 
> Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
> Overall the patch makes sense to me, however I don't yet know this area well enought
> for a review, but I think I'll dig into it today and once it all makes sense to me,
> I'll review this patch as well.
> 
> Best regards,
> 	Maxim Levitsky
> 
> > Then low_phys_bits would also cover some of the
> > bits that are set in the shadow_mmio_value, terribly confusing the gfn
> > caching mechanism.
> > 
> > To fix this, avoid splitting gfns as long as the processor does not have
> > the L1TF bug (which includes all AMD processors).  When there is no
> > splitting, low_phys_bits can be set to the reduced MAXPHYADDR removing
> > the overlap.  This fixes "npt=0" operation on EPYC processors.
> > 
> > Thanks to Maxim Levitsky for bisecting this bug.
> > 
> > Cc: stable@vger.kernel.org
> > Fixes: 52918ed5fcf0 ("KVM: SVM: Override default MMIO mask if memory encryption is enabled")
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 19 ++++++++++---------
> >  1 file changed, 10 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 8071952e9cf2..86619631ff6a 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -335,6 +335,8 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value, u64 access_mask)
> >  {
> >  	BUG_ON((u64)(unsigned)access_mask != access_mask);
> >  	BUG_ON((mmio_mask & mmio_value) != mmio_value);
> > +	WARN_ON(mmio_value & (shadow_nonpresent_or_rsvd_mask << shadow_nonpresent_or_rsvd_mask_len));
> > +	WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);
> >  	shadow_mmio_value = mmio_value | SPTE_MMIO_MASK;
> >  	shadow_mmio_mask = mmio_mask | SPTE_SPECIAL_MASK;
> >  	shadow_mmio_access_mask = access_mask;
> > @@ -583,16 +585,15 @@ static void kvm_mmu_reset_all_pte_masks(void)
> >  	 * the most significant bits of legal physical address space.
> >  	 */
> >  	shadow_nonpresent_or_rsvd_mask = 0;
> > -	low_phys_bits = boot_cpu_data.x86_cache_bits;
> > -	if (boot_cpu_data.x86_cache_bits <
> > -	    52 - shadow_nonpresent_or_rsvd_mask_len) {
> > +	low_phys_bits = boot_cpu_data.x86_phys_bits;
> > +	if (boot_cpu_has_bug(X86_BUG_L1TF) &&
> > +	    !WARN_ON_ONCE(boot_cpu_data.x86_cache_bits >=
> > +			  52 - shadow_nonpresent_or_rsvd_mask_len)) {
> > +		low_phys_bits = boot_cpu_data.x86_cache_bits
> > +			- shadow_nonpresent_or_rsvd_mask_len;
> >  		shadow_nonpresent_or_rsvd_mask =
> > -			rsvd_bits(boot_cpu_data.x86_cache_bits -
> > -				  shadow_nonpresent_or_rsvd_mask_len,
> > -				  boot_cpu_data.x86_cache_bits - 1);
> > -		low_phys_bits -= shadow_nonpresent_or_rsvd_mask_len;
> > -	} else
> > -		WARN_ON_ONCE(boot_cpu_has_bug(X86_BUG_L1TF));
> > +			rsvd_bits(low_phys_bits, boot_cpu_data.x86_cache_bits - 1);
> > +	}
> >  
> >  	shadow_nonpresent_or_rsvd_lower_gfn_mask =
> >  		GENMASK_ULL(low_phys_bits - 1, PAGE_SHIFT);
Tom Lendacky May 19, 2020, 1:56 p.m. UTC | #3
On 5/19/20 5:59 AM, Maxim Levitsky wrote:
> On Tue, 2020-05-19 at 05:50 -0400, Paolo Bonzini wrote:
>> KVM stores the gfn in MMIO SPTEs as a caching optimization.  These are split
>> in two parts, as in "[high 11111 low]", to thwart any attempt to use these bits
>> in an L1TF attack.  This works as long as there are 5 free bits between
>> MAXPHYADDR and bit 50 (inclusive), leaving bit 51 free so that the MMIO
>> access triggers a reserved-bit-set page fault.
> 
> Most of machines I used have MAXPHYADDR=39, however on larger server machines,
> isn't MAXPHYADDR already something like 48, thus not allowing enought space for these bits?
> This is the case for my machine as well.
> 
> In this case, if I understand correctly, the MAXPHYADDR value reported to the guest can
> be reduced to accomodate for these bits, is that true?
> 
> 
>>
>> The bit positions however were computed wrongly for AMD processors that have
>> encryption support.  In this case, x86_phys_bits is reduced (for example
>> from 48 to 43, to account for the C bit at position 47 and four bits used
>> internally to store the SEV ASID and other stuff) while x86_cache_bits in
>> would remain set to 48, and _all_ bits between the reduced MAXPHYADDR
>> and bit 51 are set.
> 
> If I understand correctly this is done by the host kernel. I haven't had memory encryption
> enabled when I did these tests.
> 
> 
> FYI, later on, I did some digging about SME and SEV on my machine (3970X), and found out that memory encryption (SME) does actually work,
> except that it makes AMD's own amdgpu driver panic on boot and according to google this is a very well known issue.
> This is why I always thought that it wasn't supported.
> 
> I tested this issue while SME is enabled with efifb and it seems that its state (enabled/disabled) doesn't affect this bug,
> which suggest me that a buggy bios always reports that memory encrypiton is enabled in that msr, or something
> like that. I haven't yet studied this area well enought to be sure.

If the SMEE MSR bit (bit 23 of 0xc0010010) is enabled then the overall 
hardware encryption feature is active which means the encryption bit is 
available and active, regardless of whether the OS supports it or not, and 
the physical address space is reduced.

Some BIOSes provide an option to disable/turn off the SMEE bit, but not all.

> 
> SEV on the other hand is not active because the system doesn't seem to have PSP firmware loaded,
> and only have CCP active (I added some printks to the ccp/psp driver and it shows that PSP reports 0 capability which indicates that it is not there)
> It is reported as supported in CPUID (even SEV-ES).

Correct, the hardware supports the feature, but you need the SEV firmware, 
too. The SEV firmware is only available on EPYC processors.

Thanks,
Tom

> 
> I tested this patch and it works.
> 
> However note (not related to this patch) that running nested guest,
> makes the L1 guest panic right in the very startup of the guest when npt=1.
> I tested this with many guest/host combinations and even with fedora kernel 5.3 running
> on both host and guest, this is the case.
> 
> Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
> Overall the patch makes sense to me, however I don't yet know this area well enought
> for a review, but I think I'll dig into it today and once it all makes sense to me,
> I'll review this patch as well.
> 
> Best regards,
> 	Maxim Levitsky
> 
>> Then low_phys_bits would also cover some of the
>> bits that are set in the shadow_mmio_value, terribly confusing the gfn
>> caching mechanism.
>>
>> To fix this, avoid splitting gfns as long as the processor does not have
>> the L1TF bug (which includes all AMD processors).  When there is no
>> splitting, low_phys_bits can be set to the reduced MAXPHYADDR removing
>> the overlap.  This fixes "npt=0" operation on EPYC processors.
>>
>> Thanks to Maxim Levitsky for bisecting this bug.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 52918ed5fcf0 ("KVM: SVM: Override default MMIO mask if memory encryption is enabled")
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>   arch/x86/kvm/mmu/mmu.c | 19 ++++++++++---------
>>   1 file changed, 10 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 8071952e9cf2..86619631ff6a 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -335,6 +335,8 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value, u64 access_mask)
>>   {
>>   	BUG_ON((u64)(unsigned)access_mask != access_mask);
>>   	BUG_ON((mmio_mask & mmio_value) != mmio_value);
>> +	WARN_ON(mmio_value & (shadow_nonpresent_or_rsvd_mask << shadow_nonpresent_or_rsvd_mask_len));
>> +	WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);
>>   	shadow_mmio_value = mmio_value | SPTE_MMIO_MASK;
>>   	shadow_mmio_mask = mmio_mask | SPTE_SPECIAL_MASK;
>>   	shadow_mmio_access_mask = access_mask;
>> @@ -583,16 +585,15 @@ static void kvm_mmu_reset_all_pte_masks(void)
>>   	 * the most significant bits of legal physical address space.
>>   	 */
>>   	shadow_nonpresent_or_rsvd_mask = 0;
>> -	low_phys_bits = boot_cpu_data.x86_cache_bits;
>> -	if (boot_cpu_data.x86_cache_bits <
>> -	    52 - shadow_nonpresent_or_rsvd_mask_len) {
>> +	low_phys_bits = boot_cpu_data.x86_phys_bits;
>> +	if (boot_cpu_has_bug(X86_BUG_L1TF) &&
>> +	    !WARN_ON_ONCE(boot_cpu_data.x86_cache_bits >=
>> +			  52 - shadow_nonpresent_or_rsvd_mask_len)) {
>> +		low_phys_bits = boot_cpu_data.x86_cache_bits
>> +			- shadow_nonpresent_or_rsvd_mask_len;
>>   		shadow_nonpresent_or_rsvd_mask =
>> -			rsvd_bits(boot_cpu_data.x86_cache_bits -
>> -				  shadow_nonpresent_or_rsvd_mask_len,
>> -				  boot_cpu_data.x86_cache_bits - 1);
>> -		low_phys_bits -= shadow_nonpresent_or_rsvd_mask_len;
>> -	} else
>> -		WARN_ON_ONCE(boot_cpu_has_bug(X86_BUG_L1TF));
>> +			rsvd_bits(low_phys_bits, boot_cpu_data.x86_cache_bits - 1);
>> +	}
>>   
>>   	shadow_nonpresent_or_rsvd_lower_gfn_mask =
>>   		GENMASK_ULL(low_phys_bits - 1, PAGE_SHIFT);
> 
>
Maxim Levitsky May 19, 2020, 2:06 p.m. UTC | #4
On Tue, 2020-05-19 at 08:56 -0500, Tom Lendacky wrote:
> On 5/19/20 5:59 AM, Maxim Levitsky wrote:
> > On Tue, 2020-05-19 at 05:50 -0400, Paolo Bonzini wrote:
> > > KVM stores the gfn in MMIO SPTEs as a caching optimization.  These are split
> > > in two parts, as in "[high 11111 low]", to thwart any attempt to use these bits
> > > in an L1TF attack.  This works as long as there are 5 free bits between
> > > MAXPHYADDR and bit 50 (inclusive), leaving bit 51 free so that the MMIO
> > > access triggers a reserved-bit-set page fault.
> > 
> > Most of machines I used have MAXPHYADDR=39, however on larger server machines,
> > isn't MAXPHYADDR already something like 48, thus not allowing enought space for these bits?
> > This is the case for my machine as well.
> > 
> > In this case, if I understand correctly, the MAXPHYADDR value reported to the guest can
> > be reduced to accomodate for these bits, is that true?
> > 
> > 
> > > The bit positions however were computed wrongly for AMD processors that have
> > > encryption support.  In this case, x86_phys_bits is reduced (for example
> > > from 48 to 43, to account for the C bit at position 47 and four bits used
> > > internally to store the SEV ASID and other stuff) while x86_cache_bits in
> > > would remain set to 48, and _all_ bits between the reduced MAXPHYADDR
> > > and bit 51 are set.
> > 
> > If I understand correctly this is done by the host kernel. I haven't had memory encryption
> > enabled when I did these tests.
> > 
> > 
> > FYI, later on, I did some digging about SME and SEV on my machine (3970X), and found out that memory encryption (SME) does actually work,
> > except that it makes AMD's own amdgpu driver panic on boot and according to google this is a very well known issue.
> > This is why I always thought that it wasn't supported.
> > 
> > I tested this issue while SME is enabled with efifb and it seems that its state (enabled/disabled) doesn't affect this bug,
> > which suggest me that a buggy bios always reports that memory encrypiton is enabled in that msr, or something
> > like that. I haven't yet studied this area well enought to be sure.
> 
> If the SMEE MSR bit (bit 23 of 0xc0010010) is enabled then the overall 
> hardware encryption feature is active which means the encryption bit is 
> available and active, regardless of whether the OS supports it or not, and 
> the physical address space is reduced.

This means if I understand correctly that when I don't enable the encryption in
the kernel, then basically kernel just doesn't set the 'C' bit in the physical address,
but it can if it wanted to.
This makes sense, thanks for the explanation.


> 
> Some BIOSes provide an option to disable/turn off the SMEE bit, but not all.
> 
My BIOS indeed doesn't have any option in regard to this.


> > SEV on the other hand is not active because the system doesn't seem to have PSP firmware loaded,
> > and only have CCP active (I added some printks to the ccp/psp driver and it shows that PSP reports 0 capability which indicates that it is not there)
> > It is reported as supported in CPUID (even SEV-ES).
> 
> Correct, the hardware supports the feature, but you need the SEV firmware, 
> too. The SEV firmware is only available on EPYC processors.
That what I figured out. Thanks!

BTW, any ideas on why AMDGPU driver crashes with SME enabled?
Is it still not supported or this is is a corner case which I can file a bug report about?

I have Radeon Pro WX 4100, and I am using mainline branch of the kernel.

I don't yet have means to capture the kernel panic it is getting,
since I don't yet have a serial port on this machine, and I am looking
into getting one trying my luck with usb3 debug cable.

I used to get the kernel panic reports via 'ramoops' mechanism which relies on
storing the kernel log in a fixed area in the ram, and on the fact that BIOS doesn't really
clear the memory on reboot, but since memory is encrypted it doesn't work.

Best regards,
	Maxim Levitsky
Tom Lendacky May 19, 2020, 2:32 p.m. UTC | #5
On 5/19/20 9:06 AM, Maxim Levitsky wrote:
> On Tue, 2020-05-19 at 08:56 -0500, Tom Lendacky wrote:
>> On 5/19/20 5:59 AM, Maxim Levitsky wrote:
>>> On Tue, 2020-05-19 at 05:50 -0400, Paolo Bonzini wrote:
>>>> KVM stores the gfn in MMIO SPTEs as a caching optimization.  These are split
>>>> in two parts, as in "[high 11111 low]", to thwart any attempt to use these bits
>>>> in an L1TF attack.  This works as long as there are 5 free bits between
>>>> MAXPHYADDR and bit 50 (inclusive), leaving bit 51 free so that the MMIO
>>>> access triggers a reserved-bit-set page fault.
>>>
>>> Most of machines I used have MAXPHYADDR=39, however on larger server machines,
>>> isn't MAXPHYADDR already something like 48, thus not allowing enought space for these bits?
>>> This is the case for my machine as well.
>>>
>>> In this case, if I understand correctly, the MAXPHYADDR value reported to the guest can
>>> be reduced to accomodate for these bits, is that true?
>>>
>>>
>>>> The bit positions however were computed wrongly for AMD processors that have
>>>> encryption support.  In this case, x86_phys_bits is reduced (for example
>>>> from 48 to 43, to account for the C bit at position 47 and four bits used
>>>> internally to store the SEV ASID and other stuff) while x86_cache_bits in
>>>> would remain set to 48, and _all_ bits between the reduced MAXPHYADDR
>>>> and bit 51 are set.
>>>
>>> If I understand correctly this is done by the host kernel. I haven't had memory encryption
>>> enabled when I did these tests.
>>>
>>>
>>> FYI, later on, I did some digging about SME and SEV on my machine (3970X), and found out that memory encryption (SME) does actually work,
>>> except that it makes AMD's own amdgpu driver panic on boot and according to google this is a very well known issue.
>>> This is why I always thought that it wasn't supported.
>>>
>>> I tested this issue while SME is enabled with efifb and it seems that its state (enabled/disabled) doesn't affect this bug,
>>> which suggest me that a buggy bios always reports that memory encrypiton is enabled in that msr, or something
>>> like that. I haven't yet studied this area well enought to be sure.
>>
>> If the SMEE MSR bit (bit 23 of 0xc0010010) is enabled then the overall
>> hardware encryption feature is active which means the encryption bit is
>> available and active, regardless of whether the OS supports it or not, and
>> the physical address space is reduced.
> 
> This means if I understand correctly that when I don't enable the encryption in
> the kernel, then basically kernel just doesn't set the 'C' bit in the physical address,
> but it can if it wanted to.
> This makes sense, thanks for the explanation.
> 
> 
>>
>> Some BIOSes provide an option to disable/turn off the SMEE bit, but not all.
>>
> My BIOS indeed doesn't have any option in regard to this.
> 
> 
>>> SEV on the other hand is not active because the system doesn't seem to have PSP firmware loaded,
>>> and only have CCP active (I added some printks to the ccp/psp driver and it shows that PSP reports 0 capability which indicates that it is not there)
>>> It is reported as supported in CPUID (even SEV-ES).
>>
>> Correct, the hardware supports the feature, but you need the SEV firmware,
>> too. The SEV firmware is only available on EPYC processors.
> That what I figured out. Thanks!
> 
> BTW, any ideas on why AMDGPU driver crashes with SME enabled?
> Is it still not supported or this is is a corner case which I can file a bug report about?

I think it is because the GPU doesn't support as high a physical address 
size as the CPU and when the encryption bit is set it appears to be an 
invalid address.

Many BIOSes have an option to enable something called TSME (transparent 
SME). This encrypts everything to/from DRAM without having to use the 
pagetable bits. This allows memory encryption for non-enlightened OSes and 
allows the GPU to function with DRAM encryption.

Thanks,
Tom

> 
> I have Radeon Pro WX 4100, and I am using mainline branch of the kernel.
> 
> I don't yet have means to capture the kernel panic it is getting,
> since I don't yet have a serial port on this machine, and I am looking
> into getting one trying my luck with usb3 debug cable.
> 
> I used to get the kernel panic reports via 'ramoops' mechanism which relies on
> storing the kernel log in a fixed area in the ram, and on the fact that BIOS doesn't really
> clear the memory on reboot, but since memory is encrypted it doesn't work.
> 
> Best regards,
> 	Maxim Levitsky
>
Vitaly Kuznetsov May 19, 2020, 2:35 p.m. UTC | #6
Paolo Bonzini <pbonzini@redhat.com> writes:

> KVM stores the gfn in MMIO SPTEs as a caching optimization.  These are split
> in two parts, as in "[high 11111 low]", to thwart any attempt to use these bits
> in an L1TF attack.  This works as long as there are 5 free bits between
> MAXPHYADDR and bit 50 (inclusive), leaving bit 51 free so that the MMIO
> access triggers a reserved-bit-set page fault.
>
> The bit positions however were computed wrongly for AMD processors that have
> encryption support.  In this case, x86_phys_bits is reduced (for example
> from 48 to 43, to account for the C bit at position 47 and four bits used
> internally to store the SEV ASID and other stuff) while x86_cache_bits in
> would remain set to 48, and _all_ bits between the reduced MAXPHYADDR
> and bit 51 are set.  Then low_phys_bits would also cover some of the
> bits that are set in the shadow_mmio_value, terribly confusing the gfn
> caching mechanism.
>
> To fix this, avoid splitting gfns as long as the processor does not have
> the L1TF bug (which includes all AMD processors).  When there is no
> splitting, low_phys_bits can be set to the reduced MAXPHYADDR removing
> the overlap.  This fixes "npt=0" operation on EPYC processors.
>
> Thanks to Maxim Levitsky for bisecting this bug.
>
> Cc: stable@vger.kernel.org
> Fixes: 52918ed5fcf0 ("KVM: SVM: Override default MMIO mask if memory encryption is enabled")
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 19 ++++++++++---------
>  1 file changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8071952e9cf2..86619631ff6a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -335,6 +335,8 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value, u64 access_mask)
>  {
>  	BUG_ON((u64)(unsigned)access_mask != access_mask);
>  	BUG_ON((mmio_mask & mmio_value) != mmio_value);
> +	WARN_ON(mmio_value & (shadow_nonpresent_or_rsvd_mask << shadow_nonpresent_or_rsvd_mask_len));
> +	WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);
>  	shadow_mmio_value = mmio_value | SPTE_MMIO_MASK;
>  	shadow_mmio_mask = mmio_mask | SPTE_SPECIAL_MASK;
>  	shadow_mmio_access_mask = access_mask;
> @@ -583,16 +585,15 @@ static void kvm_mmu_reset_all_pte_masks(void)
>  	 * the most significant bits of legal physical address space.
>  	 */
>  	shadow_nonpresent_or_rsvd_mask = 0;
> -	low_phys_bits = boot_cpu_data.x86_cache_bits;
> -	if (boot_cpu_data.x86_cache_bits <
> -	    52 - shadow_nonpresent_or_rsvd_mask_len) {
> +	low_phys_bits = boot_cpu_data.x86_phys_bits;
> +	if (boot_cpu_has_bug(X86_BUG_L1TF) &&
> +	    !WARN_ON_ONCE(boot_cpu_data.x86_cache_bits >=
> +			  52 - shadow_nonpresent_or_rsvd_mask_len)) {
> +		low_phys_bits = boot_cpu_data.x86_cache_bits
> +			- shadow_nonpresent_or_rsvd_mask_len;
>  		shadow_nonpresent_or_rsvd_mask =
> -			rsvd_bits(boot_cpu_data.x86_cache_bits -
> -				  shadow_nonpresent_or_rsvd_mask_len,
> -				  boot_cpu_data.x86_cache_bits - 1);
> -		low_phys_bits -= shadow_nonpresent_or_rsvd_mask_len;
> -	} else
> -		WARN_ON_ONCE(boot_cpu_has_bug(X86_BUG_L1TF));
> +			rsvd_bits(low_phys_bits, boot_cpu_data.x86_cache_bits - 1);
> +	}
>  
>  	shadow_nonpresent_or_rsvd_lower_gfn_mask =
>  		GENMASK_ULL(low_phys_bits - 1, PAGE_SHIFT);

This indeed seems to fix previously-completely-broken 'npt=0' case,
checked with AMD EPYC 7401P.

Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8071952e9cf2..86619631ff6a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -335,6 +335,8 @@  void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value, u64 access_mask)
 {
 	BUG_ON((u64)(unsigned)access_mask != access_mask);
 	BUG_ON((mmio_mask & mmio_value) != mmio_value);
+	WARN_ON(mmio_value & (shadow_nonpresent_or_rsvd_mask << shadow_nonpresent_or_rsvd_mask_len));
+	WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);
 	shadow_mmio_value = mmio_value | SPTE_MMIO_MASK;
 	shadow_mmio_mask = mmio_mask | SPTE_SPECIAL_MASK;
 	shadow_mmio_access_mask = access_mask;
@@ -583,16 +585,15 @@  static void kvm_mmu_reset_all_pte_masks(void)
 	 * the most significant bits of legal physical address space.
 	 */
 	shadow_nonpresent_or_rsvd_mask = 0;
-	low_phys_bits = boot_cpu_data.x86_cache_bits;
-	if (boot_cpu_data.x86_cache_bits <
-	    52 - shadow_nonpresent_or_rsvd_mask_len) {
+	low_phys_bits = boot_cpu_data.x86_phys_bits;
+	if (boot_cpu_has_bug(X86_BUG_L1TF) &&
+	    !WARN_ON_ONCE(boot_cpu_data.x86_cache_bits >=
+			  52 - shadow_nonpresent_or_rsvd_mask_len)) {
+		low_phys_bits = boot_cpu_data.x86_cache_bits
+			- shadow_nonpresent_or_rsvd_mask_len;
 		shadow_nonpresent_or_rsvd_mask =
-			rsvd_bits(boot_cpu_data.x86_cache_bits -
-				  shadow_nonpresent_or_rsvd_mask_len,
-				  boot_cpu_data.x86_cache_bits - 1);
-		low_phys_bits -= shadow_nonpresent_or_rsvd_mask_len;
-	} else
-		WARN_ON_ONCE(boot_cpu_has_bug(X86_BUG_L1TF));
+			rsvd_bits(low_phys_bits, boot_cpu_data.x86_cache_bits - 1);
+	}
 
 	shadow_nonpresent_or_rsvd_lower_gfn_mask =
 		GENMASK_ULL(low_phys_bits - 1, PAGE_SHIFT);