[2/4] KVM: SVM: use NPT page attributes

Message ID	1436276739-50326-3-git-send-email-pbonzini@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Paolo Bonzini <pbonzini@redhat.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: guangrong.xiao@linux.intel.com, jroedel@suse.de, alex.williamson@redhat.com, ogerlitz@mellanox.com, amirv@mellanox.com Subject: [PATCH 2/4] KVM: SVM: use NPT page attributes Date: Tue, 7 Jul 2015 15:45:37 +0200 Message-Id: <1436276739-50326-3-git-send-email-pbonzini@redhat.com> In-Reply-To: <1436276739-50326-1-git-send-email-pbonzini@redhat.com> References: <1436276739-50326-1-git-send-email-pbonzini@redhat.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Paolo Bonzini July 7, 2015, 1:45 p.m. UTC

Right now, NPT page attributes are not used, and the final page
attribute depends solely on gPAT (which however is not synced
correctly), the guest MTRRs and the guest page attributes.

However, we can do better by mimicking what is done for VMX.
In the absence of PCI passthrough, the guest PAT can be ignored
and the page attributes can be just WB.  If passthrough is being
used, instead, keep respecting the guest PAT, and emulate the guest
MTRRs through the PAT field of the nested page tables.

The only snag is that WP memory cannot be emulated correctly,
because Linux's default PAT setting only includes the other types.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/svm.c | 47 ++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 42 insertions(+), 5 deletions(-)

Xiao Guangrong July 8, 2015, 5:59 a.m. UTC | #1

On 07/07/2015 09:45 PM, Paolo Bonzini wrote:
> Right now, NPT page attributes are not used, and the final page
> attribute depends solely on gPAT (which however is not synced
> correctly), the guest MTRRs and the guest page attributes.
>
> However, we can do better by mimicking what is done for VMX.
> In the absence of PCI passthrough, the guest PAT can be ignored
> and the page attributes can be just WB.  If passthrough is being
> used, instead, keep respecting the guest PAT, and emulate the guest
> MTRRs through the PAT field of the nested page tables.
>
> The only snag is that WP memory cannot be emulated correctly,
> because Linux's default PAT setting only includes the other types.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>   arch/x86/kvm/svm.c | 47 ++++++++++++++++++++++++++++++++++++++++++-----
>   1 file changed, 42 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 602b974a60a6..0f125c1860ec 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1085,6 +1085,47 @@ static u64 svm_compute_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc)
>   	return target_tsc - tsc;
>   }
>
> +static void svm_set_guest_pat(struct vcpu_svm *svm, u64 *g_pat)
> +{
> +	struct kvm_vcpu *vcpu = &svm->vcpu;
> +
> +	/* Unlike Intel, AMD takes the guest's CR0.CD into account.

I noticed this code in svm_set_cr0():

	if (!(vcpu->kvm->arch.disabled_quirks & KVM_QUIRK_CD_NW_CLEARED))
		cr0 &= ~(X86_CR0_CD | X86_CR0_NW);

gCR0.CD is hidden to CPU if KVM_QUIRK_CD_NW_CLEARED is not set and looks like
it is the normal case after grepping Qemu code.

> +	 *
> +	 * AMD does not have IPAT.  To emulate it for the case of guests
> +	 * with no assigned devices, just set everything to WB.  If guests
> +	 * have assigned devices, however, we cannot force WB for RAM
> +	 * pages only, so use the guest IPAT as passed.
> +	 */
> +	if (!kvm_arch_has_assigned_device(vcpu->kvm))
> +		*g_pat = 0x0606060606060606;
> +	else
> +		*g_pat = vcpu->arch.pat;
> +}
> +
> +static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> +{
> +	u8 cache;
> +
> +	/*
> +	 * 1. MMIO: always map as UC
> +	 * 2. No passthrough: always map as WB, and force guest PAT to WB as well
> +	 * 3. Passthrough: can't guarantee the result, try to trust guest.
> +	 */
> +	if (is_mmio)
> +		return _PAGE_NOCACHE;
> +
> +	if (!kvm_arch_has_assigned_device(vcpu->kvm))
> +		return 0;
> +
> +	cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);
> +

@cache return from kvm_mtrr_get_guest_memory_type is MTRR_TYPE_*
which is different with _PAGE_CACHE_MODE_*. The latter is pure SW
usage, e.g:
_PAGE_CACHE_MODE_WB = 0 and  #define MTRR_TYPE_WRBACK     6


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini July 8, 2015, 11:19 a.m. UTC | #2

On 08/07/2015 07:59, Xiao Guangrong wrote:
> 
> 
> On 07/07/2015 09:45 PM, Paolo Bonzini wrote:
>> Right now, NPT page attributes are not used, and the final page
>> attribute depends solely on gPAT (which however is not synced
>> correctly), the guest MTRRs and the guest page attributes.
>>
>> However, we can do better by mimicking what is done for VMX.
>> In the absence of PCI passthrough, the guest PAT can be ignored
>> and the page attributes can be just WB.  If passthrough is being
>> used, instead, keep respecting the guest PAT, and emulate the guest
>> MTRRs through the PAT field of the nested page tables.
>>
>> The only snag is that WP memory cannot be emulated correctly,
>> because Linux's default PAT setting only includes the other types.
>>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>   arch/x86/kvm/svm.c | 47 ++++++++++++++++++++++++++++++++++++++++++-----
>>   1 file changed, 42 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index 602b974a60a6..0f125c1860ec 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -1085,6 +1085,47 @@ static u64 svm_compute_tsc_offset(struct
>> kvm_vcpu *vcpu, u64 target_tsc)
>>       return target_tsc - tsc;
>>   }
>>
>> +static void svm_set_guest_pat(struct vcpu_svm *svm, u64 *g_pat)
>> +{
>> +    struct kvm_vcpu *vcpu = &svm->vcpu;
>> +
>> +    /* Unlike Intel, AMD takes the guest's CR0.CD into account.
> 
> I noticed this code in svm_set_cr0():
> 
>     if (!(vcpu->kvm->arch.disabled_quirks & KVM_QUIRK_CD_NW_CLEARED))
>         cr0 &= ~(X86_CR0_CD | X86_CR0_NW);
> 
> gCR0.CD is hidden to CPU if KVM_QUIRK_CD_NW_CLEARED is not set and looks
> like
> it is the normal case after grepping Qemu code.
> 
>> +     *
>> +     * AMD does not have IPAT.  To emulate it for the case of guests
>> +     * with no assigned devices, just set everything to WB.  If guests
>> +     * have assigned devices, however, we cannot force WB for RAM
>> +     * pages only, so use the guest IPAT as passed.
>> +     */
>> +    if (!kvm_arch_has_assigned_device(vcpu->kvm))
>> +        *g_pat = 0x0606060606060606;
>> +    else
>> +        *g_pat = vcpu->arch.pat;
>> +}
>> +
>> +static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool
>> is_mmio)
>> +{
>> +    u8 cache;
>> +
>> +    /*
>> +     * 1. MMIO: always map as UC
>> +     * 2. No passthrough: always map as WB, and force guest PAT to WB
>> as well
>> +     * 3. Passthrough: can't guarantee the result, try to trust guest.
>> +     */
>> +    if (is_mmio)
>> +        return _PAGE_NOCACHE;
>> +
>> +    if (!kvm_arch_has_assigned_device(vcpu->kvm))
>> +        return 0;
>> +
>> +    cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);
>> +
> 
> @cache return from kvm_mtrr_get_guest_memory_type is MTRR_TYPE_*
> which is different with _PAGE_CACHE_MODE_*. The latter is pure SW
> usage, e.g:
> _PAGE_CACHE_MODE_WB = 0 and  #define MTRR_TYPE_WRBACK     6

Oops, you're right.  In fact my first version was correct, then I
changed it to use cachemode2protval and screwed up.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong July 9, 2015, 2:30 a.m. UTC | #3

On 07/08/2015 07:19 PM, Paolo Bonzini wrote:
>
>
> On 08/07/2015 07:59, Xiao Guangrong wrote:
>>
>>
>> On 07/07/2015 09:45 PM, Paolo Bonzini wrote:
>>> Right now, NPT page attributes are not used, and the final page
>>> attribute depends solely on gPAT (which however is not synced
>>> correctly), the guest MTRRs and the guest page attributes.
>>>
>>> However, we can do better by mimicking what is done for VMX.
>>> In the absence of PCI passthrough, the guest PAT can be ignored
>>> and the page attributes can be just WB.  If passthrough is being
>>> used, instead, keep respecting the guest PAT, and emulate the guest
>>> MTRRs through the PAT field of the nested page tables.
>>>
>>> The only snag is that WP memory cannot be emulated correctly,
>>> because Linux's default PAT setting only includes the other types.
>>>
>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>> ---
>>>    arch/x86/kvm/svm.c | 47 ++++++++++++++++++++++++++++++++++++++++++-----
>>>    1 file changed, 42 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>> index 602b974a60a6..0f125c1860ec 100644
>>> --- a/arch/x86/kvm/svm.c
>>> +++ b/arch/x86/kvm/svm.c
>>> @@ -1085,6 +1085,47 @@ static u64 svm_compute_tsc_offset(struct
>>> kvm_vcpu *vcpu, u64 target_tsc)
>>>        return target_tsc - tsc;
>>>    }
>>>
>>> +static void svm_set_guest_pat(struct vcpu_svm *svm, u64 *g_pat)
>>> +{
>>> +    struct kvm_vcpu *vcpu = &svm->vcpu;
>>> +
>>> +    /* Unlike Intel, AMD takes the guest's CR0.CD into account.
>>
>> I noticed this code in svm_set_cr0():
>>
>>      if (!(vcpu->kvm->arch.disabled_quirks & KVM_QUIRK_CD_NW_CLEARED))
>>          cr0 &= ~(X86_CR0_CD | X86_CR0_NW);
>>
>> gCR0.CD is hidden to CPU if KVM_QUIRK_CD_NW_CLEARED is not set and looks
>> like
>> it is the normal case after grepping Qemu code.
>>

How about this one? I still do not know how SVM properly emulates CR0.CD? :(
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini July 9, 2015, 3:18 p.m. UTC | #4

On 09/07/2015 04:30, Xiao Guangrong wrote:
>>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>>> index 602b974a60a6..0f125c1860ec 100644
>>>> --- a/arch/x86/kvm/svm.c
>>>> +++ b/arch/x86/kvm/svm.c
>>>> @@ -1085,6 +1085,47 @@ static u64 svm_compute_tsc_offset(struct
>>>> kvm_vcpu *vcpu, u64 target_tsc)
>>>>        return target_tsc - tsc;
>>>>    }
>>>>
>>>> +static void svm_set_guest_pat(struct vcpu_svm *svm, u64 *g_pat)
>>>> +{
>>>> +    struct kvm_vcpu *vcpu = &svm->vcpu;
>>>> +
>>>> +    /* Unlike Intel, AMD takes the guest's CR0.CD into account.
>>>
>>> I noticed this code in svm_set_cr0():
>>>
>>>      if (!(vcpu->kvm->arch.disabled_quirks & KVM_QUIRK_CD_NW_CLEARED))
>>>          cr0 &= ~(X86_CR0_CD | X86_CR0_NW);
>>>
>>> gCR0.CD is hidden to CPU if KVM_QUIRK_CD_NW_CLEARED is not set and looks
>>> like it is the normal case after grepping Qemu code.

Hi Xiao,

yes, this is correct.  QEMU still does not have support for disabling
"quirks", so gCR0.CD is currently hidden on SVM.  I would like to
include this series in 4.2, while for 4.3 I will disable the quirk above
altogether (it is superseded by the way PAT is forced to all-WB).

In any case, this is just a KVM limitation.  gCR0.CD is taken into
account by the processor when computing the effective memory type.  It
uses all of these:

- the guest page tables PAT/PCD/PWT

- the guest page tables CR3.PCD/CR3.PWD

- the nested page tables PAT/PCD/PWT

- the guest page tables CR3.PCD/CR3.PWD

- the host MTRRs

- gCR0.CD

- hCR0.CD

Paolo

> How about this one? I still do not know how SVM properly emulates
> CR0.CD? :(
> -- 
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong July 10, 2015, 1:19 a.m. UTC | #5

On 07/09/2015 11:18 PM, Paolo Bonzini wrote:
>
>
> On 09/07/2015 04:30, Xiao Guangrong wrote:
>>>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>>>> index 602b974a60a6..0f125c1860ec 100644
>>>>> --- a/arch/x86/kvm/svm.c
>>>>> +++ b/arch/x86/kvm/svm.c
>>>>> @@ -1085,6 +1085,47 @@ static u64 svm_compute_tsc_offset(struct
>>>>> kvm_vcpu *vcpu, u64 target_tsc)
>>>>>         return target_tsc - tsc;
>>>>>     }
>>>>>
>>>>> +static void svm_set_guest_pat(struct vcpu_svm *svm, u64 *g_pat)
>>>>> +{
>>>>> +    struct kvm_vcpu *vcpu = &svm->vcpu;
>>>>> +
>>>>> +    /* Unlike Intel, AMD takes the guest's CR0.CD into account.
>>>>
>>>> I noticed this code in svm_set_cr0():
>>>>
>>>>       if (!(vcpu->kvm->arch.disabled_quirks & KVM_QUIRK_CD_NW_CLEARED))
>>>>           cr0 &= ~(X86_CR0_CD | X86_CR0_NW);
>>>>
>>>> gCR0.CD is hidden to CPU if KVM_QUIRK_CD_NW_CLEARED is not set and looks
>>>> like it is the normal case after grepping Qemu code.
>
> Hi Xiao,
>
> yes, this is correct.  QEMU still does not have support for disabling
> "quirks", so gCR0.CD is currently hidden on SVM.  I would like to
> include this series in 4.2, while for 4.3 I will disable the quirk above
> altogether (it is superseded by the way PAT is forced to all-WB).
>

That plan sounds good to me.

You will drop disabled_quirks completely or just enable it in Qemu? :)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini July 10, 2015, 10:47 a.m. UTC | #6

On 10/07/2015 03:19, Xiao Guangrong wrote:
>> yes, this is correct.  QEMU still does not have support for disabling
>> "quirks", so gCR0.CD is currently hidden on SVM.  I would like to
>> include this series in 4.2, while for 4.3 I will disable the quirk above
>> altogether (it is superseded by the way PAT is forced to all-WB).
> 
> That plan sounds good to me.
> 
> You will drop disabled_quirks completely or just enable it in Qemu? :)

I will drop this quirk completely.  Other quirks (well, there's just
one) will remain.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong July 10, 2015, 4:02 p.m. UTC | #7

On 07/10/2015 06:47 PM, Paolo Bonzini wrote:
>
>
> On 10/07/2015 03:19, Xiao Guangrong wrote:
>>> yes, this is correct.  QEMU still does not have support for disabling
>>> "quirks", so gCR0.CD is currently hidden on SVM.  I would like to
>>> include this series in 4.2, while for 4.3 I will disable the quirk above
>>> altogether (it is superseded by the way PAT is forced to all-WB).
>>
>> That plan sounds good to me.
>>
>> You will drop disabled_quirks completely or just enable it in Qemu? :)
>
> I will drop this quirk completely.  Other quirks (well, there's just
> one) will remain.

Make senses.

The whole patchset looks good to me!

Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andy Lutomirski July 17, 2015, 12:35 a.m. UTC | #8

On 07/07/2015 06:45 AM, Paolo Bonzini wrote:
> Right now, NPT page attributes are not used, and the final page
> attribute depends solely on gPAT (which however is not synced
> correctly), the guest MTRRs and the guest page attributes.
>
> However, we can do better by mimicking what is done for VMX.
> In the absence of PCI passthrough, the guest PAT can be ignored
> and the page attributes can be just WB.  If passthrough is being
> used, instead, keep respecting the guest PAT, and emulate the guest
> MTRRs through the PAT field of the nested page tables.
>
> The only snag is that WP memory cannot be emulated correctly,
> because Linux's default PAT setting only includes the other types.
>

As of quite recently, Linux understands that the PAT has eight slots, 
and we can probably give you WP.  Do you want it?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini July 17, 2015, 2:31 a.m. UTC | #9

On 17/07/2015 02:35, Andy Lutomirski wrote:
> 
>> Right now, NPT page attributes are not used, and the final page
>> attribute depends solely on gPAT (which however is not synced
>> correctly), the guest MTRRs and the guest page attributes.
>>
>> However, we can do better by mimicking what is done for VMX.
>> In the absence of PCI passthrough, the guest PAT can be ignored
>> and the page attributes can be just WB.  If passthrough is being
>> used, instead, keep respecting the guest PAT, and emulate the guest
>> MTRRs through the PAT field of the nested page tables.
>>
>> The only snag is that WP memory cannot be emulated correctly,
>> because Linux's default PAT setting only includes the other types.
>>
> 
> As of quite recently, Linux understands that the PAT has eight slots,
> and we can probably give you WP.  Do you want it?

It doesn't really matter, it's mostly used in the firmware.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[2/4] KVM: SVM: use NPT page attributes

Commit Message

Comments

Patch