[14/15] KVM: MTRR: do not map huage page for non-consistent range
diff mbox

Message ID 1432983566-15773-15-git-send-email-guangrong.xiao@linux.intel.com
State New
Headers show

Commit Message

Xiao Guangrong May 30, 2015, 10:59 a.m. UTC
Based on Intel's SDM, mapping huge page which do not have consistent
memory cache for each 4k page will cause undefined behavior

In order to avoiding this kind of undefined behavior, we force to use
4k pages under this case

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
---
 arch/x86/kvm/mmu.c  | 20 +++++++++++++++++++-
 arch/x86/kvm/mtrr.c | 25 +++++++++++++++++++++++++
 arch/x86/kvm/x86.h  |  2 ++
 3 files changed, 46 insertions(+), 1 deletion(-)

Comments

Paolo Bonzini June 1, 2015, 9:36 a.m. UTC | #1
On 30/05/2015 12:59, Xiao Guangrong wrote:
> Currently guest MTRR is completely prohibited if cache snoop is supported on
> IOMMU (!noncoherent_dma) and host does the emulation based on the knowledge
> from host side, however, host side is not the good point to know
> what the purpose of guest is. A good example is that pass-throughed VGA
> frame buffer is not always UC as host expected

Can you explain how?  The original idea was that such a framebuffer
would be kvm_is_reserved_pfn and thus be unconditionally UC.

> +bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
> +					  int page_num)
> +{
> +	struct mtrr_looker looker;
> +	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
> +	u64 start = gfn_to_gpa(gfn), end = gfn_to_gpa(gfn + page_num);
> +	int type = -1;
> +
> +	mtrr_for_each_mem_type(&looker, mtrr_state, start, end) {
> +		if (type == -1) {
> +			type = looker.mem_type;
> +			continue;
> +		}
> +
> +		if (type != looker.mem_type)
> +			return false;
> +	}
> +
> +	if ((type != -1) && looker.partial_map &&
> +	      (mtrr_state->def_type != type))
> +		return false;
> +

No Pascal-like parentheses.

Does this have a performance impact on shadow?  Perhaps we could cache
in struct kvm_arch_memory_slot whether the memslot is covered by MTRRs?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 1, 2015, 9:38 a.m. UTC | #2
On 01/06/2015 11:36, Paolo Bonzini wrote:
> Does this have a performance impact on shadow?  Perhaps we could cache
> in struct kvm_arch_memory_slot whether the memslot is covered by MTRRs?

Nevermind, patch 15 answers my question.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong June 3, 2015, 2:56 a.m. UTC | #3
On 06/01/2015 05:36 PM, Paolo Bonzini wrote:
>
>
> On 30/05/2015 12:59, Xiao Guangrong wrote:
>> Currently guest MTRR is completely prohibited if cache snoop is supported on
>> IOMMU (!noncoherent_dma) and host does the emulation based on the knowledge
>> from host side, however, host side is not the good point to know
>> what the purpose of guest is. A good example is that pass-throughed VGA
>> frame buffer is not always UC as host expected
>
> Can you explain how?  The original idea was that such a framebuffer
> would be kvm_is_reserved_pfn and thus be unconditionally UC.

Yes, frame-buffer is always UC in current code, however, UC for frame-buffer
causes bad performance. It's quoted from Documentation/x86/mtrr.txt:

| This is most useful when you have a video (VGA) card on a PCI or AGP bus.
| Enabling write-combining allows bus write transfers to be combined into a
| larger transfer before bursting over the PCI/AGP bus. This can increase
| performance of image write operations 2.5 times or more.

So that guest will configure the range to MTRR, this patchset follows
guest MTRR and cooperates with guest PAT (ept.VMX_EPT_IPAT_BIT = 0) to emulate
guest cache type as guest expects.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 3, 2015, 7:55 a.m. UTC | #4
On 03/06/2015 04:56, Xiao Guangrong wrote:
> 
> 
> On 06/01/2015 05:36 PM, Paolo Bonzini wrote:
>>
>>
>> On 30/05/2015 12:59, Xiao Guangrong wrote:
>>> Currently guest MTRR is completely prohibited if cache snoop is
>>> supported on
>>> IOMMU (!noncoherent_dma) and host does the emulation based on the
>>> knowledge
>>> from host side, however, host side is not the good point to know
>>> what the purpose of guest is. A good example is that pass-throughed VGA
>>> frame buffer is not always UC as host expected
>>
>> Can you explain how?  The original idea was that such a framebuffer
>> would be kvm_is_reserved_pfn and thus be unconditionally UC.
> 
> Yes, frame-buffer is always UC in current code, however, UC for
> frame-buffer causes bad performance.

Understood now, thanks.

> So that guest will configure the range to MTRR, this patchset follows
> guest MTRR and cooperates with guest PAT (ept.VMX_EPT_IPAT_BIT = 0) to
> emulate guest cache type as guest expects.

Unlike e.g. CR0.CD=1, UC memory does not snoop the cache to preserve
coherency.  AMD, has special logic to do this, for example:

- if guest PAT says "UC" and host MTRR says "WB", the processor will not
cache the memory but will snoop the cache as if CR0.CD=1

- if guest PAT says "WC" and host (nested page table) PAT says "WB" and
host MTRR says "WB", the processor will still do write combining but
also snoop the cache as if CR0.CD=1

I am worried that the lack of this feature could cause problems if
guests map QEMU's VGA framebuffer as uncached.  We have this problem on
ARM, so it's not 100% theoretical.

So, why do you need to always use IPAT=0?  Can patch 15 keep the current
logic for RAM, like this:

	if (is_mmio || kvm_arch_has_noncoherent_dma(vcpu->kvm))
		ret = kvm_mtrr_get_guest_memory_type(vcpu, gfn) <<
		      VMX_EPT_MT_EPTE_SHIFT;
	else
		ret = (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT)
			| VMX_EPT_IPAT_BIT;

?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong June 4, 2015, 8:23 a.m. UTC | #5
On 06/03/2015 03:55 PM, Paolo Bonzini wrote:
>
>
> On 03/06/2015 04:56, Xiao Guangrong wrote:
>>
>>
>> On 06/01/2015 05:36 PM, Paolo Bonzini wrote:
>>>
>>>
>>> On 30/05/2015 12:59, Xiao Guangrong wrote:
>>>> Currently guest MTRR is completely prohibited if cache snoop is
>>>> supported on
>>>> IOMMU (!noncoherent_dma) and host does the emulation based on the
>>>> knowledge
>>>> from host side, however, host side is not the good point to know
>>>> what the purpose of guest is. A good example is that pass-throughed VGA
>>>> frame buffer is not always UC as host expected
>>>
>>> Can you explain how?  The original idea was that such a framebuffer
>>> would be kvm_is_reserved_pfn and thus be unconditionally UC.
>>
>> Yes, frame-buffer is always UC in current code, however, UC for
>> frame-buffer causes bad performance.
>
> Understood now, thanks.
>
>> So that guest will configure the range to MTRR, this patchset follows
>> guest MTRR and cooperates with guest PAT (ept.VMX_EPT_IPAT_BIT = 0) to
>> emulate guest cache type as guest expects.
>
> Unlike e.g. CR0.CD=1, UC memory does not snoop the cache to preserve
> coherency.  AMD, has special logic to do this, for example:
>
> - if guest PAT says "UC" and host MTRR says "WB", the processor will not
> cache the memory but will snoop the cache as if CR0.CD=1
>
> - if guest PAT says "WC" and host (nested page table) PAT says "WB" and
> host MTRR says "WB", the processor will still do write combining but
> also snoop the cache as if CR0.CD=1
>
> I am worried that the lack of this feature could cause problems if
> guests map QEMU's VGA framebuffer as uncached.  We have this problem on
> ARM, so it's not 100% theoretical.

CR0.CD is always 0 in both host and guest, i guess it's why we cleared
CR0.CD and CR0.NW in vmx_set_cr0().

>
> So, why do you need to always use IPAT=0?  Can patch 15 keep the current
> logic for RAM, like this:
>
> 	if (is_mmio || kvm_arch_has_noncoherent_dma(vcpu->kvm))
> 		ret = kvm_mtrr_get_guest_memory_type(vcpu, gfn) <<
> 		      VMX_EPT_MT_EPTE_SHIFT;
> 	else
> 		ret = (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT)
> 			| VMX_EPT_IPAT_BIT;

Yeah, it's okay, actually we considered this way, however
- it's light enough, it did not hurt guest performance based on our
   benchmark.
- the logic has always used for noncherent_dma case, extend it to
   normal case should have low risk and also help us to check the logic.
- completely follow MTRRS spec would be better than host hides it.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong June 4, 2015, 8:26 a.m. UTC | #6
On 06/04/2015 04:23 PM, Xiao Guangrong wrote:
>
>
> On 06/03/2015 03:55 PM, Paolo Bonzini wrote:
>>
>>
>> On 03/06/2015 04:56, Xiao Guangrong wrote:
>>>
>>>
>>> On 06/01/2015 05:36 PM, Paolo Bonzini wrote:
>>>>
>>>>
>>>> On 30/05/2015 12:59, Xiao Guangrong wrote:
>>>>> Currently guest MTRR is completely prohibited if cache snoop is
>>>>> supported on
>>>>> IOMMU (!noncoherent_dma) and host does the emulation based on the
>>>>> knowledge
>>>>> from host side, however, host side is not the good point to know
>>>>> what the purpose of guest is. A good example is that pass-throughed VGA
>>>>> frame buffer is not always UC as host expected
>>>>
>>>> Can you explain how?  The original idea was that such a framebuffer
>>>> would be kvm_is_reserved_pfn and thus be unconditionally UC.
>>>
>>> Yes, frame-buffer is always UC in current code, however, UC for
>>> frame-buffer causes bad performance.
>>
>> Understood now, thanks.
>>
>>> So that guest will configure the range to MTRR, this patchset follows
>>> guest MTRR and cooperates with guest PAT (ept.VMX_EPT_IPAT_BIT = 0) to
>>> emulate guest cache type as guest expects.
>>
>> Unlike e.g. CR0.CD=1, UC memory does not snoop the cache to preserve
>> coherency.  AMD, has special logic to do this, for example:
>>
>> - if guest PAT says "UC" and host MTRR says "WB", the processor will not
>> cache the memory but will snoop the cache as if CR0.CD=1
>>
>> - if guest PAT says "WC" and host (nested page table) PAT says "WB" and
>> host MTRR says "WB", the processor will still do write combining but
>> also snoop the cache as if CR0.CD=1
>>
>> I am worried that the lack of this feature could cause problems if
>> guests map QEMU's VGA framebuffer as uncached.  We have this problem on
>> ARM, so it's not 100% theoretical.
>
> CR0.CD is always 0 in both host and guest, i guess it's why we cleared
> CR0.CD and CR0.NW in vmx_set_cr0().

It reminds me that we should check guest CR0.CD before check guest MTRR
and disable guest PAT if guest CR0.CD = 1.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 4, 2015, 8:34 a.m. UTC | #7
On 04/06/2015 10:26, Xiao Guangrong wrote:
>>
>> CR0.CD is always 0 in both host and guest, i guess it's why we cleared
>> CR0.CD and CR0.NW in vmx_set_cr0().
> 
> It reminds me that we should check guest CR0.CD before check guest MTRR
> and disable guest PAT if guest CR0.CD = 1.

I think it can be done separately, on top.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 4, 2015, 8:36 a.m. UTC | #8
On 04/06/2015 10:23, Xiao Guangrong wrote:
>>
>> So, why do you need to always use IPAT=0?  Can patch 15 keep the current
>> logic for RAM, like this:
>>
>>     if (is_mmio || kvm_arch_has_noncoherent_dma(vcpu->kvm))
>>         ret = kvm_mtrr_get_guest_memory_type(vcpu, gfn) <<
>>               VMX_EPT_MT_EPTE_SHIFT;
>>     else
>>         ret = (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT)
>>             | VMX_EPT_IPAT_BIT;
> 
> Yeah, it's okay, actually we considered this way, however
> - it's light enough, it did not hurt guest performance based on our
>   benchmark.
> - the logic has always used for noncherent_dma case, extend it to
>   normal case should have low risk and also help us to check the logic.

But noncoherent_dma is not the common case, so it's not necessarily true
that the risk is low.

> - completely follow MTRRS spec would be better than host hides it.

We are a virtualization platform, we know well when MTRRs are necessary.

Tis a risk from blindly obeying the guest MTRRs: userspace can see stale
data if the guest's accesses bypass the cache.  AMD bypasses this by
enabling snooping even in cases that ordinarily wouldn't snoop; for
Intel the solution is that RAM-backed areas should always use IPAT.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong June 5, 2015, 6:33 a.m. UTC | #9
[ CCed Zhang Yang ]

On 06/04/2015 04:36 PM, Paolo Bonzini wrote:
>
>
> On 04/06/2015 10:23, Xiao Guangrong wrote:
>>>
>>> So, why do you need to always use IPAT=0?  Can patch 15 keep the current
>>> logic for RAM, like this:
>>>
>>>      if (is_mmio || kvm_arch_has_noncoherent_dma(vcpu->kvm))
>>>          ret = kvm_mtrr_get_guest_memory_type(vcpu, gfn) <<
>>>                VMX_EPT_MT_EPTE_SHIFT;
>>>      else
>>>          ret = (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT)
>>>              | VMX_EPT_IPAT_BIT;
>>
>> Yeah, it's okay, actually we considered this way, however
>> - it's light enough, it did not hurt guest performance based on our
>>    benchmark.
>> - the logic has always used for noncherent_dma case, extend it to
>>    normal case should have low risk and also help us to check the logic.
>
> But noncoherent_dma is not the common case, so it's not necessarily true
> that the risk is low.

I thought noncoherent_dma exists on 1st generation(s) IOMMU, it should
be fully tested at that time.

>
>> - completely follow MTRRS spec would be better than host hides it.
>
> We are a virtualization platform, we know well when MTRRs are necessary.
>
> Tis a risk from blindly obeying the guest MTRRs: userspace can see stale
> data if the guest's accesses bypass the cache.  AMD bypasses this by
> enabling snooping even in cases that ordinarily wouldn't snoop; for
> Intel the solution is that RAM-backed areas should always use IPAT.

Not sure if UC and other cacheable type combinations on guest and host
will cause problem. The SMD mentioned that snoop is not required only when
"The UC attribute comes from the MTRRs and the processors are not required
  to snoop their caches since the data could never have been cached."
(Vol 3. 11.5.2.2)
VMX do not touch hardware MTRR MSRs and i guess snoop works under this case.

I also noticed if SS (self-snooping) is supported we need not to invalidate
cache when programming memory type (Vol 3. 11.11.8), so that means CPU works
well on the page which has different cache types i guess.

After think it carefully, we (Zhang Yang) doubt if always set WB for DMA
memory is really a good idea because we can not assume WB DMA works well for
all devices. One example is that audio DMA (not a MMIO region) is required WC
to improve its performance.

However, we think the SDM is not clear enough so let's do full vMTRR on MMIO
and noncoherent_dma first.?:)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7462c57..c8c2a90 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3437,6 +3437,16 @@  static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 	return false;
 }
 
+static bool
+check_hugepage_cache_consistency(struct kvm_vcpu *vcpu, gfn_t gfn, int level)
+{
+	int page_num = KVM_PAGES_PER_HPAGE(level);
+
+	gfn &= ~(page_num - 1);
+
+	return kvm_mtrr_check_gfn_range_consistency(vcpu, gfn, page_num);
+}
+
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 			  bool prefault)
 {
@@ -3462,9 +3472,17 @@  static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 	if (r)
 		return r;
 
-	force_pt_level = mapping_level_dirty_bitmap(vcpu, gfn);
+	if (mapping_level_dirty_bitmap(vcpu, gfn) ||
+	    !check_hugepage_cache_consistency(vcpu, gfn, PT_DIRECTORY_LEVEL))
+		force_pt_level = 1;
+	else
+		force_pt_level = 0;
+
 	if (likely(!force_pt_level)) {
 		level = mapping_level(vcpu, gfn);
+		if (level > PT_DIRECTORY_LEVEL &&
+		    !check_hugepage_cache_consistency(vcpu, gfn, level))
+			level = PT_DIRECTORY_LEVEL;
 		gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
 	} else
 		level = PT_PAGE_TABLE_LEVEL;
diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index bc90834..703a66b 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -645,3 +645,28 @@  u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
 	return type;
 }
 EXPORT_SYMBOL_GPL(kvm_mtrr_get_guest_memory_type);
+
+bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
+					  int page_num)
+{
+	struct mtrr_looker looker;
+	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
+	u64 start = gfn_to_gpa(gfn), end = gfn_to_gpa(gfn + page_num);
+	int type = -1;
+
+	mtrr_for_each_mem_type(&looker, mtrr_state, start, end) {
+		if (type == -1) {
+			type = looker.mem_type;
+			continue;
+		}
+
+		if (type != looker.mem_type)
+			return false;
+	}
+
+	if ((type != -1) && looker.partial_map &&
+	      (mtrr_state->def_type != type))
+		return false;
+
+	return true;
+}
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index a3dae49..7c30ec8 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -166,6 +166,8 @@  void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu);
 bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data);
 int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
 int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
+bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
+					  int page_num);
 
 #define KVM_SUPPORTED_XCR0     (XSTATE_FP | XSTATE_SSE | XSTATE_YMM \
 				| XSTATE_BNDREGS | XSTATE_BNDCSR \