diff mbox

[v4,09/13] nEPT: Add nEPT violation/misconfigration support

Message ID 1374750001-28527-10-git-send-email-gleb@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Gleb Natapov July 25, 2013, 10:59 a.m. UTC
From: Yang Zhang <yang.z.zhang@Intel.com>

Inject nEPT fault to L1 guest. This patch is original from Xinhao.

Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    4 ++++
 arch/x86/kvm/mmu.c              |   34 ++++++++++++++++++++++++++++++++++
 arch/x86/kvm/paging_tmpl.h      |   30 +++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx.c              |   17 +++++++++++++++++
 4 files changed, 84 insertions(+), 1 deletion(-)

Comments

Paolo Bonzini July 29, 2013, 8:59 a.m. UTC | #1
Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> +#if PTTYPE == PTTYPE_EPT
> +#define CHECK_BAD_MT_XWR(G) mmu->bad_mt_xwr & (1ull << ((G) & 0x3f));
> +#else
> +#define CHECK_BAD_MT_XWR(G) 0;
> +#endif
> +
>  static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
>  {
>  	int bit7;
>  
>  	bit7 = (gpte >> 7) & 1;
> -	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> +	return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) ||
> +		CHECK_BAD_MT_XWR(gpte);
>  }
>  
> +#undef CHECK_BAD_MT_XWR

Instead of a macro, you can do

	if (...)
		return true;
#if PTTYPE == PTTYPE_EPT
	if (...)
		return true;
#endif
	return false;

The compiler should be smart enough to generate the same code for
non-EPT PTTYPE.

> 
> +	/*
> +	 * Use PFERR_RSVD_MASK in erorr_code to to tell if EPT
> +	 * misconfiguration requires to be injected. The detection is
> +	 * done by is_rsvd_bits_set() above.

erorr_code -> error_code

This patch has warnings for unused static functions.  You can squash
them, or split them differently according to file boundaries (i.e. mmu.c
first, vmx.c second).

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov July 29, 2013, 10:52 a.m. UTC | #2
On Mon, Jul 29, 2013 at 10:59:31AM +0200, Paolo Bonzini wrote:
> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> > +#if PTTYPE == PTTYPE_EPT
> > +#define CHECK_BAD_MT_XWR(G) mmu->bad_mt_xwr & (1ull << ((G) & 0x3f));
> > +#else
> > +#define CHECK_BAD_MT_XWR(G) 0;
> > +#endif
> > +
> >  static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> >  {
> >  	int bit7;
> >  
> >  	bit7 = (gpte >> 7) & 1;
> > -	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> > +	return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) ||
> > +		CHECK_BAD_MT_XWR(gpte);
> >  }
> >  
> > +#undef CHECK_BAD_MT_XWR
> 
> Instead of a macro, you can do
> 
> 	if (...)
> 		return true;
> #if PTTYPE == PTTYPE_EPT
> 	if (...)
> 		return true;
> #endif
> 	return false;
> 
> The compiler should be smart enough to generate the same code for
> non-EPT PTTYPE.
> 
The idea behind this rsvd_bits_mask trickery is to  produce code that
does not have conditional branches. I don't want to rely on compiler to do
the right things. On the other hand I am not sure that just dropping this
ifdefs here and checking mmu->bad_mt_xwr for non ept case is not a
good idea. The overhead should not be measurable.

> > 
> > +	/*
> > +	 * Use PFERR_RSVD_MASK in erorr_code to to tell if EPT
> > +	 * misconfiguration requires to be injected. The detection is
> > +	 * done by is_rsvd_bits_set() above.
> 
> erorr_code -> error_code
> 
> This patch has warnings for unused static functions.  You can squash
> them, or split them differently according to file boundaries (i.e. mmu.c
> first, vmx.c second).
> 
I prefer to have an in between patch with a warning, but do not divide
code that logically belongs together between patches.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini July 29, 2013, 10:59 a.m. UTC | #3
Il 29/07/2013 12:52, Gleb Natapov ha scritto:
> On Mon, Jul 29, 2013 at 10:59:31AM +0200, Paolo Bonzini wrote:
>> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
>>> +#if PTTYPE == PTTYPE_EPT
>>> +#define CHECK_BAD_MT_XWR(G) mmu->bad_mt_xwr & (1ull << ((G) & 0x3f));
>>> +#else
>>> +#define CHECK_BAD_MT_XWR(G) 0;

No semicolons here, BTW.

>>> +#endif
>>> +
>>>  static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
>>>  {
>>>  	int bit7;
>>>  
>>>  	bit7 = (gpte >> 7) & 1;
>>> -	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
>>> +	return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) ||
>>> +		CHECK_BAD_MT_XWR(gpte);
>>>  }
>>>  
>>> +#undef CHECK_BAD_MT_XWR
>>
>> Instead of a macro, you can do
>>
>> 	if (...)
>> 		return true;
>> #if PTTYPE == PTTYPE_EPT
>> 	if (...)
>> 		return true;
>> #endif
>> 	return false;
>>
>> The compiler should be smart enough to generate the same code for
>> non-EPT PTTYPE.
>>
> The idea behind this rsvd_bits_mask trickery is to  produce code that
> does not have conditional branches.

If you want to have no conditional branches, you need to use "|" not
"||" (and you also need an "!= 0" in CHECK_BAD_MT_XWR).  As you wrote
it, the compiler is most likely to generate exactly the same code that I
suggested.

But I think what you _really_ want is not avoiding conditional branches.
 What you want is always inline is_rsvd_bits_set, so that the compiler
can merge these "if"s with the one where the caller calls
is_rsvd_bits_set.  So just mark is_rsvd_bits_set as inline.

> I don't want to rely on compiler to do
> the right things. On the other hand I am not sure that just dropping this
> ifdefs here and checking mmu->bad_mt_xwr for non ept case is not a
> good idea. The overhead should not be measurable.

That's also a possibility, but I think if you mark is_rsvd_bits_set as
inline it is better to leave the ifs separate.

>>>
>>> +	/*
>>> +	 * Use PFERR_RSVD_MASK in erorr_code to to tell if EPT
>>> +	 * misconfiguration requires to be injected. The detection is
>>> +	 * done by is_rsvd_bits_set() above.
>>
>> erorr_code -> error_code
>>
>> This patch has warnings for unused static functions.  You can squash
>> them, or split them differently according to file boundaries (i.e. mmu.c
>> first, vmx.c second).
>>
> I prefer to have an in between patch with a warning, but do not divide
> code that logically belongs together between patches.

Fine.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov July 29, 2013, 11:43 a.m. UTC | #4
On Mon, Jul 29, 2013 at 12:59:54PM +0200, Paolo Bonzini wrote:
> Il 29/07/2013 12:52, Gleb Natapov ha scritto:
> > On Mon, Jul 29, 2013 at 10:59:31AM +0200, Paolo Bonzini wrote:
> >> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> >>> +#if PTTYPE == PTTYPE_EPT
> >>> +#define CHECK_BAD_MT_XWR(G) mmu->bad_mt_xwr & (1ull << ((G) & 0x3f));
> >>> +#else
> >>> +#define CHECK_BAD_MT_XWR(G) 0;
> 
> No semicolons here, BTW.
> 
> >>> +#endif
> >>> +
> >>>  static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> >>>  {
> >>>  	int bit7;
> >>>  
> >>>  	bit7 = (gpte >> 7) & 1;
> >>> -	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> >>> +	return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) ||
> >>> +		CHECK_BAD_MT_XWR(gpte);
> >>>  }
> >>>  
> >>> +#undef CHECK_BAD_MT_XWR
> >>
> >> Instead of a macro, you can do
> >>
> >> 	if (...)
> >> 		return true;
> >> #if PTTYPE == PTTYPE_EPT
> >> 	if (...)
> >> 		return true;
> >> #endif
> >> 	return false;
> >>
> >> The compiler should be smart enough to generate the same code for
> >> non-EPT PTTYPE.
> >>
> > The idea behind this rsvd_bits_mask trickery is to  produce code that
> > does not have conditional branches.
> 
> If you want to have no conditional branches, you need to use "|" not
> "||" (and you also need an "!= 0" in CHECK_BAD_MT_XWR).  As you wrote
> it, the compiler is most likely to generate exactly the same code that I
> suggested.
OK. I can add that :) I still prefer to have ifdefs outside the
function, not inside.

> 
> But I think what you _really_ want is not avoiding conditional branches.
The idea is that it is hard for branch prediction to predict correct
result when correct result depends on guest's page table that can
contain anything, so in some places shadow paging code uses boolean
logic to avoid branches, in this case it is hard to avoid if() anyway
since the function invocation is in the if().

>  What you want is always inline is_rsvd_bits_set, so that the compiler
> can merge these "if"s with the one where the caller calls
> is_rsvd_bits_set.  So just mark is_rsvd_bits_set as inline.
Will do, but it is inlined regardless. It's very small.


> 
> > I don't want to rely on compiler to do
> > the right things. On the other hand I am not sure that just dropping this
> > ifdefs here and checking mmu->bad_mt_xwr for non ept case is not a
> > good idea. The overhead should not be measurable.
> 
> That's also a possibility, but I think if you mark is_rsvd_bits_set as
> inline it is better to leave the ifs separate.
> 
> >>>
> >>> +	/*
> >>> +	 * Use PFERR_RSVD_MASK in erorr_code to to tell if EPT
> >>> +	 * misconfiguration requires to be injected. The detection is
> >>> +	 * done by is_rsvd_bits_set() above.
> >>
> >> erorr_code -> error_code
> >>
> >> This patch has warnings for unused static functions.  You can squash
> >> them, or split them differently according to file boundaries (i.e. mmu.c
> >> first, vmx.c second).
> >>
> > I prefer to have an in between patch with a warning, but do not divide
> > code that logically belongs together between patches.
> 
> Fine.
> 
> Paolo

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini July 29, 2013, 12:05 p.m. UTC | #5
Il 29/07/2013 13:43, Gleb Natapov ha scritto:
> On Mon, Jul 29, 2013 at 12:59:54PM +0200, Paolo Bonzini wrote:
>> Il 29/07/2013 12:52, Gleb Natapov ha scritto:
>>> On Mon, Jul 29, 2013 at 10:59:31AM +0200, Paolo Bonzini wrote:
>>>> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
>>>>> +#if PTTYPE == PTTYPE_EPT
>>>>> +#define CHECK_BAD_MT_XWR(G) mmu->bad_mt_xwr & (1ull << ((G) & 0x3f));
>>>>> +#else
>>>>> +#define CHECK_BAD_MT_XWR(G) 0;
>>
>> No semicolons here, BTW.
>>
>>>>> +#endif
>>>>> +
>>>>>  static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
>>>>>  {
>>>>>  	int bit7;
>>>>>  
>>>>>  	bit7 = (gpte >> 7) & 1;
>>>>> -	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
>>>>> +	return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) ||
>>>>> +		CHECK_BAD_MT_XWR(gpte);
>>>>>  }
>>>>>  
>>>>> +#undef CHECK_BAD_MT_XWR
>>>>
>>>> Instead of a macro, you can do
>>>>
>>>> 	if (...)
>>>> 		return true;
>>>> #if PTTYPE == PTTYPE_EPT
>>>> 	if (...)
>>>> 		return true;
>>>> #endif
>>>> 	return false;
>>>>
>>>> The compiler should be smart enough to generate the same code for
>>>> non-EPT PTTYPE.
>>>>
>>> The idea behind this rsvd_bits_mask trickery is to  produce code that
>>> does not have conditional branches.
>>
>> If you want to have no conditional branches, you need to use "|" not
>> "||" (and you also need an "!= 0" in CHECK_BAD_MT_XWR).  As you wrote
>> it, the compiler is most likely to generate exactly the same code that I
>> suggested.
> OK. I can add that :) I still prefer to have ifdefs outside the
> function, not inside.
> 
>>
>> But I think what you _really_ want is not avoiding conditional branches.
> The idea is that it is hard for branch prediction to predict correct
> result when correct result depends on guest's page table that can
> contain anything, so in some places shadow paging code uses boolean
> logic to avoid branches, in this case it is hard to avoid if() anyway
> since the function invocation is in the if().

Yes, I get the idea, but is_rsvd_bits_set should be predicted unlikely,
no?  If the guest has to run, it must use mostly valid ptes. :)

Especially if you change prefetch_invalid_gpte to do the reserved bits
test after the present test (so that is_rsvd_bits_set is only called on
present pagetables), is_rsvd_bits_set's result should be really
well-predicted.  At this point (and especially since function invocation
is always in "if"s), using boolean logic to avoid branches does not make
much sense anymore for this function.

The compiler may still use setCC and boolean logic, even if you write it
as "if"s; and in simple cases like this, after inlining and seeing an
"if" the compiler may undo all your careful choice of "|" vs. "||".  So
it's better anyway to ensure is_rsvd_bits_set is called only when it's
unlikely.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov July 29, 2013, 12:34 p.m. UTC | #6
On Mon, Jul 29, 2013 at 02:05:44PM +0200, Paolo Bonzini wrote:
> Il 29/07/2013 13:43, Gleb Natapov ha scritto:
> > On Mon, Jul 29, 2013 at 12:59:54PM +0200, Paolo Bonzini wrote:
> >> Il 29/07/2013 12:52, Gleb Natapov ha scritto:
> >>> On Mon, Jul 29, 2013 at 10:59:31AM +0200, Paolo Bonzini wrote:
> >>>> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> >>>>> +#if PTTYPE == PTTYPE_EPT
> >>>>> +#define CHECK_BAD_MT_XWR(G) mmu->bad_mt_xwr & (1ull << ((G) & 0x3f));
> >>>>> +#else
> >>>>> +#define CHECK_BAD_MT_XWR(G) 0;
> >>
> >> No semicolons here, BTW.
> >>
> >>>>> +#endif
> >>>>> +
> >>>>>  static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> >>>>>  {
> >>>>>  	int bit7;
> >>>>>  
> >>>>>  	bit7 = (gpte >> 7) & 1;
> >>>>> -	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> >>>>> +	return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) ||
> >>>>> +		CHECK_BAD_MT_XWR(gpte);
> >>>>>  }
> >>>>>  
> >>>>> +#undef CHECK_BAD_MT_XWR
> >>>>
> >>>> Instead of a macro, you can do
> >>>>
> >>>> 	if (...)
> >>>> 		return true;
> >>>> #if PTTYPE == PTTYPE_EPT
> >>>> 	if (...)
> >>>> 		return true;
> >>>> #endif
> >>>> 	return false;
> >>>>
> >>>> The compiler should be smart enough to generate the same code for
> >>>> non-EPT PTTYPE.
> >>>>
> >>> The idea behind this rsvd_bits_mask trickery is to  produce code that
> >>> does not have conditional branches.
> >>
> >> If you want to have no conditional branches, you need to use "|" not
> >> "||" (and you also need an "!= 0" in CHECK_BAD_MT_XWR).  As you wrote
> >> it, the compiler is most likely to generate exactly the same code that I
> >> suggested.
> > OK. I can add that :) I still prefer to have ifdefs outside the
> > function, not inside.
> > 
> >>
> >> But I think what you _really_ want is not avoiding conditional branches.
> > The idea is that it is hard for branch prediction to predict correct
> > result when correct result depends on guest's page table that can
> > contain anything, so in some places shadow paging code uses boolean
> > logic to avoid branches, in this case it is hard to avoid if() anyway
> > since the function invocation is in the if().
> 
> Yes, I get the idea, but is_rsvd_bits_set should be predicted unlikely,
> no?  If the guest has to run, it must use mostly valid ptes. :)
> 
You see, you are confused and you want branch prediction not to be? :)
If your guest is KVM is_rsvd_bits_set() will be likely much more then
unlikely because KVM misconfigures EPT entries to cache MMIO addresses,
so all the "unlikely" cases will be fixed by shadow pages and will not
reappear (until shadow pages are zapped), but misconfigured entries will
continue to produces violations.

> Especially if you change prefetch_invalid_gpte to do the reserved bits
> test after the present test (so that is_rsvd_bits_set is only called on
> present pagetables), is_rsvd_bits_set's result should be really
> well-predicted. 
Nope, for ept page tables present is not a single bit, it is three bits
which by themselves can have invalid values. For regular paging you are
probably right.

>                   At this point (and especially since function invocation
> is always in "if"s), using boolean logic to avoid branches does not make
> much sense anymore for this function.
That's true.

> 
> The compiler may still use setCC and boolean logic, even if you write it
> as "if"s; and in simple cases like this, after inlining and seeing an
> "if" the compiler may undo all your careful choice of "|" vs. "||".  So
> it's better anyway to ensure is_rsvd_bits_set is called only when it's
> unlikely.
> 
> Paolo

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini July 29, 2013, 1:11 p.m. UTC | #7
Il 29/07/2013 14:34, Gleb Natapov ha scritto:
>>>> But I think what you _really_ want is not avoiding conditional branches.
>>> The idea is that it is hard for branch prediction to predict correct
>>> result when correct result depends on guest's page table that can
>>> contain anything, so in some places shadow paging code uses boolean
>>> logic to avoid branches, in this case it is hard to avoid if() anyway
>>> since the function invocation is in the if().
>>
>> Yes, I get the idea, but is_rsvd_bits_set should be predicted unlikely,
>> no?  If the guest has to run, it must use mostly valid ptes. :)
>>
> You see, you are confused and you want branch prediction not to be? :)
> If your guest is KVM is_rsvd_bits_set() will be likely much more then
> unlikely because KVM misconfigures EPT entries to cache MMIO addresses,
> so all the "unlikely" cases will be fixed by shadow pages and will not
> reappear (until shadow pages are zapped), but misconfigured entries will
> continue to produces violations.

But then:

1) MMIO is a slow path anyway, losing 10 cycles on a mispredicted branch
is not going to help much.  Fast page faults are all I would optimize for.

2) in cases like this you just do not use likely/unlikely; the branch
will be very unlikely in the beginning, and very likely once shadow
pages are filled or in the no-EPT case.  Just let the branch predictor
adjust, it will probably do better than boolean tricks.

>> Especially if you change prefetch_invalid_gpte to do the reserved bits
>> test after the present test (so that is_rsvd_bits_set is only called on
>> present pagetables), is_rsvd_bits_set's result should be really
>> well-predicted. 
> Nope, for ept page tables present is not a single bit, it is three bits
> which by themselves can have invalid values.

We're not checking the validity of the bits in the is_present_gpte test,
we're checking it in the is_rsvd_bits_set test (is_present_gpte is doing
just "(pte & 7) != 0").  It doesn't change anything in the outcome of
prefetch_invalid_gpte, and it makes the ordering consistent with
walk_addr_generic which already tests presence before reserved bits.

So doing this swap should be a win anyway.

>>                   At this point (and especially since function invocation
>> is always in "if"s), using boolean logic to avoid branches does not make
>> much sense anymore for this function.
> 
> That's true.

So are you going to change to "if"s?

Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov July 29, 2013, 1:20 p.m. UTC | #8
On Mon, Jul 29, 2013 at 03:11:39PM +0200, Paolo Bonzini wrote:
> Il 29/07/2013 14:34, Gleb Natapov ha scritto:
> >>>> But I think what you _really_ want is not avoiding conditional branches.
> >>> The idea is that it is hard for branch prediction to predict correct
> >>> result when correct result depends on guest's page table that can
> >>> contain anything, so in some places shadow paging code uses boolean
> >>> logic to avoid branches, in this case it is hard to avoid if() anyway
> >>> since the function invocation is in the if().
> >>
> >> Yes, I get the idea, but is_rsvd_bits_set should be predicted unlikely,
> >> no?  If the guest has to run, it must use mostly valid ptes. :)
> >>
> > You see, you are confused and you want branch prediction not to be? :)
> > If your guest is KVM is_rsvd_bits_set() will be likely much more then
> > unlikely because KVM misconfigures EPT entries to cache MMIO addresses,
> > so all the "unlikely" cases will be fixed by shadow pages and will not
> > reappear (until shadow pages are zapped), but misconfigured entries will
> > continue to produces violations.
> 
> But then:
> 
> 1) MMIO is a slow path anyway, losing 10 cycles on a mispredicted branch
> is not going to help much.  Fast page faults are all I would optimize for.
> 
Of course, for that the check should be fast.

> 2) in cases like this you just do not use likely/unlikely; the branch
> will be very unlikely in the beginning, and very likely once shadow
> pages are filled or in the no-EPT case.  Just let the branch predictor
> adjust, it will probably do better than boolean tricks.
> 
likely/unlikely are usually useless anyway. If you can avoid if()
altogether this is a win since there is no branch to predict.

> >> Especially if you change prefetch_invalid_gpte to do the reserved bits
> >> test after the present test (so that is_rsvd_bits_set is only called on
> >> present pagetables), is_rsvd_bits_set's result should be really
> >> well-predicted. 
> > Nope, for ept page tables present is not a single bit, it is three bits
> > which by themselves can have invalid values.
> 
> We're not checking the validity of the bits in the is_present_gpte test,
> we're checking it in the is_rsvd_bits_set test (is_present_gpte is doing
> just "(pte & 7) != 0").  It doesn't change anything in the outcome of
> prefetch_invalid_gpte, and it makes the ordering consistent with
> walk_addr_generic which already tests presence before reserved bits.
> 
> So doing this swap should be a win anyway.
> 
> >>                   At this point (and especially since function invocation
> >> is always in "if"s), using boolean logic to avoid branches does not make
> >> much sense anymore for this function.
> > 
> > That's true.
> 
> So are you going to change to "if"s?
> 
I think it will be better just to check mmu->bad_mt_xwr always. (I
dislike ifdefs if you haven't noticed :)).

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini July 29, 2013, 2:12 p.m. UTC | #9
Il 29/07/2013 15:20, Gleb Natapov ha scritto:
>> 2) in cases like this you just do not use likely/unlikely; the branch
>> will be very unlikely in the beginning, and very likely once shadow
>> pages are filled or in the no-EPT case.  Just let the branch predictor
>> adjust, it will probably do better than boolean tricks.
>>
> likely/unlikely are usually useless anyway. If you can avoid if()
> altogether this is a win since there is no branch to predict.

However, if the branches are dynamically well-predicted,

   if (simple)
       ...
   if (complex)
       ...

is likely faster than

   if (simple | complex)

because the branches then are very very cheap, and it pays off to not
always evaluate the complex branch.

In this case, the reserved bit test is the relatively complex one, it
has a couple memory accesses and a longish chain of dependencies.

>>>> Especially if you change prefetch_invalid_gpte to do the reserved bits
>>>> test after the present test (so that is_rsvd_bits_set is only called on
>>>> present pagetables), is_rsvd_bits_set's result should be really
>>>> well-predicted. 
>>> Nope, for ept page tables present is not a single bit, it is three bits
>>> which by themselves can have invalid values.
>>
>> We're not checking the validity of the bits in the is_present_gpte test,
>> we're checking it in the is_rsvd_bits_set test (is_present_gpte is doing
>> just "(pte & 7) != 0").  It doesn't change anything in the outcome of
>> prefetch_invalid_gpte, and it makes the ordering consistent with
>> walk_addr_generic which already tests presence before reserved bits.
>>
>> So doing this swap should be a win anyway.
>>
>>>>                   At this point (and especially since function invocation
>>>> is always in "if"s), using boolean logic to avoid branches does not make
>>>> much sense anymore for this function.
>>>
>>> That's true.
>>
>> So are you going to change to "if"s?
>>
> I think it will be better just to check mmu->bad_mt_xwr always. (I
> dislike ifdefs if you haven't noticed :)).

Yeah, I also thought of always checking bad_mt_xwr and even using it to
subsume the present check too, i.e. turning it into
is_rsvd_bits_set_or_nonpresent.  It checks the same bits that are used
in the present check (well, a superset).  You can then check for
presence separately if you care, which you don't in
prefetch_invalid_gpte.  It requires small changes in the callers but
nothing major.

But it still seems to me that we're in the above "if (simple ||
complex)" case and having a separate "if (!present)" check will be faster.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov July 29, 2013, 4:24 p.m. UTC | #10
On Mon, Jul 29, 2013 at 04:12:33PM +0200, Paolo Bonzini wrote:
> Il 29/07/2013 15:20, Gleb Natapov ha scritto:
> >> 2) in cases like this you just do not use likely/unlikely; the branch
> >> will be very unlikely in the beginning, and very likely once shadow
> >> pages are filled or in the no-EPT case.  Just let the branch predictor
> >> adjust, it will probably do better than boolean tricks.
> >>
> > likely/unlikely are usually useless anyway. If you can avoid if()
> > altogether this is a win since there is no branch to predict.
> 
> However, if the branches are dynamically well-predicted,
> 
>    if (simple)
>        ...
>    if (complex)
>        ...
> 
> is likely faster than
> 
>    if (simple | complex)
> 
> because the branches then are very very cheap, and it pays off to not
> always evaluate the complex branch.
> 
Good point about about "|" always evaluating both. Is this the case
with if (simple !=0 | complex != 0) too where theoretically compiler may
see that if simple !=0 is true no need to evaluate the second one?
 

> In this case, the reserved bit test is the relatively complex one, it
> has a couple memory accesses and a longish chain of dependencies.
> 
> >>>> Especially if you change prefetch_invalid_gpte to do the reserved bits
> >>>> test after the present test (so that is_rsvd_bits_set is only called on
> >>>> present pagetables), is_rsvd_bits_set's result should be really
> >>>> well-predicted. 
> >>> Nope, for ept page tables present is not a single bit, it is three bits
> >>> which by themselves can have invalid values.
> >>
> >> We're not checking the validity of the bits in the is_present_gpte test,
> >> we're checking it in the is_rsvd_bits_set test (is_present_gpte is doing
> >> just "(pte & 7) != 0").  It doesn't change anything in the outcome of
> >> prefetch_invalid_gpte, and it makes the ordering consistent with
> >> walk_addr_generic which already tests presence before reserved bits.
> >>
> >> So doing this swap should be a win anyway.
> >>
> >>>>                   At this point (and especially since function invocation
> >>>> is always in "if"s), using boolean logic to avoid branches does not make
> >>>> much sense anymore for this function.
> >>>
> >>> That's true.
> >>
> >> So are you going to change to "if"s?
> >>
> > I think it will be better just to check mmu->bad_mt_xwr always. (I
> > dislike ifdefs if you haven't noticed :)).
> 
> Yeah, I also thought of always checking bad_mt_xwr and even using it to
> subsume the present check too, i.e. turning it into
> is_rsvd_bits_set_or_nonpresent.  It checks the same bits that are used
> in the present check (well, a superset).  You can then check for
> presence separately if you care, which you don't in
> prefetch_invalid_gpte.  It requires small changes in the callers but
> nothing major.
I do not get what is_rsvd_bits_set_or_nonpresent() will check exactly
and why do we needed it, there are two places where we check
present/reserved and in one of them we need to know which one it is.

Anyway order of checks in prefetch_invalid_gpte() is not relevant to
that patchset, so lets better leave it to a separate discussion.

> 
> But it still seems to me that we're in the above "if (simple ||
> complex)" case and having a separate "if (!present)" check will be faster.
> 
> Paolo

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini July 29, 2013, 4:36 p.m. UTC | #11
Il 29/07/2013 18:24, Gleb Natapov ha scritto:
> On Mon, Jul 29, 2013 at 04:12:33PM +0200, Paolo Bonzini wrote:
>> Il 29/07/2013 15:20, Gleb Natapov ha scritto:
>>>> 2) in cases like this you just do not use likely/unlikely; the branch
>>>> will be very unlikely in the beginning, and very likely once shadow
>>>> pages are filled or in the no-EPT case.  Just let the branch predictor
>>>> adjust, it will probably do better than boolean tricks.
>>>>
>>> likely/unlikely are usually useless anyway. If you can avoid if()
>>> altogether this is a win since there is no branch to predict.
>>
>> However, if the branches are dynamically well-predicted,
>>
>>    if (simple)
>>        ...
>>    if (complex)
>>        ...
>>
>> is likely faster than
>>
>>    if (simple | complex)
>>
>> because the branches then are very very cheap, and it pays off to not
>> always evaluate the complex branch.
>
> Good point about about "|" always evaluating both. Is this the case
> with if (simple !=0 | complex != 0) too where theoretically compiler may
> see that if simple !=0 is true no need to evaluate the second one?

Yes (only if complex doesn't have any side effects, which is the case here).

>> Yeah, I also thought of always checking bad_mt_xwr and even using it to
>> subsume the present check too, i.e. turning it into
>> is_rsvd_bits_set_or_nonpresent.  It checks the same bits that are used
>> in the present check (well, a superset).  You can then check for
>> presence separately if you care, which you don't in
>> prefetch_invalid_gpte.  It requires small changes in the callers but
>> nothing major.
> 
> I do not get what is_rsvd_bits_set_or_nonpresent() will check exactly
> and why do we needed it, there are two places where we check
> present/reserved and in one of them we need to know which one it is.

You can OR bad_mt_xwr with 0x5555555555555555ULL (I think).  Then your
implementation of is_rsvd_bits_set() using bad_mt_xwr will return true
in all cases where the pte is non-present.  You can then call
is_present_pte to discriminate the two cases.

    if (is_rsvd_bits_set_or_nonpresent) {
        if (!present)
            ...
        else
            ...
    }

In more abstract terms this is:

    if (simple)
        ...
    if (complex)
        ...

to

    if (simple_or_complex) {
        if (simple)
            ...
        else
            ...
    }

This can actually make sense if simple is almost always false, because
then you save something from not evaluating it on the fast path.

But in this case, adding bad_mt_xwr to the non-EPT case is a small loss.

> Anyway order of checks in prefetch_invalid_gpte() is not relevant to
> that patchset, so lets better leave it to a separate discussion.

Yes.

Paolo

>>
>> But it still seems to me that we're in the above "if (simple ||
>> complex)" case and having a separate "if (!present)" check will be faster.
>>
>> Paolo
> 
> --
> 			Gleb.
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov July 29, 2013, 4:54 p.m. UTC | #12
On Mon, Jul 29, 2013 at 06:36:12PM +0200, Paolo Bonzini wrote:
> Il 29/07/2013 18:24, Gleb Natapov ha scritto:
> > On Mon, Jul 29, 2013 at 04:12:33PM +0200, Paolo Bonzini wrote:
> >> Il 29/07/2013 15:20, Gleb Natapov ha scritto:
> >>>> 2) in cases like this you just do not use likely/unlikely; the branch
> >>>> will be very unlikely in the beginning, and very likely once shadow
> >>>> pages are filled or in the no-EPT case.  Just let the branch predictor
> >>>> adjust, it will probably do better than boolean tricks.
> >>>>
> >>> likely/unlikely are usually useless anyway. If you can avoid if()
> >>> altogether this is a win since there is no branch to predict.
> >>
> >> However, if the branches are dynamically well-predicted,
> >>
> >>    if (simple)
> >>        ...
> >>    if (complex)
> >>        ...
> >>
> >> is likely faster than
> >>
> >>    if (simple | complex)
> >>
> >> because the branches then are very very cheap, and it pays off to not
> >> always evaluate the complex branch.
> >
> > Good point about about "|" always evaluating both. Is this the case
> > with if (simple !=0 | complex != 0) too where theoretically compiler may
> > see that if simple !=0 is true no need to evaluate the second one?
> 
> Yes (only if complex doesn't have any side effects, which is the case here).
> 
> >> Yeah, I also thought of always checking bad_mt_xwr and even using it to
> >> subsume the present check too, i.e. turning it into
> >> is_rsvd_bits_set_or_nonpresent.  It checks the same bits that are used
> >> in the present check (well, a superset).  You can then check for
> >> presence separately if you care, which you don't in
> >> prefetch_invalid_gpte.  It requires small changes in the callers but
> >> nothing major.
> > 
> > I do not get what is_rsvd_bits_set_or_nonpresent() will check exactly
> > and why do we needed it, there are two places where we check
> > present/reserved and in one of them we need to know which one it is.
> 
> You can OR bad_mt_xwr with 0x5555555555555555ULL (I think).  Then your
With 0x1010101.

> implementation of is_rsvd_bits_set() using bad_mt_xwr will return true
> in all cases where the pte is non-present.  You can then call
> is_present_pte to discriminate the two cases.
> 
>     if (is_rsvd_bits_set_or_nonpresent) {
>         if (!present)
>             ...
>         else
>             ...
>     }
> 
> In more abstract terms this is:
> 
>     if (simple)
>         ...
>     if (complex)
>         ...
> 
> to
> 
>     if (simple_or_complex) {
>         if (simple)
>             ...
>         else
>             ...
>     }
> 
> This can actually make sense if simple is almost always false, because
> then you save something from not evaluating it on the fast path.
> 
> But in this case, adding bad_mt_xwr to the non-EPT case is a small loss.
> 
> > Anyway order of checks in prefetch_invalid_gpte() is not relevant to
> > that patchset, so lets better leave it to a separate discussion.
> 
> Yes.
> 
> Paolo
> 
> >>
> >> But it still seems to me that we're in the above "if (simple ||
> >> complex)" case and having a separate "if (!present)" check will be faster.
> >>
> >> Paolo
> > 
> > --
> > 			Gleb.
> > 

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 531f47c..58a17c0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -286,6 +286,7 @@  struct kvm_mmu {
 	u64 *pae_root;
 	u64 *lm_root;
 	u64 rsvd_bits_mask[2][4];
+	u64 bad_mt_xwr;
 
 	/*
 	 * Bitmap: bit set = last pte in walk
@@ -512,6 +513,9 @@  struct kvm_vcpu_arch {
 	 * instruction.
 	 */
 	bool write_fault_to_shadow_pgtable;
+
+	/* set at EPT violation at this point */
+	unsigned long exit_qualification;
 };
 
 struct kvm_lpage_info {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3df3ac3..58ae9db 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3521,6 +3521,8 @@  static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
 	int maxphyaddr = cpuid_maxphyaddr(vcpu);
 	u64 exb_bit_rsvd = 0;
 
+	context->bad_mt_xwr = 0;
+
 	if (!context->nx)
 		exb_bit_rsvd = rsvd_bits(63, 63);
 	switch (context->root_level) {
@@ -3576,6 +3578,38 @@  static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
 	}
 }
 
+static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
+		struct kvm_mmu *context, bool execonly)
+{
+	int maxphyaddr = cpuid_maxphyaddr(vcpu);
+	int pte;
+
+	context->rsvd_bits_mask[0][3] =
+		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
+	context->rsvd_bits_mask[0][2] =
+		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
+	context->rsvd_bits_mask[0][1] =
+		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
+	context->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
+
+	/* large page */
+	context->rsvd_bits_mask[1][3] = context->rsvd_bits_mask[0][3];
+	context->rsvd_bits_mask[1][2] =
+		rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
+	context->rsvd_bits_mask[1][1] =
+		rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
+	context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0];
+	
+	for (pte = 0; pte < 64; pte++) {
+		int rwx_bits = pte & 7;
+		int mt = pte >> 3;
+		if (mt == 0x2 || mt == 0x3 || mt == 0x7 ||
+				rwx_bits == 0x2 || rwx_bits == 0x6 ||
+				(rwx_bits == 0x4 && !execonly))
+			context->bad_mt_xwr |= (1ull << pte);
+	}
+}
+
 static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
 {
 	unsigned bit, byte, pfec;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 23a19a5..58d2f87 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -121,14 +121,23 @@  static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
 #endif
 }
 
+#if PTTYPE == PTTYPE_EPT
+#define CHECK_BAD_MT_XWR(G) mmu->bad_mt_xwr & (1ull << ((G) & 0x3f));
+#else
+#define CHECK_BAD_MT_XWR(G) 0;
+#endif
+
 static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
 {
 	int bit7;
 
 	bit7 = (gpte >> 7) & 1;
-	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
+	return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) ||
+		CHECK_BAD_MT_XWR(gpte);
 }
 
+#undef CHECK_BAD_MT_XWR
+
 static inline int FNAME(is_present_gpte)(unsigned long pte)
 {
 #if PTTYPE != PTTYPE_EPT
@@ -376,6 +385,25 @@  error:
 	walker->fault.vector = PF_VECTOR;
 	walker->fault.error_code_valid = true;
 	walker->fault.error_code = errcode;
+
+#if PTTYPE == PTTYPE_EPT
+	/*
+	 * Use PFERR_RSVD_MASK in erorr_code to to tell if EPT
+	 * misconfiguration requires to be injected. The detection is
+	 * done by is_rsvd_bits_set() above.
+	 *
+	 * We set up the value of exit_qualification to inject:
+	 * [2:0] - Derive from [2:0] of real exit_qualification at EPT violation
+	 * [5:3] - Calculated by the page walk of the guest EPT page tables
+	 * [7:8] - Clear to 0.
+	 *
+	 * The other bits are set to 0.
+	 */
+	if (!(errcode & PFERR_RSVD_MASK)) {
+		vcpu->arch.exit_qualification &= 0x7;
+		vcpu->arch.exit_qualification |= ((pt_access & pte) & 0x7) << 3;
+	}
+#endif
 	walker->fault.address = addr;
 	walker->fault.nested_page_fault = mmu != vcpu->arch.walk_mmu;
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index fc24370..bbfff8d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5321,6 +5321,8 @@  static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	/* ept page table is present? */
 	error_code |= (exit_qualification >> 3) & 0x1;
 
+	vcpu->arch.exit_qualification = exit_qualification;
+
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
@@ -7416,6 +7418,21 @@  static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
 		entry->ecx |= bit(X86_FEATURE_VMX);
 }
 
+static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
+		struct x86_exception *fault)
+{
+	struct vmcs12 *vmcs12;
+	nested_vmx_vmexit(vcpu);
+	vmcs12 = get_vmcs12(vcpu);
+
+	if (fault->error_code & PFERR_RSVD_MASK)
+		vmcs12->vm_exit_reason = EXIT_REASON_EPT_MISCONFIG;
+	else
+		vmcs12->vm_exit_reason = EXIT_REASON_EPT_VIOLATION;
+	vmcs12->exit_qualification = vcpu->arch.exit_qualification;
+	vmcs12->guest_physical_address = fault->address;
+}
+
 /*
  * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
  * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it