diff mbox series

x86/svm: retry after unhandled NPT fault if gfn was marked for recalculation

Message ID 1590097438-28829-1-git-send-email-igor.druzhinin@citrix.com (mailing list archive)
State Superseded
Headers show
Series x86/svm: retry after unhandled NPT fault if gfn was marked for recalculation | expand

Commit Message

Igor Druzhinin May 21, 2020, 9:43 p.m. UTC
If a recalculation NPT fault hasn't been handled explicitly in
hvm_hap_nested_page_fault() then it's potentially safe to retry -
US bit has been re-instated in PTE and any real fault would be correctly
re-raised next time.

This covers a specific case of migration with vGPU assigned on AMD:
global log-dirty is enabled and causes immediate recalculation NPT
fault in MMIO area upon access. This type of fault isn't described
explicitly in hvm_hap_nested_page_fault (this isn't called on
EPT misconfig exit on Intel) which results in domain crash.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
---
 xen/arch/x86/hvm/svm/svm.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Igor Druzhinin May 22, 2020, 12:26 a.m. UTC | #1
On 21/05/2020 22:43, Igor Druzhinin wrote:
> If a recalculation NPT fault hasn't been handled explicitly in
> hvm_hap_nested_page_fault() then it's potentially safe to retry -
> US bit has been re-instated in PTE and any real fault would be correctly
> re-raised next time.
> 
> This covers a specific case of migration with vGPU assigned on AMD:
> global log-dirty is enabled and causes immediate recalculation NPT
> fault in MMIO area upon access. This type of fault isn't described
> explicitly in hvm_hap_nested_page_fault (this isn't called on
> EPT misconfig exit on Intel) which results in domain crash.
> 
> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> ---

Alternatively, I can re-raise the fault immediately after recalculation is
done which is less efficient (will take one more VMEXIT) but safer IMO -
hvm_hap_nested_page_fault might potentially leave VM in inconsistent state
in case of a real failure and cause second page fault to conceal it.

Another alternative is to inject fall_through bool into hvm_hap_nested_page_fault
to give it the idea of expected behavior in that case and avoid guessing in SVM
code. I think that's an improvement over suggestion in v1 and a candidate for v2. 

Igor
Andrew Cooper May 22, 2020, 9:45 a.m. UTC | #2
On 21/05/2020 22:43, Igor Druzhinin wrote:
> If a recalculation NPT fault hasn't been handled explicitly in
> hvm_hap_nested_page_fault() then it's potentially safe to retry -
> US bit has been re-instated in PTE and any real fault would be correctly
> re-raised next time.
>
> This covers a specific case of migration with vGPU assigned on AMD:
> global log-dirty is enabled and causes immediate recalculation NPT
> fault in MMIO area upon access. This type of fault isn't described
> explicitly in hvm_hap_nested_page_fault (this isn't called on
> EPT misconfig exit on Intel) which results in domain crash.
>
> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> ---
>  xen/arch/x86/hvm/svm/svm.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
> index 46a1aac..f0d0bd3 100644
> --- a/xen/arch/x86/hvm/svm/svm.c
> +++ b/xen/arch/x86/hvm/svm/svm.c
> @@ -1726,6 +1726,10 @@ static void svm_do_nested_pgfault(struct vcpu *v,
>          /* inject #VMEXIT(NPF) into guest. */
>          nestedsvm_vmexit_defer(v, VMEXIT_NPF, pfec, gpa);
>          return;
> +    case 0:
> +        /* If a recalculation page fault hasn't been handled - just retry. */
> +        if ( pfec & PFEC_user_mode )
> +            return;

This smells like it is a recipe for livelocks.

Everything should have been handled properly by the call to
p2m_pt_handle_deferred_changes() which precedes svm_do_nested_pgfault().

It is legitimate for the MMIO mapping to end up being transiently
recalculated, but the fact that p2m_pt_handle_deferred_changes() doesn't
fix it up suggests that the bug is there.

Do you have the complete NPT walk to the bad mapping? Do we have
_PAGE_USER in the leaf mapping, or is this perhaps a spurious fault?

~Andrew
Igor Druzhinin May 22, 2020, 10:05 a.m. UTC | #3
On 22/05/2020 10:45, Andrew Cooper wrote:
> On 21/05/2020 22:43, Igor Druzhinin wrote:
>> If a recalculation NPT fault hasn't been handled explicitly in
>> hvm_hap_nested_page_fault() then it's potentially safe to retry -
>> US bit has been re-instated in PTE and any real fault would be correctly
>> re-raised next time.
>>
>> This covers a specific case of migration with vGPU assigned on AMD:
>> global log-dirty is enabled and causes immediate recalculation NPT
>> fault in MMIO area upon access. This type of fault isn't described
>> explicitly in hvm_hap_nested_page_fault (this isn't called on
>> EPT misconfig exit on Intel) which results in domain crash.
>>
>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
>> ---
>>  xen/arch/x86/hvm/svm/svm.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
>> index 46a1aac..f0d0bd3 100644
>> --- a/xen/arch/x86/hvm/svm/svm.c
>> +++ b/xen/arch/x86/hvm/svm/svm.c
>> @@ -1726,6 +1726,10 @@ static void svm_do_nested_pgfault(struct vcpu *v,
>>          /* inject #VMEXIT(NPF) into guest. */
>>          nestedsvm_vmexit_defer(v, VMEXIT_NPF, pfec, gpa);
>>          return;
>> +    case 0:
>> +        /* If a recalculation page fault hasn't been handled - just retry. */
>> +        if ( pfec & PFEC_user_mode )
>> +            return;
> 
> This smells like it is a recipe for livelocks.
> 
> Everything should have been handled properly by the call to
> p2m_pt_handle_deferred_changes() which precedes svm_do_nested_pgfault().
> 
> It is legitimate for the MMIO mapping to end up being transiently
> recalculated, but the fact that p2m_pt_handle_deferred_changes() doesn't
> fix it up suggests that the bug is there.
> 
> Do you have the complete NPT walk to the bad mapping? Do we have
> _PAGE_USER in the leaf mapping, or is this perhaps a spurious fault?

It does fix it up. The problem is that currently in SVM we enter
svm_do_nested_pgfault immediately after p2m_pt_handle_deferred_changes
is finished finished. 

Yes, we don't have _PAGE_USER initially and, yes, it's fixed up
correctly in p2m_pt_handle_deferred_changes but svm_do_nested_pgfault
doesn't know about it.

Please read my second email about alternatives that suggest to resolve
the issue you're worrying about.

Igor
Roger Pau Monné May 22, 2020, 10:08 a.m. UTC | #4
On Thu, May 21, 2020 at 10:43:58PM +0100, Igor Druzhinin wrote:
> If a recalculation NPT fault hasn't been handled explicitly in
> hvm_hap_nested_page_fault() then it's potentially safe to retry -
> US bit has been re-instated in PTE and any real fault would be correctly
> re-raised next time.
> 
> This covers a specific case of migration with vGPU assigned on AMD:
> global log-dirty is enabled and causes immediate recalculation NPT
> fault in MMIO area upon access. This type of fault isn't described
> explicitly in hvm_hap_nested_page_fault (this isn't called on
> EPT misconfig exit on Intel) which results in domain crash.

Couldn't direct MMIO regions be handled like other types of memory for
the purposes of logdiry mode?

I assume there's already a path here used for other memory types when
logdirty is turned on, and hence would seem better to just make direct
MMIO regions also use that path?

> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> ---
>  xen/arch/x86/hvm/svm/svm.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
> index 46a1aac..f0d0bd3 100644
> --- a/xen/arch/x86/hvm/svm/svm.c
> +++ b/xen/arch/x86/hvm/svm/svm.c
> @@ -1726,6 +1726,10 @@ static void svm_do_nested_pgfault(struct vcpu *v,
>          /* inject #VMEXIT(NPF) into guest. */
>          nestedsvm_vmexit_defer(v, VMEXIT_NPF, pfec, gpa);
>          return;
> +    case 0:
> +        /* If a recalculation page fault hasn't been handled - just retry. */
> +        if ( pfec & PFEC_user_mode )
> +            return;

I'm slightly worried that this diverges from the EPT implementation
now, in the sense that returning 0 from hvm_hap_nested_page_fault will
no longer trigger a guest crash.

Thanks, Roger.
Igor Druzhinin May 22, 2020, 10:14 a.m. UTC | #5
On 22/05/2020 11:08, Roger Pau Monné wrote:
> On Thu, May 21, 2020 at 10:43:58PM +0100, Igor Druzhinin wrote:
>> If a recalculation NPT fault hasn't been handled explicitly in
>> hvm_hap_nested_page_fault() then it's potentially safe to retry -
>> US bit has been re-instated in PTE and any real fault would be correctly
>> re-raised next time.
>>
>> This covers a specific case of migration with vGPU assigned on AMD:
>> global log-dirty is enabled and causes immediate recalculation NPT
>> fault in MMIO area upon access. This type of fault isn't described
>> explicitly in hvm_hap_nested_page_fault (this isn't called on
>> EPT misconfig exit on Intel) which results in domain crash.
> 
> Couldn't direct MMIO regions be handled like other types of memory for
> the purposes of logdiry mode?
> 
> I assume there's already a path here used for other memory types when
> logdirty is turned on, and hence would seem better to just make direct
> MMIO regions also use that path?

The proble of handling only MMIO case is that the issue still stays.
It will be hit with some other memory type since it's not MMIO specific.
The issue is that if global recalculation is called, the next hit to
this type will cause a transient fault which will not be handled
correctly after a due fixup by neither of our handlers.

>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
>> ---
>>  xen/arch/x86/hvm/svm/svm.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
>> index 46a1aac..f0d0bd3 100644
>> --- a/xen/arch/x86/hvm/svm/svm.c
>> +++ b/xen/arch/x86/hvm/svm/svm.c
>> @@ -1726,6 +1726,10 @@ static void svm_do_nested_pgfault(struct vcpu *v,
>>          /* inject #VMEXIT(NPF) into guest. */
>>          nestedsvm_vmexit_defer(v, VMEXIT_NPF, pfec, gpa);
>>          return;
>> +    case 0:
>> +        /* If a recalculation page fault hasn't been handled - just retry. */
>> +        if ( pfec & PFEC_user_mode )
>> +            return;
> 
> I'm slightly worried that this diverges from the EPT implementation
> now, in the sense that returning 0 from hvm_hap_nested_page_fault will
> no longer trigger a guest crash.

My second alternative from my follow up email addresses this. I also
didn't like this aspect.

Igor
Andrew Cooper May 22, 2020, 10:19 a.m. UTC | #6
On 22/05/2020 11:05, Igor Druzhinin wrote:
> On 22/05/2020 10:45, Andrew Cooper wrote:
>> On 21/05/2020 22:43, Igor Druzhinin wrote:
>>> If a recalculation NPT fault hasn't been handled explicitly in
>>> hvm_hap_nested_page_fault() then it's potentially safe to retry -
>>> US bit has been re-instated in PTE and any real fault would be correctly
>>> re-raised next time.
>>>
>>> This covers a specific case of migration with vGPU assigned on AMD:
>>> global log-dirty is enabled and causes immediate recalculation NPT
>>> fault in MMIO area upon access. This type of fault isn't described
>>> explicitly in hvm_hap_nested_page_fault (this isn't called on
>>> EPT misconfig exit on Intel) which results in domain crash.
>>>
>>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
>>> ---
>>>  xen/arch/x86/hvm/svm/svm.c | 4 ++++
>>>  1 file changed, 4 insertions(+)
>>>
>>> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
>>> index 46a1aac..f0d0bd3 100644
>>> --- a/xen/arch/x86/hvm/svm/svm.c
>>> +++ b/xen/arch/x86/hvm/svm/svm.c
>>> @@ -1726,6 +1726,10 @@ static void svm_do_nested_pgfault(struct vcpu *v,
>>>          /* inject #VMEXIT(NPF) into guest. */
>>>          nestedsvm_vmexit_defer(v, VMEXIT_NPF, pfec, gpa);
>>>          return;
>>> +    case 0:
>>> +        /* If a recalculation page fault hasn't been handled - just retry. */
>>> +        if ( pfec & PFEC_user_mode )
>>> +            return;
>> This smells like it is a recipe for livelocks.
>>
>> Everything should have been handled properly by the call to
>> p2m_pt_handle_deferred_changes() which precedes svm_do_nested_pgfault().
>>
>> It is legitimate for the MMIO mapping to end up being transiently
>> recalculated, but the fact that p2m_pt_handle_deferred_changes() doesn't
>> fix it up suggests that the bug is there.
>>
>> Do you have the complete NPT walk to the bad mapping? Do we have
>> _PAGE_USER in the leaf mapping, or is this perhaps a spurious fault?
> It does fix it up. The problem is that currently in SVM we enter
> svm_do_nested_pgfault immediately after p2m_pt_handle_deferred_changes
> is finished finished.

Oh - so we do.  I'd read the entry condition for svm_do_nested_pgfault()
incorrectly.

Jan - why did you chose to do it this way?  If
p2m_pt_handle_deferred_changes() has made a modification, there is
surely nothing relevant to do in svm_do_nested_pgfault().

~Andrew
Roger Pau Monné May 22, 2020, 10:23 a.m. UTC | #7
On Fri, May 22, 2020 at 11:14:24AM +0100, Igor Druzhinin wrote:
> On 22/05/2020 11:08, Roger Pau Monné wrote:
> > On Thu, May 21, 2020 at 10:43:58PM +0100, Igor Druzhinin wrote:
> >> If a recalculation NPT fault hasn't been handled explicitly in
> >> hvm_hap_nested_page_fault() then it's potentially safe to retry -
> >> US bit has been re-instated in PTE and any real fault would be correctly
> >> re-raised next time.
> >>
> >> This covers a specific case of migration with vGPU assigned on AMD:
> >> global log-dirty is enabled and causes immediate recalculation NPT
> >> fault in MMIO area upon access. This type of fault isn't described
> >> explicitly in hvm_hap_nested_page_fault (this isn't called on
> >> EPT misconfig exit on Intel) which results in domain crash.
> > 
> > Couldn't direct MMIO regions be handled like other types of memory for
> > the purposes of logdiry mode?
> > 
> > I assume there's already a path here used for other memory types when
> > logdirty is turned on, and hence would seem better to just make direct
> > MMIO regions also use that path?
> 
> The proble of handling only MMIO case is that the issue still stays.
> It will be hit with some other memory type since it's not MMIO specific.
> The issue is that if global recalculation is called, the next hit to
> this type will cause a transient fault which will not be handled
> correctly after a due fixup by neither of our handlers.

I admit I should go look at the code, but for example RAM p2m types
don't require this fix, so I assume there's some different path taken
in that case that avoids all this?

Ie: when global logdirty is enabled you will start to get nested page
faults for every access, yet only direct MMIO types require this fix?

Thanks, Roger
Igor Druzhinin May 22, 2020, 10:25 a.m. UTC | #8
On 22/05/2020 11:19, Andrew Cooper wrote:
> On 22/05/2020 11:05, Igor Druzhinin wrote:
>> On 22/05/2020 10:45, Andrew Cooper wrote:
>>> On 21/05/2020 22:43, Igor Druzhinin wrote:
>>>> If a recalculation NPT fault hasn't been handled explicitly in
>>>> hvm_hap_nested_page_fault() then it's potentially safe to retry -
>>>> US bit has been re-instated in PTE and any real fault would be correctly
>>>> re-raised next time.
>>>>
>>>> This covers a specific case of migration with vGPU assigned on AMD:
>>>> global log-dirty is enabled and causes immediate recalculation NPT
>>>> fault in MMIO area upon access. This type of fault isn't described
>>>> explicitly in hvm_hap_nested_page_fault (this isn't called on
>>>> EPT misconfig exit on Intel) which results in domain crash.
>>>>
>>>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
>>>> ---
>>>>  xen/arch/x86/hvm/svm/svm.c | 4 ++++
>>>>  1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
>>>> index 46a1aac..f0d0bd3 100644
>>>> --- a/xen/arch/x86/hvm/svm/svm.c
>>>> +++ b/xen/arch/x86/hvm/svm/svm.c
>>>> @@ -1726,6 +1726,10 @@ static void svm_do_nested_pgfault(struct vcpu *v,
>>>>          /* inject #VMEXIT(NPF) into guest. */
>>>>          nestedsvm_vmexit_defer(v, VMEXIT_NPF, pfec, gpa);
>>>>          return;
>>>> +    case 0:
>>>> +        /* If a recalculation page fault hasn't been handled - just retry. */
>>>> +        if ( pfec & PFEC_user_mode )
>>>> +            return;
>>> This smells like it is a recipe for livelocks.
>>>
>>> Everything should have been handled properly by the call to
>>> p2m_pt_handle_deferred_changes() which precedes svm_do_nested_pgfault().
>>>
>>> It is legitimate for the MMIO mapping to end up being transiently
>>> recalculated, but the fact that p2m_pt_handle_deferred_changes() doesn't
>>> fix it up suggests that the bug is there.
>>>
>>> Do you have the complete NPT walk to the bad mapping? Do we have
>>> _PAGE_USER in the leaf mapping, or is this perhaps a spurious fault?
>> It does fix it up. The problem is that currently in SVM we enter
>> svm_do_nested_pgfault immediately after p2m_pt_handle_deferred_changes
>> is finished finished.
> 
> Oh - so we do.  I'd read the entry condition for svm_do_nested_pgfault()
> incorrectly.
> 
> Jan - why did you chose to do it this way?  If
> p2m_pt_handle_deferred_changes() has made a modification, there is
> surely nothing relevant to do in svm_do_nested_pgfault().

In Jan's defense that saves one additional VMEXIT in rare cases
if the fault had other implications (write to RO page in log-dirty)
in addition to recalculation.

Igor
other
Igor Druzhinin May 22, 2020, 10:27 a.m. UTC | #9
On 22/05/2020 11:23, Roger Pau Monné wrote:
> On Fri, May 22, 2020 at 11:14:24AM +0100, Igor Druzhinin wrote:
>> On 22/05/2020 11:08, Roger Pau Monné wrote:
>>> On Thu, May 21, 2020 at 10:43:58PM +0100, Igor Druzhinin wrote:
>>>> If a recalculation NPT fault hasn't been handled explicitly in
>>>> hvm_hap_nested_page_fault() then it's potentially safe to retry -
>>>> US bit has been re-instated in PTE and any real fault would be correctly
>>>> re-raised next time.
>>>>
>>>> This covers a specific case of migration with vGPU assigned on AMD:
>>>> global log-dirty is enabled and causes immediate recalculation NPT
>>>> fault in MMIO area upon access. This type of fault isn't described
>>>> explicitly in hvm_hap_nested_page_fault (this isn't called on
>>>> EPT misconfig exit on Intel) which results in domain crash.
>>>
>>> Couldn't direct MMIO regions be handled like other types of memory for
>>> the purposes of logdiry mode?
>>>
>>> I assume there's already a path here used for other memory types when
>>> logdirty is turned on, and hence would seem better to just make direct
>>> MMIO regions also use that path?
>>
>> The proble of handling only MMIO case is that the issue still stays.
>> It will be hit with some other memory type since it's not MMIO specific.
>> The issue is that if global recalculation is called, the next hit to
>> this type will cause a transient fault which will not be handled
>> correctly after a due fixup by neither of our handlers.
> 
> I admit I should go look at the code, but for example RAM p2m types
> don't require this fix, so I assume there's some different path taken
> in that case that avoids all this?
> 
> Ie: when global logdirty is enabled you will start to get nested page
> faults for every access, yet only direct MMIO types require this fix?

It's not "only MMIO" - it's just MMIO area is hit in my particular case.
I'd prefer this fix to address the general issue otherwise for SVM
we would have to write handlers in hvm_hap_nested_page_fault() for
every case as soon as we hit it.

Igor
Roger Pau Monné May 22, 2020, 11:11 a.m. UTC | #10
On Fri, May 22, 2020 at 11:27:38AM +0100, Igor Druzhinin wrote:
> On 22/05/2020 11:23, Roger Pau Monné wrote:
> > On Fri, May 22, 2020 at 11:14:24AM +0100, Igor Druzhinin wrote:
> >> On 22/05/2020 11:08, Roger Pau Monné wrote:
> >>> On Thu, May 21, 2020 at 10:43:58PM +0100, Igor Druzhinin wrote:
> >>>> If a recalculation NPT fault hasn't been handled explicitly in
> >>>> hvm_hap_nested_page_fault() then it's potentially safe to retry -
> >>>> US bit has been re-instated in PTE and any real fault would be correctly
> >>>> re-raised next time.
> >>>>
> >>>> This covers a specific case of migration with vGPU assigned on AMD:
> >>>> global log-dirty is enabled and causes immediate recalculation NPT
> >>>> fault in MMIO area upon access. This type of fault isn't described
> >>>> explicitly in hvm_hap_nested_page_fault (this isn't called on
> >>>> EPT misconfig exit on Intel) which results in domain crash.
> >>>
> >>> Couldn't direct MMIO regions be handled like other types of memory for
> >>> the purposes of logdiry mode?
> >>>
> >>> I assume there's already a path here used for other memory types when
> >>> logdirty is turned on, and hence would seem better to just make direct
> >>> MMIO regions also use that path?
> >>
> >> The proble of handling only MMIO case is that the issue still stays.
> >> It will be hit with some other memory type since it's not MMIO specific.
> >> The issue is that if global recalculation is called, the next hit to
> >> this type will cause a transient fault which will not be handled
> >> correctly after a due fixup by neither of our handlers.
> > 
> > I admit I should go look at the code, but for example RAM p2m types
> > don't require this fix, so I assume there's some different path taken
> > in that case that avoids all this?
> > 
> > Ie: when global logdirty is enabled you will start to get nested page
> > faults for every access, yet only direct MMIO types require this fix?
> 
> It's not "only MMIO" - it's just MMIO area is hit in my particular case.
> I'd prefer this fix to address the general issue otherwise for SVM
> we would have to write handlers in hvm_hap_nested_page_fault() for
> every case as soon as we hit it.

Hm, I'm not sure I agree. p2m memory types are limited, and IMO we
want to have strict control about how they are handled.
hvm_hap_nested_page_fault is already full of special casing for each
memory type for that reason.

That being said, I also don't like the fact that logdity is handled
differently between EPT and NPT, as on EPT it's handled as a
misconfig while on NPT it's handled as a violation.

Thanks, Roger.
Jan Beulich May 22, 2020, 1:04 p.m. UTC | #11
On 22.05.2020 13:11, Roger Pau Monné wrote:
> That being said, I also don't like the fact that logdity is handled
> differently between EPT and NPT, as on EPT it's handled as a
> misconfig while on NPT it's handled as a violation.

Because, well, there is no concept of misconfig in NPT.

Jan
Andrew Cooper May 22, 2020, 1:11 p.m. UTC | #12
On 22/05/2020 14:04, Jan Beulich wrote:
> On 22.05.2020 13:11, Roger Pau Monné wrote:
>> That being said, I also don't like the fact that logdity is handled
>> differently between EPT and NPT, as on EPT it's handled as a
>> misconfig while on NPT it's handled as a violation.
> Because, well, there is no concept of misconfig in NPT.

Indeed.  Intel chose to split EPT errors into two - MISCONFIG for
structural errors (not present, or reserved bits set) and VIOLATION for
permissions errors.

AMD reused the same silicon pagewalker design, so have a single
NPT_FAULT vmexit which behaves much more like a regular pagefault,
encoding structural vs permission errors in the error code.

~Andrew
Roger Pau Monné May 22, 2020, 1:32 p.m. UTC | #13
On Fri, May 22, 2020 at 02:11:15PM +0100, Andrew Cooper wrote:
> On 22/05/2020 14:04, Jan Beulich wrote:
> > On 22.05.2020 13:11, Roger Pau Monné wrote:
> >> That being said, I also don't like the fact that logdity is handled
> >> differently between EPT and NPT, as on EPT it's handled as a
> >> misconfig while on NPT it's handled as a violation.
> > Because, well, there is no concept of misconfig in NPT.
> 
> Indeed.  Intel chose to split EPT errors into two - MISCONFIG for
> structural errors (not present, or reserved bits set) and VIOLATION for
> permissions errors.
> 
> AMD reused the same silicon pagewalker design, so have a single
> NPT_FAULT vmexit which behaves much more like a regular pagefault,
> encoding structural vs permission errors in the error code.

Maybe I should clarify, I understand that NPT doesn't have such
differentiation regarding nested page table faults vs EPT, but I feel
like it would be clearer if part of the code could be shared, ie:
unify EPT resolve_misconfig and NPT do_recalc into a single function
for example that uses the necessary p2m-> helpers for the differing
implementations. I think we should be able to tell apart when a NPT
page fault is a recalc one by looking at the bits in the EXITINFO1
error field?

Anyway, this was just a rant, and it's tangential to the issue at
hand, sorry for distracting.

Roger.
Jan Beulich May 22, 2020, 1:34 p.m. UTC | #14
On 22.05.2020 12:19, Andrew Cooper wrote:
> On 22/05/2020 11:05, Igor Druzhinin wrote:
>> On 22/05/2020 10:45, Andrew Cooper wrote:
>>> On 21/05/2020 22:43, Igor Druzhinin wrote:
>>>> If a recalculation NPT fault hasn't been handled explicitly in
>>>> hvm_hap_nested_page_fault() then it's potentially safe to retry -
>>>> US bit has been re-instated in PTE and any real fault would be correctly
>>>> re-raised next time.
>>>>
>>>> This covers a specific case of migration with vGPU assigned on AMD:
>>>> global log-dirty is enabled and causes immediate recalculation NPT
>>>> fault in MMIO area upon access. This type of fault isn't described
>>>> explicitly in hvm_hap_nested_page_fault (this isn't called on
>>>> EPT misconfig exit on Intel) which results in domain crash.
>>>>
>>>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
>>>> ---
>>>>  xen/arch/x86/hvm/svm/svm.c | 4 ++++
>>>>  1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
>>>> index 46a1aac..f0d0bd3 100644
>>>> --- a/xen/arch/x86/hvm/svm/svm.c
>>>> +++ b/xen/arch/x86/hvm/svm/svm.c
>>>> @@ -1726,6 +1726,10 @@ static void svm_do_nested_pgfault(struct vcpu *v,
>>>>          /* inject #VMEXIT(NPF) into guest. */
>>>>          nestedsvm_vmexit_defer(v, VMEXIT_NPF, pfec, gpa);
>>>>          return;
>>>> +    case 0:
>>>> +        /* If a recalculation page fault hasn't been handled - just retry. */
>>>> +        if ( pfec & PFEC_user_mode )
>>>> +            return;
>>> This smells like it is a recipe for livelocks.
>>>
>>> Everything should have been handled properly by the call to
>>> p2m_pt_handle_deferred_changes() which precedes svm_do_nested_pgfault().
>>>
>>> It is legitimate for the MMIO mapping to end up being transiently
>>> recalculated, but the fact that p2m_pt_handle_deferred_changes() doesn't
>>> fix it up suggests that the bug is there.
>>>
>>> Do you have the complete NPT walk to the bad mapping? Do we have
>>> _PAGE_USER in the leaf mapping, or is this perhaps a spurious fault?
>> It does fix it up. The problem is that currently in SVM we enter
>> svm_do_nested_pgfault immediately after p2m_pt_handle_deferred_changes
>> is finished finished.
> 
> Oh - so we do.  I'd read the entry condition for svm_do_nested_pgfault()
> incorrectly.
> 
> Jan - why did you chose to do it this way?  If
> p2m_pt_handle_deferred_changes() has made a modification, there is
> surely nothing relevant to do in svm_do_nested_pgfault().

Why? There can very well be multiple reasons for a single exit
to occur, and hence it did seem to make sense to allow processing
of such possible further reasons right away, instead of re-
entering the guest just to deal with another VM exit. What I can
accept is that the error code may not correctly represent the
"remaining" fault reason anymore at this point.

Jan
Andrew Cooper May 22, 2020, 3:53 p.m. UTC | #15
On 22/05/2020 14:32, Roger Pau Monné wrote:
> On Fri, May 22, 2020 at 02:11:15PM +0100, Andrew Cooper wrote:
>> On 22/05/2020 14:04, Jan Beulich wrote:
>>> On 22.05.2020 13:11, Roger Pau Monné wrote:
>>>> That being said, I also don't like the fact that logdity is handled
>>>> differently between EPT and NPT, as on EPT it's handled as a
>>>> misconfig while on NPT it's handled as a violation.
>>> Because, well, there is no concept of misconfig in NPT.
>> Indeed.  Intel chose to split EPT errors into two - MISCONFIG for
>> structural errors (not present, or reserved bits set) and VIOLATION for
>> permissions errors.
>>
>> AMD reused the same silicon pagewalker design, so have a single
>> NPT_FAULT vmexit which behaves much more like a regular pagefault,
>> encoding structural vs permission errors in the error code.
> Maybe I should clarify, I understand that NPT doesn't have such
> differentiation regarding nested page table faults vs EPT, but I feel
> like it would be clearer if part of the code could be shared, ie:
> unify EPT resolve_misconfig and NPT do_recalc into a single function
> for example that uses the necessary p2m-> helpers for the differing
> implementations. I think we should be able to tell apart when a NPT
> page fault is a recalc one by looking at the bits in the EXITINFO1
> error field?

But they use fundamentally different mechanisms.

EPT uses an invalid caching type, while NPT sets the User bit (and even
this is going to have to change when we want to support GMET for Windows
VBS in the long term).

You could abstract a few things out into common logic, but none of the
bit positions match (not even the recalc software bit), and the result
would be more complicated than its current form.

~Andrew
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 46a1aac..f0d0bd3 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -1726,6 +1726,10 @@  static void svm_do_nested_pgfault(struct vcpu *v,
         /* inject #VMEXIT(NPF) into guest. */
         nestedsvm_vmexit_defer(v, VMEXIT_NPF, pfec, gpa);
         return;
+    case 0:
+        /* If a recalculation page fault hasn't been handled - just retry. */
+        if ( pfec & PFEC_user_mode )
+            return;
     }
 
     /* Everything else is an error. */