x86/nested-hap: Fix handling of L0_ERROR
diff mbox series

Message ID 20191118181509.10981-1-andrew.cooper3@citrix.com
State New
Headers show
Series
  • x86/nested-hap: Fix handling of L0_ERROR
Related show

Commit Message

Andrew Cooper Nov. 18, 2019, 6:15 p.m. UTC
When nestedhvm_hap_nested_page_fault() returns L0_ERROR,
hvm_hap_nested_page_fault() operates on the adjusted gpa.  However, it
operates with the original npfec, which is no longer be correct.

In particular, it is possible to get a nested fault where the translation is
not present in L12 (and therefore L02), while it is present in L01.

When handling an L0_ERROR, adjust npfec as well as gpa.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: George Dunlap <george.dunlap@eu.citrix.com>
CC: Juergen Gross <jgross@suse.com>

This is one part of fixing nested virt in light of XSA-304.  Combined with:

Comments

Jürgen Groß Nov. 19, 2019, 5:20 a.m. UTC | #1
On 18.11.19 19:15, Andrew Cooper wrote:
> When nestedhvm_hap_nested_page_fault() returns L0_ERROR,
> hvm_hap_nested_page_fault() operates on the adjusted gpa.  However, it
> operates with the original npfec, which is no longer be correct.
> 
> In particular, it is possible to get a nested fault where the translation is
> not present in L12 (and therefore L02), while it is present in L01.
> 
> When handling an L0_ERROR, adjust npfec as well as gpa.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Release-acked-by: Juergen Gross <jgross@suse.com>


Juergen
Jan Beulich Nov. 19, 2019, 11:13 a.m. UTC | #2
On 18.11.2019 19:15, Andrew Cooper wrote:
> When nestedhvm_hap_nested_page_fault() returns L0_ERROR,
> hvm_hap_nested_page_fault() operates on the adjusted gpa.  However, it
> operates with the original npfec, which is no longer be correct.

Nit: Perhaps "may" instead of "is"?

> In particular, it is possible to get a nested fault where the translation is
> not present in L12 (and therefore L02), while it is present in L01.

I'm afraid I don't see the connection to the issue at hand, where
we have a page present in both L01 and L12, just not in L02. And
there's also no L0_ERROR here - both the initial (propagation) and
the subsequent (live-locking) exits report DONE according to what
I thought was the outcome of yesterday's discussion on irc.

I take it you imply that L0_ERROR would need raising (as per the
auxiliary code fragment adding the "(access_x && *page_order)"
check), but I wonder whether that would really be correct. This
depends on what L0_ERROR really is supposed to mean: An error
because of actual L0 settings (x=0 in our case), or an error
because of intended L0 settings (x=1 in our case). After all a
violation of just the p2m_access (which also affects r/w/x)
doesn't get reported by nestedhap_walk_L0_p2m() as L0_ERROR
either (and hence would, as it seems to me, lead to a similar
live lock).

Therefore I wonder whether your initial idea of doing the
shattering right here wouldn't be the better course of action.
nestedhap_fix_p2m() could either install the large page and then
shatter it right away, or it could install just the individual
small page. Together with the different npfec adjustment model
suggested below (leading to npfec.present to also get updated in
the DONE case) a similar "insn-fetch && present" conditional (to
that introduced for XSA-304) could then be used there.

Even better - by making the violation checking around the
original XSA-304 addition a function (together with the 304
addition), such a function might then be reusable here. This
might then address the p2m_access related live lock as well.

> When handling an L0_ERROR, adjust npfec as well as gpa.

The gpa adjustment referred to here is not in nestedhap_walk_L0_p2m()
but in nestedhvm_hap_nested_page_fault(), if I'm not mistaken?

> @@ -181,6 +180,18 @@ nestedhap_walk_L0_p2m(struct p2m_domain *p2m, paddr_t L1_gpa, paddr_t *L0_gpa,
>      *L0_gpa = (mfn_x(mfn) << PAGE_SHIFT) + (L1_gpa & ~PAGE_MASK);
>  out:
>      __put_gfn(p2m, L1_gpa >> PAGE_SHIFT);
> +
> +    /*
> +     * When reporting L0_ERROR, rewrite nfpec to match what would have occured
> +     * if hardware had walked the L0, rather than the combined L02.
> +     */
> +    if ( rc == NESTEDHVM_PAGEFAULT_L0_ERROR )
> +    {
> +        npfec->present = !mfn_eq(mfn, INVALID_MFN);

To be in line with the conditional a few lines up from here,
wouldn't this better be !mfn_valid(mfn)?

Should there ever be a case to clear the flag when it was set? If
a mapping has gone away between the time the exit condition was
detected and the time we re-evaluate things here, I think it
should still report "present" back to the caller. Taking both
remarks together I'm thinking of

        if ( mfn_valid(mfn) )
            npfec->present = 1;

> +        npfec->gla_valid = 0;

For this, one the question is whose linear address is meant here.
If it's L2's, then it shouldn't be cleared. If it's L1's, then
it would seem to me that it should have been avoided to set the
field, or at least it should have been cleared the moment we're
past L12 handling.

And then there is the question of overall flow here. On the basis
of npfec not being of any interest anymore to the caller's caller
if reporting back DONE (but as per far above it might help our
immediate caller) I wonder whether

static int
nestedhap_walk_L0_p2m(struct p2m_domain *p2m, paddr_t L1_gpa, paddr_t *L0_gpa,
                      p2m_type_t *p2mt, p2m_access_t *p2ma,
                      unsigned int *page_order, struct npfec *npfec)
{
    mfn_t mfn;
    int rc;

    /* walk L0 P2M table */
    mfn = get_gfn_type_access(p2m, L1_gpa >> PAGE_SHIFT, p2mt, p2ma,
                              0, page_order);

    rc = NESTEDHVM_PAGEFAULT_DIRECT_MMIO;
    if ( *p2mt == p2m_mmio_direct )
        goto direct_mmio_out;
    rc = NESTEDHVM_PAGEFAULT_MMIO;
    if ( *p2mt == p2m_mmio_dm )
        goto out;

    rc = NESTEDHVM_PAGEFAULT_L0_ERROR;
    /*
     * When reporting L0_ERROR, rewrite nfpec to match what would have occurred
     * if hardware had walked the L0, rather than the combined L02.
     */
    npfec->gla_valid = 0;
    npfec->kind = npfec_kind_unknown;

    if ( !mfn_valid(mfn) )
        goto out;

    npfec->present = 1;

    if ( npfec->write_access && p2m_is_readonly(*p2mt) )
        goto out;

    if ( p2m_is_paging(*p2mt) || p2m_is_shared(*p2mt) || !p2m_is_ram(*p2mt) )
        goto out;

    rc = NESTEDHVM_PAGEFAULT_DONE;
 direct_mmio_out:
    *L0_gpa = (mfn_x(mfn) << PAGE_SHIFT) + (L1_gpa & ~PAGE_MASK);
 out:
    __put_gfn(p2m, L1_gpa >> PAGE_SHIFT);
    return rc;
}

wouldn't be preferable.

Plus I notice that neither your nor my variant take care of the
NESTEDHVM_PAGEFAULT_DIRECT_MMIO case (where "present" would also
want to become set afaict, and I guess the other two npfec
adjustments would also be applicable).

Jan
Andrew Cooper Nov. 19, 2019, 2:58 p.m. UTC | #3
On 19/11/2019 11:13, Jan Beulich wrote:
> On 18.11.2019 19:15, Andrew Cooper wrote:
>> When nestedhvm_hap_nested_page_fault() returns L0_ERROR,
>> hvm_hap_nested_page_fault() operates on the adjusted gpa.  However, it
>> operates with the original npfec, which is no longer be correct.
> Nit: Perhaps "may" instead of "is"?

I wrote it that way first, but then changed my mind.

npfec is definitely wrong, by virtue of being from the wrong walk.  Its
value may change when being adjusted to the correct walk.

>
>> In particular, it is possible to get a nested fault where the translation is
>> not present in L12 (and therefore L02), while it is present in L01.
> I'm afraid I don't see the connection to the issue at hand, where
> we have a page present in both L01 and L12, just not in L02. And
> there's also no L0_ERROR here - both the initial (propagation) and
> the subsequent (live-locking) exits report DONE according to what
> I thought was the outcome of yesterday's discussion on irc.

Forget the XSA-304 livelock.  That aspect of things is still not fixed.

This is a real bug in L0_ERROR handling (which happens to be the second
part of the problem when the livelock is addressed in one possible way.)

> I take it you imply that L0_ERROR would need raising (as per the
> auxiliary code fragment adding the "(access_x && *page_order)"
> check), but I wonder whether that would really be correct. This
> depends on what L0_ERROR really is supposed to mean: An error
> because of actual L0 settings (x=0 in our case), or an error
> because of intended L0 settings (x=1 in our case). After all a
> violation of just the p2m_access (which also affects r/w/x)
> doesn't get reported by nestedhap_walk_L0_p2m() as L0_ERROR
> either (and hence would, as it seems to me, lead to a similar
> live lock).
>
> Therefore I wonder whether your initial idea of doing the
> shattering right here wouldn't be the better course of action.
> nestedhap_fix_p2m() could either install the large page and then
> shatter it right away, or it could install just the individual
> small page. Together with the different npfec adjustment model
> suggested below (leading to npfec.present to also get updated in
> the DONE case) a similar "insn-fetch && present" conditional (to
> that introduced for XSA-304) could then be used there.
>
> Even better - by making the violation checking around the
> original XSA-304 addition a function (together with the 304
> addition), such a function might then be reusable here. This
> might then address the p2m_access related live lock as well.

This is all unrelated to the patch.

>
>> When handling an L0_ERROR, adjust npfec as well as gpa.
> The gpa adjustment referred to here is not in nestedhap_walk_L0_p2m()
> but in nestedhvm_hap_nested_page_fault(), if I'm not mistaken?

The layers where adjustments are made are spread out.  I was referring
to the whole process of handling L0_ERROR.

>
>> @@ -181,6 +180,18 @@ nestedhap_walk_L0_p2m(struct p2m_domain *p2m, paddr_t L1_gpa, paddr_t *L0_gpa,
>>      *L0_gpa = (mfn_x(mfn) << PAGE_SHIFT) + (L1_gpa & ~PAGE_MASK);
>>  out:
>>      __put_gfn(p2m, L1_gpa >> PAGE_SHIFT);
>> +
>> +    /*
>> +     * When reporting L0_ERROR, rewrite nfpec to match what would have occured
>> +     * if hardware had walked the L0, rather than the combined L02.
>> +     */
>> +    if ( rc == NESTEDHVM_PAGEFAULT_L0_ERROR )
>> +    {
>> +        npfec->present = !mfn_eq(mfn, INVALID_MFN);
> To be in line with the conditional a few lines up from here,
> wouldn't this better be !mfn_valid(mfn)?

That's not how the return value from get_gfn_*() works, and would break
the MMIO case.

> Should there ever be a case to clear the flag when it was set? If
> a mapping has gone away between the time the exit condition was
> detected and the time we re-evaluate things here, I think it
> should still report "present" back to the caller.

No - absolutely not.  We must report the property of the L0 walk, as we
found it.

Pretending it was present when it wasn't is a sure-fire way of leaving
further bugs lurking.

>  Taking both
> remarks together I'm thinking of
>
>         if ( mfn_valid(mfn) )
>             npfec->present = 1;
>
>> +        npfec->gla_valid = 0;
> For this, one the question is whose linear address is meant here.

The linear address (which was L2's) is nonsensical when we've taken an
L0 fault.  This is why it is clobbered unconditionally.

> If it's L2's, then it shouldn't be cleared. If it's L1's, then
> it would seem to me that it should have been avoided to set the
> field, or at least it should have been cleared the moment we're
> past L12 handling.
>
> And then there is the question of overall flow here. On the basis
> of npfec not being of any interest anymore to the caller's caller
> if reporting back DONE (but as per far above it might help our
> immediate caller) I wonder whether

That is far too subtle and complicated.  I'm not included to make the
code any harder to follow than it already is.

~Andrew
Jan Beulich Nov. 19, 2019, 3:23 p.m. UTC | #4
On 19.11.2019 15:58, Andrew Cooper wrote:
> On 19/11/2019 11:13, Jan Beulich wrote:
>> On 18.11.2019 19:15, Andrew Cooper wrote:
>> I take it you imply that L0_ERROR would need raising (as per the
>> auxiliary code fragment adding the "(access_x && *page_order)"
>> check), but I wonder whether that would really be correct. This
>> depends on what L0_ERROR really is supposed to mean: An error
>> because of actual L0 settings (x=0 in our case), or an error
>> because of intended L0 settings (x=1 in our case). After all a
>> violation of just the p2m_access (which also affects r/w/x)
>> doesn't get reported by nestedhap_walk_L0_p2m() as L0_ERROR
>> either (and hence would, as it seems to me, lead to a similar
>> live lock).
>>
>> Therefore I wonder whether your initial idea of doing the
>> shattering right here wouldn't be the better course of action.
>> nestedhap_fix_p2m() could either install the large page and then
>> shatter it right away, or it could install just the individual
>> small page. Together with the different npfec adjustment model
>> suggested below (leading to npfec.present to also get updated in
>> the DONE case) a similar "insn-fetch && present" conditional (to
>> that introduced for XSA-304) could then be used there.
>>
>> Even better - by making the violation checking around the
>> original XSA-304 addition a function (together with the 304
>> addition), such a function might then be reusable here. This
>> might then address the p2m_access related live lock as well.
> 
> This is all unrelated to the patch.

I don't think so. At the very least defining what exactly L0_ERROR
is intended to mean is pretty relevant here. The other two
paragraphs may or may not be considered related, depending how
things overall are supposed to work.

>>> @@ -181,6 +180,18 @@ nestedhap_walk_L0_p2m(struct p2m_domain *p2m, paddr_t L1_gpa, paddr_t *L0_gpa,
>>>      *L0_gpa = (mfn_x(mfn) << PAGE_SHIFT) + (L1_gpa & ~PAGE_MASK);
>>>  out:
>>>      __put_gfn(p2m, L1_gpa >> PAGE_SHIFT);
>>> +
>>> +    /*
>>> +     * When reporting L0_ERROR, rewrite nfpec to match what would have occured
>>> +     * if hardware had walked the L0, rather than the combined L02.
>>> +     */
>>> +    if ( rc == NESTEDHVM_PAGEFAULT_L0_ERROR )
>>> +    {
>>> +        npfec->present = !mfn_eq(mfn, INVALID_MFN);
>> To be in line with the conditional a few lines up from here,
>> wouldn't this better be !mfn_valid(mfn)?
> 
> That's not how the return value from get_gfn_*() works, and would break
> the MMIO case.

How that (for the latter part of your reply)? The MMIO case produces
NESTEDHVM_PAGEFAULT_DIRECT_MMIO, i.e. doesn't even enter this if().
Hence my remark elsewhere that the MMIO cases isn't taken care of in
the first place.

>> Should there ever be a case to clear the flag when it was set? If
>> a mapping has gone away between the time the exit condition was
>> detected and the time we re-evaluate things here, I think it
>> should still report "present" back to the caller.
> 
> No - absolutely not.  We must report the property of the L0 walk, as we
> found it.
> 
> Pretending it was present when it wasn't is a sure-fire way of leaving
> further bugs lurking.

But if npfec.present is set, it surely was set at the time of the
hardware walk. And _that's_ what npfec is supposed to represent.

>>  Taking both
>> remarks together I'm thinking of
>>
>>         if ( mfn_valid(mfn) )
>>             npfec->present = 1;
>>
>>> +        npfec->gla_valid = 0;
>> For this, one the question is whose linear address is meant here.
> 
> The linear address (which was L2's) is nonsensical when we've taken an
> L0 fault.  This is why it is clobbered unconditionally.

And this is also why I was saying ...

>> If it's L2's, then it shouldn't be cleared. If it's L1's, then
>> it would seem to me that it should have been avoided to set the
>> field, or at least it should have been cleared the moment we're
>> past L12 handling.

... this. If it's nonsensical, it shouldn't have been set to begin
with, or be squashed earlier than here.

>> And then there is the question of overall flow here. On the basis
>> of npfec not being of any interest anymore to the caller's caller
>> if reporting back DONE (but as per far above it might help our
>> immediate caller) I wonder whether
> 
> That is far too subtle and complicated.  I'm not included to make the
> code any harder to follow than it already is.

I have to admit I don't see how this is any more complicated than
your variant. I'm merely suggesting adjustments to npfec to get done
at slightly different points in time, with the possible benefit of
an easier route to addressing the live lock(s). It possibly being
more subtle can be compensated by adequate commenting.

Jan
Andrew Cooper Nov. 19, 2019, 8:45 p.m. UTC | #5
On 19/11/2019 15:23, Jan Beulich wrote:
> On 19.11.2019 15:58, Andrew Cooper wrote:
>> On 19/11/2019 11:13, Jan Beulich wrote:
>>> On 18.11.2019 19:15, Andrew Cooper wrote:
>>> I take it you imply that L0_ERROR would need raising (as per the
>>> auxiliary code fragment adding the "(access_x && *page_order)"
>>> check), but I wonder whether that would really be correct. This
>>> depends on what L0_ERROR really is supposed to mean: An error
>>> because of actual L0 settings (x=0 in our case), or an error
>>> because of intended L0 settings (x=1 in our case). After all a
>>> violation of just the p2m_access (which also affects r/w/x)
>>> doesn't get reported by nestedhap_walk_L0_p2m() as L0_ERROR
>>> either (and hence would, as it seems to me, lead to a similar
>>> live lock).
>>>
>>> Therefore I wonder whether your initial idea of doing the
>>> shattering right here wouldn't be the better course of action.
>>> nestedhap_fix_p2m() could either install the large page and then
>>> shatter it right away, or it could install just the individual
>>> small page. Together with the different npfec adjustment model
>>> suggested below (leading to npfec.present to also get updated in
>>> the DONE case) a similar "insn-fetch && present" conditional (to
>>> that introduced for XSA-304) could then be used there.
>>>
>>> Even better - by making the violation checking around the
>>> original XSA-304 addition a function (together with the 304
>>> addition), such a function might then be reusable here. This
>>> might then address the p2m_access related live lock as well.
>> This is all unrelated to the patch.
> I don't think so.

This patch is not a fix for the XSA-304 livelock.

It is a independent bug discovered while investigating the livelock.

It may, or may not, form part of the XSA-304 livelock bugfix, depending
on how the rest of the investigation goes.

>  At the very least defining what exactly L0_ERROR
> is intended to mean is pretty relevant here.

The intent of the code is clear (at least, to me).

It means #NPF/EPT_VIOLATION/EPT_MISCONFIG in the L01 part of the nested
walk.

>>>> @@ -181,6 +180,18 @@ nestedhap_walk_L0_p2m(struct p2m_domain *p2m, paddr_t L1_gpa, paddr_t *L0_gpa,
>>>>      *L0_gpa = (mfn_x(mfn) << PAGE_SHIFT) + (L1_gpa & ~PAGE_MASK);
>>>>  out:
>>>>      __put_gfn(p2m, L1_gpa >> PAGE_SHIFT);
>>>> +
>>>> +    /*
>>>> +     * When reporting L0_ERROR, rewrite nfpec to match what would have occured
>>>> +     * if hardware had walked the L0, rather than the combined L02.
>>>> +     */
>>>> +    if ( rc == NESTEDHVM_PAGEFAULT_L0_ERROR )
>>>> +    {
>>>> +        npfec->present = !mfn_eq(mfn, INVALID_MFN);
>>> To be in line with the conditional a few lines up from here,
>>> wouldn't this better be !mfn_valid(mfn)?
>> That's not how the return value from get_gfn_*() works, and would break
>> the MMIO case.
> How that (for the latter part of your reply)? The MMIO case produces
> NESTEDHVM_PAGEFAULT_DIRECT_MMIO, i.e. doesn't even enter this if().
> Hence my remark elsewhere that the MMIO cases isn't taken care of in
> the first place.
>
>>> Should there ever be a case to clear the flag when it was set? If
>>> a mapping has gone away between the time the exit condition was
>>> detected and the time we re-evaluate things here, I think it
>>> should still report "present" back to the caller.
>> No - absolutely not.  We must report the property of the L0 walk, as we
>> found it.
>>
>> Pretending it was present when it wasn't is a sure-fire way of leaving
>> further bugs lurking.
> But if npfec.present is set, it surely was set at the time of the
> hardware walk. And _that's_ what npfec is supposed to represent.
>
>>>  Taking both
>>> remarks together I'm thinking of
>>>
>>>         if ( mfn_valid(mfn) )
>>>             npfec->present = 1;
>>>
>>>> +        npfec->gla_valid = 0;
>>> For this, one the question is whose linear address is meant here.
>> The linear address (which was L2's) is nonsensical when we've taken an
>> L0 fault.  This is why it is clobbered unconditionally.
> And this is also why I was saying ...
>
>>> If it's L2's, then it shouldn't be cleared. If it's L1's, then
>>> it would seem to me that it should have been avoided to set the
>>> field, or at least it should have been cleared the moment we're
>>> past L12 handling.
> ... this. If it's nonsensical, it shouldn't have been set to begin
> with, or be squashed earlier than here.

There seems to be a lot of confusion here.

This is the correct place to discard it.

Hardware did a real walk of L02 and got a real gpa and npfec (optionally
with a real gla), that overall identified "something went wrong".

Upon interpreting "what went wrong", Xen may decide that it is a problem
in the L01 walk, rather than the L12 or combined L02.

A problem in the L01 walk is handled by returning L0_ERROR back to the
common code, discarding the current NPF/EPT_VIOLATION/MISCONFIG context,
and synthesizing the state that would have occurred if hardware were to
have performed the L01 walk instead of L02, so it can be correctly
interpreted by the common code on the hostp2m.

gpa gets adjusted.  npfec doesn't (and the subject of this patch).  gla
doesn't even get passed in for potential adjustments.

The gla isn't actually an interesting value, and Xen's use of it for
various cache maintenance purposes looks buggy.  Gla is specific to the
L2 guests' register state and virtual memory layout, and in particular,
has no bearing on anything where we've decided that we need a correction
to the L01 mapping.

~Andrew
Jan Beulich Nov. 20, 2019, 8:29 a.m. UTC | #6
On 19.11.2019 21:45, Andrew Cooper wrote:
> On 19/11/2019 15:23, Jan Beulich wrote:
>> On 19.11.2019 15:58, Andrew Cooper wrote:
>>> On 19/11/2019 11:13, Jan Beulich wrote:
>>>> On 18.11.2019 19:15, Andrew Cooper wrote:
>>>> I take it you imply that L0_ERROR would need raising (as per the
>>>> auxiliary code fragment adding the "(access_x && *page_order)"
>>>> check), but I wonder whether that would really be correct. This
>>>> depends on what L0_ERROR really is supposed to mean: An error
>>>> because of actual L0 settings (x=0 in our case), or an error
>>>> because of intended L0 settings (x=1 in our case). After all a
>>>> violation of just the p2m_access (which also affects r/w/x)
>>>> doesn't get reported by nestedhap_walk_L0_p2m() as L0_ERROR
>>>> either (and hence would, as it seems to me, lead to a similar
>>>> live lock).
>>>>
>>>> Therefore I wonder whether your initial idea of doing the
>>>> shattering right here wouldn't be the better course of action.
>>>> nestedhap_fix_p2m() could either install the large page and then
>>>> shatter it right away, or it could install just the individual
>>>> small page. Together with the different npfec adjustment model
>>>> suggested below (leading to npfec.present to also get updated in
>>>> the DONE case) a similar "insn-fetch && present" conditional (to
>>>> that introduced for XSA-304) could then be used there.
>>>>
>>>> Even better - by making the violation checking around the
>>>> original XSA-304 addition a function (together with the 304
>>>> addition), such a function might then be reusable here. This
>>>> might then address the p2m_access related live lock as well.
>>> This is all unrelated to the patch.
>> I don't think so.
> 
> This patch is not a fix for the XSA-304 livelock.
> 
> It is a independent bug discovered while investigating the livelock.
> 
> It may, or may not, form part of the XSA-304 livelock bugfix, depending
> on how the rest of the investigation goes.
> 
>>  At the very least defining what exactly L0_ERROR
>> is intended to mean is pretty relevant here.
> 
> The intent of the code is clear (at least, to me).
> 
> It means #NPF/EPT_VIOLATION/EPT_MISCONFIG in the L01 part of the nested
> walk.

But this is contrary to the code deriving L0_ERROR exclusively from
the p2m type, ignoring p2m access and anything else that may have lead
to e.g. r/w/x settings different from the ones implied by the type.
Using just the p2m type to me makes it more like a "would be error"
than an actual walk result.

What you imply would require an approach much more similar to that of
nestedhap_walk_L1_p2m(), looking at the actual EPT/NPT entries rather
than the abstract (MFN,order,type,access) tuple derived from them.

>>>>> @@ -181,6 +180,18 @@ nestedhap_walk_L0_p2m(struct p2m_domain *p2m, paddr_t L1_gpa, paddr_t *L0_gpa,
>>>>>      *L0_gpa = (mfn_x(mfn) << PAGE_SHIFT) + (L1_gpa & ~PAGE_MASK);
>>>>>  out:
>>>>>      __put_gfn(p2m, L1_gpa >> PAGE_SHIFT);
>>>>> +
>>>>> +    /*
>>>>> +     * When reporting L0_ERROR, rewrite nfpec to match what would have occured
>>>>> +     * if hardware had walked the L0, rather than the combined L02.
>>>>> +     */
>>>>> +    if ( rc == NESTEDHVM_PAGEFAULT_L0_ERROR )
>>>>> +    {
>>>>> +        npfec->present = !mfn_eq(mfn, INVALID_MFN);
>>>> To be in line with the conditional a few lines up from here,
>>>> wouldn't this better be !mfn_valid(mfn)?
>>> That's not how the return value from get_gfn_*() works, and would break
>>> the MMIO case.
>> How that (for the latter part of your reply)? The MMIO case produces
>> NESTEDHVM_PAGEFAULT_DIRECT_MMIO, i.e. doesn't even enter this if().
>> Hence my remark elsewhere that the MMIO cases isn't taken care of in
>> the first place.
>>
>>>> Should there ever be a case to clear the flag when it was set? If
>>>> a mapping has gone away between the time the exit condition was
>>>> detected and the time we re-evaluate things here, I think it
>>>> should still report "present" back to the caller.
>>> No - absolutely not.  We must report the property of the L0 walk, as we
>>> found it.
>>>
>>> Pretending it was present when it wasn't is a sure-fire way of leaving
>>> further bugs lurking.
>> But if npfec.present is set, it surely was set at the time of the
>> hardware walk. And _that's_ what npfec is supposed to represent.
>>
>>>>  Taking both
>>>> remarks together I'm thinking of
>>>>
>>>>         if ( mfn_valid(mfn) )
>>>>             npfec->present = 1;
>>>>
>>>>> +        npfec->gla_valid = 0;
>>>> For this, one the question is whose linear address is meant here.
>>> The linear address (which was L2's) is nonsensical when we've taken an
>>> L0 fault.  This is why it is clobbered unconditionally.
>> And this is also why I was saying ...
>>
>>>> If it's L2's, then it shouldn't be cleared. If it's L1's, then
>>>> it would seem to me that it should have been avoided to set the
>>>> field, or at least it should have been cleared the moment we're
>>>> past L12 handling.
>> ... this. If it's nonsensical, it shouldn't have been set to begin
>> with, or be squashed earlier than here.
> 
> There seems to be a lot of confusion here.

Indeed. The question though is who it is that is confused. It's
likely me, but I dislike you putting it as if there's not even
the remote chance of it not being (just) me.

> This is the correct place to discard it.
> 
> Hardware did a real walk of L02 and got a real gpa and npfec (optionally
> with a real gla), that overall identified "something went wrong".
> 
> Upon interpreting "what went wrong", Xen may decide that it is a problem
> in the L01 walk, rather than the L12 or combined L02.
> 
> A problem in the L01 walk is handled by returning L0_ERROR back to the
> common code, discarding the current NPF/EPT_VIOLATION/MISCONFIG context,
> and synthesizing the state that would have occurred if hardware were to
> have performed the L01 walk instead of L02, so it can be correctly
> interpreted by the common code on the hostp2m.

Well, we appear to agree on what is to happen, but disagree on what
"if hardware were to have performed the L01 walk instead of L02" means:
You suggest this ought to be the state of the tables as we see them
when we do the software walk. This allows for a 1 -> 0 transition of
npfec.present when updating it. Otoh I think that a 1 -> 0 transition
shouldn't be possible, as the hardware having found the L02 entry to
be present implies that both the L12 and the L01 entries also had
their respective entries present at the time the hardware walk was
performed, and npfec (including any updates we do to it) should
reflect (as much as possible) the state of things at the time of the
hardware walk, because that's what it is describing.

Note that what I describe above as my interpretation of your replies
seems (to me) contrary to your patch, in a comment, having "if
hardware had walked the L0". But yes, the comment isn't entirely
unambiguous either, as it also leaves open at which point in time
you mean this hypothetical walk to have occurred.

> gpa gets adjusted.  npfec doesn't (and the subject of this patch).  gla
> doesn't even get passed in for potential adjustments.
> 
> The gla isn't actually an interesting value, and Xen's use of it for
> various cache maintenance purposes looks buggy.  Gla is specific to the
> L2 guests' register state and virtual memory layout, and in particular,
> has no bearing on anything where we've decided that we need a correction
> to the L01 mapping.

And hence my conclusion that gla_valid should be cleared the latest
by nestedhvm_hap_nested_page_fault() before even calling
nestedhap_walk_L0_p2m(), i.e. in particular independent of anything
the latter function does.

In the end it's not my ack that would matter for this patch, so if
you and George can come to an agreement, I'll go silent. For me to
give an R-b the minimum requirements are
- more extensive description (or comments) eliminating some of the
  ambiguities (and thus having firm common grounds as to what the
  behavior ultimately ought to be, allowing to judge that this
  change is, in all its parts, at least not a step away from the
  targeted behavior),
- the MMIO case taken care of (or making it explicit that the case
  is left unaddressed),
- at least a comment regarding the XSA-304 / mem-access related
  shortcomings of nestedhap_walk_L0_p2m().

Jan
Jan Beulich Nov. 20, 2019, 8:39 a.m. UTC | #7
On 19.11.2019 15:58, Andrew Cooper wrote:
> On 19/11/2019 11:13, Jan Beulich wrote:
>> On 18.11.2019 19:15, Andrew Cooper wrote:
>>> @@ -181,6 +180,18 @@ nestedhap_walk_L0_p2m(struct p2m_domain *p2m, paddr_t L1_gpa, paddr_t *L0_gpa,
>>>      *L0_gpa = (mfn_x(mfn) << PAGE_SHIFT) + (L1_gpa & ~PAGE_MASK);
>>>  out:
>>>      __put_gfn(p2m, L1_gpa >> PAGE_SHIFT);
>>> +
>>> +    /*
>>> +     * When reporting L0_ERROR, rewrite nfpec to match what would have occured
>>> +     * if hardware had walked the L0, rather than the combined L02.
>>> +     */
>>> +    if ( rc == NESTEDHVM_PAGEFAULT_L0_ERROR )
>>> +    {
>>> +        npfec->present = !mfn_eq(mfn, INVALID_MFN);
>> To be in line with the conditional a few lines up from here,
>> wouldn't this better be !mfn_valid(mfn)?
> 
> That's not how the return value from get_gfn_*() works, and would break
> the MMIO case.

To deal with the MMIO case the if() condition needs extending
(or a separate code block needs adding) and then I would still
see this assignment become

       npfec->present = mfn_valid(mfn) || rc == NESTEDHVM_PAGEFAULT_DIRECT_MMIO;

After all we never write non-present direct MMIO entries. (The
above is ignoring the question of whether 1 -> 0 transitions of
->present are to be permitted, as per the other sub-thread.)
This is to cope with (current or potential future) p2m entries
that get a special MFN stored, e.g. along the lines of
SHARED_P2M_ENTRY. I don't think we have any such right now, but
I also don't see why code shouldn't be prepared for ones to
appear.

Jan

Patch
diff mbox series

diff --git a/xen/arch/x86/mm/hap/nested_hap.c
b/xen/arch/x86/mm/hap/nested_hap.c
index a509a40c93..bd58a86b46 100644
--- a/xen/arch/x86/mm/hap/nested_hap.c
+++ b/xen/arch/x86/mm/hap/nested_hap.c
@@ -167,7 +167,8 @@  nestedhap_walk_L0_p2m(struct p2m_domain *p2m, paddr_t
L1_gpa, paddr_t *L0_gpa,
         goto out;

     rc = NESTEDHVM_PAGEFAULT_L0_ERROR;
-    if ( access_w && p2m_is_readonly(*p2mt) )
+    if ( (access_w && p2m_is_readonly(*p2mt)) ||
+         (access_x && *page_order) )
         goto out;

     if ( p2m_is_paging(*p2mt) || p2m_is_shared(*p2mt) || !p2m_is_ram(*p2mt) )

it does resolve the issue.  However, the above isn't correct in the general
case, and is still under development.
---
 xen/arch/x86/hvm/hvm.c              |  6 ++----
 xen/arch/x86/mm/hap/nested_hap.c    | 27 +++++++++++++++++++--------
 xen/include/asm-x86/hvm/nestedhvm.h |  2 +-
 3 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 818e705fd1..87eed13ee1 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1729,10 +1729,8 @@  int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
          * the same as for shadow paging.
          */
 
-         rv = nestedhvm_hap_nested_page_fault(curr, &gpa,
-                                              npfec.read_access,
-                                              npfec.write_access,
-                                              npfec.insn_fetch);
+        rv = nestedhvm_hap_nested_page_fault(curr, &gpa, &npfec);
+
         switch (rv) {
         case NESTEDHVM_PAGEFAULT_DONE:
         case NESTEDHVM_PAGEFAULT_RETRY:
diff --git a/xen/arch/x86/mm/hap/nested_hap.c b/xen/arch/x86/mm/hap/nested_hap.c
index abe5958a52..9eba35f7e8 100644
--- a/xen/arch/x86/mm/hap/nested_hap.c
+++ b/xen/arch/x86/mm/hap/nested_hap.c
@@ -149,8 +149,7 @@  nestedhap_walk_L1_p2m(struct vcpu *v, paddr_t L2_gpa, paddr_t *L1_gpa,
 static int
 nestedhap_walk_L0_p2m(struct p2m_domain *p2m, paddr_t L1_gpa, paddr_t *L0_gpa,
                       p2m_type_t *p2mt, p2m_access_t *p2ma,
-                      unsigned int *page_order,
-                      bool_t access_r, bool_t access_w, bool_t access_x)
+                      unsigned int *page_order, struct npfec *npfec)
 {
     mfn_t mfn;
     int rc;
@@ -167,7 +166,7 @@  nestedhap_walk_L0_p2m(struct p2m_domain *p2m, paddr_t L1_gpa, paddr_t *L0_gpa,
         goto out;
 
     rc = NESTEDHVM_PAGEFAULT_L0_ERROR;
-    if ( access_w && p2m_is_readonly(*p2mt) )
+    if ( npfec->write_access && p2m_is_readonly(*p2mt) )
         goto out;
 
     if ( p2m_is_paging(*p2mt) || p2m_is_shared(*p2mt) || !p2m_is_ram(*p2mt) )
@@ -181,6 +180,18 @@  nestedhap_walk_L0_p2m(struct p2m_domain *p2m, paddr_t L1_gpa, paddr_t *L0_gpa,
     *L0_gpa = (mfn_x(mfn) << PAGE_SHIFT) + (L1_gpa & ~PAGE_MASK);
 out:
     __put_gfn(p2m, L1_gpa >> PAGE_SHIFT);
+
+    /*
+     * When reporting L0_ERROR, rewrite nfpec to match what would have occured
+     * if hardware had walked the L0, rather than the combined L02.
+     */
+    if ( rc == NESTEDHVM_PAGEFAULT_L0_ERROR )
+    {
+        npfec->present = !mfn_eq(mfn, INVALID_MFN);
+        npfec->gla_valid = 0;
+        npfec->kind = npfec_kind_unknown;
+    }
+
     return rc;
 }
 
@@ -191,7 +202,7 @@  nestedhap_walk_L0_p2m(struct p2m_domain *p2m, paddr_t L1_gpa, paddr_t *L0_gpa,
  */
 int
 nestedhvm_hap_nested_page_fault(struct vcpu *v, paddr_t *L2_gpa,
-    bool_t access_r, bool_t access_w, bool_t access_x)
+                                struct npfec *npfec)
 {
     int rv;
     paddr_t L1_gpa, L0_gpa;
@@ -206,7 +217,8 @@  nestedhvm_hap_nested_page_fault(struct vcpu *v, paddr_t *L2_gpa,
 
     /* walk the L1 P2M table */
     rv = nestedhap_walk_L1_p2m(v, *L2_gpa, &L1_gpa, &page_order_21, &p2ma_21,
-        access_r, access_w, access_x);
+                               npfec->read_access, npfec->write_access,
+                               npfec->insn_fetch);
 
     /* let caller to handle these two cases */
     switch (rv) {
@@ -222,9 +234,8 @@  nestedhvm_hap_nested_page_fault(struct vcpu *v, paddr_t *L2_gpa,
     }
 
     /* ==> we have to walk L0 P2M */
-    rv = nestedhap_walk_L0_p2m(p2m, L1_gpa, &L0_gpa,
-        &p2mt_10, &p2ma_10, &page_order_10,
-        access_r, access_w, access_x);
+    rv = nestedhap_walk_L0_p2m(p2m, L1_gpa, &L0_gpa, &p2mt_10, &p2ma_10,
+                               &page_order_10, npfec);
 
     /* let upper level caller to handle these two cases */
     switch (rv) {
diff --git a/xen/include/asm-x86/hvm/nestedhvm.h b/xen/include/asm-x86/hvm/nestedhvm.h
index 256fed733a..7b53f23e97 100644
--- a/xen/include/asm-x86/hvm/nestedhvm.h
+++ b/xen/include/asm-x86/hvm/nestedhvm.h
@@ -58,7 +58,7 @@  bool_t nestedhvm_vcpu_in_guestmode(struct vcpu *v);
 #define NESTEDHVM_PAGEFAULT_RETRY      5
 #define NESTEDHVM_PAGEFAULT_DIRECT_MMIO 6
 int nestedhvm_hap_nested_page_fault(struct vcpu *v, paddr_t *L2_gpa,
-    bool_t access_r, bool_t access_w, bool_t access_x);
+                                    struct npfec *npfec);
 
 int nestedhap_walk_L1_p2m(struct vcpu *v, paddr_t L2_gpa, paddr_t *L1_gpa,
                           unsigned int *page_order, uint8_t *p2m_acc,