diff mbox series

[07/16] x86/shadow: call sh_update_cr3() directly from sh_page_fault()

Message ID 69304929-de84-04db-04f2-8faffc12ef0f@suse.com (mailing list archive)
State New, archived
Headers show
Series x86: assorted (mostly) shadow mode adjustments | expand

Commit Message

Jan Beulich March 22, 2023, 9:33 a.m. UTC
There's no need for an indirect call here, as the mode is invariant
throughout the entire paging-locked region. All it takes to avoid it is
to have a forward declaration of sh_update_cr3() in place.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I find this and the respective Win7 related comment suspicious: If we
really need to "fix up" L3 entries "on demand", wouldn't we better retry
the shadow_get_and_create_l1e() rather than exit? The spurious page
fault that the guest observes can, after all, not be known to be non-
fatal inside the guest. That's purely an OS policy.

Furthermore the sh_update_cr3() will also invalidate L3 entries which
were loaded successfully before, but invalidated by the guest
afterwards. I strongly suspect that the described hardware behavior is
_only_ to load previously not-present entries from the PDPT, but not
purge ones already marked present. IOW I think sh_update_cr3() would
need calling in an "incremental" mode here. (The alternative of doing
this in shadow_get_and_create_l3e() instead would likely be more
cumbersome.)

In any event emitting a TRC_SHADOW_DOMF_DYING trace record in this case
looks wrong.

Beyond the "on demand" L3 entry creation I also can't see what guest
actions could lead to the ASSERT() being inapplicable in the PAE case.
The 3-level code in shadow_get_and_create_l2e() doesn't consult guest
PDPTEs, and all other logic is similar to that for other modes.

(See 89329d832aed ["x86 shadow: Update cr3 in PAE mode when guest walk
succeed but shadow walk fails"].)

Comments

Andrew Cooper March 23, 2023, 1:34 p.m. UTC | #1
On 22/03/2023 9:33 am, Jan Beulich wrote:
> There's no need for an indirect call here, as the mode is invariant
> throughout the entire paging-locked region. All it takes to avoid it is
> to have a forward declaration of sh_update_cr3() in place.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I find this and the respective Win7 related comment suspicious: If we
> really need to "fix up" L3 entries "on demand", wouldn't we better retry
> the shadow_get_and_create_l1e() rather than exit? The spurious page
> fault that the guest observes can, after all, not be known to be non-
> fatal inside the guest. That's purely an OS policy.
>
> Furthermore the sh_update_cr3() will also invalidate L3 entries which
> were loaded successfully before, but invalidated by the guest
> afterwards. I strongly suspect that the described hardware behavior is
> _only_ to load previously not-present entries from the PDPT, but not
> purge ones already marked present. IOW I think sh_update_cr3() would
> need calling in an "incremental" mode here. (The alternative of doing
> this in shadow_get_and_create_l3e() instead would likely be more
> cumbersome.)
>
> In any event emitting a TRC_SHADOW_DOMF_DYING trace record in this case
> looks wrong.
>
> Beyond the "on demand" L3 entry creation I also can't see what guest
> actions could lead to the ASSERT() being inapplicable in the PAE case.
> The 3-level code in shadow_get_and_create_l2e() doesn't consult guest
> PDPTEs, and all other logic is similar to that for other modes.
>
> (See 89329d832aed ["x86 shadow: Update cr3 in PAE mode when guest walk
> succeed but shadow walk fails"].)

I recall that there was a complicated bug, ultimately discovered because
Win7 changed behaviour vs older versions, and the shadow logic had been
written to AMD's PAE behaviour, not Intel's.

Remember that Intel and AMD differer in how PAE paging works between
root and non-root mode, and it is to do with whether all PDPTRs get
cached at once, or on demand.

Off the top of my head:
* 32bit PV guests get on-demand semantics (as they're really 4-level)
* VT-x strictly use architectural "PDPTRs get cached on mov to CR3"
semantics
* SVM with NPT have on-demand semantics
* SVM with shadow is model-specific as to which semantics is uses, IIRC
Fam15h and later are on-demand

These differences still manifest bugs in the common HVM code, the PTE
caching code, and the pagewalk code.

Looking at the comment specifically, I'm pretty sure it's wrong.  I
think that suggests we've got even more PDPTR bugs than I'd previously
identified.  In some copious free time, I do need to extend the
pagetable-emul XTF test to include PDPTR updates because it's the one
area of pagewalking that doesn't have any suitable testing right now.

~Andrew
Tim Deegan March 27, 2023, 3:39 p.m. UTC | #2
Hi,

At 10:33 +0100 on 22 Mar (1679481226), Jan Beulich wrote:
> There's no need for an indirect call here, as the mode is invariant
> throughout the entire paging-locked region. All it takes to avoid it is
> to have a forward declaration of sh_update_cr3() in place.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I find this and the respective Win7 related comment suspicious: If we
> really need to "fix up" L3 entries "on demand", wouldn't we better retry
> the shadow_get_and_create_l1e() rather than exit? The spurious page
> fault that the guest observes can, after all, not be known to be non-
> fatal inside the guest. That's purely an OS policy.

I think it has to be non-fatal because it can happen on real hardware,
even if the hardware *does* fill the TLB here (which it is not
required to).

Filling just one sl3e sounds plausible, though we don't want to go
back to the idea of having L3 shadows on PAE!

> Furthermore the sh_update_cr3() will also invalidate L3 entries which
> were loaded successfully before, but invalidated by the guest
> afterwards. I strongly suspect that the described hardware behavior is
> _only_ to load previously not-present entries from the PDPT, but not
> purge ones already marked present.

Very likely, but we *are* allowed to forget old entries whenever we
want to, so this is at worst a performance problem.

> IOW I think sh_update_cr3() would
> need calling in an "incremental" mode here.

This would be the best way of updating just the one entry - but as far
as I can tell the existing code is correct so I wouldn't add any more
complexity unless we know that this path is causing perf problems.

> In any event emitting a TRC_SHADOW_DOMF_DYING trace record in this case
> looks wrong.

Yep.

> Beyond the "on demand" L3 entry creation I also can't see what guest
> actions could lead to the ASSERT() being inapplicable in the PAE case.
> The 3-level code in shadow_get_and_create_l2e() doesn't consult guest
> PDPTEs, and all other logic is similar to that for other modes.

The assert's not true here because the guest can push us down this
path by doing exactly what Win 7 does here - loading CR3 with a
missing L3E, then filling the L3E and causing a page fault that uses
the now-filled L3E.  (Or maybe that's not true any more; my mental
model of the pagetable walker code might be out of date)

Cheers,

Tim.
Jan Beulich March 28, 2023, 10:37 a.m. UTC | #3
On 27.03.2023 17:39, Tim Deegan wrote:
> At 10:33 +0100 on 22 Mar (1679481226), Jan Beulich wrote:
>> There's no need for an indirect call here, as the mode is invariant
>> throughout the entire paging-locked region. All it takes to avoid it is
>> to have a forward declaration of sh_update_cr3() in place.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> I find this and the respective Win7 related comment suspicious: If we
>> really need to "fix up" L3 entries "on demand", wouldn't we better retry
>> the shadow_get_and_create_l1e() rather than exit? The spurious page
>> fault that the guest observes can, after all, not be known to be non-
>> fatal inside the guest. That's purely an OS policy.
> 
> I think it has to be non-fatal because it can happen on real hardware,
> even if the hardware *does* fill the TLB here (which it is not
> required to).

But if hardware filled the TLB, it won't raise #PF. If it didn't fill
the TLB (e.g. because of not even doing a walk when a PDPTE is non-
present), a #PF would be legitimate, and the handler would then need
to reload CR3. But such a #PF is what, according to the comment, Win7
chokes on. So it can only be the former case, yet what we do here is
fill the (virtual) TLB _and_ raise #PF. Win7 apparently ignores this
as spurious, but that's not required behavior for an OS afaik.

> Filling just one sl3e sounds plausible, though we don't want to go
> back to the idea of having L3 shadows on PAE!

Of course.

>> Furthermore the sh_update_cr3() will also invalidate L3 entries which
>> were loaded successfully before, but invalidated by the guest
>> afterwards. I strongly suspect that the described hardware behavior is
>> _only_ to load previously not-present entries from the PDPT, but not
>> purge ones already marked present.
> 
> Very likely, but we *are* allowed to forget old entries whenever we
> want to, so this is at worst a performance problem.

That depends on the model, I think: In the original Intel model PDPTEs
cannot be "forgotten". In some AMD variants, where L3 is walked normally,
they of course can be.

>> IOW I think sh_update_cr3() would
>> need calling in an "incremental" mode here.
> 
> This would be the best way of updating just the one entry - but as far
> as I can tell the existing code is correct so I wouldn't add any more
> complexity unless we know that this path is causing perf problems.

If it was/is just performance - sure.

>> In any event emitting a TRC_SHADOW_DOMF_DYING trace record in this case
>> looks wrong.
> 
> Yep.

Will add another patch to the series then.

>> Beyond the "on demand" L3 entry creation I also can't see what guest
>> actions could lead to the ASSERT() being inapplicable in the PAE case.
>> The 3-level code in shadow_get_and_create_l2e() doesn't consult guest
>> PDPTEs, and all other logic is similar to that for other modes.
> 
> The assert's not true here because the guest can push us down this
> path by doing exactly what Win 7 does here - loading CR3 with a
> missing L3E, then filling the L3E and causing a page fault that uses
> the now-filled L3E.  (Or maybe that's not true any more; my mental
> model of the pagetable walker code might be out of date)

Well, I specifically started the paragraph with 'Beyond the "on demand"
L3 entry creation'. IOW the assertion would look applicable to me if
we, as suggested higher up, retried shadow_get_and_create_l1e() and it
failed again. As the comment there says, "we know the guest entries are
OK", so the missing L3 entry must have appeared.

Jan
Tim Deegan March 29, 2023, 6:49 a.m. UTC | #4
At 12:37 +0200 on 28 Mar (1680007032), Jan Beulich wrote:
> On 27.03.2023 17:39, Tim Deegan wrote:
> > At 10:33 +0100 on 22 Mar (1679481226), Jan Beulich wrote:
> >> There's no need for an indirect call here, as the mode is invariant
> >> throughout the entire paging-locked region. All it takes to avoid it is
> >> to have a forward declaration of sh_update_cr3() in place.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> I find this and the respective Win7 related comment suspicious: If we
> >> really need to "fix up" L3 entries "on demand", wouldn't we better retry
> >> the shadow_get_and_create_l1e() rather than exit? The spurious page
> >> fault that the guest observes can, after all, not be known to be non-
> >> fatal inside the guest. That's purely an OS policy.
> > 
> > I think it has to be non-fatal because it can happen on real hardware,
> > even if the hardware *does* fill the TLB here (which it is not
> > required to).
> 
> But if hardware filled the TLB, it won't raise #PF. If it didn't fill
> the TLB (e.g. because of not even doing a walk when a PDPTE is non-
> present), a #PF would be legitimate, and the handler would then need
> to reload CR3. But such a #PF is what, according to the comment, Win7
> chokes on.

IIRC the Win7 behaviour is to accept and discard the #PF as spurious
(because it sees the PDPTE is present) *without* reloading CR3, so it
gets stuck in a loop taking pagefaults.  Here, we reload CR3 for it,
to unstick it.

I can't think of a mental model of the MMU that would have a problem
here -- either the L3Es are special in which case the pagefault is
correct (and we shouldn't even read the new contents) or they're just
like other PTEs in which case the spurious fault is fine.

> > The assert's not true here because the guest can push us down this
> > path by doing exactly what Win 7 does here - loading CR3 with a
> > missing L3E, then filling the L3E and causing a page fault that uses
> > the now-filled L3E.  (Or maybe that's not true any more; my mental
> > model of the pagetable walker code might be out of date)
> 
> Well, I specifically started the paragraph with 'Beyond the "on demand"
> L3 entry creation'. IOW the assertion would look applicable to me if
> we, as suggested higher up, retried shadow_get_and_create_l1e() and it
> failed again. As the comment there says, "we know the guest entries are
> OK", so the missing L3 entry must have appeared.

Ah, I didn't quite understand you.  Yes, if we changed the handler to
rebuild the SL3E then I think the assertion would be valid again.

Cheers,

Tim.
diff mbox series

Patch

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -91,6 +91,8 @@  const char *const fetch_type_names[] = {
 # define for_each_shadow_table(v, i) for ( (i) = 0; (i) < 1; ++(i) )
 #endif
 
+static void cf_check sh_update_cr3(struct vcpu *v, int do_locking, bool noflush);
+
 /* Helper to perform a local TLB flush. */
 static void sh_flush_local(const struct domain *d)
 {
@@ -2487,7 +2489,7 @@  static int cf_check sh_page_fault(
          * In any case, in the PAE case, the ASSERT is not true; it can
          * happen because of actions the guest is taking. */
 #if GUEST_PAGING_LEVELS == 3
-        v->arch.paging.mode->update_cr3(v, 0, false);
+        sh_update_cr3(v, 0, false);
 #else
         ASSERT(d->is_shutting_down);
 #endif