Message ID | 69304929-de84-04db-04f2-8faffc12ef0f@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86: assorted (mostly) shadow mode adjustments | expand |
On 22/03/2023 9:33 am, Jan Beulich wrote: > There's no need for an indirect call here, as the mode is invariant > throughout the entire paging-locked region. All it takes to avoid it is > to have a forward declaration of sh_update_cr3() in place. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > I find this and the respective Win7 related comment suspicious: If we > really need to "fix up" L3 entries "on demand", wouldn't we better retry > the shadow_get_and_create_l1e() rather than exit? The spurious page > fault that the guest observes can, after all, not be known to be non- > fatal inside the guest. That's purely an OS policy. > > Furthermore the sh_update_cr3() will also invalidate L3 entries which > were loaded successfully before, but invalidated by the guest > afterwards. I strongly suspect that the described hardware behavior is > _only_ to load previously not-present entries from the PDPT, but not > purge ones already marked present. IOW I think sh_update_cr3() would > need calling in an "incremental" mode here. (The alternative of doing > this in shadow_get_and_create_l3e() instead would likely be more > cumbersome.) > > In any event emitting a TRC_SHADOW_DOMF_DYING trace record in this case > looks wrong. > > Beyond the "on demand" L3 entry creation I also can't see what guest > actions could lead to the ASSERT() being inapplicable in the PAE case. > The 3-level code in shadow_get_and_create_l2e() doesn't consult guest > PDPTEs, and all other logic is similar to that for other modes. > > (See 89329d832aed ["x86 shadow: Update cr3 in PAE mode when guest walk > succeed but shadow walk fails"].) I recall that there was a complicated bug, ultimately discovered because Win7 changed behaviour vs older versions, and the shadow logic had been written to AMD's PAE behaviour, not Intel's. Remember that Intel and AMD differer in how PAE paging works between root and non-root mode, and it is to do with whether all PDPTRs get cached at once, or on demand. Off the top of my head: * 32bit PV guests get on-demand semantics (as they're really 4-level) * VT-x strictly use architectural "PDPTRs get cached on mov to CR3" semantics * SVM with NPT have on-demand semantics * SVM with shadow is model-specific as to which semantics is uses, IIRC Fam15h and later are on-demand These differences still manifest bugs in the common HVM code, the PTE caching code, and the pagewalk code. Looking at the comment specifically, I'm pretty sure it's wrong. I think that suggests we've got even more PDPTR bugs than I'd previously identified. In some copious free time, I do need to extend the pagetable-emul XTF test to include PDPTR updates because it's the one area of pagewalking that doesn't have any suitable testing right now. ~Andrew
Hi, At 10:33 +0100 on 22 Mar (1679481226), Jan Beulich wrote: > There's no need for an indirect call here, as the mode is invariant > throughout the entire paging-locked region. All it takes to avoid it is > to have a forward declaration of sh_update_cr3() in place. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > I find this and the respective Win7 related comment suspicious: If we > really need to "fix up" L3 entries "on demand", wouldn't we better retry > the shadow_get_and_create_l1e() rather than exit? The spurious page > fault that the guest observes can, after all, not be known to be non- > fatal inside the guest. That's purely an OS policy. I think it has to be non-fatal because it can happen on real hardware, even if the hardware *does* fill the TLB here (which it is not required to). Filling just one sl3e sounds plausible, though we don't want to go back to the idea of having L3 shadows on PAE! > Furthermore the sh_update_cr3() will also invalidate L3 entries which > were loaded successfully before, but invalidated by the guest > afterwards. I strongly suspect that the described hardware behavior is > _only_ to load previously not-present entries from the PDPT, but not > purge ones already marked present. Very likely, but we *are* allowed to forget old entries whenever we want to, so this is at worst a performance problem. > IOW I think sh_update_cr3() would > need calling in an "incremental" mode here. This would be the best way of updating just the one entry - but as far as I can tell the existing code is correct so I wouldn't add any more complexity unless we know that this path is causing perf problems. > In any event emitting a TRC_SHADOW_DOMF_DYING trace record in this case > looks wrong. Yep. > Beyond the "on demand" L3 entry creation I also can't see what guest > actions could lead to the ASSERT() being inapplicable in the PAE case. > The 3-level code in shadow_get_and_create_l2e() doesn't consult guest > PDPTEs, and all other logic is similar to that for other modes. The assert's not true here because the guest can push us down this path by doing exactly what Win 7 does here - loading CR3 with a missing L3E, then filling the L3E and causing a page fault that uses the now-filled L3E. (Or maybe that's not true any more; my mental model of the pagetable walker code might be out of date) Cheers, Tim.
On 27.03.2023 17:39, Tim Deegan wrote: > At 10:33 +0100 on 22 Mar (1679481226), Jan Beulich wrote: >> There's no need for an indirect call here, as the mode is invariant >> throughout the entire paging-locked region. All it takes to avoid it is >> to have a forward declaration of sh_update_cr3() in place. >> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> --- >> I find this and the respective Win7 related comment suspicious: If we >> really need to "fix up" L3 entries "on demand", wouldn't we better retry >> the shadow_get_and_create_l1e() rather than exit? The spurious page >> fault that the guest observes can, after all, not be known to be non- >> fatal inside the guest. That's purely an OS policy. > > I think it has to be non-fatal because it can happen on real hardware, > even if the hardware *does* fill the TLB here (which it is not > required to). But if hardware filled the TLB, it won't raise #PF. If it didn't fill the TLB (e.g. because of not even doing a walk when a PDPTE is non- present), a #PF would be legitimate, and the handler would then need to reload CR3. But such a #PF is what, according to the comment, Win7 chokes on. So it can only be the former case, yet what we do here is fill the (virtual) TLB _and_ raise #PF. Win7 apparently ignores this as spurious, but that's not required behavior for an OS afaik. > Filling just one sl3e sounds plausible, though we don't want to go > back to the idea of having L3 shadows on PAE! Of course. >> Furthermore the sh_update_cr3() will also invalidate L3 entries which >> were loaded successfully before, but invalidated by the guest >> afterwards. I strongly suspect that the described hardware behavior is >> _only_ to load previously not-present entries from the PDPT, but not >> purge ones already marked present. > > Very likely, but we *are* allowed to forget old entries whenever we > want to, so this is at worst a performance problem. That depends on the model, I think: In the original Intel model PDPTEs cannot be "forgotten". In some AMD variants, where L3 is walked normally, they of course can be. >> IOW I think sh_update_cr3() would >> need calling in an "incremental" mode here. > > This would be the best way of updating just the one entry - but as far > as I can tell the existing code is correct so I wouldn't add any more > complexity unless we know that this path is causing perf problems. If it was/is just performance - sure. >> In any event emitting a TRC_SHADOW_DOMF_DYING trace record in this case >> looks wrong. > > Yep. Will add another patch to the series then. >> Beyond the "on demand" L3 entry creation I also can't see what guest >> actions could lead to the ASSERT() being inapplicable in the PAE case. >> The 3-level code in shadow_get_and_create_l2e() doesn't consult guest >> PDPTEs, and all other logic is similar to that for other modes. > > The assert's not true here because the guest can push us down this > path by doing exactly what Win 7 does here - loading CR3 with a > missing L3E, then filling the L3E and causing a page fault that uses > the now-filled L3E. (Or maybe that's not true any more; my mental > model of the pagetable walker code might be out of date) Well, I specifically started the paragraph with 'Beyond the "on demand" L3 entry creation'. IOW the assertion would look applicable to me if we, as suggested higher up, retried shadow_get_and_create_l1e() and it failed again. As the comment there says, "we know the guest entries are OK", so the missing L3 entry must have appeared. Jan
At 12:37 +0200 on 28 Mar (1680007032), Jan Beulich wrote: > On 27.03.2023 17:39, Tim Deegan wrote: > > At 10:33 +0100 on 22 Mar (1679481226), Jan Beulich wrote: > >> There's no need for an indirect call here, as the mode is invariant > >> throughout the entire paging-locked region. All it takes to avoid it is > >> to have a forward declaration of sh_update_cr3() in place. > >> > >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > >> --- > >> I find this and the respective Win7 related comment suspicious: If we > >> really need to "fix up" L3 entries "on demand", wouldn't we better retry > >> the shadow_get_and_create_l1e() rather than exit? The spurious page > >> fault that the guest observes can, after all, not be known to be non- > >> fatal inside the guest. That's purely an OS policy. > > > > I think it has to be non-fatal because it can happen on real hardware, > > even if the hardware *does* fill the TLB here (which it is not > > required to). > > But if hardware filled the TLB, it won't raise #PF. If it didn't fill > the TLB (e.g. because of not even doing a walk when a PDPTE is non- > present), a #PF would be legitimate, and the handler would then need > to reload CR3. But such a #PF is what, according to the comment, Win7 > chokes on. IIRC the Win7 behaviour is to accept and discard the #PF as spurious (because it sees the PDPTE is present) *without* reloading CR3, so it gets stuck in a loop taking pagefaults. Here, we reload CR3 for it, to unstick it. I can't think of a mental model of the MMU that would have a problem here -- either the L3Es are special in which case the pagefault is correct (and we shouldn't even read the new contents) or they're just like other PTEs in which case the spurious fault is fine. > > The assert's not true here because the guest can push us down this > > path by doing exactly what Win 7 does here - loading CR3 with a > > missing L3E, then filling the L3E and causing a page fault that uses > > the now-filled L3E. (Or maybe that's not true any more; my mental > > model of the pagetable walker code might be out of date) > > Well, I specifically started the paragraph with 'Beyond the "on demand" > L3 entry creation'. IOW the assertion would look applicable to me if > we, as suggested higher up, retried shadow_get_and_create_l1e() and it > failed again. As the comment there says, "we know the guest entries are > OK", so the missing L3 entry must have appeared. Ah, I didn't quite understand you. Yes, if we changed the handler to rebuild the SL3E then I think the assertion would be valid again. Cheers, Tim.
--- a/xen/arch/x86/mm/shadow/multi.c +++ b/xen/arch/x86/mm/shadow/multi.c @@ -91,6 +91,8 @@ const char *const fetch_type_names[] = { # define for_each_shadow_table(v, i) for ( (i) = 0; (i) < 1; ++(i) ) #endif +static void cf_check sh_update_cr3(struct vcpu *v, int do_locking, bool noflush); + /* Helper to perform a local TLB flush. */ static void sh_flush_local(const struct domain *d) { @@ -2487,7 +2489,7 @@ static int cf_check sh_page_fault( * In any case, in the PAE case, the ASSERT is not true; it can * happen because of actions the guest is taking. */ #if GUEST_PAGING_LEVELS == 3 - v->arch.paging.mode->update_cr3(v, 0, false); + sh_update_cr3(v, 0, false); #else ASSERT(d->is_shutting_down); #endif
There's no need for an indirect call here, as the mode is invariant throughout the entire paging-locked region. All it takes to avoid it is to have a forward declaration of sh_update_cr3() in place. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- I find this and the respective Win7 related comment suspicious: If we really need to "fix up" L3 entries "on demand", wouldn't we better retry the shadow_get_and_create_l1e() rather than exit? The spurious page fault that the guest observes can, after all, not be known to be non- fatal inside the guest. That's purely an OS policy. Furthermore the sh_update_cr3() will also invalidate L3 entries which were loaded successfully before, but invalidated by the guest afterwards. I strongly suspect that the described hardware behavior is _only_ to load previously not-present entries from the PDPT, but not purge ones already marked present. IOW I think sh_update_cr3() would need calling in an "incremental" mode here. (The alternative of doing this in shadow_get_and_create_l3e() instead would likely be more cumbersome.) In any event emitting a TRC_SHADOW_DOMF_DYING trace record in this case looks wrong. Beyond the "on demand" L3 entry creation I also can't see what guest actions could lead to the ASSERT() being inapplicable in the PAE case. The 3-level code in shadow_get_and_create_l2e() doesn't consult guest PDPTEs, and all other logic is similar to that for other modes. (See 89329d832aed ["x86 shadow: Update cr3 in PAE mode when guest walk succeed but shadow walk fails"].)