Message ID | c6f2dcd7-81e6-3fb6-0e13-3ffb95e12bc1@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] x86/PoD: defer nested P2M flushes | expand |
On Tue, Oct 19, 2021 at 02:52:27PM +0200, Jan Beulich wrote: > With NPT or shadow in use, the p2m_set_entry() -> p2m_pt_set_entry() -> > write_p2m_entry() -> p2m_flush_nestedp2m() call sequence triggers a lock > order violation when the PoD lock is held around it. Hence such flushing > needs to be deferred. Steal the approach from p2m_change_type_range(). > (Note that strictly speaking the change at the out_of_memory label is > not needed, as the domain gets crashed there anyway. The change is being > made nevertheless to avoid setting up a trap from someone meaning to > deal with that case better than by domain_crash().) > > Similarly for EPT I think ept_set_entry() -> ept_sync_domain() -> > ept_sync_domain_prepare() -> p2m_flush_nestedp2m() is affected. Make its > p2m_flush_nestedp2m() invocation conditional. Note that this then also > alters behavior of p2m_change_type_range() on EPT, deferring the nested > flushes there as well. I think this should have been that way from the > introduction of the flag. > > Reported-by: Elliott Mitchell <ehem+xen@m5p.com> > Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> > --- > v2: Also adjust ept_sync_domain_prepare(). Also convert the flush at the > out_of_memory label. Extend description to cover these. > > --- a/xen/arch/x86/mm/p2m-ept.c > +++ b/xen/arch/x86/mm/p2m-ept.c > @@ -1253,7 +1253,7 @@ static void ept_sync_domain_prepare(stru > { > if ( p2m_is_nestedp2m(p2m) ) > ept = &p2m_get_hostp2m(d)->ept; > - else > + else if ( !p2m->defer_nested_flush ) > p2m_flush_nestedp2m(d); I find this model slightly concerning, as we don't actually notify the caller that a nested flush as been deferred, so we must make sure that whoever sets defer_nested_flush also performs a flush unconditionally when clearing the flag. Thanks, Roger.
On 19.10.2021 15:53, Roger Pau Monné wrote: > On Tue, Oct 19, 2021 at 02:52:27PM +0200, Jan Beulich wrote: >> With NPT or shadow in use, the p2m_set_entry() -> p2m_pt_set_entry() -> >> write_p2m_entry() -> p2m_flush_nestedp2m() call sequence triggers a lock >> order violation when the PoD lock is held around it. Hence such flushing >> needs to be deferred. Steal the approach from p2m_change_type_range(). >> (Note that strictly speaking the change at the out_of_memory label is >> not needed, as the domain gets crashed there anyway. The change is being >> made nevertheless to avoid setting up a trap from someone meaning to >> deal with that case better than by domain_crash().) >> >> Similarly for EPT I think ept_set_entry() -> ept_sync_domain() -> >> ept_sync_domain_prepare() -> p2m_flush_nestedp2m() is affected. Make its >> p2m_flush_nestedp2m() invocation conditional. Note that this then also >> alters behavior of p2m_change_type_range() on EPT, deferring the nested >> flushes there as well. I think this should have been that way from the >> introduction of the flag. >> >> Reported-by: Elliott Mitchell <ehem+xen@m5p.com> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > Acked-by: Roger Pau Monné <roger.pau@citrix.com> Thanks. >> --- a/xen/arch/x86/mm/p2m-ept.c >> +++ b/xen/arch/x86/mm/p2m-ept.c >> @@ -1253,7 +1253,7 @@ static void ept_sync_domain_prepare(stru >> { >> if ( p2m_is_nestedp2m(p2m) ) >> ept = &p2m_get_hostp2m(d)->ept; >> - else >> + else if ( !p2m->defer_nested_flush ) >> p2m_flush_nestedp2m(d); > > I find this model slightly concerning, as we don't actually notify the > caller that a nested flush as been deferred, so we must make sure that > whoever sets defer_nested_flush also performs a flush unconditionally > when clearing the flag. Well, this _is_ the model used for now. Until this change there was just a single party setting the flag. And like here, any new party setting the flag will also need to invoke a flush upon clearing it. It's not clear to me what alternative model you may have in mind. Jan
> From: Jan Beulich <jbeulich@suse.com> > Sent: Tuesday, October 19, 2021 8:52 PM > > With NPT or shadow in use, the p2m_set_entry() -> p2m_pt_set_entry() -> > write_p2m_entry() -> p2m_flush_nestedp2m() call sequence triggers a lock > order violation when the PoD lock is held around it. Hence such flushing > needs to be deferred. Steal the approach from p2m_change_type_range(). > (Note that strictly speaking the change at the out_of_memory label is > not needed, as the domain gets crashed there anyway. The change is being > made nevertheless to avoid setting up a trap from someone meaning to > deal with that case better than by domain_crash().) > > Similarly for EPT I think ept_set_entry() -> ept_sync_domain() -> > ept_sync_domain_prepare() -> p2m_flush_nestedp2m() is affected. Make its > p2m_flush_nestedp2m() invocation conditional. Note that this then also > alters behavior of p2m_change_type_range() on EPT, deferring the nested > flushes there as well. I think this should have been that way from the > introduction of the flag. > > Reported-by: Elliott Mitchell <ehem+xen@m5p.com> > Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> > --- > v2: Also adjust ept_sync_domain_prepare(). Also convert the flush at the > out_of_memory label. Extend description to cover these. > > --- a/xen/arch/x86/mm/p2m-ept.c > +++ b/xen/arch/x86/mm/p2m-ept.c > @@ -1253,7 +1253,7 @@ static void ept_sync_domain_prepare(stru > { > if ( p2m_is_nestedp2m(p2m) ) > ept = &p2m_get_hostp2m(d)->ept; > - else > + else if ( !p2m->defer_nested_flush ) > p2m_flush_nestedp2m(d); > } > > --- a/xen/arch/x86/mm/p2m-pod.c > +++ b/xen/arch/x86/mm/p2m-pod.c > @@ -24,6 +24,7 @@ > #include <xen/mm.h> > #include <xen/sched.h> > #include <xen/trace.h> > +#include <asm/hvm/nestedhvm.h> > #include <asm/page.h> > #include <asm/paging.h> > #include <asm/p2m.h> > @@ -494,6 +495,13 @@ p2m_pod_offline_or_broken_replace(struct > static int > p2m_pod_zero_check_superpage(struct p2m_domain *p2m, gfn_t gfn); > > +static void pod_unlock_and_flush(struct p2m_domain *p2m) > +{ > + pod_unlock(p2m); > + p2m->defer_nested_flush = false; > + if ( nestedhvm_enabled(p2m->domain) ) > + p2m_flush_nestedp2m(p2m->domain); > +} > > /* > * This function is needed for two reasons: > @@ -514,6 +522,7 @@ p2m_pod_decrease_reservation(struct doma > > gfn_lock(p2m, gfn, order); > pod_lock(p2m); > + p2m->defer_nested_flush = true; > > /* > * If we don't have any outstanding PoD entries, let things take their > @@ -665,7 +674,7 @@ out_entry_check: > } > > out_unlock: > - pod_unlock(p2m); > + pod_unlock_and_flush(p2m); > gfn_unlock(p2m, gfn, order); > return ret; > } > @@ -1144,8 +1153,10 @@ p2m_pod_demand_populate(struct p2m_domai > * won't start until we're done. > */ > if ( unlikely(d->is_dying) ) > - goto out_fail; > - > + { > + pod_unlock(p2m); > + return false; > + } > > /* > * Because PoD does not have cache list for 1GB pages, it has to remap > @@ -1167,6 +1178,8 @@ p2m_pod_demand_populate(struct p2m_domai > p2m_populate_on_demand, p2m->default_access); > } > > + p2m->defer_nested_flush = true; > + > /* Only reclaim if we're in actual need of more cache. */ > if ( p2m->pod.entry_count > p2m->pod.count ) > pod_eager_reclaim(p2m); > @@ -1229,22 +1242,25 @@ p2m_pod_demand_populate(struct p2m_domai > __trace_var(TRC_MEM_POD_POPULATE, 0, sizeof(t), &t); > } > > - pod_unlock(p2m); > + pod_unlock_and_flush(p2m); > return true; > + > out_of_memory: > - pod_unlock(p2m); > + pod_unlock_and_flush(p2m); > > printk("%s: Dom%d out of PoD memory! (tot=%"PRIu32" ents=%ld > dom%d)\n", > __func__, d->domain_id, domain_tot_pages(d), > p2m->pod.entry_count, current->domain->domain_id); > domain_crash(d); > return false; > + > out_fail: > - pod_unlock(p2m); > + pod_unlock_and_flush(p2m); > return false; > + > remap_and_retry: > BUG_ON(order != PAGE_ORDER_2M); > - pod_unlock(p2m); > + pod_unlock_and_flush(p2m); > > /* > * Remap this 2-meg region in singleton chunks. See the comment on the
On Tue, Oct 19, 2021 at 05:06:27PM +0200, Jan Beulich wrote: > On 19.10.2021 15:53, Roger Pau Monné wrote: > > On Tue, Oct 19, 2021 at 02:52:27PM +0200, Jan Beulich wrote: > >> With NPT or shadow in use, the p2m_set_entry() -> p2m_pt_set_entry() -> > >> write_p2m_entry() -> p2m_flush_nestedp2m() call sequence triggers a lock > >> order violation when the PoD lock is held around it. Hence such flushing > >> needs to be deferred. Steal the approach from p2m_change_type_range(). > >> (Note that strictly speaking the change at the out_of_memory label is > >> not needed, as the domain gets crashed there anyway. The change is being > >> made nevertheless to avoid setting up a trap from someone meaning to > >> deal with that case better than by domain_crash().) > >> > >> Similarly for EPT I think ept_set_entry() -> ept_sync_domain() -> > >> ept_sync_domain_prepare() -> p2m_flush_nestedp2m() is affected. Make its > >> p2m_flush_nestedp2m() invocation conditional. Note that this then also > >> alters behavior of p2m_change_type_range() on EPT, deferring the nested > >> flushes there as well. I think this should have been that way from the > >> introduction of the flag. > >> > >> Reported-by: Elliott Mitchell <ehem+xen@m5p.com> > >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > > > Acked-by: Roger Pau Monné <roger.pau@citrix.com> > > Thanks. > > >> --- a/xen/arch/x86/mm/p2m-ept.c > >> +++ b/xen/arch/x86/mm/p2m-ept.c > >> @@ -1253,7 +1253,7 @@ static void ept_sync_domain_prepare(stru > >> { > >> if ( p2m_is_nestedp2m(p2m) ) > >> ept = &p2m_get_hostp2m(d)->ept; > >> - else > >> + else if ( !p2m->defer_nested_flush ) > >> p2m_flush_nestedp2m(d); > > > > I find this model slightly concerning, as we don't actually notify the > > caller that a nested flush as been deferred, so we must make sure that > > whoever sets defer_nested_flush also performs a flush unconditionally > > when clearing the flag. > > Well, this _is_ the model used for now. Until this change there was > just a single party setting the flag. And like here, any new party > setting the flag will also need to invoke a flush upon clearing it. > It's not clear to me what alternative model you may have in mind. I was mostly thinking of something similar to the defer_flush/need_flush pair, where the need for a flush is signaled in need_flush, so setting defer_flush doesn't automatically imply a flush when clearing it. Anyway, this is simple enough that it doesn't warrant the use of the more complicated logic. Thanks, Roger.
--- a/xen/arch/x86/mm/p2m-ept.c +++ b/xen/arch/x86/mm/p2m-ept.c @@ -1253,7 +1253,7 @@ static void ept_sync_domain_prepare(stru { if ( p2m_is_nestedp2m(p2m) ) ept = &p2m_get_hostp2m(d)->ept; - else + else if ( !p2m->defer_nested_flush ) p2m_flush_nestedp2m(d); } --- a/xen/arch/x86/mm/p2m-pod.c +++ b/xen/arch/x86/mm/p2m-pod.c @@ -24,6 +24,7 @@ #include <xen/mm.h> #include <xen/sched.h> #include <xen/trace.h> +#include <asm/hvm/nestedhvm.h> #include <asm/page.h> #include <asm/paging.h> #include <asm/p2m.h> @@ -494,6 +495,13 @@ p2m_pod_offline_or_broken_replace(struct static int p2m_pod_zero_check_superpage(struct p2m_domain *p2m, gfn_t gfn); +static void pod_unlock_and_flush(struct p2m_domain *p2m) +{ + pod_unlock(p2m); + p2m->defer_nested_flush = false; + if ( nestedhvm_enabled(p2m->domain) ) + p2m_flush_nestedp2m(p2m->domain); +} /* * This function is needed for two reasons: @@ -514,6 +522,7 @@ p2m_pod_decrease_reservation(struct doma gfn_lock(p2m, gfn, order); pod_lock(p2m); + p2m->defer_nested_flush = true; /* * If we don't have any outstanding PoD entries, let things take their @@ -665,7 +674,7 @@ out_entry_check: } out_unlock: - pod_unlock(p2m); + pod_unlock_and_flush(p2m); gfn_unlock(p2m, gfn, order); return ret; } @@ -1144,8 +1153,10 @@ p2m_pod_demand_populate(struct p2m_domai * won't start until we're done. */ if ( unlikely(d->is_dying) ) - goto out_fail; - + { + pod_unlock(p2m); + return false; + } /* * Because PoD does not have cache list for 1GB pages, it has to remap @@ -1167,6 +1178,8 @@ p2m_pod_demand_populate(struct p2m_domai p2m_populate_on_demand, p2m->default_access); } + p2m->defer_nested_flush = true; + /* Only reclaim if we're in actual need of more cache. */ if ( p2m->pod.entry_count > p2m->pod.count ) pod_eager_reclaim(p2m); @@ -1229,22 +1242,25 @@ p2m_pod_demand_populate(struct p2m_domai __trace_var(TRC_MEM_POD_POPULATE, 0, sizeof(t), &t); } - pod_unlock(p2m); + pod_unlock_and_flush(p2m); return true; + out_of_memory: - pod_unlock(p2m); + pod_unlock_and_flush(p2m); printk("%s: Dom%d out of PoD memory! (tot=%"PRIu32" ents=%ld dom%d)\n", __func__, d->domain_id, domain_tot_pages(d), p2m->pod.entry_count, current->domain->domain_id); domain_crash(d); return false; + out_fail: - pod_unlock(p2m); + pod_unlock_and_flush(p2m); return false; + remap_and_retry: BUG_ON(order != PAGE_ORDER_2M); - pod_unlock(p2m); + pod_unlock_and_flush(p2m); /* * Remap this 2-meg region in singleton chunks. See the comment on the
With NPT or shadow in use, the p2m_set_entry() -> p2m_pt_set_entry() -> write_p2m_entry() -> p2m_flush_nestedp2m() call sequence triggers a lock order violation when the PoD lock is held around it. Hence such flushing needs to be deferred. Steal the approach from p2m_change_type_range(). (Note that strictly speaking the change at the out_of_memory label is not needed, as the domain gets crashed there anyway. The change is being made nevertheless to avoid setting up a trap from someone meaning to deal with that case better than by domain_crash().) Similarly for EPT I think ept_set_entry() -> ept_sync_domain() -> ept_sync_domain_prepare() -> p2m_flush_nestedp2m() is affected. Make its p2m_flush_nestedp2m() invocation conditional. Note that this then also alters behavior of p2m_change_type_range() on EPT, deferring the nested flushes there as well. I think this should have been that way from the introduction of the flag. Reported-by: Elliott Mitchell <ehem+xen@m5p.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> --- v2: Also adjust ept_sync_domain_prepare(). Also convert the flush at the out_of_memory label. Extend description to cover these.