Message ID | 77acf62cab293ae99bd1fc079e1b0853faaf1242.1576154413.git.hongyxia@amazon.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Refactor super page shattering | expand |
On 12/12/2019 12:46, Hongyan Xia wrote: > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c > index 7d4dd80a85..8def4fb8d8 100644 > --- a/xen/arch/x86/mm.c > +++ b/xen/arch/x86/mm.c > @@ -5151,6 +5151,52 @@ l1_pgentry_t *virt_to_xen_l1e(unsigned long v) > flush_area_local((const void *)v, f) : \ > flush_area_all((const void *)v, f)) > > +/* Shatter an l3 entry and populate l2. If virt is passed in, also do flush. */ > +static bool shatter_l3e(l3_pgentry_t *pl3e, unsigned long virt, bool locking) > +{ > + unsigned int i; > + l3_pgentry_t ol3e = *pl3e; > + l2_pgentry_t l2e = l2e_from_intpte(l3e_get_intpte(ol3e)); > + l2_pgentry_t *l2t = alloc_xen_pagetable(); > + > + if ( !l2t ) > + return false; > + > + for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ ) > + { > + l2e_write(l2t + i, l2e); > + l2e = l2e_from_intpte( > + l2e_get_intpte(l2e) + (PAGE_SIZE << PAGETABLE_ORDER)); > + } > + > + if ( locking ) > + spin_lock(&map_pgdir_lock); > + if ( (l3e_get_flags(*pl3e) & _PAGE_PRESENT) && > + (l3e_get_flags(*pl3e) & _PAGE_PSE) ) > + { > + l3e_write_atomic(pl3e, > + l3e_from_paddr(virt_to_maddr(l2t), __PAGE_HYPERVISOR)); > + l2t = NULL; > + } > + if ( locking ) > + spin_unlock(&map_pgdir_lock); > + > + if ( virt ) > + { > + unsigned int flush_flags = > + FLUSH_TLB | FLUSH_ORDER(2 * PAGETABLE_ORDER); > + > + if ( l3e_get_flags(ol3e) & _PAGE_GLOBAL ) > + flush_flags |= FLUSH_TLB_GLOBAL; Another problematic use of ol3e which is racy on conflict. You need to strictly use the content of *pl3e from within the locked region. However, why have you moved the flushing in here? Only one of the two callers actually wanted it, and even then I'm not totally sure it is necessary. Both callers operate on an arbitrary range of addresses, and for anything other than a singleton update, will want to issue a single flush at the end, rather than a spate of flushes for sub-areas. (Although someone really please check my reasoning here for the map_pages_to_xen() case which currently does have sub-area flushing.) Either the flush wants dropping (and best via a prereq patch altering map_pages_to_xen()), or you need to cache ol3e in the locked region with ACCESS_ONCE() or equivalent. > + flush_area(virt, flush_flags); > + } > + > + if ( l2t ) > + free_xen_pagetable(l2t); Mind annotating this as: if ( l2t ) /* Raced on trying to shatter? Throw away our work. */ to highlight that this is an error path, and there is no connection between the TLB flushing and pagetable freeing. ~Andrew
On 13.12.2019 15:19, Andrew Cooper wrote: > On 12/12/2019 12:46, Hongyan Xia wrote: >> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c >> index 7d4dd80a85..8def4fb8d8 100644 >> --- a/xen/arch/x86/mm.c >> +++ b/xen/arch/x86/mm.c >> @@ -5151,6 +5151,52 @@ l1_pgentry_t *virt_to_xen_l1e(unsigned long v) >> flush_area_local((const void *)v, f) : \ >> flush_area_all((const void *)v, f)) >> >> +/* Shatter an l3 entry and populate l2. If virt is passed in, also do flush. */ >> +static bool shatter_l3e(l3_pgentry_t *pl3e, unsigned long virt, bool locking) >> +{ >> + unsigned int i; >> + l3_pgentry_t ol3e = *pl3e; >> + l2_pgentry_t l2e = l2e_from_intpte(l3e_get_intpte(ol3e)); >> + l2_pgentry_t *l2t = alloc_xen_pagetable(); >> + >> + if ( !l2t ) >> + return false; >> + >> + for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ ) >> + { >> + l2e_write(l2t + i, l2e); >> + l2e = l2e_from_intpte( >> + l2e_get_intpte(l2e) + (PAGE_SIZE << PAGETABLE_ORDER)); >> + } >> + >> + if ( locking ) >> + spin_lock(&map_pgdir_lock); >> + if ( (l3e_get_flags(*pl3e) & _PAGE_PRESENT) && >> + (l3e_get_flags(*pl3e) & _PAGE_PSE) ) >> + { >> + l3e_write_atomic(pl3e, >> + l3e_from_paddr(virt_to_maddr(l2t), __PAGE_HYPERVISOR)); >> + l2t = NULL; >> + } >> + if ( locking ) >> + spin_unlock(&map_pgdir_lock); >> + >> + if ( virt ) >> + { >> + unsigned int flush_flags = >> + FLUSH_TLB | FLUSH_ORDER(2 * PAGETABLE_ORDER); >> + >> + if ( l3e_get_flags(ol3e) & _PAGE_GLOBAL ) >> + flush_flags |= FLUSH_TLB_GLOBAL; > > Another problematic use of ol3e which is racy on conflict. You need to > strictly use the content of *pl3e from within the locked region. But this isn't a problem introduced here, i.e. fixing of it doesn't strictly fall under "re-factor". (I'm certainly not opposed to getting this right at the same time.) > However, why have you moved the flushing in here? Only one of the two > callers actually wanted it, and even then I'm not totally sure it is > necessary. > > Both callers operate on an arbitrary range of addresses, and for > anything other than a singleton update, will want to issue a single > flush at the end, rather than a spate of flushes for sub-areas. > > (Although someone really please check my reasoning here for the > map_pages_to_xen() case which currently does have sub-area flushing.) > > Either the flush wants dropping (and best via a prereq patch altering > map_pages_to_xen()), or you need to cache ol3e in the locked region with > ACCESS_ONCE() or equivalent. Well, at best replacing by a single one at the end, but I guess the current piecemeal behavior is to cope with error paths (see Julien's report against modify_xen_mappings(), where it's exactly the other way around). Considering especially speculative accesses I think it isn't the worst idea to keep the window small between shatter and flush (short of us doing a proper break-then- make sequence). Jan
On 13/12/2019 14:36, Jan Beulich wrote: > On 13.12.2019 15:19, Andrew Cooper wrote: >> On 12/12/2019 12:46, Hongyan Xia wrote: >>> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c >>> index 7d4dd80a85..8def4fb8d8 100644 >>> --- a/xen/arch/x86/mm.c >>> +++ b/xen/arch/x86/mm.c >>> @@ -5151,6 +5151,52 @@ l1_pgentry_t *virt_to_xen_l1e(unsigned long v) >>> flush_area_local((const void *)v, f) : \ >>> flush_area_all((const void *)v, f)) >>> >>> +/* Shatter an l3 entry and populate l2. If virt is passed in, also do flush. */ >>> +static bool shatter_l3e(l3_pgentry_t *pl3e, unsigned long virt, bool locking) >>> +{ >>> + unsigned int i; >>> + l3_pgentry_t ol3e = *pl3e; >>> + l2_pgentry_t l2e = l2e_from_intpte(l3e_get_intpte(ol3e)); >>> + l2_pgentry_t *l2t = alloc_xen_pagetable(); >>> + >>> + if ( !l2t ) >>> + return false; >>> + >>> + for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ ) >>> + { >>> + l2e_write(l2t + i, l2e); >>> + l2e = l2e_from_intpte( >>> + l2e_get_intpte(l2e) + (PAGE_SIZE << PAGETABLE_ORDER)); >>> + } >>> + >>> + if ( locking ) >>> + spin_lock(&map_pgdir_lock); >>> + if ( (l3e_get_flags(*pl3e) & _PAGE_PRESENT) && >>> + (l3e_get_flags(*pl3e) & _PAGE_PSE) ) >>> + { >>> + l3e_write_atomic(pl3e, >>> + l3e_from_paddr(virt_to_maddr(l2t), __PAGE_HYPERVISOR)); >>> + l2t = NULL; >>> + } >>> + if ( locking ) >>> + spin_unlock(&map_pgdir_lock); >>> + >>> + if ( virt ) >>> + { >>> + unsigned int flush_flags = >>> + FLUSH_TLB | FLUSH_ORDER(2 * PAGETABLE_ORDER); >>> + >>> + if ( l3e_get_flags(ol3e) & _PAGE_GLOBAL ) >>> + flush_flags |= FLUSH_TLB_GLOBAL; >> Another problematic use of ol3e which is racy on conflict. You need to >> strictly use the content of *pl3e from within the locked region. > But this isn't a problem introduced here, i.e. fixing of it doesn't > strictly fall under "re-factor". (I'm certainly not opposed to > getting this right at the same time.) It is brand new code which is racy. Its either not necessary, or an XSA-in-waiting. (And not necessary, AFAICT). > >> However, why have you moved the flushing in here? Only one of the two >> callers actually wanted it, and even then I'm not totally sure it is >> necessary. >> >> Both callers operate on an arbitrary range of addresses, and for >> anything other than a singleton update, will want to issue a single >> flush at the end, rather than a spate of flushes for sub-areas. >> >> (Although someone really please check my reasoning here for the >> map_pages_to_xen() case which currently does have sub-area flushing.) >> >> Either the flush wants dropping (and best via a prereq patch altering >> map_pages_to_xen()), or you need to cache ol3e in the locked region with >> ACCESS_ONCE() or equivalent. > Well, at best replacing by a single one at the end, but I guess > the current piecemeal behavior is to cope with error paths (see > Julien's report against modify_xen_mappings(), where it's > exactly the other way around). Considering especially speculative > accesses I think it isn't the worst idea to keep the window small > between shatter and flush (short of us doing a proper break-then- > make sequence). Every sub-flush is a broadcast IPI, which is a scalability concern, and at least needs considering. x86 is designed not to need BBM, although the BBM sequence can be helpful at times to simplify other reasoning. It might actually be necessary in some SMP cases. Speculation can bite you at any point, including the very next instruction. The logic is either correct, or not correct, and the distance between the PTE update and the flush is only relevant when it comes to the scarcity of the incorrect case manifesting in a noticeable way. Fundamentally, we either need BBM, or the flush is safe to defer to the end. Everything in-between is racy, and dropping the sub-flushes would make any incorrect cases more likely to manifest. ~Andrew
Hi, On Fri, 2019-12-13 at 14:19 +0000, Andrew Cooper wrote: > On 12/12/2019 12:46, Hongyan Xia wrote: > > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c > > index 7d4dd80a85..8def4fb8d8 100644 > > --- a/xen/arch/x86/mm.c > > +++ b/xen/arch/x86/mm.c > > @@ -5151,6 +5151,52 @@ l1_pgentry_t *virt_to_xen_l1e(unsigned long > > v) > > flush_area_local((const void *)v, f) : \ > > flush_area_all((const void *)v, f)) > > > > +/* Shatter an l3 entry and populate l2. If virt is passed in, also > > do flush. */ > > +static bool shatter_l3e(l3_pgentry_t *pl3e, unsigned long virt, > > bool locking) > > +{ > > + unsigned int i; > > + l3_pgentry_t ol3e = *pl3e; > > + l2_pgentry_t l2e = l2e_from_intpte(l3e_get_intpte(ol3e)); > > + l2_pgentry_t *l2t = alloc_xen_pagetable(); > > + > > + if ( !l2t ) > > + return false; > > + > > + for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ ) > > + { > > + l2e_write(l2t + i, l2e); > > + l2e = l2e_from_intpte( > > + l2e_get_intpte(l2e) + (PAGE_SIZE << > > PAGETABLE_ORDER)); > > + } > > + > > + if ( locking ) > > + spin_lock(&map_pgdir_lock); > > + if ( (l3e_get_flags(*pl3e) & _PAGE_PRESENT) && > > + (l3e_get_flags(*pl3e) & _PAGE_PSE) ) > > + { > > + l3e_write_atomic(pl3e, > > + l3e_from_paddr(virt_to_maddr(l2t), > > __PAGE_HYPERVISOR)); > > + l2t = NULL; > > + } > > + if ( locking ) > > + spin_unlock(&map_pgdir_lock); > > + > > + if ( virt ) > > + { > > + unsigned int flush_flags = > > + FLUSH_TLB | FLUSH_ORDER(2 * PAGETABLE_ORDER); > > + > > + if ( l3e_get_flags(ol3e) & _PAGE_GLOBAL ) > > + flush_flags |= FLUSH_TLB_GLOBAL; > > Another problematic use of ol3e which is racy on conflict. You need > to > strictly use the content of *pl3e from within the locked region. > This is actually just refactoring, although if the original code is wrong, it is also worth fixing. In fact, in the last couple of days, the more I read the code in map_pages_to_xen, the more I am worried about its race conditions and correctness. The lock is mostly used for writes, so there are so many reads outside the locked region which could potentially read stale values. One example I found is (after refactoring): else if ( l2e_get_flags(*pl2e) & _PAGE_PSE ) { ... /* Pass virt to indicate we need to flush. */ if ( !shatter_l2e(pl2e, virt, locking) ) return -ENOMEM; } pl1e = l2e_to_l1e(*pl2e) + l1_table_offset(virt); It tries to shatter an l2 page before accessing a pl1e, but is there any guard between the shattering and the read of pl1e? If another call comes in between the two and merges this page back to a superpage, the pl1e then accesses the superpage memory instead of a PTE page! (Please check my logic.) Also in other places, we see the races between PTE modifications and flushes. There could be more examples like this. Of course, removing the code for merging can avoid a lot of the problems, although Julien explained to me that it could be useful during boot. If removing is not an option, is it a big problem to extend the lock, e.g., to the whole function? It is mostly just used by vmap after boot, and a larger lock can simplify this function and its logic significantly. vmap is already taking other global locks before map_pages_to_xen anyway though. Hongyan
On Fri, 2019-12-13 at 14:58 +0000, Andrew Cooper wrote: > On 13/12/2019 14:36, Jan Beulich wrote: > > On 13.12.2019 15:19, Andrew Cooper wrote: > > > On 12/12/2019 12:46, Hongyan Xia wrote: > > > > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c > > > > index 7d4dd80a85..8def4fb8d8 100644 > > > > --- a/xen/arch/x86/mm.c > > > > +++ b/xen/arch/x86/mm.c > > > > @@ -5151,6 +5151,52 @@ l1_pgentry_t *virt_to_xen_l1e(unsigned > > > > long v) > > > > flush_area_local((const void *)v, f) > > > > : \ > > > > flush_area_all((const void *)v, f)) > > > > > > > > +/* Shatter an l3 entry and populate l2. If virt is passed in, > > > > also do flush. */ > > > > +static bool shatter_l3e(l3_pgentry_t *pl3e, unsigned long > > > > virt, bool locking) > > > > +{ > > > > + unsigned int i; > > > > + l3_pgentry_t ol3e = *pl3e; > > > > + l2_pgentry_t l2e = l2e_from_intpte(l3e_get_intpte(ol3e)); > > > > + l2_pgentry_t *l2t = alloc_xen_pagetable(); > > > > + > > > > + if ( !l2t ) > > > > + return false; > > > > + > > > > + for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ ) > > > > + { > > > > + l2e_write(l2t + i, l2e); > > > > + l2e = l2e_from_intpte( > > > > + l2e_get_intpte(l2e) + (PAGE_SIZE << > > > > PAGETABLE_ORDER)); > > > > + } > > > > + > > > > + if ( locking ) > > > > + spin_lock(&map_pgdir_lock); > > > > + if ( (l3e_get_flags(*pl3e) & _PAGE_PRESENT) && > > > > + (l3e_get_flags(*pl3e) & _PAGE_PSE) ) > > > > + { > > > > + l3e_write_atomic(pl3e, > > > > + l3e_from_paddr(virt_to_maddr(l2t), > > > > __PAGE_HYPERVISOR)); > > > > + l2t = NULL; > > > > + } > > > > + if ( locking ) > > > > + spin_unlock(&map_pgdir_lock); > > > > + > > > > + if ( virt ) > > > > + { > > > > + unsigned int flush_flags = > > > > + FLUSH_TLB | FLUSH_ORDER(2 * PAGETABLE_ORDER); > > > > + > > > > + if ( l3e_get_flags(ol3e) & _PAGE_GLOBAL ) > > > > + flush_flags |= FLUSH_TLB_GLOBAL; > > > > > > Another problematic use of ol3e which is racy on conflict. You > > > need to > > > strictly use the content of *pl3e from within the locked region. > > > > But this isn't a problem introduced here, i.e. fixing of it doesn't > > strictly fall under "re-factor". (I'm certainly not opposed to > > getting this right at the same time.) > > It is brand new code which is racy. Its either not necessary, or an > XSA-in-waiting. (And not necessary, AFAICT). > I am really confused. The original code already does sub-region flushes, and it uses a flush flag from ol3e that is even more outdated than the refactored version, so I am not quite getting your point. I hope I am not missing something obvious. Hongyan
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index 7d4dd80a85..8def4fb8d8 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -5151,6 +5151,52 @@ l1_pgentry_t *virt_to_xen_l1e(unsigned long v) flush_area_local((const void *)v, f) : \ flush_area_all((const void *)v, f)) +/* Shatter an l3 entry and populate l2. If virt is passed in, also do flush. */ +static bool shatter_l3e(l3_pgentry_t *pl3e, unsigned long virt, bool locking) +{ + unsigned int i; + l3_pgentry_t ol3e = *pl3e; + l2_pgentry_t l2e = l2e_from_intpte(l3e_get_intpte(ol3e)); + l2_pgentry_t *l2t = alloc_xen_pagetable(); + + if ( !l2t ) + return false; + + for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ ) + { + l2e_write(l2t + i, l2e); + l2e = l2e_from_intpte( + l2e_get_intpte(l2e) + (PAGE_SIZE << PAGETABLE_ORDER)); + } + + if ( locking ) + spin_lock(&map_pgdir_lock); + if ( (l3e_get_flags(*pl3e) & _PAGE_PRESENT) && + (l3e_get_flags(*pl3e) & _PAGE_PSE) ) + { + l3e_write_atomic(pl3e, + l3e_from_paddr(virt_to_maddr(l2t), __PAGE_HYPERVISOR)); + l2t = NULL; + } + if ( locking ) + spin_unlock(&map_pgdir_lock); + + if ( virt ) + { + unsigned int flush_flags = + FLUSH_TLB | FLUSH_ORDER(2 * PAGETABLE_ORDER); + + if ( l3e_get_flags(ol3e) & _PAGE_GLOBAL ) + flush_flags |= FLUSH_TLB_GLOBAL; + flush_area(virt, flush_flags); + } + + if ( l2t ) + free_xen_pagetable(l2t); + + return true; +} + int map_pages_to_xen( unsigned long virt, mfn_t mfn, @@ -5244,9 +5290,6 @@ int map_pages_to_xen( if ( (l3e_get_flags(ol3e) & _PAGE_PRESENT) && (l3e_get_flags(ol3e) & _PAGE_PSE) ) { - unsigned int flush_flags = - FLUSH_TLB | FLUSH_ORDER(2 * PAGETABLE_ORDER); - /* Skip this PTE if there is no change. */ if ( ((l3e_get_pfn(ol3e) & ~(L2_PAGETABLE_ENTRIES * L1_PAGETABLE_ENTRIES - 1)) + @@ -5267,33 +5310,9 @@ int map_pages_to_xen( continue; } - pl2e = alloc_xen_pagetable(); - if ( pl2e == NULL ) + /* Pass virt to indicate we need to flush. */ + if ( !shatter_l3e(pl3e, virt, locking) ) return -ENOMEM; - - for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ ) - l2e_write(pl2e + i, - l2e_from_pfn(l3e_get_pfn(ol3e) + - (i << PAGETABLE_ORDER), - l3e_get_flags(ol3e))); - - if ( l3e_get_flags(ol3e) & _PAGE_GLOBAL ) - flush_flags |= FLUSH_TLB_GLOBAL; - - if ( locking ) - spin_lock(&map_pgdir_lock); - if ( (l3e_get_flags(*pl3e) & _PAGE_PRESENT) && - (l3e_get_flags(*pl3e) & _PAGE_PSE) ) - { - l3e_write_atomic(pl3e, l3e_from_mfn(virt_to_mfn(pl2e), - __PAGE_HYPERVISOR)); - pl2e = NULL; - } - if ( locking ) - spin_unlock(&map_pgdir_lock); - flush_area(virt, flush_flags); - if ( pl2e ) - free_xen_pagetable(pl2e); } pl2e = virt_to_xen_l2e(virt); @@ -5578,27 +5597,8 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) } /* PAGE1GB: shatter the superpage and fall through. */ - pl2e = alloc_xen_pagetable(); - if ( !pl2e ) + if ( !shatter_l3e(pl3e, 0, locking) ) return -ENOMEM; - for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ ) - l2e_write(pl2e + i, - l2e_from_pfn(l3e_get_pfn(*pl3e) + - (i << PAGETABLE_ORDER), - l3e_get_flags(*pl3e))); - if ( locking ) - spin_lock(&map_pgdir_lock); - if ( (l3e_get_flags(*pl3e) & _PAGE_PRESENT) && - (l3e_get_flags(*pl3e) & _PAGE_PSE) ) - { - l3e_write_atomic(pl3e, l3e_from_mfn(virt_to_mfn(pl2e), - __PAGE_HYPERVISOR)); - pl2e = NULL; - } - if ( locking ) - spin_unlock(&map_pgdir_lock); - if ( pl2e ) - free_xen_pagetable(pl2e); } /*
map_pages_to_xen and modify_xen_mappings are performing almost exactly the same operations when shattering an l3 PTE, the only difference being whether we want to flush. Signed-off-by: Hongyan Xia <hongyxia@amazon.com> --- Changes in v4: - use false/true instead of -1/0 to indicate failure/success. - remove unnecessary cast. Changes in v3: - style and indentation changes. - return -ENOMEM instead of -1. Changes in v2: - improve asm. - re-read pl3e from memory when taking the lock. - move the allocation of l2t inside the shatter function. --- xen/arch/x86/mm.c | 98 +++++++++++++++++++++++------------------------ 1 file changed, 49 insertions(+), 49 deletions(-)