Message ID | 20200207155701.2781820-2-dwmw2@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] xen/mm: fold PGC_broken into PGC_state bits | expand |
On Fri, 2020-02-07 at 15:57 +0000, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > ... > > Finally, make free_xenheap_pages() and free_domheap_pages() call > through to init_heap_pages() instead of directly to free_heap_pages() > in the case where pages are being free which have never passed > through > init_heap_pages(). > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> If both end up calling free_heap_pages, can we just put the check there? Hongyan
On 7 February 2020 16:30:04 GMT, "Xia, Hongyan" <hongyxia@amazon.com> wrote: >On Fri, 2020-02-07 at 15:57 +0000, David Woodhouse wrote: >> From: David Woodhouse <dwmw@amazon.co.uk> >> >> ... >> >> Finally, make free_xenheap_pages() and free_domheap_pages() call >> through to init_heap_pages() instead of directly to free_heap_pages() >> in the case where pages are being free which have never passed >> through >> init_heap_pages(). >> >> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > >If both end up calling free_heap_pages, can we just put the check >there? Ideally not because init_heap_pages() then calls free_heap_pages() and the "recursion" is best avoided.
On Fri, 2020-02-07 at 16:32 +0000, David Woodhouse wrote: > > ... > > Ideally not because init_heap_pages() then calls free_heap_pages() > and the "recursion" is best avoided. Kind of depends on where we change its pg to initialised. If we do that in free_heap_pages this does recurse, but if it is done in init_heap_pages before calling free it does not. Hongyan
On 7 February 2020 16:40:01 GMT, "Xia, Hongyan" <hongyxia@amazon.com> wrote: >On Fri, 2020-02-07 at 16:32 +0000, David Woodhouse wrote: >> >> ... >> >> Ideally not because init_heap_pages() then calls free_heap_pages() >> and the "recursion" is best avoided. > >Kind of depends on where we change its pg to initialised. If we do that >in free_heap_pages this does recurse, but if it is done in >init_heap_pages before calling free it does not. True. It would *look* like it might recurse, but never should in practice.
On Fri, 2020-02-07 at 17:06 +0000, David Woodhouse wrote: > On 7 February 2020 16:40:01 GMT, "Xia, Hongyan" <hongyxia@amazon.com> wrote: > > On Fri, 2020-02-07 at 16:32 +0000, David Woodhouse wrote: > > > > > > ... > > > > > > Ideally not because init_heap_pages() then calls free_heap_pages() > > > and the "recursion" is best avoided. > > > > Kind of depends on where we change its pg to initialised. If we do that > > in free_heap_pages this does recurse, but if it is done in > > init_heap_pages before calling free it does not. > > True. It would *look* like it might recurse, but never should in practice. Looks like this. Less duplication of the 'if (uninitialised) init_heap_pages() else free_heap_pages()' construct, more scope for people to naïvely complain that it "recurses". I think I prefer it this way. From: David Woodhouse <dwmw@amazon.co.uk> Date: Fri, 7 Feb 2020 13:01:36 +0000 Subject: [PATCH] xen/mm: Introduce PG_state_uninitialised It is possible for pages to enter general circulation without ever being process by init_heap_pages(). For example, pages of the multiboot module containing the initramfs may be assigned via assign_pages() to dom0 as it is created. And some code including map_pages_to_xen() has checks on 'system_state' to determine whether to use the boot or the heap allocator, but it seems impossible to prove that pages allocated by the boot allocator are not subsequently freed with free_heap_pages(). This actually works fine in the majority of cases; there are only a few esoteric corner cases which init_heap_pages() handles before handing the page range off to free_heap_pages(): • Excluding MFN #0 to avoid inappropriate cross-zone merging. • Ensuring that the node information structures exist, when the first page(s) of a given node are handled. • High order allocations crossing from one node to another. To handle this case, shift PG_state_inuse from its current value of zero, to another value. Use zero, which is the initial state of the entire frame table, as PG_state_uninitialised. Fix a couple of assertions which were assuming that PG_state_inuse is zero, and make them cope with the PG_state_uninitialised case too where appopriate. Finally, make free_heap_pages() call through to init_heap_pages() when given a page range which has not been initialised. This cannot keep recursing because init_heap_pages() will set each page state to PGC_state_inuse before passing it back to free_heap_pages() for the second time. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- xen/arch/x86/mm.c | 3 ++- xen/common/page_alloc.c | 40 ++++++++++++++++++++++++++++------------ xen/include/asm-arm/mm.h | 3 ++- xen/include/asm-x86/mm.h | 3 ++- 4 files changed, 34 insertions(+), 15 deletions(-) diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index 9b33829084..bf660ee8eb 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -488,7 +488,8 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d, page_set_owner(page, d); smp_wmb(); /* install valid domain ptr before updating refcnt. */ - ASSERT((page->count_info & ~PGC_xen_heap) == 0); + ASSERT((page->count_info & ~PGC_xen_heap) == PGC_state_inuse || + (page->count_info & ~PGC_xen_heap) == PGC_state_uninitialised); /* Only add to the allocation list if the domain isn't dying. */ if ( !d->is_dying ) diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c index 4084503554..95984d6de0 100644 --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -252,6 +252,8 @@ struct bootmem_region { static struct bootmem_region __initdata bootmem_region_list[PAGE_SIZE / sizeof(struct bootmem_region)]; static unsigned int __initdata nr_bootmem_regions; +static void init_heap_pages(struct page_info *pg, unsigned long nr_pages, + bool scrub); struct scrub_region { unsigned long offset; @@ -1389,6 +1391,16 @@ static void free_heap_pages( ASSERT(order <= MAX_ORDER); ASSERT(node >= 0); + if ( page_state_is(pg, uninitialised) ) + { + init_heap_pages(pg, 1 << order, need_scrub); + /* + * init_heap_pages() will call back into free_heap_pages() for + * each page but cannot keep recursing because each page will + * be set to PGC_state_inuse first. + */ + return; + } spin_lock(&heap_lock); for ( i = 0; i < (1 << order); i++ ) @@ -1780,11 +1792,10 @@ int query_page_offline(mfn_t mfn, uint32_t *status) * latter is not on a MAX_ORDER boundary, then we reserve the page by * not freeing it to the buddy allocator. */ -static void init_heap_pages( - struct page_info *pg, unsigned long nr_pages) +static void init_heap_pages(struct page_info *pg, unsigned long nr_pages, + bool scrub) { unsigned long i; - bool idle_scrub = false; /* * Keep MFN 0 away from the buddy allocator to avoid crossing zone @@ -1809,7 +1820,7 @@ static void init_heap_pages( spin_unlock(&heap_lock); if ( system_state < SYS_STATE_active && opt_bootscrub == BOOTSCRUB_IDLE ) - idle_scrub = true; + scrub = true; for ( i = 0; i < nr_pages; i++ ) { @@ -1837,7 +1848,8 @@ static void init_heap_pages( nr_pages -= n; } - free_heap_pages(pg + i, 0, scrub_debug || idle_scrub); + pg[i].count_info = PGC_state_inuse; + free_heap_pages(pg + i, 0, scrub_debug || scrub); } } @@ -1873,7 +1885,7 @@ void __init end_boot_allocator(void) if ( (r->s < r->e) && (phys_to_nid(pfn_to_paddr(r->s)) == cpu_to_node(0)) ) { - init_heap_pages(mfn_to_page(_mfn(r->s)), r->e - r->s); + init_heap_pages(mfn_to_page(_mfn(r->s)), r->e - r->s, false); r->e = r->s; break; } @@ -1882,7 +1894,7 @@ void __init end_boot_allocator(void) { struct bootmem_region *r = &bootmem_region_list[i]; if ( r->s < r->e ) - init_heap_pages(mfn_to_page(_mfn(r->s)), r->e - r->s); + init_heap_pages(mfn_to_page(_mfn(r->s)), r->e - r->s, false); } nr_bootmem_regions = 0; @@ -2151,7 +2163,7 @@ void init_xenheap_pages(paddr_t ps, paddr_t pe) memguard_guard_range(maddr_to_virt(ps), pe - ps); - init_heap_pages(maddr_to_page(ps), (pe - ps) >> PAGE_SHIFT); + init_heap_pages(maddr_to_page(ps), (pe - ps) >> PAGE_SHIFT, false); } @@ -2260,7 +2275,7 @@ void init_domheap_pages(paddr_t ps, paddr_t pe) if ( mfn_x(emfn) <= mfn_x(smfn) ) return; - init_heap_pages(mfn_to_page(smfn), mfn_x(emfn) - mfn_x(smfn)); + init_heap_pages(mfn_to_page(smfn), mfn_x(emfn) - mfn_x(smfn), false); } @@ -2301,10 +2316,11 @@ int assign_pages( for ( i = 0; i < (1 << order); i++ ) { ASSERT(page_get_owner(&pg[i]) == NULL); - ASSERT(!pg[i].count_info); + ASSERT(pg[i].count_info == PGC_state_inuse || + pg[i].count_info == PGC_state_uninitialised); page_set_owner(&pg[i], d); smp_wmb(); /* Domain pointer must be visible before updating refcnt. */ - pg[i].count_info = PGC_allocated | 1; + pg[i].count_info |= PGC_allocated | 1; page_list_add_tail(&pg[i], &d->page_list); } diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h index c9466c8ca0..c696941600 100644 --- a/xen/include/asm-arm/mm.h +++ b/xen/include/asm-arm/mm.h @@ -117,12 +117,13 @@ struct page_info * { inuse, offlining, offlined, free, broken_offlining, broken } */ #define PGC_state PG_mask(7, 9) -#define PGC_state_inuse PG_mask(0, 9) +#define PGC_state_uninitialised PG_mask(0, 9) #define PGC_state_offlining PG_mask(1, 9) #define PGC_state_offlined PG_mask(2, 9) #define PGC_state_free PG_mask(3, 9) #define PGC_state_broken_offlining PG_mask(4, 9) #define PGC_state_broken PG_mask(5, 9) +#define PGC_state_inuse PG_mask(6, 9) #define page_state_is(pg, st) (((pg)->count_info&PGC_state) == PGC_state_##st) #define page_is_broken(pg) (page_state_is((pg), broken_offlining) || \ diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h index 3edadf7a7c..645368e6a9 100644 --- a/xen/include/asm-x86/mm.h +++ b/xen/include/asm-x86/mm.h @@ -72,12 +72,13 @@ * { inuse, offlining, offlined, free, broken_offlining, broken } */ #define PGC_state PG_mask(7, 9) -#define PGC_state_inuse PG_mask(0, 9) +#define PGC_state_uninitialised PG_mask(0, 9) #define PGC_state_offlining PG_mask(1, 9) #define PGC_state_offlined PG_mask(2, 9) #define PGC_state_free PG_mask(3, 9) #define PGC_state_broken_offlining PG_mask(4, 9) #define PGC_state_broken PG_mask(5, 9) +#define PGC_state_inuse PG_mask(6, 9) #define page_state_is(pg, st) (((pg)->count_info&PGC_state) == PGC_state_##st) #define page_is_broken(pg) (page_state_is((pg), broken_offlining) || \
On 07.02.2020 19:04, David Woodhouse wrote: > --- a/xen/arch/x86/mm.c > +++ b/xen/arch/x86/mm.c > @@ -488,7 +488,8 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d, > > page_set_owner(page, d); > smp_wmb(); /* install valid domain ptr before updating refcnt. */ > - ASSERT((page->count_info & ~PGC_xen_heap) == 0); > + ASSERT((page->count_info & ~PGC_xen_heap) == PGC_state_inuse || > + (page->count_info & ~PGC_xen_heap) == PGC_state_uninitialised); Can uninitialized pages really make it here? > @@ -1389,6 +1391,16 @@ static void free_heap_pages( > ASSERT(order <= MAX_ORDER); > ASSERT(node >= 0); > > + if ( page_state_is(pg, uninitialised) ) > + { > + init_heap_pages(pg, 1 << order, need_scrub); > + /* > + * init_heap_pages() will call back into free_heap_pages() for > + * each page but cannot keep recursing because each page will > + * be set to PGC_state_inuse first. > + */ > + return; > + } > spin_lock(&heap_lock); Can you also add a blank line above here please? > @@ -1780,11 +1792,10 @@ int query_page_offline(mfn_t mfn, uint32_t *status) > * latter is not on a MAX_ORDER boundary, then we reserve the page by > * not freeing it to the buddy allocator. > */ > -static void init_heap_pages( > - struct page_info *pg, unsigned long nr_pages) > +static void init_heap_pages(struct page_info *pg, unsigned long nr_pages, > + bool scrub) Is this new parameter strictly needed, i.e. can free_heap_pages() be called with uninitialized pages which need scrubbing? (The code change is simple enough, and hence may warrant keeping, but then the commit message could indicate so in case this isn't a strict requirement.) > @@ -2301,10 +2316,11 @@ int assign_pages( > for ( i = 0; i < (1 << order); i++ ) > { > ASSERT(page_get_owner(&pg[i]) == NULL); > - ASSERT(!pg[i].count_info); > + ASSERT(pg[i].count_info == PGC_state_inuse || > + pg[i].count_info == PGC_state_uninitialised); Same question here: Can uninitialized pages make it here? If so, wouldn't it be better to correct this, rather than having the more permissive assertion? > page_set_owner(&pg[i], d); > smp_wmb(); /* Domain pointer must be visible before updating refcnt. */ > - pg[i].count_info = PGC_allocated | 1; > + pg[i].count_info |= PGC_allocated | 1; This is too relaxed for my taste: I understand you want to retain page state, but I suppose other bits would want clearing nevertheless. > --- a/xen/include/asm-x86/mm.h > +++ b/xen/include/asm-x86/mm.h > @@ -72,12 +72,13 @@ > * { inuse, offlining, offlined, free, broken_offlining, broken } > */ > #define PGC_state PG_mask(7, 9) > -#define PGC_state_inuse PG_mask(0, 9) > +#define PGC_state_uninitialised PG_mask(0, 9) > #define PGC_state_offlining PG_mask(1, 9) > #define PGC_state_offlined PG_mask(2, 9) > #define PGC_state_free PG_mask(3, 9) > #define PGC_state_broken_offlining PG_mask(4, 9) > #define PGC_state_broken PG_mask(5, 9) > +#define PGC_state_inuse PG_mask(6, 9) Would imo be nice if this most common state was actually either 1 or 7, for easy recognition. But the most suitable value to pick may also depend on the outcome of one of the comments on patch 1. Jan
Hi, On 20/02/2020 11:59, Jan Beulich wrote: > On 07.02.2020 19:04, David Woodhouse wrote: >> --- a/xen/arch/x86/mm.c >> +++ b/xen/arch/x86/mm.c >> @@ -488,7 +488,8 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d, >> >> page_set_owner(page, d); >> smp_wmb(); /* install valid domain ptr before updating refcnt. */ >> - ASSERT((page->count_info & ~PGC_xen_heap) == 0); >> + ASSERT((page->count_info & ~PGC_xen_heap) == PGC_state_inuse || >> + (page->count_info & ~PGC_xen_heap) == PGC_state_uninitialised); > > Can uninitialized pages really make it here? IIRC, arch_init_memory() will share the first 1MB of the RAM but they were never given to the page allocator. 01,10 +2316,11 @@ int assign_pages( >> for ( i = 0; i < (1 << order); i++ ) >> { >> ASSERT(page_get_owner(&pg[i]) == NULL); >> - ASSERT(!pg[i].count_info); >> + ASSERT(pg[i].count_info == PGC_state_inuse || >> + pg[i].count_info == PGC_state_uninitialised); > > Same question here: Can uninitialized pages make it here? Yes, in dom0_construct_pv() when the initrd is assigned to the guest. > If > so, wouldn't it be better to correct this, rather than having > the more permissive assertion? We would need to rework init_heap_pages() (or create a new function) so the allocator initialize the state but it is marked as "reserved/used" rather than "freed". Most likely we will need a similar function for liveupdate. This is because the pages belonging to guests should be untouched. Cheers,
On Thu, 2020-02-20 at 12:59 +0100, Jan Beulich wrote: > On 07.02.2020 19:04, David Woodhouse wrote: > > --- a/xen/arch/x86/mm.c > > +++ b/xen/arch/x86/mm.c > > @@ -488,7 +488,8 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d, > > > > page_set_owner(page, d); > > smp_wmb(); /* install valid domain ptr before updating refcnt. */ > > - ASSERT((page->count_info & ~PGC_xen_heap) == 0); > > + ASSERT((page->count_info & ~PGC_xen_heap) == PGC_state_inuse || > > + (page->count_info & ~PGC_xen_heap) == PGC_state_uninitialised); > > Can uninitialized pages really make it here? Yep, we share the low 1MiB with dom_io. > > @@ -1389,6 +1391,16 @@ static void free_heap_pages( > > ASSERT(order <= MAX_ORDER); > > ASSERT(node >= 0); > > > > + if ( page_state_is(pg, uninitialised) ) > > + { > > + init_heap_pages(pg, 1 << order, need_scrub); > > + /* > > + * init_heap_pages() will call back into free_heap_pages() for > > + * each page but cannot keep recursing because each page will > > + * be set to PGC_state_inuse first. > > + */ > > + return; > > + } > > spin_lock(&heap_lock); > > Can you also add a blank line above here please? Done. > > @@ -1780,11 +1792,10 @@ int query_page_offline(mfn_t mfn, uint32_t *status) > > * latter is not on a MAX_ORDER boundary, then we reserve the page by > > * not freeing it to the buddy allocator. > > */ > > -static void init_heap_pages( > > - struct page_info *pg, unsigned long nr_pages) > > +static void init_heap_pages(struct page_info *pg, unsigned long nr_pages, > > + bool scrub) > > Is this new parameter strictly needed, i.e. can free_heap_pages() > be called with uninitialized pages which need scrubbing? (The > code change is simple enough, and hence may warrant keeping, but > then the commit message could indicate so in case this isn't a > strict requirement.) Yes, I think it's feasible for the initramfs pages, which is assigned to dom0 from uninitialised pages, to later get freed and then they'll want scrubbing? There *is* a path into free_heap_pages() with the need_scrub argument set, and I'd have to *prove* that it can never happen in order to... I don't know... put a BUG() in that case instead of supporting it? Didn't seem like that was the thing I wanted to do. > > @@ -2301,10 +2316,11 @@ int assign_pages( > > for ( i = 0; i < (1 << order); i++ ) > > { > > ASSERT(page_get_owner(&pg[i]) == NULL); > > - ASSERT(!pg[i].count_info); > > + ASSERT(pg[i].count_info == PGC_state_inuse || > > + pg[i].count_info == PGC_state_uninitialised); > > Same question here: Can uninitialized pages make it here? If > so, wouldn't it be better to correct this, rather than having > the more permissive assertion? Yep, Dom0 initrd on x86. ASSERT((pg[i].count_info & ~PGC_extra) == PGC_state_inuse || (pg[i].count_info & ~PGC_extra) == PGC_state_uninitialised); > > page_set_owner(&pg[i], d); > > smp_wmb(); /* Domain pointer must be visible before updating refcnt. */ > > - pg[i].count_info = PGC_allocated | 1; > > + pg[i].count_info |= PGC_allocated | 1; > > This is too relaxed for my taste: I understand you want to > retain page state, but I suppose other bits would want clearing > nevertheless. You seem to have dropped the ASSERT immediately before the code snippet you cited, in which arbitrary other contents of count_info are not permitted. I put it back, in its current form after I rebase on top of Paul's commit c793d13944b45d assing PGC_extra. > > --- a/xen/include/asm-x86/mm.h > > +++ b/xen/include/asm-x86/mm.h > > @@ -72,12 +72,13 @@ > > * { inuse, offlining, offlined, free, broken_offlining, broken } > > */ > > #define PGC_state PG_mask(7, 9) > > -#define PGC_state_inuse PG_mask(0, 9) > > +#define PGC_state_uninitialised PG_mask(0, 9) > > #define PGC_state_offlining PG_mask(1, 9) > > #define PGC_state_offlined PG_mask(2, 9) > > #define PGC_state_free PG_mask(3, 9) > > #define PGC_state_broken_offlining PG_mask(4, 9) > > #define PGC_state_broken PG_mask(5, 9) > > +#define PGC_state_inuse PG_mask(6, 9) > > Would imo be nice if this most common state was actually > either 1 or 7, for easy recognition. But the most suitable > value to pick may also depend on the outcome of one of the > comments on patch 1. Not quite sure why 1 and 7 are easier to recognise than other values. The important one is that uninitialised has to be zero, since that's the default (because that's what the frame table is memset to. Which is changeable, but non-trivially so).
> -----Original Message----- > From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of David Woodhouse > Sent: 17 March 2020 22:15 > To: Jan Beulich <jbeulich@suse.com> > Cc: sstabellini@kernel.org; julien@xen.org; wl@xen.org; konrad.wilk@oracle.com; > george.dunlap@eu.citrix.com; andrew.cooper3@citrix.com; ian.jackson@eu.citrix.com; > george.dunlap@citrix.com; jeff.kubascik@dornerworks.com; Xia, Hongyan <hongyxia@amazon.com>; > stewart.hildebrand@dornerworks.com; xen-devel@lists.xenproject.org > Subject: Re: [Xen-devel] [PATCH 2/2] xen/mm: Introduce PG_state_uninitialised > > On Thu, 2020-02-20 at 12:59 +0100, Jan Beulich wrote: > > On 07.02.2020 19:04, David Woodhouse wrote: > > > --- a/xen/arch/x86/mm.c > > > +++ b/xen/arch/x86/mm.c > > > @@ -488,7 +488,8 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d, > > > > > > page_set_owner(page, d); > > > smp_wmb(); /* install valid domain ptr before updating refcnt. */ > > > - ASSERT((page->count_info & ~PGC_xen_heap) == 0); > > > + ASSERT((page->count_info & ~PGC_xen_heap) == PGC_state_inuse || > > > + (page->count_info & ~PGC_xen_heap) == PGC_state_uninitialised); > > > > Can uninitialized pages really make it here? > > Yep, we share the low 1MiB with dom_io. > OOI anyone know why we do this? Is it actually necessary? Paul
On 17.03.2020 23:15, David Woodhouse wrote: > On Thu, 2020-02-20 at 12:59 +0100, Jan Beulich wrote: >> On 07.02.2020 19:04, David Woodhouse wrote: > ASSERT((pg[i].count_info & ~PGC_extra) == PGC_state_inuse || > (pg[i].count_info & ~PGC_extra) == PGC_state_uninitialised); >>> page_set_owner(&pg[i], d); >>> smp_wmb(); /* Domain pointer must be visible before updating refcnt. */ >>> - pg[i].count_info = PGC_allocated | 1; >>> + pg[i].count_info |= PGC_allocated | 1; >> >> This is too relaxed for my taste: I understand you want to >> retain page state, but I suppose other bits would want clearing >> nevertheless. > > You seem to have dropped the ASSERT immediately before the code snippet > you cited, in which arbitrary other contents of count_info are not > permitted. I put it back, in its current form after I rebase on top of > Paul's commit c793d13944b45d assing PGC_extra. But that' only an ASSERT(), i.e. no protection at all in release builds. >>> --- a/xen/include/asm-x86/mm.h >>> +++ b/xen/include/asm-x86/mm.h >>> @@ -72,12 +72,13 @@ >>> * { inuse, offlining, offlined, free, broken_offlining, broken } >>> */ >>> #define PGC_state PG_mask(7, 9) >>> -#define PGC_state_inuse PG_mask(0, 9) >>> +#define PGC_state_uninitialised PG_mask(0, 9) >>> #define PGC_state_offlining PG_mask(1, 9) >>> #define PGC_state_offlined PG_mask(2, 9) >>> #define PGC_state_free PG_mask(3, 9) >>> #define PGC_state_broken_offlining PG_mask(4, 9) >>> #define PGC_state_broken PG_mask(5, 9) >>> +#define PGC_state_inuse PG_mask(6, 9) >> >> Would imo be nice if this most common state was actually >> either 1 or 7, for easy recognition. But the most suitable >> value to pick may also depend on the outcome of one of the >> comments on patch 1. > > Not quite sure why 1 and 7 are easier to recognise than other values. > The important one is that uninitialised has to be zero, since that's > the default (because that's what the frame table is memset to. Which is > changeable, but non-trivially so). In particular 7 may imo be easier to recognize, as having all of the three bits set. It sometimes helps seeing such at (almost) the first glance in e.g. logged data. Jan
On 18.03.2020 09:53, Paul Durrant wrote: >> -----Original Message----- >> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of David Woodhouse >> Sent: 17 March 2020 22:15 >> >> On Thu, 2020-02-20 at 12:59 +0100, Jan Beulich wrote: >>> On 07.02.2020 19:04, David Woodhouse wrote: >>>> --- a/xen/arch/x86/mm.c >>>> +++ b/xen/arch/x86/mm.c >>>> @@ -488,7 +488,8 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d, >>>> >>>> page_set_owner(page, d); >>>> smp_wmb(); /* install valid domain ptr before updating refcnt. */ >>>> - ASSERT((page->count_info & ~PGC_xen_heap) == 0); >>>> + ASSERT((page->count_info & ~PGC_xen_heap) == PGC_state_inuse || >>>> + (page->count_info & ~PGC_xen_heap) == PGC_state_uninitialised); >>> >>> Can uninitialized pages really make it here? >> >> Yep, we share the low 1MiB with dom_io. >> > > OOI anyone know why we do this? Is it actually necessary? Yes, for Dom0 to be able to access things like EBDA, IBFT, or data found in BIOS space. Jan
> -----Original Message----- > From: Jan Beulich <jbeulich@suse.com> > Sent: 18 March 2020 10:10 > To: paul@xen.org > Cc: 'David Woodhouse' <dwmw2@infradead.org>; sstabellini@kernel.org; julien@xen.org; wl@xen.org; > konrad.wilk@oracle.com; george.dunlap@eu.citrix.com; andrew.cooper3@citrix.com; > ian.jackson@eu.citrix.com; george.dunlap@citrix.com; jeff.kubascik@dornerworks.com; 'Xia, Hongyan' > <hongyxia@amazon.com>; stewart.hildebrand@dornerworks.com; xen-devel@lists.xenproject.org > Subject: Re: [PATCH 2/2] xen/mm: Introduce PG_state_uninitialised > > On 18.03.2020 09:53, Paul Durrant wrote: > >> -----Original Message----- > >> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of David Woodhouse > >> Sent: 17 March 2020 22:15 > >> > >> On Thu, 2020-02-20 at 12:59 +0100, Jan Beulich wrote: > >>> On 07.02.2020 19:04, David Woodhouse wrote: > >>>> --- a/xen/arch/x86/mm.c > >>>> +++ b/xen/arch/x86/mm.c > >>>> @@ -488,7 +488,8 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d, > >>>> > >>>> page_set_owner(page, d); > >>>> smp_wmb(); /* install valid domain ptr before updating refcnt. */ > >>>> - ASSERT((page->count_info & ~PGC_xen_heap) == 0); > >>>> + ASSERT((page->count_info & ~PGC_xen_heap) == PGC_state_inuse || > >>>> + (page->count_info & ~PGC_xen_heap) == PGC_state_uninitialised); > >>> > >>> Can uninitialized pages really make it here? > >> > >> Yep, we share the low 1MiB with dom_io. > >> > > > > OOI anyone know why we do this? Is it actually necessary? > > Yes, for Dom0 to be able to access things like EBDA, IBFT, or data > found in BIOS space. > Ok. I am still wondering why dom0's low 1MiB of pfn space is not simply mapped 1:1 though. Just historical? Paul > Jan
On 18.03.2020 11:41, Paul Durrant wrote: >> -----Original Message----- >> From: Jan Beulich <jbeulich@suse.com> >> Sent: 18 March 2020 10:10 >> To: paul@xen.org >> Cc: 'David Woodhouse' <dwmw2@infradead.org>; sstabellini@kernel.org; julien@xen.org; wl@xen.org; >> konrad.wilk@oracle.com; george.dunlap@eu.citrix.com; andrew.cooper3@citrix.com; >> ian.jackson@eu.citrix.com; george.dunlap@citrix.com; jeff.kubascik@dornerworks.com; 'Xia, Hongyan' >> <hongyxia@amazon.com>; stewart.hildebrand@dornerworks.com; xen-devel@lists.xenproject.org >> Subject: Re: [PATCH 2/2] xen/mm: Introduce PG_state_uninitialised >> >> On 18.03.2020 09:53, Paul Durrant wrote: >>>> -----Original Message----- >>>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of David Woodhouse >>>> Sent: 17 March 2020 22:15 >>>> >>>> On Thu, 2020-02-20 at 12:59 +0100, Jan Beulich wrote: >>>>> On 07.02.2020 19:04, David Woodhouse wrote: >>>>>> --- a/xen/arch/x86/mm.c >>>>>> +++ b/xen/arch/x86/mm.c >>>>>> @@ -488,7 +488,8 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d, >>>>>> >>>>>> page_set_owner(page, d); >>>>>> smp_wmb(); /* install valid domain ptr before updating refcnt. */ >>>>>> - ASSERT((page->count_info & ~PGC_xen_heap) == 0); >>>>>> + ASSERT((page->count_info & ~PGC_xen_heap) == PGC_state_inuse || >>>>>> + (page->count_info & ~PGC_xen_heap) == PGC_state_uninitialised); >>>>> >>>>> Can uninitialized pages really make it here? >>>> >>>> Yep, we share the low 1MiB with dom_io. >>>> >>> >>> OOI anyone know why we do this? Is it actually necessary? >> >> Yes, for Dom0 to be able to access things like EBDA, IBFT, or data >> found in BIOS space. >> > > Ok. I am still wondering why dom0's low 1MiB of pfn space is not > simply mapped 1:1 though. Just historical? Well, in a way perhaps. Using the DomIO approach is less of a special case than mapping some arbitrary range 1:1. Furthermore Dom0 being PV wouldn't necessarily expect any BIOS in its PFN range there, but rather views it as normal RAM. Jan
On Wed, 2020-03-18 at 11:03 +0100, Jan Beulich wrote: > On 17.03.2020 23:15, David Woodhouse wrote: > > On Thu, 2020-02-20 at 12:59 +0100, Jan Beulich wrote: > > > On 07.02.2020 19:04, David Woodhouse wrote: > > > > ASSERT((pg[i].count_info & ~PGC_extra) == PGC_state_inuse || > > (pg[i].count_info & ~PGC_extra) == PGC_state_uninitialised); > > > > page_set_owner(&pg[i], d); > > > > smp_wmb(); /* Domain pointer must be visible before updating refcnt. */ > > > > - pg[i].count_info = PGC_allocated | 1; > > > > + pg[i].count_info |= PGC_allocated | 1; > > > > > > This is too relaxed for my taste: I understand you want to > > > retain page state, but I suppose other bits would want clearing > > > nevertheless. > > > > You seem to have dropped the ASSERT immediately before the code snippet > > you cited, in which arbitrary other contents of count_info are not > > permitted. I put it back, in its current form after I rebase on top of > > Paul's commit c793d13944b45d assing PGC_extra. > > But that' only an ASSERT(), i.e. no protection at all in release builds. An ASSERT does protect release builds. If the rule is that you must never call assign_pages() with pages that have the other bits in count_info set, then the ASSERT helps to catch the cases where people introduce a bug and start doing precisely that, and the bug never *makes* it to release builds. What we're debating here is the behaviour of assign_pages() when someone introduces such a bug and calls it with inappropriate pages. Currently, the behaviour is that the other flags are silently cleared. I've seen no analysis that such clearing is correct or desirable. In fact, for the PGC_state bits I determined that it now is NOT correct, which is why I changed it. While I was at it, I let it preserve the other bits — which, again, should never be set, and which would trigger the ASSERT in debug builds if it were to happen. But I'm not tied to that behaviour. It's still a "can never happen" case as far as I'm concerned. So let's make it look like this: for ( i = 0; i < (1 << order); i++ ) { ASSERT(page_get_owner(&pg[i]) == NULL); /* * Note: Not using page_state_is() here. The ASSERT requires that * all other bits in count_info are zero, in addition to PGC_state * being appropriate. */ ASSERT((pg[i].count_info & ~PGC_extra) == PGC_state_inuse || (pg[i].count_info & ~PGC_extra) == PGC_state_uninitialised); page_set_owner(&pg[i], d); smp_wmb(); /* Domain pointer must be visible before updating refcnt. */ pg[i].count_info = (pg[i].count_info & PGC_state) | PGC_allocated | 1; page_list_add_tail(&pg[i], &d->page_list); } OK?
On 18.03.2020 13:11, David Woodhouse wrote: > On Wed, 2020-03-18 at 11:03 +0100, Jan Beulich wrote: >> On 17.03.2020 23:15, David Woodhouse wrote: >>> On Thu, 2020-02-20 at 12:59 +0100, Jan Beulich wrote: >>>> On 07.02.2020 19:04, David Woodhouse wrote: >>> >>> ASSERT((pg[i].count_info & ~PGC_extra) == PGC_state_inuse || >>> (pg[i].count_info & ~PGC_extra) == PGC_state_uninitialised); >>>>> page_set_owner(&pg[i], d); >>>>> smp_wmb(); /* Domain pointer must be visible before updating refcnt. */ >>>>> - pg[i].count_info = PGC_allocated | 1; >>>>> + pg[i].count_info |= PGC_allocated | 1; >>>> >>>> This is too relaxed for my taste: I understand you want to >>>> retain page state, but I suppose other bits would want clearing >>>> nevertheless. >>> >>> You seem to have dropped the ASSERT immediately before the code snippet >>> you cited, in which arbitrary other contents of count_info are not >>> permitted. I put it back, in its current form after I rebase on top of >>> Paul's commit c793d13944b45d assing PGC_extra. >> >> But that' only an ASSERT(), i.e. no protection at all in release builds. > > An ASSERT does protect release builds. If the rule is that you must > never call assign_pages() with pages that have the other bits in > count_info set, then the ASSERT helps to catch the cases where people > introduce a bug and start doing precisely that, and the bug never > *makes* it to release builds. > > What we're debating here is the behaviour of assign_pages() when > someone introduces such a bug and calls it with inappropriate pages. > > Currently, the behaviour is that the other flags are silently cleared. > I've seen no analysis that such clearing is correct or desirable. In > fact, for the PGC_state bits I determined that it now is NOT correct, > which is why I changed it. > > While I was at it, I let it preserve the other bits — which, again, > should never be set, and which would trigger the ASSERT in debug builds > if it were to happen. > > But I'm not tied to that behaviour. It's still a "can never happen" > case as far as I'm concerned. So let's make it look like this: > > > for ( i = 0; i < (1 << order); i++ ) > { > ASSERT(page_get_owner(&pg[i]) == NULL); > /* > * Note: Not using page_state_is() here. The ASSERT requires that > * all other bits in count_info are zero, in addition to PGC_state > * being appropriate. > */ > ASSERT((pg[i].count_info & ~PGC_extra) == PGC_state_inuse || > (pg[i].count_info & ~PGC_extra) == PGC_state_uninitialised); > page_set_owner(&pg[i], d); > smp_wmb(); /* Domain pointer must be visible before updating refcnt. */ > pg[i].count_info = (pg[i].count_info & PGC_state) | PGC_allocated | 1; > page_list_add_tail(&pg[i], &d->page_list); > } > > OK? Yes, thanks. Jan
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index 9b33829084..bf660ee8eb 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -488,7 +488,8 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d, page_set_owner(page, d); smp_wmb(); /* install valid domain ptr before updating refcnt. */ - ASSERT((page->count_info & ~PGC_xen_heap) == 0); + ASSERT((page->count_info & ~PGC_xen_heap) == PGC_state_inuse || + (page->count_info & ~PGC_xen_heap) == PGC_state_uninitialised); /* Only add to the allocation list if the domain isn't dying. */ if ( !d->is_dying ) diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c index 4084503554..9703a2c664 100644 --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -1407,6 +1407,7 @@ static void free_heap_pages( switch ( pg[i].count_info & PGC_state ) { case PGC_state_inuse: + case PGC_state_uninitialised: pg[i].count_info = PGC_state_free; break; @@ -1780,11 +1781,10 @@ int query_page_offline(mfn_t mfn, uint32_t *status) * latter is not on a MAX_ORDER boundary, then we reserve the page by * not freeing it to the buddy allocator. */ -static void init_heap_pages( - struct page_info *pg, unsigned long nr_pages) +static void init_heap_pages(struct page_info *pg, unsigned long nr_pages, + bool scrub) { unsigned long i; - bool idle_scrub = false; /* * Keep MFN 0 away from the buddy allocator to avoid crossing zone @@ -1809,7 +1809,7 @@ static void init_heap_pages( spin_unlock(&heap_lock); if ( system_state < SYS_STATE_active && opt_bootscrub == BOOTSCRUB_IDLE ) - idle_scrub = true; + scrub = true; for ( i = 0; i < nr_pages; i++ ) { @@ -1837,7 +1837,7 @@ static void init_heap_pages( nr_pages -= n; } - free_heap_pages(pg + i, 0, scrub_debug || idle_scrub); + free_heap_pages(pg + i, 0, scrub_debug || scrub); } } @@ -1873,7 +1873,7 @@ void __init end_boot_allocator(void) if ( (r->s < r->e) && (phys_to_nid(pfn_to_paddr(r->s)) == cpu_to_node(0)) ) { - init_heap_pages(mfn_to_page(_mfn(r->s)), r->e - r->s); + init_heap_pages(mfn_to_page(_mfn(r->s)), r->e - r->s, false); r->e = r->s; break; } @@ -1882,7 +1882,7 @@ void __init end_boot_allocator(void) { struct bootmem_region *r = &bootmem_region_list[i]; if ( r->s < r->e ) - init_heap_pages(mfn_to_page(_mfn(r->s)), r->e - r->s); + init_heap_pages(mfn_to_page(_mfn(r->s)), r->e - r->s, false); } nr_bootmem_regions = 0; @@ -2151,7 +2151,7 @@ void init_xenheap_pages(paddr_t ps, paddr_t pe) memguard_guard_range(maddr_to_virt(ps), pe - ps); - init_heap_pages(maddr_to_page(ps), (pe - ps) >> PAGE_SHIFT); + init_heap_pages(maddr_to_page(ps), (pe - ps) >> PAGE_SHIFT, false); } @@ -2174,14 +2174,20 @@ void *alloc_xenheap_pages(unsigned int order, unsigned int memflags) void free_xenheap_pages(void *v, unsigned int order) { + struct page_info *pg; ASSERT(!in_irq()); if ( v == NULL ) return; + pg = virt_to_page(v); + memguard_guard_range(v, 1 << (order + PAGE_SHIFT)); - free_heap_pages(virt_to_page(v), order, false); + if ( unlikely(page_state_is(pg, uninitialised)) ) + init_heap_pages(pg, 1 << order, true); + else + free_heap_pages(pg, order, false); } #else /* !CONFIG_SEPARATE_XENHEAP */ @@ -2237,7 +2243,10 @@ void free_xenheap_pages(void *v, unsigned int order) for ( i = 0; i < (1u << order); i++ ) pg[i].count_info &= ~PGC_xen_heap; - free_heap_pages(pg, order, true); + if ( unlikely(page_state_is(pg, uninitialised)) ) + init_heap_pages(pg, 1 << order, true); + else + free_heap_pages(pg, order, true); } #endif /* CONFIG_SEPARATE_XENHEAP */ @@ -2260,7 +2269,7 @@ void init_domheap_pages(paddr_t ps, paddr_t pe) if ( mfn_x(emfn) <= mfn_x(smfn) ) return; - init_heap_pages(mfn_to_page(smfn), mfn_x(emfn) - mfn_x(smfn)); + init_heap_pages(mfn_to_page(smfn), mfn_x(emfn) - mfn_x(smfn), false); } @@ -2301,10 +2310,11 @@ int assign_pages( for ( i = 0; i < (1 << order); i++ ) { ASSERT(page_get_owner(&pg[i]) == NULL); - ASSERT(!pg[i].count_info); + ASSERT(pg[i].count_info == PGC_state_inuse || + pg[i].count_info == PGC_state_uninitialised); page_set_owner(&pg[i], d); smp_wmb(); /* Domain pointer must be visible before updating refcnt. */ - pg[i].count_info = PGC_allocated | 1; + pg[i].count_info |= PGC_allocated | 1; page_list_add_tail(&pg[i], &d->page_list); } @@ -2427,7 +2437,10 @@ void free_domheap_pages(struct page_info *pg, unsigned int order) scrub = 1; } - free_heap_pages(pg, order, scrub); + if ( unlikely(page_state_is(pg, uninitialised)) ) + init_heap_pages(pg, 1 << order, scrub); + else + free_heap_pages(pg, order, scrub); } if ( drop_dom_ref ) diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h index c9466c8ca0..c696941600 100644 --- a/xen/include/asm-arm/mm.h +++ b/xen/include/asm-arm/mm.h @@ -117,12 +117,13 @@ struct page_info * { inuse, offlining, offlined, free, broken_offlining, broken } */ #define PGC_state PG_mask(7, 9) -#define PGC_state_inuse PG_mask(0, 9) +#define PGC_state_uninitialised PG_mask(0, 9) #define PGC_state_offlining PG_mask(1, 9) #define PGC_state_offlined PG_mask(2, 9) #define PGC_state_free PG_mask(3, 9) #define PGC_state_broken_offlining PG_mask(4, 9) #define PGC_state_broken PG_mask(5, 9) +#define PGC_state_inuse PG_mask(6, 9) #define page_state_is(pg, st) (((pg)->count_info&PGC_state) == PGC_state_##st) #define page_is_broken(pg) (page_state_is((pg), broken_offlining) || \ diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h index 3edadf7a7c..645368e6a9 100644 --- a/xen/include/asm-x86/mm.h +++ b/xen/include/asm-x86/mm.h @@ -72,12 +72,13 @@ * { inuse, offlining, offlined, free, broken_offlining, broken } */ #define PGC_state PG_mask(7, 9) -#define PGC_state_inuse PG_mask(0, 9) +#define PGC_state_uninitialised PG_mask(0, 9) #define PGC_state_offlining PG_mask(1, 9) #define PGC_state_offlined PG_mask(2, 9) #define PGC_state_free PG_mask(3, 9) #define PGC_state_broken_offlining PG_mask(4, 9) #define PGC_state_broken PG_mask(5, 9) +#define PGC_state_inuse PG_mask(6, 9) #define page_state_is(pg, st) (((pg)->count_info&PGC_state) == PGC_state_##st) #define page_is_broken(pg) (page_state_is((pg), broken_offlining) || \