[2/2] xen/mm: Introduce PG_state_uninitialised

Message ID	20200207155701.2781820-2-dwmw2@infradead.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=0beg=33=lists.xenproject.org=xen-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 136E4214AF From: David Woodhouse <dwmw2@infradead.org> To: Jan Beulich <jbeulich@suse.com> Date: Fri, 7 Feb 2020 15:57:01 +0000 Message-Id: <20200207155701.2781820-2-dwmw2@infradead.org> In-Reply-To: <56f7fe21daff2dc4bf8db7ee356666233bdb0f7a.camel@infradead.org> References: <56f7fe21daff2dc4bf8db7ee356666233bdb0f7a.camel@infradead.org> MIME-Version: 1.0 Subject: [Xen-devel] [PATCH 2/2] xen/mm: Introduce PG_state_uninitialised Precedence: list Cc: Stefano Stabellini <sstabellini@kernel.org>, Julien Grall <julien@xen.org>, Wei Liu <wl@xen.org>, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, George Dunlap <george.dunlap@eu.citrix.com>, Andrew Cooper <andrew.cooper3@citrix.com>, Ian Jackson <ian.jackson@eu.citrix.com>, George Dunlap <george.dunlap@citrix.com>, Jeff Kubascik <jeff.kubascik@dornerworks.com>, Stewart Hildebrand <stewart.hildebrand@dornerworks.com>, xen-devel@lists.xenproject.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Series	[1/2] xen/mm: fold PGC_broken into PGC_state bits \| expand [1/2] xen/mm: fold PGC_broken into PGC_state bits [2/2] xen/mm: Introduce PG_state_uninitialised

David Woodhouse Feb. 7, 2020, 3:57 p.m. UTC

From: David Woodhouse <dwmw@amazon.co.uk>

It is possible for pages to enter general circulation without ever
being process by init_heap_pages().

For example, pages of the multiboot module containing the initramfs may
be assigned via assign_pages() to dom0 as it is created. And some code
including map_pages_to_xen() has checks on 'system_state' to determine
whether to use the boot or the heap allocator, but it seems impossible
to prove that pages allocated by the boot allocator are not subsequently
freed with free_heap_pages().

This actually works fine in the majority of cases; there are only a few
esoteric corner cases which init_heap_pages() handles before handing the
page range off to free_heap_pages():
 • Excluding MFN #0 to avoid inappropriate cross-zone merging.
 • Ensuring that the node information structures exist, when the first
   page(s) of a given node are handled.
 • High order allocations crossing from one node to another.

To handle this case, shift PG_state_inuse from its current value of
zero, to another value. Use zero, which is the initial state of the
entire frame table, as PG_state_uninitialised.

Fix a couple of assertions which were assuming that PG_state_inuse is
zero, and make them cope with the PG_state_uninitialised case too where
appopriate.

Finally, make free_xenheap_pages() and free_domheap_pages() call
through to init_heap_pages() instead of directly to free_heap_pages()
in the case where pages are being free which have never passed through
init_heap_pages().

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 xen/arch/x86/mm.c        |  3 ++-
 xen/common/page_alloc.c  | 41 ++++++++++++++++++++++++++--------------
 xen/include/asm-arm/mm.h |  3 ++-
 xen/include/asm-x86/mm.h |  3 ++-
 4 files changed, 33 insertions(+), 17 deletions(-)

Xia, Hongyan Feb. 7, 2020, 4:30 p.m. UTC | #1

On Fri, 2020-02-07 at 15:57 +0000, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> ...
> 
> Finally, make free_xenheap_pages() and free_domheap_pages() call
> through to init_heap_pages() instead of directly to free_heap_pages()
> in the case where pages are being free which have never passed
> through
> init_heap_pages().
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

If both end up calling free_heap_pages, can we just put the check
there?

Hongyan

David Woodhouse Feb. 7, 2020, 4:32 p.m. UTC | #2

On 7 February 2020 16:30:04 GMT, "Xia, Hongyan" <hongyxia@amazon.com> wrote:
>On Fri, 2020-02-07 at 15:57 +0000, David Woodhouse wrote:
>> From: David Woodhouse <dwmw@amazon.co.uk>
>> 
>> ...
>> 
>> Finally, make free_xenheap_pages() and free_domheap_pages() call
>> through to init_heap_pages() instead of directly to free_heap_pages()
>> in the case where pages are being free which have never passed
>> through
>> init_heap_pages().
>> 
>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
>
>If both end up calling free_heap_pages, can we just put the check
>there?

Ideally not because init_heap_pages() then calls free_heap_pages() and the "recursion" is best avoided.

Xia, Hongyan Feb. 7, 2020, 4:40 p.m. UTC | #3

On Fri, 2020-02-07 at 16:32 +0000, David Woodhouse wrote:
> 
> ...
> 
> Ideally not because init_heap_pages() then calls free_heap_pages()
> and the "recursion" is best avoided.

Kind of depends on where we change its pg to initialised. If we do that
in free_heap_pages this does recurse, but if it is done in
init_heap_pages before calling free it does not.

Hongyan

David Woodhouse Feb. 7, 2020, 5:06 p.m. UTC | #4

On 7 February 2020 16:40:01 GMT, "Xia, Hongyan" <hongyxia@amazon.com> wrote:
>On Fri, 2020-02-07 at 16:32 +0000, David Woodhouse wrote:
>> 
>> ...
>> 
>> Ideally not because init_heap_pages() then calls free_heap_pages()
>> and the "recursion" is best avoided.
>
>Kind of depends on where we change its pg to initialised. If we do that
>in free_heap_pages this does recurse, but if it is done in
>init_heap_pages before calling free it does not.

True. It would *look* like it might recurse, but never should in practice.

David Woodhouse Feb. 7, 2020, 6:04 p.m. UTC | #5

On Fri, 2020-02-07 at 17:06 +0000, David Woodhouse wrote:
> On 7 February 2020 16:40:01 GMT, "Xia, Hongyan" <hongyxia@amazon.com> wrote:
> > On Fri, 2020-02-07 at 16:32 +0000, David Woodhouse wrote:
> > > 
> > > ...
> > > 
> > > Ideally not because init_heap_pages() then calls free_heap_pages()
> > > and the "recursion" is best avoided.
> > 
> > Kind of depends on where we change its pg to initialised. If we do that
> > in free_heap_pages this does recurse, but if it is done in
> > init_heap_pages before calling free it does not.
> 
> True. It would *look* like it might recurse, but never should in practice.

Looks like this. Less duplication of the 'if (uninitialised)
init_heap_pages() else free_heap_pages()' construct, more scope for
people to naïvely complain that it "recurses". I think I prefer it this
way.



From: David Woodhouse <dwmw@amazon.co.uk>
Date: Fri, 7 Feb 2020 13:01:36 +0000
Subject: [PATCH] xen/mm: Introduce PG_state_uninitialised

It is possible for pages to enter general circulation without ever
being process by init_heap_pages().

For example, pages of the multiboot module containing the initramfs may
be assigned via assign_pages() to dom0 as it is created. And some code
including map_pages_to_xen() has checks on 'system_state' to determine
whether to use the boot or the heap allocator, but it seems impossible
to prove that pages allocated by the boot allocator are not subsequently
freed with free_heap_pages().

This actually works fine in the majority of cases; there are only a few
esoteric corner cases which init_heap_pages() handles before handing the
page range off to free_heap_pages():
 • Excluding MFN #0 to avoid inappropriate cross-zone merging.
 • Ensuring that the node information structures exist, when the first
   page(s) of a given node are handled.
 • High order allocations crossing from one node to another.

To handle this case, shift PG_state_inuse from its current value of
zero, to another value. Use zero, which is the initial state of the
entire frame table, as PG_state_uninitialised.

Fix a couple of assertions which were assuming that PG_state_inuse is
zero, and make them cope with the PG_state_uninitialised case too where
appopriate.

Finally, make free_heap_pages() call through to init_heap_pages() when
given a page range which has not been initialised. This cannot keep
recursing because init_heap_pages() will set each page state to
PGC_state_inuse before passing it back to free_heap_pages() for the
second time.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 xen/arch/x86/mm.c        |  3 ++-
 xen/common/page_alloc.c  | 40 ++++++++++++++++++++++++++++------------
 xen/include/asm-arm/mm.h |  3 ++-
 xen/include/asm-x86/mm.h |  3 ++-
 4 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 9b33829084..bf660ee8eb 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -488,7 +488,8 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d,
 
     page_set_owner(page, d);
     smp_wmb(); /* install valid domain ptr before updating refcnt. */
-    ASSERT((page->count_info & ~PGC_xen_heap) == 0);
+    ASSERT((page->count_info & ~PGC_xen_heap) == PGC_state_inuse ||
+           (page->count_info & ~PGC_xen_heap) == PGC_state_uninitialised);
 
     /* Only add to the allocation list if the domain isn't dying. */
     if ( !d->is_dying )
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 4084503554..95984d6de0 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -252,6 +252,8 @@ struct bootmem_region {
 static struct bootmem_region __initdata
     bootmem_region_list[PAGE_SIZE / sizeof(struct bootmem_region)];
 static unsigned int __initdata nr_bootmem_regions;
+static void init_heap_pages(struct page_info *pg, unsigned long nr_pages,
+                            bool scrub);
 
 struct scrub_region {
     unsigned long offset;
@@ -1389,6 +1391,16 @@ static void free_heap_pages(
     ASSERT(order <= MAX_ORDER);
     ASSERT(node >= 0);
 
+    if ( page_state_is(pg, uninitialised) )
+    {
+        init_heap_pages(pg, 1 << order, need_scrub);
+        /*
+         * init_heap_pages() will call back into free_heap_pages() for
+         * each page but cannot keep recursing because each page will
+         * be set to PGC_state_inuse first.
+         */
+        return;
+    }
     spin_lock(&heap_lock);
 
     for ( i = 0; i < (1 << order); i++ )
@@ -1780,11 +1792,10 @@ int query_page_offline(mfn_t mfn, uint32_t *status)
  * latter is not on a MAX_ORDER boundary, then we reserve the page by
  * not freeing it to the buddy allocator.
  */
-static void init_heap_pages(
-    struct page_info *pg, unsigned long nr_pages)
+static void init_heap_pages(struct page_info *pg, unsigned long nr_pages,
+                            bool scrub)
 {
     unsigned long i;
-    bool idle_scrub = false;
 
     /*
      * Keep MFN 0 away from the buddy allocator to avoid crossing zone
@@ -1809,7 +1820,7 @@ static void init_heap_pages(
     spin_unlock(&heap_lock);
 
     if ( system_state < SYS_STATE_active && opt_bootscrub == BOOTSCRUB_IDLE )
-        idle_scrub = true;
+        scrub = true;
 
     for ( i = 0; i < nr_pages; i++ )
     {
@@ -1837,7 +1848,8 @@ static void init_heap_pages(
             nr_pages -= n;
         }
 
-        free_heap_pages(pg + i, 0, scrub_debug || idle_scrub);
+        pg[i].count_info = PGC_state_inuse;
+        free_heap_pages(pg + i, 0, scrub_debug || scrub);
     }
 }
 
@@ -1873,7 +1885,7 @@ void __init end_boot_allocator(void)
         if ( (r->s < r->e) &&
              (phys_to_nid(pfn_to_paddr(r->s)) == cpu_to_node(0)) )
         {
-            init_heap_pages(mfn_to_page(_mfn(r->s)), r->e - r->s);
+            init_heap_pages(mfn_to_page(_mfn(r->s)), r->e - r->s, false);
             r->e = r->s;
             break;
         }
@@ -1882,7 +1894,7 @@ void __init end_boot_allocator(void)
     {
         struct bootmem_region *r = &bootmem_region_list[i];
         if ( r->s < r->e )
-            init_heap_pages(mfn_to_page(_mfn(r->s)), r->e - r->s);
+            init_heap_pages(mfn_to_page(_mfn(r->s)), r->e - r->s, false);
     }
     nr_bootmem_regions = 0;
 
@@ -2151,7 +2163,7 @@ void init_xenheap_pages(paddr_t ps, paddr_t pe)
 
     memguard_guard_range(maddr_to_virt(ps), pe - ps);
 
-    init_heap_pages(maddr_to_page(ps), (pe - ps) >> PAGE_SHIFT);
+    init_heap_pages(maddr_to_page(ps), (pe - ps) >> PAGE_SHIFT, false);
 }
 
 
@@ -2260,7 +2275,7 @@ void init_domheap_pages(paddr_t ps, paddr_t pe)
     if ( mfn_x(emfn) <= mfn_x(smfn) )
         return;
 
-    init_heap_pages(mfn_to_page(smfn), mfn_x(emfn) - mfn_x(smfn));
+    init_heap_pages(mfn_to_page(smfn), mfn_x(emfn) - mfn_x(smfn), false);
 }
 
 
@@ -2301,10 +2316,11 @@ int assign_pages(
     for ( i = 0; i < (1 << order); i++ )
     {
         ASSERT(page_get_owner(&pg[i]) == NULL);
-        ASSERT(!pg[i].count_info);
+        ASSERT(pg[i].count_info == PGC_state_inuse ||
+               pg[i].count_info == PGC_state_uninitialised);
         page_set_owner(&pg[i], d);
         smp_wmb(); /* Domain pointer must be visible before updating refcnt. */
-        pg[i].count_info = PGC_allocated | 1;
+        pg[i].count_info |= PGC_allocated | 1;
         page_list_add_tail(&pg[i], &d->page_list);
     }
 
diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
index c9466c8ca0..c696941600 100644
--- a/xen/include/asm-arm/mm.h
+++ b/xen/include/asm-arm/mm.h
@@ -117,12 +117,13 @@ struct page_info
   * { inuse, offlining, offlined, free, broken_offlining, broken }
   */
 #define PGC_state                  PG_mask(7, 9)
-#define PGC_state_inuse            PG_mask(0, 9)
+#define PGC_state_uninitialised    PG_mask(0, 9)
 #define PGC_state_offlining        PG_mask(1, 9)
 #define PGC_state_offlined         PG_mask(2, 9)
 #define PGC_state_free             PG_mask(3, 9)
 #define PGC_state_broken_offlining PG_mask(4, 9)
 #define PGC_state_broken           PG_mask(5, 9)
+#define PGC_state_inuse            PG_mask(6, 9)
 
 #define page_state_is(pg, st)      (((pg)->count_info&PGC_state) == PGC_state_##st)
 #define page_is_broken(pg)         (page_state_is((pg), broken_offlining) ||  \
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 3edadf7a7c..645368e6a9 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -72,12 +72,13 @@
   * { inuse, offlining, offlined, free, broken_offlining, broken }
   */
 #define PGC_state                  PG_mask(7, 9)
-#define PGC_state_inuse            PG_mask(0, 9)
+#define PGC_state_uninitialised    PG_mask(0, 9)
 #define PGC_state_offlining        PG_mask(1, 9)
 #define PGC_state_offlined         PG_mask(2, 9)
 #define PGC_state_free             PG_mask(3, 9)
 #define PGC_state_broken_offlining PG_mask(4, 9)
 #define PGC_state_broken           PG_mask(5, 9)
+#define PGC_state_inuse            PG_mask(6, 9)
 
 #define page_state_is(pg, st)      (((pg)->count_info&PGC_state) == PGC_state_##st)
 #define page_is_broken(pg)         (page_state_is((pg), broken_offlining) ||  \

Jan Beulich Feb. 20, 2020, 11:59 a.m. UTC | #6

On 07.02.2020 19:04, David Woodhouse wrote:
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -488,7 +488,8 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d,
>  
>      page_set_owner(page, d);
>      smp_wmb(); /* install valid domain ptr before updating refcnt. */
> -    ASSERT((page->count_info & ~PGC_xen_heap) == 0);
> +    ASSERT((page->count_info & ~PGC_xen_heap) == PGC_state_inuse ||
> +           (page->count_info & ~PGC_xen_heap) == PGC_state_uninitialised);

Can uninitialized pages really make it here?

> @@ -1389,6 +1391,16 @@ static void free_heap_pages(
>      ASSERT(order <= MAX_ORDER);
>      ASSERT(node >= 0);
>  
> +    if ( page_state_is(pg, uninitialised) )
> +    {
> +        init_heap_pages(pg, 1 << order, need_scrub);
> +        /*
> +         * init_heap_pages() will call back into free_heap_pages() for
> +         * each page but cannot keep recursing because each page will
> +         * be set to PGC_state_inuse first.
> +         */
> +        return;
> +    }
>      spin_lock(&heap_lock);

Can you also add a blank line above here please?

> @@ -1780,11 +1792,10 @@ int query_page_offline(mfn_t mfn, uint32_t *status)
>   * latter is not on a MAX_ORDER boundary, then we reserve the page by
>   * not freeing it to the buddy allocator.
>   */
> -static void init_heap_pages(
> -    struct page_info *pg, unsigned long nr_pages)
> +static void init_heap_pages(struct page_info *pg, unsigned long nr_pages,
> +                            bool scrub)

Is this new parameter strictly needed, i.e. can free_heap_pages()
be called with uninitialized pages which need scrubbing? (The
code change is simple enough, and hence may warrant keeping, but
then the commit message could indicate so in case this isn't a
strict requirement.)

> @@ -2301,10 +2316,11 @@ int assign_pages(
>      for ( i = 0; i < (1 << order); i++ )
>      {
>          ASSERT(page_get_owner(&pg[i]) == NULL);
> -        ASSERT(!pg[i].count_info);
> +        ASSERT(pg[i].count_info == PGC_state_inuse ||
> +               pg[i].count_info == PGC_state_uninitialised);

Same question here: Can uninitialized pages make it here? If
so, wouldn't it be better to correct this, rather than having
the more permissive assertion?

>          page_set_owner(&pg[i], d);
>          smp_wmb(); /* Domain pointer must be visible before updating refcnt. */
> -        pg[i].count_info = PGC_allocated | 1;
> +        pg[i].count_info |= PGC_allocated | 1;

This is too relaxed for my taste: I understand you want to
retain page state, but I suppose other bits would want clearing
nevertheless.

> --- a/xen/include/asm-x86/mm.h
> +++ b/xen/include/asm-x86/mm.h
> @@ -72,12 +72,13 @@
>    * { inuse, offlining, offlined, free, broken_offlining, broken }
>    */
>  #define PGC_state                  PG_mask(7, 9)
> -#define PGC_state_inuse            PG_mask(0, 9)
> +#define PGC_state_uninitialised    PG_mask(0, 9)
>  #define PGC_state_offlining        PG_mask(1, 9)
>  #define PGC_state_offlined         PG_mask(2, 9)
>  #define PGC_state_free             PG_mask(3, 9)
>  #define PGC_state_broken_offlining PG_mask(4, 9)
>  #define PGC_state_broken           PG_mask(5, 9)
> +#define PGC_state_inuse            PG_mask(6, 9)

Would imo be nice if this most common state was actually
either 1 or 7, for easy recognition. But the most suitable
value to pick may also depend on the outcome of one of the
comments on patch 1.

Jan

Julien Grall Feb. 20, 2020, 1:27 p.m. UTC | #7

Hi,

On 20/02/2020 11:59, Jan Beulich wrote:
> On 07.02.2020 19:04, David Woodhouse wrote:
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -488,7 +488,8 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d,
>>   
>>       page_set_owner(page, d);
>>       smp_wmb(); /* install valid domain ptr before updating refcnt. */
>> -    ASSERT((page->count_info & ~PGC_xen_heap) == 0);
>> +    ASSERT((page->count_info & ~PGC_xen_heap) == PGC_state_inuse ||
>> +           (page->count_info & ~PGC_xen_heap) == PGC_state_uninitialised);
> 
> Can uninitialized pages really make it here?

IIRC, arch_init_memory() will share the first 1MB of the RAM but they 
were never given to the page allocator.

01,10 +2316,11 @@ int assign_pages(
>>       for ( i = 0; i < (1 << order); i++ )
>>       {
>>           ASSERT(page_get_owner(&pg[i]) == NULL);
>> -        ASSERT(!pg[i].count_info);
>> +        ASSERT(pg[i].count_info == PGC_state_inuse ||
>> +               pg[i].count_info == PGC_state_uninitialised);
> 
> Same question here: Can uninitialized pages make it here?

Yes, in dom0_construct_pv() when the initrd is assigned to the guest.

> If
> so, wouldn't it be better to correct this, rather than having
> the more permissive assertion?

We would need to rework init_heap_pages() (or create a new function) so 
the allocator initialize the state but it is marked as "reserved/used" 
rather than "freed".

Most likely we will need a similar function for liveupdate. This is
because the pages belonging to guests should be untouched.

Cheers,

David Woodhouse March 17, 2020, 10:15 p.m. UTC | #8

On Thu, 2020-02-20 at 12:59 +0100, Jan Beulich wrote:
> On 07.02.2020 19:04, David Woodhouse wrote:
> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -488,7 +488,8 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d,
> >  
> >      page_set_owner(page, d);
> >      smp_wmb(); /* install valid domain ptr before updating refcnt. */
> > -    ASSERT((page->count_info & ~PGC_xen_heap) == 0);
> > +    ASSERT((page->count_info & ~PGC_xen_heap) == PGC_state_inuse ||
> > +           (page->count_info & ~PGC_xen_heap) == PGC_state_uninitialised);
> 
> Can uninitialized pages really make it here?

Yep, we share the low 1MiB with dom_io.

> > @@ -1389,6 +1391,16 @@ static void free_heap_pages(
> >      ASSERT(order <= MAX_ORDER);
> >      ASSERT(node >= 0);
> >  
> > +    if ( page_state_is(pg, uninitialised) )
> > +    {
> > +        init_heap_pages(pg, 1 << order, need_scrub);
> > +        /*
> > +         * init_heap_pages() will call back into free_heap_pages() for
> > +         * each page but cannot keep recursing because each page will
> > +         * be set to PGC_state_inuse first.
> > +         */
> > +        return;
> > +    }
> >      spin_lock(&heap_lock);
> 
> Can you also add a blank line above here please?

Done.

> > @@ -1780,11 +1792,10 @@ int query_page_offline(mfn_t mfn, uint32_t *status)
> >   * latter is not on a MAX_ORDER boundary, then we reserve the page by
> >   * not freeing it to the buddy allocator.
> >   */
> > -static void init_heap_pages(
> > -    struct page_info *pg, unsigned long nr_pages)
> > +static void init_heap_pages(struct page_info *pg, unsigned long nr_pages,
> > +                            bool scrub)
> 
> Is this new parameter strictly needed, i.e. can free_heap_pages()
> be called with uninitialized pages which need scrubbing? (The
> code change is simple enough, and hence may warrant keeping, but
> then the commit message could indicate so in case this isn't a
> strict requirement.)

Yes, I think it's feasible for the initramfs pages, which is assigned
to dom0 from uninitialised pages, to later get freed and then they'll
want scrubbing?

There *is* a path into free_heap_pages() with the need_scrub argument
set, and I'd have to *prove* that it can never happen in order to... I
don't know... put a BUG() in that case instead of supporting it? Didn't
seem like that was the thing I wanted to do.

> > @@ -2301,10 +2316,11 @@ int assign_pages(
> >      for ( i = 0; i < (1 << order); i++ )
> >      {
> >          ASSERT(page_get_owner(&pg[i]) == NULL);
> > -        ASSERT(!pg[i].count_info);
> > +        ASSERT(pg[i].count_info == PGC_state_inuse ||
> > +               pg[i].count_info == PGC_state_uninitialised);
> 
> Same question here: Can uninitialized pages make it here? If
> so, wouldn't it be better to correct this, rather than having
> the more permissive assertion?

Yep, Dom0 initrd on x86.


        ASSERT((pg[i].count_info & ~PGC_extra) == PGC_state_inuse ||
               (pg[i].count_info & ~PGC_extra) == PGC_state_uninitialised);
> >          page_set_owner(&pg[i], d);
> >          smp_wmb(); /* Domain pointer must be visible before updating refcnt. */
> > -        pg[i].count_info = PGC_allocated | 1;
> > +        pg[i].count_info |= PGC_allocated | 1;
> 
> This is too relaxed for my taste: I understand you want to
> retain page state, but I suppose other bits would want clearing
> nevertheless.

You seem to have dropped the ASSERT immediately before the code snippet
you cited, in which arbitrary other contents of count_info are not
permitted. I put it back, in its current form after I rebase on top of
Paul's commit c793d13944b45d assing PGC_extra.

> > --- a/xen/include/asm-x86/mm.h
> > +++ b/xen/include/asm-x86/mm.h
> > @@ -72,12 +72,13 @@
> >    * { inuse, offlining, offlined, free, broken_offlining, broken }
> >    */
> >  #define PGC_state                  PG_mask(7, 9)
> > -#define PGC_state_inuse            PG_mask(0, 9)
> > +#define PGC_state_uninitialised    PG_mask(0, 9)
> >  #define PGC_state_offlining        PG_mask(1, 9)
> >  #define PGC_state_offlined         PG_mask(2, 9)
> >  #define PGC_state_free             PG_mask(3, 9)
> >  #define PGC_state_broken_offlining PG_mask(4, 9)
> >  #define PGC_state_broken           PG_mask(5, 9)
> > +#define PGC_state_inuse            PG_mask(6, 9)
> 
> Would imo be nice if this most common state was actually
> either 1 or 7, for easy recognition. But the most suitable
> value to pick may also depend on the outcome of one of the
> comments on patch 1.

Not quite sure why 1 and 7 are easier to recognise than other values.
The important one is that uninitialised has to be zero, since that's
the default (because that's what the frame table is memset to. Which is
changeable, but non-trivially so).

Paul Durrant March 18, 2020, 8:53 a.m. UTC | #9

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of David Woodhouse
> Sent: 17 March 2020 22:15
> To: Jan Beulich <jbeulich@suse.com>
> Cc: sstabellini@kernel.org; julien@xen.org; wl@xen.org; konrad.wilk@oracle.com;
> george.dunlap@eu.citrix.com; andrew.cooper3@citrix.com; ian.jackson@eu.citrix.com;
> george.dunlap@citrix.com; jeff.kubascik@dornerworks.com; Xia, Hongyan <hongyxia@amazon.com>;
> stewart.hildebrand@dornerworks.com; xen-devel@lists.xenproject.org
> Subject: Re: [Xen-devel] [PATCH 2/2] xen/mm: Introduce PG_state_uninitialised
> 
> On Thu, 2020-02-20 at 12:59 +0100, Jan Beulich wrote:
> > On 07.02.2020 19:04, David Woodhouse wrote:
> > > --- a/xen/arch/x86/mm.c
> > > +++ b/xen/arch/x86/mm.c
> > > @@ -488,7 +488,8 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d,
> > >
> > >      page_set_owner(page, d);
> > >      smp_wmb(); /* install valid domain ptr before updating refcnt. */
> > > -    ASSERT((page->count_info & ~PGC_xen_heap) == 0);
> > > +    ASSERT((page->count_info & ~PGC_xen_heap) == PGC_state_inuse ||
> > > +           (page->count_info & ~PGC_xen_heap) == PGC_state_uninitialised);
> >
> > Can uninitialized pages really make it here?
> 
> Yep, we share the low 1MiB with dom_io.
> 

OOI anyone know why we do this? Is it actually necessary?

  Paul

Jan Beulich March 18, 2020, 10:03 a.m. UTC | #10

On 17.03.2020 23:15, David Woodhouse wrote:
> On Thu, 2020-02-20 at 12:59 +0100, Jan Beulich wrote:
>> On 07.02.2020 19:04, David Woodhouse wrote:
>          ASSERT((pg[i].count_info & ~PGC_extra) == PGC_state_inuse ||
>                 (pg[i].count_info & ~PGC_extra) == PGC_state_uninitialised);
>>>           page_set_owner(&pg[i], d);
>>>           smp_wmb(); /* Domain pointer must be visible before updating refcnt. */
>>> -        pg[i].count_info = PGC_allocated | 1;
>>> +        pg[i].count_info |= PGC_allocated | 1;
>>
>> This is too relaxed for my taste: I understand you want to
>> retain page state, but I suppose other bits would want clearing
>> nevertheless.
> 
> You seem to have dropped the ASSERT immediately before the code snippet
> you cited, in which arbitrary other contents of count_info are not
> permitted. I put it back, in its current form after I rebase on top of
> Paul's commit c793d13944b45d assing PGC_extra.

But that' only an ASSERT(), i.e. no protection at all in release builds.

>>> --- a/xen/include/asm-x86/mm.h
>>> +++ b/xen/include/asm-x86/mm.h
>>> @@ -72,12 +72,13 @@
>>>     * { inuse, offlining, offlined, free, broken_offlining, broken }
>>>     */
>>>   #define PGC_state                  PG_mask(7, 9)
>>> -#define PGC_state_inuse            PG_mask(0, 9)
>>> +#define PGC_state_uninitialised    PG_mask(0, 9)
>>>   #define PGC_state_offlining        PG_mask(1, 9)
>>>   #define PGC_state_offlined         PG_mask(2, 9)
>>>   #define PGC_state_free             PG_mask(3, 9)
>>>   #define PGC_state_broken_offlining PG_mask(4, 9)
>>>   #define PGC_state_broken           PG_mask(5, 9)
>>> +#define PGC_state_inuse            PG_mask(6, 9)
>>
>> Would imo be nice if this most common state was actually
>> either 1 or 7, for easy recognition. But the most suitable
>> value to pick may also depend on the outcome of one of the
>> comments on patch 1.
> 
> Not quite sure why 1 and 7 are easier to recognise than other values.
> The important one is that uninitialised has to be zero, since that's
> the default (because that's what the frame table is memset to. Which is
> changeable, but non-trivially so).

In particular 7 may imo be easier to recognize, as having all
of the three bits set. It sometimes helps seeing such at (almost)
the first glance in e.g. logged data.

Jan

Jan Beulich March 18, 2020, 10:10 a.m. UTC | #11

On 18.03.2020 09:53, Paul Durrant wrote:
>> -----Original Message-----
>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of David Woodhouse
>> Sent: 17 March 2020 22:15
>>
>> On Thu, 2020-02-20 at 12:59 +0100, Jan Beulich wrote:
>>> On 07.02.2020 19:04, David Woodhouse wrote:
>>>> --- a/xen/arch/x86/mm.c
>>>> +++ b/xen/arch/x86/mm.c
>>>> @@ -488,7 +488,8 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d,
>>>>
>>>>       page_set_owner(page, d);
>>>>       smp_wmb(); /* install valid domain ptr before updating refcnt. */
>>>> -    ASSERT((page->count_info & ~PGC_xen_heap) == 0);
>>>> +    ASSERT((page->count_info & ~PGC_xen_heap) == PGC_state_inuse ||
>>>> +           (page->count_info & ~PGC_xen_heap) == PGC_state_uninitialised);
>>>
>>> Can uninitialized pages really make it here?
>>
>> Yep, we share the low 1MiB with dom_io.
>>
> 
> OOI anyone know why we do this? Is it actually necessary?

Yes, for Dom0 to be able to access things like EBDA, IBFT, or data
found in BIOS space.

Jan

Paul Durrant March 18, 2020, 10:41 a.m. UTC | #12

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 18 March 2020 10:10
> To: paul@xen.org
> Cc: 'David Woodhouse' <dwmw2@infradead.org>; sstabellini@kernel.org; julien@xen.org; wl@xen.org;
> konrad.wilk@oracle.com; george.dunlap@eu.citrix.com; andrew.cooper3@citrix.com;
> ian.jackson@eu.citrix.com; george.dunlap@citrix.com; jeff.kubascik@dornerworks.com; 'Xia, Hongyan'
> <hongyxia@amazon.com>; stewart.hildebrand@dornerworks.com; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH 2/2] xen/mm: Introduce PG_state_uninitialised
> 
> On 18.03.2020 09:53, Paul Durrant wrote:
> >> -----Original Message-----
> >> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of David Woodhouse
> >> Sent: 17 March 2020 22:15
> >>
> >> On Thu, 2020-02-20 at 12:59 +0100, Jan Beulich wrote:
> >>> On 07.02.2020 19:04, David Woodhouse wrote:
> >>>> --- a/xen/arch/x86/mm.c
> >>>> +++ b/xen/arch/x86/mm.c
> >>>> @@ -488,7 +488,8 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d,
> >>>>
> >>>>       page_set_owner(page, d);
> >>>>       smp_wmb(); /* install valid domain ptr before updating refcnt. */
> >>>> -    ASSERT((page->count_info & ~PGC_xen_heap) == 0);
> >>>> +    ASSERT((page->count_info & ~PGC_xen_heap) == PGC_state_inuse ||
> >>>> +           (page->count_info & ~PGC_xen_heap) == PGC_state_uninitialised);
> >>>
> >>> Can uninitialized pages really make it here?
> >>
> >> Yep, we share the low 1MiB with dom_io.
> >>
> >
> > OOI anyone know why we do this? Is it actually necessary?
> 
> Yes, for Dom0 to be able to access things like EBDA, IBFT, or data
> found in BIOS space.
>

Ok. I am still wondering why dom0's low 1MiB of pfn space is not simply mapped 1:1 though. Just historical?

  Paul
 
> Jan

Jan Beulich March 18, 2020, 11:12 a.m. UTC | #13

On 18.03.2020 11:41, Paul Durrant wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 18 March 2020 10:10
>> To: paul@xen.org
>> Cc: 'David Woodhouse' <dwmw2@infradead.org>; sstabellini@kernel.org; julien@xen.org; wl@xen.org;
>> konrad.wilk@oracle.com; george.dunlap@eu.citrix.com; andrew.cooper3@citrix.com;
>> ian.jackson@eu.citrix.com; george.dunlap@citrix.com; jeff.kubascik@dornerworks.com; 'Xia, Hongyan'
>> <hongyxia@amazon.com>; stewart.hildebrand@dornerworks.com; xen-devel@lists.xenproject.org
>> Subject: Re: [PATCH 2/2] xen/mm: Introduce PG_state_uninitialised
>>
>> On 18.03.2020 09:53, Paul Durrant wrote:
>>>> -----Original Message-----
>>>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of David Woodhouse
>>>> Sent: 17 March 2020 22:15
>>>>
>>>> On Thu, 2020-02-20 at 12:59 +0100, Jan Beulich wrote:
>>>>> On 07.02.2020 19:04, David Woodhouse wrote:
>>>>>> --- a/xen/arch/x86/mm.c
>>>>>> +++ b/xen/arch/x86/mm.c
>>>>>> @@ -488,7 +488,8 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d,
>>>>>>
>>>>>>        page_set_owner(page, d);
>>>>>>        smp_wmb(); /* install valid domain ptr before updating refcnt. */
>>>>>> -    ASSERT((page->count_info & ~PGC_xen_heap) == 0);
>>>>>> +    ASSERT((page->count_info & ~PGC_xen_heap) == PGC_state_inuse ||
>>>>>> +           (page->count_info & ~PGC_xen_heap) == PGC_state_uninitialised);
>>>>>
>>>>> Can uninitialized pages really make it here?
>>>>
>>>> Yep, we share the low 1MiB with dom_io.
>>>>
>>>
>>> OOI anyone know why we do this? Is it actually necessary?
>>
>> Yes, for Dom0 to be able to access things like EBDA, IBFT, or data
>> found in BIOS space.
>>
> 
> Ok. I am still wondering why dom0's low 1MiB of pfn space is not
> simply mapped 1:1 though. Just historical?

Well, in a way perhaps. Using the DomIO approach is less of a special
case than mapping some arbitrary range 1:1. Furthermore Dom0 being PV
wouldn't necessarily expect any BIOS in its PFN range there, but
rather views it as normal RAM.

Jan

David Woodhouse March 18, 2020, 12:11 p.m. UTC | #14

On Wed, 2020-03-18 at 11:03 +0100, Jan Beulich wrote:
> On 17.03.2020 23:15, David Woodhouse wrote:
> > On Thu, 2020-02-20 at 12:59 +0100, Jan Beulich wrote:
> > > On 07.02.2020 19:04, David Woodhouse wrote:
> > 
> >           ASSERT((pg[i].count_info & ~PGC_extra) == PGC_state_inuse ||
> >                  (pg[i].count_info & ~PGC_extra) == PGC_state_uninitialised);
> > > >            page_set_owner(&pg[i], d);
> > > >            smp_wmb(); /* Domain pointer must be visible before updating refcnt. */
> > > > -        pg[i].count_info = PGC_allocated | 1;
> > > > +        pg[i].count_info |= PGC_allocated | 1;
> > > 
> > > This is too relaxed for my taste: I understand you want to
> > > retain page state, but I suppose other bits would want clearing
> > > nevertheless.
> > 
> > You seem to have dropped the ASSERT immediately before the code snippet
> > you cited, in which arbitrary other contents of count_info are not
> > permitted. I put it back, in its current form after I rebase on top of
> > Paul's commit c793d13944b45d assing PGC_extra.
> 
> But that' only an ASSERT(), i.e. no protection at all in release builds.

An ASSERT does protect release builds. If the rule is that you must
never call assign_pages() with pages that have the other bits in
count_info set, then the ASSERT helps to catch the cases where people
introduce a bug and start doing precisely that, and the bug never
*makes* it to release builds.

What we're debating here is the behaviour of assign_pages() when
someone introduces such a bug and calls it with inappropriate pages.

Currently, the behaviour is that the other flags are silently cleared.
I've seen no analysis that such clearing is correct or desirable. In
fact, for the PGC_state bits I determined that it now is NOT correct,
which is why I changed it.

While I was at it, I let it preserve the other bits — which, again,
should never be set, and which would trigger the ASSERT in debug builds
if it were to happen.

But I'm not tied to that behaviour. It's still a "can never happen"
case as far as I'm concerned. So let's make it look like this:

    for ( i = 0; i < (1 << order); i++ )
    {
        ASSERT(page_get_owner(&pg[i]) == NULL);
        /*
         * Note: Not using page_state_is() here. The ASSERT requires that
         * all other bits in count_info are zero, in addition to PGC_state
         * being appropriate.
         */
        ASSERT((pg[i].count_info & ~PGC_extra) == PGC_state_inuse ||
               (pg[i].count_info & ~PGC_extra) == PGC_state_uninitialised);
        page_set_owner(&pg[i], d);
        smp_wmb(); /* Domain pointer must be visible before updating refcnt. */
        pg[i].count_info = (pg[i].count_info & PGC_state) | PGC_allocated | 1;
        page_list_add_tail(&pg[i], &d->page_list);
    }

OK?

Jan Beulich March 18, 2020, 1:27 p.m. UTC | #15

On 18.03.2020 13:11, David Woodhouse wrote:
> On Wed, 2020-03-18 at 11:03 +0100, Jan Beulich wrote:
>> On 17.03.2020 23:15, David Woodhouse wrote:
>>> On Thu, 2020-02-20 at 12:59 +0100, Jan Beulich wrote:
>>>> On 07.02.2020 19:04, David Woodhouse wrote:
>>>
>>>           ASSERT((pg[i].count_info & ~PGC_extra) == PGC_state_inuse ||
>>>                  (pg[i].count_info & ~PGC_extra) == PGC_state_uninitialised);
>>>>>            page_set_owner(&pg[i], d);
>>>>>            smp_wmb(); /* Domain pointer must be visible before updating refcnt. */
>>>>> -        pg[i].count_info = PGC_allocated | 1;
>>>>> +        pg[i].count_info |= PGC_allocated | 1;
>>>>
>>>> This is too relaxed for my taste: I understand you want to
>>>> retain page state, but I suppose other bits would want clearing
>>>> nevertheless.
>>>
>>> You seem to have dropped the ASSERT immediately before the code snippet
>>> you cited, in which arbitrary other contents of count_info are not
>>> permitted. I put it back, in its current form after I rebase on top of
>>> Paul's commit c793d13944b45d assing PGC_extra.
>>
>> But that' only an ASSERT(), i.e. no protection at all in release builds.
> 
> An ASSERT does protect release builds. If the rule is that you must
> never call assign_pages() with pages that have the other bits in
> count_info set, then the ASSERT helps to catch the cases where people
> introduce a bug and start doing precisely that, and the bug never
> *makes* it to release builds.
> 
> What we're debating here is the behaviour of assign_pages() when
> someone introduces such a bug and calls it with inappropriate pages.
> 
> Currently, the behaviour is that the other flags are silently cleared.
> I've seen no analysis that such clearing is correct or desirable. In
> fact, for the PGC_state bits I determined that it now is NOT correct,
> which is why I changed it.
> 
> While I was at it, I let it preserve the other bits — which, again,
> should never be set, and which would trigger the ASSERT in debug builds
> if it were to happen.
> 
> But I'm not tied to that behaviour. It's still a "can never happen"
> case as far as I'm concerned. So let's make it look like this:
> 
> 
>     for ( i = 0; i < (1 << order); i++ )
>     {
>         ASSERT(page_get_owner(&pg[i]) == NULL);
>         /*
>          * Note: Not using page_state_is() here. The ASSERT requires that
>          * all other bits in count_info are zero, in addition to PGC_state
>          * being appropriate.
>          */
>         ASSERT((pg[i].count_info & ~PGC_extra) == PGC_state_inuse ||
>                (pg[i].count_info & ~PGC_extra) == PGC_state_uninitialised);
>         page_set_owner(&pg[i], d);
>         smp_wmb(); /* Domain pointer must be visible before updating refcnt. */
>         pg[i].count_info = (pg[i].count_info & PGC_state) | PGC_allocated | 1;
>         page_list_add_tail(&pg[i], &d->page_list);
>     }
> 
> OK?

Yes, thanks.

Jan

[2/2] xen/mm: Introduce PG_state_uninitialised

Commit Message

Comments

Patch