diff mbox series

[2/2] xen/arm: p2m: Populate pages for GICv2 mapping in arch_domain_create()

Message ID 20221017191237.11079-3-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show
Series arm/p2m: XSA-409 fix | expand

Commit Message

Andrew Cooper Oct. 17, 2022, 7:12 p.m. UTC
From: Henry Wang <Henry.Wang@arm.com>

The XSA-409 fixes discovered that the GICv2 path tries to create P2M mappings
in the domain_create() path.  This fails, as the P2M pool is empty before a
XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION hypercall.

As a stopgap, automatically give domains 16 pages of P2M memory.  This is
large enough to allow the GICv2 case to work, but small enough to not
introduce a continuation worry.

A consequence is that, for later error paths domain_create(), we end up in
p2m_final_teardown() with a nonzero P2M pool.  Such a domain has no vCPUs, and
has never been scheduled, so free the memory directly.

Fixes: cbea5a1149ca ("xen/arm: Allocate and free P2M pages from the P2M pool")
Suggested-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Henry Wang <Henry.Wang@arm.com>
---
 xen/arch/arm/p2m.c | 43 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 41 insertions(+), 2 deletions(-)

Comments

Julien Grall Oct. 17, 2022, 8:36 p.m. UTC | #1
Hi Andrew,

On 17/10/2022 20:12, Andrew Cooper wrote:
> From: Henry Wang <Henry.Wang@arm.com>
> 
> The XSA-409 fixes discovered that the GICv2 path tries to create P2M mappings
> in the domain_create() path.  This fails, as the P2M pool is empty before a
> XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION hypercall.
> 
> As a stopgap, automatically give domains 16 pages of P2M memory.  This is
> large enough to allow the GICv2 case to work, but small enough to not
> introduce a continuation worry.
> 
> A consequence is that, for later error paths domain_create(), we end up in
> p2m_final_teardown() with a nonzero P2M pool.  Such a domain has no vCPUs, and
> has never been scheduled, so free the memory directly.
> 
> Fixes: cbea5a1149ca ("xen/arm: Allocate and free P2M pages from the P2M pool")
> Suggested-by: Julien Grall <jgrall@amazon.com>

This is not really in the spirit of my original suggestion anymore... In 
fact, you drop all the explanations regarding how the code is fragile 
(e.g. we are relying on early mapping to not take any extra reference). 
Maybe you don't care, but I do as Henry and I spent ages to figure out 
all the corner cases. In addition to that...

> Signed-off-by: Henry Wang <Henry.Wang@arm.com>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Stefano Stabellini <sstabellini@kernel.org>
> CC: Julien Grall <julien@xen.org>
> CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
> CC: Bertrand Marquis <bertrand.marquis@arm.com>
> CC: Henry Wang <Henry.Wang@arm.com>
> ---
>   xen/arch/arm/p2m.c | 43 +++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 41 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
> index 6826f6315080..76a0e31c6c8c 100644
> --- a/xen/arch/arm/p2m.c
> +++ b/xen/arch/arm/p2m.c
> @@ -1736,8 +1736,36 @@ void p2m_final_teardown(struct domain *d)
>       if ( !p2m->domain )
>           return;
>   
> -    ASSERT(page_list_empty(&p2m->pages));
> -    ASSERT(page_list_empty(&d->arch.paging.p2m_freelist));
> +    /*
> +     * On the domain_create() error path only, we can end up here with a
> +     * non-zero P2M pool.
> +     *
> +     * At present, this is a maximum of 16 pages, spread between p2m->pages
> +     * and the free list.  The domain has never been scheduled (it has no
> +     * vcpus), so there is TLB maintenance to perform; just free everything.
> +     */
> +    if ( !page_list_empty(&p2m->pages) ||
> +         !page_list_empty(&d->arch.paging.p2m_freelist) )
> +    {
> +        struct page_info *pg;
> +
> +        /*
> +         * There's no sensible "in the domain_create() error path" predicate,
> +         * so simply sanity check that we don't have unexpected work to do.
> +         */
> +        ASSERT(d->arch.paging.p2m_total_pages <= 16);
> +
> +        spin_lock(&d->arch.paging.lock);
> +
> +        while ( (pg = page_list_remove_head(&p2m->pages)) )
> +            free_domheap_page(pg);
> +        while ( (pg = page_list_remove_head(&d->arch.paging.p2m_freelist)) )
> +            free_domheap_page(pg);
> +
> +        d->arch.paging.p2m_total_pages = 0;
> +
> +        spin_unlock(&d->arch.paging.lock);
> +    }

... you are hardcoding both p2m_teardown() and p2m_set_allocation().
IMO this is not an improvement at all. It is just making the code more 
complex than necessary and lack all the explanation on the assumptions.

So while I am fine with your patch #1 (already reviewed it), there is a 
better patch from Henry on the ML. So we should take his (rebased) 
instead of yours.

Cheers,
Andrew Cooper Oct. 17, 2022, 9:50 p.m. UTC | #2
On 17/10/2022 21:36, Julien Grall wrote:
> Hi Andrew,
>
> On 17/10/2022 20:12, Andrew Cooper wrote:
>> From: Henry Wang <Henry.Wang@arm.com>
>>
>> The XSA-409 fixes discovered that the GICv2 path tries to create P2M
>> mappings
>> in the domain_create() path.  This fails, as the P2M pool is empty
>> before a
>> XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION hypercall.
>>
>> As a stopgap, automatically give domains 16 pages of P2M memory. 
>> This is
>> large enough to allow the GICv2 case to work, but small enough to not
>> introduce a continuation worry.
>>
>> A consequence is that, for later error paths domain_create(), we end
>> up in
>> p2m_final_teardown() with a nonzero P2M pool.  Such a domain has no
>> vCPUs, and
>> has never been scheduled, so free the memory directly.
>>
>> Fixes: cbea5a1149ca ("xen/arm: Allocate and free P2M pages from the
>> P2M pool")
>> Suggested-by: Julien Grall <jgrall@amazon.com>
>
> This is not really in the spirit of my original suggestion anymore

Ok, I have dropped it.

> ... In fact, you drop all the explanations regarding how the code is
> fragile (e.g. we are relying on early mapping to not take any extra
> reference). Maybe you don't care, but I do as Henry and I spent ages
> to figure out all the corner cases.

I presume you're referring to the todo?  If so, that's an statement, not
an explanation of what is suddenly different about it.

What has XSA-409 changed in this regard?  Because it looks like the
answer is nothing and the GICv2 path was similarly fragile beforehand. 
In which case, why it is appropriate content for a security patch?

>
>> Signed-off-by: Henry Wang <Henry.Wang@arm.com>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Stefano Stabellini <sstabellini@kernel.org>
>> CC: Julien Grall <julien@xen.org>
>> CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
>> CC: Bertrand Marquis <bertrand.marquis@arm.com>
>> CC: Henry Wang <Henry.Wang@arm.com>
>> ---
>>   xen/arch/arm/p2m.c | 43 +++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 41 insertions(+), 2 deletions(-)
>>
>> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
>> index 6826f6315080..76a0e31c6c8c 100644
>> --- a/xen/arch/arm/p2m.c
>> +++ b/xen/arch/arm/p2m.c
>> @@ -1736,8 +1736,36 @@ void p2m_final_teardown(struct domain *d)
>>       if ( !p2m->domain )
>>           return;
>>   -    ASSERT(page_list_empty(&p2m->pages));
>> -    ASSERT(page_list_empty(&d->arch.paging.p2m_freelist));
>> +    /*
>> +     * On the domain_create() error path only, we can end up here
>> with a
>> +     * non-zero P2M pool.
>> +     *
>> +     * At present, this is a maximum of 16 pages, spread between
>> p2m->pages
>> +     * and the free list.  The domain has never been scheduled (it
>> has no
>> +     * vcpus), so there is TLB maintenance to perform; just free
>> everything.
>> +     */
>> +    if ( !page_list_empty(&p2m->pages) ||
>> +         !page_list_empty(&d->arch.paging.p2m_freelist) )
>> +    {
>> +        struct page_info *pg;
>> +
>> +        /*
>> +         * There's no sensible "in the domain_create() error path"
>> predicate,
>> +         * so simply sanity check that we don't have unexpected work
>> to do.
>> +         */
>> +        ASSERT(d->arch.paging.p2m_total_pages <= 16);
>> +
>> +        spin_lock(&d->arch.paging.lock);
>> +
>> +        while ( (pg = page_list_remove_head(&p2m->pages)) )
>> +            free_domheap_page(pg);
>> +        while ( (pg =
>> page_list_remove_head(&d->arch.paging.p2m_freelist)) )
>> +            free_domheap_page(pg);
>> +
>> +        d->arch.paging.p2m_total_pages = 0;
>> +
>> +        spin_unlock(&d->arch.paging.lock);
>> +    }
>
> ... you are hardcoding both p2m_teardown() and p2m_set_allocation().
> IMO this is not an improvement at all. It is just making the code more
> complex than necessary and lack all the explanation on the assumptions.
>
> So while I am fine with your patch #1 (already reviewed it), there is
> a better patch from Henry on the ML. So we should take his (rebased)
> instead of yours.

If by better, you mean something that still has errors, then sure.

There's a really good reason why you cannot safely repurpose
p2m_teardown().  It's written expecting a fully constructed domain -
which is fine because that's how it is used.  It doesn't cope safely
with an partially constructed domain.


Yes, this code is a bit nasty, but it's less buggy than all attempts
presented thus far, specifically because it avoids inappropriate
repurposing of infrastructure.

At this point, we're a week after the publishing of XSA-409 and CI is
still red across the board.  There were multiple failings which have
lead to this situation, that the security team need to deal with, but
right now, we need a correct fix and we need it yesterday.

~Andrew
Julien Grall Oct. 17, 2022, 11:01 p.m. UTC | #3
Hi Andrew,

On 17/10/2022 22:50, Andrew Cooper wrote:
> On 17/10/2022 21:36, Julien Grall wrote:
>> Hi Andrew,
>>
>> On 17/10/2022 20:12, Andrew Cooper wrote:
>>> From: Henry Wang <Henry.Wang@arm.com>
>>>
>>> The XSA-409 fixes discovered that the GICv2 path tries to create P2M
>>> mappings
>>> in the domain_create() path.  This fails, as the P2M pool is empty
>>> before a
>>> XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION hypercall.
>>>
>>> As a stopgap, automatically give domains 16 pages of P2M memory.
>>> This is
>>> large enough to allow the GICv2 case to work, but small enough to not
>>> introduce a continuation worry.
>>>
>>> A consequence is that, for later error paths domain_create(), we end
>>> up in
>>> p2m_final_teardown() with a nonzero P2M pool.  Such a domain has no
>>> vCPUs, and
>>> has never been scheduled, so free the memory directly.
>>>
>>> Fixes: cbea5a1149ca ("xen/arm: Allocate and free P2M pages from the
>>> P2M pool")
>>> Suggested-by: Julien Grall <jgrall@amazon.com>
>>
>> This is not really in the spirit of my original suggestion anymore
> 
> Ok, I have dropped it.
> 
>> ... In fact, you drop all the explanations regarding how the code is
>> fragile (e.g. we are relying on early mapping to not take any extra
>> reference). Maybe you don't care, but I do as Henry and I spent ages
>> to figure out all the corner cases.
> 
> I presume you're referring to the todo?  If so, that's an statement, not
> an explanation of what is suddenly different about it.
> 
> What has XSA-409 changed in this regard?  Because it looks like the
> answer is nothing and the GICv2 path was similarly fragile beforehand.
> In which case, why it is appropriate content for a security patch?

This is explaining why the current logic (and the one you add) is still 
OK. It is not entirely related to XSA-409, but relevant to the fix 
itself (and why the issue is now "properly" closed).

> 
>>
>>> Signed-off-by: Henry Wang <Henry.Wang@arm.com>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> ---
>>> CC: Stefano Stabellini <sstabellini@kernel.org>
>>> CC: Julien Grall <julien@xen.org>
>>> CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
>>> CC: Bertrand Marquis <bertrand.marquis@arm.com>
>>> CC: Henry Wang <Henry.Wang@arm.com>
>>> ---
>>>    xen/arch/arm/p2m.c | 43 +++++++++++++++++++++++++++++++++++++++++--
>>>    1 file changed, 41 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
>>> index 6826f6315080..76a0e31c6c8c 100644
>>> --- a/xen/arch/arm/p2m.c
>>> +++ b/xen/arch/arm/p2m.c
>>> @@ -1736,8 +1736,36 @@ void p2m_final_teardown(struct domain *d)
>>>        if ( !p2m->domain )
>>>            return;
>>>    -    ASSERT(page_list_empty(&p2m->pages));
>>> -    ASSERT(page_list_empty(&d->arch.paging.p2m_freelist));
>>> +    /*
>>> +     * On the domain_create() error path only, we can end up here
>>> with a
>>> +     * non-zero P2M pool.
>>> +     *
>>> +     * At present, this is a maximum of 16 pages, spread between
>>> p2m->pages
>>> +     * and the free list.  The domain has never been scheduled (it
>>> has no
>>> +     * vcpus), so there is TLB maintenance to perform; just free
>>> everything.
>>> +     */
>>> +    if ( !page_list_empty(&p2m->pages) ||
>>> +         !page_list_empty(&d->arch.paging.p2m_freelist) )
>>> +    {
>>> +        struct page_info *pg;
>>> +
>>> +        /*
>>> +         * There's no sensible "in the domain_create() error path"
>>> predicate,
>>> +         * so simply sanity check that we don't have unexpected work
>>> to do.
>>> +         */
>>> +        ASSERT(d->arch.paging.p2m_total_pages <= 16);
>>> +
>>> +        spin_lock(&d->arch.paging.lock);
>>> +
>>> +        while ( (pg = page_list_remove_head(&p2m->pages)) )
>>> +            free_domheap_page(pg);
>>> +        while ( (pg =
>>> page_list_remove_head(&d->arch.paging.p2m_freelist)) )
>>> +            free_domheap_page(pg);
>>> +
>>> +        d->arch.paging.p2m_total_pages = 0;
>>> +
>>> +        spin_unlock(&d->arch.paging.lock);
>>> +    }
>>
>> ... you are hardcoding both p2m_teardown() and p2m_set_allocation().
>> IMO this is not an improvement at all. It is just making the code more
>> complex than necessary and lack all the explanation on the assumptions.
>>
>> So while I am fine with your patch #1 (already reviewed it), there is
>> a better patch from Henry on the ML. So we should take his (rebased)
>> instead of yours.
> 
> If by better, you mean something that still has errors, then sure.
> 
> There's a really good reason why you cannot safely repurpose
> p2m_teardown().  It's written expecting a fully constructed domain -
> which is fine because that's how it is used.  It doesn't cope safely
> with an partially constructed domain.

It is not 100% clear what is the issue you are referring to as the VMID 
is valid at this point. So what part would be wrong?

But if there are part of p2m_teardown() that are not safe for partially 
constructed domain, then we should split the code. This would be much 
better that the duplication you are proposing.

Cheers,
Jan Beulich Oct. 18, 2022, 7:08 a.m. UTC | #4
On 17.10.2022 21:12, Andrew Cooper wrote:
> --- a/xen/arch/arm/p2m.c
> +++ b/xen/arch/arm/p2m.c
> @@ -1736,8 +1736,36 @@ void p2m_final_teardown(struct domain *d)
>      if ( !p2m->domain )
>          return;
>  
> -    ASSERT(page_list_empty(&p2m->pages));
> -    ASSERT(page_list_empty(&d->arch.paging.p2m_freelist));
> +    /*
> +     * On the domain_create() error path only, we can end up here with a
> +     * non-zero P2M pool.
> +     *
> +     * At present, this is a maximum of 16 pages, spread between p2m->pages
> +     * and the free list.  The domain has never been scheduled (it has no
> +     * vcpus), so there is TLB maintenance to perform; just free everything.
> +     */
> +    if ( !page_list_empty(&p2m->pages) ||
> +         !page_list_empty(&d->arch.paging.p2m_freelist) )
> +    {
> +        struct page_info *pg;
> +
> +        /*
> +         * There's no sensible "in the domain_create() error path" predicate,
> +         * so simply sanity check that we don't have unexpected work to do.
> +         */
> +        ASSERT(d->arch.paging.p2m_total_pages <= 16);

I guess this isn't sufficient as a sanity check, as the count (contrary
to the name of the field) is only representing all pages on p2m_freelist.

Jan

> +        spin_lock(&d->arch.paging.lock);
> +
> +        while ( (pg = page_list_remove_head(&p2m->pages)) )
> +            free_domheap_page(pg);
> +        while ( (pg = page_list_remove_head(&d->arch.paging.p2m_freelist)) )
> +            free_domheap_page(pg);
> +
> +        d->arch.paging.p2m_total_pages = 0;
> +
> +        spin_unlock(&d->arch.paging.lock);
> +    }
>  
>      if ( p2m->root )
>          free_domheap_pages(p2m->root, P2M_ROOT_ORDER);
Andrew Cooper Oct. 19, 2022, 9:30 p.m. UTC | #5
On 18/10/2022 00:01, Julien Grall wrote:
>>>> Signed-off-by: Henry Wang <Henry.Wang@arm.com>
>>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> ---
>>>> CC: Stefano Stabellini <sstabellini@kernel.org>
>>>> CC: Julien Grall <julien@xen.org>
>>>> CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
>>>> CC: Bertrand Marquis <bertrand.marquis@arm.com>
>>>> CC: Henry Wang <Henry.Wang@arm.com>
>>>> ---
>>>>    xen/arch/arm/p2m.c | 43 +++++++++++++++++++++++++++++++++++++++++--
>>>>    1 file changed, 41 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
>>>> index 6826f6315080..76a0e31c6c8c 100644
>>>> --- a/xen/arch/arm/p2m.c
>>>> +++ b/xen/arch/arm/p2m.c
>>>> @@ -1736,8 +1736,36 @@ void p2m_final_teardown(struct domain *d)
>>>>        if ( !p2m->domain )
>>>>            return;
>>>>    -    ASSERT(page_list_empty(&p2m->pages));
>>>> -    ASSERT(page_list_empty(&d->arch.paging.p2m_freelist));
>>>> +    /*
>>>> +     * On the domain_create() error path only, we can end up here
>>>> with a
>>>> +     * non-zero P2M pool.
>>>> +     *
>>>> +     * At present, this is a maximum of 16 pages, spread between
>>>> p2m->pages
>>>> +     * and the free list.  The domain has never been scheduled (it
>>>> has no
>>>> +     * vcpus), so there is TLB maintenance to perform; just free
>>>> everything.
>>>> +     */
>>>> +    if ( !page_list_empty(&p2m->pages) ||
>>>> +         !page_list_empty(&d->arch.paging.p2m_freelist) )
>>>> +    {
>>>> +        struct page_info *pg;
>>>> +
>>>> +        /*
>>>> +         * There's no sensible "in the domain_create() error path"
>>>> predicate,
>>>> +         * so simply sanity check that we don't have unexpected work
>>>> to do.
>>>> +         */
>>>> +        ASSERT(d->arch.paging.p2m_total_pages <= 16);
>>>> +
>>>> +        spin_lock(&d->arch.paging.lock);
>>>> +
>>>> +        while ( (pg = page_list_remove_head(&p2m->pages)) )
>>>> +            free_domheap_page(pg);
>>>> +        while ( (pg =
>>>> page_list_remove_head(&d->arch.paging.p2m_freelist)) )
>>>> +            free_domheap_page(pg);
>>>> +
>>>> +        d->arch.paging.p2m_total_pages = 0;
>>>> +
>>>> +        spin_unlock(&d->arch.paging.lock);
>>>> +    }
>>>
>>> ... you are hardcoding both p2m_teardown() and p2m_set_allocation().
>>> IMO this is not an improvement at all. It is just making the code more
>>> complex than necessary and lack all the explanation on the assumptions.
>>>
>>> So while I am fine with your patch #1 (already reviewed it), there is
>>> a better patch from Henry on the ML. So we should take his (rebased)
>>> instead of yours.
>>
>> If by better, you mean something that still has errors, then sure.
>>
>> There's a really good reason why you cannot safely repurpose
>> p2m_teardown().  It's written expecting a fully constructed domain -
>> which is fine because that's how it is used.  It doesn't cope safely
>> with an partially constructed domain.
>
> It is not 100% clear what is the issue you are referring to as the
> VMID is valid at this point. So what part would be wrong?

Falling over a bad root pointer from an early construction exit.

> But if there are part of p2m_teardown() that are not safe for
> partially constructed domain, then we should split the code. This
> would be much better that the duplication you are proposing.

You have two totally different contexts with different safety
requirements.  c/s 55914f7fc9 is a reasonably good and clean separation
between preemptible and non-preemptible cleanup[1].

You've agreed that the introduction of the non-preemptible path to the
preemptible path is a hack and layering violation, and will need undoing
later.  Others have raised this concern too.

Now consider that we're taking the error path without ancillary
collateral damage.  It:
1) Zeros all the root frames
2) Switches to a remote VMID in order to flush the TLBs, not that they
need flushing in the first place
3) For allocated P2M pages, moves them one at a time onto the free list,
taking the paging lock for each frame
4) (wrapping the preemptible helper with an ignore loop) finally free
the complete pool.

... in a case where 16 is the chosen value because you're already
concerned about the hypercall taking too long.

Is that safe? Possibly.  Is it wise?  no.

You can't test the error path in question here (because my fault_ttl
patches are still pending).  "Correctness" is almost exclusively by code
inspection.

Also realise that you've now split the helper between regular hypercall
context, and RCU context, and recall what happened when we finally
started asserting that memory couldn't be allocated in stop-machine context.

How certain are you that the safety is the same on earlier versions of
Xen?  What is the likelihood that all of these actions will remain safe
given future development?


Despite what is being claimed, the attempt to share cleanup logic is
introducing fragility and risk, not removing it.  This is a bugfix for
to a security fix issue which is totally dead on arrival; net safety,
especially in older versions of the Xen, is *the highest priority*.

These two different contexts don't share any common properties of how to
clean up the pool, save freeing the frames back to the memory
allocator.  In a proper design, this is the hint that they shouldn't
share logic either.


Given that you do expect someone to spend yet-more time&effort to undo
the short term hack currently being proposed, how do you envisage the
end result looking?

~Andrew

[1] Although the order of actions in p2m_teardown() for the common case
is poor.  The root pagetables should be cleaned and freed first so steps
1 and 2 of the list above are not repeated for every continuation.
Julien Grall Oct. 19, 2022, 10:33 p.m. UTC | #6
Hi Andrew,

On 19/10/2022 22:30, Andrew Cooper wrote:
> On 18/10/2022 00:01, Julien Grall wrote:
>>>> ... you are hardcoding both p2m_teardown() and p2m_set_allocation().
>>>> IMO this is not an improvement at all. It is just making the code more
>>>> complex than necessary and lack all the explanation on the assumptions.
>>>>
>>>> So while I am fine with your patch #1 (already reviewed it), there is
>>>> a better patch from Henry on the ML. So we should take his (rebased)
>>>> instead of yours.
>>>
>>> If by better, you mean something that still has errors, then sure.
>>>
>>> There's a really good reason why you cannot safely repurpose
>>> p2m_teardown().  It's written expecting a fully constructed domain -
>>> which is fine because that's how it is used.  It doesn't cope safely
>>> with an partially constructed domain.
>>
>> It is not 100% clear what is the issue you are referring to as the
>> VMID is valid at this point. So what part would be wrong?
> 
> Falling over a bad root pointer from an early construction exit.

You have been mentioning that several time now but I can't see how this
can happen. If you look at Henry's second patch, p2m_teardown() starts
with the following check:
if ( page_list_empty(&p2m->pages) )
    return;

Per the logic in p2m_init(), the root pages have to be allocated (note 
they are *not* allocated from the P2M pool) and the VMID as well before 
any pages could be added in the list.

> 
>> But if there are part of p2m_teardown() that are not safe for
>> partially constructed domain, then we should split the code. This
>> would be much better that the duplication you are proposing.
> 
> You have two totally different contexts with different safety
> requirements.  c/s 55914f7fc9 is a reasonably good and clean separation
> between preemptible and non-preemptible cleanup[1].

The part you mention in [1] was decided to be delayed post 4.17 for 
development cycle reasons.

> 
> You've agreed that the introduction of the non-preemptible path to the
> preemptible path is a hack and layering violation, and will need undoing
> later.  Others have raised this concern too.

[...]

> 
> Also realise that you've now split the helper between regular hypercall
> context, and RCU context, and recall what happened when we finally
> started asserting that memory couldn't be allocated in stop-machine context.
> 
> How certain are you that the safety is the same on earlier versions of
> Xen?
I am pretty confident because the P2M code has not changed a lot.

> What is the likelihood that all of these actions will remain safe
> given future development?
Code always evolve and neither you (nor I) can claim that any work will 
stay safe forever. In your patch proposal, then the risk is a bug could 
be duplicated.

> 
> 
> Despite what is being claimed, the attempt to share cleanup logic is
> introducing fragility and risk, not removing it.

I find interesting you are saying that... If we were going to move 
p2m_teardown() in domain_teardown() then we would end up to share the code.

To me, this is not very different here because in one context it would 
be preemptible while the other it won't. At which point...

>  This is a bugfix for
> to a security fix issue which is totally dead on arrival; net safety,
> especially in older versions of the Xen, is *the highest priority*.
> 
> These two different contexts don't share any common properties of how to
> clean up the pool, save freeing the frames back to the memory
> allocator.  In a proper design, this is the hint that they shouldn't
> share logic either
... why is your design better than what Henry's proposed?

> 
> Given that you do expect someone to spend yet-more time&effort to undo
> the short term hack currently being proposed, how do you envisage the
> end result looking?

The end result will be p2m_teardown() & co to be called from 
domain_teardown().

Anyway, this discussion doesn't make us closer to come to an agreement 
on the correct approach. We have both very diverging opinion and so far 
I haven't seen any strong reasons that is showing yours is better.

So unless Bertrand or Stefano agree with you, then I will go ahead and 
merge Henry's patch tomorrow after a final review.

Cheers,
Bertrand Marquis Oct. 20, 2022, 7:45 a.m. UTC | #7
Hi Julien,

> On 19 Oct 2022, at 23:33, Julien Grall <julien@xen.org> wrote:
> 
> Hi Andrew,
> 
> On 19/10/2022 22:30, Andrew Cooper wrote:
>> On 18/10/2022 00:01, Julien Grall wrote:
>>>>> ... you are hardcoding both p2m_teardown() and p2m_set_allocation().
>>>>> IMO this is not an improvement at all. It is just making the code more
>>>>> complex than necessary and lack all the explanation on the assumptions.
>>>>> 
>>>>> So while I am fine with your patch #1 (already reviewed it), there is
>>>>> a better patch from Henry on the ML. So we should take his (rebased)
>>>>> instead of yours.
>>>> 
>>>> If by better, you mean something that still has errors, then sure.
>>>> 
>>>> There's a really good reason why you cannot safely repurpose
>>>> p2m_teardown().  It's written expecting a fully constructed domain -
>>>> which is fine because that's how it is used.  It doesn't cope safely
>>>> with an partially constructed domain.
>>> 
>>> It is not 100% clear what is the issue you are referring to as the
>>> VMID is valid at this point. So what part would be wrong?
>> Falling over a bad root pointer from an early construction exit.
> 
> You have been mentioning that several time now but I can't see how this
> can happen. If you look at Henry's second patch, p2m_teardown() starts
> with the following check:
> if ( page_list_empty(&p2m->pages) )
>   return;
> 
> Per the logic in p2m_init(), the root pages have to be allocated (note they are *not* allocated from the P2M pool) and the VMID as well before any pages could be added in the list.
> 
>>> But if there are part of p2m_teardown() that are not safe for
>>> partially constructed domain, then we should split the code. This
>>> would be much better that the duplication you are proposing.
>> You have two totally different contexts with different safety
>> requirements.  c/s 55914f7fc9 is a reasonably good and clean separation
>> between preemptible and non-preemptible cleanup[1].
> 
> The part you mention in [1] was decided to be delayed post 4.17 for development cycle reasons.
> 
>> You've agreed that the introduction of the non-preemptible path to the
>> preemptible path is a hack and layering violation, and will need undoing
>> later.  Others have raised this concern too.
> 
> [...]
> 
>> Also realise that you've now split the helper between regular hypercall
>> context, and RCU context, and recall what happened when we finally
>> started asserting that memory couldn't be allocated in stop-machine context.
>> How certain are you that the safety is the same on earlier versions of
>> Xen?
> I am pretty confident because the P2M code has not changed a lot.
> 
>> What is the likelihood that all of these actions will remain safe
>> given future development?
> Code always evolve and neither you (nor I) can claim that any work will stay safe forever. In your patch proposal, then the risk is a bug could be duplicated.
> 
>> Despite what is being claimed, the attempt to share cleanup logic is
>> introducing fragility and risk, not removing it.
> 
> I find interesting you are saying that... If we were going to move p2m_teardown() in domain_teardown() then we would end up to share the code.
> 
> To me, this is not very different here because in one context it would be preemptible while the other it won't. At which point...
> 
>>   This is a bugfix for
>> to a security fix issue which is totally dead on arrival; net safety,
>> especially in older versions of the Xen, is *the highest priority*.
>> These two different contexts don't share any common properties of how to
>> clean up the pool, save freeing the frames back to the memory
>> allocator.  In a proper design, this is the hint that they shouldn't
>> share logic either
> ... why is your design better than what Henry's proposed?
> 
>> Given that you do expect someone to spend yet-more time&effort to undo
>> the short term hack currently being proposed, how do you envisage the
>> end result looking?
> 
> The end result will be p2m_teardown() & co to be called from domain_teardown().
> 
> Anyway, this discussion doesn't make us closer to come to an agreement on the correct approach. We have both very diverging opinion and so far I haven't seen any strong reasons that is showing yours is better.
> 
> So unless Bertrand or Stefano agree with you, then I will go ahead and merge Henry's patch tomorrow after a final review.

At this stage, I still do not get the NULL pointer case as from my point of view the list_empty done at the beginning is making that case impossible.
We have a currently blocked status where any GICv2 based board is not booting and we all know that Henry’s solution will need to be reworked post 4.17.

So I will give my reviewed-by on Henry’s patch so that we have a short term solution giving us more time to discuss and find a proper solution.

Cheers
Bertrand


> 
> Cheers,
> 
> -- 
> Julien Grall
Jan Beulich Oct. 20, 2022, 7:55 a.m. UTC | #8
On 20.10.2022 00:33, Julien Grall wrote:
> On 19/10/2022 22:30, Andrew Cooper wrote:
>> On 18/10/2022 00:01, Julien Grall wrote:
>>>>> ... you are hardcoding both p2m_teardown() and p2m_set_allocation().
>>>>> IMO this is not an improvement at all. It is just making the code more
>>>>> complex than necessary and lack all the explanation on the assumptions.
>>>>>
>>>>> So while I am fine with your patch #1 (already reviewed it), there is
>>>>> a better patch from Henry on the ML. So we should take his (rebased)
>>>>> instead of yours.
>>>>
>>>> If by better, you mean something that still has errors, then sure.
>>>>
>>>> There's a really good reason why you cannot safely repurpose
>>>> p2m_teardown().  It's written expecting a fully constructed domain -
>>>> which is fine because that's how it is used.  It doesn't cope safely
>>>> with an partially constructed domain.
>>>
>>> It is not 100% clear what is the issue you are referring to as the
>>> VMID is valid at this point. So what part would be wrong?
>>
>> Falling over a bad root pointer from an early construction exit.
> 
> You have been mentioning that several time now but I can't see how this
> can happen. If you look at Henry's second patch, p2m_teardown() starts
> with the following check:
> if ( page_list_empty(&p2m->pages) )
>     return;
> 
> Per the logic in p2m_init(), the root pages have to be allocated (note 
> they are *not* allocated from the P2M pool) and the VMID as well before 
> any pages could be added in the list.
> 
>>
>>> But if there are part of p2m_teardown() that are not safe for
>>> partially constructed domain, then we should split the code. This
>>> would be much better that the duplication you are proposing.
>>
>> You have two totally different contexts with different safety
>> requirements.  c/s 55914f7fc9 is a reasonably good and clean separation
>> between preemptible and non-preemptible cleanup[1].
> 
> The part you mention in [1] was decided to be delayed post 4.17 for 
> development cycle reasons.
> 
>>
>> You've agreed that the introduction of the non-preemptible path to the
>> preemptible path is a hack and layering violation, and will need undoing
>> later.  Others have raised this concern too.
> 
> [...]
> 
>>
>> Also realise that you've now split the helper between regular hypercall
>> context, and RCU context, and recall what happened when we finally
>> started asserting that memory couldn't be allocated in stop-machine context.
>>
>> How certain are you that the safety is the same on earlier versions of
>> Xen?
> I am pretty confident because the P2M code has not changed a lot.
> 
>> What is the likelihood that all of these actions will remain safe
>> given future development?
> Code always evolve and neither you (nor I) can claim that any work will 
> stay safe forever. In your patch proposal, then the risk is a bug could 
> be duplicated.
> 
>>
>>
>> Despite what is being claimed, the attempt to share cleanup logic is
>> introducing fragility and risk, not removing it.
> 
> I find interesting you are saying that... If we were going to move 
> p2m_teardown() in domain_teardown() then we would end up to share the code.
> 
> To me, this is not very different here because in one context it would 
> be preemptible while the other it won't. At which point...
> 
>>   This is a bugfix for
>> to a security fix issue which is totally dead on arrival; net safety,
>> especially in older versions of the Xen, is *the highest priority*.
>>
>> These two different contexts don't share any common properties of how to
>> clean up the pool, save freeing the frames back to the memory
>> allocator.  In a proper design, this is the hint that they shouldn't
>> share logic either
> ... why is your design better than what Henry's proposed?
> 
>>
>> Given that you do expect someone to spend yet-more time&effort to undo
>> the short term hack currently being proposed, how do you envisage the
>> end result looking?
> 
> The end result will be p2m_teardown() & co to be called from 
> domain_teardown().
> 
> Anyway, this discussion doesn't make us closer to come to an agreement 
> on the correct approach. We have both very diverging opinion and so far 
> I haven't seen any strong reasons that is showing yours is better.
> 
> So unless Bertrand or Stefano agree with you, then I will go ahead and 
> merge Henry's patch tomorrow after a final review.

While Andrew makes several points worth considering, I agree here that
staging needs unbreaking rather sooner than later. I'm inclined to say
the patches want committing now, unless an actual bug was still known to
be present there (which I can't deduce from Andrew's reply). In the
absence of actual bugs it really should be the maintainers to have the
final say when there are multiple ways of carrying out certain
functionality.

Jan
diff mbox series

Patch

diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index 6826f6315080..76a0e31c6c8c 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -1736,8 +1736,36 @@  void p2m_final_teardown(struct domain *d)
     if ( !p2m->domain )
         return;
 
-    ASSERT(page_list_empty(&p2m->pages));
-    ASSERT(page_list_empty(&d->arch.paging.p2m_freelist));
+    /*
+     * On the domain_create() error path only, we can end up here with a
+     * non-zero P2M pool.
+     *
+     * At present, this is a maximum of 16 pages, spread between p2m->pages
+     * and the free list.  The domain has never been scheduled (it has no
+     * vcpus), so there is TLB maintenance to perform; just free everything.
+     */
+    if ( !page_list_empty(&p2m->pages) ||
+         !page_list_empty(&d->arch.paging.p2m_freelist) )
+    {
+        struct page_info *pg;
+
+        /*
+         * There's no sensible "in the domain_create() error path" predicate,
+         * so simply sanity check that we don't have unexpected work to do.
+         */
+        ASSERT(d->arch.paging.p2m_total_pages <= 16);
+
+        spin_lock(&d->arch.paging.lock);
+
+        while ( (pg = page_list_remove_head(&p2m->pages)) )
+            free_domheap_page(pg);
+        while ( (pg = page_list_remove_head(&d->arch.paging.p2m_freelist)) )
+            free_domheap_page(pg);
+
+        d->arch.paging.p2m_total_pages = 0;
+
+        spin_unlock(&d->arch.paging.lock);
+    }
 
     if ( p2m->root )
         free_domheap_pages(p2m->root, P2M_ROOT_ORDER);
@@ -1803,6 +1831,17 @@  int p2m_init(struct domain *d)
     if ( rc )
         return rc;
 
+    /*
+     * Hardware using GICv2 wants to create an 8KB MMIO mapping during
+     * domain_create(), which requires some P2M pagetables.  Allocate 16 page
+     * which is good enough for now.
+     */
+    spin_lock(&d->arch.paging.lock);
+    rc = p2m_set_allocation(d, 16, NULL);
+    spin_unlock(&d->arch.paging.lock);
+    if ( rc )
+        return rc;
+
     return 0;
 }