diff mbox series

[v3,4/8] x86/mem-sharing: copy GADDR based shared guest areas

Message ID a79d2a8b-6d6e-bd31-b079-a30b555e5fd0@suse.com (mailing list archive)
State New, archived
Headers show
Series runstate/time area registration by (guest) physical address | expand

Commit Message

Jan Beulich May 3, 2023, 3:56 p.m. UTC
In preparation of the introduction of new vCPU operations allowing to
register the respective areas (one of the two is x86-specific) by
guest-physical address, add the necessary fork handling (with the
backing function yet to be filled in).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Extend comment.

Comments

Tamas K Lengyel May 3, 2023, 5:14 p.m. UTC | #1
> @@ -1974,7 +2046,10 @@ int mem_sharing_fork_reset(struct domain
>
>   state:
>      if ( reset_state )
> +    {
>          rc = copy_settings(d, pd);
> +        /* TBD: What to do here with -ERESTART? */

Ideally we could avoid hitting code-paths that are restartable during fork
reset since it gets called from vm_event replies that have no concept of
handling errors. If we start having errors like this we would just have to
drop the vm_event reply optimization and issue a standalone fork reset
hypercall every time which isn't a big deal, it's just slower. My
preference would actually be that after the initial forking is performed a
local copy of the parent's state is maintained for the fork to reset to so
there would be no case of hitting an error like this. It would also allow
us in the future to unpause the parent for example..

Tamas
Jan Beulich May 4, 2023, 7:44 a.m. UTC | #2
On 03.05.2023 19:14, Tamas K Lengyel wrote:
>> @@ -1974,7 +2046,10 @@ int mem_sharing_fork_reset(struct domain
>>
>>   state:
>>      if ( reset_state )
>> +    {
>>          rc = copy_settings(d, pd);
>> +        /* TBD: What to do here with -ERESTART? */
> 
> Ideally we could avoid hitting code-paths that are restartable during fork
> reset since it gets called from vm_event replies that have no concept of
> handling errors. If we start having errors like this we would just have to
> drop the vm_event reply optimization and issue a standalone fork reset
> hypercall every time which isn't a big deal, it's just slower.

I'm afraid I don't follow: We are in the process of fork-reset here. How
would issuing "a standalone fork reset hypercall every time" make this
any different? The possible need for a continuation here comes from a
failed spin_trylock() in map_guest_area(). That won't change the next
time round.

But perhaps I should say that till now I didn't even pay much attention
to the 2nd use of the function by vm_event_resume(); I was mainly
focused on the one from XENMEM_sharing_op_fork_reset, where no
continuation handling exists. Yet perhaps your comment is mainly
related to that use?

I actually notice that the comment ahead of the function already has a
continuation related TODO, just that there thought is only of larger
memory footprint.

> My
> preference would actually be that after the initial forking is performed a
> local copy of the parent's state is maintained for the fork to reset to so
> there would be no case of hitting an error like this. It would also allow
> us in the future to unpause the parent for example..

Oh, I guess I didn't realize the parent was kept paused. Such state
copying / caching may then indeed be a possibility, but that's nothing
I can see myself deal with, even less so in the context of this series.
I need a solution to the problem at hand within the scope of what is
there right now (or based on what could be provided e.g. by you within
the foreseeable future). Bubbling up the need for continuation from the
XENMEM_sharing_op_fork_reset path is the most I could see me handle
myself ... For vm_event_resume() bubbling state up the domctl path
_may_ also be doable, but mem_sharing_notification() and friends don't
even check the function's return value.

Jan
Tamas K Lengyel May 4, 2023, 12:50 p.m. UTC | #3
On Thu, May 4, 2023 at 3:44 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 03.05.2023 19:14, Tamas K Lengyel wrote:
> >> @@ -1974,7 +2046,10 @@ int mem_sharing_fork_reset(struct domain
> >>
> >>   state:
> >>      if ( reset_state )
> >> +    {
> >>          rc = copy_settings(d, pd);
> >> +        /* TBD: What to do here with -ERESTART? */
> >
> > Ideally we could avoid hitting code-paths that are restartable during
fork
> > reset since it gets called from vm_event replies that have no concept of
> > handling errors. If we start having errors like this we would just have
to
> > drop the vm_event reply optimization and issue a standalone fork reset
> > hypercall every time which isn't a big deal, it's just slower.
>
> I'm afraid I don't follow: We are in the process of fork-reset here. How
> would issuing "a standalone fork reset hypercall every time" make this
> any different? The possible need for a continuation here comes from a
> failed spin_trylock() in map_guest_area(). That won't change the next
> time round.

Why not? Who is holding the lock and why wouldn't it ever relinquish it? If
that's really true then there is a larger issue then just not being able to
report the error back to the user on vm_event_resume path and we need to
devise a way of being able to copy this from the parent bypassing this
lock. The parent is paused and its state should not be changing while forks
are active, so if the lock was turned into an rwlock of some sort so we can
acquire the read-lock would perhaps be a possible way out of this.

>
> But perhaps I should say that till now I didn't even pay much attention
> to the 2nd use of the function by vm_event_resume(); I was mainly
> focused on the one from XENMEM_sharing_op_fork_reset, where no
> continuation handling exists. Yet perhaps your comment is mainly
> related to that use?
>
> I actually notice that the comment ahead of the function already has a
> continuation related TODO, just that there thought is only of larger
> memory footprint.

With XENMEM_sharing_op_fork_reset the caller actually receives the error
code and can decide what to do next. With vm_event_resume there is no path
currently to notify the agent of an error. We could generate another
vm_event to send such an error, but the expectation with fork_reset is that
it will always work because the parent is paused, so not having that path
for an error to get back to the agent isn't a big deal.

Now, if it becomes the case that due to this locking we can get an error
even while the parent is paused, that will render the vm_event_resume path
unreliably so we would just switch to using XENMEM_sharing_op_fork_reset so
that at least it can re-try in case of an issue. Of course, only if a
reissue of the hypercall has any reasonable chance of succeeding.

>
> > My
> > preference would actually be that after the initial forking is
performed a
> > local copy of the parent's state is maintained for the fork to reset to
so
> > there would be no case of hitting an error like this. It would also
allow
> > us in the future to unpause the parent for example..
>
> Oh, I guess I didn't realize the parent was kept paused. Such state
> copying / caching may then indeed be a possibility, but that's nothing
> I can see myself deal with, even less so in the context of this series.
> I need a solution to the problem at hand within the scope of what is
> there right now (or based on what could be provided e.g. by you within
> the foreseeable future). Bubbling up the need for continuation from the
> XENMEM_sharing_op_fork_reset path is the most I could see me handle
> myself ... For vm_event_resume() bubbling state up the domctl path
> _may_ also be doable, but mem_sharing_notification() and friends don't
> even check the function's return value.

Sure, I wasn't expecting that work to be done as part of this series, just
as something I would like to get to in the future.

Tamas
Jan Beulich May 4, 2023, 2:25 p.m. UTC | #4
On 04.05.2023 14:50, Tamas K Lengyel wrote:
> On Thu, May 4, 2023 at 3:44 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 03.05.2023 19:14, Tamas K Lengyel wrote:
>>>> @@ -1974,7 +2046,10 @@ int mem_sharing_fork_reset(struct domain
>>>>
>>>>   state:
>>>>      if ( reset_state )
>>>> +    {
>>>>          rc = copy_settings(d, pd);
>>>> +        /* TBD: What to do here with -ERESTART? */
>>>
>>> Ideally we could avoid hitting code-paths that are restartable during
> fork
>>> reset since it gets called from vm_event replies that have no concept of
>>> handling errors. If we start having errors like this we would just have
> to
>>> drop the vm_event reply optimization and issue a standalone fork reset
>>> hypercall every time which isn't a big deal, it's just slower.
>>
>> I'm afraid I don't follow: We are in the process of fork-reset here. How
>> would issuing "a standalone fork reset hypercall every time" make this
>> any different? The possible need for a continuation here comes from a
>> failed spin_trylock() in map_guest_area(). That won't change the next
>> time round.
> 
> Why not? Who is holding the lock and why wouldn't it ever relinquish it?

What state is the fork in at that point in time? We're talking about the
fork's hypercall deadlock mutex here, after all. Hence if we knew the
fork is paused (just like the parent is), then I don't think -ERESTART
can be coming back. (To be precise, both paths leading here are of
interest, yet the state the fork is in may be different in both cases.)

> If
> that's really true then there is a larger issue then just not being able to
> report the error back to the user on vm_event_resume path and we need to
> devise a way of being able to copy this from the parent bypassing this
> lock.

The issue isn't that the lock will never become available. But we can't
predict how many attempts it'll take.

But my earlier question went in a different direction anyway: You did
suggest to replace a fork-reset by "a standalone fork reset hypercall
every time". I somehow don't get the difference (but it looks like some
of your further reply down from here addresses that).

> The parent is paused and its state should not be changing while forks
> are active, so if the lock was turned into an rwlock of some sort so we can
> acquire the read-lock would perhaps be a possible way out of this.

Given the specific lock we're talking about here, an rwlock is out of
question, I think.

>> But perhaps I should say that till now I didn't even pay much attention
>> to the 2nd use of the function by vm_event_resume(); I was mainly
>> focused on the one from XENMEM_sharing_op_fork_reset, where no
>> continuation handling exists. Yet perhaps your comment is mainly
>> related to that use?
>>
>> I actually notice that the comment ahead of the function already has a
>> continuation related TODO, just that there thought is only of larger
>> memory footprint.
> 
> With XENMEM_sharing_op_fork_reset the caller actually receives the error
> code and can decide what to do next. With vm_event_resume there is no path
> currently to notify the agent of an error. We could generate another
> vm_event to send such an error, but the expectation with fork_reset is that
> it will always work because the parent is paused, so not having that path
> for an error to get back to the agent isn't a big deal.
> 
> Now, if it becomes the case that due to this locking we can get an error
> even while the parent is paused, that will render the vm_event_resume path
> unreliably so we would just switch to using XENMEM_sharing_op_fork_reset so
> that at least it can re-try in case of an issue. Of course, only if a
> reissue of the hypercall has any reasonable chance of succeeding.

(I think this is the explanation for the "standalone reset hypercall.)

Jan
Roger Pau Monné Sept. 27, 2023, 11:08 a.m. UTC | #5
On Wed, May 03, 2023 at 05:56:46PM +0200, Jan Beulich wrote:
> In preparation of the introduction of new vCPU operations allowing to
> register the respective areas (one of the two is x86-specific) by
> guest-physical address, add the necessary fork handling (with the
> backing function yet to be filled in).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Given the very limited and specific usage of the current Xen forking
code, do we really need to bother about copying such areas?

IOW: I doubt that not updating the runstate/time areas will make any
difference to the forking code ATM.

> ---
> v3: Extend comment.
> 
> --- a/xen/arch/x86/mm/mem_sharing.c
> +++ b/xen/arch/x86/mm/mem_sharing.c
> @@ -1641,6 +1641,68 @@ static void copy_vcpu_nonreg_state(struc
>      hvm_set_nonreg_state(cd_vcpu, &nrs);
>  }
>  
> +static int copy_guest_area(struct guest_area *cd_area,
> +                           const struct guest_area *d_area,
> +                           struct vcpu *cd_vcpu,
> +                           const struct domain *d)
> +{
> +    mfn_t d_mfn, cd_mfn;
> +
> +    if ( !d_area->pg )
> +        return 0;
> +
> +    d_mfn = page_to_mfn(d_area->pg);
> +
> +    /* Allocate & map a page for the area if it hasn't been already. */
> +    if ( !cd_area->pg )
> +    {
> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
> +        p2m_type_t p2mt;
> +        p2m_access_t p2ma;
> +        unsigned int offset;
> +        int ret;
> +
> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
> +        {
> +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
> +
> +            if ( !pg )
> +                return -ENOMEM;
> +
> +            cd_mfn = page_to_mfn(pg);
> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
> +
> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
> +                                 p2m->default_access, -1);
> +            if ( ret )
> +                return ret;
> +        }
> +        else if ( p2mt != p2m_ram_rw )
> +            return -EBUSY;

Shouldn't the populate of the underlying gfn in the fork case
be done by map_guest_area itself?

What if a forked guest attempts to register a new runstate/time info
against an address not yet populated?

> +        /*
> +         * Map the into the guest. For simplicity specify the entire range up
> +         * to the end of the page: All the function uses it for is to check
> +         * that the range doesn't cross page boundaries. Having the area mapped
> +         * in the original domain implies that it fits there and therefore will
> +         * also fit in the clone.
> +         */
> +        offset = PAGE_OFFSET(d_area->map);
> +        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
> +                             PAGE_SIZE - offset, cd_area, NULL);
> +        if ( ret )
> +            return ret;
> +    }
> +    else
> +        cd_mfn = page_to_mfn(cd_area->pg);
> +
> +    copy_domain_page(cd_mfn, d_mfn);
> +
> +    return 0;
> +}
> +
>  static int copy_vpmu(struct vcpu *d_vcpu, struct vcpu *cd_vcpu)
>  {
>      struct vpmu_struct *d_vpmu = vcpu_vpmu(d_vcpu);
> @@ -1733,6 +1795,16 @@ static int copy_vcpu_settings(struct dom
>              copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
>          }
>  
> +        /* Same for the (physically registered) runstate and time info areas. */
> +        ret = copy_guest_area(&cd_vcpu->runstate_guest_area,
> +                              &d_vcpu->runstate_guest_area, cd_vcpu, d);
> +        if ( ret )
> +            return ret;
> +        ret = copy_guest_area(&cd_vcpu->arch.time_guest_area,
> +                              &d_vcpu->arch.time_guest_area, cd_vcpu, d);
> +        if ( ret )
> +            return ret;
> +
>          ret = copy_vpmu(d_vcpu, cd_vcpu);
>          if ( ret )
>              return ret;
> @@ -1974,7 +2046,10 @@ int mem_sharing_fork_reset(struct domain
>  
>   state:
>      if ( reset_state )
> +    {
>          rc = copy_settings(d, pd);
> +        /* TBD: What to do here with -ERESTART? */
> +    }
>  
>      domain_unpause(d);
>  
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -1572,6 +1572,13 @@ void unmap_vcpu_info(struct vcpu *v)
>      put_page_and_type(mfn_to_page(mfn));
>  }
>  
> +int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
> +                   struct guest_area *area,
> +                   void (*populate)(void *dst, struct vcpu *v))

Oh, the prototype for this is added in patch 1, almost missed it.

Why not also add this dummy implementation in patch 1 then?

Thanks, Roger.
Jan Beulich Sept. 27, 2023, 12:06 p.m. UTC | #6
On 27.09.2023 13:08, Roger Pau Monné wrote:
> On Wed, May 03, 2023 at 05:56:46PM +0200, Jan Beulich wrote:
>> In preparation of the introduction of new vCPU operations allowing to
>> register the respective areas (one of the two is x86-specific) by
>> guest-physical address, add the necessary fork handling (with the
>> backing function yet to be filled in).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Given the very limited and specific usage of the current Xen forking
> code, do we really need to bother about copying such areas?
> 
> IOW: I doubt that not updating the runstate/time areas will make any
> difference to the forking code ATM.

I expect Tamas'es reply has sufficiently addressed this question.

>> --- a/xen/arch/x86/mm/mem_sharing.c
>> +++ b/xen/arch/x86/mm/mem_sharing.c
>> @@ -1641,6 +1641,68 @@ static void copy_vcpu_nonreg_state(struc
>>      hvm_set_nonreg_state(cd_vcpu, &nrs);
>>  }
>>  
>> +static int copy_guest_area(struct guest_area *cd_area,
>> +                           const struct guest_area *d_area,
>> +                           struct vcpu *cd_vcpu,
>> +                           const struct domain *d)
>> +{
>> +    mfn_t d_mfn, cd_mfn;
>> +
>> +    if ( !d_area->pg )
>> +        return 0;
>> +
>> +    d_mfn = page_to_mfn(d_area->pg);
>> +
>> +    /* Allocate & map a page for the area if it hasn't been already. */
>> +    if ( !cd_area->pg )
>> +    {
>> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
>> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
>> +        p2m_type_t p2mt;
>> +        p2m_access_t p2ma;
>> +        unsigned int offset;
>> +        int ret;
>> +
>> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
>> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
>> +        {
>> +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
>> +
>> +            if ( !pg )
>> +                return -ENOMEM;
>> +
>> +            cd_mfn = page_to_mfn(pg);
>> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
>> +
>> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
>> +                                 p2m->default_access, -1);
>> +            if ( ret )
>> +                return ret;
>> +        }
>> +        else if ( p2mt != p2m_ram_rw )
>> +            return -EBUSY;
> 
> Shouldn't the populate of the underlying gfn in the fork case
> be done by map_guest_area itself?

I've used the existing logic for the info area to base my code on. As
noted in the patch switching the info area logic to use the generalize
infrastructure, I've taken the liberty to address an issue in the
original logic. But it was never a goal to re-write things from scratch.
If, as Tamas appears to agree, there a better way of layering things
here, then please as a follow-on patch.

> What if a forked guest attempts to register a new runstate/time info
> against an address not yet populated?

What if the same happened for the info area? Again, I'm not trying to
invent anything new here. But hopefully Tamas'es reply has helped here
as well.

>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -1572,6 +1572,13 @@ void unmap_vcpu_info(struct vcpu *v)
>>      put_page_and_type(mfn_to_page(mfn));
>>  }
>>  
>> +int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
>> +                   struct guest_area *area,
>> +                   void (*populate)(void *dst, struct vcpu *v))
> 
> Oh, the prototype for this is added in patch 1, almost missed it.
> 
> Why not also add this dummy implementation in patch 1 then?

The prototype isn't dead code, but the function would be until it gains
users here. If anything, I'd move the prototype introduction here; it
merely seemed desirable to me to touch xen/include/xen/domain.h no
more frequently than necessary.

Also, to be quite frank, I find this kind of nitpicking odd after the
series has been pending for almost a year.

Jan
Roger Pau Monné Sept. 27, 2023, 1:54 p.m. UTC | #7
On Wed, Sep 27, 2023 at 07:43:26AM -0400, Tamas K Lengyel wrote:
> On Wed, Sep 27, 2023 at 7:08 AM Roger Pau Monné <roger.pau@citrix.com> wrote:
> >
> > On Wed, May 03, 2023 at 05:56:46PM +0200, Jan Beulich wrote:
> > > In preparation of the introduction of new vCPU operations allowing to
> > > register the respective areas (one of the two is x86-specific) by
> > > guest-physical address, add the necessary fork handling (with the
> > > backing function yet to be filled in).
> > >
> > > Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >
> > Given the very limited and specific usage of the current Xen forking
> > code, do we really need to bother about copying such areas?
> >
> > IOW: I doubt that not updating the runstate/time areas will make any
> > difference to the forking code ATM.
> 
> The current implementation of forking allows for fully functional VM
> forks to be setup, including launching the dm for them. The toolstack
> side hasn't been merged for that into Xen but that doesn't mean it's
> not available or used already. So any internal changes to Xen that
> create guest-states that could potentially be interacted with and
> relied on by a guest should get properly wired in for forking. So I'm
> happy Jan took the initiative here for wiring this up, even if the
> use-case seems niche.

But there are still some bits missing in Xen, seeing the comment in
copy_vcpu_settings().  If we don't copy the timers already then I'm
unsure whether copying the runstate/time shared areas is very
relevant.

> >
> > > ---
> > > v3: Extend comment.
> > >
> > > --- a/xen/arch/x86/mm/mem_sharing.c
> > > +++ b/xen/arch/x86/mm/mem_sharing.c
> > > @@ -1641,6 +1641,68 @@ static void copy_vcpu_nonreg_state(struc
> > >      hvm_set_nonreg_state(cd_vcpu, &nrs);
> > >  }
> > >
> > > +static int copy_guest_area(struct guest_area *cd_area,
> > > +                           const struct guest_area *d_area,
> > > +                           struct vcpu *cd_vcpu,
> > > +                           const struct domain *d)
> > > +{
> > > +    mfn_t d_mfn, cd_mfn;
> > > +
> > > +    if ( !d_area->pg )
> > > +        return 0;
> > > +
> > > +    d_mfn = page_to_mfn(d_area->pg);
> > > +
> > > +    /* Allocate & map a page for the area if it hasn't been already. */
> > > +    if ( !cd_area->pg )
> > > +    {
> > > +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
> > > +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
> > > +        p2m_type_t p2mt;
> > > +        p2m_access_t p2ma;
> > > +        unsigned int offset;
> > > +        int ret;
> > > +
> > > +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
> > > +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
> > > +        {
> > > +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
> > > +
> > > +            if ( !pg )
> > > +                return -ENOMEM;
> > > +
> > > +            cd_mfn = page_to_mfn(pg);
> > > +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
> > > +
> > > +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
> > > +                                 p2m->default_access, -1);
> > > +            if ( ret )
> > > +                return ret;
> > > +        }
> > > +        else if ( p2mt != p2m_ram_rw )
> > > +            return -EBUSY;
> >
> > Shouldn't the populate of the underlying gfn in the fork case
> > be done by map_guest_area itself?
> 
> Would seem more logical, I agree, but then the call should be
> map_guest_area first, which conditionally calls down into a
> mem_sharing_copy_guest_area if the domain is a fork.
> 
> >
> > What if a forked guest attempts to register a new runstate/time info
> > against an address not yet populated?
> 
> Unpopulated memory will get forked from the parent for all access
> paths currently in existence, including access to a forked VMs memory
> from dom0/dm and the various internal Xen access paths. If the memory
> range is not mapped in the parent registering on that range ought to
> fail by I assume existing checks that validate that the memory is
> present during registration.

What I'm trying to say is that map_guest_area() already has to deal
with forked guests, and hence the populating of the gfn shouldn't be
needed as map_guest_area() must know how to deal with such guest
anyway.

Thanks, Roger.
Roger Pau Monné Sept. 27, 2023, 2:05 p.m. UTC | #8
On Wed, Sep 27, 2023 at 02:06:47PM +0200, Jan Beulich wrote:
> On 27.09.2023 13:08, Roger Pau Monné wrote:
> > On Wed, May 03, 2023 at 05:56:46PM +0200, Jan Beulich wrote:
> >> In preparation of the introduction of new vCPU operations allowing to
> >> register the respective areas (one of the two is x86-specific) by
> >> guest-physical address, add the necessary fork handling (with the
> >> backing function yet to be filled in).
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Given the very limited and specific usage of the current Xen forking
> > code, do we really need to bother about copying such areas?
> > 
> > IOW: I doubt that not updating the runstate/time areas will make any
> > difference to the forking code ATM.
> 
> I expect Tamas'es reply has sufficiently addressed this question.

Seeing the TODO in copy_vcpu_settings() makes me wonder how well we
copy information for PV interfaces for forks.  Timers are not
migrated, neither runstate areas for example.

Which is all fine, but I expect VMs this is currently tested against
don't use (a lot of) PV interfaces, or else I don't know how they can
survive.

> >> --- a/xen/arch/x86/mm/mem_sharing.c
> >> +++ b/xen/arch/x86/mm/mem_sharing.c
> >> @@ -1641,6 +1641,68 @@ static void copy_vcpu_nonreg_state(struc
> >>      hvm_set_nonreg_state(cd_vcpu, &nrs);
> >>  }
> >>  
> >> +static int copy_guest_area(struct guest_area *cd_area,
> >> +                           const struct guest_area *d_area,
> >> +                           struct vcpu *cd_vcpu,
> >> +                           const struct domain *d)
> >> +{
> >> +    mfn_t d_mfn, cd_mfn;
> >> +
> >> +    if ( !d_area->pg )
> >> +        return 0;
> >> +
> >> +    d_mfn = page_to_mfn(d_area->pg);
> >> +
> >> +    /* Allocate & map a page for the area if it hasn't been already. */
> >> +    if ( !cd_area->pg )
> >> +    {
> >> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
> >> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
> >> +        p2m_type_t p2mt;
> >> +        p2m_access_t p2ma;
> >> +        unsigned int offset;
> >> +        int ret;
> >> +
> >> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
> >> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
> >> +        {
> >> +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
> >> +
> >> +            if ( !pg )
> >> +                return -ENOMEM;
> >> +
> >> +            cd_mfn = page_to_mfn(pg);
> >> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
> >> +
> >> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
> >> +                                 p2m->default_access, -1);
> >> +            if ( ret )
> >> +                return ret;
> >> +        }
> >> +        else if ( p2mt != p2m_ram_rw )
> >> +            return -EBUSY;
> > 
> > Shouldn't the populate of the underlying gfn in the fork case
> > be done by map_guest_area itself?
> 
> I've used the existing logic for the info area to base my code on. As
> noted in the patch switching the info area logic to use the generalize
> infrastructure, I've taken the liberty to address an issue in the
> original logic. But it was never a goal to re-write things from scratch.
> If, as Tamas appears to agree, there a better way of layering things
> here, then please as a follow-on patch.

Hm, I'm unsure the code that allocates the page and adds it to the p2m
for the vcpu_info area is required?  map_vcpu_info() should already be
able to handle this case and fork the page from the previous VM.

> > What if a forked guest attempts to register a new runstate/time info
> > against an address not yet populated?
> 
> What if the same happened for the info area? Again, I'm not trying to
> invent anything new here. But hopefully Tamas'es reply has helped here
> as well.

Yes, I don't think we should need to allocate and map a page for the
vcpu_info area before calling map_vcpu_info().

> >> --- a/xen/common/domain.c
> >> +++ b/xen/common/domain.c
> >> @@ -1572,6 +1572,13 @@ void unmap_vcpu_info(struct vcpu *v)
> >>      put_page_and_type(mfn_to_page(mfn));
> >>  }
> >>  
> >> +int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
> >> +                   struct guest_area *area,
> >> +                   void (*populate)(void *dst, struct vcpu *v))
> > 
> > Oh, the prototype for this is added in patch 1, almost missed it.
> > 
> > Why not also add this dummy implementation in patch 1 then?
> 
> The prototype isn't dead code, but the function would be until it gains
> users here. If anything, I'd move the prototype introduction here; it
> merely seemed desirable to me to touch xen/include/xen/domain.h no
> more frequently than necessary.
> 
> Also, to be quite frank, I find this kind of nitpicking odd after the
> series has been pending for almost a year.

TBH when I saw this I was about to comment that the patch won't build
because the prototypes was missing, but then I remembered about patch
1.  I don't think it's obvious, but anyway.

Thanks, Roger.
Jan Beulich Sept. 27, 2023, 3:11 p.m. UTC | #9
On 27.09.2023 16:05, Roger Pau Monné wrote:
> On Wed, Sep 27, 2023 at 02:06:47PM +0200, Jan Beulich wrote:
>> On 27.09.2023 13:08, Roger Pau Monné wrote:
>>> On Wed, May 03, 2023 at 05:56:46PM +0200, Jan Beulich wrote:
>>>> --- a/xen/arch/x86/mm/mem_sharing.c
>>>> +++ b/xen/arch/x86/mm/mem_sharing.c
>>>> @@ -1641,6 +1641,68 @@ static void copy_vcpu_nonreg_state(struc
>>>>      hvm_set_nonreg_state(cd_vcpu, &nrs);
>>>>  }
>>>>  
>>>> +static int copy_guest_area(struct guest_area *cd_area,
>>>> +                           const struct guest_area *d_area,
>>>> +                           struct vcpu *cd_vcpu,
>>>> +                           const struct domain *d)
>>>> +{
>>>> +    mfn_t d_mfn, cd_mfn;
>>>> +
>>>> +    if ( !d_area->pg )
>>>> +        return 0;
>>>> +
>>>> +    d_mfn = page_to_mfn(d_area->pg);
>>>> +
>>>> +    /* Allocate & map a page for the area if it hasn't been already. */
>>>> +    if ( !cd_area->pg )
>>>> +    {
>>>> +        gfn_t gfn = mfn_to_gfn(d, d_mfn);
>>>> +        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
>>>> +        p2m_type_t p2mt;
>>>> +        p2m_access_t p2ma;
>>>> +        unsigned int offset;
>>>> +        int ret;
>>>> +
>>>> +        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
>>>> +        if ( mfn_eq(cd_mfn, INVALID_MFN) )
>>>> +        {
>>>> +            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
>>>> +
>>>> +            if ( !pg )
>>>> +                return -ENOMEM;
>>>> +
>>>> +            cd_mfn = page_to_mfn(pg);
>>>> +            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
>>>> +
>>>> +            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
>>>> +                                 p2m->default_access, -1);
>>>> +            if ( ret )
>>>> +                return ret;
>>>> +        }
>>>> +        else if ( p2mt != p2m_ram_rw )
>>>> +            return -EBUSY;
>>>
>>> Shouldn't the populate of the underlying gfn in the fork case
>>> be done by map_guest_area itself?
>>
>> I've used the existing logic for the info area to base my code on. As
>> noted in the patch switching the info area logic to use the generalize
>> infrastructure, I've taken the liberty to address an issue in the
>> original logic. But it was never a goal to re-write things from scratch.
>> If, as Tamas appears to agree, there a better way of layering things
>> here, then please as a follow-on patch.
> 
> Hm, I'm unsure the code that allocates the page and adds it to the p2m
> for the vcpu_info area is required?  map_vcpu_info() should already be
> able to handle this case and fork the page from the previous VM.

I don't think that's the case. It would be able to unshare, but the fork
doesn't use shared pages aiui. I think it instead runs on an extremely
sparse p2m, where pages from the parent are brought in only as they're
touched. Tamas?

Jan
diff mbox series

Patch

--- a/xen/arch/x86/mm/mem_sharing.c
+++ b/xen/arch/x86/mm/mem_sharing.c
@@ -1641,6 +1641,68 @@  static void copy_vcpu_nonreg_state(struc
     hvm_set_nonreg_state(cd_vcpu, &nrs);
 }
 
+static int copy_guest_area(struct guest_area *cd_area,
+                           const struct guest_area *d_area,
+                           struct vcpu *cd_vcpu,
+                           const struct domain *d)
+{
+    mfn_t d_mfn, cd_mfn;
+
+    if ( !d_area->pg )
+        return 0;
+
+    d_mfn = page_to_mfn(d_area->pg);
+
+    /* Allocate & map a page for the area if it hasn't been already. */
+    if ( !cd_area->pg )
+    {
+        gfn_t gfn = mfn_to_gfn(d, d_mfn);
+        struct p2m_domain *p2m = p2m_get_hostp2m(cd_vcpu->domain);
+        p2m_type_t p2mt;
+        p2m_access_t p2ma;
+        unsigned int offset;
+        int ret;
+
+        cd_mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, NULL);
+        if ( mfn_eq(cd_mfn, INVALID_MFN) )
+        {
+            struct page_info *pg = alloc_domheap_page(cd_vcpu->domain, 0);
+
+            if ( !pg )
+                return -ENOMEM;
+
+            cd_mfn = page_to_mfn(pg);
+            set_gpfn_from_mfn(mfn_x(cd_mfn), gfn_x(gfn));
+
+            ret = p2m->set_entry(p2m, gfn, cd_mfn, PAGE_ORDER_4K, p2m_ram_rw,
+                                 p2m->default_access, -1);
+            if ( ret )
+                return ret;
+        }
+        else if ( p2mt != p2m_ram_rw )
+            return -EBUSY;
+
+        /*
+         * Map the into the guest. For simplicity specify the entire range up
+         * to the end of the page: All the function uses it for is to check
+         * that the range doesn't cross page boundaries. Having the area mapped
+         * in the original domain implies that it fits there and therefore will
+         * also fit in the clone.
+         */
+        offset = PAGE_OFFSET(d_area->map);
+        ret = map_guest_area(cd_vcpu, gfn_to_gaddr(gfn) + offset,
+                             PAGE_SIZE - offset, cd_area, NULL);
+        if ( ret )
+            return ret;
+    }
+    else
+        cd_mfn = page_to_mfn(cd_area->pg);
+
+    copy_domain_page(cd_mfn, d_mfn);
+
+    return 0;
+}
+
 static int copy_vpmu(struct vcpu *d_vcpu, struct vcpu *cd_vcpu)
 {
     struct vpmu_struct *d_vpmu = vcpu_vpmu(d_vcpu);
@@ -1733,6 +1795,16 @@  static int copy_vcpu_settings(struct dom
             copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
         }
 
+        /* Same for the (physically registered) runstate and time info areas. */
+        ret = copy_guest_area(&cd_vcpu->runstate_guest_area,
+                              &d_vcpu->runstate_guest_area, cd_vcpu, d);
+        if ( ret )
+            return ret;
+        ret = copy_guest_area(&cd_vcpu->arch.time_guest_area,
+                              &d_vcpu->arch.time_guest_area, cd_vcpu, d);
+        if ( ret )
+            return ret;
+
         ret = copy_vpmu(d_vcpu, cd_vcpu);
         if ( ret )
             return ret;
@@ -1974,7 +2046,10 @@  int mem_sharing_fork_reset(struct domain
 
  state:
     if ( reset_state )
+    {
         rc = copy_settings(d, pd);
+        /* TBD: What to do here with -ERESTART? */
+    }
 
     domain_unpause(d);
 
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1572,6 +1572,13 @@  void unmap_vcpu_info(struct vcpu *v)
     put_page_and_type(mfn_to_page(mfn));
 }
 
+int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
+                   struct guest_area *area,
+                   void (*populate)(void *dst, struct vcpu *v))
+{
+    return -EOPNOTSUPP;
+}
+
 /*
  * This is only intended to be used for domain cleanup (or more generally only
  * with at least the respective vCPU, if it's not the current one, reliably