diff mbox

[v2,1/2] xen: fix a (latent) cpupool-related race during domain destroy

Message ID 1468583562.13039.96.camel@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dario Faggioli July 15, 2016, 11:52 a.m. UTC
On Fri, 2016-07-15 at 12:36 +0200, Juergen Gross wrote:
> On 15/07/16 12:14, Dario Faggioli wrote:
> > In particular, I'm probably not fully understanding, from that
> > commit
> > changelog, what is the set of operations/command that I should run
> > to
> > check whether or not I reintroduced the issue back.
> You need to create a domain in a cpupool and destroy it again while
> some dom0 process still is holding a reference to it (resulting in a
> zombie domain). Then try to destroy the cpupool.
> 
Ah, I see. I wasn't get the fact that it needed to be a zombie domain
from anywhere.

> > What am I missing?
> The domain being a zombie domain might change the picture. Moving it
> to
> cpupool0 was failing before my patch and it might do so again with
> your
> patch applied.
> 
Mmmm... I don't immediately see the reason why moving a zombie domain
fails either, but I guess I'll have to try.

But then, correct me if I'm wrong, the situation is like this:
 - right now there's a (potential) race between domain's scheduling 
   data destruction and domain removal from a cpupool;
 - with my patch, the race goes away, but we risk not being able to 
   destroy a cpupool with a zombie domain in it.

I don't think we want to be in either of these two situations. :-(

The race is never triggering so far, but I've already patches to
Credit2 (finishing implementing soft affinity for it) that make it a
real thing.

I think I can work around that specific case by doing something like
the below:


But even if that would be acceptable (what do you think?), I still
don't like to have the race there lurking. :-/

Therefore, I still think this patch is correct, but I'm up for
investigating further and finding a way to solve the "zombie in
cpupool" issue as well.

Regards,
Dario

Comments

Jürgen Groß July 15, 2016, 12:52 p.m. UTC | #1
On 15/07/16 13:52, Dario Faggioli wrote:
> On Fri, 2016-07-15 at 12:36 +0200, Juergen Gross wrote:
>> On 15/07/16 12:14, Dario Faggioli wrote:
>>> In particular, I'm probably not fully understanding, from that
>>> commit
>>> changelog, what is the set of operations/command that I should run
>>> to
>>> check whether or not I reintroduced the issue back.
>> You need to create a domain in a cpupool and destroy it again while
>> some dom0 process still is holding a reference to it (resulting in a
>> zombie domain). Then try to destroy the cpupool.
>>
> Ah, I see. I wasn't get the fact that it needed to be a zombie domain
> from anywhere.
> 
>>> What am I missing?
>> The domain being a zombie domain might change the picture. Moving it
>> to
>> cpupool0 was failing before my patch and it might do so again with
>> your
>> patch applied.
>>
> Mmmm... I don't immediately see the reason why moving a zombie domain
> fails either, but I guess I'll have to try.

Searching through the history I found commit
934e7baa6c12d19cfaf24e8f8e27d6c6a8b8c5e4 which might has removed the
problematic condition (cpupool->n_dom being non-zero while d->cpupool
was NULL already).

> But then, correct me if I'm wrong, the situation is like this:
>  - right now there's a (potential) race between domain's scheduling 
>    data destruction and domain removal from a cpupool;
>  - with my patch, the race goes away, but we risk not being able to 
>    destroy a cpupool with a zombie domain in it.

This one has been observed.

I do remember the following critical cases:

- removing a cpupool with a zombie domain
- shutting down the system with a domain in a cpupool
- not sure about the combination of both cases (shutting down with
  zombie domain in a cpupool): is this even possible without another
  bug in the hypervisor or dom0?

> Therefore, I still think this patch is correct, but I'm up for
> investigating further and finding a way to solve the "zombie in
> cpupool" issue as well.

I'm not saying your patch is wrong. I just wanted to give you a hint
about the history of the stuff you are changing. :-)

If it is working I'd really prefer it over the current situation.


Juergen
Dario Faggioli July 15, 2016, 2:23 p.m. UTC | #2
On Fri, 2016-07-15 at 14:52 +0200, Juergen Gross wrote:
> On 15/07/16 13:52, Dario Faggioli wrote:
> > 
> > On Fri, 2016-07-15 at 12:36 +0200, Juergen Gross wrote:
> > > 
> > > On 15/07/16 12:14, Dario Faggioli wrote:
> > > > 
> > > > In particular, I'm probably not fully understanding, from that
> > > > commit
> > > > changelog, what is the set of operations/command that I should
> > > > run
> > > > to
> > > > check whether or not I reintroduced the issue back.
> > > You need to create a domain in a cpupool and destroy it again
> > > while
> > > some dom0 process still is holding a reference to it (resulting
> > > in a
> > > zombie domain). Then try to destroy the cpupool.
> > > 
> > Ah, I see. I wasn't get the fact that it needed to be a zombie
> > domain
> > from anywhere.
> > 
> > > 
> > > > 
> > > > What am I missing?
> > > The domain being a zombie domain might change the picture. Moving
> > > it
> > > to
> > > cpupool0 was failing before my patch and it might do so again
> > > with
> > > your
> > > patch applied.
> > > 
> > Mmmm... I don't immediately see the reason why moving a zombie
> > domain
> > fails either, but I guess I'll have to try.
> Searching through the history I found commit
> 934e7baa6c12d19cfaf24e8f8e27d6c6a8b8c5e4 which might has removed the
> problematic condition (cpupool->n_dom being non-zero while d->cpupool
> was NULL already).
> 
Yeah.. And there's also this:

/*
 * unassign a specific cpu from a cpupool
 * we must be sure not to run on the cpu to be unassigned! to achieve this
 * the main functionality is performed via continue_hypercall_on_cpu on a
 * specific cpu.
 * if the cpu to be removed is the last one of the cpupool no active domain
 * must be bound to the cpupool. dying domains are moved to cpupool0 as they
 * might be zombies.
 * possible failures:
 * - last cpu and still active domains in cpupool
 * - cpu just being unplugged
 */
static int cpupool_unassign_cpu(struct cpupool *c, unsigned int cpu)
{
    int work_cpu;
    int ret;
    struct domain *d;

    cpupool_dprintk("cpupool_unassign_cpu(pool=%d,cpu=%d)\n",
                    c->cpupool_id, cpu);
    [...]
        for_each_domain_in_cpupool(d, c)
        {
            if ( !d->is_dying )
            {
                ret = -EBUSY;
                break;
            }
            ret = cpupool_move_domain_locked(d, cpupool0);
            if ( ret )
                break;
        }
    [...]

So it really looks like it ought to be possible to move zombies to
cpupool0 these days. :-)

> > Therefore, I still think this patch is correct, but I'm up for
> > investigating further and finding a way to solve the "zombie in
> > cpupool" issue as well.
> I'm not saying your patch is wrong. I just wanted to give you a hint
> about the history of the stuff you are changing. :-)
> 
Sure, and that's much appreciated! :-)

> If it is working I'd really prefer it over the current situation.
> 
Right. I've got to leave now. But I'll produce some zombies on Monday,
and will see if they move. :-D

Thanks and Regards,
Dario
Dario Faggioli July 18, 2016, 2:03 p.m. UTC | #3
On Fri, 2016-07-15 at 16:23 +0200, Dario Faggioli wrote:
> On Fri, 2016-07-15 at 14:52 +0200, Juergen Gross wrote:
> > On 15/07/16 13:52, Dario Faggioli wrote:
> > > Therefore, I still think this patch is correct, but I'm up for
> > > investigating further and finding a way to solve the "zombie in
> > > cpupool" issue as well.
> > I'm not saying your patch is wrong. I just wanted to give you a
> > hint
> > about the history of the stuff you are changing. :-)
> > 
> Sure, and that's much appreciated! :-)
> 
> > If it is working I'd really prefer it over the current situation.
> > 
> Right. I've got to leave now. But I'll produce some zombies on
> Monday,
> and will see if they move. :-D
> 
Here you go:

http://pastebin.com/r93TGCZU

Is that "vm1 -b-" --> "(null) ---" domain zombie enough?

If yes, as you can see, it moves to cpupool0 while being destroyed, and
I can destroy the pool without problems.

I also shutdown the system with the domain still there (in cpupool0),
and it works.

Regards,
Dario
Jürgen Groß July 18, 2016, 2:09 p.m. UTC | #4
On 18/07/16 16:03, Dario Faggioli wrote:
> On Fri, 2016-07-15 at 16:23 +0200, Dario Faggioli wrote:
>> On Fri, 2016-07-15 at 14:52 +0200, Juergen Gross wrote:
>>> On 15/07/16 13:52, Dario Faggioli wrote:
>>>> Therefore, I still think this patch is correct, but I'm up for
>>>> investigating further and finding a way to solve the "zombie in
>>>> cpupool" issue as well.
>>> I'm not saying your patch is wrong. I just wanted to give you a
>>> hint
>>> about the history of the stuff you are changing. :-)
>>>
>> Sure, and that's much appreciated! :-)
>>
>>> If it is working I'd really prefer it over the current situation.
>>>
>> Right. I've got to leave now. But I'll produce some zombies on
>> Monday,
>> and will see if they move. :-D
>>
> Here you go:
> 
> http://pastebin.com/r93TGCZU
> 
> Is that "vm1 -b-" --> "(null) ---" domain zombie enough?

Yes, seems to be okay for a zombie.

> If yes, as you can see, it moves to cpupool0 while being destroyed, and
> I can destroy the pool without problems.

Great.

> I also shutdown the system with the domain still there (in cpupool0),
> and it works.

Fine. You can add

Acked-by: Juergen Gross <jgross@suse.com>

for this patch then.


Thanks, Juergen
Dario Faggioli July 28, 2016, 5:29 p.m. UTC | #5
On Mon, 2016-07-18 at 16:09 +0200, Juergen Gross wrote:
> Acked-by: Juergen Gross <jgross@suse.com>
> 
> for this patch then.
> 
George,

Ping about this series.

It's not terribly urgent, but it should be easy enough, so I guess
there is a chance that you can have a quick look.

If you can't, sorry for the noise, I'll re-ping you in a bit. :-)

Thanks and Regards,
Dario
George Dunlap Aug. 3, 2016, 11:54 a.m. UTC | #6
On Thu, Jul 28, 2016 at 6:29 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On Mon, 2016-07-18 at 16:09 +0200, Juergen Gross wrote:
>> Acked-by: Juergen Gross <jgross@suse.com>
>>
>> for this patch then.
>>
> George,
>
> Ping about this series.

Dario,

Somehow I can only find patch 1/2 in either Google or my Citrix
mailbox.  What's the title of the 2nd patch?

 -George
George Dunlap Aug. 3, 2016, 12:27 p.m. UTC | #7
On 18/07/16 15:09, Juergen Gross wrote:
> On 18/07/16 16:03, Dario Faggioli wrote:
>> On Fri, 2016-07-15 at 16:23 +0200, Dario Faggioli wrote:
>>> On Fri, 2016-07-15 at 14:52 +0200, Juergen Gross wrote:
>>>> On 15/07/16 13:52, Dario Faggioli wrote:
>>>>> Therefore, I still think this patch is correct, but I'm up for
>>>>> investigating further and finding a way to solve the "zombie in
>>>>> cpupool" issue as well.
>>>> I'm not saying your patch is wrong. I just wanted to give you a
>>>> hint
>>>> about the history of the stuff you are changing. :-)
>>>>
>>> Sure, and that's much appreciated! :-)
>>>
>>>> If it is working I'd really prefer it over the current situation.
>>>>
>>> Right. I've got to leave now. But I'll produce some zombies on
>>> Monday,
>>> and will see if they move. :-D
>>>
>> Here you go:
>>
>> http://pastebin.com/r93TGCZU
>>
>> Is that "vm1 -b-" --> "(null) ---" domain zombie enough?
> 
> Yes, seems to be okay for a zombie.
> 
>> If yes, as you can see, it moves to cpupool0 while being destroyed, and
>> I can destroy the pool without problems.
> 
> Great.
> 
>> I also shutdown the system with the domain still there (in cpupool0),
>> and it works.
> 
> Fine. You can add
> 
> Acked-by: Juergen Gross <jgross@suse.com>

Acked-by: George Dunlap <george.dunlap@citrix.com>

And queued.

 -George
diff mbox

Patch

diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index bc0e794..d91fd70 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -193,12 +193,9 @@  struct cpupool
 
 static inline cpumask_t* cpupool_domain_cpumask(struct domain *d)
 {
-    /*
-     * d->cpupool is NULL only for the idle domain, and no one should
-     * be interested in calling this for the idle domain.
-     */
-    ASSERT(d->cpupool != NULL);
-    return d->cpupool->cpu_valid;
+    /* No one should be interested in calling this for the idle domain! */
+    ASSERT(!is_idle_domain(d));
+    return d->cpupool ? d->cpupool->cpu_valid : cpupool0->cpu_valid;
 }
 
 #endif /* __XEN_SCHED_IF_H__ */