Message ID | 1468583562.13039.96.camel@citrix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 15/07/16 13:52, Dario Faggioli wrote: > On Fri, 2016-07-15 at 12:36 +0200, Juergen Gross wrote: >> On 15/07/16 12:14, Dario Faggioli wrote: >>> In particular, I'm probably not fully understanding, from that >>> commit >>> changelog, what is the set of operations/command that I should run >>> to >>> check whether or not I reintroduced the issue back. >> You need to create a domain in a cpupool and destroy it again while >> some dom0 process still is holding a reference to it (resulting in a >> zombie domain). Then try to destroy the cpupool. >> > Ah, I see. I wasn't get the fact that it needed to be a zombie domain > from anywhere. > >>> What am I missing? >> The domain being a zombie domain might change the picture. Moving it >> to >> cpupool0 was failing before my patch and it might do so again with >> your >> patch applied. >> > Mmmm... I don't immediately see the reason why moving a zombie domain > fails either, but I guess I'll have to try. Searching through the history I found commit 934e7baa6c12d19cfaf24e8f8e27d6c6a8b8c5e4 which might has removed the problematic condition (cpupool->n_dom being non-zero while d->cpupool was NULL already). > But then, correct me if I'm wrong, the situation is like this: > - right now there's a (potential) race between domain's scheduling > data destruction and domain removal from a cpupool; > - with my patch, the race goes away, but we risk not being able to > destroy a cpupool with a zombie domain in it. This one has been observed. I do remember the following critical cases: - removing a cpupool with a zombie domain - shutting down the system with a domain in a cpupool - not sure about the combination of both cases (shutting down with zombie domain in a cpupool): is this even possible without another bug in the hypervisor or dom0? > Therefore, I still think this patch is correct, but I'm up for > investigating further and finding a way to solve the "zombie in > cpupool" issue as well. I'm not saying your patch is wrong. I just wanted to give you a hint about the history of the stuff you are changing. :-) If it is working I'd really prefer it over the current situation. Juergen
On Fri, 2016-07-15 at 14:52 +0200, Juergen Gross wrote: > On 15/07/16 13:52, Dario Faggioli wrote: > > > > On Fri, 2016-07-15 at 12:36 +0200, Juergen Gross wrote: > > > > > > On 15/07/16 12:14, Dario Faggioli wrote: > > > > > > > > In particular, I'm probably not fully understanding, from that > > > > commit > > > > changelog, what is the set of operations/command that I should > > > > run > > > > to > > > > check whether or not I reintroduced the issue back. > > > You need to create a domain in a cpupool and destroy it again > > > while > > > some dom0 process still is holding a reference to it (resulting > > > in a > > > zombie domain). Then try to destroy the cpupool. > > > > > Ah, I see. I wasn't get the fact that it needed to be a zombie > > domain > > from anywhere. > > > > > > > > > > > > > What am I missing? > > > The domain being a zombie domain might change the picture. Moving > > > it > > > to > > > cpupool0 was failing before my patch and it might do so again > > > with > > > your > > > patch applied. > > > > > Mmmm... I don't immediately see the reason why moving a zombie > > domain > > fails either, but I guess I'll have to try. > Searching through the history I found commit > 934e7baa6c12d19cfaf24e8f8e27d6c6a8b8c5e4 which might has removed the > problematic condition (cpupool->n_dom being non-zero while d->cpupool > was NULL already). > Yeah.. And there's also this: /* * unassign a specific cpu from a cpupool * we must be sure not to run on the cpu to be unassigned! to achieve this * the main functionality is performed via continue_hypercall_on_cpu on a * specific cpu. * if the cpu to be removed is the last one of the cpupool no active domain * must be bound to the cpupool. dying domains are moved to cpupool0 as they * might be zombies. * possible failures: * - last cpu and still active domains in cpupool * - cpu just being unplugged */ static int cpupool_unassign_cpu(struct cpupool *c, unsigned int cpu) { int work_cpu; int ret; struct domain *d; cpupool_dprintk("cpupool_unassign_cpu(pool=%d,cpu=%d)\n", c->cpupool_id, cpu); [...] for_each_domain_in_cpupool(d, c) { if ( !d->is_dying ) { ret = -EBUSY; break; } ret = cpupool_move_domain_locked(d, cpupool0); if ( ret ) break; } [...] So it really looks like it ought to be possible to move zombies to cpupool0 these days. :-) > > Therefore, I still think this patch is correct, but I'm up for > > investigating further and finding a way to solve the "zombie in > > cpupool" issue as well. > I'm not saying your patch is wrong. I just wanted to give you a hint > about the history of the stuff you are changing. :-) > Sure, and that's much appreciated! :-) > If it is working I'd really prefer it over the current situation. > Right. I've got to leave now. But I'll produce some zombies on Monday, and will see if they move. :-D Thanks and Regards, Dario
On Fri, 2016-07-15 at 16:23 +0200, Dario Faggioli wrote: > On Fri, 2016-07-15 at 14:52 +0200, Juergen Gross wrote: > > On 15/07/16 13:52, Dario Faggioli wrote: > > > Therefore, I still think this patch is correct, but I'm up for > > > investigating further and finding a way to solve the "zombie in > > > cpupool" issue as well. > > I'm not saying your patch is wrong. I just wanted to give you a > > hint > > about the history of the stuff you are changing. :-) > > > Sure, and that's much appreciated! :-) > > > If it is working I'd really prefer it over the current situation. > > > Right. I've got to leave now. But I'll produce some zombies on > Monday, > and will see if they move. :-D > Here you go: http://pastebin.com/r93TGCZU Is that "vm1 -b-" --> "(null) ---" domain zombie enough? If yes, as you can see, it moves to cpupool0 while being destroyed, and I can destroy the pool without problems. I also shutdown the system with the domain still there (in cpupool0), and it works. Regards, Dario
On 18/07/16 16:03, Dario Faggioli wrote: > On Fri, 2016-07-15 at 16:23 +0200, Dario Faggioli wrote: >> On Fri, 2016-07-15 at 14:52 +0200, Juergen Gross wrote: >>> On 15/07/16 13:52, Dario Faggioli wrote: >>>> Therefore, I still think this patch is correct, but I'm up for >>>> investigating further and finding a way to solve the "zombie in >>>> cpupool" issue as well. >>> I'm not saying your patch is wrong. I just wanted to give you a >>> hint >>> about the history of the stuff you are changing. :-) >>> >> Sure, and that's much appreciated! :-) >> >>> If it is working I'd really prefer it over the current situation. >>> >> Right. I've got to leave now. But I'll produce some zombies on >> Monday, >> and will see if they move. :-D >> > Here you go: > > http://pastebin.com/r93TGCZU > > Is that "vm1 -b-" --> "(null) ---" domain zombie enough? Yes, seems to be okay for a zombie. > If yes, as you can see, it moves to cpupool0 while being destroyed, and > I can destroy the pool without problems. Great. > I also shutdown the system with the domain still there (in cpupool0), > and it works. Fine. You can add Acked-by: Juergen Gross <jgross@suse.com> for this patch then. Thanks, Juergen
On Mon, 2016-07-18 at 16:09 +0200, Juergen Gross wrote: > Acked-by: Juergen Gross <jgross@suse.com> > > for this patch then. > George, Ping about this series. It's not terribly urgent, but it should be easy enough, so I guess there is a chance that you can have a quick look. If you can't, sorry for the noise, I'll re-ping you in a bit. :-) Thanks and Regards, Dario
On Thu, Jul 28, 2016 at 6:29 PM, Dario Faggioli <dario.faggioli@citrix.com> wrote: > On Mon, 2016-07-18 at 16:09 +0200, Juergen Gross wrote: >> Acked-by: Juergen Gross <jgross@suse.com> >> >> for this patch then. >> > George, > > Ping about this series. Dario, Somehow I can only find patch 1/2 in either Google or my Citrix mailbox. What's the title of the 2nd patch? -George
On 18/07/16 15:09, Juergen Gross wrote: > On 18/07/16 16:03, Dario Faggioli wrote: >> On Fri, 2016-07-15 at 16:23 +0200, Dario Faggioli wrote: >>> On Fri, 2016-07-15 at 14:52 +0200, Juergen Gross wrote: >>>> On 15/07/16 13:52, Dario Faggioli wrote: >>>>> Therefore, I still think this patch is correct, but I'm up for >>>>> investigating further and finding a way to solve the "zombie in >>>>> cpupool" issue as well. >>>> I'm not saying your patch is wrong. I just wanted to give you a >>>> hint >>>> about the history of the stuff you are changing. :-) >>>> >>> Sure, and that's much appreciated! :-) >>> >>>> If it is working I'd really prefer it over the current situation. >>>> >>> Right. I've got to leave now. But I'll produce some zombies on >>> Monday, >>> and will see if they move. :-D >>> >> Here you go: >> >> http://pastebin.com/r93TGCZU >> >> Is that "vm1 -b-" --> "(null) ---" domain zombie enough? > > Yes, seems to be okay for a zombie. > >> If yes, as you can see, it moves to cpupool0 while being destroyed, and >> I can destroy the pool without problems. > > Great. > >> I also shutdown the system with the domain still there (in cpupool0), >> and it works. > > Fine. You can add > > Acked-by: Juergen Gross <jgross@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> And queued. -George
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h index bc0e794..d91fd70 100644 --- a/xen/include/xen/sched-if.h +++ b/xen/include/xen/sched-if.h @@ -193,12 +193,9 @@ struct cpupool static inline cpumask_t* cpupool_domain_cpumask(struct domain *d) { - /* - * d->cpupool is NULL only for the idle domain, and no one should - * be interested in calling this for the idle domain. - */ - ASSERT(d->cpupool != NULL); - return d->cpupool->cpu_valid; + /* No one should be interested in calling this for the idle domain! */ + ASSERT(!is_idle_domain(d)); + return d->cpupool ? d->cpupool->cpu_valid : cpupool0->cpu_valid; } #endif /* __XEN_SCHED_IF_H__ */